Russell wrote:
> Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) :
>
> Summary:  Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files.
> _________________
>
> I have a large photo collection scattered across many hard drives of various capacity and vintage. 
.....

Hi Russell.

So, what works for me.... and it is somewhat recently implemented (in 
the past few years....).... like you I found I have stuff everywhere. It 
does not sound quite as bad as your situation, but only perhaps because 
my volume was somewhat less.

I have managed to use a few tools to get things nice and tidy again. 
Primary among them are Linux, jhead, rsync, and some custom scripts and 
Java programs.

jhead is a command-line tool does a fair number of things, but all I 
really use it for is to:
1. alter the file name to represent the date/time the image was taken... 
for example, DSC4321.JPG becomes Img20091025.123456.78.jpg for a photo 
taken at 73 milliseconds in to the second at 2009/10/25 12:34:56. It 
does this by using the EXIF data.
2. I have also modified jhead slightly to be able to simultaneously 
rename the RAW file (if any exists) associated with the JPG to have the 
same name but different extension.
3. jhead is also instructed to inspect the orientation data in the EXIF, 
and it will losslessly rotate JPG images if they were taken portrait 
style...

I have used the above to re-process all my previously messy files, but I 
also use it now as part of the routine workflow for loading pictures 
from my camera to my server.

I group the photos in to 'batches' where, normally a batch is what I 
download from each memory card.... each batch is in a different folder, 
and the folders are stored in a calendar year folder, and named 
sequentially/chronologically....

i.e. I have .../2009/Batch001/Img20090101.000501.20.jpg for that photo 
taken 5 minutes after the new year....

So, the new system is one where the photos are loaded off the camera, 
pre-processed to rotate, rename, file, (and set all permissions to 
read-only).

The new batch is then loaded in to Gallery2 web server (which I have 
also slightly modified to ignore processing RAW photos...), and it loads 
each photo, creates a couple of smaller image sizes for each photo, and 
symbolically links the 'full size' image to the main photo repository.

Then, I manipulate things in the Gallery2 web app to build photo albums 
of 'good' photos. I have then written a Java program that you can 
call... it inspects these 'real' albums, and, from that, it can track 
back the real photos that the album uses, and it pulls out the RAW photo 
with the same name as the JPG in Gallery2. I can then re-touch, and 
batch-process the RAW files for physical prints in a physical album, or 
for enlargements.

The actual folder with the original folders in is read-only exported 
using Samba to my other windows machines. being read-only is a good 
thing because otherwise Windows will mess with the files, etc.

If I retouch images, I normally don't pay special attention to the 
changed file, and delete it after it is used for enlargements/prints, 
etc. I can always re-retouch the pics from the originals. Some special 
re-works get re-filed with the originals manually, and then manually 
re-imported to Gallery2. For the most part my re-touching is pretty 
simple and easy to reproduce.

Once the photos are all consistently named, and (normally) stored in a 
separate folders which I have named sequentially in each calendar 
year.... it became a whole lot easier to start filing the 'messy' photos 
in with the existing photos, and removing duplicates, etc.

Again, having established this system, I was able then to go back and 
re-process all the messy/scattered files and take each set of messy 
files, re-process them using jhead, etc, and see if there were duplicate 
names, etc. In some obvious cases I was able to completely discard wads 
of photos as duplicates, in other cases I had to do some manual merging, 
and in other cases I decided to skip the manual processing and add a 
separate 'batch' and live with duplicate sets of 'originals'.

Now that I have a system going, I am also able to keep all the photos, 
and only photos in a 'valuable' folder. Because Gallery2 symbolically 
links to the photos the Gallery2 re-sized files for web-viewing are not 
in the same place.

I then use rsync in an automated system to keep frequent (hourly) 
snapshot-like backups of all photos.

The 'originals' are stored on two 1TB drives which are set up in a RAID0 
(striped) array for performance reasons but this introduces significant 
risk of drive failure.

So, every hour all photos are re-synced on to a TAID1 set of 1.5TB 
drives (mirrored). The sync is 'intelligent', in that it keeps a history 
of all file modifications (there are none for photos except new photos 
being added), but the same system is also used for other documents and 
settings which change. As a result of this system I have a copy of the 
state of every file for every hour in the past day, every day in the 
past week, every week in the past 3 months, every month in the past 
year, and every year since the system started.....

Me deleting a file or making a stupid change should be easy to recover 
from. Even formatting my drive means at most an hour of work lost....

Once every month I connect an external 1.5TB drive and completely 
re-sync my complete historical backup. This disk is then stored safely 
offsite ... ;-)

I submitted patches for jhead but they were not added to the tool, but 
if you want I can happily send you the changes I have made (adding 
millisecond granularity to timestamps, processing RAW files along with 
the JPG, etc.), and I can share other details and code.

The system works for me in the sense that I now have about 78K photos 
occupying about 560Gig. For the record, I will never back up to 
CD/DVD/etc again. Hard-drives only for me.

The same drives are used for other data like email, documents, music, 
etc, I have 800Gig of data actively processed through the system 
regularly using the same backup and storage system.

Because my data is stored on a central server I have also invested in 
Gigabit networks and, with the RAID0 system used for storage, it is 
actually faster for me to process data over the network than to process 
it locally. I do in fact max out the network at around 80MB/s when I am 
transferring files around for upload or editing.

Also, because the 'server' does a lot of photo processing when uploading 
photos, and because gallery2 is somewhat intensive, I have a pretty 
beefy server (quad-core with 8Gig mem, etc.) and this helps a lot to 
make the web site 'slick'. It almost embarrasses me that my server got 
out of control... 6TB of storage permanently attached, with 3TB on hand 
(2 external backup 1.5T drives)... with a beefy network, CPU, memory, 
etc. But, it does the job really really well, and I am 'comfortable' 
with my data in that it is as safe as I can afford at a relatively low 
price, all in all.... just over $1K CAD, and no cost in software....

I just wish I could find a good software solution for photo-editing and 
processing the Nikon RAW files I work off in Linux....

Anyway, I though you may want to hear how I do it.


Rolf
-- 
http://www.piclist.com PIC/SX FAQ & list archive
View/change your membership options at
http://mailman.mit.edu/mailman/listinfo/piclist