Russell wrote: > Hopefully the obsessed photographers here will have at least a good idea about this subject as the experts on any photo list :-) : > > Summary: Seeking opinions re duplicate-file location and management software. Needs to work with essentially unlimited capacity and number of files and number of connected or disconnected drives. "Actually working" is highly desirable. Free would be nice but is not essential. Emphasis is on photo files. > _________________ > > I have a large photo collection scattered across many hard drives of various capacity and vintage. ..... Hi Russell. So, what works for me.... and it is somewhat recently implemented (in the past few years....).... like you I found I have stuff everywhere. It does not sound quite as bad as your situation, but only perhaps because my volume was somewhat less. I have managed to use a few tools to get things nice and tidy again. Primary among them are Linux, jhead, rsync, and some custom scripts and Java programs. jhead is a command-line tool does a fair number of things, but all I really use it for is to: 1. alter the file name to represent the date/time the image was taken... for example, DSC4321.JPG becomes Img20091025.123456.78.jpg for a photo taken at 73 milliseconds in to the second at 2009/10/25 12:34:56. It does this by using the EXIF data. 2. I have also modified jhead slightly to be able to simultaneously rename the RAW file (if any exists) associated with the JPG to have the same name but different extension. 3. jhead is also instructed to inspect the orientation data in the EXIF, and it will losslessly rotate JPG images if they were taken portrait style... I have used the above to re-process all my previously messy files, but I also use it now as part of the routine workflow for loading pictures from my camera to my server. I group the photos in to 'batches' where, normally a batch is what I download from each memory card.... each batch is in a different folder, and the folders are stored in a calendar year folder, and named sequentially/chronologically.... i.e. I have .../2009/Batch001/Img20090101.000501.20.jpg for that photo taken 5 minutes after the new year.... So, the new system is one where the photos are loaded off the camera, pre-processed to rotate, rename, file, (and set all permissions to read-only). The new batch is then loaded in to Gallery2 web server (which I have also slightly modified to ignore processing RAW photos...), and it loads each photo, creates a couple of smaller image sizes for each photo, and symbolically links the 'full size' image to the main photo repository. Then, I manipulate things in the Gallery2 web app to build photo albums of 'good' photos. I have then written a Java program that you can call... it inspects these 'real' albums, and, from that, it can track back the real photos that the album uses, and it pulls out the RAW photo with the same name as the JPG in Gallery2. I can then re-touch, and batch-process the RAW files for physical prints in a physical album, or for enlargements. The actual folder with the original folders in is read-only exported using Samba to my other windows machines. being read-only is a good thing because otherwise Windows will mess with the files, etc. If I retouch images, I normally don't pay special attention to the changed file, and delete it after it is used for enlargements/prints, etc. I can always re-retouch the pics from the originals. Some special re-works get re-filed with the originals manually, and then manually re-imported to Gallery2. For the most part my re-touching is pretty simple and easy to reproduce. Once the photos are all consistently named, and (normally) stored in a separate folders which I have named sequentially in each calendar year.... it became a whole lot easier to start filing the 'messy' photos in with the existing photos, and removing duplicates, etc. Again, having established this system, I was able then to go back and re-process all the messy/scattered files and take each set of messy files, re-process them using jhead, etc, and see if there were duplicate names, etc. In some obvious cases I was able to completely discard wads of photos as duplicates, in other cases I had to do some manual merging, and in other cases I decided to skip the manual processing and add a separate 'batch' and live with duplicate sets of 'originals'. Now that I have a system going, I am also able to keep all the photos, and only photos in a 'valuable' folder. Because Gallery2 symbolically links to the photos the Gallery2 re-sized files for web-viewing are not in the same place. I then use rsync in an automated system to keep frequent (hourly) snapshot-like backups of all photos. The 'originals' are stored on two 1TB drives which are set up in a RAID0 (striped) array for performance reasons but this introduces significant risk of drive failure. So, every hour all photos are re-synced on to a TAID1 set of 1.5TB drives (mirrored). The sync is 'intelligent', in that it keeps a history of all file modifications (there are none for photos except new photos being added), but the same system is also used for other documents and settings which change. As a result of this system I have a copy of the state of every file for every hour in the past day, every day in the past week, every week in the past 3 months, every month in the past year, and every year since the system started..... Me deleting a file or making a stupid change should be easy to recover from. Even formatting my drive means at most an hour of work lost.... Once every month I connect an external 1.5TB drive and completely re-sync my complete historical backup. This disk is then stored safely offsite ... ;-) I submitted patches for jhead but they were not added to the tool, but if you want I can happily send you the changes I have made (adding millisecond granularity to timestamps, processing RAW files along with the JPG, etc.), and I can share other details and code. The system works for me in the sense that I now have about 78K photos occupying about 560Gig. For the record, I will never back up to CD/DVD/etc again. Hard-drives only for me. The same drives are used for other data like email, documents, music, etc, I have 800Gig of data actively processed through the system regularly using the same backup and storage system. Because my data is stored on a central server I have also invested in Gigabit networks and, with the RAID0 system used for storage, it is actually faster for me to process data over the network than to process it locally. I do in fact max out the network at around 80MB/s when I am transferring files around for upload or editing. Also, because the 'server' does a lot of photo processing when uploading photos, and because gallery2 is somewhat intensive, I have a pretty beefy server (quad-core with 8Gig mem, etc.) and this helps a lot to make the web site 'slick'. It almost embarrasses me that my server got out of control... 6TB of storage permanently attached, with 3TB on hand (2 external backup 1.5T drives)... with a beefy network, CPU, memory, etc. But, it does the job really really well, and I am 'comfortable' with my data in that it is as safe as I can afford at a relatively low price, all in all.... just over $1K CAD, and no cost in software.... I just wish I could find a good software solution for photo-editing and processing the Nikon RAW files I work off in Linux.... Anyway, I though you may want to hear how I do it. Rolf -- http://www.piclist.com PIC/SX FAQ & list archive View/change your membership options at http://mailman.mit.edu/mailman/listinfo/piclist