+ Reply to Thread
Results 1 to 6 of 6

Thread: Finding and deleting duplicate image files

  1. #1
    cerbere is offline x10Hosting Member cerbere is an unknown quantity at this point
    Join Date
    Nov 2007
    Posts
    51

    Finding and deleting duplicate image files

    Hi everyone,

    I want to find (and optionally delete) duplicate images
    in my directories. I already have written a Perl program
    (here : http://ixedix.x10hosting.com/findsame.htm ) that lists
    JPG files of the same size, as possible candidates for manual deletion.

    To make the deletion automatic, I need to be pretty sure that the images
    are the same. From what I read while googling the subject, it seems
    that there is no checksum or similar signature (MD5, etc) in the
    JPG format. My idea now is to read a 256 bytes block in the middle
    of each potentially duplicate image, compare it to a block at the same
    offset from the first image of that size, and delete the clone if
    the blocks are identical.

    Of course, a byte-by-byte comparison of the entire files could do
    the job, but it would be a lot slower.

    Any suggestions on how to do it the fastest, safest way ?

    Ancilliary question : is there a "move file" function in Perl or PHP ?
    I'd like to move the duplicate files to a garbage bin before finally
    deleting them for good.

  2. #2
    lordskid is offline x10Hosting Member lordskid is an unknown quantity at this point
    Join Date
    Mar 2008
    Posts
    41

    Re: Finding and deleting duplicate image files

    http://us.php.net/crc32#86628

    check this out. they say you can use this to generate crc16 checks on a file.

    now all you have to do is to run this code to both jpgs and if the results of both are the same then the file most likely will be the same. Hence they can be deleted.

  3. #3
    cerbere is offline x10Hosting Member cerbere is an unknown quantity at this point
    Join Date
    Nov 2007
    Posts
    51

    Re: Finding and deleting duplicate image files

    Thanks for the suggestion, lordskid, but calculating a CRC
    for each candidate file would be even slower than doing
    a plain byte-by-byte comparison...

    On second thought, the above is clear if I have only 2 files
    of the same length. For more files, maybe your idea is the best way.

    Here is a fragment of my list of potential duplicates :

    Code:
    c:/2008Mai4/LFS001090.JPG   208163
    c:/2008Nov2/1227225623.JPG   208163
    
    c:/Sept2007Bis/1189442639290.JPG   209199
    c:/Oct2007Ter/1192542190263.JPG   209199
    c:/Oct2007Quad/1193434359791.JPG   209199
    c:/Oct2007Penta/1193639838277.JPG   209199
    
    c:/2008Mai2/117943817617.JPG   210542
    c:/2008Juin3/1213585426621.JPG   210542
    The number on each line is the file size.

  4. #4
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Finding and deleting duplicate image files

    Quote Originally Posted by cerbere View Post
    Thanks for the suggestion, lordskid, but calculating a CRC
    for each candidate file would be even slower than doing
    a plain byte-by-byte comparison...

    On second thought, the above is clear if I have only 2 files
    of the same length. For more files, maybe your idea is the best way.
    That was going to be my suggestion: compare files with only 2 size collisions, look for checksum collision when >= 3 size collisions.

    As for comparing random samples of the content, it's hard to even estimate how many samples you'd need to achieve a given confidence interval without knowing more about what's depicted in the images, but I'd be very surprised if you need to compare more than 25-50% of the file contents, and even that estimate is very conservative. Assuming there's more variation in the center of the image, a good strategy would be to pick blocks from the middle of a file for baseline JPEG and the start of a file for a progressive JPEG.

    Some JPEG markers might also be good candidates to compare (e.g. JFIF, if it contains a thumbnail), depending on what produced the images. Quantization tables, in particular, might vary highly from source to source. If the images were produced by the same software, it might be better to skip the markers & compare just the scan data. Of course, to reliably check or skip markers for fast comparison purposes depends on them being at the start of the file.

    Are the files stored on an NTFS formatted disk? I don't think there are any NTFS attributes (other than size) that are helpful, but someone more familiar with NTFS internals could prove me wrong.

    Finally, how many files do you have to compare? You probably have the cycles to spare if this isn't a frequent task or the number of duplicates is small.

    Quote Originally Posted by cerbere View Post
    Ancilliary question : is there a "move file" function in Perl or PHP ?
    I'd like to move the duplicate files to a garbage bin before finally
    deleting them for good.
    For Perl, see File::Copy::move() or (failing that) rename. For PHP, see rename().

  5. #5
    cerbere is offline x10Hosting Member cerbere is an unknown quantity at this point
    Join Date
    Nov 2007
    Posts
    51

    Re: Finding and deleting duplicate image files

    Thanks for the pointers, misson.

    I have around 60000 images (art, technology, eye-candy :>) ),
    with an average fike size of say 200 kb. They are spread
    over 3 drives.

    I now realize that I must give more thought to the subject,
    notably about the ">= 3" case. My PC is very low-end
    ("jurassic" comes to mind : 133 MHz Pentium II with 32 Mb of RAM)
    so I must get smart with the comparison algorithm
    if I want it to complete this year ! Good thing is that I will
    run it only every 3 months or so, untill I can afford a new PC
    with a HUGE drive (14 Gb total right now).

    So I will experiment with different approaches, and post what
    I find here in a few days.

    P.S. Shame on me for not having thought of "rename"...

  6. #6
    misson is offline x10 Spammer misson is a jewel in the rough
    Join Date
    Mar 2008
    Location
    Libertatia
    Posts
    2,506

    Re: Finding and deleting duplicate image files

    Given the nature of the problem, I don't expect changing the programming language will have a huge effect, but it's worth testing. Python's fairly fast, even for numeric processing. A mix of C and assembly might be the fastest, but compiler optimizations may be able to beat hand-coded assembly.

    Where do the images come from? Do you need to rescan the entire image store each time? You also might be able to use one of the APPn markers to tag JPEGs with checksums to speed future comparisons.

+ Reply to Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
x10hosting free hosting for the masses
dedicated servers