How to deduplicate files on Linux with dupeGuru

Recently, I was given the task to clean up my father's files and folders. What made it difficult was the abnormal amount of duplicate files with incorrect names. By keeping a backup on an external drive, simultaneously editing multiple versions of the same file, or even changing the directory structure, the same file can get copied many times, change names, change locations, and just clog disk space. Hunting down every single one of them can become a problem of gigantic proportions. Hopefully, there exists nice little software that can save your precious hours by finding and removing duplicate files on your system: dupeGuru. Written in Python, this file deduplication software switched to a GPLv3 license a few hours ago. So time to apply your new year's resolutions and clean up your stuff!

Installation of dupeGuru

On Ubuntu, you can add the Hardcoded Software PPA:

$ sudo apt-add-repository ppa:hsoft/ppa
$ sudo apt-get update

And then install with:

$ sudo apt-get install dupeguru-se

On Arch Linux, the package is present in the AUR.

If you prefer compiling it yourself, the sources are on GitHub.

Basic Usage of dupeGuru

DupeGuru is conceived to be fast and safe. Which means that the program is not going to run berserk on your system. It has a very low risk of deleting stuff that you did not intend to delete. However, as we are still talking about file deletion, it is always a good idea to stay vigilant and cautious: a good backup is always necessary.

Once you took your precautions, you can launch dupeGuru via the command:

$ dupeguru_se

You should be greeted by the folder selection screen, where you can add folders to scan for deduplication.

Once you selected your directories and launched the scan, dupeGuru will show its results by grouping duplicate files together in a list.

Note that by default dupeGuru matches files based on their content, and not their name. To be sure that you do not accidentally delete something important, the match column shows you the accuracy of the matching algorithm. From there, you can select the duplicate files that you want to take action on, and click on "Actions" button to see available actions.

The choice of actions is quite extensive. In short, you can delete the duplicates, move them to another location, ignore them, open them, rename them, or even invoke a custom command on them. If you choose to delete a duplicate, you might get as pleasantly surprised as I was by available deletion options.

You can not only send the duplicate files to the trash or delete them permanently, but you can also choose to leave a link to the original file (either using a symlink or a hardlink). In oher words, the duplicates will be erased, and a link to the original will be left instead, saving a lot of disk space. This can be particularly useful if you imported those files into a workspace, or have dependencies based on them.

Another fancy option: you can export the results to a HTML or CSV file. Not really sure why you would do that, but I suppose that it can be useful if you prefer keeping track of duplicates rather than use any of dupeGuru's actions on them.

Finally, last but not least, the preferences menu will make all your dream about duplicate busting come true.

There you can select the criterion for the scan, either content based or name based, and a threshold for duplicates to control the number of results. It is also possible to define the custom command that you can select in the actions. Among the myriad of other little options, it is good to notice that by default, dupeGuru ignores files less than 10KB.

For more information, I suggest that you go check out the official website, which is filled with documention, support forums, and other goodies.

To conclude, dupeGuru is my go-to software whenever I have to prepare a backup or to free some space. I find it powerful enough for advanced users, and yet intuitive to use for newcomers. Cherry on the cake: dupeGuru is cross platform, which means that you can also use it for your Mac or Windows PC. If you have specific needs, and want to clean up music or image files, there exists two variations: dupeguru-me and dupeguru-pe, which respectively find duplicate audio tracks and pictures. The main difference from the regular version is that it compares beyond file formats and takes into account specific media meta-data like quality and bit-rate.

What do you think of dupeGuru? Would you consider using it? Or do you have any alternative deduplication software to suggest? Let us know in the comments.

Subscribe to Xmodulo

Do you want to receive Linux FAQs, detailed tutorials and tips published at Xmodulo? Enter your email address below, and we will deliver our Linux posts straight to your email box, for free. Delivery powered by Google Feedburner.

Support Xmodulo

Did you find this tutorial helpful? Then please be generous and support Xmodulo!

The following two tabs change content below.

Adrien Brochard

I am a Linux aficionado from France. After trying multiple distributions, I finally settled for Archlinux. But I am always trying to improve my system by stacking up tips and tricks.

Latest posts by Adrien Brochard (see all)

8 thoughts on “How to deduplicate files on Linux with dupeGuru

  1. For dedupe of music collections, the problem is always when you find three copies of songs (all with different paths and directories), how do you select which two to delete?

    For thousands of songs this is not practical.
    We need some automated rule to do this automatically.

    My manual rule is delete all except the one with the longest path. That version is the most categorized into directories of genre, artist, album etc.
    Could such a rule be automated?

  2. Wish I'd seen this before I wrote mine. The approach I took was ... will search through a file system for file above a minimum specified size. The set of all inodes of each file of each size found is computed. If there is more than one inode for each size then each inode of that size has the md5sum computed. Any inodes with duplicate md5sums have their shasum computed if the switch --sha is set. Any inodes with duplicate checksums are then consolidated to one inode and hard links made to the previous names.
    File system integrity is preserved, but the total space is reduced. This is especially useful on back-up disks with multiple copies of large files.

    I should put my code on my web site...

Leave a comment

Your email address will not be published. Required fields are marked *