A tool for removing duplicate files

Download RemoveDuplicates.py

One of the problems with using hybrid Windows and Linux environments is that one needs to watch closely for filesystem and file anomalies and inconsistencies. Differing end-of-line markers, for example, cause many problems when sharing files between the two operating systems. One particular problem I've run into is that of having duplicate files, or in other words, multiple files with the same filename. This can happen if, say, you copy a directory somewhere in Windows, then switch to Linux and use a tool such as rsync to copy that same directory over again. If the capitalization is different, Linux will not replace the old files, because Linux, unlike Windows, is case-sensitive. This will even happen, and is technically acceptable, on NTFS filesystems.

The solution I've come up with is this simple script, called RemoveDuplicates.py. Obviously, you need Python installed to run it, but it has no additional dependencies. Simply run it in the directory you wish to clean, and it should do the rest. Note that you shouldn't use this for entire filesystems (yet), because it will use ridiculous amounts of memory if it is given a high number of files. Download it here!

P.S. Also, I cannot guarantee that this tool will work as intended or will be bug-free. Use wisely.

Comments

comments powered by Disqus