Kalebu Jordan

Become a Pro Python Developer

Python is savage , it can do almost anything you can imagine . On this tutorial I will guide you to building a Python script that can ultimately save you a lot spaces on your disk space.

Program Overview

The main idea of the program you’re about to built is that , in many situations you find yourself having duplicates files on your disk and checking hem manually it tedious work since you could have saved different name .

By realizing that we are going to built a Python file to recursively remove all duplicate files in a given directory.

How do we do it ?

If we were to read the whole file and then compare it to the rest of the files recursively through given directory it will take a very long time then how do we do it ?

The answer is hashing , through hashing we can generate a given string of letters and numbers which act as identity of a given file and if we find any other file with the same identity we gonna delete it .

There a variety of hashing algorithms out there such as

Let’s do some coding

Hashing in Python is pretty straight forward we are going to use hashlib library which comes with Python standard library

Let’s see how do we create hash of a string in Python using md5 hashing algorithms

Example of Usage :

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

It’s straight forward you just need to import hashlib and then use md5 method to create hash and finally use hexdigest to generate string of the hash.

The above example have shown as how to hash string but as we look in relation to project we were about to build we need to hash files.

How do we hash files ?

Hashing files is similar to hashing string a minor difference , during hashing file we need to open file in binary and then create a hash of the binary value we need .

Hashing File

Let’s say you have simple text document on your project directory with name learn.txt, Let’s see try hashing it

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

------Try to generate again the hashes ----
>>> file2 = open('learn.txt', 'rb').read()
>>> hashlib.md5(file2).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

As you can see above even If you try to generate the hashes for a second time , hashes still remain the same .

Challenge arises when we try to read a very large file , waiting for the whole file to read and then computing the hash after takes time we need a way to create hashes as we ready the file .

Well don’t worry because there is way in Python to do that , we have to read the file in blocks and update the hash as we keep reading the file until the file is entirely read.

Doing this way could save us a lot of waiting time that we could use on waiting the whole file to be ready.

Example :

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

As you can see hash has not changed it still the same , there fore we are ready to go to building our python program to save disk space for us.

Before we build our app we need a way to delete those duplicate files , how do we do it , we you should have already know .

We will use Python OS module to delete the any duplicate file found during inspecting files using remove( ) method .

Let’s try deleting learn.txt with os module

Example of Usage (os module):

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Well that’s simple you just call remove ( ) with parameter of name of file you wanna remove done. now let’s go build our application.

Building our Disk space saver program

Importing necessary Library

import time
import os
from hashlib import sha256

Creating Initial class for program

app.py

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

That’s just initial cover for our Python Program of which when we ran it will just print the welcoming method to the screen

Output :

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************


----------------        WELCOME        ----------------------------

Cleaning .................

We now have to create a simple function to generate hash of a given filename using the hashing knowledge we have learnt above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False
        
    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Architecturing Program Logic

Now after we made a function to generate hash per a given filename , now the remaining task is to build our Program Logic .

Where by we will be comparing those hashes and removing duplicates if we find any.

I have made simple function called clean( ) just to that

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')
        
   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
        
    def main(self)->None:
      self.welcome();self.clean()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Now our program Logic is complete, Our program is nearly complete , we now have to add a simple method to print the summary of the cleaning process.

app.py

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)
    
    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')
        
    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Our app is complete , now to run the application is now complete now go run the application on the specific folder you want to clean and it will iterate recursively over given folder to find all the files and remove the duplicate one .

Example output :

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************


----------------        WELCOME        ----------------------------

Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 


--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------

Hope you find this post interesting , don’t forget to subscribe to get more tutorials like this. To get the full code it on My Github

In case of any suggestion or comment , drop it on the comment box and I will reply to you immediately.

One thought on “How to remove duplicates on your drive using Python

Leave a Reply

error

Enjoy this blog? Please spread the word :)