Hello, Guys on this tutorial I will guide to convert a picture to a sound using python, in which we are going to cover Optical character recognition to detect text from the image and Speech synthesis to synthesize the speech from decoded text.
Project Requirements
In order for you to complete follow through this tutorial, you’re supposed to have the following library installed on your machine.
- Python Imaging Libary (Pillow) for reading images
- Python-tesseract (OCR Tool) for converting picture to text
- gTTS (Google text to speech) for converting text to sound
Installation
$ pip install Pillow $ pip install gTTS $ pip install pytesseract
Also in order pytesseract to work you have to install Google’s Tesseract-OCR Engine on your operating system.
To install Tesseract Engine , CLICK HERE to get full instruction on installation with respect to your operating system
Now after Everything is installed , let’s start building our program
Project Folder
On your project folder you should have a sample image containing text on it which we could use to test our program
. ├── app.py └── image.jpg 0 directories, 2 files
Our project will be divided into two main parts
- Converting the image to Text (OCR)
- Converting Text to speech (Speech Synthesis)
How do we convert image to string of text ?
At this stage, we use the python imagining library (pillow) to load our image and then pytesseract to perform Optical character recognition(OCR) so as to detect all the recognizable characters from our image.
On this we gonna use Pillow library
In the example below I used this image

Example of Usage
>>> from PIL import Image >>> from pytesseract import image_to_string >>> text = image_to_string(Image.open('image.jpg')) 'JOBS FILLnYflUR POCKET.nADVENTURESnFILL YOURnLIFE.'
That’s how you can easily perform OCR in just 1 line of code, now let’s go see how can we convert it to speech using gTTS
Converting Generated Text to speech
I have a separate article that explains different ways to convert the text to speech in Python you can review them here 3 ways to convert text to speech.
In this tutorial, We are going to use gTTS, google text to speech to convert our decoded text into sound.
The Syntax to performing text to speech is very simple you can also do it with just one line of code as shown in the example below
>>> from gtts import gTTS >>> gTTS('Coding is awesome trust me').save('sound.mp3')
Final program
I made the below simple program using the knowledge we just learned above with the addition of a cleaner function to remove n in the generated text to make it easily convertible to sound.
from PIL import Image from gtts import gTTS from pytesseract import image_to_string def image_to_sound(path_to_image): try: loaded_image = Image.open(path_to_image) decoded_text = image_to_string(loaded_image) cleaned_text = ' '.join(decoded_text.split('\n')) print(cleaned_text) sound = gTTS(cleaned_text, lang = 'en') sound.save('sound.mp3') return True except Exception as bug: print('The bug thrown while excuting the code\n', bug) return image_to_sound('image.jpg') input()
When you run the above code, it will open our sample image, perform optical character recognition, clean generated text by removing \n, convert into sound by using gTTS
Don’t forget to subscribe to this blog to stay updated on upcoming Python tutorials
I also recommend you to read reading this;
- Build a real-time barcode reader in Python
- Realtime vehicle detection in Python in 5 minutes
- Getting started with image processing using a pillow
- How to detect Edges in a picture using OpenCV Canny algorithm
In case of any thing just drop it in the comment box and I will reply to you to the fast that I can.
To get the full code for this article please check it on My Github
This is great Kaleb, keep it up! You might not know how many people are impacted by you sharing this knowledge!
Thanks a lot for your inspiration Landry , I will
Awesome stuff!
Mwanang nakubali
Great work buddy
Thank you Vijay