Kalebu Jordan

Become a Pro Python Developer


Hello Guys on this tutorial I will be covering on how to easy to train a Machine Learning spam filter in Python and use it to proper classify spam and ham text.


To effectively follow through the tutorial you’re required to have the following python libraries installed on your machine.

You also need to have a clean dataset which we gonna use to train our spam filter model in your project directory, You can download it at Training dataset


you can just install the following Python library using pip command or conda depending on the environment you’re using.

pip install scikit-learn , pandas , numpy , wordcloud , matplotlib
pip install jupyter notebook 

Jupyter notebook

We are going to use Jupyter notebook to train and test our model therefore once you have installed the above library, open the jupyter notebook as shown below.

jupyter notebook 

Once we have opened the Jupyter notebook, open a new python3 notebook file, and then we are ready to begin training our Machine learning model.

Importing necessary library

Let’s firstly begin by importing all the necessary libraries first for manipulating and training data.

  #__________importing all neccessary library___________
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud , STOPWORDS
import numpy as np

We are going to use pandas to load the dataset, and also it stores the data in a data structure which provides us interface to easily twist it.

Wordcloud library will be used to visualize show us the majority of the words appearing on the sentence or paragraph.

Basics of Word cloud

Below is a sample code on how to create and Show a word cloud of text using Matplotlib and WordCloud Library

s = 'Today we will be creating a classifier for filtering spam email'
cloud = WordCloud().generate(s)

Loading our Dataset

we are going to use pandas built method read_csv() to load our training dataset to our notebook as shown in the code below

#____________Loading and Visualizing our data_____________
>>>data = pd.read_csv('spam.csv', usecols=['v1', 'v2'],  encoding = 'latin-1')

hamGo until jurong point, crazy.. Available only ……
ham Ok lar… Joking wif u oni…
spamFree entry in 2 a wkly comp to win FA Cup fina…
ham U dun say so early hor… U c already then say…
ham Nah I don’t think he goes to usf, he lives aro…
Top 5 rows of the datasets

Checking the length of the dataset


Separating Ham and Spam Data using Pandas

Separately selecting SPAM and HAM Data from the parent dataframe data so as we can easily Visualize the corpus of SPAM data and HAM data separately, we use simple conditional operator together with pandas to select the Column

condition1 = data['v1'] == 'ham'
condition2 = data['v1'] == 'spam'
spam = data[condition2]
ham = data[condition1]

Preview Selected SPAM DataFrame using head method

spam Free entry in 2 a wkly comp to win FA Cup fina…..

Preveiwing HAM DataFrame using head method

hamGo until jurong point, crazy.. Available only …

We need to Visualize the Corpus of SPAM and HAM using world cloud, and in order to do this we have to extract all the textual information from the v2 column of ham and of spam and then generate it cloud using Wordcloud and lastly Visualizing it to determine it’s majority containing words

Building a Function to Extract text from columns

I made a simple function called combine which help in extracting all the text of particular column and the combining them together to get one string

def combine(array):
    whole_text = ' '
    for sentence in array:
        whole_text = whole_text+sentence
    return whole_text

Visualize Spam corpus using wordcloud

Visualizing the corpus of SPAM text can be done as shown below

#__________Visualizing spam textual information________
#pandas->numpy array
spam_array = spam.iloc[:, 1].values
spam_text = combine(spam_array)

#_________Generating Worldcloud by removing stopwords_________
spam_cloud = WordCloud(background_color = 'black', stopwords = STOPWORDS).generate(spam_text)

Visualize ham corpus using wordcloud

Visualizing the Corpus of HAM text by following the same procedures as we have done with SPAM above

ham_v = ham.iloc[:, 1].values
ham_text = combine(ham_v)
ham_cloud = WordCloud(stopwords=STOPWORDS).generate(ham_text)

Transforming our textual data to Numerical(word2vec)

After performing some descriptive statistics of our data, now we are moving to the next stage which is Feature Extraction, We can’t fit into Machine Learning dataset with with textual information, therefore we have to find a way to convert the textual information to Array, This can be achieved by applying a transformer or vectorizer

CountVectorizer ( )

On this tutorial we are going to use CountVectorizer to transform our textual data into vectors and later transform them intro arrays so as we can fit them into Machine Learning Model

To use CountVectorizer we have to import it from scikit-learn as shown below

from sklearn.feature_extraction.text import CountVectorizer

#______Converting textual data to arrays___________________
word2vec = CountVectorizer()
vector_words = word2vec.transform(words.ravel())
words_array = vector_words.toarray()

Fitting Our Dataset into Machine Learning Model

After performing Feature Extraction on data we are now into last stage of which we are going to load a Classifier and Train it ready to be used for a sample case

We are going to use GaussianNB which is binary classifier which operate based on bayes laws of conditional probability. Do this to load the classifier and fit with Training Data

#__________Training the model________________
from sklearn.naive_bayes import GaussianNB
model = GaussianNB().fit(words_array, data['v1'])

Congratulations you have just trained a Machine Learning Model, I then made a Simple function within a notebook as use case of the model.

Building a Function to test our model

def spam_filter():
    while True:
        sentence = input('you: ')
        sent = np.array([[sentence]])
        sent_array = word2vec.transform(sent.ravel()).toarray()
        results = model.predict(sent_array)
        if results[0] == 'ham':
            print('chatbot: Hello')
            print('chatbo: Youre spam I can\'t answer yoo ')
you:  hi
chatbo: Youre spam I can't answer yoo 
you:  hello are you good
chatbo: Youre spam I can't answer yoo 
you:  free wifi and money
chatbot: Hello
you:  you have free award call us now 
chatbo: Youre spam I can't answer yoo 
you:  I just wanted to know youre alright
chatbot: Hello

We have now reached the end of our tutorial, If you have any comment, suggestion, or difficulties drop it in the comment box below and I will get back to you ASAP.

I recommend you to also read this;

To get the whole project code check it on My Github Profile

3 thoughts on “How to train your own spam sms filter using Python

Leave a Reply


Enjoy this blog? Please spread the word :)

%d bloggers like this: