Hello Guys on this tutorial I will be covering on how to easy to train a Machine Learning spam filter in Python and use it to proper classify spam and ham text.
To effectively follow through the tutorial you’re required to have the following python libraries installed on your machine.
You also need to have a clean dataset which we gonna use to train our spam filter model in your project directory, You can download it at Training dataset
you can just install the following Python library using pip command or conda depending on the environment you’re using.
pip install scikit-learn , pandas , numpy , wordcloud , matplotlib pip install jupyter notebook
We are going to use Jupyter notebook to train and test our model therefore once you have installed the above library, open the jupyter notebook as shown below.
jupyter notebook .......
Once we have opened the Jupyter notebook, open a new python3 notebook file, and then we are ready to begin training our Machine learning model.
Importing necessary library
Let’s firstly begin by importing all the necessary libraries first for manipulating and training data.
#__________importing all neccessary library___________ import pandas as pd import matplotlib.pyplot as plt from wordcloud import WordCloud , STOPWORDS import numpy as np
We are going to use pandas to load the dataset, and also it stores the data in a data structure which provides us interface to easily twist it.
Wordcloud library will be used to visualize show us the majority of the words appearing on the sentence or paragraph.
Basics of Word cloud
Below is a sample code on how to create and Show a word cloud of text using Matplotlib and WordCloud Library
s = 'Today we will be creating a classifier for filtering spam email' cloud = WordCloud().generate(s) plt.imshow(cloud)
Loading our Dataset
we are going to use pandas built method read_csv() to load our training dataset to our notebook as shown in the code below
#____________Loading and Visualizing our data_____________ >>>data = pd.read_csv('spam.csv', usecols=['v1', 'v2'], encoding = 'latin-1') >>>data.head()
|ham||Go until jurong point, crazy.. Available only ……|
|ham||Ok lar… Joking wif u oni…|
|spam||Free entry in 2 a wkly comp to win FA Cup fina…|
|ham||U dun say so early hor… U c already then say…|
|ham||Nah I don’t think he goes to usf, he lives aro…|
Checking the length of the dataset
Separating Ham and Spam Data using Pandas
Separately selecting SPAM and HAM Data from the parent dataframe data so as we can easily Visualize the corpus of SPAM data and HAM data separately, we use simple conditional operator together with pandas to select the Column
condition1 = data['v1'] == 'ham' condition2 = data['v1'] == 'spam' spam = data[condition2] ham = data[condition1]
Preview Selected SPAM DataFrame using head method
|spam||Free entry in 2 a wkly comp to win FA Cup fina…..|
Preveiwing HAM DataFrame using head method
|ham||Go until jurong point, crazy.. Available only …|
We need to Visualize the Corpus of SPAM and HAM using world cloud, and in order to do this we have to extract all the textual information from the v2 column of ham and of spam and then generate it cloud using Wordcloud and lastly Visualizing it to determine it’s majority containing words
Building a Function to Extract text from columns
I made a simple function called combine which help in extracting all the text of particular column and the combining them together to get one string
def combine(array): whole_text = ' ' for sentence in array: whole_text = whole_text+sentence return whole_text
Visualize Spam corpus using wordcloud
Visualizing the corpus of SPAM text can be done as shown below
#__________Visualizing spam textual information________ #pandas->numpy array spam_array = spam.iloc[:, 1].values spam_text = combine(spam_array) #_________Generating Worldcloud by removing stopwords_________ spam_cloud = WordCloud(background_color = 'black', stopwords = STOPWORDS).generate(spam_text) plt.imshow(spam_cloud)
Visualize ham corpus using wordcloud
Visualizing the Corpus of HAM text by following the same procedures as we have done with SPAM above
ham_v = ham.iloc[:, 1].values ham_text = combine(ham_v) ham_cloud = WordCloud(stopwords=STOPWORDS).generate(ham_text) plt.imshow(ham_cloud)
Transforming our textual data to Numerical(word2vec)
After performing some descriptive statistics of our data, now we are moving to the next stage which is Feature Extraction, We can’t fit into Machine Learning dataset with with textual information, therefore we have to find a way to convert the textual information to Array, This can be achieved by applying a transformer or vectorizer
CountVectorizer ( )
On this tutorial we are going to use CountVectorizer to transform our textual data into vectors and later transform them intro arrays so as we can fit them into Machine Learning Model
To use CountVectorizer we have to import it from scikit-learn as shown below
from sklearn.feature_extraction.text import CountVectorizer #______Converting textual data to arrays___________________ word2vec = CountVectorizer() word2vec.fit(words.ravel()) vector_words = word2vec.transform(words.ravel()) words_array = vector_words.toarray()
Fitting Our Dataset into Machine Learning Model
After performing Feature Extraction on data we are now into last stage of which we are going to load a Classifier and Train it ready to be used for a sample case
We are going to use GaussianNB which is binary classifier which operate based on bayes laws of conditional probability. Do this to load the classifier and fit with Training Data
#__________Training the model________________ from sklearn.naive_bayes import GaussianNB model = GaussianNB().fit(words_array, data['v1']) ......
Congratulations you have just trained a Machine Learning Model, I then made a Simple function within a notebook as use case of the model.
Building a Function to test our model
def spam_filter(): while True: sentence = input('you: ') sent = np.array([[sentence]]) sent_array = word2vec.transform(sent.ravel()).toarray() results = model.predict(sent_array) if results == 'ham': print('chatbot: Hello') else: print('chatbo: Youre spam I can\'t answer yoo ') spam_filter() you: hi chatbo: Youre spam I can't answer yoo you: hello are you good chatbo: Youre spam I can't answer yoo you: free wifi and money chatbot: Hello you: you have free award call us now chatbo: Youre spam I can't answer yoo you: I just wanted to know youre alright chatbot: Hello
We have now reached the end of our tutorial, If you have any comment, suggestion, or difficulties drop it in the comment box below and I will get back to you ASAP.
I recommend you to also read this;
- How to translate languages using Python
- Emotion detection from the text in Python
- 3 ways to convert text to speech in Python
- How to perform speech recognition in Python
- Make your own Plagiarism detector in Python
- Make your own knowledge-based chatbot in Python
- How to perform automatic spelling correction in Python
- A quick guide to Twitter sentiment analysis using python
To get the whole project code check it on My Github Profile