Kalebu Jordan

Become a Pro Python Developer

Web scraping simply concerns with Extracting data from website .

As a programmer in many cases, you will need to extract data from websites therefore Web scraping is a skill you need to have.

In this tutorial, you’re going to learn how to perform web scraping in Python using requests and BeautifulSoup libraries.

Throughout the tutorial you will learn out basic web scraping examples together with implementing a simple web scraper to scrap quotations from a website .

Requirements

In order to follow through this tutorial you need to have the following Python Libraries Installed on your System

Installation

$ pip install requests 
$ pip install beautifulsoup4

Requests

Requests is an elegant and simple HTTP library for Python, built for human beings. 

We will use requests during implementing our project which scrap quotations from a particular website

BeautifulSoup

Beautiful Soup is a Python library for pulling data out of HTML and XML files.

It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.

Diving deeper to BeautifulSoup

For Instance Let’s use BeautifulSoup to Extract data from the below HTML file

Sample.html

<!DOCTYPE html>
<head>
    <title>Document</title>
</head>
<body>
    <div id = 'quotes'>
        <p id = 'normal'>Time the time before the time times you</p>
        <p id = 'normal'>The Future is now </p>
        <p id = 'special'>Be who you wanted to be when you're younger</p>
        <p id = 'special'>The world is reflection of who you're</p>
    </div>

    <div>
        <p id = 'Languages'>Programming Languages</p>
        <ul>
            <li>Python</li>
            <li>C+++</li>
            <li>Javascript</li>
            <li>Golang</li>
        </ul>
    </div>
</body>
</html>

Extracting all paragraphs in HTML

Let’s Extract all paragraphs from the Sample.html
app.py

from bs4 import BeautifulSoup

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

for paragraph in soup.find_all('p'):
    print(paragraph.text)

Output:

When you run the above simple program it will produces the following result .

$ python app.py 
Time the time before the time times you
The Future is now 
Be who you wanted to be when you're younger
The world is reflection of who you're
Programming Languages

Code Explanation

Importing Library

from bs4 import BeautifulSoup

The above line of code is for importing our BeautifulSoup Library to our program

Creating a BeautifulSoup object with html string

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

The above 2 lines of code are for reading the sample.html and and creating Beautifulsoup object ready for parsing data within it .

The Syntax for making a BeautifulSoup object is

soup = BeautifulSoup(html_string, 'html.parser')

Finding all paragraphs and printing them

for paragraph in soup.find_all('p'):
    print(paragraph.text)

The above 2 lines of code are for finding all paragraph in the html file and displaying their text .

The BeautifulSoup object we just created above provide us tons of methods for parsing through it to find the data we want.

One of those methods is find_all ( ) , it accept a parameter of name of tag and then it parses through the html string to find those tags and returns them.

Extracting all List in HTML

For Instance Let’s twist the above program to display out List text found in the html file

app.py

from bs4 import BeautifulSoup

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

for List in soup.find_all('li'):
    print(List.text)

Output :

$ python app.py
Python
C+++
Javascript
Golang

Extracting Paragraphs with specific Id

Apart from just returning all tags in html string we can specify the attributes of those tag in order to get the specific data , For instance

Program to Extract paragraphs with id of normal

app.py

import requests
from bs4 import BeautifulSoup

html = open('sample.html').read()
soup = BeautifulSoup(html, 'html.parser')

for paragraph in soup.find_all('p'):
    if paragraph['id'] == 'normal':
        print(paragraph.text)

Output :

$ python app.py 
Time the time before the time times you
The Future is now 

Building Our Demo Project

So far we have seen how to extract data from html file that is in our local file, now Let’s go see how we can extract data from website in cloud .

On this project we are going to implement a web scraper to scrap quotations from a website of given URL.

We are going to use requests library to pull the html from the website and then parse that HTML using BeautifulSoup.

Website to Scrap

Note : Don’t just go out there and scrap whatever website you want , First research what kind of scraping to that site is legal and then build your scraper for it

On our demo project we are going to use the below URL to scrap quotations

URL = 'http://quotes.toscrape.com/'

scraper.py

import requests
from bs4 import BeautifulSoup

html = requests.get('http://quotes.toscrape.com/').text
soup = BeautifulSoup(html, 'html.parser')

for paragraph in soup.find_all('span'):
    if paragraph.string:
        print(paragraph.string

Output :

$ python scraper.py 
"The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking."
"It is our choices, Harry, that show what we truly are, far more than our abilities."
"There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle."
"The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid."
"Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring."
"Try not to become a man of success. Rather become a man of value."
"It is better to be hated for what you are than to be loved for what you are not."
"I have not failed. I've just found 10,000 ways that won't work."
"A woman is like a tea bag; you never know how strong it is until it's in hot water."
"A day without sunshine is like, you know, night."

Hope you find this post interesting , don’t forget to subscribe to get more tutorials like this

In case of any suggestion or comment , drop it on the comment box and I will reply to you immediately.

error

Enjoy this blog? Please spread the word :)