Web scraping simply concerns with Extracting data from website .
As a programmer in many cases, you will need to extract data from websites therefore Web scraping is a skill you need to have.
In this tutorial, you’re going to learn how to perform web scraping in Python using requests and BeautifulSoup libraries.
Throughout the tutorial you will learn out basic web scraping examples together with implementing a simple web scraper to scrap quotations from a website .
Requirements
In order to follow through this tutorial you need to have the following Python Libraries Installed on your System
- Requests
- BeautifulSoup
- Sample HTML File
Installation
$ pip install requests $ pip install beautifulsoup4
Requests
Requests is an elegant and simple HTTP library for Python, built for human beings.
We will use requests during implementing our project which scrap quotations from a particular website
BeautifulSoup
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree.
Diving deeper to BeautifulSoup
For Instance Let’s use BeautifulSoup to Extract data from the below HTML file
Sample.html
<!DOCTYPE html> <head> <title>Document</title> </head> <body> <div id = 'quotes'> <p id = 'normal'>Time the time before the time times you</p> <p id = 'normal'>The Future is now </p> <p id = 'special'>Be who you wanted to be when you're younger</p> <p id = 'special'>The world is reflection of who you're</p> </div> <div> <p id = 'Languages'>Programming Languages</p> <ul> <li>Python</li> <li>C+++</li> <li>Javascript</li> <li>Golang</li> </ul> </div> </body> </html>
Extracting all paragraphs in HTML
Let’s Extract all paragraphs from the Sample.html
app.py
from bs4 import BeautifulSoup html = open('sample.html').read() soup = BeautifulSoup(html, 'html.parser') for paragraph in soup.find_all('p'): print(paragraph.text)
Output:
When you run the above simple program it will produces the following result .
$ python app.py Time the time before the time times you The Future is now Be who you wanted to be when you're younger The world is reflection of who you're Programming Languages
Code Explanation
Importing Library
from bs4 import BeautifulSoup
The above line of code is for importing our BeautifulSoup Library to our program
Creating a BeautifulSoup object with html string
html = open('sample.html').read() soup = BeautifulSoup(html, 'html.parser')
The above 2 lines of code are for reading the sample.html and and creating Beautifulsoup object ready for parsing data within it .
The Syntax for making a BeautifulSoup object is
soup = BeautifulSoup(html_string, 'html.parser')
Finding all paragraphs and printing them
for paragraph in soup.find_all('p'): print(paragraph.text)
The above 2 lines of code are for finding all paragraph in the html file and displaying their text .
The BeautifulSoup object we just created above provide us tons of methods for parsing through it to find the data we want.
One of those methods is find_all ( ) , it accept a parameter of name of tag and then it parses through the html string to find those tags and returns them.
Extracting all List in HTML
For Instance Let’s twist the above program to display out List text found in the html file
app.py
from bs4 import BeautifulSoup html = open('sample.html').read() soup = BeautifulSoup(html, 'html.parser') for List in soup.find_all('li'): print(List.text)
Output :
$ python app.py Python C+++ Javascript Golang
Extracting Paragraphs with specific Id
Apart from just returning all tags in html string we can specify the attributes of those tag in order to get the specific data , For instance
Program to Extract paragraphs with id of normal
app.py
import requests from bs4 import BeautifulSoup html = open('sample.html').read() soup = BeautifulSoup(html, 'html.parser') for paragraph in soup.find_all('p'): if paragraph['id'] == 'normal': print(paragraph.text)
Output :
$ python app.py Time the time before the time times you The Future is now
Building Our Demo Project
So far we have seen how to extract data from html file that is in our local file, now Let’s go see how we can extract data from website in cloud .
On this project we are going to implement a web scraper to scrap quotations from a website of given URL.
We are going to use requests library to pull the html from the website and then parse that HTML using BeautifulSoup.
Website to Scrap
Note : Don’t just go out there and scrap whatever website you want , First research what kind of scraping to that site is legal and then build your scraper for it
On our demo project we are going to use the below URL to scrap quotations
URL = 'http://quotes.toscrape.com/'
scraper.py
import requests from bs4 import BeautifulSoup html = requests.get('http://quotes.toscrape.com/').text soup = BeautifulSoup(html, 'html.parser') for paragraph in soup.find_all('span'): if paragraph.string: print(paragraph.string
Output :
$ python scraper.py "The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking." "It is our choices, Harry, that show what we truly are, far more than our abilities." "There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle." "The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid." "Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring." "Try not to become a man of success. Rather become a man of value." "It is better to be hated for what you are than to be loved for what you are not." "I have not failed. I've just found 10,000 ways that won't work." "A woman is like a tea bag; you never know how strong it is until it's in hot water." "A day without sunshine is like, you know, night."
Hope you find this post interesting , don’t forget to subscribe to get more tutorials like this
In case of any suggestion or comment , drop it on the comment box and I will reply to you immediately.
4 thoughts on “A beginners guide to webscraping in python”