Building first web Scraper - Python
In this article you will build your first web scrapper and it’s my first python article/post. I haven’t wrote any article or taught python to anyone. I just made some $$ by building a tools which automates the form filling.
What is Web Scraper? #
It’s a term used to point to some script or software which extract data from some website.
There are two ways to do this
- Without browser(http)
Mostly it’s preferred without browser, but with browser you can do lots of things.
When to use browser for web scraping? #
You should involve browser for scraping, only when you are doing some complicated data extraction. i.e. getting data after
login, etc. If you use browser for doing this then it becomes simple.
Use Browser when
- Login is there
- Captcha is there
- Single Page Website is there
- Session is requires
- Cookies is required
- CSRF is not allowed
Step 1 #
Installing requests library, it helps us in sending http request. We will send http request to some website and get the response.
$ pip install requests
I hope you are familiar with
python package manger pip. If you aren’t then you should learn bit about that before continuing this article.
I am using macOS Catalina and by default I have python(Python v2) and python3(Python v3). I always use
python3and it’s package manger
$ pip3 install requests Installing collected packages: idna, certifi, urllib3, chardet, requests Successfully installed certifi-2020.6.20 chardet-3.0.4 idna-2.10 requests-2.24.0 urllib3-1.25.9
The requests library has been successfully installed and we are ready for using it in our python script.
Let’s write a simple python code to confirm if requests library has been installed or not? We will write a program which will send a get request to the
import requests response = requests.get("https://www.nstack.in/") print(response.status_code)
If the request will be successful then we will get 200 as status code which mean everything is okay.
Let’s run the above code and see the response. I am expecting you know how to execute(run) python file.
$ python3 scrapper.py 200 $
Step 2 #
Now we want to get website content and then we will pick our desired data.
When you send request to any URI it’s respond back with a Response Object which has various properties.
- status_code: Status code of response.
- headers: This provide us header of response(sent from server).
- content: This will provide us raw html of the web page.
- encoding: This provide us the details about the encoding.
import requests response = requests.get("https://www.nstack.in/") body = response.content print(body)
b'<!doctype html><html lang=en-us><head><meta name=generator content="Hugo 0.69.1"><meta charset=utf-8><meta name=viewport content="width=device-width,initial-scale=1"><meta http-equiv=x-ua-compatible content="ie=edge"><title>Homepage | nstack</title>..........localStorage.setItem("subject","link shortern");</script></body></html>'
This isn’t human understandable or may be with some efforts you will understand this. This HTML source code corresponding to my website homepage(https://www.nstack.in/).
You can get prettified code just by calling one method on content of webpage.
import requests response = requests.get("https://www.nstack.in/") body = response.content # If you call prettify method on response content then # it will print in some beautify the html code. print(body.prettify())
Step 3 #
Our End goal is to pick the
Top 4 Post of the Month which is available at my homepage. You should one my website home page and see that you should go into the inspect element and see the class name and id corresponding to these DOM items.
We will use the library beautifulsoup4 which will help us in selecting the element using some selector logic.
We will find the element using their class name but there are various to find the element. You can fine the element by id also, we will talk about the selector later also.
import requests from bs4 import BeautifulSoup response = requests.get("https://www.nstack.in/") soup = BeautifulSoup(response.content,'html.parser') posts = soup.find_all(class_="card-title mt-3") for post in posts: print(post.prettify())
When you will run the above code, you will get output similar to this one.
<h4 class="card-title mt-3"> Flutter: Slide Button </h4> <h4 class="card-title mt-3"> grpc iOS stuck </h4> <h4 class="card-title mt-3"> Flutter: Dart Doc </h4> <h4 class="card-title mt-3"> Flutter: Deep dive into Button </h4>
We got the title of top 4 post and now we want their links or corresponding url.
But still we didn’t get the plain text. We have title with the html tags. Getting the plain text out of this isn’t difficult.
get_text()This method helps us in extracting the text from the html code by eliminating the html tags/syntax.
import requests from bs4 import BeautifulSoup response = requests.get("https://www.nstack.in/") soup = BeautifulSoup(response.content,'html.parser') posts = soup.find_all(class_="card-title mt-3") for post in posts: print(post.get_text())
Flutter: Slide Button grpc iOS stuck Flutter: Dart Doc Flutter: Deep dive into Button
Step 4 #
Now we will get the links of all 4 posts, let’s get the class of anchor tag and find all the html element containing that class name.
- using square bracket you can select property of the HTML element in BeautifulSoup.
import requests from bs4 import BeautifulSoup response = requests.get("https://www.nstack.in/") soup = BeautifulSoup(response.content,'html.parser') links = soup.find_all(class_="indigo-text btn btn-outline-secondary") for link in links: print(link['href'])
We will get output similar to this one, output will be similar to this one not the exactly same. Because every time I write any new article and publish it updates the top 4 article list.
https://www.nstack.in/blog/flutter-slide-button/ https://www.nstack.in/blog/grpc-ios-stuck/ https://www.nstack.in/blog/flutter-dart-doc/ https://www.nstack.in/blog/flutter-deep-dive-into-button/
I have used
indigo-text btn btn-outline-secondary class name exactly 4 times in the homepage and that’s for showing the button of top 4 post. So we will get only four elements corresponding to that.
Step 5 #
Here is the final code which will scrape top 4 posts of my webpage.
If you have little bit of experience with python then this was very straight forward for you except the selecting the element using class name.
If you are from web dev and not from python then selecting using class wasn’t new to you. You might have understood that part pretty well.
import requests from bs4 import BeautifulSoup response = requests.get("https://www.nstack.in/") soup = BeautifulSoup(response.content,'html.parser') posts = soup.find_all(class_="card-title mt-3") links = soup.find_all(class_="indigo-text btn btn-outline-secondary") for post, link in zip(posts, links): print(post.get_text()) print(link['href'])
If you know python and web dev then this must be very simple to you and you got everything in one go.
There are more ways to selecting the element and scrapping the only based on selecting of element. If you know how to select the right element then you know everything.
- find_all(class="jumbotron”): Finds all element with class name
- find(id="login-form”): Find One element with the id
- find(a, id="btn-login”): Find one element with tag name a and id as
- find_all(‘h2’, limit=2): This will return first two
If you like it and this was a good read to you then don’t be strange. Write some comment tweet about the article and help other developers.
I am writing this for helping and sharing my knowledge. I am not selling any ads to you.