What is Web Scraping?
As the world is moving in the second decade of the 21st century, the recently famous proverb “Data is the new Oil” is getting more and more relevant. Web Scraping is a very useful technique to retrieve volumes of data from a working website. It can also be used to download files, images, texts and even to get live updates from single or multiple websites.
What is a web-scraper?
Web-scraping refers to the extraction of data from websites. This data is collected and converted into another format that is useful for the user (such as an API or Excel spreadsheet). A software that is designed to scrape the web and gather large amounts of data for the user is known as a web-scraper. In this tutorial, we’ll be storing the data saved in a pdf file format.
How do web scrapers work?
Ideally, it’s best if you specify the data you want so that the web scraper only extracts that data quickly. For example, you might want to scrape an Amazon page for the types of juicers available, but you might only want the data about the models of different juicers and not the customer reviews. So, when a web scraper scrapes a site, it must be provided with the URLs of the required sites. Then, it loads all the HTML code of those sites and obtains the required data from the HTML code and outputs it in a format specified by the user. (a PDF in our case).
Prerequisites-
- Python - You must have Python installed. It comes inbuilt with Ubuntu, Windows or Mac users can easily install it from the website.
- Text-Editor - A good text editor like VS Code is required. It helps us to quickly and efficiently modify code.
Apart from that, a few libraries are required for our code to function. Execute these commands in your shell, then sit back and relax to let the computer do the hard work for you.
$ pip3 install beautifulsoup4
$ sudo apt-get install wkhtmltopdf
$ pip3 install urllib
$ sudo apt-get install xvfb
Getting started with the code
from bs4 import BeautifulSoup as bs
import requests
import urllib
import os
import pdfkit
import re
from requests.adapters import HTTPAdapter
Importing the required libraries
url = 'https://docs.erpnext.com/docs/user/manual/en'
response = requests.get(url)
soup = bs(response.text,'html.parser')
a = soup.findAll('a',{'class':'stretched-link'})
Here, requests.get
sends a get request to the given URL whose return value is stored in response. This is then parsed through beautiful soup which returns the raw HTML code. Now, going to ERPnext and observing the source code, we find that the required data is wrapped in <a>
tags having the class stretched-link
. To filter these from the page, we use the findAll function provided by beautiful soup.
lb1 = soup.findAll('a',{'href': re.compile('^/')})
lb2 = soup.findAll('link',{'href': re.compile('^/')})
lb3 = soup.findAll('script',{'src': re.compile('^/')})
lb4 = soup.findAll('script',{'href': re.compile('https://')})
lb5 = soup.findAll('link',{'href': re.compile('https://')})
for i in lb1:
i['href'] = i['href'].replace('?ver=1616352305.528987', '')
i['href'] = 'https://docs.erpnext.com' + i['href']
for i in lb2:
i['href'] = i['href'].replace('?ver=1616352305.528987', '')
i['href'] = 'https://docs.erpnext.com' + i['href']
for i in lb3:
i['src'] =i['src'].replace('?ver=1616352305.528987', '')
i['src'] = 'https://docs.erpnext.com' + i['src']
for i in lb4:
i['href'] =i['href'].replace('?ver=1616352305.528987', '')
for i in lb5:
i['href'] =i['href'].replace('?ver=1616352305.528987', '')
Now, we need to straighten out a kink for the wkhtmltopdf to do its work. If there are any URLs containing version numbers, or ones that do not start with https://
then wkhtmltopdf will fail to compile into pdf. For this, we will need to isolate all URLs starting with /
and remove the version number in all URLs. Again, we use the findAll
function provided by beautiful soup. Here, we obtain URLs to be corrected in an array. The for
loop iterates over each item and modifies it. This gives us a clean HTML file to work with.
for element in a:
name = element['href'].split('/')
if name[2] == 'frappe.io':
continue
else: name = element['href'].split('/')[5]
link = element['href']
directory = 'Give current working directory'
print('saving : ',name)
A for
loop iterates over each element containing href
and splits it at /
. In the list of URLs, we need to remove frappe.io since it causes an error which we pass as an exception. Now, each URL is split into parts and 5th item is taken.
card = soup.findAll('div',{'class':'card'})
for i in range(1,len(card)):
dirname = card[i].h3.text.strip()
print('creating folder for : ',dirname)
dirname = 'Give dirname where folder is to be created'
if not os.path.isdir(dirname):
os.mkdir(dirname)
links = (card[i].findAll('a',{'href': re.compile('/docs/user/manual/en/*')}))
for f in links:
html_link = (f['href'])
html_res = requests.get('Give Base URL' + html_link)
# creating files with same name as name in html link
filename = dirname+html_link+'.pdf'
if not os.path.isfile(filename):
pdf = pdfkit.from_url(html_res,filename)
print(filename)
All the HTML code that exists within <div class=“card”></div>
is taken and for each element in the array, text from <h3>
tag is stripped. Then, all the links starting with /docs/user/manual/en/ are taken and stored in links. Lastly, a for loop fetches each of the URLs and saves it in PDF format.
I have the entire code in my github repo. Check it out!