Skip to main content
  1. Portfolios/

Preparing a PDF manual by scraping a website

·4 mins· loading · loading · ·
Ch Virinchi
Author
Ch Virinchi
I’m an aspiring space scientist, coder by night, inventor, love memorising and reciting long hymns in Sanskrit
Table of Contents

What is Web Scraping?

As the world is moving in the second decade of the 21st century, the recently famous proverb “Data is the new Oil” is getting more and more relevant. Web Scraping is a very useful technique to retrieve volumes of data from a working website. It can also be used to download files, images, texts and even to get live updates from single or multiple websites.

What is a web-scraper?

Web-scraping refers to the extraction of data from websites. This data is collected and converted into another format that is useful for the user (such as an API or Excel spreadsheet). A software that is designed to scrape the web and gather large amounts of data for the user is known as a web-scraper. In this tutorial, we’ll be storing the data saved in a pdf file format.

How do web scrapers work?

Ideally, it’s best if you specify the data you want so that the web scraper only extracts that data quickly. For example, you might want to scrape an Amazon page for the types of juicers available, but you might only want the data about the models of different juicers and not the customer reviews. So, when a web scraper scrapes a site, it must be provided with the URLs of the required sites. Then, it loads all the HTML code of those sites and obtains the required data from the HTML code and outputs it in a format specified by the user. (a PDF in our case).

Prerequisites-

  • Python - You must have Python installed. It comes inbuilt with Ubuntu, Windows or Mac users can easily install it from the website.
  • Text-Editor - A good text editor like VS Code is required. It helps us to quickly and efficiently modify code.

Apart from that, a few libraries are required for our code to function. Execute these commands in your shell, then sit back and relax to let the computer do the hard work for you.

$ pip3 install beautifulsoup4

$ sudo apt-get install wkhtmltopdf 

$ pip3 install urllib

$ sudo apt-get install xvfb

Getting started with the code

from bs4 import BeautifulSoup as bs   
import requests                       
import urllib                         
import os                             
import pdfkit                  
import re                       
from requests.adapters import HTTPAdapter

Importing the required libraries

url = 'https://docs.erpnext.com/docs/user/manual/en'                    
response = requests.get(url)
soup = bs(response.text,'html.parser')
a = soup.findAll('a',{'class':'stretched-link'})

Here, requests.get sends a get request to the given URL whose return value is stored in response. This is then parsed through beautiful soup which returns the raw HTML code. Now, going to ERPnext and observing the source code, we find that the required data is wrapped in <a> tags having the class stretched-link. To filter these from the page, we use the findAll function provided by beautiful soup.

lb1 = soup.findAll('a',{'href': re.compile('^/')})
lb2 = soup.findAll('link',{'href': re.compile('^/')})
lb3 = soup.findAll('script',{'src': re.compile('^/')})
lb4 = soup.findAll('script',{'href': re.compile('https://')})
lb5 = soup.findAll('link',{'href': re.compile('https://')})

for i in lb1:
    i['href'] = i['href'].replace('?ver=1616352305.528987', '') 
    i['href'] = 'https://docs.erpnext.com' + i['href']
for i in lb2:
    i['href'] = i['href'].replace('?ver=1616352305.528987', '') 
    i['href'] = 'https://docs.erpnext.com' + i['href']

for i in lb3:
    i['src'] =i['src'].replace('?ver=1616352305.528987', '') 
    i['src'] = 'https://docs.erpnext.com' + i['src']

for i in lb4:
    i['href'] =i['href'].replace('?ver=1616352305.528987', '') 

for i in lb5:
    i['href'] =i['href'].replace('?ver=1616352305.528987', '') 

Now, we need to straighten out a kink for the wkhtmltopdf to do its work. If there are any URLs containing version numbers, or ones that do not start with https:// then wkhtmltopdf will fail to compile into pdf. For this, we will need to isolate all URLs starting with / and remove the version number in all URLs. Again, we use the findAll function provided by beautiful soup. Here, we obtain URLs to be corrected in an array. The for loop iterates over each item and modifies it. This gives us a clean HTML file to work with.

for element in a:
 name = element['href'].split('/')
 if name[2] == 'frappe.io':
     continue
 else:  name = element['href'].split('/')[5]
  
 link = element['href']
 directory = 'Give current working directory'
 print('saving : ',name)

A for loop iterates over each element containing href and splits it at /. In the list of URLs, we need to remove frappe.io since it causes an error which we pass as an exception. Now, each URL is split into parts and 5th item is taken.

card = soup.findAll('div',{'class':'card'})
for i in range(1,len(card)):

    dirname = card[i].h3.text.strip()
    print('creating folder for : ',dirname)
    
    dirname = 'Give dirname where folder is to be created'
    if not os.path.isdir(dirname):
        os.mkdir(dirname)
    links = (card[i].findAll('a',{'href': re.compile('/docs/user/manual/en/*')}))
    
    for f in links:
        html_link = (f['href'])
        html_res = requests.get('Give Base URL' + html_link)
        # creating files with same name as name in html link 
        filename =  dirname+html_link+'.pdf'
        if not os.path.isfile(filename):
            pdf = pdfkit.from_url(html_res,filename)
            print(filename)

All the HTML code that exists within <div class=“card”></div> is taken and for each element in the array, text from <h3> tag is stripped. Then, all the links starting with /docs/user/manual/en/ are taken and stored in links. Lastly, a for loop fetches each of the URLs and saves it in PDF format.

I have the entire code in my github repo. Check it out!

Happy Coding!

Related

#100DaysOfCode
·2 mins· loading · loading
What is Tailwind CSS? Tailwind CSS is a highly customizable, low-level CSS framework that gives you all of the building blocks you need to build bespoke designs without any annoying opinionated styles you have to fight to override.