Web scraping tutorial via BeautifulSoup

In July, I went to the UseR 2019 conference, while it was not physically possible to attend all the awesome talks, I did write a script in Python + BeautifulSoup that fetches all the presentations from the website. Here is a tutorial that describes how. This script is based on the amazing tutorial by Al Sweigart.

In this tutorial I will break the script in small steps and will walk you step-by-step through the code on my GitHub –> downloadUser2019.py,

import os, re
import bs4, requests
from random import random
import time
import win32com.client

In Line 1 two packages are imported —
os helps the python script to interact with the operating system and re is the RegEx library
In Line 2 —
bs4 is the BeautifulSoup library which helps to make the source html of the webpage navigable (beautiful); requests helps python to interact with the webpage and download items
In Line 3, random a pseudo-random generator is imported
time on Line 4 provides various time-related functions
Finally, win32com.client is used to help interact with Windows, we will use it in this tutorial to create a shortcut
We will see their uses in the course of the tutorial, don’t think too much about the packages now πŸ˜‰

win32com.client can be a pain to find on pip, install it by:
pip install pywin32 

First piece of code is:

website = 'http://www.user2019.fr'
res = requests.get(website + '/talk_schedule/')
res.status_code == 200
user19Soup = bs4.BeautifulSoup(res.text)
type(user19Soup)

In line 1, the target website is defined
In line 2, we use the get method from the requests library. On invoking this method, we can retrieve data from a specified resource. In our case, the resource is the URL: http://www.user2019.fr/talk_schedule/. While I append the page ‘/talk_schedule/’ to the website, you could also write:

res = requests.get('http://www.user2019.fr/talk_schedule/')

and it would still work. As we move further in the code, it will become clear why I separate the parent website from the page. Also, I feel it is good practice to do so. Next, on Line 3 res.status_code == 200, checks if the result requested is positive, i.e.

True

Code 200 is good and as you might have experienced on the interwebs, code 404 is bad πŸ˜‰

In Line 4, res.text converts the HTML that was downloaded by the requests.get method into a text format. This HTML text is then fed to BeautifulSoup and stored in user19Soup. The idea is that by making the HTML soup easily parseable (for text detection tasks), it is beautiful

Write in your terminal, these two lines below and see the difference in output. While line 1 will display the HTML text in a raw format, the user19Soup object will return a more beautiful output.

res.text
user19Soup

Regex can also be used for picking up text from the HTML, but is not considered a wise choice. You can read some nice articles on the internet on using a parser vs. regex.
Then, in Line 5, we perform a check to see what type of object user19Soup is

 <class 'bs4.BeautifulSoup'>

Go to the target URL: http://www.user2019.fr/talk_schedule/, there you can see the slides. These are what we want to fetch. Right click anywhere on the page and select View page source, this will show you the raw HTML behind the web-page. This is a huge HTML, how do we find the location where the slides are? Let’s do some investigation…
On the target URL, we see the first slide is by Julia Stewart Lowndes. Let’s go to the raw HTML source, press Ctrl + F and search for Julia Stewart Lowndes. We see we get directed to some HTML that looks so:

<tr class="filtered" data-toggle="collapse" data-target=".1collapsed">
  <td>10:00</td>
  <td></td>
  <td>Keynote</td>
  <td>Julia Stewart Lowndes</td>
  <td>R for better science in less time</td>
  <td>Julia Silge</td>
  <td>Concorde 1+2</td>
  <td style="text-align: center"><a href="/static/pres/keynote_201907101000.pdf" target="_blank" style="color: #a0178b"><i class="fa fa-download fa-fw"></i></a></td>
</tr>

Good! We have found the corresponding HTML code πŸ™‚
Under the <a href> element we can also see the link of the pdf slides, that’s what we want to fetch!
The href attribute is used in HTML to specify the link’s destination.
Okay, who is the next speaker? Let’s search for Kamil Wais on the raw HTML page, to check if we can find a similar pattern.
The first observation, I make is that all the information relating to the slides and presenters is under the tag: <tr class=”filtered”

<tr class="filtered" data-toggle="collapse" data-target=".2collapsed">
  <td>11:30</td>
  <td></td>
  <td>Shiny 1</td>
<td>Kamil Wais<br/></td>
  <td>Logging and Analyzing Events in Complex Shiny Apps</td>
  <td>Aline Deschamps</td>
  <td>Saint-ExupΓ©ry</td>
  <td style="text-align: center"><a href="/static/pres/t256173.zip" target="_blank" style="color: #a0178b"><i class="fa fa-download fa-fw"></i></a></td>
</tr>

Make some searches yourself on the speaker names from the talk page on the raw HTML, to see if this is indeed the case πŸ™‚

Since, you confirm it is πŸ˜‰ , we can write:

trFiltered = user19Soup.find_all('tr', class_='filtered')

Here, we use the most popular method of BeautifulSoup: find_all. To quote from the official documentation “[This] method looks through a tag’s descendants and retrieves all descendants that match your filters”. I would advice you to read the docs, but for now it is sufficient to know that this method is going to find all the <tr> tags in the soup that have the class: filtered. Because, as per our observations, all slide related data falls on the <tr class=”filtered” pattern.

The trFiltered object is subscriptable like a list, try writing below in your terminal and you will see the 8th element

trFiltered[7]

Okay, let’s inspect the source HTML a bit more in detail. We see under the <a href> tag, not every slide is named in the same manner. Check out yourself for the slide name for second speaker vs. the first. Secondly, the file extensions are also different. Hmm, that means there is no pattern here.

Manually inspecting every <tr> tag will be tedious, let’s write some code to find out all the extensions of the files:

exts = []
for tr in trFiltered:    
    fileName = tr.a.get('href')
    ex = re.search(r'\..*', fileName).group()
    exts.append(ex)

In Line 1 above, we declare an empty list: exts, which we will fill with the individual file extensions.
In Line 2, we loop over all the trFiltered items. In each of the <tr> tag, we retrieve the file name’s extension, which we have already observed is under the <a> tag. This how the <a> tag looks in the raw HTML:

<a href="/static/pres/t256173.zip"

Hence, in line 3 we first navigate to the a tag and get the URL stored in the the href attribute. A .tag after the soup element returns the first corresponding tag. Since, for every tr we have just one a tag. This will work for us. Read the documentation on more elaboration. A typical output of this line, should look like:

>>> fileName
'/static/pres/keynote_201907121415.pdf'

Then in Line 4 we use a regex pattern ( ‘\..*’ ), to find out all the extensions. If we break down the regex pattern \. matches the dot character (.) , then .* returns all characters after the dot.
The search method finds this regex pattern and group() method in the end returns the string searched by the re. Here, we expect .pdf, .zip to be returned by this line.
Read docs, to understand how group works and of course the python regex documentation.
I expect the extensions returned by Line 4 to be the usual ones like: .pdf, .zip, .pptx

In Line 5, we append the extension of each filename in the a tag to the empty exts list we had declared earlier. To see all the unique elements in this list, let’s do:

set(exts)

However, the generated output is quite unexpected:

{'.cynkra.com/slides/190712-tsbox/slides.html#1', '.r-forge.r-project.org/slides/user2019/conjoint-slides.html#/', '.zip', '.jumpingrivers.com/t/2019-user-security/', '.com/wp-content/uploads/2019/07/Toulouse/201907_toulouse.html#/', '.github.com/hadley/eb5c97bfbf257d133a7337b33d9f02d1', '.pdf', '.com/davisvaughan/user-2019-rray', '.github.io/talks/useR2019/index.html', '.com/files/ikosmidis_cranly_user2019/#1', '.netlify.com/talks/user2019/user2019#1', '.github.io/user_19/presentation/#1', 
 '.netlify.com/#1', '.github.io/useR2019/#1', '.pptx', '.welovedatascience.com/user2019', '.gitlab.io/user2019/', '.jottr.org/2019/07/12/future-user2019-slides/', '.robinlovelace.net/presentations/user2019-r-for-transport-planning.html', '.github.io/useR2019', '.html'}

Besides the usual extensions of ‘.zip’, ‘.pdf’, ‘.pptx’, ‘.html’ we also see lot of URLs. Apparently, these slides are hosted by the speakers directly on their websites, which means while we can download the usual extensions directly on our hard-disk, for the web-based links it would be smart to just save the shortcut of the slide URLs on our hard-disk.
Let’s create a list with all the usual downloadable file extensions:

usualExts = ['.zip', '.pdf', '.pptx', '.html']

Okay, so far so good!
We know there are not only typical file extensions for the slides, but also URLs. However, as observed earlier in the raw HTML source, the filenames of the slides are not named in the same manner. Look at the filenames of the first two speakers:

keynote_201907101000.pdf 
t256173.zip

These are not very helpful, for the hundreds of slides to browse through. Wouldn’t it be nice to give a more informative name to our downloaded slides?
Say: Session_Speaker Name_Talk Title…
Luckily we have all these as tags in our raw HTML and by extension also in our soup.
Look in the raw HTML,
the 3rd <td> tag has the Session
the 4th <td> tag has the Speaker Name
and the 5th <td> tag has the Talk title.

Next piece of code is the method:

def OnlyAlphas(tag):
    tagAlpha = " ".join(re.findall("[a-zA-Z]+", tag.getText()))
    return tagAlpha

This method takes a tag as input and returns only the alphabets, it is created to remove any numbers or special characters from the tag.
Writing any tag in the terminal returns:

<td>You don't need Spark for this - larger-than-RAM data manipulation with disk.frame</td>

While, tag.getText() returns just the text

"You don't need Spark for this - larger-than-RAM data manipulation with disk.frame"

There are dots and dashes in the tag, which we are undesired in the filename, therefore the OnlyAlphas method is created.
On Line 2, all the action occurs.
The tag.getText() method returns the text inside the tag, while the ” “.join(re.findall(“[a-zA-Z]+” takes the text of the tag as input, finds all individual alphabetical words in the tag alphabets and joins them together.

All the code components encountered till now, were preparation for the following for loop, which actually does all the fetching.
I will elucidate this loop also in small manageable components —

for tr in trFiltered:
    t1, t2, t3 = tr.select("td:nth-of-type(3), td:nth-of-type(4), td:nth-of-type(5)")
    tName = OnlyAlphas(t1) + '_' + OnlyAlphas(t2) + '_' + OnlyAlphas(t3)
    fileN = tr.a.get('href')
    ext = re.search(r'\..*', fileN).group()

Line 1, goes through each <tr> element.
We identified earlier that the 3rd, 4th and 5th <td> are the elements needed to rename the file. To do that we use the select method.
This method takes multiple CSS selectors as input and returns instances of tags that match. Read the bs4 documentation and W3Schools resource about CSS selectors.
Line 3, uses the OnlyAlphas method to extract only the alphabetical components of the text in the 3rd , 4th and 5th <td> and joins them with an underscore.
The fileN and ext on Line 4 and 5, have the same logic as with the fileName and ex variable used above.

Now if you remember, there are two cases, either the file extension is usual or unusual πŸ˜‰
Let’s deal in this if loop with the usual case —

if ext in usualExts:
        resFile = requests.get(website + fileN)
        outFile = open(os.getcwd() + '\\downloads\\' + tName + ext, 'wb')
        resFileIter = resFile
        resFile.close()
        for chunk in resFileIter.iter_content(100000):
            outFile.write(chunk)
        outFile.close() 

Line 1 checks whether the extension is in the list usualExts. The requests.get method fetches data from a specified resource. The resource in this case looks like:

'http://www.user2019.fr/static/pres/t258286.html'

Now, I hope it makes sense why I mentioned in the beginning, to keep the website variable on its own. Because now, when we begin to fetch files, we can simply append the filename to the website. Makes the code simpler and reusable. πŸ™‚
In Line 3, we save the file on our local drive with python’s standard open() function. The arguments are:
the current working directory + downloads folder + the desired file name + file extension and wb.
Note: you can modify the

os.getcwd() + '\\downloads\\' 

component of the file argument to wherever you want to save on your local disk. If you don’t have a downloads folder beforehand in your working directory, please create one before running the loop

wb stands for write-binary and is an important component if you are on a windows machine (like me). Since, windows treats line endings of binary files, say .html or .pdf files, we need to use the wb mode, otherwise the generated output can be corrupt. Hence, in Line 3, the file which we create in the desired location, is written in binary mode. See this discussion on StackOverflow for more insight.

In Line 4, we copy the file that is fetched by the requests module locally and close the connection with the website in Line 5. This is done to not strain the website, since the open connections, more is the load on the website. This I think is a good etiquette especially when scraping large amounts of data from the web.

In Line 6, we use a for loop on the iter_content() method. This loop goes/ iterates over the components (chunks) in the response data from the website (resFileIter) and writes these chunks to our destination file. We have chosen that the for loop writes 100,000 bytes at a time. This byte figure is a good rule of thumb. Read more in the documentation.
Finally in Line 9, we close the opened destination file. If we don’t do that python will keep the file opened, thereby keeping it locked and you cannot modify it.

Remember, in the if clause above we check if our file extensions is usual. When this is not the case, else clause gets activated

Let’s now see how this else loop works:

else:
        outFileUrl = os.getcwd() + '\\downloads\\' + tName + '.url'
        ws = win32com.client.Dispatch("wscript.shell")
        shortcut = ws.CreateShortcut(outFileUrl)
        shortcut.TargetPath = fileN
        shortcut.Save()
    print('fetched')

The slides that are not usual, as they are all URLs. So in Line 2, we create the name of our file (similar to what we did in the if loop), but rather than add the ext to the filename we add ‘.url‘. A URL slide looks like:

'https://gist.github.com/hadley/eb5c97bfbf257d133a7337b33d9f02d1'

Internet shortcuts have the extension url.
Line 3 and Line 4 call the WScript object which helps create a shortcuts at the target path. More info on this object is here. Line 5 tells the internet location of the shortcut and Line 6 saves the shortcut. Without the Save command the shortcut will not be created.

The last line in our for loop is

time.sleep(random())

With this command we pause our script/for loop, for a random duration of seconds in the range [0.0, 1.0). Sleeping for a few seconds helps reduce the load on the website and the intention is similar as done with resFileIter above.

That’s it! πŸ™‚
Hope by reading this script you understand concepts of web-scraping πŸ™‚