3 When to Use a Browser Driver

The Selenium browser driver is typically used to scrape data from dynamic websites that use JavaScript (although it can scrape data from static websites too). The use of JavaScript can vary from simple form events to single page apps that download all their content after loading. The consequence of this is that for many web pages the content that is displayed in our web browser is not available in the original HTML. For example, the following use-cases often occur:

  1. A result table shows up only after the user clicks the search box.
  2. Content following a click on a link is generated instantaneously rather than already being stored on the server before the click.
  3. A JavaScript request might trigger a new block of content to load.

The following subsections cover these three use-cases in detail.

3.3 Dynamic load

With dynamic loading, new content appears only after a JavaScript request for that information is made to the server. There are two major ways in web application design to make a JavaScript request that triggers a new block of content to load. The first way uses pagination, while the second way involves the scroll bar hitting the bottom of a page.

Let us first look at pagination. Let’s inspect the web page of dynamic search again. We can see that the page links are stored within a <div> element with ID “pagination”. Here, the href attribute has a value of javascript:void(0). This just means the browser stays at the same position on the same page and does nothing. Once the page links are clicked, the browser will execute a JavaScript function previous() / next() to make another JavaScript request to the server for the information on that page. Then this new information from the the previous or next page will be displayed.

In this case, the value of the href attribute is not a URL. So, there is no point in trying to test if this is a static or dynamic link using the requests module. But we can illustrate the dynamic load using the lxml module. The code below tries to scrape the page link information using the lxml module:

search_url = "https://iqssdss2020.pythonanywhere.com/tutorial/cases/search"
search_page = requests.get(search_url)
search_html = html.fromstring(search_page.text)
page_link = search_html.xpath('//*[@id="next"]')

The scraper here has failed to extract the page links since the xpath() method returns an empty list. Let us examine the page source code to see why. Here, we find that the <div> element with ID “pagination” is empty:

If we scroll down the source code to the end, we find that the display of the page links is coded in a JavaScript function displayResult(jsonresult) in the JavaScript section. This means that the web page has used JavaScript to load the page links and insert it at the position of the <div> element with ID “pagination” in the original HTML. We can see the revised HTML after running the JavaScript code in the Elements window:

Now let us examine the second way of dynamically loading content –– scrolling down to the bottom of a page. Let’s look at another example web page, which is available at dynamic search load webpage. This webpage is the same as the previous example webpage, except here new content is loaded when the scroll bar reaches the bottom of a page instead of when clicking a link. The code below tries to scrap the information of the result table’s entries using the lxml module:

searchLoad_url = "https://iqssdss2020.pythonanywhere.com/tutorial/casesLoad/search"
searchLoad_page = requests.get(searchLoad_url)
searchLoad_html = html.fromstring(searchLoad_page.text)
entries_link = searchLoad_html.xpath('//*[@id="resultstable"]/tbody/tr')

The scraper here has failed to extract that information since the xpath() method returns an empty list. Below we see the result table part of the original HTML from the page source code. It is clear that there is no information under the tag name <tbody>. This explains why the xpath() method returns an empty list.

Below we see the same section of the revised HTML from the Elements window after it executes the JavaScript code. The webpage runs the JavaScript to insert the first chunk of the students’ information into the empty result table, which has been created statically, before running the JavaScript and then displaying it in the browser.

More interestingly, once we open the Elements window of this example webpage, under the <tbody> tag, we find there are 15 table entries (with tag <tr>). We can therefore infer that the initial load has a total of 15 table entries. If we scroll down to the bottom of the webpage, we can see that 9 more entries are appended to the table. If we continue to scroll down to the bottom of the page, nothing changes. This means that there exists a total of 24 table entries, with the new load delivering the last 9 of those entries. Your browser executes the JavaScript code to perform all these actions. If we look at the JavaScript section in the page source code, we will see that a JQuery method $(window).on("scroll", function() defines the scrolling down to the bottom of page and triggers the load of the new content once it is satisfied. And another JQuery method $('#resultstable > tbody:last-child').append(htmlContent) appends the new entries to the result table.