Tuesday, March 31, 2020

Scrapy a JS heavy website using Selenium

Scrapy doesn't really like JavaScript heavy websites especially the ones that load the rest of the HTML via a secondary requests using JavaScript.

To overcome this you either use Splash or Selenium. Unfortunately, Splash is no longer supported. It still works but moving forward, it's going to be Selenium. 

The good news here is that Scrapy already supports Selenium via a middleware: scrapy-selenium.

Steps to use scrapy-selenium:

1. Download a Selenium Driver. For example for Firefox get gecko. I'm assuming you have the browser also installed. 

2. Add the following settings to the settings.py file:

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'path/to/gecko'
SELENIUM_BROWSER_EXECUTABLE_PATH = 'path/to/firefox binary'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

For example (on Windows):

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'c:\\Tools\\geckodriver.exe'
SELENIUM_BROWSER_EXECUTABLE_PATH = 'c:\\Program Files\\Mozilla Firefox\\firefox.exe'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

3. In the spiders, just replace the Request() calls to SeleniumRequest()

4. Add a wait_until test on the SeleniumRequest() to would look like:

SeleniumRequest(url=url, 
                callback=self.parse, 
                wait_time=5,
                wait_until=EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.search-results div.search-cell')))

In this one, we wait for the max of 5 secs or until the element, selected by class to be found in the HTML source. After that, we can use scrapy selectors to find the things we want.

So that's it.