Showing posts with label Scrapy. Show all posts
Showing posts with label Scrapy. Show all posts

Tuesday, March 31, 2020

Scrapy a JS heavy website using Selenium

Scrapy doesn't really like JavaScript heavy websites especially the ones that load the rest of the HTML via a secondary requests using JavaScript.

To overcome this you either use Splash or Selenium. Unfortunately, Splash is no longer supported. It still works but moving forward, it's going to be Selenium. 

The good news here is that Scrapy already supports Selenium via a middleware: scrapy-selenium.

Steps to use scrapy-selenium:

1. Download a Selenium Driver. For example for Firefox get gecko. I'm assuming you have the browser also installed. 

2. Add the following settings to the settings.py file:

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'path/to/gecko'
SELENIUM_BROWSER_EXECUTABLE_PATH = 'path/to/firefox binary'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

For example (on Windows):

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'c:\\Tools\\geckodriver.exe'
SELENIUM_BROWSER_EXECUTABLE_PATH = 'c:\\Program Files\\Mozilla Firefox\\firefox.exe'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

3. In the spiders, just replace the Request() calls to SeleniumRequest()

4. Add a wait_until test on the SeleniumRequest() to would look like:

SeleniumRequest(url=url, 
                callback=self.parse, 
                wait_time=5,
                wait_until=EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.search-results div.search-cell')))

In this one, we wait for the max of 5 secs or until the element, selected by class to be found in the HTML source. After that, we can use scrapy selectors to find the things we want.

So that's it.

Thursday, August 29, 2019

Handling non-well-formed HTML in Scrapy with BeautifulSoup

With Scrapy, we can deal with non-well-formed HTML is many ways. This is just one of them.

BeautifulSoup has a pretty nifty feature where it tries to fix bad HTML like replacing missing tags. So if we put BeautifulSoup in the middle then whatever we get from a site is fixed before we parse it with Scrapy.

Fortunately, all we have to do is pip install Alecxe's scrapy-beautifulsoup middleware.

pip install scrapy-beautifulsoup

Then we configure Scrapy to use it from settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

BeautifulSoup comes with a default parser named 'html.parse'. We can change it.

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

HTML5 is the better parser IMO but it has to be installed separately.
 
pip install html5lib

Tuesday, May 21, 2019

Using ElementTree(xml) with Scrapy to deal with hard to deal HTML

While Scrapy's Selectors like xpath and css are powerful there are some cases that make them cost to much effort.

An example with irregular HTML text like this:

 ['<a href="https://www.blogger.com/u/1/misc/cst2020d.html"><b>Dimensions</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020s.html"><b>Typical Circuit</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020t.html"><b>Temperature vs Current</b></a>',  
  '<a href="http://www.blogger.com/Search.aspx?arg=somepart%20ic2020" target="_blank"><b>3D\xa0model</b></a>']  

This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.

We can:
  1. Do multiple Scrapy selector calls to get the data we need or
  2. Do a single Scrapy selector call and process it via XML
I went with #2. Dealing with HTML as XML should be relatively easy. Besides Python already has a way for working with XML via the ElementTree XML API.

So the Python code to solve the problem is short and simple:

import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
    for link_item in our_raw_links:
        root = ET.fromstring(link_item)
        href = root.attrib['href']
        anchor_text = root[0].text

        cap_links.append({'text': anchor_text, 'href': href})

And Voila!

Thursday, February 14, 2019

Random quirk with Windows, Git and an executable Bash script

The context of the problem was that I was making a Docker container with Linux image. It's to be deployed to a cloud service and I'm working from Windows machine. Of course the container would need code to run. To facilitate this, I wrote a basic bash script - scrape.sh.

On Windows, the scrape.sh file already is executable
For disclosure, I'm working with Scrapy here.

From my Windows terminal, the scrape.sh script seems to be already executable. I thought I was fine and I did the whole git cycle of add, commit and push.

The problem shows up when the container tries to run. It will spit out an error saying that it can't execute the scrape.sh script because it's lacking rights. It seems that git commits my bash script as an ordinary file not as an executable despite what I saw on the Windows terminal. The fix though was quite easy.


$ git add --chmod=+x -- afile
$ git commit -m"afile is now executable"


Relevant link:



Friday, January 4, 2019

Installing Scrapy on Windows

Data is the new oil they say and you want to start scraping sites for data. Fine! And since you are a Python developer you'd want to use Scrapy. Unfortunately for you you are a Windows user and errors abound.

Anyhow, you run pip install scrapy and you run into an error:

Scrapy failing to pip install in Windows

The error here is one of dependencies of Scrapy which is Twisted. Fortunately this is fixable.

The first thing we're going to do is download the "binary" wheel file for Twisted. There's a trick here though, YOU MUST DOWNLOAD THE CORRECT ONE THAT MATCHES YOUR PYTHON VERSION. So if you are on Python 3.6, you're looking for something that reads like Twisted-18.9.0-cp36-cp36m-win32.whl; if Python 3.7 then you're looking for something with cp37-cp37m. You get the idea.

The 32-bit or 64-bit probably also matters but I didn't test it.

Once you have that file downloaded somewhere, you can then do a pip install .

Try installing Scrapy again and you should be good to go.

As an added bonus, this seems to fix other Python packages in Windows that require the Visual Studio Windows C++ SDK like the mysql-python.

Tip for Pipenv users:

1. pipenv shell
2. pip install
3. pipenv sync