Tuesday, May 21, 2019

Using ElementTree(xml) with Scrapy to deal with hard to deal HTML

While Scrapy's Selectors like xpath and css are powerful there are some cases that make them cost to much effort.

An example with irregular HTML text like this:

 ['<a href="https://www.blogger.com/u/1/misc/cst2020d.html"><b>Dimensions</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020s.html"><b>Typical Circuit</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020t.html"><b>Temperature vs Current</b></a>',  
  '<a href="http://www.blogger.com/Search.aspx?arg=somepart%20ic2020" target="_blank"><b>3D\xa0model</b></a>']  

This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.

We can:
  1. Do multiple Scrapy selector calls to get the data we need or
  2. Do a single Scrapy selector call and process it via XML
I went with #2. Dealing with HTML as XML should be relatively easy. Besides Python already has a way for working with XML via the ElementTree XML API.

So the Python code to solve the problem is short and simple:

import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
    for link_item in our_raw_links:
        root = ET.fromstring(link_item)
        href = root.attrib['href']
        anchor_text = root[0].text

        cap_links.append({'text': anchor_text, 'href': href})

And Voila!