angwebampgonnakillsomebody: processing

Tuesday, May 21, 2019

Using ElementTree(xml) with Scrapy to deal with hard to deal HTML

While Scrapy's Selectors like xpath and css are powerful there are some cases that make them cost to much effort.

An example with irregular HTML text like this:

 ['<a href="https://www.blogger.com/u/1/misc/cst2020d.html"><b>Dimensions</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020s.html"><b>Typical Circuit</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020t.html"><b>Temperature vs Current</b></a>',  
  '<a href="http://www.blogger.com/Search.aspx?arg=somepart%20ic2020" target="_blank"><b>3D\xa0model</b></a>']

This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.

We can:

Do multiple Scrapy selector calls to get the data we need or
Do a single Scrapy selector call and process it via XML

I went with #2. Dealing with HTML as XML should be relatively easy. Besides Python already has a way for working with XML via the ElementTree XML API.

So the Python code to solve the problem is short and simple:

import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
    for link_item in our_raw_links:
        root = ET.fromstring(link_item)
        href = root.attrib['href']
        anchor_text = root[0].text

        cap_links.append({'text': anchor_text, 'href': href})

And Voila!

Tuesday, July 11, 2017

Splitting 40 character tokens or keys for readability with dashes

A good bit programming is about mangling text.

Python 3 has a lot of fun language features you can use to process text. Here is short Python 3 script that will split a 40 character length token (or text) into 6 segments. Of course, how many character per segment depends on the text length.

token = 'ad6d9c4e3fe09cfd24afdd62cc3705be02545272'

# double // so we don't have to do an int cast: int(len(token)/6) - Python 3 feature
chunks, chunk_size = len(token), len(token) // 6

keyed_token = [ token[i:i+chunk_size] for i in range(0, chunks, chunk_size)]

print('-'.join(keyed_token))

It should output: ad6d9c-4e3fe0-9cfd24-afdd62-cc3705-be0254-5272

Done!

angwebampgonnakillsomebody

Pages

Tuesday, May 21, 2019

Using ElementTree(xml) with Scrapy to deal with hard to deal HTML

Tuesday, July 11, 2017

Splitting 40 character tokens or keys for readability with dashes

I support Wikipedia

Blog Archive

Tag cloud

GreenGeeks