An example with irregular HTML text like this:
['<a href="https://www.blogger.com/u/1/misc/cst2020d.html"><b>Dimensions</b></a>', '<a href="https://www.blogger.com/u/1/misc/cst2020s.html"><b>Typical Circuit</b></a>', '<a href="https://www.blogger.com/u/1/misc/cst2020t.html"><b>Temperature vs Current</b></a>', '<a href="http://www.
blogger.com
/Search.aspx?arg=somepart%20ic2020" target="_blank"><b>3D\xa0model</b></a>']
This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.
We can:
- Do multiple Scrapy selector calls to get the data we need or
- Do a single Scrapy selector call and process it via XML
So the Python code to solve the problem is short and simple:
import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
for link_item in our_raw_links:
root = ET.fromstring(link_item)
href = root.attrib['href']
anchor_text = root[0].text
cap_links.append({'text': anchor_text, 'href': href})
And Voila!