This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.
We can:
Do multiple Scrapy selector calls to get the data we need or
Do a single Scrapy selector call and process it via XML
I went with #2. Dealing with HTML as XML should be relatively easy. Besides Python already has a way for working with XML via the ElementTree XML API.
So the Python code to solve the problem is short and simple:
import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
for link_item in our_raw_links:
root = ET.fromstring(link_item)
href = root.attrib['href']
anchor_text = root[0].text
cap_links.append({'text': anchor_text, 'href': href})
A student of mine asked me this question. He wanted to read an external XML file, parse it and load the data into a HTML Form control. In his case, he wanted to populate a select component (combo box). And he didn't want to use any PHP, Java or whatever. So that leaves me with JavaScript and HTML. This intrigued me for a bit. I haven't tried anything like this before.
So after a bit of research I stumbled into jQuery.ajax. I am familiar with jQuery - been using it for a bit so this should be quick. So, we prep the HTML file for our display:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Parsing Data from a File</title>
<link rel="stylesheet" type="text/css" href="css/demo.css" />
<script type="text/javascript" src="js/jquery-1.7.1.min.js"></script>
<script type="text/javascript" src="js/demo.js"></script>
</head>
<body>
<h2>Parsing Data from a File</h2>
<p class="compress">Data will be from an external file. It will formatted as an XML file
(this could be easily be a html file). It will be be parsed using JQuery library.
We could do this with raw JavaScript but why would you want to do that when
there is a simpler solution?</p>
<p>To learn in detail how this thing works <a href="http://api.jquery.com/jQuery.ajax#options">read this.</a></p>
<div id="form-div">
Our Combo box:
<form id="form1" class="cmbox">
<select class="combo1">
</select>
</form>
</div>
</body>
</html>
And here is the demo.js file that comes with that file. If you run both now you'll get an error. This is because we have to create a file called data.xml. You should also notice that in the demo.js file (refer to the comments) loads the data.xml if it successfully loads the file it should call a function called "parse" and should it fail it calls the "loadfail" function.
$(document).ready(function(){ // load jQuery 1.5
function loadfail(){
alert("Error: Failed to read file!");
}
function parse(document){
$(document).find("combo").each(function(){
var optionLabel = $(this).find('text').text();
var optionValue = $(this).find('value').text();
$('.combo1').append(
''
);
});
}
$.ajax({
url: 'js/data.xml', // name of file with our data
dataType: 'xml', // type of file we will be reading
success: parse, // name of function to call when done reading file
error: loadfail // name of function to call when failed to read
});
});
Do note that this will fail with Chrome because of its security model. If you look at the console it will not allow you to load local files because of Origin null is not allowed by Access-Control-Allow-Origin. Whatever that is.