Showing posts with label XML. Show all posts
Showing posts with label XML. Show all posts

Tuesday, May 21, 2019

Using ElementTree(xml) with Scrapy to deal with hard to deal HTML

While Scrapy's Selectors like xpath and css are powerful there are some cases that make them cost to much effort.

An example with irregular HTML text like this:

 ['<a href="https://www.blogger.com/u/1/misc/cst2020d.html"><b>Dimensions</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020s.html"><b>Typical Circuit</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020t.html"><b>Temperature vs Current</b></a>',  
  '<a href="http://www.blogger.com/Search.aspx?arg=somepart%20ic2020" target="_blank"><b>3D\xa0model</b></a>']  

This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.

We can:
  1. Do multiple Scrapy selector calls to get the data we need or
  2. Do a single Scrapy selector call and process it via XML
I went with #2. Dealing with HTML as XML should be relatively easy. Besides Python already has a way for working with XML via the ElementTree XML API.

So the Python code to solve the problem is short and simple:

import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
    for link_item in our_raw_links:
        root = ET.fromstring(link_item)
        href = root.attrib['href']
        anchor_text = root[0].text

        cap_links.append({'text': anchor_text, 'href': href})

And Voila!

Sunday, March 13, 2011

Using JQuery to read an external XML file

A student of mine asked me this question. He wanted to read an external XML file, parse it and load the data into a HTML Form control. In his case, he wanted to populate a select component (combo box). And he didn't want to use any PHP, Java or whatever. So that leaves me with JavaScript and HTML. This intrigued me for a bit. I haven't tried anything like this before.

So after a bit of research I stumbled into jQuery.ajax. I am familiar with jQuery - been using it for a bit so this should be quick. So, we prep the HTML file for our display:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <title>Parsing Data from a File</title>

        <link rel="stylesheet" type="text/css" href="css/demo.css" />
        <script type="text/javascript" src="js/jquery-1.7.1.min.js"></script>
        <script type="text/javascript" src="js/demo.js"></script>

    </head>
    <body>
        <h2>Parsing Data from a File</h2>
        <p class="compress">Data will be from an external file. It will formatted as an XML file
            (this could be easily be a html file). It will be be parsed using JQuery library.
            We could do this with raw JavaScript but why would you want to do that when
            there is a simpler solution?</p>

        <p>To learn in detail how this thing works <a href="http://api.jquery.com/jQuery.ajax#options">read this.</a></p>

        <div id="form-div">
        Our Combo box:
            <form id="form1" class="cmbox">
                <select class="combo1">
                </select>
            </form>
        </div>       
    </body>
</html> 
And here is the demo.js file that comes with that file. If you run both now you'll get an error. This is because we have to create a file called data.xml. You should also notice that in the demo.js file (refer to the comments) loads the data.xml if it successfully loads the file it should call a function called "parse" and should it fail it calls the "loadfail" function.
$(document).ready(function(){       // load jQuery 1.5
 function loadfail(){
  alert("Error: Failed to read file!");
 }
 
 function parse(document){
  $(document).find("combo").each(function(){
     var optionLabel = $(this).find('text').text();
     var optionValue = $(this).find('value').text();
     $('.combo1').append(
    ''
     );
  });
 }
 
 $.ajax({
  url: 'js/data.xml',    // name of file with our data
  dataType: 'xml',    // type of file we will be reading
  success: parse,     // name of function to call when done reading file
  error: loadfail     // name of function to call when failed to read
 });
});
Here is the data.xml file.
<formdata>
    <combo>
        <value>1</value>
        <text>Option 1</text>
    </combo>
    <combo>
        <value>2</value>
        <text>Option 2</text>
    </combo>
    <combo>
        <value>3</value>
        <text>Option 3</text>
    </combo>
    <combo>
        <value>4</value>
        <text>Option 4</text>
    </combo>
    <combo>
        <value>5</value>
        <text>Option 5</text>
    </combo>
</formdata>

Do note that this will fail with Chrome because of its security model. If you look at the console it will not allow you to load local files because of Origin null is not allowed by Access-Control-Allow-Origin. Whatever that is.