Thursday, August 29, 2019

Handling non-well-formed HTML in Scrapy with BeautifulSoup

With Scrapy, we can deal with non-well-formed HTML is many ways. This is just one of them.

BeautifulSoup has a pretty nifty feature where it tries to fix bad HTML like replacing missing tags. So if we put BeautifulSoup in the middle then whatever we get from a site is fixed before we parse it with Scrapy.

Fortunately, all we have to do is pip install Alecxe's scrapy-beautifulsoup middleware.

pip install scrapy-beautifulsoup

Then we configure Scrapy to use it from settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

BeautifulSoup comes with a default parser named 'html.parse'. We can change it.

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

HTML5 is the better parser IMO but it has to be installed separately.
 
pip install html5lib