Wednesday, November 13, 2019

Testing for many values in a Python list

There will be times that you'd want to test for membership of multiple values in a list.

Of course, we could solve this my writing the equivalent number of loops for every value we are testing. That's a nope.

Fortunately Python provides a much cleaner solution:


>>> all(x in ['b', 'a', 'foo', 'bar'] for x in ['a', 'b'])

Also there are other options:

>>> set(['a', 'b']).issubset(set(['a', 'b', 'foo', 'bar']))
True
>>> {'a', 'b'} <= {'a', 'b', 'foo', 'bar'}
True

Reference:

https://stackoverflow.com/questions/6159313/can-python-test-the-membership-of-multiple-values-in-a-list

Monday, October 28, 2019

Keeping a Forked Repo up to date

Keep your fork up to date by tracking the original "upstream" repo that you forked. To do this, you'll need to add a remote:

# Add 'upstream' repo to list of remotes
git remote add upstream https://github.com/UPSTREAM-USER/ORIGINAL-PROJECT.git

# Verify the new remote named 'upstream'
git remote -v


Whenever you want to update your fork with the latest upstream changes, you'll need to first fetch the upstream repo's branches and latest commits to bring them into your repository:

# Fetch from upstream remote
git fetch upstream

# View all branches, including those from upstream
git branch -va


Now, checkout your own master branch and merge the upstream repo's master branch:

# Checkout your master branch and merge upstream
git checkout master
git merge upstream/master


If there are no unique commits on the local master branch, git will simply perform a fast-forward. Now, your local master branch is up-to-date with everything modified upstream.

Thursday, August 29, 2019

Handling non-well-formed HTML in Scrapy with BeautifulSoup

With Scrapy, we can deal with non-well-formed HTML is many ways. This is just one of them.

BeautifulSoup has a pretty nifty feature where it tries to fix bad HTML like replacing missing tags. So if we put BeautifulSoup in the middle then whatever we get from a site is fixed before we parse it with Scrapy.

Fortunately, all we have to do is pip install Alecxe's scrapy-beautifulsoup middleware.

pip install scrapy-beautifulsoup

Then we configure Scrapy to use it from settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_beautifulsoup.middleware.BeautifulSoupMiddleware': 400
}

BeautifulSoup comes with a default parser named 'html.parse'. We can change it.

BEAUTIFULSOUP_PARSER = "html5lib"  # or BEAUTIFULSOUP_PARSER = "lxml"

HTML5 is the better parser IMO but it has to be installed separately.
 
pip install html5lib

Wednesday, June 26, 2019

Gracefully dealing with different SSH keys for different domains or accounts

This problem often happens when you mix your personal ssh keys along with your company's and you work off a single workstation or laptop.

Even if the user and host are the same, they can still be distinguished in the ~/.ssh/config.

Host gitlab.com
  HostName git.company.com
  User git
  IdentityFile /home/whoever/.ssh/id_rsa.alice
  IdentitiesOnly yes

Host kitlab.com
  HostName git.company.com
  User git
  IdentityFile /home/whoever/.ssh/id_dsa.bob
  IdentitiesOnly yes


Then you can use gitlab.com and kitlab.com instead of the hostname in your git remote.

git remote add g-origin git@gitlab.com:whatever.git
git remote add k-origin git@kitlab.com:whatever.git


You probably want to include the option IdentitiesOnly yes to prevent the use of default ids.

Ref: https://blog.developer.atlassian.com/different-ssh-keys-multiple-bitbucket-accounts/

Tuesday, May 21, 2019

Using ElementTree(xml) with Scrapy to deal with hard to deal HTML

While Scrapy's Selectors like xpath and css are powerful there are some cases that make them cost to much effort.

An example with irregular HTML text like this:

 ['<a href="https://www.blogger.com/u/1/misc/cst2020d.html"><b>Dimensions</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020s.html"><b>Typical Circuit</b></a>',  
  '<a href="https://www.blogger.com/u/1/misc/cst2020t.html"><b>Temperature vs Current</b></a>',  
  '<a href="http://www.blogger.com/Search.aspx?arg=somepart%20ic2020" target="_blank"><b>3D\xa0model</b></a>']  

This is a sample result of calling the extract() method with Scrapy's selector. The parts we want here is the href and the link text. We want to get those.

We can:
  1. Do multiple Scrapy selector calls to get the data we need or
  2. Do a single Scrapy selector call and process it via XML
I went with #2. Dealing with HTML as XML should be relatively easy. Besides Python already has a way for working with XML via the ElementTree XML API.

So the Python code to solve the problem is short and simple:

import xml.etree.ElementTree as ET
....
class MySpider(scrapy.Spider):
....
    for link_item in our_raw_links:
        root = ET.fromstring(link_item)
        href = root.attrib['href']
        anchor_text = root[0].text

        cap_links.append({'text': anchor_text, 'href': href})

And Voila!

Wednesday, April 10, 2019

Troubleshooting Cmder Terminal in Visual Studio Code

If you didn't know, you can customize the terminal for VScode. In my case, I wanted to use cmder because Windows terminals suck; both of them cmd and powershell.

Fortunately, cmder has a guide for integrating it to VScode.

The interesting thing is when I tried using it with my Python workflow which includes virtual environements via Pipenv and I ran into a couple of problems.

Problem #1 Missing posh-git

Symptom Spits out a warning telling you that you are missing posh-git - 'Install-Module posh-git' and restart cmder"

Fix #1 Run from the admin level PShell:  Install-Module posh-git

You might also get a error saying it might have an overlap with some other extension like say 'TabExpansion'.

Fix #2 Either run Fix#1 with the -AllowClobber or -Force option OR run it with a -Scope option: Install-Module posh-git -Scope CurrentUser

----

Problem #2 FunctionNotWritable (Cannot write to function prompt..)

Symptom Spits out the error; Also notice that your terminal prompt isn't prefixed by the activated virtual environment. Only happens when your doing Python with virtual environments.

Fix #1 Remove or comment out the -Options ReadOnly option on the cmder profile.ps1. Details here.


Monday, March 25, 2019

Azure Cloud, an annoying mess of random errors & warnings

Context: We recently moved to Azure from AWS. We did this for business reasons. It's been about 2 months since we moved.

This would have not been a big deal if Azure wasn't a buggy P.O.S mess.

A few examples:

1. Random errors when completing a merge request. What the fuck is Status Code 0: error error?

Status code 0: error error? #WTF
2. You use Git flow? Well MS says 'fuck you'. You can't complete Pull Request from the command line. Doing a git flow feature finish from a CLI, Azure will mark your PR as abandoned unlike the other services. AND THEY EVEN HAVE THE BALLS TO CALL IT A FEATURE!! F!*@*#!@#(*!!!!

3. Random Status code: 200 errors when using the Azure CLI to do stuff. But when you do the same thing with the SLOW AS ASS web app console, no error. If you're a devops guys, do you feel safe?

4. RESOLVING MERGE CONFLICTS IS ASS BACKWARDS!! If a PR will be marked as conflict, you have to pull the PR branch and attempt local merge to see the conflict. You will not find the conflicts in the File viewer.  WHAT. THE. FUCK.

The take away is unless the cost savings is significant, really significant, move to Azure otherwise don't. Personally, I'd rather pour gasoline over my laptop and setting it on fire rather than use Azure for my projects EVEN IT WAS FREE.

Azure Cloud, an annoying mess of random errors & warning and things not working as expected.


Edit. Ironically, the day when I put up this post, I ran into #4.

Thursday, February 14, 2019

Random quirk with Windows, Git and an executable Bash script

The context of the problem was that I was making a Docker container with Linux image. It's to be deployed to a cloud service and I'm working from Windows machine. Of course the container would need code to run. To facilitate this, I wrote a basic bash script - scrape.sh.

On Windows, the scrape.sh file already is executable
For disclosure, I'm working with Scrapy here.

From my Windows terminal, the scrape.sh script seems to be already executable. I thought I was fine and I did the whole git cycle of add, commit and push.

The problem shows up when the container tries to run. It will spit out an error saying that it can't execute the scrape.sh script because it's lacking rights. It seems that git commits my bash script as an ordinary file not as an executable despite what I saw on the Windows terminal. The fix though was quite easy.


$ git add --chmod=+x -- afile
$ git commit -m"afile is now executable"


Relevant link:



Friday, January 4, 2019

Installing Scrapy on Windows

Data is the new oil they say and you want to start scraping sites for data. Fine! And since you are a Python developer you'd want to use Scrapy. Unfortunately for you you are a Windows user and errors abound.

Anyhow, you run pip install scrapy and you run into an error:

Scrapy failing to pip install in Windows

The error here is one of dependencies of Scrapy which is Twisted. Fortunately this is fixable.

The first thing we're going to do is download the "binary" wheel file for Twisted. There's a trick here though, YOU MUST DOWNLOAD THE CORRECT ONE THAT MATCHES YOUR PYTHON VERSION. So if you are on Python 3.6, you're looking for something that reads like Twisted-18.9.0-cp36-cp36m-win32.whl; if Python 3.7 then you're looking for something with cp37-cp37m. You get the idea.

The 32-bit or 64-bit probably also matters but I didn't test it.

Once you have that file downloaded somewhere, you can then do a pip install .

Try installing Scrapy again and you should be good to go.

As an added bonus, this seems to fix other Python packages in Windows that require the Visual Studio Windows C++ SDK like the mysql-python.

Tip for Pipenv users:

1. pipenv shell
2. pip install
3. pipenv sync