Wednesday, November 25, 2020

Downtime Investigations: Pipenv vs Poetry

Python projects start around a virtual environment. This helps organizing project dependencies and builds are deterministic. In this area there are two maturing tools Pipenv and Poetry. I've been a long time user of pipenv but it's never a bad idea to see what the other side has to offer. 

Handling Packages

Both Pipenv and Poetry organize project dependencies with a separate file for production and development. ie: requirements.txt, requirements-dev.txt

Pipenv uses a Toml file called Pipfile while Poetry uses a similar Toml file called pyproject.toml. Interesting side note, unlike Pipfile, pyproject.toml follows PEP 518. 

The difference here, I discovered, is that Pipfile is smaller in scope vs pyproject.toml. The pyproject.toml can be contain configs for supports tools like flake8 or publishing info if you're going to put the project up to pypi. You need a separate file, like say setup.py, in Pipenv to publish your project/package. 

Adding dependencies

Pipevn's Install command installs all dependencies if you don't specify a package. Poetry decided to have separate commands for add a new dependency and installing existing ones. In effect, Poetry ask the dev to be more explicit on what he's try to do. 

Poetry also has more info in the terminal vs Pipenv which is fairly spartan when installing something.

Pipenv installing pytest

Poetry installing pytest

Uninstalling dependencies

A thing that I discovered, Poetry uninstalls sub-dependencies. Pipenv does not. With pipenv you have to do something like pipenv uninstall [some package] && pipenv clean to do what Poetry does from a single poetry remove [some package]

Wrap up

Both Pipenv and Poetry goal's are to make dependency management easier and building projects more consistent. While pipenv has more broader support but I think that's just because it's older than poetry. Poetry has some interesting features like having configs in it's project Toml and uninstalls sub-dependencies. 

At some point in the feature, I think I'd like poetry in my professional projects. 

Thursday, September 3, 2020

Django-celery Error in Calling apply_async() - takes 1 positional arguement but xx were given

This error confused me initially and the Celery documentation wasn't directly helpful. 

 Traceback (most recent call last):  
  File "<stdin>", line 1, in <module>  
  File "/Users/jaypax/.local/share/virtualenvs/server-9an_1rEM/lib/python3.6/site-packages/celery/app/task.py", line 518, in apply_async  
   check_arguments(*(args or ()), **(kwargs or {}))  
 TypeError: run_scraper_one() takes 1 positional argument but 40 were given  

I called my task as run_scraper_one.apply_sync(args=('keywords here'), countdown=5). 

The run_scraper_one() method is decorated with @shared_app. So this should work. Should. But apparently after digging around: here and here, I figured out that it wants a list or tuple. 

So, the correct way to invoke the task is: 
run_scraper_one.apply_async(("keyword here",), countdown=5)

Fixed.

Tuesday, March 31, 2020

Scrapy a JS heavy website using Selenium

Scrapy doesn't really like JavaScript heavy websites especially the ones that load the rest of the HTML via a secondary requests using JavaScript.

To overcome this you either use Splash or Selenium. Unfortunately, Splash is no longer supported. It still works but moving forward, it's going to be Selenium. 

The good news here is that Scrapy already supports Selenium via a middleware: scrapy-selenium.

Steps to use scrapy-selenium:

1. Download a Selenium Driver. For example for Firefox get gecko. I'm assuming you have the browser also installed. 

2. Add the following settings to the settings.py file:

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'path/to/gecko'
SELENIUM_BROWSER_EXECUTABLE_PATH = 'path/to/firefox binary'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

For example (on Windows):

SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = 'c:\\Tools\\geckodriver.exe'
SELENIUM_BROWSER_EXECUTABLE_PATH = 'c:\\Program Files\\Mozilla Firefox\\firefox.exe'
SELENIUM_DRIVER_ARGUMENTS=['-headless']  # '--headless' if using chrome instead of firefox

3. In the spiders, just replace the Request() calls to SeleniumRequest()

4. Add a wait_until test on the SeleniumRequest() to would look like:

SeleniumRequest(url=url, 
                callback=self.parse, 
                wait_time=5,
                wait_until=EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.search-results div.search-cell')))

In this one, we wait for the max of 5 secs or until the element, selected by class to be found in the HTML source. After that, we can use scrapy selectors to find the things we want.

So that's it.

Monday, February 10, 2020

Fixing that CosmosDB Error=2: The index path corresponding to the specified order-by item is excluded.

This bug needs three things:

  1. You're using Azure CosmosDB (I know, I don't like it too)
  2. Have a Mongoose query with a sort option against a ..
  3. Field that inside a sub document.
The query option in question is `{ sort: req.query.order || '-metaData.inserted_at' }`. The metaData.inserted_at field is just a date field. MetaData is just a plain object that has a couple of date fields tracking updates, deletes and such. So when you submit the query it spits out the Error=2 response.

CAUTION: Azure CosmosDB has a emulator isn't really helpful here. It will probably point you in a different direction. In my case, I was able to replicate the error and found a fix in where I 'unchecked' the Provision Throughput option in creating the database. That didn't solve the problem on the server.

In fixing this, you have two options:
  1. No sorting in your code. 
  2. Create the index
I went with option #2. A bit of a hassle. I tried CosmosDB as if it's a MongoDB equivalent.

db.getCollection('collectionName').getIndexes();

db.getCollection('collectionName').createIndex({'metaData.inserted_at':-1});

// shorter version
db.collectionName.createIndex({'metaData.inserted_at':-1});

It will be pain if you created an index that's wrong because you have to delete and create it again.

References:
  • https://docs.mongodb.com/manual/tutorial/manage-indexes/#modify-an-index
  • https://docs.microsoft.com/en-us/azure/cosmos-db/index-overview