Running ChemDataExtractor in AWS lambda

by Afanasy Barbarov

Dealing with AWS lambda restrictions

One day I needed to create a lambda function for extracting chemical data from texts... And faced some problems almost immediately. The main was that I needed to ship a library called ChemDataExtractor (compatible with python 3.6) in a lambda layer. But I couldn't just install it locally and copy-paste to the remote service, because I was using Mac OS X and AWS lambdas run on amazonlinux.

So I needed to run amazonlinux somehow to install python dependencies, then zip and ship them as a lambda layer. Unfortunately, chemdataextractor runs on python 3.6 only, which cannot be simply installed to amazonlinux via yum (like yum install -y python37 && yum install -y python3-pip for python 3.7). The workaround is to build python 3.6 from sources. If you'll need something like that - please check first what I already did - search for abarbarov/python36:latest on hub.docker.com. It will save you a lot of time (building python from sources on docker hub took about 90 min).

After that, I was ready to work on the main task - creating a lambda layer.

And here is what is needed to reproduce that.

Pull the image, and run bash terminal:

docker run -it --name lambdalayer abarbarov/python36:latest bash

Then create a virtual environment and install chemdataextractor:

cd ~
python3 -m venv cde
source cde/bin/activate
pip install chemdataextractor

After that, create a config for chemdataextractor to set the path, where chemdataextractor models will be stored and also set the corresponding environment variable. By default, models are downloaded to the user's directory. Since we do not have access to lambda's underlying operating system, we need to download the models to a separate folder and put them near to the python modules.

echo 'data_dir: /root/cdemodels' > /root/cdeconfig/cdeconfig.yaml
export CHEMDATAEXTRACTOR_CONFIG=/root/cdeconfig/cdeconfig.yaml

After that download the models:

cde data download

Models will be downloaded into /root/cdemodels directory.

Next tricky thing is combining everything into a single zip package to ship as a lambda layer. Here are the steps I took:

  1. Install chemdataextractor once again, but this time into the current directory:
pip install chemdataextractor -t ./python
  1. Move /root/cdemodels/ to ./python/cdemodels/ folder.
  2. Change data_dir in /root/cdeconfig/cdeconfig.yaml to data_dir: /opt/python/cdemodels.
  3. Move that file to the path ./python/cdeconfig/cdeconfig.yaml.

Now we have the ./python/ folder with chemdataextractor and dependent libraries, models and the config file. Let's zip it:

zip -r cde-layer-python36.zip ./python/

Now get that zip from the docker image. It's easy - open a new terminal and run

docker cp lambdalayer:/root/cde-layer-python36.zip ./

Note, that you must run this from the host machine, not from the docker.

The brutal truth about lambdas

Each lambda function has a hard limit of 250Mb (including all layers) in size. But we are lucky since our zip archive after extraction is 245Mb. Let's create a layer for chemdataextractor. The upload limit via browser is 50Mb, so we need to upload the archive to a S3 bucket. After uploading, copy the link to the file and use it while creating the lambda layer.

Next, create the lambda function and set the layer created before. The lambda function may like this one:

import json
from chemdataextractor import Document

def lambda_handler(event, context):
    doc = Document('UV-vis spectrum of
      5,10,15,20-Tetra(4-carboxyphenyl)porphyrin
      in Tetrahydrofuran (THF).')

    return {
        'statusCode': 200,
        'body': json.dumps(doc.records[0].serialize())
    }

If we just run the lambda with some fake test data, it will fail with an error, similar to "Could not load models/punkt_chem-1.0.pickle. Have you run cde data download?",

But it's easily fixed by setting the correct environment variable, namely: CHEMDATAEXTRACTOR_CONFIG=/opt/python/cdeconfig/cdeconfig.yaml. Find the environment variables section on the page and set it.


That's all, folks!

Written by Afanasy Barbarov — Tech Lead with 15+ years shipping production systems in Rust, Go, and TypeScript. Facing a similar challenge? Reach out on LinkedIn. Support my work.

More articles

Previous post

Deploying Python Lambda using SAM and Github actions, covering CI/CD ideas for Terraform.

Read more

Next post

Configure continuous deployments for a SAM application on Github.

Read more