Running ChemDataExtractor in AWS lambda
by Afanasy Barbarov
Dealing with AWS lambda restrictions
One day I needed to create a lambda function for extracting chemical data from texts... And faced some problems almost immediately. The main was that I needed to ship a library called ChemDataExtractor (compatible with python 3.6) in a lambda layer. But I couldn't just install it locally and copy-paste to the remote service, because I was using Mac OS X and AWS lambdas run on amazonlinux.
So I needed to run amazonlinux somehow to install python dependencies, then zip and ship them as a lambda layer. Unfortunately, chemdataextractor runs on python 3.6 only, which cannot be simply installed to amazonlinux via yum (like yum install -y python37 && yum install -y python3-pip for python 3.7). The workaround is to build python 3.6 from sources. If you'll need something like that - please check first what I already did - search for abarbarov/python36:latest on hub.docker.com. It will save you a lot of time (building python from sources on docker hub took about 90 min).
After that, I was ready to work on the main task - creating a lambda layer.
And here is what is needed to reproduce that.
Pull the image, and run bash terminal:
docker run -it --name lambdalayer abarbarov/python36:latest bashThen create a virtual environment and install chemdataextractor:
cd ~
python3 -m venv cde
source cde/bin/activate
pip install chemdataextractorAfter that, create a config for chemdataextractor to set the path, where chemdataextractor models will be stored and also set the corresponding environment variable. By default, models are downloaded to the user's directory. Since we do not have access to lambda's underlying operating system, we need to download the models to a separate folder and put them near to the python modules.
echo 'data_dir: /root/cdemodels' > /root/cdeconfig/cdeconfig.yaml
export CHEMDATAEXTRACTOR_CONFIG=/root/cdeconfig/cdeconfig.yamlAfter that download the models:
cde data downloadModels will be downloaded into /root/cdemodels directory.
Next tricky thing is combining everything into a single zip package to ship as a lambda layer. Here are the steps I took:
- Install chemdataextractor once again, but this time into the current directory:
pip install chemdataextractor -t ./python- Move
/root/cdemodels/to./python/cdemodels/folder. - Change
data_dirin/root/cdeconfig/cdeconfig.yamltodata_dir: /opt/python/cdemodels. - Move that file to the path
./python/cdeconfig/cdeconfig.yaml.
Now we have the ./python/ folder with chemdataextractor and dependent libraries, models and the config file. Let's zip it:
zip -r cde-layer-python36.zip ./python/Now get that zip from the docker image. It's easy - open a new terminal and run
docker cp lambdalayer:/root/cde-layer-python36.zip ./Note, that you must run this from the host machine, not from the docker.
The brutal truth about lambdas
Each lambda function has a hard limit of 250Mb (including all layers) in size. But we are lucky since our zip archive after extraction is 245Mb. Let's create a layer for chemdataextractor. The upload limit via browser is 50Mb, so we need to upload the archive to a S3 bucket. After uploading, copy the link to the file and use it while creating the lambda layer.
Next, create the lambda function and set the layer created before. The lambda function may like this one:
import json
from chemdataextractor import Document
def lambda_handler(event, context):
doc = Document('UV-vis spectrum of
5,10,15,20-Tetra(4-carboxyphenyl)porphyrin
in Tetrahydrofuran (THF).')
return {
'statusCode': 200,
'body': json.dumps(doc.records[0].serialize())
}If we just run the lambda with some fake test data, it will fail with an error, similar to "Could not load models/punkt_chem-1.0.pickle. Have you run cde data download?",
But it's easily fixed by setting the correct environment variable, namely: CHEMDATAEXTRACTOR_CONFIG=/opt/python/cdeconfig/cdeconfig.yaml. Find the environment variables section on the page and set it.
That's all, folks!