Data mining with AWS lambda

by Afanasy Barbarov

Scraping (data-mining) with lambda functions

Another day I needed to check what I can get out from a PDF file... Things went smoothly this time. I just created a new docker image with python installed. This is the dockerfile I used:

FROM amazonlinux:2.0.20191217.0

ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8

RUN yum install -y python37 && \
    yum install -y python3-pip && \
    yum install -y tar && \
    yum install -y gzip && \
    yum install -y zip && \
    yum clean all

RUN python3.7 -m pip install --upgrade pip && \
    python3.7 -m pip install virtualenv

Commands to build the image and open a new shell:

docker build -t abarbarov/python37:latest .
docker run -it --name lambdalayer abarbarov/python37:latest bash

And installing the PDF miner library is also easy:

cd ~
python3.7 -m venv miner
source miner/bin/activate
pip install pdfminer.six -t ./python
zip -r miner-layer-python37.zip ./python/

After all that I copied the zip archive to the host machine and used it as a new layer for the lambda function.

Just in case one need to do this - this is the snippet:

docker cp lambdalayer:/root/miner-layer-python3.7.zip .

That's all, folks!

Written by Afanasy Barbarov — Tech Lead with 15+ years shipping production systems in Rust, Go, and TypeScript. Facing a similar challenge? Reach out on LinkedIn. Support my work.

More articles

Previous post

Send an SMS on a regular basis (poor man's approach).

Read more

Next post

Create infrastructure with Amazon CDK.

Read more