Data mining with AWS lambda
by Afanasy Barbarov
Scraping (data-mining) with lambda functions
Another day I needed to check what I can get out from a PDF file... Things went smoothly this time. I just created a new docker image with python installed. This is the dockerfile I used:
FROM amazonlinux:2.0.20191217.0
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
RUN yum install -y python37 && \
yum install -y python3-pip && \
yum install -y tar && \
yum install -y gzip && \
yum install -y zip && \
yum clean all
RUN python3.7 -m pip install --upgrade pip && \
python3.7 -m pip install virtualenvCommands to build the image and open a new shell:
docker build -t abarbarov/python37:latest .
docker run -it --name lambdalayer abarbarov/python37:latest bashAnd installing the PDF miner library is also easy:
cd ~
python3.7 -m venv miner
source miner/bin/activate
pip install pdfminer.six -t ./python
zip -r miner-layer-python37.zip ./python/After all that I copied the zip archive to the host machine and used it as a new layer for the lambda function.
Just in case one need to do this - this is the snippet:
docker cp lambdalayer:/root/miner-layer-python3.7.zip .That's all, folks!