You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Write sample-extract-output.csv to s3://$BUCKET/sample-project/extract/$QUERY_DATE/${QUERY_DATE}-extract-output.csv
Transform
Implement a function for each transform operation in the provided target spec.
Create a test for each transform operation using pytest.
Once you have the transform functions tested, update the end-to-end script to:
Read s3://$BUCKET/sample-project/extract/$QUERY_DATE/${QUERY_DATE}-extract-output.csv into memory
Run the transformations
Write the transform output to s3://$BUCKET/sample-project/transform/$QUERY_DATE/${QUERY_DATE}-transform-output.csv
Load
Update the end-to-end script to copy s3://sample-project/extract/$QUERY_DATE/${QUERY_DATE}-transform-output.csv to s3://sample-project/load/$QUERY_DATE/${QUERY_DATE}.EMP.REC.LRF.csv
All tests for the project should be runnable from the project root-level via pytest.
Use the AWS CLI to create an AWS Lambda that runs the job.
Create a simple but clear and complete README that documents:
How to build and run the job locally
How to run tests
How to deploy the job to AWS as a Lambda
Transform Spec
Source Columns
Transformation
Target Column
full_name
Split on space, reverse, uppercase last name, join with a comma
employee_name
email
Direct map
email_address
address
Replace commas with spaces
home_address
Building
All source code is located in the myetl directory
Every function in the .py files within myetl contains docstrings so you can use help() to learn how they work
scripts/lambda_function.py is the run-script used for deployment to AWS. It takes the place of scripts/call_cli_handler.py for running the code locally.
Each run-script imports only the functions marked as public API in their docstrings. These functions are extract.extract_function, transform.transform_function, and load.load_function. The helper functions within transform are not imported by the run-scripts.
Build the package by typing pip install -r requirements.txt from the root directory to run setup.py and install to your local machine.
Running Locally
scripts contains an end-to-end run-script named call_cli_handler.py
typing python3 call_cli_handler.py -h will display a help message that describes the arguments needed such as QUERY_DATE which is passed in as -d YYYY-MM-DD. I've also provided options to point to a local directory to store temporary files and delete it when the Lambda completes.
Testing with pytest
From the root directory, type pytest myetl
It is necessary to pass myetl because of all the code stored in lib that's needed for deployment to AWS. Amazon linux compatible pandas, numpy, and pytz have a few files with 'test' in their names and/or functions; pytest tries to test them and it goes rather poorly.
Pytest will test the two helper functions in myetl/transform.py using myetl/test_transform.py
email to email_address is a direct mapping so no helper function was needed
Deploying to AWS Lambda
In scripts there are two shell scripts
bundle.sh zips up all necessary code and creates the Lambda using the AWS CLI
invoke_lambda.sh will invoke the Lambda provided the user passes it one CL arg - date in YYYY-MM-DD as follows invoke_lambda.sh YYYY-MM-DD
Go check out s3://mybucket1219 to view the output
UPDATE: code doesn't run in the Lambda anymore
bundle.sh uses pip install -r to install all dependencies of myetl which unfortunately installs a bunch of Mac specific packages; these break on the Lambda. I reverted back to using scripts/push2lambda.sh but because AWS Lambda doesn't recognize myetl as a package, all import statements of the form from myetl import break.
If you want to try my code in a Lambda, revert back to (any) prior commit in my repo and push2lambda.sh + invoke_lambda.sh will work.