This proof-of-concept demonstrates a simple Spark ETL job that processes 2016 stock market data:
INSERT OVERWRITE TABLE high_volume_stocks
SELECT ticker, the_date, open, high, low, close, vol
FROM 2016_stock_data
WHERE vol > 250000
The job runs on Amazon EMR and stores results in S3, making them queryable via Amazon Athena.
Before running this POC, you’ll need to set up the following AWS resources and configurations:
EMR Service Role:
# Create EMR service role (if not exists)
aws iam create-role --role-name EMR_DefaultRole --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "elasticmapreduce.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}'
# Attach EMR service policy
aws iam attach-role-policy --role-name EMR_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole
EMR EC2 Instance Profile:
# Create EC2 role for EMR instances
aws iam create-role --role-name EMR_EC2_DefaultRole --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}'
# Attach EC2 instance profile policy
aws iam attach-role-policy --role-name EMR_EC2_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role
# Create instance profile
aws iam create-instance-profile --instance-profile-name EMR_EC2_DefaultRole
aws iam add-role-to-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRole
Create S3 buckets for code storage and logging:
# Code bucket (replace with your bucket name)
aws s3 mb s3://my-emr-etl-bucket-poc
# Logs bucket (replace with your account ID and region)
aws s3 mb s3://aws-logs-{ACCOUNT-ID}-{REGION}
Ensure you have:
Add these secrets to your GitHub repository (Settings → Secrets and variables → Actions):
AWS_ACCESS_KEY_ID - Your AWS access keyAWS_SECRET_ACCESS_KEY - Your AWS secret access keyUpdate the following values in the workflow files to match your AWS environment:
146121144646 with your AWS account IDsubnet-06ea10d2d2e7afb5f with your subnetsg-* values with your security groupsus-west-2 or change to your preferred regionGitHub Actions CI/CD pipelines for automated deployment and ETL execution.
deploy.yml → Automatic deployment triggered on pushes to main. Syncs ETL job code from ./jobs/ to S3 using secure OIDC authentication. Keeps your S3 bucket updated with latest code changes.
run-emr-etl.yml → Manual workflow for complete ETL pipeline execution. Creates transient EMR cluster, uploads and runs the Spark job, waits for completion, then exports cluster logs and artifacts. Best for production-like ETL testing.
terraform-emr.yml → Manual infrastructure deployment using Terraform. Creates persistent EMR cluster with your job pre-loaded. Useful for development environments where you want a long-running cluster.
Terraform definitions for infrastructure.
Spark jobs (the actual ETL code you want to run).
Developer helper scripts (not production jobs).
docs/Architecture Overview.jpg - High-level system architecture diagramdocs/MLOps_EMR_PySpark_Presentation.pdf - Complete project presentation slides