Deploying PySpark ETL Routine to Amazon EMR POC

What This ETL Does

This proof-of-concept demonstrates a simple Spark ETL job that processes 2016 stock market data:

INSERT OVERWRITE TABLE high_volume_stocks
SELECT ticker, the_date, open, high, low, close, vol
FROM 2016_stock_data
WHERE vol > 250000

The job runs on Amazon EMR and stores results in S3, making them queryable via Amazon Athena.

🛠️ AWS Setup Prerequisites

Before running this POC, you’ll need to set up the following AWS resources and configurations:

1. IAM Roles & Policies

EMR Service Role:

# Create EMR service role (if not exists)
aws iam create-role --role-name EMR_DefaultRole --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "elasticmapreduce.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}'

# Attach EMR service policy
aws iam attach-role-policy --role-name EMR_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceRole

EMR EC2 Instance Profile:

# Create EC2 role for EMR instances
aws iam create-role --role-name EMR_EC2_DefaultRole --assume-role-policy-document '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": { "Service": "ec2.amazonaws.com" },
      "Action": "sts:AssumeRole"
    }
  ]
}'

# Attach EC2 instance profile policy
aws iam attach-role-policy --role-name EMR_EC2_DefaultRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonElasticMapReduceforEC2Role

# Create instance profile
aws iam create-instance-profile --instance-profile-name EMR_EC2_DefaultRole
aws iam add-role-to-instance-profile --instance-profile-name EMR_EC2_DefaultRole --role-name EMR_EC2_DefaultRole

2. S3 Buckets

Create S3 buckets for code storage and logging:

# Code bucket (replace with your bucket name)
aws s3 mb s3://my-emr-etl-bucket-poc

# Logs bucket (replace with your account ID and region)
aws s3 mb s3://aws-logs-{ACCOUNT-ID}-{REGION}

3. VPC & Security Groups

Ensure you have:

4. GitHub Repository Secrets

Add these secrets to your GitHub repository (Settings → Secrets and variables → Actions):

5. Update Configuration Values

Update the following values in the workflow files to match your AWS environment:

Repo Structure & File Descriptions

📂 .github/workflows/

GitHub Actions CI/CD pipelines for automated deployment and ETL execution.

📂 infra/

Terraform definitions for infrastructure.

📂 jobs/

Spark jobs (the actual ETL code you want to run).

📂 scripts/

Developer helper scripts (not production jobs).

📖 Project Publication

Visual Materials