AWS Data Engineering: DynamoDB, RDS, EC2 & S3

Data Engineering
AWS Data Engineering: DynamoDB, RDS, EC2 & S3

Overview

This project explores how data is generated, stored, and accessed in cloud source systems before it enters a modern data pipeline. It combines:

  1. A NoSQL data layer on AWS DynamoDB using Python/Boto3.
  2. A relational data pipeline using RDS PostgreSQL, EC2 bastion hosts, and S3 for file-based ingestion.

The focus is on data modeling, connectivity, and security: VPCs, security groups, IAM permissions, and bulk data loading.


Architecture

  • AWS DynamoDB – key–value/document store for product & forum data.
  • AWS RDS (PostgreSQL) – managed relational database.
  • Amazon EC2 – bastion hosts used to connect securely to RDS.
  • Amazon S3 – object storage for CSV files used as ingestion source.
  • Python + Boto3 – automation and file transfers.
  • psql – CLI for interacting with PostgreSQL.

Part 1 – NoSQL Data Layer with AWS DynamoDB

I implemented a small NoSQL backend on AWS DynamoDB using Python and Boto3, running everything from a Jupyter notebook.

Data model

Modeled an e-commerce + forum system with four tables:

  • ProductCatalog

    • Primary key: Id (partition key)
    • Stores product information for an online store.
  • Forum

    • Primary key: Name (partition key)
    • Tracks forums plus thread, message, and view counts.
  • Thread

    • Primary key: ForumName (partition key) + Subject (sort key)
    • Represents discussion threads within each forum.
  • Reply

    • Primary key: Id (partition key) + ReplyDateTime (sort key)
    • Stores individual replies, authors, and timestamps.

Implementation

  • Programmatic table creation

    • Used create_table with AttributeDefinitions and KeySchema.
    • Wrapped calls in a helper function that accepts **kwargs for flexible configs.
  • Data loading

    • Parsed JSON sample files.
    • Inserted data using put_item and batch_write_item.
  • CRUD operations

    • Create: Inserted single and batch items.
    • Read: Used get_item, scan, and query to:
      • Fetch products by ID.
      • List threads for a forum.
      • Retrieve replies for a specific thread.
    • Update: Used update_item to increment view and reply counts.
    • Delete: Removed items with delete_item.

Takeaways

  • NoSQL data modeling with simple and composite keys.
  • Understanding how partition keys and sort keys shape access patterns and scalability.
  • Practical use of Python + Boto3 to manage DynamoDB programmatically.

Part 2 – RDS PostgreSQL, EC2 Bastion Host & S3 CSV Ingestion

The second part of the project focuses on a classic ingestion pattern: load data from S3 into RDS PostgreSQL from an EC2 environment, and solve the connectivity and permission issues along the way.

Goals

  • Connect securely from an EC2 bastion host to an RDS PostgreSQL instance.
  • Create a ratings table using SQL DDL.
  • Download a CSV file from S3 to EC2 with Python/Boto3.
  • Load the CSV data into RDS using COPY.
  • Diagnose and fix common networking and IAM problems.

Environment

  • RDS PostgreSQL instance as the target database.
  • Two EC2 instances:
    • external-bastion-host (different VPC).
    • bastion-host (same VPC as RDS).
  • S3 bucket hosting the ratings CSV.
  • Tools: psql, Python, Boto3, AWS console.

Networking & security troubleshooting

I worked through several realistic failure modes:

  1. VPC mismatch

    • Initial connection from external-bastion-host silently failed.
    • Root cause: EC2 instance and RDS database were in different VPCs with no peering.
    • Solution: Use a bastion host EC2 instance deployed in the same VPC as RDS.
  2. Security group configuration

    • Connection from the correct VPC still failed.
    • Root cause: RDS security group only allowed traffic from its own group; EC2 used a different security group.
    • Solution:
      • Edited the RDS security group inbound rules.
      • Opened port 5432 (PostgreSQL default).
      • Restricted the source to the EC2 instance’s security group (avoiding 0.0.0.0/0).
  3. Authentication failure

    • After fixing networking, received “password authentication failed”.
    • Root cause: outdated database password.
    • Solution: obtained updated credentials and reconnected successfully.

Building the ingestion flow

Once networking and security were resolved:

  • Created the table

    • Downloaded ratings_table_ddl.sql to EC2.
    • Connected with psql and ran \i ratings_table_ddl.sql to create the ratings table.
  • Downloaded CSV from S3

    • Configured a Python script (download_from_s3.py) using Boto3.
    • First attempt failed due to S3 bucket policy denying GetObject.
  • Fixed S3 permissions

    • Inspected the bucket policy and replaced a “deny all” configuration.
    • Applied a new policy allowing s3:GetObject from the EC2 bastion’s IP / security context.
    • Re-ran the script and successfully downloaded the CSV into a local data/ folder.
  • Loaded data into PostgreSQL

    • Used a SQL file containing a COPY command pointing to the CSV.
    • Executed it via psql to bulk load the data into ratings.
    • Verified the load with SELECT queries.

Takeaways

  • Designing and operating both NoSQL (DynamoDB) and relational (PostgreSQL) source systems on AWS.
  • Connecting services securely with:
    • VPCs, subnets, and security groups.
    • Correct database ports and least-privilege access.
  • Managing S3 bucket policies and IAM permissions for data ingestion.
  • Using psql and SQL COPY to bulk-load data from files into RDS.
  • Systematic troubleshooting of cloud connectivity and permission issues end-to-end.

Key Skills

  • AWS: DynamoDB, RDS (PostgreSQL), EC2, S3
  • Networking: VPCs, security groups, database ports
  • Security: IAM principles, S3 bucket policies, least-privilege access
  • Programming: Python, Boto3, SQL, psql
  • Data Engineering: source-system modeling, ingestion from object storage, troubleshooting data pipelines