DeFtunes Music Analytics: End-to-End Data Pipeline on AWS

Data Engineering
DeFtunes Music Analytics: End-to-End Data Pipeline on AWS

Business Context

DeFtunes is a subscription-based music streaming company that added a new feature: users can now purchase and download songs.
Analytics stakeholders want to understand:

  • How purchases evolve over time
  • Which songs and artists drive revenue
  • How purchases vary by user attributes (e.g. location, timezone)

The project delivers an end-to-end data pipeline that makes this purchase data available in an analytics-ready schema with historical raw retention, daily updates, and automated data-quality checks.


High-Level Architecture

The solution follows a medallion architecture on AWS:

  • Bronze / Landing zone (S3) – raw data from APIs and PostgreSQL, stored with ingestion timestamps.
  • Silver / Transform zone (S3 Iceberg) – curated, schema-enforced tables with lineage metadata.
  • Gold / Serving zone (Redshift) – star-schema model optimized for analytical queries.

Key services:

  • AWS S3 – central data lake (landing + transform zones)
  • AWS Glue ETL (Spark) – extraction and transformation jobs
  • Apache Iceberg – table format for transform-zone datasets
  • Amazon Redshift + Redshift Spectrum – data warehouse & external table access
  • PostgreSQL – source system for song catalog
  • Custom REST API – source system for sessions & users
  • Terraform – infrastructure-as-code for Glue jobs, S3 layout, Redshift, catalogs
  • dbt – modeling star schema inside Redshift
  • Airflow – orchestration of daily pipeline and data-quality checks

Data Sources

  1. Sessions API endpoint

    • JSON response containing:
      • session_id, user_id
      • Array of purchased items with song info, artist info, price, and like status.
    • Represents transactional purchase sessions over a time window.
  2. Users API endpoint

    • JSON response with:
      • First/last name
      • Location, timezone
      • Subscription start date
    • Used to build the user dimension and enrich sessions.
  3. PostgreSQL “songs” table

    • Under deftunes schema.
    • Contains track ID, title, release, album, artist details, and release year.
    • Used as the source for song and artist dimensions.

Data Modeling – Star Schema in Redshift

Designed a star schema with the atomic grain: a single song purchase item within a session.

Fact table – fact_song_purchase

Each row represents one purchased item:

  • Keys

    • fact_key (surrogate key)
    • session_id
    • song_id
    • song_key (FK → dim_song)
    • artist_key (FK → dim_artist)
    • user_key (FK → dim_user)
    • session_start_time (FK → dim_time)
  • Measures

    • price_usd
    • liked (boolean)
    • liked_since (timestamp when the user liked the song)

Dimension tables

  • dim_song

    • Song ID, title, album, release year
    • Genre/other track-level attributes
  • dim_artist

    • Artist ID, name, country, other metadata
  • dim_user

    • User ID, first/last name
    • Location (city, country)
    • Timezone
    • Subscription date
  • dim_time

    • Purchase timestamp → date, hour, day of week, month, quarter, year

The model is built and maintained with dbt models and macros, enabling future extension and documentation.


Bronze Layer – Ingestion & Landing Zone (S3)

Raw data is ingested into S3 using AWS Glue ETL jobs defined with Terraform:

  • Glue job: extract_users_api

    • Calls the users API endpoint.
    • Writes raw JSON to:
      • s3://deftunes-lake/landing/users_ingestion_ts=<timestamp>/users.json
  • Glue job: extract_sessions_api

    • Calls the sessions API endpoint.
    • Writes raw JSON to:
      • s3://deftunes-lake/landing/sessions_ingestion_ts=<timestamp>/sessions.json
  • Glue job: extract_songs_postgres

    • Uses PostgreSQL JDBC connection to read deftunes.songs.
    • Writes CSV to:
      • s3://deftunes-lake/landing/songs_ingestion_ts=<timestamp>/songs.csv

Each file path includes an ingestion timestamp partition to preserve history and support reprocessing.


Silver Layer – Transform Zone with Iceberg

The raw datasets are transformed and enriched with Glue/Spark and stored as Iceberg tables.

Transform job: Users & Sessions JSON → Iceberg

transform_json Glue job:

  • Reads raw users JSON into a Spark DataFrame.
  • Location enrichment from a single location field into:
    • latitude
    • longitude
    • place_name
    • country_code
    • time_zone
  • Adds lineage columns:
    • ingestion_ts (when data landed)
    • processing_ts (when transform ran)
  • Writes to Iceberg at:
    • s3://deftunes-lake/transform/users/

For sessions:

  • Reads raw sessions JSON into a DataFrame.
  • Explodes the session_items array so each purchased item becomes its own row.
  • Adds the same lineage columns (ingestion_ts, processing_ts).
  • Writes to Iceberg at:
    • s3://deftunes-lake/transform/sessions/

Transform job: Songs CSV → Iceberg

transform_songs Glue job:

  • Reads songs.csv from the landing zone.
  • Performs schema enforcement (types, required fields).
  • Adds lineage columns:
    • ingestion_ts
    • source_db_name (e.g. deftunes_postgres)
  • Writes as Iceberg to:
    • s3://deftunes-lake/transform/songs/

Glue Data Catalog & Transform DB

  • Creates a Glue database catalog: transform_db.
  • Registers Iceberg tables:
    • transform_db.users
    • transform_db.sessions
    • transform_db.songs

These tables are then queryable via Redshift Spectrum as external tables.


Gold Layer – Serving Zone in Redshift (with dbt)

Using Terraform, the project configures:

  • Redshift provider with cluster credentials.
  • Schemas:
    • analytics_internal – internal schema storing dbt-managed star-schema tables.
    • analytics_external – external schema pointing to transform_db in Glue (for Spectrum).

An existing IAM role is attached to allow Redshift Spectrum to read from the S3 transform zone.

Modeling with dbt

  • dbt models use external transform tables as sources:
    • source('transform_db', 'sessions'), source('transform_db', 'users'), source('transform_db', 'songs').
  • Build dim_* and fact_song_purchase tables inside analytics_internal.
  • Add:
    • Incremental strategies to support daily updates.
    • Tests for primary/foreign keys and non-null critical fields.
    • Documentation blocks that describe each model and column.

Orchestration & Data Quality with Airflow

Airflow orchestrates the daily pipeline:

  1. Ingestion DAG

    • Triggers Glue extract jobs for:
      • users API
      • sessions API
      • PostgreSQL songs table
    • Waits for successful completion.
  2. Transform DAG

    • Runs Glue transform jobs for JSON and songs CSV.
    • Updates Iceberg tables in transform_db.
  3. Modeling DAG

    • Executes dbt run/test for the star schema in Redshift.
    • Refreshes analytical views.
  4. Data-quality checks

    • Airflow tasks and dbt tests validate:
      • Row counts above minimum thresholds.
      • No duplicate natural keys in dimensions.
      • No null foreign keys in the fact table.
      • Reasonable ranges for prices and timestamps.

Alerts can be configured for failures or data-quality test violations.


Analytics & Usage

With the star schema in place, analysts can easily answer questions such as:

  • Revenue and purchase counts by song, artist, and time.
  • Top performing artists by country or city.
  • Cohort analysis of users by subscription date and purchase behavior.
  • Impact of likes (liked, liked_since) on purchasing.

Views and BI dashboards can be built directly on top of Redshift, using the fact and dimension tables.


Key Skills

  • Cloud & Infra: AWS S3, Glue, Redshift, Redshift Spectrum, Glue Data Catalog, IAM
  • Table Format: Apache Iceberg for transform-zone tables
  • Orchestration & IaC: Airflow, Terraform
  • Modeling: Star schema design, dbt models & tests, medallion architecture
  • Sources & Ingestion: REST APIs, PostgreSQL, JSON, CSV
  • Data Engineering: lineage metadata, schema enforcement, daily incremental loads, data quality