COURSE PROJECT

Summary

The course project is a semester-long team project in which you will design and implement a data engineering pipeline. The goal is to ingest data from one or more sources, transform it into a well-structured format, and make it usable for analysis or downstream consumption.

You will apply the tools, concepts, and techniques covered in this course, including data ingestion, relational modeling, transformation, querying, and data serving.

Projects are completed in groups of 2–3 students.


Goals

By completing this project, you will:

  • Design an end-to-end data pipeline
  • Make informed decisions about data ingestion and storage
  • Apply relational modeling principles (preferably 3NF)
  • Implement reproducible data transformations
  • Communicate technical decisions clearly through documentation and presentation

Team Formation

Teams will be formed early in the semester through an in-class Matchmaking Workshop.

This activity replaces traditional self-selection and is designed to: - Expose you to multiple project ideas - Encourage balanced teams - Help you find collaborators with complementary skills and interests

Active participation is required to receive credit.


Deliverables and Schedule

The project consists of multiple staged deliverables spread across the 15-week semester. Documentation should be iterative, meaning each milestone builds directly on and refines the original project proposal.

Schedule Overview

Component Week Weight
Matchmaking Workshop Week 3 5%
Proposal Week 5 15%
Milestone Checkpoint 1 Week 8 15%
Milestone Checkpoint 2 Week 11 15%
Presentation Week 14 20%
Final Write-Up Week 15 30%
Total 100%

Deliverables

I. Matchmaking Workshop (5%)

Planned: Week 3 (in class)

You are expected to: - Actively participate in structured discussions - Pitch a preliminary project idea - Engage with other students’ ideas - Identify potential teammates

Credit is based on participation and professionalism.


II. Proposal (15%)

Due: Week 5

Your proposal serves as the foundation for all future milestones. This document will be revised and expanded throughout the semester.

Required Sections

1. Research Question

Clearly state your research question. It should be: - Well-defined - Answerable using data - Feasible within the scope of the course

2. Data Collection Methods (Ingestion Phase)
  • Identify your data source(s), such as:
    • Public APIs
    • Web scraping
    • Open datasets
  • Provide links where applicable
  • Describe how data will be ingested
  • Explain any planned automation or scheduling
3. Data Transformation Process
  • Describe how raw data will be structured
  • Outline your proposed database schema
  • Aim for third normal form (3NF) where appropriate
  • Explain cleaning, preprocessing, and transformation steps
4. Data Serving and Querying
  • Describe how data will be accessed
  • Possible outputs include:
    • SQL queries
    • Dashboards
    • APIs
  • Justify your approach

III. Milestone Checkpoint 1: Pipeline Progress Update (15%)

Due: Week 8

This milestone focuses on early implementation and validation.

Expectations

  • Updated proposal documentation
  • Evidence of working data ingestion
  • Preliminary database schema
  • Initial transformation logic
  • Clear discussion of challenges and adjustments

Your documentation should clearly show what has changed since the proposal.


IV. Milestone Checkpoint 2: Pipeline Refinement Update (15%)

Due: Week 11

This milestone emphasizes completeness and refinement.

Expectations

  • Revised and expanded documentation
  • Stable ingestion pipeline
  • Refined schema and transformations
  • Demonstrated querying or data serving
  • Discussion of design trade-offs and lessons learned

V. Project Presentation (20%)

Due: Week 14 (in class)

Each group will deliver a formal presentation covering:

  • Project motivation and research question
  • Data sources and ingestion strategy
  • Schema design and transformations
  • Demo of queries, dashboards, or APIs
  • Key challenges and insights

Presentations should be technical, clear, and well-organized.


VI. Final Project Write-Up (30%)

Due: Week 15

The final write-up is a polished, self-contained document that reflects the full lifecycle of your project.

Required Components
  • Finalized research question
  • Detailed pipeline architecture
  • Final schema and transformations
  • Examples of queries or data access
  • Reflection on design decisions
  • Limitations and future work

This document should incorporate and refine all prior milestone documentation.


General Expectations
  • All work must be original and appropriately cited
  • Teams are jointly responsible for deliverables
  • Clear communication and documentation quality matter
  • Iteration and improvement are expected and rewarded