
DATA 503: Fundamentals of Data Engineering
January 26, 2026
Data engineering is not just about moving bytes. It is about enabling decisions.
Your pipeline exists to answer a question that matters.
Every data engineering project answers a question:
Your pipeline is the infrastructure that makes the answer credible.

Without a question, you have a data dump.
Without evidence, you have an opinion.
Without an action, you have a report nobody reads.
Raw data from a single source tells a limited story.
Combining multiple sources creates new knowledge:
| Single Source | Combined Sources |
|---|---|
| Weather data shows rainfall | Weather + traffic + accidents reveals crash risk patterns |
| Job postings list skills | Job postings + salary data + geography shows where to move |
| Restaurant inspections show violations | Inspections + reviews + demographics reveals food desert risks |
You are not just analysts. You are infrastructure builders.
Your project will:
This is how you create new knowledge.
Research Question:
Which small Oregon towns along historic tourism corridors show the strongest relationship between seasonal visitor traffic and local business survival rates?
Small towns depend on tourism but lack data infrastructure.
A chamber of commerce director asks:
No single dataset answers this. But combining three does.
Source: StreetLight InSight API (academic access available)
What it provides:
Ingestion approach:
Source: Oregon Business Registry public data export
What it provides:
Ingestion approach:
Source: Census API (data.census.gov)
What it provides:
Ingestion approach:

Key derived metrics:
-- Seasonal traffic ratio
SELECT
town_id,
SUM(CASE WHEN MONTH(observation_date) IN (6,7,8)
THEN vehicle_count ELSE 0 END) * 1.0 /
NULLIF(SUM(CASE WHEN MONTH(observation_date) IN (1,2,12)
THEN vehicle_count ELSE 0 END), 0)
AS summer_winter_ratio
FROM traffic_observation
GROUP BY town_id;
-- Business survival rate by type
SELECT
town_id,
business_type,
COUNT(CASE WHEN is_active THEN 1 END) * 1.0 /
COUNT(*) AS survival_rate
FROM business
WHERE registration_date < DATE_SUB(CURRENT_DATE, INTERVAL 3 YEAR)
GROUP BY town_id, business_type;With this infrastructure, you can answer:
The action: Target economic development resources to towns with potential but gaps.

| Criterion | Status |
|---|---|
| Data publicly accessible or API available | Yes |
| Rate limits manageable | StreetLight: 1000/day, Census: 500/day |
| Schema stable | Business registry format unchanged since 2019 |
| Can automate ingestion | All sources support scripted pulls |
| Fits 3NF model | Yes, clear entity relationships |
| Answers novel question | No existing dataset combines these three |
Write one sentence:
I am investigating [question] by combining [data source 1] and [data source 2] so that [stakeholder] can [decision].
Example:
I am investigating which Portland neighborhoods have the highest gap between Airbnb density and affordable housing availability by combining Inside Airbnb listings and HUD fair market rent data so that housing advocates can target policy interventions.
| Property | Weak Example | Strong Example |
|---|---|---|
| Specific | “How is climate change affecting Oregon?” | “Which Oregon counties show the largest gap between summer fire risk and emergency response capacity?” |
| Multi-source | “What do Yelp reviews say about restaurants?” | “Do Yelp ratings correlate with health inspection scores, and does this vary by neighborhood income?” |
| Actionable | “What is the history of bike lanes?” | “Which Portland intersections have the highest cyclist injury rate per commuter volume?” |
| Feasible | “Predict stock prices” | “Which SEC filing patterns correlate with earnings surprises for Oregon-based public companies?” |
APIs with academic/free tiers:
Public data portals:
Scrapeable sources (with care):
Take 3 minutes. Write:
You will refine this in the matchmaking activity.
By the end of this hour, you will have:
Teams of 2-3 will form based on mutual interest and complementary skills.
| Phase | Duration | Activity |
|---|---|---|
| 1 | 5 min | Prepare your pitch card |
| 2 | 36 min | Speed rounds (6 rounds x 6 min) |
| 3 | 10 min | Reflection and ranking |
Fill out the index card provided with:
Front of card:
Back of card:
Room setup: Two concentric circles facing each other

Outer circle faces inner circle and rotates clockwise after each round.
| Time | Activity |
|---|---|
| 0:00-2:00 | Inner circle person pitches |
| 2:00-2:30 | Outer circle asks one question |
| 2:30-4:30 | Outer circle person pitches |
| 4:30-5:00 | Inner circle asks one question |
| 5:00-6:00 | Both score and take notes, outer rotates |
You will complete 6 rounds total.
Cover these points in order:
Use your 30-second question slot wisely. Good questions:
After each round, record on your score sheet:
| Field | Description |
|---|---|
| Partner name | Who you talked to |
| Their question | Brief summary |
| Fit score (1-3) | 3 = strong overlap, 2 = complementary, 1 = not a fit |
| Notes | Skills, concerns, ideas that emerged |
Score 3 if:
After all 6 rounds, take 10 minutes to:
Fill out the team preference form:
Process:
Teams of 3: Some teams will have 3 members based on mutual rankings and project scope.
Your project will be evaluated on:
| Criterion | What We Look For |
|---|---|
| Research question | Specific, answerable, novel |
| Data sources | Multiple, disparate, properly cited |
| Pipeline design | Ingestion, transformation, serving layers clear |
| Schema | Normalized (3NF preferred), documented |
| Implementation | Working code, reproducible |
| Story | Clear narrative connecting data to insight to action |
Before you leave, submit: