Lecture 01: Introduction to Data Engineering

DATA 503: Fundamentals of Data Engineering

Lucas P. Cordova, Ph.D.

lpcordova@willamette.edu

Willamette University

January 12, 2026

Setting the stage

What We’ll Discuss

What data engineering is and why it exists
The data engineering pipeline
Relational databases as a workhorse
SQL as your first “superpower”

The setting

In this lecture, our data world is Dunder Mifflin.
We will use familiar Office characters, events, and business problems as examples.
The goal is not TV trivia.
The goal is to make abstract engineering ideas feel concrete.

Quick question

If you had to define data engineering in one sentence, what would you say?

Write your sentence in 15 seconds.
Share with the person next to you.

What counts as “data” at Dunder Mifflin?

Examples include:

Sales calls, quotes, invoices
Customer records and contacts
Warehouse inventory and shipping logs
HR data (hiring, training, performance)
Emails and calendar invites
“Prank events” if Dwight is logging them

Mini-quiz

Which role is primarily responsible for making raw data reliable and accessible for others?

A. Data Analyst
B. Data Scientist
C. Data Engineer
D. Product Manager

Answer: C

The point of data engineering

The data engineer’s job

Build systems that move and shape data so it can be used reliably.
Make data easy to find, trustworthy, and fast to access.
Reduce chaos so others can do analysis, reporting, and ML.

In The Office terms

Michael wants a dashboard in 10 minutes.

“How many sales did we make this week?”
“Which customers are at risk of churning?”
“What does the warehouse backlog look like?”

Your job is to make those questions answerable without manual spreadsheet heroics.

Where things break

Common failure modes:

Data is missing or duplicated.
Definitions are inconsistent.
The report takes 40 minutes to run.
Nobody knows which table to trust.
The pipeline fails silently on a Tuesday.

A pipeline mental model

A pipeline is a repeatable path from sources to usable outputs.

The “why” in one slide

Data engineering exists because:

Data is produced by many systems.
Data changes over time.
Data is messy.
Organizations still need answers on demand.

Data quality and trust

The Five V’s (Scranton edition)

Volume
- How much data: every order, every call, every invoice
Velocity
- How fast it arrives: live sales calls vs nightly shipments
Variety
- Tables, PDFs, emails, phone call logs, images
Veracity
- Can we trust it: typos, duplicates, missing values
Value
- Does it help decisions: pricing, staffing, inventory planning

Quick question

Which “V” is usually hardest in your experience?

Raise a hand for:
- Volume
- Velocity
- Variety
- Veracity
- Value

Veracity is usually the silent killer

A simple example:

Sales reps enter customer names manually.
“Prince Family Paper” becomes:
- Prince Family Paper
- Prince Family Papers
- Prince Family Papeer

Now “top customers” depends on spelling.

Data quality is not just correctness

Also think about:

Consistency across systems
Timeliness
Completeness
Lineage (where it came from)
Observability (how you know it is working)

ETL, ELT, and the lifecycle

ETL vs ELT

ETL:

Extract
Transform
Load

ELT:

Extract
Load
Transform (inside the warehouse)

Why the difference matters

ETL is often:

Great for strict control and smaller volumes
Easier to reason about transformations

ELT is often:

Faster to iterate for analytics teams
More flexible once data is centralized

Batch vs streaming

Batch:

“Run the daily sales rollup at 2am”
Often cheaper and simpler

Streaming:

“Update the live sales leaderboard every minute”
More complex but lower latency

What “production” means

A pipeline is production when:

It runs on a schedule or event.
It is monitored.
Failures alert the right humans.
Data contracts are stable enough that changes are managed.

A small lifecycle picture

Break (10 minutes)

During break:

Pick one question you want answered about data engineering.
Write it down.
We will collect a few when we return.

Relational databases

Why relational databases still matter

Relational databases remain a core tool because:

Tables match how many business questions are asked.
SQL is powerful and widely supported.
Constraints and relationships reduce duplication and ambiguity.
They are a reliable foundation for analytics and applications.

When a relational database is a good fit

You have structured entities (customers, orders, employees).
You care about relationships and integrity.
You need precise querying and joins.
You want constraints (unique, foreign keys).

Our tiny Dunder Mifflin dataset

We will pretend we have these tables:

employees
customers
orders
order_items
products
episodes (optional, for fun)

Example: employees

Columns:

employee_id (PK)
full_name
role
branch
hire_date

Note

Primary Key (PK) is a unique identifier for each row in the table. More on this later.

Sample employees data

employee_id	full_name	role	branch	hire_date
1	Michael Scott	Regional Manager	Scranton	1992-03-15
2	Dwight Schrute	Assistant Regional Manager	Scranton	1995-04-01
3	Jim Halpert	Sales Representative	Scranton	1999-08-01
4	Pam Beesly	Receptionist	Scranton	2000-01-03
5	Stanley Hudson	Sales Representative	Scranton	1990-09-10
6	Phyllis Vance	Sales Representative	Scranton	2000-02-14
7	Kevin Malone	Accountant	Scranton	1998-06-15
8	Oscar Martinez	Accountant	Scranton	1996-11-20
9	Angela Martin	Head of Accounting	Scranton	1994-05-05
10	Creed Bratton	Quality Assurance	Scranton	1993-12-01

Example: customers

Columns:

customer_id (PK)
customer_name
industry
address_line_1
address_line_2
city
state
zip_code
country

Example: orders and order_items

orders:

order_id (PK)
order_date
customer_id (FK)
sales_rep_id (FK)

order_items:

order_id (FK)
product_id (FK)
quantity
unit_price

Note

Foreign Key (FK) is a reference to a primary key in another table. More on this later.

A simple schema view

Quick question

In what table is the best place to add another address field so that we have both a billing and shipping address?

A. customers
B. orders
C. order_items
D. products

Answer: A

Note

BUT, we should ask ourselves if there is a better way to approach this problem.

Normalization in one minute

Normalization is a way to reduce duplication.

Store the customer names once in customers.
Reference customers from orders.
Avoid rewriting customer names on every order.

SQL: asking questions

SQL is how you ask for answers

SQL lets you:

Select columns
Filter rows
Sort results
Limit output
Combine tables with joins
Aggregate (count, sum, average)

Today we focus on SELECT fundamentals.

The shape of a SELECT query

SELECT column_1, column_2
FROM some_table
WHERE some_condition
ORDER BY some_column
LIMIT 10;

Start simple

All employees:

SELECT *
FROM employees;

employee_id	full_name	role	branch	hire_date
1	Michael Scott	Regional Manager	Scranton	1992-03-15
2	Dwight Schrute	Assistant Regional Manager	Scranton	1995-04-01
3	Jim Halpert	Sales Representative	Scranton	1999-08-01
4	Pam Beesly	Receptionist	Scranton	2000-01-03
5	Stanley Hudson	Sales Representative	Scranton	1990-09-10
6	Phyllis Vance	Sales Representative	Scranton	2000-02-14
7	Kevin Malone	Accountant	Scranton	1998-06-15
8	Oscar Martinez	Accountant	Scranton	1996-11-20
9	Angela Martin	Head of Accounting	Scranton	1994-05-05
10	Creed Bratton	Quality Assurance	Scranton	1993-12-01

Note

SELECT * is a wildcard that selects all columns. It is not a good practice to use * in production queries. Instead, you should list the columns you need.

Choose columns

Only names and roles:

SELECT full_name, role
FROM employees;

full_name	role
Michael Scott	Regional Manager
Dwight Schrute	Assistant Regional Manager
Jim Halpert	Sales Representative
Pam Beesly	Receptionist
Stanley Hudson	Sales Representative
Phyllis Vance	Sales Representative
Kevin Malone	Accountant
Oscar Martinez	Accountant
Angela Martin	Head of Accounting
Creed Bratton	Quality Assurance

DISTINCT

Unique branches:

SELECT DISTINCT branch
FROM employees;

branch
Scranton

WHERE

All Scranton employees:

SELECT full_name, role
FROM employees
WHERE branch = 'Scranton';

full_name	role
Michael Scott	Regional Manager
Dwight Schrute	Assistant Regional Manager
Jim Halpert	Sales Representative
Pam Beesly	Receptionist
Stanley Hudson	Sales Representative
Phyllis Vance	Sales Representative
Kevin Malone	Accountant
Oscar Martinez	Accountant
Angela Martin	Head of Accounting
Creed Bratton	Quality Assurance

ORDER BY

Newest hires first:

SELECT full_name, hire_date
FROM employees
ORDER BY hire_date DESC;

full_name	hire_date
Phyllis Vance	2000-02-14
Pam Beesly	2000-01-03
Jim Halpert	1999-08-01
Kevin Malone	1998-06-15
Oscar Martinez	1996-11-20
Dwight Schrute	1995-04-01
Angela Martin	1994-05-05
Creed Bratton	1993-12-01
Michael Scott	1992-03-15
Stanley Hudson	1990-09-10

LIMIT

Top 5 newest hires:

SELECT full_name, hire_date
FROM employees
ORDER BY hire_date DESC
LIMIT 5;

full_name	hire_date
Phyllis Vance	2000-02-14
Pam Beesly	2000-01-03
Jim Halpert	1999-08-01
Kevin Malone	1998-06-15
Oscar Martinez	1996-11-20

Building a query step by step

Question:

“Show the 5 largest order line items by total line value.”

We define line value as:

quantity * unit_price

Step 1: pick columns

SELECT order_id, product_id, quantity, unit_price
FROM order_items;

Step 2: add a computed column

SELECT
  order_id,
  product_id,
  quantity,
  unit_price,
  quantity * unit_price AS line_value
FROM order_items;

Step 3: sort and limit

SELECT
  order_id,
  product_id,
  quantity,
  unit_price,
  quantity * unit_price AS line_value
FROM order_items
ORDER BY line_value DESC
LIMIT 5;

Quick question

If you filter rows, which clause do you use?

A. FROM
B. WHERE
C. ORDER BY
D. LIMIT

Answer: B

Informal exercise: build a SELECT

The exercise (individual then pair)

We are going to build a single SELECT statement for a given table.

Table:

episodes

Columns:

episode_id
season
episode_number
title
air_date
imdb_rating

Task 1

Write a query to list:

season
episode_number
title
imdb_rating

Conditions:

only season 2
only ratings 8.5 or higher

Output:

highest rated first

Limit:

top 5

Hint: start from the skeleton

SELECT
  -- columns
FROM episodes
WHERE
  -- conditions
ORDER BY
  -- sorting
LIMIT
  -- number
;

One possible solution

SELECT
  season,
  episode_number,
  title,
  imdb_rating
FROM episodes
WHERE season = 2
  AND imdb_rating >= 8.5
ORDER BY imdb_rating DESC
LIMIT 5;

Task 2

Modify your query to break ties by episode_number ascending.

One possible solution

SELECT
  season,
  episode_number,
  title,
  imdb_rating
FROM episodes
WHERE season = 2
  AND imdb_rating >= 8.5
ORDER BY imdb_rating DESC, episode_number ASC
LIMIT 5;

A peek ahead: joins

Why joins exist

Most real questions require combining tables.

Example:

“Which customers did Jim sell to last month?”
That information is split across:
- employees
- orders
- customers

Conceptual join

One join teaser

SELECT
  e.full_name AS sales_rep,
  c.customer_name,
  o.order_date
FROM orders o
JOIN employees e
  ON o.sales_rep_id = e.employee_id
JOIN customers c
  ON o.customer_id = c.customer_id
WHERE e.full_name = 'Jim Halpert'
ORDER BY o.order_date DESC
LIMIT 10;

Quick question

What is the main purpose of a foreign key?

A. Make queries faster
B. Guarantee a relationship points to an existing row
C. Store text efficiently
D. Replace the need for indexes

Answer: B

Wrap-up

What you should leave with

A clear definition of what data engineering is.
A mental model of a pipeline.
A sense of why relational databases matter.
The ability to write basic SELECT queries with:
- WHERE
- ORDER BY
- LIMIT
- DISTINCT

Exit ticket

Write down:

One concept that felt clear.
One concept that felt fuzzy.
One question you want answered next lecture.

Send me your answers on Canvas on the Week 1 Participation Activity.

Vibe check

If you had to explain “ETL vs ELT” to Michael in two sentences, what would you say?
If you had to explain “foreign key” to Dwight in two sentences, what would you say?

Lecture 01: Introduction to Data Engineering

Setting the stage

What We’ll Discuss

The setting

Quick question

What counts as “data” at Dunder Mifflin?

Mini-quiz

The point of data engineering

The data engineer’s job

In The Office terms

Where things break

Think-pair-share: “The spreadsheet problem”

A pipeline mental model

The “why” in one slide

Data quality and trust

The Five V’s (Scranton edition)

Quick question

Veracity is usually the silent killer

Data quality is not just correctness

ETL, ELT, and the lifecycle

ETL vs ELT

Why the difference matters

Batch vs streaming

Think-pair-share: choose a mode

What “production” means

A small lifecycle picture

Break (10 minutes)

Relational databases

Why relational databases still matter

When a relational database is a good fit

Our tiny Dunder Mifflin dataset

Example: employees

Sample employees data

Example: customers

Example: orders and order_items

A simple schema view

Quick question

Normalization in one minute

Think-pair-share: what is the primary key?

SQL: asking questions

SQL is how you ask for answers

The shape of a SELECT query

Start simple

Choose columns

DISTINCT

WHERE

ORDER BY

LIMIT

Building a query step by step

Step 1: pick columns

Step 2: add a computed column

Step 3: sort and limit

Quick question

Informal exercise: build a SELECT

The exercise (individual then pair)

Task 1

Hint: start from the skeleton

Think-pair-share: compare solutions

One possible solution

Task 2

One possible solution

A peek ahead: joins

Why joins exist

Conceptual join

One join teaser

Quick question

Wrap-up

What you should leave with

Exit ticket

Vibe check