
DATA 351: Database Design
February 2, 2026
Today we learn to structure data so it does not structure us.
You have probably seen spreadsheets like this in the wild:
| OrderID | Customer | CustomerEmail | Product1 | Qty1 | Product2 | Qty2 | Product3 | Qty3 |
|---|---|---|---|---|---|---|---|---|
| 1001 | Acme Corp | orders@acme.com | Widget | 5 | Gadget | 2 | ||
| 1002 | Acme Corp | orders@acme.com | Sprocket | 10 | ||||
| 1003 | Beta Inc | purchasing@beta.io | Widget | 3 | Gadget | 1 | Sprocket | 7 |
This looks reasonable at first glance. Then someone orders four products.
When data is poorly structured, three types of anomalies haunt your database:
If Acme Corp changes their email address, you must update every single row where they appear. Miss one? Now you have conflicting data.
OrderID 1001: orders@acme.com
OrderID 1002: orders_new@acme.com -- Oops!
You want to add a new customer who has not placed an order yet. Where do you put them? The table requires an OrderID, but they have no orders.
You either:
If you delete the only order from a customer, you lose all information about that customer. They vanish from your database like they never existed.
Bad database design is expensive:
Normalization prevents all of these problems.
Normalization is the process of organizing data to:

Think of it as Marie Kondo for databases. Does this column spark joy in this table? No? It belongs somewhere else.
The high-level workflow looks like this:

| Step | Question to Ask |
|---|---|
| 1. Identify Entities | What things exist in this domain? |
| 2. List Attributes | What describes each entity? |
| 3. Find Keys | What uniquely identifies each row? |
| 4. Identify Dependencies | What determines what? |
| 5. Apply Normal Forms | Should we split this table? |
| 6. Create Relationships | How do we connect tables with foreign keys? |
| 7. Build ERD | How do we document and communicate the design? |
A key is one or more attributes that uniquely identify a row in a table.
Keys are foundational to relational databases:
Without keys, we cannot distinguish one row from another.
The chosen key used to uniquely identify rows in a table. Every table should have exactly one primary key.
employee(employee_id, full_name, email)
___________
PK
Any attribute or set of attributes that could serve as the primary key. A table may have multiple candidate keys.
In an employee table, both employee_id and email might be unique. Both are candidate keys, but we choose one as primary.
Any set of attributes that uniquely identifies rows, including extra (non-minimal) attributes.
{employee_id, full_name} is a super key, but not a candidate key because full_name is unnecessary for uniqueness.
An attribute in one table that references the primary key of another table. Creates relationships between tables.
employee.department_id -> department.department_id
Candidate keys are all possible keys. Primary key is the one you pick.
| employee_id | full_name | |
|---|---|---|
| 1 | michael@dm.com | Michael Scott |
| 2 | dwight@dm.com | Dwight Schrute |
| 3 | jim@dm.com | Jim Halpert |
Candidate keys here: employee_id, email
Both uniquely identify rows. We typically choose employee_id as the primary key because:
Before we dive into normal forms, we need to understand functional dependencies.
A functional dependency exists when one attribute determines another:
StudentID -> StudentName
This means: if you know the StudentID, you can determine exactly one StudentName.
Written as: StudentID functionally determines StudentName
Tip
Think of it like a function in programming: getStudentName(studentID) always returns the same name for the same ID.
An attribute depends on the entire primary key, not just part of it.
In a table with key (OrderID, ProductID):
An attribute depends on only part of a composite key.
This is a problem. It means data is in the wrong table.
A -> B -> C where A determines B, and B determines C, but B is not a key.
Example: StudentID -> DeptID -> DeptName
The student determines the department, and the department determines the department name. But DeptID is not a candidate key.
Rule: No repeating groups or arrays. Each cell contains exactly one atomic value.
| EmpID | Name | PhoneNumbers |
|---|---|---|
| 1 | Michael | 555-1234, 555-5678 |
| 2 | Dwight | 555-9999 |
Multiple values in one cell. SQL cannot query individual phone numbers easily.
| EmpID | Name | PhoneNumber |
|---|---|---|
| 1 | Michael | 555-1234 |
| 1 | Michael | 555-5678 |
| 2 | Dwight | 555-9999 |
One value per cell. Now we can search, filter, and index phone numbers.
Repeating columns are also a violation:
| StudentID | Course1 | Grade1 | Course2 | Grade2 | Course3 | Grade3 |
|---|---|---|---|---|---|---|
| S001 | Math | A | Physics | B | ||
| S002 | Math | B | Chemistry | A | Biology | A |
What if a student takes 10 courses? 50 courses?
| StudentID | Course | Grade |
|---|---|---|
| S001 | Math | A |
| S001 | Physics | B |
| S002 | Math | B |
| S002 | Chemistry | A |
| S002 | Biology | A |
Scalable. Clean. Queryable.
Rule: Must be in 1NF, plus no partial dependencies on the primary key.
This only matters when you have a composite primary key.
Primary Key: (OrderID, ProductID)
| OrderID | ProductID | ProductName | Quantity | UnitPrice |
|---|---|---|---|---|
| 1001 | P01 | Widget | 5 | 9.99 |
| 1001 | P02 | Gadget | 2 | 14.99 |
| 1002 | P01 | Widget | 10 | 9.99 |
ProductName and UnitPrice depend only on ProductID, not the full key!
Orders Table:
| OrderID | ProductID | Quantity |
|---|---|---|
| 1001 | P01 | 5 |
| 1001 | P02 | 2 |
| 1002 | P01 | 10 |
Products Table:
| ProductID | ProductName | UnitPrice |
|---|---|---|
| P01 | Widget | 9.99 |
| P02 | Gadget | 14.99 |
Now each table has a single theme.
Rule: Must be in 2NF, plus no transitive dependencies.
Non-key attributes should depend only on the key, not on other non-key attributes.
| EmpID | EmpName | DeptID | DeptName | DeptLocation |
|---|---|---|---|---|
| 1 | Michael | D01 | Management | Floor 3 |
| 2 | Dwight | D02 | Sales | Floor 2 |
| 3 | Jim | D02 | Sales | Floor 2 |
EmpID -> DeptID -> DeptName (transitive dependency)
Employees Table:
| EmpID | EmpName | DeptID |
|---|---|---|
| 1 | Michael | D01 |
| 2 | Dwight | D02 |
| 3 | Jim | D02 |
Departments Table:
| DeptID | DeptName | DeptLocation |
|---|---|---|
| D01 | Management | Floor 3 |
| D02 | Sales | Floor 2 |
Update the department once, and it is correct everywhere.
Bill Kent famously summarized 3NF as:
“Every non-key attribute must provide a fact about the key, the whole key, and nothing but the key, so help me Codd.”
Rule: A stricter version of 3NF. Every determinant must be a candidate key.
Most tables in 3NF are also in BCNF. The difference matters in edge cases with overlapping candidate keys.
Consider course scheduling where:
| Student | Course | Professor |
|---|---|---|
| Alice | Math101 | Dr. Smith |
| Bob | Math101 | Dr. Smith |
| Carol | Phys201 | Dr. Jones |
Candidate keys: (Student, Course) or (Student, Professor)
Professor -> Course is a dependency, but Professor is not a superkey.
Enrollment:
| Student | Professor |
|---|---|
| Alice | Dr. Smith |
| Bob | Dr. Smith |
| Carol | Dr. Jones |
Teaching:
| Professor | Course |
|---|---|
| Dr. Smith | Math101 |
| Dr. Jones | Phys201 |
Now every determinant is a candidate key.
3NF is usually sufficient for most applications. BCNF is the practical maximum.
Higher normal forms exist (4NF, 5NF, 6NF) but are rarely needed in practice.
Warning: Over-normalization can hurt performance:
The goal is eliminating anomalies, not winning a normalization contest.
| Form | Rule | Fixes |
|---|---|---|
| 1NF | Atomic values only | Repeating groups |
| 2NF | No partial dependencies | Composite key issues |
| 3NF | No transitive dependencies | Non-key -> non-key |
| BCNF | Every determinant is a candidate key | Overlapping keys |
Relational notation is a shorthand, textual way to represent a database table’s structure, including its name, attributes, primary keys, and foreign keys.
It acts as a stepping stone between:
Think of it as pseudocode for database design. ERDs are the visual, logical representation of the same information that relational notation captures in text form.
The basic format is:
table_name(attribute_1, attribute_2, attribute_3, ...)
Conventions:
Example:
employee(employee_id, full_name, department_id, email)
___________ _____________
PK FK -> department
Given this notation:
department(department_id, department_name, location)
_____________
employee(employee_id, full_name, department_id, hire_date)
___________ _____________
FK -> department
We understand:
department and employeedepartment_idBenefits:
You can sketch a database design in minutes without opening any tools.
We have employee data from Dunder Mifflin Paper Company:
employee_id,full_name,department,email,hire_date,salary_usd,
is_manager,performance_rating,years_experience,commission_rate,last_login
1,Michael Scott,Management,michael.scott@dundermifflin.com,2005-03-24,
75000.00,true,3.2,15,,2026-01-15 09:12:00
2,Dwight Schrute,Sales,dwight.schrute@dundermifflin.com,2006-04-12,
62000.00,false,4.8,12,0.08,2026-01-16 08:01:00
...
Let us normalize this data step by step.
| employee_id | full_name | department | hire_date | salary_usd | is_manager | performance_rating | years_experience | commission_rate | last_login | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Michael Scott | Management | michael.scott@… | 2005-03-24 | 75000.00 | true | 3.2 | 15 | 2026-01-15 | |
| 2 | Dwight Schrute | Sales | dwight.schrute@… | 2006-04-12 | 62000.00 | false | 4.8 | 12 | 0.08 | 2026-01-16 |
| 3 | Pam Beesly | Reception | pam.beesly@… | 2007-07-02 | 42000.00 | false | 3.9 | 8 | ||
| 4 | Jim Halpert | Sales | jim.halpert@… | 2005-10-05 | 61000.00 | false | 4.1 | 10 | 0.07 | 2026-01-16 |
| 5 | Angela Martin | Accounting | angela.martin@… | 2006-08-15 | 52000.00 | false | 4.5 | 11 | 2026-01-15 |
Questions to ask:
Is this data in First Normal Form?
Checklist:
This data is already in 1NF. Good start.
What does employee_id determine?
employee_id -> full_name
employee_id -> department
employee_id -> email
employee_id -> hire_date
employee_id -> salary_usd
employee_id -> is_manager
employee_id -> performance_rating
employee_id -> years_experience
employee_id -> commission_rate
employee_id -> last_login
Are there other dependencies?
department -> ??? (Does department determine anything else?)
Think about it: does knowing the department tell us anything specific?
Is there a composite key? No, employee_id alone is the primary key.
With a single-column primary key, 2NF is automatically satisfied.
We are in 2NF.
Looking for transitive dependencies…
The department column contains the department name directly. What if departments have additional attributes?
Consider:
If we wanted to add these, we would repeat them for every employee in that department.
Problem Identified: Department should be its own entity.
Let us create a separate department table:
Relational Notation:
department(department_id, department_name)
_____________
employee(employee_id, full_name, department_id, email, hire_date,
___________ _____________
salary_usd, is_manager, performance_rating,
years_experience, commission_rate, last_login)
FK: department_id -> department(department_id)
Should we split further?
Sales-specific attributes:
commission_rate only applies to Sales employeesOptions:
sales_employee subtype tableFor this dataset, option 1 is reasonable. The NULL rate is low.
The last_login column represents temporal data. What if we want login history?
Current: Single timestamp, overwrites each time
Better design for history:
login_history(login_id, employee_id, login_timestamp)
________ ___________
FK -> employee
employee(employee_id, ...) -- Remove last_login
But for the current requirement (just last login), the original design works.
department(department_id, department_name)
_____________
employee(employee_id, full_name, department_id, email, hire_date,
___________ _____________
salary_usd, is_manager, performance_rating,
years_experience, commission_rate, last_login)
FK: employee.department_id -> department.department_id
We have completed the logical design using relational notation:
department(department_id, department_name)
_____________
employee(employee_id, full_name, department_id, email, hire_date,
___________ _____________
salary_usd, is_manager, performance_rating,
years_experience, commission_rate, last_login)
FK: employee.department_id -> department.department_id
Coming Soon: Physical Implementation
In an upcoming lecture, we will cover:
For now, focus on understanding how to design the logical structure.
Working in pairs, answer these questions:
is_manager be a separate table or role-based system?Take 5 minutes, then we will discuss.
An Entity-Relationship Diagram (ERD) is a visual representation of your database structure.
ERDs show:
ERDs are the universal language of database design.

Rectangles represent tables. The entity name appears at the top.
Listed inside the entity box with their data types.
Lines connecting entities show how they relate. The line style indicates cardinality.
|| One (mandatory)|o Zero or one (optional)}| Many (at least one)}o Zero or manyThe most common ERD notation style:
Crow’s Foot Notation Symbols
| Symbol | Meaning |
|---|---|
| Line with perpendicular line | One (mandatory) |
| Line with circle | Zero (optional) |
| Line with crow’s foot | Many |

Reading this diagram:
draw.io (also known as diagrams.net) is a free, browser-based diagramming tool.
Why draw.io?
Access it at: https://app.diagrams.net
Creating our Dunder Mifflin schema:
While we focus on draw.io, other options exist:
| Tool | Pros | Cons |
|---|---|---|
| draw.io | Free, easy, browser-based | Manual relationship lines |
| Lucidchart | Polish, collaboration | Paid for full features |
| dbdiagram.io | Code-based syntax | Limited free tier |
| pgModeler | PostgreSQL specific | Steeper learning curve |
| MySQL Workbench | Reverse engineering | MySQL focused |
For this course, draw.io is sufficient and recommended.
Using draw.io, create an ERD for a simplified library system:
Entities:
Relationships:
Take 10 minutes, then share your diagrams.
Oregon’s native freshwater turtles need help. The Oregon Conservation and Recreation Fund (OCRF) and Oregon Department of Fish and Wildlife (ODFW) track turtle sightings through a citizen science platform.
The raw sighting data arrives as a denormalized CSV file with issues we need to fix:
This is exactly the kind of messy data you will encounter in practice.
Two native species are in decline:
Threatened by two invasive species:
Citizens report sightings via mobile app. ODFW biologists review and intervene when necessary.
The CSV contains denormalized data where a single submission spans multiple rows:
| Submission ID | First Name | Latitude | Longitude | Observation ID | Species | Behavior | Count | |
|---|---|---|---|---|---|---|---|---|
| 101 | Jane | jane@… | 45.123 | -122.456 | 1001 | WesternPond | Basking | 3 |
| 101 | Jane | jane@… | 45.123 | -122.456 | 1002 | RedEaredSlider | Swimming | 1 |
| 102 | Bob | bob@… | 44.987 | -123.111 | 1003 | WesternPainted | Basking | 2 |
Notice how Jane’s submission information repeats because she observed two species.
What problems do you see?
Submitter information (name, email, phone) repeats for every observation within a submission. Location data also repeats.
If Jane submits a report with 5 species, her contact info appears 5 times.
If Jane changes her email, we must update every row where she appears. Miss one? Data inconsistency.
The flat file conflates three distinct concepts:
Repeated text values that should be normalized:
Your assignment will be to design a normalized schema for this turtle sighting data:
This is an ERD-only assignment. Focus on the logical design; we will implement the physical schema in a later assignment.

This is the professional database design workflow. We cover the first four steps in this lecture and the final two in the upcoming DDL and data import lecture.
Starting point: Messy spreadsheet from business users
Next steps:
Questions?
Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377-387.
Kent, W. (1983). A Simple Guide to Five Normal Forms in Relational Database Theory. Communications of the ACM, 26(2), 120-125.
Silberschatz, A., Korth, H., & Sudarshan, S. (2019). Database System Concepts (7th ed.). McGraw-Hill.
Date, C. J. (2003). An Introduction to Database Systems (8th ed.). Addison-Wesley.
Oregon Department of Fish and Wildlife. (n.d.). Oregon Turtle Conservation. https://www.dfw.state.or.us/wildlife/turtle/
diagrams.net. (n.d.). Free Online Diagram Software. https://www.diagrams.net