
DATA 503: Fundamentals of Data Engineering
February 9, 2026

We built the tables. We added constraints. Now comes the part where we actually move the data from the staging table into the normalized schema without breaking anything. No pressure.
Let’s say you have a staging table (music_catalog) full of raw CSV data and a set of normalized tables (artists, albums, tracks) waiting to receive it. The problem is that raw data is rarely clean enough to insert directly.
The workflow:

The DML (Data Manipulation Language) toolkit:
| Statement | Purpose |
|---|---|
UPDATE |
Change existing values |
DELETE |
Remove rows |
ALTER TABLE |
Modify table structure |
INSERT INTO ... SELECT |
Copy data between tables |
START TRANSACTION |
Begin a safe modification block |
COMMIT |
Save changes permanently |
ROLLBACK |
Undo everything since the last transaction start |
These are the verbs of data engineering. SELECT asks questions. These statements change answers.
Before modifying anything, inspect the data. Every experienced data engineer has a horror story about an UPDATE that ran without a WHERE clause. Do not become a horror story.
Questions to ask before any migration:
Last time, we imported a CSV of album data from a defunct record distributor and built our normalized target tables (artists, albums, tracks). The staging table looks like this:
Six decades of music. Also six decades of data entry by people who apparently had strong opinions about capitalization.
Always start with the basics:
count
-------
609
This tells you the scale of what you are working with. A table with 50 rows and a table with 5 million rows require different strategies. One you can eyeball. The other, you cannot.
artist_name album_title album_count
----------------- ----------------------- -----------
Radiohead OK Computer 2
The Beatles Abbey Road 2
GROUP BY with HAVING count(*) > 1 is your duplicate detector. Same artist, same album, appearing twice. Could be a reissue. Could be a data entry mistake. Usually the latter.
genre genre_count
------------- -----------
Alternative 87
Country 34
Electronic 45
Hip-Hop 62
Pop 128
R&B 41
Rock 203
9
Nine rows have no genre. That blank line at the bottom is NULL. NULL is not nothing. NULL is “I do not know,” which in a database is significantly worse than nothing.
IS NULL is the only way to check for NULL. Using = NULL will not work because NULL is not equal to anything, including itself. This is one of those things that makes perfect logical sense and no intuitive sense whatsoever.
artist_name name_count
--------------------------- ----------
Led Zeppelin 3
Led Zepplin 1
Outkast 2
OutKast 1
the rolling stones 1
The Rolling Stones 4
Whitney Houston 3
Whitney houston 1
Eight entries for what should be four artists. Typos, inconsistent casing, and missing “The” prefixes. In a normalized database, each artist would be one row. In this staging table, each variation is a separate identity crisis.
decade decade_count
---------- ------------
1970 42
1970s 58
1980s 89
1990s 104
2000s 97
2010s 78
20s 12
2020s 29
Three problems: “1970” should be “1970s”, “20s” should be “2020s”, and the two 1970s variants need merging. Whoever entered this data was consistent about 60% of the time, which is the worst kind of consistent.
decade earliest latest
---------- -------- ------
1970 1970 1979
1970s 1970 1979
1980s 1980 1989
1990s 1990 1999
2000s 2000 2009
2010s 2010 2019
20s 2020 2024
2020s 2020 2025
At least the years match the decades. Small mercies.
Our audit revealed three categories of problems:

Now we fix them. But first, a word about safety.
Always create a backup before running UPDATE or DELETE on production data.
This is not optional. This is not paranoia. This is professionalism. The difference between a junior and senior data engineer is not skill. It is the number of times they have been burned by a missing backup.
CREATE TABLE ... AS SELECT copies both structure and data. Verify the backup:
original backup
-------- ------
609 609
Same count. You can proceed with slightly less anxiety.
Before modifying a column, copy it first:
Now artist_name_copy preserves the original values. If your UPDATE goes sideways, you can restore from the copy without touching the backup table. Belt and suspenders.
ALTER TABLE changes the shape of a table without destroying its data:
| Operation | Syntax |
|---|---|
| Add column | ALTER TABLE t ADD COLUMN col type; |
| Drop column | ALTER TABLE t DROP COLUMN col; |
| Change type | ALTER TABLE t ALTER COLUMN col SET DATA TYPE type; |
| Set NOT NULL | ALTER TABLE t ALTER COLUMN col SET NOT NULL; |
| Drop NOT NULL | ALTER TABLE t ALTER COLUMN col DROP NOT NULL; |
A common pattern: add a new “clean” column alongside the original dirty one.
Now you can clean artist_standard without losing the original artist_name values. When you are satisfied the cleaning is correct, you can drop the original. Or keep both. Data engineers who keep both sleep better.
UPDATE changes existing values in a table:
Warning
An UPDATE without a WHERE clause modifies every row in the table. PostgreSQL will not ask “are you sure?” It will simply do it. Immediately. With enthusiasm.
We found nine rows with NULL genres. After researching the albums, we can fill them in:
Each UPDATE returns UPDATE 1 (or UPDATE 3 for the IN clause), confirming the exact number of rows modified. If it says UPDATE 0, your WHERE clause matched nothing. If it says UPDATE 609, you forgot the WHERE clause. Both are worth investigating.
If your UPDATE was wrong, restore from the copy:
Option 2 uses UPDATE ... FROM, which joins two tables during the update. This is PostgreSQL-specific syntax and extremely useful for data migration work.
Remember the inconsistent spellings? Fix them with pattern matching and exact matches:
-- Fix the typo
UPDATE music_catalog
SET artist_standard = 'Led Zeppelin'
WHERE artist_name LIKE 'Led Zep%';
-- Fix casing inconsistencies
UPDATE music_catalog
SET artist_standard = 'OutKast'
WHERE lower(artist_name) = 'outkast';
UPDATE music_catalog
SET artist_standard = 'The Rolling Stones'
WHERE lower(artist_name) = 'the rolling stones';
UPDATE music_catalog
SET artist_standard = 'Whitney Houston'
WHERE lower(artist_name) = 'whitney houston';Verify:
artist_name artist_standard
--------------------- --------------------
Led Zepplin Led Zeppelin
the rolling stones The Rolling Stones
OutKast OutKast
Outkast OutKast
Whitney houston Whitney Houston
Five dirty values, four clean artist names. The original artist_name column is untouched for auditing.
The || operator concatenates strings in PostgreSQL. Combined with simple WHERE clauses, it handles our decade problems:
Verify all decades are now consistent:
decade decade_count
---------- ------------
1970s 100
1980s 89
1990s 104
2000s 97
2010s 78
2020s 41
Six clean decades. No duplicates, no abbreviations.
For the duplicate Beatles and Radiohead entries, we need to keep one and remove the other. First, identify which to keep:
Keep the one with the lower catalog_id (the first entry) and delete the duplicate:
DELETE 2
Two rows removed. Verify the duplicates are gone:
Empty result. Clean.
You can use a subquery to determine which rows to update:
A more readable alternative for cross-table updates:
This joins music_catalog to genre_tags during the update and pulls the category value directly. Same result, cleaner syntax. This is PostgreSQL-specific and one of its nicest features.
| Pattern | Use When |
|---|---|
SET col = value WHERE ... |
Simple fixes to specific rows |
SET col = col2 |
Copying between columns |
SET col = 'prefix' \|\| col |
String manipulation |
UPDATE t1 SET ... FROM t2 WHERE ... |
Joining tables during update |
UPDATE ... WHERE EXISTS (SELECT ...) |
Conditional update based on another table |
WHERE lower(col) = 'value' |
Case-insensitive matching |
DELETE removes rows from a table:
Warning
DELETE FROM table_name; without a WHERE clause deletes every row. Unlike dropping the table, the empty table structure remains, staring at you like a reminder of what you have done.
DELETE 12
This removes all 12 albums released before the 1970s. PostgreSQL confirms the count, which you should always check against your expectation. If you expected 12 and got 12, good. If you expected 12 and got 412, less good.
| Statement | What It Does | Reversible? |
|---|---|---|
DELETE FROM t WHERE ... |
Removes matching rows | Yes (in a transaction) |
DELETE FROM t |
Removes all rows, keeps structure | Yes (in a transaction) |
TRUNCATE t |
Removes all rows, faster than DELETE | No |
DROP TABLE t |
Removes table entirely | No |
After migration, remove temporary columns:
Clean tables are happy tables. Leftover columns named _copy, _backup, _temp, and _old are the database equivalent of packing boxes you never unpacked after moving.
Remove backup tables when you are confident the migration succeeded:
DROP TABLE is permanent. There is no undo. Make sure you are done with the backup before dropping it. Then wait a day. Then drop it.
A transaction groups multiple SQL statements into a single atomic unit. Either all of them succeed, or none of them do.

Check the result:
artist_name artist_standard
---------------- ----------------
Fleetwood Mac Feetwood Mac -- Typo!
Fleetwood Mac Feetwood Mac -- Typo!
Fleetwood Mac Feetwood Mac -- Typo!
The typo is visible. Roll it back:
Query again:
artist_name artist_standard
---------------- ----------------
Fleetwood Mac Fleetwood Mac -- Original restored
Fleetwood Mac Fleetwood Mac
Fleetwood Mac Fleetwood Mac
The ROLLBACK undid the UPDATE as if it never happened. Fleetwood Mac’s name is safe. This is why transactions exist.
Always use transactions when:
The workflow:
START TRANSACTION;SELECT to verify the resultsCOMMIT; if correct, ROLLBACK; if notTip
In psql, BEGIN is an alias for START TRANSACTION. You will see both in the wild. They do the same thing.
A few things to be aware of:
CREATE TABLE, DROP TABLE) in PostgreSQL are transactional. This is unusual. Most other databases auto-commit DDL.With all these tools in hand, here is the complete workflow for migrating data from the staging table to our normalized music catalog:

The key statement for migration copies data from one table to another:
DISTINCT ensures you do not insert duplicate rows. This is how you populate a lookup table from a denormalized staging table.
When migrating to tables with foreign key relationships, order matters:
START TRANSACTION;
-- 1. Populate parent table first (artists)
INSERT INTO artists (artist_name)
SELECT DISTINCT artist_standard
FROM music_catalog
WHERE artist_standard IS NOT NULL;
-- 2. Then populate child table with FK lookups (albums)
INSERT INTO albums (album_title, release_year, genre, label,
duration_min, artist_id)
SELECT DISTINCT m.album_title, m.release_year, m.genre, m.label,
m.duration_min, a.artist_id
FROM music_catalog m
JOIN artists a ON m.artist_standard = a.artist_name;
-- 3. Verify
SELECT count(*) FROM artists;
SELECT count(*) FROM albums;
COMMIT;Parent tables first, child tables second. Foreign key constraints will reject inserts in the wrong order.
After committing, verify the data made it across cleanly:
-- Check counts
SELECT 'artists' AS tbl, count(*) FROM artists
UNION ALL
SELECT 'albums', count(*) FROM albums;
-- Spot-check with a JOIN
SELECT a.artist_name, al.album_title, al.release_year, al.genre
FROM albums al
JOIN artists a ON al.artist_id = a.artist_id
ORDER BY a.artist_name, al.release_year
LIMIT 10;If the JOIN returns the data you expect, the migration worked. If it returns nothing, your foreign keys are not aligned. Check the ON clause.
A safer migration pattern creates the new table, verifies it, then swaps:
-- Create backup with extra metadata
CREATE TABLE music_catalog_archive AS
SELECT *,
'2026-02-09'::date AS reviewed_date
FROM music_catalog;
-- Swap tables using RENAME
ALTER TABLE music_catalog
RENAME TO music_catalog_temp;
ALTER TABLE music_catalog_archive
RENAME TO music_catalog;
ALTER TABLE music_catalog_temp
RENAME TO music_catalog_archive;Three renames and the new table is live. The old one is still available as _archive if anything goes wrong.
You have the music_catalog staging table and the normalized artists and albums tables from the previous lecture. Your job: audit the staging data, fix the problems, and migrate the clean data into the normalized schema.
Write queries to find all the quality issues. You should check for:
Tip
Use GROUP BY ... HAVING count(*) > 1 for duplicates, IS NULL for missing values, and GROUP BY with ORDER BY to spot inconsistencies.
Create a backup table
Add a safety column for artist names
Fix each category of issues using UPDATE and DELETE:
Remember: one category at a time, verify after each fix.
Write the migration wrapped in a transaction:
INSERT INTO artists using SELECT DISTINCT from the cleaned staging dataINSERT INTO albums with a JOIN back to artists for the foreign keyTip
Parent tables first, child tables second. Your albums INSERT needs artist_id values, which means artists must be populated first.
-- Duplicates
SELECT artist_name, album_title, count(*)
FROM music_catalog
GROUP BY artist_name, album_title
HAVING count(*) > 1;
-- NULL genres
SELECT catalog_id, artist_name, album_title
FROM music_catalog
WHERE genre IS NULL;
-- Inconsistent artist names
SELECT artist_name, count(*)
FROM music_catalog
GROUP BY artist_name
ORDER BY artist_name;
-- Malformed decades
SELECT decade, count(*)
FROM music_catalog
GROUP BY decade
ORDER BY decade;-- Backup
CREATE TABLE music_catalog_backup AS
SELECT * FROM music_catalog;
-- Safety column
ALTER TABLE music_catalog ADD COLUMN artist_standard varchar(200);
UPDATE music_catalog SET artist_standard = artist_name;
-- Fix artist names
UPDATE music_catalog
SET artist_standard = 'Led Zeppelin'
WHERE artist_name LIKE 'Led Zep%';
UPDATE music_catalog
SET artist_standard = 'OutKast'
WHERE lower(artist_name) = 'outkast';
UPDATE music_catalog
SET artist_standard = 'The Rolling Stones'
WHERE lower(artist_name) = 'the rolling stones';
UPDATE music_catalog
SET artist_standard = 'Whitney Houston'
WHERE lower(artist_name) = 'whitney houston';
-- Fix decades
UPDATE music_catalog SET decade = '1970s' WHERE decade = '1970';
UPDATE music_catalog SET decade = '2020s' WHERE decade = '20s';
-- Fix NULL genres (after research)
UPDATE music_catalog SET genre = 'Rock'
WHERE catalog_id = 'CAT-4501';
-- ... (remaining NULLs)
-- Remove duplicates
DELETE FROM music_catalog
WHERE catalog_id IN ('CAT-1004', 'CAT-2087');START TRANSACTION;
-- 1. Artists
INSERT INTO artists (artist_name)
SELECT DISTINCT artist_standard
FROM music_catalog
WHERE artist_standard IS NOT NULL
ORDER BY artist_standard;
-- 2. Albums
INSERT INTO albums (album_title, release_year, genre, label,
duration_min, artist_id)
SELECT DISTINCT m.album_title, m.release_year, m.genre,
m.label, m.duration_min, a.artist_id
FROM music_catalog m
JOIN artists a ON m.artist_standard = a.artist_name;
-- 3. Verify
SELECT 'artists' AS tbl, count(*) FROM artists
UNION ALL
SELECT 'albums', count(*) FROM albums;
COMMIT;-- Full picture
SELECT a.artist_name, al.album_title,
al.release_year, al.genre
FROM albums al
JOIN artists a ON al.artist_id = a.artist_id
ORDER BY a.artist_name, al.release_year;
-- Check constraints
SELECT * FROM albums WHERE release_year NOT BETWEEN 1900 AND 2100;
SELECT * FROM albums WHERE duration_min <= 0;artist_standard for the JOIN instead of artist_name?| Tool | When to Use |
|---|---|
UPDATE ... SET ... WHERE |
Fix specific data quality issues |
UPDATE ... FROM |
Update using data from another table |
DELETE FROM ... WHERE |
Remove rows that do not belong |
ALTER TABLE ADD/DROP COLUMN |
Reshape tables for cleaning |
INSERT INTO ... SELECT |
Migrate data between tables |
START TRANSACTION / COMMIT / ROLLBACK |
Wrap everything in a safety net |
Before any data modification:
The goal is not speed. The goal is correctness. A fast migration that loses data is not a migration. It is a disaster with good performance metrics.
You now have the complete toolkit:

From raw CSV to a clean, normalized, constraint-enforced production database. The entire pipeline. Next, you will apply this pipeline end-to-end on your own with a new dataset.
DeBarros, A. (2022). Practical SQL: A Beginner’s Guide to Storytelling with Data (2nd ed.). No Starch Press. Chapter 9: Inspecting and Modifying Data.
PostgreSQL Documentation. “UPDATE.” https://www.postgresql.org/docs/current/sql-update.html
PostgreSQL Documentation. “DELETE.” https://www.postgresql.org/docs/current/sql-delete.html
PostgreSQL Documentation. “Transactions.” https://www.postgresql.org/docs/current/tutorial-transactions.html
PostgreSQL Documentation. “ALTER TABLE.” https://www.postgresql.org/docs/current/sql-altertable.html