DATA 503: FUNDAMENTALS OF DATA ENGINEERING
  • Lectures

On this page

  • The Final Step
    • Where We Are in the Pipeline
      • The Data Migration Challenge
      • What You Will Learn Today
    • Auditing Data Quality
      • The Inspection Mindset
      • Example: Music Catalog Dataset
      • Audit Step 1: How Many Rows?
      • Audit Step 2: Find Duplicate Albums
      • Audit Step 3: Find NULL Values
      • Audit Step 3b: Investigate the NULLs
      • Audit Step 4: Inconsistent Names
      • Audit Step 5: Malformed Decade Values
      • Audit Step 5b: Check Release Years Against Decades
      • Data Quality Audit Summary
    • Backup Before You Modify
      • The Golden Rule of Data Modification
      • Creating a Backup Table
      • The Safety Column Pattern
    • ALTER TABLE: Changing Structure
      • Modifying Table Structure
      • Adding Columns for Data Cleaning
    • UPDATE: Fixing Data
      • The UPDATE Statement
      • Fixing Missing Genres
      • Restoring from a Safety Column
      • Standardizing Artist Names
      • Fixing Inconsistent Decade Values
      • UPDATE with Subqueries
      • UPDATE with FROM
      • UPDATE Patterns Summary
    • DELETE: Removing Rows
      • The DELETE Statement
      • Deleting Rows by Condition
      • DELETE vs DROP vs TRUNCATE
    • Cleaning Up: DROP
      • Dropping Columns
      • Dropping Tables
    • Transactions: Your Safety Net
      • What Is a Transaction?
      • Transaction Syntax
      • Transaction Example: Catching a Typo
      • When to Use Transactions
      • Transaction Gotchas
    • Data Migration: Putting It All Together
      • The Migration Workflow
      • INSERT INTO … SELECT
      • Migration with Foreign Keys
      • Backup and Swap Pattern
    • Activity: Clean and Migrate Vet Clinic Data
      • The Scenario
      • The Staging Data
      • Part 1: Audit the Data (5 min)
      • Part 2: Fix the Issues (10 min)
      • Part 3: Migrate the Data (10 min)
      • One Possible Solution
      • Part 4: Discussion Questions
    • Key Takeaways
      • The DML Toolkit
      • The Data Engineer’s Checklist
      • What Is Next
    • References
      • Sources

Other Formats

  • RevealJS
  • PDF

Lecture 05-2: Inspecting and Modifying Data

DATA 503: Fundamentals of Data Engineering

Author
Affiliation

Lucas P. Cordova, Ph.D.

Willamette University

Published

February 9, 2026

Abstract

This lecture covers the practical skills of inspecting, cleaning, and modifying data in PostgreSQL. After building normalized tables with constraints, we now learn to migrate data from a staging table into the final schema. Topics include auditing data quality, ALTER TABLE for schema changes, UPDATE for fixing data, DELETE for removing rows, backup strategies, and transactions for safe modifications.

The Final Step

You can import data. You can design a schema. You can build tables with constraints. Now comes the part where you actually move the data from point A to point B without breaking anything. No pressure.

Where We Are in the Pipeline

The Data Migration Challenge

You have a staging table full of raw CSV data and a set of normalized tables waiting to receive it. The problem is that raw data is rarely clean enough to insert directly.

The workflow:

What You Will Learn Today

The DML (Data Manipulation Language) toolkit:

Statement Purpose
UPDATE Change existing values
DELETE Remove rows
ALTER TABLE Modify table structure
INSERT INTO ... SELECT Copy data between tables
START TRANSACTION Begin a safe modification block
COMMIT Save changes permanently
ROLLBACK Undo everything since the last transaction start

These are the verbs of data engineering. SELECT asks questions. These statements change answers.

Auditing Data Quality

The Inspection Mindset

Before modifying anything, inspect the data. Every experienced data engineer has a horror story about an UPDATE that ran without a WHERE clause. Do not become a horror story.

Questions to ask before any migration:

  • How many rows do I have?
  • Are there NULLs where there should not be?
  • Are values consistent? (Same entity, different spellings?)
  • Are there duplicates?
  • Do the data types match the target schema?

Example: Music Catalog Dataset

Imagine you work for a streaming service that just acquired a catalog of albums from a defunct record distributor. The data arrived as a CSV and was loaded into a staging table:

CREATE TABLE music_catalog (
    catalog_id varchar(20) CONSTRAINT catalog_key PRIMARY KEY,
    artist_name varchar(200),
    album_title varchar(200),
    release_year smallint,
    genre varchar(50),
    label varchar(100),
    duration_min numeric(5,1),
    decade varchar(10)
);

COPY music_catalog
FROM '/path/to/album_catalog.csv'
WITH (FORMAT CSV, HEADER, DELIMITER ',');

CREATE INDEX artist_idx ON music_catalog (artist_name);

The catalog spans six decades of music, from Fleetwood Mac to Beyonce. It is also, as you will discover, a mess.

Audit Step 1: How Many Rows?

Always start with the basics:

SELECT count(*) FROM music_catalog;

This tells you the scale of what you are working with. A table with 50 rows and a table with 5 million rows require different strategies. One you can eyeball. The other, you cannot.

Audit Step 2: Find Duplicate Albums

SELECT artist_name,
       album_title,
       count(*) AS album_count
FROM music_catalog
GROUP BY artist_name, album_title
HAVING count(*) > 1
ORDER BY artist_name, album_title;
artist_name        album_title              album_count
-----------------  -----------------------  -----------
Radiohead          OK Computer                        2
The Beatles        Abbey Road                         2

GROUP BY with HAVING count(*) > 1 is your duplicate detector. Same artist, same album, appearing twice. Could be a reissue. Could be a data entry mistake. Usually the latter.

Audit Step 3: Find NULL Values

SELECT genre,
       count(*) AS genre_count
FROM music_catalog
GROUP BY genre
ORDER BY genre;
genre          genre_count
-------------  -----------
Alternative             87
Country                 34
Electronic              45
Hip-Hop                 62
Pop                    128
R&B                     41
Rock                   203
                         9

Nine rows have no genre. That blank line at the bottom is NULL. NULL is not nothing. NULL is “I do not know,” which in a database is significantly worse than nothing.

Audit Step 3b: Investigate the NULLs

SELECT catalog_id,
       artist_name,
       album_title,
       genre,
       release_year
FROM music_catalog
WHERE genre IS NULL;

IS NULL is the only way to check for NULL. Using = NULL will not work because NULL is not equal to anything, including itself. This is one of those things that makes perfect logical sense and no intuitive sense whatsoever.

Audit Step 4: Inconsistent Names

SELECT artist_name,
       count(*) AS name_count
FROM music_catalog
GROUP BY artist_name
ORDER BY artist_name ASC;
artist_name                  name_count
---------------------------  ----------
Led Zeppelin                          3
Led Zepplin                           1
Outkast                               2
OutKast                               1
the rolling stones                    1
The Rolling Stones                    4
Whitney Houston                       3
Whitney houston                       1

Eight entries for what should be four artists. Typos, inconsistent casing, and missing “The” prefixes. In a normalized database, each artist would be one row. In this staging table, each variation is a separate identity crisis.

Audit Step 5: Malformed Decade Values

SELECT decade,
       count(*) AS decade_count
FROM music_catalog
GROUP BY decade
ORDER BY decade;
decade      decade_count
----------  ------------
1970           42
1970s          58
1980s          89
1990s         104
2000s          97
2010s          78
20s            12
2020s          29

Three problems: “1970” should be “1970s”, “20s” should be “2020s”, and the two 1970s variants need merging. Whoever entered this data was consistent about 60% of the time, which is the worst kind of consistent.

Audit Step 5b: Check Release Years Against Decades

SELECT decade,
       min(release_year) AS earliest,
       max(release_year) AS latest
FROM music_catalog
GROUP BY decade
ORDER BY decade;
decade      earliest    latest
----------  --------    ------
1970              1970    1979
1970s             1970    1979
1980s             1980    1989
1990s             1990    1999
2000s             2000    2009
2010s             2010    2019
20s               2020    2024
2020s             2020    2025

At least the years match the decades. Small mercies.

Data Quality Audit Summary

Our audit revealed three categories of problems:

Now we fix them. But first, a word about safety.

Backup Before You Modify

The Golden Rule of Data Modification

Always create a backup before running UPDATE or DELETE on production data.

This is not optional. This is not paranoia. This is professionalism. The difference between a junior and senior data engineer is not skill. It is the number of times they have been burned by a missing backup.

Creating a Backup Table

CREATE TABLE music_catalog_backup AS
SELECT * FROM music_catalog;

CREATE TABLE ... AS SELECT copies both structure and data. Verify the backup:

SELECT
    (SELECT count(*) FROM music_catalog) AS original,
    (SELECT count(*) FROM music_catalog_backup) AS backup;
original    backup
--------    ------
     609       609

Same count. You can proceed with slightly less anxiety.

The Safety Column Pattern

Before modifying a column, copy it first:

ALTER TABLE music_catalog ADD COLUMN artist_name_copy varchar(200);

UPDATE music_catalog
SET artist_name_copy = artist_name;

Now artist_name_copy preserves the original values. If your UPDATE goes sideways, you can restore from the copy without touching the backup table. Belt and suspenders.

ALTER TABLE: Changing Structure

Modifying Table Structure

ALTER TABLE changes the shape of a table without destroying its data:

Operation Syntax
Add column ALTER TABLE t ADD COLUMN col type;
Drop column ALTER TABLE t DROP COLUMN col;
Change type ALTER TABLE t ALTER COLUMN col SET DATA TYPE type;
Set NOT NULL ALTER TABLE t ALTER COLUMN col SET NOT NULL;
Drop NOT NULL ALTER TABLE t ALTER COLUMN col DROP NOT NULL;

Adding Columns for Data Cleaning

A common pattern: add a new “clean” column alongside the original dirty one.

-- Add a standardized artist name column
ALTER TABLE music_catalog
    ADD COLUMN artist_standard varchar(200);

-- Copy original values as starting point
UPDATE music_catalog
SET artist_standard = artist_name;

Now you can clean artist_standard without losing the original artist_name values. When you are satisfied the cleaning is correct, you can drop the original. Or keep both. Data engineers who keep both sleep better.

UPDATE: Fixing Data

The UPDATE Statement

UPDATE changes existing values in a table:

-- Update ALL rows (dangerous)
UPDATE table
SET column = value;

-- Update specific rows (safer)
UPDATE table
SET column = value
WHERE criteria;

-- Update multiple columns at once
UPDATE table
SET column_a = value_a,
    column_b = value_b
WHERE criteria;
Warning

An UPDATE without a WHERE clause modifies every row in the table. PostgreSQL will not ask “are you sure?” It will simply do it. Immediately. With enthusiasm.

Fixing Missing Genres

We found nine rows with NULL genres. After researching the albums, we can fill them in:

UPDATE music_catalog
SET genre = 'Rock'
WHERE catalog_id = 'CAT-4501';

UPDATE music_catalog
SET genre = 'Pop'
WHERE catalog_id = 'CAT-7823';

UPDATE music_catalog
SET genre = 'Hip-Hop'
WHERE catalog_id IN ('CAT-9102', 'CAT-9103', 'CAT-9104');

Each UPDATE returns UPDATE 1 (or UPDATE 3 for the IN clause), confirming the exact number of rows modified. If it says UPDATE 0, your WHERE clause matched nothing. If it says UPDATE 609, you forgot the WHERE clause. Both are worth investigating.

Restoring from a Safety Column

If your UPDATE was wrong, restore from the copy:

-- Option 1: Restore from the safety column
UPDATE music_catalog
SET artist_name = artist_name_copy;

-- Option 2: Restore from the backup table
UPDATE music_catalog original
SET artist_name = backup.artist_name
FROM music_catalog_backup backup
WHERE original.catalog_id = backup.catalog_id;

Option 2 uses UPDATE ... FROM, which joins two tables during the update. This is PostgreSQL-specific syntax and extremely useful for data migration work.

Standardizing Artist Names

Remember the inconsistent spellings? Fix them with pattern matching and exact matches:

-- Fix the typo
UPDATE music_catalog
SET artist_standard = 'Led Zeppelin'
WHERE artist_name LIKE 'Led Zep%';

-- Fix casing inconsistencies
UPDATE music_catalog
SET artist_standard = 'OutKast'
WHERE lower(artist_name) = 'outkast';

UPDATE music_catalog
SET artist_standard = 'The Rolling Stones'
WHERE lower(artist_name) = 'the rolling stones';

UPDATE music_catalog
SET artist_standard = 'Whitney Houston'
WHERE lower(artist_name) = 'whitney houston';

Verify:

SELECT artist_name, artist_standard
FROM music_catalog
WHERE artist_name != artist_standard
ORDER BY artist_standard;
artist_name            artist_standard
---------------------  --------------------
Led Zepplin            Led Zeppelin
the rolling stones     The Rolling Stones
OutKast                OutKast
Outkast                OutKast
Whitney houston        Whitney Houston

Five dirty values, four clean artist names. The original artist_name column is untouched for auditing.

Fixing Inconsistent Decade Values

The || operator concatenates strings in PostgreSQL. Combined with CASE or simple WHERE clauses, it handles our decade problems:

-- Fix "1970" -> "1970s"
UPDATE music_catalog
SET decade = '1970s'
WHERE decade = '1970';

-- Fix "20s" -> "2020s"
UPDATE music_catalog
SET decade = '2020s'
WHERE decade = '20s';

Verify all decades are now consistent:

SELECT decade, count(*) AS decade_count
FROM music_catalog
GROUP BY decade
ORDER BY decade;
decade      decade_count
----------  ------------
1970s            100
1980s             89
1990s            104
2000s             97
2010s             78
2020s             41

Six clean decades. No duplicates, no abbreviations.

UPDATE with Subqueries

You can use a subquery to determine which rows to update. Suppose we have a genre_tags table that maps genres to broader categories:

ALTER TABLE music_catalog ADD COLUMN category varchar(50);

UPDATE music_catalog m
SET category = 'Contemporary'
WHERE EXISTS (SELECT 1
              FROM genre_tags g
              WHERE m.genre = g.genre
                    AND g.era = 'modern');

This sets category only for albums whose genre appears in the genre_tags table with a matching era. The WHERE EXISTS subquery acts as a filter across tables.

UPDATE with FROM

A more readable alternative for cross-table updates:

UPDATE music_catalog m
SET category = g.category
FROM genre_tags g
WHERE m.genre = g.genre;

This joins music_catalog to genre_tags during the update and pulls the category value directly. Same result, cleaner syntax. This is PostgreSQL-specific and one of its nicest features.

UPDATE Patterns Summary

Pattern Use When
SET col = value WHERE ... Simple fixes to specific rows
SET col = col2 Copying between columns
SET col = 'prefix' \|\| col String manipulation
UPDATE t1 SET ... FROM t2 WHERE ... Joining tables during update
UPDATE ... WHERE EXISTS (SELECT ...) Conditional update based on another table
WHERE lower(col) = 'value' Case-insensitive matching

DELETE: Removing Rows

The DELETE Statement

DELETE removes rows from a table:

-- Delete ALL rows (very dangerous)
DELETE FROM table_name;

-- Delete specific rows (less dangerous)
DELETE FROM table_name WHERE expression;
Warning

DELETE FROM table_name; without a WHERE clause deletes every row. Unlike dropping the table, the empty table structure remains, staring at you like a reminder of what you have done.

Deleting Rows by Condition

DELETE FROM music_catalog
WHERE release_year < 1970;
DELETE 12

This removes all 12 albums released before the 1970s. PostgreSQL confirms the count, which you should always check against your expectation. If you expected 12 and got 12, good. If you expected 12 and got 412, less good.

DELETE vs DROP vs TRUNCATE

Statement What It Does Reversible?
DELETE FROM t WHERE ... Removes matching rows Yes (in a transaction)
DELETE FROM t Removes all rows, keeps structure Yes (in a transaction)
TRUNCATE t Removes all rows, faster than DELETE No
DROP TABLE t Removes table entirely No

Cleaning Up: DROP

Dropping Columns

After migration, remove temporary columns:

ALTER TABLE music_catalog DROP COLUMN artist_name_copy;

Clean tables are happy tables. Leftover columns named _copy, _backup, _temp, and _old are the database equivalent of packing boxes you never unpacked after moving.

Dropping Tables

Remove backup tables when you are confident the migration succeeded:

DROP TABLE music_catalog_backup;

DROP TABLE is permanent. There is no undo. Make sure you are done with the backup before dropping it. Then wait a day. Then drop it.

Transactions: Your Safety Net

What Is a Transaction?

A transaction groups multiple SQL statements into a single atomic unit. Either all of them succeed, or none of them do.

Transaction Syntax

START TRANSACTION;

UPDATE music_catalog
SET artist_standard = 'Feetwood Mac'
WHERE artist_name LIKE 'Fleetwood%';

-- Check your work
SELECT artist_name, artist_standard
FROM music_catalog
WHERE artist_name LIKE 'Fleetwood%';

-- Oops, typo! Undo everything.
ROLLBACK;

Transaction Example: Catching a Typo

START TRANSACTION;

UPDATE music_catalog
SET artist_standard = 'Feetwood Mac'   -- Note the missing 'l'
WHERE artist_name LIKE 'Fleetwood%';

Check the result:

artist_name       artist_standard
----------------  ----------------
Fleetwood Mac     Feetwood Mac     -- Typo!
Fleetwood Mac     Feetwood Mac     -- Typo!
Fleetwood Mac     Feetwood Mac     -- Typo!

The typo is visible. Roll it back:

ROLLBACK;

Query again:

artist_name       artist_standard
----------------  ----------------
Fleetwood Mac     Fleetwood Mac    -- Original restored
Fleetwood Mac     Fleetwood Mac
Fleetwood Mac     Fleetwood Mac

The ROLLBACK undid the UPDATE as if it never happened. Fleetwood Mac’s name is safe. This is why transactions exist.

When to Use Transactions

Always use transactions when:

  • Running multiple related UPDATE or DELETE statements
  • Making changes you want to verify before committing
  • Working on production data
  • Performing data migrations

The workflow:

  1. START TRANSACTION;
  2. Run your statements
  3. SELECT to verify the results
  4. COMMIT; if correct, ROLLBACK; if not
Tip

In psql, BEGIN is an alias for START TRANSACTION. You will see both in the wild. They do the same thing.

Transaction Gotchas

A few things to be aware of:

  • DDL statements (CREATE TABLE, DROP TABLE) in PostgreSQL are transactional. This is unusual. Most other databases auto-commit DDL.
  • If your session disconnects during an open transaction, PostgreSQL rolls it back automatically. Your data is safe, if slightly inconvenienced.
  • Long-running open transactions can block other users. Do your work, then commit or rollback. Do not leave a transaction open while you go to lunch.

Data Migration: Putting It All Together

The Migration Workflow

With all these tools in hand, here is the complete workflow for migrating data from a staging table to normalized tables:

INSERT INTO … SELECT

The key statement for migration copies data from one table to another:

-- Populate a normalized table from the staging table
INSERT INTO companies (company_name, city, state, zip)
SELECT DISTINCT company_standard, city, st, zip
FROM meat_poultry_egg_inspect
WHERE company_standard IS NOT NULL;

DISTINCT ensures you do not insert duplicate rows. This is how you populate a lookup table from a denormalized staging table.

Migration with Foreign Keys

When migrating to tables with foreign key relationships, order matters:

START TRANSACTION;

-- 1. Populate parent tables first
INSERT INTO departments (department_name)
SELECT DISTINCT department
FROM staging_employees;

-- 2. Then populate child tables with FK lookups
INSERT INTO employees (name, email, department_id)
SELECT s.name, s.email, d.department_id
FROM staging_employees s
JOIN departments d ON s.department = d.department_name;

-- 3. Verify
SELECT count(*) FROM departments;
SELECT count(*) FROM employees;

COMMIT;

Parent tables first, child tables second. Foreign key constraints will reject inserts in the wrong order.

Backup and Swap Pattern

A safer migration pattern creates the new table, verifies it, then swaps:

-- Create backup with extra metadata
CREATE TABLE music_catalog_backup AS
SELECT *,
       '2026-02-09'::date AS reviewed_date
FROM music_catalog;

-- Swap tables using RENAME
ALTER TABLE music_catalog
    RENAME TO music_catalog_temp;
ALTER TABLE music_catalog_backup
    RENAME TO music_catalog;
ALTER TABLE music_catalog_temp
    RENAME TO music_catalog_backup;

Three renames and the new table is live. The old one is still available as _backup if anything goes wrong.

Activity: Clean and Migrate Vet Clinic Data

The Scenario

Continuing from the previous lecture’s vet clinic activity, you now have:

A staging table with raw imported data:

CREATE TABLE vet_staging (
    owner_name varchar(100),
    owner_email varchar(200),
    owner_phone varchar(20),
    pet_name varchar(100),
    species varchar(50),
    breed varchar(100),
    visit_date date,
    reason text,
    cost numeric(8,2)
);

And your normalized tables (owners, pets, visits) from last time.

Your job: audit the staging data, fix problems, and migrate it into the normalized schema.

The Staging Data

INSERT INTO vet_staging VALUES
('Maria Lopez', 'maria@email.com', '503-555-0101', 'Luna', 'Dog', 'Labrador', '2026-01-15', 'Checkup', 75.00),
('Maria Lopez', 'maria@email.com', '503-555-0101', 'Luna', 'Dog', 'Labrador', '2026-02-01', 'Vaccination', 120.00),
('Maria Lopez', 'MARIA@EMAIL.COM', '503-555-0101', 'Whiskers', 'Cat', 'Siamese', '2026-01-20', 'Dental', 250.00),
('James Park', 'james@email.com', '541-555-0202', 'Buddy', 'dog', 'Golden Retriever', '2026-01-18', 'Surgery', 800.00),
('James Park', 'james@email.com', NULL, 'Buddy', 'dog', 'Golden Retriever', '2026-02-05', 'Follow-up', 50.00),
('Sarah Chen', 'sarah@email.com', '971-555-0303', 'Max', 'Dog', 'Poodle', '2026-01-22', 'Checkup', 75.00),
('Sarah Chen', 'sarah@email.com', '971-555-0303', 'Bella', 'cat', 'Persian', '2026-01-25', 'Vaccination', 95.00),
('James Park', 'james@email.com', '541-555-0202', 'Rocky', 'Reptile', 'Bearded Dragon', '2026-02-03', 'Checkup', 65.00);

Part 1: Audit the Data (5 min)

Write queries to find the problems. Hint: there are at least three categories of issues.

Tip

Check for:

  • Inconsistent casing
  • NULL values
  • Duplicate owners with different data
  • Case variations in species

Part 2: Fix the Issues (10 min)

Using UPDATE, fix the data quality problems you found. Remember:

  • Back up first (or use a transaction)
  • Fix one category at a time
  • Verify after each fix

Part 3: Migrate the Data (10 min)

Write the INSERT INTO … SELECT statements to populate your normalized tables.

Remember:

  • Parents first (owners), then children (pets), then grandchildren (visits)
  • Use DISTINCT where appropriate
  • You will need to JOIN back to get the foreign key IDs

One Possible Solution

  • Audit Queries
  • Fix Queries
  • Migration
  • Verify
-- Find case inconsistencies in email
SELECT owner_email, count(*)
FROM vet_staging
GROUP BY owner_email
ORDER BY owner_email;
-- maria@email.com vs MARIA@EMAIL.COM

-- Find case inconsistencies in species
SELECT species, count(*)
FROM vet_staging
GROUP BY species;
-- 'Dog' vs 'dog', 'Cat' vs 'cat'

-- Find NULL phone numbers
SELECT * FROM vet_staging
WHERE owner_phone IS NULL;
START TRANSACTION;

-- Standardize email to lowercase
UPDATE vet_staging
SET owner_email = lower(owner_email);

-- Standardize species to title case
UPDATE vet_staging
SET species = initcap(species);

-- Fix NULL phone (look up from other rows for same owner)
UPDATE vet_staging s1
SET owner_phone = s2.owner_phone
FROM vet_staging s2
WHERE s1.owner_name = s2.owner_name
  AND s1.owner_phone IS NULL
  AND s2.owner_phone IS NOT NULL;

-- Verify
SELECT DISTINCT owner_name, owner_email, owner_phone
FROM vet_staging;

SELECT DISTINCT species FROM vet_staging;

COMMIT;
START TRANSACTION;

-- 1. Populate owners
INSERT INTO owners (owner_name, email, phone)
SELECT DISTINCT owner_name, owner_email, owner_phone
FROM vet_staging;

-- 2. Populate pets (need owner_id from owners table)
INSERT INTO pets (pet_name, species, breed, owner_id)
SELECT DISTINCT s.pet_name, s.species, s.breed, o.owner_id
FROM vet_staging s
JOIN owners o ON s.owner_email = o.email;

-- 3. Populate visits (need pet_id from pets table)
INSERT INTO visits (visit_date, reason, cost, pet_id)
SELECT s.visit_date, s.reason, s.cost, p.pet_id
FROM vet_staging s
JOIN owners o ON s.owner_email = o.email
JOIN pets p ON s.pet_name = p.pet_name AND p.owner_id = o.owner_id;

-- 4. Verify counts
SELECT 'owners' AS tbl, count(*) FROM owners
UNION ALL
SELECT 'pets', count(*) FROM pets
UNION ALL
SELECT 'visits', count(*) FROM visits;

COMMIT;
-- Check the full picture with joins
SELECT o.owner_name, p.pet_name, p.species,
       v.visit_date, v.reason, v.cost
FROM visits v
JOIN pets p ON v.pet_id = p.pet_id
JOIN owners o ON p.owner_id = o.owner_id
ORDER BY o.owner_name, p.pet_name, v.visit_date;

Expected: 8 rows matching the original staging data, now properly normalized across three tables.

Part 4: Discussion Questions

  1. What would happen if you tried to insert pets before owners? What error would you see?
  2. The JOIN in the visits migration uses both pet_name and owner_id. Why is pet_name alone not sufficient?
  3. If Maria later changes her email, how many rows need updating in the normalized schema vs the original flat table?
  4. What would you add to make this migration idempotent (safe to run multiple times)?

Key Takeaways

The DML Toolkit

Tool When to Use
UPDATE ... SET ... WHERE Fix specific data quality issues
UPDATE ... FROM Update using data from another table
DELETE FROM ... WHERE Remove rows that do not belong
ALTER TABLE ADD/DROP COLUMN Reshape tables for cleaning
INSERT INTO ... SELECT Migrate data between tables
START TRANSACTION / COMMIT / ROLLBACK Wrap everything in a safety net

The Data Engineer’s Checklist

Before any data modification:

  1. Audit the data thoroughly
  2. Create a backup (table or safety columns)
  3. Use transactions for multi-step changes
  4. Verify after every modification
  5. Keep the originals until you are certain

The goal is not speed. The goal is correctness. A fast migration that loses data is not a migration. It is a disaster with good performance metrics.

What Is Next

You now have the complete toolkit:

From raw CSV to a clean, normalized, constraint-enforced production database. The entire pipeline.

References

Sources

  1. DeBarros, A. (2022). Practical SQL: A Beginner’s Guide to Storytelling with Data (2nd ed.). No Starch Press. Chapter 9: Inspecting and Modifying Data.

  2. PostgreSQL Documentation. “UPDATE.” https://www.postgresql.org/docs/current/sql-update.html

  3. PostgreSQL Documentation. “DELETE.” https://www.postgresql.org/docs/current/sql-delete.html

  4. PostgreSQL Documentation. “Transactions.” https://www.postgresql.org/docs/current/tutorial-transactions.html

  5. PostgreSQL Documentation. “ALTER TABLE.” https://www.postgresql.org/docs/current/sql-altertable.html