
Lecture 05-2: Inspecting and Modifying Data
DATA 503: Fundamentals of Data Engineering
This lecture covers the practical skills of inspecting, cleaning, and modifying data in PostgreSQL. After building normalized tables with constraints, we now learn to migrate data from a staging table into the final schema. Topics include auditing data quality, ALTER TABLE for schema changes, UPDATE for fixing data, DELETE for removing rows, backup strategies, and transactions for safe modifications.
The Final Step
You can import data. You can design a schema. You can build tables with constraints. Now comes the part where you actually move the data from point A to point B without breaking anything. No pressure.
Where We Are in the Pipeline
The Data Migration Challenge
You have a staging table full of raw CSV data and a set of normalized tables waiting to receive it. The problem is that raw data is rarely clean enough to insert directly.
The workflow:

What You Will Learn Today
The DML (Data Manipulation Language) toolkit:
| Statement | Purpose |
|---|---|
UPDATE |
Change existing values |
DELETE |
Remove rows |
ALTER TABLE |
Modify table structure |
INSERT INTO ... SELECT |
Copy data between tables |
START TRANSACTION |
Begin a safe modification block |
COMMIT |
Save changes permanently |
ROLLBACK |
Undo everything since the last transaction start |
These are the verbs of data engineering. SELECT asks questions. These statements change answers.
Auditing Data Quality
The Inspection Mindset
Before modifying anything, inspect the data. Every experienced data engineer has a horror story about an UPDATE that ran without a WHERE clause. Do not become a horror story.
Questions to ask before any migration:
- How many rows do I have?
- Are there NULLs where there should not be?
- Are values consistent? (Same entity, different spellings?)
- Are there duplicates?
- Do the data types match the target schema?
Example: Music Catalog Dataset
Imagine you work for a streaming service that just acquired a catalog of albums from a defunct record distributor. The data arrived as a CSV and was loaded into a staging table:
CREATE TABLE music_catalog (
catalog_id varchar(20) CONSTRAINT catalog_key PRIMARY KEY,
artist_name varchar(200),
album_title varchar(200),
release_year smallint,
genre varchar(50),
label varchar(100),
duration_min numeric(5,1),
decade varchar(10)
);
COPY music_catalog
FROM '/path/to/album_catalog.csv'
WITH (FORMAT CSV, HEADER, DELIMITER ',');
CREATE INDEX artist_idx ON music_catalog (artist_name);The catalog spans six decades of music, from Fleetwood Mac to Beyonce. It is also, as you will discover, a mess.
Audit Step 1: How Many Rows?
Always start with the basics:
This tells you the scale of what you are working with. A table with 50 rows and a table with 5 million rows require different strategies. One you can eyeball. The other, you cannot.
Audit Step 2: Find Duplicate Albums
artist_name album_title album_count
----------------- ----------------------- -----------
Radiohead OK Computer 2
The Beatles Abbey Road 2
GROUP BY with HAVING count(*) > 1 is your duplicate detector. Same artist, same album, appearing twice. Could be a reissue. Could be a data entry mistake. Usually the latter.
Audit Step 3: Find NULL Values
genre genre_count
------------- -----------
Alternative 87
Country 34
Electronic 45
Hip-Hop 62
Pop 128
R&B 41
Rock 203
9
Nine rows have no genre. That blank line at the bottom is NULL. NULL is not nothing. NULL is “I do not know,” which in a database is significantly worse than nothing.
Audit Step 3b: Investigate the NULLs
IS NULL is the only way to check for NULL. Using = NULL will not work because NULL is not equal to anything, including itself. This is one of those things that makes perfect logical sense and no intuitive sense whatsoever.
Audit Step 4: Inconsistent Names
artist_name name_count
--------------------------- ----------
Led Zeppelin 3
Led Zepplin 1
Outkast 2
OutKast 1
the rolling stones 1
The Rolling Stones 4
Whitney Houston 3
Whitney houston 1
Eight entries for what should be four artists. Typos, inconsistent casing, and missing “The” prefixes. In a normalized database, each artist would be one row. In this staging table, each variation is a separate identity crisis.
Audit Step 5: Malformed Decade Values
decade decade_count
---------- ------------
1970 42
1970s 58
1980s 89
1990s 104
2000s 97
2010s 78
20s 12
2020s 29
Three problems: “1970” should be “1970s”, “20s” should be “2020s”, and the two 1970s variants need merging. Whoever entered this data was consistent about 60% of the time, which is the worst kind of consistent.
Audit Step 5b: Check Release Years Against Decades
decade earliest latest
---------- -------- ------
1970 1970 1979
1970s 1970 1979
1980s 1980 1989
1990s 1990 1999
2000s 2000 2009
2010s 2010 2019
20s 2020 2024
2020s 2020 2025
At least the years match the decades. Small mercies.
Data Quality Audit Summary
Our audit revealed three categories of problems:

Now we fix them. But first, a word about safety.
Backup Before You Modify
The Golden Rule of Data Modification
Always create a backup before running UPDATE or DELETE on production data.
This is not optional. This is not paranoia. This is professionalism. The difference between a junior and senior data engineer is not skill. It is the number of times they have been burned by a missing backup.
Creating a Backup Table
CREATE TABLE ... AS SELECT copies both structure and data. Verify the backup:
original backup
-------- ------
609 609
Same count. You can proceed with slightly less anxiety.
The Safety Column Pattern
Before modifying a column, copy it first:
Now artist_name_copy preserves the original values. If your UPDATE goes sideways, you can restore from the copy without touching the backup table. Belt and suspenders.
ALTER TABLE: Changing Structure
Modifying Table Structure
ALTER TABLE changes the shape of a table without destroying its data:
| Operation | Syntax |
|---|---|
| Add column | ALTER TABLE t ADD COLUMN col type; |
| Drop column | ALTER TABLE t DROP COLUMN col; |
| Change type | ALTER TABLE t ALTER COLUMN col SET DATA TYPE type; |
| Set NOT NULL | ALTER TABLE t ALTER COLUMN col SET NOT NULL; |
| Drop NOT NULL | ALTER TABLE t ALTER COLUMN col DROP NOT NULL; |
Adding Columns for Data Cleaning
A common pattern: add a new “clean” column alongside the original dirty one.
Now you can clean artist_standard without losing the original artist_name values. When you are satisfied the cleaning is correct, you can drop the original. Or keep both. Data engineers who keep both sleep better.
UPDATE: Fixing Data
The UPDATE Statement
UPDATE changes existing values in a table:
An UPDATE without a WHERE clause modifies every row in the table. PostgreSQL will not ask “are you sure?” It will simply do it. Immediately. With enthusiasm.
Fixing Missing Genres
We found nine rows with NULL genres. After researching the albums, we can fill them in:
Each UPDATE returns UPDATE 1 (or UPDATE 3 for the IN clause), confirming the exact number of rows modified. If it says UPDATE 0, your WHERE clause matched nothing. If it says UPDATE 609, you forgot the WHERE clause. Both are worth investigating.
Restoring from a Safety Column
If your UPDATE was wrong, restore from the copy:
Option 2 uses UPDATE ... FROM, which joins two tables during the update. This is PostgreSQL-specific syntax and extremely useful for data migration work.
Standardizing Artist Names
Remember the inconsistent spellings? Fix them with pattern matching and exact matches:
-- Fix the typo
UPDATE music_catalog
SET artist_standard = 'Led Zeppelin'
WHERE artist_name LIKE 'Led Zep%';
-- Fix casing inconsistencies
UPDATE music_catalog
SET artist_standard = 'OutKast'
WHERE lower(artist_name) = 'outkast';
UPDATE music_catalog
SET artist_standard = 'The Rolling Stones'
WHERE lower(artist_name) = 'the rolling stones';
UPDATE music_catalog
SET artist_standard = 'Whitney Houston'
WHERE lower(artist_name) = 'whitney houston';Verify:
artist_name artist_standard
--------------------- --------------------
Led Zepplin Led Zeppelin
the rolling stones The Rolling Stones
OutKast OutKast
Outkast OutKast
Whitney houston Whitney Houston
Five dirty values, four clean artist names. The original artist_name column is untouched for auditing.
Fixing Inconsistent Decade Values
The || operator concatenates strings in PostgreSQL. Combined with CASE or simple WHERE clauses, it handles our decade problems:
Verify all decades are now consistent:
decade decade_count
---------- ------------
1970s 100
1980s 89
1990s 104
2000s 97
2010s 78
2020s 41
Six clean decades. No duplicates, no abbreviations.
UPDATE with Subqueries
You can use a subquery to determine which rows to update. Suppose we have a genre_tags table that maps genres to broader categories:
This sets category only for albums whose genre appears in the genre_tags table with a matching era. The WHERE EXISTS subquery acts as a filter across tables.
UPDATE with FROM
A more readable alternative for cross-table updates:
This joins music_catalog to genre_tags during the update and pulls the category value directly. Same result, cleaner syntax. This is PostgreSQL-specific and one of its nicest features.
UPDATE Patterns Summary
| Pattern | Use When |
|---|---|
SET col = value WHERE ... |
Simple fixes to specific rows |
SET col = col2 |
Copying between columns |
SET col = 'prefix' \|\| col |
String manipulation |
UPDATE t1 SET ... FROM t2 WHERE ... |
Joining tables during update |
UPDATE ... WHERE EXISTS (SELECT ...) |
Conditional update based on another table |
WHERE lower(col) = 'value' |
Case-insensitive matching |
DELETE: Removing Rows
The DELETE Statement
DELETE removes rows from a table:
DELETE FROM table_name; without a WHERE clause deletes every row. Unlike dropping the table, the empty table structure remains, staring at you like a reminder of what you have done.
Deleting Rows by Condition
DELETE 12
This removes all 12 albums released before the 1970s. PostgreSQL confirms the count, which you should always check against your expectation. If you expected 12 and got 12, good. If you expected 12 and got 412, less good.
DELETE vs DROP vs TRUNCATE
| Statement | What It Does | Reversible? |
|---|---|---|
DELETE FROM t WHERE ... |
Removes matching rows | Yes (in a transaction) |
DELETE FROM t |
Removes all rows, keeps structure | Yes (in a transaction) |
TRUNCATE t |
Removes all rows, faster than DELETE | No |
DROP TABLE t |
Removes table entirely | No |
Cleaning Up: DROP
Dropping Columns
After migration, remove temporary columns:
Clean tables are happy tables. Leftover columns named _copy, _backup, _temp, and _old are the database equivalent of packing boxes you never unpacked after moving.
Dropping Tables
Remove backup tables when you are confident the migration succeeded:
DROP TABLE is permanent. There is no undo. Make sure you are done with the backup before dropping it. Then wait a day. Then drop it.
Transactions: Your Safety Net
What Is a Transaction?
A transaction groups multiple SQL statements into a single atomic unit. Either all of them succeed, or none of them do.

Transaction Syntax
Transaction Example: Catching a Typo
Check the result:
artist_name artist_standard
---------------- ----------------
Fleetwood Mac Feetwood Mac -- Typo!
Fleetwood Mac Feetwood Mac -- Typo!
Fleetwood Mac Feetwood Mac -- Typo!
The typo is visible. Roll it back:
Query again:
artist_name artist_standard
---------------- ----------------
Fleetwood Mac Fleetwood Mac -- Original restored
Fleetwood Mac Fleetwood Mac
Fleetwood Mac Fleetwood Mac
The ROLLBACK undid the UPDATE as if it never happened. Fleetwood Mac’s name is safe. This is why transactions exist.
When to Use Transactions
Always use transactions when:
- Running multiple related UPDATE or DELETE statements
- Making changes you want to verify before committing
- Working on production data
- Performing data migrations
The workflow:
START TRANSACTION;- Run your statements
SELECTto verify the resultsCOMMIT;if correct,ROLLBACK;if not
In psql, BEGIN is an alias for START TRANSACTION. You will see both in the wild. They do the same thing.
Transaction Gotchas
A few things to be aware of:
- DDL statements (
CREATE TABLE,DROP TABLE) in PostgreSQL are transactional. This is unusual. Most other databases auto-commit DDL. - If your session disconnects during an open transaction, PostgreSQL rolls it back automatically. Your data is safe, if slightly inconvenienced.
- Long-running open transactions can block other users. Do your work, then commit or rollback. Do not leave a transaction open while you go to lunch.
Data Migration: Putting It All Together
The Migration Workflow
With all these tools in hand, here is the complete workflow for migrating data from a staging table to normalized tables:

INSERT INTO … SELECT
The key statement for migration copies data from one table to another:
DISTINCT ensures you do not insert duplicate rows. This is how you populate a lookup table from a denormalized staging table.
Migration with Foreign Keys
When migrating to tables with foreign key relationships, order matters:
START TRANSACTION;
-- 1. Populate parent tables first
INSERT INTO departments (department_name)
SELECT DISTINCT department
FROM staging_employees;
-- 2. Then populate child tables with FK lookups
INSERT INTO employees (name, email, department_id)
SELECT s.name, s.email, d.department_id
FROM staging_employees s
JOIN departments d ON s.department = d.department_name;
-- 3. Verify
SELECT count(*) FROM departments;
SELECT count(*) FROM employees;
COMMIT;Parent tables first, child tables second. Foreign key constraints will reject inserts in the wrong order.
Backup and Swap Pattern
A safer migration pattern creates the new table, verifies it, then swaps:
-- Create backup with extra metadata
CREATE TABLE music_catalog_backup AS
SELECT *,
'2026-02-09'::date AS reviewed_date
FROM music_catalog;
-- Swap tables using RENAME
ALTER TABLE music_catalog
RENAME TO music_catalog_temp;
ALTER TABLE music_catalog_backup
RENAME TO music_catalog;
ALTER TABLE music_catalog_temp
RENAME TO music_catalog_backup;Three renames and the new table is live. The old one is still available as _backup if anything goes wrong.
Activity: Clean and Migrate Vet Clinic Data
The Scenario
Continuing from the previous lecture’s vet clinic activity, you now have:
A staging table with raw imported data:
And your normalized tables (owners, pets, visits) from last time.
Your job: audit the staging data, fix problems, and migrate it into the normalized schema.
The Staging Data
INSERT INTO vet_staging VALUES
('Maria Lopez', 'maria@email.com', '503-555-0101', 'Luna', 'Dog', 'Labrador', '2026-01-15', 'Checkup', 75.00),
('Maria Lopez', 'maria@email.com', '503-555-0101', 'Luna', 'Dog', 'Labrador', '2026-02-01', 'Vaccination', 120.00),
('Maria Lopez', 'MARIA@EMAIL.COM', '503-555-0101', 'Whiskers', 'Cat', 'Siamese', '2026-01-20', 'Dental', 250.00),
('James Park', 'james@email.com', '541-555-0202', 'Buddy', 'dog', 'Golden Retriever', '2026-01-18', 'Surgery', 800.00),
('James Park', 'james@email.com', NULL, 'Buddy', 'dog', 'Golden Retriever', '2026-02-05', 'Follow-up', 50.00),
('Sarah Chen', 'sarah@email.com', '971-555-0303', 'Max', 'Dog', 'Poodle', '2026-01-22', 'Checkup', 75.00),
('Sarah Chen', 'sarah@email.com', '971-555-0303', 'Bella', 'cat', 'Persian', '2026-01-25', 'Vaccination', 95.00),
('James Park', 'james@email.com', '541-555-0202', 'Rocky', 'Reptile', 'Bearded Dragon', '2026-02-03', 'Checkup', 65.00);Part 1: Audit the Data (5 min)
Write queries to find the problems. Hint: there are at least three categories of issues.
Check for:
- Inconsistent casing
- NULL values
- Duplicate owners with different data
- Case variations in species
Part 2: Fix the Issues (10 min)
Using UPDATE, fix the data quality problems you found. Remember:
- Back up first (or use a transaction)
- Fix one category at a time
- Verify after each fix
Part 3: Migrate the Data (10 min)
Write the INSERT INTO … SELECT statements to populate your normalized tables.
Remember:
- Parents first (owners), then children (pets), then grandchildren (visits)
- Use DISTINCT where appropriate
- You will need to JOIN back to get the foreign key IDs
One Possible Solution
-- Find case inconsistencies in email
SELECT owner_email, count(*)
FROM vet_staging
GROUP BY owner_email
ORDER BY owner_email;
-- maria@email.com vs MARIA@EMAIL.COM
-- Find case inconsistencies in species
SELECT species, count(*)
FROM vet_staging
GROUP BY species;
-- 'Dog' vs 'dog', 'Cat' vs 'cat'
-- Find NULL phone numbers
SELECT * FROM vet_staging
WHERE owner_phone IS NULL;START TRANSACTION;
-- Standardize email to lowercase
UPDATE vet_staging
SET owner_email = lower(owner_email);
-- Standardize species to title case
UPDATE vet_staging
SET species = initcap(species);
-- Fix NULL phone (look up from other rows for same owner)
UPDATE vet_staging s1
SET owner_phone = s2.owner_phone
FROM vet_staging s2
WHERE s1.owner_name = s2.owner_name
AND s1.owner_phone IS NULL
AND s2.owner_phone IS NOT NULL;
-- Verify
SELECT DISTINCT owner_name, owner_email, owner_phone
FROM vet_staging;
SELECT DISTINCT species FROM vet_staging;
COMMIT;START TRANSACTION;
-- 1. Populate owners
INSERT INTO owners (owner_name, email, phone)
SELECT DISTINCT owner_name, owner_email, owner_phone
FROM vet_staging;
-- 2. Populate pets (need owner_id from owners table)
INSERT INTO pets (pet_name, species, breed, owner_id)
SELECT DISTINCT s.pet_name, s.species, s.breed, o.owner_id
FROM vet_staging s
JOIN owners o ON s.owner_email = o.email;
-- 3. Populate visits (need pet_id from pets table)
INSERT INTO visits (visit_date, reason, cost, pet_id)
SELECT s.visit_date, s.reason, s.cost, p.pet_id
FROM vet_staging s
JOIN owners o ON s.owner_email = o.email
JOIN pets p ON s.pet_name = p.pet_name AND p.owner_id = o.owner_id;
-- 4. Verify counts
SELECT 'owners' AS tbl, count(*) FROM owners
UNION ALL
SELECT 'pets', count(*) FROM pets
UNION ALL
SELECT 'visits', count(*) FROM visits;
COMMIT;Expected: 8 rows matching the original staging data, now properly normalized across three tables.
Part 4: Discussion Questions
- What would happen if you tried to insert pets before owners? What error would you see?
- The JOIN in the visits migration uses both
pet_nameandowner_id. Why ispet_namealone not sufficient? - If Maria later changes her email, how many rows need updating in the normalized schema vs the original flat table?
- What would you add to make this migration idempotent (safe to run multiple times)?
Key Takeaways
The DML Toolkit
| Tool | When to Use |
|---|---|
UPDATE ... SET ... WHERE |
Fix specific data quality issues |
UPDATE ... FROM |
Update using data from another table |
DELETE FROM ... WHERE |
Remove rows that do not belong |
ALTER TABLE ADD/DROP COLUMN |
Reshape tables for cleaning |
INSERT INTO ... SELECT |
Migrate data between tables |
START TRANSACTION / COMMIT / ROLLBACK |
Wrap everything in a safety net |
The Data Engineer’s Checklist
Before any data modification:
- Audit the data thoroughly
- Create a backup (table or safety columns)
- Use transactions for multi-step changes
- Verify after every modification
- Keep the originals until you are certain
The goal is not speed. The goal is correctness. A fast migration that loses data is not a migration. It is a disaster with good performance metrics.
What Is Next
You now have the complete toolkit:

From raw CSV to a clean, normalized, constraint-enforced production database. The entire pipeline.
References
Sources
DeBarros, A. (2022). Practical SQL: A Beginner’s Guide to Storytelling with Data (2nd ed.). No Starch Press. Chapter 9: Inspecting and Modifying Data.
PostgreSQL Documentation. “UPDATE.” https://www.postgresql.org/docs/current/sql-update.html
PostgreSQL Documentation. “DELETE.” https://www.postgresql.org/docs/current/sql-delete.html
PostgreSQL Documentation. “Transactions.” https://www.postgresql.org/docs/current/tutorial-transactions.html
PostgreSQL Documentation. “ALTER TABLE.” https://www.postgresql.org/docs/current/sql-altertable.html