
DATA 351: Data Management with SQL
February 25, 2026
By the end of this lesson, you will be able to:
Last time we built tables with constraints. Now we clean the data that goes into them.

This is the step most people skip and then regret. Constraints will reject dirty data at the door, so we need to clean it first.
We are data janitors today. No cap, this is 80% of real data work:
Open Beekeeper Studio or pgAdmin and run:
Then connect to the lc_music database before running the next step.
music_staging_setup.sql from the course sitelc_music)The script creates:
music_catalog_staging – our messy raw data (40+ rows of chaos)artists, albums, tracks – our clean normalized target tables (empty, waiting)You should see 40 rows. If you see 0, you did not run the INSERT statements. If you see 80, you ran the script twice. Either way, drop and re-run.
Take a look. Notice anything suspicious? We are about to find out.
Before you fix anything, you need to understand what is broken. Think of it like a doctor’s exam before surgery. Every good data cleaning workflow starts with questions:
The most common data quality issue. Let us check for duplicate catalog IDs:
CAT-1001 appears twice – an exact duplicate row. But duplicates are not always that obvious.
Sometimes the catalog_id is different but the data is the same:
The Beatles’ “Come Together” appears with both CAT-1010 and CAT-1099. Same song, different ID. This is real-world data distribution drama.
True or False: If every row has a unique catalog_id, then the table has no duplicates.
Think about it.
False!
Duplicate data can exist even with unique IDs. A duplicate is about the content, not just the identifier. Always check for duplicates on the meaningful columns, not just the ID.
NULLs are the ghosts of your database – invisible until they break something:
SELECT
count(*) AS total_rows,
count(*) - count(genre) AS missing_genre,
count(*) - count(label) AS missing_label,
count(*) - count(track_title) AS missing_track_title,
count(*) - count(track_number) AS missing_track_number,
count(*) - count(track_duration_sec) AS missing_track_duration
FROM music_catalog_staging;count(column) skips NULLs while count(*) counts all rows. The difference reveals the gaps.
Daft Punk is missing genre and label info. Amy Winehouse has a NULL track. These need decisions: fill them in, or remove them?
This is low-key the most annoying data problem:
You will see entries like:
Beyonce vs BEYONCEthe rolling stones vs The Rolling Stones vs THE ROLLING STONESTaylor Swift vs Taylor swiftkendrick lamar vs Kendrick LamarSame artist, different vibes. The database treats them as completely different values.
It is not just artist names. Check genres:
Hip-Hop vs hip-hop, Alternative vs alternative, Pop vs pop. Same energy, different casing.
Labels too:
LaFace vs Laface Records, Capitol vs Capitol Records, TDE vs Top Dawg Entertainment. Absolute chaos.
Check for values that are clearly wrong:
Nirvana with release_year = 9119? Grunge was ahead of its time, but not that far ahead.
Pink Floyd with duration_min = -42.5. Albums cannot have negative runtime. Not even conceptually.
Which SQL function helps you find rows where zip codes or text fields have unexpected lengths?
count()
length()
trim()
upper()
B) length()
The length() function returns the number of characters in a string. Useful for spotting truncated data, padded values, or fields that are suspiciously short or long.
Here is what we found:
| Issue | Count |
|---|---|
| Duplicate catalog_id | 1 row |
| Sneaky content duplicates | 1 row |
| NULL genre | 2 rows |
| NULL label | 1 row |
| NULL track info | 1 row |
| Inconsistent casing | 10+ rows |
| Typos | 2+ rows |
| Bad release year | 1 row |
| Negative duration | 1 row |

That is a lot of mess for 40 rows. Imagine a dataset with 6 million. This is why inspection matters.
We are about to start modifying data. Before you UPDATE or DELETE anything, make a backup. This is non-negotiable. It is the database equivalent of saving your game before the boss fight.
This creates an exact copy. Verify it:
Both should show 40. If they match, you are safe to proceed.
Instead of modifying a column directly, create a copy first:
Now you can modify artist_name_clean while keeping the original artist_name intact. If you mess up, you still have the original.

Every modification should follow this pattern:

This is not paranoia. This is professionalism.
You need to fix typos in the artist_name column. What should you do FIRST?
Run UPDATE directly on artist_name
Create a backup copy of the table
Delete all the bad rows
Drop the column and recreate it
B) Create a backup copy of the table
Always create a backup before modifying data. If your UPDATE goes wrong (and it will, eventually), you want a way to restore the original. No backup = no undo = no sleep tonight.
UPDATE modifies existing rows. The basic syntax:
The WHERE clause is critical. Without it, every row gets updated. That is almost never what you want.
Let us standardize artist names. First, add and populate our working column:
Now fix the casing issues one artist at a time:
UPDATE music_catalog_staging
SET artist_name_clean = 'Beyonce'
WHERE lower(artist_name) = 'beyonce';
UPDATE music_catalog_staging
SET artist_name_clean = 'The Rolling Stones'
WHERE lower(artist_name) = 'the rolling stones';
UPDATE music_catalog_staging
SET artist_name_clean = 'Taylor Swift'
WHERE lower(artist_name) = 'taylor swift';
UPDATE music_catalog_staging
SET artist_name_clean = 'Kendrick Lamar'
WHERE lower(artist_name) = 'kendrick lamar';
UPDATE music_catalog_staging
SET artist_name_clean = 'Outkast'
WHERE lower(artist_name) = 'outkast';The Led Zepplin typo needs a targeted fix:
Verify it:
Every artist should now have a single clean name in artist_name_clean.
Same pattern – add a clean column, then standardize:
initcap() capitalizes the first letter of each word. But check the results:
Hip-Hop becomes Hip-Hop (correct). alternative becomes Alternative (correct). R&B becomes… R&b. Not ideal. Fix it:
Labels have variations like LaFace vs Laface Records and TDE vs Top Dawg Entertainment:
ALTER TABLE music_catalog_staging
ADD COLUMN label_clean varchar(100);
UPDATE music_catalog_staging
SET label_clean = label;
UPDATE music_catalog_staging
SET label_clean = 'LaFace Records'
WHERE label_clean LIKE 'LaFace%'
OR label_clean LIKE 'Laface%';
UPDATE music_catalog_staging
SET label_clean = 'Capitol Records'
WHERE label_clean LIKE 'Capitol%';
UPDATE music_catalog_staging
SET label_clean = 'Big Machine Records'
WHERE label_clean LIKE 'Big Machine%';
UPDATE music_catalog_staging
SET label_clean = 'Top Dawg Entertainment'
WHERE label_clean IN ('TDE', 'Top Dawg Entertainment');
UPDATE music_catalog_staging
SET label_clean = 'Rolling Stones Records'
WHERE label_clean LIKE 'Rolling Stones%';What does this query do?
Capitalizes every letter in the genre
Lowercases the entire genre
Capitalizes the first letter of each word
Removes all spaces from the genre
C) Capitalizes the first letter of each word
initcap('hip-hop') returns 'Hip-Hop'. It is PostgreSQL’s built-in title-case function. Super useful for standardizing casing, but watch out for edge cases like R&B.
The Nirvana release year 9119 should be 1991:
The Pink Floyd negative duration:
Always use the most specific WHERE clause possible. Using catalog_id targets exactly one row, which is what we want.
Daft Punk’s genre is Electronic. We know this, so we fill it in:
The missing label:
The textbook shows a useful pattern for fixing padded or truncated values using the concatenation operator ||:
The || operator glues strings together. 'abc' || '123' gives 'abc123'. Useful for building composite values or fixing formatting issues.
DELETE removes rows. Like UPDATE, it needs a WHERE clause unless you want to nuke everything:
Remember our duplicate CAT-1001? We need to remove one copy. PostgreSQL has a hidden column called ctid (the physical row location) that helps:
This keeps the first occurrence of each unique combination and deletes the rest.
The Beatles’ “Come Together” exists with two different catalog IDs. Keep one, remove the other:
The row with NULL track info is incomplete. Depending on your business rules, you might:
For our exercise, let us delete it since we need track info for our normalized tables:
We started with 40 rows. After removing 1 exact duplicate, 1 content duplicate, and 1 NULL track row, we should have 37 rows.
What happens if you run DELETE FROM music_catalog_staging; without a WHERE clause?
It deletes the table structure and all data
It deletes all rows but keeps the table structure
It raises an error because WHERE is required
It deletes only NULL rows
B) It deletes all rows but keeps the table structure
DELETE FROM without WHERE removes every row but the table still exists (empty). If you want to remove the table entirely, that is DROP TABLE. Big difference. One empties the house, the other demolishes it.
Every UPDATE and DELETE is permanent by default. Transactions let you preview changes before committing them.
Think of it like trying on clothes before buying. You can put them back if they do not fit.

After ROLLBACK, the data is exactly as it was before START TRANSACTION. Crisis averted.
If the changes look correct, use COMMIT:
After COMMIT, the changes are permanent. No going back (unless you have that backup table we made earlier).
| Use a Transaction | Skip Transaction |
|---|---|
| Bulk UPDATEs on many rows | Single-row changes with backup |
| Any DELETE operation | SELECT queries (read-only) |
| Multi-step changes that must succeed together | Adding columns (ALTER TABLE) |
| When you are not 100% sure about the WHERE clause | When you have a backup and are confident |
Pro tip: if you are asking yourself “should I use a transaction?” the answer is yes.
You ran START TRANSACTION, then an UPDATE that changed 500 rows incorrectly. What command undoes the damage?
UNDO;
REVERT;
ROLLBACK;
CTRL+Z;
C) ROLLBACK;
ROLLBACK reverses all changes made since START TRANSACTION. There is no UNDO or CTRL+Z in SQL. And CTRL+Z only works in your text editor, not on your data.
Once you have verified your clean columns are correct, you can either:
-- Drop the original messy columns
ALTER TABLE music_catalog_staging DROP COLUMN artist_name;
ALTER TABLE music_catalog_staging DROP COLUMN genre;
ALTER TABLE music_catalog_staging DROP COLUMN label;
-- Rename clean columns to the standard names
ALTER TABLE music_catalog_staging
RENAME COLUMN artist_name_clean TO artist_name;
ALTER TABLE music_catalog_staging
RENAME COLUMN genre_clean TO genre;
ALTER TABLE music_catalog_staging
RENAME COLUMN label_clean TO label;Once you are confident in your cleaned data and have migrated it to the normalized tables, you can remove the backup:
Only do this when you are truly done. There is no recovering a dropped table (unless you have database-level backups, which you should in production).
The textbook shows a clever pattern for replacing a table with an updated version:
-- Create an improved copy with an extra column
CREATE TABLE music_catalog_staging_v2 AS
SELECT *,
now()::date AS cleaned_date
FROM music_catalog_staging;
-- Swap the tables using rename
ALTER TABLE music_catalog_staging
RENAME TO music_catalog_staging_old;
ALTER TABLE music_catalog_staging_v2
RENAME TO music_catalog_staging;
ALTER TABLE music_catalog_staging_old
RENAME TO music_catalog_staging_backup;This adds a cleaned_date column and swaps the tables without downtime. The old version becomes the new backup.
What is the difference between DELETE FROM table_name; and DROP TABLE table_name;?
Think about what survives each command.
DELETE FROM table_name; removes all rows but the table structure (columns, constraints, indexes) still exists. You can INSERT new data immediately.DROP TABLE table_name; removes the table entirely – structure, data, constraints, indexes, everything. Gone. Reduced to atoms.One is emptying a filing cabinet. The other is throwing the filing cabinet in a dumpster.

Inspection:
| Task | SQL |
|---|---|
| Count rows | count(*) |
| Find dupes | GROUP BY ... HAVING count(*) > 1 |
| Find NULLs | WHERE col IS NULL |
| Check lengths | length(col) |
| Distinct values | SELECT DISTINCT col |
Modification:
| Task | SQL |
|---|---|
| Update data | UPDATE SET WHERE |
| Add column | ALTER TABLE ADD COLUMN |
| Drop column | ALTER TABLE DROP COLUMN |
| Rename column | ALTER TABLE RENAME COLUMN |
| Delete rows | DELETE FROM WHERE |
| Drop table | DROP TABLE |
Safety:
| Task | SQL |
|---|---|
| Backup table | CREATE TABLE backup AS SELECT * FROM original |
| Start transaction | START TRANSACTION; |
| Save changes | COMMIT; |
| Undo changes | ROLLBACK; |
| Title case | initcap(col) |
| Concatenate | col1 \|\| col2 |
| Lowercase | lower(col) |
You receive a CSV with 50,000 rows of customer data. Before loading it into your production tables, describe the first five things you would do to inspect and prepare the data. Use what we learned today.
Take 2 minutes. Write it down. Compare with a neighbor.
Then: backup, fix, verify, migrate. In that order. Always.
Now that we can inspect and clean staging data, next time we will:
INSERT INTO ... SELECT to migrate cleaned data into normalized tables
DeBarros, A. (2022). Practical SQL: A Beginner’s Guide to Storytelling with Data (2nd ed.). No Starch Press. Chapter 9: Inspecting and Modifying Data.
PostgreSQL Documentation. “UPDATE.” https://www.postgresql.org/docs/current/sql-update.html
PostgreSQL Documentation. “ALTER TABLE.” https://www.postgresql.org/docs/current/sql-altertable.html
PostgreSQL Documentation. “Transactions.” https://www.postgresql.org/docs/current/tutorial-transactions.html