
Lecture 10-3: Advanced Query Techniques
DATA 503: Fundamentals of Data Engineering
This lecture covers advanced SQL query techniques: subqueries in WHERE, FROM, and SELECT clauses, correlated subqueries, EXISTS, LATERAL joins, Common Table Expressions, cross tabulations, and CASE expressions. We apply these patterns to Census, survey, and temperature data. Based on Chapter 13 of Practical SQL, 2nd Edition.
Part 1: Setting Up
Database Setup
Create and Populate the Database
Create a fresh database for this lecture:
Connect to it, then run the setup script provided: [advanced_queries.sql]
The script creates and loads the following tables:
| Table | Rows | Description |
|---|---|---|
us_counties_pop_est_2019 |
3,142 | County population estimates |
cbp_naics_72_establishments |
2,764 | Food/accommodation businesses per county |
employees |
6 | Small HR demo table |
teachers |
6 | Small demo table (from Ch 7) |
ice_cream_survey |
200 | Office ice cream flavor preferences |
temperature_readings |
730 | Two years of daily temps at two stations |
All CSVs are in the Chapter_13/ folder alongside these slides. Download them before class.
Verify Your Setup
If any count is wrong, re-run the setup. We need all of these today.
Part 2: Subqueries
A query inside a query. It’s queries all the way down. 🐢
Subqueries in WHERE
Scalar Subquery in WHERE
The most common pattern: use a subquery to compute a threshold, then filter on it.
Run this. The inner query computes the 90th percentile of county populations. The outer query returns only counties at or above that threshold.
The subquery executes first, returns a single value, and the outer query uses it like a constant. This is a scalar subquery – it returns exactly one value.
Subquery with DELETE
Subqueries work with DML too. Let’s create a copy and prune it:
Run this. You started with 3,142 counties and now have ~315 (the top 10%).
In a pipeline context: this is how you build filtered summary tables. Copy the source, delete what you don’t need, keep the rest for downstream consumers.
🧠 Quick Quiz: Subquery Execution
When does the subquery in a WHERE clause execute?
A. Once before the outer query starts
B. Once per row of the outer query
C. After the outer query finishes
D. It depends
. . .
A (for scalar subqueries). The database computes the inner value once, then scans the outer table. Correlated subqueries (coming up) are different – they run once per row.
Try It: Above-Average Counties 🎯
Write a query that returns all counties where pop_est_2019 is above the national average. Show county name, state, and population. How many are there?
. . .
Far fewer than half! The mean is pulled up by massive counties (LA, Cook, Harris), so most counties are below average. Median vs. mean matters.
Subqueries in FROM (Derived Tables)
Using a Subquery as a Table
A subquery in the FROM clause creates a derived table – a temporary result set you can query like a regular table:
Run this. The inner query computes two aggregates. The outer query calculates the difference. The AS calcs alias is required – PostgreSQL demands that derived tables have names.
Joining Two Derived Tables
This is powerful for combining aggregations from different tables:
SELECT census.state_name AS st,
census.pop_est_2018,
est.establishment_count,
round((est.establishment_count / census.pop_est_2018::numeric) * 1000, 1)
AS estabs_per_thousand
FROM
(SELECT st, sum(establishments) AS establishment_count
FROM cbp_naics_72_establishments
GROUP BY st) AS est
JOIN
(SELECT state_name, sum(pop_est_2018) AS pop_est_2018
FROM us_counties_pop_est_2019
GROUP BY state_name) AS census
ON est.st = census.state_name
ORDER BY estabs_per_thousand DESC;Each derived table aggregates to the state level separately, then they join on state name. This is a common ETL pattern: aggregate separately, join at the grain you need.
🧠 Quick Quiz: Derived Table Rules
What happens if you forget the AS alias on a derived table?
A. PostgreSQL infers a name automatically
B. You get a syntax error
C. It works but you can't reference the columns
D. The query runs but returns wrong results
. . .
B. PostgreSQL requires every derived table to have an alias. AS calcs, AS est, AS census – pick a name. Without it, the parser rejects the query.
Subqueries in SELECT
Column-Level Subqueries
You can put a subquery directly in the SELECT list to add a computed column:
Every row gets the same median value alongside its own population. This lets you compare each county to the national benchmark without a JOIN.
Subquery in a Calculation
Take it further – compute the difference from the median inline:
SELECT county_name,
state_name AS st,
pop_est_2019,
pop_est_2019 - (SELECT percentile_cont(.5) WITHIN GROUP (ORDER BY pop_est_2019)
FROM us_counties_pop_est_2019) AS diff_from_median
FROM us_counties_pop_est_2019
WHERE (pop_est_2019 - (SELECT percentile_cont(.5) WITHIN GROUP (ORDER BY pop_est_2019)
FROM us_counties_pop_est_2019))
BETWEEN -1000 AND 1000;This finds counties whose population is within 1,000 of the national median. Notice the subquery appears twice – once in SELECT, once in WHERE. That redundancy is why CTEs exist (we’ll get there).
Try It: Every County vs. Its State Average 🎯
Write a query that shows each county’s population alongside its state’s average population (using a correlated subquery in the SELECT list). Show county_name, state_name, pop_est_2019, and state_avg.
Hint: the subquery references the outer table’s state_name.
. . .
This is a correlated subquery – it runs once per row of the outer query because it references outer_t.state_name. More expensive than scalar subqueries, but very expressive.
Part 3: EXISTS and IN with Subqueries
Checking membership across tables – the SQL equivalent of “are you on the list?”
Setup: Employees and Retirees
Now we have two employees who retired. Let’s find them – and find who’s still active.
Using IN with a Subquery
IN checks if emp_id matches any value in the subquery result. Simple and readable.
NOT EXISTS: The Anti-Join
Run this. Returns employees who are NOT retirees. This is an anti-join – one of the most useful patterns in data engineering. “Give me everything in table A that has no match in table B.”
Pipeline use case: find new records that haven’t been processed yet. WHERE NOT EXISTS (SELECT 1 FROM processed WHERE processed.id = raw.id).
🧠 Quick Quiz: IN vs EXISTS
For a table with 10 million rows, which is typically faster?
A. IN (always)
B. EXISTS (always)
C. EXISTS (usually, because it short-circuits)
D. They're always identical
. . .
C. EXISTS stops at the first match. IN materializes the full subquery result into a list. For large datasets, that difference matters. But the optimizer is smart – sometimes it rewrites one to the other. Profile, don’t guess.
Part 4: LATERAL Joins
The most powerful subquery pattern you’ve never heard of.
LATERAL Subqueries
LATERAL in FROM: Inline Calculations
LATERAL lets a subquery in the FROM clause reference columns from preceding tables. Think of it as a “for each row, compute this”:
Run this. Each LATERAL subquery can reference columns from earlier in the FROM clause – including results from other LATERAL subqueries. pct_chg uses raw_chg from the first LATERAL. Without LATERAL, you’d need nested subqueries or repeat the calculation.
LATERAL with JOIN: Top-N Per Group
The killer feature: get the top N related rows for each row in the driving table.
ALTER TABLE teachers ADD CONSTRAINT id_key PRIMARY KEY (id);
CREATE TABLE teachers_lab_access (
access_id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
access_time timestamp with time zone,
lab_name text,
teacher_id bigint REFERENCES teachers (id)
);
INSERT INTO teachers_lab_access (access_time, lab_name, teacher_id)
VALUES ('2022-11-30 08:59:00-05', 'Science A', 2),
('2022-12-01 08:58:00-05', 'Chemistry B', 2),
('2022-12-21 09:01:00-05', 'Chemistry A', 2),
('2022-12-02 11:01:00-05', 'Science B', 6),
('2022-12-07 10:02:00-05', 'Science A', 6),
('2022-12-17 16:00:00-05', 'Science B', 6);Run the setup, then:
For each teacher, this returns their 2 most recent lab accesses. Without LATERAL, you’d need window functions + CTEs. With LATERAL, it’s clean and direct.
In data engineering: “for each customer, get their last 3 orders.” “For each sensor, get the most recent 5 readings.” LATERAL is the tool.
Try It: Top-1 Per Teacher 🎯
Modify the LATERAL query to return only each teacher’s most recent lab access (not two). Teachers with no lab access should still appear (show NULLs).
. . .
Change LIMIT 2 to LIMIT 1. The LEFT JOIN ensures all teachers appear even with no matches.
Part 5: Common Table Expressions (CTEs)
Named temporary result sets that make complex queries readable.
CTEs with WITH
A Simple CTE
The CTE filters to counties with 100k+ population. The outer query counts them by state. It’s the same as a subquery, but the logic flows top-to-bottom instead of inside-out.
CTEs for Joining Aggregations
Remember the derived table join from earlier? Here it is with CTEs:
WITH
counties (st, pop_est_2018) AS (
SELECT state_name, sum(pop_est_2018)
FROM us_counties_pop_est_2019
GROUP BY state_name
),
establishments (st, establishment_count) AS (
SELECT st, sum(establishments) AS establishment_count
FROM cbp_naics_72_establishments
GROUP BY st
)
SELECT counties.st,
pop_est_2018,
establishment_count,
round((establishments.establishment_count /
counties.pop_est_2018::numeric(10,1)) * 1000, 1)
AS estabs_per_thousand
FROM counties JOIN establishments
ON counties.st = establishments.st
ORDER BY estabs_per_thousand DESC;Compare this to the derived table version. Same result, dramatically more readable. Each CTE has a name and a clear purpose. In code review, this is the version that gets approved.
CTEs to Eliminate Redundancy
Remember the repeated subquery for median comparison? CTEs fix that:
WITH us_median AS (
SELECT percentile_cont(.5)
WITHIN GROUP (ORDER BY pop_est_2019) AS us_median_pop
FROM us_counties_pop_est_2019
)
SELECT county_name,
state_name AS st,
pop_est_2019,
us_median_pop,
pop_est_2019 - us_median_pop AS diff_from_median
FROM us_counties_pop_est_2019 CROSS JOIN us_median
WHERE (pop_est_2019 - us_median_pop) BETWEEN -1000 AND 1000;The median is computed once in the CTE, then CROSS JOINed to every row. No repeated subqueries. Clean, efficient, maintainable.
Try It: State-Level Summary CTE 🎯
Using two CTEs, compute:
- CTE 1: the total population per state
- CTE 2: the number of counties per state
Join them and add a column for average population per county. Show state, total pop, county count, and avg pop per county. Order by average descending. Which state has the largest average county?
. . .
WITH state_pop AS (
SELECT state_name, sum(pop_est_2019) AS total_pop
FROM us_counties_pop_est_2019
GROUP BY state_name
),
state_counties AS (
SELECT state_name, count(*) AS num_counties
FROM us_counties_pop_est_2019
GROUP BY state_name
)
SELECT sp.state_name,
sp.total_pop,
sc.num_counties,
round(sp.total_pop::numeric / sc.num_counties, 0) AS avg_pop_per_county
FROM state_pop sp
JOIN state_counties sc ON sp.state_name = sc.state_name
ORDER BY avg_pop_per_county DESC;Part 6: Cross Tabulations
Pivot tables in SQL. Every analyst’s favorite party trick.
crosstab()
Enable the Extension
This loads the crosstab() function from the tablefunc module. You only need to do this once per database.
The Ice Cream Survey
200 employees across offices, each picking a flavor. We want a pivot: offices as rows, flavors as columns, counts as values.
Generating the Crosstab
Run this. The first argument is the source query (must have exactly 3 columns: row, category, value). The second argument defines the category labels. The AS clause names the output columns.
This transforms long data (one row per observation) into wide data (one row per group). Essential for reporting and dashboards.
Temperature Crosstab
A more complex example: median max temperature by station and month.
SELECT *
FROM crosstab(
'SELECT station_name,
date_part(''month'', observation_date),
percentile_cont(.5)
WITHIN GROUP (ORDER BY max_temp)
FROM temperature_readings
GROUP BY station_name,
date_part(''month'', observation_date)
ORDER BY station_name',
'SELECT month FROM generate_series(1,12) month'
)
AS (station text,
jan numeric(3,0), feb numeric(3,0), mar numeric(3,0),
apr numeric(3,0), may numeric(3,0), jun numeric(3,0),
jul numeric(3,0), aug numeric(3,0), sep numeric(3,0),
oct numeric(3,0), nov numeric(3,0), dec numeric(3,0));Run this. Two stations, 12 months, median temperatures. Notice generate_series(1,12) creates the month column headers. The escaped single quotes ('') inside the string are required.
🧠 Quick Quiz: Crosstab Requirements
The source query for crosstab() must return exactly how many columns?
A. 2
B. 3
C. 4
D. It depends on the output
. . .
B. Always three: (1) row identifier, (2) category, (3) value. The row identifier groups the pivot rows, the category determines which column gets the value.
Part 7: CASE Expressions
Conditional logic inside SQL. Your if/else for data transformation.
CASE for Classification
Reclassifying Data
SELECT max_temp,
CASE WHEN max_temp >= 90 THEN 'Hot'
WHEN max_temp >= 70 AND max_temp < 90 THEN 'Warm'
WHEN max_temp >= 50 AND max_temp < 70 THEN 'Pleasant'
WHEN max_temp >= 33 AND max_temp < 50 THEN 'Cold'
WHEN max_temp >= 20 AND max_temp < 33 THEN 'Frigid'
WHEN max_temp < 20 THEN 'Inhumane'
ELSE 'No reading'
END AS temperature_group
FROM temperature_readings
ORDER BY station_name, observation_date;Each row gets classified into a group. CASE evaluates conditions top-to-bottom and returns the first match. ELSE catches anything that fell through.
CASE Inside a CTE: The Full Pattern
Combine CASE with a CTE to classify, then aggregate:
WITH temps_collapsed (station_name, max_temperature_group) AS (
SELECT station_name,
CASE WHEN max_temp >= 90 THEN 'Hot'
WHEN max_temp >= 70 AND max_temp < 90 THEN 'Warm'
WHEN max_temp >= 50 AND max_temp < 70 THEN 'Pleasant'
WHEN max_temp >= 33 AND max_temp < 50 THEN 'Cold'
WHEN max_temp >= 20 AND max_temp < 33 THEN 'Frigid'
WHEN max_temp < 20 THEN 'Inhumane'
ELSE 'No reading'
END
FROM temperature_readings
)
SELECT station_name, max_temperature_group, count(*)
FROM temps_collapsed
GROUP BY station_name, max_temperature_group
ORDER BY station_name, count(*) DESC;Run this. Step 1 (CTE): classify every reading. Step 2 (outer query): count days per station per group. This two-step pattern – transform then aggregate – is the backbone of analytical queries.
Try It: Population Tiers 🎯
Using a CTE with CASE, classify each county as:
- ‘Metro’ (pop >= 500,000)
- ‘Urban’ (pop 100,000 to 499,999)
- ‘Suburban’ (pop 50,000 to 99,999)
- ‘Rural’ (pop < 50,000)
Then count the number of counties in each tier and compute the total population per tier. Which tier has the most counties? Which holds the most population?
. . .
WITH county_tiers AS (
SELECT county_name, state_name, pop_est_2019,
CASE
WHEN pop_est_2019 >= 500000 THEN 'Metro'
WHEN pop_est_2019 >= 100000 THEN 'Urban'
WHEN pop_est_2019 >= 50000 THEN 'Suburban'
ELSE 'Rural'
END AS tier
FROM us_counties_pop_est_2019
)
SELECT tier,
count(*) AS num_counties,
sum(pop_est_2019) AS total_pop,
round(avg(pop_est_2019), 0) AS avg_county_pop
FROM county_tiers
GROUP BY tier
ORDER BY total_pop DESC;Rural has the most counties by far. Metro has the most total population. The vast majority of U.S. counties are small.
Part 8: Summary
What We Learned
Key Techniques
| Technique | What It Does |
|---|---|
| Scalar subquery in WHERE | Compute a threshold, filter on it |
| Derived table (subquery in FROM) | Create a temporary table inline |
| Column subquery in SELECT | Add a computed column per row |
| Correlated subquery | Inner query references outer query (runs per row) |
IN / EXISTS / NOT EXISTS |
Check membership across tables |
LATERAL join |
Per-row subquery that can reference preceding tables |
CTE (WITH) |
Named temporary result sets for readability and reuse |
crosstab() |
Pivot long data to wide (requires tablefunc extension) |
CASE |
Conditional classification inside queries |
When to Use What
The Big Ideas
- Subqueries go everywhere: WHERE, FROM, SELECT, even inside other subqueries.
- CTEs > nested subqueries for readability. Always. Fight me.
- LATERAL is the top-N-per-group tool. Learn it. Love it.
- EXISTS short-circuits. For large-table membership checks, prefer it over IN.
- CASE is your inline if/else. Classify first, aggregate second.
- crosstab() pivots data. Three columns in, pivot table out.
References
- DeBarros, A. (2022). Practical SQL (2nd ed.). No Starch Press. Chapter 13.
- PostgreSQL: WITH Queries (CTEs)
- PostgreSQL: Subquery Expressions
- PostgreSQL: LATERAL Joins
- PostgreSQL: tablefunc (crosstab)