DATA 351: Data Management with SQL
March 4, 2026
We need data before we can do statistics. Shocking, I know.
Connect to it: \c analysis (or use your GUI client).
The American Community Survey (ACS) gives us county-level data on education, income, and commuting for 3,142 U.S. counties:
Notice the CHECK constraint. Why would we enforce that masters must be less than or equal to bachelors?
Because everyone with a master’s degree also has a bachelor’s degree. If pct_masters > pct_bachelors, something is deeply wrong with our data. Constraints catch problems that eyeballs miss.
Update the path to match your machine! Verify:
Run this now. Which counties have the highest median household income? Any surprises?
Does education affect income? Let’s ask the data.
The Pearson correlation coefficient (r) measures the strength and direction of a linear relationship between two variables.
| r value | Interpretation |
|---|---|
| +1.0 | Perfect positive (as X goes up, Y goes up proportionally) |
| +0.6 to +0.9 | Strong positive |
| +0.3 to +0.59 | Moderate positive |
| +0.01 to +0.29 | Weak positive |
| 0 | No linear relationship |
| Negative values | Same scale, opposite direction |
Key word: linear. Two variables can have a strong curved relationship and still show r near 0. Correlation only detects straight-line patterns.
Run this. What do you get?
About 0.70. That’s a fairly strong positive correlation. Counties with higher percentages of bachelor’s degree holders tend to have higher median household incomes. Not groundbreaking, but now you can quantify it.
If corr() returns -0.65, which of these is true?
A. As X increases, Y increases B. As X increases, Y decreases C. There is no relationship D. The data is broken
B. A negative r means an inverse relationship. Strong, too – -0.65 is well into “moderate to strong” territory.
You can compute several correlations in one query:
Run this. Which pair has the weakest correlation? What does that tell you?
The commute-to-education correlation (bachelors_travel_r) is close to zero. Having a degree doesn’t predict whether you’ll have a long commute. Makes sense – PhDs in Manhattan walk to work; PhDs in rural Montana drive 45 minutes.
A classic reminder:
There’s an entire website dedicated to absurd correlations: tylervigen.com/spurious-correlations. The divorce rate in Maine correlates with per-capita margarine consumption. r = 0.99.
Always ask: is there a plausible mechanism, or is this just two things that happen to move together?
Write a query to find the correlation between pct_masters_higher and median_hh_income. Round to 2 decimal places.
Is it stronger or weaker than the bachelor’s correlation? Why might that be?
The master’s correlation is slightly weaker. This makes intuitive sense – a master’s degree adds income, but the bachelor’s is the bigger jump from “no degree” to “degree.” The bachelor’s rate captures more of the variation.
Correlation tells us how strong a relationship is. Regression tells us what the relationship looks like – and lets us make predictions.
Simple linear regression fits a straight line through your data:
Y = bX + a
PostgreSQL gives us both:
Run this. You should get a slope around 926.95 and an intercept around 27,901.15.
The slope of ~927 means: for every 1 percentage point increase in bachelor’s degree holders, the median household income increases by about $927.
The intercept of ~$27,901 is the predicted income when the bachelor’s rate is 0%. (A theoretical county with zero college graduates.)
If a county has 30% of residents with bachelor’s degrees, what income does our model predict?
Y = 926.95 * 30 + 27,901.15 = $55,709.65
You just did linear regression. In your head. With SQL. At 10 AM on a Wednesday.
Our regression shows a slope of 926.95 for income vs. bachelor’s rate. If a county’s bachelor’s rate increases from 25% to 35%, how much does our model predict income will change?
A. $926.95 B. $9,269.50 C. $27,901.15 D. It depends on the county
B. The slope is the change per 1 percentage point. A 10-point increase = 10 * $926.95 = $9,269.50.
r-squared (r²) tells you what percentage of the variation in Y is explained by X.
Run this. You should get about 0.490.
That means ~49% of the variation in median income across counties is explained by the bachelor’s degree rate. The other 51% comes from other factors (industry, cost of living, geography, etc.).
Write a query that computes the slope, intercept, and r-squared for median_hh_income vs. pct_masters_higher. Based on your r², is the master’s rate a better or worse predictor of income than the bachelor’s rate?
The r² for master’s is lower (~0.37 vs ~0.49). Bachelor’s rate is the better predictor.
How spread out is the data? Time for everyone’s favorite statistics topic.
PostgreSQL has both population and sample versions:
| Function | Type | Use When |
|---|---|---|
var_pop() |
Population variance | Your data IS the entire population |
var_samp() |
Sample variance | Your data is a sample of a larger population |
stddev_pop() |
Population std dev | Your data IS the entire population |
stddev_samp() |
Sample std dev | Your data is a sample |
Run this. The population and sample versions are very close because we have 3,142 data points. The difference matters more with small samples.
You have test scores for every student in DATA 351 this semester. Which function do you use?
A. stddev_pop() – because you have everyone B. stddev_samp() – because this semester is a sample of all possible students
It depends on your question! If you want the std dev of this class, use pop. If you want to estimate the std dev of all students who might take this class, use samp. Statistics is fun like that. 😅
This is where window functions enter the picture. And they are cool.
CREATE TABLE widget_companies (
id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
company text NOT NULL,
widget_output integer NOT NULL
);
INSERT INTO widget_companies (company, widget_output)
VALUES
('Dom Widgets', 125000),
('Ariadne Widget Masters', 143000),
('Saito Widget Co.', 201000),
('Mal Inc.', 133000),
('Dream Widget Inc.', 196000),
('Miles Amalgamated', 620000),
('Arthur Industries', 244000),
('Fischer Worldwide', 201000);Run this now to create and populate the table.
Run this. Notice what happens at Saito Widget Co. and Fischer Worldwide – they both produced 201,000 widgets.
| Company | Output | rank() | dense_rank() |
|---|---|---|---|
| Miles Amalgamated | 620,000 | 1 | 1 |
| Arthur Industries | 244,000 | 2 | 2 |
| Saito Widget Co. | 201,000 | 3 | 3 |
| Fischer Worldwide | 201,000 | 3 | 3 |
| Dream Widget Inc. | 196,000 | 5 | 4 |
rank() skips position 4 (goes 3, 3, 5). dense_rank() doesn’t skip (goes 3, 3, 4). Use dense_rank() when you don’t want gaps.
If three companies are tied for 2nd place using rank(), what rank does the next company get?
A. 3 B. 4 C. 5 D. It depends
C. Three companies at rank 2 means positions 2, 3, 4 are “used.” The next company gets rank 5.
CREATE TABLE store_sales (
store text NOT NULL,
category text NOT NULL,
unit_sales bigint NOT NULL,
CONSTRAINT store_category_key PRIMARY KEY (store, category)
);
INSERT INTO store_sales (store, category, unit_sales)
VALUES
('Broders', 'Cereal', 1104),
('Wallace', 'Ice Cream', 1863),
('Broders', 'Ice Cream', 2517),
('Cramers', 'Ice Cream', 2112),
('Broders', 'Beer', 641),
('Cramers', 'Cereal', 1003),
('Cramers', 'Beer', 640),
('Wallace', 'Cereal', 980),
('Wallace', 'Beer', 988);Run this. The ranking resets for each category. Broders might be #1 in Ice Cream but #2 in Beer. PARTITION BY creates separate ranking “lanes.”
Write a query that ranks all counties by median_hh_income (highest first) using dense_rank(). Show the county name, state, income, and rank. Limit to the top 10.
Now modify the query to rank counties within each state using PARTITION BY. Show only rank 1 per state – the richest county in each state.
Hint: you’ll need a subquery or CTE because you can’t WHERE on a window function.
This uses a CTE! You cannot filter on state_rank in the same SELECT where it’s defined, so we wrap it and filter in the outer query.
Raw counts lie. Rates tell the truth.
Los Angeles County has 31,000 restaurants. Teton County, Wyoming has 234.
Does LA love food more? No – LA has 10 million people. Teton has 23,000. To compare fairly, we need a rate.
CREATE TABLE cbp_naics_72_establishments (
state_fips text,
county_fips text,
county text NOT NULL,
st text NOT NULL,
naics_2017 text NOT NULL,
naics_2017_label text NOT NULL,
year smallint NOT NULL,
establishments integer NOT NULL,
CONSTRAINT cbp_fips_key PRIMARY KEY (state_fips, county_fips)
);
\copy cbp_naics_72_establishments FROM '/path/to/cbp_naics_72_establishments.csv' WITH (FORMAT CSV, HEADER);This table has county-level counts of “Accommodation and Food Services” businesses (NAICS code 72).
We need a population table. If you have us_counties_pop_est_2019 from Chapter 5, great. If not, we can compute rates using the ACS data. For now, here’s the pattern:
SELECT
cbp.county,
cbp.st,
cbp.establishments,
round(
(cbp.establishments::numeric / pop.pop_est_2018) * 1000, 1
) AS estabs_per_1000
FROM cbp_naics_72_establishments cbp
JOIN us_counties_pop_est_2019 pop
ON cbp.state_fips = pop.state_fips
AND cbp.county_fips = pop.county_fips
WHERE pop.pop_est_2018 >= 50000
ORDER BY estabs_per_1000 DESC;The ::numeric cast is critical. Without it, integer division truncates everything to 0.
What does 5 / 2 return in PostgreSQL?
A. 2.5 B. 2 C. 3 D. An error
B. Integer division truncates. 5 / 2 = 2. To get 2.5, cast one side: 5::numeric / 2. This trips people up constantly.
Monthly data is noisy. Rolling averages smooth it out.
Run this. Citrus exports spike in certain months and drop in others. Seasonal patterns make it hard to see the trend. Is citrus exports growing over time? Hard to tell from the raw numbers.
Run this. The rolling average smooths out the seasonal spikes. Now you can see whether the overall trend is up, down, or flat.
ROWS BETWEEN 11 PRECEDING AND CURRENT ROW means: “average this row plus the 11 before it.” That’s 12 rows total – one full year of monthly data.
The first 11 rows won’t have a full window (there aren’t 11 preceding rows yet), so the average is computed over whatever is available.
Write a query that computes a 6-month rolling average for soybeans_export_value. Show year, month, the raw value, and the rolling average.
| Function | What It Does |
|---|---|
corr(Y, X) |
Pearson correlation coefficient (-1 to +1) |
regr_slope(Y, X) |
Slope of the regression line |
regr_intercept(Y, X) |
Y-intercept of the regression line |
regr_r2(Y, X) |
Coefficient of determination (0 to 1) |
var_pop() / var_samp() |
Population / sample variance |
stddev_pop() / stddev_samp() |
Population / sample standard deviation |
rank() |
Ranking with gaps on ties |
dense_rank() |
Ranking without gaps on ties |
PARTITION BY |
Rank within groups |
ROWS BETWEEN ... AND ... |
Define a window frame for rolling calculations |
5 / 2 = 2 in PostgreSQL. Don’t learn this the hard way.