DATA 351: Data Management with SQL
March 19, 2026
Before we start manipulating time itself, we need to load some data.
Open a terminal and create a fresh database for this lecture:
Using psql’s createdb command:
OR
Using Beekeeper Studio by connecting to your localhost server and clicking (+).
OR
Using pgAdmin by connecting to your localhost server and right-clicking on the server -> Create -> Database.
We need to create the NYC taxi trips table and import the CSV. Run the following to create the table:
CREATE TABLE nyc_yellow_taxi_trips (
trip_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
vendor_id text NOT NULL,
tpep_pickup_datetime timestamptz NOT NULL,
tpep_dropoff_datetime timestamptz NOT NULL,
passenger_count integer NOT NULL,
trip_distance numeric(8,2) NOT NULL,
pickup_longitude numeric(18,15) NOT NULL,
pickup_latitude numeric(18,15) NOT NULL,
rate_code_id text NOT NULL,
store_and_fwd_flag text NOT NULL,
dropoff_longitude numeric(18,15) NOT NULL,
dropoff_latitude numeric(18,15) NOT NULL,
payment_type text NOT NULL,
fare_amount numeric(9,2) NOT NULL,
extra numeric(9,2) NOT NULL,
mta_tax numeric(5,2) NOT NULL,
tip_amount numeric(9,2) NOT NULL,
tolls_amount numeric(9,2) NOT NULL,
improvement_surcharge numeric(9,2) NOT NULL,
total_amount numeric(9,2) NOT NULL
);Use whichever method you prefer:
Beekeeper Studio or pgAdmin:
Change the path to the CSV file on your computer.
copy nyc_yellow_taxi_trips (vendor_id, tpep_pickup_datetime,
tpep_dropoff_datetime, passenger_count, trip_distance,
pickup_longitude, pickup_latitude, rate_code_id,
store_and_fwd_flag, dropoff_longitude, dropoff_latitude,
payment_type, fare_amount, extra, mta_tax, tip_amount,
tolls_amount, improvement_surcharge, total_amount)
FROM '/path/to/nyc_yellow_taxi_trips.csv'
WITH (FORMAT CSV, HEADER);Beekeeper Studio: Right-click the table, select Import CSV, browse to nyc_yellow_taxi_trips.csv.
Replace /path/to/ with the actual path to your CSV. On macOS, drag the file into the terminal to paste the full path.
The taxi data is from New York City, so let’s make sure we are all seeing the same timestamps:
This is a session-level setting. It lasts until you disconnect. Your data on the server is unchanged.
Time to talk about time. (That sentence hurt to write.)
PostgreSQL gives us four data types for temporal data:
| Type | Stores | Example |
|---|---|---|
timestamp with time zone (timestamptz) |
Date + time + time zone | 2022-12-01 18:37:12 EST |
date |
Date only | 2022-12-01 |
time |
Time only | 18:37:12 |
interval |
A duration | 2 days 3 hours |
The first three are datetime types. The fourth, interval, represents a length of time rather than a specific moment.
Almost always: timestamptz.
A timestamp without a time zone is like an address without a city. Sure, “123 Main Street” means something, but where?
date is fine when you genuinely only care about the calendar date (birthdays, holidays). time alone is almost never useful because a time without a date or zone is meaningless.
And interval? You do not create columns of type interval very often. It shows up as the result of subtracting two timestamps. Think of it as the answer to “how long?” rather than “when?”
What data type would you use for each of these?
Answers:
timestamptz (you need to know where on Earth that order happened)date (nobody was born at 3:47 PM UTC)interval (e.g., '2 hours 23 minutes')time seems right, but without a date, you cannot handle daylight saving changes. In practice, store this as text or use application logic.Sometimes you do not want the whole timestamp. You just want the year, or the hour, or the day of the week.
The date_part() function extracts a single component from a date or timestamp:
SELECT
date_part('year', '2022-12-01 18:37:12 EST'::timestamptz) AS year,
date_part('month', '2022-12-01 18:37:12 EST'::timestamptz) AS month,
date_part('day', '2022-12-01 18:37:12 EST'::timestamptz) AS day,
date_part('hour', '2022-12-01 18:37:12 EST'::timestamptz) AS hour,
date_part('minute', '2022-12-01 18:37:12 EST'::timestamptz) AS minute,
date_part('seconds','2022-12-01 18:37:12 EST'::timestamptz) AS seconds;Run this now. Your hour value might differ from your neighbor’s. Why?
Because your PostgreSQL server converts the timestamp to your session time zone. EST is UTC-5. If your server is set to Pacific time (UTC-8), the hour shows as 15 instead of 18. The underlying moment in time is identical.
date_part() can extract more than the obvious pieces:
| Component | What It Returns |
|---|---|
timezone_hour |
UTC offset in hours (e.g., -5 for EST) |
week |
ISO 8601 week number (1-53) |
quarter |
Quarter of the year (1-4) |
epoch |
Seconds since January 1, 1970 00:00:00 UTC |
Epoch is how computers actually think about time. Every timestamp is secretly just a big number counting seconds from 1970. Everything else is a human convenience.
The SQL standard offers extract(), which does the same thing with slightly different syntax:
Note: no quotes around the component name, and FROM instead of a comma. Both date_part() and extract() return the same result. Use whichever your team prefers.
Write a query that extracts the day of the week (dow) and the hour from the current timestamp.
Hint: date_part('dow', ...) returns 0 for Sunday through 6 for Saturday.
Sometimes your data arrives in pieces: year in one column, month in another, day in a third. PostgreSQL can reassemble them.
PostgreSQL provides three constructor functions:
That last one is the most useful. Notice the time zone argument is a string. The output will be converted to your session’s time zone.
Run the make_timestamptz query now. What time does it show?
If your session is set to US/Eastern, you should see 2022-02-22 13:04:30.3-05. Lisbon is at UTC+0, Eastern in February is UTC-5, so 6:04 PM in Lisbon is 1:04 PM in New York.
PostgreSQL gives you several ways to ask “what time is it?”:
Important: All of these return the time at the start of the query. If your query takes 10 seconds, every row gets the same timestamp.
What if you want the actual clock time as each row is processed?
CREATE TABLE current_time_example (
time_id integer GENERATED ALWAYS AS IDENTITY,
current_timestamp_col timestamptz,
clock_timestamp_col timestamptz
);
INSERT INTO current_time_example
(current_timestamp_col, clock_timestamp_col)
(SELECT current_timestamp,
clock_timestamp()
FROM generate_series(1,1000));
SELECT * FROM current_time_example;Run this and look at the results. The current_timestamp_col is identical for all 1,000 rows. The clock_timestamp_col increases slightly with each row.
When would this matter? Imagine logging the exact processing time of each row during a large data migration. current_timestamp would lie to you. clock_timestamp() tells the truth.
This is where dates and times get spicy.
Quick thought experiment: a server in Oregon logs an event at 2026-03-15 14:00:00. A server in New York logs an event at 2026-03-15 17:00:00. Which happened first?
They happened at the exact same moment. Oregon is UTC-7 in March, New York is UTC-4. Both are 21:00 UTC.
Without time zone information, you would think the Oregon event happened 3 hours earlier. This is how bugs are born, and they are the worst kind of bugs because they look correct until they do not.
Run this. What does yours say? If you followed the setup instructions, it should say US/Eastern because we set it earlier.
PostgreSQL ships with two system tables full of time zone information:
You can filter them:
Write a query to find all time zones in Asia. How many are there?
You can change your session’s time zone at any time:
This changes how timestamps are displayed, not how they are stored. Internally, PostgreSQL always stores timestamptz values as UTC.
Let’s watch the same moment in time change its clothes:
Now watch:
Same row, same data on disk, different display. Pacific is 3 hours behind Eastern, so 4 AM Pacific = 7 AM Eastern. PostgreSQL handles this automatically.
What if you do not want to change your whole session, just peek at one value in a different time zone?
4 AM Pacific on January 1st is 9 PM the same day in Seoul. New Year’s Eve in Portland, New Year’s Day in Korea.
Quirk alert: When you use AT TIME ZONE on a timestamptz, the output is a timestamp without time zone (and vice versa). PostgreSQL strips the zone because you already specified which zone you want.
It is midnight on New Year’s Day 2030 in New York ('2030-01-01 00:00:00 US/Eastern').
Write a query that shows what time it is at that exact moment in:
Europe/London)Asia/Tokyo)Australia/Sydney)London: 5 AM. Tokyo: 2 PM. Sydney: 4 PM. When New Yorkers are watching the ball drop, Sydney already had their fireworks 16 hours ago.
You can do arithmetic with dates and times, and it actually makes sense.
Subtract one date from another and you get an integer (days between them):
Result: 3. Three days between September 27 and September 30, 1929. (That was right before the stock market crash, incidentally. Bad week.)
Add an interval to a date and you get a new timestamp:
Result: 1934-09-30 00:00:00. PostgreSQL knows about leap years, month lengths, and all the calendar quirks. You do not have to think about whether February has 28 or 29 days.
Subtract two timestamps and you get an interval:
Result: 08:00:00. An eight-hour workday. (In theory.)
PostgreSQL’s first release was on July 8, 1996. Write a query that calculates how many days ago that was.
Time to put all of this to work on real data. We have 368,774 taxi rides from June 1, 2016, in New York City.
When do New Yorkers take the most taxi rides? Let’s find out:
Run this now. You should get 24 rows, one per hour.
Look at the results. When is the busiest time? When is the slowest?
The busiest hours are in the evening (6 PM to 10 PM). The slowest are in the early morning (2 AM to 5 AM). No surprises there: nobody is hailing a cab at 3 AM on a Wednesday. (Well, almost nobody.)
Write a query to find the single busiest hour and how many trips occurred in it.
If you look at the data carefully, there is a dip around 3-4 PM. Why would taxi ridership drop in the middle of the afternoon?
Could be shift changes for taxi drivers. Could be that people are still at work. Could be that everyone is stuck in traffic anyway so they walk. This is where data raises questions faster than it answers them.
How long do taxi rides take at different times of day? We can calculate the median trip duration by subtracting pickup from dropoff:
Run this. Which hour has the longest median trip time? Which has the shortest?
The longest median trips are midday (around 1 PM: ~15 minutes). The shortest are early morning (~5 AM: under 8 minutes). Less traffic at 5 AM means you get where you are going faster. At 1 PM, you are fighting for every block.
Write a query to find the 10 longest taxi rides by duration. Include the pickup time, dropoff time, the calculated duration, and the trip distance.
Do any of the results look suspicious?
You will probably see some rides lasting 20+ hours with a trip distance of 0. Those are almost certainly errors: the meter was left running, or the data was entered incorrectly. Real-world data is messy. Always sanity-check your results.
Write a query that shows the average total_amount (fare) for each hour of the day. Which hour has the highest average fare?
The early morning hours (4-5 AM) often have the highest average fares. Fewer but longer rides: probably airport runs and longer-distance trips when there is no traffic.
Let’s switch from taxis to trains. We are going to track a cross-country Amtrak trip through four time zones.
CREATE TABLE train_rides (
trip_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
segment text NOT NULL,
departure timestamptz NOT NULL,
arrival timestamptz NOT NULL
);
INSERT INTO train_rides (segment, departure, arrival)
VALUES
('Chicago to New York',
'2020-11-13 21:30 CST', '2020-11-14 18:23 EST'),
('New York to New Orleans',
'2020-11-15 14:15 EST', '2020-11-16 19:32 CST'),
('New Orleans to Los Angeles',
'2020-11-17 13:45 CST', '2020-11-18 9:00 PST'),
('Los Angeles to San Francisco',
'2020-11-19 10:10 PST', '2020-11-19 21:24 PST'),
('San Francisco to Denver',
'2020-11-20 9:10 PST', '2020-11-21 18:38 MST'),
('Denver to Chicago',
'2020-11-22 19:10 MST', '2020-11-23 14:50 CST');Run this now.
Notice how each timestamp includes the time zone of the city. Chicago departs in CST, arrives in New York in EST. This is critical for accurate duration calculations.
All timestamps now display in Central time. The underlying data has not changed. PostgreSQL just shows everything through a Central time lens.
How long is each leg of the journey?
Two things to notice here:
to_char() formats the timestamp into a human-friendly string. HH12 gives 12-hour time, a.m. adds the AM/PM indicator, TZ shows the time zone abbreviation.interval. PostgreSQL correctly handles the time zone differences. Chicago to New York crosses a time zone boundary, and the math accounts for it.Which segment of the trip takes the longest?
How long has the entire trip taken so far after each segment? This is a job for a window function:
Run this. Look at the cume_duration column. Something looks weird.
The output says things like 2 days 85:47:00. That is technically correct (85 hours is, in fact, 85 hours) but confusing. PostgreSQL sums the days and the hours separately and does not roll them up.
The justify_interval() function converts overflow hours into days and overflow days into months:
Now the final row shows 5 days 13:47:00. That is how long the entire cross-country trip takes. Five and a half days on a train. The scenery better be incredible.
Rewrite the cumulative duration query so that the departure column is formatted as Mon DD, YYYY HH12:MI AM TZ using to_char().
Hint: The format string is 'Mon DD, YYYY HH12:MI AM TZ'.
Let’s combine several concepts from today into more complex queries.
Using the NYC taxi data, write a query that counts trips by day of the week. Use date_part('dow', ...) and a CASE statement to show day names instead of numbers.
Remember: dow returns 0 = Sunday through 6 = Saturday.
(This dataset only has one day, but the query pattern works on multi-day datasets.)
SELECT
CASE date_part('dow', tpep_pickup_datetime)
WHEN 0 THEN 'Sunday'
WHEN 1 THEN 'Monday'
WHEN 2 THEN 'Tuesday'
WHEN 3 THEN 'Wednesday'
WHEN 4 THEN 'Thursday'
WHEN 5 THEN 'Friday'
WHEN 6 THEN 'Saturday'
END AS day_name,
count(*) AS num_trips
FROM nyc_yellow_taxi_trips
GROUP BY day_name
ORDER BY num_trips DESC;Since this dataset is from June 1, 2016 (a Wednesday), you will only see one row. But now you know the pattern for when you have months of data.
Write a query that shows the taxi pickup times converted to Asia/Tokyo time. Just show the first 5 rows.
What day is it in Tokyo when it is Wednesday evening in New York?
Write a query that calculates the 25th, 50th (median), and 75th percentile of trip duration for the entire dataset. Use percentile_cont().
SELECT
percentile_cont(0.25)
WITHIN GROUP (ORDER BY tpep_dropoff_datetime - tpep_pickup_datetime)
AS p25,
percentile_cont(0.50)
WITHIN GROUP (ORDER BY tpep_dropoff_datetime - tpep_pickup_datetime)
AS median,
percentile_cont(0.75)
WITHIN GROUP (ORDER BY tpep_dropoff_datetime - tpep_pickup_datetime)
AS p75
FROM nyc_yellow_taxi_trips;The interquartile range (P75 - P25) tells you how spread out “typical” trip durations are.
| Concept | What It Does |
|---|---|
timestamptz |
The go-to type for storing moments in time |
date |
Date only (no time, no zone) |
interval |
A duration, not a point in time |
date_part() / extract() |
Pull year, month, hour, etc. from a timestamp |
make_date() / make_timestamptz() |
Build dates from components |
current_timestamp / now() |
The time when the query started |
clock_timestamp() |
The actual clock time (changes per row) |
SET TIME ZONE |
Change display for your session |
AT TIME ZONE |
One-off conversion to another zone |
to_char() |
Format timestamps as readable strings |
justify_interval() |
Clean up messy interval output |
timestamptz. A timestamp without a zone is a bug waiting to happen.date_part() is your Swiss Army knife. Need to group by hour? By month? By quarter? It handles all of them.