
DATA 503: Fundamentals of Data Engineering
February 25, 2026
Web scraping is the process of automatically extracting data from websites. Instead of manually copying and pasting, we write (or configure) tools to do it for us.
Why? Because life is too short to copy 10,000 rows by hand. Your fingers will thank you.
Before you go scraping the entire internet:
robots.txt file (e.g., https://example.com/robots.txt)We are responsible data engineers, not chaos agents.
Web Scraper is a free Chrome/Firefox extension that lets you scrape websites without writing code. It uses a point-and-click interface to define what data to extract.
Think of it as “I can not code a spider, but I can click on things.”
Install it from:
After installation, restart your browser (or just use new tabs).
F12 or Cmd+Opt+I on Mac)>> arrows to find itThat is where all the magic happens.
A Sitemap is your scraping project. It defines:
Think of it like a treasure map, except the treasure is data and X marks the CSS selector.
Selectors are the building blocks of your sitemap. Web Scraper has three types:
| Category | Purpose | Examples |
|---|---|---|
| Data extraction | Pull data from elements | Text, Link, Image, Table, HTML |
| Link navigation | Follow links to other pages | Link selector |
| Element grouping | Group related data together | Element selector, Element click |
Selectors are organized in a tree structure. The scraper executes them top-down:

Parent selectors define scope. Child selectors extract data within that scope.
When a selector has Multiple checked, it means:
For example: a product listing page has 25 items. The Element selector with Multiple checked will find all 25.
Web Scraper’s point-and-click tool:
Keyboard shortcuts (after clicking Select):
| Key | Action |
|---|---|
P |
Expand to parent element |
C |
Narrow to child element |
S |
Select without clicking (for dynamic elements) |
Shift |
Select multiple element groups |
Always use these before running a scrape:
If the preview looks wrong, the scrape will be wrong. Trust the preview.
Instead of clicking “Next” 100 times, use range URLs:
| Pattern | Generates |
|---|---|
https://example.com/page/[1-3] |
/page/1, /page/2, /page/3 |
https://example.com/page/[001-100] |
/page/001, /page/002, … (zero-padded) |
https://example.com/page/[0-100:10] |
/page/0, /page/10, /page/20, … |
This is how we handle multi-page scrapes without link selectors for pagination.
We are going to scrape the Most Collected Releases from Discogs – the world’s largest music database.
We want: artist name, album title, and then we will follow the link to each album’s detail page to grab the average rating and number of ratings. This is a two-level scrape.
Our scrape has two levels: the search results page and each album’s detail page:

The pagination selector handles navigating through search result pages. Under it, we extract text (artist, album) and follow links (album_link) to detail pages where we grab ratings.
F12 or Cmd+Opt+I) and go to the Web Scraper tabdiscogs5https://www.discogs.com/search?sort=have%2Cdesc&type=release&page=[1-2]We use [1-2] to scrape the first 2 pages (~50 results). You can increase this later, but start small when testing.
This handles scrolling or clicking through search results on each page:
pagination.hidden .cursor-pointer.justify-center spanWhy Pagination instead of Element? The Pagination selector both groups the repeating data on the page AND handles navigating to subsequent pages. It is doing double duty.
Navigate into the pagination selector (click on it), then:
artista.blockUse Data Preview to verify it is grabbing artist names correctly.
Still inside the pagination selector:
albuma.line-clamp-2Data Preview should now show album titles alongside artists.
This is the key step – we need to follow the link to each album’s detail page to get ratings:
pagination, click Add new selectoralbum_linka.groupThis tells the scraper: “For each result on the page, follow this link to the detail page.” The child selectors under album_link will extract data from those detail pages.
Navigate into the album_link selector. Now we define what to extract from each album’s detail page.
Average Rating:
avg_rating.section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2)Number of Ratings:
num_ratings#release-stats li:nth-of-type(4) aYour sitemap should now look like this:
| Selector ID | Type | Parent | CSS Selector |
|---|---|---|---|
pagination |
Pagination | _root |
.hidden .cursor-pointer.justify-center span |
artist |
Text | pagination |
a.block |
album |
Text | pagination |
a.line-clamp-2 |
album_link |
Link | pagination |
a.group |
avg_rating |
Text | album_link |
.section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2) |
num_ratings |
Text | album_link |
#release-stats li:nth-of-type(4) a |
If you prefer to skip building it manually, you can import the complete sitemap JSON:
{
"_id": "discogs5",
"startUrl": [
"https://www.discogs.com/search?sort=have%2Cdesc&type=release&page=[1-2]"
],
"selectors": [
{
"id": "pagination",
"paginationType": "auto",
"parentSelectors": ["_root", "", "pagination"],
"selector": ".hidden .cursor-pointer.justify-center span",
"type": "SelectorPagination"
},
{
"id": "artist",
"multiple": false,
"parentSelectors": ["pagination"],
"selector": "a.block",
"type": "SelectorText",
"version": 2
},
{
"id": "album",
"multiple": false,
"parentSelectors": ["pagination"],
"selector": "a.line-clamp-2",
"type": "SelectorText",
"version": 2
},
{
"id": "album_link",
"linkType": "linkFromHref",
"multiple": true,
"parentSelectors": ["pagination"],
"selector": "a.group",
"type": "SelectorLink",
"version": 2
},
{
"id": "avg_rating",
"multiple": false,
"parentSelectors": ["album_link"],
"selector": ".section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2)",
"type": "SelectorText",
"version": 2
},
{
"id": "num_ratings",
"multiple": false,
"parentSelectors": ["album_link"],
"selector": "#release-stats li:nth-of-type(4) a",
"type": "SelectorText",
"version": 2
}
]
}Before running the full scrape, always check:
artist and album – do they show real artist/album names?avg_rating and num_ratings element previews thereIf the preview looks wrong, the scrape will be wrong. Trust the preview.
A popup window will open and start loading pages. It will visit each search results page, extract artist/album names, then follow each album link to grab ratings from the detail pages. Do not close the popup. Go get coffee – this is a two-level scrape so it takes longer than a single-page scrape.
Once scraping is complete:
Congratulations, you just built a two-level scraping pipeline with zero code. You scraped search results AND followed links to detail pages for extra data. Your future self who has to write Python scrapers will be jealous.
| Field | Source | Description |
|---|---|---|
artist |
Search results page | Artist or band name |
album |
Search results page | Album/release title |
album_link |
Search results page | URL to the album detail page |
avg_rating |
Album detail page | Average user rating (e.g., 4.21) |
num_ratings |
Album detail page | Total number of user ratings |
This is our raw data. In a real pipeline, we would clean, transform, and load this into a database. Which is exactly what we are about to learn.
Work individually or with a neighbor. You are going to build your own sitemap from scratch using what we just learned.
The Use Case: Discogs tracks which releases are the most collected by users worldwide. We want to find the most collected vinyl releases from the 2010s decade and pull rating data from each album’s detail page.
Start with this base URL in your browser:
https://www.discogs.com/search?sort=have%2Cdesc&type=release&year1=2010&year2=2020&format_exact=Vinyl
This applies the following filters:
| Filter | Value |
|---|---|
| Sort | Most Collected (have, descending) |
| Type | Release |
| Year Range | 2010 to 2020 |
| Format | Vinyl |
Open this URL in Chrome and verify you see vinyl releases sorted by collector count.
Build a Web Scraper sitemap that:
Bonus fields (if you finish early):
discogs5 demo we just builtP (parent) or C (child) keys while the Select tool is activeWhen you export to CSV, each row should have:
| Column | Example Value |
|---|---|
artist |
Adele |
album |
21 |
album_link |
https://www.discogs.com/release/… |
avg_rating |
4.18 |
num_ratings |
1,247 |
You have 15 minutes. When you are done (or stuck), we will compare sitemaps.
Here is a working sitemap for this exercise. You can import it via Create new sitemap > Import Sitemap:

The solution sitemap JSON is available as top100_2010-2020.json on the course site.
| Selector ID | Type | Parent | CSS Selector |
|---|---|---|---|
pagination |
Pagination | _root |
.hidden .cursor-pointer.justify-center span |
artist |
Text | pagination |
a.block |
album |
Text | pagination |
a.line-clamp-2 |
year |
Text | pagination |
span.block.text-xs |
format |
Text | pagination |
p.text-xs.truncate |
album_link |
Link | pagination |
a.group |
avg_rating |
Text | album_link |
.section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2) |
num_ratings |
Text | album_link |
#release-stats li:nth-of-type(4) a |
num_have |
Text | album_link |
#release-stats li:nth-of-type(1) a |
num_want |
Text | album_link |
#release-stats li:nth-of-type(2) a |
“It works on my machine.” – Every developer, moments before disaster.
Docker packages your application and all its dependencies into a container – a lightweight, portable, isolated environment that runs the same everywhere.
| Concept | What It Is | Analogy |
|---|---|---|
| Image | A blueprint/recipe for your container | A shipping manifest |
| Container | A running instance of an image | The actual shipping container |
| Registry | Where images are stored (Docker Hub) | The warehouse |
| Dockerfile | Instructions to build an image | The recipe card |
Docker turns “how do I deploy this?” from a week-long saga into a one-liner.
lucascordova/web2dbWe will use a pre-built Docker image called lucascordova/web2db.
This image:
Environment variables control its behavior – no code changes needed:
| Variable | Purpose |
|---|---|
SITE_URL |
The URL to fetch data from |
DATABASE_URL |
PostgreSQL connection string |
TABLE_NAME |
Which table to insert into |
DEBUG |
Enable verbose logging |
We are building an end-to-end data engineering pipeline:

Seven services, one Railway project, zero hand-waving. Let us build each one.
In this pipeline, Postgres is the brain. Everything flows into it and everything reads from it.
It serves a dual role:
We will deploy it on Railway and connect using Beekeeper Studio.
Beekeeper Studio is an open-source desktop SQL client. It is like pgAdmin but prettier and less likely to make you cry.
discogs-pipeline)This database becomes the shared backend for all services in your pipeline.
DATABASE_PUBLIC_URLIt looks something like:
postgresql://postgres:<password>@<server>.proxy.rlwy.net:20848/railway
This is your golden ticket. Guard it like a secret.
DATABASE_PUBLIC_URLYou should see an empty database. That is expected – your ingestion services will fill it up shortly.
Before ingestion can begin, we need tables to hold the raw API responses. These are append-only logs: full JSON payloads with timestamps.
Paste this SQL into Beekeeper Studio and execute it.
JSONB stores JSON in a binary format – faster to query than JSON@>), and path queries (->, ->>)We store the raw data first, ask questions later. This is the ELT pattern (Extract, Load, Transform) rather than ETL.
The web2db service fetches data from a URL and stores the JSON response in Postgres. We will deploy it as a Docker container on Railway with a cron schedule.
First up: flight data from the OpenSky Network API.
https://opensky-network.org/api/states/all?lamin=45.08&lomin=-123.50&lamax=45.88&lomax=-122.00
| Parameter | Value | Meaning |
|---|---|---|
lamin |
45.08 | Latitude minimum |
lamax |
45.88 | Latitude maximum |
lomin |
-123.50 | Longitude minimum |
lomax |
-122.00 | Longitude maximum |
This bounding box covers the Portland metropolitan area and surrounding airspace. We are basically watching planes fly over our heads. Legally.
lucascordova/web2dbIn the Variables tab, add:
| Key | Value |
|---|---|
SITE_URL |
https://opensky-network.org/api/states/all?lamin=45.08&lomin=-123.50&lamax=45.88&lomax=-122.00 |
DATABASE_URL |
${{Postgres.DATABASE_PUBLIC_URL}} |
TABLE_NAME |
flight_json_data |
DEBUG |
TRUE |
Use Railway’s variable picker to reference the Postgres service directly for DATABASE_URL.
*/5 * * * *This runs the service every 5 minutes.
Cron format: minute hour day month weekday
*/5 = “every 5 minutes”
Bookmark Crontab Guru – it will save your life when writing cron expressions.
After a few minutes, check in Beekeeper Studio:
If you see rows with timestamps and JSON blobs, you are golden.
Troubleshooting:
SITE_URL and DATABASE_URL for typosSame concept, different API. We are pulling weather data for Portland from Open-Meteo.
https://api.open-meteo.com/v1/forecast?latitude=45.52&longitude=-122.68¤t=temperature_2m,wind_speed_10m,weathercode
| Parameter | Value | Meaning |
|---|---|---|
latitude |
45.52 | Downtown Portland |
longitude |
-122.68 | Downtown Portland |
current |
temperature, wind, weathercode | Current conditions |
Same steps as flights:
lucascordova/web2db| Key | Value |
|---|---|
SITE_URL |
https://api.open-meteo.com/v1/forecast?latitude=45.52&longitude=-122.68¤t=temperature_2m,wind_speed_10m,weathercode |
DATABASE_URL |
${{Postgres.DATABASE_PUBLIC_URL}} |
TABLE_NAME |
weather_json_data |
DEBUG |
TRUE |
*/5 * * * *You now have two data streams flowing into your database. You are officially a data engineer. Put it on your resume.
Raw JSON is great for storage but terrible for analysis. The transformation service:
This is the “T” in ETL (or the “T” in ELT, depending on your religion).
Always look at what you are working with:
OpenSky returns an array of arrays. Open-Meteo returns a nested object. Different APIs, different headaches.
CREATE TABLE flights (
id SERIAL PRIMARY KEY,
icao24 VARCHAR(10) NOT NULL,
callsign VARCHAR(10),
country VARCHAR(64),
latitude DOUBLE PRECISION,
longitude DOUBLE PRECISION,
altitude_meters DOUBLE PRECISION,
velocity_knots DOUBLE PRECISION,
heading_degrees DOUBLE PRECISION,
vertical_rate DOUBLE PRECISION,
timestamp TIMESTAMPTZ NOT NULL
);CREATE TABLE weather_observations (
id SERIAL PRIMARY KEY,
latitude DOUBLE PRECISION NOT NULL,
longitude DOUBLE PRECISION NOT NULL,
timestamp TIMESTAMPTZ NOT NULL,
precipitation_mm DOUBLE PRECISION,
weathercode SMALLINT
);
CREATE TABLE weather_condition (
code SMALLINT PRIMARY KEY,
description TEXT
);A CTE is a temporary named result set defined with WITH. It makes complex queries readable:
Think of CTEs as named scratch paper for your query. You break the problem into steps, then combine them.
BEGIN;
WITH raw AS (
SELECT id, timestamptz,
jsonb_array_elements(raw_json->'states') AS state
FROM flight_json_data
),
parsed AS (
SELECT
state->>0 AS icao24,
state->>1 AS callsign,
state->>2 AS country,
(state->>6)::DOUBLE PRECISION AS longitude,
(state->>5)::DOUBLE PRECISION AS latitude,
(state->>7)::DOUBLE PRECISION AS altitude_meters,
(state->>9)::DOUBLE PRECISION AS velocity_knots,
(state->>10)::DOUBLE PRECISION AS heading_degrees,
(state->>11)::DOUBLE PRECISION AS vertical_rate,
timestamptz
FROM raw
)
INSERT INTO flights (
icao24, callsign, country, latitude, longitude,
altitude_meters, velocity_knots, heading_degrees,
vertical_rate, timestamp
)
SELECT * FROM parsed;
DELETE FROM flight_json_data;
COMMIT;BEGIN;
WITH raw AS (
SELECT raw_json, timestamptz
FROM weather_json_data
),
parsed AS (
SELECT
(raw_json->'latitude')::DOUBLE PRECISION AS latitude,
(raw_json->'longitude')::DOUBLE PRECISION AS longitude,
(raw_json->'current'->'precipitation')::DOUBLE PRECISION
AS precipitation_mm,
(raw_json->'current'->'weathercode')::SMALLINT
AS weathercode,
timestamptz AS timestamp
FROM raw
)
INSERT INTO weather_observations (
latitude, longitude, precipitation_mm,
weathercode, timestamp
)
SELECT * FROM parsed;
DELETE FROM weather_json_data;
COMMIT;The BEGIN / COMMIT wrapper ensures atomicity:
This is not optional. This is how adults write SQL.
weather_flight_db_transformclean.sql file (click the pencil icon)weather_flight_db_transform repo| Key | Value |
|---|---|
DATABASE_URL |
${{Postgres.DATABASE_PUBLIC_URL}} |
0 * * * * (runs hourly, at the top of the hour)After the first cron run:
If you see structured rows, the transformation is working. Raw JSON in, clean tables out. Beautiful.
A View is a virtual table based on a SQL query. It does not store data – it runs the query on demand.
Why views are great for dashboards:
This view joins flights and weather to answer our research question: Are there “no-fly windows” correlated with weather?
CREATE OR REPLACE VIEW flight_weather AS
SELECT
f.timestamp,
f.callsign,
f.altitude_meters,
f.velocity_knots,
f.latitude AS flight_latitude,
f.longitude AS flight_longitude,
w.precipitation_mm,
w.weathercode,
w.latitude AS weather_latitude,
w.longitude AS weather_longitude
FROM flights f
JOIN weather_observations w
ON date_trunc('minute', f.timestamp)
= date_trunc('minute', w.timestamp)
ORDER BY f.timestamp DESC;| Key | Value |
|---|---|
GF_SECURITY_ADMIN_USER |
grafanareader |
GF_DEFAULT_INSTANCE_NAME |
grafanapg |
GF_SECURITY_ADMIN_PASSWORD |
your-password |
After deployment:
This makes Grafana sleep when idle and wake on demand. Your wallet will appreciate this.
Do not give Grafana the keys to the kingdom. Create a restricted user:
This user can only SELECT from the flight_weather view. Nothing else. Principle of least privilege in action.
PGHOSTpostgres.railway.internal| Field | Value |
|---|---|
| Host | postgres.railway.internal:5432 |
| Database | railway |
| User | grafanareader |
| Password | your_password |
flight_weather view as the tabletimestamp and a data column (e.g., altitude_meters or precipitation_mm)You now have a live dashboard powered by your pipeline. Go impress someone.
By default, Grafana stores dashboards in SQLite inside the container. Container restarts = dashboards gone. Let us fix that.
grafana database in your Railway Postgres instance (via Beekeeper)GF_DATABASE_TYPE=postgres
GF_DATABASE_HOST=postgres.railway.internal:5432
GF_DATABASE_NAME=grafana
GF_DATABASE_USER=postgres
GF_DATABASE_SSL_MODE=disable
GF_DATABASE_PASSWORD=<your-db-password>
GF_SERVER_ROOT_URL=<your-grafana-url>
Your dashboard is now publicly viewable. Data storytelling at its finest.
Dashboards are great for humans. But what about other applications?
PostgREST turns your Postgres database into a RESTful API automatically. No backend code required.
Key features:

postgrest/postgrest| Key | Value |
|---|---|
PGRST_DB_ANON_ROLE |
web_anon |
PGRST_DB_SCHEMA |
api |
PGRST_DB_URI |
${{DATABASE_URL}} |
PGRST_OPENAPI_SERVER_PROXY_URI |
<your public hostname> |
We expose data through a dedicated api schema with a restricted role:
-- Create the API schema
CREATE SCHEMA IF NOT EXISTS api;
-- Create a view in the API schema
CREATE OR REPLACE VIEW api.general_aviation_weather_view AS
SELECT * FROM flight_weather;
-- Create an anonymous access role
CREATE ROLE web_anon NOLOGIN;
GRANT USAGE ON SCHEMA api TO web_anon;
GRANT SELECT ON api.general_aviation_weather_view
TO web_anon;Instead of connecting as the all-powerful postgres user:
NOINHERIT – cannot directly access anythingLOGIN – can authenticate from PostgRESTGRANT web_anon – can act as the read-only roleSeparation of concerns. Even your database roles practice good architecture.
Visit your PostgREST URL with the view name appended:
https://<your-domain>.up.railway.app/general_aviation_weather_view
You should see JSON data. If you do, your database is now a REST API. That was disturbingly easy.
An API without documentation is like a library without a catalog. Swagger UI provides:
PostgREST automatically generates an OpenAPI spec. Swagger reads it. They are best friends.
swaggerapi/swagger-ui| Key | Value |
|---|---|
API_URL |
${{postgrest.PGRST_OPENAPI_SERVER_PROXY_URI}} |
This tells Swagger where to find the OpenAPI spec generated by PostgREST.
Open the public URL. You should see the Swagger UI loaded with your API documentation. You can:
general_aviation_weather_view endpointYour API is now self-documenting, testable, and shareable. Chef’s kiss.

| Stage | Service | Purpose |
|---|---|---|
| 1 | PostgreSQL | Central data store |
| 2a | web2db (flights) | Flight data ingestion |
| 2b | web2db (weather) | Weather data ingestion |
| 3 | db_transform | JSON to structured tables |
| 4 | Grafana | Visualization dashboard |
| 5 | PostgREST | REST API from database |
| 6 | Swagger UI | API documentation portal |