
DATA 503: Fundamentals of Data Engineering
March 4, 2026
Web scraping is the process of automatically extracting data from websites. Instead of manually copying and pasting, we configure tools to do it for us.
Why? Because life is too short to copy 10,000 rows by hand. Your fingers will thank you. 🙏
Before you go scraping the entire internet:
robots.txt file (e.g., https://example.com/robots.txt)We are responsible data engineers, not chaos agents. 😇
Web Scraper is a free Chrome/Firefox extension that lets you scrape websites without writing code. It uses a point-and-click interface to define what data to extract.
Think of it as “I can not code a spider, but I can click on things.” 🕷️
Install it from:
After installation, restart your browser (or just use new tabs).
F12 or Cmd+Opt+I on Mac)>> arrows to find itThat is where all the magic happens. ✨
A Sitemap is your scraping project. It defines:
Think of it like a treasure map, except the treasure is data and X marks the CSS selector. 🗺️
Selectors are the building blocks of your sitemap. Web Scraper has three categories:
| Category | Purpose | Examples |
|---|---|---|
| Data extraction | Pull data from elements | Text, Link, Image, Table, HTML |
| Link navigation | Follow links to other pages | Link selector |
| Element grouping | Group related data together | Element selector, Element click, Element scroll |
Selectors are organized in a tree structure. The scraper executes them top-down:

Parent selectors define scope. Child selectors extract data within that scope.
When a selector has Multiple checked, it means:
For example: a product listing page has 25 items. The Element selector with Multiple checked will find all 25. 📋
Web Scraper’s point-and-click tool:
Keyboard shortcuts (after clicking Select):
| Key | Action |
|---|---|
P |
Expand to parent element |
C |
Narrow to child element |
S |
Select without clicking (for dynamic elements) |
Shift |
Select multiple element groups |
Always use these before running a scrape:
If the preview looks wrong, the scrape will be wrong. Trust the preview. 🔍
The Element (scroll) selector is used for pages that load content dynamically as you scroll (infinite scroll or lazy loading). It:
This is the selector type we will use for Discogs search results, since the page loads items as you scroll down.
Instead of relying on pagination buttons, use range URLs:
| Pattern | Generates |
|---|---|
https://example.com/page/[1-3] |
/page/1, /page/2, /page/3 |
https://example.com/page/[001-100] |
/page/001, /page/002, … (zero-padded) |
https://example.com/page/[0-100:10] |
/page/0, /page/10, /page/20, … |
This is how we handle multi-page scrapes without needing a pagination selector.
We are going to scrape the Most Collected Releases from Discogs, the world’s largest music database. 🎵
We want: album name, artist name, and then we will follow the link to each album’s detail page to grab the average rating and number of ratings. This is a two-level scrape.
Our scrape has two levels: the search results page and each album’s detail page:

The element selector (Element scroll) handles grouping each search result card. Under it, we extract text (album_name, artist_name) and follow links (album_link) to detail pages where we grab ratings.
F12 or Cmd+Opt+I) and go to the Web Scraper tabdiscogs5https://www.discogs.com/search?sort=have%2Cdesc&type=release&page=[1-2]We use [1-2] to scrape the first 2 pages (~50 results). You can increase this later, but start small when testing. 🧪
This groups each search result card on the page and handles scroll-based loading:
elementdiv[role='listitem']Why Element (scroll)? Discogs search results load dynamically. The Element scroll selector automatically scrolls to load all items and groups each result card as a container for child selectors.
Navigate into the element selector (click on it), then:
album_namea.line-clamp-2Use Data Preview to verify it is grabbing album names correctly. ✅
Still inside the element selector:
artist_namea.blockData Preview should now show artist names alongside album names.
This is the key step. We need to follow the link to each album’s detail page to get ratings:
element, click Add new selectoralbum_linka.groupThis tells the scraper: “For each result on the page, follow this link to the detail page.” The child selectors under album_link will extract data from those detail pages. 🔗
Navigate into the album_link selector. Now we define what to extract from each album’s detail page.
Average Rating:
avg_rating.section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2)Number of Ratings:
num_ratings#release-stats li:nth-of-type(4) aYour sitemap should now look like this:
| Selector ID | Type | Parent | CSS Selector |
|---|---|---|---|
element |
Element scroll | _root |
div[role='listitem'] |
album_name |
Text | element |
a.line-clamp-2 |
artist_name |
Text | element |
a.block |
album_link |
Link | element |
a.group |
avg_rating |
Text | album_link |
.section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2) |
num_ratings |
Text | album_link |
#release-stats li:nth-of-type(4) a |
If you prefer to skip building it manually, you can import the complete sitemap JSON:
{
"_id": "discogs5",
"startUrl": [
"https://www.discogs.com/search?sort=have%2Cdesc&type=release&page=[1-2]"
],
"selectors": [
{
"id": "element",
"parentSelectors": ["_root"],
"selector": "div[role='listitem']",
"multiple": true,
"type": "SelectorElementScroll",
"delay": 2000
},
{
"id": "album_name",
"multiple": false,
"parentSelectors": ["element"],
"selector": "a.line-clamp-2",
"type": "SelectorText",
"version": 2
},
{
"id": "artist_name",
"multiple": false,
"parentSelectors": ["element"],
"selector": "a.block",
"type": "SelectorText",
"version": 2
},
{
"id": "album_link",
"linkType": "linkFromHref",
"multiple": true,
"parentSelectors": ["element"],
"selector": "a.group",
"type": "SelectorLink",
"version": 2
},
{
"id": "avg_rating",
"multiple": false,
"parentSelectors": ["album_link"],
"selector": ".section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2)",
"type": "SelectorText",
"version": 2
},
{
"id": "num_ratings",
"multiple": false,
"parentSelectors": ["album_link"],
"selector": "#release-stats li:nth-of-type(4) a",
"type": "SelectorText",
"version": 2
}
]
}Before running the full scrape, always check:
album_name and artist_name. Do they show real data?avg_rating and num_ratings element previews thereIf the preview looks wrong, the scrape will be wrong. Trust the preview. 🔎
A popup window will open and start loading pages. It will visit each search results page, extract album/artist names, then follow each album link to grab ratings from the detail pages. Do not close the popup. Go get coffee ☕ this is a two-level scrape so it takes longer than a single-page scrape.
Once scraping is complete:
Congratulations, you just built a two-level scraping pipeline with zero code. You scraped search results AND followed links to detail pages for extra data. Your future self who has to write Python scrapers will be jealous. 😎
| Field | Source | Description |
|---|---|---|
album_name |
Search results page | Album/release title |
artist_name |
Search results page | Artist or band name |
album_link |
Search results page | URL to the album detail page |
avg_rating |
Album detail page | Average user rating (e.g., 4.21) |
num_ratings |
Album detail page | Total number of user ratings |
This is our raw data. In a real pipeline, we would clean, transform, and load this into a database.
Work individually or with a neighbor. You are going to build your own sitemap from scratch using what we just learned.
The Use Case: Discogs tracks which releases are the most collected by users worldwide. We want to find the most collected vinyl releases from the 2010s decade and pull rating data from each album’s detail page.
Start with this base URL in your browser:
https://www.discogs.com/search?sort=have%2Cdesc&type=release&year1=2010&year2=2020&format_exact=Vinyl
This applies the following filters:
| Filter | Value |
|---|---|
| Sort | Most Collected (have, descending) |
| Type | Release |
| Year Range | 2010 to 2020 |
| Format | Vinyl |
Open this URL in Chrome and verify you see vinyl releases sorted by collector count.
Build a Web Scraper sitemap that:
div[role='listitem'] to group each resultBonus fields (if you finish early):
discogs5 demo we just builtdiv[role='listitem']P (parent) or C (child) keys while the Select tool is activeWhen you export to CSV, each row should have:
| Column | Example Value |
|---|---|
artist_name |
Adele |
album_name |
21 |
album_link |
https://www.discogs.com/release/… |
avg_rating |
4.18 |
num_ratings |
1,247 |
You have 15 minutes. When you are done (or stuck), we will compare sitemaps. ⏱️
Here is a working sitemap for this exercise. You can import it via Create new sitemap > Import Sitemap:

The solution sitemap JSON is available as top100_2010-2020.json on the course site.
| Selector ID | Type | Parent | CSS Selector |
|---|---|---|---|
element |
Element scroll | _root |
div[role='listitem'] |
album_name |
Text | element |
a.line-clamp-2 |
artist_name |
Text | element |
a.block |
year |
Text | element |
span.block.text-xs |
format |
Text | element |
p.text-xs.truncate |
album_link |
Link | element |
a.group |
avg_rating |
Text | album_link |
.section_Odw8o div div ul:nth-of-type(1) span:nth-of-type(2) |
num_ratings |
Text | album_link |
#release-stats li:nth-of-type(4) a |
num_have |
Text | album_link |
#release-stats li:nth-of-type(1) a |
num_want |
Text | album_link |
#release-stats li:nth-of-type(2) a |