
Lecture 06-2: Command Line Skills for Data Detectives
DATA 503: Fundamentals of Data Engineering
This supplemental lecture reviews core command-line navigation (cd, ls, paths) with visual diagrams, then builds hands-on proficiency with the text-processing commands needed for the Command Line Mystery homework: grep, sed, awk, head, tail, cat, wc, sort, uniq, cut, pipes, and redirection. Students follow along inside the lucascordova/dataeng Docker container using a practice dataset downloaded with curl.
The Filesystem: Your Map
The Directory Tree
Every Unix system is a tree of directories (folders) starting from the root /:
/is the root – the top of everything/home/dstis your home directory (also called~)/datais where we mounted our local files
Where Am I? pwd
pwd = Print Working Directory. It tells you your current location as an absolute path:
$ pwd
/home/dstThink of it like GPS for your terminal. When you are lost, pwd is your friend.
What Is Here? ls
ls lists what is in the current directory:
$ ls
$ ls -l # long format (permissions, size, date)
$ ls -la # include hidden files (starting with .)
$ ls -lh # human-readable file sizes
$ ls /data # list a specific directoryCommon options:
| Option | What It Shows |
|---|---|
-l |
Long format (one file per line with details) |
-a |
All files including hidden (dotfiles) |
-h |
Human-readable sizes (KB, MB, GB) |
-F |
Append / to directories, * to executables |
Absolute vs Relative Paths
Absolute Path – starts from root /
$ cat /data/ch02/movies.txtWorks from anywhere. Like a full street address.
Relative Path – starts from where you are
$ cd /data
$ cat ch02/movies.txtDepends on your current directory. Like saying “two blocks north.”

Visual: Moving Around
Starting at /home/dst:

The .. always means “one level up.” You can chain them: ../../ means “two levels up.”
Special Directory Symbols
| Symbol | Meaning | Example |
|---|---|---|
/ |
Root directory | cd / |
~ |
Home directory | cd ~ or just cd |
. |
Current directory | ls . (same as ls) |
.. |
Parent directory | cd .. |
- |
Previous directory | cd - (toggle back) |
These work everywhere – in cd, ls, cat, cp, mv, any command that takes a path.
Knowledge Check 1
You are currently in /data/ch02. What is the absolute path you end up at after running these commands?
cd ..
cd ch10
cd ../ch02Answer: /data/ch02
Step by step:
cd ..– moves up to/datacd ch10– moves into/data/ch10cd ../ch02– moves up to/data, then intoch02=/data/ch02
You ended up right where you started!
Knowledge Check 2
You are in /home/dst/projects. Which of these commands will list the files in /data?
ls datals /datals ../../dataBoth B and C
Answer: D) Both B and C
B is an absolute path – works from anywhere.
C is a relative path: from /home/dst/projects, .. goes to /home/dst, ../.. goes to /home, and wait – that gives /home/data which does not exist. Actually let us re-trace: /home/dst/projects -> .. = /home/dst -> ../.. = /. So ../../data = /data. Hmm, actually: from /home/dst/projects, .. = /home/dst, ../.. = /home, ../../data = /home/data. So C is WRONG.
Corrected Answer: B only.
ls /data is an absolute path. ls ../../data from /home/dst/projects resolves to /home/data, which does not exist. ls ../../../data would be needed to reach /data (up 3 levels to /, then into data).
Creating and Removing Directories
$ mkdir practice # create a directory
$ mkdir -p a/b/c # create nested directories
$ rmdir practice # remove empty directory
$ rm -r a # remove directory and everything in itThe -p flag on mkdir creates all parent directories as needed. Without it, mkdir a/b/c fails if a/b does not exist.
Hands-On Setup: Practice Dataset
Find Your Book Data
Last week you downloaded the data from the Data Science at the Command Line book. It should already be inside your /data directory. Let us find it:
$ ls /dataDepending on how you unzipped last week, your chapter directories might be in different places:
# If you unzipped directly into /data:
$ ls /data/ch02
# If you kept the outer folder:
$ ls /data/book_data/ch02
# If you used a different name:
$ ls /data/data-science-at-the-command-line-master/ch02Use ls to poke around and find where ch02 lives. Once you find it, cd into it:
$ cd /data/ch02 # adjust this path to match YOUR layout
$ lsYou should see files like movies.txt, dates.txt, and other text files.
Verify Your Setup
$ cat movies.txtIf you see a list of movies, you are good. If you get “No such file or directory,” you are in the wrong place. Use pwd to check where you are, then ls to look around.
This is your first real navigation exercise. Finding files by exploring directory structures is exactly what you will do in the Command Line Mystery homework.
Grab a Larger Dataset
The book data is great for small examples, but we also want a real-world dataset. Let us download an Apache web server access log using wget:
cd /tmp
wget https://raw.githubusercontent.com/elastic/examples/master/Common%20Data%20Formats/apache_logs/apache_logs
ls -lh apache_logsThat is it. One command, one file. wget saves the file using the name from the URL (apache_logs) by default.
wget – Download Files from the Web
wget is a command-line tool for downloading files. It is simpler than curl for straightforward downloads:
$ wget https://example.com/data.csvCommon options:
| Flag | Meaning |
|---|---|
-O filename |
Save as a specific filename (capital O) |
-q |
Quiet mode (no progress output) |
-P dir/ |
Save to a specific directory |
--no-check-certificate |
Skip SSL verification (use with caution) |
wget vs curl
Both download files, but they have different strengths:
| Feature | wget |
curl |
|---|---|---|
| Default behavior | Saves to file | Prints to stdout |
| Recursive downloads | Yes (-r) |
No |
| Resume interrupted downloads | Yes (-c) |
Yes (-C -) |
| API requests (POST, headers) | Limited | Full support |
| Availability | Linux/Docker (common) | macOS/Linux (universal) |
Rule of thumb: Use wget when you just need to download a file. Use curl when you need to interact with an API or need fine-grained control over the request.
Rename for Convenience
The downloaded file is called apache_logs (no extension). Let us rename it so it is clear what it is:
mv apache_logs access.log
wc -l access.logNow you have a real Apache web server log file to explore. This is the kind of data you will encounter as data engineers.
Knowledge Check 3
What is the difference between these two commands?
wget https://example.com/data.csv
curl -O https://example.com/data.csvAnswer: They both download the file and save it as data.csv. wget saves to a file by default using the filename from the URL. curl -O (capital O) does the same thing – saves using the remote filename. Without -O, curl prints to stdout instead. For simple file downloads, wget is slightly more convenient. For API work, curl is the standard.
Core Commands: Reading Files
cat – Print Entire File
$ cat movies.txtcat dumps the whole file to your terminal. Fine for small files. For large files, it will flood your screen.
Use it to combine files too:
$ cat file1.txt file2.txt > combined.txthead and tail – Peek at Files
$ head movies.txt # first 10 lines
$ head -n 3 movies.txt # first 3 lines
$ tail movies.txt # last 10 lines
$ tail -n 5 movies.txt # last 5 lines
$ tail -n +2 data.csv # everything EXCEPT the first linetail -n +2 is the classic trick to skip a header row in a CSV.
wc – Count Things
$ wc movies.txt # lines, words, characters
$ wc -l movies.txt # just line count
$ wc -w movies.txt # just word count
$ wc -c movies.txt # just byte countQuick sanity check: “How big is this file?”
$ wc -l access.logThe access.log should have around 10,000 lines – a realistic web server log file.
less – Page Through Large Files
$ less access.log| Key | Action |
|---|---|
Space / f |
Next page |
b |
Previous page |
/pattern |
Search forward |
n |
Next search match |
q |
Quit |
Unlike cat, less does not load the entire file into memory. Use it for big files.
Knowledge Check 4
How would you print lines 20 through 25 of a file called data.txt?
Core Commands: Searching with grep
grep – Find Lines Matching a Pattern
grep is your search engine for files. It prints every line that matches your pattern:
$ grep "Star Wars" movies.txtEssential flags:
| Flag | Meaning |
|---|---|
-i |
Case-insensitive search |
-c |
Count matches (don’t print them) |
-n |
Show line numbers |
-v |
Invert: show lines that do NOT match |
-l |
Show only filenames that contain a match |
-r |
Search recursively through directories |
-w |
Match whole words only |
grep Context Flags
Sometimes you need to see what is around a match:
$ grep -A 3 "error" access.log # 3 lines AFTER match
$ grep -B 2 "error" access.log # 2 lines BEFORE match
$ grep -C 2 "error" access.log # 2 lines BEFORE and AFTERThink of it as: After, Before, Context.
grep with Pipes
The real power comes from chaining grep with other commands:
$ cat access.log | grep "404" | wc -lThis counts how many 404 (Not Found) errors are in the log.
$ grep "404" access.log | head -n 5Show the first 5 lines that contain “404”.
Knowledge Check 5
Using the access.log file, write a command that counts how many requests came from the IP address 83.149.9.216.
Knowledge Check 6
How would you search for the word “CLUE” in a file called crimescene, but only show the matching lines?
Core Commands: Extracting with sed
sed – Stream Editor
sed processes text line by line. Most common use: print specific lines and find-and-replace.
Print a specific line:
$ sed -n '5p' movies.txt # print only line 5
$ sed -n '10,20p' movies.txt # print lines 10 through 20Find and replace:
$ sed 's/old/new/' file.txt # replace first occurrence per line
$ sed 's/old/new/g' file.txt # replace ALL occurrences per lineThe s/pattern/replacement/ syntax is the substitution command.
sed in Practice
Remove blank lines from a file:
$ sed '/^$/d' file.txtDelete lines containing a pattern:
$ sed '/pattern/d' file.txtPrint only lines matching a pattern (like grep):
$ sed -n '/error/p' access.logWhy sed -n 'Np' Matters for the Mystery
In the Command Line Mystery, people have addresses like:
Annabel Church F 38 Buckingham Place, line 179
To look up that address, you need line 179 of the street file:
$ sed -n '179p' streets/Buckingham_PlaceThis is faster than opening the whole file. You jump straight to the line you need.
Knowledge Check 7
A file called people has this entry:
Jeremy Bowers M 34 Dunstable Road, line 284
Write the command to see what is on line 284 of streets/Dunstable_Road.
Core Commands: cut, sort, uniq
cut – Extract Columns
cut pulls out specific columns from structured text:
$ cut -d',' -f1 data.csv # first field, comma-delimited
$ cut -d',' -f1,3 data.csv # fields 1 and 3
$ cut -d' ' -f1 access.log # first field, space-delimited
$ cut -c1-10 file.txt # characters 1 through 10| Flag | Meaning |
|---|---|
-d |
Delimiter (comma, tab, space, etc.) |
-f |
Field number(s) to extract |
-c |
Character positions to extract |
sort – Sort Lines
$ sort names.txt # alphabetical sort
$ sort -n numbers.txt # numeric sort
$ sort -r names.txt # reverse sort
$ sort -t',' -k2 data.csv # sort by 2nd field, comma-delimited
$ sort -u names.txt # sort and remove duplicatesuniq – Remove Adjacent Duplicates
uniq only removes consecutive duplicates, so you almost always use it with sort:
$ sort names.txt | uniq # unique values
$ sort names.txt | uniq -c # count occurrences
$ sort names.txt | uniq -d # show only duplicatesThe Classic Pipeline: cut | sort | uniq -c | sort -rn
This is the “top N” pattern. Find the most frequent values:
$ cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10This answers: “What are the top 10 IP addresses in the access log?”
The pipeline:
cutextracts the IP address (field 1)sortgroups identical IPs togetheruniq -ccounts consecutive duplicatessort -rnsorts by count, highest firsthead -10shows the top 10
Knowledge Check 8
Using the access.log, write a pipeline that finds the top 5 most requested URLs (URLs are typically the 7th field in a space-delimited Apache log).
Core Commands: awk
awk – The Swiss Army Knife
awk is a mini programming language for text processing. Basic use: print specific fields.
$ awk '{print $1}' access.log # print first field
$ awk '{print $1, $7}' access.log # print fields 1 and 7
$ awk -F',' '{print $2}' data.csv # comma-delimited, field 2awk splits each line on whitespace by default. Use -F to change the delimiter.
awk with Conditions
Print only lines where a condition is true:
$ awk '$9 == 404 {print $7}' access.logThis prints the URL (field 7) for every line where the HTTP status code (field 9) is 404.
$ awk '$9 == 200 {count++} END {print count}' access.logCount the number of successful (200) requests.
awk vs cut
| Feature | cut |
awk |
|---|---|---|
| Speed | Faster for simple extraction | Slightly slower |
| Delimiter | Single character only | Any pattern, regex |
| Conditions | None | Full programming logic |
| Multiple delimiters | No | Handles whitespace naturally |
| Math | No | Yes |
Rule of thumb: use cut for simple extraction, awk for anything more complex.
Knowledge Check 9
Using awk, print the IP address and status code (fields 1 and 9) from access.log for all lines where the status code is 500.
Pipes and Redirection Revisited
The Pipe | – Connecting Commands
The pipe takes stdout of one command and feeds it as stdin to the next:

Each command does one small job. Together they answer: “What are the top 5 URLs returning 404?”
Output Redirection
Overwrite (>):
$ grep "404" access.log > errors.txtCreates or overwrites the file.
Append (>>):
$ echo "new line" >> errors.txtAdds to the end of the file.
Suppress errors (2>/dev/null):
$ grep "x" missing.txt 2>/dev/nullThrows away error messages.
Redirect both (&>):
$ command &> output.txtCaptures both stdout and stderr.
Knowledge Check 10
Write a single pipeline that:
- Finds all lines containing “GET” in
access.log - Extracts just the URL (field 7)
- Counts how many unique URLs there are
Working with Archives
tar and zip – Packing and Unpacking
You will encounter compressed archives constantly as a data engineer:
zip/unzip:
$ unzip archive.zip # extract to current directory
$ unzip archive.zip -d folder/ # extract to specific folder
$ zip -r backup.zip folder/ # create a zip from a foldertar (tape archive):
$ tar xzf archive.tar.gz # extract gzipped tar
$ tar xjf archive.tar.bz2 # extract bzip2 tar
$ tar czf backup.tar.gz folder/ # create gzipped tar
$ tar tf archive.tar.gz # list contents without extractingThink of tar flags as: extract, create, z for gzip, f for filename.
gunzip – Decompress Single Files
$ gunzip file.gz # decompress (replaces .gz file)
$ gunzip -k file.gz # decompress and keep original
$ gunzip -c file.gz > out.txt # decompress to stdoutKnowledge Check 11
You downloaded a file called dataset.tar.gz. Write the command to extract it into a directory called mydata/.
Putting It All Together: Mystery Prep
The Command Line Mystery
Your homework involves solving a murder mystery using only command-line tools. Here is the workflow you will follow:

Commands You Will Need
| Command | Mystery Use |
|---|---|
cat |
Read instructions, interviews |
grep |
Search crimescene for CLUE, search people, vehicles, memberships |
grep -A |
Show context lines after a vehicle match |
sed -n 'Np' |
Look up a specific line in a street file |
wc -l |
Count matches |
Pipes (\|) |
Chain commands together |
cd |
Navigate into mystery/ and its subdirectories |
ls |
See what files and directories are available |
Practice: Mini Mystery
Let us simulate the mystery workflow with our book data. Navigate back to ch02:
cd /data/ch02 # adjust to match your layout
lsTry these patterns:
# Search for something in a file
grep "Matrix" movies.txt
# Count occurrences
grep -c "the" movies.txt
# Show line numbers
grep -n "Star" movies.txtKnowledge Check 12
You are in the mystery/ directory. The people file shows:
Joe Germuska M 36 Plainfield Street, line 275
Write the commands to:
- Look up line 275 of
streets/Plainfield_Street - Read the interview file that is referenced there
Knowledge Check 13
A clue says the suspect drives a blue Honda with a plate starting with “L337”. The vehicles file has multi-line records. Write a command to find all matching vehicles and show 5 lines of context after each match.
Knowledge Check 14
You narrowed suspects to two people: Joe Germuska and Jeremy Bowers. You need to check if each person is a member of the “AAA” club. The membership file is at memberships/AAA. Write the commands.
Advanced Patterns
Wildcards in File Paths
The * matches any characters in filenames:
$ ls *.txt # all .txt files
$ grep "Bowers" memberships/* # search ALL membership files
$ cat streets/B* # all streets starting with BThis is called globbing and it is handled by the shell, not by the command.
find – Locate Files
$ find . -name "*.txt" # find all .txt files
$ find . -name "*.log" -size +1M # .log files over 1MB
$ find . -type d # find all directoriesxargs – Build Commands from Input
When you need to run a command on each result from another command:
$ find . -name "*.txt" | xargs wc -lThis counts lines in every .txt file found by find.
Knowledge Check 15
Write a single command that searches ALL files in the memberships/ directory for “Jeremy Bowers” and shows which files contain a match.
Command Cheat Sheet
File Reading
| Command | What It Does |
|---|---|
cat file |
Print entire file |
head -n N file |
First N lines |
tail -n N file |
Last N lines |
tail -n +N file |
Everything from line N onward |
less file |
Page through file |
wc -l file |
Count lines |
Searching and Extracting
| Command | What It Does |
|---|---|
grep "pattern" file |
Find matching lines |
grep -i |
Case-insensitive |
grep -c |
Count matches |
grep -n |
Show line numbers |
grep -v |
Invert (non-matching lines) |
grep -A N |
N lines after match |
grep -B N |
N lines before match |
grep -l |
Show filenames only |
grep -r |
Recursive search |
sed -n 'Np' |
Print line N |
sed 's/old/new/g' |
Find and replace |
Transforming
| Command | What It Does |
|---|---|
cut -d',' -f1 |
Extract field 1 (comma-delimited) |
sort |
Sort lines alphabetically |
sort -n |
Sort numerically |
sort -rn |
Sort numerically, descending |
sort -u |
Sort and deduplicate |
uniq |
Remove adjacent duplicates |
uniq -c |
Count duplicates |
awk '{print $1}' |
Print first field |
awk -F',' '{print $2}' |
Print field 2 (comma-delimited) |
Plumbing
| Syntax | What It Does |
|---|---|
cmd1 \| cmd2 |
Pipe output to next command |
cmd > file |
Write output to file (overwrite) |
cmd >> file |
Append output to file |
cmd 2>/dev/null |
Suppress error messages |
curl -L -o file URL |
Download a file |
unzip file.zip |
Extract zip archive |
tar xzf file.tar.gz |
Extract gzipped tar |
What Is Next
Your Mission
You now have all the commands you need to solve the Command Line Mystery.
Homework 6: Navigate to the mystery directory and follow the instructions. Use grep, sed, cat, pipes, and your navigation skills to find the killer.
cd /data/command-line-mystery/mystery
cat ../instructionsThe hints are there if you get stuck (cat ../hint1, cat ../hint2, etc.), but try to solve it on your own first. You have the skills.
Good luck, detective.
References
References
Janssens, J. (2021). Data Science at the Command Line (2nd ed.). O’Reilly Media. https://datascienceatthecommandline.com
Veltman, N. The Command Line Murders. https://github.com/veltman/clmystery
GNU Coreutils Manual. https://www.gnu.org/software/coreutils/manual
Robbins, A. (2002). sed & awk (2nd ed.). O’Reilly Media.
Shotts, W. (2019). The Linux Command Line (2nd ed.). No Starch Press.