
DATA 503: Fundamentals of Data Engineering
February 2, 2026
The command line has been around for over 50 years, yet it remains one of the most powerful tools for CS, DS, and beyond! Today we explore why.
How can a technology that is more than 50 years old be useful for a field that is only a few years young?
Today, data scientists can choose from an overwhelming collection of exciting technologies:
And yet, the command line remains essential. Let us understand why.
The command line offers five key advantages for data science:
| Advantage | Description |
|---|---|
| Agile | REPL environment allows rapid iteration and exploration |
| Augmenting | Integrates with and amplifies other technologies |
| Scalable | Automate, parallelize, and run on remote servers |
| Extensible | Create your own tools in any language |
| Ubiquitous | Available on virtually every system |
Data science is interactive and exploratory. The command line supports this through:
Read-Eval-Print Loop (REPL)
Close to the Filesystem
The command line does not replace your current workflow. It amplifies it.
Three ways it augments your tools:
Every technology has strengths and weaknesses. Knowing multiple technologies lets you use the right one for each task.
On the command line, you do things by typing. In a GUI, you point and click.
Everything you type can be:
Pointing and clicking is hard to automate. Typing is not.
The command line itself is over 50 years old, but its tools are being developed daily.
The command line comes with any Unix-like operating system:
This technology has been around for more than five decades. It will be here for five more.
The field of data science employs a practical definition devised by Hilary Mason and Chris H. Wiggins [1].
OSEMN (pronounced “awesome”) defines data science in five steps:
The OSEMN model shows data science as an iterative, nonlinear process
In practice, you move back and forth between steps. After modeling, you may return to scrubbing to adjust features.
Download, query APIs, extract from files, or generate data yourself.
Common formats: plain text, CSV, JSON, HTML, XML
Clean the data before analysis:
80% of data project work is in cleaning the data.
Get to know your data:
Create statistical models:
Draw conclusions, evaluate results, and communicate findings.
This is where human judgment matters most.
We use Docker to ensure everyone has the same environment with all necessary tools.
What is Docker?
Why Docker for this course?
Download Docker Desktop from docker.com
Install and launch Docker Desktop
Open your terminal (or Command Prompt on Windows)
Pull the course Docker image:
Run the Docker image:
You are now inside an isolated environment with all the necessary command-line tools installed.
Test that it works:
To get data in and out of the container, mount a local directory:
The -v option maps your current directory to /data inside the container.
When you are done, exit the container by typing:
The container is removed (due to --rm flag), but any files you saved to the mounted directory persist on your computer.
Command line on macOS
The terminal is the application where you type commands. It enables you to interact with the shell.
Command line on Ubuntu
Ubuntu is a distribution of GNU/Linux. The same commands work across different Unix-like systems.
The environment consists of four layers:

| Layer | Purpose |
|---|---|
| Command-line tools | Programs you execute |
| Terminal | Application for typing commands |
| Shell | Program that interprets commands (bash, zsh) |
| Operating system | Executes tools, manages hardware |
When you see text like this in the book or slides:
$ is the prompt (do not type it)seq 3 is the command you type1, 2, 3 is the outputThe prompt may show additional information (username, directory, time), but we show only $ for simplicity.
Try these commands in your terminal:
The tool pwd outputs the name of your current directory.
The tool cd changes directories. Values after the command are called arguments or options.
Commands often take arguments and options:
This command has three arguments:
| Argument | Type | Purpose |
|---|---|---|
-n |
Option (short form) | Specifies number of lines |
3 |
Value | The number of lines to show |
movies.txt |
Filename | The file to read |
The long form of -n is --lines.

Each command-line tool is one of five types:
Programs compiled from source code to machine code.
ls, grep, catCommand-line tools provided by the shell itself.
cd, pwd, echoText files executed by an interpreter (Python, R, Bash).
Advantages: Readable, editable, portable
Functions executed by the shell itself:
.bashrc, .zshrc)Macros that expand to longer commands:
Now typing l expands to the full ls command with options.
Use the type command to identify what kind of tool you have:
Most command-line tools follow the Unix philosophy:
Do one thing and do it well.
grep filters lineswc counts lines, words, characterssort sorts lineshead shows first linesThe power comes from combining these simple tools.
Every tool has three standard communication streams:
Standard input (stdin), standard output (stdout), and standard error (stderr)
| Stream | Abbreviation | Default |
|---|---|---|
| Standard input | stdin | Keyboard |
| Standard output | stdout | Terminal |
| Standard error | stderr | Terminal |
The pipe operator (|) connects stdout of one tool to stdin of another:
Piping output from curl to grep
You can chain as many tools as needed:
This pipeline:
Think of piping as automated copy and paste.
Save output to a file using >:
Redirecting output to a file
Use >> to append instead of overwrite:
The -n option tells echo not to add a trailing newline.
Read from a file using <:
Two ways to use file contents as input
Both achieve the same result. The second form avoids starting an extra process.
Suppress error messages by redirecting stderr to /dev/null:
Redirecting stderr to /dev/null
The 2 refers to standard error (file descriptor 2).
Warning: Do not read from and write to the same file in one command!
The output file is opened (and emptied) before reading starts.
Solutions:
sponge to absorb all input before writingsponge soaks up input before writing
The tee command writes to both a file and stdout:
tee writes to file while passing data through
Useful for saving intermediate results while continuing a pipeline.
Common options:
| Option | Meaning |
|---|---|
-l |
Long format (permissions, size, date) |
-h |
Human-readable sizes |
-F |
Append indicators (/ for directories) |
-a |
Show hidden files |
Warning: There is no recycle bin on the command line. Deleted files are gone.
Use the -v (verbose) option to see what is happening:
The man command displays the manual page for a tool:
Navigation:
f: next pageb: previous page/pattern: searchq: quitMany tools support --help:
Some tools use -h instead.
For concise, example-focused help, use tldr:
Shows practical examples instead of exhaustive documentation.
For shell builtins like cd, check the shell manual:
Or use the help command (in bash):
OSEMN Model: Obtain, Scrub, Explore, Model, iNterpret - data science is iterative
Command Line Advantages: Agile, augmenting, scalable, extensible, ubiquitous
Docker Setup: Use lucascordova/dataeng for a consistent environment
Tool Types: Binary executables, shell builtins, scripts, functions, aliases
Pipes and Redirection: Combine simple tools into powerful pipelines
File Management: Navigate, create, copy, move, and remove files
Next steps:
Questions?
Mason, H. & Wiggins, C. H. (2010). A Taxonomy of Data Science. dataists blog. http://www.dataists.com/2010/09/a-taxonomy-of-data-science
Janssens, J. (2021). Data Science at the Command Line (2nd ed.). O’Reilly Media. https://datascienceatthecommandline.com
Raymond, E. S. (2003). The Art of Unix Programming. Addison-Wesley.
Patil, D. J. (2012). Data Jujitsu. O’Reilly Media.
Docker Documentation. https://docs.docker.com
GNU Coreutils Manual. https://www.gnu.org/software/coreutils/manual