Lecture 04-2: Data Science at the Command Line

DATA 503: Fundamentals of Data Engineering

Lucas P. Cordova, Ph.D.

Willamette University

February 2, 2026

Introduction to Data Science at the Command Line

The command line has been around for over 50 years, yet it remains one of the most powerful tools for CS, DS, and beyond! Today we explore why.

Why the Command Line?

Data Science Meets Ancient Technology

How can a technology that is more than 50 years old be useful for a field that is only a few years young?

Today, data scientists can choose from an overwhelming collection of exciting technologies:

Python, R, Julia
Apache Spark
Jupyter Notebooks
GUI-based tools

And yet, the command line remains essential. Let us understand why.

Five Advantages of the Command Line

The command line offers five key advantages for data science:

Advantage	Description
Agile	REPL environment allows rapid iteration and exploration
Augmenting	Integrates with and amplifies other technologies
Scalable	Automate, parallelize, and run on remote servers
Extensible	Create your own tools in any language
Ubiquitous	Available on virtually every system

The Command Line Is Agile

Data science is interactive and exploratory. The command line supports this through:

Read-Eval-Print Loop (REPL)

Type a command, press Enter, see results immediately
Stop execution at will
Modify and re-run quickly
Short iteration cycles let you play with your data

Close to the Filesystem

Data lives in files
Command line offers convenient tools for file manipulation
Navigate, search, and transform data files easily

The Command Line Is Augmenting

The command line does not replace your current workflow. It amplifies it.

Three ways it augments your tools:

Glue between tools: Connect output of one tool to input of another using pipes
Delegation: Python, R, and Spark can call command-line tools and capture output
Tool creation: Convert your scripts into reusable command-line tools

Every technology has strengths and weaknesses. Knowing multiple technologies lets you use the right one for each task.

The Command Line Is Scalable

On the command line, you do things by typing. In a GUI, you point and click.

Everything you type can be:

Saved to scripts
Rerun when data changes
Scheduled to run at intervals
Executed on remote servers
Parallelized across multiple cores

Pointing and clicking is hard to automate. Typing is not.

The Command Line Is Extensible

The command line itself is over 50 years old, but its tools are being developed daily.

Language agnostic: tools can be written in any programming language
Open source community produces high-quality tools
Tools can work together
You can create your own tools

The Command Line Is Ubiquitous

The command line comes with any Unix-like operating system:

macOS
Linux (100% of top 500 supercomputers)
Windows Subsystem for Linux
Cloud servers
Embedded systems

This technology has been around for more than five decades. It will be here for five more.

The OSEMN Model

Data Science Is OSEMN

The field of data science employs a practical definition devised by Hilary Mason and Chris H. Wiggins [1].

OSEMN (pronounced “awesome”) defines data science in five steps:

Obtaining data
Scrubbing data
Exploring data
Modeling data
iNterpreting data

OSEMN: Expectation vs Reality

The OSEMN model shows data science as an iterative, nonlinear process

In practice, you move back and forth between steps. After modeling, you may return to scrubbing to adjust features.

Download, query APIs, extract from files, or generate data yourself.

Common formats: plain text, CSV, JSON, HTML, XML

Clean the data before analysis:

Filter lines
Extract columns
Replace values
Handle missing data
Convert formats

80% of data project work is in cleaning the data.

Get to know your data:

Look at the data
Derive statistics
Create visualizations

Create statistical models:

Clustering
Classification
Regression
Dimensionality reduction

Draw conclusions, evaluate results, and communicate findings.

This is where human judgment matters most.

Getting Started

Setting Up Your Environment

Docker: Your Portable Environment

We use Docker to ensure everyone has the same environment with all necessary tools.

What is Docker?

A Docker image bundles applications with their dependencies
A Docker container is an isolated environment running an image
Like a lightweight virtual machine, but uses fewer resources

Why Docker for this course?

Consistent environment across Windows, macOS, and Linux
All tools pre-installed and configured
No dependency conflicts
Easy to reset if something goes wrong

Installing Docker

Download Docker Desktop from docker.com
Install and launch Docker Desktop
Open your terminal (or Command Prompt on Windows)
Pull the course Docker image:

$ docker pull lucascordova/dataeng

Running the Docker Container

Run the Docker image:

$ docker run --rm -it lucascordova/dataeng

You are now inside an isolated environment with all the necessary command-line tools installed.

Test that it works:

$ cowsay "Data engineering is awesome!"
 _______________________________
< Data engineering is awesome!  >
 -------------------------------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Mounting a Local Directory

To get data in and out of the container, mount a local directory:

macOS/Linux
Windows (Command Prompt)
Windows (PowerShell)

$ docker run --rm -it -v "$(pwd)":/data lucascordova/dataeng

C:\> docker run --rm -it -v "%cd%":/data lucascordova/dataeng

PS C:\> docker run --rm -it -v ${PWD}:/data lucascordova/dataeng

The -v option maps your current directory to /data inside the container.

Exiting the Container

When you are done, exit the container by typing:

$ exit

The container is removed (due to --rm flag), but any files you saved to the mounted directory persist on your computer.

What Is the Command Line?

The Terminal

Command line on macOS

The terminal is the application where you type commands. It enables you to interact with the shell.

The Terminal on Linux

Command line on Ubuntu

Ubuntu is a distribution of GNU/Linux. The same commands work across different Unix-like systems.

The Four Layers

The environment consists of four layers:

Layer	Purpose
Command-line tools	Programs you execute
Terminal	Application for typing commands
Shell	Program that interprets commands (bash, zsh)
Operating system	Executes tools, manages hardware

Understanding the Prompt

When you see text like this in the book or slides:

$ seq 3
1
2
3

The $ is the prompt (do not type it)
seq 3 is the command you type
1, 2, 3 is the output

The prompt may show additional information (username, directory, time), but we show only $ for simplicity.

Essential Unix Concepts

Executing Commands

Your First Commands

Try these commands in your terminal:

$ pwd
/home/dst

The tool pwd outputs the name of your current directory.

$ cd /data/ch02
$ pwd
/data/ch02

The tool cd changes directories. Values after the command are called arguments or options.

Command Arguments and Options

Commands often take arguments and options:

$ head -n 3 movies.txt
Matrix
Star Wars
Home Alone

This command has three arguments:

Argument	Type	Purpose
`-n`	Option (short form)	Specifies number of lines
`3`	Value	The number of lines to show
`movies.txt`	Filename	The file to read

The long form of -n is --lines.

Five Types of Command-Line Tools

The Tool Umbrella

Command-line tool is an umbrella term for five types

Each command-line tool is one of five types:

Binary executable
Shell builtin
Interpreted script
Shell function
Alias

Binary Executables

Programs compiled from source code to machine code.

Created by compiling C, C++, Rust, Go, etc.
Cannot read the file in a text editor
Fast execution
Examples: ls, grep, cat

$ file /usr/bin/ls
/usr/bin/ls: ELF 64-bit LSB pie executable...

Shell Builtins

Command-line tools provided by the shell itself.

Part of the shell (bash, zsh)
May differ between shells
Cannot be easily inspected
Examples: cd, pwd, echo

$ type cd
cd is a shell builtin

Interpreted Scripts

Text files executed by an interpreter (Python, R, Bash).

#!/usr/bin/env python

def factorial(x):
    result = 1
    for i in range(2, x + 1):
        result *= i
    return result

if __name__ == "__main__":
    import sys
    x = int(sys.argv[1])
    print(factorial(x))

Advantages: Readable, editable, portable

Shell Functions

Functions executed by the shell itself:

$ fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
$ fac 5
120

Similar to scripts but typically smaller
Defined in shell configuration (.bashrc, .zshrc)
Good for personal productivity shortcuts

Aliases

Macros that expand to longer commands:

$ alias l='ls --color -lhF --group-directories-first'
$ alias les=less

Now typing l expands to the full ls command with options.

Save keystrokes for common commands
Fix common typos
Cannot take parameters (use functions for that)

Identifying Tool Types

Use the type command to identify what kind of tool you have:

$ type -a pwd
pwd is a shell builtin
pwd is /usr/bin/pwd

$ type -a cd
cd is a shell builtin

$ type -a l
l is an alias for ls --color -lhF --group-directories-first

Combining Tools

The Unix Philosophy

Most command-line tools follow the Unix philosophy:

Do one thing and do it well.

grep filters lines
wc counts lines, words, characters
sort sorts lines
head shows first lines

The power comes from combining these simple tools.

Standard Streams

Every tool has three standard communication streams:

Standard input (stdin), standard output (stdout), and standard error (stderr)

Stream	Abbreviation	Default
Standard input	stdin	Keyboard
Standard output	stdout	Terminal
Standard error	stderr	Terminal

The Pipe Operator

The pipe operator (|) connects stdout of one tool to stdin of another:

Piping output from curl to grep

$ curl -s "https://www.gutenberg.org/files/11/11-0.txt" | grep " CHAPTER"
 CHAPTER I.     Down the Rabbit-Hole
 CHAPTER II.    The Pool of Tears
 CHAPTER III.   A Caucus-Race and a Long Tale
...

Chaining Multiple Tools

You can chain as many tools as needed:

$ curl -s "https://www.gutenberg.org/files/11/11-0.txt" |
> grep " CHAPTER" |
> wc -l
12

This pipeline:

Downloads Alice in Wonderland
Filters lines containing ” CHAPTER”
Counts the number of lines

Think of piping as automated copy and paste.

Redirecting Input and Output

Output Redirection

Save output to a file using >:

Redirecting output to a file

$ curl -s "https://example.com/data.txt" | grep "pattern" > results.txt

Creates file if it does not exist
Overwrites file if it exists
Standard error still goes to terminal

Appending Output

Use >> to append instead of overwrite:

$ echo -n "Hello" > greeting.txt
$ echo " World" >> greeting.txt
$ cat greeting.txt
Hello World

The -n option tells echo not to add a trailing newline.

Input Redirection

Read from a file using <:

Two ways to use file contents as input

$ cat greeting.txt | wc -w
2

$ < greeting.txt wc -w
2

Both achieve the same result. The second form avoids starting an extra process.

Redirecting Standard Error

Suppress error messages by redirecting stderr to /dev/null:

Redirecting stderr to /dev/null

$ cat movies.txt 404.txt
Matrix
Star Wars
cat: 404.txt: No such file or directory

$ cat movies.txt 404.txt 2> /dev/null
Matrix
Star Wars

The 2 refers to standard error (file descriptor 2).

The Sponge Problem

Warning: Do not read from and write to the same file in one command!

$ < dates.txt nl > dates.txt  # BAD: results in empty file!

The output file is opened (and emptied) before reading starts.

Solutions:

Write to a different file, then rename
Use sponge to absorb all input before writing

sponge soaks up input before writing

$ < dates.txt nl | sponge dates.txt  # GOOD

Using tee for Intermediate Output

The tee command writes to both a file and stdout:

tee writes to file while passing data through

$ seq 0 2 100 | tee even.txt | head -5
0
2
4
6
8

Useful for saving intermediate results while continuing a pipeline.

Working with Files and Directories

Listing Directory Contents

$ ls /data/ch10
alice.txt  count.py  count.R  Untitled1337.ipynb

$ ls -lhF /data/ch10
total 176K
-rw-r--r-- 1 dst dst 164K Jun 29 14:25 alice.txt
-rwxr-xr-x 1 dst dst  408 Jun 29 14:25 count.py*
-rw-r--r-- 1 dst dst  460 Jun 29 14:25 count.R
-rw-r--r-- 1 dst dst 1.7K Jun 29 14:25 Untitled1337.ipynb

Common options:

Option	Meaning
`-l`	Long format (permissions, size, date)
`-h`	Human-readable sizes
`-F`	Append indicators (/ for directories)
`-a`	Show hidden files

Creating and Navigating Directories

$ mkdir logs              # Create directory
$ mkdir -p data/raw/2024  # Create nested directories

$ cd /data                # Change to absolute path
$ cd ..                   # Go up one level
$ cd ~                    # Go to home directory
$ cd -                    # Go to previous directory

Moving and Copying Files

$ mv hello.txt /data/ch02      # Move file
$ mv hello.txt goodbye.txt     # Rename file
$ mv old_dir new_dir           # Rename directory

$ cp server.log server.log.bak # Copy file
$ cp -r src/ backup/           # Copy directory recursively

Removing Files and Directories

$ rm bye.txt           # Remove file
$ rm -r /data/ch02/old # Remove directory and contents
$ rm -i *.txt          # Interactive: ask before each removal

Warning: There is no recycle bin on the command line. Deleted files are gone.

Use the -v (verbose) option to see what is happening:

$ mkdir -v backup
mkdir: created directory 'backup'

Getting Help

Manual Pages

The man command displays the manual page for a tool:

$ man tar
TAR(1)                    GNU TAR Manual                   TAR(1)

NAME
       tar - an archiving utility

SYNOPSIS
       tar [options] [pathname ...]
...

Navigation:

Space or f: next page
b: previous page
/pattern: search
q: quit

The –help Option

Many tools support --help:

$ jq --help
jq - commandline JSON processor [version 1.6]

Usage: jq [options] <jq filter> [file...]
...

Some tools use -h instead.

TLDR Pages

For concise, example-focused help, use tldr:

$ tldr tar

  tar

  Archiving utility.

  - [c]reate an archive and write it to a [f]ile:
    tar cf target.tar file1 file2 file3

  - E[x]tract a (compressed) archive [f]ile:
    tar xf source.tar[.gz|.bz2|.xz]

Shows practical examples instead of exhaustive documentation.

Shell Builtins Help

For shell builtins like cd, check the shell manual:

$ man zshbuiltins  # for zsh
$ man bash         # for bash (search for the builtin)

Or use the help command (in bash):

$ help cd

Summary

Key Takeaways

What We Learned

OSEMN Model: Obtain, Scrub, Explore, Model, iNterpret - data science is iterative
Command Line Advantages: Agile, augmenting, scalable, extensible, ubiquitous
Docker Setup: Use lucascordova/dataeng for a consistent environment
Tool Types: Binary executables, shell builtins, scripts, functions, aliases
Pipes and Redirection: Combine simple tools into powerful pipelines
File Management: Navigate, create, copy, move, and remove files

Looking Ahead

Next steps:

Practice the commands covered today
Explore the Docker environment
Complete the command-line exercises

Questions?

References

Mason, H. & Wiggins, C. H. (2010). A Taxonomy of Data Science. dataists blog. http://www.dataists.com/2010/09/a-taxonomy-of-data-science
Janssens, J. (2021). Data Science at the Command Line (2nd ed.). O’Reilly Media. https://datascienceatthecommandline.com
Raymond, E. S. (2003). The Art of Unix Programming. Addison-Wesley.
Patil, D. J. (2012). Data Jujitsu. O’Reilly Media.
Docker Documentation. https://docs.docker.com
GNU Coreutils Manual. https://www.gnu.org/software/coreutils/manual