Margin-time playbook Β· your tutor, in the margin

When you sit down in a margin block,
I'll tell you what to advance.

The Schedule says when. This page is where I sit next to you and say what. Tap the block you're in β€” the page remembers, jumps you to the action, logs your win when you close the session. When you get stuck, scroll down: I've written the part that usually trips people up.

I'm going to walk you through this the same way I'd walk you through a whiteboard session: one track at a time, one block at a time, one honest question at a time. You don't have to be brilliant in any of these blocks. You have to show up, move one thing, and write it down so tomorrow-you trusts today-you. That's the whole game. Everything else on this page is scaffolding for that one habit.
Which block are you in right now?
β€”
Last win
β€”
β€”
TRACK 01

YouTube Channel β€” animate the concepts, model the process

A channel that animates CS ideas AND the metacognitive moves of a student engaging in highly effective project-based learning. The animations are the hook; the PBL voice is the moat.

Here is the thing nobody tells you about starting a channel: the people who ship start ugly. The people who quit keep polishing. If you only have 45 minutes tonight, I would rather you produce 90 seconds of rough-cut footage you hate than an unopened project file. Your future self will thank you. Ugly and shipped beats pretty and queued β€” every single time.

Do this now Β· 45 min

Pick one. Don't try to do two β€” decision fatigue wins every time.

Default β†’ fallback order
  1. Script beat β€” open the active script; advance one of hook β†’ wrong intuition β†’ derivation β†’ animation β†’ artifact. Stop when that beat reads clean out loud.
  2. Manim scene β€” one animation for the active script. One scene, <60 sec runtime.
  3. Record take β€” one recorded pass. No re-recording today; you'll hate it less after sleep.
  4. Edit pass β€” 1 minute of footage, cuts locked.
  5. Ship β€” render, thumbnail, description, upload unlisted to Jeff, then public.
When you're stuck
  • Can't start writing? Write the hook sentence out loud, badly, right now. Bad-on-purpose dodges perfectionism.
  • Don't know what to record? Record yourself teaching the concept to a friend from memory. That recording is your first script.
  • Manim intimidating? Use a text slide with a fade for this video. Fancy animation is optional; shipping isn't.
No wins logged yet. Close a session and tap βœ“ Done for today.

Positioning & voice

Answer this in one sentence before the first video ships.

Working tagline: "I animate CS concepts while thinking out loud about how I'd actually learn them β€” so you can steal the process, not just the answer."

Who it's for: community-college CS students, career-switchers, and students who've "taken the class" but don't trust their understanding. Not for people who already have the concept β€” for the person still looking for a door in.

What makes it not generic: most animated CS channels show the what. This one shows the how I got here β€” the derivation, the wrong turn, the checkpoint question, the artifact at the end. That's the project-based-learning DNA.

Voice rules:

  • Derive before compute. Never drop a formula without showing where it comes from.
  • Build before import. If a library does it, first show the 10-line version you'd write.
  • Name the confusion. Say the wrong intuition out loud before correcting it.
  • Close with an artifact. Every video ends with a thing the viewer can run, fork, or open.

Milestone ladder β€” tap to mark done

Each rung is small enough to finish in 1–3 margin blocks. State saves automatically.

βœ“
Channel identity locked
Channel name, handle, one-line description, avatar, banner. Write it in a doc you can't edit after Friday. Kills bikeshedding.
βœ“
Tooling spike (one evening, no content)
Install Manim Community Edition, get one example scene rendering. OBS for screen+webcam. Mic level. Defaults everywhere β€” don't spend two weeks on presets.
βœ“
Script template + first script
Five-beat template: hook β†’ wrong intuition β†’ derivation β†’ animation β†’ artifact/CTA. First topic: "Why matrix multiplication is row Γ— column β€” and why we define it that way." Doubles as scaffolding for Lab 1.
βœ“
Ship video 1 β€” ugly and shipped beats pretty and queued
Record, rough cut, no b-roll drama, upload unlisted first. Send to Jeff for one round. Then publish.
βœ“
Batch scripts 2–5
Batching scripts is the single biggest velocity unlock. Each video after the first should need <2 blocks of scripting.
βœ“
Series 1 complete β€” "Seeing Linear Algebra" (10 videos)
Ten 3–5 min videos, one LA concept each, each ending with a link to the matching lab below. Labs + videos become the same product: watch, then do.
βœ“
Quarterly review β€” iterate voice
At 10 videos, read watch-time curves honestly. Keep what worked. Kill the first 20 seconds of everything that lost people.

Series 1 content bank β€” "Seeing Linear Algebra"

Ten topics feeding directly into the three labs in Track 03.

#Working titleCore visualFeeds lab
1Why matrix multiplication is row Γ— columnBasis vectors under a linear map, decomposed as dot productsMorphs
2An image is a matrix (and every filter is a function of it)Grayscale Lena beside a printed grid of numbersAll three
3Homogeneous coordinates, or: why we add a fake dimension2D translation becoming a 3Γ—3 productMorphs
4Forward vs. inverse warping β€” and the bug that teaches you whichRotated image with holes vs. clean inverse-warped versionMorphs
5What PCA actually does to a cloud of points2D Gaussian blob; axes rotating to principal directionsEigenfaces, USPS
6Eigenvectors without the textbookA transformation whose action leaves certain arrows fixedEigenfaces
7The covariance trick β€” $X^\top X$ vs. $XX^\top$Tall-skinny vs. short-fat data matrix and their eigenproblemsEigenfaces
8SVD as three honest steps: rotate, stretch, rotateUnit circle β†’ ellipse β†’ rotated ellipseEigenfaces, USPS
9Nearest neighbors, or: a 256-dim digit is still a pointFlatten a 16Γ—16 digit, plot its 2D PCA projectionUSPS
10Confusion matrices, read like a chessboard10Γ—10 grid; diagonals are wins; off-diagonals tell a storyUSPS
TRACK 02

Grad School Prep β€” don't drown in the first semester

Two jobs: pick the right programs, and arrive with enough technical fluency that coursework is learning, not survival. The PhD Tracker tab holds the program database; this page holds the study curriculum and the framework for deciding what rows go in that database.

You don't need to be brilliant in the first semester β€” you need to be not drowning. The difference between those two is roughly ten hours of self-study a week, nine months before you start. This is that. A warning, though: most students treat grad-school prep as "read more textbooks." That's wrong. Prep is re-deriving the 30 ideas you'll be asked to use as lemmas without looking. If you can't derive a 2Γ—2 eigenvalue from scratch on a napkin right now, that's not a gap in knowledge β€” that's a gap in fluency. Fluency is the thing that saves you in week two of a grad course.

Do this now Β· 45–60 min

Pick one. The default is always Read mode unless the rest are blocked.

Default β†’ fallback order
  1. Read mode β€” open the day's chapter. Pen in hand. If you can't re-derive the key theorem on paper after 10 min, loop back.
  2. Exercise mode β€” 3 problems from the chapter. Not "read the solution" β€” finish or fail honestly.
  3. Program mode β€” advance one program row: read one faculty's last paper, update the fit narrative, or draft one outreach email.
  4. Anki mode β€” 3 cards per paper read this week. Clear the queue.
When you're stuck
  • Can't focus on reading? Do 3 exercises from yesterday's chapter. They drag you back into the material without the activation energy of a new topic.
  • Textbook feels abstract? Open Strang's MIT 18.06 lecture for that chapter. Thirty minutes of Strang on YouTube beats ninety minutes of sleepy reading.
  • Feeling behind? You are not behind. Nine months is enough. Write one Anki card and close the laptop. Tomorrow-you will be grateful.
No wins logged yet.

Program research framework

Because "apply to good schools" is not a plan.

Target split

Build a list of 8–12 programs, roughly 3 reach / 5 match / 3 safety. Two parallel tracks:

  • MS in CS (coursework or thesis): SJSU, SFSU, SCU, UC Davis, UC Irvine, Oregon State, Georgia Tech OMSCS, UIUC MCS, UT Austin MSCSO. Strong technical floor; lower stakes on "did you publish yet?"
  • PhD in CS Education / Learning Sciences / HCI: what the PhD Tracker already targets. Candidate institutions to investigate (verify deadlines and research fit on the program site β€” these shift yearly): UW (Info School, CSE), Berkeley (EECS / GSE), Stanford (GSE, CS HCI), CMU (HCII, METALS), Michigan (EECS, UMSI), Georgia Tech (HCC), UC San Diego (Cog Sci, CSE/EdS), UC Irvine (Informatics), Indiana (Learning Sciences), Northwestern (Learning Sciences), Penn (GSE), Utah, NCSU.

For each program, capture 5 things

  1. 3–5 faculty with live funding in your area. Read their last 3 papers. If you can't write one sentence about why the work matters, they don't go on the list.
  2. Admissions surface: deadline, GRE policy, required materials, funding model (TA/RA/fellowship), typical cohort size.
  3. Fit narrative: one paragraph answering "why this program, not a peer in the same tier." This is the seed of your SoP paragraph for that school.
  4. Contact strategy: who to email when (early fall for PhD, not a week before the deadline). What single question you'll ask that a 5-second skim of their site won't already answer.
  5. Red flags: PI not taking students, lab wound down, group moved institutions. Check recent publication activity; an inactive lab is worse than a rejection.
When you write faculty emails, the one thing that separates "opens, reads, replies" from "ignores" is a specific sentence about their work that proves you read it. Not "your work on X is fascinating." A sentence that contains a fact only someone who read the paper would know β€” an experimental detail, a limitation they flagged, a dataset they chose. That's the trust anchor. The rest of the email can be short.

Where to verify

  • CSRankings.org β€” research-area-weighted rankings; filter by HCI, software engineering education, etc.
  • Program website β€” the one authoritative source for deadlines. Google is wrong half the time.
  • Google Scholar profiles of target PIs. Sort by year. Note co-authors β€” often students you could email later.
  • The GradCafe β€” last-year admit timing, funding signals, decision patterns. Noisy but useful.
  • Twitter/Bluesky β€” many HCI/CS-Ed PIs announce openings there months before the deadline.

Self-directed core curriculum β€” survive grad courses

By the time you sit in your first grad class, nothing in the syllabus should be a true first exposure. 9–12 month plan at ~1 hour/day of focused study. Feeds the 8h CS Theory + 6.5h CS Build blocks on the Schedule.

Five pillars β€” open one book first per pillar

PillarPrimary textWhy this oneMinimum dose
Linear algebra Strang, Introduction to Linear Algebra (6e) + MIT 18.06 lectures Teaches LA as four subspaces + a story about $A\mathbf{x}=\mathbf{b}$. Axler is purer but less useful as a first grad-course refresher. Ch. 1–7, SVD and eigendecomposition. Feeds Labs 2 & 3.
Probability & stats Blitzstein & Hwang, Intro to Probability + Wasserman, All of Statistics Blitzstein for intuition; Wasserman is the pocket reference you'll open during every grad ML course. Blitzstein Ch. 1–8, then Wasserman Ch. 1–10. Exercises matter β€” probability is not a spectator sport.
Algorithms CLRS, Introduction to Algorithms (4e) Grad algorithms = CLRS + randomized/approximation on top. Know the core cold. Parts I–III (growth, divide-and-conquer, sorting), graphs, DP. Skip advanced data structures on first pass.
Systems OSTEP (free online) for OS, then Bryant & O'Hallaron, CS:APP OSTEP is friendly and surprisingly deep. CS:APP makes all systems courses stop feeling random. OSTEP: one chapter each from virtualization, concurrency, persistence. CS:APP: 1–3, 6, 9.
Machine learning Hastie, Tibshirani & Friedman, Elements of Statistical Learning (free online) The "adult book" of ML. Uses USPS digits as a running example β€” Lab 3 literally comes from Ch. 13. Ch. 1–4, 7, 13, 14. Read once as a tourist, then redo the math in the chapters your labs touch.

Supplementary β€” CS Ed / Learning Sciences / HCI

Because the PhD target is CS-Ed adjacent, the technical pillars aren't enough. Add a second stream while the first runs:

  • Sawyer (ed.), The Cambridge Handbook of the Learning Sciences. The field's canonical survey. One chapter per week.
  • Fincher & Robins (eds.), The Cambridge Handbook of Computing Education Research. Essential for reading your target faculty's papers.
  • Creswell, Research Design. Qualitative, quantitative, mixed-methods β€” vocabulary to read CS Ed papers critically.
  • Running paper diet: 1–2 papers/week logged in the PhD Tracker tab. Target venues: ICER, SIGCSE, CHI, L@S, JLS, Cognition and Instruction.

Weekly cadence

MON–TUE
LA + Probability
One chapter's reading + 3–5 exercises. Foundation everything else leans on.
WED
Algorithms
One CLRS section + implement from scratch in Python (not from memory of a library).
THU
Systems
OSTEP / CS:APP chapter or a hands-on project (pthread, cache-aware matmul, tiny shell).
FRI–SAT
ML + paper
ESL section + one paper from target-faculty reading list. Anki: 3 cards per paper.
TRACK 03

PhD Research Labs β€” three labs that braid linear algebra into real artifacts

Three labs, designed to feed the Modeling Bench with student-facing material and to seed content for the YouTube series. Same principle for each: derive the math, build the method from scratch, then benchmark against a library. Each lab is scoped as four 45-minute margin blocks, ending in a committable artifact.

These three labs are designed to braid together. If you only had time to do one, do Eigenfaces. It is the lab where linear algebra stops being symbols on a page and starts being "I just plotted an eigenvector and it looks like a face." That moment β€” the one where the math becomes visible β€” is what you are paying tuition for in every LA class you will ever take. I want you to have it once, on your own laptop, with code you wrote. Then everything else in ML is a variation on that same trick.

Do this now Β· 45 min

Pick the lab and week you're currently in (set it on each lab below). During a block, do only that week's task.

Within any week's block
  1. Derive β€” write the week's key equation on paper, from memory if you can. If you can't, rebuild it from the textbook.
  2. Code β€” implement the week's task in a notebook. NumPy primitives only for the core step β€” no sklearn shortcuts.
  3. Verify β€” compare to the library (np.linalg.svd, sklearn.decomposition.PCA). Confirm agreement up to sign/ordering.
  4. Commit β€” push the notebook. One sentence in the commit: what you learned, not what you did.
When you're stuck
  • Your code doesn't match the library? 90% of mismatches are sign of eigenvectors or column order. Check those first.
  • Derivation won't come back? Close the notebook. Open Strang on that chapter. Re-read ten pages. Try again tomorrow. Don't brute-force a derivation at midnight.
  • Notebook feels too small to matter? Commit it anyway. "Tiny commit today" is the habit you're building, not "brilliant commit today."
No wins logged yet.
LAB 01 Β· 4 Γ— 45 MIN

Morphs & Warps β€” images as functions you multiply

Linear algebra Β· Affine / projective maps Β· Interpolation

Current week

WEEK 1
Linear maps on a grid
Build a 50Γ—50 test grid image. Implement rotation, scaling, shear as $2\times 2$ matrices on pixel coordinates. Forward-warp naively, observe the holes. One-paragraph reflection on why.
WEEK 2
Homogeneous coords + inverse warp
Rewrite every transform as a $3\times 3$. Implement inverse warping with bilinear interpolation. Verify by warping and un-warping β€” should recover original to sub-pixel error.
WEEK 3
Correspondence + Delaunay
Pick two images. Annotate 20–40 corresponding landmarks. Delaunay-triangulate. Compute per-triangle affine warps toward an intermediate average mesh.
WEEK 4
Morph = warp + cross-dissolve
At $t \in [0,1]$, warp both images toward the shape at $t$ and alpha-blend by $(1-t)$ and $t$. Render 30 frames. Export MP4. 300-word reflection.

Motivation

Every student has seen a rotated image; every student has computed a matrix-vector product. Almost none of them have connected the two. Lab 1 makes the connection viscerally: a warp is a matrix, and a morph is a warp plus a cross-dissolve. The lab produces a short face-morph animation β€” the same visual trick from early SIGGRAPH work and every music video from 1991.

Linear algebra you will actually use

  • Vectors as pixel coordinates. A point $(x, y)$ lives in $\mathbb{R}^2$; an image is a function $I: \mathbb{Z}^2 \to \mathbb{R}$.
  • Linear maps. Rotation, scaling, shear β€” each a $2 \times 2$ matrix. Translation is affine, not linear.
  • Homogeneous coordinates. Promote $(x,y) \to (x,y,1)$ and every affine map is a single $3 \times 3$ multiply. This is the payoff that makes "why did we add a 1?" click.
  • Inverse maps + bilinear interpolation. Forward warping leaves holes; inverse warping plus interpolation does not. The bug teaches you which direction to iterate.
  • Piecewise-affine warps. Triangulate two faces with corresponding landmarks; each triangle gets its own $3 \times 3$. The idea behind Beier–Neely without the field calculus.
The single equation the lab is organized around

A forward affine warp sends source pixel $\mathbf{p}_s = (x_s, y_s, 1)^\top$ to destination pixel $\mathbf{p}_d = A \mathbf{p}_s$, where

$$A = \begin{pmatrix} a_{11} & a_{12} & t_x \\ a_{21} & a_{22} & t_y \\ 0 & 0 & 1 \end{pmatrix}.$$

At render time you iterate over every destination pixel and ask "where did you come from?" β€” i.e., compute $\mathbf{p}_s = A^{-1}\mathbf{p}_d$ and sample $I$ there. That one inversion is why inverse warping has no holes.

Check yourself β€” tutor questions
  1. Before Week 1: Can you write down the $2 \times 2$ matrix for a 30Β° rotation from memory, without peeking? If not, the derive-before-compute rule kicks in β€” open Strang Ch. 7 for twenty minutes first. This is the fluency I was talking about above.
  2. After Week 2: If I showed you a forward-warped image with holes in it, could you tell me β€” without running code β€” which direction the bug is in, and what one-line change fixes it?
  3. After Week 4: In one sentence, write down why inverse warping works. If the sentence feels too mechanical, you haven't earned it yet β€” go teach it to someone for five minutes and try again.

Deliverables β€” tap to mark done

warps.py morph.ipynb grid_warp_gallery.png morph.mp4 reflection.md

Common confusions β€” name them in the video script

  • "Why do we invert the matrix?" β€” Iterating over destination pixels is hole-free; inverting the map lets you pull source color.
  • "Is translation linear?" β€” No, it's affine. Homogeneous coordinates let you pretend otherwise, which is the point.
  • "Why bilinear, not nearest-neighbor?" β€” Bilinear is smooth in the source parameter; nearest-neighbor introduces aliasing that ruins the morph.

Extensions once the baseline ships

  • Projective (perspective) warps: drop the last-row constraint; recover the 8 DoF via a linear system.
  • Thin-plate-spline warps for smoother interpolation between landmarks.
  • Feature-based Beier–Neely morphing (line-segment fields instead of triangles).

Primary references

  • Beier & Neely, Feature-Based Image Metamorphosis, SIGGRAPH 1992.
  • Szeliski, Computer Vision: Algorithms and Applications, Ch. 3 & 8.
  • Strang, Introduction to Linear Algebra, Ch. 7 (linear transformations).
LAB 02 Β· 4 Γ— 45 MIN

Eigenfaces β€” faces as points in a 10,000-dimensional space

Linear algebra Β· PCA / SVD Β· Subspace projection

Current week

WEEK 1
Dataset + baseline
Load Yale Faces / ORL / LFW subset. Grayscale, fixed size, vectorize. 1-NN-in-pixel-space baseline on a held-out split. This is the number to beat.
WEEK 2
Mean face + the Gram trick
Compute and plot the mean face (uncanny). Form $X_c X_c^\top$, eigendecompose, lift to eigenfaces. Display top 20. Confirm np.linalg.svd agrees up to sign.
WEEK 3
Projection & reconstruction
Project held-out faces into $k$-dim face space for $k \in \{1,5,10,25,50,100\}$. Measure pixel-space MSE. Plot reconstruction error vs. $k$.
WEEK 4
Recognition + failure analysis
1-NN classification in face space at several $k$. Plot accuracy vs. $k$. Pick three failure cases and name what went wrong. That paragraph makes the writeup research-grade.

Motivation

A grayscale face at 100Γ—100 lives in $\mathbb{R}^{10{,}000}$. Almost all of that space is empty β€” faces occupy a thin submanifold of valid images. Lab 2 builds the tiny linear approximation of that submanifold (the "face space") and uses it to recognize, reconstruct, and imagine faces. The surprise every student gets: an eigenvector of the face covariance matrix is itself a face. You can literally plot it.

This lab is the pedagogy version of the Modeling Bench Paper 2 (SVD on classification). The lab produces the student-facing artifact; the Modeling Bench pushes the research contribution.

Linear algebra you will actually use

  • Data matrix. $X \in \mathbb{R}^{n \times d}$ where each row is a flattened, mean-centered face. $d = \text{width} \times \text{height}$.
  • Mean face + centering. $\bar{\mathbf{x}} = \frac{1}{n}\sum_i \mathbf{x}_i$; the centered matrix $X_c$ is the only object the rest of the lab cares about.
  • Covariance & eigendecomposition. $C = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{d \times d}$. Eigenvectors of $C$ are the eigenfaces.
  • The $d \gg n$ trick. With $d=10{,}000$ and $n=400$, don't form the $10{,}000 \times 10{,}000$ covariance. Eigendecompose the $n \times n$ Gram matrix $X_c X_c^\top$ and lift via $X_c^\top$. That is SVD β€” the lab has you do it by hand first, then show NumPy agrees.
  • Subspace projection & reconstruction. A new face projects to $U_k^\top (\mathbf{y} - \bar{\mathbf{x}}) \in \mathbb{R}^k$; reconstruction is $\bar{\mathbf{x}} + U_k U_k^\top (\mathbf{y} - \bar{\mathbf{x}})$.
  • Classification in face space. 1-NN on $k$-dimensional projections is embarrassingly effective.
The single equation the lab is organized around

Write $X_c = U \Sigma V^\top$ (SVD). The columns of $V$ are eigenfaces, and the top-$k$ reconstruction of a face $\mathbf{y}$ is

$$\hat{\mathbf{y}} = \bar{\mathbf{x}} + V_k V_k^\top (\mathbf{y} - \bar{\mathbf{x}}).$$

Increasing $k$ smoothly trades reconstruction error for dimensionality. Plotting that curve is one of the deliverables.

Check yourself β€” tutor questions
  1. Before Week 1: Right now, without looking anything up, answer this: "What is PCA?" Write your answer in two sentences on paper. Don't erase it. You will come back to it at the end of Week 4, and I want you to see how much your answer changed.
  2. After Week 2: Your first eigenface looks mostly like lighting, not identity. What does that tell you about your dataset β€” and what would you do differently if you wanted the first component to encode identity instead?
  3. After Week 4: A skeptic says "eigenfaces is just compression β€” you're not doing recognition at all." Give them the strongest version of their argument, then give your strongest counter-argument. This is how you'll argue in a PhD interview.

Deliverables β€” tap to mark done

eigenfaces.py mean_face.png top20_eigenfaces.png reconstruction_curve.png accuracy_vs_k.png failure_cases.md

Common confusions β€” name them in the video script

  • "Why subtract the mean?" β€” PCA finds directions of maximum variance, not magnitude. Without centering, the first component points at the mean.
  • "Why is the first eigenface ugly?" β€” It encodes the largest variation, which for most face sets is lighting, not identity. Good discovery moment.
  • "Isn't this just compression?" β€” Yes. Recognition-as-compression is the research insight: a face is "recognized" when its short code matches a short code you've seen.
  • "Why not the $d \times d$ covariance?" β€” You can, at 100M entries. The Gram trick is why eigenfaces was feasible on 1991 hardware.

Extensions

  • Fisherfaces: swap PCA for LDA. Frames PCA as unsupervised and LDA as supervised versions of the same projection idea.
  • Face morphing in face space: interpolate projected codes $\alpha \mathbf{c}_A + (1-\alpha)\mathbf{c}_B$ and reconstruct. Closes the loop with Lab 1.
  • Kernel PCA for a nonlinear face manifold β€” gateway to the ML track.

Primary references

  • Turk & Pentland, Eigenfaces for Recognition, J. Cognitive Neuroscience, 1991.
  • Belhumeur, Hespanha & Kriegman, Eigenfaces vs. Fisherfaces, PAMI 1997.
  • Hastie, Tibshirani & Friedman, ESL, Ch. 14 (unsupervised learning).
LAB 03 Β· 4 Γ— 45 MIN

USPS handwritten digits β€” the smallest dataset that teaches every ML idea

Linear algebra Β· PCA Β· Nearest-neighbor Β· Linear classifiers

Current week

WEEK 1
Load, visualize, baseline
Load USPS. Visualize 100 random digits. Split 7291 train / 2007 test (the canonical ESL split). Report pixel-NN accuracy as baseline.
WEEK 2
PCA compression
PCA on the training set. Plot top 20 components as 16Γ—16 images. Reconstruct at $k \in \{5,20,50,100\}$ next to originals. Plot cumulative variance.
WEEK 3
Least-squares classifier from scratch
Solve the normal equations. Classify the test set. Compare raw pixels vs. PCA features. Run 1-NN in PCA space β€” does lower dimension help or hurt? Why?
WEEK 4
Confusion matrix + error analysis
Build the 10Γ—10 confusion matrix. Find the worst-confused pair (usually 4↔9 or 3↔5). Show 10 misclassified examples with the classifier's "thought" on each. 300-word analysis.

Motivation

The USPS digit set β€” ~9,298 hand-written digits at 16Γ—16 grayscale β€” is the dataset Hastie, Tibshirani, and Friedman use as a running example in Elements of Statistical Learning. Small enough to fit in memory, rich enough to distinguish every classical method, and visual enough that an error analysis reads like a story. Lab 3 closes the trilogy by turning images into classifications instead of reconstructions.

This lab is the on-ramp to the Modeling Bench Paper 2 on SVD-based classification: do the lab, then ask "what's the research contribution here?" That's the PhD question.

Linear algebra you will actually use

  • Images as vectors. Each 16Γ—16 digit becomes a point in $\mathbb{R}^{256}$. The training set becomes $X \in \mathbb{R}^{n \times 256}$.
  • Nearest-neighbor classification. Distance is $\|\mathbf{x}_i - \mathbf{x}_q\|^2$ β€” a dot product in disguise. 1-NN is linear algebra, not "AI."
  • PCA for compression and visualization. Same mechanism as Lab 2 β€” top-$k$ directions, project, classify in the low-dim space. Watch accuracy vs. $k$.
  • Linear classification via least squares. One-hot encode labels; solve $\min_W \|XW - Y\|_F^2$; classify by $\arg\max \mathbf{x}^\top W$. A grad-level ML result in ten lines.
  • Confusion matrix as a matrix. A $10 \times 10$ integer matrix; $(i,j)$ is "true $i$, predicted $j$." Diagonals win; off-diagonals tell a story.
The single equation the lab is organized around

Given training data $X \in \mathbb{R}^{n \times d}$ and one-hot labels $Y \in \mathbb{R}^{n \times 10}$, the closed-form least-squares classifier is

$$W = (X^\top X)^{-1} X^\top Y,$$

and a new digit $\mathbf{x}$ is classified as $\arg\max_j (\mathbf{x}^\top W)_j$. Yes β€” a linear regression used as a classifier. Yes, it works better than it has any right to. The question the lab asks: when does it fail, and why?

Check yourself β€” tutor questions
  1. Before Week 1: 1-NN on raw pixels gets ~87% accuracy. Steelman the case for leaving it alone β€” why would you not bother with anything fancier? Your job is to argue the lazy position as strongly as you can, so you know why you're moving past it.
  2. After Week 3: Your least-squares classifier works. Does that mean linear regression is "a good idea for classification," or is something else going on? Be honest β€” what's the cheat that makes it work?
  3. After Week 4: The worst-confused pair in your confusion matrix is telling you about your representation, not your classifier. Say what you would change about the representation β€” not the model β€” to fix it. This is the research question.

Deliverables β€” tap to mark done

usps.py digit_grid.png pc_digits.png reconstruction_grid.png confusion_matrix.png error_analysis.md

Common confusions β€” name them in the video script

  • "Why not a neural network?" β€” Because the LA version is 30 lines, trains in 100 ms, and hits ~87%. The gap to a CNN is the research question, not the starting point.
  • "Why is least-squares classification weird?" β€” The output is unbounded; labels are not. It works anyway because nearest-label is a projection.
  • "Why is 4 confused with 9?" β€” Because in 16Γ—16 grayscale the loop of a 9 and the loop of a 4 share most of their mass. Teaches you to care about representation.

Extensions

  • Regularized least squares (ridge): add $\lambda I$ to $X^\top X$. Show stability.
  • Logistic regression via gradient descent; compare to closed-form least squares.
  • Tangent distance (Simard et al.): build invariance to small rotations/translations into the distance function itself. Beautiful linear-algebra-of-invariances result.
  • SVD-based class-subspace classifier: one subspace per digit, classify by projection residual. The natural stepping stone to the Modeling Bench Paper 2.

Primary references

  • Hastie, Tibshirani & Friedman, ESL, Ch. 1, 4, 13.
  • Simard, LeCun, Denker & Victorri, Transformation Invariance in Pattern Recognition β€” Tangent Distance and Tangent Propagation, 1998.
  • LeCun et al., Gradient-Based Learning Applied to Document Recognition, Proc. IEEE 1998.
Focus Blocks Β· part of The System v4.1 Β· back to dashboard Β· modeling bench Β· projects Β· about