When you sit down in a margin block, I'll tell you what to advance.
The Schedule says when. This page is where I sit next to you and say what.
Tap the block you're in β the page remembers, jumps you to the action, logs your win when you close the session.
When you get stuck, scroll down: I've written the part that usually trips people up.
I'm going to walk you through this the same way I'd walk you through a whiteboard session: one track at a time, one block at a time, one honest question at a time. You don't have to be brilliant in any of these blocks. You have to show up, move one thing, and write it down so tomorrow-you trusts today-you. That's the whole game. Everything else on this page is scaffolding for that one habit.
Which block are you in right now?
β
Last win
β
β
TRACK 01
YouTube Channel β animate the concepts, model the process
A channel that animates CS ideas AND the metacognitive moves of a student engaging in highly effective project-based learning. The animations are the hook; the PBL voice is the moat.
Here is the thing nobody tells you about starting a channel: the people who ship start ugly. The people who quit keep polishing. If you only have 45 minutes tonight, I would rather you produce 90 seconds of rough-cut footage you hate than an unopened project file. Your future self will thank you. Ugly and shipped beats pretty and queued β every single time.
Do this now Β· 45 min
Pick one. Don't try to do two β decision fatigue wins every time.
Default β fallback order
Script beat β open the active script; advance one of hook β wrong intuition β derivation β animation β artifact. Stop when that beat reads clean out loud.
Manim scene β one animation for the active script. One scene, <60 sec runtime.
Record take β one recorded pass. No re-recording today; you'll hate it less after sleep.
Edit pass β 1 minute of footage, cuts locked.
Ship β render, thumbnail, description, upload unlisted to Jeff, then public.
When you're stuck
Can't start writing? Write the hook sentence out loud, badly, right now. Bad-on-purpose dodges perfectionism.
Don't know what to record? Record yourself teaching the concept to a friend from memory. That recording is your first script.
Manim intimidating? Use a text slide with a fade for this video. Fancy animation is optional; shipping isn't.
No wins logged yet. Close a session and tap β Done for today.
Positioning & voice
Answer this in one sentence before the first video ships.
Working tagline: "I animate CS concepts while thinking out loud about how I'd actually learn them β so you can steal the process, not just the answer."
Who it's for: community-college CS students, career-switchers, and students who've "taken the class" but don't trust their understanding. Not for people who already have the concept β for the person still looking for a door in.
What makes it not generic: most animated CS channels show the what. This one shows the how I got here β the derivation, the wrong turn, the checkpoint question, the artifact at the end. That's the project-based-learning DNA.
Voice rules:
Derive before compute. Never drop a formula without showing where it comes from.
Build before import. If a library does it, first show the 10-line version you'd write.
Name the confusion. Say the wrong intuition out loud before correcting it.
Close with an artifact. Every video ends with a thing the viewer can run, fork, or open.
Milestone ladder β tap to mark done
Each rung is small enough to finish in 1β3 margin blocks. State saves automatically.
β
Channel identity locked
Channel name, handle, one-line description, avatar, banner. Write it in a doc you can't edit after Friday. Kills bikeshedding.
β
Tooling spike (one evening, no content)
Install Manim Community Edition, get one example scene rendering. OBS for screen+webcam. Mic level. Defaults everywhere β don't spend two weeks on presets.
β
Script template + first script
Five-beat template: hook β wrong intuition β derivation β animation β artifact/CTA. First topic: "Why matrix multiplication is row Γ column β and why we define it that way." Doubles as scaffolding for Lab 1.
β
Ship video 1 β ugly and shipped beats pretty and queued
Record, rough cut, no b-roll drama, upload unlisted first. Send to Jeff for one round. Then publish.
β
Batch scripts 2β5
Batching scripts is the single biggest velocity unlock. Each video after the first should need <2 blocks of scripting.
β
Series 1 complete β "Seeing Linear Algebra" (10 videos)
Ten 3β5 min videos, one LA concept each, each ending with a link to the matching lab below. Labs + videos become the same product: watch, then do.
β
Quarterly review β iterate voice
At 10 videos, read watch-time curves honestly. Keep what worked. Kill the first 20 seconds of everything that lost people.
Series 1 content bank β "Seeing Linear Algebra"
Ten topics feeding directly into the three labs in Track 03.
#
Working title
Core visual
Feeds lab
1
Why matrix multiplication is row Γ column
Basis vectors under a linear map, decomposed as dot products
Morphs
2
An image is a matrix (and every filter is a function of it)
Grayscale Lena beside a printed grid of numbers
All three
3
Homogeneous coordinates, or: why we add a fake dimension
2D translation becoming a 3Γ3 product
Morphs
4
Forward vs. inverse warping β and the bug that teaches you which
Rotated image with holes vs. clean inverse-warped version
Morphs
5
What PCA actually does to a cloud of points
2D Gaussian blob; axes rotating to principal directions
Eigenfaces, USPS
6
Eigenvectors without the textbook
A transformation whose action leaves certain arrows fixed
Eigenfaces
7
The covariance trick β $X^\top X$ vs. $XX^\top$
Tall-skinny vs. short-fat data matrix and their eigenproblems
Eigenfaces
8
SVD as three honest steps: rotate, stretch, rotate
Unit circle β ellipse β rotated ellipse
Eigenfaces, USPS
9
Nearest neighbors, or: a 256-dim digit is still a point
Flatten a 16Γ16 digit, plot its 2D PCA projection
USPS
10
Confusion matrices, read like a chessboard
10Γ10 grid; diagonals are wins; off-diagonals tell a story
USPS
TRACK 02
Grad School Prep β don't drown in the first semester
Two jobs: pick the right programs, and arrive with enough technical fluency that coursework is learning, not survival. The PhD Tracker tab holds the program database; this page holds the study curriculum and the framework for deciding what rows go in that database.
You don't need to be brilliant in the first semester β you need to be not drowning. The difference between those two is roughly ten hours of self-study a week, nine months before you start. This is that. A warning, though: most students treat grad-school prep as "read more textbooks." That's wrong. Prep is re-deriving the 30 ideas you'll be asked to use as lemmas without looking. If you can't derive a 2Γ2 eigenvalue from scratch on a napkin right now, that's not a gap in knowledge β that's a gap in fluency. Fluency is the thing that saves you in week two of a grad course.
Do this now Β· 45β60 min
Pick one. The default is always Read mode unless the rest are blocked.
Default β fallback order
Read mode β open the day's chapter. Pen in hand. If you can't re-derive the key theorem on paper after 10 min, loop back.
Exercise mode β 3 problems from the chapter. Not "read the solution" β finish or fail honestly.
Program mode β advance one program row: read one faculty's last paper, update the fit narrative, or draft one outreach email.
Anki mode β 3 cards per paper read this week. Clear the queue.
When you're stuck
Can't focus on reading? Do 3 exercises from yesterday's chapter. They drag you back into the material without the activation energy of a new topic.
Textbook feels abstract? Open Strang's MIT 18.06 lecture for that chapter. Thirty minutes of Strang on YouTube beats ninety minutes of sleepy reading.
Feeling behind? You are not behind. Nine months is enough. Write one Anki card and close the laptop. Tomorrow-you will be grateful.
No wins logged yet.
Program research framework
Because "apply to good schools" is not a plan.
Target split
Build a list of 8β12 programs, roughly 3 reach / 5 match / 3 safety. Two parallel tracks:
MS in CS (coursework or thesis): SJSU, SFSU, SCU, UC Davis, UC Irvine, Oregon State, Georgia Tech OMSCS, UIUC MCS, UT Austin MSCSO. Strong technical floor; lower stakes on "did you publish yet?"
PhD in CS Education / Learning Sciences / HCI: what the PhD Tracker already targets. Candidate institutions to investigate (verify deadlines and research fit on the program site β these shift yearly): UW (Info School, CSE), Berkeley (EECS / GSE), Stanford (GSE, CS HCI), CMU (HCII, METALS), Michigan (EECS, UMSI), Georgia Tech (HCC), UC San Diego (Cog Sci, CSE/EdS), UC Irvine (Informatics), Indiana (Learning Sciences), Northwestern (Learning Sciences), Penn (GSE), Utah, NCSU.
For each program, capture 5 things
3β5 faculty with live funding in your area. Read their last 3 papers. If you can't write one sentence about why the work matters, they don't go on the list.
Admissions surface: deadline, GRE policy, required materials, funding model (TA/RA/fellowship), typical cohort size.
Fit narrative: one paragraph answering "why this program, not a peer in the same tier." This is the seed of your SoP paragraph for that school.
Contact strategy: who to email when (early fall for PhD, not a week before the deadline). What single question you'll ask that a 5-second skim of their site won't already answer.
Red flags: PI not taking students, lab wound down, group moved institutions. Check recent publication activity; an inactive lab is worse than a rejection.
When you write faculty emails, the one thing that separates "opens, reads, replies" from "ignores" is a specific sentence about their work that proves you read it. Not "your work on X is fascinating." A sentence that contains a fact only someone who read the paper would know β an experimental detail, a limitation they flagged, a dataset they chose. That's the trust anchor. The rest of the email can be short.
Where to verify
CSRankings.org β research-area-weighted rankings; filter by HCI, software engineering education, etc.
Program website β the one authoritative source for deadlines. Google is wrong half the time.
Google Scholar profiles of target PIs. Sort by year. Note co-authors β often students you could email later.
The GradCafe β last-year admit timing, funding signals, decision patterns. Noisy but useful.
Twitter/Bluesky β many HCI/CS-Ed PIs announce openings there months before the deadline.
Self-directed core curriculum β survive grad courses
By the time you sit in your first grad class, nothing in the syllabus should be a true first exposure. 9β12 month plan at ~1 hour/day of focused study. Feeds the 8h CS Theory + 6.5h CS Build blocks on the Schedule.
Five pillars β open one book first per pillar
Pillar
Primary text
Why this one
Minimum dose
Linear algebra
Strang, Introduction to Linear Algebra (6e) + MIT 18.06 lectures
Teaches LA as four subspaces + a story about $A\mathbf{x}=\mathbf{b}$. Axler is purer but less useful as a first grad-course refresher.
Ch. 1β7, SVD and eigendecomposition. Feeds Labs 2 & 3.
Probability & stats
Blitzstein & Hwang, Intro to Probability + Wasserman, All of Statistics
Blitzstein for intuition; Wasserman is the pocket reference you'll open during every grad ML course.
Blitzstein Ch. 1β8, then Wasserman Ch. 1β10. Exercises matter β probability is not a spectator sport.
Algorithms
CLRS, Introduction to Algorithms (4e)
Grad algorithms = CLRS + randomized/approximation on top. Know the core cold.
Parts IβIII (growth, divide-and-conquer, sorting), graphs, DP. Skip advanced data structures on first pass.
Systems
OSTEP (free online) for OS, then Bryant & O'Hallaron, CS:APP
OSTEP is friendly and surprisingly deep. CS:APP makes all systems courses stop feeling random.
OSTEP: one chapter each from virtualization, concurrency, persistence. CS:APP: 1β3, 6, 9.
Machine learning
Hastie, Tibshirani & Friedman, Elements of Statistical Learning (free online)
The "adult book" of ML. Uses USPS digits as a running example β Lab 3 literally comes from Ch. 13.
Ch. 1β4, 7, 13, 14. Read once as a tourist, then redo the math in the chapters your labs touch.
Supplementary β CS Ed / Learning Sciences / HCI
Because the PhD target is CS-Ed adjacent, the technical pillars aren't enough. Add a second stream while the first runs:
Sawyer (ed.), The Cambridge Handbook of the Learning Sciences. The field's canonical survey. One chapter per week.
Fincher & Robins (eds.), The Cambridge Handbook of Computing Education Research. Essential for reading your target faculty's papers.
Creswell, Research Design. Qualitative, quantitative, mixed-methods β vocabulary to read CS Ed papers critically.
Running paper diet: 1β2 papers/week logged in the PhD Tracker tab. Target venues: ICER, SIGCSE, CHI, L@S, JLS, Cognition and Instruction.
Weekly cadence
MONβTUE
LA + Probability
One chapter's reading + 3β5 exercises. Foundation everything else leans on.
WED
Algorithms
One CLRS section + implement from scratch in Python (not from memory of a library).
THU
Systems
OSTEP / CS:APP chapter or a hands-on project (pthread, cache-aware matmul, tiny shell).
FRIβSAT
ML + paper
ESL section + one paper from target-faculty reading list. Anki: 3 cards per paper.
TRACK 03
PhD Research Labs β three labs that braid linear algebra into real artifacts
Three labs, designed to feed the Modeling Bench with student-facing material and to seed content for the YouTube series. Same principle for each: derive the math, build the method from scratch, then benchmark against a library. Each lab is scoped as four 45-minute margin blocks, ending in a committable artifact.
These three labs are designed to braid together. If you only had time to do one, do Eigenfaces. It is the lab where linear algebra stops being symbols on a page and starts being "I just plotted an eigenvector and it looks like a face." That moment β the one where the math becomes visible β is what you are paying tuition for in every LA class you will ever take. I want you to have it once, on your own laptop, with code you wrote. Then everything else in ML is a variation on that same trick.
Do this now Β· 45 min
Pick the lab and week you're currently in (set it on each lab below). During a block, do only that week's task.
Within any week's block
Derive β write the week's key equation on paper, from memory if you can. If you can't, rebuild it from the textbook.
Code β implement the week's task in a notebook. NumPy primitives only for the core step β no sklearn shortcuts.
Verify β compare to the library (np.linalg.svd, sklearn.decomposition.PCA). Confirm agreement up to sign/ordering.
Commit β push the notebook. One sentence in the commit: what you learned, not what you did.
When you're stuck
Your code doesn't match the library? 90% of mismatches are sign of eigenvectors or column order. Check those first.
Derivation won't come back? Close the notebook. Open Strang on that chapter. Re-read ten pages. Try again tomorrow. Don't brute-force a derivation at midnight.
Notebook feels too small to matter? Commit it anyway. "Tiny commit today" is the habit you're building, not "brilliant commit today."
No wins logged yet.
LAB 01 Β· 4 Γ 45 MIN
Morphs & Warps β images as functions you multiply
Linear algebra Β· Affine / projective maps Β· Interpolation
Current week
WEEK 1
Linear maps on a grid
Build a 50Γ50 test grid image. Implement rotation, scaling, shear as $2\times 2$ matrices on pixel coordinates. Forward-warp naively, observe the holes. One-paragraph reflection on why.
WEEK 2
Homogeneous coords + inverse warp
Rewrite every transform as a $3\times 3$. Implement inverse warping with bilinear interpolation. Verify by warping and un-warping β should recover original to sub-pixel error.
WEEK 3
Correspondence + Delaunay
Pick two images. Annotate 20β40 corresponding landmarks. Delaunay-triangulate. Compute per-triangle affine warps toward an intermediate average mesh.
WEEK 4
Morph = warp + cross-dissolve
At $t \in [0,1]$, warp both images toward the shape at $t$ and alpha-blend by $(1-t)$ and $t$. Render 30 frames. Export MP4. 300-word reflection.
Motivation
Every student has seen a rotated image; every student has computed a matrix-vector product. Almost none of them have connected the two. Lab 1 makes the connection viscerally: a warp is a matrix, and a morph is a warp plus a cross-dissolve. The lab produces a short face-morph animation β the same visual trick from early SIGGRAPH work and every music video from 1991.
Linear algebra you will actually use
Vectors as pixel coordinates. A point $(x, y)$ lives in $\mathbb{R}^2$; an image is a function $I: \mathbb{Z}^2 \to \mathbb{R}$.
Linear maps. Rotation, scaling, shear β each a $2 \times 2$ matrix. Translation is affine, not linear.
Homogeneous coordinates. Promote $(x,y) \to (x,y,1)$ and every affine map is a single $3 \times 3$ multiply. This is the payoff that makes "why did we add a 1?" click.
Inverse maps + bilinear interpolation. Forward warping leaves holes; inverse warping plus interpolation does not. The bug teaches you which direction to iterate.
Piecewise-affine warps. Triangulate two faces with corresponding landmarks; each triangle gets its own $3 \times 3$. The idea behind BeierβNeely without the field calculus.
The single equation the lab is organized around
A forward affine warp sends source pixel $\mathbf{p}_s = (x_s, y_s, 1)^\top$ to destination pixel $\mathbf{p}_d = A \mathbf{p}_s$, where
At render time you iterate over every destination pixel and ask "where did you come from?" β i.e., compute $\mathbf{p}_s = A^{-1}\mathbf{p}_d$ and sample $I$ there. That one inversion is why inverse warping has no holes.
Check yourself β tutor questions
Before Week 1: Can you write down the $2 \times 2$ matrix for a 30Β° rotation from memory, without peeking? If not, the derive-before-compute rule kicks in β open Strang Ch. 7 for twenty minutes first. This is the fluency I was talking about above.
After Week 2: If I showed you a forward-warped image with holes in it, could you tell me β without running code β which direction the bug is in, and what one-line change fixes it?
After Week 4: In one sentence, write down why inverse warping works. If the sentence feels too mechanical, you haven't earned it yet β go teach it to someone for five minutes and try again.
Szeliski, Computer Vision: Algorithms and Applications, Ch. 3 & 8.
Strang, Introduction to Linear Algebra, Ch. 7 (linear transformations).
LAB 02 Β· 4 Γ 45 MIN
Eigenfaces β faces as points in a 10,000-dimensional space
Linear algebra Β· PCA / SVD Β· Subspace projection
Current week
WEEK 1
Dataset + baseline
Load Yale Faces / ORL / LFW subset. Grayscale, fixed size, vectorize. 1-NN-in-pixel-space baseline on a held-out split. This is the number to beat.
WEEK 2
Mean face + the Gram trick
Compute and plot the mean face (uncanny). Form $X_c X_c^\top$, eigendecompose, lift to eigenfaces. Display top 20. Confirm np.linalg.svd agrees up to sign.
WEEK 3
Projection & reconstruction
Project held-out faces into $k$-dim face space for $k \in \{1,5,10,25,50,100\}$. Measure pixel-space MSE. Plot reconstruction error vs. $k$.
WEEK 4
Recognition + failure analysis
1-NN classification in face space at several $k$. Plot accuracy vs. $k$. Pick three failure cases and name what went wrong. That paragraph makes the writeup research-grade.
Motivation
A grayscale face at 100Γ100 lives in $\mathbb{R}^{10{,}000}$. Almost all of that space is empty β faces occupy a thin submanifold of valid images. Lab 2 builds the tiny linear approximation of that submanifold (the "face space") and uses it to recognize, reconstruct, and imagine faces. The surprise every student gets: an eigenvector of the face covariance matrix is itself a face. You can literally plot it.
This lab is the pedagogy version of the Modeling Bench Paper 2 (SVD on classification). The lab produces the student-facing artifact; the Modeling Bench pushes the research contribution.
Linear algebra you will actually use
Data matrix. $X \in \mathbb{R}^{n \times d}$ where each row is a flattened, mean-centered face. $d = \text{width} \times \text{height}$.
Mean face + centering. $\bar{\mathbf{x}} = \frac{1}{n}\sum_i \mathbf{x}_i$; the centered matrix $X_c$ is the only object the rest of the lab cares about.
Covariance & eigendecomposition. $C = \frac{1}{n-1} X_c^\top X_c \in \mathbb{R}^{d \times d}$. Eigenvectors of $C$ are the eigenfaces.
The $d \gg n$ trick. With $d=10{,}000$ and $n=400$, don't form the $10{,}000 \times 10{,}000$ covariance. Eigendecompose the $n \times n$ Gram matrix $X_c X_c^\top$ and lift via $X_c^\top$. That is SVD β the lab has you do it by hand first, then show NumPy agrees.
Subspace projection & reconstruction. A new face projects to $U_k^\top (\mathbf{y} - \bar{\mathbf{x}}) \in \mathbb{R}^k$; reconstruction is $\bar{\mathbf{x}} + U_k U_k^\top (\mathbf{y} - \bar{\mathbf{x}})$.
Classification in face space. 1-NN on $k$-dimensional projections is embarrassingly effective.
The single equation the lab is organized around
Write $X_c = U \Sigma V^\top$ (SVD). The columns of $V$ are eigenfaces, and the top-$k$ reconstruction of a face $\mathbf{y}$ is
Increasing $k$ smoothly trades reconstruction error for dimensionality. Plotting that curve is one of the deliverables.
Check yourself β tutor questions
Before Week 1: Right now, without looking anything up, answer this: "What is PCA?" Write your answer in two sentences on paper. Don't erase it. You will come back to it at the end of Week 4, and I want you to see how much your answer changed.
After Week 2: Your first eigenface looks mostly like lighting, not identity. What does that tell you about your dataset β and what would you do differently if you wanted the first component to encode identity instead?
After Week 4: A skeptic says "eigenfaces is just compression β you're not doing recognition at all." Give them the strongest version of their argument, then give your strongest counter-argument. This is how you'll argue in a PhD interview.
Common confusions β name them in the video script
"Why subtract the mean?" β PCA finds directions of maximum variance, not magnitude. Without centering, the first component points at the mean.
"Why is the first eigenface ugly?" β It encodes the largest variation, which for most face sets is lighting, not identity. Good discovery moment.
"Isn't this just compression?" β Yes. Recognition-as-compression is the research insight: a face is "recognized" when its short code matches a short code you've seen.
"Why not the $d \times d$ covariance?" β You can, at 100M entries. The Gram trick is why eigenfaces was feasible on 1991 hardware.
Extensions
Fisherfaces: swap PCA for LDA. Frames PCA as unsupervised and LDA as supervised versions of the same projection idea.
Face morphing in face space: interpolate projected codes $\alpha \mathbf{c}_A + (1-\alpha)\mathbf{c}_B$ and reconstruct. Closes the loop with Lab 1.
Kernel PCA for a nonlinear face manifold β gateway to the ML track.
Primary references
Turk & Pentland, Eigenfaces for Recognition, J. Cognitive Neuroscience, 1991.
Belhumeur, Hespanha & Kriegman, Eigenfaces vs. Fisherfaces, PAMI 1997.
USPS handwritten digits β the smallest dataset that teaches every ML idea
Linear algebra Β· PCA Β· Nearest-neighbor Β· Linear classifiers
Current week
WEEK 1
Load, visualize, baseline
Load USPS. Visualize 100 random digits. Split 7291 train / 2007 test (the canonical ESL split). Report pixel-NN accuracy as baseline.
WEEK 2
PCA compression
PCA on the training set. Plot top 20 components as 16Γ16 images. Reconstruct at $k \in \{5,20,50,100\}$ next to originals. Plot cumulative variance.
WEEK 3
Least-squares classifier from scratch
Solve the normal equations. Classify the test set. Compare raw pixels vs. PCA features. Run 1-NN in PCA space β does lower dimension help or hurt? Why?
WEEK 4
Confusion matrix + error analysis
Build the 10Γ10 confusion matrix. Find the worst-confused pair (usually 4β9 or 3β5). Show 10 misclassified examples with the classifier's "thought" on each. 300-word analysis.
Motivation
The USPS digit set β ~9,298 hand-written digits at 16Γ16 grayscale β is the dataset Hastie, Tibshirani, and Friedman use as a running example in Elements of Statistical Learning. Small enough to fit in memory, rich enough to distinguish every classical method, and visual enough that an error analysis reads like a story. Lab 3 closes the trilogy by turning images into classifications instead of reconstructions.
This lab is the on-ramp to the Modeling Bench Paper 2 on SVD-based classification: do the lab, then ask "what's the research contribution here?" That's the PhD question.
Linear algebra you will actually use
Images as vectors. Each 16Γ16 digit becomes a point in $\mathbb{R}^{256}$. The training set becomes $X \in \mathbb{R}^{n \times 256}$.
Nearest-neighbor classification. Distance is $\|\mathbf{x}_i - \mathbf{x}_q\|^2$ β a dot product in disguise. 1-NN is linear algebra, not "AI."
PCA for compression and visualization. Same mechanism as Lab 2 β top-$k$ directions, project, classify in the low-dim space. Watch accuracy vs. $k$.
Linear classification via least squares. One-hot encode labels; solve $\min_W \|XW - Y\|_F^2$; classify by $\arg\max \mathbf{x}^\top W$. A grad-level ML result in ten lines.
Confusion matrix as a matrix. A $10 \times 10$ integer matrix; $(i,j)$ is "true $i$, predicted $j$." Diagonals win; off-diagonals tell a story.
The single equation the lab is organized around
Given training data $X \in \mathbb{R}^{n \times d}$ and one-hot labels $Y \in \mathbb{R}^{n \times 10}$, the closed-form least-squares classifier is
$$W = (X^\top X)^{-1} X^\top Y,$$
and a new digit $\mathbf{x}$ is classified as $\arg\max_j (\mathbf{x}^\top W)_j$. Yes β a linear regression used as a classifier. Yes, it works better than it has any right to. The question the lab asks: when does it fail, and why?
Check yourself β tutor questions
Before Week 1: 1-NN on raw pixels gets ~87% accuracy. Steelman the case for leaving it alone β why would you not bother with anything fancier? Your job is to argue the lazy position as strongly as you can, so you know why you're moving past it.
After Week 3: Your least-squares classifier works. Does that mean linear regression is "a good idea for classification," or is something else going on? Be honest β what's the cheat that makes it work?
After Week 4: The worst-confused pair in your confusion matrix is telling you about your representation, not your classifier. Say what you would change about the representation β not the model β to fix it. This is the research question.
Common confusions β name them in the video script
"Why not a neural network?" β Because the LA version is 30 lines, trains in 100 ms, and hits ~87%. The gap to a CNN is the research question, not the starting point.
"Why is least-squares classification weird?" β The output is unbounded; labels are not. It works anyway because nearest-label is a projection.
"Why is 4 confused with 9?" β Because in 16Γ16 grayscale the loop of a 9 and the loop of a 4 share most of their mass. Teaches you to care about representation.
Extensions
Regularized least squares (ridge): add $\lambda I$ to $X^\top X$. Show stability.
Logistic regression via gradient descent; compare to closed-form least squares.
Tangent distance (Simard et al.): build invariance to small rotations/translations into the distance function itself. Beautiful linear-algebra-of-invariances result.
SVD-based class-subspace classifier: one subspace per digit, classify by projection residual. The natural stepping stone to the Modeling Bench Paper 2.
Primary references
Hastie, Tibshirani & Friedman, ESL, Ch. 1, 4, 13.
Simard, LeCun, Denker & Victorri, Transformation Invariance in Pattern Recognition β Tangent Distance and Tangent Propagation, 1998.
LeCun et al., Gradient-Based Learning Applied to Document Recognition, Proc. IEEE 1998.