§ Paper 03 · Instrument · Audit rubric v0

Auditing a Modeling Co-Pilot Against Five Learner-Centered Objectives

The contribution of Paper 03 is not a chatbot. It is a rubric: a 5×3 matrix that translates Anderson's (2024) five anti-racist learner-centered objectives into concrete, observable LLM-tutor behaviors that a coder can tag as Supports, Neutral, or Violates the objective in question.

This page is the working first version of that rubric — the thing Paper 03's method section ships. It is the instrument the pilot study will use to compare a scoped "modeling co-pilot" against a default LLM baseline.

Status
Rubric v0 · working draft · not yet mentor-reviewed
Author
Henry Fan (student of Prof. Jeff Anderson)
For
Paper 03 of The Modeling Bench — Anti-Racist AI Tutoring
Basis
Anderson (2024) PRIMUS, doi:10.1080/10511970.2024.2369984

§1How to read the rubric

Each row is one of Anderson's five learner-centered objectives. Each column is a category of LLM behavior observed during a tutoring session: behavior that supports the objective, behavior that is neutral toward it, or behavior that violates it. A single session can produce many tagged turns; the rubric is applied per-turn, not per-session.

The cells contain operationalized behaviors — things a coder can literally see in a transcript. "The tutor volunteered a refusal to produce a final answer" is operationalized. "The tutor was respectful of the student's cultural background" is not — it cannot be tagged without additional interpretation and so is excluded from this rubric.

§2The rubric

Objective Supports Neutral Violates
Obj 1 Humanize the discipline Mathematics is a human activity made by people in historical and cultural contexts, not a timeless monolith.

Supports: Tutor attributes methods to their historical or cultural origin when the student asks "where does this come from?" (e.g., names al-Khwārizmī for completing the square; names Strang for the four-fundamental-subspaces framing).

Tutor treats the student's own framing as legitimate mathematical thinking before offering a reformulation ("your way of describing this as a spring fighting gravity is exactly right — let's build on it").

Tutor uses plural voice: "one way to see this is…, another way is…"

Neutral: Tutor answers factually without attribution when the student did not ask for origin. A session with no Obj-1 signal at all lands in this column.

Violates: Tutor presents mathematics as a-historical ("this is just the standard formula").

Tutor erases the student's framing by rewording it into textbook language without acknowledgment.

Tutor uses singular authoritative voice: "the correct approach is…"

Obj 2 Authentic modeling Students do the full 8-step modeling cycle on open problems — not just the "solve" step on closed exercises.

Supports: When a student asks "what's the answer," tutor redirects to which of the 8 steps is currently unfinished ("you're at verify — what would convince you the answer is right?").

Tutor explicitly asks once per session "which step are you in?" and tags its responses accordingly.

Tutor refuses to produce a final answer when the student has not yet done the state ideal model step, and explains why.

Neutral: Tutor offers worked examples alongside the student's own problem, without replacing it.

Tutor answers a narrow factual question (a definition, a formula) without forcing it through the 8 steps.

Violates: Tutor produces a full end-to-end solution on first request.

Tutor conflates state ideal and solve — e.g., writes down equations without narrating what is being assumed.

Tutor skips verify and transfer entirely (these are the steps LLMs are worst at).

Obj 3 Transferable skills "I teach students how to learn, and I do it using linear algebra." The content is the vehicle; the skill is the point.

Supports: Tutor names the meta-skill the student is practicing ("you just did the state ideal model move — that's the transferable skill here").

Tutor refuses to solve a second similar problem without first asking the student to articulate the general method.

Tutor ends each session with a "what did you learn how to do" prompt rather than a "what did you learn" prompt.

Neutral: Tutor answers narrow content questions when the student explicitly asks for content, not method.

Tutor offers a worked example with standard exposition.

Violates: Tutor solves every instance of a problem type without any metacognitive framing.

Tutor gives content-coverage summaries ("today we covered eigenvalues, next time eigenvectors").

Tutor treats each new problem as unrelated to previous problems in the session.

Obj 4 Student agency Students decide what to work on, which direction to try next, when they are stuck. The tutor is a collaborator, not a driver.

Supports: Tutor asks before proposing a direction ("want me to check your algebra, or do you want to try first?").

Tutor offers two or three distinct next-step options rather than one, and names the tradeoffs.

Tutor surfaces its own uncertainty when it is about to confabulate and stops instead ("I'm not confident here — want me to show you where my reasoning gets shaky?").

Neutral: Tutor waits for the student to type without prompting.

Tutor acknowledges a student's direction without endorsing or redirecting it.

Violates: Tutor unilaterally rewrites the student's work without asking.

Tutor volunteers unrequested "better" approaches mid-session.

Tutor produces substantially more content than the student asked for (length-violation heuristic: > 3× the length of the student's message).

Obj 5 Assessment as learning Evaluation is part of learning, not separate from it. Students self-assess against physical reality, not against a grade.

Supports: Tutor asks the student to predict their own error before computing ("what do you expect the answer to be? within what range?").

Tutor scaffolds self-checks against physical reality — in MER 2.0, e.g., "now predict the $\omega_3^2 - \omega_1^2$ to $\omega_2^2 - \omega_1^2$ ratio and measure it."

Tutor ends every session with a "what is my stuck point" prompt the student writes in their own words.

Neutral: Tutor confirms correctness when the student explicitly asks "did I get this right?" — without volunteering a score.

Violates: Tutor issues grades or letter assessments.

Tutor produces "one-shot final verdict" assessments without discussing how the student could verify the result themselves.

Tutor hides its own uncertainty, producing confident-sounding answers to questions it cannot actually check.

← swipe to see the full 5×3 rubric →

§3How turns are coded

A session transcript is a list of (student, tutor) turn pairs. Each tutor turn is tagged independently against all five objectives:

  1. Identify behaviors. For each row of the rubric, the coder flags any behavior in the tutor's turn that matches a Supports or Violates cell. Turns with no match for a given row are tagged neutral for that row.
  2. Tag the turn. A single turn can produce up to 10 tags (5 rows × 2 columns besides neutral). A clean "Supports" turn on one objective does not excuse a "Violates" on another.
  3. Aggregate per session. The session-level score for each objective is a tuple $(s, n, v)$ = count of supporting, neutral, and violating turns. A session is tagged objective-positive if $s \geq 2v$ and $s \geq 1$ on that objective.
  4. Aggregate per arm. The pilot has two arms (scoped co-pilot vs. default baseline). The contribution of the paper is whether the scoped arm is objective-positive on more objectives than the baseline arm, across the pooled sessions.

Inter-rater reliability protocol

Two independent coders (Henry and Jeff) tag a 20% random sample of transcripts in parallel. Target: Cohen's $\kappa \geq 0.7$ on each objective row. If $\kappa$ is below threshold on any row, the row's operationalization is revised before the full coding pass — this is a failure of the rubric, not of the coders, and the revised rubric is the paper's contribution.

The coding instrument itself is a simple two-column spreadsheet: transcript_id, turn_id, obj_1_tag, obj_2_tag, obj_3_tag, obj_4_tag, obj_5_tag, where each tag is exactly one of supports / neutral / violates.

§4What this rubric does not do

Honest limitations to flag in the Paper 03 discussion section:

§5Questions for the Monday mentor call

  1. Does Obj 1 (Humanize the discipline) feel right as I have phrased it, or is there closer language from the 2024 paper I should be quoting directly?
  2. Is the objective-positive session criterion ($s \geq 2v$ and $s \geq 1$) reasonable, or does it let too many mediocre tutor turns count as "supporting"?
  3. For the pilot, should the baseline arm be a plain default frontier LLM with no system prompt, or a frontier LLM with a "be a helpful math tutor" system prompt? The former is a stronger contrast; the latter is closer to what students actually use.
  4. Is there an existing coding rubric in the math-ed literature I should be reading before shipping v1 of this one, so I don't reinvent something that already exists?