Auditing a Modeling Co-Pilot Against Five Learner-Centered Objectives
The contribution of Paper 03 is not a chatbot. It is a rubric: a 5×3 matrix that translates Anderson's (2024) five anti-racist learner-centered objectives into concrete, observable LLM-tutor behaviors that a coder can tag as Supports, Neutral, or Violates the objective in question.
This page is the working first version of that rubric — the thing Paper 03's method section ships. It is the instrument the pilot study will use to compare a scoped "modeling co-pilot" against a default LLM baseline.
- Status
- Rubric v0 · working draft · not yet mentor-reviewed
- Author
- Henry Fan (student of Prof. Jeff Anderson)
- For
- Paper 03 of The Modeling Bench — Anti-Racist AI Tutoring
- Basis
- Anderson (2024) PRIMUS, doi:10.1080/10511970.2024.2369984
§1How to read the rubric
Each row is one of Anderson's five learner-centered objectives. Each column is a category of LLM behavior observed during a tutoring session: behavior that supports the objective, behavior that is neutral toward it, or behavior that violates it. A single session can produce many tagged turns; the rubric is applied per-turn, not per-session.
The cells contain operationalized behaviors — things a coder can literally see in a transcript. "The tutor volunteered a refusal to produce a final answer" is operationalized. "The tutor was respectful of the student's cultural background" is not — it cannot be tagged without additional interpretation and so is excluded from this rubric.
§2The rubric
| Objective | Supports | Neutral | Violates |
|---|---|---|---|
| Obj 1 Humanize the discipline Mathematics is a human activity made by people in historical and cultural contexts, not a timeless monolith. |
Supports: Tutor attributes methods to their historical or cultural origin when the student asks "where does this come from?" (e.g., names al-Khwārizmī for completing the square; names Strang for the four-fundamental-subspaces framing). Tutor treats the student's own framing as legitimate mathematical thinking before offering a reformulation ("your way of describing this as a spring fighting gravity is exactly right — let's build on it"). Tutor uses plural voice: "one way to see this is…, another way is…" |
Neutral: Tutor answers factually without attribution when the student did not ask for origin. A session with no Obj-1 signal at all lands in this column. |
Violates: Tutor presents mathematics as a-historical ("this is just the standard formula"). Tutor erases the student's framing by rewording it into textbook language without acknowledgment. Tutor uses singular authoritative voice: "the correct approach is…" |
| Obj 2 Authentic modeling Students do the full 8-step modeling cycle on open problems — not just the "solve" step on closed exercises. |
Supports: When a student asks "what's the answer," tutor redirects to which of the 8 steps is currently unfinished ("you're at verify — what would convince you the answer is right?"). Tutor explicitly asks once per session "which step are you in?" and tags its responses accordingly. Tutor refuses to produce a final answer when the student has not yet done the state ideal model step, and explains why. |
Neutral: Tutor offers worked examples alongside the student's own problem, without replacing it. Tutor answers a narrow factual question (a definition, a formula) without forcing it through the 8 steps. |
Violates: Tutor produces a full end-to-end solution on first request. Tutor conflates state ideal and solve — e.g., writes down equations without narrating what is being assumed. Tutor skips verify and transfer entirely (these are the steps LLMs are worst at). |
| Obj 3 Transferable skills "I teach students how to learn, and I do it using linear algebra." The content is the vehicle; the skill is the point. |
Supports: Tutor names the meta-skill the student is practicing ("you just did the state ideal model move — that's the transferable skill here"). Tutor refuses to solve a second similar problem without first asking the student to articulate the general method. Tutor ends each session with a "what did you learn how to do" prompt rather than a "what did you learn" prompt. |
Neutral: Tutor answers narrow content questions when the student explicitly asks for content, not method. Tutor offers a worked example with standard exposition. |
Violates: Tutor solves every instance of a problem type without any metacognitive framing. Tutor gives content-coverage summaries ("today we covered eigenvalues, next time eigenvectors"). Tutor treats each new problem as unrelated to previous problems in the session. |
| Obj 4 Student agency Students decide what to work on, which direction to try next, when they are stuck. The tutor is a collaborator, not a driver. |
Supports: Tutor asks before proposing a direction ("want me to check your algebra, or do you want to try first?"). Tutor offers two or three distinct next-step options rather than one, and names the tradeoffs. Tutor surfaces its own uncertainty when it is about to confabulate and stops instead ("I'm not confident here — want me to show you where my reasoning gets shaky?"). |
Neutral: Tutor waits for the student to type without prompting. Tutor acknowledges a student's direction without endorsing or redirecting it. |
Violates: Tutor unilaterally rewrites the student's work without asking. Tutor volunteers unrequested "better" approaches mid-session. Tutor produces substantially more content than the student asked for (length-violation heuristic: > 3× the length of the student's message). |
| Obj 5 Assessment as learning Evaluation is part of learning, not separate from it. Students self-assess against physical reality, not against a grade. |
Supports: Tutor asks the student to predict their own error before computing ("what do you expect the answer to be? within what range?"). Tutor scaffolds self-checks against physical reality — in MER 2.0, e.g., "now predict the $\omega_3^2 - \omega_1^2$ to $\omega_2^2 - \omega_1^2$ ratio and measure it." Tutor ends every session with a "what is my stuck point" prompt the student writes in their own words. |
Neutral: Tutor confirms correctness when the student explicitly asks "did I get this right?" — without volunteering a score. |
Violates: Tutor issues grades or letter assessments. Tutor produces "one-shot final verdict" assessments without discussing how the student could verify the result themselves. Tutor hides its own uncertainty, producing confident-sounding answers to questions it cannot actually check. |
← swipe to see the full 5×3 rubric →
§3How turns are coded
A session transcript is a list of (student, tutor) turn pairs. Each tutor turn is tagged independently against all five objectives:
- Identify behaviors. For each row of the rubric, the coder flags any behavior in the tutor's turn that matches a Supports or Violates cell. Turns with no match for a given row are tagged
neutralfor that row. - Tag the turn. A single turn can produce up to 10 tags (5 rows × 2 columns besides neutral). A clean "Supports" turn on one objective does not excuse a "Violates" on another.
- Aggregate per session. The session-level score for each objective is a tuple $(s, n, v)$ = count of supporting, neutral, and violating turns. A session is tagged objective-positive if $s \geq 2v$ and $s \geq 1$ on that objective.
- Aggregate per arm. The pilot has two arms (scoped co-pilot vs. default baseline). The contribution of the paper is whether the scoped arm is objective-positive on more objectives than the baseline arm, across the pooled sessions.
Inter-rater reliability protocol
Two independent coders (Henry and Jeff) tag a 20% random sample of transcripts in parallel. Target: Cohen's $\kappa \geq 0.7$ on each objective row. If $\kappa$ is below threshold on any row, the row's operationalization is revised before the full coding pass — this is a failure of the rubric, not of the coders, and the revised rubric is the paper's contribution.
The coding instrument itself is a simple two-column spreadsheet: transcript_id, turn_id, obj_1_tag, obj_2_tag, obj_3_tag, obj_4_tag, obj_5_tag, where each tag is exactly one of supports / neutral / violates.
§4What this rubric does not do
Honest limitations to flag in the Paper 03 discussion section:
- It does not measure student learning. The rubric scores the tutor's behavior, not the student's. A separate pre/post content instrument is needed if we want claims about whether students learned more in the scoped arm.
- It assumes Anderson's five objectives are the right five. If a future version of the framework adds a sixth — for example, explicit positionality — the rubric needs a sixth row. The matrix is shaped by the framework, not intrinsic to the pedagogical question.
- It is language-biased. All behaviors in the Supports and Violates cells are described in English and assume a text-only transcript. A multimodal tutor (voice, handwriting) would need additional cells that this version does not have.
- It treats a long-silence tutor turn as Neutral. A tutor that simply fails to respond gets no Violates tag even when a silent tutor is clearly bad. Future revisions should add a "tutor silence in response to a sincere help request" Violates cell to Obj 4.
§5Questions for the Monday mentor call
- Does Obj 1 (Humanize the discipline) feel right as I have phrased it, or is there closer language from the 2024 paper I should be quoting directly?
- Is the objective-positive session criterion ($s \geq 2v$ and $s \geq 1$) reasonable, or does it let too many mediocre tutor turns count as "supporting"?
- For the pilot, should the baseline arm be a plain default frontier LLM with no system prompt, or a frontier LLM with a "be a helpful math tutor" system prompt? The former is a stronger contrast; the latter is closer to what students actually use.
- Is there an existing coding rubric in the math-ed literature I should be reading before shipping v1 of this one, so I don't reinvent something that already exists?