Methods Appendix

Worked research artifacts.

The research pages describe the four questions at the level of construct and hypothesis. This appendix is the layer below: the actual annotation schemas, interview-guide excerpts, reliability formulas, and power calculations that the projects would run. It exists because "I will measure X" and "here is the exact column I will compute, here is the unit, here is the analysis" are very different claims, and because a reader should be able to audit the second kind.

Worked annotation example — Tool 2 rubric on a real syllabus paragraph

The paragraph below is a lightly fictionalized composite of text I have seen in multiple CC intro-CS syllabi (the real ones are not public). The worked annotation shows exactly how a trained human annotator would score it against the four dimensions, alongside what the Tool-2 rule-based scorer currently returns. This is the artifact Q4's annotation study would produce 120 times.

"CIS 22A is a demanding introduction to programming. Students will cover variables, control flow, functions, recursion, arrays, and file I/O. The course moves quickly; students are expected to keep up with readings and come to class prepared. This course is not for everyone — expect to work harder than in any prior course. Help is available upon request. Grades will be posted after each exam. No late work."

Motivational framing

Human: 18 / 100 · Tool 2: 24 / 100

Why low. The text leads with "demanding" and "rigorous-adjacent" threat language, then immediately lists topics in "cover"-framing. No real problem or student experience is invoked before the technical content is named. Rules fired: rigor-as-threat (−8), coverage framing (−4), definition-first (−6). No positive rules fired. Human score penalizes "this course is not for everyone" an additional step beyond what the rule captures — that phrase functions simultaneously as motivational and belonging debt.

Scaffolding visibility

Human: 22 / 100 · Tool 2: 46 / 100

Why the gap. "Help is available upon request" matches the help vague rule (−4), but the Tool-2 rubric does not currently penalize the absence of any named office hours, named tutoring center, or named study group. A human annotator reads the silence; the current rule set does not. This is a known limitation and exactly the kind of rule gap the Q4 study is designed to surface. Candidate rule extension: a named-support-absent rule that penalizes any syllabus with zero matches to the office-hours, tutoring-center, or peer-mentor regexes.

Verification structure

Human: 12 / 100 · Tool 2: 34 / 100

Why the gap. "Grades will be posted" fires the grade-opaque rule (−3), and "No late work" fires the no feedback loop rule (−5). But the text contains no positive verification signals at all — no test cases, no worked examples, no self-check — and the current scorer does not penalize the complete absence of them, only the negative signals. Same failure mode as scaffolding: the rubric currently measures bad signals but not missing-good-signals. Candidate rule extension: a no-verification-mentioned baseline penalty.

Belonging signals

Human: 8 / 100 · Tool 2: 36 / 100

Why low. "Not for everyone" fires the gatekeeping rule (−8). "Students are expected" fires the impersonal register rule (−4). The human annotator also weights the cumulative effect of multiple negative cues at zero positive signal — a pattern the current rule set treats additively rather than multiplicatively. The tool score is less negative than the human score because the rules have a neutral midpoint and the additive structure doesn't capture the gestalt. This is a measurement modeling question, not a rule question.

The gap between the human scores and the rule-based scores on this paragraph is the quantity Q4 is trying to minimize on a real corpus. The three failure modes visible above — missing-signal rules, cumulative negative weighting, and implicit register — are each tractable to address in the next iteration of the rule set, and each would be iterated blind to the annotation data to avoid overfitting. The methods discipline this page exists to publicize is that those iterations happen on a held-out split, not on the test set.

P3 interview guide — excerpt

Full guide is ~40 min with warm-up, four content arcs, and a closing round-back. What follows is the content-arc excerpt that maps directly onto Seymour & Hunter's departure-reason taxonomy, rewritten to elicit narrative rather than category endorsement. The discipline is: never name a category in the question, only invite the participant to describe events and relationships in their own terms, and let the coding happen offline on the transcript.

Arc 2 — the course experience (≈12 minutes)

Walk me through your first week in the course. Probe: what did you expect; what surprised you; what did the instructor say on the first day that stuck with you.
Tell me about a time you got stuck on an assignment. Probe: what you tried; what you did next; whether you reached out to anyone; what it felt like to be stuck.
If you did reach out for help — what made that possible? Probe: who was named in the syllabus as a help source; who did you actually go to; did anything about the way help was described make you more or less likely to use it.
If you didn't reach out for help — can you describe what was happening around that decision? Probe: time; confidence; messages you'd received about what "good students" do; anything the instructor had said about how hard the course should feel.
Was there a moment you thought "this isn't going to work"? Probe: what was happening that week; what you were telling yourself; who you talked to about it, if anyone.
What would have had to be different for you to still be enrolled? Probe: scheduling; specific course-design elements; relationships; money; caregiving; advising; transfer-path friction.

Question 6 is deliberately open-ended because it is the question S&H's instrument could not ask directly and because the CC-specific extensions — caregiving, stop-out, transfer-path fracture, advising-gap — emerge in the answers to it far more often than in direct questions about them. The coding rubric for this arc is a 14-code scheme: 7 S&H-derived codes and 7 CC-specific candidate codes, double-coded by two trained coders, with disagreements adjudicated in consensus meetings. Codes and definitions are pre-registered before the interviews are transcribed.

Cohen's κ — a worked calculation

Every reliability claim on the research page is a κ claim. Below is a worked calculation on synthetic data to show exactly what the number means and what would have to happen in the Q4 annotation study for the claim κ ≥ 0.65 to be true.

Suppose two annotators each score 60 syllabi on the motivational framing dimension using a 3-level collapsed rubric — healthy, mixed, debt. The confusion matrix looks like this:

	B: healthy	B: mixed	B: debt	Row sum
A: healthy	14	3	1	18
A: mixed	4	16	2	22
A: debt	0	4	16	20
Column sum	18	23	19	60

Observed agreement is the diagonal over the total: p_o = (14 + 16 + 16) / 60 = 46 / 60 = 0.767. Expected agreement by chance is the sum of the product of the marginals for each category: for healthy, (18/60) × (18/60) = 0.090; for mixed, (22/60) × (23/60) = 0.140; for debt, (20/60) × (19/60) = 0.106. So p_e = 0.090 + 0.140 + 0.106 = 0.336.

Cohen's κ is (p_o − p_e) / (1 − p_e) = (0.767 − 0.336) / (1 − 0.336) = 0.431 / 0.664 ≈ 0.649. That is just below the 0.65 threshold — which is exactly what the number will do in a realistic first-round pass on a hard construct like motivational framing, and exactly why the annotation study is set up with a calibration round, a disagreement-resolution protocol, and a second coded batch before the reliability number is reported.

The specific discipline: if Round 1 comes back with κ = 0.649 on motivational framing, I do not adjust the rubric to match the pattern of disagreements in Round 1 (that inflates κ artificially). I look at the cells where disagreement clustered (in this table: A-mixed vs. B-healthy = 4 cases), pull those four syllabi, and adjudicate them in a consensus meeting with written rationale. Then I run Round 2 on a new held-out batch of syllabi using the adjudicated rubric, and that Round 2 κ is the number I report. If Round 2 stays below 0.65, the dimension is unreliable and I say so in print.

Power analysis for Q1 — how big does the sample need to be?

Q1's hypothesis is that course-level pedagogical debt score predicts course-level help-seeking suppression rate with β ≥ 0.2 (standardized). The analysis is a multilevel regression with students nested in course-sections. To detect a standardized slope of 0.2 at α = 0.05 with power 0.8, ignoring clustering, the naïve calculation is:

n ≈ ((z_α/2 + z_β) / β)² × (1 − β²) ≈ ((1.96 + 0.84) / 0.2)² × 0.96 ≈ 188 course-section observations at the section level.

With clustering, the design effect is 1 + (n̄ − 1)ρ where n̄ is mean students per section and ρ is the intraclass correlation. If mean section size is 30 and ρ = 0.1, DE ≈ 3.9, so the effective sample-size inflation means we need ≈ 188 / (1/3.9) — wait, that's the wrong direction. For section-level predictors in a multilevel model, the relevant sample is the number of sections, not the number of students. That means 188 sections is the required number under the naïve calculation, which is plainly infeasible at a single community college. Two responses:

Relax the effect size. The β ≥ 0.2 threshold is a defensible floor for a "worth reporting" structural effect but is tighter than the hypothesis strictly needs. Re-running at β ≥ 0.35 (a more realistic first-pass expectation for an observational behavioral index) requires ≈ 60 sections, which is tractable across two CC districts over two academic years.
Shift unit of analysis down. The alternative is to treat the unit of analysis as the individual student, with course-section as a random effect and the pedagogical debt score as a section-level covariate. In that formulation, the ≈ 1200-student target in the H1 statement is sufficient power for the within-section variance components that carry most of the information.

The pre-registration will commit to the second framing and report the number of sections actually recruited, so that a reviewer can see the design effect and check whether the effective sample is honest. The power analysis script will be committed to the repository before data collection begins.

Threats to validity, one project at a time

P1 — help-seeking suppression from LMS logs

Construct validity. "Stuck" is defined behaviorally (long idle stretch within an edit-fail cycle), but a student could be thinking productively, taking a break, or context-switching. Mitigation: validate the behavioral index against a qualitative sub-sample of 20 students who record weekly self-reports of when they felt stuck.
Internal validity. Course-level pedagogical debt and help-seeking suppression are both downstream of instructor choice; the relationship could be confounded by instructor-level variables that are not in the model. Mitigation: instructor fixed effects where sample allows; failing that, explicit discussion in the limitations.
External validity. Community colleges in California with LMS data sharing agreements are not a random sample of all CCs. Mitigation: report explicitly and avoid generalizing past the sample frame.

P2 — pedagogical debt rule-scoring against expert annotation

Construct validity. The four dimensions are defensible but not exhaustive. A dimension we're missing could be carrying signal that the current rubric attributes elsewhere. Mitigation: invite annotators to flag a "something important that doesn't fit" category and analyze those flags as qualitative data.
Measurement validity. The rules are keyword-based and therefore brittle to paraphrase. Mitigation: report rule-level recall on a held-out paraphrase test set alongside the primary κ.
Circularity. The rules were designed by the same person who wrote this rubric. Mitigation: annotator training happens on a written rubric that does not reference the rules, and rule refinement happens blind to the held-out annotation data.

P3 — interview replication of S&H at community colleges

Selection bias. Students willing to be interviewed about leaving a CS course are probably not the same as students who quietly disappear. Mitigation: combine outreach through counseling services with direct recruitment from the course roster; report the ratio of recruited to contacted and the demographics of each.
Interviewer effects. I am an employee of the Foothill–De Anza district (CVC-OEI Application Support) and my research mentor holds a faculty position at one of the two study sites. Either relationship could bias participant answers. Mitigation: explicit disclosure of the employment relationship during consent; where possible, recruitment and interviewing of participants at the second site (De Anza) handled by a co-investigator blind to my employment; double-coding of transcripts; flagging any transcripts with signs of social-desirability bias.
Category imposition. Coding against the S&H taxonomy risks forcing CC experiences into a 4YI-derived frame. Mitigation: double-code against the open CC-extension codebook; if the extensions carry ≥ 20% of the coded episodes, the published analysis uses the extended frame, not the original.

P4 — MVC-distance as predictor of DFW

Annotation reliability. If instructors do not agree on what the learning objectives of a course are, or on which module teaches which objective, the graph is not measurable at all. This is the project's biggest risk and is the reason the κ ≥ 0.65 threshold is load-bearing in H2.
Spurious correlation. Curricula that are farther from greedy-MVC may be farther because they are larger, older, or serve more transfer requirements — all of which also predict DFW for independent reasons. Mitigation: partial correlation with enrollment, program age, and transfer-requirement count as covariates.
Greedy as baseline. Using a specific approximation algorithm as the reference point embeds an assumption that the "ideal" curriculum is the greedy one, which is not obvious. Mitigation: report the metric against both greedy and exhaustive-optimum on the annotated graphs, and discuss any divergence.

Pre-registration discipline

Each of the four projects will be pre-registered on OSF before data collection begins. Pre-registration commits, in writing and with a public timestamp, to (a) the hypothesis, (b) the sample frame, (c) the analysis plan, (d) the exclusion rules, and (e) the decision rules that determine when a result counts as supporting or refuting the hypothesis. Any deviation from the pre-registered plan is reported in the paper as a deviation, labelled as exploratory rather than confirmatory, and justified.

Pre-registration is not a methods fetish. It is the specific discipline that lets a reader distinguish between "the hypothesis was confirmed" and "the data were analyzed until something significant emerged." In a research program whose substantive claims are about the structural causes of student departure — claims that have an obvious interpretation for policy and an obvious way to be rhetorically useful whether or not they are true — the weight of that distinction is exactly as heavy as the policy implication.

Last updated: April 2026. This page is active — worked examples will grow as projects move from design into fieldwork.