The standard · v1.0

The rules we grade by.

Course staff already grade with AI, mostly by pasting student work into a chatbot. We grade with AI as our product, so we wrote down the rules we hold ourselves to, and we publish them to be judged against. Each commitment below names something you can check, and where published research or professional standards set the bar, we cite them.

Why we grade this way at all is laid out in our grading position.

Who decides

A person approves every grade.

Nothing posts to a student or a gradebook until someone with authority over the course has reviewed and approved it. There is no fully automatic path from submission to posted grade. The U.S. Department of Education’s report on AI in education calls humans-in-the-loop a top policy priority, and UNESCO’s guidance says the same about high-stakes decisions; we built the requirement in rather than treating it as a setting.

Grounding: U.S. Dept. of Education (2023) · UNESCO (2023)

Judgment calls stop the machine.

Some decisions belong to course staff: a grade swing large enough to change a student’s standing, a rubric reading the system has low confidence in, a course-policy question, an answer key that disagrees with the problem. These classes of decision are defined before grading starts, and when one comes up the system stops and asks instead of guessing. The EU classifies systems that evaluate learning outcomes as high-risk and requires that people can override them; we are not bound by that law in the US, and we meet its oversight bar voluntarily.

Grounding: EU AI Act, Annex III 3(b) · Article 14

A zero is never automatic.

Before any zero or “missing work” is recorded, a person confirms it against the original submission. A blank page in a file export is not a blank page in the student’s work until a human has looked.

How rubrics are made

Rubrics come from the course, and they grade understanding.

Criteria are built from the course’s own materials (the lectures, the readings, the problem as assigned) and tagged to the concepts being tested. They describe the quality of the reasoning a student showed, and they never count surface features or match answer text. This is the oldest documented failure in grading: in the 1913 study that founded the field, math teachers grading one geometry solution disagreed by more than 60 points because they mixed up method, answer, and neatness without deciding which they were grading.

Grounding: Brookhart (2018) · Starch & Elliott (1913)

Every rubric is reviewed before it grades anyone.

A rubric is an instrument, and instruments can be invalid. Each one is read adversarially before first use, from the student’s seat, the grader’s seat, and the instructor’s seat, then revised until it holds. The research is specific about what works: rubrics make scoring reliable when they are analytic, task-specific, and paired with exemplars and review; a rubric’s existence alone guarantees nothing.

Grounding: Jonsson & Svingby (2007) · Reddy & Andrade (2010)

The same work gets the same grade.

In 1912, 142 teachers gave the same English paper everything from a failing mark to an A; the geometry version of the study a year later varied more, not less. A century on, graders sharing a rubric still disagree with each other, and with their own earlier grades on a duplicated submission, and scores measurably drift with fatigue and with where a paper sits in the stack. So rubrics are locked before grading starts; if a tier changes mid-run, every submission already graded is re-graded under the final rubric; and a regrade decision applies to every student who made the same choice, not just the one who asked.

Grounding: Starch & Elliott (1912) · Brimi (2011) · Messer et al. (2025)

What students get

Feedback that teaches, not a bare score.

Every student gets specific, per-question feedback tied to the rubric: what was right, what was missing, where to go next. The research bar is higher than “feedback is good.” In the largest meta-analyses, feedback helps on average while a substantial minority of feedback interventions reduce performance, and what separates the two is information content and task focus. Feedback that carries the what, the how, and the next step shows roughly twice the effect of simple right/wrong marking.

Grounding: Hattie & Timperley (2007) · Kluger & DeNisi (1996) · Wisniewski et al. (2020) · Shute (2008)

Integrity signals are evidence, never verdicts.

AI-text detection is unreliable enough that its own builders have withdrawn classifiers over accuracy, and peer-reviewed testing found detectors flagging 61% of essays by non-native English speakers as machine-written. So at Treemarks, similarity and authorship signals route to course staff as evidence (the work, the signal, the context), and no penalty is ever applied automatically. Students are graded on the merit of what they submitted while a person decides what the signal means.

Grounding: Liang et al. (2023) · OpenAI (2023)

Student work stays private and never trains models.

Identifying details are removed before any cloud model sees a submission. Work is used to grade the course it belongs to and for nothing else: never training data, never shared, retained at the institution’s direction. We operate within FERPA’s school-official conditions: the institution’s direct control, authorized purposes only, no re-disclosure.

Grounding: U.S. Dept. of Education PTAC (2014) · How we handle data →

How we check ourselves

Every grade carries a receipt.

Each score is recorded with its rubric tier, its written reason, and who assigned and approved it, at the moment it happens. A regrade request, an instructor’s audit, or an accreditation review months later reads the same record. Professional testing standards have required documented, traceable scoring for years; we hold coursework grading to the same bar.

Grounding: AERA/APA/NCME Standards (2014)

Autonomy is measured before it is trusted.

Before the system grades any new kind of work with less supervision, its agreement with human grading is measured on that work, using the educational-measurement field’s own framework: machine scores benchmarked against how well trained human graders agree with each other, monitored over time, checked across student groups. Autonomy expands where agreement holds and contracts where it doesn’t.

Grounding: Williamson, Xi & Breyer (2012)

This document

This is version 1.0 of a living document, modeled on the practice of publishing assessment standards and revising them in the open. When a commitment changes, the version number changes and the previous text stays available. If we are falling short of this standard, or the standard falls short of where it should be, tell us: hello@treemarks.ai.

Hold us to it.

A pilot on one course is the fastest way to see every commitment in practice.

Request a pilot