For teachers

Grading that makes you a better teacher.

Treemarks shows you where the class is actually struggling and keeps student work protected. Leaf carries the repetitive grading, writes real feedback, and hands you a map of what to reteach. You keep the final say on every grade.

Start a pilot Argue with us

Where the class needs help

The feedback you almost never get.

Grading is where you are supposed to sense the class. With eight to ten problems on a weekly set and a shrinking TA budget, that sense rarely survives the pile. Whatever the graders noticed dies in their heads, never aggregated, and most students get back a bare score.

Treemarks inverts that. Every grade traces through the knowledge graph to the concept it tests and the module that taught it. In aggregate, that is a map of exactly where your class needs help, scored per concept, so you can reteach what matters and sharpen the course next term. It is produced automatically, as a byproduct of grading.

Concept-mastery report: eight course concepts ranked by class mastery out of 100%, from 70% to 93%. Class of 40, anonymized. — Class of 40, anonymized.

The recovered hours go where teaching actually happens: office hours and one-on-ones with the students who need them. The sensing data already exists. It is just locked in a few tired graders’ heads, and we put it on your desk.

Privacy & data

Student work you can defend to your privacy office.

Most AI grading today happens by pasting whole submissions, names and all, into a public chatbot. That is FERPA-protected data, exposed, with no audit trail. Treemarks exists partly to end that.

Anonymized before grading

A local model strips names, emails, and IDs from student work before any cloud model sees it. The grader reads “Student 14,” never a name.

The chat never carries the records

The channel carries conversation and aggregates. Named grades, rosters, and submissions stay in an access-controlled, auditable record behind the link.

Never trains anyone else’s model

Course work is used to grade that course and nothing else. It is never training data, never sold, never shared, and retained at your institution’s direction.

Built for FERPA, and for your IT review

We operate within FERPA’s school-official conditions, and we are glad to complete your security questionnaire and sign your data-processing agreement.

Working with your IT or privacy office? We will walk through the data flow, hosting model, and sub-processors, and complete your review.

How we handle data →

Who decides

You keep the final say on every grade.

There is no automatic path from submission to posted grade. Leaf finishes the pass, then surfaces the judgment calls and waits. A zero is never recorded until a person confirms it against the original work. And when it is not confident, it asks instead of guessing.

A person approves every grade

Nothing posts to a student or a gradebook until you have reviewed and approved it.
Judgment calls stop the machine

Grade swings large enough to change standing, low-confidence rubric reads, and policy questions route to you.
Every grade carries a receipt

Each score is recorded with its rubric tier, its written reason, and who approved it, at the moment it happens.

Argue with us

The hard questions, answered.

Here are the strongest objections to AI grading we know, in the voice a thoughtful skeptic would use. Where the objection is right, we say so first. Then we show what we built. The commitments referenced here are published in The Treemarks Standard.

“The human in the loop is a fiction.”

You say a person approves every grade. Under deadline pressure, “review” becomes a rubber stamp: the machine decides, the human signs, and accountability quietly evaporates.

What’s true: That describes most deployments. A “looks good” button is oversight theater.

What we built: Start from the status quo: a professor whose TAs make hundreds of grading calls a week already supervises on trust and one calibration meeting. Here, supervision is measurable. Before being trusted, the system graded a real engineering midterm blind. Its rubric was built independently and agreed with the course’s own TA team on 94% of sub-parts. In a separate course in another department, the same blind method matched the instructor’s grades closely. It states its uncertainty and stops on the calls that belong to course staff instead of guessing, and every approval is recorded with who and why. That is tighter supervision than the status quo, not looser.

Receipts: Standard §01–02, §10–11

“Your average hides who it fails.”

A grader can be right on average and still be wrong about the same students every time: non-native English speakers, unconventional notation, anyone who doesn’t write like the training data. Aggregate accuracy is the number that conceals it. Peer-reviewed testing found AI-text detectors flagged 61% of essays by non-native speakers as machine-written.

What’s true: Averages can hide exactly that harm, and most AI tools never check.

What we built: This is the first grading system where the check is possible. Every score is recorded with its rubric tier and a written reason, so the archive can be audited and re-scored by student group, by question, by anything. Agreement with human grading is measured before autonomy expands, including across student groups, and detector-style signals reach course staff as evidence, never verdicts. Human grading was never written down, so it cannot be audited for bias at any scale. Ours can. If your research touches equity and measurement, help us define the audit.

Receipts: Liang et al. (2023) · Standard §08, §10–11

“Unmaintained tools silently drift.”

Universities are full of adopted-then-abandoned software. A model update quietly shifts grading behavior mid-quarter and nobody notices.

What’s true: Infrastructure fails in maintenance, not at the ribbon-cutting.

What we built: So recalibration is built into the course, every term. Rubrics are versioned and locked before grading starts, agreement with human grading is re-measured before the system grades a new kind of work, and each assignment is checked against the course’s own history for drift. Maintenance never depends on whoever happens to be TAing this quarter.

Receipts: Standard §06, §11

“Would you grade essays with this?”

Language models reward fluent conformity and punish unconventional voices. You will point it at essays anyway, because the demo looked good.

What’s true: So we don’t. Essays and open-ended reports are listed as not validated in our own published record, and we do not grade them.

What we built: The boundary is verification versus interpretation. Engineering work carries an external anchor (governing equations, boundary conditions, units), so the judgment is how far a method drifted from the physics, with partial credit handled under written rules. Where the only anchor is the reader, the reader should be a person, and in our courses it is.

Receipts: Standard §04

Two more sit closer to other readers: how students experience it is on the student page, and what it means for your TAs is on theirs.

Questions teachers ask

How accurate is it?

You see the reasoning behind every point, and Leaf escalates the calls it is not confident on instead of guessing. Before you rely on it, it grades blind against your own graders and we show you the gap. The goal is not to match a lenient grader. It is the consistent, defensible distribution that hand-grading at scale cannot hold. The honest case →

What can it grade?

Quantitative engineering work: problem sets, exams, derivations. It reads real reasoning, not just final answers. Essays and open-ended reports we deliberately leave to people.

What if the question itself was confusing or flawed?

Leaf checks for that. When a part trips up much of the class, it weighs whether that is the students or the question (an ambiguous prompt, a wrong key) and flags it to you instead of quietly marking everyone down. A bad question becomes something you get told about, not something students get punished for.

Does it give students real feedback?

Yes, and it is the point. Every grade comes with what would have made the answer right, grounded in the student’s own work and the course material, not just which rubric tier they landed in. That corrective feedback is produced as part of grading. It is the part students actually learn from.

Does it fit our existing tools?

That is the design. You reach Leaf in the chat your institution already runs (Slack today, Teams next), it pulls assignments from Canvas, and it writes approved grades back to Canvas or Gradescope. It rides on top of the course stack you have rather than replacing it.

How do we start?

A short pilot on one course. We agree on what success looks like up front; the only thing we ask for is a little of your time.

Argue with us.

Bring your hardest objection to a pilot conversation. If we cannot answer it honestly, you will have saved yourself a pilot.

Start a pilot