AI grading for engineering courses

The teaching assistant’s assistant.

Treemarks is an AI agent that helps you design and grade assessments for engineering courses.

Leaf works in the chat your course staff already uses: hand it a problem set, or let it pull one from Canvas. It grades in your style, writes feedback worth reading, and shows you where the class needs help. You approve before anything posts.

Built by a Stanford teaching assistant, applied in live courses.

Course knowledge graph

What one grade hides

Traced to modules

Inflation: real vs nominal reteach

Present Worth review

Depreciation solid

+ 8 more concepts, each traced to its module

A grade is one number. Treemarks connects scores to class concepts, so you know exactly what to revisit.

Illustrative

What grading costs today

The challenges of grading at scale.

Teachers face a false choice: grade everything by hand and fall behind, or hand it to tools that flatten real engineering work into checkboxes. Treemarks is the third option. It carries the load, and the judgment stays with you.

A dozen TAs, a dozen standards

Grades drift between graders, and grading fatigue drifts them again over a long stack. The same work can earn a different score depending on who marks it, and when.

Zhao et al. (2017)

Feedback arrives too late to matter

Students wait weeks for a problem set, and by the time it comes back the class has moved on. In real classrooms, faster feedback has generally proven better for learning.

Kulik & Kulik (1988)

Front-end AI leaks private data

To save time, graders paste whole submissions into public chatbots. That can expose private, FERPA-protected data, with no audit trail.

U.S. Dept. of Education, PTAC (2014)

Do the math: a 120-student course can mean 600+ submissions a term. At ~8 minutes each, that’s 80+ hours of marking a term, before a single regrade. (Illustrative.)

Who it’s for

Built for everyone the gradebook touches.

For teachers

See where your class needs help, safely.

The diagnostic an overloaded TA never writes up, student data that stays protected, and your final say on every grade. Plus the hard questions, answered.

The case for teachers

For students

Is a robot grading me?

A person is accountable for your grade, you get real feedback instead of a number, and you are graded on your reasoning, not your handwriting. Your rights, in plain terms.

What this means for you

For TAs

Are we going to be replaced?

The honest answer, with the numbers, from a teaching assistant who built this. Plus how you would actually use Leaf in a grading week.

The case for TAs

How it works

Meet Leaf. It grades. It checks in. You decide.

What a real week looks like. The conversation is the whole interface, with no new dashboard to learn.

It notices the deadline

PS4 closes, and Leaf already knows. It pulls the submissions from Canvas and starts, or just DM it a PDF.

It grades the work

It builds the key and rubric from your materials, grades every submission, and writes each student feedback worth reading.

It DMs you

Leaf reports back, surfaces the calls that need you, and asks how you want them handled. Approve, and grades write back to Canvas or Gradescope.

Private by design. A local model strips names before any cloud call, so the grader sees “Student 14,” never a name. The channel carries the conversation and aggregates; named grades stay in an auditable record behind the link, and student work never trains anyone else’s model.

See a real grading week, from a TA’s seat →

The evidence

We did the testing.

Treemarks was built inside real courses, graded against real instructors, and measured rather than asserted.

A full term, graded end to end: A complete term, end to end: ten problem sets, two midterms, a final. Every grade was approved by a human before it posted.
Measured against your own graders: Before Treemarks is trusted on a course, it grades blind against your own TAs and shows you the gap. It has matched an in-house TA team on the large majority of sub-parts on a real midterm, and matched an instructor in a different department closely. Its grading principles are versioned and won under blind A/B testing, not chosen by vibes.
Real engineering work, and an honest edge: Handwritten exams, spreadsheets, derivations, graded where physics rather than the reader defines the answer. Where the only anchor is the reader, like argue-a-position essays, we leave it to people, and we say so.

Read the full position →

Histogram of final-exam totals for a real class, scores ranging 43 to 99.5, a genuinely hard, differentiating exam. Students anonymized. — What real signal looks like: a hard final exam (a real class, anonymized), scores 43 to 99.5. Granular grading reveals what each student demonstrated. It cannot manufacture a spread the exam does not contain.

Why now

Why this is possible now.

Instructors already grade with AI

Most do it by pasting student work, names and all, into public chatbots. The privacy problem isn’t coming; it’s already here.

Enrollments up, TA budgets down

Leaner teams, more submissions, the same deadlines. Something has to give, and today it’s feedback quality.

AI can finally do the work

Models can follow derivations, spreadsheets, and reasoning well enough to grade them, and agents can now carry a whole job, not just speed up one step of it. Grading is exactly that kind of job.

The Treemarks network

Join the Treemarks network.

We take a few courses each term as design partners. Leaf joins the chat your staff already uses; you get a written evidence packet at term’s end and a real say in the roadmap. Every course makes the next one sharper: the method and the evidence compound across the network.

One course, one term
We agree on what success looks like up front
Confidential by default: your course never becomes a logo
You keep the final say on every grade