For TAs

Are we going to be replaced?

It is the first thing every TA asks about a grading tool, and it is the right question. Treemarks was built by a teaching assistant who has graded more than a thousand submissions across Stanford engineering courses, so here is the honest answer, with the numbers behind it.

See how you would use it Get your course a pilot

The answer is no.

Treemarks doesn’t run without a person. The human in the loop is essential: to handle the edge cases of rubric design, and to be responsible for the grade that gets posted.

The pressure on TA roles is real, and it predates AI. This is a tool that lets you do the job faster and more thoroughly.

Graduate fellowships that fund many TAs have been cut sharply since 2025, and the TAs who remain carry more students. Being available to them matters more than ever.

Why now

More to grade, fewer to grade it.

Over two decades, the degrees U.S. colleges confer have grown far faster than the full-time faculty who teach and grade them. By 2020, each faculty member accounted for about 16% more graduates than in 2000. The work each grader carries has grown.

Line chart, U.S. higher education 2000 to 2020: bachelor's degrees conferred per full-time faculty member rose from about 2.1 to about 2.4, up roughly 16%.

We built Treemarks to equip TAs with the tools to grade better and faster, so they can focus on teaching.

Sources: NCES (degrees and faculty counts) · Nature (2025) on the graduate-fellowship cut.

How you would use it

A real grading week, with Leaf.

Leaf is an AI agent that lives in the chat your course already uses. There is no new app to learn. The conversation is the whole interface.

You hand it the work

DM Leaf a PDF of submissions, or say “pull PS4 from Canvas and grade it in my style.” That message is the whole setup.

It does the first pass

It builds the key and rubric from the course materials, grades every submission with a checkable reason, and writes each student feedback worth reading.

You make the calls

Leaf brings you the few decisions that need a human, shows what each one changes, and waits. You approve, and grades write back to Canvas or Gradescope.

🌿 Leaf · direct message

PS4 is graded: 118 submissions, mean 84. Two calls for you before anything posts.

On Q3, six students used the integral form instead of the tabulated factor. Correct, just not your worked method. Credit it as equivalent?

If you credit it

now mean 84

credited mean 86 · six move 70s to 80s

Yes, credit equivalent methods.

Done, re-scored all six. Nothing is posted; the set is waiting on your approval.

Illustrative

Your corrections stick

One call, applied to the whole class.

Move one rubric tier and every student re-scores the same way, including the ones who would never have emailed to argue. You are not clicking through 118 submissions. You are making the handful of decisions that actually need your judgment, and Leaf carries them across the cohort consistently.

On the work itself

Every student gets feedback they can act on.

Not a bare score. A specific note on the student’s own work, tied to the rubric line, with the fix. It is the feedback you would write for every student, if you had the hours.

Student 14 · Q3, step 4

Q̇ = ṁ·c_p·ΔT = (2.4)(1.0)(40) = +96 kW

🌿 Outlet runs cooler, so ΔT is negative. This is heat rejected: Q̇ = −96 kW. Setup and method are right; −1 for the sign. Rubric: first law, correct signs.

Illustrative example

The apprenticeship

“But grading is how TAs learn to teach.”

Automate the entry rung and you deskill the pipeline that produces the next generation of instructors.

What’s true: the apprenticeship is real, and it matters.

But look at what grading-as-practiced is. Grading does not teach TAs; solving does. Grading as practiced is value-hunting under fatigue. What builds teaching judgment is seeing the whole error landscape and working the genuinely hard calls. That is what you do here: reviewing rubrics, deciding the escalated cases, and crediting valid alternative methods, the ones that today get docked when a tired grader pattern-matches against the official solution. You spend your time on the interesting ten percent instead of the routine ninety.

You see the patterns

A per-question map of where the class went wrong, in aggregate, instead of one paper at a time.

You make the judgment calls

Alternative methods, partial credit, the ambiguous prompt. The decisions that teach you to teach.

You get the hours back

The recovered time goes to office hours and one-on-ones, the highest-leverage teaching there is.

Want Leaf on your course?

Pilots run on one course for a term. Start one, or send this to whoever runs the class. You keep the final say on every grade.

Get your course a pilot The case for teachers