Should AI Grade Student Essays?
New Jersey’s New Testing Plan Raises Serious Questions About Fairness, Graduation, and AI in Education
Should AI Grade Student Essays?
A student writes an essay on a state exam.
That essay may help determine whether they graduate from high school.
But in New Jersey’s new testing system, part of that evaluation may be done by artificial intelligence.
As someone who studies AI in education and works with students every day, I find myself asking a question that educators across the state should be asking as well.
Should an algorithm help decide whether a student earns a diploma?
New Jersey’s new testing system includes adaptive assessments that adjust to students in real time. That part makes sense.
But when artificial intelligence moves from helping deliver a test to judging student writing, the stakes become much higher.
The Part That Actually Makes Sense
New Jersey is introducing adaptive statewide assessments beginning in 2026. These digital tests adjust the difficulty of questions based on student responses.
Adaptive testing can be beneficial.
Instead of every student taking the exact same set of questions, the test adjusts based on their answers. Students who answer correctly receive more challenging questions. Students who struggle receive questions that better match their level.
This can create a more accurate picture of student learning.
Using technology to adapt the test experience is a reasonable use of AI.
Using AI to evaluate student thinking is a very different question.
Where Concerns Begin
The concern arises when artificial intelligence is used to score essays and written responses.
Writing is one of the most complex forms of student expression. A strong essay reflects reasoning, interpretation, evidence, and voice.
Teachers evaluate writing by considering nuance and context. Algorithms operate differently.
Most automated essay scoring systems analyze measurable patterns in text, such as vocabulary, sentence structure, grammar patterns, and essay length. The system compares the essay to patterns learned from previously scored essays and predicts a score.
The system is not interpreting ideas the way a teacher does. It is identifying statistical patterns.
That distinction matters.
What Research Has Found
Researchers studying automated essay scoring have identified several recurring issues.
Some systems reward predictable writing structures rather than thoughtful analysis.
Others struggle to distinguish clearly between high-quality and low-quality essays. Instead of identifying the strongest and weakest writing, scores may cluster in the middle.
Researchers have also raised concerns about bias. AI systems learn from training data. If the data reflects specific writing styles or evaluation patterns, the algorithm may reproduce them.
When thousands of students are evaluated by the same system, even small inaccuracies can affect many students.
Why Multilingual Students Matter in This Conversation
Many schools in New Jersey serve students whose primary language at home is not English.
These students are developing academic writing while also learning English.
Multilingual writers may structure sentences differently or use language patterns influenced by their first language. Human educators understand this context and can recognize strong ideas even as language development continues.
Algorithms often struggle with that distinction.
If automated scoring systems are trained primarily on essays written by native English speakers, multilingual students may be evaluated unfairly.
For diverse schools, like mine, this is a significant concern.
When AI Scoring Affects Teacher Evaluations
State test scores are not only used to measure student learning. In many states, including New Jersey, assessment data have historically played a role in teacher evaluation systems.
If AI systems score written responses, those scores could influence how educator performance is measured.
This raises additional questions.
How accurate is the system compared with trained human scorers?
What safeguards exist to detect scoring errors?
How frequently is the system audited for bias?
What human oversight exists in the scoring process?
If assessment data influences teacher evaluations, transparency in the scoring process becomes essential.
Graduation Decisions Should Not Depend on an Algorithm
In New Jersey, statewide testing is not only about measuring learning.
The New Jersey Graduation Proficiency Assessment (NJGPA) helps determine whether students meet graduation requirements.
This raises a serious concern.
If artificial intelligence is used to score written responses connected to graduation requirements, algorithmic decisions could influence whether a student receives a diploma.
Graduation decisions should rely on clear, transparent evaluation by trained educators.
Writing reflects reasoning, interpretation, and argument. Evaluating that work requires professional judgment and context.
Technology can assist assessment systems. But when a diploma is on the line, human evaluation should remain central.
When a test can determine whether a student graduates, the evaluation of their writing should never depend on an algorithm alone.
The Teaching to the Algorithm Problem
Assessment systems influence classroom instruction.
When standardized tests reward certain writing structures, teachers often feel pressure to prepare students to write in ways that match those scoring systems.
If automated scoring systems reward predictable patterns, students may learn how to write for the algorithm rather than communicate ideas effectively.
Writing instruction could gradually shift toward producing essays that score well with the system rather than essays that demonstrate thoughtful analysis.
Educators have seen similar patterns before with standardized testing.
Transparency Matters
When artificial intelligence becomes part of statewide testing systems, transparency is critical.
Educators and families should understand:
Who developed the scoring system
How the system was trained
How often is it tested for bias
How accurate is it compared with human scoring
What human oversight exists
What appeal process exists if scores are questioned
Without transparency, trust in the assessment system becomes difficult.
Questions Educators Should Be Asking
As AI becomes part of statewide testing, several questions deserve attention.
How accurate is AI scoring compared with trained human scorers?
How are multilingual students protected from potential bias?
What level of human review exists in the scoring process?
How are scoring errors corrected?
Can students challenge AI-generated scores?
If test results influence teacher evaluations, how are educators protected from potential inaccuracies?
These are not arguments against technology. They are questions about responsible implementation.
Final Thoughts
Artificial intelligence will continue to shape education.
Adaptive testing may represent a productive use of technology that can improve assessment systems.
Automated essay scoring raises deeper questions.
When algorithms evaluate student writing on high-stakes exams, transparency, oversight, and fairness become essential.
Students deserve assessment systems that recognize the complexity of their thinking and the diversity of their voices.
And when something as important as graduation is involved, that evaluation should always include human judgment.
Join the Conversation
Artificial intelligence is rapidly entering classrooms, research tools, and now statewide testing systems.
Educators should have a voice in how these technologies are implemented.
If this issue matters to you, consider sharing this article with colleagues, administrators, and policymakers who are thinking about the future of assessment.
The way states approach AI in testing today will shape how student learning is evaluated for years to come.
And as educators, we should be part of that conversation.



I agree with your analysis, but similar questions arise with the human scoring in many states. Years ago a friend worked with a researcher at MIT who discovered that students writing their MCAS essays in a certain pattern were more likely to receive a good score than those who wrote more engagingly or were better at giving an answer to what the question actually asked. Perhaps that problem has been rectified in Massachusetts. I don't know about other states. When graduation relies on the scores on these tests, it is important that we take a much closer look at scoring, whether it is is done by humans or machines.