NYU Professor Fights ChatGPT Cheating With 42-Cent AI Oral Exams.

NYU professor deploys 42-cent AI voice exams, revealing true student understanding and hidden instructional flaws.

January 4, 2026

NYU Professor Fights ChatGPT Cheating With 42-Cent AI Oral Exams.
The advent of generative artificial intelligence has brought on a crisis of academic integrity in higher education, prompting a rapid and creative evolution in how universities assess student learning. At the forefront of this shift is an NYU Stern School of Business professor, Panos Ipeirotis, who spearheaded a low-cost, highly effective method of assessment by deploying an AI-powered voice agent to conduct scalable oral examinations, a format previously deemed impractical for large courses. This experiment, which cost a mere 42 cents per student, not only revealed significant gaps between the quality of student-submitted written work and their actual understanding, but also served as an unexpected mirror for the professor, highlighting weaknesses in his own instructional delivery.[1][2]
The impetus for this radical change was the professor’s observation that student submissions in his "AI/ML Product Management" course were becoming "suspiciously good," possessing a level of polish that suggested professional editing rather than strong student performance. This led Ipeirotis and co-instructor Konstantinos Rizakos to begin cold-calling students in class, a practice that proved "illuminating" as many who had submitted thoughtful papers could not coherently defend or explain basic decisions in their own work after a few follow-up questions. The unmistakable conclusion was that the old equilibrium—where take-home assignments reliably measured understanding—was, in Ipeirotis's words, "dead, gone, kaput," due to the pervasive availability of Large Language Models (LLMs) like ChatGPT, which can effortlessly produce high-quality written work that bypasses traditional anti-plagiarism tools.[1][2]
In response, the team turned to the ancient, yet newly revitalized, format of the oral exam, which forces real-time reasoning and defense of specific decisions, making it exceedingly difficult to fake or use an LLM for instantaneous assistance. The challenge, however, is scalability; manually conducting 25-minute oral exams for a large class would be a logistical nightmare, consuming dozens of hours of faculty time. Ipeirotis’s solution leveraged the very technology students were using to cheat: an AI-powered voice agent built using ElevenLabs Conversational AI, paired with a multi-model grading council.[2] The entire system for 36 students cost only $15.00—broken down as roughly $8 for the primary grading LLM (Claude), $2 for a second LLM (Gemini), 30 cents for a third (OpenAI), and $5 for the voice minutes from ElevenLabs. This 42-cent per-student cost stands in stark contrast to the estimated $750 it would cost to pay teaching assistants (TAs) to conduct the same number of exams, showcasing a dramatic unit economics advantage.[2]
The AI-powered examination was structured in two parts: a defense of the student's capstone project and a real-time case study discussion. The agent was designed to inject specific project context, allowing it to drill into the student's stated goals, data choices, and modeling decisions—an approach that immediately nullified the "LLM did my homework" strategy, as deep, specific, and improvisational knowledge was required to pass. The AI examiner provided a structured, three-model grading council with deliberation, a full audit trail, and structured feedback with verbatim quotes, offering a level of consistency and documentation human-only grading often lacks.[2]
While the experiment's logistics demonstrated a massive leap in cost-effectiveness and scalability, the core findings were centered on pedagogy. The grading output from the AI council served as an unvarnished mirror for the instructors, revealing weaknesses in their teaching methods that had been masked by polished written work. For instance, the grading showed that students universally scored poorly on one particular section, exposing that the in-class A/B testing methodology unit had been rushed and insufficiently covered. The data also showed a counterintuitive finding: the duration of the exam had practically zero correlation with the final score, indicating that the AI was effectively testing comprehension rather than endurance.[2]
From the students' perspective, the reaction was mixed but telling. A significant 83 percent of the students found the AI oral exam more stressful than a traditional written test. However, a critical 70 percent agreed that the new format genuinely tested their "actual understanding," the highest-rated item in the survey. Though only 13 percent preferred the AI oral format over other assessment methods, the students did appreciate the flexibility of taking the exam at a time and place of their choosing. The results indicate an acceptance of the AI-driven assessment's *validity* even as they expressed discomfort with its *delivery*.[2]
This pivot to scalable, AI-powered oral assessment holds profound implications for the educational technology and AI industries. It represents a functional, immediate answer to the academic integrity challenge posed by generative AI, proving the concept of "fighting fire with fire" by using AI to enforce a higher standard of learning. The technology’s low barrier to entry—a basic version can be set up in minutes using platforms like ElevenLabs—suggests this model could be rapidly adopted across universities worldwide. Beyond assessment, the professor's insight points to a massive, underdeveloped market for AI tools that diagnose teaching efficacy. By providing objective, granular feedback on student performance, these tools can move from being mere cheating detectors to essential instruments for pedagogical improvement, shifting the focus of education back to real-time understanding, critical thinking, and communication skills—all of which are invaluable in the modern workplace. This method also opens the door to a new paradigm of exam preparation: by giving the entire setup to students to practice, the act of "drilling the test" becomes a productive, personalized learning tool, as the questions are dynamically generated, forcing true mastery of the underlying concepts.[2]

Sources
Share this article