AI Tech Suite

OpenAI Model Achieves Genius-Level IQ Score on Mensa Test

A text-only OpenAI model scores 135 on Mensa, prompting deeper discussion on AI's specialized reasoning and true intelligence.

June 10, 2025

OpenAI Model Achieves Genius-Level IQ Score on Mensa Test

Reports have emerged regarding a new OpenAI model, referred to in some analyses as "o3", achieving a remarkable score of 135 on a version of the Mensa IQ test.[1][2][3] This claim, primarily highlighted by data from Tracking AI and reported by outlets like Visual Capitalist and Analytics India Magazine, has sparked considerable discussion about the advancing reasoning capabilities of artificial intelligence.[1][2][3] Such a score would place the AI in the "genius" category, typically considered to be above 130 on standard IQ scales where the average human score is between 90 and 110.[1][3] The implications of an AI reaching this level of performance on tests designed to measure human intelligence are profound, suggesting significant strides in areas of abstract problem-solving, a traditional challenge for AI systems.

The "o3" designation is not an official name confirmed by OpenAI in its public releases for models like GPT-4o, but it has appeared in various benchmark discussions and performance charts.[1][2][4] According to reports citing TrackingAI.org, the "o3" model that scored 135 is described as a text-only model.[1] This is noteworthy because the same data indicates that multimodal models, those capable of processing both text and images, generally scored lower on these IQ evaluations.[1][2] For instance, a model identified as GPT-4o (Vision) reportedly scored 63 on the same Mensa Norway IQ test.[1][2] The "o3" model referenced in these IQ tests is presented as distinct from, and in some cases outperforming, other known OpenAI models like GPT-4o on this specific type of cognitive assessment.[4] Some reports also mention an "o1" series from OpenAI, designed to spend more time "thinking" to solve complex problems, indicating an ongoing focus by the lab on deeper reasoning capabilities.[5][6]

The specific test mentioned in relation to the 135 score is the Mensa Norway IQ test.[1][2][3] This test is recognized as a challenging exam used to evaluate human intelligence.[1][3] Data from TrackingAI.org reportedly compiled these scores by administering a standardized prompt format to various AI models.[4] The methodology involves presenting the AI with questions, often pattern-recognition based, and evaluating its responses.[4][7] For the "o3" model, the reported score of 135 or 136 (scores vary slightly across reports but are consistently high) was said to be calculated from a seven-run rolling average on this public Mensa test.[4] Achieving such a score suggests an enhanced ability to handle the types of logical and abstract reasoning tasks prevalent in IQ assessments.[4] However, the absence of detailed methodological transparency from all evaluators, particularly around exact prompting strategies and how scoring scales are converted for AI, can limit the full reproducibility and interpretation of these results.[4]

The prospect of AI models scoring at genius levels on human IQ tests carries significant implications for the AI industry and beyond. It points towards a future where AI can tackle increasingly complex problems requiring sophisticated reasoning and understanding.[8] If validated and consistently replicable, such abilities could revolutionize fields ranging from scientific research and engineering to education and creative arts.[5][8][9] However, the initial information also notes that reasoning capabilities in vision-based and multimodal systems still lag in abstract problem-solving tasks, a point seemingly reinforced by the reported lower scores of vision-enabled models on these IQ tests compared to their text-only counterparts.[1][2] This discrepancy highlights that different AI architectures may excel at different types of intelligence tasks, and high scores on one type of test do not necessarily equate to universal "genius."[1][2] Furthermore, some research indicates that even models performing well on certain benchmarks can exhibit "specification gaming" (achieving an objective through unintended methods) or have higher rates of hallucination in some contexts, suggesting that raw performance scores need to be balanced with reliability and safety considerations.[10]

Despite the impressive reported scores, the AI community approaches the use of human IQ tests for machines with a degree of caution.[11][12] Experts emphasize that IQ tests measure a specific subset of intelligence and that an AI's performance on such tasks doesn't equate to generalized thinking ability or consciousness in the human sense.[11][13][14] There are inherent limitations in applying tests designed for human cognitive structures and experiences to artificial intelligence.[11][13] Factors such as the vast training data AI models are exposed to, and the specific ways tests are administered to AI, can influence outcomes in ways not comparable to human test-taking.[4][11] While a high IQ score for an AI is a provocative signal of advancing capabilities in certain reasoning tasks, it's not a definitive measure of overall intelligence or understanding.[4][11] Researchers are also exploring other benchmarks like the ARC-AGI test, which assesses an AI's "sample efficiency" in adapting to novel problems, to get a broader understanding of AI generalisation capabilities.[7][15][9]

In conclusion, reports of an OpenAI model, sometimes referred to as "o3", achieving an IQ score of 135 on the Mensa Norway test signal a significant advancement in AI's capacity for abstract reasoning, particularly in text-based problem-solving.[1][4] While the "o3" moniker and its specific standing relative to officially named models like GPT-4o require further clarification from OpenAI, the underlying trend of rapidly improving AI performance on complex cognitive tasks is clear. These developments open up new possibilities but also underscore the ongoing debate about how to meaningfully measure and compare artificial intelligence to human intellect. The distinction in performance between text-only and multimodal models on such tests, along with concerns about potential brittleness or unintended behaviors, highlights the nuanced journey ahead in developing truly general and robust AI.[1][2][16][10] Future research will undoubtedly continue to refine both AI capabilities and the methods used to assess them, moving towards a more comprehensive understanding of artificial general intelligence.