Analysis of common failure patterns in language models (GPT, Mistral) on French Baccalaureate exams, with concrete examples and insights for LLM practitioners.
Posted on 2025-01-17 by Simon Moisselin
This research was conducted with my intern Waleston Trimh, who is currently seeking summer internship opportunities in the USA. Connect with him on LinkedIn.
We recently ran an extensive evaluation of various LLMs on the French Baccalaureate (high school exit exam), and the results were... surprising. While you might expect a simple correlation between model price and performance, the reality is far more nuanced. Let's dive into the specific failure modes we discovered and what they reveal about current LLM capabilities.
Here's how we systematically evaluated LLM performance on real exam questions:
graph TD
A[PDF Exam Documents] -->|Mistral OCR| B[Markdown Text]
B -->|Extract Questions + Points| C[Structured Exercises]
C -->|Send to LLM| D[Model Responses]
D -->|Proprietary Scoring| E[Graded Results]
F[French Teacher Guidelines] -->|Calibration| E
G[Correct Answers] -->|Reference| E
style A fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#9f9,stroke:#333,stroke-width:2px
style F fill:#ff9,stroke:#333,stroke-width:2px
ExerciceSubmission
and BlockQuestionFreeForm
in our codebase)The scoring system has access to correct answers and follows specific French grading requirements - notably different from US grading standards in terms of partial credit and reasoning expectations.
Our results revealed something fascinating: while higher-priced models generally perform better, the specific types of errors they make are completely different from human errors. Let's examine the failure modes of three popular models.
GPT-3.5 Turbo consistently ignores parts of exam instructions. When questions ask for "True or False with justification", it often responds with just "True" or "False" - no explanation provided. This pattern appeared across multiple subjects and cost significant points.
Here's a striking example from a computer science exam:
The model completely misunderstood the core requirement, confusing "sum of weights" with "count of items."
In mathematics problems, GPT-3.5 Turbo sometimes invents facts to support its reasoning:
"The line is vertical, therefore its slope is infinite"
...when nothing in the problem statement mentioned the line being vertical.
Bad Example (GPT-3.5 Turbo):
To show that ln(α) = 2(2-α), we first rewrite the right-hand side as
ln(e^(2(2-α))) = ln(e^4e^(-2α)) = ln(e^4/e^(2α)) = ln(e^4)-(ln(e^(2α))) = 4 - 2α.
Now, we have ln(α) = ln(e^(ln(α))) = ln(e^(2(2-α-2))) = 2(2-α) = 4 - 2α,
which proves the desired equality.
This is confused and contains logical errors.
Good Example (GPT-4 Mini):
Since α is a root of f, we have f(α) = ½ ln α + α – 2 = 0
⇔ ½ ln α = 2 – α
⇔ ln α = 2(2 – α)
Clear, precise, rigorous - exactly what examiners expect.
Biology (SVT) exams heavily rely on document analysis - students must extract specific data from graphs, images, and tables. Mistral-Small tends to:
We verified that the source markdown had correct spelling, making these 100% LLM errors:
Grammar mistakes:
Spelling disasters:
Complete gibberish:
"bohén'tous médités par la moeJulान्त, spiealeu cuii esl un circuit refle lisan"
Interestingly, these errors never appear in Mistral Medium - suggesting this is a model-size issue.
GPT-4.1 can solve complex calculations but sometimes misreads the question context entirely:
Physics Example (Smartphone Drop, Bac 2024):
This pattern repeats: confusing left/right, positive/negative signs, and mathematical identities.
GPT-4.1 often gives imprecise answers:
This aligns with research on emergent abilities in LLMs - precision and perfect instruction-following may require specific training or model scale thresholds.
Even when solving the hardest parts correctly, GPT-4.1 loses points by skipping "obvious" steps:
Mediocre Answer (GPT-4.1 Regular):
• Verifies that A, C, D satisfy the equation ✓ • Concludes: "They define the plane"
Complete Answer (GPT-4 Mini):
- Verifies that A, C, D belong to the plane
- Calculates vectors AC = (2,4,1), AD = (-2,0,4) → Vectors are not proportional → points are not collinear
- Concludes: the three points properly define the requested plane
The "mini" model paradoxically provides more complete reasoning!
In our experiments, no LLM had image access - we simply showed [img1.png]
placeholders. This is fine for comparison since all models had the same limitation.
However, this explains poor performance on biology questions that heavily reference figures:
Document 2: Study of myenteric plexus neurons involved in intestinal transit
Document 2a: Location of myenteric plexus in the intestine
[Cross-section of intestine diagram]
Source: T.Lebouvier (2008). Enteric nervous system and Parkinson's disease...
Strong reasoning models (like o4) compensate by extensively reasoning about what the image might contain based on context. This relates to recent research on "Visual Inference Chains" - thinking before looking to mitigate visual hallucination.
Interested in testing your own models or preparing for exams with AI assistance? Our platform allows you to:
Special thanks to Waleston for his contributions to this research. If you're looking for a talented intern with experience in LLM evaluation and educational technology, he's seeking opportunities for Summer 2024 in the USA.
Want to dive deeper into the technical details? Check out our GitHub repository for the complete evaluation framework and results. </rewritten_file>