Why LLMs Fail the French Baccalaureate – A Deep Dive into Model Errors

Analysis of common failure patterns in language models (GPT, Mistral) on French Baccalaureate exams, with concrete examples and insights for LLM practitioners.

Posted on 2025-01-17 by Simon Moisselin

llmbaccalaureategptmistraleducationanalysisai-evaluation

Why LLMs Fail the French Baccalaureate: A Deep Dive into Model Errors

This research was conducted with my intern Waleston Trimh, who is currently seeking summer internship opportunities in the USA. Connect with him on LinkedIn.

We recently ran an extensive evaluation of various LLMs on the French Baccalaureate (high school exit exam), and the results were... surprising. While you might expect a simple correlation between model price and performance, the reality is far more nuanced. Let's dive into the specific failure modes we discovered and what they reveal about current LLM capabilities.

Our Evaluation Process

Here's how we systematically evaluated LLM performance on real exam questions:

graph TD
    A[PDF Exam Documents] -->|Mistral OCR| B[Markdown Text]
    B -->|Extract Questions + Points| C[Structured Exercises]
    C -->|Send to LLM| D[Model Responses]
    D -->|Proprietary Scoring| E[Graded Results]
    F[French Teacher Guidelines] -->|Calibration| E
    G[Correct Answers] -->|Reference| E
    
    style A fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#9f9,stroke:#333,stroke-width:2px
    style F fill:#ff9,stroke:#333,stroke-width:2px

Document Processing: We used Mistral's OCR to convert PDF exam documents into clean markdown text
Question Extraction: Our system extracts all questions with their point values using structured schemas (see ExerciceSubmission and BlockQuestionFreeForm in our codebase)
LLM Performance: Each exercise is sent individually to the LLM being tested
AI Scoring: Responses are evaluated by our proprietary scoring mechanism, calibrated against French teacher guidelines

The scoring system has access to correct answers and follows specific French grading requirements - notably different from US grading standards in terms of partial credit and reasoning expectations.

Key Findings: Price ≠ Performance (Always)

Our results revealed something fascinating: while higher-priced models generally perform better, the specific types of errors they make are completely different from human errors. Let's examine the failure modes of three popular models.

GPT-3.5 Turbo: The Instruction-Following Problem

1. Partial Compliance with Instructions

GPT-3.5 Turbo consistently ignores parts of exam instructions. When questions ask for "True or False with justification", it often responds with just "True" or "False" - no explanation provided. This pattern appeared across multiple subjects and cost significant points.

2. Fundamental Misunderstandings

Here's a striking example from a computer science exam:

Task: Write code to calculate the cumulative sum of weights for multiple people
GPT-3.5's Answer: Code that simply counts the number of people

The model completely misunderstood the core requirement, confusing "sum of weights" with "count of items."

3. Hallucinated Justifications

In mathematics problems, GPT-3.5 Turbo sometimes invents facts to support its reasoning:

"The line is vertical, therefore its slope is infinite"

...when nothing in the problem statement mentioned the line being vertical.

4. Flawed Mathematical Proofs

Bad Example (GPT-3.5 Turbo):

To show that ln(α) = 2(2-α), we first rewrite the right-hand side as 
ln(e^(2(2-α))) = ln(e^4e^(-2α)) = ln(e^4/e^(2α)) = ln(e^4)-(ln(e^(2α))) = 4 - 2α. 
Now, we have ln(α) = ln(e^(ln(α))) = ln(e^(2(2-α-2))) = 2(2-α) = 4 - 2α, 
which proves the desired equality.

This is confused and contains logical errors.

Good Example (GPT-4 Mini):

Since α is a root of f, we have f(α) = ½ ln α + α – 2 = 0
⇔ ½ ln α = 2 – α  
⇔ ln α = 2(2 – α)

Clear, precise, rigorous - exactly what examiners expect.

Mistral-Small: Document Blindness in Biology

1. Ignoring Critical Documents

Biology (SVT) exams heavily rely on document analysis - students must extract specific data from graphs, images, and tables. Mistral-Small tends to:

Ignore provided documents entirely
Generate generic answers without using key data
Miss specific numbers or trends that questions explicitly ask about

Document Analysis Issue

2. Language Errors Galore

We verified that the source markdown had correct spelling, making these 100% LLM errors:

Grammar mistakes:

"maintains leur métabolisme" (mixing English/French - should be "maintiennent")
"insectes sont essentials" (should be "essentiels")

Spelling disasters:

"reproduxtion", "respriation"

Complete gibberish:

"bohén'tous médités par la moeJulान्त, spiealeu cuii esl un circuit refle lisan"

Interestingly, these errors never appear in Mistral Medium - suggesting this is a model-size issue.

GPT-4.1 Regular: Smart but Sloppy

1. Context Confusion

GPT-4.1 can solve complex calculations but sometimes misreads the question context entirely:

Physics Example (Smartphone Drop, Bac 2024):

Question: Forces analysis on the vertical axis
GPT-4.1 Answer: Forces on the horizontal axis with formula aₓ = –(k/m)vₓ
Score: 0/0.5 (wrong axis = wrong answer)

Axis Confusion

This pattern repeats: confusing left/right, positive/negative signs, and mathematical identities.

2. Precision Problems

GPT-4.1 often gives imprecise answers:

Reads: Eₖ ≈ 2.3 J → calculates v ≈ 5 m/s
Expected: 5.2 m/s (±0.1 tolerance)
Score: 0.4/0.5 for "imprecise rounding"

This aligns with research on emergent abilities in LLMs - precision and perfect instruction-following may require specific training or model scale thresholds.

3. Incomplete Reasoning Chains

Even when solving the hardest parts correctly, GPT-4.1 loses points by skipping "obvious" steps:

Mediocre Answer (GPT-4.1 Regular):

• Verifies that A, C, D satisfy the equation ✓ • Concludes: "They define the plane"

Complete Answer (GPT-4 Mini):

Verifies that A, C, D belong to the plane

Calculates vectors AC = (2,4,1), AD = (-2,0,4) → Vectors are not proportional → points are not collinear

Concludes: the three points properly define the requested plane

The "mini" model paradoxically provides more complete reasoning!

The Image Problem

In our experiments, no LLM had image access - we simply showed [img1.png] placeholders. This is fine for comparison since all models had the same limitation.

However, this explains poor performance on biology questions that heavily reference figures:

Document 2: Study of myenteric plexus neurons involved in intestinal transit
Document 2a: Location of myenteric plexus in the intestine
[Cross-section of intestine diagram]
Source: T.Lebouvier (2008). Enteric nervous system and Parkinson's disease...

Strong reasoning models (like o4) compensate by extensively reasoning about what the image might contain based on context. This relates to recent research on "Visual Inference Chains" - thinking before looking to mitigate visual hallucination.

Visual Reasoning

Key Takeaways for Us

Price doesn't guarantee performance - GPT-4.1 Regular makes different errors than GPT-3.5, not necessarily fewer
LLM errors differ fundamentally from human errors - they fail in ways no human student would
Model size affects error types - Mistral Small makes language errors that Mistral Medium never makes
Instruction compliance varies wildly - even expensive models skip explicit requirements
Document analysis remains challenging - especially without vision capabilities

Try Our AI Exam Platform

Interested in testing your own models or preparing for exams with AI assistance? Our platform allows you to:

Upload any exam document
Get instant AI grading with detailed feedback
Compare performance across different LLMs
Practice with calibrated scoring based on real teacher guidelines

Special thanks to Waleston for his contributions to this research. If you're looking for a talented intern with experience in LLM evaluation and educational technology, he's seeking opportunities for Summer 2024 in the USA.

Want to dive deeper into the technical details? Check out our GitHub repository for the complete evaluation framework and results. </rewritten_file>

Simon Moisselin's Blog