📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva, a European sovereign large language model trained from scratch, achieved strong technical results but scored near chance on Italian academic tests. This highlights challenges in scaling language models for country-specific knowledge.
Italy’s Minerva project, a sovereign large language model trained entirely from scratch on 2.5 trillion tokens, scored only 4.9% on the INVALSI Italian school-exam benchmark, despite its impressive technical development. This stark result highlights the complex challenge of achieving deep country-specific knowledge through scale alone, and it questions assumptions about the relationship between training data size, model parameters, and real-world language understanding.
The Minerva project, led by Sapienza University of Rome and funded through Italy’s national AI strategy, trained models ranging from 350 million to 7 billion parameters using approximately 50% Italian data. The training dataset comprised 2.5 trillion tokens, with Italian content making up about 1.14 trillion tokens, representing a significant investment in native-language data. Despite these efforts, the 3B parameter model scored only 4.9% on the INVALSI Italian exam—an outcome considered near chance—indicating a disconnect between training scale and complex language comprehension. Researchers have concluded that while dataset composition matters, the overall size of the dataset and the number of parameters are more critical for handling complex language tasks. This result contrasts with the technical achievements of Minerva, which outperforms comparable multilingual models on Italian benchmarks but struggles on real-world academic assessments. The findings suggest that simply increasing data and parameters may not be sufficient to produce deep, country-specific language understanding, raising questions about the necessary scale and investment for sovereign models.Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.

Large Language Models (LLMs)
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.

Machining Dynamics and Parameters Process Optimization
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code

OpenAI Evals Cookbook: Designing Benchmarks for Product‑Grade LLM Features
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.

Training Data for Machine Learning: Human Supervision from Annotation to Data Science
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications of Scale for Country-Specific AI Models
This development matters because it challenges the assumption that larger models trained on more native-language data automatically achieve deep, country-specific knowledge. Italy’s experience with Minerva demonstrates that even substantial investment and large datasets may not guarantee high performance on complex, real-world language tasks. This has broader implications for European efforts to develop sovereign AI, suggesting that scale alone might be insufficient and that strategic focus on model architecture, training methodology, and data quality is essential. The result also raises questions about the optimal investment levels needed to justify national AI projects and the potential need for more targeted or innovative approaches to achieve meaningful language understanding at the country level.
European Sovereign LLM Strategies and Challenges
The European sovereign-LLM movement has seen varied approaches, notably Portugal’s AMÁLIA model, which layered a language-specific extension onto a multilingual foundation, and Italy’s Minerva, which trained from scratch on a massive dataset. While AMÁLIA’s architecture remains proprietary, Minerva’s open weights and data have provided transparency. Italy’s approach involved a significant investment in infrastructure, including CINECA’s supercomputing resources and national funding, resulting in models that outperform multilingual counterparts on Italian benchmarks. However, the performance on the INVALSI exam reveals a critical challenge: scale and native-language data alone may not suffice for complex language understanding. This ongoing research underscores the structural debate about the necessary investment and architectural choices for sovereign AI in Europe.
“Despite the large-scale training, our model’s performance on academic content tests remains near chance, indicating a need to rethink our approach.”
— Research team behind Minerva
Unresolved Questions About Model Scaling and Effectiveness
It is not yet clear what specific factors limit Minerva’s performance on complex language tasks despite the large-scale training. The relationship between dataset size, model parameters, architecture, and real-world understanding remains an open question. Researchers continue to investigate whether different training methodologies, data quality, or model architectures could improve results, but definitive conclusions are still pending.
Next Steps for European Sovereign Language Models
The Minerva team plans to continue refining their models, including ongoing experiments with continual training and different data strategies. Further research is expected to explore how to better align large-scale training with real-world language understanding, potentially involving architectural innovations or targeted data curation. Policymakers and researchers are likely to reassess investment levels and strategic priorities based on these emerging insights, aiming to develop models that better serve national language and knowledge needs.
Key Questions
Why did Minerva score so low on the Italian exam despite large-scale training?
The evaluation suggests that scale alone may not be enough; factors like training methodology, data quality, and architecture are critical for complex language understanding.
How does Minerva compare to multilingual models on Italian benchmarks?
Minerva outperforms comparable multilingual models on Italian benchmarks, but struggles with real-world academic tests, indicating a gap between benchmark performance and practical understanding.
What are the implications for Europe’s AI strategy?
The results suggest that European AI efforts may need to reconsider the emphasis on scale and focus more on architecture, data quality, and targeted training to achieve country-specific language mastery.
Is the low exam score a sign that sovereign models are not worth the investment?
Not necessarily; it highlights that current approaches may need adjustment. Large-scale training remains valuable, but effectiveness depends on multiple factors beyond just data size and parameters.
Source: ThorstenMeyerAI.com