Open Training Data in AI

Apertus - Swiss AI Initiative | OLMo - Allen Institute for AI

Most “open-source” AI models publish only their weights, the trained parameters. Training data, training code, and alignment methods remain closed. This limits independent audit, bias detection, and reproducibility.

A growing number of models go further. Apertus, developed by EPFL, ETH Zurich, and the Swiss National Supercomputing Centre, publishes everything: training data, code, weights, and alignment methods. It supports over 1,000 languages, was designed for EU AI Act compliance, and uses techniques to prevent memorization of personal data. OLMo, from the Allen Institute for AI, follows the same philosophy: open weights, open training data, open code, with full documentation of pretraining, post-training, and evaluation. Both demonstrate that competitive performance and full transparency are not mutually exclusive. Open training data is the foundation for auditable, reproducible AI.