Levanter

You could not prevent a thunderstorm, but you could use the electricity; you could not direct the wind, but you could trim your sail so as to propel your vessel as you pleased, no matter which way the wind blew.
— Cora L. V. Hatch

Levanter is a framework for training large language models (LLMs) and other foundation models that strives for legibility, scalability, and reproducibility:

Legible: Levanter uses our named tensor library Haliax to write easy-to-follow, composable deep learning code, while still being high performance.
Scalable: Levanter scales to large models, and to be able to train on a variety of hardware, including GPUs and TPUs.
Reproducible: Levanter is bitwise deterministic, meaning that the same configuration will always produce the same results, even in the face of preemption and resumption.

We built Levanter with JAX, Equinox, and Haliax.

Documentation¤

Levanter's documentation is available at levanter.readthedocs.io. Haliax's documentation is available at haliax.readthedocs.io.

Features¤

Distributed Training: We support distributed training on TPUs (and soon, GPUs), including FSDP and tensor parallelism.
Compatibility: Levanter supports importing and exporting models to/from the Hugging Face ecosystem, including tokenizers, datasets, and models via SafeTensors.
Performance: Levanter's performance rivals commercially-backed frameworks like MosaicML's Composer or Google's MaxText.
Cached On-Demand Data Preprocessing: We preprocess corpora online, but we cache the results of preprocessing so that resumes are much faster and so that subsequent runs are even faster. As soon as the first part of the cache is complete, Levanter will start training.
Optimization: Levanter supports the new Sophia optimizer, which can be 2x as fast as Adam. We also support ses Optax for optimization with AdamW, etc.
Logging: Levanter supports a few different logging backends, including WandB and TensorBoard. (Adding a new logging backend is easy!) Levanter even exposes the ability to log inside of JAX jit-ted functions.
Reproducibility: On TPU, Levanter is bitwise deterministic, meaning that the same configuration will always produce the same results, even in the face of preemption and resumption.
Distributed Checkpointing: Distributed checkpointing is supported via Google's TensorStore library. Training can even be resumed on a different number of hosts, though this breaks reproducibility for now.

The code is released on GitHub: Levanter repository.

To get started, please refer to the User Guide's chapters:

Please also see the guides in the menu on the left.

To contribute, please refer to the Contributing Guide.