tl;dr When we put lots of text (e.g. a whole code repo) into a language model's context, generation cost soars because of the KV cache's size. What if we trained a smaller KV cache for our documents offline? Using a test-time training recipe called self-study, we show that this simple idea can improve throughput by
26× while maintaining quality. (See our blogpost for more.)
The codebase relies on your setting the following variables. We recommend adding them to your ~/.bashrc, ~/.zshrc, DockerFile, etc.
# path to your the directory where you cloned this repoexport CARTRIDGES_DIR=/path/to/cartridges
# path to a directory where you want to store outputs like models checkpoints and suchexport CARTRIDGES_OUTPUT_DIR=/path/to/cartridges/outputs
Synthesizing Training Data with Self-Study
What is self-study? Self-study is a test-time training approach where we generate synthetic conversations about a corpus of text. The process simulates two AI agents: one asks questions or makes requests about the content, and another responds using the provided context. This creates training data that teaches the model to efficiently compress and retrieve information from long contexts.
Quickstart: Take a look at the script at scripts/longhealth_synthesize.py for an example of how to generate training data with self-study. To actually run the script, you will need to spin up an inference server (either Tokasaurus or SGLang) and set the client variable to point to it.
Below we walk through the process of generating synthetic training data for a corpus of text in more detail. As a running example, we'll be training a cartridge on our paper on Cartridges. How meta!
Here are the steps:
Create a StructuredContext object that contains the data you want to store in the cartridge
Ensure you have an inference server running (either Tokasaurus or SGLang) and configure your client to point to it
Instantiate a SynthesizeConfig object that contains the parameters for the self-study process
Put it all together in one script and run it!
Note: For configuration, we use Pydantic models. Pydantic models are useful for defining the schema of the config and quickly ensuring that the config is valid at the beginning of the script. We also rely on pydrantic, which provides a few utilities for working with configs.
Step 1: Create a Context Object
A StructuredContext represents your corpus in a format that the self-study process can work with. We provide several built-in context types. For our example, we'll use the TexDocument context type.
Step 3: Configuring the Synthesizer and Putting it all together
We are now going to put all of the pieces together in a SynthesizeConfig object that configures the entire self-study process.
Core Settings:
num_samples: Total number of training examples to generate
batch_size: Number of training examples to generate per call to the inference server.
max_num_batches_in_parallel: Number of batches to process concurrently. When using Modal, high values
Here's a complete example script:
importosfrompathlibimportPathimportpydranticfrompydrantic.variablesimportFormatStringVariablefromcartridges.clients.tokasaurus_batchimportTokasaurusBatchClientfromcartridges.synthesizeimportSynthesizeConfigfromcartridges.synthesizers.self_studyimportSelfStudySynthesizer, SlicePromptSamplerWithChunksfromcartridges.utilsimportWandBConfigfromcartridges.tasks.longhealth.contextimportLongHealthStructuredContextConfigclient_config=TokasaurusBatchClient.Config(
url="https://hazyresearch--tksrs-entry-capsules-3b-1xh100-min0-max64-serve.modal.run",
ports=None,
model_name="meta-llama/Llama-3.2-3B-Instruct",
)
context_config=TexDocument.Config(
arxiv_src_url="https://arxiv.org/src/2506.06266",
main_file="main.tex"
)
config=SynthesizeConfig(
context=context_config,
synthesizer=SelfStudySynthesizer.Config(
client=client_config,
tokenizer="meta-llama/Llama-3.2-3B-Instruct",
max_rounds=1,
prompt_sampler=SlicePromptSamplerWithChunks.Config(
slices=["structuring", "summarization", "question", "use_case", "creative"],
min_chunk_size=512,
max_chunk_size=4096,
desc=f"Below is a research paper on test-time training for long contexts."
),
prob_cot_a=0.2,
use_tools=False,
tools=[]
),
output_dir=os.environ.get("CARTRIDGES_OUTPUT_DIR", "."),
num_samples=512,
batch_size=16,
max_num_batches_in_parallel=4,
handle_exceptions=True, # Continue if individual batches failsave_wandb_artifact=True,
name="cartridges-tutorial",
wandb=WandBConfig(project="cartridges", entity="hazy-research"),
)
if__name__=="__main__":
pydrantic.main([config])
Step 4: Running the Synthesis
Once you've created the file, run it with:
python your_synthesis_script.py
Once the run is complete, it will save the results to a pickle file and print the path:
Final output saved to /path/to/output/dir/artifact/dataset.pkl
Output format
classTrainingExample(BaseModel):
messages: List[Message] # The conversation between agents (system, user, assistant format)token_ids: List[int] # The token IDs for the responsetop_logprob_ids: List[List[int]] # The top-k token predictions at each positiontop_logprob_logprobs: List[List[float]] # The corresponding log probabilitiesmetadata: Dict[str, Any] # Information about tool usage, prompts, and generation process
Exploring synthesized dataset in a DataFrame
importpickleimportpandasaspd# Load the datasetwithopen("/path/to/output/dir/artifact/dataset.pkl", "rb") asf:
data=pickle.load(f)
rows=data["rows"]
context=data["context"]
# Convert to DataFrame for explorationdf=pd.DataFrame([
{
"num_messages": len(row.messages),
"num_output_tokens": row.num_output_tokens,
"seed_prompt": row.metadata.get("seed_prompt", ""),
"conversation": "\n".join([f"{msg.role}: {msg.content}"formsginrow.messages])
}
forrowinrows[:10] # First 10 examples
])
You can enhance the self-study process with tools that allow agents to dynamically retrieve additional context:
fromcartridges.tools.baseimportTool# Define custom tools for information retrievaltools= [
SearchTool.Config(description="Search for specific information"),
SummaryTool.Config(description="Generate summaries of sections")
]
synthesizer_config=SelfStudySynthesizer.Config(
# ... other config ...use_tools=True,
tools=tools
)
Quickstart: Take a look at the script at scripts/longhealth_train.py for an example of how to generate training data with self-study.
See cartridges.train.TrainConfig for the schema of the main config we use for training.
Below we provide an example of a config file prefaced with notes describing each part of the config:
dataset Th
*`
importosfrompathlibimportPathimportpydranticfromcartridges.initialization.strategies.first_n_tokensimportKVCacheInitFromFirstNTokensOfContextfromcartridges.trainimportEvalDatasetConfig, GenerateDatasetConfig, TrainConfigfromcartridges.configimportHFModelConfigfromcartridges.datasetsimportCartridgeDatasetfromcartridges.tasks.longhealthimportLongHealthMultipleChoiceGenerateDatasetfromcartridges.utilsimportWandBConfigfile_name=Path(__file__).stemconfig=TrainConfig(
model=HFModelConfig(
pretrained_model_name_or_path="meta-llama/Llama-3.2-3B-Instruct",
model_cls=LlamaForCausalLM,
attn_implementation="einsum",
),
kv_cache_initializer=KVCacheInitFromFirstNTokensOfContext.Config(max_tokens=2048),
lr=2e-2,
loss_type="logits",
epochs=2,
global_batch_size=bs,
local_batch_size=4,
use_batch_sampler=True,
dataset=CartridgeTrainDataset.Config(
# path should point to the output of the synthesis script we ran abovedata_sources=[("/path/to/output/dir/artifact/dataset.pkl", None)]
max_sequence_length=1024,
is_wandb=True,
label_type="logits",
top_k_logits=20,
),
context=LongHealthStructuredContextConfig(patient_ids=patient_ids),
save_every_n_steps=512,
generate_every_n_steps=512,
generate_max_new_tokens=512,
generate_datasets=[
GenerateDatasetConfig(
dataset=LongHealthMultipleChoiceGenerateDataset.Config(
patient_ids=patient_ids,
cot=True,
),
name_for_wandb=f"longhealth_mc",
num_samples=8,
num_samples_final=8,
batch_size=16,
temperature=0.3
)
],
eval_every_n_steps=256,
eval_datasets=[],
distributed_backend="gloo",
wandb=WandBConfig(
project="cartridges",
tags=["train", "longhealth", f"patients{patients_str}"],
entity="hazy-research",
),
output_dir=os.environ["CARTRIDGES_OUTPUT_DIR"],
name="train-cartridges"
)
if__name__=="__main__":
pydrantic.main([config])
Distributed data parallel training
To launch a data parallel training run, you can run:
We describe two ways to serve and chat with a trained Cartridge: a simple, but slow way that just uses a pure PyTorch generation loop, and a faster one that uses a Tokasaurus server.
Serving with Tokasuaurus [Fastest and recommended]
We've implemented (h/t @geoffreyangus) an integration with Tokasaurus, a simple LLM inference server optimized for high throughput.
To run the Tokasaurus server, you will need to (install Tokasaurus from source)[], switch to the branch geoff/cartridges, and then follow the instructions here to make API calls to the server.
streamlit run cartridges/analysis/dashboards/chat_w_cache.py
Serving with Basic PyTorch [Easiest but slow]
streamlit run cartridges/analysis/dashboards/chat_w_cache.py
Acknowledgments and Citation
There are tons of people and organizations who have supported this project. Below we shout out a few, but check out the the paper for a full list.
The compute for this project was provided by Modal — who made it super easy to scale out horizontally when running the synthetic data generation for self-study — and Together — who provided the compute for training the Cartridges on the synthetic data. Prime Intellect, Voltage Park, and Azure through the HAI Grants program also contributed compute towards this project.
@article{eyuboglu2025cartridges,
title={Cartridges: Lightweight and general-purpose long context representations via self-study},
author={Eyuboglu, Sabri and Ehrlich, Ryan and Arora, Simran and Guha, Neel and Zinsley, Dylan and Liu, Emily and Tennien, Will and Rudra, Atri and Zou, James and Mirhoseini, Azalia and others},
journal={arXiv preprint arXiv:2506.06266},
year={2025}
}