You have a large JSON file, and you want to load the data into Pydantic. Unfortunately, this uses a lot of memory, to the point where large JSON files are very difficult to read. What to do?
Assuming you’re stuck with JSON, in this article we’ll cover:
- The high memory usage you get with Pydantic’s default JSON loading.
- How to reduce memory usage by switching to another JSON library.
- Going further by switching to dataclasses with slots.
The problem: 20× memory multiplier
We’re going to start with a 100MB JSON file, and load it into Pydantic (v2.11.4). Here’s what our model looks like:
The JSON we’re loading looks more or less like this:
Pydantic has built-in support for loading JSON, though sadly it doesn’t support reading from a file. So we load the file into a string and then parse it:
This is very straightforward.
But there’s a problem. If we measure peak memory usage, it’s using a lot of memory:
That’s around 2000MB of memory, 20× the size of the JSON file. If our JSON file had been 10GB, memory usage would be 200GB, and we’d probably run out of memory. Can we do better?
Reducing memory usage
There are two fundamental sources of peak memory usage when parsing JSON:
- The memory used during parsing; many JSON parsers aren’t careful about memory usage, and use more than necessary.
- The memory used by the final representation, the objects we’re creating.
We’ll try to reduce memory usage in each.
1. Memory-efficient JSON parsing
We’ll use ijson, an incremental JSON parser that lets us stream the JSON document we’re parsing. Instead of loading the whole document into memory, we’ll load it one key/value pair at a time. The result is that most of the memory usage will now come from the in-memory representation of the resulting objects, rather than parsing:
While parsing this way is significantly slower (5×), it reduces memory usage significantly, to just 1200MB.
It also requires us to do a bit more of the work of parsing the JSON, but anything below the top-level JSON object or list can be done by Pydantic.
2. Memory-efficient representation
We’re creating a lot of Python objects, and one way to save memory on Python objects is to use “slots”. Essentially, slots are a more efficient in-memory representation for Python objects, where the list of possible attributes is fixed. This saves memory at the cost of disallowing adding extra attributes to an object, which in practice isn’t that common so it’s often a good tradeoff.
Unfortunately, pydantic.BaseModel doesn’t seem to support that at the moment, so I switched to Pydantic’s dataclass support, which does. Here’s our new model:
And we also need to tweak our parsing code slightly:
With this version of the code, memory usage has shrunk to 450MB.
Final thoughts
Here’s a summary of peak memory usage when parsing a 100MB JSON file with the three techniques we covered:
Model.model_validate_json() | 2000 |
ijson | 1200 |
ijson + @dataclass(slots=True) | 450 |
This particular use case, of loading a large number of objects, may not be something Pydantic developers care about, or have the time to prioritize. But it would certainly be possible for Pydantic to internally work more like ijson, and to add the option for using __slots__ to BaseModel. The end result would use far less memory, while still benefiting from Pydantic’s faster JSON parser.
Until then, you have options you can implement yourself.