Designing with Dataclasses

4 months ago 20

Python dictionaries are available without an import and extremely flexible, which means many Python programmers default to representing data as a dict. However, dataclasses are often more appropriate. Here is when you should use a dataclass instead, and how to decide between the two.

Note: I’m using dataclass in this article since it’s in the standard library. If you’re already using a 3rd-party library like attrs to define record-like classes, the advice here still applies, just replace dataclass with the library you’re using.

If you’re already familiar with dataclasses, you can skip this section.

dataclass is a class decorator which automatically generates magic methods like __init__ and __eq__, making for more concise class definitions. For instance, this class declaration:

class Order: def __init__(self, item_id: str, customer_id: str, amount: int): self.item_id = item_id self.customer_id = customer_id self.amount = amount def __eq__(self, other): return ( self.item_id == other.item_id and self.customer_id == other.customer_id and self.amount == other.amount )

can be replaced with:

from dataclasses import dataclass @dataclass class Order: item_id: str customer_id: str amount: int

Readability

A dataclass can be more readable than a dict. When you see a dataclass like Order, you know just by glancing at its definition which fields it contains 1. On the other hand, items can be added or removed from a dict at various points in the code, which means you have to read through much more code to know the shape of the data. While this can be avoided with discipline (for instance, you can avoid inserting new items into a dict after it’s instantiated), dataclasses help enforce this discipline automatically.

Error checking & debugging

Representing data as a dataclass can make debugging a lot faster. For instance, using the same Order class as before, if you forgot to provide customer_id when instantiating,

order = Order(item_id="i1435", amount=10)

it raises

----> 1 Order(item_id="i2345", amount=10) TypeError: Order.__init__() missing 1 required positional argument: 'customer_id'.

with the exact line where you forgot to provide the customer_id. However, representing the same data as a dict,

order = { "item_id": "i1435", "amount" 10, }

does not raise an error. If the "customer_id" were accessed somewhere downstream,

customer = order["customer_id"]

raises KeyError: 'customer_id' and you’re left backtracking through the code to find where you forgot to add 'customer_id'.

Dataclasses also work well with type checkers like mypy. Since they encourage annotating each field with types, code using dataclasses can be type checked with very little extra effort.

Dataclasses are useful when the names of the items in your data container are known ahead of time. Here are some heuristics to help you decide if you should use a dict or a dataclass:

  1. Are item names hardcoded (e.g. you have code that look like order["item_id"])? Use a dataclass, which enforces the presence of these names.
  2. Do you need to loop over item names or dynamically add or remove items? Use a dict.

Let’s see how these heuristics apply in a larger program. This script uploads a directory of text files to object storage (here S3). Each file’s object key will be {id}/{start_timestamp}/{session_name} and the metadata used to derive this key is stored on the first line of each file in this format:

# id=53,started_at=2021-01-02T11:30:00Z,session_name=daring_foolion
import os import boto3 def upload_directory(directory, s3_bucket): headers_by_file = _get_headers(directory) metadata_by_file = _parse_headers(headers_by_file) s3_key_by_file = _build_s3_keys(metadata_by_file) _upload_to_s3(s3_bucket, s3_key_by_file) def _get_headers(directory): headers = {} for file_name in os.listdir(directory): file_path = os.path.join(directory, file_name) with open(file_path, "r") as f: headers[file_path] = f.readline() return headers def _parse_headers(headers): metadata_by_file = {} for file_path, header in headers.items(): header = header.removeprefix("# ") pairs = header.split(",") metadata = {} # (2) for key_value in pairs: key, value = key_value.split("=") metadata[key] = value metadata_by_file[file_path] = metadata return metadata_by_file def _build_s3_keys(metadata_by_file): object_keys = {} for filepath, metadata in metadata_by_file.items(): recorder = metadata["id"] # (3) started_at = metadata["started_at"] session_name = metadata["session"] object_keys[filepath] = f"{recorder}/{session_name}_{started_at}" return object_keys def _upload_to_s3(s3_bucket, s3_key_by_file): s3_client = boto3.client("s3") for filepath, s3_key in s3_key_by_file.items(): s3_client.upload_file(filepath, s3_bucket, s3_key)

The dict for recordings in (1) is appropriate: we don’t access or set any of its items through hard-coded key names. However the dict in (2) fails the test: we access items through hard-coded key names downstream in _build_s3_keys() (3).

Here is the same script after re-writing (2) to use a dataclass.

import os from dataclasses import dataclass import boto3 def upload_directory(directory, s3_bucket): headers_by_file = _get_headers(directory) metadata_by_file = _parse_headers(headers_by_file) s3_key_by_file = _build_s3_keys(metadata_by_file) _upload_to_s3(s3_bucket, s3_key_by_file) def _get_headers(directory): headers = {} for file_name in os.listdir(directory): file_path = os.path.join(directory, file_name) with open(file_path, "r") as f: headers[file_path] = f.readline() return headers @dataclass class RecordingMetadata: recorder_id: int started_at: str session_name: str def _parse_headers(headers_by_file): metadata_by_file = {} for file_path, header in headers_by_file.items(): header = header.removeprefix("# ") pairs = header.split(",") metadata = {} # (2) for key_value in pairs: key, value = key_value.split("=") metadata[key] = value metadata_by_file[file_path] = RecordingMetadata( recorder_id=metadata["id"], started_at=metadata["started_at"], session_name=metadata["session"], ) return metadata_by_file def _build_s3_keys(metadata_by_file): object_keys = {} for filepath, metadata in metadata_by_file.items(): object_keys[filepath] = ( f"{metadata.recorder_id}/{metadata.session_name}_{metadata.started_at}" ) return object_keys def _upload_to_s3(s3_bucket, s3_key_by_file): s3_client = boto3.client("s3") for filepath, s3_key in s3_key_by_file.items(): s3_client.upload_file(filepath, s3_bucket, s3_key)

The readability benefits are more obvious when you use type hints:

def upload_directory(directory: os.PathLike, s3_bucket: str): headers_by_file = _get_headers(directory) metadata_by_file = _parse_headers(headers_by_file) s3_key_by_file = _build_s3_keys(metadata_by_file) _upload_to_s3(s3_bucket, s3_key_by_file) @dataclass class RecordingMetadata: recorder_id: int started_at: str session_name: str def _get_headers(directory: os.PathLike) -> dict[str, str]: headers = {} for file_name in os.listdir(directory): file_path = os.path.join(directory, file_name) with open(file_path, "r") as f: headers[file_path] = f.readline() return headers def _parse_headers(headers: dict[str, str]) -> dict[str, RecordingMetadata]: metadata_by_file = {} for file_path, header in headers.items(): header = header.removeprefix("# ") pairs = header.split(",") metadata = {} for key_value in pairs: key, value = key_value.split("=") metadata[key] = value metadata_by_file[file_path] = RecordingMetadata( recorder_id=metadata["id"], started_at=metadata["started_at"], session_name=metadata["session"], ) return metadata_by_file def _build_s3_keys(metadata_by_file: dict[str, RecordingMetadata]) -> dict[str, str]: object_keys = {} for filepath, metadata in metadata_by_file.items(): object_keys[filepath] = ( f"{metadata.recorder_id}/{metadata.session_name}_{metadata.started_at}" ) return object_keys def _upload_to_s3(s3_bucket: str, s3_key_by_file: dict[str, str]): s3_client = boto3.client("s3") for filepath, s3_key in s3_key_by_file.items(): s3_client.upload_file(filepath, s3_bucket, s3_key)

These aren’t hard rules: in some cases it’s best to ignore them.

One instance is calling functions that take or returs a dict. This is common when serializing or de-serializing data, like in the standard library’s json module. If you’re building the data in the same function where it’s used, it’s OK to just use a dict, even if there are hard-coded keys.

Another good reason is performance. While accessing a dataclass attribute is only slightly slower than accessing a key in a dict, instantiating a dataclass is ~5x slower than creating a dict 2. So, if you’re instantiating tens of thousands of dataclasses and you’ve determined it’s a bottleneck, you can use dicts instead.

In both of these cases, if you’re using a type checker like mypy, you can annotate your code with TypedDicts to regain some readability and error checking.

Read Entire Article