This post is an account of why I prefer using the attrs library over Pydantic. I'm writing it since I am often asked this question and I want to have something concrete to link to. This is not meant to be an objective comparison of attrs and Pydantic; I'm not interested in comparing bullet points of features, nor can I be unbiased since I'm a major contributor to attrs (at time of writing, second by commit count, after Hynek) and the author of one of its unofficial companion libraries, cattrs.
A small side note on a third contender, dataclasses. Dataclasses is a strict subset of attrs included with Python since version 3.7. There are three reasons to use dataclasses over attrs:
- You're writing code for the actual Python standard library, and hence cannot use third-party packages like attrs
- The code you're running has no access to third-party packages (i.e. you can't use pip)
- Since dataclasses are part of the standard library, some tools may have better support for them than for attrs. If this is a dealbreaker for you, you need to use dataclasses over attrs. This is a problem with the tool, though.
In all other cases you are better served by attrs, since it's faster, it strictly has more features, releases them at a better cadence and offers them over a wider range of Python versions. Migrating between dataclasses and attrs is very straightforward, since dataclasses are essentially a port of attrs. Just use attrs instead.
Note: as of the time of writing the current version of attrs is 21.2.0, cattrs 1.8.0, and Pydantic 1.8.2.
The Problem Description
Both attrs and Pydantic are libraries that make writing classes significantly easier. Here's a class written both ways:
from attr import define
from pydantic import BaseModel
@define
class AttrsPrimitives:
a: bytes
b: str
c: int
d: float
e: bool
class PydanticPrimitives(BaseModel):
a: bytes
b: str
c: int
d: float
e: bool
On the face of it, the approaches look very similar with the exception of attrs using a class decorator versus Pydantic using inheritance. The class decorator approach is superior for a number of reasons, one of them being you don't have to be mindful of accidentally overwriting methods from a base class. (Pydantic also has a class decorator mode, but it has a different feature set and the docs steer you towards using BaseModel
anyway, so we'll ignore it.)
Case in point: Pydantic reserved fields
Your Pydantic model cannot have a field named json
, since that is one of the methods on the BaseModel
. Can you remember all the others? And better hope no more methods get added in the future. attrs only needs one field, __attrs_attrs__
, which you are unlikely to use by accident.
What gets generated under the hood is very different, though. For the attrs example, it's what you'd have written yourself:
def __init__(self, a, b, c, d, e):
self.a = a
self.b = b
self.c = c
self.d = d
self.e = e
For Pydantic though, the generated code is a little different:
def __init__(__pydantic_self__, **data: Any) -> None:
"""
Create a new model by parsing and validating input data from keyword arguments.
Raises ValidationError if the input data cannot be parsed to form a valid model.
"""
# Uses something other than `self` the first arg to allow "self" as a settable attribute
values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
if validation_error:
raise validation_error
try:
object_setattr(__pydantic_self__, '__dict__', values)
except TypeError as e:
raise TypeError(
'Model values must be a dict; you may not have returned a dictionary from a root validator'
) from e
object_setattr(__pydantic_self__, '__fields_set__', fields_set)
__pydantic_self__._init_private_attributes()
attrs is a library for generating the boring parts of writing classes; Pydantic is that but also a complex validation library.
So now you have a class to model a piece of data and you want to store it somewhere, or send it somewhere. For purposes of this article, let's assume you want to convert it to json.
attrs provides an attrs.asdict
helper function that can take an instance and turn it into a dictionary to be passed to a json library, but really you should be using cattrs instead. cattrs can create a converter especially tailored for json:
>>> from json import dumps
>>> from cattr.preconf.json import make_converter
>>> c = make_converter()
>>> dumps(c.unstructure(AttrsPrimitives(a=b'', b='', c=1, d=1.0, e=True)))
'{"a": "", "b": "", "c": 1, "d": 1.0, "e": true}'
For Pydantic, this functionality is part of the BaseModel
:
>>> PydanticPrimitives(a=b'', b='', c=1, d=1.0, e=True).json()
'{"a": "", "b": "", "c": 1, "d": 1.0, "e": true}'
attrs is a library for automating the boring parts of writing classes; Pydantic is that but also a complex validation library and also a structuring/unstructuring library.
So to spell out the two problems I'm hinting at: Pydantic does a lot, and I prefer a component-based approach where separate components piece together. Pydantic is very opinionated about the things it does, and I simply disagree with a lot of its opinions.
Validation
In attrs, validation is opt-in - you define your validators (there are some included in the attr.validators
package), and you pass them in when you define the class. In Pydantic, validation is opt-out - there's a BaseModel.construct
classmethod that can bypass it.
Validation is actually a complex topic. For example, if you annotate a field as list[int]
, and then pass in a list of 10000 integers, do you expect the validation to only check if the argument is a list, or do a for loop with an isinstance check on each and every element (which is what Pydantic does by default)? What is the performance cost of actually doing this? Do you expect the validator to also run when you set the attribute, or just in the __init__
(again, Pydantic default)? If you actually configure Pydantic to validate everything, and you change element #500 in the list to a string, will that be caught (it won't)?
Pydantic has a ton of validation logic magically baked in, whereas attrs gives you the tools to set it up yourself. Hence, less surprises. Take this example from the Pydantic docs:
class Foo(BaseModel):
count: int
size: float = None
This will happily validate an instance of Foo
with the (float) size field set to None. I don't really understand how this checks out.
Conversion
Conversion is the process of passing in an argument to __init__
that gets converted to something else before being set on the class. In attrs, conversion is opt-in - you define your converters (there are some included in the attr.converters
package) and you pass them in when you define the class. Pydantic generates converters automatically.
Conversion is actually a complex topic (noticing a pattern here?). For example, for a datetime.datetime
field, Pydantic will generate a converter that can take an existing datetime, a string of something similar to ISO 8604 or an int or float (interpreted as UNIX time, seconds or milliseconds based on the actual value). This converter gets called any time the class is instantiated.
My problems with this are as follows.
- The rules are very arbitrary
With attrs, you define the converter yourself, or it doesn't exist. If you define it, you choose how it works exactly.
- The rules are defined inside Pydantic, which hurts composability
I guess the idea here is twofold:
- allow the users to more easily create instances of models with dates by passing in a string instead of an actual datetime instance
I disagree with this - the logic for creating datetimes should be in the datetime class, not in a Pydantic converter. This is the composability argument.
- it allows automatic de/serialization to formats that don't support datetimes natively
I disagree with this. Un/structuring should be handled independently of the model.
As a sidenote, consider the following example. If you wanted to use the DateTime
class from Pendulum (probably the best Python datetime class), you'd model it like this:
from pendulum import DateTime
class PydanticPendulum(BaseModel):
a: DateTime
But since pendulum.DateTime
is a subclass of datetime.datetime
, Pydantic will handle it as a datetime.datetime
, which is wrong.
>>> PydanticPendulum('2021-05-25T00:00:00')
PydanticPendulum(a=datetime.datetime(2021, 5, 25, 0, 0)) # Whoops, wrong class
There's a performance penalty on each instance creation
Since Pydantic needs to run this converter on each instance creation even if you're not using it (passing in only datetimes), Pydantic class creation is much slower than attrs. Comparing two very similar models:
@define
class AttrsDatetime:
a: datetime = field(validator=instance_of(datetime))
class PydanticDatetime(BaseModel):
a: datetime
Instantiating these classes on my machine; attrs takes 953 ns +- 20 ns, while Pydantic takes 3.06 us +- 0.07, so around ~3x slower. What's worse, if we adopt a better approach - using Mypy to do the validation statically - the attrs approach can drop the validator to drop down to 387 ns +- 11 ns, while Pydantic needs to switch to using PydanticDatetime.construct
(awkward to use) which still takes ~1.36 us, so again ~3.5x slower.
Un/structuring
As mentioned before, I have written a companion library called cattrs to ensure the attrs ecosystem has a good library for structuring and unstructuring data. You can read more about the case for cattrs over here.
Some of the fundamental ideas of cattrs are:
- the un/structuring logic should be separate from the model, since the relationship isn't 1:1, but rather 1:N (a single model may have many ways of being un/structured).
cattrs has the concept of a converter, which is an object containing logic for un/structuring. You can have as many converters as you need ways of un/structuring. For example, you can have a separate converter for ujson, and a different one for msgpack; in fact, this is what you need since msgpack supports bytes natively, and it'd be a waste to encode them into base64/85 for this format.
Case in point: customizing datetime representation
Suppose you'd like to be able to choose whether to unstructure your datetimes as ISO 8601 strings or Unix timestamps. The cattrs way would be to have two converters:
from datetime import datetime
from cattr import Converter
iso_converter = Converter()
iso_converter.register_unstructure_hook(datetime, lambda d: d.isoformat())
unix_converter = Converter()
unix_converter.register_unstructure_hook(datetime, lambda d: d.timestamp())
With Pydantic, you need to override the json_encoder
field in the model config. This approach is json-specific (so no msgpack, bson, yaml, toml...), doesn't work with libraries like ujson, and is tied to the model so you can't pick and choose strategies on the fly.
Case in point: Msgpack
It's difficult to, for example, dump Pydantic classes with datetime
s into formats that don't support them natively, like msgpack, because you can't really customize the unstructuring rules. The Pydantic docs mention this might be a problem with ujson
, and suggest you try using a different library instead.
- the un/structuring logic should be a separate code path from operations done on the model in code.
Going back to the question of validation: assuming your model has a list containing 10000 integers, the validation question has a better answer. In code, you should use Mypy (ideally, or tests, or your IDE) to ensure you don't make element #500 a string. When loading data from outside, cattrs will iterate over the array and make sure all elements are integers. This approach ensures the maximum possible efficiency (pay the price but only when it's the only way).
Performance aside, the ergonomics are completely different. Because Pydantic supports both simultaneously, the following is possible:
class PydanticInner(BaseModel):
a: int
class PydanticOuter(BaseModel):
a: PydanticInner
>>> PydanticOuter(a={"a": 1})
PydanticOuter(a=PydanticInner(a=1))
Even disregarding the fact this is a nightmare to typecheck, it simply doesn't fit my brain.
Performance
Despite their docs claiming Pydantic is the fastest library, it's very simple to prove otherwise. I've already compared it to attrs in the last chapter, and we can quickly benchmark it against cattrs here. We can reuse the PydanticPrimitives
and AttrsPrimitives
classes from the beginning of the article, and add a simple class to contain some instances of these.
@define
class AttrsModel:
one: AttrsPrimitives
two: AttrsPrimitives
three: AttrsPrimitives
class PydanticModel(BaseModel):
one: PydanticPrimitives
two: PydanticPrimitives
three: PydanticPrimitives
Now we can benchmark. First, dumping to JSON. We can use the standard library JSON module in both cases, even though the c/attrs approach can use ujson
more readily than Pydantic.
$ pyperf timeit -g -s "from v import AttrsModel, AttrsPrimitives; from cattr.preconf.json import make_converter; from json import dumps; i = AttrsPrimitives(b'0101', 'a str', 1, 1.0, True); m = AttrsModel(i, i, i); c = make_converter()" "dumps(c.unstructure(m))"
Mean +- std dev: 18.5 us +- 0.5 us
$ pyperf timeit -g -s "from v import PydanticModel, PydanticPrimitives; from json import dumps; i = PydanticPrimitives(a=b'0101', b='a str', c=1, d=1.0, e=True); m = PydanticModel(one=i, two=i, three=i)" "m.json()"
Mean +- std dev: 77.7 us +- 2.1 us
Even though Pydantic uses a naive way of encoding bytes, and the attrs example defaults to base85, c/attrs still beats it by a factor of 4.
Now, loading:
$ pyperf timeit -g -s "from v import AttrsModel, AttrsPrimitives; from cattr.preconf.json import make_converter; from json import dumps, loads; i = AttrsPrimitives(b'0101', 'a str', 1, 1.0, True); m = AttrsModel(i, i, i); c = make_converter(); r = dumps(c.unstructure(m))" "c.structure(loads(r), AttrsModel)"
Mean +- std dev: 22.9 us +- 0.4 us
$ pyperf timeit -g -s "from v import PydanticModel, PydanticPrimitives; from json import dumps, loads; i = PydanticPrimitives(a=b'0101', b='a str', c=1, d=1.0, e=True); m = PydanticModel(one=i, two=i, three=i); r = m.json()" "PydanticModel.parse_raw(r)"
Mean +- std dev: 52.6 us +- 2.9 us
c/attrs
is twice as fast.
Validation
One area where Pydantic has an edge over c/attrs is in generating validation errors.
Reusing the AttrsPrimitives
and PydanticPrimitives
models:
try:
cattr.structure({"a": b"", "b": "", c: "test", d: 1.0, e: True}, AttrsPrimitives)
except Exception:
traceback.print_exc()
Traceback (most recent call last):
File "...", line 27, in <module>
structure(
File "cattr/converters.py", line 223, in structure
return self._structure_func.dispatch(cl)(obj, cl)
File "<cattrs generated structure __main__.AttrsPrimitives>", line 5, in structure_AttrsPrimitives
'c': structure_c(o['c'], type_c),
File "/home/tin/pg/cattrs/src/cattr/converters.py", line 314, in _structure_call
return cl(obj)
ValueError: invalid literal for int() with base 10: 'a'
It's probably good enough while developing (from the stack trace you can see the issue was structuring the c
field of the AttrsPrimitives
class) but not good enough for generating a complex error message to return as a response.
The Pydantic approach:
try:
PydanticPrimitives(**{"a": b"", "b": "", "c": "a", "d": 1.0, "e": True})
except Exception:
print_exc()
Traceback (most recent call last):
File "/home/tin/pg/cattrs/a02.py", line 35, in <module>
PydanticPrimitives(**{"a": b"", "b": "", "c": "a", "d": 1.0, "e": True})
File "pydantic/main.py", line 406, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for PydanticPrimitives
c
value is not a valid integer (type=type_error.integer)
The exception itself also contains metadata about where exactly the error occured, and so is much more useful for letting the caller know exactly where validation failed.
This is motivated by performance. Cattrs is optimized for the happy path, for use in systems where more than ~99.9% of un/structuring operations succeed, so it doesn't want to sacrifice speed to accomodate the error paths. This is likely to change with Python 3.11, and the introduction of zero cost exception handling, but not before.
I've considered adding functions to cattrs for generating validation metadata from stack traces which would possibly provide the best of both worlds: precise error metadata and top-class performance. However, I believe the approach would ultimately be inadequate. Consider the use case of a list of a lot of integers; it'd be impossible to extract the index of the invalid integer from the stack trace. So we wait. (Pydantic supports this use case.)
Wrapping up
If there's one takeaway I would like you to walk away with, it's this: the problem with Pydantic is that it makes things that should be hard (since they're complex and you need to be careful, and the library should make you choose exactly what to do) appear easy, and things that should be easy, frustratingly hard.
Additional Nitpicks
- Pydantic doesn't support positional arguments
- Pydantic generates a weird
__str__
, without the class name - Pydantic doesn't support slot classes
- Pydantic doesn't support
{collection, typing}.Counter
s - Pydantic's strategy for structuring unions is very naive and cannot be easily customized
- Pydantic's support for customizing un/structuring is weak, leading to issues like this for adding base64 support to linger. In cattrs this is two lines of code. The suggested solution in the issue is adding a special type (
Base64
) for this, which is surprising to me
from Hacker News https://ift.tt/385ylHy
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.