Data Reliability 101: A Practical Guide to Data Validation Using Pydantic in Data Science Projects
Last Updated on January 14, 2024 by Editorial Team
Author(s): Shivamshinde
Originally published on Towards AI.
This article will explain Why data validation is needed for the Python code, How itβs done using the Pydantic library, and How to integrate it into your data science projects.
Table of Content
- Why Data Validation is Needed in Python?
- Pydantic Python Library
- Pydantic Components
– Models
– Fields
– Required, Optional, and Nullable Fields
– Field Validators - Using Pydantic for Data Validation of Data in DataFrame Format
Why is Data Validation Needed in Python?
Python is a dynamically typed language. That means the datatype of a variable is decided based on its value, and while initializing the variable, you donβt have to mention its data type. An interpreter assigns the types to variables at runtime.
This makes Python easy to start with. However, this approach has some disadvantages.
In Python, you can override the value of the variable with another value having a different datatype.
a = 10 # Initializing the variable (Note that we have not mentioned the type)
a = "ten" # replacing the variable with the new value "10"
This seems fine at this point but, it could create issues at a later point in code unintentionally.
Also, it is not easy to understand the datatype of a variable at first glance. This is particularly inconvenient in the case of functions.
Suppose we have a function named resize_image as follows:
def resize_image(image, dim):
# ...
Now, in this function, we wonβt be able to understand the datatype of the parameter named dim just by looking at it. Additionally, if we assume dim is a list or tuple then another question arises which dimension comes first (does x come first or does y come first?).
Another problem that dynamic typing causes is that it will allow to create objects with incorrect datatypes and we wonβt know about the errors it could create until we use those objects. For example,
P1 = Person(name="Kelsier", age=24)
P2 = Person(name="Breeze", age="35")
Here, both objects will be created even though one of the values of age is passed as a string. But again, this could create bugs in our code later in the stage.
The developer would want to know about the errors as early as possible in their development life cycle.
Pydantic Python Library
Pydantic is a data validation library in Python. We can make use of Pydantic to validate the data types before using them in any kind of operation. This way, we can avoid potential bugs that are similar to the ones mentioned earlier.
Pydantic Library does more than just validate the datatype as we will see next.
Pydantic Components
Models
One of the basic ways of using validations is via Models. Models are the classes that are inherited from the pydantic.BaseModel class and define the fields as annotated attributes.
When we pass the data that could contain some mismatched data types, after parsing and validation, Pydantic guarantees that the instances of the resultant model conform to the datatypes mentioned in the model. And if it is not possible then it will throw a validation error.
Model Usage:
from pydantic import BaseModel
class User(BaseModel):
id: int
name: str = 'Jane Doe'
user = User(id='123')
assert user.id == 123
assert isinstance(user.id, int)
# Note that '123' was coerced to an int and its value is 123
user2 = User(id="onetwothree")
# Pydantic will throw a validation error because "onetwothree" cannot be
# converted to an int
In the above example, we have created a class User with two attributes id and name. We have also specified their datatypes. Now, when we give the string value β123β to the id field after creating an object, pydantic automatically converts β123β into 123 and then assigns it to the id. And, if such conversion is not possible (just like when we created user2 object) then it will throw a validation error.
Notice that the name string has a default value βJane Doeβ.
Fields
The Field function is used to customize the validation of fields of the model.
- Setting the default value to the field
we use the default keyword inside the Field function to give a default value to the field.
from pydantic import BaseModel, Field
class User(BaseModel):
name: str = Field(default='John Doe')
user = User()
print(user)
#> name='John Doe'
2. Adding numerical constraints to the fields
we use the following keywords inside the Field function to add numerical constraints to the field
gt
– greater thanlt
– less thange
– greater than or equal tole
– less than or equal tomultiple_of
– a multiple of the given numberallow_inf_nan
– allow'inf'
,'-inf'
,'nan'
values
from pydantic import BaseModel, Field
class Foo(BaseModel):
positive: int = Field(gt=0)
non_negative: int = Field(ge=0)
negative: int = Field(lt=0)
non_positive: int = Field(le=0)
even: int = Field(multiple_of=2)
love_for_pydantic: float = Field(allow_inf_nan=True)
foo = Foo(
positive=1,
non_negative=0,
negative=-1,
non_positive=0,
even=2,
love_for_pydantic=float('inf'),
)
print(foo)
"""
positive=1 non_negative=0 negative=-1 non_positive=0 even=2 love_for_pydantic=inf
"""
3. String constraints
We use the following keywords to constrain the strings:
min_length
: Minimum length of the string.max_length
: Maximum length of the string.pattern
: A regular expression that the string must match.
from pydantic import BaseModel, Field
class Foo(BaseModel):
short: str = Field(min_length=3)
long: str = Field(max_length=10)
regex: str = Field(pattern=r'^\d*$')
foo = Foo(short='foo', long='foobarbaz', regex='123')
print(foo)
#> short='foo' long='foobarbaz' regex='123'
Required, Optional, and Nullable Fields
We can set constraints on fields to indicate if they are required, optional, or cannot be None.
We can use the following table as a guide while creating a class having these constraints to its fields:
from typing import Optional
from pydantic import BaseModel, ValidationError
class Foo(BaseModel):
f1: str # required, cannot be None
f2: Optional[str] # required, can be None - same as str U+007C None
f3: Optional[str] = None # not required, can be None
f4: str = 'Foobar' # not required, but cannot be None
try:
Foo(f1=None, f2=None, f4='b')
except ValidationError as e:
print(e)
"""
1 validation error for Foo
f1
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
"""
Here, the f1 field is required but it cannot have a None value. However, while creating an object, f1 fields get a None value and hence we get a validation error.
Field Validators
If we want to apply custom validation to any of your fields then you can do that by creating a method with the validation criteria and the @field_validator decorator.
from pydantic import (
BaseModel,
ValidationError,
ValidationInfo,
field_validator,
)
class UserModel(BaseModel):
id: int
name: str
@field_validator('name')
@classmethod
def name_must_contain_space(cls, v: str) -> str:
if ' ' not in v:
raise ValueError('must contain a space')
return v.title()
# you can select multiple fields, or use '*' to select all fields
@field_validator('id', 'name')
@classmethod
def check_alphanumeric(cls, v: str, info: ValidationInfo) -> str:
if isinstance(v, str):
# info.field_name is the name of the field being validated
is_alphanumeric = v.replace(' ', '').isalnum()
assert is_alphanumeric, f'{info.field_name} must be alphanumeric'
return v
print(UserModel(id=1, name='John Doe'))
#> id=1 name='John Doe'
try:
UserModel(id=1, name='samuel')
except ValidationError as e:
print(e)
"""
1 validation error for UserModel
name
Value error, must contain a space [type=value_error, input_value='samuel', input_type=str]
"""
try:
UserModel(id='abc', name='John Doe')
except ValidationError as e:
print(e)
"""
1 validation error for UserModel
id
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='abc', input_type=str]
"""
try:
UserModel(id=1, name='John Doe!')
except ValidationError as e:
print(e)
"""
1 validation error for UserModel
name
Assertion failed, name must be alphanumeric
assert False [type=assertion_error, input_value='John Doe!', input_type=str]
"""
Here, we have created two custom validations using methods name_must_contain_space(β¦) and check_alphanumeric(β¦). The first method applies the validation on the name field. It guarantees that the name field has at least one space. The second method checks if the string field is alphanumeric or not.
We also created three objects in the code.
UserModel(id=1, name='samuel')
Here, we get the validation error because of the name_must_contain_space() field validator, since the name field does not have a space in it.
UserModel(id='abc', name='John Doe')
Here, we get the validation error because the id is not of integer datatype and pydantic is not able to convert it into one.
UserModel(id=1, name='John Doe!')
Here, we get the validation error because of the check_alphanumeric(β¦) field validator, since the name field has an exclamation mark (!) in it, it is no longer an alphanumeric string.
Using Pydantic for Data Validation of Data in DataFrame Format
Now that we have seen how to use Pydantic for the validation of fields in a class, letβs extend this knowledge to our data science project.
We will use Pydantic validations to constrain data records in our dataframe. This ensures we donβt use any invalid data while building our machine-learning model.
For a demonstration of this, letβs use the βThyroid Disease Datasetβ from Kaggle. I will skip the exploratory data analysis and data preprocessing steps since they are outside the scope of this article. If you are interested in all the steps of the project, check out the whole code using the below link.
GitHub – shivamshinde123/Thyroid_Disease_Detection_Internship: Thyroid disease is a common cause ofβ¦
Thyroid disease is a common cause of medical diagnosis and prediction, with an onset that is difficult to forecast inβ¦
github.com
Now letβs see how to perform data validation on the dataframe.
from pydantic import BaseModel, ValidationError, Field
from typing import List, Optional
class Dictvalidator(BaseModel):
age: int = Field(gt=0, le=100)
sex: Optional[str]
on_thyroxine: Optional[str]
query_on_thyroxine: Optional[str]
on_antithyroid_meds: Optional[str]
sick: Optional[str]
pregnant: Optional[str]
thyroid_surgery: Optional[str]
I131_treatment: Optional[str]
query_hypothyroid: Optional[str]
query_hyperthyroid: Optional[str]
lithium: Optional[str]
goitre: Optional[str]
tumor: Optional[str]
hypopituitary: Optional[str]
psych: Optional[str]
TSH_measured: str
TSH: Optional[float]
T3_measured: str
T3: Optional[float]
TT4_measured: str
TT4: Optional[float]
T4U_measured: str
T4U: Optional[float]
FTI_measured: str
FTI: Optional[float]
TBG_measured: str
TBG: Optional[float]
referral_source: Optional[str]
target: str
patient_id: int
class dataframe_validator(BaseModel):
df_dict: List[Dictvalidator]
if __name__ == '__main__':
df = pd.read_csv(raw_data_file_path)
try:
dataframe_validator(df_dict=df.to_dict(orient='records'))
except ValidationError as e:
raise e
First of all, we will create a class named Dictvalidator with all the features of a dataframe as fields. We will add all the constraints we wish on these fields as shown in the code above.
Next, we will create another class named dataframe_validator which will have a field that is the list of Dictvalidator. Now when we create an instance of the dataframe_validator class and pass the dataframe records as a list of dictionaries, all the fields of the dataframe will be validated.
Also, there is another way of validating the pandas dataframe. We can use a Python library called Pandantic for this. You can check out the article by Wessel Huising on how to use this library for the validation of dataframes. The link for the article is given in the reference section.
Outro
Thank you so much for reading. If you liked this article donβt forget to press that clap icon as many times as you want. Keep following for more such articles!
Are you struggling to choose what to read next? Donβt worry, I have got you covered.
A Step-by-Step Guide to Building an End-to-End Machine Learning Project
This article will show you the way you can create an end-to-end machine learning project.
ai.plainenglish.io
and one moreβ¦
From Raw to Refined: A Journey Through Data Preprocessing β Part 6: Imbalanced Datasets
This article will explain the concept of imbalanced datasets and the methods used to handle them.
pub.towardsai.net
References
Models
Data validation using Python type hints
docs.pydantic.dev
Validate Pandas DataFrames using Pydantic: pandantic
As I cycled through the beautiful centre of Amsterdam, I tuned in to the Python Bytes podcast. This time the hostsβ¦
wesselhuising.medium.com
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI