Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Data Reliability 101: A Practical Guide to Data Validation Using Pydantic in Data Science Projects
Artificial Intelligence   Data Science   Latest   Machine Learning

Data Reliability 101: A Practical Guide to Data Validation Using Pydantic in Data Science Projects

Last Updated on January 14, 2024 by Editorial Team

Author(s): Shivamshinde

Originally published on Towards AI.

This article will explain Why data validation is needed for the Python code, How it’s done using the Pydantic library, and How to integrate it into your data science projects.

Photo by Anthony Bressy on Unsplash

Table of Content

  1. Why Data Validation is Needed in Python?
  2. Pydantic Python Library
  3. Pydantic Components
    Models
    Fields
    Required, Optional, and Nullable Fields
    Field Validators
  4. Using Pydantic for Data Validation of Data in DataFrame Format

Why is Data Validation Needed in Python?

Python is a dynamically typed language. That means the datatype of a variable is decided based on its value, and while initializing the variable, you don’t have to mention its data type. An interpreter assigns the types to variables at runtime.

This makes Python easy to start with. However, this approach has some disadvantages.

In Python, you can override the value of the variable with another value having a different datatype.

a = 10 # Initializing the variable (Note that we have not mentioned the type)
a = "ten" # replacing the variable with the new value "10"

This seems fine at this point but, it could create issues at a later point in code unintentionally.

Also, it is not easy to understand the datatype of a variable at first glance. This is particularly inconvenient in the case of functions.

Suppose we have a function named resize_image as follows:

def resize_image(image, dim):
# ...

Now, in this function, we won’t be able to understand the datatype of the parameter named dim just by looking at it. Additionally, if we assume dim is a list or tuple then another question arises which dimension comes first (does x come first or does y come first?).

Another problem that dynamic typing causes is that it will allow to create objects with incorrect datatypes and we won’t know about the errors it could create until we use those objects. For example,

P1 = Person(name="Kelsier", age=24)
P2 = Person(name="Breeze", age="35")

Here, both objects will be created even though one of the values of age is passed as a string. But again, this could create bugs in our code later in the stage.

The developer would want to know about the errors as early as possible in their development life cycle.

Pydantic Python Library

Pydantic is a data validation library in Python. We can make use of Pydantic to validate the data types before using them in any kind of operation. This way, we can avoid potential bugs that are similar to the ones mentioned earlier.

Pydantic Library does more than just validate the datatype as we will see next.

Pydantic Components

Models

One of the basic ways of using validations is via Models. Models are the classes that are inherited from the pydantic.BaseModel class and define the fields as annotated attributes.

When we pass the data that could contain some mismatched data types, after parsing and validation, Pydantic guarantees that the instances of the resultant model conform to the datatypes mentioned in the model. And if it is not possible then it will throw a validation error.

Model Usage:

from pydantic import BaseModel


class User(BaseModel):
id: int
name: str = 'Jane Doe'

user = User(id='123')

assert user.id == 123
assert isinstance(user.id, int)
# Note that '123' was coerced to an int and its value is 123

user2 = User(id="onetwothree")
# Pydantic will throw a validation error because "onetwothree" cannot be
# converted to an int

In the above example, we have created a class User with two attributes id and name. We have also specified their datatypes. Now, when we give the string value “123” to the id field after creating an object, pydantic automatically converts “123” into 123 and then assigns it to the id. And, if such conversion is not possible (just like when we created user2 object) then it will throw a validation error.

Notice that the name string has a default value ‘Jane Doe’.

Fields

The Field function is used to customize the validation of fields of the model.

  1. Setting the default value to the field

we use the default keyword inside the Field function to give a default value to the field.

from pydantic import BaseModel, Field


class User(BaseModel):
name: str = Field(default='John Doe')


user = User()
print(user)
#> name='John Doe'

2. Adding numerical constraints to the fields

we use the following keywords inside the Field function to add numerical constraints to the field

  • gt – greater than
  • lt – less than
  • ge – greater than or equal to
  • le – less than or equal to
  • multiple_of – a multiple of the given number
  • allow_inf_nan – allow 'inf', '-inf', 'nan' values
from pydantic import BaseModel, Field


class Foo(BaseModel):
positive: int = Field(gt=0)
non_negative: int = Field(ge=0)
negative: int = Field(lt=0)
non_positive: int = Field(le=0)
even: int = Field(multiple_of=2)
love_for_pydantic: float = Field(allow_inf_nan=True)


foo = Foo(
positive=1,
non_negative=0,
negative=-1,
non_positive=0,
even=2,
love_for_pydantic=float('inf'),
)
print(foo)
"""
positive=1 non_negative=0 negative=-1 non_positive=0 even=2 love_for_pydantic=inf
"""

3. String constraints

We use the following keywords to constrain the strings:

  • min_length: Minimum length of the string.
  • max_length: Maximum length of the string.
  • pattern: A regular expression that the string must match.
from pydantic import BaseModel, Field


class Foo(BaseModel):
short: str = Field(min_length=3)
long: str = Field(max_length=10)
regex: str = Field(pattern=r'^\d*$')


foo = Foo(short='foo', long='foobarbaz', regex='123')
print(foo)
#> short='foo' long='foobarbaz' regex='123'

Required, Optional, and Nullable Fields

We can set constraints on fields to indicate if they are required, optional, or cannot be None.

We can use the following table as a guide while creating a class having these constraints to its fields:

Pydantic Documentation
from typing import Optional

from pydantic import BaseModel, ValidationError


class Foo(BaseModel):
f1: str # required, cannot be None
f2: Optional[str] # required, can be None - same as str U+007C None
f3: Optional[str] = None # not required, can be None
f4: str = 'Foobar' # not required, but cannot be None


try:
Foo(f1=None, f2=None, f4='b')
except ValidationError as e:
print(e)
"""
1 validation error for Foo
f1
Input should be a valid string [type=string_type, input_value=None, input_type=NoneType]
"""

Here, the f1 field is required but it cannot have a None value. However, while creating an object, f1 fields get a None value and hence we get a validation error.

Field Validators

If we want to apply custom validation to any of your fields then you can do that by creating a method with the validation criteria and the @field_validator decorator.

from pydantic import (
BaseModel,
ValidationError,
ValidationInfo,
field_validator,
)


class UserModel(BaseModel):
id: int
name: str

@field_validator('name')
@classmethod
def name_must_contain_space(cls, v: str) -> str:
if ' ' not in v:
raise ValueError('must contain a space')
return v.title()

# you can select multiple fields, or use '*' to select all fields
@field_validator('id', 'name')
@classmethod
def check_alphanumeric(cls, v: str, info: ValidationInfo) -> str:
if isinstance(v, str):
# info.field_name is the name of the field being validated
is_alphanumeric = v.replace(' ', '').isalnum()
assert is_alphanumeric, f'{info.field_name} must be alphanumeric'
return v


print(UserModel(id=1, name='John Doe'))
#> id=1 name='John Doe'

try:
UserModel(id=1, name='samuel')
except ValidationError as e:
print(e)
"""
1 validation error for UserModel
name
Value error, must contain a space [type=value_error, input_value='samuel', input_type=str]
"""


try:
UserModel(id='abc', name='John Doe')
except ValidationError as e:
print(e)
"""
1 validation error for UserModel
id
Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='abc', input_type=str]
"""


try:
UserModel(id=1, name='John Doe!')
except ValidationError as e:
print(e)
"""
1 validation error for UserModel
name
Assertion failed, name must be alphanumeric
assert False [type=assertion_error, input_value='John Doe!', input_type=str]
"""

Here, we have created two custom validations using methods name_must_contain_space(…) and check_alphanumeric(…). The first method applies the validation on the name field. It guarantees that the name field has at least one space. The second method checks if the string field is alphanumeric or not.

We also created three objects in the code.

UserModel(id=1, name='samuel')

Here, we get the validation error because of the name_must_contain_space() field validator, since the name field does not have a space in it.

UserModel(id='abc', name='John Doe')

Here, we get the validation error because the id is not of integer datatype and pydantic is not able to convert it into one.

UserModel(id=1, name='John Doe!')

Here, we get the validation error because of the check_alphanumeric(…) field validator, since the name field has an exclamation mark (!) in it, it is no longer an alphanumeric string.

Using Pydantic for Data Validation of Data in DataFrame Format

Now that we have seen how to use Pydantic for the validation of fields in a class, let’s extend this knowledge to our data science project.

We will use Pydantic validations to constrain data records in our dataframe. This ensures we don’t use any invalid data while building our machine-learning model.

For a demonstration of this, let’s use the ‘Thyroid Disease Dataset’ from Kaggle. I will skip the exploratory data analysis and data preprocessing steps since they are outside the scope of this article. If you are interested in all the steps of the project, check out the whole code using the below link.

GitHub – shivamshinde123/Thyroid_Disease_Detection_Internship: Thyroid disease is a common cause of…

Thyroid disease is a common cause of medical diagnosis and prediction, with an onset that is difficult to forecast in…

github.com

Now let’s see how to perform data validation on the dataframe.


from pydantic import BaseModel, ValidationError, Field
from typing import List, Optional


class Dictvalidator(BaseModel):

age: int = Field(gt=0, le=100)
sex: Optional[str]
on_thyroxine: Optional[str]
query_on_thyroxine: Optional[str]
on_antithyroid_meds: Optional[str]
sick: Optional[str]
pregnant: Optional[str]
thyroid_surgery: Optional[str]
I131_treatment: Optional[str]
query_hypothyroid: Optional[str]
query_hyperthyroid: Optional[str]
lithium: Optional[str]
goitre: Optional[str]
tumor: Optional[str]
hypopituitary: Optional[str]
psych: Optional[str]
TSH_measured: str
TSH: Optional[float]
T3_measured: str
T3: Optional[float]
TT4_measured: str
TT4: Optional[float]
T4U_measured: str
T4U: Optional[float]
FTI_measured: str
FTI: Optional[float]
TBG_measured: str
TBG: Optional[float]
referral_source: Optional[str]
target: str
patient_id: int


class dataframe_validator(BaseModel):

df_dict: List[Dictvalidator]


if __name__ == '__main__':

df = pd.read_csv(raw_data_file_path)

try:
dataframe_validator(df_dict=df.to_dict(orient='records'))
except ValidationError as e:
raise e

First of all, we will create a class named Dictvalidator with all the features of a dataframe as fields. We will add all the constraints we wish on these fields as shown in the code above.

Next, we will create another class named dataframe_validator which will have a field that is the list of Dictvalidator. Now when we create an instance of the dataframe_validator class and pass the dataframe records as a list of dictionaries, all the fields of the dataframe will be validated.

Also, there is another way of validating the pandas dataframe. We can use a Python library called Pandantic for this. You can check out the article by Wessel Huising on how to use this library for the validation of dataframes. The link for the article is given in the reference section.

Outro

Thank you so much for reading. If you liked this article don’t forget to press that clap icon as many times as you want. Keep following for more such articles!

Are you struggling to choose what to read next? Don’t worry, I have got you covered.

A Step-by-Step Guide to Building an End-to-End Machine Learning Project

This article will show you the way you can create an end-to-end machine learning project.

ai.plainenglish.io

and one more…

From Raw to Refined: A Journey Through Data Preprocessing — Part 6: Imbalanced Datasets

This article will explain the concept of imbalanced datasets and the methods used to handle them.

pub.towardsai.net

References

Models

Data validation using Python type hints

docs.pydantic.dev

Validate Pandas DataFrames using Pydantic: pandantic

As I cycled through the beautiful centre of Amsterdam, I tuned in to the Python Bytes podcast. This time the hosts…

wesselhuising.medium.com

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓