Data augmentation
Introduction
Yesterday, I was working on cmi project
and my score has increased by 2% when I added two new data augmentations.
Data augmentation is a technique in machine learning that artificially expands
our training dataset by applying different transformations.
In torchvision, there are multiple transfromation have been implemented.
You can access these transformations by importing v2 (version 2) from
torchvision.transforms like this:
from torchvision.transforms import v2.
Data augmentation is extremely helpful when we want to generalize
our training data, so the model doesn’t overfit on training data.
Also, it helps the model not to get biased with some specific features
that are only available in training data.
My use-case and Gaussian noise
In this project, we have to classify the behaviour of an individual with
a sequence.
I was specifically working on the IMU (Inertial Measurement Unit) part of
the data, to increase my score on that as much as I can.
The augmentation that I had previously was to add a small Gaussian noise.
The code below shows my approach:
class AddGaussianNoise:
def __init__(
self,
features_to_use: list[str],
probability: float = 0.5,
noise_std: float = 0.01,
):
self.features_to_use = features_to_use
self.probability = probability
self.noise_std = noise_std
@classmethod
def from_config(cls, cfg: DictConfig) -> "AddGaussianNoise":
if "features_to_use" not in cfg:
raise ValueError("features_to_use is required")
if "gaussian_noise" in cfg and "probability" in cfg.gaussian_noise:
probability = cfg.gaussian_noise.probability
else:
probability = 0.5
print(f"Using default probability: {probability}")
if "gaussian_noise" in cfg and "noise_std" in cfg.gaussian_noise:
noise_std = cfg.gaussian_noise.noise_std
else:
noise_std = 0.01
print(f"Using default gaussian noise_std: {noise_std}")
return cls(
cfg.features_to_use,
probability=probability,
noise_std=noise_std,
)
def __call__(
self,
sequence: pl.DataFrame,
):
result = sequence
if np.random.rand() < self.probability:
for feature in self.features_to_use:
result = result.with_columns(
pl.col(feature)
+ np.random.normal(
loc=0.0, scale=self.noise_std, size=result[feature].shape
),
)
return result
The code above adds Gaussian noise with the scale of noise_std
with the given probibility.
Also, I have a standard to be able to load each class that I write with
config.
So, in this class, I have a class method called from_config which
helps me to create a new instance of this object simply, like the code below:
agn = AddGaussianNoise.from_config(cfg)
I really like this approach, and it made my code way cleaner and more readable.
Shrink Sequence
The other augmentation that I add is to shrink the sequence. Shrinking a sequence means increasing the speed of actions. So it teaches the model not to care too much about the speed of the actions. I did that with the code below:
class ShrinkOneSequence:
def __init__(
self,
probability: float = 0.2,
):
super().__init__()
self.probability = probability
@classmethod
def from_config(
cls,
cfg: DictConfig,
) -> "ShrinkOneSequence":
if "shrink_one_sequence" in cfg and "probability" in cfg.shrink_one_sequence:
probability = cfg.shrink_one_sequence.probability
else:
probability = 0.2
print(f"Using default probability: {probability}")
return cls(
probability=probability,
)
def __call__(self, sequence: pl.DataFrame) -> pl.DataFrame:
if np.random.rand() < self.probability:
max_len = sequence.shape[0]
shrink_random = (np.random.randint(70, 90)) / 100
max_len *= shrink_random
max_len = round(max_len)
take_sample = np.linspace(
0,
sequence.shape[0],
max_len,
endpoint=False,
dtype=int,
)
sequence = sequence[take_sample]
return sequence
The code above shrinks the given sequence with the given probability. The result size of the sequence would be in the range of [70%, 90%]. These numbers still need to be calculated carefully, but right now, this technique has been pretty helpful.
Crop the start of the sequence
The other Augmentation that I use is to crop the start of the sequence.
I know the main action that happens is at the end of the sequence.
So, by cropping the start of the sequence, it seems that I’m giving
the model new data also helps it to focus more on the end of the sequence.
To do that, I wrote the code below:
class CropStartSequence:
def __init__(
self,
probability: float = 0.2,
):
super().__init__()
self.probability = probability
@classmethod
def from_config(
cls,
cfg: DictConfig,
) -> "CropStartSequence":
if "crop_start_sequence" in cfg and "probability" in cfg.crop_start_sequence:
probability = cfg.crop_start_sequence.probability
else:
probability = 0.2
print(f"Using default probability: {probability}")
return cls(
probability=probability,
)
def __call__(self, sequence: pl.DataFrame) -> pl.DataFrame:
if np.random.rand() < self.probability:
max_len = sequence.shape[0]
crop_percentage = (np.random.randint(10, 50)) / 100
crop_start = round(max_len * crop_percentage)
sequence = sequence[crop_start:]
return sequence
The code above crops the start of the sequence with the given probability. The crop size can be between [10%, 50%]. These crop sizes should be calculated with careful measures, but right now it is working really well.
Final thoughts
In deep learning, Data is literally everything.
The better the way that we present our data to the model, the better
our results will be.
In this project, we don’t have that much data.
So, by using correct augmentation, I am trying to generalize my model.
One of my next plans is to make a data generator to create similar data
to the data that we have, or find some related data.
