Hard Drive Test Data Problem (Kaggle): Solution Approach
Part I
Courtesy: Zyzixun.net |
This article is dedicated towards solving the Hard Drive
Test Data Problem on Kaggle. It contains daily snapshots of operational hard
drives during the first two quarters of 2016.
Problem statement: Predict Hard Drive Failure/survival period of a hard drive.
A detailed description of the dataset can be found on the Kaggle
page for the data: https://www.kaggle.com/backblaze/hard-drive-test-data
Go through the description before jumping into the approach
as it will facilitate the understanding of the methods implemented.
If you have understood the data in depth, step one of the
process is complete. Proceeding to step two, which is preprocessing:
1. Preprocessing is a vital process of a machine learning process. In layman terms, a machine learning model is like a child, and the preprocessing module is a parent. It carefully monitors the data which is fed to the model to facilitate optimum performance.
Raw datasets contain sparse and indigestible data (like features
with type object). Also, there are frequent situations of missing and outlying
data. Such unevenness in the data confuses the model to a great extent,
resulting in incorrect predictions.
Presented below are the preprocessing approaches tested and/or
implemented on the hard drive Test data on Kaggle:
1.1
As mentioned in the description, 90 columns
essentially contain raw features and their normalized counterparts. Thus, we
only read the normalized features as they basically are the processed mirror image
of the raw data.
Normalization
helps to scale the features in terms of each other, leading to better
understanding from the model’s perspective. In case normalization was not
already done, we would’ve done the same at a later stage in the preprocessing
module.
1.2
Problem-dependent Filtering: The dataset
provided is huge, and it will make sense to use this whole chunk if it actually
aided our solution. Fortunately, we do not need the entire set since we are solving
the problem with a supervised learning approach.
In the case of this dataset, I have filtered out all the samples of the HDs
which fail at a given date. The samples for each hard drive have been stored
separately (here, data frames used) to generate more accurate values for
missing data during the later stages of preprocessing.
1.3
Removal of features with all samples missing (obviously).
1.4
Missing Value Generation: A few approaches were
tested, and the best logical solution has been implemented here. A gist of each
approach provided as under:
i)
Replacement
with 0 (integer or float representation of ‘zero’)
Discarded because: It is a misconception to consider nan as ‘0’, especially
for the given dataset. Here the features are essentially sensor values. ‘nan’
means that the sensor did not report or take values for that particular
instance/row. It is highly likely that value ‘0’ for a sensor may be considered
as a weighted value (Example: Sensor may return value 0 when the device is in
resting condition) Thus, we cannot replace the nan values with 0 since it will
mean incorrect data manipulation.
ii)
Replacement
with mean of all rows of the respective feature
Discarded because: If we take mean of all rows for a given feature, the
same value will be assigned to both the failing as well as working instances.
However, in original data, sensors may give varying results for failing and
working instances.
iii)
Replacement
with mean of classes (0 or 1)
Discarded because: It may seem logically correct to assign mean of
individual classes to the feature instances belonging to their respective
class, since this method assigns different values for the failing and working
instances/rows. However, if we visualize the data of the features along with
their mean, for most of the features, the graphs obtained are either left
skewed or right skewed. This means a good chunk of data is missing and the mean
only takes into account the data available.
iv)
Replacement
with median
Discarded because: Even though median gives better results compared to
mean, it also suffers from the problem of missing data and only takes into
account the data available, the problem of skewed data persisting.
v)
Replacement
with interpolated and extrapolated values
Accepted because: Interpolation helps us fill the gaps left by the
missing sensor values. Interpolation takes into account the range of data
available and varies the new data accordingly. This solves the problem of
skewed data and partial replacements for nan. However, one problem persists.
Interpolation cannot generate values outside the given range. Thus, features
which has nan values at the beginning or at the end cannot be replaced with
interpolation. Here, extrapolation comes into the picture. It effectively
generates the necessary data required for nan replacement.
Interpolation:
Helps to assign values as replacement to nan in features. Works only within
given range. It needs a minimum and a maximum to determine the numbers in
between them.
Extrapolation: When nan
values exist outside the range of minimum and maximum, interpolation ceases to
work. This is so because, if a limit is not present in the form of minima or
maxima, interpolation cannot effectively determine the curve from one point to
a non-existent one. Here, extrapolation takes the aid of another feature which
has a well-defined relationship with the feature in question (determined with
the help of mutual information), and predicts the values for replacement
depending on the aiding feature.
1.5
Removal of columns based on Variance Score: It
is essential to understand that a feature which does not show much change in
its instances/rows, is as good as a row with all ‘nan’ values. If there is no
change in the feature, it is impossible to understand any pattern from it which
might result in device failure. Thus, we check the variance and eliminate any
feature which shows no variation (variance==0)
1.6
Eliminate features based on relationship with
target variable: Variation of feature is
necessary, however if the variation is abrupt and with no clear pattern when
mapped against the target variable, it cannot help us predict the target
variable efficiently. To nullify this problem, we need to look for a
relationship between the feature in question and the target variable.
Techniques are mentioned as follows:
i)
Correlation:
Correlation captures linear relationships between the given features. It
is in fact extremely efficient in tracking down linear relationships. However,
it fails when strong non-linear relationships are processed. It might give a
very low correlation value in spite of the presence of a very significant
relationship among the data points.
Reference Link: https://towardsdatascience.com/
ii)
Mutual
Information: Mutual information solves the problem which arises due to
correlation. Mutual information effectively captures any non-linear
relationship between the given variables. This helps us eliminate the features
which show no significant relationship with the target variable, thus, strengthening
the predictive model to be constructed. It is essentially a measure of the how
well a feature can be determined from another feature (information gain).
For mathematical understanding: https://en.wikipedia.org/wiki/Mutual_information#Definition
This constitutes the end of Part I which
was Data Preprocessing.
Part II: Feature Engineering
Part III: Model Selection and
Tuning
Keep tabs on this space for updates
on the links.
Comments
Post a Comment