Hard Drive Test Data Problem (Kaggle): Solution Approach

November 21, 2018

Hard Drive Test Data Problem (Kaggle): Solution Approach

Part I

Courtesy: Zyzixun.net

This article is dedicated towards solving the Hard Drive Test Data Problem on Kaggle. It contains daily snapshots of operational hard drives during the first two quarters of 2016.

Problem statement: Predict Hard Drive Failure/survival period of a hard drive.

A detailed description of the dataset can be found on the Kaggle page for the data: https://www.kaggle.com/backblaze/hard-drive-test-data

Go through the description before jumping into the approach as it will facilitate the understanding of the methods implemented.

If you have understood the data in depth, step one of the process is complete. Proceeding to step two, which is preprocessing:

1. Preprocessing is a vital process of a machine learning process. In layman terms, a machine learning model is like a child, and the preprocessing module is a parent. It carefully monitors the data which is fed to the model to facilitate optimum performance.

Raw datasets contain sparse and indigestible data (like features with type object). Also, there are frequent situations of missing and outlying data. Such unevenness in the data confuses the model to a great extent, resulting in incorrect predictions.

Presented below are the preprocessing approaches tested and/or implemented on the hard drive Test data on Kaggle:

1.1 As mentioned in the description, 90 columns essentially contain raw features and their normalized counterparts. Thus, we only read the normalized features as they basically are the processed mirror image of the raw data.

Normalization helps to scale the features in terms of each other, leading to better understanding from the model’s perspective. In case normalization was not already done, we would’ve done the same at a later stage in the preprocessing module.

1.2 Problem-dependent Filtering: The dataset provided is huge, and it will make sense to use this whole chunk if it actually aided our solution. Fortunately, we do not need the entire set since we are solving the problem with a supervised learning approach.

In the case of this dataset, I have filtered out all the samples of the HDs which fail at a given date. The samples for each hard drive have been stored separately (here, data frames used) to generate more accurate values for missing data during the later stages of preprocessing.

1.3 Removal of features with all samples missing (obviously).

1.4 Missing Value Generation: A few approaches were tested, and the best logical solution has been implemented here. A gist of each approach provided as under:

i) Replacement with 0 (integer or float representation of ‘zero’)

Discarded because: It is a misconception to consider nan as ‘0’, especially for the given dataset. Here the features are essentially sensor values. ‘nan’ means that the sensor did not report or take values for that particular instance/row. It is highly likely that value ‘0’ for a sensor may be considered as a weighted value (Example: Sensor may return value 0 when the device is in resting condition) Thus, we cannot replace the nan values with 0 since it will mean incorrect data manipulation.

ii) Replacement with mean of all rows of the respective feature

Discarded because: If we take mean of all rows for a given feature, the same value will be assigned to both the failing as well as working instances. However, in original data, sensors may give varying results for failing and working instances.

iii) Replacement with mean of classes (0 or 1)

Discarded because: It may seem logically correct to assign mean of individual classes to the feature instances belonging to their respective class, since this method assigns different values for the failing and working instances/rows. However, if we visualize the data of the features along with their mean, for most of the features, the graphs obtained are either left skewed or right skewed. This means a good chunk of data is missing and the mean only takes into account the data available.

iv) Replacement with median

Discarded because: Even though median gives better results compared to mean, it also suffers from the problem of missing data and only takes into account the data available, the problem of skewed data persisting.

v) Replacement with interpolated and extrapolated values

Accepted because: Interpolation helps us fill the gaps left by the missing sensor values. Interpolation takes into account the range of data available and varies the new data accordingly. This solves the problem of skewed data and partial replacements for nan. However, one problem persists. Interpolation cannot generate values outside the given range. Thus, features which has nan values at the beginning or at the end cannot be replaced with interpolation. Here, extrapolation comes into the picture. It effectively generates the necessary data required for nan replacement.

Interpolation: Helps to assign values as replacement to nan in features. Works only within given range. It needs a minimum and a maximum to determine the numbers in between them.

Extrapolation: When nan values exist outside the range of minimum and maximum, interpolation ceases to work. This is so because, if a limit is not present in the form of minima or maxima, interpolation cannot effectively determine the curve from one point to a non-existent one. Here, extrapolation takes the aid of another feature which has a well-defined relationship with the feature in question (determined with the help of mutual information), and predicts the values for replacement depending on the aiding feature.

1.5 Removal of columns based on Variance Score: It is essential to understand that a feature which does not show much change in its instances/rows, is as good as a row with all ‘nan’ values. If there is no change in the feature, it is impossible to understand any pattern from it which might result in device failure. Thus, we check the variance and eliminate any feature which shows no variation (variance==0)

1.6 Eliminate features based on relationship with target variable: Variation of feature is necessary, however if the variation is abrupt and with no clear pattern when mapped against the target variable, it cannot help us predict the target variable efficiently. To nullify this problem, we need to look for a relationship between the feature in question and the target variable. Techniques are mentioned as follows:

i) Correlation: Correlation captures linear relationships between the given features. It is in fact extremely efficient in tracking down linear relationships. However, it fails when strong non-linear relationships are processed. It might give a very low correlation value in spite of the presence of a very significant relationship among the data points.

Reference Link: https://towardsdatascience.com/

ii) Mutual Information: Mutual information solves the problem which arises due to correlation. Mutual information effectively captures any non-linear relationship between the given variables. This helps us eliminate the features which show no significant relationship with the target variable, thus, strengthening the predictive model to be constructed. It is essentially a measure of the how well a feature can be determined from another feature (information gain).

For mathematical understanding: https://en.wikipedia.org/wiki/Mutual_information#Definition

This constitutes the end of Part I which was Data Preprocessing.

Part II: Feature Engineering

Part III: Model Selection and Tuning

Keep tabs on this space for updates on the links.

Search This Blog

Machine Learning Hub