
Machine Learning Techniques
The algorithms that are considered for the purpose of building machine learning models depend on the type, structure, and attributes of the data as a whole. Before applying any machine learning models, the data has to be preprocessed so that the trained model would yield better accuracy as far as possible.
In this model, we have undertaken data of Liver Patients as an example to perform various pre-processing functions such as oversampling the data, filling the missing values, and hyper-parameter tuning the models for better accuracy.
The dataset was downloaded from Kaggle which consisted of 583 rows. Out of 583 rows, 416 were liver patient records and 167 non-liver patient records. The data was originally collected from North East of Andhra Pradesh, India. The “Dataset” column is the result column in the dataset that divides groups into liver patients (liver disease) or not (no disease) and it contained 441 male patient records and 142 female patient records.
Here are some of the things which we had to undertake after studying the data:
1. Unbalanced Data: The dataset was the unbalanced one as there were 416 true scenarios and 167 false scenarios. We needed to oversample it using a technique called Synthetic Minority Oversampling Technique (SMOTE). There are various other approaches or methods that can be used to oversample the data and make it balanced.
SMOTE first selects a minority class instance at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b. Here is how SMOTE is used:
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
The value X consists of all the values of all the columns except for the ‘dataset’ while y consists of the values of column ‘dataset’ only which is the result column.
X = df.drop('Dataset', axis=1).values
y = df['Dataset'].values
2. Missing Values: The dataset consisted of a couple of missing values in the Albumin_and_Globulin_Ratio
column. There are various approaches to handle the missing values like replacing missing values with the mean value of the column, replacing missing values with the median value of the column, or putting the random values etcetera. For this data, the missing values were replaced by the mean values. fillna()
method on the data frame is used for imputing missing values with mean, median, mode, or constant value.
The null values in the dataset were filled with mean values of the data from the corresponding column. In this case, the column Albumin_and_Globulin_Ratio
had a couple of missing values and it was filled in with mean value in the following manner:
df[df.isna().any(axis=1)]
df['Albumin_and_Globulin_Ratio'].fillna((df['Albumin_and_Globulin_Ratio'].mean()), inplace=True)
3. Scaling: In simple language, scaling is the process of converting the data having different units to a common measure so that they are easier to compare. Suppose, we have a column called Meter and another column called Time which are way different units of measure. So, this is where scaling is used that converts the data so that they can be compared. There are various scaling techniques that can be used and it also depends on the data to be preprocessed. Here MinMax Scaler was used which converted the data to adjust between 0.0 to 1.0.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df = pd.DataFrame(scaler.fit_transform(data),columns=data.columns)
Below is the result where we can see the data has been adjusted between 0.0 and 1.0.
4. Hyperparameter Tuning: There are various parameters of a model and finding the most reliable and perfect parameters value often becomes an uphill task for the developers. If they don’t use this process, they have to do tons of hits and trials to find the most perfect parameter’s value. Hyperparameter tuning allows developers to pass parameters at bulk with multiple values. This allows us to find the perfect parameter’s value which would yield maximum accuracy and scores in a single process. There are various types of Hyperparameter tuning such as grid search cv, random search, and Bayesian optimization. For this data, the grid search was used.
We used GridSearchCV as our method of hyperparameter tuning in Artificial Neural Network, K-nearest Neighbours, SVM, and Logistic Regression models.
Here is an example of how hyperparameter tuning was used with the Artificial Neural Network model. The clf.best_params_
shows the best parameters that yielded the maximum accuracy.
parameter_space = {
'hidden_layer_sizes': [(10,30,10),(20,)],
'activation': ['tanh', 'relu'],
'solver': ['sgd', 'adam'],
'alpha': [0.0001, 0.05],
'learning_rate': ['constant','adaptive'],
}
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(my_ANN, parameter_space, n_jobs=-1, cv=5)
clf.fit(X, y) # X is train samples and y is the corresponding labels
print('Best parameters found:\n', clf.best_params_)
Here, we have discussed a few machine learning techniques for data preprocessing as well as a method called Hyperparameter tuning that can be used to find maximum accuracy of the model.
The code snippets used in this blog have been taken from here. You can see the full code as well as it’s free to use it.