sklearn-pandas
sklearn-pandas is a small library that provides a bridge between scikit-learn’s machine learning methods and pandas Data Frames.
In this blog post I will show you a simple example on how to use sklearn-pandas in a classification problem. I will use the Titanic dataset from Kaggle. You can find training set e test set here.
Imports
import os
import pandas as pd
from sklearn.preprocessing import LabelBinarizer, Imputer, LabelEncoder, \
FunctionTransformer, Binarizer, StandardScaler, MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn_pandas import DataFrameMapper, CategoricalImputer
here = os.path.abspath(os.path.dirname(__file__))
Data
Kaggle provides separate files for training set e test set: trains.csv
contains class labels (0 = dead; 1 = survived), while test.csv
does not.
If you want you can perform some basic EDA (Exploratory Data Analysis). Be careful of data leakege though! Don’t use test data to carry on EDA, otherwise you will be tempted to select some features or perform some operations based on what you see on the test data. Here I will concatenate training set and test set just to see the total number of samples and the missing values of the entire dataset. I will “touch” the test set only at the end, for prediction.
data_directory = os.path.abspath(os.path.join(here, 'data', 'titanic'))
train_csv = os.path.join(data_directory, 'train.csv')
test_csv = os.path.join(data_directory, 'test.csv')
df_train = pd.read_csv(train_csv, header=0, index_col='PassengerId')
df_test = pd.read_csv(test_csv, header=0, index_col='PassengerId')
df = pd.concat([df_train, df_test], keys=['train', 'test'])
print('--- Info ---')
print(df.info())
print('--- Describe ---')
print(df.describe())
print('--- Features ---')
for feature in set(df_train.columns.values).difference(set(['Name'])):
print(feature)
print(df[feature].value_counts(dropna=False))
print('-' * 40)
Features
When working on a machine learning problem, feature engineering is manually designing what the input x’s should be.
With sklearn-pandas you can use the DataFrameMapper
class to declare transformations and variable imputations.
default=False
means that only the variables specified in the DataFrameMapper
will be kept. All other variables will be discarded.
None
means that no transformation will be applied to that variable.
LabelBinarizer
converts a categorical variable into a dummy variable (aka binary variable). A dummy variable is either 1 or 0, whether a condition is met or not (in pandas categorical variables can be converted into dummy variables with the method get_dummies).
Imputer
is a scikit-learn class that can perform NA imputation for quantitative variables, while CategoricalImputer
is a sklearn-pandas class that works on categorical variables too. Missing value imputation is a broad topic, and in other languages there are entire packages dedicated to it. For example, in R you can find MICE and Amelia.
In a DataFrameMapper
you can also provide a custom name for the transformed features – to be used instead of the automatically generated one – by specifying it as the third argument of the feature definition.
The difference between specifying the column selector as 'column'
(as a simple string) and ['column']
(as a list with one element is the shape of the array that is passed to the transformer. In the first case, a one dimensional array will be passed, while in the second case it will be a 2-dimensional array with one column, i.e. a column vector.
Example: with a simple string Imputer()
will discard NaN
values for the column Age
, and the fitting process will fail because of a mismatch of the size of this array and the other arrays in the DataFrame
.
mapper = DataFrameMapper([
('Pclass', None),
('Sex', LabelBinarizer()),
(['Age'], [Imputer()]),
('SibSp', None, {'alias': 'Some variable'}),
(['Ticket'], [LabelBinarizer()]),
(['Fare'], Imputer()),
('Embarked', [CategoricalImputer(), MultiLabelBinarizer()]),
], default=False)
Once again, here is how to use DataMapper
:
mapper = DataFrameMapper([
(['Age'], [Imputer()]), # OK
('Age', [Imputer()]), # NO!
])
Pipeline
Now that the you defined the features you want to use, you can build a scikit-learn Pipeline
. The first step of the Pipeline is the mapper
you have just defined. The last step is a scikit-learn Estimator
that will run the classification. In this case I chose a RandomForestClassifier
with a basic configuration. Between these two steps you can define additional ones. For example, you might want to z-normalize your features with a StandardScaler
.
pipeline = Pipeline([
('feature_mapper', mapper),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=50, random_state=seed))
])
Cross validation
The pipeline is ready, so you can train your model. In order to provide several estimates of the model’s accuracy you can use cross validation. scikit-learn provides the convenient function cross_val_score
to do that, but you can also do it manually. Keep in mind that we are not touching the test set here: xx_train
and xx_test
are both part of the entire training set. We just split the entire training set to train it on xx_train
and predict on xx_test
.
x_train = df_train[df_train.columns.drop(['Survived'])]
y_train = df_train['Survived']
# one way of computing cross-validated accuracy estimates
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
scores = cross_val_score(pipeline, x_train, y_train, cv=skf)
print('Accuracy estimates: {}'.format(scores))
# another way of computing cross-validated accuracy estimates
for i_split, (ii_train, ii_test) in enumerate(skf.split(X=x_train, y=y_train)):
# x_train (independent variables, aka features) is a pandas DataFrame.
# xx_train and xx_test are pandas dataframes
xx_train = x_train.iloc[ii_train, :]
xx_test = x_train.iloc[ii_test, :]
# y_train (target variable) is a pandas Series.
# yy_train and yy_test are numpy arrays
yy_train = y_train.values[ii_train]
yy_test = y_train.values[ii_test]
model = pipeline.fit(X=xx_train, y=yy_train)
predictions = model.predict(xx_test)
score = accuracy_score(y_true=yy_test, y_pred=predictions)
print('Accuracy of split num {}: {}'.format(i_split, score))
# final model (retrain it on the entire train set)
model = pipeline.fit(X=x_train, y=y_train)
Predict
Now that the model is trained we can finally predict data that we have never seen before (i.e. the test set).
# In this problem df_test doesn't contain the target variable 'Survived'
x_test = df_test
predictions = model.predict(x_test)
print('Predictions (0 = dead, 1 = survived)')
print(predictions)
The entire script
Here is the entire script:
import os
import pandas as pd
from sklearn.preprocessing import LabelBinarizer, Imputer, LabelEncoder, \
FunctionTransformer, Binarizer, StandardScaler, MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import accuracy_score
from sklearn_pandas import DataFrameMapper, CategoricalImputer
here = os.path.abspath(os.path.dirname(__file__))
def main(seed=42):
data_directory = os.path.abspath(os.path.join(here, 'data', 'titanic'))
train_csv = os.path.join(data_directory, 'train.csv')
test_csv = os.path.join(data_directory, 'test.csv')
df_train = pd.read_csv(train_csv, header=0, index_col='PassengerId')
df_test = pd.read_csv(test_csv, header=0, index_col='PassengerId')
df = pd.concat([df_train, df_test], keys=['train', 'test'])
print('--- Info ---')
print(df.info())
print('--- Describe ---')
print(df.describe())
print('--- Features ---')
for feature in set(df_train.columns.values).difference(set(['Name'])):
print(feature)
print(df[feature].value_counts(dropna=False))
print('-' * 40)
mapper = DataFrameMapper([
('Pclass', None),
('Sex', LabelBinarizer()),
(['Age'], [Imputer()]),
('SibSp', None, {'alias': 'Some variable'}),
(['Ticket'], [LabelBinarizer()]),
(['Fare'], Imputer()),
('Embarked', [CategoricalImputer(), MultiLabelBinarizer()]),
], default=False)
pipeline = Pipeline([
('feature_mapper', mapper),
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=50, random_state=seed))
])
x_train = df_train[df_train.columns.drop(['Survived'])]
y_train = df_train['Survived']
# one way of computing cross-validated accuracy estimates
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
scores = cross_val_score(pipeline, x_train, y_train, cv=skf)
print('Accuracy estimates: {}'.format(scores))
# another way of computing cross-validated accuracy estimates
for i_split, (ii_train, ii_test) in enumerate(skf.split(X=x_train, y=y_train)):
# x_train (independent variables, aka features) is a pandas DataFrame.
# xx_train and xx_test are pandas dataframes
xx_train = x_train.iloc[ii_train, :]
xx_test = x_train.iloc[ii_test, :]
# y_train (target variable) is a pandas Series.
# yy_train and yy_test are numpy arrays
yy_train = y_train.values[ii_train]
yy_test = y_train.values[ii_test]
model = pipeline.fit(X=xx_train, y=yy_train)
predictions = model.predict(xx_test)
score = accuracy_score(y_true=yy_test, y_pred=predictions)
print('Accuracy of split num {}: {}'.format(i_split, score))
# final model (retrain it on the entire train set)
model = pipeline.fit(X=x_train, y=y_train)
# In this problem df_test doesn't contain the target variable 'Survived'
x_test = df_test
predictions = model.predict(x_test)
print('Predictions (0 = dead, 1 = survived)')
print(predictions)
if __name__ == '__main__':
main()
If you run it, you should get these results with seed=42
:
Accuracy estimates: [ 0.7979798 0.83501684 0.81144781]
Accuracy of split num 0: 0.797979797979798
Accuracy of split num 1: 0.835016835016835
Accuracy of split num 2: 0.8114478114478114
Predictions (0 = dead, 1 = survived)
[0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0 0
1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0
1 1 0 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 1 0 0 1 0 0 1 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 0
1 0 1 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1
0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 0 1 0
0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0
0 1 1 1 1 0 0 1 0 0 0]