Train a Linear Regression Model using scikit-learn
In machine learning, a linear regression model is typically trained to predict the value of a numeric variable.
The example below will show you how to train and evaluate a linear regression model using scikit-learn.
Example
In this example, you will work with the diabetes dataset that comes with scikit-learn.
Step 1: Load the data
from sklearn.datasets import load_diabetes
import pandas as pd
diabetes = load_diabetes(as_frame=True)
# Load data and target into separate dataframes
X = diabetes.data
y = diabetes.target
print(X.shape)
print(X.head())
print(f'\ntarget:\n{y.head()}')
You should get the following output:
(442, 10)
age sex bmi bp s1 s2 s3 \
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142
s4 s5 s6
0 -0.002592 0.019907 -0.017646
1 -0.039493 -0.068332 -0.092204
2 -0.002592 0.002861 -0.025930
3 0.034309 0.022688 -0.009362
4 -0.002592 -0.031988 -0.046641
target:
0 151.0
1 75.0
2 141.0
3 206.0
4 135.0
Name: target, dtype: float64
That is, the dataset consists of 442 observations with 10 health-related attributes and the target (a quantitative measure of disease progression one year after baseline).
Let's split the data into train and test sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Step 2: Fit a baseline model
As a baseline, fit a linear regression model with all features to the training data:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, y_train)
And evaluate the model on the test set by computing the root mean squared error (RMSE):
from sklearn.metrics import mean_squared_error
from math import sqrt
mse = mean_squared_error(y_test, reg.predict(X_test))
print("RMSE:", sqrt(mse))
You should get the following result:
RMSE: 53.92906390435081
Step 3: Feature selection
Let's try to find a better model by selecting features. More specifically, run an F-test to find significant features:
from sklearn.feature_selection import f_regression
f, pval = f_regression(X_train, y_train)
# Store the results in a dataframe
stat = pd.DataFrame({ 'feature': X_train.columns, 'F value': f, 'p value': pval })
stat['p value'] = round(stat['p value'], 2)
print(stat)
You should get the following output:
feature F value p value
0 age 14.235069 0.00
1 sex 0.009320 0.92
2 bmi 197.891166 0.00
3 bp 78.668287 0.00
4 s1 17.218113 0.00
5 s2 13.096857 0.00
6 s3 65.878968 0.00
7 s4 87.104931 0.00
8 s5 162.300576 0.00
9 s6 73.527875 0.00
Finally, fit another regression model using only significant (p<0.05) features:
from sklearn.feature_selection import SelectFwe
from sklearn.pipeline import Pipeline
best = SelectFwe(f_regression, alpha=0.05)
estimator = LinearRegression()
pipeline = Pipeline([ ('feature_selection', best), ('estimator', estimator)])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, predictions))
print("RMSE:", rmse)
You should get a lower RMSE than before:
RMSE: 50.81396878578822
Conclusion
You just learned how to train a linear regression model using scikit-learn!
For those seeking further practice: try to find a better model using feature engineering. For example, include a polynomial of the age attribute.