Tutorial: Train a Linear Regression Model using scikit-learn

In machine learning, a linear regression model is typically trained to predict the value of a numeric variable. The example below will show you how to train and evaluate a linear regression model using scikit-learn.

Example

In this example, you will work with the diabetes dataset that comes with scikit-learn.

Step 1: Load the data

from sklearn.datasets import load_diabetes
import pandas as pd

diabetes = load_diabetes(as_frame=True)

# Load data and target into separate dataframes
X = diabetes.data
y = diabetes.target
print(X.shape)
print(X.head())
print(f'\ntarget:\n{y.head()}')

You should get the following output:

(442, 10)
        age       sex       bmi        bp        s1        s2        s3  \
0  0.038076  0.050680  0.061696  0.021872 -0.044223 -0.034821 -0.043401   
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163  0.074412   
2  0.085299  0.050680  0.044451 -0.005670 -0.045599 -0.034194 -0.032356   
3 -0.089063 -0.044642 -0.011595 -0.036656  0.012191  0.024991 -0.036038   
4  0.005383 -0.044642 -0.036385  0.021872  0.003935  0.015596  0.008142   

         s4        s5        s6  
0 -0.002592  0.019907 -0.017646  
1 -0.039493 -0.068332 -0.092204  
2 -0.002592  0.002861 -0.025930  
3  0.034309  0.022688 -0.009362  
4 -0.002592 -0.031988 -0.046641  

target:
0    151.0
1     75.0
2    141.0
3    206.0
4    135.0
Name: target, dtype: float64

That is, the dataset consists of 442 observations with 10 health-related attributes and the target (a quantitative measure of disease progression one year after baseline).

Let's split the data into train and test sets:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Step 2: Fit a baseline model

As a baseline, fit a linear regression model with all features to the training data:

y_{\text{target}} = \beta_0 + \beta_1\text{age} + ... + \beta_{10}\text{s6}

from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)

And evaluate the model on the test set by computing the root mean squared error (RMSE):

from sklearn.metrics import mean_squared_error
from math import sqrt

mse = mean_squared_error(y_test, reg.predict(X_test))
print("RMSE:", sqrt(mse))

You should get the following result:

RMSE: 53.92906390435081

Step 3: Feature selection

Let's try to find a better model by selecting features. More specifically, run an F-test to find significant features:

from sklearn.feature_selection import f_regression

f, pval = f_regression(X_train, y_train)

# Store the results in a dataframe
stat = pd.DataFrame({ 'feature': X_train.columns, 'F value': f, 'p value': pval })
stat['p value'] = round(stat['p value'], 2)

print(stat)

You should get the following output:

  feature     F value  p value
0     age   14.235069     0.00
1     sex    0.009320     0.92
2     bmi  197.891166     0.00
3      bp   78.668287     0.00
4      s1   17.218113     0.00
5      s2   13.096857     0.00
6      s3   65.878968     0.00
7      s4   87.104931     0.00
8      s5  162.300576     0.00
9      s6   73.527875     0.00

Finally, fit another regression model using only significant (p<0.05) features:

from sklearn.feature_selection import SelectFwe
from sklearn.pipeline import Pipeline

best = SelectFwe(f_regression, alpha=0.05)

estimator = LinearRegression()
pipeline = Pipeline([ ('feature_selection', best), ('estimator', estimator)])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

rmse = sqrt(mean_squared_error(y_test, predictions))
print("RMSE:", rmse)

You should get a lower RMSE than before:

RMSE: 50.81396878578822

Conclusion

You just learned how to train a linear regression model using scikit-learn! For those seeking further practice: try to find a better model using feature engineering. For example, include a polynomial of the age attribute.

References

https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset