Linear Regression using Scikit-learn

Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. – Arthur Samuel (1959)

Machine learning is an application of Artificial Intelligence (AI). The focus of machine learning is to train algorithms to learn patterns and make predictions from data. Machine learning is especially valuable because it lets us use computers to automate decision-making processes.
Netflix and Amazon use machine learning to make new product recommendations. Banks use machine learning to detect fraudulent activity in credit card. Healthcare industry is making use of machine learning to detect diseases and to montor and assess patients.

Linear Regression

Linear Regression is a supervised learning problem where the answer to be learned is a continuous value. Linear regression is used to predict the value of an outcome variable Y based on one or more input predictor variables X. The aim is to establish a linear relationship (a mathematical formula) between the predictor variable(s) and the response variable, so that, we can use this formula to estimate the value of the response Y, when only the predictors (Xs) values are known.

In this tutorial, we will implement a simple linear regression algorithm in Python using Scikit-learn, a machine learning tool for Python.

Loading the dataset.

The dataset that we will use will help us in determining if there is a relation between Brain weight(grams) and Head size(cubic cm). The data set is associated with the following paper: A Study of the Relations of the Brain to to the Size of the Head, by R.J. Gladstone, published in Biometrika, 1905. It’s a rather quaint data set, created well over a century ago and can be downloaded from here.

1
2
3
4
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
1
2
df = pd.read_csv('dataset_brain.txt', comment='#', sep='\s+')
df.head()
gender age-group head-size brain-weight
0 1 1 4512 1530
1 1 1 3738 1297
2 1 1 4261 1335
3 1 1 3777 1282
4 1 1 4177 1590

Visualizing data

1
2
3
plt.scatter(df['head-size'], df['brain-weight'])
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

Scatter Plot

Preparing the data

1
2
y = df['brain-weight'].values
y.shape

(237,)

1
2
3
X = df['head-size'].values
X = X.reshape(X.shape[0], 1)
X.shape

(237, 1)

A general practice is to split your data into a training and test set. You train/tune your model with your training set and test how well it generalizes to data it has never seen before with your test set. We will make use of scikit learn’s train_test_split method to achieve this.

1
2
3
4
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=100)

Let us visualize the training and test set data.

1
2
3
4
plt.scatter(X_train, y_train, c='blue', marker='o')
plt.scatter(X_test, y_test, c='red', marker='p')
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

Train and Test Data

Training the Model

1
2
3
4
5
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

Evaluating the model

is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination.
In general, the higher the R-squared, the better the model fits your data. Scikit learn’s score function returns the coefficient of determination of the prediction.

1
lr.score(X_test, y_test)

0.68879769950865688

Let us also plot the regression line which will help us in giving a better intuition about Linear Regression.

1
2
3
4
5
6
7
8
9
10
min_pred = X_train.min() * lr.coef_ + lr.intercept_
max_pred = X_train.max() * lr.coef_ + lr.intercept_

plt.scatter(X_train, y_train, c='blue', marker='o')
plt.plot([X_train.min(), X_train.max()],
         [min_pred, max_pred],
         color='red',
         linewidth=4)
plt.xlabel('Head size (cm^3)')
plt.ylabel('Brain weight (grams)');

Linear Regression

Entire code is available on github.

To know more about Linear Regression check out this link.