Linear Regression

In [2]:
%matplotlib inline
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

Example: Boston Housing Dataset

In [3]:
from sklearn.datasets import load_boston
boston = load_boston()
In [4]:
boston.keys()
Out[4]:
dict_keys(['DESCR', 'target', 'data', 'feature_names'])
In [5]:
boston.feature_names
Out[5]:
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], 
      dtype='<U7')
In [6]:
print(boston.DESCR)
Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [7]:
boston.data.shape
Out[7]:
(506, 13)

Lets create a Pandas DataFrame object for this dataset.

In [8]:
data = pd.DataFrame(boston.data, columns=boston.feature_names)
In [9]:
data.head()
Out[9]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Lets add the target column also to the dataset.

In [10]:
data['PRICE'] = boston.target
In [11]:
data.head()
Out[11]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2
In [12]:
data.describe()
Out[12]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

First thing to do when we get a dataset is get a sense of the dataset by visualizing it. Lets try to see if there is any correlation between various columns.

In [13]:
data.plot(kind="scatter", x='RM', y='PRICE')
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x114215780>
In [14]:
plt.hist(data.CRIM)
plt.title("CRIM")
plt.xlabel("Crime rate per capita")
plt.ylabel("Frequencey")
plt.show()

It looks like boston was very peaceful city back then.

In [15]:
data.PRICE.hist()
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1148f7080>

Fitting Linear Regression using sklearn

In [16]:
from sklearn.linear_model import LinearRegression
X = data.drop('PRICE', axis = 1)

model = LinearRegression()
model.fit(X, data.PRICE)
Out[16]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [17]:
print('Estimated intercept coefficient:', model.intercept_)
Estimated intercept coefficient: 36.4911032804
In [18]:
print("coefficients:")
print(model.coef_)
coefficients:
[ -1.07170557e-01   4.63952195e-02   2.08602395e-02   2.68856140e+00
  -1.77957587e+01   3.80475246e+00   7.51061703e-04  -1.47575880e+00
   3.05655038e-01  -1.23293463e-02  -9.53463555e-01   9.39251272e-03
  -5.25466633e-01]
In [19]:
for col, coef in zip(X.columns, model.coef_):
    print("{:+6.4f} {}".format(coef, col))
-0.1072 CRIM
+0.0464 ZN
+0.0209 INDUS
+2.6886 CHAS
-17.7958 NOX
+3.8048 RM
+0.0008 AGE
-1.4758 DIS
+0.3057 RAD
-0.0123 TAX
-0.9535 PTRATIO
+0.0094 B
-0.5255 LSTAT

The coefficients indicate a lot about the important features.

The house price seems to be highly effected by RM, the average number of rooms per dwelling and RAD, accessibility to radial highways and negatively effected by NOX, the pollution and DIS, the average distance to employment centers.

In [20]:
plt.scatter(np.log(data.DIS), data.PRICE)
Out[20]:
<matplotlib.collections.PathCollection at 0x1155594e0>
In [21]:
import sklearn
import numpy as np

X_train, X_test, Y_train, Y_test = sklearn.cross_validation.train_test_split(
    X, data.PRICE, test_size=0.33, random_state = 5)
print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)
(339, 13)
(167, 13)
(339,)
(167,)
In [22]:
model = LinearRegression()
model.fit(X_train, Y_train)
Out[22]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [23]:
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)
In [24]:
print("MSE train:", np.mean((Y_train - model.predict(X_train)) ** 2))
print("MSE test:", np.mean((Y_test - model.predict(X_test)) ** 2))
MSE train: 19.546758473534673
MSE test: 28.541367275618263
In [27]:
m = model
plt.scatter(m.predict(X_train), m.predict(X_train) - Y_train, c='b', s=40, alpha=0.5, label="train")
plt.scatter(m.predict(X_test), m.predict(X_test) - Y_test, c='g', s=40, label="test")
plt.hlines(y = 0, xmin=0, xmax = 50)
plt.ylabel('Residuals')
plt.legend()
Out[27]:
<matplotlib.legend.Legend at 0x115641860>
In [28]:
model
Out[28]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [27]:
model.residues_
/Users/anand/anaconda/lib/python3.5/site-packages/sklearn/utils/__init__.py:93: DeprecationWarning: Function residues_ is deprecated; ``residues_`` is deprecated and will be removed in 0.19
  warnings.warn(msg, category=DeprecationWarning)
Out[27]:
6626.3511225282509
In [33]:
model.get_params()
Out[33]:
{'copy_X': True, 'fit_intercept': True, 'n_jobs': 1, 'normalize': False}
In [30]:
import statsmodels
In [ ]: