Linear Regression with Python

Linear Regression

Few data sets in practice are perfect therefore we need preparation steps after importing data.

Example: Prices of round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds.

price: Selling price in dollars.
carat: Weight of the diamond.
lot:Area of the houses lot in square feet.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color: Diamond colour, from J (worst) to D (best).
clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x: Length in mm.
y: Width in mm.
z: Depth in mm.
depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y).
table: Width of top of diamond relative to widest point.

Source: https://vincentarelbundock.github.io/Rdatasets/doc/ggplot2/diamonds.html
Download the data from here

Task: Develop a model to estimate price given diamond parameters.

It will make our life much easier if we first separate our data into two tables: predictors and predicted:

								
									import pandas as pd

									import numpy as np

									diamonds_y = pd.DataFrame(diamonds["price"])

									diamonds_x = diamonds.drop("price", axis=1)

First we imported pandas library. Following that we created a new dataframe with only price column and then another dataframe with all other columns except price. "axis=1" option tells python that we are making columnwise selection.

Before getting into regression we need to encode the categorical variables to numerical variables so that we can use them in our model. For example, cut column attributes "Ideal", "Premium", "Good" will be converted to 1,2,3 and so on. However there is a problem with this approach. This data is what we call "ordinal" data, i.e. there is some ranking i.e. Ideal > Premium > Good etc. therefore we have to convert them to numbers such that the same rank still exists (straight or reverse). Let's first see how many unique attributes we have in cut:

								
									set(diamonds['cut'])

{'Fair', 'Good', 'Ideal', 'Premium', 'Very Good'}

The reason we use "set" function is that "set" creates a list free from duplicates therefore giving us the unique values in that column.

We will now use OrdinalEncoder class from scikit-learn library:

								
									ordenc = OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']])

									diamonds['cut'] = ordenc.fit_transform(diamonds.cut.values.reshape(-1,1))

We will convert other categorical variables (clarity and color) to one binary attribute per category. To do this we will use OneHotEncoder class:

								
									from sklearn.preprocessing import OneHotEncoder

									encoder = OneHotEncoder(categories=[['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2']])

									clarity_one = encoder.fit_transform(diamonds.clarity.values.reshape(-1,1)).toarray()

									clarity_one = pd.DataFrame(data = clarity_one, columns = ['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2'])

									encoder = OneHotEncoder(categories=[['E', 'F', 'D', 'H', 'I', 'J', 'G']])

									color_one = encoder.fit_transform(diamonds.color.values.reshape(-1,1)).toarray()

									color_one = pd.DataFrame(data = color_one, columns = ['E', 'F', 'D', 'H', 'I', 'J', 'G'])

Now let's delete original clarity and color columns:

								
									diamonds.drop(['color', 'clarity'], axis=1)

Time to combine these dataframes:

								
									diamonds = pd.concat([diamonds, clarity_one, color_one], axis=1)

We can now run our regression model but it would be much better if we prepare a train and a test data set so that we can see how our models perform in a more accurate way.

								
									from sklearn.model_selection import train_test_split

									diamonds_train, diamonds_test = train_test_split(diamonds, test_size=0.25)

Train our regression with the train data set:

								
									from sklearn.linear_model import LinearRegression

									linr = LinearRegression()

									linr.fit(diamonds_train.drop('price', axis=1), diamonds_train.price)

Out[]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Our model is now baked in. Let's see how good it is with our test data.

								
									from sklearn.metrics import mean_squared_error, r2_score

									diamonds_test_y = linr.predict(diamonds_test.drop('price', axis=1))

									linmse = mean_squared_error(diamonds_test.price, diamonds_test_y)

									np.sqrt(linmse)

Out[]: 1104.3425369824802


									r2_score(diamonds_test.price, diamonds_test_y)

Out[]: 0.9229236625907614