Linear Regression
Few data sets in practice are perfect therefore we need preparation steps after importing data.
Example: Prices of round cut diamonds
Description: A dataset containing the prices and other attributes of almost 54,000 diamonds.
price: Selling price in dollars.
carat: Weight of the diamond.
lot:Area of the houses lot in square feet.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color: Diamond colour, from J (worst) to D (best).
clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x: Length in mm.
y: Width in mm.
z: Depth in mm.
depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y).
table: Width of top of diamond relative to widest point.
Source: https://vincentarelbundock.github.io/Rdatasets/doc/ggplot2/diamonds.html
Download the data from here
Task: Develop a model to estimate price given diamond parameters.
It will make our life much easier if we first separate our data into two tables: predictors and predicted:
import pandas as pd
import numpy as np
diamonds_y = pd.DataFrame(diamonds["price"])
diamonds_x = diamonds.drop("price", axis=1)
First we imported pandas library. Following that we created a new dataframe with only price column and then another dataframe with all other columns except price. "axis=1" option tells python that we are making columnwise selection.
Before getting into regression we need to encode the categorical variables to numerical variables so that we can use them in our model. For example, cut column attributes "Ideal", "Premium", "Good" will be converted to 1,2,3 and so on. However there is a problem with this approach. This data is what we call "ordinal" data, i.e. there is some ranking i.e. Ideal > Premium > Good etc. therefore we have to convert them to numbers such that the same rank still exists (straight or reverse). Let's first see how many unique attributes we have in cut:
set(diamonds['cut'])
{'Fair', 'Good', 'Ideal', 'Premium', 'Very Good'}
The reason we use "set" function is that "set" creates a list free from duplicates therefore giving us the unique values in that column.
We will now use OrdinalEncoder class from scikit-learn library:
ordenc = OrdinalEncoder(categories=[['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']])
diamonds['cut'] = ordenc.fit_transform(diamonds.cut.values.reshape(-1,1))
We will convert other categorical variables (clarity and color) to one binary attribute per category. To do this we will use OneHotEncoder class:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(categories=[['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2']])
clarity_one = encoder.fit_transform(diamonds.clarity.values.reshape(-1,1)).toarray()
clarity_one = pd.DataFrame(data = clarity_one, columns = ['I1', 'IF', 'SI1', 'SI2', 'VS1', 'VS2', 'VVS1', 'VVS2'])
encoder = OneHotEncoder(categories=[['E', 'F', 'D', 'H', 'I', 'J', 'G']])
color_one = encoder.fit_transform(diamonds.color.values.reshape(-1,1)).toarray()
color_one = pd.DataFrame(data = color_one, columns = ['E', 'F', 'D', 'H', 'I', 'J', 'G'])
Now let's delete original clarity and color columns:
diamonds.drop(['color', 'clarity'], axis=1)
Time to combine these dataframes:
diamonds = pd.concat([diamonds, clarity_one, color_one], axis=1)
We can now run our regression model but it would be much better if we prepare a train and a test data set so that we can see how our models perform in a more accurate way.
from sklearn.model_selection import train_test_split
diamonds_train, diamonds_test = train_test_split(diamonds, test_size=0.25)
Train our regression with the train data set:
from sklearn.linear_model import LinearRegression
linr = LinearRegression()
linr.fit(diamonds_train.drop('price', axis=1), diamonds_train.price)
Out[]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Our model is now baked in. Let's see how good it is with our test data.
from sklearn.metrics import mean_squared_error, r2_score
diamonds_test_y = linr.predict(diamonds_test.drop('price', axis=1))
linmse = mean_squared_error(diamonds_test.price, diamonds_test_y)
np.sqrt(linmse)
Out[]: 1104.3425369824802
r2_score(diamonds_test.price, diamonds_test_y)
Out[]: 0.9229236625907614
Leave a Comment