Create Test/Train Datasets
An indispensible procedure in predictive analytics is creating train and test (and optionally validation) data sets. A simple procedure exists in sklearn to partition our data sets.
Example: Prices of round cut diamonds
Description: A dataset containing the prices and other attributes of almost 54,000 diamonds.
price: Selling price in dollars.
carat: Weight of the diamond.
lot:Area of the houses lot in square feet.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color: Diamond colour, from J (worst) to D (best).
clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x: Length in mm.
y: Width in mm.
z: Depth in mm.
depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y).
table: Width of top of diamond relative to widest point.
Source: https://vincentarelbundock.github.io/Rdatasets/doc/ggplot2/diamonds.html
Download the data from here
Task: Partition the data into test and train sets.
First, start Spyder. Then type the following.
from sklearn.model_selection import train_test_split
diamonds_train, diamonds_test = train_test_split(diamonds, test_size=0.3, random_state=1234)
Leave a Comment