Data Sampling with Python

Create Test/Train Datasets

An indispensible procedure in predictive analytics is creating train and test (and optionally validation) data sets. A simple procedure exists in sklearn to partition our data sets.

Example: Prices of round cut diamonds

Description: A dataset containing the prices and other attributes of almost 54,000 diamonds.

price: Selling price in dollars.
carat: Weight of the diamond.
lot:Area of the houses lot in square feet.
cut: Quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color: Diamond colour, from J (worst) to D (best).
clarity: A measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x: Length in mm.
y: Width in mm.
z: Depth in mm.
depth: Total depth percentage = z / mean(x, y) = 2 * z / (x + y).
table: Width of top of diamond relative to widest point.

Source: https://vincentarelbundock.github.io/Rdatasets/doc/ggplot2/diamonds.html
Download the data from here

Task: Partition the data into test and train sets.

First, start Spyder. Then type the following.

								
									from sklearn.model_selection import train_test_split

									diamonds_train, diamonds_test = train_test_split(diamonds, test_size=0.3, random_state=1234)