
Binning
Data binning (also called Discrete binning or bucketing) is a data pre-processing technique used to reduce the effects of minor observation errors. The original data values which fall in a given small interval, a bin, are replaced by a value representative of that interval, often the central value. It is a form of quantization. Statistical data binning is a way to group a number of more or less continuous values into a smaller number of "bins". For example, if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals (for example, grouping every five years together). It can also be used in multivariate statistics, binning in several dimensions at once.
Example: Customer Purchase
Description: Garden dataset contains 30,000 customers who were observed over a four year period. Their start time is when they make their first purchase. They are monitored until the end of the four year period.
id_number: Customer ID.
start: Customer's first purchase.
end: Customer's second purchase.
censor: 1 if there is a second purchase, 0 if not.
last_day: End of the study.
time: Number of months between first purchase and second or last day of study, whicever is smaller.
garden: Dollars spent in garden department of first purchase.
decorating: Dollars spent in decorating department of first purchase.
car: Dollars spent in auto department of first purchase.
electrical: Dollars spent in electrical department of first purchase.
safety: Dollars spent in safety department of first purchase.
computer: Dollars spent in computer department of first purchase.
previous_garden: Previous amount spent in garden tools.
previous_decorating: Previous amount spent in decorating tools.
previous_car: Previous amount spent in auto tools.
previous_electrical: Previous amount spent in electrical tools.
previous_safety: Previous amount spent in safety tools.
previous_computer: Previous amount spent in computer tools.
amount_clv: Total amount spent in customer lifetime.
strata: Strata based on total number of orders. A: 1-2 orders between 2011 and 2013, B: 1-2 orders in 2014, C: 3-4 orders, D: 5-10 orders, E: 11-20 orders, F: >21 orders.
account_origin: Channel where the account is originated.
order: Channel where the first purchase is made.
age: Binned age.
credit_score: Binned credit score.
behavior: Binned behavior.
mosaic: Mosaic bureau data.
credicard: Method of payment or credit card brand.
family: Family bureau data.
income: Income bureau data.
Source: Business Survival Analysis Using SAS, J. Ribeiro.
Download the data from here
Task: Design a model to predict the time to the next purchase of a product.
One of the decisions we need to make when preparing a predictive model is to whether bin the target variable or not. Many models perform better with binned target as this also helps us to remove outliers. For our data for example we may simply bin the 'time' to predict whether a customer will make a purchase within a year, two years etc. instead of trying to predict the time exactly. We may benefit from binning time, especially a winsorized binning to remove outliers. SAS has a procedure called HPBIN to automate binning jobs:
PROC HPBIN DATA=tutorial.garden NUMBIN=4 WINSOR WINSORRATE=0.05;
INPUT time;
RUN;
Performance Information | |
---|---|
Execution Mode | Single-Machine |
Number of Threads | 4 |
Data Access Information | |||
---|---|---|---|
Data | Engine | Role | Path |
WORK.GARDEN | V9 | Input | On Client |
Binning Information | |
---|---|
Method | Winsor Binning |
Number of Bins Specified | 4 |
Number of Variables | 1 |
Mapping | ||||
---|---|---|---|---|
Variable | Binned Variable | Range | Frequency | Proportion |
Time | BIN_Time | Time < 11 | 373350 | 0.55177459 |
11 <= Time < 21 | 107167 | 0.15838229 | ||
21 <= Time < 31 | 80803 | 0.11941889 | ||
31 <= Time | 115315 | 0.17042423 |
Winsorized Statistics | |||||||
---|---|---|---|---|---|---|---|
Variable | Mean | Std Error Mean |
N Left Tail |
Percent Left Tail |
N Right Tail |
Percent Right Tail |
DF |
Time | 13.6603235 | 0.02003493 | 85439 | 12.6270441 | 38180 | 5.64262860 | 553015 |
Trimmed Statistics | |||||||
---|---|---|---|---|---|---|---|
Variable | Mean | Std Error Mean |
N Left Tail |
Percent Left Tail |
N Right Tail |
Percent Right Tail |
DF |
Time | 13.7287782 | 0.02003492 | 85439 | 12.6270441 | 38180 | 5.64262860 | 553015 |
Leave a Comment