Logistic Regression with Python

Logistic Regression

Logistic model (or logit model) is a widely used statistical model that, in its basic form, uses a logistic function to model a binary dependent variable; many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model; it is a form of binomial regression. Mathematically, a binary logistic model has a dependent variable with two possible values, such as pass/fail, win/lose, alive/dead or healthy/sick; these are represented by an indicator variable, where the two values are labeled "0" and "1".

Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score (TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd et al. using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression.Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.). It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc. In economics it can be used to predict the likelihood of a person's choosing to be in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing.

Example: Frogs

Description: This data frame gives the distribution of the Southern Corroboree frog, which occurs in the Snowy Mountains area of New South Wales, Australia.

pres.abs: 0 = frogs were absent, 1 = frogs were present
northing: reference point.
easting: reference point.
altitude: altitude in meters.
distance: distance in meters to nearest extant population.
noofpools: number of potential breeding pools.
noofsites: number of potential breeding sites within a 2 km radius.
avrain: mean rainfall for Spring period.
meanmin: mean minimum Spring temperature.
meanmax: mean maximum Spring temperature.

Source: https://vincentarelbundock.github.io/Rdatasets/doc/DAAG/frogs.html
Download the data from here

Task: What are the best parameters for existence of frogs?

First we will import needed classes and initialize our regression models:

								
									import pandas as pd

									import numpy as np

									from sklearn.model_selection import train_test_split

									from sklearn.linear_model import LogisticRegression

									from sklearn.model_selection import train_test_split

									logreg = LogisticRegression(max_iter=250)

Let's create our training and test samples:

								
									X = frogs.drop('pres.abs', axis=1)

									y = frogs['pres.abs']

									X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=444)

Create logistic regression model on training set:


									logreg.fit(X_train, y_train)

To see how our model would fare on the test sample:


									logreg.score(X_test, y_test)

0.6981132075471698

Note that scikit-learn does not have a built-in method to calculate p-values for predicting variables, as opposed to R or SAS. I think this is mostly due to the fact that Python is more catered to making predictions than looking under the hood. If you are interested in machine learning, Python is wonderful. If you are looking for traditional statistical analysis, R might be the way to go.

We can now easily predict the outcome with any data. For example,

								
									logreg.predict(X_test.loc[205,:].values.reshape(1,-1))

array([0], dtype=int64)

or the probability:

								
									logreg.predict_proba(X_test.loc[205,:].values.reshape(1,-1))

array([[0.93240067, 0.06759933]])