Python StatsModels allows users to explore data, perform statistical tests and estimate statistical models. It is supposed to complement to SciPy’s stats module. It is part of the Python scientific stack that deals with data science, statistics and data analysis.
Python StatsModels
StatsModels is built on top of NumPy and SciPy.
It also uses Pandas for data handling and Patsy for R-like formula interface. It takes its graphics functions from matplotlib. It is known to provide statistical background for other python packages.
Originally, Jonathan Taylor wrote the models module of scipy.stats
. It was part of scipy for some time but was removed later.
It was tested, corrected and improved during the Google Summer of Code 2009 and launched as a new package we know as StatsModels.
New models, plotting tools and statistical models are being introduced continuously developed and introduced by the StatsModels development team.
Why StatsModels?
As the name states StatsModels is made for hardcore statistics and makes it possible to work on stats in a manner no one else does.
StatsModels is a great tool for statistical analysis and is more aligned towards R and thus it is easier to use for the ones who are working with R and want to move towards Python.
Getting Started with StatsModels
Let’s get started with this Python library.
Install StatsModels
Before getting StatsModels on your machine, StatsModels assumes the following functioning properly on your machine:
- Python 2.6 or later
- Numpy 1.6 or later
- Scipy 0.11 or later
- Pandas 0.12 or later
- Patsy 0.2.1 or later
- Cython 0.24 or later
Once you have these you can begin with installation.
To install using pip, open your terminal and type the following command:
sudo pip install statsmodels
You can also install the same using conda. To install using conda, type the following command in terminal:
sudo conda install statsmodels
Using StatsModels
Once you are done with the installation, you can use StatsModels easily in your Python code by importing it:
import statsmodels
Simple Example with StatsModels
Let’s have a look at a simple example to better understand the package:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Load data
dat = sm.datasets.get_rdataset("Guerry", "HistData").data
# Fit regression model (using the natural log of one of the regressors)
results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
# Inspect the results
print(results.summary())
Running the above script give following results:
Python StatsModels Linear Regression
Now that we are familiar with package let’s start with something more sophisticated. Here we are trying to implement Linear Regression to our data using StatsModels. Let’s get into it:
# Load modules and data
import numpy as np
import statsmodels.api as sm
spector_data = sm.datasets.spector.load()
spector_data.exog = sm.add_constant(spector_data.exog, prepend=False)
# Fit and summarize OLS model
mod = sm.OLS(spector_data.endog, spector_data.exog)
res = mod.fit()
print(res.summary())
Running the above code gives us an easy to read and understand output like this:
Generalized linear models (GLMs)
These currently support estimation using the one-parameter exponential families. Let’s have a better look into this:
# Load modules and data
import statsmodels.api as sm
data = sm.datasets.scotland.load()
data.exog = sm.add_constant(data.exog)
# Instantiate a gamma family model with the default link function.
gamma_model = sm.GLM(data.endog, data.exog, family=sm.families.Gamma())
gamma_results = gamma_model.fit()
print(gamma_results.summary())
Running the above code gives us an easy to read and understand output like this:
Generalized Estimating Equations (GEEs)
GEEs as clear from name are generalized linear models for panel, cluster or repeated measure data when the observations are possibly correlated within a cluster but not across the same.
# Load modules and data
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = sm.datasets.get_rdataset('epil', package='MASS').data
fam = sm.families.Poisson()
ind = sm.cov_struct.Exchangeable()
# Instantiate model with the default link function.
mod = smf.gee("y ~ age + trt + base", "subject", data,cov_struct=ind, family=fam)
res = mod.fit()
print(res.summary())
Running the above code gives us:
Robust Linear Models
Let’s create a more robust linear model. You must have observed it so far how easy it is to make such models with statsmodels:
# Load modules and data
import statsmodels.api as sm
data = sm.datasets.stackloss.load()
data.exog = sm.add_constant(data.exog)
# Fit model and print summary
rlm_model = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()
print(rlm_results.params)
Running the above code gives us:
Linear Mixed Effects Models
Sometimes we have to work with dependent data. Such data is common to find when working with longitudinal and other study designs where multiple study designs are made. To analyse such data with regression Linear Mixed Effects models are very helpful:
# Load modules and data
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Fit model and print summary
data = sm.datasets.get_rdataset("dietox", "geepack").data
md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"])
mdf = md.fit()
print(mdf.summary())
Running the above code gives us:
Conclusion
In this tutorial, we have seen that StatsModels make it easy to perform statistical analysis. We have seen several examples of creating stats models.
Python StatsModels module makes it easy to create models without much of hassle and with just a few lines of code. It also presents the output in a manner that is easier to read and understand.