Bayesian Regression with Count Data
Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The third of these is regression with count data (also called Poisson regression).
A regression with count data behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , plus an intercept . However, instead of returning a continuous value , such as linear regression, it returns the natural log of .
We use regression with count data when our dependent variable is restricted to positive integers, i.e. . See the figure below for a graphical intuition of the exponential function:
using CairoMakie
f, ax, l = lines(-6 .. 6, exp; axis=(xlabel=L"x", ylabel=L"e^x"))
As we can see, the exponential function is basically a mapping of any real number to a positive real number in the range between 0 and (non-inclusive):
That is, the exponential function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted to taking positive values only. That is why it is used when we need a model to have a positive-only dependent variable. This is the case of a dependent variable for count data.
Comparison with Linear Regression
Linear regression follows the following mathematical formulation:
- model parameters
- intercept
- independent variables coefficients
- total number of independent variables
Regression with count data would add the exponential function to the linear term:
which is the same as:
Bayesian Regression with Count Data
We can model regression with count data in two ways. The first option with a Poisson likelihood function and the second option with a negative binomial likelihood function.
With the Poisson likelihood we model a discrete and positive dependent variable by assuming that a given number of independent events will occur with a known constant average rate.
In a negative binomial likelihood, model a discrete and positive dependent variable by assuming that a given number of independent events will occur by asking a yes-no question for each with probability until success(es) is obtained. Note that it becomes identical to the Poisson likelihood when at the limit of . This makes the negative binomial a robust option to replace a Poisson likelihood to model phenomena with a overdispersion (excess expected variation in data). This occurs due to the Poisson likelihood making an assumption that the dependent variable has the same mean and variance, while in the negative binomial likelihood the mean and the variance do not need to be equal.
Using Poisson Likelihood
where:
– discrete and positive dependent variable.
– exponential.
– intercept.
– coefficient vector.
– data matrix.
As we can see, the linear predictor is the logarithm of the value of . So we need to apply the exponential function the values of the linear predictor:
The intercept and coefficients are only interpretable after exponentiation.
What remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate a regression with count data using the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our regression with count data. This is due to the Poisson not having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
This is easily accomplished with Turing:
using Turing
using LazyArrays
using Random: seed!
seed!(123)
@model function poissonreg(X, y; predictors=size(X, 2))
#priors
α ~ Normal(0, 2.5)
β ~ filldist(TDist(3), predictors)
#likelihood
return y ~ arraydist(LazyArray(@~ LogPoisson.(α .+ X * β)))
end;
Here I am specifying very weakly informative priors:
– This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
– The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the
filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray()
constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~
creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. This is an efficient way to tell Turing that our y
vector is distributed lazily as a LogPoisson
broadcasted to α
added to the product of the data matrix X
and β
coefficient vector. LogPoisson
is Turing's efficient distribution that already apply exponentiation to all the linear predictors.
Using Negative Binomial Likelihood
where:
– discrete and positive dependent variable.
– exponential.
– dispersion.
– reciprocal dispersion.
– intercept.
– coefficient vector.
– data matrix.
Note that when we compare with the Poisson model, we have a new parameter that parameterizes the negative binomial likelihood. This parameter is the probability of successes of the negative binomial distribution and we generally use a Gamma distribution as prior so that the inverse of which is fulfills the function of a "reciprocal dispersion" parameter. Most of the time we use a weakly informative prior of the parameters shape and scale (Gelman et al., 2013; 2020). But you can also use as prior (McElreath, 2020).
Here is what a looks like:
using Distributions
f, ax, l = lines(
Gamma(0.01, 0.01);
linewidth=2,
axis=(xlabel=L"\phi", ylabel="Density", limits=(0, 0.03, nothing, nothing)),
)
In both likelihood options, what remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate a regression with count data using the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our regression with count data. This is due to neither the Poisson nor negative binomial distributions having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
Alternative Negative Binomial Parameterization
One last thing before we get into the details of the negative binomial distribution is to consider an alternative parameterization. Julia's Distributions.jl
and, consequently, Turing's parameterization of the negative binomial distribution follows the following the Wolfram reference:
where:
– number of failures before the th success in a sequence of independent Bernoulli trials
– number of successes
– probability of success in an individual Bernoulli trial
This is not ideal for most of the modeling situations that we would employ the negative binomial distribution. In particular, we want to have a parameterization that is more appropriate for count data. What we need is the familiar mean (or location) and variance (or scale) parameterization. If we look in Stan's documentation for the neg_binomial_2
function, we have the following two equations:
With a little bit of algebra, we can substitute the first equation of (11) into the right hand side of the second equation and get the following:
Then in (11) we have:
Hence, the resulting map is . I would like to point out that this implementation was done by Tor Fjelde in a COVID-19 model with the code available in GitHub. So we can use this parameterization in our negative binomial regression model. But first, we need to define an alternative negative binomial distribution function:
function NegativeBinomial2(μ, ϕ)
p = 1 / (1 + μ / ϕ)
p = p > 0 ? p : 1e-4 # numerical stability
r = ϕ
return NegativeBinomial(r, p)
end
NegativeBinomial2 (generic function with 1 method)
Now we create our Turing model with the alternative NegBinomial2
parameterization:
@model function negbinreg(X, y; predictors=size(X, 2))
#priors
α ~ Normal(0, 2.5)
β ~ filldist(TDist(3), predictors)
ϕ⁻ ~ Gamma(0.01, 0.01)
ϕ = 1 / ϕ⁻
#likelihood
return y ~ arraydist(LazyArray(@~ NegativeBinomial2.(exp.(α .+ X * β), ϕ)))
end;
Here I am also specifying very weakly informative priors:
– This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
– The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the
filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.– overdispersion parameter of the negative binomial distribution.
Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray()
constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~
creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. This is an efficient way to tell Turing that our y
vector is distributed lazily as a NegativeBinomial2
broadcasted to α
added to the product of the data matrix X
and β
coefficient vector. Note that NegativeBinomial2
does not apply exponentiation so we had to include the exp.()
broadcasted function to all the linear predictors.
Example - Roaches Extermination
For our example, I will use a famous dataset called roaches
(Gelman & Hill, 2007), which is data on the efficacy of a pest management system at reducing the number of roaches in urban apartments. It has 262 observations and the following variables:
y
– number of roaches caught.roach1
– pretreatment number of roaches.treatment
– binary/dummy (0 or 1) for treatment indicator.senior
– binary/dummy (0 or 1) for only elderly residents in building.exposure2
– number of days for which the roach traps were used
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
using CSV
using HTTP
url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/roaches.csv"
roaches = CSV.read(HTTP.get(url).body, DataFrame)
describe(roaches)
5×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Real Float64 Real Int64 DataType
─────┼────────────────────────────────────────────────────────────────────
1 │ y 25.6489 0 3.0 357 0 Int64
2 │ roach1 42.1935 0.0 7.0 450.0 0 Float64
3 │ treatment 0.603053 0 1.0 1 0 Int64
4 │ senior 0.305344 0 0.0 1 0 Int64
5 │ exposure2 1.02105 0.2 1.0 4.28571 0 Float64
As you can see from the describe()
output the average number of roaches caught by the pest management system is around 26 roaches. The average number of roaches pretreatment is around 42 roaches (oh boy...). 30% of the buildings has only elderly residents and 60% of the buildings received a treatment by the pest management system. Also note that the traps were set in general for only 1 day and it ranges from 0.2 days (almost 5 hours) to 4.3 days (which is approximate 4 days and 7 hours).
Poisson Regression
Let's first run the Poisson regression. First, we instantiate our model with the data:
X = Matrix(select(roaches, Not(:y)))
y = roaches[:, :y]
model_poisson = poissonreg(X, y);
And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain_poisson = sample(model_poisson, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain_poisson)
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
α 2.9867 0.1526 0.0257 307.5895 81.9752 1.0121 9.3558
β[1] 0.0065 0.0002 0.0000 3084.6402 2620.6876 1.0023 93.8237
β[2] -0.5007 0.1066 0.0179 277.4972 68.4987 1.0220 8.4405
β[3] -0.3620 0.1247 0.0215 319.4964 79.5812 1.0136 9.7179
β[4] 0.1265 0.2666 0.0452 310.0447 75.1405 1.0124 9.4304
We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log scale. So we need to apply the exponential function to them. We can do this with a transformation in a DataFrame
constructed from a Chains
object:
using Chain
@chain quantile(chain_poisson) begin
DataFrame
select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
end
5×6 DataFrame
Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
│ Symbol Float64 Float64 Float64 Float64 Float64
─────┼───────────────────────────────────────────────────────────────────
1 │ α 17.8944 18.8715 19.4401 20.0227 21.6618
2 │ β[1] 1.00636 1.00648 1.00655 1.00661 1.00672
3 │ β[2] 0.569218 0.587878 0.598309 0.608326 0.642261
4 │ β[3] 0.640725 0.669002 0.685442 0.701448 0.74567
5 │ β[4] 1.07174 1.14706 1.17526 1.20443 1.26015
Let's analyze our results. The intercept α
is the basal number of roaches caught y
and has a median value of 19.4 roaches caught. The remaining 95% credible intervals for the β
s can be interpreted as follows:
β[1]
– first column ofX
,roach1
, has 95% credible interval 1.01 to 1.01. This means that each increase in one unit ofroach1
is related to an increase of 1% more roaches caught.β[2]
– second column ofX
,treatment
, has 95% credible interval 0.57 to 0.63. This means that if an apartment was treated with the pest management system then we expect an decrease of around 40% roaches caught.β[3]
– third column ofX
,senior
, has a 95% credible interval from 0.64 to 0.73. This means that if an apartment building has only elderly residents then we expect an decrease of around 30% roaches caught.β[4]
– fourth column ofX
,exposure2
, has a 95% credible interval from 1.09 to 1.26. Each increase in one day for the exposure of traps in an apartment we expect an increase of between 9% to 26% roaches caught.
That's how you interpret 95% credible intervals from a quantile()
output of a regression with count data Chains
object converted from a log scale.
Negative Binomial Regression
Let's now run the negative binomial regression.
model_negbin = negbinreg(X, y);
We will also default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain_negbin = sample(model_negbin, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain_negbin)
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
α 2.3935 0.3639 0.0077 2289.8720 1937.3708 1.0026 50.6474
β[1] 0.0127 0.0015 0.0000 4640.3636 2667.0438 1.0014 102.6357
β[2] -0.7327 0.1499 0.0027 3027.2982 2637.8716 1.0028 66.9578
β[3] -0.3199 0.1553 0.0027 3278.7372 2381.6903 1.0018 72.5192
β[4] 0.4266 0.3365 0.0073 2198.4749 1778.9698 1.0027 48.6259
ϕ⁻ 1.4058 0.0800 0.0014 3056.0376 2581.1948 1.0003 67.5935
We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log scale. So we need to also apply the exponential function as we did before.
@chain quantile(chain_negbin) begin
DataFrame
select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
end
6×6 DataFrame
Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
│ Symbol Float64 Float64 Float64 Float64 Float64
─────┼─────────────────────────────────────────────────────────────────
1 │ α 5.28389 8.5925 11.0566 13.9834 21.9849
2 │ β[1] 1.00987 1.01173 1.01277 1.01381 1.016
3 │ β[2] 0.358343 0.433873 0.481535 0.53245 0.645215
4 │ β[3] 0.540157 0.650629 0.724575 0.806835 0.984437
5 │ β[4] 0.829193 1.21638 1.50724 1.92373 3.05504
6 │ ϕ⁻ 3.52309 3.85985 4.07161 4.30516 4.80385
Our results show much more uncertainty in the coefficients than in the Poisson regression. So it might be best to use the Poisson regression in the roaches
dataset.
References
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian Data Analysis. Chapman and Hall/CRC.
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.
McElreath, R. (2020). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC press.