Bayesian Logistic Regression

Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The first is logistic regression (also called binomial regression).

A logistic regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables \(\mathbf{X}\) by the estimated coefficients \(\boldsymbol{\beta}\), plus an intercept \(\alpha\). However, instead of returning a continuous value \(y\), such as linear regression, it returns the logistic function of \(y\):

\[ \text{Logistic}(x) = \frac{1}{1 + e^{(-x)}} \]

We use logistic regression when our dependent variable is binary. It has only two distinct values, usually encoded as \(0\) or \(1\). See the figure below for a graphical intuition of the logistic function:

using Plots, LaTeXStrings

function logistic(x)
    return 1 / (1 + exp(-x))
end

plot(logistic, -10, 10, label=false,
     xlabel=L"x", ylabel=L"\mathrm{Logistic}(x)")

Logistic Function

As we can see, the logistic function is basically a mapping of any real number to a real number in the range between 0 and 1:

\[ \text{Logistic}(x) = \{ \mathbb{R} \in [- \infty , + \infty] \} \to \{ \mathbb{R} \in (0, 1) \} \]

That is, the logistic function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted between 0 and 1. That is why it is used when we need a model to have a probability as a dependent variable (remembering that any real number between 0 and 1 is a valid probability). In the case of a binary dependent variable, we can use this probability as the chance of the dependent variable taking a value of 0 or 1.

Comparison with Linear Regression

Linear regression follows the following mathematical formulation:

\[ \text{Linear} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \dots + \theta_n x_n \]
  • \(\theta\) - model parameters

    • \(\theta_0\) - intercept

    • \(\theta_1, \theta_2, \dots\) - independent variables \(x_1, x_2, \dots\) coefficients

  • \(n\) - total number of independent variables

Logistic regression would add the logistic function to the linear term:

  • \(\hat{p} = \text{Logistic}(\text{Linear}) = \frac{1}{1 + e^{-\operatorname{Linear}}}\) - predicted probability of the observation being the value 1

  • \(\hat{\mathbf{y}}=\left\{\begin{array}{ll} 0 & \text { if } \hat{p} < 0.5 \\ 1 & \text { if } \hat{p} \geq 0.5 \end{array}\right.\) - predicted discreve value of \(\mathbf{y}\)

Example:

\[ \text{Probability of Death} = \text{Logistic} \big(-10 + 10 \cdot \text{cancer} + 12 \cdot \text{diabetes} + 8 \cdot \text{obesity} \big) \]

Bayesian Logistic Regression

We can model logistic regression in two ways. The first option with a Bernoulli likelihood function and the second option with a binomial likelihood function.

With the Bernoulli likelihood we model a binary dependent variable \(y\) which is the result of a Bernoulli trial with a certain probability \(p\).

In a binomial likelihood, we model a continuous dependent variable \(y\) which is the number of successes of \(n\) independent Bernoulli trials.

Using Bernoulli Likelihood

\[ \begin{aligned} \mathbf{y} &\sim \text{Bernoulli}\left( p \right) \\ \mathbf{p} &\sim \text{Logistic}(\alpha + \mathbf{X} \cdot \boldsymbol{\beta}) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned} \]

where:

  • \(\mathbf{y}\) – binary dependent variable.

  • \(\mathbf{p}\) – probability of \(\mathbf{y}\) taking the value of \(\mathbf{y}\) – success of an independent Bernoulli trial.

  • \(\text{Logistic}\) – logistic function.

  • \(\alpha\) – intercept.

  • \(\boldsymbol{\beta}\) – coefficient vector.

  • \(\mathbf{X}\) – data matrix.

Using Binomial Likelihood

\[ \begin{aligned} \mathbf{y} &\sim \text{Binomial}\left( n, p \right) \\ \mathbf{p} &\sim \text{Logistic}(\alpha + \mathbf{X} \cdot \boldsymbol{\beta}) \\ \alpha &\sim \text{Normal}(\mu_\alpha, \sigma_\alpha) \\ \boldsymbol{\beta} &\sim \text{Normal}(\mu_{\boldsymbol{\beta}}, \sigma_{\boldsymbol{\beta}}) \end{aligned} \]

where:

  • \(\mathbf{y}\) – binary dependent variable – successes of \(n\) independent Bernoulli trials.

  • \(n\) – number of independent Bernoulli trials.

  • \(\mathbf{p}\) – probability of \(\mathbf{y}\) taking the value of \(\mathbf{y}\) – success of an independent Bernoulli trial.

  • \(\text{Logistic}\) – logistic function.

  • \(\alpha\) – intercept.

  • \(\boldsymbol{\beta}\) – coefficient vector.

  • \(\mathbf{X}\) – data matrix.

In both likelihood options, what remains is to specify the model parameters' prior distributions:

  • Prior Distribution of \(\alpha\) – Knowledge we possess regarding the model's intercept.

  • Prior Distribution of \(\boldsymbol{\beta}\) – Knowledge we possess regarding the model's independent variables' coefficients.

Our goal is to instantiate a logistic regression with the observed data (\(\mathbf{y}\) and \(\mathbf{X}\)) and find the posterior distribution of our model's parameters of interest (\(\alpha\) and \(\boldsymbol{\beta}\)). This means to find the full posterior distribution of:

\[ P(\boldsymbol{\theta} \mid \mathbf{y}) = P(\alpha, \boldsymbol{\beta} \mid \mathbf{y}) \]

Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter \(\sigma\) in our logistic regression. This is due to neither the Bernoulli nor binomial distributions having a "scale" parameter such as the \(\sigma\) parameter in the Gaussian/normal distribution.

Also note that the Bernoulli distribution is a special case of the binomial distribution where \(n = 1\):

\[ \text{Bernoulli}(p) = \text{Binomial}(1, p) \]

This is easily accomplished with Turing:

using Turing
using LazyArrays
using Random:seed!
seed!(123)

@model logreg(X,  y; predictors=size(X, 2)) = begin
    #priors
    α ~ Normal(0, 2.5)
    β ~ filldist(TDist(3), predictors)

    #likelihood
    y ~ arraydist(LazyArray(@~ BernoulliLogit.(α .+ X * β)))
end;

Here I am specifying very weakly informative priors:

  • \(\alpha \sim \text{Normal}(0, 2.5)\) – This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of \(\alpha\). Remember that the normal distribution has support over all the real number line \(\in (-\infty, +\infty)\).

  • \(\boldsymbol{\beta} \sim \text{Student-}t(0,1,3)\) – The predictors all have a prior distribution of a Student-\(t\) distribution centered on 0 with variance 1 and degrees of freedom \(\nu = 3\). That wide-tailed \(t\) distribution will cover all possible values for our coefficients. Remember the Student-\(t\) also has support over all the real number line \(\in (-\infty, +\infty)\). Also the filldist() is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.

Turing's arraydist() function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray() constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~ creates a broadcast and is a nice short hand for the familiar dot . broadcasting operator in Julia. This is an efficient way to tell Turing that our y vector is distributed lazily as a BernoulliLogit broadcasted to α added to the product of the data matrix X and β coefficient vector.

If your dependent variable y is continuous and represents the number of successes of \(n\) independent Bernoulli trials you can use the binomial likelihood in your model:

y ~ arraydist(LazyArray(@~ BinomialLogit.(n, α .+ X * β)))

Example - Contaminated Water Wells

For our example, I will use a famous dataset called wells (Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:

  • switch – binary/dummy (0 or 1) for well-switching.

  • arsenic – arsenic level in respondent's well.

  • dist – distance (meters) from the respondent's house to the nearest well with safe drinking water.

  • association – binary/dummy (0 or 1) if member(s) of household participate in community organizations.

  • educ – years of education (head of household).

Ok let's read our data with CSV.jl and output into a DataFrame from DataFrames.jl:

using DataFrames, CSV, HTTP

url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/wells.csv"
wells = CSV.read(HTTP.get(url).body, DataFrame)
describe(wells)
5×7 DataFrame
 Row │ variable  mean       min    median   max      nmissing  eltype
     │ Symbol    Float64    Real   Float64  Real     Int64     DataType
─────┼──────────────────────────────────────────────────────────────────
   1 │ switch     0.575166  0       1.0       1             0  Int64
   2 │ arsenic    1.65693   0.51    1.3       9.65          0  Float64
   3 │ dist      48.3319    0.387  36.7615  339.531         0  Float64
   4 │ assoc      0.422848  0       0.0       1             0  Int64
   5 │ educ       4.82848   0       5.0      17             0  Int64

As you can see from the describe() output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.

Now let's us instantiate our model with the data:

X = Matrix(select(wells, Not(:switch)))
y = wells[:, :switch]
model = logreg(X, y);

And, finally, we will sample from the Turing model. We will be using the default NUTS() sampler with 2_000 samples, with 4 Markov chains using multiple threads MCMCThreads():

chain = sample(model, NUTS(), MCMCThreads(), 2_000, 4)
summarystats(chain)
Summary Statistics
  parameters      mean       std   naive_se      mcse         ess      rhat   ess_per_sec
      Symbol   Float64   Float64    Float64   Float64     Float64   Float64       Float64

           α   -0.1560    0.0972     0.0011    0.0017   3605.6597    1.0008       30.8754
        β[1]    0.4679    0.0414     0.0005    0.0007   4344.2928    1.0002       37.2003
        β[2]   -0.0090    0.0011     0.0000    0.0000   9237.5731    0.9997       79.1017
        β[3]   -0.1247    0.0767     0.0009    0.0011   4786.7428    1.0005       40.9891
        β[4]    0.0423    0.0095     0.0001    0.0001   5155.7757    1.0004       44.1491

We had no problem with the Markov chains as all the rhat are well below 1.01 (or above 0.99). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:

\[ \text{odds} = \frac{p}{1-p} \]

where \(p\) is a probability. So log-odds is defined as:

\[ \log(\text{odds}) = \log \left( \frac{p}{1-x} \right) \]

So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:

\[ \text{odds} = \exp ( \log ( \text{odds} )) \]

We can do this with a transformation in a DataFrame constructed from a Chains object:

using Chain

@chain quantile(chain) begin
    DataFrame
    select(_,
        :parameters,
        names(_, r"%") .=> ByRow(exp),
        renamecols=false)
end
5×6 DataFrame
 Row │ parameters  2.5%      25.0%     50.0%     75.0%     97.5%
     │ Symbol      Float64   Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────────────
   1 │ α           0.709482  0.799411  0.856869  0.91328   1.03653
   2 │ β[1]        1.47384   1.55248   1.59518   1.64188   1.73346
   3 │ β[2]        0.988966  0.990344  0.991066  0.991741  0.993117
   4 │ β[3]        0.759348  0.837795  0.882745  0.930137  1.02586
   5 │ β[4]        1.02434   1.03653   1.04325   1.04994   1.06333

Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that \(y\) will be \(1\). And anything above 1 increases the probability of \(y\) being \(1\), while 1 itself is a neutral odds for \(y\) being either \(1\) or \(0\). Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob() that converts log-odds to probabilities:

function logodds2prob(logodds::Float64)
    return exp(logodds) / (1 + exp(logodds))
end

@chain quantile(chain) begin
    DataFrame
    select(_,
        :parameters,
        names(_, r"%") .=> ByRow(logodds2prob),
        renamecols=false)
end
5×6 DataFrame
 Row │ parameters  2.5%      25.0%     50.0%     75.0%     97.5%
     │ Symbol      Float64   Float64   Float64   Float64   Float64
─────┼──────────────────────────────────────────────────────────────
   1 │ α           0.415027  0.444263  0.461459  0.477337  0.508969
   2 │ β[1]        0.595769  0.608224  0.61467   0.621481  0.634163
   3 │ β[2]        0.497226  0.497574  0.497757  0.497927  0.498273
   4 │ β[3]        0.431608  0.45587   0.468861  0.481902  0.506381
   5 │ β[4]        0.506012  0.508968  0.510584  0.51218   0.515347

There you go, much better now. Let's analyze our results. The intercept α is the basal switch probability which has a median value of 46%. All coefficients whose 95% credible intervals captures the value \(\frac{1}{2} = 0.5\) tells that the effect on the propensity of switch is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients. So this rules out β[3] which is the third column of Xassoc. The other remaining 95% credible intervals can be interpreted as follows:

  • β[1] – first column of X, arsenic, has 95% credible interval 0.595 to 0.634. This means that each increase in one unit of arsenic is related to an increase of 9.6% to 13.4% propension of switch being 1.

  • β[2] – second column of X, dist, has a 95% credible interval from 0.497 to 0.498. So we expect that each increase in one meter of dist is related to a decrease of 0.1% propension of switch being 0.

  • β[4] – fourth column of X, educ, has a 95% credible interval from 0.506 to 0.515. Each increase in one year of educ is related to an increase of 0.6% to 1.5% propension of switch being 1.

That's how you interpret 95% credible intervals from a quantile() output of a logistic regression Chains object converted from log-odds to probability.

Footnotes

[1] actually the logit function or the log-odds is the logarithm of the odds \(\frac{p}{1-p}\) where \(p\) is a probability.

References

Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.