Bayesian Logistic Regression
Leaving the universe of linear models, we start to venture into generalized linear models (GLM). The first is logistic regression (also called binomial regression).
A logistic regression behaves exactly like a linear model: it makes a prediction simply by computing a weighted sum of the independent variables by the estimated coefficients , plus an intercept . However, instead of returning a continuous value , such as linear regression, it returns the logistic function of :
We use logistic regression when our dependent variable is binary. It has only two distinct values, usually encoded as or . See the figure below for a graphical intuition of the logistic function:
using CairoMakie
function logistic(x)
return 1 / (1 + exp(-x))
end
f, ax, l = lines(-10 .. 10, logistic; axis=(; xlabel=L"x", ylabel=L"\mathrm{Logistic}(x)"))
f
As we can see, the logistic function is basically a mapping of any real number to a real number in the range between 0 and 1:
That is, the logistic function is the ideal candidate for when we need to convert something continuous without restrictions to something continuous restricted between 0 and 1. That is why it is used when we need a model to have a probability as a dependent variable (remembering that any real number between 0 and 1 is a valid probability). In the case of a binary dependent variable, we can use this probability as the chance of the dependent variable taking a value of 0 or 1.
Comparison with Linear Regression
Linear regression follows the following mathematical formulation:
- model parameters
- intercept
- independent variables coefficients
- total number of independent variables
Logistic regression would add the logistic function to the linear term:
- predicted probability of the observation being the value 1
- predicted discrete value of
Example:
Bayesian Logistic Regression
We can model logistic regression in two ways. The first option with a Bernoulli likelihood function and the second option with a binomial likelihood function.
With the Bernoulli likelihood we model a binary dependent variable which is the result of a Bernoulli trial with a certain probability .
In a binomial likelihood, we model a continuous dependent variable which is the number of successes of independent Bernoulli trials.
Using Bernoulli Likelihood
where:
– binary dependent variable.
– probability of taking the value of – success of an independent Bernoulli trial.
– logistic function.
– intercept.
– coefficient vector.
– data matrix.
Using Binomial Likelihood
where:
– binary dependent variable – successes of independent Bernoulli trials.
– number of independent Bernoulli trials.
– probability of taking the value of – success of an independent Bernoulli trial.
– logistic function.
– intercept.
– coefficient vector.
– data matrix.
In both likelihood options, what remains is to specify the model parameters' prior distributions:
Prior Distribution of – Knowledge we possess regarding the model's intercept.
Prior Distribution of – Knowledge we possess regarding the model's independent variables' coefficients.
Our goal is to instantiate a logistic regression with the observed data ( and ) and find the posterior distribution of our model's parameters of interest ( and ). This means to find the full posterior distribution of:
Note that contrary to the linear regression, which used a Gaussian/normal likelihood function, we don't have an error parameter in our logistic regression. This is due to neither the Bernoulli nor binomial distributions having a "scale" parameter such as the parameter in the Gaussian/normal distribution.
Also note that the Bernoulli distribution is a special case of the binomial distribution where :
This is easily accomplished with Turing:
using Turing
using LazyArrays
using Random: seed!
seed!(123)
@model function logreg(X, y; predictors=size(X, 2))
#priors
α ~ Normal(0, 2.5)
β ~ filldist(TDist(3), predictors)
#likelihood
return y ~ arraydist(LazyArray(@~ BernoulliLogit.(α .+ X * β)))
end;
Here I am specifying very weakly informative priors:
– This means a normal distribution centered on 0 with a standard deviation of 2.5. That prior should with ease cover all possible values of . Remember that the normal distribution has support over all the real number line .
– The predictors all have a prior distribution of a Student- distribution centered on 0 with variance 1 and degrees of freedom . That wide-tailed distribution will cover all possible values for our coefficients. Remember the Student- also has support over all the real number line . Also the
filldist()
is a nice Turing's function which takes any univariate or multivariate distribution and returns another distribution that repeats the input distribution.
Turing's arraydist()
function wraps an array of distributions returning a new distribution sampling from the individual distributions. And the LazyArrays' LazyArray()
constructor wrap a lazy object that wraps a computation producing an array to an array. Last, but not least, the macro @~
creates a broadcast and is a nice short hand for the familiar dot .
broadcasting operator in Julia. This is an efficient way to tell Turing that our y
vector is distributed lazily as a BernoulliLogit
broadcasted to α
added to the product of the data matrix X
and β
coefficient vector.
If your dependent variable y
is continuous and represents the number of successes of independent Bernoulli trials you can use the binomial likelihood in your model:
y ~ arraydist(LazyArray(@~ BinomialLogit.(n, α .+ X * β)))
Example - Contaminated Water Wells
For our example, I will use a famous dataset called wells
(Gelman & Hill, 2007), which is data from a survey of 3,200 residents in a small area of Bangladesh suffering from arsenic contamination of groundwater. Respondents with elevated arsenic levels in their wells had been encouraged to switch their water source to a safe public or private well in the nearby area and the survey was conducted several years later to learn which of the affected residents had switched wells. It has 3,200 observations and the following variables:
switch
– binary/dummy (0 or 1) for well-switching.arsenic
– arsenic level in respondent's well.dist
– distance (meters) from the respondent's house to the nearest well with safe drinking water.association
– binary/dummy (0 or 1) if member(s) of household participate in community organizations.educ
– years of education (head of household).
Ok let's read our data with CSV.jl
and output into a DataFrame
from DataFrames.jl
:
using DataFrames
using CSV
using HTTP
url = "https://raw.githubusercontent.com/storopoli/Bayesian-Julia/master/datasets/wells.csv"
wells = CSV.read(HTTP.get(url).body, DataFrame)
describe(wells)
5×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Real Float64 Real Int64 DataType
─────┼──────────────────────────────────────────────────────────────────
1 │ switch 0.575166 0 1.0 1 0 Int64
2 │ arsenic 1.65693 0.51 1.3 9.65 0 Float64
3 │ dist 48.3319 0.387 36.7615 339.531 0 Float64
4 │ assoc 0.422848 0 0.0 1 0 Int64
5 │ educ 4.82848 0 5.0 17 0 Int64
As you can see from the describe()
output 58% of the respondents switched wells and 42% percent of respondents somehow are engaged in community organizations. The average years of education of the household's head is approximate 5 years and ranges from 0 (no education at all) to 17 years. The distance to safe drinking water is measured in meters and averages 48m ranging from less than 1m to 339m. Regarding arsenic levels I cannot comment because the only thing I know that it is toxic and you probably would never want to have your well contaminated with it. Here, we believe that all of those variables somehow influence the probability of a respondent switch to a safe well.
Now let's us instantiate our model with the data:
X = Matrix(select(wells, Not(:switch)))
y = wells[:, :switch]
model = logreg(X, y);
And, finally, we will sample from the Turing model. We will be using the default NUTS()
sampler with 1_000
samples, with 4 Markov chains using multiple threads MCMCThreads()
:
chain = sample(model, NUTS(), MCMCThreads(), 1_000, 4)
summarystats(chain)
Summary Statistics
parameters mean std mcse ess_bulk ess_tail rhat ess_per_sec
Symbol Float64 Float64 Float64 Float64 Float64 Float64 Float64
α -0.1537 0.1008 0.0026 1487.6006 2091.0499 1.0020 26.5174
β[1] 0.4664 0.0419 0.0009 2237.3707 2405.0432 1.0010 39.8825
β[2] -0.0090 0.0010 0.0000 4269.6543 3192.4775 1.0009 76.1093
β[3] -0.1226 0.0777 0.0019 1680.2431 1877.4329 1.0002 29.9514
β[4] 0.0424 0.0096 0.0002 2656.3110 2520.0618 1.0020 47.3504
We had no problem with the Markov chains as all the rhat
are well below 1.01
(or above 0.99
). Note that the coefficients are in log-odds scale. They are the natural log of the odds[1], and odds is defined as:
where is a probability. So log-odds is defined as:
So in order to get odds from a log-odds we must undo the log operation with a exponentiation. This translates to:
We can do this with a transformation in a DataFrame
constructed from a Chains
object:
using Chain
@chain quantile(chain) begin
DataFrame
select(_, :parameters, names(_, r"%") .=> ByRow(exp); renamecols=false)
end
5×6 DataFrame
Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
│ Symbol Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ α 0.700906 0.801908 0.859547 0.91731 1.04274
2 │ β[1] 1.47317 1.54865 1.59402 1.63747 1.7351
3 │ β[2] 0.989048 0.990354 0.991039 0.991766 0.993111
4 │ β[3] 0.75859 0.84017 0.885127 0.931194 1.03098
5 │ β[4] 1.0235 1.03671 1.04353 1.0502 1.06242
Our interpretation of odds is the same as in betting games. Anything below 1 signals a unlikely probability that will be . And anything above 1 increases the probability of being , while 1 itself is a neutral odds for being either or . Since I am not a gambling man, let's talk about probabilities. So I will create a function called logodds2prob()
that converts log-odds to probabilities:
function logodds2prob(logodds::Float64)
return exp(logodds) / (1 + exp(logodds))
end
@chain quantile(chain) begin
DataFrame
select(_, :parameters, names(_, r"%") .=> ByRow(logodds2prob); renamecols=false)
end
5×6 DataFrame
Row │ parameters 2.5% 25.0% 50.0% 75.0% 97.5%
│ Symbol Float64 Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────
1 │ α 0.412078 0.445033 0.462234 0.478436 0.510462
2 │ β[1] 0.59566 0.607635 0.614497 0.620848 0.634383
3 │ β[2] 0.497247 0.497577 0.49775 0.497933 0.498272
4 │ β[3] 0.431363 0.456572 0.469532 0.482186 0.507626
5 │ β[4] 0.505807 0.509012 0.510649 0.512243 0.515133
There you go, much better now. Let's analyze our results. The intercept α
is the basal switch
probability which has a median value of 46%. All coefficients whose 95% credible intervals captures the value tells that the effect on the propensity of switch
is inconclusive. It is pretty much similar to a 95% credible interval that captures the 0 in the linear regression coefficients. So this rules out β[3]
which is the third column of X
– assoc
. The other remaining 95% credible intervals can be interpreted as follows:
β[1]
– first column ofX
,arsenic
, has 95% credible interval 0.595 to 0.634. This means that each increase in one unit ofarsenic
is related to an increase of 9.6% to 13.4% propension ofswitch
being 1.β[2]
– second column ofX
,dist
, has a 95% credible interval from 0.497 to 0.498. So we expect that each increase in one meter ofdist
is related to a decrease of 0.01% propension ofswitch
being 1.β[4]
– fourth column ofX
,educ
, has a 95% credible interval from 0.506 to 0.515. Each increase in one year ofeduc
is related to an increase of 0.6% to 1.5% propension ofswitch
being 1.
That's how you interpret 95% credible intervals from a quantile()
output of a logistic regression Chains
object converted from log-odds to probability.
Footnotes
[1] | actually the logit function or the log-odds is the logarithm of the odds where is a probability. |
References
Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.