BBR: Bayesian Logistic Regression Software

Alexander Genkin, David D. Lewis, and David Madigan

Last update: Nov 16, 2005

 New!    in ver.3.0: Hierarchical modeling

 New!    in ver.3.0: Self-tuning of hyperparameter

 New!    Estimates of uncertainty of model parameters

see also our Bayesian Multinomial Regression software


See www.bayesianregression.com for pointers to the latest versions of BBR, BMR, and BXR.



Overview

Download: Source code and Binaries

User Guide

Acknowledgments

Hierarchical Modeling  New!   

Legal Notice

Estimating uncertainty of model parameters  New!   

Feedback



Overview

This software implements Bayesian logistic regression with two choices of priors: Gaussian and Laplace (also known as double exponential). A detailed technical report presenting theoretical background on the approach, the fitting algorithm, and experimental results can be found at http://stat.rutgers.edu/~madigan/PAPERS/techno-06-09-18.pdf (Technometrics, 2006, to appear).

A general binary regression model takes the form:

where y is the class label, 1 or -1; x is the predictor vector extended with 1 to take care of the intercept; β is the vector of model parameters; and ψ is the link function. We use the logistic link function:

The program finds the maximum a posteriori (MAP) estimate of the parameter vector β under one of two choices for the prior distribution, Gaussian or Laplace. The latter is given by the formula:

where βj is a component of the vector of parameters. BBR assumes that the priors for the components are independent, so that the overall prior on the parameter vector is the product of the priors on its individual components. In the simplest case all prior components have a mode of 0 and the same variance, but this can be modified (see the Individual Priors section below). The Laplace prior favors a sparse solution: the MAP estimate of the parameter vector β tends to have many components equal to zero. For this reason the Laplace prior or, equivalently, L1 penalization of parameters, has been widely investigated as an alternative to feature selection, dating back at least to the LASSO algorithm of Tibshirani (1996).

To find the MAP parameter estimates, BBR uses a minor variant of the coordinate descent algorithm of Zhang and Oles (2001).

Modeling options are described below; see also sections on Hierarchical Modeling and Estimating uncertainty of model parameters.

Specification of the overall prior

The user specifies the overall prior by choosing 1) the form of the prior on each model parameter (Gaussian or Laplace); 2) the skew of the prior; and 3) the hyperparameter value, i.e. the common variance of those parameterwise priors. (Note in the Laplace case variance is equal to 2/λ2 ). The mode of the prior is always 0. The -p option is used the specify the form of the prior, with Gaussian the default.

For the Laplace prior, the user may choose to impose a skew on the prior, which allows model parameters in the solution to take on only non-negative or non-positive values. (A skew may be specified for a Gaussian prior also, but the only legal value for skew is the default, 0.) Strictly speaking, if one specifies that BBR should use a Laplace prior with mode 0 (the default), variance v (i.e. λ = sqrt(2/v)), and positive skew, BBR actually uses an exponential prior with variance v (i.e. exponential parameter λ = sqrt(1/v)). If mode 0, variance v, and negative skew are specified, BBR uses an exponential prior (with variance v) on the negation of the parameter values.

Nonzero skews may be specified for individual priors (see below) as well. Individual priors can have modes other than 0, and if a mode other than 0 is specified with a skewed Laplace, BBR uses an exponential or negated exponential prior shifted accordingly.

The variance of the prior can be set simply by specifying a single value to the -V option. More flexible approaches to setting the variance are possible:

Norm-based default for prior variance

If -V is omitted, BBR sets the prior variance equal to d/m, where d is the number of distinct features appearing in the training examples, and m is the mean 2-norm of the training examples. If feature selection is enabled (see below), then d and m are calculated using only the selected features.

Selecting prior variance through cross-validation

If multiple possible prior variance values are specified with the -V option, BBR will choose among them using k-fold cross validation. The training data is randomly split into k disjoint subsets, with class frequencies balanced among the subsets to the extent possible. For each prior variance value, we train k logistic regression models under a prior with that hyperparameter. Each model is trained on the union of k-1 of the subsets, and tested on the remaining subset, with each subset used as the test subset once. BBR then selects the prior variance that maximizes the average value of the log-likelihood of a training instance when it appears in the test subset.

However, if in addition the --errbar option is specified, BBR instead uses the “one-standard error” version of cross-validation described by Hastie et al. (2001). In this approach, BBR again computes the mean log-likelihood of instances under each prior variance, and finds the prior variance value that maximizes the mean log-likelihood of instances. Let lmax be that maximal mean. For the prior variance value that gives lmax, BBR computes the mean log-likelihood of instances for each of the test subsets separately, and computes the standard deviation, smax, of these k quantities. BBR then chooses the smallest prior variance whose corresponding mean log-likelihood is greater than or equal to lmax - smax. The effect of this approach is to reduce the danger of overfitting that choosing the "best" prior variance on the list might otherwise introduce.

By default BBR uses full k-fold cross-validation with k=10. The user can specify a positive integer value of k by the option -C k. If they specify -C k1,k2, with integers k1 ≥ k2 ≥ 1, BBR performs a kind of "fractional" cross-validation. It splits the training data into k1 pieces, but trains and runs models on only k2 of the unions of training subsets, applying those models to the corresponding test subsets. Selection is based on mean loglikelihood over instances in just those k2 test subsets.

How to choose a good set of prior variances to specify with -V is not well understood. For some problems a geometric progression centered on 1, e.g. 0.001, 0.01, 0.1, 1, 10, 100, 1000, may make sense, but users should experiment with their own data.

Self-tuning  New!   

Selecting prior variance value from a set through cross-validation is a widely and successfully used practice with regression, SVM, and other regularized modeling techniques. However, the researcher is left with several questions that are hard to answer:

– is the set of choices wide enough to cover the optimal value?

– is the set of choices fine enough to approach the optimal value closely?

– isn't there an overfit associated with selecting the best value from the set?

Here we are trying to address these issues, and also to relieve the user from the burden of defining the set of choices. Cross-validation is still used to evaluate each particular value of prior variance, but the set of candidate values to evaluate is defined fully automatically.

Most researchers agree that in the overwhelming majority of cases test set loglikelihood of the model as a function of prior variance is unimodal. Our experiments show that, as a function of the logarithm of prior variance, it can be fairly well approximated by a parabole. So we are basing our search for prior variance value on this heuristic.

Search starts with the norm-based default, then makes a step towards smaller prior variance. After that the search continues in the direction of increasing loglikelihood as long as it is monotone. As soon as monotonicity has been violated, the quadratic approximation kicks in: next step is made to the maximum of the approximation, which is re-evaluated after each step. The search stops when the new maximum predicted by the approximation is not significantly larger than what already has been acheved or when the step becomes too small. Experiments with this strategy show results very close to those of achieved by regular cross-validation strategy with carefully selected grid of choices.

Self-tuning is enabled by the command-line option --autosearch. Cross-validation is governed by -C option, exactly like above.

Individual priors

The Bayesian approach is attractive for its ability to incorporate prior knowledge into statistical inference. As a step in this direction, BBRtrain allows a different prior to be specified for each model parameter. The use can specify the mode, variance, and skew (for Laplace) of the prior for any parameter (including the intercept). However, all priors must have the same form (Gaussian or Laplace/exponential) as specified by the -p option (or default). The general prior specified by -p and -V is used for all model parameters whose prior is not specified individually.

Note that the variance of an individual prior is specified relative to the variance of the general prior. If v is specified as the variance of the prior on a particular parameter, and the variance of the general prior (specified by user, set by default, or chosen by cross-validation) is v0, then the actual variance used for the parameter is v·v0. This allows users to indicate the relative size of variances, while letting the norm-based default or cross-validation be used to tune the absolute variances. A user who wishes to specify exact values for variances in individual priors can do this by forcing the general variance to be 1.0 (option -V 1).

Individual prior variances can also be specified as infinite, in which case the overall prior variance is ignored for that parameter. This is equivalent to an uninformative prior on this parameter, with all parameter values treated as equally plausible.

Setting classification threshold

Logistic regression models estimate the probability that a data vector belongs to the class with label 1. Classification with a logistic regression model typically uses a threshold: we assign a case to class 1 iff the probability estimate is greater or equal to the threshold value.

After the final parameter vector is produced, it is applied to all training examples to produce an estimate of their class membership probability. The examples (and their corresonding class labels) are sorted on this value. A threshold on the estimated probabilities is then chosen so as to maximize the specified effectiveness measure on the training set.

The effectiveness measures available for tuning are specified in terms of this confusion matrix:


Correct label

1

-1

Prediction

1

a

b

-1

c

d

The program offers the following choices for threshold tuning criteria:

  1. no tuning; threshold is equal to 0.5 (default)

  2. sum of errors = b+c

  3. balanced error rate BER = (c/(a+c) + b/(b+d))/2

  4. T11U = 2*a - b

  5. F1 = (2*a)/(2*a + b + c)

  6. T13U = 20*a - b

T11U is a measure used in the TREC-11 / TREC 2002 filtering evaluation (Robertson & Soboroff, 2002), and T13U is a measure used in the TREC-13 / TREC 2004 genomics evaluation (Hersh, 2005). F1 is Van Rijsbergen's F-measure with the tradeoff parameter set to 1 (Van Rijsbergen, 1972; Van Rijsbergen, 1979; Lewis, 1995).

There is also an option to set the threshold at a probability value specified by user. This of course discards any of the tuning options.

Feature selection

BBR allows the user to select features that will be used in modeling. The user chooses a feature quality criterion and a number of features, k say, that should be used in modeling. BBR computes the criterion value for each of the d features that takes on a nonzero value in the training data, and selects the k features with the highest value on the criterion. Four feature selection criteria are supported: Pearson's correlation coefficient, chi-square, Yule's Q, and bi-normal separation (BNS) (Forman, 2002).

None of the current feature selection criterion take into account the prior on parameter values in computing criterion values. However, any feature for which the user specifies an individual prior is always selected, regardless of its criterion value, and even if that feature never takes on a nonzero value on the training data. This selection occurs after the top k features are chosen using the specified criterion. Thus if the user has specified individual priors on c features, the actual number of features selected can range from min(k, d) up to min(k+c, d).

Standardization

This optional data transformation centers and scales (Ryan, 1997) each input feature to have a sample mean of 0 and a sample standard deviation of 1.0 on the training data. In other words, each feature value xj is replaced with (xj - aj)/sj, when aj is the sample mean and sj is the sample standard deviation.

Centering and scaling is a common, if sometimes controversial, strategy in maximum likelihood linear regression. The primary purpose is to avoid numerical difficulties that would result from inverting an ill-conditioned data matrix. Its benefits in Bayesian logistic regression are unclear. For logistic (or linear) regression with a Bayesian prior, ill-conditioning is much less of a problem, and our algorithm in any case does not perform matrix inversion. Centering also destroys any sparseness that the input vector may have, adding to memory and CPU load. On the other hand, centering and scaling may make the best choice of prior distribution more similar between features and/or easier to set manually.

Note that when centering and scaling is used during training, BMRtrain outputs a model that is intended to be applied to new in raw data form. Test examples do not need to be and should not be, centered and scaled themselves. BMRtrain accomplishes this by centering and scaling the training data, training a model on the transformed data, and then adjusting the model parameters to be appropriate for raw data. Each βj from the model trained on centered and scaled data is replaced by βj/sj, and βj·aj/sj is subtracted from the intercept term. This gives the same result on new data as if the new data had been centered and scaled using the aj and sj values from the training set, and the original fitted model applied.

Cosine normalization

This optional data transformation centrally projects each data vector onto the unit Euclidian sphere, giving it a 2-norm of 1.0. After that the dot product of any two vectors is equal to the cosine of the angle between those vectors, hence the name. Cosine normalization is popular in text classification because it helps to compensate for variations in document length.

At the classification step, if there are features in the data that never occurred in training, these features are ignored and do not paricipate in 2-norm calculation or any subsequent operations.

Notes on data transformations

1. The constant feature 1 that corresponds to the intercept terms of the logistic regression model does not participate in, and is not affected by either centering and scaling or cosine normalization.

2. If both standardization (-s) and cosine normalization (-c) are specified then standardization is applied first. This is usually undesirable.

3. If both feature selection and cosine normalization are enabled, cosine normalization is applied first.

User Guide

This software consists of two executable modules: the training module and the classification module. The training module takes a training data file and optional individual priors file as inputs and generates a model file as output. The classification module takes the model file and a test data file as inputs, and outputs a results file with predicted class probabilities and class labels for the test data.

File formats

The Data file format for training or testing follows the same format as Joachims' SVMlight software for training support vector machines (SVM). Each line represents an instance. The line format is:

<label>{ <feature_id>:<value>}*

The label may take value 1, -1, or (for test data only) 0. Each feature_id is must be a positive integer, and each value a number in double float notation. The feature IDs should contain no duplicates, and a feature value of 0 is assumed for any feature ID which is not present. The order of features in not important; it is allowed for a line to have no feature/value pairs at all. Lines starting with '#' are ignored and can be used for comments.

A modified format of Data file is required for Hierarchical Modeling.

The Results file lines correspond to cases in the same order as in the data file, which could be training or test data. Each line has two fields: score and label predicted by the model (1 or -1):

<score> <label>

Score is the model's estimate of the probability for the case to have label 1, while label is the label assigned. For instance, with the default threshold a label is 1 if the score is ≥0.5, and -1 otherwise.

The Individual priors file has one line per model parameter for which a user wishes to specify an explicit prior. (The file can be omitted if the user does not wish to specify any explicit priors, and an empty file is allowed with the same effect.) There should not be multiple lines for the same model parameter. The line format is:

<feature_id> <mode> <variance> [<skew>]

The first value is the feature ID, with a feature ID of 0 used to specify the prior for the intercept term. The mode is the mode of the prior, and can be any real value. The variance can be any nonnegative number, or the string "inf". The latter means infinity, indicating that no penalty should be imposed on the value of this parameter. A variance of 0 can be specified: this sets the parameter to the value specified by the mode and does not allow fitting to change that value.

The skew specification is meaningful only for the Laplace prior. (A skew can be specified for Gaussian priors, but 0, the default, is the only legal value.) The skew is optional with a default value of 0. It can take one of three values: 0 (Laplace prior with no hard restriction on parameter values), 1 (only allow parameter values that are greater or equal to the prior mode), or -1 (only allow parameter values that are less or equal to the prior mode). The rest of the line is ignored by the program, and can be used for comments.

Training module

The training program is called from the command line as:

BBRtrain [options] training_data_file model_file

where the options are:

-p <[1,2]>, Type of prior, 1-Laplace 2-Gaussian (default is 2)

-V <number[,number]*>, Prior variance values; if more than one, cross-validation will be used

-C <integer[,integer]>, Cross-validation: number of folds, number of runs. If the number of runs is not given, it's assumed equal to the number of folds. Default is 10,10

--errbar Use “one-standard error” rule to select the prior variance after cross-validation. Ignored unless multiple variances specified with -V

--autosearch  New!   Enable self-tuning of the prior variance. Do not use -V or --errbar options with it.

-I <indPriorsFile> Individual Priors file

--pos Forces all model parameters to be only greater or equal to the prior mode (unless overridden by individual priors). Allowed only with a Laplace prior.

--neg Forces all model parameters to be only less or equal to the prior mode(unless overridden by individual priors). Allowed only with a Laplace prior.

-s Standardize variables in input vectors.

-c Cosine normalize input vectors.

-e <float>, Convergence threshold. Default is 0.0005 (see tech report for details).

--iter <integer>, Maximum number of passes allowed over parameter vector during fitting. Default is 1,000

--accurate High accuracy mode that attempts to limit accumulation of numerical errors. Increases training time. (Default is not to do this.)

-t <[0..5]>, Threshold tuning, 0: No tuning (threshold is 0.5). 1: Number of errors. 2: T11U. 3: F1. 4: Balanced error rate. 5: T13U. (Default is 0.)

--probthres <p>, where 0<=p<=1 Set classification threshold at the given probability value; if this option is specified, -t is ignored

-u <[0..3]>, Feature selection criterion, 0-Pearson's Correlation, 1-Yule's Q, 2-Chi-square, 3-BNS (default is 0)

-f <integer>, Number of features to select; if not provided, no feature selection is performed, even if '-u' is specified

-r <file_name>, Generate results file

-l <[0..2]>, Program log verbosity level (default is 0)

-v Displays version information and exits

-h Displays usage information and exits

Please look for additional options that are required for Hierarchical Modeling.

Deprecated arguments, retained for back compatibility:

-H <float>, Hyperparameter value, depends on the type of prior

-S <number[,number]*>, Search for hyperparameter value: list of values to try in cross-validation

The training data will be read from standard input if dash '-' is specified for training_data_file instead of a file path. An execution log (detail controlled by -l) is written to standard output.

Classification module

The command line call for the classification program is:

BBRclassify [options] new_data_file model_file

where the options are:

-r <file_name>, Results file

-l <[0..2]>, Program log verbosity level (default is 0)

--probthres <p>, where 0<=p<=1 Set classification threshold at the given probability value; ignore one specified with the model

-v Displays version information and exits.

-h Displays usage information and exits

The test data will be read from standard input if dash '-' is specified for new_data_file instead of a file path. An execution log (detail controlled by -l) is written to standard output.

Estimating uncertainty of model parameters  New! 

Bootstrap is used to estimate the uncertainty in model parameters. The script makes bootstrap samples of the training file, runs BBRtrain against each of them, and computes the variance for each model parameter. Script is controlled by the following command-line arguments:

-B nn specifies the number of bootstrap replicates (default=30)

-N 1 (default) to use the same hyperparameter setting strategy for every boostrap sample; 0 to choose hyperparameter on training data and reuse for all samples

Remaining command line options pass through to BBRtrain; the last two have to be the training data file and the model file. Here is an example:

> perl bootstrap.pl -B 30 -N 1 -p 2 -C 10,10 -V 0.001,0.01,0.1,1,10,100 extract.bbr entity.mod

On the output the script prints a table with the following information about each model parameter:

As a convenience, we also provide a script to call WinBUGS and perform fully Bayesian logistic regression via MCMC using BBR-like syntax.

Download and Installation

The software is available in binaries for Windows or Linux. In any case, you will need two executable modules: for training and classification.

To complete the installation, just copy executables to the folder where you can execute them.

There is also a Perl script for estimating uncertainty of model parameters:

Source code and build instructions

Please send us email to let us know that you have downloaded the software. We will notify you of future releases, bug fixes, etc.

Acknowledgments

The work was partially supported under funds provided by the KD-D group for a project at DIMACS on Monitoring Message Streams, funded through National Science Foundation grant EIA-0087022 to Rutgers University. The NSF also partially supported the work through ITR grant DMS-0113236.

Source code used:

Legal Notice

The BBR software, and this webpage, are covered by the following notice:

Copyright 2005, Rutgers University, New Brunswick, NJ.

All Rights Reserved

Permission to use, copy, modify, and distribute this software and its documentation for any purpose other than its incorporation into a commercial product is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Rutgers University, DIMACS, and the authors not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

RUTGERS UNIVERSITY, DIMACS, AND THE AUTHORS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT SHALL RUTGERS UNIVERSITY, DIMACS, OR THE AUTHORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

References: