BMR: Bayesian Multinomial Regression Software

Alexander Genkin, David D. Lewis, and David Madigan

Last update: Nov 16, 2005

 New!    Individual priors

See also our Bayesian Logistic Regression software


Overview

Acknowledgments

User Guide

Legal Notice

Download: Source code and Binaries

Feedback



Overview

This software implements Bayesian multinomial logistic regression (also known as polytomous logistic regression or polychotomous logistic regression). The software will also perform polytomous classification (also called 1-of-k classification, ambiguously called multiclass classification, and occasionally mistakenly called multilabel classification). Assume that yk is the class indicator for the kth class (1 if the observation belongs to the class k, 0 otherwise); x is the predictor vector extended with 1 to be paired with the intercept parameter; and each βk is a vector of parameters, one for each class. The multinomial logistic regression model takes the form:

BMR finds the maximum a posteriori (MAP) estimate of the complete parameter vector β under two choices of prior distribution for the parameters: Gaussian or Laplace. The latter is given by the formula:

where βjk is a component of the vector of parameters. BMR assumes that the priors for the components are independent, so that the overall prior on the parameter vector is the product of the priors on its individual components. In the simplest case all prior components have a mode of 0 and the same variance, but this can be modified (see the Individual Priors section below). The Laplace prior favors a sparse solution: the MAP estimate of the parameter vector β tends to have many components equal to zero. For this reason the Laplace prior or, equivalently, L1 penalization of parameters, has been widely investigated as an alternative to feature selection, dating back at least to the LASSO algorithm of Tibshirani (1996).

To find the MAP parameter estimates, BMR uses a minor variant of the coordinate descent algorithm of Zhang and Oles (2001). Details on theoretical background, the fitting algorithm, and experimental results can be found at http://stat.rutgers.edu/~madigan/PAPERS/authorid-csna05.pdf .

Specification of the priors

The user specifies the prior by choosing 1) the form of the prior on each model parameter (Gaussian or Laplace), and 2) the hyperparameter value, i.e. the common variance of those parameter-wise priors. (In the Laplace case variance is equal to 2/λ2 ). The -p option is used to specify the form of the prior, with Gaussian the default. The mode of the prior is always 0. The variance of the prior can be set by specifying a single value to the -V option. Two other approaches to setting the variance are possible:

Norm-based default for prior variance

If -V is omitted, BMR sets the prior variance equal to d/m, where d is the number of distinct features appearing in the training examples, and m is the mean 2-norm of the training examples. If feature selection is enabled (see below), then d and m are calculated using only the selected features.

Selecting prior variance through cross-validation

If multiple possible prior variance values are specified with the -V option, BMR will choose among them using k-fold cross validation. The training data is randomly split into k disjoint subsets, with class frequencies balanced among the subsets to the extent possible. For each prior variance value, we train k logistic regression models under a prior with that hyperparameter. Each model is trained on the union of k-1 of the subsets, and tested on the remaining subset, with each subset used as the test subset once. BMR then selects the prior variance that maximizes the average value of the log-likelihood of a training instance when it appears in the test subset.

However, if in addition the --errbar option is specified, BMR instead uses the “one-standard error” version of cross-validation described by Hastie et al. (2001). In this approach, BMR again computes the mean log-likelihood of instances under each prior variance, and finds the prior variance value that maximizes the mean log-likelihood of instances. Let lmax be that maximal mean. For the prior variance value that gives lmax, BMR computes the mean log-likelihood of instances for each of the m test subsets separately, and computes the standard deviation, smax, of these k quantities. BMR then chooses the smallest prior variance whose corresponding mean log-likelihood is greater than or equal to lmax - smax, and choose that prior variance. The effect of this approach is to reduce the danger of overfitting that choosing the "best" prior variance on the list might otherwise introduce.

By default BMR uses full k-fold cross-validation with k=10. The user can specify a positive integer value of k by the option -C k. If they specify -C k1,k2, with integers k1 ≥ k2 ≥ 1, BMR performs a kind of "fractional" cross-validation. It splits the training data into k1 pieces, but trains and runs models on only k2 of the unions of training subsets, applying those models to the corresponding test subsets. Selection is based of mean loglikelihood over instances in just those k2 test subsets.

How to choose a good set of prior variances to specify with -V is not well understood. For some problems a geometric progression centered on 1, e.g. 0.001, 0.01, 0.1, 1, 10, 100, 1000, may make sense, but users should experiment with their own data.

Individual priors  New! 

The Bayesian approach is attractive for its ability to incorporate prior knowledge into statistical inference. As a step in this direction, BMRtrain allows a different prior to be specified for each model parameter. Recall that there is a model parameter for each class and feature combination, and one more intercept parameter for each class. The user can specify individual priors for some of the model parameters, leaving the rest to the usual mechanisms of prior specification. Because of the possible large number of parameters, there are two mechanisms for specifying individual priors:

NOTE: a prior specified on the detailed level always overrides prior on the feature level when both are specified

The user can specify the mode and variance of the prior for any parameter. However, all priors must have the same form (Gaussian or Laplace) as specified by the -p option. The general prior specified by -p and -V is used for all model parameters whose prior is not specified individually.

Note that the variance of an individual prior is specified relative to the variance of the general prior. If v is specified as the variance of the prior on a particular parameter, and the variance of the general prior (specified by user, set by default, or chosen by cross-validation) is v0, then the actual variance used for the parameter is v·v0. This allows users to indicate the relative size of variances, while letting the norm-based default or cross-validation be used to tune the absolute variances. A user who wishes to specify exact values for variances in individual priors can do this by forcing the general variance to be 1.0 (option -V 1).

Individual prior variances can also be specified as infinite, in which case the overall prior variance is ignored for that parameter. This is equivalent to an uninformative prior on this parameter, with all parameter values treated as equally plausible. A variance of 0 can also be specified: this sets the parameter to the value specified by the mode and does not allow fitting to change that value.

Individual priors should be defined in a special file, which should be addressed on the command line with the -I option; see User Guide for details.

Forcing Model Parameters to Be Zero

Multinomial logistic regression models can easily have thousands to millions of parameters. One way to reduce the number of model parameters that need to be estimated is to force some parameters to be zero rather than fitting them to do. BMR supports two constraints of this type: a reference class, and a constraint based on the lack of nonzero feature values.

The -R option causes BMR to treat the class with the largest label value as a reference class. All parameters associated with this class are forced to equal zero. Defining a reference class is necessary in maximum likelihood estimation of multinomial logistic to allow identifiability of parameters. For MAP estimation with the priors supported by BMR a reference class is not necessary, but may be advantageous.

The -z option specifies that if feature j takes on the value 0 for all training instances in class k then βjk is forced to equal 0. There are heuristic arguments that such parameters are likely to end up with values at or close to 0 in any case, and experimental evidence suggests that this option has very little effect on model fit. Further, with sparse data it can greatly increase the speed of fitting.

Standardization

This optional data transformation centers and scales (Ryan, 1997) each input feature to have a sample mean of 0 and a sample standard deviation of 1.0 on the training data. In other words, each feature value xj is replaced with (xj - aj)/sj, when aj is the sample mean and sj is the sample standard deviation.

Centering and scaling is a common, if sometimes controversial, strategy in maximum likelihood linear regression. The primary purpose is to avoid numerical difficulties that would result from inverting an ill-conditioned data matrix. Its benefits in Bayesian logistic regression are unclear. For logistic (or linear) regression with a Bayesian prior, ill-conditioning is much less of a problem, and our algorithm in any case does not perform matrix inversion. Centering also destroys any sparseness that the input vector may have, adding to memory and CPU load. On the other hand, centering and scaling may make the best choice of prior distribution more similar between features and/or easier to set manually.

Note that when centering and scaling is used during training, BMRtrain outputs a model that is intended to be applied to new in raw data form. Test examples do not need to be and should not be, centered and scaled themselves. BMRtrain accomplishes this by centering and scaling the training data, training a model on the transformed data, and then adjusting the model parameters to be appropriate for raw data. Each βj from the model trained on centered and scaled data is replaced by βj/sj, and βj·aj/sj is subtracted from the intercept term. This gives the same result on new data as if the new data had been centered and scaled using the aj and sj values from the training set, and the original fitted model applied.

Cosine normalization

This optional data transformation centrally projects each data vector onto the unit Euclidian sphere, giving it a 2-norm of 1.0. After that the dot product of any two vectors is equal to the cosine of the angle between those vectors, hence the name. Cosine normalization is popular in text classification because it helps to compensate for variations in document length.

At the classification step, if there are features in the data that never occurred in training, these features are ignored and do not paricipate in 2-norm calculation or any subsequent operations.

Notes on data transformations

1. The constant feature 1 that corresponds to the intercept terms of the logistic regression model does not participate in, and is not affected by either centering and scaling or cosine normalization.

2. If both standardization (-s) and cosine normalization (-c) are specified then standardization is applied first. This is usually undesirable.

User Guide

This software consists of two executable modules: BMRtrain, the training module, and BMRclassify, the classification module. BMRtrain takes a training data file as input and generates a model file. BMRclassify inputs a model file, plus a data file with new data, and outputs a results file with predicted probabilities and class labels.

File formats

The Data file format for training or testing is similar to that used by Joachims' SVMlight software for training support vector machines (SVM). Each line represents an instance. The line format is:

<label>{ <feature_id>:<value>}*

The label here may be any integer; feature_id must be a positive integer; each value is a number in double float notation. Lines starting with '#' are ignored and can be used for comments.

The Results file lines correspond to cases in the same order as in the data file, which could be training or test data. Each line has r+2 fields, where r is the number of classes in the training data (and hence in the model). The first field is the true label copied from the data file; the last field is the label predicted by the model. Fields in the middle are the model's estimates of the posterior probability for the case to have that label, in ascending order of labels. The predicted label is the label of the class with the highest predicted probability of class membership.

Individual priors are specified in the Individual priors file. To implement two mechanism for specifying individual priors there are two kinds of lines in the file. The format for the feature level line is:

<feature_id> <mode> <variance>

and the format for the detailed specification line starts with the keyword “class”:

class <class_id> <feature_id> <mode> <variance>

The feature ID and class ID should be what actually occurs in the training data file; feature ID of 0 is used to specify the prior for the intercept term. The mode is the mode of the prior, and can be any real value. The variance can be any nonnegative number, or the string "inf". The latter means infinity, indicating that no penalty should be imposed on the value of this parameter. The rest of the line is ignored by the program, and can be used for comments.

Training module

The training program is called from the command line as:

BMRtrain [options] training_data_file model_file

where the options are:

-p <[1,2]>, Type of prior, 1-Laplace 2-Gaussian (default is 2)

-V <number[,number]*>, Prior variance values; if more than one, cross-validation will be used

-C <integer[,integer]>, Cross-validation: number of folds, number of runs. If the number of runs is not given, it is assumed equal to the number of folds. Default is 10,10. The argument is ignored unless multiple possible variances are specified with -V

-I <indPriorsFile> Individual Priors file

-z Exclude all-zero per class variables (default is no)

-R Reference class: the class with the largest label will be used as reference (default is no)

-s Standardize variables in input vectors (default is no)

-c Cosine normalize input vectors (default is no)

-e <float>, Convergence threshold; default is 0.001

-r <file_name>, Generate results file

-l <[0..2]>, Program log verbosity level (default is 0)

-v Displays version information and exits

-h Displays usage information and exits

The training data will be read from standard input if dash '-' is specified for training_data_file instead of a file path. An execution log (detail controlled by -l) is written to standard output.

Classification module

Here is how to use the classification module:

BMRclassify [options] new_data_file model_file

where the options are:

-r <file_name>, Results file. Format is as described for BMRtrain.

-l <[0..2]>, Program log verbosity level (default is 0)

-v Displays version information and exits.

-h Displays usage information and exits

The data to be classified will be read from standard input if dash '-' is specified for new_data_file instead of a file path. An execution log (detail controlled by -l) is written to standard output.

If the data file contains class labels that did not participate in training, the results file will have those records, but they will not contribute to the statistics in the log, like log-likelihood, number of errors, etc.

Download and Installation

The software is available now in binaries for Windows or Linux. In any case, you will need two executable modules: for training and classification.

To complete the installation, just copy executables to the folder where you can execute them.

Source code and build instructions

Please send us email to let us know that you have downloaded the software. We will notify you of future releases, bug fixes, etc.

Acknowledgments

The work was partially supported under funds provided by the KD-D group for a project at DIMACS on Monitoring Message Streams, funded through National Science Foundation grant EIA-0087022 to Rutgers University. The NSF also partially supported the work through ITR grant DMS-0113236.

Source code used:

Legal Notice

The BMR software, and this webpage, are covered by the following notice:

Copyright 2005, Rutgers University, New Brunswick, NJ.

All Rights Reserved

Permission to use, copy, modify, and distribute this software and its documentation for any purpose other than its incorporation into a commercial product is hereby granted without fee, provided that the above copyright notice appears in all copies and that both that copyright notice and this permission notice appear in supporting documentation, and that the names of Rutgers University, DIMACS, and the authors not be used in advertising or publicity pertaining to distribution of the software without specific, written prior permission.

RUTGERS UNIVERSITY, DIMACS, AND THE AUTHORS DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT SHALL RUTGERS UNIVERSITY, DIMACS, OR THE AUTHORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

References: