Automatic Data Reweighting


In Automatic data reweighting by Cook, he writes

Suppose you are designing an autonomous system that will gather data and adapt its behavior to that data.

At first you face the so-called cold-start problem. You don't have any data when you first turn the system on, and yet the system needs to do something before it has accumulated data. So you prime the pump by having the system act at first on what you believe to be true prior to seeing any data.

Now you face a problem. You initially let the system operate on assumptions rather than data out of necessity, but you'd like to go by data rather than assumptions once you have enough data. Not just some data, but enough data. Once you have a single data point, you have some data, but you can hardly expect a system to act reasonably based on one datum.

Rather than abruptly moving from prior assumptions to empirical data, you'd like the system to gradually transition from reliance on the former to reliance on the latter, weaning the system off initial assumptions as it becomes more reliant on new data.

The delicate part is how to manage this transition. How often should you adjust the relative weight of prior assumptions and empirical data? And how should you determine what weights to use? Should you set the weight given to the prior assumptions to zero at some point, or should you let the weight asymptotically approach zero?

Fortunately, there is a general theory of how to design such systems. It's called Bayesian statistics. The design issues raised above are all handled automatically by Bayes theorem. You start with a prior distribution summarizing what you believe to be true before collecting new data. Then as new data arrive, the magic of Bayes theorem adjusts this distribution, putting more emphasis on the empirical data and less on the prior. The effect of the prior gradually fades away as more data become available.

My general comment is that nothing here specifically screams "Bayes!" to me.

For example, the "cold-start problem" can be solved, or at least explored, by modelling assumptions and feeding the system with simulated data, or data from previous similar studies. Although, I'm not sure why "the system needs to do something before it has accumulated data" in the first place. Why not just incorporate a rule to wait until n data points have been accumulated? Yes, we'd have some data with say five real data points, but in my opinion, no approaches, frequentist or Bayesian or anything else, are great for small samples.

The transitioning part mentioned can be done in a variety of ways. For example, using a weighted average between assumptions and data, sample size requirement, an estimate being in a certain interval, changepoint analysis, stopping rules, or any other criteria. Transitioning doesn't require Bayes.

I noticed that there wasn't much discussion of the selection of, or the sensitivity of, the priors. In fact, it seems to assume that priors are easy to set up and are adequate from the start. The "what you believe to be true before collecting new data" can be problematic if what you believe and what is actually reality are quite different, and that issue can be magnified in a multivariate setting. Choosing a prior can be made automatically ("objective Bayes"), but using such priors can defeat the purpose of basing priors on "what you believe to be true". Gelman has noted that "..., if you could in general express your knowledge in a subjective prior, you wouldn't need formal Bayesian statistics at all: you could just look at your data and write your subjective posterior distribution." In a non-Bayes setting, "what you believe to be true before collecting new data" comes from looking at previous studies, real data or simulated, subject matter expert input, and modelling assumptions.

A Bayes-only approach would also make you use a hammer (Bayes), or Swiss Army knife to be generous, for every nail (problem) when there are great non-Bayes approaches. Bayes is fine, but I would not want to be tied to one method but be more flexible in solving problems. I'm also not convinced "handled automatically" is necessarily a desirable thing.

There are also adaptive and weighting, but non-Bayes approaches, like that explained in Adaptive Tests of Significance Using Permutations of Residuals, by O'Gorman, sequential statistics, or machine learning approaches.

The actual magic of the system Cook describes, I believe, is not in the "Bayes theorem" part (the standard Bayes theorem for events is a frequentist result, by the way), but in the "more data" part. Magic happens as n/N approaches 1 or the likelihood swamps the prior, assuming the models are good. Additionally, we hope this new data comes from well-designed experiments or samples, and is not just any ol' data gathered any ol' way. The problem with a pure Bayesian coherence approach is that one can be coherent but be completely wrong.

Thanks for reading.

If you enjoyed any of my content, please consider supporting it in a variety of ways: