wiki:2015/DailyLogs/HWProcSys

The 'Entropy Triangle' (or 'Jaynes Triangle') and The Brain

Some notes from my talk "The 'Entropy Triangle' and The Brain" on Friday morning:

Notes


Some ideas from Jaynes:

  • Probability extends Boolean logic with 0 & 1 as special cases.
  • Probability distributions are "carriers of incomplete information"
  • Probability Theory, Information Theory and Statistical Mechanics are fundamentally related by the concept of 'Entropy' which is - loosely speaking - 'lack of information'.
Cox Theorems: 1) Sum rule->      P(A) + P(notA) = 1
              2) Product Rule->  P(X,Y) = P(X|Y)P(Y)

Evidence value (integral on bottom of RHS) is an important property and should not be ignored. Provides 'Occams Razor' to avoid over-fitting and allows rational & quantitative comparison (and averaging, for combination) of models.

Entropy (= KL Divergence or Relative Entropy) provides an axiomatically justified basis for inference and measurement of information at all 3 points of the triangle. For example,

'measure of what you have learned in an experiment'-> KL( P(Theta|X) || P(Theta) ) 
Mutual Information-> MI[X,Y] = KL( P(X,Y) || P(X)P(Y) )

Optimal Experimental Design provides many tools on how to choose data (sub)sets to maximise learning e.g 'Maximum Entropy Sampling'. Also Sequential Design means that you should in general fit subsets and learn as you go.

Physicists have been studying the properties of very large collections of particles for 150 years. Maximum Entropy distributions that appear from the standard theory of equilibrium Statistical Mechanics & Thermodynamics provide the majority of distributions used in Probability Theory. They provide distributions which "add as little information as possible beyond the constraints that define them".

Beyond simple equilibrium for a perfect gas, much interesting structure & behaviour can be generated by e.g phase transitions, interactions between particles, presence of an external field, movements between equilibria. All of these (and others) may have analogues in neural circuits that are appropriately represented, allowing tools and theories of Physics to be used for their analysis and construction.

Papers by Tkacik, Bialek, Schneidman et al provide the intriguing possibility of building MaxEnt models from actual neural data, and therefore reason about patterns, coding, information flow and other properties by representing as a distribution of states as in a Physics problem.

'Entropic Inference' and 'Information Geometry' provide some advanced ideas for choosing optimal priors that represent required information and nothing else, and reasoning about the relative structure of probability models in a non-Euclidean geometry that describes information content directly.

References


'Maximum Entropy Sampling', Shewry & Wynn - Journal of Applied Statistics , vol. 14, no. 2, pp. 165-170, 1987

'Simulation based optimal design', P Müller - Handbook of Statistics, 2005

'Optimal Information Processing and Bayes’ Theorem', Zellner - American Statistician, 42, No. 4, 278-284

'The simplest maximum entropy model for collective behavior in a neural network', Tkacik, Marre, Mora, Amodei, Berry & Bialek - Journal of Statistical Mechanics

...also see list below for recommended books.

Also some other background material.

An excellent intro to many of the ideas by Sivia. A concise and practical starting point if you want to go onto Jaynes' book.

http://www.amazon.co.uk/Data-Analysis-A-Bayesian-Tutorial/dp/0198568320

Box & Tiao. The first few chapters in particular are good for background and an analysis of assumptions and how inference changes as the assumptions do. Originally meant to show how Bayesian inference offers advantages over classical but it has much useful insight and does not age.

http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471574287.html

The not-quite-finished masterpiece by Jaynes. A valuable source reference with much thought-provoking personal comment and deep consideration. Also quite funny and rude about some of the opposition to the ideas.

http://www.amazon.com/Probability-Theory-The-Logic-Science/dp/0521592712

Raiffa & Schlaiffer. Possibly still the best book on Bayesian Decision Theory with a very thorough analysis of the possibilities. De Groot also very good in this area.

http://www.amazon.com/Applied-Statistical-Decision-Theory-Howard/dp/047138349X

Bernardo & Smith and O'Hagan. 2 useful source texts for understanding technicalities.

http://www.amazon.com/Kendalls-Advanced-Theory-Statistic-2B/dp/0470685697 http://www.amazon.com/Bayesian-Theory-Jos-233-Bernardo/dp/047149464X/ref=sr_1_1?s=books&ie=UTF8&qid=1429733664&sr=1-1&keywords=bayesian+theory

There are many more! These are just a personal selection.

Some advice:

  • Work from the Cox theorems (sum & product rule) upwards. Everything derives from there. You can find an optimal probability setup for any given problem, get 100% inferential fidelity (as proven by Zellner) and avoid ad-hoc solutions. For many problems, these setups have already been solved but it's good to know that you can devise new ones that you can trust for any new, specialised problem.
  • Appropriate priors can be important in high-dimensional problems and don't forget that every real problem has some prior information. Sometimes it's so obvious that you can't see it, sometimes it's difficult to find or express appropriately. In many areas of 'Inverse Problems' such as image reconstruction and Seismology, results would not be possible without prior information expressed appropriately.
  • Don't forget that the likelihood function is also an expression of prior information in the sense that you are employing a probability model of some kind for the uncertainty in your data. A real objective analysis will appropriately take into account the full joint distribution of data and parameters: p( X, θ ) = p( X | θ ) * p( θ ) i.e. the likelihood * the prior.
  • Although the machine learning community use Bayesian inference widely (as they should), sometimes their presentation of it seems a bit unintuitive compared to the original writers - or maybe that is just my bias! For example, they often talk about "learning priors" when priors are what you have before you have a any data (this time around, anyway) and posteriors are what you are learning. To draw from the posterior model you are using the predictive distribution.
  • Robbins' ideas about 'Empirical Bayes' should probably be of more interest in the AI community than it is.

Ed Jaynes produced much original thinking on these ideas. A collection of his papers are here (some about Physics):

http://bayes.wustl.edu/etj/node1.html

Ariel Caticha and Carlos Rodriguez at Albany, NY are doing some very interesting work on appropriate choice of Bayesian prior (Entropic priors) and information processing via 'Information Geometry'. Anything by George Box, Arnold Zellner, John Skilling & Henry Wynn is worth reading too.

The MaxEnt community have some very useful ideas and their conference proceedings since the 1980s contain much creative thought and helpful information. Caticha shows that Bayesian inference can be seen as a special case of Maximum Entropy inference.

You will find that, in general, much of the best work in the area of objective inference based upon probability theory integrates understandings from Shannon's Information Theory and Gibbs' ideas on Statistical Mechanics, with Jaynes at the centre of it all:

http://bayes.wustl.edu/

Last modified 4 years ago Last modified on 05/28/15 17:17:08

Attachments (1)