2011/DailyLogs/ReinfLearn

Reinforcement learning and decision making in real-time behaving systems

(A) The first session was spent giving basic concepts on reinforcement learning (RL)

Juergen: Illustrative example of the basic principles of RL
Gianluca: Formalization of an actor critic model
Marco: Link to biology: dopaminergic system as modelled by the brain
Kevin: extension to complex systems / settings
Juergen: introduction to intrinsic rewards

(B) Group Discussion on Intrinsically Motivated Reinforcement Learning

Distinction between intrinsic and extrinsic reward

Gianluca: external reward (such as food and sex) is not the only reward in the world guiding trial and error learning. The acquisition of knowledge and the success in doing things (e.g., for a scientist making a discovery, or for an entineer in designing a chip, ecc.) is a type of intrinsic reward.

Question: how can we distinguish between extrinsic and intrinsic rewards?

Gianluca: it is not obvious, there is a lot of discussion ongoing in the workshop. I will give you my idea. For example, the distinction between external/internal is not a good distinction: both intrinsic and extrinsic motivations involve external events that ultimately affect your brain.

Juergen: Well, I do think it is obvious - see  http://www.idsia.ch/~juergen/creativity.html and  http://www.idsia.ch/~juergen/interest.html

Gianluca: In general, extrinsic motivations generated extrinsic rewards on the basis of homeostatic needs (such as regulation of sugar in blood, level of water in cells, etc.) and have the function to directly increase your biological fitness.

Intrinsic motivations, instead, generate learning signals that allow you to create competence and knowledege that are only later used to increase your fitness. When these learning signals are produced, they are produced as the brain detects that knowledge and competence in itself is increasing. For example children spend time acquiring knowledge and competences, while their parents still take care of basic needs, such as food and other fitness related events. During this time, the mechanisms in their brain allows them to evaluate success/progress and provide/generate reward signals to do so.

Intrinsic motivations and reward signals generated by information compression

Question: is this related to understanding and compressing information we get from the world?

Juergen: Yes! What generates the increase in understanding the world?

Leslie: society keeps extending the time periode that children / students spend for learning (where people go to school, etc); i.e. the time that they rely on intrinsic reward. What makes a little child learn how to stand up if not the social environment?

Juergen: compression. Imagine a system that has a network that predicts upcoming things (e.g., states), thereby learning to represent data in a compressed fashion. The compressor/predictor/data-encoder changes over time; i.e. it gets better in predicting and compressing information in time. When you succeed to make progress you generated a reward. This reward can be used to train a second component of the system, for example a reinforcement learning component, and this explains the acquisition of new motor skills. Later if the system does not compress information any further then it does not feel rewarded anymore and so changes activity.

Project IM-CLeVeR and Summer School on intrinsically motivated RL, abstraction, and cumulative learning

Gianluca: Please come to visit the projects ongoing within the Summer School of the European funded project IM-CleVeR to get some examples of the things I am to say.

Intrinsic motivations and learning signals based on improvements of competence

In contrast to Juergen, we think that in the brain there is a mechanism that directly measures your capabilities for doing things. If you acquire some new skills (such as setting up a glass), the reward will not come from a predictor but from the true action that has an impact in the world. Such mechanism measures the progress (which is an inverse model, not a forward model as in the case of prediction).

Juergen: This is just making things unnecessarily complex: the predictor can provide such a reward along the lines of learning to compress data.

Alex: there is still a reward even for things that we know well how to do, e.g. climbing (so without surprise, improvement). Why is this still rewarding?

Juergen: Intrinsic reward is still there as mountains are all different.

A biological example of intrinsic motivations

Kevin: There is biological evidence that support the view of Juergen. It can be found in basal ganglia. How do I learn actions? As little child I did not have access to a pool of possible actions, but I might have found a light switch, pressed it by chance, and recognized a surprising result (light suddenly going on). How can I learn the action that safely lead to switch on the light?

Basal ganglia are informed by relevant information: context and motor patterns. Initially (without an internal model) the outcome of the motor patterns might be a surprise, which elecits a pulse in dopaminergic neurons as it activates the superior colliculus. Dopamine acts as a learning signal which then enhances cortical connections within the basal ganglia, so you enhance the chance of repeating that action. If you repeat this action over and over again you get more and more likely to furthermore repeat this action, and this let you refine the action and also learn the action-outcome effects. Once you have learned a certain set of actions you can deploy them in different settings.

Question: One-shot learning vs. “twiddling around”.

Kevin: Toddlers learn by trying, refinement of actions. We're exploring action space and keep repeating actions until we build/refine our internal model.

Gianluca: you can build actions bottom-up, by chunking movements and assigning them a label, or based on goals: you first acquire a goal this drives the learning of the sensorimotor mapping that lead to such goal.

Juergen: there are all sorts of reinforcement learners (but not “different motivation”). There is always a predictor that generates reward and a reinforcement-learner that tries to maximize such reward. Imagine the two extremes: you get bored in a white room with white walls (no stimuli) because there is no novelty, nothing to be learned. On the other hand, if you live in “white noise” the compressor also cannot detect regularities, so you again get bored. We, a RL-system is most interested in medium complexity settings where it can learn to improve its compression of information.

Matthew: Can we actually use these learning principles for anything useful? Anything where the computational complexity is tractable?

Juergen: systems scale with time: 20 years ago we had very simple networks that “lived” in a maze and (on slow computers) explored certain regions of the maze. Today we're working on more complex robots and the IM-CleVeR project is exactly aiming at achieving this, e.g. using the iCub humanoid robot.

A a new experimental paradigm usable with babies

Jochen: I want to introduce an experimental paradigm that can be used to study intrinsic motivations in young infants, but is also a general new paradigm that can be used for many other possible studies in very yong babies. These are not capable of acting on the environment, and this does not allow you to perform a number of tests that can be performed with adults. To overcome this problem, we had the idea of using an eye-tracking systems to allow young babies to 'manipulate' the world. For example, when the baby fixates a black dot in a screen, she causes an image to apper. Babies are very good in learning this, e.g. in 2-3 trials!

Related topic: hierarchical reinforcement learning

Jochen: Aside this, I want to introduce another topic that is very important for RL but has not been still mentioned but is very important for IM-CLeVeR. This relates to the problem of the 'curse of dimensionality'. How to control a body that has hundreds of degrees of freedom and receives millions of sensory inputs? The essence of the solution we are exploring is to search in sub-sets of these dimensions, so that the solution gets much easier. This leads us to develop hierarchical RL. The idea is to exploit the independence between dimensions when it is possible.

Attention is a fundamental process in the brain

Ernst Niebur: statement: in a complex (real world) scene you can not do RL learning in real-time reactive systems. Example: while you are here and you are paying attention to the presenter, you do not pay attention, say, to what your left foot does inside you left shoe. But if you want to do so (such as now) you can well do it. The point is that whatever you do, you need to direct you attention to the currently relevant stimuli. To do this you can use bottom-up or top-down strategies: bottom up example: a sting from a bee will divert your attention immediately, no matter how much you concentrate on the talk top down: as discussed, you can choose to intentionally focus your attention to your left foot. The problem that all organisms face is that they need to have a large number of sensors available (such as pressure on skin, etc.), but the amount of information incoming is more than the brain can process. For instance, primates have around 1 million fibers per optic nerve which leads to an information inflow of the order of 100Mega bits per second. And this is only for the visual system, the somatosensory system contributes about the same amount of information, plus audition and chemical senses.

Juergen: isn't there more information in the timing in the information? The exact interspike interval might provide additional information.

Ernst: yes, more efficient temporal coding provides even more information, and there is some evidence for such coding schemes. Thus, if you do the math, the visual information alone would fill all of the synapses in your brain within weeks if not filtered; which is clearly not true. So the brain must select.

Matthew: Is attention some sub-cortical mechanism that was specifically build for this purpose?

Ernst: Selective attention is a very fundamental process of perception and cognition so there are many structures involved in attention, both sub-cortical (e.g. thalamic) and cortical (e.g. posterior parietal cortex).

Kevin: While attending to the presenter, are other retinal ganglion cells not responding?

Ernst: No, at least in the visual system, the afferent sensory input stays the same, there is no feedback to the retina. But already at the next level, the thalamus, there are likely attentional influences. Then, the higher we go up in the information processing hierarchy, the more important the effect of selection becomes.

Jochen: Attention on the sensory side is a hot topic, is there also attention on the motor side?

Ernst: Yes. We necessarily need to select between different actions. For instance, we can direct our gaze either to the left or to the right, not both. In fact, it has been proposed that visual selective attention might be very similar to motor control, in the pre-motor theory of covert attention.

Work-group on Attention: contact Alex Rast, Ernst Niebur, Kevin Gurney.

Attachments