Wednesday, 15 April 2009

When stochastic search beats direct methods

The central problem of reinforcement learning is to fit the parameters of an agent's policy in order to make the agent chose "good" decisions according to some unknown objective.

Let's consider a very simple case. We have an MDP which is basically stateless. So to say, all the world is doing is to map the actions of an agent directly to a reward. Consider this function to be a mixture of two gaussians, with one mode at -3 and one at 3. Say we pick a single mode gaussian as a model of the world.

After fitting our model to our data, gathered via a rollout, weighting it by the reward - we will get a gaussian that has a mode somewhere around 0.

A better strategy for an agent would be to just stick to one of the modes. Even though this model of the world is not more correct, it does lead to a better policy. Picking 0 as an action will lead to a reward of approximately zero while always picking -3 or always picking 3 will always return 0.5 as a reward.

Under the assumption of a single mode gaussian and via the maximum likelihood approach, a non optimal solution is picked.

Of course, there are superior methods. For example, one could estimate the gradient and just move up to one of the modes. But what happens if the gradient information is incomplete? This can happen if our world has time dependencies which are not modelled (e.g. non markovian environments). It can also happen if our observations are not complete but only partial (as in POMDPs). In that case, direct methods are prone to fail, again.

Imagine our world to be only accessible through a proxy, which happens to turn the complete state information into an observation via a many-to-one mapping. For example, two different states s and s' map to the same observation o. However, the rewards r(s) and r(s') are completely different, say -1 and 1. In that case our sampling will lead to a completely wrong impression.

Stochastic search does not suffer from this problem. A direct method uses an approximation of the world's dynamics consisting of the observations and the rewards as well as the current agent's policy in order to generate a new (hopefully better) policy. A simple hill climber uses only the current policy and its expected reward. Since the non-use of local information does keep it from making wrong decisions due to wrong information, it is in theory more robust if the environment has a lot of wrong local information.

Friday, 13 February 2009

Bash Job Management Essentials

Did you ever wish your program had a pause button? I did so regularly. I regularly have to do lengthy calculations, for example for evaluating some heuristics for an optimization problem such as Maximum Independent Set or Matchings. A typical program could look like this:
for all benchmark graphs G:
  for all algorithm variants A:
    for k in [2, 4, 8, ..., 128]:
      Execute A on G with the current value for k
Now if this computation takes 20 hours and I'm running this on my laptop, one of the CPU cores is hogged and the fan goes berserk. So what happens if I want to take my computer somewhere else or view a movie? As it turns out, I can simply suspend the machine and after waking up the program will continue to run. This already solves the first problem and shows that the operating system can suspend my program out of the box. The bash shell (and I'm sure other shells, too) helps me with the second problem: It is able to do basic job management. Consider the following command sequence. This lists my temporary directory and pastes it through less. less will wait for my input.
$ ls /tmp | less
Now, I press Ctrl+Z to suspend less:
[1]+  Stopped                 ls /tmp | less
Neat! How do we get back the listing. With fg %<job number>:
$ fg %1
We can also have multiple jobs:
$ ls /tmp | less
Ctrl+Z
[1]+  Stopped                 ls /tmp | less
$ ls /usr | less
Ctrl+Z
[2]+  Stopped                 ls /usr | less
The command job gives us a listing of the current jobs:
$ jobs
[1]-  Stopped                 ls /tmp | less
[2]+  Stopped                 ls /usr | less
There is one gotcha: If you use the bash command time, it will display the passed time when you suspend your job and not print the remaining spent time after you restart it again:
$ time ls /tmp | less
Ctrl+Z
[1]+  Stopped                 ls /tmp | less

real 0m1.139s
user 0m0.001s
sys 0m0.003s
manuel@coxorange ~
$ fg %1
ls /tmp | less
Ctrl+Z
[1]+  Stopped                 ls /tmp | less
In this case, use time on fg %1:
$ time ls /tmp | less
Ctrl+Z
[1]+  Stopped                 ls /tmp | less

real 0m0.723s
user 0m0.001s
sys 0m0.003s
manuel@coxorange ~
$ time fg
ls /tmp | less
Ctrl+Z
[1]+  Stopped                 ls /tmp | less

real 0m0.913s
user 0m0.000s
sys 0m0.000s
 

Header Image

Header Image
Bitwiese Header Image