Visualizing Rewards in Reinforcement Learning – Apart from policy reward second most important concept in RL. In the illustration shown if you were to take the action based on the maximum reward, then we may never encounter the jackpot. That’s because it is fenced by low rewarding actions before it. So we can clearly see that the simple immediate reward cannot write the policy.
so how are we going to get a better policy? Well the answer is behind the shelves of dynamic programming. Instead of using an immediate reward I might use a better deciding mechanism. Choose the action with maximum discounted reward.
Discounted reward as the name suggests is taking the future into account. If the discount factor is zero it means that we are just following our stupid old look ahead, and if it is 1 then we are summing up all the rewards straight up till we hit a dead end. Think of a rod which is heated up on one hand and hinged on the other side. Then you can think of discounting as propagation of heat or how heating on one side can affect the adjacent blocks.
We use discount factor to connect the states in reward sense. If it is zero than they are disconnected and if it is 1 then they are no different. This means, if we have a rod then each of the blocks start with 0, but in steady state they reach an average reward or temperature. This is called state value function. If you haven’t figured it out, the blocks here are the states, and the temperature is the reward. If you put the stuff in mathematical terms, then we can call it Bellman equations.
You can look up the mathematics in the blog that I mentioned in the description. But in essence it is just using the discounting mechanisms to vomit out the numbers of each state and action. You can think of this number as temperature or expected reward. This number is called state value ‘v’ and action value ‘q’. The function which emits these taking state as input is called state value function, and for action it is called action value function.
There is one last revelation that I would like to make. As my Nana says love thy neighbor and the world will be a better place. Well guess what Bellman said . . . He said that “discount only the neighbor value and after a few iterations the value will be in steady state”. But for each single state you need to love thy neighbor I mean update based on your neighbor. You don’t need to discount all the way up.
Just discount the neighbor and the reward will start flowing everywhere. So it’s time for a little quiz here ! He only a single bullet left so if he aims at the cops, then the chances that he’s dead are 70%. But if he’s successful he gets his billion dollars back.
On the other hand if you put the gun down there are 50/50 percent chances of getting encountered or serving a shorter jail term. so what is the best action for him? time’s up. Well, if you have calculated, you will find that shooting the coop is the best action with the action value of minus 0.4. Answer is explained in the blog, but it in real life there are multiple actions which can follow up after hitting the bullet. So here’s where our love thy neighbor concept comes into the picture.
We don’t care if the initial state value is high or low, sinner or saint if he loves our neighbor not everyone , just your neighbor then we will have value exactly equal to the expected reward. This is called steady-state of the bellman optimality equation in dynamic programming. Its goal is to express the current state or action value as a function of its neighbor. I mean next state or action value function. It’s super easy peezy.
You just look at the state. There are actions sprouting from it. each action can lead you to new states probabilistically. why probabilistically? That because if you choose to jump from the window but how many bones will be broken or rather you will be in alive or dead state is probabilistic. Let’s look at expressing the current state value as a function of next state value.
You take the weighted average of the states resulting from an action. Statisticians call this weighted average as expectation. After you find the expected reward for each action, you update the state value based on the max of these two action values, or multiple action values. If you are reading this theory for the first time you are bound to have lots of doubts.
That’s why crazymuse has an even more crazier explanation. Let’s take another simple example from Game of Thrones.
Total creator. General coffe buff. Award-winning internet trailblazer. Devoted tv practitioner. Gamer. Communicator. Travel fan. AI and machine learning are everyday!