# Imagination Based AIs- Part-5

# Deep Learning Theory

As per Demis Habbis’s paper on Deep Learning theory[UC1], the knowledge of making good decision is represented as a Deep Q- Network (DQN), which is a type of Artificial Neural Network (ANN), and this calculates the right amount of reward, based on the decision strategy. Quoting the same example of the game character, I mentioned in the previous articles, if the path chosen is a winning path, leading to end of the game in fewer steps, then there will be two energy drinks, but if the path chosen is the winning path, but does take more steps to reach the end, then only one energy drink will be provided.

The performance of DQN is quite high compared to the Shallow (a simple ANN with one or two hidden layers), Medium Neural Networks (Recurrent Neural Networks and LSTM) and traditional Machine Learning feed forward networks (Supervised and Unsupervised)

As I mentioned in my previous article, DQN has multiple hidden layers of neurons giving rise to number of decisions to be made based on the previous action and the probability of each next state leading to the destination is also provided. Based on this information, the agent can decide the next state. The DQN stores these decision strategies in a Q table as training episodes and then randomly samples and replays these experiences to feed the diverse and de-correlated training data for the future iterations till the algorithm cracks the problem and provides the winning strategy. Q tables are made in the single layer ANN too, nonetheless the DQN provides multiple strategies and hence multiple Q — values

The above video from DeepMind shows how the DQN learns with every iteration and stores the data as a training episode and learns from it till it finally breaks down the pixel wall. The number of squares in the wall which the algorithm breaks in every iteration is shown at the top of the video as a rolling number, and as you see it keeps changing , based on the decision made.

It works on the simple principles of imagination and innovation where training models acquire knowledge, similar to human brain gaining knowledge through experience and education as we grow older and use this acquired capability to deduce or predict different scenarios from the existing environment information, to solve real time problems. Human brain achieves this capability using hippocampus, a bay shaped structure, which helps in gathering and replaying the experiences during rest and sleep (This is the main reason that well rested people can invent and create better than the sleep deprived human beings). This type of AI also helps to learn from past failures to do better. This is the concept behind Imagination Augmented Agents proposed by Demis Hassabis et al,. Nonetheless, based on the same concept of enhancing the ability of a reinforcement learning agent, the researchers have improved and stabilized the Deep- Q- network using techniques which involves prioritizing, scaling, aggregating, approximating, and normalizing the relayed experiences and training episodes.

# Types of DQN

Thus, the DQN based on reinforcement has evolved and evolving as you read this article. The following are the various types of evolved DQNs.

# Double DQN

[UC2]The standard DQN outputs various strategies in terms of Q values. The best strategy is mostly chosen using **maximum Q values**, this leads to overestimation of the target goals. Instead, double DQN is proposed, and this overcomes the problem of overestimated Q values and overshooting targets through focusing on specific Q values chosen by the first Q Network.

# Prioritized DQN

[UC3] DQN learns from previous experiences stored in a data buffer. This is called experience replay and this is drawn from the buffer and sampled frequently in order to choose the next strategy. In standard DQN, the experience replay is buffered and sampled in a sequential manner. Nonetheless this slows the speed, as we need to sample the **increased and significant transitions, more frequently**. Thus, Prioritized DQN is used to improve the speed of the network, using transition error

# Dueling DQN

[UC4] DuDQN uses aggregation to improve the results achieved from a standard DQN, ending in a better performing DQN than a standard DQN. By understanding which states are **more valuable** without actually learning the effectual action of every state involved, allows the network to be more efficient, resulting in better estimation of Q- values. A simple analogy can be some students who are terrific at mental arithmetic and can reach a mathematical solution faster than many students. The **aggregating layer** acts as a superfast network, which does the mental arithmetic and give more focus on the important steps to reach a solution. A good example can be when the next Q strategy or move devised by a self-driving car results in a collision with an another vehicle on the road, then this strategy needs to be changed. This is an important strategy with more value than a strategy which does not end in a collision. Thus, the value of this strategy, is considered in the **final combination layer** used to reach the target goal, which is arriving at point B, without getting involved in an accident.

# Distributional DQN

Distributional DQN takes into account the unexpected scenario in our everyday life to reach the target goal more efficiently. This involves approximating the q — values using a modified Bellman Equation used in most DQNs or CNNs for policy or strategy mapping. In other words, it helps to reach the target goal step by step, by feeding the current state and current action.

While accommodating randomness in the model behavior, the algorithm, considers three possible outcomes:

1) When given with two choices, it is possible for the learning agent to decide which is the best outcome, by using **environmental input**. Say for example, when one route takes 15 min to reach destination and the other takes 30 min due to the road maintenance, then choose the 15 min by getting the road maintenance alert notifications.

2) When faced with two choices, the learning agent chooses the option which varies from the **average** of the previous choices by the least.

3) Given the two optional strategies, it is possible for the DiDQN to come up with multiple strategies using modified bellman equation using **probability distribution** and choose the best among the same.

The new Distributional DQ Network uses the third outcome and an improved modified Bellman equation, to predict possible strategies without taking the average of all the outcomes. Thus, the probability distribution of the outcomes is preferred over the averaging of the outcomes in this new type of network, which reduces instability, inductive bias and poor predictions in the network.

# Asynchronous DQN

[UC5] This uses multithread capabilities in order to solve the problems by making it possible to run many instances of the algorithm parallelly using a number of parallel processors instead of standard CPUs. In this way the imagination based AIs learn faster and make efficient decision within a short time using multiple networks. This is similar to the concept of parallel universe in quantum entanglement. Here two or more agents from different instances of the algorithm, interact with the environment **in parallel**. Then the experience replay data of each agent is fed as the input to all agents and this combined data helps to reduce the error in the **gradient function** [UC6] helps all the agents to reach their best performance and the best among the all the learning agents is chosen as the final network architecture.

# Exploration — Exploitation strategies

[UC7] In DQN, there is always a trade-off between, exploration, which involves finding new Q — values and new states or exploitation, using the experience replay to learn from previous episodes, in order to get to the target. Many researchers have implemented successful strategies to get a maximum reward while finding the target goal. The disadvantage of this type of DQN is the learning agent might use exploitation more to maximize the reward without exploring the environment completely. While exploitation is a good winning strategy, it has led to a great number of debates on legal and ethical aspects of artificial intelligence, as exploration leads to **enhanced learning** and exploitation leads to **Blitzscaling **to the winning line, with out acquiring the required knowledge.

I have detailed only six strategies used to improve reinforcement learning methods which are similar to Imagination Augmented Agents, the inspiration behind the title of this series. But there are many more strategies such as **rainbow model** which is the combo of distributional, double and dueling networks, then there is **LSTM which involves recurrent DQN or RNN**, as I mentioned in the last article of this series. The most important network which I do not want to forget is the **Hybrid reward structure** [UC8] algorithms which is used to face the challenges raised by the concept of generalization in real world problems with large state space.

Before I conclude this series, I want to briefly mention about the significance of Neuroscience inspired AIs and Reward driven Attention based AIs.

**Neuroscience based AIs** use neural Turing machine for memory, knowledge transfer for extracting concepts, hippocampus inspired imagination to devise strategies by simulating the impacts of multiple strategies on the final goal.

**Reward Driven Attention AIs,** as per Brian Anderson[UC9], unlike the Neuroscience inspired AI adapt to quick attention based on a simple reward. The algorithms are written in such a way that the AI is asked to pay attention on to a varied sets of inputs and asked to make a decision, whenever the algorithm makes a decision, quickly and in the direction of the goal based on partial inputs ( Acquiring partial knowledge to improve speed) instead of detailed full inputs, a reward is given as a token of appreciation, which in turn motivates the algorithm to make quicker decisions based on partial inputs. If the decisions made are good or best then the reward level is increased and thus the AI software learns to adapt to quick context sensitive information in order to take simple day to day life decisions. To site an example, the algorithm is shown a variety of pictures which involves five categories namely flora, fauna, automobiles, street-signs and houses. Hundreds of pictures are shown for a few seconds in quick succession and the AI software is then asked to identify the pictures and that belonged to these five categories. The faster the decision made, the more the reward was and thus the software was allowed to learn and make decisions in quick succession based on some contexts, such as ‘Appears in Nature’. In this case, obviously, with respect to the context the pictures were recognized quickly. These AI are also based on Perception modelling, where cognitive access plays an important role. This type of AI is especially good for applications such as paintings, NLPs, content creations etc., where the creativity level needed is more than the rest of the Imagination Augmented Agents

As I conclude this five-part series, I also want to mention about the importance of symbiotic intelligence. The enhancement in the AI technology, especially the deep reinforcement learning concepts, has also led to the talk of symbiotic intelligence, also known as collective intelligence. This concept of symbiotic artificial systems enables multi- disciplinary opportunities and enormous enhancements in the fields of technology and medicine.

The advantages of symbiotic intelligence are enhanced results when handling extreme complex tasks at exceptional low costs, through eliminating human error, through proactive approach and through Tera Bytes of data. Because of these advantages the fields of applications range from providing data privacy and security in mobile communications to meal orders for airplane passengers.

Reinforcement learning based AIs are the foundation stones for the futuristic AGIs, I mentioned in the article -2 of this series. If any of my readers want to enjoy AGIs in a non- technical, light weight mode, I will suggest them to watch the movie, ‘Bladder Runner 2049’,. Except for the dystopian future depicted in the movie, (just like many other futuristic Sci-Fi movies), I like the concept and screenplay.

[UC1]Imagination-Augmented Agents for Deep Reinforcement Learning by Sébastien Racanière, Théophane Weber, David P. Reichert, Lars Buesing, Arthur Guez, Danilo Rezende, Adria Puigdomènech Badia, Oriol Vinyals, Nicolas Heess, Yujia Li, Razvan Pascanu, Peter Battaglia, Demis Hassabis, David Silver, Daan Wierstra of ‘DeepMind’

[UC2]Ref: Van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-Learning. In AAAI (pp. 2094–2100).

[UC3]Ref: Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2015). Prioritized experience replay. arXiv preprint arXiv:1511.05952.

[UC4] Ref: Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot, M., & De Freitas, N. (2015). Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581.

[UC5]Ref: Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, & Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning

[UC6]The square of the loss function is represented as a gradient expressed in three dimensions in most learning algorithms to analyze the difference between the obtained outcome and target outcome. Then the difference is fed back to the network to correct the agents to take the right path towards the target.

[UC7]Ref: Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning By Michael Castronovo, Francis Maes, Raphael Fonteneau, Damien Ernst, Department of Electrical Engineering and Computer Science, University of Liège, BELGIUM

[UC8]Ref: van Seijen, H., Fatemi, M., Romoff, J., Laroche, R., Barnes, T., & Tsang, J. (2017). Hybrid Reward Architecture for Reinforcement Learning. One way of tackling this problem is through using a reward function with a low dimensional representation. But if the problem is too complex and if the value function cannot be represented in a low dimensional space, and if forced might lead to instability in most cases. HRA algorithm overcomes this problem through splitting the required function in to many reward function and each function is assigned to a separate RL agent and they operate and learn in parallel devising their own strategies with different parameters associated with it. In order to obtain a final strategy plan, all the strategies of the RL agents are averaged and this strategy has been powerful devising high performing game strategies for games such as Atari ,Pacman etc.,

[UC9] Anderson, Brian A.: Department of Psychological & Brain Sciences, Johns Hopkins University, Baltimore, MD, US