본문 바로가기

Journal Review | Learning to Paint With Model-based Deep Reinforcement Learning - (3) 본문

B/Review

Journal Review | Learning to Paint With Model-based Deep Reinforcement Learning - (3)

생름 2023. 6. 22. 19:02

Huang, Zhewei, Wen Heng, and Shuchang Zhou. 2019. "Learning to Paint With Model- Based Deep Reinforcement Learning." In Proceedings of the IEEE International Conference on Computer Vision, pp-8709–8718. PDF


  While I kept digging into Reinforcement Learning, I came to realize all that technical terms could be re-wired with the concepts of general algorithms: the state as the input of each step; the agent as the output coming from AI algorithms; the policy as functions inside AI algorithms; the reward as the performance metric, assessed with state and agent in each step; the environment consisting of what AI algorithms reproduce along with the steps in combination with miscellaneous factors. (Read more)

  Also, Learning to Paint algorithms used Actor-Critic networks. The actor network updates policy parameters and participates in an agent's selecting actions in each state on the loop; the critic network evaluates the action with the value function. (Read more) This action-value function with extra parameters is useful in Reinforcement Learning to reduce its learning time which usually takes longer than other ML types due to a return value for every loop. Actor.py and critic.py in Learning to Paint represent functions of Neural Renderer Networks associated with ResNet of PyTorch.

With this in mind, my previous diagram of the Model-based Deep Reinforcement Learning should be fixed to the image below. In this diagram, I consider that the components outside the environment are made up by the agent in each phase. 

 

This is a way more accurate version of the diagram compared to the previous one after understanding the algorithms scripts. Each meaning of model-based DDPG and Model-based DRL has been amended, and the roles of Actor-Critic along with state, policy, action, value, reward, and so on make more sense than before. I should rather erase the former diagram.

  Meanwhile, an environment here in Learning to Paint is about what comprises the given canvas along the steps, for example, env.py has the Paint class in combination with functions such as: load data, previous image, observation(=state), reward, step. Specific drawing function like Quadratic B´ezier curve(QBC) is located separated in the Renderer's list.

 

ICCV2019-LearningToPaint directory (GitHub page)

 

How does a training session work?

  The training session is for training the agent in which the policy become capable of performing the task of painting in a given environment. I left notes underneath highlighted in orange and attempted to interpret how the agent, the state(here for a parameter called observation), and the environment work in the training session.

Below is train function extracted from train.py of Learning To Paint Deep Reinforcement Learning algorithms:

 

 

How do neural renderer layers operate in Python? Example:

  There are several networks serving in Learning to Paint. I cannot help but wonder what the layer looks like. The following script is a sample of the neural renderer for drawing canvas in the given environment. Basically, FCN(Fully-Convolutional Network) model is imported from PyTorch, yet difficult to grasp all the concepts but its appearance. Overall, the input transforms into an output tensor.

 

Renderer.py:

 

 

How does a testing session work?

  The output of the trained model would be saved in the specified output file, here I only inferred its final format because I haven’t demonstrated this part. When it comes to a testing session in Colab, however, it asked me to download actor.pkl and renderer.pkl in advance to proceed with the Learning to Paint test. From this phase, I assumed the final reults from training could be transferred via PKL format at least online.

  PKL file, called pickle is a kind of serialized Python file with only binary data for the purpose of saving space. This format often serves as packing the training output from Machine Learning, which is for testing and following training, and so does here. Line 18-20 in the scripts below represents how PKL files are loaded into testing scripts. Packing data is comprised of trained policies about how to reduce the loss and acquire the reward in a given state. Meanwhile, log files, incurred during the training session such as rewards, losses, and learning rates will be saved by TensorBoard in the training session according to train.py

 

test.py: 

  Last week while I test my photo with Learning to Paint scripts, I attempted to tell apart model-based DRL from model-free DRL but failed. It showed the exact same L2 Loss at each step. This was because the PKL files I embedded were the same, meaning the results of the training were by a single framework, perheps model-based DRL.

Now it is time to distinguish the model-based DRL from the model-free DRL.

 

How do model-based DRL and model-free DRL work distinguishable in training scripts?

  I vaguely but gradually understood its meaning. I skim through all the source code files on GitHub and found that a few extra lines are in critic.py of Model-free DRL baseline, and ddpg.py of Model-based and Model-free DRL possess some lines of different attitudes toward next step prediction.(Please see the script image beneath.) From my last week's article, the main difference between the model-based and the model-free of DRL was whether it could actively predict the next states and rewards to learn comprehensive inference. Likewise, in the real scripts, one can predict the next step while the other can’t. Nonetheless, once I found the difference which was tiny, I doubted this small discrepancy brings divergence in the Model being. 

  For example, the model-based DRL’s critic network in ddpg.py takes additional layers that include the current canvas and the last canvas as the state representation whereas the model-free DRL focuses only on the current state. (Please find scripts below.) Including the last canvas as input helps in better prediction of future state values. The author intended the model-based DRL critic network to enable the model to learn the dynamics of the environment more accurately by the inclusion of the last canvas.

  Thus, the small change of including the last canvas as input could have a significant impact on the performance and behavior of the algorithms. Furthermore, the model-based DRL critic network has access to more information to capture more intricate dynamics in the environment. My diagram on the top also represents the model-based DRL, and both canvas 0 and canvas 1 are in the environment for training the agent. It also shows what the model-based DDPG is, having nerual renderer interacting with the actor in a modeled environment.

 

Extracted from ddpg.py in model-based DRL:

 

Extracted from ddpg.py in model-free DRL:

Comments