It all began with AlphaGo. Where are we now? MuZero.
Mastering the game of go with expert imitation.
Mastering go from “scratch” with self-play, but with a game simulator still.
Mastering Starcraft II.
Introducing MuZero, still relies heavily on MCTS, but the key point is that the model doesn’t learn to simulate game dynamics, but rather has its own latent dynamics and focuses on extracting a value function, policy and reward function.
Combination of various improvements.
Improvements to MuZero to make it more sample efficient. https://github.com/YeWR/EfficientZero
Improvements to MuZero.
The problem of MuZero and AlphaZero is that the action space is configured to be discrete. The following works expand upon the MCTS idea and introduce continuous action spaces.
AlphaZero style algorithm with continuous action spaces.
MuZero in arbitrary action spaces.