AlphaZero Homebrew

Homebrew your own AlphaZero

Let’s have a look at the AlphaZero algorithm, implement it for the game of TicTacToe, and inspect its performance. All the code can be found in this repository.

ttt
“The game Tic Tac Toe being played by steampunk computers, digital art”

Overview

Issue: Some problems (such as the game of Go) have very wide decision trees. For a 19x19 board, there are 361 possibilities for the first move. To map all of the first two moves, there are 361*350=129960 possibilities. First three? Over 46 million. Due to the severity of this increase, previous approaches (such as alpha-beta pruning) become infeasible.

Solution: The AlphaZero (AZ) algorithm doesn’t perform a full tree search, but rather uses a probabilistic search algorithm (Monte-Carlo tree search: MCTS) to explore only those sections of the tree that are deemed most promising. MCTS is combined with a neural network in the following way:

The current board-state is fed through a network that is trained to indicate promising moves to make. This provides a prior distribution over moves to kickstart the MCTS.

Whichever it is, we now have a value estimate of how ‘good’ the path we explored is. This is pushed back up the tree and associated with the moves that led us here.

Monte-Carlo Tree Search [source](https://en.wikipedia.org/wiki/Monte_Carlo_tree_search) Monte-Carlo Tree Search source

Code & Classes

Let’s play

Loss

Here I’ve ran the model for 12 iterations with 1000 games per iteration and an MCTS depth of 15. I’ve segmented the loss colour-wise to show the progression from one version of the model to the next. The ‘score’ (wins minus losses) of AZ against random-play more than doubles in only a couple of iterations. It’s noteworthy that the model starts off much better than random play (at ~150 instead of 0). This is because TicTacToe is quite a shallow game, and thus even a very shallow MCTS already boosts our performance by quite a bit.

Before inspecting AZ further, it might be useful to briefly visualize the MCTS results.

MCTS in action

MCTS

Left, a board is plotted. MCTS is the ‘x’-player. In this position, there are 5 possible moves with one obvious best option. Here MCTS ran for 20 iterations and the bar-plot shows the amount of times we selected each move. By running these visit counts through a soft-max operation ($\frac{e^{N_a}}{\sum_A e^{N_A}}$) we get $\pi$: a probability distribution over moves, which I laid over the board on the left.

Value-net in action

To get some insight into whether the training worked, let’s run the board evaluation on some examples:

badboard goodboard

Again from the ‘x’-perspective, the left board is terrible and is a guaranteed loss. Meanwhile, the right board is great as we can guarantee ourselves a win with multiple moves. Above the board I’ve included the value estimates of the network when fed these game-states and we can see that (at least for this tiny sample) the network produces useful results that can aid the tree search.

Policy-net in action

Finally, let’s see how the priors on the policies are shaped across iterations. Here, for the initial 6 iterations I’ll feed a board to the policy-net and overlay the output $p_a$ on the board.

p_1

This is an easy one, with moves on (0,0) and (2,0) guaranteeing a victory. Ideally, the network doesn’t put all of its weight into one of them, but perhaps because the game finishes once the action is taken, the model cannot be punished for choosing move (0,0).

p_2

This is quite a nice result. It’s still early in the game, but we can guarantee a win here by playing (0,1) into (1,1), which the network seems to suggest. However, this is again not the only move with such a high value and it would be nice to produce distribution with greater entropy.

Some final notes