DeltaZero is a python implementation of the AlphaZero architecture for chess. Design ideas were
drawn from a couple of existing open-source python implementations, namely
alpha-zero-general and
chess-alpha-zero. My goal is to learn from these implementations,
and make readability/design improvements along the way.
The environment module contains a class declaration for the ChessEnvironment, which is repsonsible
for managing information about the state of the chess board, and translating that information into
a format interpretable by a convolutional neural network.
DeltaZero uses python-chess to do a fundamental board
representation. python-chess is an extremely well-written chess library completely implemented in Python.
It is well documented, and the source code is very readable and understandable.
The state of a chess game is encoded into a (19, 8, 8) numpy array, with the first dimension representing
what are referred to as "planes" (or channels), and the second two dimensions representing the shape of the
chess board (8 by 8 squares). The first 12 planes are bit matricies representing the positions of the white
and black pieces. For example, the plane at index 0 represents the white pawn positions, and at the start
of the game would look like this:
[
0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0
0, 0, 0, 0, 0, 0, 0, 0
1, 1, 1, 1, 1, 1, 1, 1
0, 0, 0, 0, 0, 0, 0, 0
]
The last 7 planes contain game metadata - castling rights for both sides, the en passant square, fifty-move counter, and side-to-move.
The ChessEnvironment can also adjudicate games, bringing an immediate end to a game in progress. This is helpful
in the early training stages when the Agent makes essentially random moves, and the games last forever. Adjudication
is performed by using Stockfish 10, one of the world's strongest open-source chess engines. At the time of adjudication,
a position evaluation is performed, and the game result is determined based on this evaluation.
Games played by DeltaZero can be exported to PGN format for review.
The agent module contains a class declaration for the ChessAgent, which is the entity that traverses the ChessEnvironment by
taking actions (making chess moves) and receiving rewards. The ChessAgent executes episodes of self-play during which it plays
chess against itself, generating training examples in the form of board positions + corresponding game outcome.
DeltaZero implements a multiprocessing module called selfplay.py dedicated to parallelizing the self-play process. During self-play,
a neural network-based Monte Carlo Tree Search is performed to generate actions for the Agent to take. Although the algorithm itself
is computationally efficient, a very large number of training examples (and therefore a large number of self-play games) are needed.
Realistically speaking, without access to an extremely high-power distributed environment (seeLc0), generating self-play games simply isn't feasible to bring the network to superhuman playing levels. Training methods are discussed in the coming sections.
The ChessAgent also has optional access to an opening book for evaluation sessions against either a previous version, or human/engine opponents.
The network module contains two class declarations: 1) An abstract base class NeuralNetwork, containing abstract methods for building the network architecture, training, running predictions, and saving/loading weights. 2) A Keras-based implementation of NeuralNetwork called ChessNetwork.
The network architecture mimics closely the design of AlphaZero. DeltaZero uses a convolutional res-net, or residual network. The network consists of a body followed by two heads - a policy head and a value head. The policy represents move probabilities given a particular board state, and the value represents the estimated outcome of the game, [-1.0, 1.0], with -1.0 being a loss, 0.0 being a draw, and 1.0 being a win.
The network body consists of a linear rectified, batch-normalized convolutional layer with 64 filters, kernel size 3x3 and stride 1, followed by (currently) 19 residual layers (note that the number of residual layers is currently evolving through different model iterations). Each residual layer consists of two skip-connected, linear rectified, batch-normalized convolutional layers of kernel size 3x3 and stride 1. Both convolutional layers in each residual layer also have 64 filters.
The policy head is slightly different from the one implemented in AlphaZero, which is due to the fact that the policy is represented differently in DeltaZero. AlphaZero represents the chess policy as a 73x8x8 matrix, while DeltaZero represents its policy as a flat vector of length 1968, each element representing a possible move on the chess board. The policy head takes the output of the residual layers and passes them through a single linear rectified, batch-normalized convolutional layer with 32 filters, kernel size 1x1 and stride 1. This convoluted output is then flattened and passed through a linear rectified fully-connected layer of size 2048. This dense layer leads to a final softmax layer of size 1968 - the output of which represents the policy estimate.
The value head is also fed by the output of the residual layers, and consists of a single linear rectified, batch-normalized convolutional layer with a single filter, kernel size 1x1 and stride 1. The convoluted output is flattened and passed through a linear rectified fully-connected layer of size 256 with 40% dropout, followed by a tanh layer - the output of which represents the value estimate.
In the interest of saving years of computation time (or thousands of dollars on distributed computing), the network is "warm-started" with supervised training data from a database containing roughly 1 million chess games in PGN format. At the time of writing, the current iteration of DeltaZero has been trained on approximately 56,000 games.
For each game, a single training example is extracted in the following manner:
- The game is loaded into a
ChessEnvironmentobject. - At each move, the board state is encoded into a
(19, 8, 8)matrix. - The policy for the move is created in the following manner:
- The ELO ratings of each player are recorded. The move that was actually played in the game is assigned a high probability, proportional to the ELO rating of the player that made the move. The probablity is calculated as
P = 1 - exp(a*e), whereeis the ELO rating anda = -0.00122(found by trial and error). The remainder probability (1 - P) is divided amongst the rest of the legal moves in the position. The illegal moves are masked with a probability of 0. - For example, a player with an ELO rating of 2000 (minumum ELO of any player in the training games) would have a probability of
0.913assigned to the move he/she played in a game, while a player with an ELO rating of 2750 would have a probability of0.965assigned to a move.
- The ELO ratings of each player are recorded. The move that was actually played in the game is assigned a high probability, proportional to the ELO rating of the player that made the move. The probablity is calculated as
- The value is the result of the game: 1 for a white win, 0 for a draw, and -1 for a black win.
One training example per move is generated. DeltaZero trains with a batch size of 64, and uses SGD with LR decay (1e-6) and Nesterov Momentum as an optimizer, with an initial learning rate of 0.001.
The oomcts (object-oriented MCTS, couldn't find a better name - mcts.py wasn't understandable enough in its current state, so a rework was done. mcts.py remains in the repo for reference) module contains two class definitions (Node, MCTS) which serve as an implementation of the Monte Carlo Tree Search algorithm, slightly modified to use the predictions of the current iteration of the neural network to generate move probabilities and value estimates. The MCTS hyperparameters are almost identical to those specified in the AlphaZero paper.
When playing a game of chess, the ChessAgent will use the MCTS algorithm to find the "best" move to make. Actions are explored according to the policy network output. Given a board state, moves are explored out to a specified depth d. Nodes in the tree at depth d are leaf nodes and are evaluated using the value network output. Q-values at each node in the tree are updated via a backpropagation step, and are visit count-adjusted according to an exploration parameter c_puct. During self-play, Dirichlet noise is added to the prior probabilities in the root nodes.
A single iteration to depth d is referred to as a "playout." As many playouts are executed (the number of playouts being referred to as the "simulation number"), the probability of visiting nodes in the search tree evolves. Actions are chosen in proportion to the node visit counts, as specified in Silver et al.
During self-play, a small simulation number is used - typically between 25 and 50. This number can be increased as more computational power is available. AlphaZero used a simulation number of 800 in self-play games. When evaluating the network, a larger simulation number (and/or a larger depth d) can be used - think of this as deeper thinking...