Love the ideas for reducing randomness, but need more details to figure out how to implement them.
mike wrote:Ensure the decks are always stacked the same
Do you mean the hand and library are always in the same order?
mike wrote:Ensure starting player is always the same (and always 50% of all the matches for each AI)
I've been thinking about playing a match twice with the same initial configuration of the library and starting player but swapping the AI. This should remove the bias due to the initial configuration.
If AI1 beats AI2 regardless of initial condition then it is definitely better, if each win one match out of two, then they are even and the win was decided by the initial configuration of the cards.
mike wrote:Run a predefined gauntlet of "fair" decks equally distributed
How do you find the set of "fair" decks? Currently I'm mostly using randomly generating decks based on two colors, this assumes that the deck generated will be roughly the same power level. But the game could be decided purely by the library order (mana flood/mana screw) even if the decks are "fair". Playing twice with the same initial conditions but swapping AIs seems to solve this.
mike wrote:This gave me close to deterministic results (1 match result difference at most given the same AI for 50 games)
Does "1 match result difference at most given the same AI for 50 games" mean in one run you got say 30/50 and in another run you got 31/50 for the same AI?
It would be very helpful if you could publish the scripts/code for running the tests. I would like to apply them to my testing as well.
mike wrote:What would be the best way to give it a new AI/magarena version to test with? .jar upload through the gui? github repo with commit hooks? rewrite the AI in groovy so it can be replaced at runtime?

.
I'm thinking commit hooks. The speculative changes that require testing will be committed to a branch, say with a special name that starts with "ai-".
We can define the list of AI and their associated source files. Then for each AI there are variants, each variant is denoted by the hash of the source files. We can establish a baseline by playing the AIs against each other, at various settings of time control. This will give us a Whole-History Rating of the base variant of each AI.
When a new commit on a testing branch is detected, we can hash the source files to determine the new AI variant. Then play this new variant against the other AIs to establish its score, which can be compared against the base variant of this AI.
Example: we start a particular version of Magarena, and establish MMAB-ce21fea and MCTS-defab87 as the base variant. Then play various decks and timing to get WHR for MMAB-ce21fea and MTCS-defab87, at two representative levels 1 (fast) and 6 (slow). From the results of the matches, we can determine the WHR scores, something like
MCTS-defab87-6 0.943
MCTS-defab87-1 0.723
MMAB-ce21fea-6 0.542
MMAB-ce21fea-1 0.341
When a commit modified MCTS-ce21fea to be MCTS-4ead2d1, then by playing MCTS-4ead2d1 against the unchanged AIs we can include MCTS-4ead2d1 in the score list to get a new list:
MCTS-4ead2d1-6 0.957MCTS-defab87-6 0.943
MCTS-4ead2d1-1 0.801MCTS-defab87-1 0.723
MMAB-ce21fea-6 0.542
MMAB-ce21fea-1 0.341
In the example, the commit 4ead2d1 produced an overall improvement in playing strength of MCTS.