Using Project Firemind to test speculative AI improvements

by **melvin** » 25 Feb 2015, 01:04

I found out about http://tests.stockfishchess.org/ recently and realized it solved a long standing problem that I found in developing the AIs for Magarena.

In summary, it is a distributed test engine to evaluate each AI change made for stockfish. After the introduction of this test engine, the playing strength of stockfish increased significantly as they can quickly tell which change made an impact on the strength of the AI.

Right now I'm simulating about 500 games on my PC, using 1s time limit but it is extremely slow and takes many hours to test a one line change.

I think Project Firemind can be a great help here in helping to run these AI comparisons. Instead of running the same AI for different decks, it would be running different AIs for randomly generated decks (or perhaps the list of top firemind decks).

by **muppet** » 25 Feb 2015, 09:22

Anything Stockfish are doing is likely to be good it was the best free engine a few years ago at least when I was still able to actually play against computers in chess with some hope. Let me know if there is anything I can do to help, I am ok at building decks and playing magic but can't code.

I still think position evaluation is much better than using life totals if it is at all possible to do.

by **mike** » 25 Feb 2015, 09:44

I like the idea. In fact, I've started working on something similar for my own AI experiments already.

The big difference between MTG and chess is obviously the fact that MTG has both stochasticity and hidden information. This makes games more volatile and would require a lot more games to be run to get somewhat accurate results.

This is why I figured I'd remove as many elements of randomness as possible to keep the number of required games minimal.

What I did was this:
- Ensure the decks are always stacked the same
- Ensure starting player is always the same (and always 50% of all the matches for each AI)
- Run a predefined gauntlet of "fair" decks equally distributed
- It's also interesting to keep track of a separate set of results for the mirror (how do the AIs play the exact same deck, stacked exactly the same)

This gave me close to deterministic results (1 match result difference at most given the same AI for 50 games)

And as to running them on Project Firemind:
Let me see if I can throw together a GUI and a worker structure for this. What would be the best way to give it a new AI/magarena version to test with? .jar upload through the gui? github repo with commit hooks? rewrite the AI in groovy so it can be replaced at runtime?

by **muppet** » 25 Feb 2015, 09:56

I don't think it is unreasonable to tell the A.I. what the opponents deck is before the match. Given the insane number of cards in magic doing anything like a full width run of possiblilities is impossible otherwise. This might enable something similar to chess where every possible move is calculated for a few moves ahead and then a long look ahead based on captures and checks resulting from these which cld translate into look aheads into only the combat and damage phase maybe.
I was thinking of playing a game and simply recording everything I was thinking and I was thinking about what the ai was doing and monitoring for things like suicide attacks etc while doing it. I might have a go at this if I get some time.
Oh and deck specific evaluations at some point e.g. reanimator has very different numbers to a red beat down deck for things like big creaures in graveyard etc.

by **melvin** » 25 Feb 2015, 12:51

muppet wrote:I still think position evaluation is much better than using life totals if it is at all possible to do.

Life totals is only one part of the forumla, we generate a score for each action and score of a game is the sum of all the actions. Playing a permanent has a score, tapping a permanent has a score, dealing damage has a score, getting poison counters has a score, etc For details, refer to https://github.com/magarena/magarena/bl ... ystem.java

This is mostly used by MMAB, MCTS uses a different mechanism that doesn't depend on these formulas at all. It does many simulations from a particular board state to the end of the game, making random moves for both side.

by **melvin** » 25 Feb 2015, 13:28

Love the ideas for reducing randomness, but need more details to figure out how to implement them.

mike wrote:Ensure the decks are always stacked the same

Do you mean the hand and library are always in the same order?

mike wrote:Ensure starting player is always the same (and always 50% of all the matches for each AI)

I've been thinking about playing a match twice with the same initial configuration of the library and starting player but swapping the AI. This should remove the bias due to the initial configuration.

If AI1 beats AI2 regardless of initial condition then it is definitely better, if each win one match out of two, then they are even and the win was decided by the initial configuration of the cards.

mike wrote:Run a predefined gauntlet of "fair" decks equally distributed

How do you find the set of "fair" decks? Currently I'm mostly using randomly generating decks based on two colors, this assumes that the deck generated will be roughly the same power level. But the game could be decided purely by the library order (mana flood/mana screw) even if the decks are "fair". Playing twice with the same initial conditions but swapping AIs seems to solve this.

mike wrote:This gave me close to deterministic results (1 match result difference at most given the same AI for 50 games)

Does "1 match result difference at most given the same AI for 50 games" mean in one run you got say 30/50 and in another run you got 31/50 for the same AI?

It would be very helpful if you could publish the scripts/code for running the tests. I would like to apply them to my testing as well.

mike wrote:What would be the best way to give it a new AI/magarena version to test with? .jar upload through the gui? github repo with commit hooks? rewrite the AI in groovy so it can be replaced at runtime? .

I'm thinking commit hooks. The speculative changes that require testing will be committed to a branch, say with a special name that starts with "ai-".

We can define the list of AI and their associated source files. Then for each AI there are variants, each variant is denoted by the hash of the source files. We can establish a baseline by playing the AIs against each other, at various settings of time control. This will give us a Whole-History Rating of the base variant of each AI.

When a new commit on a testing branch is detected, we can hash the source files to determine the new AI variant. Then play this new variant against the other AIs to establish its score, which can be compared against the base variant of this AI.

Example: we start a particular version of Magarena, and establish MMAB-ce21fea and MCTS-defab87 as the base variant. Then play various decks and timing to get WHR for MMAB-ce21fea and MTCS-defab87, at two representative levels 1 (fast) and 6 (slow). From the results of the matches, we can determine the WHR scores, something like
MCTS-defab87-6 0.943
MCTS-defab87-1 0.723
MMAB-ce21fea-6 0.542
MMAB-ce21fea-1 0.341

When a commit modified MCTS-ce21fea to be MCTS-4ead2d1, then by playing MCTS-4ead2d1 against the unchanged AIs we can include MCTS-4ead2d1 in the score list to get a new list:
MCTS-4ead2d1-6 0.957
MCTS-defab87-6 0.943
MCTS-4ead2d1-1 0.801
MCTS-defab87-1 0.723
MMAB-ce21fea-6 0.542
MMAB-ce21fea-1 0.341

In the example, the commit 4ead2d1 produced an overall improvement in playing strength of MCTS.

by **PalladiaMors** » 25 Feb 2015, 15:54

Just trying to see if I can kind of follow the debate here - I'm also a chess player, and I once tried to understand how chess engines were improved in the latest decades. The engines attribute a certain score to each position on the board, and decide between moves by choosing the one that leads to the position with the highest score. The problem with this is that assigning a precise score to each feature in a position isn't that simple - I don't think there's a way to demonstrate mathematically that a rook on an open file is worth 0.25 points or 0.30 points, or that a knight in an advanced support point in a closed position is worth 0.37 or 0.48 points. What you can do is configure the engine both ways and see which one leads to a higher win percentage. Doing this systematically, over a long time, to fine-tune the positional "understanding" of the engines, seems to have been a significant part of what lead computers to currently reach world champion ratings. Looks like you guys are doing something similar here?

One thing that I'm a bit concerned about is using the same deck over and over to do the testing. Wouldn't that lead to the improvements being biased towards that specific deck? For instance, if you use a top Firemind deck, a deck that the AI already plays well, won't you fine-tune its "scoring values" (dunno the right term to use) to play that specific deck even better? In that case, I'd suggest using decks that are known to be strong in live play, but that the AI doesn't perform well with yet. That way, you could work on tweaking its weaknesses?

Don't know if this makes any sense, just trying to understand the subject a bit better!

by **mike** » 25 Feb 2015, 17:34

Do you mean the hand and library are always in the same order?

Exactly, but not for every game. Since this is controlled by the seed param I made sure that if it ran 10 games against the same deck it would use the same 10 seeds for every test. Preferably I would find 10 seeds that lead to a 50% win rate.

If AI1 beats AI2 regardless of initial condition then it is definitely better, if each win one match out of two, then they are even and the win was decided by the initial configuration of the cards.

The problem here is that you would need to run 4 games: one for each AI with each deck, once on the play and once on the draw. Being on the play is such a huge advantage that it can't be neglected.

In my tests I let the AI's play mirror matches with equal hands to cut this down to two games (once on the play, once on the draw).

Does "1 match result difference at most given the same AI for 50 games" mean in one run you got say 30/50 and in another run you got 31/50 for the same AI?

Exactly.

It would be very helpful if you could publish the scripts/code for running the tests. I would like to apply them to my testing as well.

Not sure if I still have working code (I was just messing around and it's bee a while). I will certainly publish the code once I get everything working (and hopefully running as a Project Firemind Worker).

I'm thinking commit hooks. The speculative changes that require testing will be committed to a branch, say with a special name that starts with "ai-".

I'm actually still building new versions with eclipse on my dev machine. This would require building the code automatically. Is there already something in place that could take care of that?

I'd also like to slim down the number of files as much as possible. Basically for each AI that gets tested I want the JAR file and only the scripts directory with the cards that are part of decks in the gauntlet.

This speeds up the process and also lets me keep the versions longer for debugging purposes.

by **mike** » 25 Feb 2015, 17:49

One thing that I'm a bit concerned about is using the same deck over and over to do the testing. Wouldn't that lead to the improvements being biased towards that specific deck?

Yeah this would lead to a very inbred AI. I think picking a balanced gauntlet of 10-15 decks should solve this nicely.

For instance, if you use a top Firemind deck, a deck that the AI already plays well, won't you fine-tune its "scoring values" (dunno the right term to use) to play that specific deck even better? In that case, I'd suggest using decks that are known to be strong in live play, but that the AI doesn't perform well with yet. That way, you could work on tweaking its weaknesses?

In my earlier tests I used the 10 most popular non-combo decks in modern. This seemed to work pretty well. The decks the AI is worst at playing seem to be combo. To address this we would have to "tell the AI how the combo works". Currently the scoring is solely focused on playing an attrition match until it can "see" far enough ahead to recognize the win.

What actually confused me was that it can't "recognize" for example the Splinter Twin combo. If the scoring attributed a positive value to each Deceiver Exarch token it would seem obvious to me that 10 tokens are better than one token. And 20 token are better than 10. And that attacking for 20 (with the opponent at 20) is better than making a 21st token. Instead the AI makes 3 or 4 tokens and than attacks.

This doesn't seem to be tied to the scoring itself but some limit in the exploration strategy. I know this because I replaced the scoring with a more simplistic scoring system (assign value x for each creature, value y for each card in hand etc.). Can you maybe enlighten me as to why this happens @melvin?

by **muppet** » 25 Feb 2015, 18:25

One way they tuned the chess engine once it was pretty good was they took grandmaster games and the more moves it played that were the same as theirs the better. Even if this was viable for magic the ai is not at that level of competence yet.

Of course these days the chess computers are better than the GM's.

I am a bit worried the monte carlo approach is a a bit like setting up a chess board and saying ok computer find the forced mate when no computer has yet been able to find one even if it is actually possible.

Is it possible to brute force i.e. not randomise the choices for the first combat phase at the least and do say all blocks at the very least.

by **melvin** » 26 Feb 2015, 01:47

PalladiaMors wrote:Doing this systematically, over a long time, to fine-tune the positional "understanding" of the engines, seems to have been a significant part of what lead computers to currently reach world champion ratings.

What you've mentioned is tuning the evaluation function, which tells the AI how good/bad is a particular board state. This is one kind of improvement, there are many other kinds such as improving how many moves ahead the AI considers, completely changing the formula for the evaluation function, and so on.

by **melvin** » 26 Feb 2015, 02:02

mike wrote:Exactly, but not for every game. Since this is controlled by the seed param I made sure that if it ran 10 games against the same deck it would use the same 10 seeds for every test.

Got it, I use the same technique too. Each set of test uses the same set of seeds.

If AI1 beats AI2 regardless of initial condition then it is definitely better, if each win one match out of two, then they are even and the win was decided by the initial configuration of the cards.

I consider both the starting player and library order as part of the initial condition. Personally I think the factor with the strongest influence in the game is library order, which affects whether you get the right cards at the right time. So there are only two games, one with AI1 as player1, AI2 as player2 and another with AI1 as player2 and AI2 as player1. Assuming player1 with deck1 starts first in the first match, player1 using deck1 (arranged the same way as before) starts first in the second match as well but now it is controlled by AI2 instead of AI1.

mike wrote:I'm actually still building new versions with eclipse on my dev machine. This would require building the code automatically. Is there already something in place that could take care of that?

Yes, using ant.

Code: Select all: ant -f build.xml

will produce the jar file in release/Magarena.jar

Code: Select all: ant clean

to remove the build artifacts.

mike wrote:I'd also like to slim down the number of files as much as possible. Basically for each AI that gets tested I want the JAR file and only the scripts directory with the cards that are part of decks in the gauntlet.

Sure, just delete any card scripts that are not used. Currently I'm only building separate jars, but sharing the card scripts. I consider the cards to be part of the game, though they can change or new ones added I think it would not affect the playing strength of the AI as a particular card doesn't come up that often in test games.

by **melvin** » 26 Feb 2015, 04:44

mike wrote:This doesn't seem to be tied to the scoring itself but some limit in the exploration strategy. I know this because I replaced the scoring with a more simplistic scoring system (assign value x for each creature, value y for each card in hand etc.). Can you maybe enlighten me as to why this happens @melvin?

Your right, the exploration is limited since the space of possible future moves gets very large very quickly. The AI can only look ahead a certain number of moves. As far as it can tell, the combo only generate a few tokens, it doesn't realize that it is an infinite combo.

Attacking also has a score. Likely at some point the score of attacking became greater than the score of the tokens generated within the limited look ahead. It could also be the case the scoring system is buggy, it is currently not maintained. Development effect on AI is on the MCTS method that doesn't use action scoring.

by **melvin** » 26 Feb 2015, 04:58

muppet wrote:I am a bit worried the monte carlo approach is a a bit like setting up a chess board and saying ok computer find the forced mate when no computer has yet been able to find one even if it is actually possible.

I'm not getting how it relates to a force mate (not a chess person). Can you use a MTG analogy?

muppet wrote:Is it possible to brute force i.e. not randomise the choices for the first combat phase at the least and do say all blocks at the very least.

I think you may be confusing evaluation and exploration. I mentioned how the MCTS AI evaluates moves by simulation. MCTS also builds a game tree like regular minimax (my moves, your moves, at each level considering a set of possible future moves), except it uses the simulation results to direct the growth of the game tree instead of an evaluation function.

Though for blocking, we have to limit the set of options to be at most 12 to make it feasible to consider them all. The actual number of blocking options is very large, consider a case with 4 attackers and 4 blockers. Ignoring the order of blockers (which matters for damage assignment), there are 5^4 = 625 ways to block since each blocker has 5 options, block one of the attacker or not block at all. We do consider most reasonable cases such as cases that deal lethal damage and then narrow it down to the 12 with the highest scores according to the evaluation function.

by **muppet** » 26 Feb 2015, 10:38

Ok what I mean is if the space you are exploring is very large compared to the number of trials the criteria of I win the game is a very narrow target to hit. This is especially true if the look ahead is not far enough to get to the end of the game. Say we can see ahead 30 moves and the average game is 50 moves for example then only very special cases where we win quickly will be included in the I win target and the moves we make will be biased to when we win quickly i.e. we make a mad series of attacks and hope the opponent doesn't block for example.
When you say random moves how so you determine what cards the ai and the player have in their hands to cast. Does the ai know the cards in its own deck and/or the players deck. If it simply takes the current game state and allows no extra cards to appear this also wld explain some weird things that happen as would an assumption all cards are 1/1 defenders that cost 8 mana. I'll come back to this and explain why once I know what its doing.
To continue my analogy explanation if the trials end at a point before the end of the game is it possible to now say ok that trial ended before the end of the game rather than counting it as +1 for won the game we count it as +x evaluation of that position function. That way all the trials that end before the end of the game which presumably previously counted as 0 did not win the game wld be counted as something.
This would maybe alieviate my perceived problem of suicidal lines of play to try to win the game before the look ahead runs out.
I still plan to play a game an annotate everything and I'll try to show what I mean if the positions arise.

This theory might explain why a 1/1 attacks into a 2/2 its the only way to win fast enough.

Please keep the explanations coming if you have time I often used to find explaining how something like quantum mechanics works to someone who doesn't know a good way to firm up your own understanding of what is going on.

Using Project Firemind to test speculative AI improvements

Using Project Firemind to test speculative AI improvements

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Who is online

Main Menu

User Menu

Our Partners

Who is online

Who is online

Main Menu

User Menu

Our Partners

Who is online

Login Form