Using Project Firemind to test speculative AI improvements

by **melvin** » 26 Feb 2015, 13:23

muppet wrote:Ok what I mean is if the space you are exploring is very large compared to the number of trials the criteria of I win the game is a very narrow target to hit...

Simulations run till the game is over and someone won, so there isn't an issue of insufficient look ahead. Some simulations end in the AI winning, some end in the player winning. The intuition being that game states where most simulations end with the AI winning is a strong position for the AI.

muppet wrote:When you say random moves how so you determine what cards the ai and the player have in their hands to cast...

For simulation, MCTS cheating uses the actual cards and actual library order in the simulation, so the simulated card draws follow what would happen in reality. MCTS honest use a random permutation of the hidden cards, this assumes the AI knows the opponent's deck.

muppet wrote:To continue my analogy explanation if the trials end at a point before the end of the game..

Every simulation end with one player winning. In practice, we do have an upper limit on the length of a simulation to prevent a never ending simulation, but it is much higher than the average game length and rarely reached. In the case when the limit is reached, the value of the simulation is 0.5, effectively representing a draw. In simulations where the AI win the value is 1, where AI lose the value is 0. MCTS do not make use of a traditional evaluation function based on board state.

We have a more detailed write up of MCTS on our wiki, https://github.com/magarena/magarena/wi ... TreeSearch

by **mike** » 26 Feb 2015, 22:04

I quickly outlined what I envision the AI testing would look like. Please tell me if anything conflicts with what you have planned, anything is unclear or if you have some ideas on how to improve this. It's just a draft atm.

Creating the AI Rating Match

* Create by git hook / api call
* push to new branch named ‘ai-’ on magarena/magarena
* call firemind.ch/api/ai_rating_matches/create_via_github
* create new AIRM with branch and pushed ref id
* Create AIRM manually
* alternative way for me to run AI tests that are not part of magarena repo

Confirming the AI Rating Match

AIRM needs to be confirmed via firemind.ch web gui by an authorized user (melvin, me, whoever else requests privilege).
* In addition to confirmation following parameters can/must be set:
* AI Identifier to test (so we later know which implementation has changed)
* AI Identifier to test against, defaults to MCTS AI
* AI Strengths, defaults to 2 for both

Upon Confirmation of AIRM the following are generated:
* New API Access key specifically for this Test run
* All the duels that are part of the gauntlet (assigned to the confirming user)
* Job in the auto checkout queue

Running the AI Rating Match

The queue for new AIRM jobs is checked by a cron job on an AI worker vm. If one is found the test process is started:

Checkout git repo
ant build + cleanup
cleanup scripts for unused scripts etc.
set the special api key in general.cfg
run firemind queue worker

Getting the results

While the worker is running the progress of the duels can be viewed on firemind
firemnd.ch/ai_rating_matches/:id
* shows state of every duel in list form
* shows if duel has failed
* Once the last match result is posted to the firemind api and every duel was successful a worker is started to recalculate the AI Whole History Ratings
* The WHRs can be viewed on the AIRM overview page firemnd.ch/ai_rating_matches

What if something goes wrong
Hopefully this covers all the scenarios
* The web hook failed to create an AIRM
** Probably something wrong on the server side, I get an email
** try creating an AIRM manually
* The AIRM can’t be confirmed (error 500)
** I get an email
* The Checkout worker failed somehow
** There is a cron job running notifying me about duels that have been sitting in the queue for too long
** Creating a new AIRM may fix the problem if it was temporary
* A duel failed (usually because of a java exception)
** The AI worker should catch the exception and post it to the duel status page which can be viewed on the website
* A duel failed for stranger reasons (Sometimes a java program just crashes)
** The duel will be restarted (like every duel on firemind.ch) up to 5 times
** if the error was temporary nothing needs to be done
** if the error persists ( it fails for the 5th time) I get an email and i can have a look at the server logs
* The WHR calc worker fails
** I get an email

Open questions

I feel like there should be a copy of the MCTS (or MMAB?) AI in its current state that can always be used to test against. If later improvements on it are implemented we lose our baseline for comparison.

We need to settle on a gauntlet that shows an accurate representation of the AI's capabilities. Unfortunately this needs to be a fixed set of decks and parameters or otherwise it will be very hard to compare things in the future.

I will start by proposing a list of decks that would be suitable for this:
Modern
- Burn
- Ur Delver
- Bogles
- Infect
- UWR Control (Sphinxes Rev)
- UWR Midrange (Geist)
- Merfolk
- Splinter Twin
- Soul Sisters

Decks that I'd like to include but that are too strongly crippled by missing cards:
- Tron
- Storm
- Abzan Midrange (or any rock decks)

I just named modern decks because I think it is the one eternal format that has actually enough major archetypes playable. I'd like to include more combo decks (even if the current AI can't play them). So if anyone knows of a modern combo deck that the cards are in magarena for please tell me.

Also the list is just a quick rundown from the top of my head. Any input is welcome.

The next step then would be to find the sweet-spot for number of games. It should be enough to get an accurate representation but not more than necessary to be able to run as many tests as possible.

by **melvin** » 27 Feb 2015, 06:26

mike wrote:I quickly outlined what I envision the AI testing would look like. Please tell me if anything conflicts with what you have planned, anything is unclear or if you have some ideas on how to improve this. It's just a draft atm.

Thanks for coming up with the draft!

I think we can merge the creating and confirming steps by removing the commit hook and only have a manual creation step.

The information to be supplied is the commit hash on the main github repo, the two AIs to be tested. Each AI is specified as the symbolic name of the AI (one of the enums in MagicAIImpl), and the descriptive name of the variant, and the level. The number of games to be played is also specified and the random deck profile. The information (in json) will looking something like:

Code: Select all: { commit: 8e4c79227c4b9cd3aca439d781bab5c8d62fd24f games: 500 AIs: [ { name: "MMAB" variant: "mmab-fast-choices" level: 2 }, { name: "MCTS" variant: "base-ae3df12" level: 2 } ] }

From this the worker will clone the repo and checkout the specific commit and build the jar. Then play MMAB level 2 vs MCTS level 2 on the set of decks you proposed, for a total of 500 games. Internally we can split this into smaller blocks of games to be run on different workers in parallel.

The result can be displayed on a summary page with as table, each row is the submitted information and how many games won by each AI. To keep things simple, all the WHR calculations can be done offline for now. Since the raw data is provided, we can also use different methods to evaluate the results.

I don't think we can have a freezed version of the AIs. The goal is to continuously improve the AI code. When we find a good modification, we'll apply it to the base version, then start testing other changes and so on. That's why it is necessary to uniquely identify each AI with the variant tag and it must be specified for both AIs in the match.

by **melvin** » 27 Feb 2015, 07:37

Alternatively instead of having using fixed decks and figuring how how to split the game among the decks and which decks to use, we can use randomly generated decks as I am doing now.

Then we need to also provide a profile option when creating AI rating job, eg deck_profile="**", means each game is run with a randomly generated two color deck. The current DeckStrCal entry point accepts a profile command line parameter.

by **melvin** » 27 Feb 2015, 08:10

melvin wrote:Currently I'm mostly using randomly generating decks based on two colors, this assumes that the deck generated will be roughly the same power level. But the game could be decided purely by the library order (mana flood/mana screw) even if the decks are "fair". Playing twice with the same initial conditions but swapping AIs seems to solve this.

Just found out this technique is called "duplicate", it is used in the Annual Computer Poker Competition. http://www.computerpokercompetition.org ... -duplicate

by **mike** » 27 Feb 2015, 09:48

Alternatively instead of having using fixed decks and figuring how how to split the game among the decks and which decks to use, we can use randomly generated decks as I am doing now.

My experience with the randomly generated decks is that it leads to very high variance even if both AIs play the same deck. It was pretty easy to replicate this by substituting the scoring function for one AI with one that heavily favored an aggressive play-style. It won the burn mirror about 85% of the time but did terribly with the randomly generated decks (unless it got a very aggressive build).

This variance can only be prevented by ensuring the randomly generated decks cover a wide variety of play styles. I can't even imagine what that would look like for combo. So ultimately

I do agree that it seems tedious to select the decks and their distribution in the gauntlet but right now I do believe it is the only way to ensure the AI gets tested on a broad enough spectrum. Additionally by having fixed decks it will be very easy to identify for which decks the AI improved and where it is still lacking. For this I would even go as far as include some especially challenging cards that future AI implementations will have an edge in using.

by **mike** » 27 Feb 2015, 09:59

I don't think we can have a freezed version of the AIs. The goal is to continuously improve the AI code. When we find a good modification, we'll apply it to the base version, then start testing other changes and so on. That's why it is necessary to uniquely identify each AI with the variant tag and it must be specified for both AIs in the match.

I think I get where you're coming from. As I understand it you want this variant tag to be set manually which seems prone to human error. Obviously it doesn't really matter if all we want to do is compare the two AIs at hand. But it might break our ability to get statistical data over time like the WHR.

To be honest I doubt that well ever be able to do much with historical data anyway since the build is always based on improving versions of the magarena engine itself so maybe it's a fool's errand to try and ensure comparability over time. I'd be fine if the system is just used to learn if a change makes the AI better or worse and nothing beyond that.

by **muppet** » 27 Feb 2015, 10:07

I agree basically with your assessment of the modern decks with the exception of rock which is only really missing the targeted discard and you can make viable decks of various sorts for example see my death cloud deck. Abzhan seems to suffer more from the loss of the discard than rock presumably because it takes a bit longer to get its things going and needs to disrupt the opponent to manage this in time. But lingering spirits, rhinos and Elspeth are mostly all they have over rock which magearena has.

by **mike** » 27 Feb 2015, 10:09

Then play MMAB level 2 vs MCTS level 2 on the set of decks you proposed, for a total of 500 games. Internally we can split this into smaller blocks of games to be run on different workers in parallel.

Why would you run two different AI implementations against each other? They probably don't have a 50/50 ranking right now so it would be harder to identify that something actually changed.

I had something like MCTS level 2 vs MCTSNEW level 2 in mind to see clearly where the AI improved. For example, one could see that the new version plays twin correctly and thus has a > 80% win rate in the twin mirror.

by **mike** » 27 Feb 2015, 10:24

I agree basically with your assessment of the modern decks with the exception of rock which is only really missing the targeted discard and you can make viable decks of various sorts for example see my death cloud deck. Abzhan seems to suffer more from the loss of the discard than rock presumably because it takes a bit longer to get its things going and needs to disrupt the opponent to manage this in time. But lingering spirits, rhinos and Elspeth are mostly all they have over rock which magearena has.

A lot of people would say the core of the rock decks is the targeted discard, goyf and lili. If two of those are missing, its not the same deck. Yes, you can build death cloud but that to me is a different deck (not saying necessarily bad and we could include it in the gauntlet). Death Cloud to me borders more on the side of the unfair decks that try to retain value by playing cards like planeswalkers that are not effected by death cloud to create asymmetry. The rock decks I was talking about are very linear 1 for 1 decks built to play an attrition game. Of course the matchups where the targeted removal is strongest is against combo which is not really putting up a fight on magarena (yet). So right now the lack of thoughtseize and inquisition might not be as noticeable but as soon as twin (or scapeshift, or amulet) starts to rise it will become clear that it is not the same deck.

If we really want a rock deck to be present in the gauntlet it might be enough to replace the black one-drops with birds and hierarchs to power out siege rhino. It's not the same deck but at least it adds a lot of cards that are not yet part of the list.

by **melvin** » 27 Feb 2015, 12:44

mike wrote:This variance can only be prevented by ensuring the randomly generated decks cover a wide variety of play styles. I can't even imagine what that would look like for combo.

Magarena is designed to be played with randomly generated decks. It was never intended that the AI will be able to expertly pilot specialized decks. That's why I suggested random decks as it fits in with the goal of the program. With sufficient games, playing with random decks is able to differentiate between different AIs. The "duplicate" technique mentioned earlier will also reduce the variance.

I'm fine with using a custom gauntlet too. I think for the first iteration, go with the method that is simpler.

mike wrote:As I understand it you want this variant tag to be set manually which seems prone to human error.

Agree that custom variant tag is error prone. That's why in my earlier post, I suggested using the hash of the AI source files as the variant tag. This can be the "automatic" method of generating the variant tag for a future iteration of the AI ranking feature. With automatic variant tag, it will still be helpful to attach a human readable name, say the commit message, to make it easier to identify different variants by eye.

mike wrote:Why would you run two different AI implementations against each other? They probably don't have a 50/50 ranking right now so it would be harder to identify that something actually changed.

I tried to maintain two versions of an AI before, it just leads to a source management overhead and a bit of a mess. Each change has to done on the next version, then ported over to the main version manually. Git branches should be way to do this, unfortunately we can't compare AI across different jars.

Testing only against yourself can also lead to brittle changes where change A is better then B, B is better than C, C is better than A. By testing against different AIs we get a more robust ranking result using the WHR score. For the purpose of WHR, it is also necessary to have matches across AIs or it is not possible to generate a single set of rankings that includes all of them. For example, if p1, p2, p3 play against each other, but never against p4, p5, p6, it is not possible to rank all of them on the same scale cause there is no way to tell if p1 is better than p4 or vice-versa.

We can have a single baseline AI, perhaps a simple one step look ahead greedy AI, solely for the purpose of normalizing WHR scores. The baseline AI is maintained to keep up with the API changes but never improved and we can define to it to have a scaled WHR score of 1, so that the rest of the AIs have a ranking on a fix scale. Assuming the computed WHR score of the baseline is 0.4, then we divide all the AI's WHR score by 0.4, since we defined the scaled WHR of the baseline to be 1.

by **mike** » 01 Mar 2015, 23:58

Magarena is designed to be played with randomly generated decks. It was never intended that the AI will be able to expertly pilot specialized decks.

The optimist in me would like to believe that the AI will some day be able to do both. But that is not why I proposed the gauntlet of established decks.

My reasoning behind the use of explored archetypes is that it will make it a lot easier to pinpoint AI weak spots. These decks have been tested by a lot of people for years now and there is an established precedent on how one plays them well. Figuring out if a move was good or not seems a lot harder to me for decks that I have no experience with.

The other advantage I see is the one concerning play styles. I think we can agree that the AI should be able to pilot decks that are anywhere on the scale between aggro and control (leaving out combo for now). To test that a new version of the AI doesn't improve one play style by sacrificing its ability to pilot decks from the other side of the spectrum we have to ensure that all play styles are represented. Sure, by generating enough random decks we eventually test every strategy but hand-picking is much faster and the results are quickly interpreted. For example consider these results:

The new AI went from 55/45 with Burn to only 40/60 now. A reasonable conclusion would be that it overvalued board presence or it's own life total while undervaluing speed and direct damage. That doesn't have to be the reason but it's a good starting point for understanding the AI's reasoning.

If instead of Burn I saw a randomly generated deck list it would take me a lot longer to figure out how that deck should have been played and where the AI went wrong.

That's why in my earlier post, I suggested using the hash of the AI source files as the variant tag. This can be the "automatic" method of generating the variant tag for a future iteration of the AI ranking feature.

Ok, I will implement this as follows:
- When creating the AIRM you choose the names of the two AIs you want to run (e.g. MMAB, MCTSAI)
- After checking out the source I will calculate the md5sum of the java files (or would the .class files be preferable?) that correspond to the names in the src/magic/ai/ directory
- This hash will be saved separately for both AIs to uniquely identify the AI versions that were used in the future
- The last commit message will be saved along with the AIRM record.
- All of that data will be visible on the AIRM overview page

Also a quick status update: I made all the changes necessary for this feature on the web app side. Now I'm somewhat stuck on how to do the checkout worker properly. The truth I don't yet want to face is that it most likely needs to be done in java to ensure the thing doesn't break the scalability of the system.

I'll get back to you once that is solved and we can start testing.

by **melvin** » 02 Mar 2015, 01:05

I just realized that using the hash of the source files is a bad idea. It precludes refactoring the code as that will cause a change in the hash. For simplicity, the user has to submit the variant name manually.

by **mike** » 09 Mar 2015, 12:25

Does anyone know about cards or card combinations the AI currently handles badly?

I'd like to include some more challenges for future AI implementations that show that it "knows the cards better" than old versions.

by **muppet** » 09 Mar 2015, 13:16

I'll add some to this post as I see them. Splinter Twin is an obvious one but I know you know that one.

Cards that champion e.g. Mistbind Clique
Man lands its tends to get in a mess with tapping them to power them up for example Creeping Tar Pit.
Snapcaster Mage it tends to not wait long enough to cast.
pretty much all non targeted spells that have some effect e.g. Wrath of God
Reanimate in general not sure whats up with this at the moment it seemed ok when I tried it out but deck is getting terrible results and its not bad.
Force of Will tends to use at first opportunity
Batterskull moved it back to its hand for no reason when it had a germ token.
cards that search library for cards e.g. Stoneforge Mystic it still occasionally declines to find a card. Goblin Matron too.

Using Project Firemind to test speculative AI improvements

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Re: Using Project Firemind to test speculative AI improvemen

Who is online

Main Menu

User Menu

Our Partners

Who is online

Who is online

Main Menu

User Menu

Our Partners

Who is online

Login Form