I quickly outlined what I envision the AI testing would look like. Please tell me if anything conflicts with what you have planned, anything is unclear or if you have some ideas on how to improve this. It's just a draft atm.
Creating the AI Rating Match* Create by git hook / api call
* push to new branch named ‘ai-’ on magarena/magarena
* call firemind.ch/api/ai_rating_matches/create_via_github
* create new AIRM with branch and pushed ref id
* Create AIRM manually
* alternative way for me to run AI tests that are not part of magarena repo
Confirming the AI Rating MatchAIRM needs to be confirmed via firemind.ch web gui by an authorized user (melvin, me, whoever else requests privilege).
* In addition to confirmation following parameters can/must be set:
* AI Identifier to test (so we later know which implementation has changed)
* AI Identifier to test against, defaults to MCTS AI
* AI Strengths, defaults to 2 for both
Upon Confirmation of AIRM the following are generated:
* New API Access key specifically for this Test run
* All the duels that are part of the gauntlet (assigned to the confirming user)
* Job in the auto checkout queue
Running the AI Rating MatchThe queue for new AIRM jobs is checked by a cron job on an AI worker vm. If one is found the test process is started:
Checkout git repo
ant build + cleanup
cleanup scripts for unused scripts etc.
set the special api key in general.cfg
run firemind queue worker
Getting the resultsWhile the worker is running the progress of the duels can be viewed on firemind
firemnd.ch/ai_rating_matches/:id
* shows state of every duel in list form
* shows if duel has failed
* Once the last match result is posted to the firemind api and every duel was successful a worker is started to recalculate the AI Whole History Ratings
* The WHRs can be viewed on the AIRM overview page firemnd.ch/ai_rating_matches
What if something goes wrongHopefully this covers all the scenarios
* The web hook failed to create an AIRM
** Probably something wrong on the server side, I get an email
** try creating an AIRM manually
* The AIRM can’t be confirmed (error 500)
** I get an email
* The Checkout worker failed somehow
** There is a cron job running notifying me about duels that have been sitting in the queue for too long
** Creating a new AIRM may fix the problem if it was temporary
* A duel failed (usually because of a java exception)
** The AI worker should catch the exception and post it to the duel status page which can be viewed on the website
* A duel failed for stranger reasons (Sometimes a java program just crashes)
** The duel will be restarted (like every duel on firemind.ch) up to 5 times
** if the error was temporary nothing needs to be done
** if the error persists ( it fails for the 5th time) I get an email and i can have a look at the server logs
* The WHR calc worker fails
** I get an email
Open questionsI feel like there should be a copy of the MCTS (or MMAB?) AI in its current state that can always be used to test against. If later improvements on it are implemented we lose our baseline for comparison.
We need to settle on a gauntlet that shows an accurate representation of the AI's capabilities. Unfortunately this needs to be a fixed set of decks and parameters or otherwise it will be very hard to compare things in the future.
I will start by proposing a list of decks that would be suitable for this:
Modern
- Burn
- Ur Delver
- Bogles
- Infect
- UWR Control (Sphinxes Rev)
- UWR Midrange (Geist)
- Merfolk
-
Splinter Twin- Soul Sisters
Decks that I'd like to include but that are too strongly crippled by missing cards:
- Tron
- Storm
- Abzan Midrange (or any rock decks)
I just named modern decks because I think it is the one eternal format that has actually enough major archetypes playable. I'd like to include more combo decks (even if the current AI can't play them). So if anyone knows of a modern combo deck that the cards are in magarena for please tell me.
Also the list is just a quick rundown from the top of my head. Any input is welcome.
The next step then would be to find the sweet-spot for number of games. It should be enough to get an accurate representation but not more than necessary to be able to run as many tests as possible.