And it’s over! Spider GM played well to win 8 games, but monkey went one better with 9 victories. Most of the games were easy, but there were a few exceptions. Game 3 was just horrible: the final round was 833A5A8jk4 which is close to an instant loss (assuming no lucky suited connector such as a 3 landing on a 4 of the same suit). And that was not the only problem. Both human and monkey “agreed” on every game (i.e. both win or both lose) except game 4. Spider GM never found an empty column since the tactics were all wrong. Even so, 64% by the monkey was not the most convincing of victories. The conclusion is that the monkey’s win rate should have some correlation with an expert playing 4-suit solitaire. In other words, the effects of making random moves and playing at the one-suit level pretty much cancel each other out.
Of course 10 trials is not a lot of data, and perhaps I need more games to say for sure. Also, lesser players may find similar results, but the threshold should be e.g. 30% not 50%.
Congrats to Monkey for a well-earned win and commiserations to Spider GM.
BTW If anyone knows how to change the word “eight” in the above spreadsheet into a number 8 without the unwanted percentage sign, please let me know 🙂
Once upon a time, Odysseus was trapped in a maze and wanted
to find his way back to Ithaca. Hang on, is that Odysseus and the Sirens or
Theseus and the Minotaur? Whatevs, I was never any good with Greek Mythology.
Without loss of generality, assume it really is Odysseus and the Sirens.
Let us denote a maze as a finite set of nodes, where O represents Odysseus’ starting position, I represents Ithaca and S represents a Siren. Odysseus wins if he reaches Ithaca and loses if he reaches a Siren. We will also assume that a legal move consists of Odysseus moving from one node to another adjacent node in a downward direction. Any “terminal” node is marked either I or S. This implies that “draws” are impossible i.e. the game cannot continue indefinitely without a result. Example mazes are shown below:
Clearly, if Odysseus knows the entire maze layout then he is guaranteed to win (if at least one winning path exists) or lose (no such path exists). But if the maze layout is not known then Odyssues must guess which path to take. For instance, Odysseus might approach a dozen Sirens if they are singing Benjamin Britten’s Missa Brevis or get the 70-95-67-75 outta here if they are improvising karaoke with a Richard Clayderman song. Note that I didn’t claim that this was a good strategy.
The simplest strategy is a random-walk strategy. At any node
we can compute the number of legal moves available to Odysseus and stipulate
that each move occurs with equal probability. One can easily see that Odyssues
should probably win in the left maze and probably lose in the right maze.
Calculating the exact probabilities is left as the proverbial exercise for the
In Spider Solitaire we can (in theory at least) enumerate the set of all legal game states reachable from a given start position and it will look like the Odysseus-Siren maze except it will obviously have a ridiculous number of nodes and links. We can say that a game is “easy” if it resembles the left maze and “hard” if it resembles the right. If we can compute (or estimate) exact probabilities of winning then we can have a quantitative measure of difficulty (probability of winning) rather than a categorical measure (easy/medium/hard). Obviously this will not be appropriate for human players since humans understand basic Spider Solitaire strategy (such as obtaining empty columns). But it might be appropriate for monkeys playing at the 1-suit level. At least we have a ball-park estimate of how easy or difficult a given hand should be.
The diagram below (left) is an example endgame where an expert player has removed six suits and only needs two more moves to win. For purposes of this example we will assume that (i) empty columns cannot be used and (ii) if the player fails to win the game in two moves then it’s an automatic loss. One can verify the Odysseus-Siren maze looks like the figure below (right). The probability of winning in two moves with random play is 50% (since only the first move is relevant and there are 2 good moves out of 4)
As an aside, one should note in practical play, game states
in Spider Solitaire are not explicitly labelled “LOSE” since a player can
resign even if there are one or more legal moves. But one can easily tweak
things e.g. by imposing a move limit and saying that a game is automatically
lost if we cannot win in N moves or less. Obviously in the above example it is
impossible to reach an unwinnable state even if we tried. Hence a random move
algorithm is practically certain to win if the move limit was (say) 50 instead
Let us suppose that Joe Bloggs plays a game of Spider Solitaire without undo and loses. He then proceeds to replay the same game but with unlimited 85,78,68,79. Joe Bloggs is able to win (and hence record the identity of every unseen card). JB can then estimate the difficulty of the hand using the random walk method described above. If the probability of winning is 80% then there is a fair chance JB misplayed when playing without undo. If the probability is only 20% then chances are JB was destined to lose even with expert play. This could give Joe Bloggs a better estimate of his playing strength than simple win-rate.
If this post has inspired you to do a Ph. D. based on Spider
Solitaire please leave a comment below!
To start things off, we will kick off with a simple number puzzle:
Traditional expert advice says if you are having a
76,79,83,73,78,71 streak then it’s probably time to take a break from the game and
return when you are in a better frame of mind. Of course every man dog and
millipede on the planet knows the best revenge is PROVING THE GAME IS RIGGED.
But that’s easier said than done if you excuse the cliche.
To prove the game is rigged, people use one of two methods: the Scientific Method or the Internet Forum Method. The IFM is easier to understand, more convenient and more likely to produce the desired results, but today I prefer to discuss the SM.
I’ve seen a forum or two where card players think Microsoft
Hearts is rigged. For those in the dark, Hearts is a trick-taking game – or
more correctly a trick avoidance game where the aim is to avoid taking penalty
cards. The rules will not be discussed in detail and are easily found online.
Some players have accused MS programmers of rigging the cards if they win too
often, to compensate for the poor playing strength of the three AI players (to
be fair Microsoft isn’t the only one accused of cheating, and it is also
possible things have moved on and the forums are seriously out of date).
In the Scientific Method we need statistical hypothesis testing.
I must admit I have never studied statistics formally in my uni days, and I
only began to truly appreciate what it all meant when I was playing Spider
Solitaire – for research purposes.
Roughly speaking, to prove a game is biased we wanna design a test that is
Because Hearts is better known than Spider, I will discuss
the former game. Of course similar considerations would apply to Four-Suit
Let us suppose that Joe Bloggs plays a number of hands of Hearts
against three AI players. A “game” is a series of hands until one player
gets 100 penalty points or more and the winner is the player with fewest
points. A win occurs if and only if Joe Bloggs has the fewest penalty points after
the game. If we ignore ties than Joe Bloggs should win 25% of the time. But
since the AI players are weak, Joe Bloggs expects to win 50% of the time.
However to simplify matters, we assume Joe Bloggs is only interested in
individual hands. Joe gets the following results:
11010 01101 00010 00101
where 1 represents a “good” hand and 0 is a
“bad” hand. Let us assume for the moment that we have a sound
definition of what makes a hand good or bad. JB is careful to define good/bad
hands (which are player-independent) rather than use wins or losses – to remove
the possibility that Joe is falsifying his results, intentionally or otherwise
(e.g. by playing on tilt after a 76,79,83,83).
“Hmmm” … thinks Joe. “I started with 60% good
after 10 hands but the next 10 hands had only 30% good. Therefore something is
But why did JB stop at 20 hands? Did he sign a contract saying
he will play exactly 20 hands in front of three senior witnesses working at
Microsoft, or did he feel like stopping now because that would conveniently corroborate
his gut feel the program is rigged? I guess only Joe Bloggs would know the
answer to that question.
If Joe had stopped at 10 hands then he wouldn’t have noticed any difference between the first and second halves of his results (they would both be 60% good games). If Joe had stopped after 50 hands then maybe his luck would even out and Joe would realise the software is not rigged after all. Or perhaps his luck would deteriorate even further. The important point is of course JB must have pre-determined the number of hands before playing, otherwise his results are invalid. Suppose we define a hand as bad if it is impossible to avoid taking Ukula(*), assuming all opponents gang up on JB and the hand is played double-dummy (for simplicity we ignore the possibility of shooting the moon). In theory this is well-defined but it is not trivial to determine if a given hand allows JB to avoid taking Ukula if played double-dummy. So we need a different definition of good/bad hands.
(*) Ukula is the name of a well-mannered Pitjantjatjara-speaking woman who I befriended many years ago. Other players will probably have a different and more colorful name for the card that is worth 13 penalty points.
Joe Bloggs knows that many of his friends like to complain
about the 2,3,4,5 of hearts being in separate hands. Assuming the Hearts are
not broken, it is mathematically impossible for the 5 of Hearts to win a trick
unless (i) somebody is void or (ii) the 2,3,4,5 are in different hands. The
chances of event (ii) happening is
52/52 * 39/51 * 26/50 * 13/49 = 0.1055
Let us ignore heart voids and say a hand is bad if the
2,3,4,5 of hearts are in different hands, else it is good.
But wait! In Hearts, most of the hands begin by players
passing three unwanted cards to an opponent before play begins. Of course
players will not pass random cards so to simplify things we only consider hands
with no passing. This implies we need to play four times as many games to get
the correct data.
Assuming that the random number generator is truly random we
can say this experiment is “measurable”
in the sense that we can obtain a probability of certain events. For instance,
if we played 2 hands the probability that both are bad is 0.0111 and the probability
that at least one is bad is 0.1999.
What we are actually measuring is called a p-value. Assuming the null hypothesis is true, the probability that we would have observed “our actual result” (or sometimes “at least as bad as our actual result”) is the p-value. If this p-value is less than 0.05 then we say the result is statistically significant at the alpha = 0.05 level. If it was less than 0.01 then it would be statistically significant at the alpha = 0.01 level. Of course alpha must be pre-determined otherwise we are back to the first problem of a test that is not pre-determined.
Our final test would be something like the following:
One final warning: A p-value less than alpha does not implyconclusive evidence. For instance, we may have been very lucky/unlucky and the Random Number Generator gods gave us evidence the game is rigged when in fact it wasn’t rigged after all. But it may enable us to justify further testing – which may then lead to conclusive evidence.
As a chess analogy: suppose Wile. E. Cheetah beats three
grandmasters in consecutive games. The organisers suspect cheating because
there is too much correlation with the moves of top chess engines. They perform
complex calculations in their heads and find p < 0.05. The organisers then
force Wile. E. Cheetah to play the next few rounds without his “lucky EMCA
headphones” (i.e. further testing). Sure enough W. E. C. learns the hard
way that 1 e4 c6 2 d4 d5 3 Nc3 dxe4 4 Nxe4 Nd7 5 Bc4 Ngf6 6 Ng5 e6 7 Qe2 h6??
is not the main line in the Caro-Kann and confesses to everything.
Yes, incidents like these have happened in top-level chess.
Googling examples is left as an exercise for the reader.
So there you have it. To derive the final test, we needed to
have some knowledge of the game itself (Hearts or Spider) and some basic
statistical theory (e.g. hypothesis testing), and we needed to take some care
to make sure our experiment is sound. After all it’s hard to prove something is
rigged if your experiment is itself rigged!
DISCLAIMER: I have not studied stats formally at uni, so I won’t be surprised if someone can explain hypothesis testing much better than I did here. If you aced statistics at University or High School and found reading this was way beneath your dignity then congrats, perhaps you should start writing your own blog and I should be the one learning from you 😊
Congrats on reaching the end of this post and fully comprehending every word! Here is your reward:
We now consider the following question: How can we evaluate a starting position? That is, if you are given an initial game state with 10 exposed cards how do we determine if the chances of winning are good, average or poor? Can we quantify our winning chances as a percentage (e.g. 58%)?
NOTE: evaluating a start position is useful since most
Spider Solitaire implementations allow the player to abandon a game without
counting it as a loss. But if you are serious about improving your game, I
strongly recommend you never abandon games with a poor initial state.
A first thought may be to look for “features” of a game state. For instance suppose we are watching some top quality chess at an unhealthy time of the day. We might notice that
White has an extra pawn
The position is an endgame: both sides want to
activate their king without fear of suddenly being mated.
Black’s rook and king are more active than their
Both sides have vulnerable pawns
Bear in mind we are only identifying individual features at this early stage. Eventually we may wish to formulate an overall assessment by combining these features somehow, but that comes later.
QUESTION: What are plausible features to use in an opening game state in Spider Solitaire?
Avid readers of this blog (yes you!) would immediately identify “guaranteed turnovers” as a possible feature. In the above diagram you should be able to quickly identify 5 turnovers. Of course every man, dog and millipede on the planet knows that building in-suit is even more desirable. In this case we have Q-J in spades and 2-3 in clubs. Therefore we have 2 guaranteed suited turnovers (and hence 3 off-suit turnovers).
Finally we can look at rank multiplicity. All players know that having too much of one rank can be a problem, especially when the adjacent rank is in short supply. You don’t need a Ph. D. in economics to work out things are less than ideal when the Spider Solitaire gods have supplied five Jacks on the opening deal and there is little demand for them. For simplicity let us define the rank multiplicity as the count of the most frequent rank. For instance the above diagram has a rank multiplicity of 2 since we have two Threes/Deuces and no rank appears more than twice. In summary:
We have 5 guaranteed turnovers
We have 2 guaranteed suited turnovers
The rank multiplicity is 2.
There may be other features to consider, but we’ll keep things simple for now.
Are these values
good, bad, or average? It turns out one can use simulation to answer this
question. For instance if I had nothing better to do, I could play 10 million
games of Spider and compute the number of guaranteed turnovers should be 3.97
Of course the lazy solution is to write a computer program to do the simulation for me. The program can simply deal 10 cards, do the relevant calculations and then abandon the game. An even lazier solution is to copy the results from Steve Brown’s excellent book Spider Solitaire Winning Strategies. He got the following results:
Looking at these graphs, I would immediately dismiss rank
multiplicity as a useful feature (the entry for 5 is non-zero but is too small
to be visible). After all more than 90% of games will have a value of 2 or 3!
It is true that one can tweak rank multiplicity somehow (e.g. giving more weight
to Aces and Kings which are the bugbears of most players), but I wanted to keep
things simple for the time being. The important point is these quantities are
easily obtained via simulation.
Suited turnovers is nice, but I think it’s more important to
have many turnovers at the start of the game. In other words, quantity is more
important than quality. In the above example, we have 5 guaranteed turnovers
and 2 suited, both of which are above average. Hence if given a choice, I would
take this position over a random position.
If you are a beginner, I would estimate that:
If you start with exactly 4 guaranteed
turnovers, your chances of winning are average
If more (less) than 4 then your chances are
above (below) average.
Of course if you lose every game at the 4-suit level then
this rule only works when you have exactly 4 turnovers! So perhaps this rule is
better suited to the 2 suit level, if you excuse the lousy pun. As you gain
more experience, you would be able to tweak these guidelines. For instance, you
might think that two suited turnovers is worth an extra non-suited turnover,
That’s it for now. Happy Spider playing, and may all your
builds be in-suit 😊