To start things off, we will kick off with a simple number puzzle:

Traditional expert advice says if you are having a 76,79,83,73,78,71 streak then it’s probably time to take a break from the game and return when you are in a better frame of mind. Of course every man dog and millipede on the planet knows the best revenge is PROVING THE GAME IS RIGGED. But that’s easier said than done if you excuse the cliche.

To prove the game is rigged, people use one of two methods: the Scientific Method or the Internet Forum Method. The IFM is easier to understand, more convenient and more likely to produce the desired results, but today I prefer to discuss the SM.

I’ve seen a forum or two where card players think Microsoft Hearts is rigged. For those in the dark, Hearts is a trick-taking game – or more correctly a trick avoidance game where the aim is to avoid taking penalty cards. The rules will not be discussed in detail and are easily found online. Some players have accused MS programmers of rigging the cards if they win too often, to compensate for the poor playing strength of the three AI players (to be fair Microsoft isn’t the only one accused of cheating, and it is also possible things have moved on and the forums are seriously out of date).

In the Scientific Method we need statistical hypothesis testing. I must admit I have never studied statistics formally in my uni days, and I only began to truly appreciate what it all meant when I was playing Spider Solitaire – for research purposes.

Roughly speaking, to prove a game is biased we wanna design a test that is

(1) **Pre-determined**

(2) **Well-defined**

(3) **Measurable**

Because Hearts is better known than Spider, I will discuss the former game. Of course similar considerations would apply to Four-Suit spider.

Let us suppose that Joe Bloggs plays a number of hands of Hearts against three AI players. A “game” is a series of hands until one player gets 100 penalty points or more and the winner is the player with fewest points. A win occurs if and only if Joe Bloggs has the fewest penalty points after the game. If we ignore ties than Joe Bloggs should win 25% of the time. But since the AI players are weak, Joe Bloggs expects to win 50% of the time. However to simplify matters, we assume Joe Bloggs is only interested in individual hands. Joe gets the following results:

11010 01101 00010 00101

where 1 represents a “good” hand and 0 is a “bad” hand. Let us assume for the moment that we have a sound definition of what makes a hand good or bad. JB is careful to define good/bad hands (which are player-independent) rather than use wins or losses – to remove the possibility that Joe is falsifying his results, intentionally or otherwise (e.g. by playing on tilt after a 76,79,83,83).

“Hmmm” … thinks Joe. “I started with 60% good after 10 hands but the next 10 hands had only 30% good. Therefore something is wrong.”

But why did JB stop at 20 hands? Did he sign a contract saying he will play exactly 20 hands in front of three senior witnesses working at Microsoft, or did he feel like stopping now because that would conveniently corroborate his gut feel the program is rigged? I guess only Joe Bloggs would know the answer to that question.

If Joe had stopped at 10 hands then he wouldn’t have noticed any difference between the first and second halves of his results (they would both be 60% good games). If Joe had stopped after 50 hands then maybe his luck would even out and Joe would realise the software is not rigged after all. Or perhaps his luck would deteriorate even further. The important point is of course JB must have **pre-determined **the number of hands before playing, otherwise his results are invalid. Suppose we define a hand as bad if it is impossible to avoid taking Ukula(*), assuming all opponents gang up on JB and the hand is played double-dummy (for simplicity we ignore the possibility of shooting the moon). In theory this is **well-defined** but it is not trivial to determine if a given hand allows JB to avoid taking Ukula if played double-dummy. So we need a different definition of good/bad hands.

(*) Ukula is the name of a well-mannered Pitjantjatjara-speaking woman who I befriended many years ago. Other players will probably have a different and more colorful name for the card that is worth 13 penalty points.

Joe Bloggs knows that many of his friends like to complain about the 2,3,4,5 of hearts being in separate hands. Assuming the Hearts are not broken, it is mathematically impossible for the 5 of Hearts to win a trick unless (i) somebody is void or (ii) the 2,3,4,5 are in different hands. The chances of event (ii) happening is

52/52 * 39/51 * 26/50 * 13/49 = 0.1055

Let us ignore heart voids and say a hand is bad if the 2,3,4,5 of hearts are in different hands, else it is good.

But wait! In Hearts, most of the hands begin by players passing three unwanted cards to an opponent before play begins. Of course players will not pass random cards so to simplify things we only consider hands with no passing. This implies we need to play four times as many games to get the correct data.

Assuming that the random number generator is truly random we
can say this experiment is “**measurable**”
in the sense that we can obtain a probability of certain events. For instance,
if we played 2 hands the probability that both are bad is 0.0111 and the probability
that at least one is bad is 0.1999.

More complex calculations are possible. For instance, if 20 hands are played then the chances of exactly 5 bad hands is 0.0381 and the chances of at least 5 bad hands is 0.0525.

What we are actually measuring is called a p-value. Assuming the null hypothesis is true, the probability that we would have observed “our actual result” (or sometimes “at least as bad as our actual result”) is the p-value. If this p-value is less than 0.05 then we say the result is statistically significant at the alpha = 0.05 level. If it was less than 0.01 then it would be statistically significant at the alpha = 0.01 level. Of course alpha must be pre-determined otherwise we are back to the first problem of a test that is not pre-determined.

Our final test would be something like the following:

One final warning: A p-value less than alpha does **not imply** **conclusive evidence**. For instance, we may have been very lucky/unlucky and the Random Number Generator gods gave us evidence the game is rigged when in fact it wasn’t rigged after all. But it may enable us to justify further testing – which may then lead to conclusive evidence.

As a chess analogy: suppose Wile. E. Cheetah beats three grandmasters in consecutive games. The organisers suspect cheating because there is too much correlation with the moves of top chess engines. They perform complex calculations in their heads and find p < 0.05. The organisers then force Wile. E. Cheetah to play the next few rounds without his “lucky EMCA headphones” (i.e. further testing). Sure enough W. E. C. learns the hard way that 1 e4 c6 2 d4 d5 3 Nc3 dxe4 4 Nxe4 Nd7 5 Bc4 Ngf6 6 Ng5 e6 7 Qe2 h6?? is not the main line in the Caro-Kann and confesses to everything.

Yes, incidents like these have happened in top-level chess. Googling examples is left as an exercise for the reader.

So there you have it. To derive the final test, we needed to have some knowledge of the game itself (Hearts or Spider) and some basic statistical theory (e.g. hypothesis testing), and we needed to take some care to make sure our experiment is sound. After all it’s hard to prove something is rigged if your experiment is itself rigged!

**DISCLAIMER:** I have not studied stats formally at uni, so I won’t be surprised if someone can explain hypothesis testing much better than I did here. If you aced statistics at University or High School and found reading this was way beneath your dignity then congrats, perhaps you should start writing your own blog and I should be the one learning from you 😊

Congrats on reaching the end of this post and fully comprehending every word! Here is your reward: