Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv
. Benchmarked on Mar 25, 2024.
Model | Solved | Solved % | Illegal Moves | Illegal Moves % | Adjusted Elo |
---|---|---|---|---|---|
gpt-4-turbo-preview | 229 | 22.9% | 163 | 16.3% | 1144 |
gpt-4 | 195 | 19.5% | 183 | 18.3% | 1047 |
claude-3-opus-20240229 | 72 | 7.2% | 464 | 46.4% | 521 |
claude-3-haiku-20240307 | 38 | 3.8% | 590 | 59.0% | 363 |
claude-3-sonnet-20240229 | 23 | 2.3% | 663 | 66.3% | 286 |
gpt-3.5-turbo | 23 | 2.3% | 683 | 68.3% | 269 |
claude-instant-1.2 | 10 | 1.0% | 707 | 66.3% | 245 |
mistral-large-latest | 4 | 0.4% | 813 | 81.3% | 149 |
mixtral-8x7b | 9 | 0.9% | 832 | 83.2% | 136 |
gemini-1.5-pro-latest* | FAIL | - | - | - | - |
Published by the CEO of Kagi!
If I tried to make an illegal move 20% of the time, would you also say I am good at chess?
Depends on circumstances, obviously.
Okay. What if the circumstance is because I’m just recalling a bunch of chess puzzle solutions I’ve seen before and regurgitating the one I think is the correct solution for this particular pizzle without really understanding the rules of chess?
That’s another thing I’m wondering about, but so is anyone. I’d still want to know why GPT-4 does so much better than the others.