Self-play reinforcement learning has given rise to capable game-playing agents in a number of complex domains such as Go and Chess. These players were evaluated against other state-of-the-art agents and professional human players and have demonstrated competence surpassing these opponents. But does strong competition performance also mean the agents can (weakly or strongly) solve the game? Or even approximately solve the game? No existing work has considered this question. We propose aligning our evaluation of self-play agents with metrics of strong/weakly solving strategies to provide a measure of an agent’s strength. Using small games, we establish methodology on measuring the strength of a self-play agent and its gap between a strongly-solving agent, one which plays optimally regardless of an opponent’s decisions. We provide metrics that use ground-truth data from small, solved games to quantify the strength of an agent and its ability to generalize to a domain. We then perform an analysis of a self-play agent using scaled-down versions of Chinese checkers.