344 is actually not really enough for effects of this size. A 95% confidence interval for the win rate (assuming the “true value” is reasonably close to 50%) is about 2/sqrt(n) wide (using some basic repeated Bernoulli trials as the underlying model), over 10% for this sample size, which in isolation is terrible. People in general tend to underestimate variance and overestimate how good a sample size is, so it’s important to at least come up with some sort of statistical basis behind claims that a sample seems “large enough.”
You can definitely argue that in this case we have other (much more statistically significant!) supporting data, such as extrapolation from win rates and trends in other Elos, which will strongly affect our priors on the topic in question (so perhaps a 95% interval is overkill, and we’d be satisfied with a much weaker level). But taken by itself, a sample size of 344 is not nearly enough to measure effects whose size is <5%.
Thank you for this. It's been so long since I did any real hypothesis testing that I've forgotten all rules of thumb for confidence intervals and such.
Intuitively, I really tried to only use samples with at least 1,000 games. That's less than ~6% wr change 95% of the time, which is somewhat reasonable as long as you're not trying to take small movements as telling.