I've just had a concrete example of this effect.
Test 1 https://tests.stockfishchess.org/tests/view/662db69c6115ff6764c8065a
This failed, quite surprisingly, in 12k games with more than 1% timelosses across the whole test, with some tasks reaching near 10%.
By contrast, Test A https://tests.stockfishchess.org/tests/view/662db6b36115ff6764c80667 had passed in 50k games, and several other tests in the family had shown that both Tests 1 and A should have passed or failed together.
On that basis I rescheduled Test 1 as Test 2 after the timelossing workers had been fixed: https://tests.stockfishchess.org/tests/view/662edf05e1ff56336c0223d1
This time it went as I expected. It passed (considerably slower than I'd hoped but whatever), which is a stark contrast to failing in 12k games.
Is the difference in these two runs attributable to luck? Yes, certainly, a large portion of it is luck. However a large part is also clearly the direct impact of timelosses, for example the WW/LL rate in the bad test was much, much much higher than in the reschedule or indeed in STCs generally. That can only be attributed directly to the server accepting bad data.
Please note that last year I'd submitted a PR (#1571) to (among other things) tighten up the acceptable data quality, however it was entirely ignored. That PR is now out of date, but as these tests demonstrate, the issue is clearly still a problem on the server, and should be mitigated as soon as possible.
Ideally the server should be able to reject only gamepairs affected, but rejecting tasks with more than 3-5 timelosses would be a good place to start (and ideally such rejection should occur even before a manual or auto purge).
Given the impact of this issue, I'd recommend that anyone who ran STCs while the bad data was being accepted by the server (around a day or so before this issue creation, see the bad test timestamps) should consider rerunning those tests, as results may differ without bad data.
I've just had a concrete example of this effect.
Test 1 https://tests.stockfishchess.org/tests/view/662db69c6115ff6764c8065a
This failed, quite surprisingly, in 12k games with more than 1% timelosses across the whole test, with some tasks reaching near 10%.
By contrast, Test A https://tests.stockfishchess.org/tests/view/662db6b36115ff6764c80667 had passed in 50k games, and several other tests in the family had shown that both Tests 1 and A should have passed or failed together.
On that basis I rescheduled Test 1 as Test 2 after the timelossing workers had been fixed: https://tests.stockfishchess.org/tests/view/662edf05e1ff56336c0223d1
This time it went as I expected. It passed (considerably slower than I'd hoped but whatever), which is a stark contrast to failing in 12k games.
Is the difference in these two runs attributable to luck? Yes, certainly, a large portion of it is luck. However a large part is also clearly the direct impact of timelosses, for example the WW/LL rate in the bad test was much, much much higher than in the reschedule or indeed in STCs generally. That can only be attributed directly to the server accepting bad data.
Please note that last year I'd submitted a PR (#1571) to (among other things) tighten up the acceptable data quality, however it was entirely ignored. That PR is now out of date, but as these tests demonstrate, the issue is clearly still a problem on the server, and should be mitigated as soon as possible.
Ideally the server should be able to reject only gamepairs affected, but rejecting tasks with more than 3-5 timelosses would be a good place to start (and ideally such rejection should occur even before a manual or auto purge).
Given the impact of this issue, I'd recommend that anyone who ran STCs while the bad data was being accepted by the server (around a day or so before this issue creation, see the bad test timestamps) should consider rerunning those tests, as results may differ without bad data.