Hi, wonderful and insightful job!
I attempted to reproduce your job and evaluate your Hugging Face checkpoint using OpenCompass. I set the batch size to 1 and enabled trust_remote_code, along with other settings you recommended. However, I found it challenging to achieve your results. I suspect there might be a discrepancy between my settings and yours. Have you evaluated your checkpoint on OpenCompass or other benchmarks? or any other experience can share with us?
Thanks!