According to the blog, EvaByte only consumed 0.5T tokens but matched the performance of models trained on 5x corpus. Thus, the new architecture is at least more efficient in utilizing the training data.
Will it be much better if it uses the same amount of corpus?