Hello,
We are running evaluations for the E3 Cadets dataset and have encountered a discrepancy between the paper's results and our own self-trained model.
The provided pre-trained model perfectly matches the results reported in the paper (0.9701 F1-Score). This gives us confidence that our evaluation setup is correct.
However, our self-trained model performs significantly worse. The F1-Score drops from 0.9701 to 0.8972, which is a -7.51% difference. We generally find a difference of 2% to be acceptable however this is greater than that.
OUR RESULTS
| Model |
Precision |
Recall |
F1-Score |
% F1 Diff (from Paper) |
| Paper (Baseline) |
0.9440 |
0.9977 |
0.9701 |
N/A |
| Pre-trained |
0.9441 |
0.9977 |
0.9701 |
$0.00%$ |
| Own-trained |
0.8151 |
0.9977 |
0.8972 |
$-7.51%$ |
Since the pre-trained model and paper results are identical, the discrepancy seems to be in the training process itself. Any guidance on why there is such a difference?
Hello,
We are running evaluations for the E3 Cadets dataset and have encountered a discrepancy between the paper's results and our own self-trained model.
The provided pre-trained model perfectly matches the results reported in the paper (0.9701 F1-Score). This gives us confidence that our evaluation setup is correct.
However, our self-trained model performs significantly worse. The F1-Score drops from 0.9701 to 0.8972, which is a -7.51% difference. We generally find a difference of 2% to be acceptable however this is greater than that.
OUR RESULTS
Since the pre-trained model and paper results are identical, the discrepancy seems to be in the training process itself. Any guidance on why there is such a difference?