Hi obryanlouis,
Few days ago, I trained several models and the results are as follows:
commit b31a8e8ec1897c1eef8e80570cca19ea08b85467
| model name |
EM(T) |
F1(T) |
EM(L) |
F1(L) |
| conductor-net |
62.48% |
72.35% |
73.24% |
81.93% |
| fusion-net |
67.92% |
77.83% |
75.96% |
83.90% |
| match-lstm |
48.53% |
58.40% |
54.50% |
67.74% |
| mnemonic_reader |
67.32% |
80.63% |
70.95% |
80.14% |
T means trained result, L means results posted in SQuAD leaderboard.
It seems that the trained results are less 10% than the results posted in SQuAD leaderboard except the mnemonic_reader. I think the inappropriate parameters must be reason. I wonder if you could give your parameters when training the match-lstm model and conductor-net model or else model.
And I trained the fusion-net model in checkout 28c18bc0b23381e5c9dfd8ee1834f9e559ae9714 with your parameter posted in README.md and get the similar result with yours.
Moreover, I noticed that you didn't use the evaluate-v1.1.py which is officially posted by SQuAD for evaluation. And I write transform.py to transform your evaluated results in qa/evaluation_results/predicted_spans.visualization.txt to the official prediction file Sample Prediction File which is posted in SQuAD. You can use it with the following command:
python transform.py dev-v1.1.json predicted_spans.visualization.txt transform.json
I found that my transform.py worked when using it to transform the results in the first mentioned commit. After getting the transformed_prediction_results.txt, I use the official command to evaluate the model.
python evaluate-v1.1.py dev-v1.1.json transform.json
I found all your evaluation results is 3% higher than the official evaluation results.
Unfortunately, while in checkout 28c18bc0b23381e5c9dfd8ee1834f9e559ae9714 (the second commit), it couldn't work well because there are 10586 lines in the generated predicted_spans.visualization.txt. It should be 10570 lines as the same as the first mentioned commit, because the quantity of questions in the dev-v1.1.json is 10570.
Here are some statistics about the dev-v1.1.json:
| topic |
paragraph |
question |
| 48 |
2067 |
10570 |
So, why the quantity of questions was changed? Is it possible to use the official evaluate script or change back the original predicted_spans.visualization.txt?
Thank you ~ (ฅ´ω`ฅ) ~
Hi obryanlouis,
Few days ago, I trained several models and the results are as follows:
commit b31a8e8ec1897c1eef8e80570cca19ea08b85467
It seems that the trained results are less 10% than the results posted in SQuAD leaderboard except the
mnemonic_reader. I think the inappropriate parameters must be reason. I wonder if you could give your parameters when training thematch-lstmmodel andconductor-netmodel or else model.And I trained the
fusion-netmodel in checkout 28c18bc0b23381e5c9dfd8ee1834f9e559ae9714 with your parameter posted in README.md and get the similar result with yours.Moreover, I noticed that you didn't use the evaluate-v1.1.py which is officially posted by SQuAD for evaluation. And I write transform.py to transform your evaluated results in
qa/evaluation_results/predicted_spans.visualization.txtto the official prediction fileSample Prediction Filewhich is posted in SQuAD. You can use it with the following command:python transform.py dev-v1.1.json predicted_spans.visualization.txt transform.jsonI found that my transform.py worked when using it to transform the results in the first mentioned commit. After getting the
transformed_prediction_results.txt, I use the official command to evaluate the model.python evaluate-v1.1.py dev-v1.1.json transform.jsonI found all your evaluation results is 3% higher than the official evaluation results.
Unfortunately, while in checkout 28c18bc0b23381e5c9dfd8ee1834f9e559ae9714 (the second commit), it couldn't work well because there are 10586 lines in the generated
predicted_spans.visualization.txt. It should be 10570 lines as the same as the first mentioned commit, because the quantity of questions in thedev-v1.1.jsonis 10570.Here are some statistics about the
dev-v1.1.json:So, why the quantity of questions was changed? Is it possible to use the official evaluate script or change back the original
predicted_spans.visualization.txt?Thank you ~ (ฅ´ω`ฅ) ~