Skip to content

How about using the official evaluation script? #2

@Bearsuny

Description

@Bearsuny

Hi obryanlouis,

Few days ago, I trained several models and the results are as follows:

commit b31a8e8ec1897c1eef8e80570cca19ea08b85467

model name EM(T) F1(T) EM(L) F1(L)
conductor-net 62.48% 72.35% 73.24% 81.93%
fusion-net 67.92% 77.83% 75.96% 83.90%
match-lstm 48.53% 58.40% 54.50% 67.74%
mnemonic_reader 67.32% 80.63% 70.95% 80.14%

T means trained result, L means results posted in SQuAD leaderboard.

It seems that the trained results are less 10% than the results posted in SQuAD leaderboard except the mnemonic_reader. I think the inappropriate parameters must be reason. I wonder if you could give your parameters when training the match-lstm model and conductor-net model or else model.

And I trained the fusion-net model in checkout 28c18bc0b23381e5c9dfd8ee1834f9e559ae9714 with your parameter posted in README.md and get the similar result with yours.


Moreover, I noticed that you didn't use the evaluate-v1.1.py which is officially posted by SQuAD for evaluation. And I write transform.py to transform your evaluated results in qa/evaluation_results/predicted_spans.visualization.txt to the official prediction file Sample Prediction File which is posted in SQuAD. You can use it with the following command:

python transform.py dev-v1.1.json predicted_spans.visualization.txt transform.json

I found that my transform.py worked when using it to transform the results in the first mentioned commit. After getting the transformed_prediction_results.txt, I use the official command to evaluate the model.

python evaluate-v1.1.py dev-v1.1.json transform.json

I found all your evaluation results is 3% higher than the official evaluation results.

Unfortunately, while in checkout 28c18bc0b23381e5c9dfd8ee1834f9e559ae9714 (the second commit), it couldn't work well because there are 10586 lines in the generated predicted_spans.visualization.txt. It should be 10570 lines as the same as the first mentioned commit, because the quantity of questions in the dev-v1.1.json is 10570.

Here are some statistics about the dev-v1.1.json:

topic paragraph question
48 2067 10570

So, why the quantity of questions was changed? Is it possible to use the official evaluate script or change back the original predicted_spans.visualization.txt?

Thank you ~ (ฅ´ω`ฅ) ~

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions