How about using the official evaluation script?

Hi  obryanlouis,

Few days ago, I trained several models and the results are as follows:

commit [b31a8e8ec1897c1eef8e80570cca19ea08b85467](https://github.com/obryanlouis/qa/commit/b31a8e8ec1897c1eef8e80570cca19ea08b85467)

|model name|EM(T)|F1(T)|EM(L)|F1(L)|
|:-:|:-:|:-:|:-:|:-:|
|conductor-net|62.48%|72.35%|73.24%|81.93%|
|fusion-net|67.92%|77.83%|75.96%|83.90%|
|match-lstm|48.53%|58.40%|54.50%|67.74%|
|mnemonic_reader|67.32%|80.63%|70.95%|80.14%|

> `T` means trained result, `L` means results posted in SQuAD leaderboard.

It seems that the trained results are less 10% than the results posted in SQuAD leaderboard except the `mnemonic_reader`. I think the inappropriate parameters must be reason. I wonder if you could give your parameters when training the `match-lstm` model and `conductor-net` model or else model.   

And I trained the `fusion-net` model in checkout [28c18bc0b23381e5c9dfd8ee1834f9e559ae9714](https://github.com/obryanlouis/qa/commit/28c18bc0b23381e5c9dfd8ee1834f9e559ae9714) with your parameter posted in README.md and get the similar result with yours.

___

Moreover, I noticed that you didn't use the [evaluate-v1.1.py](https://worksheets.codalab.org/rest/bundles/0xbcd57bee090b421c982906709c8c27e1/contents/blob/) which is officially posted by SQuAD for evaluation. And I write [transform.py](https://gist.github.com/Bearsuny/471282c8ccacc02a9f7c9637a9800e1f) to transform your evaluated results in `qa/evaluation_results/predicted_spans.visualization.txt` to the official prediction file `Sample Prediction File` which is posted in SQuAD. You can use it with the following command:

`python transform.py dev-v1.1.json predicted_spans.visualization.txt transform.json`

I found that my [transform.py](https://gist.github.com/Bearsuny/471282c8ccacc02a9f7c9637a9800e1f) worked when using it to transform the results in the first mentioned commit. After getting the `transformed_prediction_results.txt`, I use the official command to evaluate the model.

`python evaluate-v1.1.py dev-v1.1.json transform.json`

I found all your evaluation results is 3% higher than the official evaluation results.

Unfortunately, while in checkout [28c18bc0b23381e5c9dfd8ee1834f9e559ae9714](https://github.com/obryanlouis/qa/commit/28c18bc0b23381e5c9dfd8ee1834f9e559ae9714) (the second commit), it couldn't work well because there are 10586 lines in the generated `predicted_spans.visualization.txt`. It should be 10570 lines as the same as the first mentioned commit, because the quantity of questions in the `dev-v1.1.json` is 10570.

Here are some statistics about the `dev-v1.1.json`:

|topic|paragraph|question|
|:-:|:-:|:-:|
|48|2067|10570|

So, why the quantity of questions was changed? Is it possible to use the official evaluate script or change back the original `predicted_spans.visualization.txt`?

Thank you ~ (ฅ´ω`ฅ) ~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How about using the official evaluation script? #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

model name	EM(T)	F1(T)	EM(L)	F1(L)
conductor-net	62.48%	72.35%	73.24%	81.93%
fusion-net	67.92%	77.83%	75.96%	83.90%
match-lstm	48.53%	58.40%	54.50%	67.74%
mnemonic_reader	67.32%	80.63%	70.95%	80.14%

How about using the official evaluation script? #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions