bootstrap sampling, (single or paired possible)#78
bootstrap sampling, (single or paired possible)#78JJumSSu wants to merge 1 commit intomjpost:masterfrom JJumSSu:bootstrap
Conversation
|
Once #88 is merged, I will take a look at this. I'm adding collaborators so there will no longer be a bottleneck on me. |
|
No problem! Take your time :) |
martinpopel
left a comment
There was a problem hiding this comment.
I haven't reviewed all the code properly. Just a few comments.
Also, would it be possible to move most of the code into a separate file to keep the main sacrebleu.py of a reasonable size?
(If the the answer is "not easily", we can leave it for future PRs, I think.)
| :param force: Ignore data that looks already tokenized | ||
| :param lowercase: Lowercase the data | ||
| :param tokenize: The tokenizer to use | ||
| :param bootstrap_trials=1: number of trials for bootstrap resampling |
There was a problem hiding this comment.
document also paired_significance_test and significance_value
| else: | ||
| print("System1 is superior with p-value {}".format(round(1-sys1_win, 3))) | ||
|
|
||
| return orig_bleu |
There was a problem hiding this comment.
It would be nice if the bootstrap statistics (e.g. p-value) are available through the API, i.e. returned in the BLEU object. However, this is just a suggestion - I haven't thought about it much and it could be done in a separate PR.
|
@JJumSSu: Thank you very much for this PR. Can you please rebase on the current master (or merge in) and resolve the conflicts? |
|
@martinpopel : Thank you for your thoughtful comment! |
|
Working my way through several papers, wikipedia articles and codes written for this purpose. I think this way of computing CI using the t-statistic (i.e. The t-statistic method above always yields a CI offset of So I need some guidance here to implement this bootstrap resampling and CI functionality correctly. In the meantime I'll replicate Moses' script's behavior to be on the safe side. (I don't know if mentioning people outside the contributors actually works but let's see.) |
Hello, I implemented the feature of a significance test based on bootstrapping sampling from this issue (#70).
I referenced the code from '''https://github.com/mjpost/sacrebleu/pull/11/files''' and '''https://github.com/neubig/util-scripts/blob/master/paired-bootstrap.py'''.
Single or paired systems' confidence intervals and significance tests can now be conducted.
(paired systems)
(single system)
system1 represents the prediction file of the first input, system2 represents the second input delimited by a tab. (e.g. paste system1 system2 | sacrebleu ... )
Also when a single system is provided(standard version) with
bootstrap-trialsmore than 1, the original sacreBLEU score and the confidence intervals are provided as an output.