Skip to content

bootstrap sampling, (single or paired possible)#78

Closed
JJumSSu wants to merge 1 commit intomjpost:masterfrom
JJumSSu:bootstrap
Closed

bootstrap sampling, (single or paired possible)#78
JJumSSu wants to merge 1 commit intomjpost:masterfrom
JJumSSu:bootstrap

Conversation

@JJumSSu
Copy link
Copy Markdown

@JJumSSu JJumSSu commented May 5, 2020

Hello, I implemented the feature of a significance test based on bootstrapping sampling from this issue (#70).

I referenced the code from '''https://github.com/mjpost/sacrebleu/pull/11/files''' and '''https://github.com/neubig/util-scripts/blob/master/paired-bootstrap.py'''.

Single or paired systems' confidence intervals and significance tests can now be conducted.

(paired systems)

image

(single system)

image

system1 represents the prediction file of the first input, system2 represents the second input delimited by a tab. (e.g. paste system1 system2 | sacrebleu ... )

Also when a single system is provided(standard version) with bootstrap-trials more than 1, the original sacreBLEU score and the confidence intervals are provided as an output.

@mjpost
Copy link
Copy Markdown
Owner

mjpost commented Jul 1, 2020

Once #88 is merged, I will take a look at this. I'm adding collaborators so there will no longer be a bottleneck on me.

@JJumSSu
Copy link
Copy Markdown
Author

JJumSSu commented Jul 2, 2020

No problem! Take your time :)

Copy link
Copy Markdown
Collaborator

@martinpopel martinpopel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed all the code properly. Just a few comments.
Also, would it be possible to move most of the code into a separate file to keep the main sacrebleu.py of a reasonable size?
(If the the answer is "not easily", we can leave it for future PRs, I think.)

:param force: Ignore data that looks already tokenized
:param lowercase: Lowercase the data
:param tokenize: The tokenizer to use
:param bootstrap_trials=1: number of trials for bootstrap resampling
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document also paired_significance_test and significance_value

else:
print("System1 is superior with p-value {}".format(round(1-sys1_win, 3)))

return orig_bleu
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if the bootstrap statistics (e.g. p-value) are available through the API, i.e. returned in the BLEU object. However, this is just a suggestion - I haven't thought about it much and it could be done in a separate PR.

@martinpopel
Copy link
Copy Markdown
Collaborator

@JJumSSu: Thank you very much for this PR. Can you please rebase on the current master (or merge in) and resolve the conflicts?

@JJumSSu
Copy link
Copy Markdown
Author

JJumSSu commented Aug 12, 2020

@martinpopel : Thank you for your thoughtful comment!
May I take a look at it next week?
I'm currently in the middle of doing something else.

@martinpopel martinpopel mentioned this pull request Nov 27, 2020
@ozancaglayan
Copy link
Copy Markdown
Collaborator

ozancaglayan commented Mar 7, 2021

Working my way through several papers, wikipedia articles and codes written for this purpose. I think this way of computing CI using the t-statistic (i.e. 1.96 * (stdev / sqrt(n_bootstrap_samples))) may not be the correct way of doing this. This is what is explained in Philipp Koehn's (@phikoehn) paper (section 4): Unfortunately, this method to compute confidence intervals does not work for the BLEU metric, since the BLEU metric is not the mean of single sentence scores
Moreover, neither the Moses' script nor the @neubig 's code above (also his compare-mt code) applies this technique but instead sorts the scores and picks 25th and 975th elements if n_samples == 1000.

The t-statistic method above always yields a CI offset of ~0.02 for BLEU scores between [0, 100].

So I need some guidance here to implement this bootstrap resampling and CI functionality correctly. In the meantime I'll replicate Moses' script's behavior to be on the safe side.

(I don't know if mentioning people outside the contributors actually works but let's see.)

ozancaglayan added a commit that referenced this pull request Mar 26, 2021
@ozancaglayan ozancaglayan mentioned this pull request Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants