bootstrap sampling, (single or paired possible) by JJumSSu · Pull Request #78 · mjpost/sacrebleu

JJumSSu · 2020-05-05T09:13:35Z

Hello, I implemented the feature of a significance test based on bootstrapping sampling from this issue (#70).

I referenced the code from '''https://github.com/mjpost/sacrebleu/pull/11/files''' and '''https://github.com/neubig/util-scripts/blob/master/paired-bootstrap.py'''.

Single or paired systems' confidence intervals and significance tests can now be conducted.

(paired systems)

(single system)

system1 represents the prediction file of the first input, system2 represents the second input delimited by a tab. (e.g. paste system1 system2 | sacrebleu ... )

Also when a single system is provided(standard version) with bootstrap-trials more than 1, the original sacreBLEU score and the confidence intervals are provided as an output.

mjpost · 2020-07-01T16:08:52Z

Once #88 is merged, I will take a look at this. I'm adding collaborators so there will no longer be a bottleneck on me.

JJumSSu · 2020-07-02T06:49:36Z

No problem! Take your time :)

martinpopel

I haven't reviewed all the code properly. Just a few comments.
Also, would it be possible to move most of the code into a separate file to keep the main sacrebleu.py of a reasonable size?
(If the the answer is "not easily", we can leave it for future PRs, I think.)

martinpopel · 2020-08-08T16:26:59Z

sacrebleu/sacrebleu.py

    :param force: Ignore data that looks already tokenized
    :param lowercase: Lowercase the data
    :param tokenize: The tokenizer to use
+    :param bootstrap_trials=1: number of trials for bootstrap resampling


document also paired_significance_test and significance_value

martinpopel · 2020-08-08T16:38:51Z

sacrebleu/sacrebleu.py

+            else:
+                print("System1 is superior with p-value {}".format(round(1-sys1_win, 3)))
+
+    return orig_bleu


It would be nice if the bootstrap statistics (e.g. p-value) are available through the API, i.e. returned in the BLEU object. However, this is just a suggestion - I haven't thought about it much and it could be done in a separate PR.

martinpopel · 2020-08-08T16:43:11Z

@JJumSSu: Thank you very much for this PR. Can you please rebase on the current master (or merge in) and resolve the conflicts?

JJumSSu · 2020-08-12T09:12:01Z

@martinpopel : Thank you for your thoughtful comment!
May I take a look at it next week?
I'm currently in the middle of doing something else.

ozancaglayan · 2021-03-07T18:34:55Z

Working my way through several papers, wikipedia articles and codes written for this purpose. I think this way of computing CI using the t-statistic (i.e. 1.96 * (stdev / sqrt(n_bootstrap_samples))) may not be the correct way of doing this. This is what is explained in Philipp Koehn's (@phikoehn) paper (section 4): Unfortunately, this method to compute confidence intervals does not work for the BLEU metric, since the BLEU metric is not the mean of single sentence scores
Moreover, neither the Moses' script nor the @neubig 's code above (also his compare-mt code) applies this technique but instead sorts the scores and picks 25th and 975th elements if n_samples == 1000.

The t-statistic method above always yields a CI offset of ~0.02 for BLEU scores between [0, 100].

So I need some guidance here to implement this bootstrap resampling and CI functionality correctly. In the meantime I'll replicate Moses' script's behavior to be on the safe side.

(I don't know if mentioning people outside the contributors actually works but let's see.)

bootstrap sampling, (single or paired possible)

8e19255

JJumSSu mentioned this pull request May 5, 2020

feature support for the significance test by bootstrapping? #70

Closed

martinpopel mentioned this pull request Jul 1, 2020

Refactoring & Fixes #88

Merged

martinpopel reviewed Aug 8, 2020

View reviewed changes

martinpopel mentioned this pull request Nov 27, 2020

Refactoring ideas #125

Closed

ozancaglayan added a commit that referenced this pull request Mar 26, 2021

add statistical test module (#40, #78)

04ecd07

ozancaglayan mentioned this pull request Mar 26, 2021

Changes for 2.0.0 #152

Merged

ozancaglayan closed this in 078c440 Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bootstrap sampling, (single or paired possible)#78

bootstrap sampling, (single or paired possible)#78
JJumSSu wants to merge 1 commit intomjpost:masterfrom
JJumSSu:bootstrap

JJumSSu commented May 5, 2020 •

edited

Loading

Uh oh!

mjpost commented Jul 1, 2020

Uh oh!

JJumSSu commented Jul 2, 2020

Uh oh!

martinpopel left a comment

Uh oh!

martinpopel Aug 8, 2020

Uh oh!

martinpopel Aug 8, 2020

Uh oh!

martinpopel commented Aug 8, 2020

Uh oh!

JJumSSu commented Aug 12, 2020 •

edited

Loading

Uh oh!

ozancaglayan commented Mar 7, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JJumSSu commented May 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjpost commented Jul 1, 2020

Uh oh!

JJumSSu commented Jul 2, 2020

Uh oh!

martinpopel left a comment

Choose a reason for hiding this comment

Uh oh!

martinpopel Aug 8, 2020

Choose a reason for hiding this comment

Uh oh!

martinpopel Aug 8, 2020

Choose a reason for hiding this comment

Uh oh!

martinpopel commented Aug 8, 2020

Uh oh!

JJumSSu commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozancaglayan commented Mar 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JJumSSu commented May 5, 2020 •

edited

Loading

JJumSSu commented Aug 12, 2020 •

edited

Loading

ozancaglayan commented Mar 7, 2021 •

edited

Loading