Add --subset option and domain+country annotation for wmt18+wmt19 by martinpopel · Pull Request #44 · mjpost/sacrebleu

martinpopel · 2019-08-22T00:00:45Z

There are surprising differences in BLEU scores between different domains (topics),
but most researchers ignore this because there is no simple way how to analyze it.
Now it is as simple as adding --detail to the sacrebleu command (see test.sh).

In addition to 6 domains: business, crime, entertainment, politics, scitech, sport, world,
this PR adds metadata about the main country/region for each news article:
EU, GB, US, OTHER (OTHER is further subclassified, but this annotation is not shown in --detail).
This annotation can be useful for disentangling the effect of translationese (original language) and domain/topics.
For such analysis, it would be nice to add domain+country annotation for some originally non-English documents.

I've tried to implement (and name) the --subset option in a relatively versatile way,
so it can be used for other purposes in future if someone adds other annotation.
Versioning the annotation (document metadata missing in the original sgm files)
in the source code is not elegant, but it is practical because

sacrebleu.py can still be used without installing (if portalocker is installed)
the annotation is tied with SacreBLEU's version
it is just 10-20 lines of "code" per wmt testset (original English docs only).

ArgumentParser deletes all newlines are multiple spaces by default. The script is now called "sacrebleu", not "sacreBLEU" and most users should have it installed in PATH, so "./" is not needed.

@mjpost

… sets. The list was included twice in the "sacrebleu -h" output. The list is getting quite long, so when printed on single line it is not much readable anyway. I think a separate --list option is better (it prints 1 test set per line). However, if anyone (@mjpost) wants to print the list within sacreble -h, I suggest to use the ArgumentParser's epilog. I know that `choices=DATASETS.keys()` does also validation in addition to documentation, but this functionality is already included (again, with a nicer multi-line listing of available test sets).

`args.langpair is None or args.langpair not in DATASETS[args.test_set])` this was unnecessarily duplicated. Also, I think the new structure is more readable.

SacreBLEU is not versioned in sockeye_contrib anymore. While `sacrebleu` with no arguments still prints all the test set names, it prints also the error message `sacreBLEU: I need either a predefined test set (-t) or a list of references' and ends with non-zero exit code. Using `sacrebleu --list` seems a tiny bit more user friendly.

use e.g. "-t wmt16,wmt17,wmt18"

So now, there are two ways how to evaluate e.g. original English subset of wmt18 en-de: 1) Generate the src and ref subsets using `sacrebleu -t wmt18 -l en-de --origlang en --echo src > my-orig-en.src` `sacrebleu -t wmt18 -l en-de --origlang en --echo ref > my-orig-en.ref` Translate only the subset. Evaluate it with `sacrebleu my-orig-en.ref < my-orig-en.translation`. Note that there is no --origlang in the evaluation. 2) Generate the full test set (i.e. no --origlang in the `sacrebleu --echo src` command). Translate the full test set. Evaluate it with `sacrebleu -t wmt18 -l en-de --origlang en < my-wmt18.translation`. So the meaning of --origlang is slightly different in the evaluation and in --echo, but I think this is intuitive and both scenarios 1 and 2 are useful in different situations: scenario 1 if you want to focus only on origlang=en during system tuning, scenario 2 if you have the full test sets already translated and want to analyze the effect of --origlang.

…cuments

with the expected output, i.e. iterating only over the other dimension. Also regarding the any->all: when using e.g. --verbose -t wmt17,wmt18 where the domain (subset) annotation is available only for wmt18, the --verbose command should not fail, but print only the breakdown by origlang (and don't report any subsets breakdown because that would be based only on the wmt18 documents and thus potentially confusing).

test.sh included a literal carriage return character (for the purpose of testing that it does not affect BLEU). However, editing test.sh in some editors (including GitHub's merge function) either deletes the CR symbol or substitutes with LF. Thus, it is safer to print CR with `printf \r`.

mjpost

Please also merge in master :)

sacrebleu.py

mjpost · 2019-09-10T13:46:22Z

sacrebleu.py

+    return [[sentence for sentence,keep in zip(sys, indices_to_keep) if keep] for sys in systems]
+
+
 def main():


This would add some work: but I've thought for a while that we should get rid of this hacky XML parsing in favor of using lxml.etree. This would simplify a lot of code and remove a potential for errors to creep in. What do you think? We use this library extensively in the Anthology.

I would prefer to merge this PR first and possibly do the lxml refactoring in a separate PR, after considering the following:

I used the ad-hoc reg-ex style, which was already there for the SGM->plaintext conversion. It's just a few lines of code and I doubt the lxml refactoring can be notably shorter/simpler (though I am not familiar with lxml).

I thought that sacreBLEU is meant to be usable also as a single script with no installation, i.e. no dependencies (meanwhile, portalocker was added, I know). That said, if dependencies are allowed now, we could reconsider re-implementing everything with numpy and adding bootstrap resampling (Add bootstrap resampling #11) and paired bootstrap tests.

Are you sure that all legacy WMT *.sgm files are XML 1.0 valid and lxml.etree parses them correctly? They have no <?xml version="1.1" encoding="UTF-8"?> declaration, so they are not valid XML 1.1. Some older testsets (wmt08, I think) use <DOC> instead of <doc> etc.

I'm not sure, but my impression is that lxml.etree is flexible against imperfect specs.

I don't mind if sacrebleu has dependencies, since pip makes it easy to install them and doesn't IMO make it harder to use. Rewriting BLEU in numpy would be awesome. It's too slow as it is. Did you have a reference numpy implementation in mind? I hadn't given any thought about how to make this faster.

martinpopel · 2019-09-10T23:40:35Z

Thanks for merging. If time allows, I will try lxml.

Regarding the speed issues, I started answering it here, but then moved it to #46.

mjpost · 2019-09-10T23:41:25Z

Sounds good! I'm going to do some full testing and then I'll release version 1.4.0 to pypi.

tholiao · 2019-09-12T06:41:53Z

Hi @martinpopel, how are the annotations sourced?

martinpopel added 14 commits August 15, 2019 09:41

print the help description with proper formatting

40da744

ArgumentParser deletes all newlines are multiple spaces by default. The script is now called "sacrebleu", not "sacreBLEU" and most users should have it installed in PATH, so "./" is not needed.

fix code duplication, nicer formatting of test set listing

fc24380

prevent code duplication

8844d64

`args.langpair is None or args.langpair not in DATASETS[args.test_set])` this was unnecessarily duplicated. Also, I think the new structure is more readable.

allow evaluation on concatenation of multiple test sets

2c0e788

use e.g. "-t wmt16,wmt17,wmt18"

add --origlang

d1ab3ca

don't use args in functions (including private _functions)

7f8abbd

add --verbose which iterates over all available origlangs

4424fbf

trying to fix https://travis-ci.com/mjpost/sacreBLEU/jobs/226898546

cc8fa74

Python3 has a more elegant solution of mkdir race condition

0227fbf

add --subset and country+domain annotation for wmt18+wmt19 orig-en do…

c93ae19

…cuments

martinpopel mentioned this pull request Sep 3, 2019

Allow evaluation on concatenation of multiple test sets #42

Merged

martinpopel added 4 commits September 4, 2019 12:56

Merge branch 'master' into testset-origlang

9dcaa34

CHANGELOG.md update

3acdd9c

Merge branch 'testset-origlang' into testset-subsets

76f137c

martinpopel mentioned this pull request Sep 5, 2019

Add --origlang #43

Merged

martinpopel added 2 commits September 9, 2019 23:42

rename --verbose to --detail

2d6b78f

Merge branch 'testset-origlang' into testset-subsets

ca445f9

martinpopel changed the title ~~Add --subset and --verbose options and domain annotation for wmt18+wmt19~~ Add --subset option and domain+country annotation for wmt18+wmt19 Sep 10, 2019

mjpost reviewed Sep 10, 2019

View reviewed changes

Merge branch 'master' into testset-subsets

2228dd8

mjpost merged commit b9d5c74 into mjpost:master Sep 10, 2019

martinpopel deleted the testset-subsets branch September 10, 2019 23:40

tholiao mentioned this pull request Sep 12, 2019

Where is document context information from? #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --subset option and domain+country annotation for wmt18+wmt19#44

Add --subset option and domain+country annotation for wmt18+wmt19#44
mjpost merged 21 commits intomjpost:masterfrom
martinpopel:testset-subsets

martinpopel commented Aug 22, 2019 •

edited

Loading

Uh oh!

mjpost left a comment

Uh oh!

Uh oh!

mjpost Sep 10, 2019

Uh oh!

martinpopel Sep 10, 2019

Uh oh!

mjpost Sep 10, 2019

Uh oh!

martinpopel commented Sep 10, 2019

Uh oh!

mjpost commented Sep 10, 2019

Uh oh!

tholiao commented Sep 12, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return [[sentence for sentence,keep in zip(sys, indices_to_keep) if keep] for sys in systems]


		def main():

Conversation

martinpopel commented Aug 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mjpost left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mjpost Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

martinpopel Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

mjpost Sep 10, 2019

Choose a reason for hiding this comment

Uh oh!

martinpopel commented Sep 10, 2019

Uh oh!

mjpost commented Sep 10, 2019

Uh oh!

tholiao commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martinpopel commented Aug 22, 2019 •

edited

Loading

tholiao commented Sep 12, 2019 •

edited

Loading