Skip to content

Add --subset option and domain+country annotation for wmt18+wmt19#44

Merged
mjpost merged 21 commits intomjpost:masterfrom
martinpopel:testset-subsets
Sep 10, 2019
Merged

Add --subset option and domain+country annotation for wmt18+wmt19#44
mjpost merged 21 commits intomjpost:masterfrom
martinpopel:testset-subsets

Conversation

@martinpopel
Copy link
Copy Markdown
Collaborator

@martinpopel martinpopel commented Aug 22, 2019

There are surprising differences in BLEU scores between different domains (topics),
but most researchers ignore this because there is no simple way how to analyze it.
Now it is as simple as adding --detail to the sacrebleu command (see test.sh).

In addition to 6 domains: business, crime, entertainment, politics, scitech, sport, world,
this PR adds metadata about the main country/region for each news article:
EU, GB, US, OTHER (OTHER is further subclassified, but this annotation is not shown in --detail).
This annotation can be useful for disentangling the effect of translationese (original language) and domain/topics.
For such analysis, it would be nice to add domain+country annotation for some originally non-English documents.

I've tried to implement (and name) the --subset option in a relatively versatile way,
so it can be used for other purposes in future if someone adds other annotation.
Versioning the annotation (document metadata missing in the original sgm files)
in the source code is not elegant, but it is practical because

  • sacrebleu.py can still be used without installing (if portalocker is installed)
  • the annotation is tied with SacreBLEU's version
  • it is just 10-20 lines of "code" per wmt testset (original English docs only).

ArgumentParser deletes all newlines are multiple spaces by default.

The script is now called "sacrebleu", not "sacreBLEU"
and most users should have it installed in PATH, so "./" is not needed.
… sets.

The list was included twice in the "sacrebleu -h" output.
The list is getting quite long, so when printed on single line
it is not much readable anyway.
I think a separate --list option is better (it prints 1 test set per line).

However, if anyone (@mjpost) wants to print the list within sacreble -h,
I suggest to use the ArgumentParser's epilog.

I know that `choices=DATASETS.keys()` does also validation
in addition to documentation, but this functionality is already included
(again, with a nicer multi-line listing of available test sets).
`args.langpair is None or args.langpair not in DATASETS[args.test_set])`
this was unnecessarily duplicated.
Also, I think the new structure is more readable.
SacreBLEU is not versioned in sockeye_contrib anymore.

While `sacrebleu` with no arguments still prints all the test set names,
it prints also the error message
`sacreBLEU: I need either a predefined test set (-t) or a list of references'
and ends with non-zero exit code.
Using `sacrebleu --list` seems a tiny bit more user friendly.
So now, there are two ways how to evaluate e.g. original English subset of wmt18 en-de:

1) Generate the src and ref subsets using
   `sacrebleu -t wmt18 -l en-de --origlang en --echo src > my-orig-en.src`
   `sacrebleu -t wmt18 -l en-de --origlang en --echo ref > my-orig-en.ref`
   Translate only the subset.
   Evaluate it with `sacrebleu my-orig-en.ref < my-orig-en.translation`.
   Note that there is no --origlang in the evaluation.

2) Generate the full test set (i.e. no --origlang in the `sacrebleu --echo src` command).
   Translate the full test set.
   Evaluate it with `sacrebleu -t wmt18 -l en-de --origlang en < my-wmt18.translation`.

So the meaning of --origlang is slightly different in the evaluation and in --echo,
but I think this is intuitive and both scenarios 1 and 2 are useful in different situations:
scenario 1 if you want to focus only on origlang=en during system tuning,
scenario 2 if you have the full test sets already translated and want to analyze the effect of --origlang.
with the expected output, i.e. iterating only over the other dimension.

Also regarding the any->all:
when using e.g. --verbose -t wmt17,wmt18
where the domain (subset) annotation is available only for wmt18,
the --verbose command should not fail, but print only the breakdown
by origlang (and don't report any subsets breakdown because that would be
based only on the wmt18 documents and thus potentially confusing).
test.sh included a literal carriage return character
(for the purpose of testing that it does not affect BLEU).
However, editing test.sh in some editors (including GitHub's merge function)
either deletes the CR symbol or substitutes with LF.
Thus, it is safer to print CR with `printf \r`.
@martinpopel martinpopel mentioned this pull request Sep 5, 2019
@martinpopel martinpopel changed the title Add --subset and --verbose options and domain annotation for wmt18+wmt19 Add --subset option and domain+country annotation for wmt18+wmt19 Sep 10, 2019
Copy link
Copy Markdown
Owner

@mjpost mjpost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also merge in master :)

return [[sentence for sentence,keep in zip(sys, indices_to_keep) if keep] for sys in systems]


def main():
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would add some work: but I've thought for a while that we should get rid of this hacky XML parsing in favor of using lxml.etree. This would simplify a lot of code and remove a potential for errors to creep in. What do you think? We use this library extensively in the Anthology.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to merge this PR first and possibly do the lxml refactoring in a separate PR, after considering the following:

  • I used the ad-hoc reg-ex style, which was already there for the SGM->plaintext conversion. It's just a few lines of code and I doubt the lxml refactoring can be notably shorter/simpler (though I am not familiar with lxml).
  • I thought that sacreBLEU is meant to be usable also as a single script with no installation, i.e. no dependencies (meanwhile, portalocker was added, I know). That said, if dependencies are allowed now, we could reconsider re-implementing everything with numpy and adding bootstrap resampling (Add bootstrap resampling #11) and paired bootstrap tests.
  • Are you sure that all legacy WMT *.sgm files are XML 1.0 valid and lxml.etree parses them correctly? They have no <?xml version="1.1" encoding="UTF-8"?> declaration, so they are not valid XML 1.1. Some older testsets (wmt08, I think) use <DOC> instead of <doc> etc.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but my impression is that lxml.etree is flexible against imperfect specs.

I don't mind if sacrebleu has dependencies, since pip makes it easy to install them and doesn't IMO make it harder to use. Rewriting BLEU in numpy would be awesome. It's too slow as it is. Did you have a reference numpy implementation in mind? I hadn't given any thought about how to make this faster.

@mjpost mjpost merged commit b9d5c74 into mjpost:master Sep 10, 2019
@martinpopel
Copy link
Copy Markdown
Collaborator Author

Thanks for merging. If time allows, I will try lxml.

Regarding the speed issues, I started answering it here, but then moved it to #46.

@martinpopel martinpopel deleted the testset-subsets branch September 10, 2019 23:40
@mjpost
Copy link
Copy Markdown
Owner

mjpost commented Sep 10, 2019

Sounds good! I'm going to do some full testing and then I'll release version 1.4.0 to pypi.

@tholiao
Copy link
Copy Markdown
Contributor

tholiao commented Sep 12, 2019

Hi @martinpopel, how are the annotations sourced?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants