Add --subset option and domain+country annotation for wmt18+wmt19#44
Add --subset option and domain+country annotation for wmt18+wmt19#44mjpost merged 21 commits intomjpost:masterfrom
Conversation
ArgumentParser deletes all newlines are multiple spaces by default. The script is now called "sacrebleu", not "sacreBLEU" and most users should have it installed in PATH, so "./" is not needed.
… sets. The list was included twice in the "sacrebleu -h" output. The list is getting quite long, so when printed on single line it is not much readable anyway. I think a separate --list option is better (it prints 1 test set per line). However, if anyone (@mjpost) wants to print the list within sacreble -h, I suggest to use the ArgumentParser's epilog. I know that `choices=DATASETS.keys()` does also validation in addition to documentation, but this functionality is already included (again, with a nicer multi-line listing of available test sets).
`args.langpair is None or args.langpair not in DATASETS[args.test_set])` this was unnecessarily duplicated. Also, I think the new structure is more readable.
SacreBLEU is not versioned in sockeye_contrib anymore. While `sacrebleu` with no arguments still prints all the test set names, it prints also the error message `sacreBLEU: I need either a predefined test set (-t) or a list of references' and ends with non-zero exit code. Using `sacrebleu --list` seems a tiny bit more user friendly.
use e.g. "-t wmt16,wmt17,wmt18"
So now, there are two ways how to evaluate e.g. original English subset of wmt18 en-de: 1) Generate the src and ref subsets using `sacrebleu -t wmt18 -l en-de --origlang en --echo src > my-orig-en.src` `sacrebleu -t wmt18 -l en-de --origlang en --echo ref > my-orig-en.ref` Translate only the subset. Evaluate it with `sacrebleu my-orig-en.ref < my-orig-en.translation`. Note that there is no --origlang in the evaluation. 2) Generate the full test set (i.e. no --origlang in the `sacrebleu --echo src` command). Translate the full test set. Evaluate it with `sacrebleu -t wmt18 -l en-de --origlang en < my-wmt18.translation`. So the meaning of --origlang is slightly different in the evaluation and in --echo, but I think this is intuitive and both scenarios 1 and 2 are useful in different situations: scenario 1 if you want to focus only on origlang=en during system tuning, scenario 2 if you have the full test sets already translated and want to analyze the effect of --origlang.
with the expected output, i.e. iterating only over the other dimension. Also regarding the any->all: when using e.g. --verbose -t wmt17,wmt18 where the domain (subset) annotation is available only for wmt18, the --verbose command should not fail, but print only the breakdown by origlang (and don't report any subsets breakdown because that would be based only on the wmt18 documents and thus potentially confusing).
test.sh included a literal carriage return character (for the purpose of testing that it does not affect BLEU). However, editing test.sh in some editors (including GitHub's merge function) either deletes the CR symbol or substitutes with LF. Thus, it is safer to print CR with `printf \r`.
| return [[sentence for sentence,keep in zip(sys, indices_to_keep) if keep] for sys in systems] | ||
|
|
||
|
|
||
| def main(): |
There was a problem hiding this comment.
This would add some work: but I've thought for a while that we should get rid of this hacky XML parsing in favor of using lxml.etree. This would simplify a lot of code and remove a potential for errors to creep in. What do you think? We use this library extensively in the Anthology.
There was a problem hiding this comment.
I would prefer to merge this PR first and possibly do the lxml refactoring in a separate PR, after considering the following:
- I used the ad-hoc reg-ex style, which was already there for the SGM->plaintext conversion. It's just a few lines of code and I doubt the lxml refactoring can be notably shorter/simpler (though I am not familiar with lxml).
- I thought that sacreBLEU is meant to be usable also as a single script with no installation, i.e. no dependencies (meanwhile,
portalockerwas added, I know). That said, if dependencies are allowed now, we could reconsider re-implementing everything with numpy and adding bootstrap resampling (Add bootstrap resampling #11) and paired bootstrap tests. - Are you sure that all legacy WMT *.sgm files are XML 1.0 valid and
lxml.etreeparses them correctly? They have no<?xml version="1.1" encoding="UTF-8"?>declaration, so they are not valid XML 1.1. Some older testsets (wmt08, I think) use<DOC>instead of<doc>etc.
There was a problem hiding this comment.
I'm not sure, but my impression is that lxml.etree is flexible against imperfect specs.
I don't mind if sacrebleu has dependencies, since pip makes it easy to install them and doesn't IMO make it harder to use. Rewriting BLEU in numpy would be awesome. It's too slow as it is. Did you have a reference numpy implementation in mind? I hadn't given any thought about how to make this faster.
|
Thanks for merging. If time allows, I will try lxml. Regarding the speed issues, I started answering it here, but then moved it to #46. |
|
Sounds good! I'm going to do some full testing and then I'll release version 1.4.0 to pypi. |
|
Hi @martinpopel, how are the annotations sourced? |
There are surprising differences in BLEU scores between different domains (topics),
but most researchers ignore this because there is no simple way how to analyze it.
Now it is as simple as adding
--detailto thesacrebleucommand (see test.sh).In addition to 6 domains: business, crime, entertainment, politics, scitech, sport, world,
this PR adds metadata about the main country/region for each news article:
EU, GB, US, OTHER (OTHER is further subclassified, but this annotation is not shown in
--detail).This annotation can be useful for disentangling the effect of translationese (original language) and domain/topics.
For such analysis, it would be nice to add domain+country annotation for some originally non-English documents.
I've tried to implement (and name) the
--subsetoption in a relatively versatile way,so it can be used for other purposes in future if someone adds other annotation.
Versioning the annotation (document metadata missing in the original sgm files)
in the source code is not elegant, but it is practical because