In moses, there is this script for determining the "true" BLEU score within a confidence interval. Unfortunately, it does not have the configurability that sacreBLEU has.
In order to compare systems with regard to statistical significance, it would be nice to have a similar script, but supporting sacreBLEU.