Skip to content

Non-Autoregressive Neural Machine Translation #6

@flrngel

Description

@flrngel

https://arxiv.org/abs/1711.02281

Abstract

Features

  • Non-Autoregressive (means output selves doesn't have dependency)
  • Parallel outputs

How

  • Knowledge distillation
  • Input token fertilities
  • Policy Gradient

1. Introduction

Paper model uses CNN and SAN (Transformer) to avoid autoregressive

2. Background

2.1. Autoregressive Neural Machine Translation

  • Transformer's masking is better than CNN

2.2. Non-Autoregressive decoding

Problems of beam-search

  • suffers from diminishing returns with respect to beam size
  • limits search parallelism

They made output length variable T as probabilistic variable

2.3. The multimodality problem

Multimodality problem is problem of "high multimodal distribution of target translation"

3. The non-autoregressive transformer

image

3.3. Modeling fertility to tackle the multimodality problem

Used IBM Model 2 to use fertilities.

Definition of fertilities and it's benefit

  • Definition: number of input word has been copied
  • Provides natural factorization that dramatically reduces mode space
  • Allows decoder easier

3.4. Translation predictor and the decoding process

  • Argmax decoding
  • Average decoding
  • Noisy parallel decoding

4. Training

I didn't like this section

image

4.2. Fine-Tuning

Uses KL Divergence, RL, backpropagation

Word-level knowledge distillation (Teacher)
image

External fertility inference model
image

Todo

  • (3.4) Search about average decoding and noisy parallel decoding

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions