https://arxiv.org/abs/1609.04938
1. Abstract
- this model is end-to-end
- model uses convolutional network and recurrent network
- current models achieve 25% accuracy, but paper model achieves 75% accuracy
2. Introduction
- OCR requires joint processing of image and text data
- WYGIWYS is simple extension of the attention-based encoder-decoder model
- Paper introduces IM2LATEX-100k Dataset
3. Problem: image-to-markup generation
- author defined the image-to-markup problem as converting a rendered source image t o target presentational markup
4. Model

Convolutional Network
- Convolutional network does not uses fully connected layer
- this preserve locality of CNN features in order to use visual attention
Row Encoder
- Show, Attend and Tell shows image feature grid can be directly fed into decoder
- decoder contains significant relative sequential order information
- so using rnn can be help in
- left-to-right order can be easily learned by encoder
- RNN can utilize the surrounding horizontal context to refine the hidden representation
Decoder
- uses attention model (Bahdanau attention)
- uses beam search on test time
5. Dataset
Tokenization
- character based models were not that good
Optional: Normalization
- modified KaTeX due to produce normalized input data
My Notes
- each github project has different loss functions
https://arxiv.org/abs/1609.04938
1. Abstract
2. Introduction
3. Problem: image-to-markup generation
4. Model
Convolutional Network
Row Encoder
Decoder
5. Dataset
Tokenization
Optional: Normalization
My Notes