speculative decoding by wheresmyhair · Pull Request #630 · OptimalScale/LMFlow

wheresmyhair · 2023-08-30T07:19:13Z

inferencer now supports speculative decoding via SpeculativeInferencer. Tested with gpt2 (draft model) and gpt2-large (target model), see /tests/pipeline/test_spec_inf.py. Only finished functionality testing, the performance testing is needed.
Not sure if my implementation of STEP 2 in speculative sampling (running target model in parallel) is correct, please review & revise. Thanks a lot!

research4pan

Overall, the implementation looks good to me and is well-documented 👍 The quality can be further improved with following minor problems fixed. I think there is only one question away from approving this PR.

`src/lmflow/pipeline/inferencer.py`

[Feature] line 331: better treat temperature=0.0 as use_argmax=True.
⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?
[Bug] line 359: no bug when num_sample=1, but should notice that torch.multinomial is without replacement (see this link), thus replacement=True should be specified.
[Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
[Style] line 458-459, 484: better use logger.debug instead of print.
[Feature] line 465: assume ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument of num_model_for_verify (default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size.
[Style] line 499, 502: typo: "flnal" -> "final"
[Style] line 507-508, 512-513: use logger.debug instead of print.

`tests/pipeline/test_spec_inf.py`

The tests folder is used for unittests, better modify this part according to standard format of unittest, or move this part to examples/*.py later.

research4pan

[Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.

wheresmyhair · 2023-09-06T16:38:46Z

`src/lmflow/pipeline/inferencer.py`

[Feature] line 331: better treat temperature=0.0 as use_argmax=True.

⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?

[Bug] line 359: no bug when num_sample=1, but should notice that torch.multinomial is without replacement (see this link), thus replacement=True should be specified.

[Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"

[Style] line 458-459, 484: better use logger.debug instead of print.

[Style] line 499, 502: typo: "flnal" -> "final"

[Style] line 507-508, 512-513: use logger.debug instead of print.

Fixed.

[Feature] line 465: assume ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument of num_model_for_verify (default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size.

[Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.

Instead of using threading + pred_next_token(), SpeculativeInferencer now uses get_backend_model() method from the HFDecoderModel model object to get logits and do the calculations.

research4pan

LGTM, thanks 👍

wheresmyhair and others added 2 commits August 29, 2023 17:09

spec inf

fc6fc68

spec inf tested

97f79c6

research4pan reviewed Sep 2, 2023

View reviewed changes

wheresmyhair and others added 2 commits September 5, 2023 10:11

Merge branch 'OptimalScale:main' into main

832613d

SpeculativeInferencer bug fix

0d5cc9d

research4pan approved these changes Sep 6, 2023

View reviewed changes

research4pan merged commit 7f2711a into OptimalScale:main Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speculative decoding#630

speculative decoding#630
research4pan merged 4 commits intoOptimalScale:mainfrom
wheresmyhair:main

wheresmyhair commented Aug 30, 2023

Uh oh!

research4pan left a comment

Uh oh!

research4pan left a comment

Uh oh!

wheresmyhair commented Sep 6, 2023

Uh oh!

research4pan left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wheresmyhair commented Aug 30, 2023

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

src/lmflow/pipeline/inferencer.py

tests/pipeline/test_spec_inf.py

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

Uh oh!

wheresmyhair commented Sep 6, 2023

src/lmflow/pipeline/inferencer.py

Uh oh!

research4pan left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`src/lmflow/pipeline/inferencer.py`

`tests/pipeline/test_spec_inf.py`

`src/lmflow/pipeline/inferencer.py`