speculative decoding#630
Merged
research4pan merged 4 commits intoOptimalScale:mainfrom Sep 6, 2023
wheresmyhair:main
Merged
Conversation
research4pan
reviewed
Sep 2, 2023
Contributor
research4pan
left a comment
There was a problem hiding this comment.
Overall, the implementation looks good to me and is well-documented 👍 The quality can be further improved with following minor problems fixed. I think there is only one question away from approving this PR.
src/lmflow/pipeline/inferencer.py
- [Feature] line 331: better treat
temperature=0.0asuse_argmax=True. ⚠️ [Bug or Question] line 344: I think the denominator is the maximum non-zero cumulative probability, not the sum of those cumulative probabiliies?- [Bug] line 359: no bug when
num_sample=1, but should notice thattorch.multinomialis without replacement (see this link), thusreplacement=Trueshould be specified. - [Style] line 455: comment typo "x1,...,γ" -> "x1,...,xγ"
- [Style] line 458-459, 484: better use
logger.debuginstead of print. - [Feature] line 465: assume
ThreadPoolExecutor(max_worker=num_model_for_verify, better export the argument ofnum_model_for_verify(default=1) to users, since for very large models, the GPU memory can become the bottleneck when multiple large models are running in parallel for verification. A better implementation could be verifying batch by batch and let user specify the batch size. - [Style] line 499, 502: typo: "flnal" -> "final"
- [Style] line 507-508, 512-513: use
logger.debuginstead of print.
tests/pipeline/test_spec_inf.py
- The
testsfolder is used for unittests, better modify this part according to standard format of unittest, or move this part toexamples/*.pylater.
research4pan
reviewed
Sep 2, 2023
Contributor
research4pan
left a comment
There was a problem hiding this comment.
- [Bug] line 465-466: we should use 1 forward with the whole sequence (utilize gpu parallelism) instead of thread-parallelism for large model M_p, otherwise there will be no acceleration.
Collaborator
Author
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
inferencer now supports speculative decoding via
SpeculativeInferencer. Tested with gpt2 (draft model) and gpt2-large (target model), see /tests/pipeline/test_spec_inf.py. Only finished functionality testing, the performance testing is needed.Not sure if my implementation of STEP 2 in speculative sampling (running target model in parallel) is correct, please review & revise. Thanks a lot!