Date: 2023-07-03
- [Experimental] Added support for GPT-NeoX models.
- [Experimental] Added support for BLOOM models.
- [Prototype] Added support for LLaMA models.
- Added support for more flexible tensor-parallel configurations to GPT2, OPT, and BLOOM. The attention heads doesn't need to be evenly divisible by
tp_degreeanymore. (Note: Thetp_degreestill needs to satisfy the runtime topologies constraint for collective communication (i.e Allreduce). For more details on supported topologies, see: Tensor-parallelism support and https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/collective-communication.html.) - Added multi-query / multi-group attention support for GPT2.
- Fixed NaN issues for GPT2 model.
- Fixed OPT/GPT-NeoX gibberish output
- Resolved an issue where NaN values could be produced when the context_length argument was used in GPT2/OPT.
- Missing cache reorder support for beam search.
Date: 2023-06-12
- Added
int8weight storage forGPT2models. - Improved prompt context encoding performance for
GPT2models. - Improved collective communications performance for tp-degrees 4, 8, and 24 on Inf2.
- Improved collective communications performance for tp-degrees 8 and 32 on Trn1.
- Support for the
--model-type=transformer-inferencecompiler flag for optimized decoder-only LLM inference.
- Added padding to the
GPT-Jlinearlayer to correctly handle odd vocabulary sizes. - Issues where the HuggingFace
generatemethod produces incorrect results whenbeam_searchis used have been resolved.
Date: 2023-04-28
- Added
transformers-neuronxartifacts to PyPI repository. - Added support for the the Hugging Face generate()
- Added support for model serialization, including model saving, loading, and weight swapping.
- Added support for caching compiled artifacts.
- Improved performance by removing unnecessary KV-cache tensor resetting.
- Improved prompt context encoding performance (
OPT,GPT2).
- Incorrect
GPT-Jamp_callbackimport: Fixed theGPT-Jdemo now imports the correctamp_callbackfunction.
Incorrect output with HuggingFace beam_search: When the HuggingFace generate method is configured to use beam_search, this
can produce incorrect results for certain configurations. It is recommended to
use other generation methods such as sample or greedy_search.
Date: 2023-02-24
- Added error handling to check if the desired generated sequence length is valid based on the model configuration
- Improved logging:
- Reduced overly verbose compiler messages
- Disabled lazy module warnings
- Updated
src/transformers_neuronx/gptj/demo.pyto correctly use theamp_callbackfunction fromtransformers_neuronx.gpt2.demo - Extend the
gpt_demo.pysavefunction to support GPT-2 and GPT-J configs
Date: 2023-02-08
First release of transformers-neuronx, a new library that enables LLM model inference on Inf2 & Trn1 using the Neuron SDK. transformers-neuronx contains optimized model implementations that are checkpoint-compatible with HuggingFace Transformers, and currently supports Transformer Decoder models like GPT2, GPT-J and OPT.