On CPU-ARM not all processors have support for bfloat16. In those case trying to run inference will crash like in the following stacktrace:
ERROR 01-07 18:11:29 engine.py:135] RuntimeError('"rms_norm_impl" not implemented for \'BFloat16\'')
ERROR 01-07 18:11:29 engine.py:135] Traceback (most recent call last):
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 133, in start
ERROR 01-07 18:11:29 engine.py:135] self.run_engine_loop()
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 196, in run_engine_loop
ERROR 01-07 18:11:29 engine.py:135] request_outputs = self.engine_step()
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 214, in engine_step
ERROR 01-07 18:11:29 engine.py:135] raise e
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 205, in engine_step
ERROR 01-07 18:11:29 engine.py:135] return self.engine.step()
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1394, in step
ERROR 01-07 18:11:29 engine.py:135] outputs = self.model_executor.execute_model(
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 201, in execute_model
ERROR 01-07 18:11:29 engine.py:135] output = self.driver_method_invoker(self.driver_worker,
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/executor/cpu_executor.py", line 298, in _driver_method_invoker
ERROR 01-07 18:11:29 engine.py:135] return getattr(driver, method)(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 344, in execute_model
ERROR 01-07 18:11:29 engine.py:135] output = self.model_runner.execute_model(
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 01-07 18:11:29 engine.py:135] return func(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/worker/cpu_model_runner.py", line 530, in execute_model
ERROR 01-07 18:11:29 engine.py:135] hidden_states = model_executable(
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-07 18:11:29 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-07 18:11:29 engine.py:135] return forward_call(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 477, in forward
ERROR 01-07 18:11:29 engine.py:135] hidden_states = self.model(input_ids, positions, kv_caches,
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/compilation/decorators.py", line 168, in __call__
ERROR 01-07 18:11:29 engine.py:135] return self.forward(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 340, in forward
ERROR 01-07 18:11:29 engine.py:135] hidden_states, residual = layer(
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-07 18:11:29 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-07 18:11:29 engine.py:135] return forward_call(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2.py", line 243, in forward
ERROR 01-07 18:11:29 engine.py:135] hidden_states = self.input_layernorm(hidden_states)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 01-07 18:11:29 engine.py:135] return self._call_impl(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 01-07 18:11:29 engine.py:135] return forward_call(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 24, in forward
ERROR 01-07 18:11:29 engine.py:135] return self._forward_method(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/custom_op.py", line 48, in forward_cpu
ERROR 01-07 18:11:29 engine.py:135] return self.forward_cuda(*args, **kwargs)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/layernorm.py", line 94, in forward_cuda
ERROR 01-07 18:11:29 engine.py:135] ops.rms_norm(
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 182, in rms_norm
ERROR 01-07 18:11:29 engine.py:135] torch.ops._C.rms_norm(out, input, weight, epsilon)
ERROR 01-07 18:11:29 engine.py:135] File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1116, in __call__
ERROR 01-07 18:11:29 engine.py:135] return self._op(*args, **(kwargs or {}))
ERROR 01-07 18:11:29 engine.py:135] RuntimeError: "rms_norm_impl" not implemented for 'BFloat16'
To support or not bfloat16 is a device dependent issue. The ideal solution is to check if the device support or not in vLLM code. I see to two ways to address this problem:
This issue also helps in the case of Mac with Apple chips, M1 does not support bfloat64 but newer version shall support it and they could take advantage of this feature.
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
On CPU-ARM not all processors have support for bfloat16. In those case trying to run inference will crash like in the following stacktrace:
Server
Request
Fix suggestion
To support or not bfloat16 is a device dependent issue. The ideal solution is to check if the device support or not in vLLM code. I see to two ways to address this problem:
This issue also helps in the case of Mac with Apple chips, M1 does not support bfloat64 but newer version shall support it and they could take advantage of this feature.
Before submitting a new issue...