Skip to content

vllm-integration with multi rdma devices error #35

@junna2016

Description

@junna2016

Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:

{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}

There will occur an error in transfer_engine:

E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded

And with one rdma device(mlx5_0 or mlx5_1) is ok

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions