Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:
{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}
There will occur an error in transfer_engine:
E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
And with one rdma device(mlx5_0 or mlx5_1) is ok
Use latest mooncake code, when I test with tp=1, num_rdma_nic=2, qps=2, input_len=200, output_len=100 in a single machine, which prefill instance num is 1 and decode instance num also is 1.
My mooncake_config.json is shown as below:
{
"prefill_url": "127.0.0.1:8144",
"decode_url": "127.0.0.1:8149",
"metadata_server": "127.0.0.1:2333",
"metadata_backend": "etcd",
"protocol": "rdma",
"device_name": "mlx5_0,mlx5_1"
}
There will occur an error in transfer_engine:
E1213 02:57:10.528410 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:14.286239 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 404, dest_addr: 140532604981264, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
E1213 02:57:18.044381 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 0): transport retry counter exceeded
E1213 02:57:21.802461 5811 worker_pool.cpp:274] Worker: Process failed for slice (opcode: 0, source_addr: 0x7efdf3ffd010, length: 1975, dest_addr: 140532604973072, local_nic: mlx5_1, peer_nic: 127.0.0.1:8149@mlx5_0, dest_rkey: 2105088, retry_cnt: 1): transport retry counter exceeded
And with one rdma device(mlx5_0 or mlx5_1) is ok