Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/advanced_features/router.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ SGLang Model Gateway is a high-performance model-routing gateway for large-scale
## Architecture

### Control Plane
- **Worker Manager** discovers capabilities (`/server_info`, `/model_info`), tracks load, and registers/removes workers in the shared registry.
- **Worker Manager** discovers capabilities (`/get_server_info`, `/get_model_info`), tracks load, and registers/removes workers in the shared registry.
- **Job Queue** serializes add/remove requests and exposes status (`/workers/{url}`) so clients can track onboarding progress.
- **Load Monitor** feeds cache-aware and power-of-two policies with live worker load statistics.
- **Health Checker** continuously probes workers and updates readiness, circuit breaker state, and router metrics.
Expand Down
2 changes: 1 addition & 1 deletion docs/advanced_features/server_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,7 +219,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
| Argument | Description | Defaults | Options |
| --- | --- | --- | --- |
| `--json-model-override-args` | A dictionary in JSON string format used to override default model configurations. | `{}` | Type: str |
| `--preferred-sampling-params` | json-formatted sampling settings that will be returned in /model_info | `None` | Type: str |
| `--preferred-sampling-params` | json-formatted sampling settings that will be returned in /get_model_info | `None` | Type: str |

## LoRA
| Argument | Description | Defaults | Options |
Expand Down
10 changes: 5 additions & 5 deletions docs/basic_usage/native_api.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
"Apart from the OpenAI compatible APIs, the SGLang Runtime also provides its native server APIs. We introduce the following APIs:\n",
"\n",
"- `/generate` (text generation model)\n",
"- `/model_info`\n",
"- `/server_info`\n",
"- `/get_model_info`\n",
"- `/get_server_info`\n",
"- `/health`\n",
"- `/health_generate`\n",
"- `/flush_cache`\n",
Expand Down Expand Up @@ -99,7 +99,7 @@
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/model_info\"\n",
"url = f\"http://localhost:{port}/get_model_info\"\n",
"\n",
"response = requests.get(url)\n",
"response_json = response.json()\n",
Expand Down Expand Up @@ -127,7 +127,7 @@
"source": [
"## Get Server Info\n",
"Gets the server information including CLI arguments, token limits, and memory pool sizes.\n",
"- Note: `server_info` merges the following deprecated endpoints:\n",
"- Note: `get_server_info` merges the following deprecated endpoints:\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The note is updated to refer to /get_server_info, but the server implementation marks this endpoint as deprecated. This could be confusing for users reading the documentation. To improve clarity, I suggest rephrasing the note to clarify the relationship between the new and old endpoints.

Suggested change
"- Note: `get_server_info` merges the following deprecated endpoints:\n",
"- Note: The `/server_info` endpoint (aliased by the deprecated `/get_server_info`) merges the following older deprecated endpoints:\n",

" - `get_server_args`\n",
" - `get_memory_pool_size` \n",
" - `get_max_total_num_tokens`"
Expand All @@ -139,7 +139,7 @@
"metadata": {},
"outputs": [],
"source": [
"url = f\"http://localhost:{port}/server_info\"\n",
"url = f\"http://localhost:{port}/get_server_info\"\n",
"\n",
"response = requests.get(url)\n",
"print_highlight(response.text)"
Expand Down
2 changes: 1 addition & 1 deletion docs/developer_guide/bench_serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -352,4 +352,4 @@ python3 -m sglang.bench_serving \
### Notes

- The script raises the file descriptor soft limit (`RLIMIT_NOFILE`) to help with many concurrent connections.
- For sglang, `/server_info` is queried post-run to report speculative decoding accept length when available.
- For sglang, `/get_server_info` is queried post-run to report speculative decoding accept length when available.
4 changes: 2 additions & 2 deletions python/sglang/bench_one_batch_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -339,7 +339,7 @@ def run_one_case(
output_throughput = batch_size * output_len / (latency - last_ttft)
overall_throughput = batch_size * (input_len + output_len) / latency

server_info = requests.get(url + "/server_info").json()
server_info = requests.get(url + "/get_server_info").json()
internal_state = server_info.get("internal_states", [{}])
last_gen_throughput = internal_state[0].get("last_gen_throughput", None) or -1
acc_length = internal_state[0].get("avg_spec_accept_length", None) or -1
Expand Down Expand Up @@ -451,7 +451,7 @@ def run_benchmark(server_args: ServerArgs, bench_args: BenchArgs):
proc, base_url = launch_server_process(server_args)

# Get tokenizer
server_info = requests.get(base_url + "/server_info").json()
server_info = requests.get(base_url + "/get_server_info").json()
if "tokenizer_path" in server_info:
tokenizer_path = server_info["tokenizer_path"]
elif "prefill" in server_info:
Expand Down
4 changes: 2 additions & 2 deletions python/sglang/bench_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -2062,7 +2062,7 @@ async def limited_request_func(request_func_input, pbar):

if "sglang" in backend:
server_info = requests.get(
base_url + "/server_info", headers=get_auth_headers()
base_url + "/get_server_info", headers=get_auth_headers()
)
if server_info.status_code == 200:
server_info_json = server_info.json()
Expand Down Expand Up @@ -2180,7 +2180,7 @@ async def limited_request_func(request_func_input, pbar):
print("{:<40} {:<10.2f}".format("Max ITL (ms):", metrics.max_itl_ms))
print("=" * 50)

resp = requests.get(base_url + "/server_info", headers=get_auth_headers())
resp = requests.get(base_url + "/get_server_info", headers=get_auth_headers())
server_info = resp.json() if resp.status_code == 200 else None

if (
Expand Down
6 changes: 3 additions & 3 deletions python/sglang/lang/backend/runtime_endpoint.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ def __init__(
self.verify = verify

res = http_request(
self.base_url + "/model_info",
self.base_url + "/get_model_info",
api_key=self.api_key,
verify=self.verify,
)
Expand Down Expand Up @@ -66,7 +66,7 @@ def flush_cache(self):

def get_server_info(self):
res = http_request(
self.base_url + "/server_info",
self.base_url + "/get_server_info",
api_key=self.api_key,
verify=self.verify,
)
Expand Down Expand Up @@ -514,7 +514,7 @@ def encode(

async def get_server_info(self):
async with aiohttp.ClientSession() as session:
async with session.get(f"{self.url}/server_info") as response:
async with session.get(f"{self.url}/get_server_info") as response:
if response.status == 200:
return await response.json()
else:
Expand Down
2 changes: 1 addition & 1 deletion python/sglang/profiler.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def run_profile(
# Dump server args.
file_path = Path(output_dir) / "server_args.json"
if not file_path.exists():
response = requests.get(url + "/server_info")
response = requests.get(url + "/get_server_info")
response.raise_for_status()
server_args_data = response.json()
with open(file_path, "w") as file:
Expand Down
2 changes: 1 addition & 1 deletion python/sglang/srt/entrypoints/http_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -1480,7 +1480,7 @@ def _execute_server_warmup(
for _ in range(120):
time.sleep(1)
try:
res = requests.get(url + "/model_info", timeout=5, headers=headers)
res = requests.get(url + "/get_model_info", timeout=5, headers=headers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This change reverts the server warmup logic to call the deprecated /get_model_info endpoint on itself. This will cause a deprecation warning to be logged every time the server starts, which can add noise to the logs. If the /model_info endpoint is functional during the warmup phase, consider using it here to avoid these warnings.

assert res.status_code == 200, f"{res=}, {res.text=}"
success = True
break
Expand Down
2 changes: 1 addition & 1 deletion python/sglang/srt/server_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -2801,7 +2801,7 @@ def add_cli_args(parser: argparse.ArgumentParser):
parser.add_argument(
"--preferred-sampling-params",
type=str,
help="json-formatted sampling settings that will be returned in /model_info",
help="json-formatted sampling settings that will be returned in /get_model_info",
)

# LoRA
Expand Down
2 changes: 1 addition & 1 deletion scripts/playground/bench_speculative.py
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ def send_one_batch(base_url, num_prompts, batch_size, processor, is_multimodal):
acc_length = results["accept_length"] or 1.0
avg_output_token = results["total_output_tokens"] / results["completed"]

server_info = requests.get(base_url + "/server_info").json()
server_info = requests.get(base_url + "/get_server_info").json()
# We use 20% percentile instead of median on purpose
step_time = np.percentile(
server_info["internal_states"][0]["step_time_dict"][str(batch_size)], 20
Expand Down
2 changes: 1 addition & 1 deletion test/manual/test_deepseek_v32_cp_single_node.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ def test_a_gsm8k(
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
2 changes: 1 addition & 1 deletion test/manual/test_eagle_infer_beta_dp_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ def test_a_gsm8k(
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
20 changes: 10 additions & 10 deletions test/manual/test_weight_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

This test suite verifies the weight_version feature implementation including:
1. Default weight_version setting
2. /weight_version endpoint
2. /get_weight_version endpoint
3. /update_weight_version endpoint
4. /generate request meta_info contains weight_version
5. OpenAI API response metadata contains weight_version
Expand Down Expand Up @@ -48,13 +48,13 @@ def tearDownClass(cls):
def test_weight_version_comprehensive(self):
"""Comprehensive test for all weight_version functionality."""

response = requests.get(f"{self.base_url}/model_info")
response = requests.get(f"{self.base_url}/get_model_info")
self.assertEqual(response.status_code, 200)
data = response.json()
self.assertIn("weight_version", data)
self.assertEqual(data["weight_version"], "test_version_1.0")

response = requests.get(f"{self.base_url}/weight_version")
response = requests.get(f"{self.base_url}/get_weight_version")
self.assertEqual(response.status_code, 200)
data = response.json()
self.assertIn("weight_version", data)
Expand Down Expand Up @@ -114,7 +114,7 @@ def test_weight_version_comprehensive(self):
self.assertTrue(data["success"])
self.assertEqual(data["new_version"], "updated_version_2.0")

response = requests.get(f"{self.base_url}/weight_version")
response = requests.get(f"{self.base_url}/get_weight_version")
self.assertEqual(response.status_code, 200)
data = response.json()
self.assertEqual(data["weight_version"], "updated_version_2.0")
Expand Down Expand Up @@ -148,13 +148,13 @@ def test_weight_version_comprehensive(self):
self.assertTrue(data["success"])
self.assertEqual(data["new_version"], "final_version_3.0")

# Check /weight_version
response = requests.get(f"{self.base_url}/weight_version")
# Check /get_weight_version
response = requests.get(f"{self.base_url}/get_weight_version")
self.assertEqual(response.status_code, 200)
self.assertEqual(response.json()["weight_version"], "final_version_3.0")

# Check /model_info
response = requests.get(f"{self.base_url}/model_info")
# Check /get_model_info
response = requests.get(f"{self.base_url}/get_model_info")
self.assertEqual(response.status_code, 200)
self.assertEqual(response.json()["weight_version"], "final_version_3.0")

Expand Down Expand Up @@ -193,7 +193,7 @@ def test_update_weight_version_with_weight_updates(self):
print("Testing weight_version update with real weight operations...")

# Get current model info for reference
model_info_response = requests.get(f"{self.base_url}/model_info")
model_info_response = requests.get(f"{self.base_url}/get_model_info")
self.assertEqual(model_info_response.status_code, 200)
current_model_path = model_info_response.json()["model_path"]

Expand All @@ -214,7 +214,7 @@ def test_update_weight_version_with_weight_updates(self):
)

# Verify version was updated
version_response = requests.get(f"{self.base_url}/weight_version")
version_response = requests.get(f"{self.base_url}/get_weight_version")
self.assertEqual(version_response.status_code, 200)
self.assertEqual(
version_response.json()["weight_version"], "disk_update_v2.0.0"
Expand Down
2 changes: 1 addition & 1 deletion test/srt/ep/test_deepep_large.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def test_gsm8k(self):

self.assertGreater(metrics["accuracy"], 0.92)

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
4 changes: 2 additions & 2 deletions test/srt/ep/test_deepep_small.py
Original file line number Diff line number Diff line change
Expand Up @@ -307,7 +307,7 @@ def test_gsm8k(self):

self.assertGreater(metrics["accuracy"], 0.60)

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down Expand Up @@ -381,7 +381,7 @@ def test_gsm8k(self):

self.assertGreater(metrics["accuracy"], 0.60)

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
2 changes: 1 addition & 1 deletion test/srt/hicache/test_hicache_variants.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ def test_mmlu(self):
self.assertGreaterEqual(metrics["score"], self.expected_mmlu_score)

# EAGLE-specific check
server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
print(f"{server_info=}")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
Expand Down
2 changes: 1 addition & 1 deletion test/srt/quant/test_w4a8_deepseek_v3.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ def test_gsm8k(
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
6 changes: 3 additions & 3 deletions test/srt/rl/test_update_weights_from_disk.py
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ def run_decode_random(self, max_new_tokens=32):
return response.json()

def get_model_info(self):
response = requests.get(self.base_url + "/model_info")
response = requests.get(self.base_url + "/get_model_info")
model_path = response.json()["model_path"]
print(json.dumps(response.json()))
return model_path
Expand Down Expand Up @@ -254,7 +254,7 @@ def run_decode(self, max_new_tokens=32):
return response.json()

def get_model_info(self):
response = requests.get(self.base_url + "/model_info")
response = requests.get(self.base_url + "/get_model_info")
model_path = response.json()["model_path"]
print(json.dumps(response.json()))
return model_path
Expand Down Expand Up @@ -386,7 +386,7 @@ def run_decode():
return response.json()["text"]

def get_model_info():
response = requests.get(base_url + "/model_info")
response = requests.get(base_url + "/get_model_info")
model_path = response.json()["model_path"]
print(json.dumps(response.json()))
return model_path
Expand Down
2 changes: 1 addition & 1 deletion test/srt/rl/test_update_weights_from_tensor.py
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ def run_decode(self, max_new_tokens=32):
return response.json()

def get_model_info(self):
response = requests.get(self.base_url + "/model_info")
response = requests.get(self.base_url + "/get_model_info")
model_path = response.json()["model_path"]
print(json.dumps(response.json()))
return model_path
Expand Down
4 changes: 2 additions & 2 deletions test/srt/test_data_parallelism.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,12 +65,12 @@ def test_update_weight(self):

def test_get_memory_pool_size(self):
# use `get_server_info` instead since `get_memory_pool_size` is merged into `get_server_info`
response = requests.get(self.base_url + "/server_info")
response = requests.get(self.base_url + "/get_server_info")
assert response.status_code == 200

time.sleep(1)

response = requests.get(self.base_url + "/server_info")
response = requests.get(self.base_url + "/get_server_info")
assert response.status_code == 200


Expand Down
2 changes: 1 addition & 1 deletion test/srt/test_deepseek_v32_mtp.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ def test_a_gsm8k(
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
2 changes: 1 addition & 1 deletion test/srt/test_deepseek_v3_fp4_4gpu.py
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ def test_a_gsm8k(
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
2 changes: 1 addition & 1 deletion test/srt/test_deepseek_v3_mtp.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ def test_a_gsm8k(
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
2 changes: 1 addition & 1 deletion test/srt/test_dp_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@ def test_gsm8k(self):

self.assertGreater(metrics["accuracy"], 0.60)

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
avg_spec_accept_length = server_info.json()["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
2 changes: 1 addition & 1 deletion test/srt/test_eagle_dp_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ def test_a_gsm8k(self):
metrics = run_eval_few_shot_gsm8k(args)
print(f"{metrics=}")

server_info = requests.get(self.base_url + "/server_info")
server_info = requests.get(self.base_url + "/get_server_info")
server_data = server_info.json()

# Try to get avg_spec_accept_length
Expand Down
2 changes: 1 addition & 1 deletion test/srt/test_eagle_infer_b.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ def test_gsm8k(self):
print(f"{metrics=}")
self.assertGreater(metrics["accuracy"], 0.20)

server_info = requests.get(self.base_url + "/server_info").json()
server_info = requests.get(self.base_url + "/get_server_info").json()
avg_spec_accept_length = server_info["internal_states"][0][
"avg_spec_accept_length"
]
Expand Down
Loading
Loading