ci: 增强syscall测试监控脚本的健壮性#1364
Conversation
fslongjin
commented
Nov 12, 2025
- 添加超时配置和进程自动检测机制
- 改进资源清理和错误处理逻辑
- 增加详细的诊断信息和进度报告
- 优化测试流程监控和超时处理
- 添加超时配置和进程自动检测机制 - 改进资源清理和错误处理逻辑 - 增加详细的诊断信息和进度报告 - 优化测试流程监控和超时处理 Signed-off-by: longjin <longjin@DragonOS.org>
c63f59b to
157c279
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the syscall test monitoring script's robustness and reliability for CI/CD environments. The main focus is on adding comprehensive timeout mechanisms, automatic QEMU process detection, improved error handling, and better diagnostic reporting.
Key changes:
- Added multiple timeout configurations (boot, test start, idle, single test, and overall test timeouts) to detect and handle stuck tests
- Implemented automatic QEMU process detection when PID is not explicitly provided
- Enhanced resource cleanup with graceful shutdown attempts before forceful termination
- Added detailed diagnostic information and progress reporting for better visibility during test execution
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.
| File | Description |
|---|---|
| user/apps/tests/syscall/gvisor/monitor_test_results.sh | Major enhancement with timeout monitoring, auto-detection of QEMU processes, improved cleanup logic, diagnostic information display, and comprehensive progress tracking |
| Makefile | Removed non-functional QEMU_PID assignment and added sleep delay to allow QEMU startup before monitoring begins |
| .github/workflows/test-x86.yml | Added 60-minute timeout to the integration test job as an outer safety net |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RUN_LINE_NUM=$(grep -an "\[ RUN \].*$TEST_NAME" "$SERIAL_FILE" | tail -n 1 | cut -d: -f1) | ||
| if [ -n "$RUN_LINE_NUM" ]; then | ||
| # 从测试开始行之后查找结果 | ||
| if tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -aq "\[ OK \] $TEST_NAME\|\[ FAILED \] $TEST_NAME"; then |
There was a problem hiding this comment.
The grep pattern uses \| for alternation which is correct for basic regular expressions, but the -a flag is being used with -q which means binary mode is treated as text but output is suppressed. The pattern \[ OK \] $TEST_NAME\|\[ FAILED \] $TEST_NAME could match incorrect results if $TEST_NAME contains special regex characters. Consider using grep -F (fixed string) for the test name matching, or properly escape the test name with a function that escapes regex special characters.
| if tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -aq "\[ OK \] $TEST_NAME\|\[ FAILED \] $TEST_NAME"; then | |
| if tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -Fqa "[ OK ] $TEST_NAME" || \ | |
| tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -Fqa "[ FAILED ] $TEST_NAME"; then |
| LAST_RUN_LINE=$(grep -a "\[ RUN \]" "$SERIAL_FILE" | tail -n 1) | ||
| if [ -n "$LAST_RUN_LINE" ]; then | ||
| TEST_NAME=$(echo "$LAST_RUN_LINE" | sed 's/.*\[ RUN \] //') | ||
|
|
||
| # 检查这个测试是否已经完成(有OK或FAILED) | ||
| # 从测试开始行开始查找,避免误判 | ||
| RUN_LINE_NUM=$(grep -an "\[ RUN \].*$TEST_NAME" "$SERIAL_FILE" | tail -n 1 | cut -d: -f1) | ||
| if [ -n "$RUN_LINE_NUM" ]; then |
There was a problem hiding this comment.
The function performs multiple full-file grep operations (grep -a "\[ RUN \]" "$SERIAL_FILE" on line 164, grep -an "\[ RUN \].*$TEST_NAME" on line 170) which can become slow as the serial file grows during long test runs. Consider using tail to limit the search to recent lines, or maintaining state to track the last processed line number to avoid re-scanning the entire file repeatedly.
| LAST_RUN_LINE=$(grep -a "\[ RUN \]" "$SERIAL_FILE" | tail -n 1) | |
| if [ -n "$LAST_RUN_LINE" ]; then | |
| TEST_NAME=$(echo "$LAST_RUN_LINE" | sed 's/.*\[ RUN \] //') | |
| # 检查这个测试是否已经完成(有OK或FAILED) | |
| # 从测试开始行开始查找,避免误判 | |
| RUN_LINE_NUM=$(grep -an "\[ RUN \].*$TEST_NAME" "$SERIAL_FILE" | tail -n 1 | cut -d: -f1) | |
| if [ -n "$RUN_LINE_NUM" ]; then | |
| # 只检查最近1000行,避免全文件扫描 | |
| TAIL_LINES=1000 | |
| LAST_RUN_LINE=$(tail -n $TAIL_LINES "$SERIAL_FILE" | grep -a "\[ RUN \]" | tail -n 1) | |
| if [ -n "$LAST_RUN_LINE" ]; then | |
| TEST_NAME=$(echo "$LAST_RUN_LINE" | sed 's/.*\[ RUN \] //') | |
| # 检查这个测试是否已经完成(有OK或FAILED) | |
| # 从测试开始行开始查找,避免误判 | |
| # 获取serial文件总行数 | |
| TOTAL_LINES=$(wc -l < "$SERIAL_FILE") | |
| # 获取tail的起始行号 | |
| TAIL_START_LINE=$((TOTAL_LINES > TAIL_LINES ? TOTAL_LINES - TAIL_LINES + 1 : 1)) | |
| # 在tail输出中查找RUN行号(相对tail的行号) | |
| REL_RUN_LINE_NUM=$(tail -n $TAIL_LINES "$SERIAL_FILE" | grep -an "\[ RUN \].*$TEST_NAME" | tail -n 1 | cut -d: -f1) | |
| if [ -n "$REL_RUN_LINE_NUM" ]; then | |
| # 转换为serial文件的绝对行号 | |
| RUN_LINE_NUM=$((TAIL_START_LINE + REL_RUN_LINE_NUM - 1)) |
| # 查找并杀死所有相关子进程 | ||
| CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ') | ||
| for child_pid in $CHILD_PIDS; do | ||
| if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then | ||
| echo "[监控] 终止子进程 (PID: $child_pid)" | ||
| kill -TERM $child_pid 2>/dev/null | ||
| fi | ||
| done | ||
| sleep 2 | ||
|
|
||
| # 强制杀死主进程 | ||
| kill -0 $QEMU_PID 2>/dev/null && kill -9 $QEMU_PID 2>/dev/null | ||
| # 再次强制杀死所有子进程 | ||
| for child_pid in $CHILD_PIDS; do | ||
| if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then | ||
| kill -9 $child_pid 2>/dev/null | ||
| fi | ||
| done | ||
| fi |
There was a problem hiding this comment.
The cleanup process for child processes iterates through $CHILD_PIDS twice (lines 59-64 and 70-74), but CHILD_PIDS is set once at line 58. If child processes spawn their own children between the first and second iteration, these grandchildren won't be cleaned up. Consider using a recursive process tree kill approach or repeatedly fetching the process tree until no children remain.
| # 查找并杀死所有相关子进程 | |
| CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ') | |
| for child_pid in $CHILD_PIDS; do | |
| if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then | |
| echo "[监控] 终止子进程 (PID: $child_pid)" | |
| kill -TERM $child_pid 2>/dev/null | |
| fi | |
| done | |
| sleep 2 | |
| # 强制杀死主进程 | |
| kill -0 $QEMU_PID 2>/dev/null && kill -9 $QEMU_PID 2>/dev/null | |
| # 再次强制杀死所有子进程 | |
| for child_pid in $CHILD_PIDS; do | |
| if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then | |
| kill -9 $child_pid 2>/dev/null | |
| fi | |
| done | |
| fi | |
| # 递归查找并杀死所有相关子进程,直到没有子进程为止 | |
| while true; do | |
| CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ') | |
| FOUND_CHILDREN=0 | |
| for child_pid in $CHILD_PIDS; do | |
| if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then | |
| echo "[监控] 终止子进程 (PID: $child_pid)" | |
| kill -TERM $child_pid 2>/dev/null | |
| FOUND_CHILDREN=1 | |
| fi | |
| done | |
| sleep 2 | |
| # 强制杀死主进程 | |
| kill -0 $QEMU_PID 2>/dev/null && kill -9 $QEMU_PID 2>/dev/null | |
| # 再次强制杀死所有子进程 | |
| for child_pid in $CHILD_PIDS; do | |
| if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then | |
| kill -9 $child_pid 2>/dev/null | |
| fi | |
| done | |
| # 如果没有子进程了,则退出循环 | |
| CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ') | |
| if [ -z "$CHILD_PIDS" ] || [ "$CHILD_PIDS" = "$QEMU_PID" ]; then | |
| break | |
| fi | |
| done |
| # 如果进程还在,检查是否有子进程需要一起杀死 | ||
| if kill -0 $QEMU_PID 2>/dev/null; then | ||
| # 查找并杀死所有相关子进程 | ||
| CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ') |
There was a problem hiding this comment.
The pstree command may not be available in all minimal shell environments (e.g., busybox). Since this script uses #!/bin/busybox sh, you should check if pstree is available before using it, or use an alternative approach like ps --ppid to find child processes.
| CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ') | |
| # 递归查找所有子进程 | |
| get_child_pids() { | |
| local parent_pid="$1" | |
| local children=$(ps -eo pid,ppid | awk -v ppid="$parent_pid" '$2==ppid {print $1}') | |
| for child in $children; do | |
| echo "$child" | |
| get_child_pids "$child" | |
| done | |
| } | |
| CHILD_PIDS=$(get_child_pids "$QEMU_PID") |
| fi | ||
|
|
||
| # 每60秒报告一次进度 | ||
| if [ $((ELAPSED % 60)) -eq 0 ]; then |
There was a problem hiding this comment.
The progress report at line 310 uses a modulo operation [ $((ELAPSED % 60)) -eq 0 ] which will trigger exactly when ELAPSED is divisible by 60. However, with a 10-second sleep interval, this might miss the exact 60-second mark and never trigger. Consider using [ $((ELAPSED % 60)) -lt 10 ] instead to ensure the progress report is displayed at least once per minute.
| if [ $((ELAPSED % 60)) -eq 0 ]; then | |
| if [ $((ELAPSED % 60)) -lt 10 ]; then |
| BOOT_TIMEOUT=300 # DragonOS开机超时(5分钟) | ||
| TEST_START_TIMEOUT=600 # 测试程序启动超时(10分钟) | ||
| TEST_TIMEOUT=1800 # 整个测试超时(30分钟) | ||
| IDLE_TIMEOUT=120 # 无输出超时(5分钟) |
There was a problem hiding this comment.
The comment states "无输出超时(5分钟)" but the value is set to 120 seconds (2 minutes), not 5 minutes. This should be corrected to either "无输出超时(2分钟)" or the value should be changed to 300 seconds to match the comment.
| IDLE_TIMEOUT=120 # 无输出超时(5分钟) | |
| IDLE_TIMEOUT=300 # 无输出超时(5分钟) |
| $(MAKE) qemu-nographic AUTO_TEST=syscall SYSCALL_TEST_DIR=/opt/tests/gvisor & | ||
| QEMU_PID=$$! | ||
| sleep 5 | ||
| bash user/apps/tests/syscall/gvisor/monitor_test_results.sh || { bash user/apps/tests/syscall/gvisor/disable_compile_gvisor.sh; exit 1; } |
There was a problem hiding this comment.
[nitpick] The removed line QEMU_PID=$$! was on its own line, which wouldn't have worked correctly in a Makefile anyway (the variable would be set but not exported to the subsequent command). However, the replacement with only sleep 5 means the QEMU PID is never passed to the monitoring script. While the script has auto-detection logic, it would be more reliable to explicitly pass the PID. Consider: $(MAKE) qemu-nographic AUTO_TEST=syscall SYSCALL_TEST_DIR=/opt/tests/gvisor & sleep 5 && QEMU_PID=$$! bash user/apps/tests/syscall/gvisor/monitor_test_results.sh
| if [ -f "$SERIAL_FILE" ]; then | ||
| CURRENT_TIME=$(date +%s) | ||
| CURRENT_LINE_COUNT=$(wc -l < "$SERIAL_FILE" 2>/dev/null || echo 0) | ||
| FILE_MTIME=$(stat -c %Y "$SERIAL_FILE" 2>/dev/null || echo 0) |
There was a problem hiding this comment.
The stat -c %Y option is GNU-specific and may not work on all systems (e.g., BSD/macOS uses stat -f %m). Since the script uses busybox, verify that busybox's stat implementation supports the -c option, or use a more portable alternative like checking file modification with ls -l or find.
| FILE_MTIME=$(stat -c %Y "$SERIAL_FILE" 2>/dev/null || echo 0) | |
| FILE_MTIME=$(find "$SERIAL_FILE" -printf '%T@\n' 2>/dev/null | cut -d. -f1) | |
| [ -z "$FILE_MTIME" ] && FILE_MTIME=0 |
| check_qemu_alive() { | ||
| if [ -n "$QEMU_PID" ]; then | ||
| # 使用ps检查进程是否存在,避免PID重用问题 | ||
| ps aux | grep -E "[q]emu-system-x86_64" | awk '{print $2}' | grep -q "^$QEMU_PID$" |
There was a problem hiding this comment.
The regex pattern [q]emu-system-x86_64 in the grep command is used to avoid matching the grep process itself, but this assumes the QEMU executable is specifically qemu-system-x86_64. However, the QEMU_SEARCH_PATTERN variable (line 15) is more flexible and includes the AUTO_TEST=syscall argument. For consistency and to avoid false positives from other QEMU instances, consider using $QEMU_SEARCH_PATTERN here instead of the hardcoded pattern.
| ps aux | grep -E "[q]emu-system-x86_64" | awk '{print $2}' | grep -q "^$QEMU_PID$" | |
| ps aux | grep -E "$QEMU_SEARCH_PATTERN" | awk '{print $2}' | grep -q "^$QEMU_PID$" |
| local current_test=$(tail -n 20 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ RUN \]" | tail -n 1 | sed 's/.*\[ RUN \] //' || echo "无") | ||
| local last_passed=$(tail -n 100 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ PASSED \]" | tail -n 1 | sed 's/.*\[ PASSED \] *\([0-9]*\).*/\1/' || echo "0") |
There was a problem hiding this comment.
[nitpick] The local keyword is used in a POSIX shell script (#!/bin/busybox sh). While busybox ash supports local, it's not part of the POSIX standard. For maximum portability, consider avoiding local or explicitly documenting that the script requires busybox or a shell that supports local.
| local current_test=$(tail -n 20 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ RUN \]" | tail -n 1 | sed 's/.*\[ RUN \] //' || echo "无") | |
| local last_passed=$(tail -n 100 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ PASSED \]" | tail -n 1 | sed 's/.*\[ PASSED \] *\([0-9]*\).*/\1/' || echo "0") | |
| current_test=$(tail -n 20 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ RUN \]" | tail -n 1 | sed 's/.*\[ RUN \] //' || echo "无") | |
| last_passed=$(tail -n 100 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ PASSED \]" | tail -n 1 | sed 's/.*\[ PASSED \] *\([0-9]*\).*/\1/' || echo "0") |