Skip to content

ci: 增强syscall测试监控脚本的健壮性#1364

Merged
fslongjin merged 1 commit intoDragonOS-Community:masterfrom
fslongjin:fix-ci-intergrate-test-hang
Nov 12, 2025
Merged

ci: 增强syscall测试监控脚本的健壮性#1364
fslongjin merged 1 commit intoDragonOS-Community:masterfrom
fslongjin:fix-ci-intergrate-test-hang

Conversation

@fslongjin
Copy link
Member

  • 添加超时配置和进程自动检测机制
  • 改进资源清理和错误处理逻辑
  • 增加详细的诊断信息和进度报告
  • 优化测试流程监控和超时处理

- 添加超时配置和进程自动检测机制
- 改进资源清理和错误处理逻辑
- 增加详细的诊断信息和进度报告
- 优化测试流程监控和超时处理

Signed-off-by: longjin <longjin@DragonOS.org>
Copilot AI review requested due to automatic review settings November 12, 2025 06:53
@fslongjin fslongjin force-pushed the fix-ci-intergrate-test-hang branch from c63f59b to 157c279 Compare November 12, 2025 06:54
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the syscall test monitoring script's robustness and reliability for CI/CD environments. The main focus is on adding comprehensive timeout mechanisms, automatic QEMU process detection, improved error handling, and better diagnostic reporting.

Key changes:

  • Added multiple timeout configurations (boot, test start, idle, single test, and overall test timeouts) to detect and handle stuck tests
  • Implemented automatic QEMU process detection when PID is not explicitly provided
  • Enhanced resource cleanup with graceful shutdown attempts before forceful termination
  • Added detailed diagnostic information and progress reporting for better visibility during test execution

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 13 comments.

File Description
user/apps/tests/syscall/gvisor/monitor_test_results.sh Major enhancement with timeout monitoring, auto-detection of QEMU processes, improved cleanup logic, diagnostic information display, and comprehensive progress tracking
Makefile Removed non-functional QEMU_PID assignment and added sleep delay to allow QEMU startup before monitoring begins
.github/workflows/test-x86.yml Added 60-minute timeout to the integration test job as an outer safety net

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

RUN_LINE_NUM=$(grep -an "\[ RUN \].*$TEST_NAME" "$SERIAL_FILE" | tail -n 1 | cut -d: -f1)
if [ -n "$RUN_LINE_NUM" ]; then
# 从测试开始行之后查找结果
if tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -aq "\[ OK \] $TEST_NAME\|\[ FAILED \] $TEST_NAME"; then
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grep pattern uses \| for alternation which is correct for basic regular expressions, but the -a flag is being used with -q which means binary mode is treated as text but output is suppressed. The pattern \[ OK \] $TEST_NAME\|\[ FAILED \] $TEST_NAME could match incorrect results if $TEST_NAME contains special regex characters. Consider using grep -F (fixed string) for the test name matching, or properly escape the test name with a function that escapes regex special characters.

Suggested change
if tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -aq "\[ OK \] $TEST_NAME\|\[ FAILED \] $TEST_NAME"; then
if tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -Fqa "[ OK ] $TEST_NAME" || \
tail -n +$((RUN_LINE_NUM + 1)) "$SERIAL_FILE" | grep -Fqa "[ FAILED ] $TEST_NAME"; then

Copilot uses AI. Check for mistakes.
Comment on lines +164 to +171
LAST_RUN_LINE=$(grep -a "\[ RUN \]" "$SERIAL_FILE" | tail -n 1)
if [ -n "$LAST_RUN_LINE" ]; then
TEST_NAME=$(echo "$LAST_RUN_LINE" | sed 's/.*\[ RUN \] //')

# 检查这个测试是否已经完成(有OK或FAILED)
# 从测试开始行开始查找,避免误判
RUN_LINE_NUM=$(grep -an "\[ RUN \].*$TEST_NAME" "$SERIAL_FILE" | tail -n 1 | cut -d: -f1)
if [ -n "$RUN_LINE_NUM" ]; then
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function performs multiple full-file grep operations (grep -a "\[ RUN \]" "$SERIAL_FILE" on line 164, grep -an "\[ RUN \].*$TEST_NAME" on line 170) which can become slow as the serial file grows during long test runs. Consider using tail to limit the search to recent lines, or maintaining state to track the last processed line number to avoid re-scanning the entire file repeatedly.

Suggested change
LAST_RUN_LINE=$(grep -a "\[ RUN \]" "$SERIAL_FILE" | tail -n 1)
if [ -n "$LAST_RUN_LINE" ]; then
TEST_NAME=$(echo "$LAST_RUN_LINE" | sed 's/.*\[ RUN \] //')
# 检查这个测试是否已经完成(有OK或FAILED)
# 从测试开始行开始查找,避免误判
RUN_LINE_NUM=$(grep -an "\[ RUN \].*$TEST_NAME" "$SERIAL_FILE" | tail -n 1 | cut -d: -f1)
if [ -n "$RUN_LINE_NUM" ]; then
# 只检查最近1000行,避免全文件扫描
TAIL_LINES=1000
LAST_RUN_LINE=$(tail -n $TAIL_LINES "$SERIAL_FILE" | grep -a "\[ RUN \]" | tail -n 1)
if [ -n "$LAST_RUN_LINE" ]; then
TEST_NAME=$(echo "$LAST_RUN_LINE" | sed 's/.*\[ RUN \] //')
# 检查这个测试是否已经完成(有OK或FAILED)
# 从测试开始行开始查找,避免误判
# 获取serial文件总行数
TOTAL_LINES=$(wc -l < "$SERIAL_FILE")
# 获取tail的起始行号
TAIL_START_LINE=$((TOTAL_LINES > TAIL_LINES ? TOTAL_LINES - TAIL_LINES + 1 : 1))
# 在tail输出中查找RUN行号(相对tail的行号)
REL_RUN_LINE_NUM=$(tail -n $TAIL_LINES "$SERIAL_FILE" | grep -an "\[ RUN \].*$TEST_NAME" | tail -n 1 | cut -d: -f1)
if [ -n "$REL_RUN_LINE_NUM" ]; then
# 转换为serial文件的绝对行号
RUN_LINE_NUM=$((TAIL_START_LINE + REL_RUN_LINE_NUM - 1))

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +75
# 查找并杀死所有相关子进程
CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ')
for child_pid in $CHILD_PIDS; do
if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then
echo "[监控] 终止子进程 (PID: $child_pid)"
kill -TERM $child_pid 2>/dev/null
fi
done
sleep 2

# 强制杀死主进程
kill -0 $QEMU_PID 2>/dev/null && kill -9 $QEMU_PID 2>/dev/null
# 再次强制杀死所有子进程
for child_pid in $CHILD_PIDS; do
if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then
kill -9 $child_pid 2>/dev/null
fi
done
fi
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup process for child processes iterates through $CHILD_PIDS twice (lines 59-64 and 70-74), but CHILD_PIDS is set once at line 58. If child processes spawn their own children between the first and second iteration, these grandchildren won't be cleaned up. Consider using a recursive process tree kill approach or repeatedly fetching the process tree until no children remain.

Suggested change
# 查找并杀死所有相关子进程
CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ')
for child_pid in $CHILD_PIDS; do
if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then
echo "[监控] 终止子进程 (PID: $child_pid)"
kill -TERM $child_pid 2>/dev/null
fi
done
sleep 2
# 强制杀死主进程
kill -0 $QEMU_PID 2>/dev/null && kill -9 $QEMU_PID 2>/dev/null
# 再次强制杀死所有子进程
for child_pid in $CHILD_PIDS; do
if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then
kill -9 $child_pid 2>/dev/null
fi
done
fi
# 递归查找并杀死所有相关子进程,直到没有子进程为止
while true; do
CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ')
FOUND_CHILDREN=0
for child_pid in $CHILD_PIDS; do
if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then
echo "[监控] 终止子进程 (PID: $child_pid)"
kill -TERM $child_pid 2>/dev/null
FOUND_CHILDREN=1
fi
done
sleep 2
# 强制杀死主进程
kill -0 $QEMU_PID 2>/dev/null && kill -9 $QEMU_PID 2>/dev/null
# 再次强制杀死所有子进程
for child_pid in $CHILD_PIDS; do
if [ "$child_pid" != "$QEMU_PID" ] && kill -0 $child_pid 2>/dev/null; then
kill -9 $child_pid 2>/dev/null
fi
done
# 如果没有子进程了,则退出循环
CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ')
if [ -z "$CHILD_PIDS" ] || [ "$CHILD_PIDS" = "$QEMU_PID" ]; then
break
fi
done

Copilot uses AI. Check for mistakes.
# 如果进程还在,检查是否有子进程需要一起杀死
if kill -0 $QEMU_PID 2>/dev/null; then
# 查找并杀死所有相关子进程
CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ')
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pstree command may not be available in all minimal shell environments (e.g., busybox). Since this script uses #!/bin/busybox sh, you should check if pstree is available before using it, or use an alternative approach like ps --ppid to find child processes.

Suggested change
CHILD_PIDS=$(pstree -p $QEMU_PID | grep -o '([0-9]\+)' | grep -o '[0-9]\+' | tr '\n' ' ')
# 递归查找所有子进程
get_child_pids() {
local parent_pid="$1"
local children=$(ps -eo pid,ppid | awk -v ppid="$parent_pid" '$2==ppid {print $1}')
for child in $children; do
echo "$child"
get_child_pids "$child"
done
}
CHILD_PIDS=$(get_child_pids "$QEMU_PID")

Copilot uses AI. Check for mistakes.
fi

# 每60秒报告一次进度
if [ $((ELAPSED % 60)) -eq 0 ]; then
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The progress report at line 310 uses a modulo operation [ $((ELAPSED % 60)) -eq 0 ] which will trigger exactly when ELAPSED is divisible by 60. However, with a 10-second sleep interval, this might miss the exact 60-second mark and never trigger. Consider using [ $((ELAPSED % 60)) -lt 10 ] instead to ensure the progress report is displayed at least once per minute.

Suggested change
if [ $((ELAPSED % 60)) -eq 0 ]; then
if [ $((ELAPSED % 60)) -lt 10 ]; then

Copilot uses AI. Check for mistakes.
BOOT_TIMEOUT=300 # DragonOS开机超时(5分钟)
TEST_START_TIMEOUT=600 # 测试程序启动超时(10分钟)
TEST_TIMEOUT=1800 # 整个测试超时(30分钟)
IDLE_TIMEOUT=120 # 无输出超时(5分钟)
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment states "无输出超时(5分钟)" but the value is set to 120 seconds (2 minutes), not 5 minutes. This should be corrected to either "无输出超时(2分钟)" or the value should be changed to 300 seconds to match the comment.

Suggested change
IDLE_TIMEOUT=120 # 无输出超时(5分钟)
IDLE_TIMEOUT=300 # 无输出超时(5分钟)

Copilot uses AI. Check for mistakes.
Comment on lines 161 to 163
$(MAKE) qemu-nographic AUTO_TEST=syscall SYSCALL_TEST_DIR=/opt/tests/gvisor &
QEMU_PID=$$!
sleep 5
bash user/apps/tests/syscall/gvisor/monitor_test_results.sh || { bash user/apps/tests/syscall/gvisor/disable_compile_gvisor.sh; exit 1; }
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The removed line QEMU_PID=$$! was on its own line, which wouldn't have worked correctly in a Makefile anyway (the variable would be set but not exported to the subsequent command). However, the replacement with only sleep 5 means the QEMU PID is never passed to the monitoring script. While the script has auto-detection logic, it would be more reliable to explicitly pass the PID. Consider: $(MAKE) qemu-nographic AUTO_TEST=syscall SYSCALL_TEST_DIR=/opt/tests/gvisor & sleep 5 && QEMU_PID=$$! bash user/apps/tests/syscall/gvisor/monitor_test_results.sh

Copilot uses AI. Check for mistakes.
if [ -f "$SERIAL_FILE" ]; then
CURRENT_TIME=$(date +%s)
CURRENT_LINE_COUNT=$(wc -l < "$SERIAL_FILE" 2>/dev/null || echo 0)
FILE_MTIME=$(stat -c %Y "$SERIAL_FILE" 2>/dev/null || echo 0)
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stat -c %Y option is GNU-specific and may not work on all systems (e.g., BSD/macOS uses stat -f %m). Since the script uses busybox, verify that busybox's stat implementation supports the -c option, or use a more portable alternative like checking file modification with ls -l or find.

Suggested change
FILE_MTIME=$(stat -c %Y "$SERIAL_FILE" 2>/dev/null || echo 0)
FILE_MTIME=$(find "$SERIAL_FILE" -printf '%T@\n' 2>/dev/null | cut -d. -f1)
[ -z "$FILE_MTIME" ] && FILE_MTIME=0

Copilot uses AI. Check for mistakes.
check_qemu_alive() {
if [ -n "$QEMU_PID" ]; then
# 使用ps检查进程是否存在,避免PID重用问题
ps aux | grep -E "[q]emu-system-x86_64" | awk '{print $2}' | grep -q "^$QEMU_PID$"
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern [q]emu-system-x86_64 in the grep command is used to avoid matching the grep process itself, but this assumes the QEMU executable is specifically qemu-system-x86_64. However, the QEMU_SEARCH_PATTERN variable (line 15) is more flexible and includes the AUTO_TEST=syscall argument. For consistency and to avoid false positives from other QEMU instances, consider using $QEMU_SEARCH_PATTERN here instead of the hardcoded pattern.

Suggested change
ps aux | grep -E "[q]emu-system-x86_64" | awk '{print $2}' | grep -q "^$QEMU_PID$"
ps aux | grep -E "$QEMU_SEARCH_PATTERN" | awk '{print $2}' | grep -q "^$QEMU_PID$"

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +122
local current_test=$(tail -n 20 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ RUN \]" | tail -n 1 | sed 's/.*\[ RUN \] //' || echo "无")
local last_passed=$(tail -n 100 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ PASSED \]" | tail -n 1 | sed 's/.*\[ PASSED \] *\([0-9]*\).*/\1/' || echo "0")
Copy link

Copilot AI Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The local keyword is used in a POSIX shell script (#!/bin/busybox sh). While busybox ash supports local, it's not part of the POSIX standard. For maximum portability, consider avoiding local or explicitly documenting that the script requires busybox or a shell that supports local.

Suggested change
local current_test=$(tail -n 20 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ RUN \]" | tail -n 1 | sed 's/.*\[ RUN \] //' || echo "")
local last_passed=$(tail -n 100 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ PASSED \]" | tail -n 1 | sed 's/.*\[ PASSED \] *\([0-9]*\).*/\1/' || echo "0")
current_test=$(tail -n 20 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ RUN \]" | tail -n 1 | sed 's/.*\[ RUN \] //' || echo "")
last_passed=$(tail -n 100 "$SERIAL_FILE" 2>/dev/null | grep -a "\[ PASSED \]" | tail -n 1 | sed 's/.*\[ PASSED \] *\([0-9]*\).*/\1/' || echo "0")

Copilot uses AI. Check for mistakes.
@fslongjin fslongjin merged commit 8215106 into DragonOS-Community:master Nov 12, 2025
16 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants