Adding global-timeout, individual command timeout, log files collection#1249
Adding global-timeout, individual command timeout, log files collection#1249qiluo-msft merged 4 commits intosonic-net:masterfrom
Conversation
|
retest default please |
|
retest this please |
1 similar comment
|
retest this please |
|
Thanks for enhancing the tech support collection. One suggestion I have is next time PRs like this can be split into multiple PRs with each PR having a specific fix or enhancement. This makes it easier to revert a PR when there's an issue, also the PR looks cleaner and easier on the reviewer. |
|
Have you raised or going to raise a PR for tech support unit tests |
smaheshm
left a comment
There was a problem hiding this comment.
I'd suggest to break this into multiple PRs, up to you.
FuzailBrcm
left a comment
There was a problem hiding this comment.
- As most of the review comments are regarding the multi_asic, I would like to keep all this in single PR.
- I have done manual testing
- I am adding some tests in sonic-mgmt for the changes
|
Thanks for making the changes. Looks good. Since you added new features, in case you are still working on writing unit tests for the new features, would like to see the output. For example.. the output of "global timeout" case and the "command timeout" case. I'm sure you've tested it, just 'copy/paste' the output. Regarding "I am adding some tests in sonic-mgmt for the changes" -- have you raised a PR or are working on it. Adding @qiluo-msft for another pair of eyes. Rest looks good to me. |
|
|
||
| save_cmd "cat /proc/bcm/knet/debug" "broadcom.knet.debug" | ||
| save_cmd "cat /proc/bcm/knet/dma" "broadcom.knet.dma" | ||
| save_cmd "cat /proc/bcm/knet/dstats" "broadcom.knet.dstats" |
There was a problem hiding this comment.
not sure why this is removed
There was a problem hiding this comment.
It is not removed. These stats were moved under save_counter_snapshot() function. We need to take two snapshots of the counter stats to get the counter trend.
|
Note that this PR does not contain sonic-utilities UT because we are extending a command that does not have such a test today. However we will fix this in a follow-up PR. |
|
retest this please |
agree, that should be the rule. @ben-gale |
|
retest this please |
1 similar comment
|
retest this please |
smaheshm
left a comment
There was a problem hiding this comment.
Looks good. Please wait for one more approval.
|
@lguohan - Can this one merge now pls? |
| local filepath="${LOGDIR}/$filename" | ||
| local do_gzip=${3:-false} | ||
| local tarpath="${BASE}/dump/$filename" | ||
| local timeout_cmd="timeout --foreground ${TIMEOUT_MIN}m" |
There was a problem hiding this comment.
timeout [](start = 23, length = 7)
bcmcmd support timeout -t. Could you use that instead of Could you use that and maybe together with timeout command?timeout command
There was a problem hiding this comment.
@qiluo-msft
We can't predict the behaviour of 'bcmcmd' in case syncd is crashed/stuck/in bad shape. The change you are suggesting might require very extensive testing.
There was a problem hiding this comment.
There was a problem hiding this comment.
'timeout' supersedes the '-t' option of 'bcmcmd'. Is the concern regarding 'bcmcmd' may end up in a bad state like leaving the socket open?
There was a problem hiding this comment.
bcmcmd is not well tested with signal termination. I am not sure.
In reply to: 550679331 [](ancestors = 550679331,546269345)
There was a problem hiding this comment.
@qiluo-msft
"Could you use that and maybe together with timeout command" --- The bcmcmd has a default timeout of 30 sec. If no argument is given using -t option, '-t 30' is automatically taken. So the outside 'timeout' command is being used together with bcmcmd timeout option.
root@sonic:/home/admin# bcmcmd
USAGE: bcmcmd [-f <sun_path>] -v <cmd>
-v verbose mode
-f domain socket filename, default /var/run/sswsyncd/sswsyncd.socket
-t timeout in seconds, default 30
RETURN VALUE:
0 success
5 socket io error
22 invalid argument
62 timeout
root@sonic:/home/admin#
|
|
||
| handle_signal() | ||
| { | ||
| echo "Generate Dump received interrupt" |
There was a problem hiding this comment.
echo [](start = 4, length = 4)
echo to stderr? #Closed
| handle_signal() | ||
| { | ||
| echo "Generate Dump received interrupt" | ||
| $RM $V -rf $TARDIR |
There was a problem hiding this comment.
$RM $V [](start = 4, length = 6)
These variables should be defined before this function. #Closed
| if [[ ( "$NUM_ASICS" > 1 ) ]]; then | ||
| for (( i=0; i<$NUM_ASICS; i++ )) | ||
| do | ||
| local cmd="bcmcmd -n $i $1" |
There was a problem hiding this comment.
bcmcmd -n [](start = 23, length = 9)
bcmcmd does not have -n option. #Closed
There was a problem hiding this comment.
The change was requested by @smaheshm.
bcmcmd has been modified to include the ASIC number. sonic-net/sonic-buildimage#4926
| ns=" -n ${asic_id}" | ||
| fi | ||
| echo "$ns" | ||
| #echo "$ns" |
There was a problem hiding this comment.
#echo "$ns" [](start = 4, length = 11)
The function stdout is changed. #Closed
There was a problem hiding this comment.
Removed the comment
| save_vtysh "show bfd peers json" "frr.bfd.peers.json" | ||
| save_vtysh "show bfd peers counters json" "frr.bfd.peers.counters.json" | ||
|
|
||
| } |
There was a problem hiding this comment.
Remove extra blank line #Closed
| ############################################################################### | ||
| # Dumps all fields and values from given Redis DB. | ||
| # Arguments: | ||
| # DB id: id of DB for sonic-db-cli |
There was a problem hiding this comment.
DB id [](start = 3, length = 5)
The option has nothing to do with DB id. Why add? Could you remove? #Closed
There was a problem hiding this comment.
I changed the comments appropriately
|
retest this please |
2 similar comments
|
retest this please |
|
retest this please |
|
@qiluo-msft - can this proceed to approve/merge now? All checks are passing and I believe review comments have been addressed. |
|
Pls merge |
…on (sonic-net#1249) and some other enhancements to techsupport **- What I did** Following is the brief description of the changes, - Adding a ‘--silent’ option to ‘show techsupport’ command. Various tar/untar, addition and removal logs appear on the console by default. This option would disable above logs. - Adding global and per-command timeouts. This would provide more user control on ‘show techsupport’ CLI. - Adding time profiling information for the commands in techsupport. Time profiling information would be part of the tarball and helps to analyse the time consumption per command. - Sometimes ‘syncd’ docker is down and bcmshell is unavailable. In such cases all the bcmcmd commands would timeout and result in tremendous increase in the total techsupport collection time. We provided an option to skip rest of the bcmcmd commands once one command times out. - Added ‘show services’, ‘show reboot-cause’ and various BGP, BFD, bcm shell and other commands - Optimised the /var/log files collection. If the number of files are large in /var/log folder, it takes a long time to add each individually to the tarball. If the folder is tar'ed at once, the time taken reduces significantly. - Following error was observed while tar'ing softlinks inside .etc folder. ** Tar append operation failed. Aborting for safety. ** This issue was due to softlinks present at /etc folder where the destination file is absent. Fixed this issue by deleting such softlinks before adding them to the tarball. **- How I did it** - Added new options to the CLICK command 'show techsupport' - Modified the 'generate_dump' script to accomodate other changes **- How to verify it** Here are some outputs, root@sonic:/home/admin# show techsupport --silent Techsupport is running with silent option. This command might take a long time. HW Mgmt dump script /usr/bin/hw-management-generate-dump.sh does not exist /var/dump/sonic_dump_sonic_20201117_161246.tar.gz root@sonic:/home/admin# root@sonic:~# show techsupport -h Usage: show techsupport [OPTIONS] Gather information for troubleshooting Options: --since TEXT Collect logs and core files since given date -g, --global-timeout INTEGER Global timeout for techsupport in minutes. Default 30 mins -c, --cmd-timeout INTEGER Command timeout for techsupport in minutes. Default 5 mins --verbose Enable verbose output --silent Run techsupport in silent mode -?, -h, --help Show this message and exit. root@sonic:~# **- Previous command output (if the output of a command-line utility has changed)** - Previous command "show techsupport" works as is **- New command output (if the output of a command-line utility has changed)**
and some other enhancements to techsupport
- What I did
Following is the brief description of the changes,
** Tar append operation failed. Aborting for safety. **
This issue was due to softlinks present at /etc folder where the destination file is absent. Fixed this issue by deleting such softlinks before adding them to the tarball.
- How I did it
- How to verify it
Here are some outputs,
root@sonic:/home/admin# show techsupport --silent
Techsupport is running with silent option. This command might take a long time.
HW Mgmt dump script /usr/bin/hw-management-generate-dump.sh does not exist
/var/dump/sonic_dump_sonic_20201117_161246.tar.gz
root@sonic:/home/admin#
root@sonic:~# show techsupport -h
Usage: show techsupport [OPTIONS]
Gather information for troubleshooting
Options:
--since TEXT Collect logs and core files since given date
-g, --global-timeout INTEGER Global timeout for techsupport in minutes.
Default 30 mins
-c, --cmd-timeout INTEGER Command timeout for techsupport in minutes.
Default 5 mins
--verbose Enable verbose output
--silent Run techsupport in silent mode
-?, -h, --help Show this message and exit.
root@sonic:~#
- Previous command output (if the output of a command-line utility has changed)
- New command output (if the output of a command-line utility has changed)