Skip to content

Conversation

@ismhong
Copy link
Contributor

@ismhong ismhong commented Dec 6, 2021

Currently, hardirqs will count interrupt event simply while tracepoint
irq:irq_handler_entry triggered, it's fine for system without shared IRQ
event, but it will cause wrong interrupt count result for system with
shared IRQ, because kernel will interate all irq handlers belong to this
IRQ descriptor.

Take an example for system with shared IRQ below.

root@localhost:/# cat /proc/interrupts
           CPU0
 13:     385248     GICv3  39 Level     DDOMAIN ISR, gdma
 ...
 23:      61532     GICv3  38 Level     VGIP ISR, OnlineMeasure ISR

DDOMAIN IRQ and gdma shared the IRQ 13, VGIP ISR and OnlineMeasure
shared the IRQ 23, and use 'hardirqs -C' to measure the count of these
interrupt event.

root@localhost:/# hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.

HARDIRQ                    TOTAL_count
OnlineMeasure ISR                  300
VGIP ISR                           300
gdma                              2103
DDOMAIN ISR                       2103
eth0                              6677

hardirqs reported the same interrupt count for shared IRQ
'OnlineMeasure ISR/VGIP ISR' and 'gdma/DDOMAIN ISR'.

We should check the ret field of tracepoint irq:irq_hanlder_exit is
IRQ_HANDLED or IRQ_WAKE_THREAD to make sure the current event is belong to this
interrupt handler. For simplifying, just check args->ret != IRQ_NONE.

In the meantimes, the same changes should be applied to interrupt time
measurement.

The fixed hardirqs will show below output.

(bcc)root@localhost:/# ./hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.

HARDIRQ                    TOTAL_count
OnlineMeasure ISR                    1
VGIP ISR                           294
gdma                              1168
DDOMAIN ISR                       1476
eth0                              5210

Currently, hardirqs will count interrupt event simply while tracepoint
irq:irq_handler_entry triggered, it's fine for system without shared IRQ
event, but it will cause wrong interrupt count result for system with
shared IRQ, because kernel will interate all irq handlers belong to this
IRQ descriptor.

Take an example for system with shared IRQ below.

root@localhost:/# cat /proc/interrupts
           CPU0
 13:     385248     GICv3  39 Level     DDOMAIN ISR, gdma
 ...
 23:      61532     GICv3  38 Level     VGIP ISR, OnlineMeasure ISR

DDOMAIN IRQ and gdma shared the IRQ 13, VGIP ISR and OnlineMeasure
shared the IRQ 23, and use 'hardirqs -C' to measure the count of these
interrupt event.

root@localhost:/# hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.

HARDIRQ                    TOTAL_count
OnlineMeasure ISR                  300
VGIP ISR                           300
gdma                              2103
DDOMAIN ISR                       2103
eth0                              6677

hardirqs reported the same interrupt count for shared IRQ
'OnlineMeasure ISR/VGIP ISR' and 'gdma/DDOMAIN ISR'.

We should check the ret field of tracepoint irq:irq_hanlder_exit is
IRQ_HANDLED or IRQ_WAKE_THREAD to make sure the current event is belong
to this interrupt handler. For simplifying, just check `args->ret !=
IRQ_NONE`.

In the meantimes, the same changes should be applied to interrupt time
measurement.

The fixed hardirqs will show below output.

(bcc)root@localhost:/# ./hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.

HARDIRQ                    TOTAL_count
OnlineMeasure ISR                    1
VGIP ISR                           294
gdma                              1168
DDOMAIN ISR                       1476
eth0                              5210
@ismhong ismhong force-pushed the fix-hardirqs-shared-irq-number branch from 9bb2f53 to ee6d955 Compare December 6, 2021 07:16
@chenhengqi
Copy link
Collaborator

I am not familiar with this shared IRQ thing, but I test this change, looks correct to me.

I have a small question:
Under IRQ context, is the tid useful ? I think they should all be zero.
I tried print the tid, most of them are 0 with some exceptions:

          <idle>-0       [000] d.h41 2763892.666859: bpf_trace_printk: tid = 0
          <idle>-0       [001] d.h41 2763892.668261: bpf_trace_printk: tid = 0
 systemd-resolve-905     [000] d.h41 2763894.418696: bpf_trace_printk: tid = 905
          <idle>-0       [000] d.h41 2763894.419803: bpf_trace_printk: tid = 0
          <idle>-0       [000] d.h41 2763894.420129: bpf_trace_printk: tid = 0

@yonghong-song
Copy link
Collaborator

Thanks for detailed commit message/explanation. The change looks good me.

@yonghong-song yonghong-song merged commit 814c264 into iovisor:master Dec 6, 2021
@netedwardwu
Copy link
Contributor

I am not familiar with this shared IRQ thing, but I test this change, looks correct to me.

I have a small question: Under IRQ context, is the tid useful ? I think they should all be zero. I tried print the tid, most of them are 0 with some exceptions:

          <idle>-0       [000] d.h41 2763892.666859: bpf_trace_printk: tid = 0
          <idle>-0       [001] d.h41 2763892.668261: bpf_trace_printk: tid = 0
 systemd-resolve-905     [000] d.h41 2763894.418696: bpf_trace_printk: tid = 905
          <idle>-0       [000] d.h41 2763894.419803: bpf_trace_printk: tid = 0
          <idle>-0       [000] d.h41 2763894.420129: bpf_trace_printk: tid = 0

In IRQ context, tid is not so useful, zero tid(swapper) is because the system is going to sleep or idle or some scheduler matters.
IRQ interrupts above jobs, so tid = 0

The only function I think is you know if your measure process is interrupted.

For example: You use funclatency to measure function-A and get 300 us, but sometimes the part of latency belongs to ISR.
So accurate latency should be 300 us minus ISR latency.

@ismhong
Copy link
Contributor Author

ismhong commented Dec 7, 2021

I have a small question:
Under IRQ context, is the tid useful ? I think they should all be zero.
I tried print the tid, most of them are 0 with some exceptions:

     <idle>-0       [000] d.h41 2763892.666859: bpf_trace_printk: tid = 0
     <idle>-0       [001] d.h41 2763892.668261: bpf_trace_printk: tid = 0

systemd-resolve-905 [000] d.h41 2763894.418696: bpf_trace_printk: tid = 905
-0 [000] d.h41 2763894.419803: bpf_trace_printk: tid = 0
-0 [000] d.h41 2763894.420129: bpf_trace_printk: tid = 0

Hi @chenhengqi
Hmmm... nice question. :)
I just simply reference to the original implementation of interrupt time measurement bpf_text_time, but what you mentioned is making sense, tid should be not useful for ISR context.

This should cause similar issue reported here #2804.

(bcc)root@localhost:/# bpftrace -e 'tracepoint:irq:irq_handler_entry { @[comm, pid, tid, cpu]=count(); }'
@[hdmi_task, 590, 590, 1]: 45
@[hdmi_task, 590, 590, 0]: 89
@[swapper/3, 0, 0, 3]: 1718
@[swapper/2, 0, 0, 2]: 2319
@[swapper/0, 0, 0, 0]: 7819
@[swapper/1, 0, 0, 1]: 8715

For ISR context, maybe we should use cpu id by calling bpf helper bpf_get_smp_processor_id for key to store ISR name at tracepoint:irq:irq_handler_entry and get back at tracepoint:irq:irq_handler_exit.

In another way, we can also use PERCPU_HASH to prevent this issue. In my opinion, use cpu_id as key should be more simple and don't need to create a PERCPU_HASH rather than simple hash map.

Hi @yonghong-song
Do you have any suggestion for this? Thank you. :)

@yonghong-song
Copy link
Collaborator

@ismhong The only issue we have is tid = 0. For any non-zero tid, since the bpf is running in irq context, the underlying tid won't change and won't conflict with other tid's, so we can use it as the key since the key will be removed after irq_exit.

To handle tid = 0, as you mentioned, either cpu_id or percpu_hash is fine. cpu_id is simpler. Please submit a followup patch to fix this issue. It would be good to add a comment to mention that the change is to fix tid = 0 issue and cpu_id is okay since there is no migration for irq handlers.

ismhong added a commit to ismhong/bcc that referenced this pull request Dec 8, 2021
Currently, hardirqs use tid as key to store information while tracepoint
irq_handler_entry. It works fine if irq is triggered while normal task
running, but there is a chance causing overwrite issue while irq is
triggered while idle task (a.k.a swapper/x, tid=0) running on multi-core
system.

Let's say there are two irq event trigger simultaneously on both CPU
core, irq A @ core #0, irq B @ core #1, and system load is pretty light,
so BPF program will get tid=0 since current task is swapper/x for both cpu
core. In this case, the information of first irq event stored in map could
be overwritten by incoming second irq event.

Use tid and cpu_id together to make sure the key is unique for each
event in this corner case.

Please check more detail at merge request iovisor#2804, iovisor#3733.
yonghong-song pushed a commit that referenced this pull request Dec 9, 2021
Currently, hardirqs use tid as key to store information while tracepoint
irq_handler_entry. It works fine if irq is triggered while normal task
running, but there is a chance causing overwrite issue while irq is
triggered while idle task (a.k.a swapper/x, tid=0) running on multi-core
system.

Let's say there are two irq event trigger simultaneously on both CPU
core, irq A @ core #0, irq B @ core #1, and system load is pretty light,
so BPF program will get tid=0 since current task is swapper/x for both cpu
core. In this case, the information of first irq event stored in map could
be overwritten by incoming second irq event.

Use tid and cpu_id together to make sure the key is unique for each
event in this corner case.

Please check more detail at merge request #2804, #3733.
CrackerCat pushed a commit to CrackerCat/bcc that referenced this pull request Jul 31, 2024
Currently, hardirqs use tid as key to store information while tracepoint
irq_handler_entry. It works fine if irq is triggered while normal task
running, but there is a chance causing overwrite issue while irq is
triggered while idle task (a.k.a swapper/x, tid=0) running on multi-core
system.

Let's say there are two irq event trigger simultaneously on both CPU
core, irq A @ core #0, irq B @ core iovisor#1, and system load is pretty light,
so BPF program will get tid=0 since current task is swapper/x for both cpu
core. In this case, the information of first irq event stored in map could
be overwritten by incoming second irq event.

Use tid and cpu_id together to make sure the key is unique for each
event in this corner case.

Please check more detail at merge request iovisor#2804, iovisor#3733.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants