Hello,
We're seeing incorrect data for bpf runtime stats on arm64. Here's an example:
$ sudo bpftool prog show id 693110
693110: tracing name __tcp_retransmit_skb tag e37be2fbe8be4726 gpl
run_time_ns 2493581964213176 run_cnt 1133532 recursion_misses 1
loaded_at 2024-04-10T22:33:09+0000 uid 62727
xlated 312B jited 344B memlock 4096B map_ids 8550445,8550441
btf_id 8726522
pids prometheus-ebpf(2224907)
According to bpftool, this program reported 66555800ns of runtime at
one point and then it jumped to 2493581675247416ns just 53s later when
we looked at it again. This is happening only on arm64 nodes in our
fleet on both v6.1.82 and v6.6.25.
We have two services that are involved:
* ebpf_exporter attaches bpf programs to the kernel and exports
prometheus metrics and opentelementry traces driven by its probes
* bpf_stats_exporter runs bpftool every 53s to capture bpf runtime metrics
The problematic fentry is attached to __tcp_retransmit_skb, but an
identical one is also attached to tcp_send_loss_probe, which does not
exhibit the same issue:
SEC("fentry/__tcp_retransmit_skb")
int BPF_PROG(__tcp_retransmit_skb, struct sock *sk)
{
return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_retransmit_skb);
}
SEC("fentry/tcp_send_loss_probe")
int BPF_PROG(tcp_send_loss_probe, struct sock *sk)
{
return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_send_loss_probe);
}
In handle_sk we do a map lookup and an optional ringbuf push. There is
no sleeping (I don't think it's even allowed on v6.1). It's
interesting that it only happens for the retransmit, but not for the
loss probe.
The issue manifests some time after we restart ebpf_exporter and
reattach the probes. It doesn't happen immediately, as we need to
capture metrics 53s apart to produce a visible spike in metrics.
There is no corresponding spike in execution count, only in execution time.
It doesn't happen deterministically. Some ebpf_exporter restarts show
it, some don't.
It doesn't keep happening after ebpf_exporter restart. It happens once
and that's it.
Maybe recursion_misses plays a role here? We see none for
tcp_send_loss_probe. We do see some for inet_sk_error_report
tracepoint, but it doesn't spike like __tcp_retransmit_skb does.
The biggest smoking gun is that it only happens on arm64.
I'm happy to try out patches to figure this one out.
On 4/11/24 11:09 AM, Ivan Babrou wrote:
> Hello,
>
> We're seeing incorrect data for bpf runtime stats on arm64. Here's an example:
>
> $ sudo bpftool prog show id 693110
> 693110: tracing name __tcp_retransmit_skb tag e37be2fbe8be4726 gpl
> run_time_ns 2493581964213176 run_cnt 1133532 recursion_misses 1
> loaded_at 2024-04-10T22:33:09+0000 uid 62727
> xlated 312B jited 344B memlock 4096B map_ids 8550445,8550441
> btf_id 8726522
> pids prometheus-ebpf(2224907)
>
> According to bpftool, this program reported 66555800ns of runtime at
> one point and then it jumped to 2493581675247416ns just 53s later when
> we looked at it again. This is happening only on arm64 nodes in our
> fleet on both v6.1.82 and v6.6.25.
>
> We have two services that are involved:
>
> * ebpf_exporter attaches bpf programs to the kernel and exports
> prometheus metrics and opentelementry traces driven by its probes
> * bpf_stats_exporter runs bpftool every 53s to capture bpf runtime metrics
>
> The problematic fentry is attached to __tcp_retransmit_skb, but an
> identical one is also attached to tcp_send_loss_probe, which does not
> exhibit the same issue:
>
> SEC("fentry/__tcp_retransmit_skb")
> int BPF_PROG(__tcp_retransmit_skb, struct sock *sk)
> {
> return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_retransmit_skb);
> }
>
> SEC("fentry/tcp_send_loss_probe")
> int BPF_PROG(tcp_send_loss_probe, struct sock *sk)
> {
> return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_send_loss_probe);
> }
>
> In handle_sk we do a map lookup and an optional ringbuf push. There is
> no sleeping (I don't think it's even allowed on v6.1). It's
> interesting that it only happens for the retransmit, but not for the
> loss probe.
>
> The issue manifests some time after we restart ebpf_exporter and
> reattach the probes. It doesn't happen immediately, as we need to
> capture metrics 53s apart to produce a visible spike in metrics.
>
> There is no corresponding spike in execution count, only in execution time.
>
> It doesn't happen deterministically. Some ebpf_exporter restarts show
> it, some don't.
>
> It doesn't keep happening after ebpf_exporter restart. It happens once
> and that's it.
>
> Maybe recursion_misses plays a role here? We see none for
> tcp_send_loss_probe. We do see some for inet_sk_error_report
> tracepoint, but it doesn't spike like __tcp_retransmit_skb does.
>
> The biggest smoking gun is that it only happens on arm64.
I am not an expert for arm64. But you or somebody could check
and compare arm64 and x86 jit trampoline codes to see whether
anything is suspicious.
>
> I'm happy to try out patches to figure this one out.
>
On 24/04/11 11:09AM, Ivan Babrou wrote:
> Hello,
>
> We're seeing incorrect data for bpf runtime stats on arm64. Here's an example:
>
> $ sudo bpftool prog show id 693110
> 693110: tracing name __tcp_retransmit_skb tag e37be2fbe8be4726 gpl
> run_time_ns 2493581964213176 run_cnt 1133532 recursion_misses 1
> loaded_at 2024-04-10T22:33:09+0000 uid 62727
> xlated 312B jited 344B memlock 4096B map_ids 8550445,8550441
> btf_id 8726522
> pids prometheus-ebpf(2224907)
>
> According to bpftool, this program reported 66555800ns of runtime at
> one point and then it jumped to 2493581675247416ns just 53s later when
> we looked at it again. This is happening only on arm64 nodes in our
> fleet on both v6.1.82 and v6.6.25.
>
> We have two services that are involved:
>
> * ebpf_exporter attaches bpf programs to the kernel and exports
> prometheus metrics and opentelementry traces driven by its probes
> * bpf_stats_exporter runs bpftool every 53s to capture bpf runtime metrics
>
> The problematic fentry is attached to __tcp_retransmit_skb, but an
> identical one is also attached to tcp_send_loss_probe, which does not
> exhibit the same issue:
>
> SEC("fentry/__tcp_retransmit_skb")
> int BPF_PROG(__tcp_retransmit_skb, struct sock *sk)
> {
> return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_retransmit_skb);
> }
>
> SEC("fentry/tcp_send_loss_probe")
> int BPF_PROG(tcp_send_loss_probe, struct sock *sk)
> {
> return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_send_loss_probe);
> }
>
> In handle_sk we do a map lookup and an optional ringbuf push. There is
> no sleeping (I don't think it's even allowed on v6.1). It's
> interesting that it only happens for the retransmit, but not for the
> loss probe.
>
> The issue manifests some time after we restart ebpf_exporter and
> reattach the probes. It doesn't happen immediately, as we need to
> capture metrics 53s apart to produce a visible spike in metrics.
>
> There is no corresponding spike in execution count, only in execution time.
>
> It doesn't happen deterministically. Some ebpf_exporter restarts show
> it, some don't.
>
> It doesn't keep happening after ebpf_exporter restart. It happens once
> and that's it.
>
> Maybe recursion_misses plays a role here? We see none for
> tcp_send_loss_probe. We do see some for inet_sk_error_report
> tracepoint, but it doesn't spike like __tcp_retransmit_skb does.
>
> The biggest smoking gun is that it only happens on arm64.
>
> I'm happy to try out patches to figure this one out.
Ivan, I recently submitted a patch that improves how the bpf runtime stats are
calculated. I'm not sure if it will fix your issue, but it would be useful to
see if removing the intrumentation time from the runtime calculation helps.
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=ce09cbdd9888
You can also use bpftop to chart bpf stats in a time series graph. Visualizing
the stats that way may help surface more patterns about the issue.
https://github.com/Netflix/bpftop
On 4/12/2024 2:09 AM, Ivan Babrou wrote:
> Hello,
>
> We're seeing incorrect data for bpf runtime stats on arm64. Here's an example:
>
> $ sudo bpftool prog show id 693110
> 693110: tracing name __tcp_retransmit_skb tag e37be2fbe8be4726 gpl
> run_time_ns 2493581964213176 run_cnt 1133532 recursion_misses 1
> loaded_at 2024-04-10T22:33:09+0000 uid 62727
> xlated 312B jited 344B memlock 4096B map_ids 8550445,8550441
> btf_id 8726522
> pids prometheus-ebpf(2224907)
>
> According to bpftool, this program reported 66555800ns of runtime at
> one point and then it jumped to 2493581675247416ns just 53s later when
> we looked at it again. This is happening only on arm64 nodes in our
> fleet on both v6.1.82 and v6.6.25.
>
> We have two services that are involved:
>
> * ebpf_exporter attaches bpf programs to the kernel and exports
> prometheus metrics and opentelementry traces driven by its probes
> * bpf_stats_exporter runs bpftool every 53s to capture bpf runtime metrics
>
> The problematic fentry is attached to __tcp_retransmit_skb, but an
> identical one is also attached to tcp_send_loss_probe, which does not
> exhibit the same issue:
>
> SEC("fentry/__tcp_retransmit_skb")
> int BPF_PROG(__tcp_retransmit_skb, struct sock *sk)
> {
> return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_retransmit_skb);
> }
>
> SEC("fentry/tcp_send_loss_probe")
> int BPF_PROG(tcp_send_loss_probe, struct sock *sk)
> {
> return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_send_loss_probe);
> }
>
> In handle_sk we do a map lookup and an optional ringbuf push. There is
> no sleeping (I don't think it's even allowed on v6.1). It's
> interesting that it only happens for the retransmit, but not for the
> loss probe.
>
> The issue manifests some time after we restart ebpf_exporter and
> reattach the probes. It doesn't happen immediately, as we need to
> capture metrics 53s apart to produce a visible spike in metrics.
>
> There is no corresponding spike in execution count, only in execution time.
>
> It doesn't happen deterministically. Some ebpf_exporter restarts show
> it, some don't.
>
> It doesn't keep happening after ebpf_exporter restart. It happens once
> and that's it.
>
> Maybe recursion_misses plays a role here? We see none for
> tcp_send_loss_probe. We do see some for inet_sk_error_report
> tracepoint, but it doesn't spike like __tcp_retransmit_skb does.
>
> The biggest smoking gun is that it only happens on arm64.
>
> I'm happy to try out patches to figure this one out.
>
I guess the issue is caused by the not setting of x20 register
when __bpf_prog_enter(prog) returns zero.
The following patch may help:
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -1905,15 +1905,15 @@ static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
emit_call(enter_prog, ctx);
+ /* save return value to callee saved register x20 */
+ emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
+
/* if (__bpf_prog_enter(prog) == 0)
* goto skip_exec_of_prog;
*/
branch = ctx->image + ctx->idx;
emit(A64_NOP, ctx);
- /* save return value to callee saved register x20 */
- emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
-
emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
if (!p->jited)
emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
On Thu, Apr 11, 2024 at 7:30 PM Xu Kuohai <[email protected]> wrote:
>
> On 4/12/2024 2:09 AM, Ivan Babrou wrote:
> > Hello,
> >
> > We're seeing incorrect data for bpf runtime stats on arm64. Here's an example:
> >
> > $ sudo bpftool prog show id 693110
> > 693110: tracing name __tcp_retransmit_skb tag e37be2fbe8be4726 gpl
> > run_time_ns 2493581964213176 run_cnt 1133532 recursion_misses 1
> > loaded_at 2024-04-10T22:33:09+0000 uid 62727
> > xlated 312B jited 344B memlock 4096B map_ids 8550445,8550441
> > btf_id 8726522
> > pids prometheus-ebpf(2224907)
> >
> > According to bpftool, this program reported 66555800ns of runtime at
> > one point and then it jumped to 2493581675247416ns just 53s later when
> > we looked at it again. This is happening only on arm64 nodes in our
> > fleet on both v6.1.82 and v6.6.25.
> >
> > We have two services that are involved:
> >
> > * ebpf_exporter attaches bpf programs to the kernel and exports
> > prometheus metrics and opentelementry traces driven by its probes
> > * bpf_stats_exporter runs bpftool every 53s to capture bpf runtime metrics
> >
> > The problematic fentry is attached to __tcp_retransmit_skb, but an
> > identical one is also attached to tcp_send_loss_probe, which does not
> > exhibit the same issue:
> >
> > SEC("fentry/__tcp_retransmit_skb")
> > int BPF_PROG(__tcp_retransmit_skb, struct sock *sk)
> > {
> > return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_retransmit_skb);
> > }
> >
> > SEC("fentry/tcp_send_loss_probe")
> > int BPF_PROG(tcp_send_loss_probe, struct sock *sk)
> > {
> > return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_send_loss_probe);
> > }
> >
> > In handle_sk we do a map lookup and an optional ringbuf push. There is
> > no sleeping (I don't think it's even allowed on v6.1). It's
> > interesting that it only happens for the retransmit, but not for the
> > loss probe.
> >
> > The issue manifests some time after we restart ebpf_exporter and
> > reattach the probes. It doesn't happen immediately, as we need to
> > capture metrics 53s apart to produce a visible spike in metrics.
> >
> > There is no corresponding spike in execution count, only in execution time.
> >
> > It doesn't happen deterministically. Some ebpf_exporter restarts show
> > it, some don't.
> >
> > It doesn't keep happening after ebpf_exporter restart. It happens once
> > and that's it.
> >
> > Maybe recursion_misses plays a role here? We see none for
> > tcp_send_loss_probe. We do see some for inet_sk_error_report
> > tracepoint, but it doesn't spike like __tcp_retransmit_skb does.
> >
> > The biggest smoking gun is that it only happens on arm64.
> >
> > I'm happy to try out patches to figure this one out.
> >
>
> I guess the issue is caused by the not setting of x20 register
> when __bpf_prog_enter(prog) returns zero.
Yes, I think this is it. Your patch makes it match x86_64 and it seems logical.
I'm building a kernel with it to put it into production to make sure.
> The following patch may help:
>
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -1905,15 +1905,15 @@ static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
>
> emit_call(enter_prog, ctx);
>
> + /* save return value to callee saved register x20 */
> + emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
> +
> /* if (__bpf_prog_enter(prog) == 0)
> * goto skip_exec_of_prog;
> */
> branch = ctx->image + ctx->idx;
> emit(A64_NOP, ctx);
>
> - /* save return value to callee saved register x20 */
> - emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
> -
> emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
> if (!p->jited)
> emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
>
On Fri, Apr 12, 2024 at 2:46 PM Ivan Babrou <[email protected]> wrote:
>
> On Thu, Apr 11, 2024 at 7:30 PM Xu Kuohai <[email protected]> wrote:
> >
> > On 4/12/2024 2:09 AM, Ivan Babrou wrote:
> > > Hello,
> > >
> > > We're seeing incorrect data for bpf runtime stats on arm64. Here's an example:
> > >
> > > $ sudo bpftool prog show id 693110
> > > 693110: tracing name __tcp_retransmit_skb tag e37be2fbe8be4726 gpl
> > > run_time_ns 2493581964213176 run_cnt 1133532 recursion_misses 1
> > > loaded_at 2024-04-10T22:33:09+0000 uid 62727
> > > xlated 312B jited 344B memlock 4096B map_ids 8550445,8550441
> > > btf_id 8726522
> > > pids prometheus-ebpf(2224907)
> > >
> > > According to bpftool, this program reported 66555800ns of runtime at
> > > one point and then it jumped to 2493581675247416ns just 53s later when
> > > we looked at it again. This is happening only on arm64 nodes in our
> > > fleet on both v6.1.82 and v6.6.25.
> > >
> > > We have two services that are involved:
> > >
> > > * ebpf_exporter attaches bpf programs to the kernel and exports
> > > prometheus metrics and opentelementry traces driven by its probes
> > > * bpf_stats_exporter runs bpftool every 53s to capture bpf runtime metrics
> > >
> > > The problematic fentry is attached to __tcp_retransmit_skb, but an
> > > identical one is also attached to tcp_send_loss_probe, which does not
> > > exhibit the same issue:
> > >
> > > SEC("fentry/__tcp_retransmit_skb")
> > > int BPF_PROG(__tcp_retransmit_skb, struct sock *sk)
> > > {
> > > return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_retransmit_skb);
> > > }
> > >
> > > SEC("fentry/tcp_send_loss_probe")
> > > int BPF_PROG(tcp_send_loss_probe, struct sock *sk)
> > > {
> > > return handle_sk((struct pt_regs *) ctx, sk, sk_kind_tcp_send_loss_probe);
> > > }
> > >
> > > In handle_sk we do a map lookup and an optional ringbuf push. There is
> > > no sleeping (I don't think it's even allowed on v6.1). It's
> > > interesting that it only happens for the retransmit, but not for the
> > > loss probe.
> > >
> > > The issue manifests some time after we restart ebpf_exporter and
> > > reattach the probes. It doesn't happen immediately, as we need to
> > > capture metrics 53s apart to produce a visible spike in metrics.
> > >
> > > There is no corresponding spike in execution count, only in execution time.
> > >
> > > It doesn't happen deterministically. Some ebpf_exporter restarts show
> > > it, some don't.
> > >
> > > It doesn't keep happening after ebpf_exporter restart. It happens once
> > > and that's it.
> > >
> > > Maybe recursion_misses plays a role here? We see none for
> > > tcp_send_loss_probe. We do see some for inet_sk_error_report
> > > tracepoint, but it doesn't spike like __tcp_retransmit_skb does.
> > >
> > > The biggest smoking gun is that it only happens on arm64.
> > >
> > > I'm happy to try out patches to figure this one out.
> > >
> >
> > I guess the issue is caused by the not setting of x20 register
> > when __bpf_prog_enter(prog) returns zero.
>
> Yes, I think this is it. Your patch makes it match x86_64 and it seems logical.
>
> I'm building a kernel with it to put it into production to make sure.
I let it simmer over the weekend. The issue kept happening on the
control group, but the test group was fine. Please proceed with this
patch.
> > The following patch may help:
> >
> > --- a/arch/arm64/net/bpf_jit_comp.c
> > +++ b/arch/arm64/net/bpf_jit_comp.c
> > @@ -1905,15 +1905,15 @@ static void invoke_bpf_prog(struct jit_ctx *ctx, struct bpf_tramp_link *l,
> >
> > emit_call(enter_prog, ctx);
> >
> > + /* save return value to callee saved register x20 */
> > + emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
> > +
> > /* if (__bpf_prog_enter(prog) == 0)
> > * goto skip_exec_of_prog;
> > */
> > branch = ctx->image + ctx->idx;
> > emit(A64_NOP, ctx);
> >
> > - /* save return value to callee saved register x20 */
> > - emit(A64_MOV(1, A64_R(20), A64_R(0)), ctx);
> > -
> > emit(A64_ADD_I(1, A64_R(0), A64_SP, args_off), ctx);
> > if (!p->jited)
> > emit_addr_mov_i64(A64_R(1), (const u64)p->insnsi, ctx);
> >