2024-01-22 19:12:00

by Jakub Kicinski

[permalink] [raw]
Subject: Re: Kernel panic in netif_rx_internal after v6 pings between netns

On Mon, 22 Jan 2024 19:22:42 +0100 Matthieu Baerts wrote:
> > Somewhat related. What do you do currently to ignore crashes?
>
> I was wondering why you wanted to ignore crashes :) ... but then I saw
> the new "Test ignored" and "Crashes ignored" sections on the status
> page. Just to be sure: you don't want to report issues that have not
> been introduced by the new patches, right?

Initially, yes, but going forward I bet we'll always see crashes and
breakage introduced downstream. So we need some knobs to selectively
silence failing things.

In an ideal world we'd also have some form of "last seen" stat
displayed to know when to retire these entries..

> We don't need to do that on MPTCP side:
> - either it is a new crash with patches that are in reviewed and that's
> not impacting others → we test each series individually, not a batch of
> series.
> - or there are issues with recent patches, not in netdev yet → we fix,
> or revert.
> - or there is an issue elsewhere, like the kernel panic we reported
> here: usually I try to quickly apply a workaround, e.g. applying a fix,
> or a revert. I don't think we ever had an issue really impacting us
> where we couldn't find a quick solution in one or two days. With the
> panic we reported here, ~15% of the tests had an issue, that's "OK" to
> have that for a few days/weeks
>
> With fewer tests and a smaller community, it is easier for us to just
> say on the ML and weekly meetings: "this is a known issue, please ignore
> for the moment". But if possible, I try to add a workaround/fix in our
> repo used by the CI and devs (not upstreamed).
>
> For NIPA CI, do you want to do like with the build and compare with a
> reference? Or multiple ones to take into account unstable tests? Or
> maintain a list of known issues (I think you started to do that,
> probably safer/easier for the moment)?

Exactly - where we can a before/after diff is the best. We do that for
all static checker / building kind of tests. But for selftests I'm not
sure how effective and applicable that is. Even the stack trace I
posted here happens somewhat unreliably :( We can try to develop more
intelligent ways going forward, obviously :)

> > I was seeing a lot of:
> > https://netdev-2.bots.linux.dev/vmksft-net-mp/results/431181/vm-crash-thr0-2
> >
> > So I hacked up this function to filter the crash from NIPA CI:
> > https://github.com/kuba-moo/nipa/blob/master/contest/remote/lib/vm.py#L50
> > It tries to get first 5 function names from the stack, to form
> > a "fingerprint". But I seem to recall a discussion at LPC's testing
> > track that there are existing solutions for generating fingerprints.
> > Are you aware of any?
>
> No, sorry. But I guess they are using that with syzkaller, no?
>
> I have to admit that crashes (or warnings) are quite rare, so there was
> no need to have an automation there. But if it is easy to have a
> fingerprint, I will be interested as well, it can help for the tracking:
> to find occurrences of crashes/warnings that are very hard to reproduce.

Indeed, I'll keep my ear to the ground. I believe it was discussed in
relation to KCIDB.


2024-01-22 19:26:32

by Matthieu Baerts (NGI0)

[permalink] [raw]
Subject: Re: Kernel panic in netif_rx_internal after v6 pings between netns

On 22/01/2024 19:36, Jakub Kicinski wrote:
> On Mon, 22 Jan 2024 19:22:42 +0100 Matthieu Baerts wrote:
>>> Somewhat related. What do you do currently to ignore crashes?
>>
>> I was wondering why you wanted to ignore crashes :) ... but then I saw
>> the new "Test ignored" and "Crashes ignored" sections on the status
>> page. Just to be sure: you don't want to report issues that have not
>> been introduced by the new patches, right?
>
> Initially, yes, but going forward I bet we'll always see crashes and
> breakage introduced downstream. So we need some knobs to selectively
> silence failing things.

Even if I guess it will mainly be around the merge window, I understand
it will be annoying to have issues for at least one week, until the next
sync with Linus' tree.

> In an ideal world we'd also have some form of "last seen" stat
> displayed to know when to retire these entries..

Good idea!

>> We don't need to do that on MPTCP side:
>> - either it is a new crash with patches that are in reviewed and that's
>> not impacting others → we test each series individually, not a batch of
>> series.
>> - or there are issues with recent patches, not in netdev yet → we fix,
>> or revert.
>> - or there is an issue elsewhere, like the kernel panic we reported
>> here: usually I try to quickly apply a workaround, e.g. applying a fix,
>> or a revert. I don't think we ever had an issue really impacting us
>> where we couldn't find a quick solution in one or two days. With the
>> panic we reported here, ~15% of the tests had an issue, that's "OK" to
>> have that for a few days/weeks
>>
>> With fewer tests and a smaller community, it is easier for us to just
>> say on the ML and weekly meetings: "this is a known issue, please ignore
>> for the moment". But if possible, I try to add a workaround/fix in our
>> repo used by the CI and devs (not upstreamed).
>>
>> For NIPA CI, do you want to do like with the build and compare with a
>> reference? Or multiple ones to take into account unstable tests? Or
>> maintain a list of known issues (I think you started to do that,
>> probably safer/easier for the moment)?
>
> Exactly - where we can a before/after diff is the best. We do that for
> all static checker / building kind of tests. But for selftests I'm not
> sure how effective and applicable that is. Even the stack trace I
> posted here happens somewhat unreliably :( We can try to develop more
> intelligent ways going forward, obviously :)

Maybe just by checking the last X rans, instead of the last one?

Maybe also enough to just mark the tests as WARN with "was failing
before" text (+ a percentage)?

>>> I was seeing a lot of:
>>> https://netdev-2.bots.linux.dev/vmksft-net-mp/results/431181/vm-crash-thr0-2
>>>
>>> So I hacked up this function to filter the crash from NIPA CI:
>>> https://github.com/kuba-moo/nipa/blob/master/contest/remote/lib/vm.py#L50
>>> It tries to get first 5 function names from the stack, to form
>>> a "fingerprint". But I seem to recall a discussion at LPC's testing
>>> track that there are existing solutions for generating fingerprints.
>>> Are you aware of any?
>>
>> No, sorry. But I guess they are using that with syzkaller, no?
>>
>> I have to admit that crashes (or warnings) are quite rare, so there was
>> no need to have an automation there. But if it is easy to have a
>> fingerprint, I will be interested as well, it can help for the tracking:
>> to find occurrences of crashes/warnings that are very hard to reproduce.
>
> Indeed, I'll keep my ear to the ground. I believe it was discussed in
> relation to KCIDB.

Maybe good to ask people behind Syzkaller, they must have something in
place :)

Cheers,
Matt
--
Sponsored by the NGI0 Core fund.