Subject: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

* @SCSI maintainers: could you please look into below please?

* @Stable team: you might want to take a look as well and consider a
revert in 6.1.y (yes, I know, those are normally avoided, but here it
might make sense).

Hi everyone!

TLDR: I noticed a regression (Adaptec 71605z with aacraid sometimes
hangs for a while) that was reported months ago already but is still not
fixed. Not only that, it apparently more and more users run into this
recently, as the culprit was recently integrated into 6.1.y; I wonder if
it would be best to revert it there, unless a fix for mainline comes
into reach soon.

Details:

Quite a few machines with Adaptec controllers seems to hang for a few
tens of seconds to a few minutes before things start to work normally
again for a while:
https://bugzilla.kernel.org/show_bug.cgi?id=217599

That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
commit despite a warning of mine to Sasha recently made it into 6.1.53
-- and that way apparently recently reached more users recently, as
quite a few joined that ticket.

The culprit is authored by Sagar Biradar who unless I missed something
never replied even once to the ticket or earlier mails about it. Lore
has no messages from him since early June.

Hannes Reinecke at least tried to fix it a few weeks ago (many thx), but
that didn't work out (see the ticket for details). Since then things
look stalled again, which is, ehh, unfortunate when it comes to
regressions.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.


Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On 21.11.23 10:50, Thorsten Leemhuis wrote:
> * @SCSI maintainers: could you please look into below please?
>
> * @Stable team: you might want to take a look as well and consider a
> revert in 6.1.y (yes, I know, those are normally avoided, but here it
> might make sense).
>
> TLDR: I noticed a regression (Adaptec 71605z with aacraid sometimes
> hangs for a while) that was reported months ago already but is still not
> fixed. Not only that, it apparently more and more users run into this
> recently, as the culprit was recently integrated into 6.1.y; I wonder if
> it would be best to revert it there, unless a fix for mainline comes
> into reach soon.
>
> Details:
>
> Quite a few machines with Adaptec controllers seems to hang for a few
> tens of seconds to a few minutes before things start to work normally
> again for a while:
> https://bugzilla.kernel.org/show_bug.cgi?id=217599

Quick follow up, only saw this now while posting something to the
ticket: according to one reporter the problem even causes data damage.
To quote:

'''
if you run fsck.ext4 on ext4 file system with buggy kernel it will
damage file system and its data

using buggy kernel BTRFS scrub also says that checksums are wrong
'''

Ciao, Thorsten

> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
> commit despite a warning of mine to Sasha recently made it into 6.1.53
> -- and that way apparently recently reached more users recently, as
> quite a few joined that ticket.
>
> The culprit is authored by Sagar Biradar who unless I missed something
> never replied even once to the ticket or earlier mails about it. Lore
> has no messages from him since early June.
>
> Hannes Reinecke at least tried to fix it a few weeks ago (many thx), but
> that didn't work out (see the ticket for details). Since then things
> look stalled again, which is, ehh, unfortunate when it comes to
> regressions.
>
> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
> --
> Everything you wanna know about Linux kernel regression tracking:
> https://linux-regtracking.leemhuis.info/about/#tldr
> If I did something stupid, please tell me, as explained on that page.

2023-11-21 11:31:56

by John Garry

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On 21/11/2023 09:50, Thorsten Leemhuis wrote:
> Quite a few machines with Adaptec controllers seems to hang for a few
> tens of seconds to a few minutes before things start to work normally
> again for a while:
> https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=217599__;!!ACWV5N9M2RV99hQ!L26RD0hu99l3f709EFnXU_V7OaB1jG4Hi7BjKvxRuhDWKFmjrgfksLuXA6eBrBCRtOT8JcRRUvzRsHbyEm41r7tL_pbDfw$
>
> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
> commit despite a warning of mine to Sasha recently made it into 6.1.53
> -- and that way apparently recently reached more users recently, as
> quite a few joined that ticket.

Is there a full kernel log for this hanging system?

I can only see snippets in the ticket.

And what does /sys/class/scsi_host/host*/nr_hw_queues show?

Thanks,
John


Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On 21.11.23 12:30, John Garry wrote:
> On 21/11/2023 09:50, Thorsten Leemhuis wrote:
>> Quite a few machines with Adaptec controllers seems to hang for a few
>> tens of seconds to a few minutes before things start to work normally
>> again for a while:
>> https://urldefense.com/v3/__https://bugzilla.kernel.org/show_bug.cgi?id=217599__;!!ACWV5N9M2RV99hQ!L26RD0hu99l3f709EFnXU_V7OaB1jG4Hi7BjKvxRuhDWKFmjrgfksLuXA6eBrBCRtOT8JcRRUvzRsHbyEm41r7tL_pbDfw$ 
>> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
>> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
>> commit despite a warning of mine to Sasha recently made it into 6.1.53
>> -- and that way apparently recently reached more users recently, as
>> quite a few joined that ticket.
>
> Is there a full kernel log for this hanging system?
>
> I can only see snippets in the ticket.
>
> And what does /sys/class/scsi_host/host*/nr_hw_queues show?

Sorry, I'm just the man-in-the-middle: you need to ask in the ticket, as
the privacy policy for bugzilla.kernel.org does not allow to CC the
reporters from the ticket here without their consent.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.



2023-11-21 13:06:37

by James Bottomley

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On Tue, 2023-11-21 at 13:24 +0100, Linux regression tracking (Thorsten
Leemhuis) wrote:
> On 21.11.23 12:30, John Garry wrote:
[...]
> > Is there a full kernel log for this hanging system?
> >
> > I can only see snippets in the ticket.
> >
> > And what does /sys/class/scsi_host/host*/nr_hw_queues show?
>
> Sorry, I'm just the man-in-the-middle: you need to ask in the ticket,
> as  the privacy policy for bugzilla.kernel.org does not allow to CC
> the reporters from the ticket here without their consent.

How did you arrive at that conclusion? Tickets for linux-scsi are
vectored to the list:

https://lore.kernel.org/linux-scsi/[email protected]%2F/

So all the email addresses in the bugzilla are already archived on our
list.

James

Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On 21.11.23 14:05, James Bottomley wrote:
> On Tue, 2023-11-21 at 13:24 +0100, Linux regression tracking (Thorsten
> Leemhuis) wrote:
>> On 21.11.23 12:30, John Garry wrote:
> [...]
>>> Is there a full kernel log for this hanging system?
>>> I can only see snippets in the ticket.
>>> And what does /sys/class/scsi_host/host*/nr_hw_queues show?
>>
>> Sorry, I'm just the man-in-the-middle: you need to ask in the ticket,
>> as  the privacy policy for bugzilla.kernel.org does not allow to CC
>> the reporters from the ticket here without their consent.
>
> How did you arrive at that conclusion?

To quote https://bugzilla.kernel.org/createaccount.cgi:
"""
Note that your email address will never be displayed to logged out
users. Only registered users will be able to see it.
"""

Not sure since when it's there. Maybe it was added due to EU GDPR?
Konstantin should know. But for me that's enough to not CC people. I
even heard from one well known kernel developer that his company got a
GDPR complaint because he had mentioning the reporters name and email
address in a Reported-by: tag.

Side note: bugbot afaics can solve the initial problem (e.g. interact
with reporters in bugzilla by mail without exposing their email
address). But to use bugbot one *afaik* still has to reassign a ticket
to a specific product and component in bugzilla. Some subsystem
maintainers don't want that, as that issues then does not show up in the
usual queries.

Ciao, Thorsten

2023-11-21 13:32:16

by James Bottomley

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On Tue, 2023-11-21 at 14:24 +0100, Linux regression tracking (Thorsten
Leemhuis) wrote:
> On 21.11.23 14:05, James Bottomley wrote:
> > On Tue, 2023-11-21 at 13:24 +0100, Linux regression tracking
> > (Thorsten
> > Leemhuis) wrote:
> > > On 21.11.23 12:30, John Garry wrote:
> > [...]
> > > > Is there a full kernel log for this hanging system?
> > > > I can only see snippets in the ticket.
> > > > And what does /sys/class/scsi_host/host*/nr_hw_queues show?
> > >
> > > Sorry, I'm just the man-in-the-middle: you need to ask in the
> > > ticket, as  the privacy policy for bugzilla.kernel.org does not
> > > allow to CC the reporters from the ticket here without their
> > > consent.
> >
> > How did you arrive at that conclusion?
>
> To quote https://bugzilla.kernel.org/createaccount.cgi:
> """
> Note that your email address will never be displayed to logged out
> users. Only registered users will be able to see it.
> """

OK, so someone needs to update that to reflect reality.

> Not sure since when it's there. Maybe it was added due to EU GDPR?
> Konstantin should know. But for me that's enough to not CC people. I
> even heard from one well known kernel developer that his company got
> a
> GDPR complaint because he had mentioning the reporters name and email
> address in a Reported-by: tag.
>
> Side note: bugbot afaics can solve the initial problem (e.g. interact
> with reporters in bugzilla by mail without exposing their email
> address). But to use bugbot one *afaik* still has to reassign a
> ticket to a specific product and component in bugzilla. Some
> subsystem maintainers don't want that, as that issues then does not
> show up in the usual queries.

I'm not sure we need to solve a problem that doesn't exist. Switching
to email is a standard maintainer response:

https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/
...

James

2023-11-24 16:25:31

by Greg KH

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On Tue, Nov 21, 2023 at 10:50:57AM +0100, Thorsten Leemhuis wrote:
> * @SCSI maintainers: could you please look into below please?
>
> * @Stable team: you might want to take a look as well and consider a
> revert in 6.1.y (yes, I know, those are normally avoided, but here it
> might make sense).
>
> Hi everyone!
>
> TLDR: I noticed a regression (Adaptec 71605z with aacraid sometimes
> hangs for a while) that was reported months ago already but is still not
> fixed. Not only that, it apparently more and more users run into this
> recently, as the culprit was recently integrated into 6.1.y; I wonder if
> it would be best to revert it there, unless a fix for mainline comes
> into reach soon.
>
> Details:
>
> Quite a few machines with Adaptec controllers seems to hang for a few
> tens of seconds to a few minutes before things start to work normally
> again for a while:
> https://bugzilla.kernel.org/show_bug.cgi?id=217599
>
> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
> commit despite a warning of mine to Sasha recently made it into 6.1.53
> -- and that way apparently recently reached more users recently, as
> quite a few joined that ticket.
>
> The culprit is authored by Sagar Biradar who unless I missed something
> never replied even once to the ticket or earlier mails about it. Lore
> has no messages from him since early June.
>
> Hannes Reinecke at least tried to fix it a few weeks ago (many thx), but
> that didn't work out (see the ticket for details). Since then things
> look stalled again, which is, ehh, unfortunate when it comes to
> regressions.

I am loath to revert a stable patch that has been there for so long as
any upgrade will just cause the same bug to show back up. Why can't we
just revert it in Linus's tree now and I'll take that revert in the
stable trees as well?

thanks,

greg k-h

2023-11-24 22:59:10

by Martin K. Petersen

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too


Greg,

> I am loath to revert a stable patch that has been there for so long as
> any upgrade will just cause the same bug to show back up. Why can't we
> just revert it in Linus's tree now and I'll take that revert in the
> stable trees as well?

Hannes just posted another tentative patch. I'd prefer an incremental
fix if possible.

--
Martin K. Petersen Oracle Linux Engineering

Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On 24.11.23 17:25, Greg KH wrote:
> On Tue, Nov 21, 2023 at 10:50:57AM +0100, Thorsten Leemhuis wrote:
>> * @SCSI maintainers: could you please look into below please?
>>
>> * @Stable team: you might want to take a look as well and consider a
>> revert in 6.1.y (yes, I know, those are normally avoided, but here it
>> might make sense).
>>
>> Hi everyone!
>>
>> TLDR: I noticed a regression (Adaptec 71605z with aacraid sometimes
>> hangs for a while) that was reported months ago already but is still not
>> fixed. Not only that, it apparently more and more users run into this
>> recently, as the culprit was recently integrated into 6.1.y; I wonder if
>> it would be best to revert it there, unless a fix for mainline comes
>> into reach soon.
>>
>> Details:
>>
>> Quite a few machines with Adaptec controllers seems to hang for a few
>> tens of seconds to a few minutes before things start to work normally
>> again for a while:
>> https://bugzilla.kernel.org/show_bug.cgi?id=217599
>>
>> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
>> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
>> commit despite a warning of mine to Sasha recently made it into 6.1.53
>> -- and that way apparently recently reached more users recently, as
>> quite a few joined that ticket.
>[...]
> I am loath to revert a stable patch that has been there for so long as
> any upgrade will just cause the same bug to show back up. Why can't we
> just revert it in Linus's tree now and I'll take that revert in the
> stable trees as well?

FWIW, I know and in general agree with that strategy, that's why I
normally wouldn't have brought a stable-only revert up for
consideration. But this issue to me looked somewhat special and urgent
for two and a half reasons: (1) that backport apparently made a lot more
people suddenly hit the issue (2) there was also this data corruption
aspect one of the reporters mentioned (not sure if that is real and/or
if this might be just a 6.1.y thing). Furthermore for 6.1.y it was
recently confirmed that reverting the change fixes things, while we iirc
had no such confirmation for recent mainline kernels at that point. So
it looked like it would take a while to get this sorted out in mainline.
But it seems we finally might get closer to that now, so yeah, maybe
it's not worth a stable revert.

Ciao, Thorsten

2023-12-29 20:13:36

by Salvatore Bonaccorso

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

Hi all,

On Sat, Nov 25, 2023 at 08:10:35AM +0100, Thorsten Leemhuis wrote:
> On 24.11.23 17:25, Greg KH wrote:
> > On Tue, Nov 21, 2023 at 10:50:57AM +0100, Thorsten Leemhuis wrote:
> >> * @SCSI maintainers: could you please look into below please?
> >>
> >> * @Stable team: you might want to take a look as well and consider a
> >> revert in 6.1.y (yes, I know, those are normally avoided, but here it
> >> might make sense).
> >>
> >> Hi everyone!
> >>
> >> TLDR: I noticed a regression (Adaptec 71605z with aacraid sometimes
> >> hangs for a while) that was reported months ago already but is still not
> >> fixed. Not only that, it apparently more and more users run into this
> >> recently, as the culprit was recently integrated into 6.1.y; I wonder if
> >> it would be best to revert it there, unless a fix for mainline comes
> >> into reach soon.
> >>
> >> Details:
> >>
> >> Quite a few machines with Adaptec controllers seems to hang for a few
> >> tens of seconds to a few minutes before things start to work normally
> >> again for a while:
> >> https://bugzilla.kernel.org/show_bug.cgi?id=217599
> >>
> >> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
> >> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
> >> commit despite a warning of mine to Sasha recently made it into 6.1.53
> >> -- and that way apparently recently reached more users recently, as
> >> quite a few joined that ticket.
> >[...]
> > I am loath to revert a stable patch that has been there for so long as
> > any upgrade will just cause the same bug to show back up. Why can't we
> > just revert it in Linus's tree now and I'll take that revert in the
> > stable trees as well?
>
> FWIW, I know and in general agree with that strategy, that's why I
> normally wouldn't have brought a stable-only revert up for
> consideration. But this issue to me looked somewhat special and urgent
> for two and a half reasons: (1) that backport apparently made a lot more
> people suddenly hit the issue (2) there was also this data corruption
> aspect one of the reporters mentioned (not sure if that is real and/or
> if this might be just a 6.1.y thing). Furthermore for 6.1.y it was
> recently confirmed that reverting the change fixes things, while we iirc
> had no such confirmation for recent mainline kernels at that point. So
> it looked like it would take a while to get this sorted out in mainline.
> But it seems we finally might get closer to that now, so yeah, maybe
> it's not worth a stable revert.

If I'm not completely wrong, finally indeed the commit has been
reverted in mainline, with c5becf57dd56 ("Revert "scsi: aacraid: Reply
queue mapping to CPUs based on IRQ affinity"") .

This is what was mentioned here:
https://bugzilla.kernel.org/show_bug.cgi?id=217599#c52

So should/can it be reverted it now as well on the 6.1.y stable series
(and the others up as needed?)

#regzbot link: https://bugs.debian.org/1059624
#regzbot fixed-by: c5becf57dd56

Thorsten, hope I got the above right.

Regards,
Salvatore

2023-12-30 10:58:36

by Greg KH

[permalink] [raw]
Subject: Re: scsi regression that after months is still not addressed and now bothering 6.1.y users, too

On Fri, Dec 29, 2023 at 09:13:18PM +0100, Salvatore Bonaccorso wrote:
> Hi all,
>
> On Sat, Nov 25, 2023 at 08:10:35AM +0100, Thorsten Leemhuis wrote:
> > On 24.11.23 17:25, Greg KH wrote:
> > > On Tue, Nov 21, 2023 at 10:50:57AM +0100, Thorsten Leemhuis wrote:
> > >> * @SCSI maintainers: could you please look into below please?
> > >>
> > >> * @Stable team: you might want to take a look as well and consider a
> > >> revert in 6.1.y (yes, I know, those are normally avoided, but here it
> > >> might make sense).
> > >>
> > >> Hi everyone!
> > >>
> > >> TLDR: I noticed a regression (Adaptec 71605z with aacraid sometimes
> > >> hangs for a while) that was reported months ago already but is still not
> > >> fixed. Not only that, it apparently more and more users run into this
> > >> recently, as the culprit was recently integrated into 6.1.y; I wonder if
> > >> it would be best to revert it there, unless a fix for mainline comes
> > >> into reach soon.
> > >>
> > >> Details:
> > >>
> > >> Quite a few machines with Adaptec controllers seems to hang for a few
> > >> tens of seconds to a few minutes before things start to work normally
> > >> again for a while:
> > >> https://bugzilla.kernel.org/show_bug.cgi?id=217599
> > >>
> > >> That problem is apparently caused by 9dc704dcc09eae ("scsi: aacraid:
> > >> Reply queue mapping to CPUs based on IRQ affinity") [v6.4-rc7]. That
> > >> commit despite a warning of mine to Sasha recently made it into 6.1.53
> > >> -- and that way apparently recently reached more users recently, as
> > >> quite a few joined that ticket.
> > >[...]
> > > I am loath to revert a stable patch that has been there for so long as
> > > any upgrade will just cause the same bug to show back up. Why can't we
> > > just revert it in Linus's tree now and I'll take that revert in the
> > > stable trees as well?
> >
> > FWIW, I know and in general agree with that strategy, that's why I
> > normally wouldn't have brought a stable-only revert up for
> > consideration. But this issue to me looked somewhat special and urgent
> > for two and a half reasons: (1) that backport apparently made a lot more
> > people suddenly hit the issue (2) there was also this data corruption
> > aspect one of the reporters mentioned (not sure if that is real and/or
> > if this might be just a 6.1.y thing). Furthermore for 6.1.y it was
> > recently confirmed that reverting the change fixes things, while we iirc
> > had no such confirmation for recent mainline kernels at that point. So
> > it looked like it would take a while to get this sorted out in mainline.
> > But it seems we finally might get closer to that now, so yeah, maybe
> > it's not worth a stable revert.
>
> If I'm not completely wrong, finally indeed the commit has been
> reverted in mainline, with c5becf57dd56 ("Revert "scsi: aacraid: Reply
> queue mapping to CPUs based on IRQ affinity"") .
>
> This is what was mentioned here:
> https://bugzilla.kernel.org/show_bug.cgi?id=217599#c52
>
> So should/can it be reverted it now as well on the 6.1.y stable series
> (and the others up as needed?)

Now queued up, thanks.

greg k-h