Subject: Re: Linux regressions report for mainline [2023-09-24]

On 25.09.23 10:02, Greg KH wrote:
> On Sun, Sep 24, 2023 at 04:17:40PM +0000, Regzbot (on behalf of Thorsten Leemhuis) wrote:
>> (2) Nearly six weeks ago there was a report that 101bd907b4244a ("misc:
>> rtsx: judge ASPM Mode to set PETXCFG Reg") [v6.5-rc6, v6.4.11, v6.1.46,
>> v5.15.127] broke booting various laptops (many or all of them are Dell).
>> This apparently plagues quite a few users, hence there were multiple
>> reports (see [2] for those I'm aware of). At least Fedora, openSUSE, and
>> nixOS have meanwhile reverted the change in their latest stable kernels
>> [3]. I one and a half week proposed to revert the culprit when I fully
>> noticed it's impact, but Greg wanted to give the developers more time.
>> We finally have a fix in sight now [5]; someone affected replied that it
>> helps. Not sure what's the right way forward now. But overall this to me
>> feels a lot like "this is not how a regression should be handled".
>> That's why I wanted to bring it up here in case to ensure your are aware
>> of this.
>
> We now have confirmed testing that the proposed fix resolves the issue
> so I'll be getting it to Linus in time for the next -rc.

Many thx!

> I've been
> traveling all last week and this week for conferences so my response
> times have been a bit slow, sorry.

No worries, I already suspected this[1]. The major aspect in this whole
episode that bugs me a lot is different anyway:

Wouldn't it have been much much better to revert[2] the culprit quickly
once it was known to cause a regression that annoyed some users a whole
lot[3, 4]?

Yes, looking back now it's easy to ask. But I encounter similar
situations all the time: developers and maintainers are
(understandably!) often quite reluctant to revert commits causing
regressions, especially when a fix seems not far off. But in the end it
often (like in this case) takes quite a while to polish the fix, get it
tested, reviewed, in -next for a day or two, into mainline, and (when
needed, like in this case) incorporation in affected stable series.

That's why I wrote the "Expectations and best practices for fixing
regressions" section in Documentation/process/handling-regressions.rst,
which mentions rough time frames to help when a revert is appropriate.
But nobody cares about them -- and I don't blame anyone, as Linus never
ACKed them; even parts that are directly based on statements from Linus
are ignored all the time (often because people simply don't known about
them [5]). That makes my job hard. :-/

Ciao, Thorsten

[1] Sadly I couldn't make it to Bilbao this year; ohh, and BTW, enjoy
Paris this week; wanted to be there, but that didn't work out due to
stupid reasons. :-/

[2] Or would that have cause a big regression for anyone? doesn't look
like it from here, but maybe I'm missing something.

[3] FWIW, I consider it partly my fault that this didn't happen, as I
should have rooted for this way earlier. :-/ I was on vacation when when
the report came in and only realize the full impact much later; then I
finally suggested to revert this ~11 days ago a fix seemed not too far
off. OTOH I still thing a revert at that point would have been the right
thing to do.

[4] And reapply it later (outside of the merge window) together with a
fix or directly in fixed form.

[5] recent example:
https://lore.kernel.org/all/[email protected]/


2023-09-25 22:13:34

by Genes Lists

[permalink] [raw]
Subject: Re: Linux regressions report for mainline [2023-09-24]

On 9/25/23 05:11, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 25.09.23 10:02, Greg KH wrote:
>> On Sun, Sep 24, 2023 at 04:17:40PM +0000, Regzbot (on behalf of Thorsten Leemhuis) wrote:
>>> (2) Nearly six weeks ago there was a report that 101bd907b4244a ("misc:
>>> rtsx: judge ASPM Mode to set PETXCFG Reg") [v6.5-rc6, v6.4.11, v6.1.46,
>>> v5.15.127] broke booting various laptops (many or all of them are Dell).
>>> This apparently plagues quite a few users, hence there were multiple
>>> reports (see [2] for those I'm aware of). At least Fedora, openSUSE, and
>>> nixOS have meanwhile reverted the change in their latest stable kernels
>>> [3]. I one and a half week proposed to revert the culprit when I fully
>>> noticed it's impact, but Greg wanted to give the developers more time.
>>> We finally have a fix in sight now [5]; someone affected replied that it
>>> helps. Not sure what's the right way forward now. But overall this to me
>>> feels a lot like "this is not how a regression should be handled".
>>> That's why I wanted to bring it up here in case to ensure your are aware
>>> of this.
>>
>> We now have confirmed testing that the proposed fix resolves the issue
>> so I'll be getting it to Linus in time for the next -rc.
>
> Many thx!
>
Thank you all for taking care of this - much appreciated.

gene

2023-09-28 20:13:43

by Greg KH

[permalink] [raw]
Subject: Re: Linux regressions report for mainline [2023-09-24]

On Mon, Sep 25, 2023 at 11:11:51AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> On 25.09.23 10:02, Greg KH wrote:
> > I've been
> > traveling all last week and this week for conferences so my response
> > times have been a bit slow, sorry.
>
> No worries, I already suspected this[1]. The major aspect in this whole
> episode that bugs me a lot is different anyway:
>
> Wouldn't it have been much much better to revert[2] the culprit quickly
> once it was known to cause a regression that annoyed some users a whole
> lot[3, 4]?

Possibly, yes. It's a balancing act between keeping the pressure on the
developer to provide a fix, vs. the severity of the issue and how
wide-spread it is, vs. my ability to do anything at all due to
non-development issues (i.e. travel and conference work.)

Trying to pick the best thing with all of those is hard, sometimes we
get it wrong, sometimes we get it wrong, usually someone is upset no
matter what we pick, including a lack of sleep for the maintainer.

So "it's complicated", as you know...

thanks,

greg k-h