Hi Linus. Just three 6.6 regression remain on my list after a few I
tracked were resolved last week. One of the remaining ones is new:
module loading trouble on some laptops. Not nice, but likely nothing
many users will encounter. The quota compilation oddity problem from
Andy is also still around (unless it was fixed without me noticing); and
a memleak, too. See below for details.
FWIW, there was some news wrt to the two 6.5 regressions I mentioned in
last weeks report[1]:
* There was another report about a blank screen during boot on a Lenovo
laptop because simpledrm (that users apparently had enabled without
problems beforehand) started to support those machines due to
60aebc9559492c ("drivers/firmware: Move sysfb_init() from
device_initcall to subsys_initcall_sync"). I suggested a revert, but the
developers disagree (to quote: "From my point of view, this is not a
regression, 60aebc9559492c doesn't cause a problem, but exposes a
problem.") and the DRM maintainer didn't comment. At least we got a very
small step closer to finding the root cause. In case you care about the
details or want to speak up, check out this thread:
https://lore.kernel.org/dri-devel/[email protected]/
* The Fuse on Android breakage is still discussed; it's still unclear to
me if this really is a regression, as the reporter did not answer some
questions I brought up to hopefully clarify the situation. But it seems
Greg thinks that this is a regression. In case you care, check out this
thread:
https://lore.kernel.org/all/2023102731-wobbly-glimpse-97f5@gregkh/
[1]
https://lore.kernel.org/all/[email protected]/
Ciao, Thorsten
---
Hi, this is regzbot, the Linux kernel regression tracking bot.
Currently I'm aware of 3 regressions in linux-mainline. Find the
current status below and the latest on the web:
https://linux-regtracking.leemhuis.info/regzbot/mainline/
Bye bye, hope to see you soon for the next report.
Regzbot (on behalf of Thorsten Leemhuis)
======================================================
current cycle (v6.5.. aka v6.6-rc), culprit identified
======================================================
[ *NEW* ] sysfs: cannot create duplicate filename .../system76_acpi::kbd_backlight/color
----------------------------------------------------------------------------------------
https://linux-regtracking.leemhuis.info/regzbot/regression/bugzilla.kernel.org/218045/
https://bugzilla.kernel.org/show_bug.cgi?id=218045
https://lore.kernel.org/lkml/[email protected]/
By Johannes Penßel and Johannes Penßel; 3 days ago; 3 activities, latest 0 days ago.
Introduced in c7d80059b086 (v6.6-rc1)
Recent activities from: Johannes Penßel (2), Bagas Sanjaya (1)
[ *NEW* ] stable offsets directory operation support triggers offset_ctx->xa memory leak
----------------------------------------------------------------------------------------
https://linux-regtracking.leemhuis.info/regzbot/regression/bugzilla.kernel.org/218039/
https://bugzilla.kernel.org/show_bug.cgi?id=218039
https://lore.kernel.org/lkml/[email protected]/
By vladbu and vladbu; 6 days ago; 4 activities, latest 3 days ago.
Introduced in 6faddda69f62 (v6.6-rc1)
Recent activities from: vladbu (2), Bagas Sanjaya (1), Vlad Buslov (1)
quota: boot on Intel Merrifield after merge commit 1500e7e0726e
---------------------------------------------------------------
https://linux-regtracking.leemhuis.info/regzbot/regression/lore/[email protected]/
https://lore.kernel.org/linux-fsdevel/[email protected]/
By Andy Shevchenko; 12 days ago; 46 activities, latest 5 days ago.
Introduced in 024128477809 (v6.6-rc1)
Recent activities from: Andy Shevchenko (2), Kees Cook (1), Baokun
Li (1), Jan Kara (1)
2 patch postings are associated with this regression, the latest is this:
* Re: [GIT PULL] ext2, quota, and udf fixes for 6.6-rc1
https://lore.kernel.org/linux-fsdevel/[email protected]/
8 days ago, by Andy Shevchenko
=============
End of report
=============
All regressions marked '[ *NEW* ]' were added since the previous report,
which can be found here:
https://lore.kernel.org/r/[email protected]
Thanks for your attention, have a nice day!
Regzbot, your hard working Linux kernel regression tracking robot
P.S.: Wanna know more about regzbot or how to use it to track regressions
for your subsystem? Then check out the getting started guide or the
reference documentation:
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md
https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md
The short version: if you see a regression report you want to see
tracked, just send a reply to the report where you Cc
[email protected] with a line like this:
#regzbot introduced: v5.13..v5.14-rc1
If you want to fix a tracked regression, just do what is expected
anyway: add a 'Link:' tag with the url to the report, e.g.:
Link: https://lore.kernel.org/all/[email protected]/
On Sun, 29 Oct 2023 at 03:52, Regzbot (on behalf of Thorsten Leemhuis)
<[email protected]> wrote:
>
> One of the remaining ones is new:
> module loading trouble on some laptops. Not nice, but likely nothing
> many users will encounter. The quota compilation oddity problem from
> Andy is also still around (unless it was fixed without me noticing); and
> a memleak, too.
The quota thing remains unexplained, and honestly seems like a timing
issue that just happens to hit Andy. Very strange, but I suspect that
without more reports (that may or may not ever happen), we're stuck.
> * There was another report about a blank screen during boot on a Lenovo
> laptop because simpledrm (that users apparently had enabled without
> problems beforehand) started to support those machines due to
> 60aebc9559492c ("drivers/firmware: Move sysfb_init() from
> device_initcall to subsys_initcall_sync"). I suggested a revert, but the
> developers disagree (to quote: "From my point of view, this is not a
> regression, 60aebc9559492c doesn't cause a problem, but exposes a
> problem.")
Honestly, "exposes a problem" is pretty much the *definition* of a
regression. So that excuse is particularly bad.
The whole point of "regression" is "things that used to work no longer work".
And no, "there's another bug that needs to be fixed" is _not_ the
answer - not unless you have that fix in hand.
That said, this already went into 6.5, so I'm not going to revert it
now just before the 6.6 release. That would be more dangerous than
just letting things be. But yes, a revert is likely the right thing to
do, unless people have figured out what is wrong with simplefb.
Linus
Hi, Linus,
On Mon, Oct 30, 2023 at 1:19 AM Linus Torvalds
<[email protected]> wrote:
>
> On Sun, 29 Oct 2023 at 03:52, Regzbot (on behalf of Thorsten Leemhuis)
> <[email protected]> wrote:
> >
> > One of the remaining ones is new:
> > module loading trouble on some laptops. Not nice, but likely nothing
> > many users will encounter. The quota compilation oddity problem from
> > Andy is also still around (unless it was fixed without me noticing); and
> > a memleak, too.
>
> The quota thing remains unexplained, and honestly seems like a timing
> issue that just happens to hit Andy. Very strange, but I suspect that
> without more reports (that may or may not ever happen), we're stuck.
>
> > * There was another report about a blank screen during boot on a Lenovo
> > laptop because simpledrm (that users apparently had enabled without
> > problems beforehand) started to support those machines due to
> > 60aebc9559492c ("drivers/firmware: Move sysfb_init() from
> > device_initcall to subsys_initcall_sync"). I suggested a revert, but the
> > developers disagree (to quote: "From my point of view, this is not a
> > regression, 60aebc9559492c doesn't cause a problem, but exposes a
> > problem.")
>
> Honestly, "exposes a problem" is pretty much the *definition* of a
> regression. So that excuse is particularly bad.
>
> The whole point of "regression" is "things that used to work no longer work".
>
> And no, "there's another bug that needs to be fixed" is _not_ the
> answer - not unless you have that fix in hand.
>
> That said, this already went into 6.5, so I'm not going to revert it
> now just before the 6.6 release. That would be more dangerous than
> just letting things be. But yes, a revert is likely the right thing to
> do, unless people have figured out what is wrong with simplefb.
We are investigating and hope the simpledrm problem can be fixed in
some days [1], and the blank screen seems not a very harmful problem
(maybe I'm wrong but I think most of people are using GUI now). So,
can we keep the commit 60aebc9559492c at this time?
[1] https://lore.kernel.org/dri-devel/CAAhV-H7UTnTWQeT_qo7VgBczaZo37zjosREr16H8DsLi21XPqQ@mail.gmail.com/T/#t
Huacai
>
> Linus
On Sun, 29 Oct 2023 at 16:18, Huacai Chen <[email protected]> wrote:
>
> We are investigating and hope the simpledrm problem can be fixed in
> some days [1],
I don't understand your "some days". The original report was two+
weeks ago, and the link you point to does not seem to have a suggested
patch for the problem either.
So where does the "some days" come from?
The WHOLE POINT of the "no regressions" rule - and the reason it came
to be in the first place - was that we used to have these endless "one
step forward, two steps back" things with suspend/resume in
particular, where people fixed one device, but then broke a random
number of other devices, and kept saying " but I fixed something".
No. If you broke something else, YOU DIDN'T FIX ANYTHING AT ALL.
This is literally why we have that "no regressions" rule. No amount of
"but it's a fix" is valid at all if something else breaks. And no
amount of "I will fix the thing I broke in the future" is valid
either.
If you don't have a fix for it, it's broken. And I don't even see a
*suggested* fix for people to try out.
> and the blank screen seems not a very harmful problem
> (maybe I'm wrong but I think most of people are using GUI now). So,
> can we keep the commit 60aebc9559492c at this time?
At least the email from Evan Preston seems to imply it's a blank
screen that doesn't go away.
"Upgrading from Linux 6.4.12 to 6.5 and later results in only a
blank screen after boot and a rapidly flashing device-access-status
indicator"
And no, "most people using GUI" doesn't matter. You are supposed to be
able to upgrade your working kernel, and it's supposed to keep
working. That's *important*, because it's really really important that
people *trust* that they can upgrade the kernel and not end up with
something non-working, because that's how people then dare do kernel
updates and dare test new kernels.
If people then stop testing new kernels because they think new kernels
might break their setup, we have lost something truly important.
And yes, there are always exceptions. At some point, devices are just
too old legacy and there is no way of testing. Or we've had some
interface that was *so* mis-designed that it was a fundamental
security issue or something like that.
But no, this does not seem to be one of those issues.
Now, I'm not going to revert it just before releasing v6.6 (which I
have locally tagged, but not pushed out yet). And I'll have the merge
window for 6.7 opening tomorrow. But if this is not fixed by -rc1,
we'll just revert it.
Linus
Hi, Linus,
On Mon, Oct 30, 2023 at 10:53 AM Linus Torvalds
<[email protected]> wrote:
>
> On Sun, 29 Oct 2023 at 16:18, Huacai Chen <[email protected]> wrote:
> >
> > We are investigating and hope the simpledrm problem can be fixed in
> > some days [1],
>
> I don't understand your "some days". The original report was two+
> weeks ago, and the link you point to does not seem to have a suggested
> patch for the problem either.
>
> So where does the "some days" come from?
>
> The WHOLE POINT of the "no regressions" rule - and the reason it came
> to be in the first place - was that we used to have these endless "one
> step forward, two steps back" things with suspend/resume in
> particular, where people fixed one device, but then broke a random
> number of other devices, and kept saying " but I fixed something".
>
> No. If you broke something else, YOU DIDN'T FIX ANYTHING AT ALL.
>
> This is literally why we have that "no regressions" rule. No amount of
> "but it's a fix" is valid at all if something else breaks. And no
> amount of "I will fix the thing I broke in the future" is valid
> either.
>
> If you don't have a fix for it, it's broken. And I don't even see a
> *suggested* fix for people to try out.
>
> > and the blank screen seems not a very harmful problem
> > (maybe I'm wrong but I think most of people are using GUI now). So,
> > can we keep the commit 60aebc9559492c at this time?
>
> At least the email from Evan Preston seems to imply it's a blank
> screen that doesn't go away.
>
> "Upgrading from Linux 6.4.12 to 6.5 and later results in only a
> blank screen after boot and a rapidly flashing device-access-status
> indicator"
>
> And no, "most people using GUI" doesn't matter. You are supposed to be
> able to upgrade your working kernel, and it's supposed to keep
> working. That's *important*, because it's really really important that
> people *trust* that they can upgrade the kernel and not end up with
> something non-working, because that's how people then dare do kernel
> updates and dare test new kernels.
>
> If people then stop testing new kernels because they think new kernels
> might break their setup, we have lost something truly important.
>
> And yes, there are always exceptions. At some point, devices are just
> too old legacy and there is no way of testing. Or we've had some
> interface that was *so* mis-designed that it was a fundamental
> security issue or something like that.
>
> But no, this does not seem to be one of those issues.
>
> Now, I'm not going to revert it just before releasing v6.6 (which I
> have locally tagged, but not pushed out yet). And I'll have the merge
> window for 6.7 opening tomorrow. But if this is not fixed by -rc1,
> we'll just revert it.
OK, I know. I will try my best. Thank you.
Huacai
>
> Linus
On 29.10.23 18:19, Linus Torvalds wrote:
> On Sun, 29 Oct 2023 at 03:52, Regzbot (on behalf of Thorsten Leemhuis)
> <[email protected]> wrote:
>
>> * There was another report about a blank screen during boot on a Lenovo
>> laptop because simpledrm (that users apparently had enabled without
>> problems beforehand) started to support those machines due to
>> 60aebc9559492c ("drivers/firmware: Move sysfb_init() from
>> device_initcall to subsys_initcall_sync"). I suggested a revert, but the
>> developers disagree (to quote: "From my point of view, this is not a
>> regression, 60aebc9559492c doesn't cause a problem, but exposes a
>> problem.")
>
> Honestly, "exposes a problem" is pretty much the *definition* of a
> regression. So that excuse is particularly bad.
>
> The whole point of "regression" is "things that used to work no longer work".
>
> And no, "there's another bug that needs to be fixed" is _not_ the
> answer - not unless you have that fix in hand.
Thx for stating it so clearly. I had tried to get that point across, but
failed despite some links to LKML messages from you that covered similar
situations.
This happens frequently, which is tiresome and draining for me. I wish
we had *your* overall view on what regressions and how they are meant to
be handled written up in one short text you explicitly vetted. That
might give me a better lever and makes things easier for maintainers as
well, especially new ones.
See the text below[1] to give you a rough idea what kind of text I'm
thinking of.
The beginning of the merge window is a bad time to bring this up for
discussion, especially when your also traveling. So I will let this rest
for now and get back to you. Unless you say "that's a bad idea, don't
waste your time on it".
> That said, this already went into 6.5, so I'm not going to revert it
> now just before the 6.6 release. That would be more dangerous than
> just letting things be.
Yup, fully agreed. Thx again for looking into this.
Ciao, Thorsten
[1] here is something I quickly complied
"""
Linus "no regressions rule"
---------------------------
The goal
~~~~~~~~
People should always feel like they can update to a new kernel version
without worrying anything might break.
What qualifies as regression
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It's a regression if some practical use case running fine with one Linux
version works worse or not at all with a newer version compiled using a
similar configuration.
To elaborate:
* The aspect "works worse" includes higher power consumption or lower
performance, unless the difference is minor.
* Among the things that do not qualify as "practical use case" are
legacy museum style equipment, ABI/API test scripts, and microbenchmarks.
* It's irrelevant if the change causing the regression only does so by
exposing a problem that beforehand was silently lurking somewhere else
(hardware, firmware, userland, or some other part of the kernel).
* It's irrelevant if a change is fixing some undefined behavior or a bug.
* It's irrelevant if users could easily avoid the problem somehow, e.g.
by changing the configuration or updating some other software (this
includes firmware stored in the device or shipped in the linux-firmware
package).
* A "similar configuration" usually means that the .config of the old
kernel was taken as base for the newer one and processed with
"olddefconfig".
* Old and new kernel versions obviously must both be untainted vanilla
kernels.
How to handle regression report
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[FIXME: this section is missing for now this requires some more thought;
I guess what's needed here is basically the very essence of what
Documentation/process/handling-regressions.rst outlines in "Expectations
and best practices for fixing regressions"]
Closing words
~~~~~~~~~~~~~
Reality is never entirely black-and-white, therefore in rare cases
exceptions will not be fixed.
For example, sometimes it is impossible to resolve a security
vulnerability without causing a regression. That being said, developers
should try hard to avoid such an outcome and when unable to do so
minimize the impact as much as possible.
Another example: regressions only found years after the culprit was
merged might be handled like regular bugs or not addressed at all, as
the developers which introduced it might have moved on to other endeavors.
"""