2022-02-02 16:01:30

by Thorsten Leemhuis

[permalink] [raw]
Subject: [PATCH v4 0/3] docs: add two texts covering regressions

"We don't cause regressions" might be the first rule of Linux kernel
development, but it and other aspects of regressions nevertheless are hardly
described in the Linux kernel's documentation. The following patches change
this by creating two documents dedicated to the topic.

The second patch could easily be folded into the first one, but was kept
separate, as it might be a bit controversial. This also allows the patch
description to explain some backgrounds for this part of the document.
Additionally, ACKs and Reviewed-by tags can be collected separately this way.

v4 (this version):
- countless small and medium changes after review feedback from Jon (thx), which
also lead to a big change:
- split the document into two, one for users and one for developers (both added
by the first patch, as they are interlinked)
- fixed and improved a bunch of areas I stumbled upon while checking the text
again after the split
- add a third patch to get one of the user-centric document on regressions
mentioned in Documentation/admin-guide/reporting-issues.rst
- note: the content added by the second patch did not change significantly,
that's why I left an earlier reviewed-by for the patch and an ACK for the
series in place there, but dropped the ACK for the first patch of the series

v3 (https://lore.kernel.org/regressions/[email protected]/):
- drop RFC tag
- heavily reshuffled and slightly adjusted the text in the sections "The
important bits for people fixing regressions" and "How to add a regression to
regzbot's tracking somebody else reported?" to make them easier to grasp
- a few small fixes and improvements
- add ACK for the series from Greg (now for real)

v2/RFC (https://lore.kernel.org/linux-doc/[email protected]/):
- a lot of small fixes, most are for spelling mistakes and grammar
errors/problems pointed out in the review feedback I got so far
- add ACK for the series from Greg

v1/RFC (https://lore.kernel.org/linux-doc/[email protected]/):
- initial version

---

Hi! Here is a updated version of my patch-set adding documentation regarding
regression. This bring bigger changes after Jon took a look and suggested
splitting the text up. I did that and changed a bunch of other things along the
way. But I for now decided against splitting off the regression tracking stuff
into one or two other documents, as suggested by Jon, as that IMHO distributes
the information over too many places.

Ciao, Thorsten

Thorsten Leemhuis (3):
docs: add two documents about regression handling
docs: regressions*rst: rules of thumb for handling regressions
docs: reporting-issues.rst: link new document about regressions

Documentation/admin-guide/index.rst | 1 +
.../admin-guide/regressions-users.rst | 448 +++++++++++
.../admin-guide/reporting-issues.rst | 60 +-
Documentation/process/index.rst | 1 +
Documentation/process/regressions-devs.rst | 753 ++++++++++++++++++
MAINTAINERS | 2 +
6 files changed, 1234 insertions(+), 31 deletions(-)
create mode 100644 Documentation/admin-guide/regressions-users.rst
create mode 100644 Documentation/process/regressions-devs.rst


base-commit: b8f4eee6a630ef8c5f00594e25c377463b4f299c
--
2.31.1


2022-02-03 09:50:21

by Thorsten Leemhuis

[permalink] [raw]
Subject: [PATCH v4 2/3] docs: regressions*rst: rules of thumb for handling regressions

Add a section with a few rules of thumb about how
quickly developers should address regressions to
Documentation/process/regressions-devs.rst; additionally,
add a short paragraph about this to the companion document
Documentation/admin-guide/regressions-users.rst as well.

The rules of thumb were written after studying the quotes from Linus
found in regressions-devs.rst and especially influenced by statements
like "Users are literally the _only_ thing that matters" and "without
users, your program is not a program, it's a pointless piece of code
that you might as well throw away". The author interpreted those in
perspective to how the various Linux kernel series are maintained
currently and what those practices might mean for users running into a
regression on a small or big kernel update.

That for example lead to the paragraph starting with "Aim to get fixes
for regressions mainlined within one week after identifying the culprit,
if the regression was introduced in a stable/longterm release or the
devel cycle for the latest mainline release". Some might see this as
pretty high bar, but on the other hand something like that is needed to
not leave users out in the cold for too long -- which can quickly happen
when updating to the latest stable series, as the previous one is
normally stamped "End of Life" about three or four weeks after a new
mainline release. This makes a lot of users switch during this
timeframe. Any of them thus risk running into regressions not promptly
fixed; even worse, once the previous stable series is EOLed for real,
users that face a regression might be left with only three options:

(1) continue running an outdated and thus potentially insecure kernel
version from an abandoned stable series

(2) run the kernel with the regression

(3) downgrade to an earlier longterm series still supported

This is better avoided, as (1) puts users and their data in danger, (2)
will only be possible if it's a minor regression that doesn't interfere
with booting or serious usage, and (3) might be regression itself or
impossible on the particular machine, as the users might require drivers
or features only introduced after the latest longterm series branched
of.

In the end this lead to the aforementioned "Aim to fix regression within
one week" part. It's also the reason for the "Try to resolve any
regressions introduced in the current development cycle before its
end.".

Signed-off-by: Thorsten Leemhuis <[email protected]>
CC: Linus Torvalds <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
Reviewed-by: Lukas Bulwahn <[email protected]>
---
Hi! A lot of developers are doing a good job in fixing regressions in a
reasonable time span, but I noticed it sometimes takes many weeks to get
even simple fixes for regressions merged. Most of the time this is due
to one of these factors:

* it takes a long time to get the fix ready, as some developers
apparently don't prioritize work on fixing regressions

* fully developed fixes linger in git trees of maintainers for weeks,
sometimes even without the fix being in linux-next

This afaics is especially a problem for regressions introduced in
mainline, but only found after a new versions was released and a new
stable kernel series derived from it. Sometimes fixes for these
regressions are even left lying around for weeks until the next merge
window, which contributes to a huge pile of fixes getting backported to
stable and longterm releases after a merge window ended. Asking
developers to speed things up rarely helped, as people have different
opinions on how fast regression fixes need to be developed and merged
upstream.

That's why it would be a great help to my work as regression tracker if
we had some rough written down guidelines for handling regressions, as
proposed by the patch below. I'm well aware that the text sets a pretty
high bar. That's because I approached the problem primarily from the
point of a user, as can be seen by the patch description.

The text added by this patch likely will lead to some discussions,
that's why I submit it separately from the rest of the new documents on
regressions, which are found in patch 1/3; I also CCed Linus on this
patch and hope he states his opinion or even ACKs is. In the end I can
easily tone this down or write something totally different: that's
totally fine for me, I'm mainly interested in having some expectations
roughly documented to get everyone on the same page.

Ciao, Thorsten
---
.../admin-guide/regressions-users.rst | 12 +++
Documentation/process/regressions-devs.rst | 81 +++++++++++++++++++
2 files changed, 93 insertions(+)

diff --git a/Documentation/admin-guide/regressions-users.rst b/Documentation/admin-guide/regressions-users.rst
index d32f446e9651..78df16f113b0 100644
--- a/Documentation/admin-guide/regressions-users.rst
+++ b/Documentation/admin-guide/regressions-users.rst
@@ -214,6 +214,18 @@ your report on the radar of these people by CCing or forwarding each report to
the regressions mailing list, ideally with a "regzbot command" in your mail to
get it tracked.

+How quickly are regressions normally fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Developers should fix any reported regression as quickly as possible, to provide
+affected users with a solution in a timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+The answer thus depends on various factors like the impact of a regression, its
+age, or the Linux series in which it occurs. In the end though, most regressions
+should be fixed within two weeks.
+
Is it a regression, if the issue can be avoided by updating some software?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/Documentation/process/regressions-devs.rst b/Documentation/process/regressions-devs.rst
index 7eb66304a694..d59aa2c38d1f 100644
--- a/Documentation/process/regressions-devs.rst
+++ b/Documentation/process/regressions-devs.rst
@@ -45,6 +45,10 @@ The important bits (aka "The TL;DR")
mandated by Documentation/process/submitting-patches.rst and
:ref:`Documentation/process/5.Posting.rst <development_posting>`.

+#. Try to fix regressions quickly once the culprit has been identified; fixes
+ for most regressions should be merged within two weeks, but some need to be
+ resolved within two or three days.
+

All the details on Linux kernel regressions relevant for developers
===================================================================
@@ -117,6 +121,83 @@ or others might need to look into the fix months or years later; these links are
also crucial for tools and scripts like regzbot, as it allows them to associate
changes with reports that were mailed or submitted to bug trackers.

+Prioritize work on fixing regressions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You should fix any reported regression as quickly as possible, to provide
+affected users with a solution in a timely manner and prevent more users from
+running into the issue; nevertheless developers need to take enough time and
+care to ensure regression fixes do not cause additional damage.
+
+In the end though, developers should give their best to prevent users from
+running into situations where a regression leaves them only three options: "run
+a kernel with a regression that seriously impacts usage", "continue running an
+outdated and thus potentially insecure kernel version for more than two weeks
+after a regression's culprit was identified", and "downgrade to a still
+supported kernel series that lack required features".
+
+How to realize this depends a lot on the situation. Here are a few rules of
+thumb for developers, in order or importance:
+
+ * Prioritize work on handling regression reports and fixing regression over all
+ other Linux kernel work, unless the latter concerns acute security issues or
+ bugs causing data loss or damage.
+
+ * Always consider reverting the culprit commits and reapplying them later
+ together with necessary fixes, as this might be the least dangerous and
+ quickest way to fix a regression.
+
+ * Try to resolve any regressions introduced in the current development before
+ its end. If you fear a fix might be too risky to apply only days before a new
+ mainline release, let Linus decide: submit the fix separately to him as soon
+ as possible with the explanation of the situation. He then can make a call
+ and postpone the release if necessary, for example if multiple such changes
+ show up in his inbox.
+
+ * Address regressions in stable, longterm, or proper mainline releases with
+ more urgency than regressions in mainline pre-releases. That changes after
+ the release of the fifth pre-release, aka "-rc5": mainline then becomes as
+ important, to ensure all the improvements and fixes are ideally tested
+ together for at least one week before Linus releases a new mainline version.
+
+ * Fix regressions within two or three days, if they are critical for some
+ reason -- for example, if the issue is likely to affect many users of the
+ kernel series in question on all or certain architectures. Note, this
+ includes mainline, as issues like compile errors otherwise might prevent many
+ testers or continuous integration systems from testing the series.
+
+ * Aim to merge regression fixes into mainline within one week after the culprit
+ was identified, if the regression was introduced in a stable/longterm release
+ or the development cycle for the latest mainline release (say v5.14). If
+ possible, try to address the issue even quicker, if the previous stable
+ series (v5.13.y) will be abandoned soon or already was stamped "End-of-Life"
+ (EOL) -- this usually happens about three to four weeks after a new mainline
+ release.
+
+ * Try to fix all other regressions within two weeks after the culprit was
+ found. Two or three additional weeks are acceptable for performance
+ regressions and other issues which are annoying, but don't prevent anyone
+ from running Linux (unless it's an issue in the current development cycle,
+ as those should ideally be addressed before the release). A few weeks in
+ total are acceptable if a regression can only be fixed with a risky change
+ and at the same time is affecting only a few users; as much time is
+ also okay if the regression is already present in the second newest longterm
+ kernel series.
+
+Note: The aforementioned time frames for resolving regressions are meant to
+include getting the fix tested, reviewed, and merged into mainline, ideally with
+the fix being in linux-next for two days.
+
+Developers need to account for this.
+
+Subsystem maintainers are expected to assist in reaching those periods by doing
+timely reviews and quick handling of accepted patches. They thus might have to
+send git-pull requests earlier or more often than usual; depending on the fix,
+it might even be acceptable to skip testing in linux-next. Especially fixes for
+regressions in stable and longterm kernels need to be handled quickly, as fixes
+need to be merged in mainline before they can be backported to older series.
+
+
More aspects regarding regressions developers should be aware of
----------------------------------------------------------------

--
2.31.1

2022-02-03 14:24:23

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] docs: regressions*rst: rules of thumb for handling regressions

Thorsten Leemhuis <[email protected]> writes:

One thing that caught my eye this time around...

> + * Address regressions in stable, longterm, or proper mainline releases with
> + more urgency than regressions in mainline pre-releases. That changes after
> + the release of the fifth pre-release, aka "-rc5": mainline then becomes as
> + important, to ensure all the improvements and fixes are ideally tested
> + together for at least one week before Linus releases a new mainline version.

Is that really what we want to suggest? I ask because (1) fixes for
stable releases need to show up in mainline first anyway, and (2) Greg
has often stated that the stable releases shouldn't be something that
most maintainers need to worry about. So if the bug is in mainline,
that has to get fixed first, and if it's something special to a stable
release, well, then the stable folks should fix it :)

> + * Fix regressions within two or three days, if they are critical for some
> + reason -- for example, if the issue is likely to affect many users of the
> + kernel series in question on all or certain architectures. Note, this
> + includes mainline, as issues like compile errors otherwise might prevent many
> + testers or continuous integration systems from testing the series.
> +
> + * Aim to merge regression fixes into mainline within one week after the culprit
> + was identified, if the regression was introduced in a stable/longterm release
> + or the development cycle for the latest mainline release (say v5.14). If
> + possible, try to address the issue even quicker, if the previous stable
> + series (v5.13.y) will be abandoned soon or already was stamped "End-of-Life"
> + (EOL) -- this usually happens about three to four weeks after a new mainline
> + release.

How much do we really think developers should worry about nearly-dead
stable kernels? We're about to tell users they shouldn't be running the
kernel anyway...

Thanks,

jon