LinuxLists.cc - [PATCH v3 0/2] docs: add a text about regressions to the Linux kernel's documentation

2022-01-25 16:45:53

Subject: [PATCH v3 0/2] docs: add a text about regressions to the Linux kernel's documentation

'We don't cause regressions' might be the first rule of Linux kernel
development, but it and other aspects of regressions nevertheless are hardly
described in the Linux kernel's documentation. The following two patches change
this by creating a document dedicated to the topic.

The second patch could easily be folded into the first one, but was kept
separate, as it might be a bit controversial. This also allows the patch
description to explain some backgrounds for this part of the document.
Additionally, ACKs and Reviewed-by tags can be collected separately this way.

v3 (this mail):
- drop RFC tag
- heavily reshuffled and slightly adjusted the text in the sections "The
important bits for people fixing regressions" and "How to add a regression to
regzbot's tracking somebody else reported?" to make them easier to grasp
- a few small fixes and improvements
- add ACK for the series from Greg (now for real)

v2/RFC (https://lore.kernel.org/linux-doc/[email protected]/):
- a lot of small fixes, most are for spelling mistakes and grammar
errors/problems pointed out in the review feedback I got so far
- add ACK for the series from Greg

v1/RFC (https://lore.kernel.org/linux-doc/[email protected]/):
- initial version

---

Hi! Merge window is over, so I'd like to move on with this. Dropped the RFC tag;
I had a small hope Linus would take a look at this (especially the second patch)
before I remove it; well, didn't work out, but Greg ACKed it, which is good
enough reason to move on with this. :-D

Ciao, Thorsten

Thorsten Leemhuis (2):
docs: add a document about regression handling
docs: regressions.rst: rules of thumb for handling regressions

Documentation/admin-guide/index.rst | 1 +
Documentation/admin-guide/regressions.rst | 992 ++++++++++++++++++++++
MAINTAINERS | 1 +
3 files changed, 994 insertions(+)
create mode 100644 Documentation/admin-guide/regressions.rst

base-commit: e783362eb54cd99b2cac8b3a9aeac942e6f6ac07
--
2.31.1

2022-01-25 16:48:01

by Thorsten Leemhuis

[permalink] [raw]

Subject: [PATCH v3 1/2] docs: add a document about regression handling

Create a document explaining various aspects around regression handling
and tracking both for users and developers. Among others describe the
first rule of Linux kernel development and what it means in practice.
Also explain what a regression actually is and how to report one
properly. The text additionally provides a brief introduction to the bot
the kernel's regression tracker uses to facilitate his work. To sum
things up, provide a few quotes from Linus to show how serious he takes
regressions.

Signed-off-by: Thorsten Leemhuis <[email protected]>
Acked-by: Greg Kroah-Hartman <[email protected]>
---
Documentation/admin-guide/index.rst | 1 +
Documentation/admin-guide/regressions.rst | 911 ++++++++++++++++++++++
MAINTAINERS | 1 +
3 files changed, 913 insertions(+)
create mode 100644 Documentation/admin-guide/regressions.rst

diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
index 1bedab498104..17157ee5a416 100644
--- a/Documentation/admin-guide/index.rst
+++ b/Documentation/admin-guide/index.rst
@@ -36,6 +36,7 @@ problems and bugs in particular.

reporting-issues
security-bugs
+ regressions
bug-hunting
bug-bisect
tainted-kernels
diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
new file mode 100644
index 000000000000..837b1658d149
--- /dev/null
+++ b/Documentation/admin-guide/regressions.rst
@@ -0,0 +1,911 @@
+.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
+..
+ If you want to distribute this text under CC-BY-4.0 only, please use 'The
+ Linux kernel developers' for author attribution and link this as source:
+ https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/regressions.rst
+..
+ Note: Only the content of this RST file as found in the Linux kernel sources
+ is available under CC-BY-4.0, as versions of this text that were processed
+ (for example by the kernel's build system) might contain content taken from
+ files which use a more restrictive license.
+
+
+Regressions
++++++++++++
+
+The first rule of Linux kernel development: '*We don't cause regressions*'.
+Linux founder and lead developer Linus Torvalds strictly enforces the rule
+himself. This document describes what it means in practice and how the Linux
+kernel's development model ensures all reported regressions are addressed.
+The text covers aspects relevant for both users and developers.
+
+The important bits for people affected by regressions
+=====================================================
+
+It's a regression if something running fine with one Linux kernel works worse or
+not at all with a newer version. Note, the newer kernel has to be compiled using
+a similar configuration -- for this and other fine print, check out below
+section "What is a 'regression' and what is the 'no regressions rule'?".
+
+Report your regression as outlined in
+`Documentation/admin-guide/reporting-issues.rst`, it already covers all aspects
+important for regressions. Below section "How do I report a regression?"
+highlights them for convenience.
+
+The most important aspect: CC or forward the report to `the regression mailing
+list <https://lore.kernel.org/regressions/>`_ ([email protected]).
+When doing so, consider mentioning the version range where the regression
+started using a paragraph like this::
+
+ #regzbot introduced v5.13..v5.14-rc1
+
+The Linux kernel regression tracking bot 'regzbot' will then start to track the
+issue. This is in your interest, as it brings the report on the radar of people
+ensuring all regressions are acted upon in a timely manner.
+
+The important bits for people fixing regressions
+================================================
+
+When submitting fixes for regressions, add "Link:" tags pointing to all places
+where the issue was reported, as tools like the Linux kernel regression bot
+'regzbot' heavily rely on them. These pointers are also of great value when
+looking into the issue months or years later, that's why
+`Documentation/process/submitting-patches.rst` and
+:ref:`Documentation/process/5.Posting.rst <development_posting>` mandate their
+use::
+
+ Link: https://lore.kernel.org/r/[email protected]/
+ Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
+
+Let the Linux kernel's regression tracker and all other subscribers of the
+`regression mailing list <https://lore.kernel.org/regressions/>`_
+([email protected]) quickly know about newly reported regressions:
+
+ * When you receive a mailed report that did not CC the list, immediately send
+ at least a brief "Reply-all" which get the list into the loop; also ensure
+ it's CCed on all future replies.
+
+ * If you get a report from a bug tracker, forward or bounce the report to the
+ list, unless the reporter did that already as outlined by
+ `Documentation/admin-guide/reporting-issues.rst`.
+
+Ensure regzbot tracks the issue (this is optional, but recommended):
+
+ * For mailed reports, check if the reporter included a 'regzbot command' like
+ the ``#regzbot introduced v5.13..v5.14-rc1`` described above. If not, send a
+ reply (with the regressions list in CC) with a paragraph like the following,
+ which brings regzbot into the loop by specifying the version range or commit
+ when the issue started to happen::
+
+ #regzbot ^introduced 1f2e3d4c5b6a
+
+ * When receiving a report from a bug tracker and forwarding it to the
+ regressions list (see above), include a paragraph like the following, which
+ brings regzbot into the loop by specifying the version range or commit when
+ the issue started to happen::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+ #regzbot from: Some N. Ice Human <[email protected]>
+ #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
+
+All the details on handling Linux kernel regressions
+====================================================
+
+The important basics
+--------------------
+
+What is a 'regression' and what is the 'no regressions rule'?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's a regression if some application or practical use case running fine with
+one Linux kernel works worse or not at all with a newer version compiled using a
+similar configuration. The 'no regressions rule' forbids this to take place; if
+it happens by accident, developers that caused it are expected to quickly fix
+the issue.
+
+It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
+5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
+It's also a regression if a perfectly working application suddenly shows erratic
+behavior with a newer kernel version, which can be caused by changes in procfs,
+sysfs, or one of the many other interfaces Linux provides to userland software.
+But keep in mind, as mentioned earlier: 5.14 in this example needs to be built
+from a configuration similar to the one from 5.13. This can be achieved using
+``make olddefconfig``, as explained in more detail below.
+
+Note the 'practical use case' in the first sentence of this section: developers
+despite the 'no regressions' rule are free to change any aspect of the kernel
+and even APIs or ABIs to userland, as long as no existing application or use
+case breaks.
+
+Also be aware the 'no regressions' rule covers only interfaces the kernel
+provides to the userland. It thus does not apply to kernel-internal interfaces
+like the module API, which some externally developed drivers use to hook into
+the kernel.
+
+What is the goal of the 'no regressions rule'?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Users should feel safe when updating kernel versions and not have to worry
+something might break. This is in the interest of the kernel developers to make
+updating attractive: they don't want users to stay on stable or longterm Linux
+series that are either abandoned or more than one and a half year old. That's in
+everybody's interest, as `those series might have known bugs, security issues,
+or other problematic aspects already fixed in later versions
+<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
+Additionally, the kernel developers want to make it simple and appealing for
+users to test the latest pre-release or regular release. That's also in
+everybody's interest, as it's a lot easier to track down and fix problems, if
+they are reported shortly after being introduced.
+
+
+How hard is the rule enforced?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Extraordinarily strict, as can be seen by many mailing list posts from Linux
+creator and lead developer Linus Torvalds, some of which are quoted at the end
+of this document.
+
+Exceptions to this rule are extremely rare; in the past developers almost always
+turned out to be wrong when they assumed a particular situation was warranting
+an exception.
+
+How is the rule enforced?
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's the duty of the subsystem maintainers, which are watched and supported by
+Linus Torvalds for mainline or stable/longterm tree maintainers like Greg
+Kroah-Hartman. All of them are supported by Thorsten Leemhuis: he's acting as
+'regressions tracker' for the Linux kernel and trying to ensure all regression
+reports are acted upon in a timely manner.
+
+The distributed and slightly unstructured nature of the Linux kernel's
+development makes tracking regressions hard. That's why Thorsten relies on the
+help of his Linux kernel regression tracking robot 'regzbot'. It watches mailing
+lists and git trees to semi-automatically associate regression reports with
+patch submissions and commits fixing the issue, as this provides all necessary
+insights into the fixing progress.
+
+Note, the regression tracker can only ensure no regression falls through the
+cracks, if someone tells him or his bot about every regression found. That's why
+each report needs to be CCed or forwarded to the regressions mailing list
+(ideally with a 'regzbot command' in the mail), as explained in the next
+section.
+
+How do I report a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Just report the issue as outlined in
+`Documentation/admin-guide/reporting-issues.rst`, it already describes the
+important points. The following aspects described there are especially relevant
+for regressions:
+
+ * When checking for existing reports to join, first check the `archives of the
+ Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
+ `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
+
+ * In your report, mention the last kernel version that worked fine and the
+ first broken one. Even better: try to find the commit causing the regression
+ using a bisection.
+
+ * Remember to let the Linux regressions mailing list
+ ([email protected]) know about your report:
+
+ * If you report the regression by mail, CC the regressions list.
+
+ * If you report your regression to some bug tracker, forward the filed report
+ by mail to the regressions list while CCing the maintainer and the mailing
+ list for the subsystem in question.
+
+Additionally, you in both cases should directly tell the aforementioned Linux
+kernel regression tracking bot about your report. To do that, include a
+paragraph like this in your report to tell the bot when the regression started
+to happen::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+
+In this example, v5.13 was the last version that worked, while v5.14-rc1 was the
+first broken one. The smaller the range, the better, as that makes it easier to
+find out what's wrong and who's responsible. That's why you ideally should
+perform a bisection to find the commit causing the regression (the 'culprit').
+If you did, specify it instead::
+
+ #regzbot introduced: 1f2e3d4c5d
+
+Placing such a 'regzbot command' is in your interest, as it will ensure the
+report won't fall through the cracks unnoticed. If you omit this, the Linux
+kernel's regressions tracker will take care of telling regzbot about your
+regression, as long as you send a copy to the regressions mailing lists. But the
+regression tracker is just one human which sometimes has to rest or occasionally
+might even enjoy some time away from computers (as crazy as that might sound).
+Relying on this person thus will result in an unnecessary delay before the
+regressions becomes mentioned `on the list of tracked and unresolved Linux
+kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
+weekly regression reports sent by regzbot. Such delays can result in Linus
+Torvalds being unaware of important regressions when deciding between 'continue
+development or call this finished and release the final?'.
+
+How to add a regression to regzbot's tracking somebody else reported?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It depends on the report:
+
+ * If the regression was reported by mail, reply using your mailers 'Reply-all'
+ function with the regressions mailing list ([email protected]) in
+ CC. In your reply, include a paragraph with a regzbot command like this::
+
+ #regzbot ^introduced: v5.13..v5.14-rc1
+
+ The caret (^) before the 'introduced' tells regzbot to treat the parent mail
+ (the one you reply to) as the initial report for the regression you want to
+ see tracked; regzbot then will automatically associate any patches with this
+ regression that point to the report using 'Link:' tags.
+
+ * If the regressions was reported to a bug tracker, forward it to the
+ regressions list and include a paragraph with these regzbot commands::
+
+ #regzbot introduced: v5.13..v5.14-rc1
+ #regzbot from: Some N. Ice Human <[email protected]>
+ #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
+
+ Regzbot will automatically associate patches with the report that use 'Link:'
+ tags pointing to your mail or the mentioned ticket.
+
+In both cases you can specify a commit-id instead of a version range, as the
+previous section outlines in more detail.
+
+In case you are having trouble, simply forward the report or a pointer to it
+without further ado to Thorsten Leemhuis ([email protected]), he then
+will handle things.
+
+Are really all regressions fixed?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Nearly all of them are, as long as the change causing the regression (the
+'culprit commit') is reliably identified. Some regressions can be fixed without
+this, but often it's required.
+
+Who needs to find the commit causing a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's the reporter's duty to find the culprit, but developers of the affected
+subsystem should offer advice and reasonably help where they can.
+
+How can I find the change causing a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Perform a bisection, as roughly outlined in `Documentation/admin-guide/reporting-issues.rst`
+and described in more detail by `Documentation/admin-guide/bug-bisect.rst`.
+It might sound like a lot of work, but in many cases finds the culprit
+relatively quickly. If it's hard or time-consuming to reliably reproduce the
+issue, consider teaming up with others affected by the problem to narrow down
+the search range together.
+
+Who can I ask for advice when it comes to regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send a mail to the regressions mailing list ([email protected]) while
+CCing the Linux kernel's regression tracker ([email protected]); if the
+issue might better be dealt with in private, feel free to omit the list.
+
+
+More details about regressions relevant for reporters
+-----------------------------------------------------
+
+Does a regression need to be fixed, if it can be avoided by updating some other software?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Almost always: yes. If a developer tells you otherwise, ask the regression
+tracker for advice as outlined above.
+
+Does it qualify as a regression if a newer kernel works slower or makes the system consume more energy?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but the difference has to be significant. A five percent slow-down in a
+micro-benchmark thus is unlikely to qualify as regression, unless it also
+influences the results of a broad benchmark by more than one percent. If in
+doubt, ask for advice.
+
+Is it a regression, if an externally developed kernel module is incompatible with a newer kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+No, as the 'no regression' rule is about interfaces and services the Linux
+kernel provides to the userland. It thus does not cover building or running
+externally developed kernel modules, as they run in kernel-space and hook into
+the kernel using internal interfaces occasionally changed.
+
+How are regressions handled that are caused by a fix for security vulnerability?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In extremely rare situations security issues can't be fixed without causing
+regressions; those are given way, as they are the lesser evil in the end.
+Luckily this almost always can be avoided, as key developers for the affected
+area and often Linus Torvalds himself try very hard to fix security issues
+without causing regressions.
+
+If you nevertheless face such a case, check the mailing list archives if people
+tried their best to avoid the regression; if in doubt, ask for advice as
+outlined above.
+
+What happens if fixing a regression is impossible without causing another regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Sadly these things happen, but luckily not very often; if they occur, expert
+developers of the affected code area should look into the issue to find a fix
+that avoids regressions or at least their impact. If you run into such a
+situation, do what was outlined already for regressions caused by security
+fixes: check earlier discussions if people already tried their best and ask for
+advice if in doubt.
+
+A quick note while at it: these situations could be avoided, if people would
+regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each cycle a
+test run. This is best explained by imagining a change integrated between Linux
+v5.14 and v5.15-rc1 which causes a regression, but at the same time is a hard
+requirement for some other improvement applied for 5.15-rc1. All these changes
+often can simply be reverted and the regression thus solved, if someone finds
+and reports it before 5.15 is released. A few days or weeks later this solution
+can become impossible, as some software might have started to rely on aspects
+introduced by one of the follow-up changes: reverting all changes would then
+cause a regression for users of said software and thus is out of the question.
+
+A feature I relied on was removed months ago, but I only noticed now. Does that qualify as regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but often it's hard to fix them due to the aspects outlined in the
+previous section. It hence needs to be dealt with on a case-by-case basis. This
+is another reason why it's in everybody's interest to regularly test mainline
+pre-releases.
+
+Does the 'no regression' rule apply if I seem to be the only person in the world that is affected by a regression?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It does, but only for practical usage: the Linux developers want to be free to
+remove support for hardware only to be found in attics and museums anymore.
+
+Note, sometimes regressions can't be avoided to make progress -- and the latter
+is needed to prevent Linux from stagnation. Hence, if only very few users seem
+to be affected by a regression, it for the greater good might be in their and
+everyone else's interest to not insist on the rule. Especially if there is an
+easy way to circumvent the regression somehow, for example by updating some
+software or using a kernel parameter created just for this purpose.
+
+Does the regression rule apply for code in the staging tree as well?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not according to the `help text for the configuration option covering all
+staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
+which since its early days states::
+
+ Please note that these drivers are under heavy development, may or
+ may not work, and may contain userspace interfaces that most likely
+ will be changed in the near future.
+
+The staging developers nevertheless often adhere to the 'no regressions' rule,
+but sometimes bend it to make progress. That's for example why some users had to
+deal with (often negligible) regressions when a WiFi driver from the staging
+tree was replaced by a totally different one written from scratch.
+
+Why do later versions have to be 'compiled with a similar configuration'?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Because the Linux kernel developers sometimes integrate changes known to cause
+regressions, but make them optional and disable them in the kernel's default
+configuration. This trick allows progress, as the 'no regressions' rule
+otherwise would lead to stagnation.
+
+Consider for example a new security feature blocking access to some kernel
+interfaces often abused by malware, which at the same time are required to run a
+few rarely used applications. The outlined approach makes both camps happy:
+people using these applications can leave the new security feature off, while
+everyone else can enable it without running into trouble.
+
+How to create a configuration similar to the one of an older kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Start a known-good kernel and configure the newer Linux version with ``make
+olddefconfig``. This makes the kernel's build scripts pick up the configuration
+file (the `.config` file) from the running kernel as base for the new one you
+are about to compile; afterwards they set all new configuration options to their
+default value, which should disable new features that might cause regressions.
+
+Can I report a regression to the upstream developers I found in a pre-compiled vanilla kernel?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You need to ensure the newer kernel was compiled with a similar configuration
+file as the older one (see above), as the one that built them might have enabled
+some known-to-be incompatible feature for the newer kernel. If in a doubt,
+report this problem to the kernel's provider and ask for advice.
+
+
+More details about regressions relevant for developers
+------------------------------------------------------
+
+What should I do, if I suspect a change I'm working on might cause regressions?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Evaluate how big the risk of regressions is, for example by performing a code
+search in Linux distributions and Git forges. Also consider asking other
+developers or projects likely to be affected to evaluate or even test the
+proposed change; if problems surface, maybe some middle ground acceptable for
+all can be found.
+
+If the risk of regressions in the end seems to be relatively small, go ahead
+with the change, but let all involved parties know about the risk. Hence, make
+sure your patch description makes this aspect obvious. Once the change is
+merged, tell the Linux kernel's regression tracker and the regressions mailing
+list about the risk, so everyone has the change on the radar in case reports
+trickle in. Depending on the risk, you also might want to ask the subsystem
+maintainer to mention the issue in his mainline pull request.
+
+
+Everything developers need to know about regression tracking
+------------------------------------------------------------
+
+Do I have to use regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It's in the interest of everyone if you do, as kernel maintainers like Linus
+Torvalds partly rely on regzbot's tracking in their work -- for example when
+deciding to release a new version or extend the development phase. For this they
+need to be aware of all unfixed regression; to do that, Linus is known to look
+into the weekly reports sent by regzbot.
+
+Do I have to tell regzbot about every regression I stumble upon?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Ideally yes: we are all humans and easily forget problems when something more
+important unexpectedly comes up -- for example a bigger problem in the Linux
+kernel or something in real life that's keeping us away from keyboards for a
+while. Hence, it's best to tell regzbot about every regression, except when you
+immediately write a fix and commit it to a tree regularly merged to the affected
+kernel series.
+
+Why does the Linux kernel need a regression tracker, and why does he utilize regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Rules like 'no regressions' need someone to enforce them, otherwise they are
+broken either accidentally or on purpose. History has shown that this is true
+for the Linux kernel as well. That's why Thorsten volunteered to keep an eye on
+things.
+
+Tracking regressions completely manually has proven to be exhausting and
+demotivating, which is why earlier attempts to establish it failed after a
+while. To prevent this from happening again, Thorsten developed regzbot to
+facilitate the work, with the long term goal to automate regression tracking as
+much as possible for everyone involved.
+
+How does regression tracking work with regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bot keeps track of all the reports and monitors their fixing progress. It
+tries to do that with as little overhead as possible for both reporters and
+developers.
+
+In fact, only reporters or someone helping them are burdened with an extra duty:
+they need to tell regzbot about the regression report using one of the
+``#regzbot introduced`` commands outlined above.
+
+For developers there normally is no extra work involved, they just need to do
+something that's expected from them already: add 'Link:' tags to the patch
+description pointing to all reports about the issue fixed.
+
+Thanks to these tags regzbot can associate regression reports with patches to
+fix the issue, whenever they are posted for review or applied to a git tree. The
+bot additionally watches out for replies to the report. All this data combined
+provides a good impression about the current status of the fixing process.
+
+How to see which regressions regzbot tracks currently?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Check `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_
+for the latest info; alternatively, `search for the latest regression report
+<https://lore.kernel.org/lkml/?q=%22Linux+regressions+report%22+f%3Aregzbot>`_,
+which regzbot normally sends out once a week on Sunday evening (UTC), which is a
+few hours before Linus usually publishes new (pre-)releases.
+
+What places is regzbot monitoring?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Regzbot is watching the most important Linux mailing lists as well as the
+linux-next, mainline and stable/longterm git repositories.
+
+How to interact with regzbot?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Everyone can interact with the bot using mails containing 'regzbot commands',
+which need to be in their own paragraph (IOW: they need to be separated from the
+rest of the mail using blank lines). One such command is ``#regzbot introduced
+<version or commit>``, which adds a report to the tracking, as already described
+above; ``#regzbot ^introduced <version or commit>`` is another such command,
+which makes regzbot consider the parent mail as a report for a regression which
+it starts to track.
+
+Once one of those two commands has been utilized, other regzbot commands can be
+used. You can write them below one of the `introduced` commands or in replies to
+the mail that used one of them or itself is a reply to that mail:
+
+ * Set or update the title::
+
+ #regzbot title: foo
+
+ * Link to a related discussion (for example the posting of a patch to fix the
+ issue) and monitor it::
+
+ #regzbot monitor: https://lore.kernel.org/all/[email protected]/
+
+ Monitoring only works for lore.kernel.org; regzbot will consider all messages
+ in that thread as related to the fixing process.
+
+ * Point to a place with further details, like a bug tracker or a related
+ mailing list post::
+
+ #regzbot link: https://bugzilla.kernel.org/show_bug.cgi?id=123456789
+
+ * Mark a regression as fixed by a commit that is heading upstream or already
+ landed::
+
+ #regzbot fixed-by: 1f2e3d4c5d
+
+ * Mark a regression as a duplicate of another one already tracked by regzbot::
+
+ #regzbot dup-of: https://lore.kernel.org/all/[email protected]/
+
+ * Mark a regression as invalid::
+
+ #regzbot invalid: wasn't a regression, problem has always existed
+
+Is there more to tell about regzbot and its commands?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+More detailed and up-to-date information about the Linux
+kernel's regression tracking bot can be found on its
+`project page <https://gitlab.com/knurd42/regzbot>`_, which among others
+contains a `getting started guide <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/getting_started.md>`_
+and `reference documentation <https://gitlab.com/knurd42/regzbot/-/blob/main/docs/reference.md>`_
+which both cover more details than above section.
+
+
+Quotes from Linus about regression
+----------------------------------
+
+Find below a few real life examples of how Linus Torvalds expects regressions to
+be handled:
+
+ * From `2017-10-26 (1/2)
+ <https://lore.kernel.org/lkml/CA+55aFwiiQYJ+YoLKCXjN_beDVfu38mg=Ggg5LFOcqHE8Qi7Zw@mail.gmail.com/>`_::
+
+ If you break existing user space setups THAT IS A REGRESSION.
+
+ It's not ok to say "but we'll fix the user space setup".
+
+ Really. NOT OK.
+
+ [...]
+
+ The first rule is:
+
+ - we don't cause regressions
+
+ and the corollary is that when regressions *do* occur, we admit to
+ them and fix them, instead of blaming user space.
+
+ The fact that you have apparently been denying the regression now for
+ three weeks means that I will revert, and I will stop pulling apparmor
+ requests until the people involved understand how kernel development
+ is done.
+
+ * From `2017-10-26 (2/2)
+ <https://lore.kernel.org/lkml/CA+55aFxW7NMAMvYhkvz1UPbUTUJewRt6Yb51QAx5RtrWOwjebg@mail.gmail.com/>`_::
+
+ People should basically always feel like they can update their kernel
+ and simply not have to worry about it.
+
+ I refuse to introduce "you can only update the kernel if you also
+ update that other program" kind of limitations. If the kernel used to
+ work for you, the rule is that it continues to work for you.
+
+ There have been exceptions, but they are few and far between, and they
+ generally have some major and fundamental reasons for having happened,
+ that were basically entirely unavoidable, and people _tried_hard_ to
+ avoid them. Maybe we can't practically support the hardware any more
+ after it is decades old and nobody uses it with modern kernels any
+ more. Maybe there's a serious security issue with how we did things,
+ and people actually depended on that fundamentally broken model. Maybe
+ there was some fundamental other breakage that just _had_ to have a
+ flag day for very core and fundamental reasons.
+
+ And notice that this is very much about *breaking* peoples environments.
+
+ Behavioral changes happen, and maybe we don't even support some
+ feature any more. There's a number of fields in /proc/<pid>/stat that
+ are printed out as zeroes, simply because they don't even *exist* in
+ the kernel any more, or because showing them was a mistake (typically
+ an information leak). But the numbers got replaced by zeroes, so that
+ the code that used to parse the fields still works. The user might not
+ see everything they used to see, and so behavior is clearly different,
+ but things still _work_, even if they might no longer show sensitive
+ (or no longer relevant) information.
+
+ But if something actually breaks, then the change must get fixed or
+ reverted. And it gets fixed in the *kernel*. Not by saying "well, fix
+ your user space then". It was a kernel change that exposed the
+ problem, it needs to be the kernel that corrects for it, because we
+ have a "upgrade in place" model. We don't have a "upgrade with new
+ user space".
+
+ And I seriously will refuse to take code from people who do not
+ understand and honor this very simple rule.
+
+ This rule is also not going to change.
+
+ And yes, I realize that the kernel is "special" in this respect. I'm
+ proud of it.
+
+ I have seen, and can point to, lots of projects that go "We need to
+ break that use case in order to make progress" or "you relied on
+ undocumented behavior, it sucks to be you" or "there's a better way to
+ do what you want to do, and you have to change to that new better
+ way", and I simply don't think that's acceptable outside of very early
+ alpha releases that have experimental users that know what they signed
+ up for. The kernel hasn't been in that situation for the last two
+ decades.
+
+ We do API breakage _inside_ the kernel all the time. We will fix
+ internal problems by saying "you now need to do XYZ", but then it's
+ about internal kernel API's, and the people who do that then also
+ obviously have to fix up all the in-kernel users of that API. Nobody
+ can say "I now broke the API you used, and now _you_ need to fix it
+ up". Whoever broke something gets to fix it too.
+
+ And we simply do not break user space.
+
+ * From `2020-05-21
+ <https://lore.kernel.org/all/CAHk-=wiVi7mSrsMP=fLXQrXK_UimybW=ziLOwSzFTtoXUacWVQ@mail.gmail.com/>`_::
+
+ The rules about regressions have never been about any kind of
+ documented behavior, or where the code lives.
+
+ The rules about regressions are always about "breaks user workflow".
+
+ Users are literally the _only_ thing that matters.
+
+ No amount of "you shouldn't have used this" or "that behavior was
+ undefined, it's your own fault your app broke" or "that used to work
+ simply because of a kernel bug" is at all relevant.
+
+ Now, reality is never entirely black-and-white. So we've had things
+ like "serious security issue" etc that just forces us to make changes
+ that may break user space. But even then the rule is that we don't
+ really have other options that would allow things to continue.
+
+ And obviously, if users take years to even notice that something
+ broke, or if we have sane ways to work around the breakage that
+ doesn't make for too much trouble for users (ie "ok, there are a
+ handful of users, and they can use a kernel command line to work
+ around it" kind of things) we've also been a bit less strict.
+
+ But no, "that was documented to be broken" (whether it's because the
+ code was in staging or because the man-page said something else) is
+ irrelevant. If staging code is so useful that people end up using it,
+ that means that it's basically regular kernel code with a flag saying
+ "please clean this up".
+
+ The other side of the coin is that people who talk about "API
+ stability" are entirely wrong. API's don't matter either. You can make
+ any changes to an API you like - as long as nobody notices.
+
+ Again, the regression rule is not about documentation, not about
+ API's, and not about the phase of the moon.
+
+ It's entirely about "we caused problems for user space that used to work".
+
+ * From `2012-07-06 <https://lore.kernel.org/all/CA+55aFwnLJ+0sjx92EGREGTWOx84wwKaraSzpTNJwPVV8edw8g@mail.gmail.com/>`_::
+
+ > Now this got me wondering if Debian _unstable_ actually qualifies as a
+ > standard distro userspace.
+
+ Oh, if the kernel breaks some standard user space, that counts. Tons
+ of people run Debian unstable (and from my limited interactions with
+ it, for damn good reasons: -stable tends to run so old versions of
+ everything that you have to sometimes deal with cuneiform writing when
+ using it)
+
+ * From `2017-11-05
+ <https://lore.kernel.org/all/CA+55aFzUvbGjD8nQ-+3oiMBx14c_6zOj2n7KLN3UsJ-qsd4Dcw@mail.gmail.com/>`_::
+
+ And our regression rule has never been "behavior doesn't change".
+ That would mean that we could never make any changes at all.
+
+ For example, we do things like add new error handling etc all the
+ time, which we then sometimes even add tests for in our kselftest
+ directory.
+
+ So clearly behavior changes all the time and we don't consider that a
+ regression per se.
+
+ The rule for a regression for the kernel is that some real user
+ workflow breaks. Not some test. Not a "look, I used to be able to do
+ X, now I can't".
+
+ * From `2018-08-03
+ <https://lore.kernel.org/all/CA+55aFwWZX=CXmWDTkDGb36kf12XmTehmQjbiMPCqCRG2hi9kw@mail.gmail.com/>`_::
+
+ YOU ARE MISSING THE #1 KERNEL RULE.
+
+ We do not regress, and we do not regress exactly because your are 100% wrong.
+
+ And the reason you state for your opinion is in fact exactly *WHY* you
+ are wrong.
+
+ Your "good reasons" are pure and utter garbage.
+
+ The whole point of "we do not regress" is so that people can upgrade
+ the kernel and never have to worry about it.
+
+ > Kernel had a bug which has been fixed
+
+ That is *ENTIRELY* immaterial.
+
+ Guys, whether something was buggy or not DOES NOT MATTER.
+
+ Why?
+
+ Bugs happen. That's a fact of life. Arguing that "we had to break
+ something because we were fixing a bug" is completely insane. We fix
+ tens of bugs every single day, thinking that "fixing a bug" means that
+ we can break something is simply NOT TRUE.
+
+ So bugs simply aren't even relevant to the discussion. They happen,
+ they get found, they get fixed, and it has nothing to do with "we
+ break users".
+
+ Because the only thing that matters IS THE USER.
+
+ How hard is that to understand?
+
+ Anybody who uses "but it was buggy" as an argument is entirely missing
+ the point. As far as the USER was concerned, it wasn't buggy - it
+ worked for him/her.
+
+ Maybe it worked *because* the user had taken the bug into account,
+ maybe it worked because the user didn't notice - again, it doesn't
+ matter. It worked for the user.
+
+ Breaking a user workflow for a "bug" is absolutely the WORST reason
+ for breakage you can imagine.
+
+ It's basically saying "I took something that worked, and I broke it,
+ but now it's better". Do you not see how f*cking insane that statement
+ is?
+
+ And without users, your program is not a program, it's a pointless
+ piece of code that you might as well throw away.
+
+ Seriously. This is *why* the #1 rule for kernel development is "we
+ don't break users". Because "I fixed a bug" is absolutely NOT AN
+ ARGUMENT if that bug fix broke a user setup. You actually introduced a
+ MUCH BIGGER bug by "fixing" something that the user clearly didn't
+ even care about.
+
+ And dammit, we upgrade the kernel ALL THE TIME without upgrading any
+ other programs at all. It is absolutely required, because flag-days
+ and dependencies are horribly bad.
+
+ And it is also required simply because I as a kernel developer do not
+ upgrade random other tools that I don't even care about as I develop
+ the kernel, and I want any of my users to feel safe doing the same
+ time.
+
+ So no. Your rule is COMPLETELY wrong. If you cannot upgrade a kernel
+ without upgrading some other random binary, then we have a problem.
+
+ * From `2021-06-05
+ <https://lore.kernel.org/all/CAHk-=wiUVqHN76YUwhkjZzwTdjMMJf_zN4+u7vEJjmEGh3recw@mail.gmail.com/>`_::
+
+ THERE ARE NO VALID ARGUMENTS FOR REGRESSIONS.
+
+ Honestly, security people need to understand that "not working" is not
+ a success case of security. It's a failure case.
+
+ Yes, "not working" may be secure. But security in that case is *pointless*.
+
+ * From `2021-07-30
+ <https://lore.kernel.org/lkml/CAHk-=witY33b-vqqp=ApqyoFDpx9p+n4PwG9N-TvF8bq7-tsHw@mail.gmail.com/>`_::
+
+ But we have the policy that regressions aren't about documentation or
+ even sane behavior.
+
+ Regressions are about whether a user application broke in a noticeable way.
+
+ * From `2011-05-06 (1/3)
+ <https://lore.kernel.org/all/[email protected]/>`_::
+
+ Binary compatibility is more important.
+
+ And if binaries don't use the interface to parse the format (or just
+ parse it wrongly - see the fairly recent example of adding uuid's to
+ /proc/self/mountinfo), then it's a regression.
+
+ And regressions get reverted, unless there are security issues or
+ similar that makes us go "Oh Gods, we really have to break things".
+
+ I don't understand why this simple logic is so hard for some kernel
+ developers to understand. Reality matters. Your personal wishes matter
+ NOT AT ALL.
+
+ If you made an interface that can be used without parsing the
+ interface description, then we're stuck with the interface. Theory
+ simply doesn't matter.
+
+ You could help fix the tools, and try to avoid the compatibility
+ issues that way. There aren't that many of them.
+
+ * From `2011-05-06 (2/3)
+ <https://lore.kernel.org/all/[email protected]/>`_::
+
+ it's clearly NOT an internal tracepoint. By definition. It's being
+ used by powertop.
+
+ * From `2011-05-06 (3/3)
+ <https://lore.kernel.org/all/[email protected]/>`_::
+
+ We have programs that use that ABI and thus it's a regression if they break.
+
+ * From `2006-02-21
+ <https://lore.kernel.org/lkml/[email protected]/>`_::
+
+ The fact is, if changing the kernel breaks user-space, it's a regression.
+ IT DOES NOT MATTER WHETHER IT'S IN /sbin/hotplug OR ANYTHING ELSE. If it
+ was installed by a distribution, it's user-space. If it got installed by
+ "vmlinux", it's the kernel.
+
+ The only piece of user-space code we ship with the kernel is the system
+ call trampoline etc that the kernel sets up. THOSE interfaces we can
+ really change, because it changes with the kernel.
+
+ * From `2019-09-15
+ <https://lore.kernel.org/lkml/CAHk-=wiP4K8DRJWsCo=20hn_6054xBamGKF2kPgUzpB5aMaofA@mail.gmail.com/>`_::
+
+ One _particularly_ last-minute revert is the top-most commit (ignoring
+ the version change itself) done just before the release, and while
+ it's very annoying, it's perhaps also instructive.
+
+ What's instructive about it is that I reverted a commit that wasn't
+ actually buggy. In fact, it was doing exactly what it set out to do,
+ and did it very well. In fact it did it _so_ well that the much
+ improved IO patterns it caused then ended up revealing a user-visible
+ regression due to a real bug in a completely unrelated area.
+
+ The actual details of that regression are not the reason I point that
+ revert out as instructive, though. It's more that it's an instructive
+ example of what counts as a regression, and what the whole "no
+ regressions" kernel rule means. The reverted commit didn't change any
+ API's, and it didn't introduce any new bugs. But it ended up exposing
+ another problem, and as such caused a kernel upgrade to fail for a
+ user. So it got reverted.
+
+ The point here being that we revert based on user-reported _behavior_,
+ not based on some "it changes the ABI" or "it caused a bug" concept.
+ The problem was really pre-existing, and it just didn't happen to
+ trigger before. The better IO patterns introduced by the change just
+ happened to expose an old bug, and people had grown to depend on the
+ previously benign behavior of that old issue.
+
+ And never fear, we'll re-introduce the fix that improved on the IO
+ patterns once we've decided just how to handle the fact that we had a
+ bad interaction with an interface that people had then just happened
+ to rely on incidental behavior for before. It's just that we'll have
+ to hash through how to do that (there are no less than three different
+ patches by three different developers being discussed, and there might
+ be more coming...). In the meantime, I reverted the thing that exposed
+ the problem to users for this release, even if I hope it will be
+ re-introduced (perhaps even backported as a stable patch) once we have
+ consensus about the issue it exposed.
+
+ Take-away from the whole thing: it's not about whether you change the
+ kernel-userspace ABI, or fix a bug, or about whether the old code
+ "should never have worked in the first place". It's about whether
+ something breaks existing users' workflow.
+
+ Anyway, that was my little aside on the whole regression thing. Since
+ it's that "first rule of kernel programming", I felt it is perhaps
+ worth just bringing it up every once in a while.
diff --git a/MAINTAINERS b/MAINTAINERS
index ea3e6c914384..03bb629302cb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10438,6 +10438,7 @@ KERNEL REGRESSIONS
M: Thorsten Leemhuis <[email protected]>
L: [email protected]
S: Supported
+F: Documentation/admin-guide/regressions.rst

KERNEL SELFTEST FRAMEWORK
M: Shuah Khan <[email protected]>
--
2.31.1

2022-01-26 13:35:35

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] docs: add a document about regression handling

Thorsten Leemhuis <[email protected]> writes:

> Create a document explaining various aspects around regression handling
> and tracking both for users and developers. Among others describe the
> first rule of Linux kernel development and what it means in practice.
> Also explain what a regression actually is and how to report one
> properly. The text additionally provides a brief introduction to the bot
> the kernel's regression tracker uses to facilitate his work. To sum
> things up, provide a few quotes from Linus to show how serious he takes
> regressions.
>
> Signed-off-by: Thorsten Leemhuis <[email protected]>
> Acked-by: Greg Kroah-Hartman <[email protected]>
> ---
> Documentation/admin-guide/index.rst | 1 +
> Documentation/admin-guide/regressions.rst | 911 ++++++++++++++++++++++
> MAINTAINERS | 1 +
> 3 files changed, 913 insertions(+)
> create mode 100644 Documentation/admin-guide/regressions.rst
>
> diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst
> index 1bedab498104..17157ee5a416 100644
> --- a/Documentation/admin-guide/index.rst
> +++ b/Documentation/admin-guide/index.rst
> @@ -36,6 +36,7 @@ problems and bugs in particular.
>
> reporting-issues
> security-bugs
> + regressions
> bug-hunting
> bug-bisect
> tainted-kernels
> diff --git a/Documentation/admin-guide/regressions.rst b/Documentation/admin-guide/regressions.rst
> new file mode 100644
> index 000000000000..837b1658d149
> --- /dev/null
> +++ b/Documentation/admin-guide/regressions.rst
> @@ -0,0 +1,911 @@
> +.. SPDX-License-Identifier: (GPL-2.0+ OR CC-BY-4.0)
> +..
> + If you want to distribute this text under CC-BY-4.0 only, please use 'The
> + Linux kernel developers' for author attribution and link this as source:
> + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/Documentation/admin-guide/regressions.rst
> +..
> + Note: Only the content of this RST file as found in the Linux kernel sources
> + is available under CC-BY-4.0, as versions of this text that were processed
> + (for example by the kernel's build system) might contain content taken from
> + files which use a more restrictive license.

I wonder if we could put this boilerplate at the bottom, with a single
"see the bottom for redistribution information" line here? Most readers
won't care about this stuff and shouldn't have to slog through it to get
to what they want to read.

> +Regressions
> ++++++++++++
> +
> +The first rule of Linux kernel development: '*We don't cause regressions*'.
> +Linux founder and lead developer Linus Torvalds strictly enforces the rule
> +himself. This document describes what it means in practice and how the Linux
> +kernel's development model ensures all reported regressions are addressed.
> +The text covers aspects relevant for both users and developers.

So that last line makes me a bit nervous; I've really been trying to get
us to organize our documentation for the readers. So, without having
read what follows in depth yet, I wonder if we don't really want two
different documents: a developer document (which maybe belongs in
Documentation/process) and a user document?

> +The important bits for people affected by regressions
> +=====================================================
> +
> +It's a regression if something running fine with one Linux kernel works worse or
> +not at all with a newer version. Note, the newer kernel has to be compiled using
> +a similar configuration -- for this and other fine print, check out below
> +section "What is a 'regression' and what is the 'no regressions rule'?".

Can we be consistent with either single or double quotes? I'd suggest
"double quotes" but won't make a fuss about that.

> +Report your regression as outlined in
> +`Documentation/admin-guide/reporting-issues.rst`, it already covers all aspects

No need to quote the file name.

> +important for regressions. Below section "How do I report a regression?"
> +highlights them for convenience.

The "How do I report a regression?" section, below, highlights...

> +The most important aspect: CC or forward the report to `the regression mailing
> +list <https://lore.kernel.org/regressions/>`_ ([email protected]).

Is that really *the* most important thing? :)

> +When doing so, consider mentioning the version range where the regression
> +started using a paragraph like this::
> +
> + #regzbot introduced v5.13..v5.14-rc1
> +
> +The Linux kernel regression tracking bot 'regzbot' will then start to track the
> +issue. This is in your interest, as it brings the report on the radar of people
> +ensuring all regressions are acted upon in a timely manner.
> +
> +The important bits for people fixing regressions
> +================================================
> +
> +When submitting fixes for regressions, add "Link:" tags pointing to all places
> +where the issue was reported, as tools like the Linux kernel regression bot
> +'regzbot' heavily rely on them. These pointers are also of great value when
> +looking into the issue months or years later, that's why
> +`Documentation/process/submitting-patches.rst` and
> +:ref:`Documentation/process/5.Posting.rst <development_posting>` mandate their
> +use::
> +
> + Link: https://lore.kernel.org/r/[email protected]/
> + Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890

What is this literal block here for?

> +Let the Linux kernel's regression tracker and all other subscribers of the
> +`regression mailing list <https://lore.kernel.org/regressions/>`_
> +([email protected]) quickly know about newly reported regressions:

You've already linked this above, not sure it's needed again.

> + * When you receive a mailed report that did not CC the list, immediately send
> + at least a brief "Reply-all" which get the list into the loop; also ensure
> + it's CCed on all future replies.
> +
> + * If you get a report from a bug tracker, forward or bounce the report to the
> + list, unless the reporter did that already as outlined by
> + `Documentation/admin-guide/reporting-issues.rst`.
> +
> +Ensure regzbot tracks the issue (this is optional, but recommended):
> +
> + * For mailed reports, check if the reporter included a 'regzbot command' like
> + the ``#regzbot introduced v5.13..v5.14-rc1`` described above. If not, send a
> + reply (with the regressions list in CC) with a paragraph like the following,
> + which brings regzbot into the loop by specifying the version range or commit
> + when the issue started to happen::
> +
> + #regzbot ^introduced 1f2e3d4c5b6a
> +
> + * When receiving a report from a bug tracker and forwarding it to the
> + regressions list (see above), include a paragraph like the following, which
> + brings regzbot into the loop by specifying the version range or commit when
> + the issue started to happen::
> +
> + #regzbot introduced: v5.13..v5.14-rc1
> + #regzbot from: Some N. Ice Human <[email protected]>
> + #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
> +
> +All the details on handling Linux kernel regressions
> +====================================================
> +
> +The important basics
> +--------------------
> +
> +What is a 'regression' and what is the 'no regressions rule'?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's a regression if some application or practical use case running fine with
> +one Linux kernel works worse or not at all with a newer version compiled using a
> +similar configuration. The 'no regressions rule' forbids this to take place; if

So this is something you already said above. This document is quite
long, we're asking a lot for people to read through the whole thing.
Repeating yourself and making it longer may not help that cause.

> +it happens by accident, developers that caused it are expected to quickly fix
> +the issue.
> +
> +It thus is a regression when a WiFi driver from Linux 5.13 works fine, but with
> +5.14 doesn't work at all, works significantly slower, or misbehaves somehow.
> +It's also a regression if a perfectly working application suddenly shows erratic
> +behavior with a newer kernel version, which can be caused by changes in procfs,
> +sysfs, or one of the many other interfaces Linux provides to userland software.
> +But keep in mind, as mentioned earlier: 5.14 in this example needs to be built
> +from a configuration similar to the one from 5.13. This can be achieved using
> +``make olddefconfig``, as explained in more detail below.
> +
> +Note the 'practical use case' in the first sentence of this section: developers
> +despite the 'no regressions' rule are free to change any aspect of the kernel
> +and even APIs or ABIs to userland, as long as no existing application or use
> +case breaks.
> +
> +Also be aware the 'no regressions' rule covers only interfaces the kernel
> +provides to the userland. It thus does not apply to kernel-internal interfaces
> +like the module API, which some externally developed drivers use to hook into
> +the kernel.
> +
> +What is the goal of the 'no regressions rule'?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Users should feel safe when updating kernel versions and not have to worry
> +something might break. This is in the interest of the kernel developers to make
> +updating attractive: they don't want users to stay on stable or longterm Linux
> +series that are either abandoned or more than one and a half year old. That's in
> +everybody's interest, as `those series might have known bugs, security issues,
> +or other problematic aspects already fixed in later versions
> +<http://www.kroah.com/log/blog/2018/08/24/what-stable-kernel-should-i-use/>`_.
> +Additionally, the kernel developers want to make it simple and appealing for
> +users to test the latest pre-release or regular release. That's also in
> +everybody's interest, as it's a lot easier to track down and fix problems, if
> +they are reported shortly after being introduced.
> +
> +
> +How hard is the rule enforced?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Extraordinarily strict, as can be seen by many mailing list posts from Linux
> +creator and lead developer Linus Torvalds, some of which are quoted at the end
> +of this document.
> +
> +Exceptions to this rule are extremely rare; in the past developers almost always
> +turned out to be wrong when they assumed a particular situation was warranting
> +an exception.
> +
> +How is the rule enforced?
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's the duty of the subsystem maintainers, which are watched and supported by
> +Linus Torvalds for mainline or stable/longterm tree maintainers like Greg
> +Kroah-Hartman. All of them are supported by Thorsten Leemhuis: he's acting as
> +'regressions tracker' for the Linux kernel and trying to ensure all regression
> +reports are acted upon in a timely manner.
> +
> +The distributed and slightly unstructured nature of the Linux kernel's
> +development makes tracking regressions hard. That's why Thorsten relies on the
> +help of his Linux kernel regression tracking robot 'regzbot'. It watches mailing
> +lists and git trees to semi-automatically associate regression reports with
> +patch submissions and commits fixing the issue, as this provides all necessary
> +insights into the fixing progress.
> +
> +Note, the regression tracker can only ensure no regression falls through the
> +cracks, if someone tells him or his bot about every regression found. That's why
> +each report needs to be CCed or forwarded to the regressions mailing list
> +(ideally with a 'regzbot command' in the mail), as explained in the next
> +section.

So this isn't really enforcement information, it's tracking, which is
different... If you really want to talk about enforcement, you might
mention that offending patches can be reverted if they are not fixed in
a timely manner.

> +How do I report a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Just report the issue as outlined in
> +`Documentation/admin-guide/reporting-issues.rst`, it already describes the
> +important points. The following aspects described there are especially relevant
> +for regressions:
> +
> + * When checking for existing reports to join, first check the `archives of the
> + Linux regressions mailing list <https://lore.kernel.org/regressions/>`_ and
> + `regzbot's web-interface <https://linux-regtracking.leemhuis.info/regzbot/>`_.
> +
> + * In your report, mention the last kernel version that worked fine and the
> + first broken one. Even better: try to find the commit causing the regression
> + using a bisection.
> +
> + * Remember to let the Linux regressions mailing list
> + ([email protected]) know about your report:
> +
> + * If you report the regression by mail, CC the regressions list.
> +
> + * If you report your regression to some bug tracker, forward the filed report
> + by mail to the regressions list while CCing the maintainer and the mailing
> + list for the subsystem in question.
> +
> +Additionally, you in both cases should directly tell the aforementioned Linux
> +kernel regression tracking bot about your report. To do that, include a
> +paragraph like this in your report to tell the bot when the regression started
> +to happen::
> +
> + #regzbot introduced: v5.13..v5.14-rc1
> +
> +In this example, v5.13 was the last version that worked, while v5.14-rc1 was the
> +first broken one. The smaller the range, the better, as that makes it easier to
> +find out what's wrong and who's responsible. That's why you ideally should
> +perform a bisection to find the commit causing the regression (the 'culprit').
> +If you did, specify it instead::
> +
> + #regzbot introduced: 1f2e3d4c5d
> +
> +Placing such a 'regzbot command' is in your interest, as it will ensure the
> +report won't fall through the cracks unnoticed. If you omit this, the Linux
> +kernel's regressions tracker will take care of telling regzbot about your
> +regression, as long as you send a copy to the regressions mailing lists. But the
> +regression tracker is just one human which sometimes has to rest or occasionally
> +might even enjoy some time away from computers (as crazy as that might sound).

Naw, we don't allow that, sorry :)

> +Relying on this person thus will result in an unnecessary delay before the
> +regressions becomes mentioned `on the list of tracked and unresolved Linux
> +kernel regressions <https://linux-regtracking.leemhuis.info/regzbot/>`_ and the
> +weekly regression reports sent by regzbot. Such delays can result in Linus
> +Torvalds being unaware of important regressions when deciding between 'continue
> +development or call this finished and release the final?'.
> +
> +How to add a regression to regzbot's tracking somebody else reported?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It depends on the report:
> +
> + * If the regression was reported by mail, reply using your mailers 'Reply-all'
> + function with the regressions mailing list ([email protected]) in
> + CC. In your reply, include a paragraph with a regzbot command like this::
> +
> + #regzbot ^introduced: v5.13..v5.14-rc1
> +
> + The caret (^) before the 'introduced' tells regzbot to treat the parent mail
> + (the one you reply to) as the initial report for the regression you want to
> + see tracked; regzbot then will automatically associate any patches with this
> + regression that point to the report using 'Link:' tags.
> +
> + * If the regressions was reported to a bug tracker, forward it to the
> + regressions list and include a paragraph with these regzbot commands::
> +
> + #regzbot introduced: v5.13..v5.14-rc1
> + #regzbot from: Some N. Ice Human <[email protected]>
> + #regzbot monitor: http://some.bugtracker.example.com/ticket?id=123456789
> +
> + Regzbot will automatically associate patches with the report that use 'Link:'
> + tags pointing to your mail or the mentioned ticket.
> +
> +In both cases you can specify a commit-id instead of a version range, as the
> +previous section outlines in more detail.
> +
> +In case you are having trouble, simply forward the report or a pointer to it
> +without further ado to Thorsten Leemhuis ([email protected]), he then
> +will handle things.
> +
> +Are really all regressions fixed?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Nearly all of them are, as long as the change causing the regression (the
> +'culprit commit') is reliably identified. Some regressions can be fixed without
> +this, but often it's required.
> +
> +Who needs to find the commit causing a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's the reporter's duty to find the culprit, but developers of the affected
> +subsystem should offer advice and reasonably help where they can.

Is it really our policy that *reporters* need to find the offending
commit? That's certainly not my view of things, anyway?

> +How can I find the change causing a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Perform a bisection, as roughly outlined in `Documentation/admin-guide/reporting-issues.rst`
> +and described in more detail by `Documentation/admin-guide/bug-bisect.rst`.
> +It might sound like a lot of work, but in many cases finds the culprit
> +relatively quickly. If it's hard or time-consuming to reliably reproduce the
> +issue, consider teaming up with others affected by the problem to narrow down
> +the search range together.
> +
> +Who can I ask for advice when it comes to regressions?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Send a mail to the regressions mailing list ([email protected]) while
> +CCing the Linux kernel's regression tracker ([email protected]); if the
> +issue might better be dealt with in private, feel free to omit the list.
> +
> +
> +More details about regressions relevant for reporters
> +-----------------------------------------------------
> +
> +Does a regression need to be fixed, if it can be avoided by updating some other software?

It would be nice to keep to 80 columns if possible. These long section
headings aren't great for readability.

> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Almost always: yes. If a developer tells you otherwise, ask the regression
> +tracker for advice as outlined above.
> +
> +Does it qualify as a regression if a newer kernel works slower or makes the system consume more energy?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It does, but the difference has to be significant. A five percent slow-down in a
> +micro-benchmark thus is unlikely to qualify as regression, unless it also
> +influences the results of a broad benchmark by more than one percent. If in
> +doubt, ask for advice.
> +
> +Is it a regression, if an externally developed kernel module is incompatible with a newer kernel?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +No, as the 'no regression' rule is about interfaces and services the Linux
> +kernel provides to the userland. It thus does not cover building or running
> +externally developed kernel modules, as they run in kernel-space and hook into
> +the kernel using internal interfaces occasionally changed.
> +
> +How are regressions handled that are caused by a fix for security vulnerability?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +In extremely rare situations security issues can't be fixed without causing
> +regressions; those are given way, as they are the lesser evil in the end.
> +Luckily this almost always can be avoided, as key developers for the affected
> +area and often Linus Torvalds himself try very hard to fix security issues
> +without causing regressions.
> +
> +If you nevertheless face such a case, check the mailing list archives if people
> +tried their best to avoid the regression; if in doubt, ask for advice as
> +outlined above.
> +
> +What happens if fixing a regression is impossible without causing another regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Sadly these things happen, but luckily not very often; if they occur, expert
> +developers of the affected code area should look into the issue to find a fix
> +that avoids regressions or at least their impact. If you run into such a
> +situation, do what was outlined already for regressions caused by security
> +fixes: check earlier discussions if people already tried their best and ask for
> +advice if in doubt.
> +
> +A quick note while at it: these situations could be avoided, if people would
> +regularly give mainline pre-releases (say v5.15-rc1 or -rc3) from each cycle a
> +test run. This is best explained by imagining a change integrated between Linux
> +v5.14 and v5.15-rc1 which causes a regression, but at the same time is a hard
> +requirement for some other improvement applied for 5.15-rc1. All these changes
> +often can simply be reverted and the regression thus solved, if someone finds
> +and reports it before 5.15 is released. A few days or weeks later this solution
> +can become impossible, as some software might have started to rely on aspects
> +introduced by one of the follow-up changes: reverting all changes would then
> +cause a regression for users of said software and thus is out of the question.
> +
> +A feature I relied on was removed months ago, but I only noticed now. Does that qualify as regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It does, but often it's hard to fix them due to the aspects outlined in the
> +previous section. It hence needs to be dealt with on a case-by-case basis. This
> +is another reason why it's in everybody's interest to regularly test mainline
> +pre-releases.
> +
> +Does the 'no regression' rule apply if I seem to be the only person in the world that is affected by a regression?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It does, but only for practical usage: the Linux developers want to be free to
> +remove support for hardware only to be found in attics and museums anymore.
> +
> +Note, sometimes regressions can't be avoided to make progress -- and the latter
> +is needed to prevent Linux from stagnation. Hence, if only very few users seem
> +to be affected by a regression, it for the greater good might be in their and
> +everyone else's interest to not insist on the rule. Especially if there is an
> +easy way to circumvent the regression somehow, for example by updating some
> +software or using a kernel parameter created just for this purpose.
> +
> +Does the regression rule apply for code in the staging tree as well?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Not according to the `help text for the configuration option covering all
> +staging code <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/staging/Kconfig>`_,
> +which since its early days states::
> +
> + Please note that these drivers are under heavy development, may or
> + may not work, and may contain userspace interfaces that most likely
> + will be changed in the near future.
> +
> +The staging developers nevertheless often adhere to the 'no regressions' rule,
> +but sometimes bend it to make progress. That's for example why some users had to
> +deal with (often negligible) regressions when a WiFi driver from the staging
> +tree was replaced by a totally different one written from scratch.
> +
> +Why do later versions have to be 'compiled with a similar configuration'?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Because the Linux kernel developers sometimes integrate changes known to cause
> +regressions, but make them optional and disable them in the kernel's default
> +configuration. This trick allows progress, as the 'no regressions' rule
> +otherwise would lead to stagnation.
> +
> +Consider for example a new security feature blocking access to some kernel
> +interfaces often abused by malware, which at the same time are required to run a
> +few rarely used applications. The outlined approach makes both camps happy:
> +people using these applications can leave the new security feature off, while
> +everyone else can enable it without running into trouble.
> +
> +How to create a configuration similar to the one of an older kernel?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Start a known-good kernel and configure the newer Linux version with ``make

Start *with* a ... ?

> +olddefconfig``. This makes the kernel's build scripts pick up the configuration
> +file (the `.config` file) from the running kernel as base for the new one you
> +are about to compile; afterwards they set all new configuration options to their
> +default value, which should disable new features that might cause regressions.
> +
> +Can I report a regression to the upstream developers I found in a pre-compiled vanilla kernel?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +You need to ensure the newer kernel was compiled with a similar configuration
> +file as the older one (see above), as the one that built them might have enabled
> +some known-to-be incompatible feature for the newer kernel. If in a doubt,
> +report this problem to the kernel's provider and ask for advice.
> +
> +
> +More details about regressions relevant for developers
> +------------------------------------------------------
> +
> +What should I do, if I suspect a change I'm working on might cause regressions?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Evaluate how big the risk of regressions is, for example by performing a code
> +search in Linux distributions and Git forges. Also consider asking other
> +developers or projects likely to be affected to evaluate or even test the
> +proposed change; if problems surface, maybe some middle ground acceptable for
> +all can be found.
> +
> +If the risk of regressions in the end seems to be relatively small, go ahead
> +with the change, but let all involved parties know about the risk. Hence, make
> +sure your patch description makes this aspect obvious. Once the change is
> +merged, tell the Linux kernel's regression tracker and the regressions mailing
> +list about the risk, so everyone has the change on the radar in case reports
> +trickle in. Depending on the risk, you also might want to ask the subsystem
> +maintainer to mention the issue in his mainline pull request.
> +
> +
> +Everything developers need to know about regression tracking
> +------------------------------------------------------------
> +
> +Do I have to use regzbot?
> +~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +It's in the interest of everyone if you do, as kernel maintainers like Linus
> +Torvalds partly rely on regzbot's tracking in their work -- for example when
> +deciding to release a new version or extend the development phase. For this they
> +need to be aware of all unfixed regression; to do that, Linus is known to look
> +into the weekly reports sent by regzbot.
> +
> +Do I have to tell regzbot about every regression I stumble upon?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Ideally yes: we are all humans and easily forget problems when something more
> +important unexpectedly comes up -- for example a bigger problem in the Linux
> +kernel or something in real life that's keeping us away from keyboards for a
> +while. Hence, it's best to tell regzbot about every regression, except when you
> +immediately write a fix and commit it to a tree regularly merged to the affected
> +kernel series.
> +
> +Why does the Linux kernel need a regression tracker, and why does he utilize regzbot?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

s/he/it/ (or "why is regzbot used?")

> +Rules like 'no regressions' need someone to enforce them, otherwise they are
> +broken either accidentally or on purpose. History has shown that this is true
> +for the Linux kernel as well. That's why Thorsten volunteered to keep an eye on
> +things.

[...]

> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea3e6c914384..03bb629302cb 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -10438,6 +10438,7 @@ KERNEL REGRESSIONS
> M: Thorsten Leemhuis <[email protected]>
> L: [email protected]
> S: Supported
> +F: Documentation/admin-guide/regressions.rst
>
> KERNEL SELFTEST FRAMEWORK
> M: Shuah Khan <[email protected]>

OK, now that I'm at the end, I would like to suggest splitting this
material up. Few people will make it through that whole thing... It
seems to me that we need is:

- How to report and track a regression (admin guide probably)
- Regression sermon for developers including Linus quotes (process)
- Reference guide for regzbot directives (process or tools)

That should get each of them to a manageable length where readers will
be able to get to the information they are looking for at the time. I
would think this would mostly be a matter of hacking out pieces from the
above and putting them in the proper place.

See what I'm getting at? Does that make sense to you?

Thanks,

jon

2022-01-26 21:29:02

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] docs: add a document about regression handling

On 26.01.22 00:59, Jonathan Corbet wrote:
> Thorsten Leemhuis <[email protected]> writes:
>
>> Create a document explaining various aspects around regression handling
>> and tracking both for users and developers. Among others describe the
>> first rule of Linux kernel development and what it means in practice.
>> Also explain what a regression actually is and how to report one
>> properly. The text additionally provides a brief introduction to the bot
>> the kernel's regression tracker uses to facilitate his work. To sum
>> things up, provide a few quotes from Linus to show how serious he takes
>> regressions.
>>
> [...]

Many thx for your feedback, much appreciated. I trimmed the reply to not
mention all the places where I fixed things as suggested.

>> + Note: Only the content of this RST file as found in the Linux kernel sources
>> + is available under CC-BY-4.0, as versions of this text that were processed
>> + (for example by the kernel's build system) might contain content taken from
>> + files which use a more restrictive license.
>
> I wonder if we could put this boilerplate at the bottom, with a single
> "see the bottom for redistribution information" line here? Most readers
> won't care about this stuff and shouldn't have to slog through it to get
> to what they want to read.

Totally fine with me. When I touch reporting-issues.rst the next time
I'll move it downwards as well.

>> +Regressions
>> ++++++++++++
>> +
>> +The first rule of Linux kernel development: '*We don't cause regressions*'.
>> +Linux founder and lead developer Linus Torvalds strictly enforces the rule
>> +himself. This document describes what it means in practice and how the Linux
>> +kernel's development model ensures all reported regressions are addressed.
>> +The text covers aspects relevant for both users and developers.
>
> So that last line makes me a bit nervous; I've really been trying to get
> us to organize our documentation for the readers. So, without having
> read what follows in depth yet, I wonder if we don't really want two
> different documents: a developer document (which maybe belongs in
> Documentation/process) and a user document?

Fun fact: I also got nervous when I added that sentence, as it lead to a
similar thought in my head already. :-/ After some internal debate I
decided that quite a few things overlap and continued to keep it in one
document, but after the text grew somewhat more I guess I have to agree.
More on this below.

>> +The important bits for people affected by regressions
>> +=====================================================
>> +
>> +It's a regression if something running fine with one Linux kernel works worse or
>> +not at all with a newer version. Note, the newer kernel has to be compiled using
>> +a similar configuration -- for this and other fine print, check out below
>> +section "What is a 'regression' and what is the 'no regressions rule'?".
> Can we be consistent with either single or double quotes? I'd suggest
> "double quotes" but won't make a fuss about that.

Changed to "double quotes" everywhere in the text. But just to make sure
I get things right: in this particular case this will result in

...section "What is a "regression" and what is the "no regressions rule"?".

This looks a bit strange to me. Something in me really would like to
quote the section's header in single quotes, but I guess grammar rules
do not allow that, so whatever. :-D

>> +Report your regression as outlined in
>> +`Documentation/admin-guide/reporting-issues.rst`, it already covers all aspects
> No need to quote the file name.

Okay, I thought I had seen some commit or instructions that it's better
to use them in this case, but my brain must have imagined it.

>> +When submitting fixes for regressions, add "Link:" tags pointing to all places
>> +where the issue was reported, as tools like the Linux kernel regression bot
>> +'regzbot' heavily rely on them. These pointers are also of great value when
>> +looking into the issue months or years later, that's why
>> +`Documentation/process/submitting-patches.rst` and
>> +:ref:`Documentation/process/5.Posting.rst <development_posting>` mandate their
>> +use::
>> +
>> + Link: https://lore.kernel.org/r/[email protected]/
>> + Link: https://bugzilla.kernel.org/show_bug.cgi?id=1234567890
>
> What is this literal block here for?

It's meant as an example. Should I simply add a ", e.g. like this" after
the "use"?

>> +Let the Linux kernel's regression tracker and all other subscribers of the
>> +`regression mailing list <https://lore.kernel.org/regressions/>`_
>> +([email protected]) quickly know about newly reported regressions:
> You've already linked this above, not sure it's needed again.

When I started this document I didn't mean it to be read from top to
bottom and repeated some things. But with the split this will likely
change somewhat.

>> +What is a 'regression' and what is the 'no regressions rule'?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +It's a regression if some application or practical use case running fine with
>> +one Linux kernel works worse or not at all with a newer version compiled using a
>> +similar configuration. The 'no regressions rule' forbids this to take place; if
>
> So this is something you already said above.

That above was more like a "TLDR", guess I'll call it that after
splitting things up.

>> +How is the rule enforced?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +It's the duty of the subsystem maintainers, which are watched and supported by
>> +Linus Torvalds for mainline or stable/longterm tree maintainers like Greg
>> +Kroah-Hartman. All of them are supported by Thorsten Leemhuis: he's acting as
>> +'regressions tracker' for the Linux kernel and trying to ensure all regression
>> +reports are acted upon in a timely manner.
>> +
>> +The distributed and slightly unstructured nature of the Linux kernel's
>> +development makes tracking regressions hard. That's why Thorsten relies on the
>> +help of his Linux kernel regression tracking robot 'regzbot'. It watches mailing
>> +lists and git trees to semi-automatically associate regression reports with
>> +patch submissions and commits fixing the issue, as this provides all necessary
>> +insights into the fixing progress.
>> +
>> +Note, the regression tracker can only ensure no regression falls through the
>> +cracks, if someone tells him or his bot about every regression found. That's why
>> +each report needs to be CCed or forwarded to the regressions mailing list
>> +(ideally with a 'regzbot command' in the mail), as explained in the next
>> +section.
>
> So this isn't really enforcement information, it's tracking, which is
> different... If you really want to talk about enforcement, you might
> mention that offending patches can be reverted if they are not fixed in
> a timely manner.

Well, I'd say tracking is part of enforcing, but whatever, I see your
point and will address it in the next version.

>> +Placing such a 'regzbot command' is in your interest, as it will ensure the
>> +report won't fall through the cracks unnoticed. If you omit this, the Linux
>> +kernel's regressions tracker will take care of telling regzbot about your
>> +regression, as long as you send a copy to the regressions mailing lists. But the
>> +regression tracker is just one human which sometimes has to rest or occasionally
>> +might even enjoy some time away from computers (as crazy as that might sound).
>
> Naw, we don't allow that, sorry :)

:-D

>> +Who needs to find the commit causing a regression?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +It's the reporter's duty to find the culprit, but developers of the affected
>> +subsystem should offer advice and reasonably help where they can.
>
> Is it really our policy that *reporters* need to find the offending
> commit? That's certainly not my view of things, anyway?

Well, do we have something on that written down somewhere or a few
quotes from Linus that might help to clarify things?

Anyway: I was not totally happy with it either, as I found the first
part of the sentence to strong, and the second to soft. But I had
trouble finding something better, maybe a native speaker could help out
here. Maybe something along these lines?

```
The developers of the affected code area should try their best to
identify what's causing the regression. But that might be impossible
with reasonable effort, as quite a few regressions only occur in a
particular environment (Linux distro, hardware, configuration, ...).
That's why in the end it's often up to the reporter to locate the
culprit; to make this easy, developers should offer advice and
reasonably help where they can.
```

As that's what things boil down to in practice, don't they?

>> +More details about regressions relevant for reporters
>> +-----------------------------------------------------
>> +
>> +Does a regression need to be fixed, if it can be avoided by updating some other software?
>
> It would be nice to keep to 80 columns if possible. These long section
> headings aren't great for readability.

I know and already tried to keep them shorter, but I sometimes simply
failed to do so. Thing is: more descriptive section IMHO make is easier
to skim over the document and just read the sections one is interested
in (kinda FAQ style). And that was my goal, as I don't expect many
people to read the text from top to bottom in these interesting times.
That's why I think the benefits outweigh the downsides in times where we
even allow code to exceed the old 80 character limit when it makes sense.

I'll give it another shot to make them shorter, but I can't promise I'll
manage -- as as you wrote "It would be *nice* to keep" I for now assume
that's okay.

>> +How to create a configuration similar to the one of an older kernel?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Start a known-good kernel and configure the newer Linux version with ``make
> Start *with* a ... ?

Hmmm. I went for "Start *your machine with* a" to make it really obvious
what to start.

> OK, now that I'm at the end, I would like to suggest splitting this
> material up. Few people will make it through that whole thing... It
> seems to me that we need is:
>
> - How to report and track a regression (admin guide probably)
> - Regression sermon for developers including Linus quotes (process)
> - Reference guide for regzbot directives (process or tools)
>
> That should get each of them to a manageable length where readers will
> be able to get to the information they are looking for at the time. I
> would think this would mostly be a matter of hacking out pieces from the
> above and putting them in the proper place.
>
> See what I'm getting at? Does that make sense to you?

Sure, I as mentioned earlier, I had thought about this already.

But I'll try to avoid a separate document for regzbot as see how this
works out. I fear splitting this off is a bad idea, as reporters
otherwise need to look into too many documents for one task, as they
already need to deal with reporting-issue.rst as well.

Thx again!

Ciao, Thorsten

2022-01-26 21:32:58

by Geert Uytterhoeven

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] docs: add a document about regression handling

Hi Thorsten,

On Tue, Jan 25, 2022 at 5:45 PM Thorsten Leemhuis <[email protected]> wrote:
> +How to create a configuration similar to the one of an older kernel?
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Start a known-good kernel and configure the newer Linux version with ``make
> +olddefconfig``. This makes the kernel's build scripts pick up the configuration
> +file (the `.config` file) from the running kernel as base for the new one you
> +are about to compile; afterwards they set all new configuration options to their
> +default value, which should disable new features that might cause regressions.

Doing so may actually cause mutations to appear in your .config
when going back and forth (i.e. when bisecting), interfering with
the bisection process.

To avoid that, I usually start bisecting with
"cp .config <src>/arch/<arch>/configs/bisect_defconfig", and use
"make bisect_defconfig" in every bisection step. That way all steps
are reproducible, and unaffected by config mutations.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2022-01-26 22:21:09

by Randy Dunlap

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] docs: add a document about regression handling

On 1/26/22 06:10, Thorsten Leemhuis wrote:
>>> +The important bits for people affected by regressions
>>> +=====================================================
>>> +
>>> +It's a regression if something running fine with one Linux kernel works worse or
>>> +not at all with a newer version. Note, the newer kernel has to be compiled using
>>> +a similar configuration -- for this and other fine print, check out below
>>> +section "What is a 'regression' and what is the 'no regressions rule'?".
>> Can we be consistent with either single or double quotes? I'd suggest
>> "double quotes" but won't make a fuss about that.
> Changed to "double quotes" everywhere in the text. But just to make sure
> I get things right: in this particular case this will result in
>
> ...section "What is a "regression" and what is the "no regressions rule"?".
>
> This looks a bit strange to me. Something in me really would like to
> quote the section's header in single quotes, but I guess grammar rules
> do not allow that, so whatever. :-D
>

I think that it was correct with the mixed quotes. Using all double
quotes here is confusing.

--
~Randy

2022-02-01 20:41:35

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] docs: add a document about regression handling

On 26.01.22 15:10, Thorsten Leemhuis wrote:
>
> On 26.01.22 00:59, Jonathan Corbet wrote:
>> Thorsten Leemhuis <[email protected]> writes:

>>> + Note: Only the content of this RST file as found in the Linux kernel sources
>>> + is available under CC-BY-4.0, as versions of this text that were processed
>>> + (for example by the kernel's build system) might contain content taken from
>>> + files which use a more restrictive license.
>>
>> I wonder if we could put this boilerplate at the bottom, with a single
>> "see the bottom for redistribution information" line here? Most readers
>> won't care about this stuff and shouldn't have to slog through it to get
>> to what they want to read.
>
> Totally fine with me. When I touch reporting-issues.rst the next time
> I'll move it downwards as well.

V4 will do that, as I added a patch to point from reporting-issues.rst
to one of the two new documents.

>>> +The important bits for people affected by regressions
>>> +=====================================================
>>> +
>>> +It's a regression if something running fine with one Linux kernel works worse or
>>> +not at all with a newer version. Note, the newer kernel has to be compiled using
>>> +a similar configuration -- for this and other fine print, check out below
>>> +section "What is a 'regression' and what is the 'no regressions rule'?".
>> Can we be consistent with either single or double quotes? I'd suggest
>> "double quotes" but won't make a fuss about that.
>
> Changed to "double quotes" everywhere in the text. But just to make sure
> I get things right: in this particular case this will result in
>
> ...section "What is a "regression" and what is the "no regressions rule"?".
>
> This looks a bit strange to me. Something in me really would like to
> quote the section's header in single quotes, but I guess grammar rules
> do not allow that, so whatever. :-D

I changed something and now simply don't mentioned the section names to
avoid this problem. After the split that's not strictly needed afaics.

>>> +Report your regression as outlined in
>>> +`Documentation/admin-guide/reporting-issues.rst`, it already covers all aspects
>> No need to quote the file name.
> Okay, I thought I had seen some commit or instructions that it's better
> to use them in this case, but my brain must have imagined it.

I noticed I quoted internal references in reporting-issues.rst quite
often. IMHO it improves readability sometimes (it depends a lot on the
title of the target document), as can be seen in this example:

```
If your kernel is tainted, study
'Documentation/admin-guide/tainted-kernels.rst' to find out why.
[...]
To find the change there is a process called 'bisection' which the document
'Documentation/admin-guide/bug-bisect.rst' describes in detail.
```

after processing to HTML looks like this:

```
If your kernel is tainted, study 'Tainted kernels' to find out why.
[...]
To find the change there is a process called ‘bisection’ which the
document ‘Bisecting a bug’ describes in detail.
```

Sure, "Tainted kernels" and "Bisecting a bug" are links and hence
displayed differently by the browser, but I think the quotes help. But YMMV.

I sooner or later hope to improve and fix a few things in
reporting-issues.rst anyway. Let me know if I should take the
opportunity to remove the single quotes then.

>>> +Who needs to find the commit causing a regression?
>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> +
>>> +It's the reporter's duty to find the culprit, but developers of the affected
>>> +subsystem should offer advice and reasonably help where they can.
>>
>> Is it really our policy that *reporters* need to find the offending
>> commit? That's certainly not my view of things, anyway?

BTW, I noticed reporting-issues.rst covers it like this:

Normally it's up to the reporter to track down the culprit, as
maintainers often won't have the time or setup at hand to reproduce it
themselves.

> Well, do we have something on that written down somewhere or a few
> quotes from Linus that might help to clarify things?
>
> Anyway: I was not totally happy with it either, as I found the first
> part of the sentence to strong, and the second to soft. But I had
> trouble finding something better, maybe a native speaker could help out
> here. Maybe something along these lines?

I plan to go with this now:
```
Who needs to find the root cause of a regression?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Developers of the affected code area should try to locate the culprit on
their own. But for them that's often impossible to do with reasonable
effort, as quite a lot of issues only occur in a particular environment
outside the developer's reach -- for example, a specific hardware
platform, firmware, Linux distro, system's configuration, or
application. That's why in the end it's often up to the reporter to
locate the culprit commit; sometimes users might even need to run
additional tests afterwards to pinpoint the exact root cause. Developers
should offer advice and reasonably help where they can, to make this
process relatively easy and achievable for typical users.
```

Ciao, Thorsten

2022-02-04 23:08:47

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] docs: add a document about regression handling

Hi! I noticed I forgot to reply here:

On 26.01.22 15:28, Geert Uytterhoeven wrote:
> On Tue, Jan 25, 2022 at 5:45 PM Thorsten Leemhuis <[email protected]> wrote:
>> +How to create a configuration similar to the one of an older kernel?
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +
>> +Start a known-good kernel and configure the newer Linux version with ``make
>> +olddefconfig``. This makes the kernel's build scripts pick up the configuration
>> +file (the `.config` file) from the running kernel as base for the new one you
>> +are about to compile; afterwards they set all new configuration options to their
>> +default value, which should disable new features that might cause regressions.
>
> Doing so may actually cause mutations to appear in your .config
> when going back and forth (i.e. when bisecting), interfering with
> the bisection process.

Good point, I knew about this, but hadn't thought of this when writing
the text.

> To avoid that, I usually start bisecting with
> "cp .config <src>/arch/<arch>/configs/bisect_defconfig", and use
> "make bisect_defconfig" in every bisection step. That way all steps
> are reproducible, and unaffected by config mutations.

That's a really cool trick, thx for mentioning it. But I think it's not
needed in the text about regressions and instead better be mentioned in
Documentation/admin-guide/bug-bisect.rst. I hope to sooner or later
improve (rewrite?) that document anyway and will make sure to keep it in
mind for that time.

I wonder if there is a way to make this work without messing in the
source tree? Took a quick look at the sources. It seems to me that it's
possible to "cp .config ~/working.config" and then using "make
KBUILD_DEFCONFIG=~/working.config defconfig" at every bisection step. It
seems to do the trick as well -- but I only tried briefly, so maybe I
might have missed something.

Ciao, Thorsten