2024-05-07 02:30:03

by John Hubbard

[permalink] [raw]
Subject: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

If a system experiences a lot of memory failures, then any associated
printk() output really needs to be rate-limited. I noticed this while
running selftests/mm/uffd-unit-tests, which logs 12,305 lines of output,
adding (on my system) an extra 97 seconds of runtime due to printk time.

The test normally only takes about 10 seconds, and the enclosing set of
mm selftests normally takes 305 seconds, so the additional 97 seconds
really hurts.

Generating lots of memory poisoning events seems like a valid use case,
by which I mean that this is not just a testing artifact. And that's why
the fix applies to the code that directly generates the output, rather
than the selftest that triggers it.

With this patch, all but 10 lines are suppressed, thus speeding up that
particular selftest by 90% (runtime drops from 107 seconds, to 10.6
seconds).

Signed-off-by: John Hubbard <[email protected]>
---
arch/x86/mm/fault.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index bba4e020dd64..e4f3c7721f45 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -928,7 +928,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
struct task_struct *tsk = current;
unsigned lsb = 0;

- pr_err(
+ pr_err_ratelimited(
"MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
tsk->comm, tsk->pid, address);
if (fault & VM_FAULT_HWPOISON_LARGE)

base-commit: dccb07f2914cdab2ac3a5b6c98406f765acab803
prerequisite-patch-id: b901ece2a5b78503e2fb5480f20e304d36a0ea27
--
2.45.0



2024-05-07 08:13:48

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

The patch subject is misleading. This should be "don't flood the system
log". Nobody cares about the speed of a unittest ;)

On 07.05.24 04:29, John Hubbard wrote:
> If a system experiences a lot of memory failures, then any associated
> printk() output really needs to be rate-limited. I noticed this while
> running selftests/mm/uffd-unit-tests, which logs 12,305 lines of output,
> adding (on my system) an extra 97 seconds of runtime due to printk time.

Recently discussed:

https://lkml.kernel.org/r/[email protected]

See the pros/cons of using ratelimiting, and what an alternative for
uffd is that Axel is working on.

(CCing Peter and Axel)

--
Cheers,

David / dhildenb


2024-05-07 16:55:40

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

On 07.05.24 18:28, John Hubbard wrote:
> On 5/7/24 1:13 AM, David Hildenbrand wrote:
>> The patch subject is misleading. This should be "don't flood the system
>
> I went back and forth on that subject line. :)
>
>> log". Nobody cares about the speed of a unittest ;)
>
> Yes they do. People should actually run the selftests, which in turn have
> enshrined their guidelines in kernel doc. See dev-tools/kselftest.rst,
> "Contributing new tests", which says, as you would hope, "Don't take
> too long".
>
> It's important. Tests need to be quick, and having one out of 50 that
> blows it up is worth fixing.

I'm pretty sure you got my point: it's much more important to not let
unprivileged users flood the log (possibly harming the system?) than
making a test run faster :)

>
>>
>> On 07.05.24 04:29, John Hubbard wrote:
>>> If a system experiences a lot of memory failures, then any associated
>>> printk() output really needs to be rate-limited. I noticed this while
>>> running selftests/mm/uffd-unit-tests, which logs 12,305 lines of output,
>>> adding (on my system) an extra 97 seconds of runtime due to printk time.
>>
>> Recently discussed:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> See the pros/cons of using ratelimiting, and what an alternative for
>> uffd is that Axel is working on.
>>
>> (CCing Peter and Axel)
>>
>
> That thread seems to have stalled.

Yes, there was no follow-up.

> I *do* have MCE experience (writing a
> handler,
> dealing with MCEs and related bugs), and what they wrote so far is exactly
> correct: if you were going to flood the log, then no, we don't need to see
> every single line printed. The first 10 or so, plus the fact that rate
> limiting
> kicked in, is sufficient to proceed with debugging and/or hardware
> replacement.
>
> I'd like to just do this patch almost as-is, just with a fixed up subject,
> perhaps:
>
> x86/fault: rate-limit to avoid flooding dmesg with "MCE: Killing"
> reports
>
> Yes?


Makes sense to me (and thanks for confirming that we don't need
complexity elsewhere).

I think we at least need "Fixes:" (not sure if stable is warranted as
well, 1b0a151c10a6d823f033023b9fdd9af72a89591b didn't CC stable).

Consider adding a link to the report in that thread.

Acked-by: David Hildenbrand <[email protected]>

--
Cheers,

David / dhildenb


2024-05-07 17:00:37

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

On 5/7/24 1:13 AM, David Hildenbrand wrote:
> The patch subject is misleading. This should be "don't flood the system

I went back and forth on that subject line. :)

> log". Nobody cares about the speed of a unittest ;)

Yes they do. People should actually run the selftests, which in turn have
enshrined their guidelines in kernel doc. See dev-tools/kselftest.rst,
"Contributing new tests", which says, as you would hope, "Don't take
too long".

It's important. Tests need to be quick, and having one out of 50 that
blows it up is worth fixing.

>
> On 07.05.24 04:29, John Hubbard wrote:
>> If a system experiences a lot of memory failures, then any associated
>> printk() output really needs to be rate-limited. I noticed this while
>> running selftests/mm/uffd-unit-tests, which logs 12,305 lines of output,
>> adding (on my system) an extra 97 seconds of runtime due to printk time.
>
> Recently discussed:
>
> https://lkml.kernel.org/r/[email protected]
>
> See the pros/cons of using ratelimiting, and what an alternative for
> uffd is that Axel is working on.
>
> (CCing Peter and Axel)
>

That thread seems to have stalled. I *do* have MCE experience (writing a
handler,
dealing with MCEs and related bugs), and what they wrote so far is exactly
correct: if you were going to flood the log, then no, we don't need to see
every single line printed. The first 10 or so, plus the fact that rate
limiting
kicked in, is sufficient to proceed with debugging and/or hardware
replacement.

I'd like to just do this patch almost as-is, just with a fixed up subject,
perhaps:

x86/fault: rate-limit to avoid flooding dmesg with "MCE: Killing"
reports

Yes?

thanks,
--
John Hubbard
NVIDIA


2024-05-07 17:05:04

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

On 07.05.24 18:53, John Hubbard wrote:
> On 5/7/24 9:43 AM, David Hildenbrand wrote:
>> On 07.05.24 18:28, John Hubbard wrote:
>>> On 5/7/24 1:13 AM, David Hildenbrand wrote:
> ...
>>> I'd like to just do this patch almost as-is, just with a fixed up
>>> subject,
>>> perhaps:
>>>
>>>       x86/fault: rate-limit to avoid flooding dmesg with "MCE: Killing"
>>> reports
>>>
>>> Yes?
>>
>>
>> Makes sense to me (and thanks for confirming that we don't need
>> complexity elsewhere).
>>
>> I think we at least need "Fixes:" (not sure if stable is warranted as
>> well, 1b0a151c10a6d823f033023b9fdd9af72a89591b didn't CC stable).
>>
>> Consider adding a link to the report in that thread.
>>
>> Acked-by: David Hildenbrand <[email protected]>
>>
>
> OK, there's enough changes to the commit description that I'll go ahead
> and send a v2, with a new subject line, a Fixes:, your Ack, a link to
> the other conversation, and maybe reword a bit based on your feedback
> too.

Sounds good. We might want to CC stable: I just read
1b0a151c10a6d823f033023b9fdd9af72a89591b and I think the problem there
cannot really be triggered by unprivileged users.

Hm... thanks!

--
Cheers,

David / dhildenb


2024-05-07 19:27:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

On Mon, May 06, 2024 at 07:29:39PM -0700, John Hubbard wrote:
> Generating lots of memory poisoning events seems like a valid use case,
> by which I mean that this is not just a testing artifact. And that's why
> the fix applies to the code that directly generates the output, rather
> than the selftest that triggers it.

Sorry, not taking a "fix" for something hypothetical.

If this is a real issue on a real system and the printing is the
*actual* problem at hand in a hw error storm sure, but no, not because
a selftest runs slower.

I'm pretty sure in a hw error storm scenario, printk being slow is the
least of your problems.

And in such a scenario the *last* thing you wanna do is ratelimit prints
so that you can't even get all the logs which is *the* thing you need to
debug the hw.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-05-07 22:50:20

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH] x86/fault: speed up uffd-unit-test by 10x: rate-limit "MCE: Killing" logs

On 5/7/24 11:15 AM, Axel Rasmussen wrote:
> On Tue, May 7, 2024 at 11:11 AM John Hubbard <[email protected]> wrote:
>>
>> On 5/7/24 11:08 AM, Axel Rasmussen wrote:
>>> On Tue, May 7, 2024 at 9:43 AM David Hildenbrand <[email protected]> wrote:
>> ...
>>>>> That thread seems to have stalled.
>>>>
>>>> Yes, there was no follow-up.
>>>
>>> Apologies, I had completely forgotten about this. I blame the weekend. :)
>>>
>>> No objections from me to the simple rate limiting proposed here, if
>>> useful you can take:
>>>
>>> Acked-by: Axel Rasmussen <[email protected]>
>>>
>>> But, it seems to me the earlier proposal may still be useful.
>>> Specifically, don't print at all for "synthetic" poisons from
>>> UFFDIO_POISON or similar mechanisms. This way, "real" errors aren't
>>> gobbled up by the ratelimit due to spam from "synthetic" errors. If
>>> folks agree, I can *actually* send a patch this time. :)
>>>
>>
>> That sounds good to me. (Should it also rate limit, though? I'm leaning
>> toward yes.)
>
> I believe the proposal so far was, simulated poisons aren't really
> "global" events, and are only relevant to the process itself. So don't
> send them to the global kernel log at all, and instead let the process
> do whatever it wants with them (e.g. it could print something when it
> receives a signal, perhaps with rate limiting).

OK. And seeing as how I'm not (at all) in alignment with Borislav on
where to apply rate limiting, we'd better go with your approach.


thanks,
--
John Hubbard
NVIDIA