Hi,
we are seeing segfaults from processes while being killed starting with
kernels which include commit eda0047296a16d65a7f2bc60a408f70d178b2014
("mm: make the page fault mmap locking killable") all the way up to
v6.5-rc3 which is the kernel I based this report on.
I don't have a simple reproducer unfortunately, the one I have is big
and quite racy. My working theory for what happens is (see also the
bpftrace script and output [0]):
Since get_mmap_lock_carefully() now uses mmap_read_lock_killable(), if
rwsem_down_write_slowpath() is taken and there is a fatal signal
pending, rwsem_down_write_slowpath() will return -EINTR and this is
propagated up until get_mmap_lock_carefully() will return its boolean
negation with !mmap_read_lock_killable(mm), i.e. 0.
Then lock_mm_and_find_vma() returns NULL
> if (!get_mmap_lock_carefully(mm, regs))
> return NULL;
and so do_user_addr_fault()
> vma = lock_mm_and_find_vma(mm, address, regs);
> if (unlikely(!vma)) {
> bad_area_nosemaphore(regs, error_code, address);
> return;
> }
will end up without a vma and cause/log the segfault. Of course the
process is already being killed, but I'd argue it is very confusing to
users when apparent segfaults from such processes are being logged by
the kernel.
Happy to provide other traces or information if required!
Best Regards,
Fiona
[0]:
I ended up with the following bpftrace script
> #include <linux/signal.h>
> #include <linux/sched.h>
>
> kprobe:down_read_killable {
> printf("%s %d %d\n", func, pid, tid);
> }
>
> kprobe:rwsem_down_read_slowpath {
> printf("%s %d %d\n", func, pid, tid);
> }
>
> kretprobe:rwsem_down_read_slowpath {
> printf("%s %d %d retval 0x%x\n", func, pid, tid, retval);
> printf("%s\n", kstack());
> }
>
> kprobe:bad_area_nosemaphore {
> printf("%s %d %d %s pending signal: %d\n", func, pid, tid, comm,
> curtask->pending.signal.sig[0]
> );
> if (curtask->pending.signal.sig[0]) {
> printf("%s\n", kstack());
> }
> }
and here is a capture of a process running into the segfault. AFAIU, the
pending signal translates to SIGKILL and the return value from
down_read_killable() is -EINTR.
> down_read_killable 987299 987299
> rwsem_down_read_slowpath 987299 987299
> down_read_killable 987299 987299 retval 0xfffffffc
>
> down_read_killable+72
> lock_mm_and_find_vma+167
> do_user_addr_fault+477
> exc_page_fault+131
> asm_exc_page_fault+39
>
> bad_area_nosemaphore 987299 987299 pverados pending signal: 256
>
> bad_area_nosemaphore+1
> exc_page_fault+131
> asm_exc_page_fault+39
>
> bad_area_nosemaphore 987299 987299 pverados pending signal: 256
>
> bad_area_nosemaphore+1
> exc_page_fault+131
> asm_exc_page_fault+39
> rep_movs_alternative+96
> show_opcodes+118
> __bad_area_nosemaphore+640
> bad_area_nosemaphore+22
> do_user_addr_fault+708
> exc_page_fault+131
> asm_exc_page_fault+39
>
> bad_area_nosemaphore 987299 987299 pverados pending signal: 256
>
> bad_area_nosemaphore+1
> exc_page_fault+131
> asm_exc_page_fault+39
> rep_movs_alternative+15
> show_opcodes+118
> __bad_area_nosemaphore+640
> bad_area_nosemaphore+22
> do_user_addr_fault+708
> exc_page_fault+131
> asm_exc_page_fault+39
>
On Tue, 25 Jul 2023 at 04:16, Fiona Ebner <[email protected]> wrote:
>
> will end up without a vma and cause/log the segfault. Of course the
> process is already being killed, but I'd argue it is very confusing to
> users when apparent segfaults from such processes are being logged by
> the kernel.
Ahh. Yes, that wasn't the intent. A process that is being killed
should exit with the lethal signal, not SIGSEGV.
This is not new, btw - this situation is exactly the same as if you
use something like NFS where the process is killed and the IO is
interrupted and the page fault faults for that reason.
But I suspect *very* few people actually encounter that NFS situation
(you can get it on local filesystems too, but the IO race window is
then so small as to probably not be triggerable at all).
So the new killable() check is probably much easier to actually
trigger in practice, even though it's not a new situation per se.
What exactly made you notice? Is it just the logging from
'show_unhandled_signals' being set?
Because the actual signal itself, from the
force_sig_fault(SIGSEGV, si_code, (void __user *)address);
in __bad_area_nosemaphore() should be overridden by the fact that a
lethal signal was already pending.
But let's add a couple of signal people rather than the mm people to
the participants. Eric, Oleg - would not an existing fatal signal take
precedence over a new SIGSEGV? I obviously thought it did, but looking
at 'get_signal()' and the signal delivery, I don't actually see any
code to that effect.
Fiona - that patch is easily reverted, and it was done as a separate
patch exactly because I was wondering if there was some subtle reason
we didn't already do that.
But before we revert it, would you mind trying out the attached
trivial patch instead?
I'd also still be interested if the symptoms were anything else than
'show_unhandled_signals' causing the show_signal_msg() dance, and
resulting in a message something like
a.out[1567]: segfault at xyz ip [..] likely on CPU X
in dmesg...
Linus
On 25/07/2023 18:38, Linus Torvalds wrote:
> But before we revert it, would you mind trying out the attached
> trivial patch instead?
Not Fiona, but as I was still online yesterday I got around to already
try that patch out, after adding the missing `tsk` task_struct param
to the fatal_signal_pending call.
With the patched kernel booted, the original case we found in the wild
went from logging a segfault roughly twice per hour before, to none
afterward, and that with a bit more than 10h of boot time.
Fiona might have a more definitive confirmation, as IIRC she got a
better (= faster) reproducer used for bisecting.
>
> I'd also still be interested if the symptoms were anything else than
> 'show_unhandled_signals' causing the show_signal_msg() dance, and
> resulting in a message something like
>
> a.out[1567]: segfault at xyz ip [..] likely on CPU X
>
> in dmesg...
exactly, it was just like that with no actual fall out. The messages
were like:
> pverados[2183248]: segfault at 55e5a00f9ae0 ip 000055e5a00f9ae0 sp 00007ffc0720bea8 error 14 in perl[55e5a00d4000+195000] likely on CPU 10 (core 4, socket 0)
And the slightly odd code triggering this was basically a fork, where
the child wrote a message to the parent via a unix socket pair and
then called exit. The parent read that message and then send a SIGKILL
to the child process, i.e., the child exit and parent killing the
child process would be pretty closely aligned, basically racing with
each other.
cheers,
Thomas
Am 25.07.23 um 18:38 schrieb Linus Torvalds:
> On Tue, 25 Jul 2023 at 04:16, Fiona Ebner <[email protected]> wrote:
>>
>> will end up without a vma and cause/log the segfault. Of course the
>> process is already being killed, but I'd argue it is very confusing to
>> users when apparent segfaults from such processes are being logged by
>> the kernel.
>
> Ahh. Yes, that wasn't the intent. A process that is being killed
> should exit with the lethal signal, not SIGSEGV.
>
Checking the status from waitpid, it does show that the process was
terminated by signal 9, even if the segfault was logged.
> But before we revert it, would you mind trying out the attached
> trivial patch instead?
>
The patch works for me too :) (after adding the missing tsk argument
like Thomas pointed out)
> I'd also still be interested if the symptoms were anything else than
> 'show_unhandled_signals' causing the show_signal_msg() dance, and
> resulting in a message something like
>
> a.out[1567]: segfault at xyz ip [..] likely on CPU X
>
> in dmesg...
>
Yes, AFAICS, it is just those messages and nothing else.
Best Regards,
Fiona
On 07/25, Linus Torvalds wrote:
>
> What exactly made you notice? Is it just the logging from
> 'show_unhandled_signals' being set?
>
> Because the actual signal itself, from the
>
> force_sig_fault(SIGSEGV, si_code, (void __user *)address);
>
> in __bad_area_nosemaphore() should be overridden by the fact that a
> lethal signal was already pending.
Yes, SIGSEGV won't be even delivered, prepare_signal() returns F if
SIGNAL_GROUP_EXIT is set.
> But let's add a couple of signal people rather than the mm people to
> the participants. Eric, Oleg - would not an existing fatal signal take
> precedence over a new SIGSEGV? I obviously thought it did, but looking
> at 'get_signal()' and the signal delivery, I don't actually see any
> code to that effect.
See
/* Has this task already been marked for death? */
if ((signal->flags & SIGNAL_GROUP_EXIT) ||
signal->group_exec_task) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
goto fatal;
}
in get_signal().
So yes, get_signal() returns SIGKILL if fatal_signal_pending() == T which
implies SIGNAL_GROUP_EXIT.
I think your patch is fine.
Oleg.
On Wed, 26 Jul 2023 at 01:19, Fiona Ebner <[email protected]> wrote:
>
> Checking the status from waitpid, it does show that the process was
> terminated by signal 9, even if the segfault was logged.
Thanks for verifying. That's what I thought, and I had just entirely
forgotten about the logging of failed page faults.
This whole "fatal signals during IO can also cause a failed page
fault" has been true for a long long time, but because it's done later
by the actual VM code, there we actually end up going through
"fault_signal_pending()" and suppressing the logging of the page fault
failure that way.
> > But before we revert it, would you mind trying out the attached
> > trivial patch instead?
>
> The patch works for me too :) (after adding the missing tsk argument
> like Thomas pointed out)
So it turns out that not only did I forget the argument, I decided
that I put that test for fatal signals in the wrong place.
The patch obviously does fix the problem on x86, and we could do the
same thing for all the other architectures that do this signal
logging.
But there's actually a much better place to put the fatal signal
check, which will take care of all architectures: just do it in the
'unhandled_signal()' function.
So I fixed the missing argument, and moved the test to a different
place, but I still added your (and Thomas') "Tested-by:" even if you
ended up testing something that was a bit different.
Oleg, I took your Acked-by too. Despite the final patch being somewhat
different. Holler if you see something objectionable.
It's commit 5f0bc0b042fc ("mm: suppress mm fault logging if fatal
signal already pending") in my tree now.
And because it's a bit different from what you already tested, it
would be lovely to just get a confirmation that I didn't screw
anything up when I decided I needed to make a fix that covers more
than just x86.
Thanks,
Linus
Am 26.07.23 um 19:59 schrieb Linus Torvalds:
>
> It's commit 5f0bc0b042fc ("mm: suppress mm fault logging if fatal
> signal already pending") in my tree now.
>
> And because it's a bit different from what you already tested, it
> would be lovely to just get a confirmation that I didn't screw
> anything up when I decided I needed to make a fix that covers more
> than just x86.
>
I can confirm this still avoids the misleading logging for me :)
Thank you for the fix!
Best Regards,
Fiona