2021-06-10 21:00:04

by Eric W. Biederman

[permalink] [raw]
Subject: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads


Folks,

Digging through the guts of exit I found something I am not quite
certain what to do with. On some architectures such as alpha, m68k, and
nios2 the kernel calls into system calls with a subset of the registers
saved on the kernel stack, and the kernel calls into signal handling and
a few other contexts with all of the registers saved on the kernel
stack. The problem is sometimes we read all of the registers from
a context where they are not all saved.

When this was initially observed it looked just like a coredump problem
and it could be solved by tweaking the coredump code. That change was
77f6ab8b7768 ("don't dump the threads that had been already exiting when
zapped.")

However I have looked farther and we have the location where get_signal
is called from io_uring, and we have the ptrace_stop in
PTRACE_EVENT_EXIT. In PTRACE_EVENT_EXIT we could be called from exit(2)
which is a syscall and we definitely won't have everything saved on the
kernel stack. I have not doubled checked create_io_thread but I don't
think create_io_threads saves all of the registers on the kernel stack.

I think at this point we need to say that the architectures that have a
do this need to be fixed to at least call do_exit and the kernel
function in create_io_thread with the deeper stack.

Is that reasonable of me to ask? Is there some other way to deal with
this issue that I am not seeing? Am I missing some critical detail that
makes PTRACE_EVENT_EXIT in do_exit not a problem if someone reads the
register with ptrace?

Eric


2021-06-10 22:15:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Thu, Jun 10, 2021 at 1:58 PM Eric W. Biederman <[email protected]> wrote:
>
> The problem is sometimes we read all of the registers from
> a context where they are not all saved.

Ouch. Yes. And this is really painful because none of the *normal*
architectures do this, so it gets absolutely no coverage.

> I think at this point we need to say that the architectures that have a
> do this need to be fixed to at least call do_exit and the kernel
> function in create_io_thread with the deeper stack.

Yeah. We traditionally have that requirement for fork() and friends
too (vfork/clone), so adding exit and io_uring to do so seems like the
most straightforward thing.

But I really wish we had some way to test and trigger this so that we
wouldn't get caught on this before. Something in task_pt_regs() that
catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
affected architectures?

Linus

2021-06-11 21:42:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Thu, Jun 10, 2021 at 1:58 PM Eric W. Biederman <[email protected]> wrote:
>>
>> The problem is sometimes we read all of the registers from
>> a context where they are not all saved.
>
> Ouch. Yes. And this is really painful because none of the *normal*
> architectures do this, so it gets absolutely no coverage.
>
>> I think at this point we need to say that the architectures that have a
>> do this need to be fixed to at least call do_exit and the kernel
>> function in create_io_thread with the deeper stack.
>
> Yeah. We traditionally have that requirement for fork() and friends
> too (vfork/clone), so adding exit and io_uring to do so seems like the
> most straightforward thing.

Interesting. I am starting with Al's analysis and reading the code
to see if I can understand what is going on. So I am still glossing
over a few details as I dig into this. Kernel threads not having
all of their registers saved is one of those details.

Looking at copy_thread it looks like at least on alpha we are dealing
with a structure that defines all of the registers in copy_thread. So
perhaps all of the registers are there in kernel_threads already. I
don't read alpha assembly very well and fork is a bit subtle. I don't
know which piece of code is calling
ret_from_fork/ret_from_kernel_thread.

I really suspect that all of those registers are popped so at least for
IO_THREADS we need to push them again, in a way that signal_pt_regs()
can find them.

It looks like we just need something like this to cover the userspace
side of exit.

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43..ab0dcb545bd1 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -812,6 +812,22 @@ fork_like fork
fork_like vfork
fork_like clone

+.macro exit_like name
+ .align 4
+ .globl alpha_\name
+ .ent alpha_\name
+alpha_\name:
+ .prologue 0
+ DO_SWITCH_STACK
+ jsr $26, sys_\name
+ UNDO_SWITCH_STACK
+ ret
+.end alpha_\name
+.endm
+
+exit_like exit
+exit_like exit_group
+
.macro sigreturn_like name
.align 4
.globl sys_\name
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..b9d6449d6caa 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
# The <abi> is always "common" for this file
#
0 common osf_syscall alpha_syscall_zero
-1 common exit sys_exit
+1 common exit alpha_exit
2 common fork alpha_fork
3 common read sys_read
4 common write sys_write
@@ -333,7 +333,7 @@
400 common io_getevents sys_io_getevents
401 common io_submit sys_io_submit
402 common io_cancel sys_io_cancel
-405 common exit_group sys_exit_group
+405 common exit_group alpha_exit_group
406 common lookup_dcookie sys_lookup_dcookie
407 common epoll_create sys_epoll_create
408 common epoll_ctl sys_epoll_ctl


> But I really wish we had some way to test and trigger this so that we
> wouldn't get caught on this before. Something in task_pt_regs() that
> catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
> affected architectures?

I think that would require pushing an extra magic value in SWITCH_STACK
and not just popping it but deliberately changing that value in
UNDO_SWITCH_STACK. Basically stack canaries.

I don't see how we could do it in an arch independent way though.
Which means it will require auditing all of the architectures to get
there. Volunteers?

This is looking straight forward enough that I can probably pull
something together, just don't count on me to have it done in anything
resembling a timely manner.

Eric

2021-06-11 23:28:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Fri, Jun 11, 2021 at 2:40 PM Eric W. Biederman <[email protected]> wrote:
>
> Looking at copy_thread it looks like at least on alpha we are dealing
> with a structure that defines all of the registers in copy_thread.

On the target side, yes.

On the _source_ side, the code does

struct pt_regs *regs = current_pt_regs();

and that's the part that means that fork() and related functions need
to have done that DO_SWITCH_STACK(), so that they have the full
register set to be copied.

Otherwise it would copy random contents from the source stack.

But that

if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {

ends up protecting us, and the code never uses that set of source
registers for the io worker threads.

So io_uring looks fine on alpha. I didn't check m68k and friends, but
I think they have the same thing going.

> It looks like we just need something like this to cover the userspace
> side of exit.

Looks correct to me. Except I think you could just use "fork_like()"
instead of creating a new (and identical) "exit_like()" macro.

> > But I really wish we had some way to test and trigger this so that we
> > wouldn't get caught on this before. Something in task_pt_regs() that
> > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
> > affected architectures?
>
> I think that would require pushing an extra magic value in SWITCH_STACK
> and not just popping it but deliberately changing that value in
> UNDO_SWITCH_STACK. Basically stack canaries.
>
> I don't see how we could do it in an arch independent way though.

No, I think you're right. There's no obvious generic solution to it,
and once we look at arch-specific ones we're vback to "just alpha,
m68k and nios needs this or cares" and tonce you're there you might as
well just fix it.

ia64 has soem "fast system call" model with limited registers too, but
I think that's limited to just a few very special system calls (ie it
does the reverse of what alpha does: alpha does the fast case by
default, and then marks fork/vfork/clone as special).

Linus

2021-06-13 22:00:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Fri, Jun 11, 2021 at 2:40 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Looking at copy_thread it looks like at least on alpha we are dealing
>> with a structure that defines all of the registers in copy_thread.
>
> On the target side, yes.
>
> On the _source_ side, the code does
>
> struct pt_regs *regs = current_pt_regs();
>
> and that's the part that means that fork() and related functions need
> to have done that DO_SWITCH_STACK(), so that they have the full
> register set to be copied.
>
> Otherwise it would copy random contents from the source stack.
>
> But that
>
> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
>
> ends up protecting us, and the code never uses that set of source
> registers for the io worker threads.

The test in copy_thread. That isn't the case I am worried about.

> So io_uring looks fine on alpha. I didn't check m68k and friends, but
> I think they have the same thing going.

As I have read through the code more I don't think so.

The code paths I am worried about are:

ret_from_kernel_thread
io_wqe_worker
get_signal
do_coredump
ptrace_stop

ret_from_kernel_thread
io_sq_thread
get_signal
do_coredump
ptrace_stop


As I understand the code the new thread created by create_thread
initially has a full complement of registers, and then is started
by alpha_switch_to:

.align 4
.globl alpha_switch_to
.type alpha_switch_to, @function
.cfi_startproc
alpha_switch_to:
DO_SWITCH_STACK
call_pal PAL_swpctx
lda $8, 0x3fff
UNDO_SWITCH_STACK
bic $sp, $8, $8
mov $17, $0
ret
.cfi_endproc
.size alpha_switch_to, .-alpha_switch_to


The alpha_switch_to will remove the extra registers from the stack and
then call ret which if I understand alpha assembly correctly is
equivalent to jumping to where $26 points. Which is
ret_from_kernel_thread (as setup by copy_thread).

Which leaves ret_from_kernel_thread and everything it calls without
the extra context saved on the stack.

I am still trying to understand how we get registers populated at a
fixed offset on the stack during schedule. As it looks like switch_to
assumes the stack pointer is in the proper location.

>> It looks like we just need something like this to cover the userspace
>> side of exit.
>
> Looks correct to me. Except I think you could just use "fork_like()"
> instead of creating a new (and identical) "exit_like()" macro.
>
>> > But I really wish we had some way to test and trigger this so that we
>> > wouldn't get caught on this before. Something in task_pt_regs() that
>> > catches "this doesn't actually work" and does a WARN_ON_ONCE() on the
>> > affected architectures?
>>
>> I think that would require pushing an extra magic value in SWITCH_STACK
>> and not just popping it but deliberately changing that value in
>> UNDO_SWITCH_STACK. Basically stack canaries.
>>
>> I don't see how we could do it in an arch independent way though.
>
> No, I think you're right. There's no obvious generic solution to it,
> and once we look at arch-specific ones we're vback to "just alpha,
> m68k and nios needs this or cares" and tonce you're there you might as
> well just fix it.
>
> ia64 has soem "fast system call" model with limited registers too, but
> I think that's limited to just a few very special system calls (ie it
> does the reverse of what alpha does: alpha does the fast case by
> default, and then marks fork/vfork/clone as special).

I wonder if the arch specific solution should be to move the registers
to a fixed location in task_struct (perhaps thread_struct ) so that the
same patterns can apply across all architectures and we don't get
surprises at all.

What appears to be unique about alpha, m68k, and nios is that
space is not always reserved for all of the registers, so we can't
always count on them being saved after a task switch.

Eric

2021-06-13 22:29:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Sun, Jun 13, 2021 at 2:55 PM Eric W. Biederman <[email protected]> wrote:
>
> The alpha_switch_to will remove the extra registers from the stack and
> then call ret which if I understand alpha assembly correctly is
> equivalent to jumping to where $26 points. Which is
> ret_from_kernel_thread (as setup by copy_thread).
>
> Which leaves ret_from_kernel_thread and everything it calls without
> the extra context saved on the stack.

Uhhuh. Right you are, I think. It's been ages since I worked on that
code and my alpha handbook is somewhere else, but yes, when
alpha_switch_to() has context-switched to the new PCB state, it will
then pop those registers in the new context and return.

So we do set up the right stack frame for the worker thread, but as
you point out, it then gets used up immediately when running. So by
the time the IO worker thread calls get_signal(), it's no longer
useful.

How very annoying.

The (obviously UNTESTED) patch might be something like the attached.

I wouldn't be surprised if m68k has the exact same thing for the exact
same reason, but I didn't check..

Linus


Attachments:
patch.diff (982.00 B)

2021-06-14 02:08:59

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Linus,

On 14/06/21 10:18 am, Linus Torvalds wrote:
> On Sun, Jun 13, 2021 at 2:55 PM Eric W. Biederman <[email protected]> wrote:
>> The alpha_switch_to will remove the extra registers from the stack and
>> then call ret which if I understand alpha assembly correctly is
>> equivalent to jumping to where $26 points. Which is
>> ret_from_kernel_thread (as setup by copy_thread).
>>
>> Which leaves ret_from_kernel_thread and everything it calls without
>> the extra context saved on the stack.
> Uhhuh. Right you are, I think. It's been ages since I worked on that
> code and my alpha handbook is somewhere else, but yes, when
> alpha_switch_to() has context-switched to the new PCB state, it will
> then pop those registers in the new context and return.
>
> So we do set up the right stack frame for the worker thread, but as
> you point out, it then gets used up immediately when running. So by
> the time the IO worker thread calls get_signal(), it's no longer
> useful.
>
> How very annoying.
>
> The (obviously UNTESTED) patch might be something like the attached.
>
> I wouldn't be surprised if m68k has the exact same thing for the exact
> same reason, but I didn't check..

m68k is indeed similar, it has:

       if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
                /* kernel thread */
                memset(frame, 0, sizeof(struct fork_frame));
                frame->regs.sr = PS_S;
                frame->sw.a3 = usp; /* function */
                frame->sw.d7 = arg;
                frame->sw.retpc = (unsigned long)ret_from_kernel_thread;
                p->thread.usp = 0;
                return 0;
        }

so a similar patch should be possible.

Cheers,

    Michael



>
> Linus

2021-06-14 05:10:13

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On second thought, I'm not certain what adding another empty stack frame
would achieve here.

On m68k, 'frame' already is a new stack frame, for running the new
thread in. This new frame does not have any user context at all, and
it's explicitly wiped anyway.

Unless we save all user context on the stack, then push that context to
a new save frame, and somehow point get_signal to look there for IO
threads (essentially what Eric suggested), I don't see how this could work?

I must be missing something.

Cheers,

Michael Schmitz

Am 14.06.2021 um 14:05 schrieb Michael Schmitz:
>>
>> I wouldn't be surprised if m68k has the exact same thing for the exact
>> same reason, but I didn't check..
>
> m68k is indeed similar, it has:
>
> if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
> /* kernel thread */
> memset(frame, 0, sizeof(struct fork_frame));
> frame->regs.sr = PS_S;
> frame->sw.a3 = usp; /* function */
> frame->sw.d7 = arg;
> frame->sw.retpc = (unsigned long)ret_from_kernel_thread;
> p->thread.usp = 0;
> return 0;
> }
>
> so a similar patch should be possible.
>
> Cheers,
>
> Michael
>
>
>
>>
>> Linus

2021-06-14 16:28:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Michael Schmitz <[email protected]> writes:

> On second thought, I'm not certain what adding another empty stack frame would
> achieve here.
>
> On m68k, 'frame' already is a new stack frame, for running the new thread
> in. This new frame does not have any user context at all, and it's explicitly
> wiped anyway.
>
> Unless we save all user context on the stack, then push that context to a new
> save frame, and somehow point get_signal to look there for IO threads
> (essentially what Eric suggested), I don't see how this could work?
>
> I must be missing something.

It is only designed to work well enough so that ptrace will access
something well defined when ptrace accesses io_uring tasks.

The io_uring tasks are special in that they are user process
threads that never run in userspace. So as long as everything
ptrace can read is accessible on that process all is well.

Having stared a bit longer at the code I think the short term
fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
them both with CONFIG_HAVE_ARCH_TRACEHOOK.

Today CONFIG_HAVE_ARCH_TRACEHOOK guards access to /proc/self/syscall.
Which out of necessity ensures that user context is always readable.
Which seems to solve both the PTRACE_EVENT_EXIT and the io_uring
problems.

What I especially like about that is there are a lot of other reasons
to encourage architectures in a CONFIG_HAVE_ARCH_TRACEHOOK direction.
I think the biggies are getting architectures to store the extra
saved state on context switch into some place in task_struct
and to implement the regset view of registers.

Hmm. This is odd. CONFIG_HAVE_ARCH_TRACEHOOK is supposed to imply
CORE_DUMP_USE_REGSET. But alpha, csky, h8300, m68k, microblaze, nds32
don't implement CORE_DUMP_USE_REGSET but nds32 implements
CONFIG_ARCH_HAVE_TRACEHOOK.

I will keep digging and see what clean code I can come up with.

Eric

2021-06-14 22:30:28

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Eric,

On 15/06/21 4:26 am, Eric W. Biederman wrote:
> Michael Schmitz <[email protected]> writes:
>
>> On second thought, I'm not certain what adding another empty stack frame would
>> achieve here.
>>
>> On m68k, 'frame' already is a new stack frame, for running the new thread
>> in. This new frame does not have any user context at all, and it's explicitly
>> wiped anyway.
>>
>> Unless we save all user context on the stack, then push that context to a new
>> save frame, and somehow point get_signal to look there for IO threads
>> (essentially what Eric suggested), I don't see how this could work?
>>
>> I must be missing something.
> It is only designed to work well enough so that ptrace will access
> something well defined when ptrace accesses io_uring tasks.
>
> The io_uring tasks are special in that they are user process
> threads that never run in userspace. So as long as everything
> ptrace can read is accessible on that process all is well.
OK, I'm testing a patch that would save extra context in
sys_io_uring_setup, which ought to ensure that for m68k.
> Having stared a bit longer at the code I think the short term
> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
> them both with CONFIG_HAVE_ARCH_TRACEHOOK.

Fair enough :-)

Cheers,

    Michael

>
> Today CONFIG_HAVE_ARCH_TRACEHOOK guards access to /proc/self/syscall.
> Which out of necessity ensures that user context is always readable.
> Which seems to solve both the PTRACE_EVENT_EXIT and the io_uring
> problems.
>
> What I especially like about that is there are a lot of other reasons
> to encourage architectures in a CONFIG_HAVE_ARCH_TRACEHOOK direction.
> I think the biggies are getting architectures to store the extra
> saved state on context switch into some place in task_struct
> and to implement the regset view of registers.
>
> Hmm. This is odd. CONFIG_HAVE_ARCH_TRACEHOOK is supposed to imply
> CORE_DUMP_USE_REGSET. But alpha, csky, h8300, m68k, microblaze, nds32
> don't implement CORE_DUMP_USE_REGSET but nds32 implements
> CONFIG_ARCH_HAVE_TRACEHOOK.
>
> I will keep digging and see what clean code I can come up with.
>
> Eric

2021-06-15 19:33:17

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Michael Schmitz <[email protected]> writes:

> Hi Eric,
>
> On 15/06/21 4:26 am, Eric W. Biederman wrote:
>> Michael Schmitz <[email protected]> writes:
>>
>>> On second thought, I'm not certain what adding another empty stack frame would
>>> achieve here.
>>>
>>> On m68k, 'frame' already is a new stack frame, for running the new thread
>>> in. This new frame does not have any user context at all, and it's explicitly
>>> wiped anyway.
>>>
>>> Unless we save all user context on the stack, then push that context to a new
>>> save frame, and somehow point get_signal to look there for IO threads
>>> (essentially what Eric suggested), I don't see how this could work?
>>>
>>> I must be missing something.
>> It is only designed to work well enough so that ptrace will access
>> something well defined when ptrace accesses io_uring tasks.
>>
>> The io_uring tasks are special in that they are user process
>> threads that never run in userspace. So as long as everything
>> ptrace can read is accessible on that process all is well.
> OK, I'm testing a patch that would save extra context in sys_io_uring_setup,
> which ought to ensure that for m68k.

I had to update ret_from_kernel_thread to pop that state to get Linus's
change to boot. Apparently kernel_threads exiting needs to be handled.

>> Having stared a bit longer at the code I think the short term
>> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
>> them both with CONFIG_HAVE_ARCH_TRACEHOOK.

Which does not work because nios2 which looks susceptible
sets CONFIG_HAVE_ARCH_TRACEHOOK.

A further look shows that there is also PTRACE_EVENT_EXEC that
needs to be handled so execve and execveat need to be wrapped
as well.

Do you happen to know if there is userspace that will run
in qemu-system-m68k that can be used for testing?

Eric

2021-06-15 19:40:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads


While thinking about the information leaks fixed in 77f6ab8b7768
("don't dump the threads that had been already exiting when zapped.")
I realized the problem is much more general than just coredumps and
exit_mm. We have io_uring threads, PTRACE_EVENT_EXEC and
PTRACE_EVENT_EXIT where ptrace is allowed to access userspace
registers, but on some architectures has not saved them.

The function alpha_switch_to does something reasonable it saves the
floating point registers and the caller saved registers and switches
to a different thread. Any register the caller is not expected to
save it does not save.

Meanhile the system call entry point on alpha also does something
reasonable. The system call entry point saves the all but the caller
saved integer registers and doesn't touch the floating point registers
as the kernel code does not touch them.

This is a nice happy fast path until the kernel wants to access the
user space's registers through ptrace or similar. As user spaces's
caller saved registers may be saved at an unpredictable point in the
kernel code's stack, the routime which may stop and make the userspace
registers available must be wrapped by code that will first save a
switch stack frame at the bottom of the call stack, call the code that
may access those registers and then pop the switch stack frame.

The practical problem with this code structure is that this results in
a game of whack-a-mole wrapping different kernel system calls. Loosing
the game of whack-a-mole results in a security hole where userspace can
write arbitrary data to the kernel stack.

I looked and there nothing I can do that is not arch specific, so
whack the moles with a minimal backportable fix.

This change survives boot testing on qemu-system-alpha.

Cc: [email protected]
Inspired-by: Linus Torvalds <[email protected]>
Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/alpha/kernel/entry.S | 21 +++++++++++++++++++++
arch/alpha/kernel/process.c | 11 ++++++++++-
arch/alpha/kernel/syscalls/syscall.tbl | 8 ++++----
3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index e227f3a29a43..98bb5b805089 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -785,6 +785,7 @@ ret_from_kernel_thread:
mov $9, $27
mov $10, $16
jsr $26, ($9)
+ lda $sp, SWITCH_STACK_SIZE($sp)
br $31, ret_to_user
.end ret_from_kernel_thread

@@ -811,6 +812,26 @@ alpha_\name:
fork_like fork
fork_like vfork
fork_like clone
+fork_like exit
+fork_like exit_group
+
+.macro exec_like name
+ .align 4
+ .globl alpha_\name
+ .ent alpha_\name
+ .cfi_startproc
+alpha_\name:
+ .prologue 0
+ DO_SWITCH_STACK
+ jsr $26, sys_\name
+ UNDO_SWITCH_STACK
+ ret
+ .cfi_endproc
+.end alpha_\name
+.endm
+
+exec_like execve
+exec_like execveat

.macro sigreturn_like name
.align 4
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5112ab996394..edbfe03f4b2c 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -251,8 +251,17 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,

if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
/* kernel thread */
+ /*
+ * Give it *two* switch stacks, one for the kernel
+ * state return that is used up by alpha_switch_to,
+ * and one for the "user state" which is accessed
+ * by ptrace.
+ */
+ childstack--;
+ childti->pcb.ksp = (unsigned long) childstack;
+
memset(childstack, 0,
- sizeof(struct switch_stack) + sizeof(struct pt_regs));
+ 2*sizeof(struct switch_stack) + sizeof(struct pt_regs));
childstack->r26 = (unsigned long) ret_from_kernel_thread;
childstack->r9 = usp; /* function */
childstack->r10 = kthread_arg;
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..5f85f3c11ed4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
# The <abi> is always "common" for this file
#
0 common osf_syscall alpha_syscall_zero
-1 common exit sys_exit
+1 common exit alpha_exit
2 common fork alpha_fork
3 common read sys_read
4 common write sys_write
@@ -65,7 +65,7 @@
56 common osf_revoke sys_ni_syscall
57 common symlink sys_symlink
58 common readlink sys_readlink
-59 common execve sys_execve
+59 common execve alpha_execve
60 common umask sys_umask
61 common chroot sys_chroot
62 common osf_old_fstat sys_ni_syscall
@@ -333,7 +333,7 @@
400 common io_getevents sys_io_getevents
401 common io_submit sys_io_submit
402 common io_cancel sys_io_cancel
-405 common exit_group sys_exit_group
+405 common exit_group alpha_exit_group
406 common lookup_dcookie sys_lookup_dcookie
407 common epoll_create sys_epoll_create
408 common epoll_ctl sys_epoll_ctl
@@ -441,7 +441,7 @@
510 common renameat2 sys_renameat2
511 common getrandom sys_getrandom
512 common memfd_create sys_memfd_create
-513 common execveat sys_execveat
+513 common execveat alpha_execveat
514 common seccomp sys_seccomp
515 common bpf sys_bpf
516 common userfaultfd sys_userfaultfd
--
2.20.1

2021-06-15 20:59:27

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Eric,

On 16/06/21 7:30 am, Eric W. Biederman wrote:
>
>>> The io_uring tasks are special in that they are user process
>>> threads that never run in userspace. So as long as everything
>>> ptrace can read is accessible on that process all is well.
>> OK, I'm testing a patch that would save extra context in sys_io_uring_setup,
>> which ought to ensure that for m68k.
> I had to update ret_from_kernel_thread to pop that state to get Linus's
> change to boot. Apparently kernel_threads exiting needs to be handled.

Hadn't yet got to that stage, sorry. Still stress testing stage 1 of my
fix (push complete context). I would have thought that this should be
sufficient (gives us a complete stack frame for ptrace code to work on)?

But it makes sense that when you push an extra stack frame, you'd need
to pop that on exit.

>
>>> Having stared a bit longer at the code I think the short term
>>> fix for both of PTRACE_EVENT_EXIT and io_uring is to guard
>>> them both with CONFIG_HAVE_ARCH_TRACEHOOK.
> Which does not work because nios2 which looks susceptible
> sets CONFIG_HAVE_ARCH_TRACEHOOK.
>
> A further look shows that there is also PTRACE_EVENT_EXEC that
> needs to be handled so execve and execveat need to be wrapped
> as well.
>
> Do you happen to know if there is userspace that will run
> in qemu-system-m68k that can be used for testing?

I surmise so. I don't use qemu myself - either ARAnyM, or actual
hardware. Hardware is limited to 14 MB RAM, which has prevented me from
using more than simple regression testing. In particular, I can't test
sys_io_uring_setup there.

Adrian uses qemu a lot, and has supplied disk images to work from on
occasion. Maybe he's got something recent enough to support
sys_io_uring_setup ... I've CC:ed him in, as I'd love to do some more
testing as well.

Cheers,

    Michael

>
> Eric
>

2021-06-15 22:05:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads

On Tue, Jun 15, 2021 at 12:36 PM Eric W. Biederman
<[email protected]> wrote:
>
> I looked and there nothing I can do that is not arch specific, so
> whack the moles with a minimal backportable fix.
>
> This change survives boot testing on qemu-system-alpha.

So as mentioned in the other thread, I think this patch is exactly right.

However, the need for this part

> @@ -785,6 +785,7 @@ ret_from_kernel_thread:
> mov $9, $27
> mov $10, $16
> jsr $26, ($9)
> + lda $sp, SWITCH_STACK_SIZE($sp)
> br $31, ret_to_user
> .end ret_from_kernel_thread

obviously eluded me in my "how about something like this", and I had
to really try to figure out why we'd ever return.

Which is why I came to that "oooh - kernel_execve()" realization.

It might be good to comment on that somewhere. And if you can think of
some other case, that should be mentioned too.

Anyway, thanks for looking into this odd case. And if you have a
test-case for this all, it really would be a good thing. Yes, it
should only affect a couple of odd-ball architectures, but still... It
would also be good to hear that you actually did verify the behavior
of this patch wrt that ptrace-of-io-worker-threads case..

Linus

Linus

2021-06-15 22:09:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Tue, Jun 15, 2021 at 12:32 PM Eric W. Biederman
<[email protected]> wrote:
>
> I had to update ret_from_kernel_thread to pop that state to get Linus's
> change to boot. Apparently kernel_threads exiting needs to be handled.

You are very right.

That, btw, seems to be a horrible design mistake, but I think it's how
"kernel_execve()" works - both for the initial "init", but also for
user-mode helper processes.

Both of those cases do "kernel_thread()" to create a new thread, and
then that new kernel thread does kernel_execve() to create the user
space image for that thread. And that act of "execve()" clears
PF_KTHREAD from the thread, and then the final return from the kernel
thread function returns to that new user space.

Or something like that. It's been ages since I looked at that code,
and your patch initially confused the heck out of me because I went
"that can't _possibly_ be needed".

But yes, I think your patch is right.

And I think our horrible "kernel threads return to user space when
done" is absolutely horrifically nasty. Maybe of the clever sort, but
mostly of the historical horror sort.

Or am I mis-rememberting how this ends up working? Did you look at
exactly what it was that returned from kernel threads?

This might be worth commenting on somewhere. But your patch for alpha
looks correct to me. Did you have some test-case to verify ptrace() on
io worker threads?

Linus

2021-06-16 00:27:39

by Finn Thain

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Wed, 16 Jun 2021, Michael Schmitz wrote:

> >
> > Do you happen to know if there is userspace that will run in
> > qemu-system-m68k that can be used for testing?
>
> I surmise so. I don't use qemu myself - either ARAnyM, or actual
> hardware. Hardware is limited to 14 MB RAM, which has prevented me from
> using more than simple regression testing. In particular, I can't test
> sys_io_uring_setup there.
>
> Adrian uses qemu a lot, and has supplied disk images to work from on
> occasion. Maybe he's got something recent enough to support
> sys_io_uring_setup ... I've CC:ed him in, as I'd love to do some more
> testing as well.
>

As well as Debian/m68k, there is also a Gentoo/m68k stage3 rootfs
available here:
https://sourceforge.net/projects/linux-mac68k/files/Gentoo%20m68k%20unauthorized/

I built that rootfs last year using Catalyst. Some background (including
the qemu-system-m68k command-line) can be found here:
https://forums.gentoo.org/viewtopic-t-1100780.html

2021-06-16 07:40:42

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Eric,

On Tue, Jun 15, 2021 at 9:32 PM Eric W. Biederman <[email protected]> wrote:
> Do you happen to know if there is userspace that will run
> in qemu-system-m68k that can be used for testing?

There's a link to an image in Laurent's patch series "[PATCH 0/2]
m68k: Add Virtual M68k Machine"
https://lore.kernel.org/linux-m68k/[email protected]/

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-06-16 20:29:02

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Tue, Jun 15, 2021 at 12:32 PM Eric W. Biederman
> <[email protected]> wrote:
>>
>> I had to update ret_from_kernel_thread to pop that state to get Linus's
>> change to boot. Apparently kernel_threads exiting needs to be handled.
>
> You are very right.
>
> That, btw, seems to be a horrible design mistake, but I think it's how
> "kernel_execve()" works - both for the initial "init", but also for
> user-mode helper processes.
>
> Both of those cases do "kernel_thread()" to create a new thread, and
> then that new kernel thread does kernel_execve() to create the user
> space image for that thread. And that act of "execve()" clears
> PF_KTHREAD from the thread, and then the final return from the kernel
> thread function returns to that new user space.
>
> Or something like that. It's been ages since I looked at that code,
> and your patch initially confused the heck out of me because I went
> "that can't _possibly_ be needed".
>
> But yes, I think your patch is right.
>
> And I think our horrible "kernel threads return to user space when
> done" is absolutely horrifically nasty. Maybe of the clever sort, but
> mostly of the historical horror sort.
>
> Or am I mis-rememberting how this ends up working? Did you look at
> exactly what it was that returned from kernel threads?
>
> This might be worth commenting on somewhere. But your patch for alpha
> looks correct to me. Did you have some test-case to verify ptrace() on
> io worker threads?

At this point I just booted an alpha image and on qemu-system-alpha.

I do have gdb in my kernel image so I can test that. I haven't yet but
I can and should.

Sleeping on it I came up with a plan to add TF_SWITCH_STACK_SAVED to
indicate if the registers have been saved. The DO_SWITCH_STACK and
UNDO_SWITCH_STACK helpers (except in alpha_switch_to) can test that.
The ptrace helpers can test that and turn an access of random kernel
stack contents into something well behaved that does WARN_ON_ONCE
because we should not get there.

I suspect adding TF_SWITCH_STACK_SAVED should come first so it
is easy to verify the problem behavior, before I fix it.

My real goal is to find a pattern that architectures whose register
saves are structured like alphas can emulate, to minimize problems in
the future.

Plus I would really like to get the last handful of architectures
updated so we can remove CONFIG_HAVE_ARCH_TRACEHOOK. I think we can
do that on alpha because we save all of the system call arguments
in pt_regs and that is all the other non-ptrace code paths care about.

AKA I am trying to move the old architectures forward so we can get rid
of unnecessary complications in the core code.

Eric

2021-06-16 23:56:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads

Linus Torvalds <[email protected]> writes:

> On Tue, Jun 15, 2021 at 12:36 PM Eric W. Biederman
> <[email protected]> wrote:
>>
>> I looked and there nothing I can do that is not arch specific, so
>> whack the moles with a minimal backportable fix.
>>
>> This change survives boot testing on qemu-system-alpha.
>
> So as mentioned in the other thread, I think this patch is exactly right.
>
> However, the need for this part
>
>> @@ -785,6 +785,7 @@ ret_from_kernel_thread:
>> mov $9, $27
>> mov $10, $16
>> jsr $26, ($9)
>> + lda $sp, SWITCH_STACK_SIZE($sp)
>> br $31, ret_to_user
>> .end ret_from_kernel_thread
>
> obviously eluded me in my "how about something like this", and I had
> to really try to figure out why we'd ever return.
>
> Which is why I came to that "oooh - kernel_execve()" realization.
>
> It might be good to comment on that somewhere. And if you can think of
> some other case, that should be mentioned too.
>
> Anyway, thanks for looking into this odd case. And if you have a
> test-case for this all, it really would be a good thing. Yes, it
> should only affect a couple of odd-ball architectures, but still... It
> would also be good to hear that you actually did verify the behavior
> of this patch wrt that ptrace-of-io-worker-threads case..

*Grumble*

So just going through and looking to see what it takes to instrument
and put in warnings when things go wrong I have found another issue.

Today there exists:
PTRACE_EVENT_FORK
PTRACE_EVENT_VFORK
PTRACE_EVENT_CLONE

Which happens after the actual fork operation in the kernel.

The following code wraps those operations in arch/alpha/kernel/entry.S

.macro fork_like name
.align 4
.globl alpha_\name
.ent alpha_\name
alpha_\name:
.prologue 0
bsr $1, do_switch_stack
jsr $26, sys_\name
ldq $26, 56($sp)
lda $sp, SWITCH_STACK_SIZE($sp)
ret
.end alpha_\name
.endm

The code in the kernel when calls in fork.c calls ptrace_event_pid
which ultimately calls ptrace_stop. So userspace can reasonably expect
to stop the process and change it's registers.

With unconditionally popping the switch stack any of those registers
that are modified are lost.

So I will update my changes to handle that case as well.


Eric

2021-06-17 01:58:33

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Geert,

On 16/06/21 7:38 pm, Geert Uytterhoeven wrote:
> Hi Eric,
>
> On Tue, Jun 15, 2021 at 9:32 PM Eric W. Biederman <[email protected]> wrote:
>> Do you happen to know if there is userspace that will run
>> in qemu-system-m68k that can be used for testing?
> There's a link to an image in Laurent's patch series "[PATCH 0/2]
> m68k: Add Virtual M68k Machine"
> https://lore.kernel.org/linux-m68k/[email protected]/

Thanks, I'll try that one.

I'll try and implement a few of the solutions Eric came up with for
alpha, unless someone beats me to it (Andreas?).

Cheers,

    Michael


>
> Gr{oetje,eeting}s,
>
> Geert
>

2021-06-17 02:23:15

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] alpha: Add extra switch_stack frames in exit, exec, and kernel threads

On Tue, Jun 15, 2021 at 02:36:38PM -0500, Eric W. Biederman wrote:
>
> While thinking about the information leaks fixed in 77f6ab8b7768
> ("don't dump the threads that had been already exiting when zapped.")
> I realized the problem is much more general than just coredumps and
> exit_mm. We have io_uring threads, PTRACE_EVENT_EXEC and
> PTRACE_EVENT_EXIT where ptrace is allowed to access userspace
> registers, but on some architectures has not saved them.

Wait a sec. To have anything happen on PTRACE_EVENT_EXEC, you need the
fucker traced. *IF* you want to go that way, at least make it conditional
upon the same thing.

2021-06-17 03:12:08

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/2] alpha/ptrace: Improved switch_stack handling


This pair of changes has not received anything beyond build and boot
testing. I am posting these changes as they do a much better job of
warning of problems and shutting down the security hole. Making them
a much better pattern than the my last patch.

I hope to get the test cases soon.

arch/alpha/include/asm/thread_info.h | 2 ++
arch/alpha/kernel/entry.S | 62 ++++++++++++++++++++++++++--------
arch/alpha/kernel/process.c | 3 ++
arch/alpha/kernel/ptrace.c | 13 +++++--
arch/alpha/kernel/syscalls/syscall.tbl | 8 ++---
5 files changed, 67 insertions(+), 21 deletions(-)

Eric W. Biederman (2):
alpha/ptrace: Record and handle the absence of switch_stack
alpha/ptrace: Add missing switch_stack frames

2021-06-17 03:12:12

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/2] alpha/ptrace: Add missing switch_stack frames


With the introduction of PTRACE_EVENT_XXX flags during the 2.5
development cycle it became possible for ptrace to write arbitrary
data to alpha kernel stack frames because it was assumed that wherever
ptrace_stop was called both a pt_regs and a switch_stack were saved
upon the stack.

The introduction of TIF_ALLREGS_SAVED removed the assumption that
switch_stack was saved on the kernel thread by transforming the
problem into a lesser bug where the access simply don't work.

Saving struct switch_stack has to happen on the lowest level of the
stack on alpha because it contains caller saved registers, which will
be saved by the C code in arbitrary locations on the stack if the data
is not saved immediately.

Update kernel threads to save a full set of userspace registers on
the stack so that io_uring threads can be ptraced.

Update fork, vfork, clone, exit, exit_group, execve, and execveat to
save all of the userspace registers when the are called as there are
known PTRACE_EVENT_XXX ptrace stop points in those functions where
registers can be manipulated.

The switch_stack frames serve double duty in fork, vfork, and clone,
as both the the childs inputs to alpha_switch_to, and the parents
saved copy of the registers for debuggers to modify. This changes
marks the the frame is present in the parent, and clears
TIF_ALLREGS_SAVED in the child as alpha_switch_to will consume the
switch_stack when the child is started.

Cc: [email protected]
Inspired-by: Linus Torvalds <[email protected]>
Fixes: dbe1bdbb39db ("io_uring: handle signals for IO threads like a normal thread")
Fixes: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
Fixes: a0691b116f6a ("Add new ptrace event tracing mechanism")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Signed-off-by: "Eric W. Biederman" <[email protected]>
---
arch/alpha/kernel/entry.S | 24 +++++++++++++++++-------
arch/alpha/kernel/process.c | 3 +++
arch/alpha/kernel/syscalls/syscall.tbl | 8 ++++----
3 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/arch/alpha/kernel/entry.S b/arch/alpha/kernel/entry.S
index c1edf54dc035..f29a40e2daf1 100644
--- a/arch/alpha/kernel/entry.S
+++ b/arch/alpha/kernel/entry.S
@@ -801,13 +801,18 @@ ret_from_fork:
.align 4
.globl ret_from_kernel_thread
.ent ret_from_kernel_thread
+ .cfi_startproc
ret_from_kernel_thread:
mov $17, $16
jsr $26, schedule_tail
+ /* PF_IO_WORKER threads can be ptraced */
+ SAVE_SWITCH_STACK
mov $9, $27
mov $10, $16
jsr $26, ($9)
+ RESTORE_SWITCH_STACK
br $31, ret_to_user
+ .cfi_endproc
.end ret_from_kernel_thread


@@ -816,23 +821,28 @@ ret_from_kernel_thread:
* have to play switch_stack games.
*/

-.macro fork_like name
+.macro allregs name
.align 4
.globl alpha_\name
.ent alpha_\name
+ .cfi_startproc
alpha_\name:
.prologue 0
- bsr $1, do_switch_stack
+ SAVE_SWITCH_STACK
jsr $26, sys_\name
- ldq $26, 56($sp)
- lda $sp, SWITCH_STACK_SIZE($sp)
+ RESTORE_SWITCH_STACK
ret
+ .cfi_endproc
.end alpha_\name
.endm

-fork_like fork
-fork_like vfork
-fork_like clone
+allregs fork
+allregs vfork
+allregs clone
+allregs exit
+allregs exit_group
+allregs execve
+allregs execveat

.macro sigreturn_like name
.align 4
diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index 5112ab996394..3bf480468a89 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -249,6 +249,9 @@ int copy_thread(unsigned long clone_flags, unsigned long usp,
childti->pcb.ksp = (unsigned long) childstack;
childti->pcb.flags = 1; /* set FEN, clear everything else */

+ /* In the child the registers are consumed by alpha_switch_to */
+ clear_ti_thread_flag(childti, TIF_ALLREGS_SAVED);
+
if (unlikely(p->flags & (PF_KTHREAD | PF_IO_WORKER))) {
/* kernel thread */
memset(childstack, 0,
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..5f85f3c11ed4 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -8,7 +8,7 @@
# The <abi> is always "common" for this file
#
0 common osf_syscall alpha_syscall_zero
-1 common exit sys_exit
+1 common exit alpha_exit
2 common fork alpha_fork
3 common read sys_read
4 common write sys_write
@@ -65,7 +65,7 @@
56 common osf_revoke sys_ni_syscall
57 common symlink sys_symlink
58 common readlink sys_readlink
-59 common execve sys_execve
+59 common execve alpha_execve
60 common umask sys_umask
61 common chroot sys_chroot
62 common osf_old_fstat sys_ni_syscall
@@ -333,7 +333,7 @@
400 common io_getevents sys_io_getevents
401 common io_submit sys_io_submit
402 common io_cancel sys_io_cancel
-405 common exit_group sys_exit_group
+405 common exit_group alpha_exit_group
406 common lookup_dcookie sys_lookup_dcookie
407 common epoll_create sys_epoll_create
408 common epoll_ctl sys_epoll_ctl
@@ -441,7 +441,7 @@
510 common renameat2 sys_renameat2
511 common getrandom sys_getrandom
512 common memfd_create sys_memfd_create
-513 common execveat sys_execveat
+513 common execveat alpha_execveat
514 common seccomp sys_seccomp
515 common bpf sys_bpf
516 common userfaultfd sys_userfaultfd
--
2.20.1

2021-06-21 13:56:00

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:

> And I think our horrible "kernel threads return to user space when
> done" is absolutely horrifically nasty. Maybe of the clever sort, but
> mostly of the historical horror sort.

How would you prefer to handle that, then? Separate magical path from
kernel_execve() to switch to userland? We used to have something of
that sort, and that had been a real horror...

As it is, it's "kernel thread is spawned at the point similar to
ret_from_fork(), runs the payload (which almost never returns) and
then proceeds out to userland, same way fork(2) would've done."
That way kernel_execve() doesn't have to do anything magical.

Al, digging through the old notes and current call graph...

2021-06-21 14:17:28

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>
> > And I think our horrible "kernel threads return to user space when
> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > mostly of the historical horror sort.
>
> How would you prefer to handle that, then? Separate magical path from
> kernel_execve() to switch to userland? We used to have something of
> that sort, and that had been a real horror...
>
> As it is, it's "kernel thread is spawned at the point similar to
> ret_from_fork(), runs the payload (which almost never returns) and
> then proceeds out to userland, same way fork(2) would've done."
> That way kernel_execve() doesn't have to do anything magical.
>
> Al, digging through the old notes and current call graph...

FWIW, the major assumption back then had been that get_signal(),
signal_delivered() and all associated machinery (including coredumps)
runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.

And "has complete registers on stack" is only a part of that;
there was other fun stuff in the area ;-/ Do we want coredumps for
those, and if we do, will the de_thread stuff work there?

2021-06-21 15:39:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 6:55 AM Al Viro <[email protected]> wrote:
>
> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>
> > And I think our horrible "kernel threads return to user space when
> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > mostly of the historical horror sort.
>
> How would you prefer to handle that, then? Separate magical path from
> kernel_execve() to switch to userland? We used to have something of
> that sort, and that had been a real horror...

Hmm. Maybe the alternatives would all be worse. The current thing is
clever, and shares the return path with the normal case. It's just
also a bit surprising, in that a kernel thread normally must not
return - with the magical exception of "if it had done a
kernel_execve() at some point, then returning is magically the way you
actually start user mode".

So it all feels very special, and there's not even a comment about it.

I think we only have two users of that thing (the very first 'init',
and user-mode-helpr), So I guess it doesn't really matter.

Linus

2021-06-21 16:56:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Al Viro <[email protected]> writes:

> On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
>> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>>
>> > And I think our horrible "kernel threads return to user space when
>> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
>> > mostly of the historical horror sort.
>>
>> How would you prefer to handle that, then? Separate magical path from
>> kernel_execve() to switch to userland? We used to have something of
>> that sort, and that had been a real horror...
>>
>> As it is, it's "kernel thread is spawned at the point similar to
>> ret_from_fork(), runs the payload (which almost never returns) and
>> then proceeds out to userland, same way fork(2) would've done."
>> That way kernel_execve() doesn't have to do anything magical.
>>
>> Al, digging through the old notes and current call graph...
>
> FWIW, the major assumption back then had been that get_signal(),
> signal_delivered() and all associated machinery (including coredumps)
> runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
>
> And "has complete registers on stack" is only a part of that;
> there was other fun stuff in the area ;-/ Do we want coredumps for
> those, and if we do, will the de_thread stuff work there?

Do we want coredumps from processes that use io_uring? yes
Exactly what we want from io_uring threads is less clear. We can't
really give much that is meaningful beyond the thread ids of the
io_uring threads.

What problems do are you seeing beyond the missing registers on the
stack for kernel threads?

I don't immediately see the connection between coredumps and de_thread.

The function de_thread arranges for the fatal_signal_pending to be true,
and that should work just fine for io_uring threads. The io_uring
threads process the fatal_signal with get_signal and then proceed to
exit eventually calling do_exit.

Eric





2021-06-21 19:00:33

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>
> > And I think our horrible "kernel threads return to user space when
> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > mostly of the historical horror sort.
>
> How would you prefer to handle that, then? Separate magical path from
> kernel_execve() to switch to userland? We used to have something of
> that sort, and that had been a real horror...
>
> As it is, it's "kernel thread is spawned at the point similar to
> ret_from_fork(), runs the payload (which almost never returns) and
> then proceeds out to userland, same way fork(2) would've done."
> That way kernel_execve() doesn't have to do anything magical.
>
> Al, digging through the old notes and current call graph...

There's a large mess around do_exit() - we have a bunch of
callers all over arch/*; if nothing else, I very much doubt that really
want to let tracer play with a thread in the middle of die_if_kernel()
or similar.

We sure as hell do not want to arrange for anything on the kernel
stack in such situations, no matter what's done in exit(2)...

2021-06-21 19:24:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 11:59 AM Al Viro <[email protected]> wrote:
>
> There's a large mess around do_exit() - we have a bunch of
> callers all over arch/*; if nothing else, I very much doubt that really
> want to let tracer play with a thread in the middle of die_if_kernel()
> or similar.

Right you are.

I'm really beginning to hate ptrace_{event,notify}() and those
PTRACE_EVENT_xyz things.

I don't even know what uses them, honestly. How very annoying.

I guess it's easy enough (famous last words) to move the
ptrace_event() call out of do_exit() and into the actual
exit/exit_group system calls, and the signal handling path. The paths
that actually have proper pt_regs.

Looks like sys_exit() and do_group_exit() would be the two places to
do it (do_group_exit() would handle the signal case and
sys_group_exit()).

Linus

2021-06-21 19:25:46

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 06:59:01PM +0000, Al Viro wrote:
> On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> > On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> >
> > > And I think our horrible "kernel threads return to user space when
> > > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> > > mostly of the historical horror sort.
> >
> > How would you prefer to handle that, then? Separate magical path from
> > kernel_execve() to switch to userland? We used to have something of
> > that sort, and that had been a real horror...
> >
> > As it is, it's "kernel thread is spawned at the point similar to
> > ret_from_fork(), runs the payload (which almost never returns) and
> > then proceeds out to userland, same way fork(2) would've done."
> > That way kernel_execve() doesn't have to do anything magical.
> >
> > Al, digging through the old notes and current call graph...
>
> There's a large mess around do_exit() - we have a bunch of
> callers all over arch/*; if nothing else, I very much doubt that really
> want to let tracer play with a thread in the middle of die_if_kernel()
> or similar.
>
> We sure as hell do not want to arrange for anything on the kernel
> stack in such situations, no matter what's done in exit(2)...

FWIW, on alpha it's die_if_kernel(), do_entUna() and do_page_fault(),
all in not-from-userland cases. On m68k - die_if_kernel(), do_page_fault()
(both for non-from-userland cases) and something really odd - fpsp040_die().
Exception handling for floating point stuff on 68040? Looks like it has
an open-coded copy_to_user()/copy_from_user(), with faults doing hard
do_exit(SIGSEGV) instead of raising a signal and trying to do something
sane...

I really don't want to try and figure out how painful would it be to
teach that code how to deal with faults - _testing_ anything in that
area sure as hell will be. IIRC, details of recovery from FPU exceptions
on 68040 in the manual left impression of a minefield...

2021-06-21 19:49:13

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 12:22:06PM -0700, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 11:59 AM Al Viro <[email protected]> wrote:
> >
> > There's a large mess around do_exit() - we have a bunch of
> > callers all over arch/*; if nothing else, I very much doubt that really
> > want to let tracer play with a thread in the middle of die_if_kernel()
> > or similar.
>
> Right you are.
>
> I'm really beginning to hate ptrace_{event,notify}() and those
> PTRACE_EVENT_xyz things.
>
> I don't even know what uses them, honestly. How very annoying.
>
> I guess it's easy enough (famous last words) to move the
> ptrace_event() call out of do_exit() and into the actual
> exit/exit_group system calls, and the signal handling path. The paths
> that actually have proper pt_regs.
>
> Looks like sys_exit() and do_group_exit() would be the two places to
> do it (do_group_exit() would handle the signal case and
> sys_group_exit()).

Maybe... I'm digging through that pile right now, will follow up when
I get a reasonably complete picture. In the meanwhile, do kernel/kthread.c
uses look even remotely sane? Intentional - sure, but it really looks
wrong to use thread exit code as communication channel there...

2021-06-21 20:05:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Mon, Jun 21, 2021 at 11:59 AM Al Viro <[email protected]> wrote:
>>
>> There's a large mess around do_exit() - we have a bunch of
>> callers all over arch/*; if nothing else, I very much doubt that really
>> want to let tracer play with a thread in the middle of die_if_kernel()
>> or similar.
>
> Right you are.
>
> I'm really beginning to hate ptrace_{event,notify}() and those
> PTRACE_EVENT_xyz things.
>
> I don't even know what uses them, honestly. How very annoying.

Modern strace does. Modern gdb appears not to.

However strace at least does not read the exit code,
or really appear to care about stopping for PTRACE_EVENT_EXIT.

I completely agree with you that they are very annoying.

> I guess it's easy enough (famous last words) to move the
> ptrace_event() call out of do_exit() and into the actual
> exit/exit_group system calls, and the signal handling path. The paths
> that actually have proper pt_regs.
>
> Looks like sys_exit() and do_group_exit() would be the two places to
> do it (do_group_exit() would handle the signal case and
> sys_group_exit()).

For other ptrace_event calls I am playing with seeing if I can split
them in two. Like sending a signal. So that we can have perform all
of the work in get_signal.

I think we can even change exit_group(2) and exit(2) so that (at least
when ptraced) they just send the ``event signal'' and then the signal
handling path handles all of the ptrace stuff.


When I started it was just going to be exit and PTRACE_EVENT_EXIT and
some old architectures, and that a generic solution was going to be
hard.

I still think we are going to need to fix the io_uring threads on the
architectures that use the caller saved register optimization like alpha
and m68k. But I think we can handle the rest in generic code.

Eric

2021-06-21 23:07:36

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
> Al Viro <[email protected]> writes:
>
> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
> >>
> >> > And I think our horrible "kernel threads return to user space when
> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
> >> > mostly of the historical horror sort.
> >>
> >> How would you prefer to handle that, then? Separate magical path from
> >> kernel_execve() to switch to userland? We used to have something of
> >> that sort, and that had been a real horror...
> >>
> >> As it is, it's "kernel thread is spawned at the point similar to
> >> ret_from_fork(), runs the payload (which almost never returns) and
> >> then proceeds out to userland, same way fork(2) would've done."
> >> That way kernel_execve() doesn't have to do anything magical.
> >>
> >> Al, digging through the old notes and current call graph...
> >
> > FWIW, the major assumption back then had been that get_signal(),
> > signal_delivered() and all associated machinery (including coredumps)
> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
> >
> > And "has complete registers on stack" is only a part of that;
> > there was other fun stuff in the area ;-/ Do we want coredumps for
> > those, and if we do, will the de_thread stuff work there?
>
> Do we want coredumps from processes that use io_uring? yes
> Exactly what we want from io_uring threads is less clear. We can't
> really give much that is meaningful beyond the thread ids of the
> io_uring threads.
>
> What problems do are you seeing beyond the missing registers on the
> stack for kernel threads?
>
> I don't immediately see the connection between coredumps and de_thread.
>
> The function de_thread arranges for the fatal_signal_pending to be true,
> and that should work just fine for io_uring threads. The io_uring
> threads process the fatal_signal with get_signal and then proceed to
> exit eventually calling do_exit.

I would like to see the testing in cases when the io-uring thread is
the one getting hit by initial signal and when it's the normal one
with associated io-uring ones. The thread-collecting logics at least
used to depend upon fairly subtle assumptions, and "kernel threads
obviously can't show up as candidates" used to narrow the analysis
down...

In any case, WTF would we allow reads or writes to *any* registers of
such threads? It's not as simple as "just return zeroes", BTW - the
values allowed in special registers might have non-trivial constraints
on them. The same goes for coredump - we don't _have_ registers to
dump for those, period.

Looks like the first things to do would be
* prohibit ptrace accessing any regsets of worker threads
* make coredump skip all register notes for those

Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
current->ptrace in the caller of the latter. So fork and exec side
has ptrace_event() crap limited to real syscalls.

It's seccomp[1] and exit-related stuff that are messy...

[1] "never trust somebody who introduces himself as Honest Joe and keeps
carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
etc.

2021-06-21 23:16:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 12:45 PM Al Viro <[email protected]> wrote:
> >
> > Looks like sys_exit() and do_group_exit() would be the two places to
> > do it (do_group_exit() would handle the signal case and
> > sys_group_exit()).
>
> Maybe... I'm digging through that pile right now, will follow up when
> I get a reasonably complete picture

We might have another possible way to solve this:

(a) make it the rule that everybody always saves the full (integer)
register set in pt_regs

(b) make m68k just always create that switch-stack for all system
calls (it's really not that big, I think it's like six words or
something)

(c) admit that alpha is broken, but nobody really cares

> In the meanwhile, do kernel/kthread.c uses look even remotely sane?
> Intentional - sure, but it really looks wrong to use thread exit code
> as communication channel there...

I really doubt that it is even "intentional".

I think it's "use some errno as a random exit code" and nobody ever
really thought about it, or thought about how that doesn't really
work. People are used to the error numbers, not thinking about how
do_exit() doesn't take an error number, but a signal number (and an
8-bit positive error code in bits 8-15).

Because no, it's not even remotely sane.

I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make
more sense. And the -ENOMEM might be SIGBUS, perhaps.

It does look like the usermode-helper code does save the exit code
with things like

kernel_wait(pid, &sub_info->retval);

and I see call_usermodehelper_exec() doing

retval = sub_info->retval;

and treating it as an error code. But I think those have never been
tested with that (bogus) exit code thing from kernel_wait(), because
it wouldn't have worked. It has only ever been tested with the (real)
exit code things like

if (pid < 0) {
sub_info->retval = pid;

which does actually assign a negative error code to it.

So I think that

kernel_wait(pid, &sub_info->retval);

line is buggy, and should be something like

int wstatus;
kernel_wait(pid, &wstatus);
sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0;

or something.

Linus

2021-06-21 23:17:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 1:04 PM Eric W. Biederman <[email protected]> wrote:
>
> For other ptrace_event calls I am playing with seeing if I can split
> them in two. Like sending a signal. So that we can have perform all
> of the work in get_signal.

That sounds like the right model, but I don't think it works.
Particularly not for exit(). The second phase will never happen.

Linus

2021-06-21 23:24:16

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 04:14:36PM -0700, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 12:45 PM Al Viro <[email protected]> wrote:
> > >
> > > Looks like sys_exit() and do_group_exit() would be the two places to
> > > do it (do_group_exit() would handle the signal case and
> > > sys_group_exit()).
> >
> > Maybe... I'm digging through that pile right now, will follow up when
> > I get a reasonably complete picture
>
> We might have another possible way to solve this:
>
> (a) make it the rule that everybody always saves the full (integer)
> register set in pt_regs
>
> (b) make m68k just always create that switch-stack for all system
> calls (it's really not that big, I think it's like six words or
> something)
>
> (c) admit that alpha is broken, but nobody really cares

How would it help e.g. oopsen on the way out of timer interrupts?
IMO we simply shouldn't allow ptrace access if the tracee is in that kind
of state, on any architecture...

2021-06-21 23:28:45

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Al,

On 22/06/21 7:24 am, Al Viro wrote:
>
>> There's a large mess around do_exit() - we have a bunch of
>> callers all over arch/*; if nothing else, I very much doubt that really
>> want to let tracer play with a thread in the middle of die_if_kernel()
>> or similar.
>>
>> We sure as hell do not want to arrange for anything on the kernel
>> stack in such situations, no matter what's done in exit(2)...
> FWIW, on alpha it's die_if_kernel(), do_entUna() and do_page_fault(),
> all in not-from-userland cases. On m68k - die_if_kernel(), do_page_fault()
> (both for non-from-userland cases) and something really odd - fpsp040_die().
> Exception handling for floating point stuff on 68040? Looks like it has
Exception handling for emulated floating point instructions, really -
exceptions happening when excecuting FPU instructions on hardware will
do the normal exception processing.
> an open-coded copy_to_user()/copy_from_user(), with faults doing hard
> do_exit(SIGSEGV) instead of raising a signal and trying to do something
> sane...

Yes, that's what it does. Not pretty ... though all that using m68k
copy_to_user()/copy_from_user() would change is returning how many bytes
could not copied. In contrast to the ifpsp060 code, we could not pass on
that return status to callers of copyin/copyout in fpsp040, so I don't
see what sane thing could be done if a fault happens.

(I'd expect the MMU would have raised a bus error and resolved the
problem by a page fault if possible, before we ever get to this point?)

> I really don't want to try and figure out how painful would it be to
> teach that code how to deal with faults - _testing_ anything in that
> area sure as hell will be. IIRC, details of recovery from FPU exceptions
> on 68040 in the manual left impression of a minefield...

This is only about faults when moving data from/to user space. FPU
exceptions are handled elsewhere in the code. So we at least don't have
to deal with that particular minefield.

Teaching the fpsp040 code to deal with access faults looks horrible
indeed... let's not go there.

Cheers,

    Michael


2021-06-21 23:38:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Mon, Jun 21, 2021 at 4:23 PM Al Viro <[email protected]> wrote:
>
> How would it help e.g. oopsen on the way out of timer interrupts?
> IMO we simply shouldn't allow ptrace access if the tracee is in that kind
> of state, on any architecture...

Yeah no, we can't do the "wait for ptrace" when the exit is due to an
oops. Although honestly, we have other cases like that where do_exit()
isn't 100% robust if you kill something in an interrupt. Like all the
locks it leaves locked etc.

So do_exit() from a timer interrupt is going to cause problems
regardless. I agree it's probably a good idea to try to avoid causing
even more with the odd ptrace thing, but I don't think ptrace_event is
some really "fundamental" problem at that point - it's just one detail
among many many.

So I was more thinking of the debug patch for m68k to catch all the
_regular_ cases, and all the other random cases of ptrace_event() or
ptrace_notify().

Although maybe we've really caught them all. The exit case was clearly
missing, and the thread fork case was scrogged. There are patches for
the known problems. The patches I really don't like are the
verification ones to find any unknown ones..

Linus

2021-06-22 00:02:39

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Linus,

On 22/06/21 11:14 am, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 12:45 PM Al Viro <[email protected]> wrote:
>>> Looks like sys_exit() and do_group_exit() would be the two places to
>>> do it (do_group_exit() would handle the signal case and
>>> sys_group_exit()).
>> Maybe... I'm digging through that pile right now, will follow up when
>> I get a reasonably complete picture
> We might have another possible way to solve this:
>
> (a) make it the rule that everybody always saves the full (integer)
> register set in pt_regs
>
> (b) make m68k just always create that switch-stack for all system
> calls (it's really not that big, I think it's like six words or
> something)

Correct - six words for registers, one for the return address. Probably
still a win compared to setting and clearing flag bits all over the
place in an attempt to catch any as yet undetected unsafe cases of
ptrace_stop.

I'll have to see how much of a performance impact I can see (not that I
can even remotely measure that accurately - it's more of a 'does it now
feel real sluggish' thing).

Cheers,

    Michael

>
> (c) admit that alpha is broken, but nobody really cares
>
>> In the meanwhile, do kernel/kthread.c uses look even remotely sane?
>> Intentional - sure, but it really looks wrong to use thread exit code
>> as communication channel there...
> I really doubt that it is even "intentional".
>
> I think it's "use some errno as a random exit code" and nobody ever
> really thought about it, or thought about how that doesn't really
> work. People are used to the error numbers, not thinking about how
> do_exit() doesn't take an error number, but a signal number (and an
> 8-bit positive error code in bits 8-15).
>
> Because no, it's not even remotely sane.
>
> I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make
> more sense. And the -ENOMEM might be SIGBUS, perhaps.
>
> It does look like the usermode-helper code does save the exit code
> with things like
>
> kernel_wait(pid, &sub_info->retval);
>
> and I see call_usermodehelper_exec() doing
>
> retval = sub_info->retval;
>
> and treating it as an error code. But I think those have never been
> tested with that (bogus) exit code thing from kernel_wait(), because
> it wouldn't have worked. It has only ever been tested with the (real)
> exit code things like
>
> if (pid < 0) {
> sub_info->retval = pid;
>
> which does actually assign a negative error code to it.
>
> So I think that
>
> kernel_wait(pid, &sub_info->retval);
>
> line is buggy, and should be something like
>
> int wstatus;
> kernel_wait(pid, &wstatus);
> sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0;
>
> or something.
>
> Linus

2021-06-22 16:44:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Al Viro <[email protected]> writes:

> On Mon, Jun 21, 2021 at 11:50:56AM -0500, Eric W. Biederman wrote:
>> Al Viro <[email protected]> writes:
>>
>> > On Mon, Jun 21, 2021 at 01:54:56PM +0000, Al Viro wrote:
>> >> On Tue, Jun 15, 2021 at 02:58:12PM -0700, Linus Torvalds wrote:
>> >>
>> >> > And I think our horrible "kernel threads return to user space when
>> >> > done" is absolutely horrifically nasty. Maybe of the clever sort, but
>> >> > mostly of the historical horror sort.
>> >>
>> >> How would you prefer to handle that, then? Separate magical path from
>> >> kernel_execve() to switch to userland? We used to have something of
>> >> that sort, and that had been a real horror...
>> >>
>> >> As it is, it's "kernel thread is spawned at the point similar to
>> >> ret_from_fork(), runs the payload (which almost never returns) and
>> >> then proceeds out to userland, same way fork(2) would've done."
>> >> That way kernel_execve() doesn't have to do anything magical.
>> >>
>> >> Al, digging through the old notes and current call graph...
>> >
>> > FWIW, the major assumption back then had been that get_signal(),
>> > signal_delivered() and all associated machinery (including coredumps)
>> > runs *only* from SIGPENDING/NOTIFY_SIGNAL handling.
>> >
>> > And "has complete registers on stack" is only a part of that;
>> > there was other fun stuff in the area ;-/ Do we want coredumps for
>> > those, and if we do, will the de_thread stuff work there?
>>
>> Do we want coredumps from processes that use io_uring? yes
>> Exactly what we want from io_uring threads is less clear. We can't
>> really give much that is meaningful beyond the thread ids of the
>> io_uring threads.
>>
>> What problems do are you seeing beyond the missing registers on the
>> stack for kernel threads?
>>
>> I don't immediately see the connection between coredumps and de_thread.
>>
>> The function de_thread arranges for the fatal_signal_pending to be true,
>> and that should work just fine for io_uring threads. The io_uring
>> threads process the fatal_signal with get_signal and then proceed to
>> exit eventually calling do_exit.
>
> I would like to see the testing in cases when the io-uring thread is
> the one getting hit by initial signal and when it's the normal one
> with associated io-uring ones. The thread-collecting logics at least
> used to depend upon fairly subtle assumptions, and "kernel threads
> obviously can't show up as candidates" used to narrow the analysis
> down...
>
> In any case, WTF would we allow reads or writes to *any* registers of
> such threads? It's not as simple as "just return zeroes", BTW - the
> values allowed in special registers might have non-trivial constraints
> on them. The same goes for coredump - we don't _have_ registers to
> dump for those, period.
>
> Looks like the first things to do would be
> * prohibit ptrace accessing any regsets of worker threads
> * make coredump skip all register notes for those

Skipping register notes is fine. Prohibiting ptrace access to any
regsets of worker threads is interesting. I think that was tried and
shown to confuse gdb. So the conclusion was just to provide a fake set
of registers.

Which has appears to work up to the point of dealing with architectures
that have their magic caller-saved optimization (like alpha and m68k),
and no check that all of the registers were saved when accessed. Adding
a dummy switch stack frame for the kernel threads on those architectures
looks like a good/cheap solution at first glance.

> Note, BTW, that kernel_thread() and kernel_execve() do *NOT* step into
> ptrace_notify() - explicit CLONE_UNTRACED for the former and zero
> current->ptrace in the caller of the latter. So fork and exec side
> has ptrace_event() crap limited to real syscalls.

That is where I thought we were. Thanks for confirming that.

> It's seccomp[1] and exit-related stuff that are messy...
>
> [1] "never trust somebody who introduces himself as Honest Joe and keeps
> carping on that all the time"; c.f. __secure_computing(), CONFIG_INTEGRITY,
> etc.

2021-06-22 20:06:53

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Linus,

On 22/06/21 11:14 am, Linus Torvalds wrote:
> On Mon, Jun 21, 2021 at 12:45 PM Al Viro <[email protected]> wrote:
>>> Looks like sys_exit() and do_group_exit() would be the two places to
>>> do it (do_group_exit() would handle the signal case and
>>> sys_group_exit()).
>> Maybe... I'm digging through that pile right now, will follow up when
>> I get a reasonably complete picture
> We might have another possible way to solve this:
>
> (a) make it the rule that everybody always saves the full (integer)
> register set in pt_regs
>
> (b) make m68k just always create that switch-stack for all system
> calls (it's really not that big, I think it's like six words or
> something)

Turns out that is harder than it looked at first glance (at least for me).

All syscalls that _do_ save the switch stack are currently called
through wrappers which pull the syscall arguments out of the saved
pt_regs on the stack (pushing the switch stack after the SAVE_ALL saved
stuff buries the syscall arguments on the stack, see comment about
m68k_clone(). We'd have to push the switch stack _first_ when entering
system_call to leave the syscall arguments in place, but that will
require further changes to the syscall exit path (currently shared with
the interrupt exit path). Not to mention the register offset
calculations in arch/m68k/kernel/ptrace.c, and perhaps a few other
dependencies that don't come to mind immediately.

We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the
ordering of the two is only mentioned in a comment. Can we reorder them
on the stack, as long as we don't change the struct definitions proper?

This will take a little more time to work out and test - certainly not
before the weekend. I'll send a corrected version of my debug patch
before that.

Cheers,

    Michael


>
> (c) admit that alpha is broken, but nobody really cares
>
>> In the meanwhile, do kernel/kthread.c uses look even remotely sane?
>> Intentional - sure, but it really looks wrong to use thread exit code
>> as communication channel there...
> I really doubt that it is even "intentional".
>
> I think it's "use some errno as a random exit code" and nobody ever
> really thought about it, or thought about how that doesn't really
> work. People are used to the error numbers, not thinking about how
> do_exit() doesn't take an error number, but a signal number (and an
> 8-bit positive error code in bits 8-15).
>
> Because no, it's not even remotely sane.
>
> I think the do_exit(-EINTR) could be do_exit(SIGINT) and it would make
> more sense. And the -ENOMEM might be SIGBUS, perhaps.
>
> It does look like the usermode-helper code does save the exit code
> with things like
>
> kernel_wait(pid, &sub_info->retval);
>
> and I see call_usermodehelper_exec() doing
>
> retval = sub_info->retval;
>
> and treating it as an error code. But I think those have never been
> tested with that (bogus) exit code thing from kernel_wait(), because
> it wouldn't have worked. It has only ever been tested with the (real)
> exit code things like
>
> if (pid < 0) {
> sub_info->retval = pid;
>
> which does actually assign a negative error code to it.
>
> So I think that
>
> kernel_wait(pid, &sub_info->retval);
>
> line is buggy, and should be something like
>
> int wstatus;
> kernel_wait(pid, &wstatus);
> sub_info->retval = WEXITSTATUS(wstatus) ? -EINVAL : 0;
>
> or something.
>
> Linus

2021-06-22 20:21:01

by Al Viro

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Wed, Jun 23, 2021 at 08:04:11AM +1200, Michael Schmitz wrote:

> All syscalls that _do_ save the switch stack are currently called through
> wrappers which pull the syscall arguments out of the saved pt_regs on the
> stack (pushing the switch stack after the SAVE_ALL saved stuff buries the
> syscall arguments on the stack, see comment about m68k_clone(). We'd have to
> push the switch stack _first_ when entering system_call to leave the syscall
> arguments in place, but that will require further changes to the syscall
> exit path (currently shared with the interrupt exit path). Not to mention
> the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a
> few other dependencies that don't come to mind immediately.
>
> We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering
> of the two is only mentioned in a comment. Can we reorder them on the stack,
> as long as we don't change the struct definitions proper?
>
> This will take a little more time to work out and test - certainly not
> before the weekend. I'll send a corrected version of my debug patch before
> that.

This is insane, *especially* on m68k where you have the mess with different
frame layouts and associated ->stkadj crap (see mangle_kernel_stack() for
the (very) full barfbag).

2021-06-22 20:57:12

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Mon, Jun 21, 2021 at 1:04 PM Eric W. Biederman <[email protected]> wrote:
>>
>> For other ptrace_event calls I am playing with seeing if I can split
>> them in two. Like sending a signal. So that we can have perform all
>> of the work in get_signal.
>
> That sounds like the right model, but I don't think it works.
> Particularly not for exit(). The second phase will never happen.

Playing with it some more I think I have everything working working
except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and
group_exit(2).

Basically in exit sending yourself a signal and then calling do_exit
from the signal handler is not unreasonable, as exit is an ordinary
system call.

I haven't seen anything that ``knows'' that exit(2) or exit_group(2)
will never return and adds a special case in the system call table for
that case.

The complications of exit_group(2) are roughly those of moving
ptrace_event out of do_exit. They look doable and I am going to look
at that next.

This is not to say that this is the most maintainable way or that we
necessarily want to implement things this way, but I need to look and
see what it looks like.

For purposes of discussion this is my current draft implementation.

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c881384517..891812d32b90 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1087,6 +1087,7 @@ struct task_struct {
struct capture_control *capture_control;
#endif
/* Ptrace state: */
+ int stop_code;
unsigned long ptrace_message;
kernel_siginfo_t *last_siginfo;

diff --git a/include/linux/ptrace.h b/include/linux/ptrace.h
index b5ebf6c01292..33c50119b193 100644
--- a/include/linux/ptrace.h
+++ b/include/linux/ptrace.h
@@ -164,18 +164,29 @@ static inline void ptrace_event(int event, unsigned long message)
}
}

+static inline bool ptrace_post_event(int event, unsigned long message)
+{
+ bool posted = false;
+ if (unlikely(ptrace_event_enabled(current, event))) {
+ current->ptrace_message = message;
+ current->stop_code = (event << 8) | SIGTRAP;
+ set_tsk_thread_flag(current, TIF_SIGPENDING);
+ posted = true;
+ } else if (event == PTRACE_EVENT_EXEC) {
+ /* legacy EXEC report via SIGTRAP */
+ if ((current->ptrace & (PT_PTRACED|PT_SEIZED)) == PT_PTRACED)
+ send_sig(SIGTRAP, current, 0);
+ }
+ return posted;
+}
+
/**
- * ptrace_event_pid - possibly stop for a ptrace event notification
- * @event: %PTRACE_EVENT_* value to report
- * @pid: process identifier for %PTRACE_GETEVENTMSG to return
- *
- * Check whether @event is enabled and, if so, report @event and @pid
- * to the ptrace parent. @pid is reported as the pid_t seen from the
- * ptrace parent's pid namespace.
+ * pid_parent_nr - Return the number the parent knows this pid as
+ * @pid: The struct pid whose numerical value we want
*
* Called without locks.
*/
-static inline void ptrace_event_pid(int event, struct pid *pid)
+static inline pid_t pid_parent_nr(struct pid *pid)
{
/*
* FIXME: There's a potential race if a ptracer in a different pid
@@ -183,16 +194,15 @@ static inline void ptrace_event_pid(int event, struct pid *pid)
* when we acquire tasklist_lock in ptrace_stop(). If this happens,
* the ptracer will get a bogus pid from PTRACE_GETEVENTMSG.
*/
- unsigned long message = 0;
+ pid_t nr = 0;
struct pid_namespace *ns;

rcu_read_lock();
ns = task_active_pid_ns(rcu_dereference(current->parent));
if (ns)
- message = pid_nr_ns(pid, ns);
+ nr = pid_nr_ns(pid, ns);
rcu_read_unlock();
-
- ptrace_event(event, message);
+ return nr;
}

/**
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e24b1fe348e3..a2eac3831369 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -97,6 +97,8 @@ extern void exit_mm_release(struct task_struct *, struct mm_struct *);
/* Remove the current tasks stale references to the old mm_struct on exec() */
extern void exec_mm_release(struct task_struct *, struct mm_struct *);

+extern int wait_for_vfork_done(struct task_struct *child, struct completion *vfork);
+
#ifdef CONFIG_MEMCG
extern void mm_update_next_owner(struct mm_struct *mm);
#else
diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..bb4751d84e2d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1781,7 +1781,7 @@ static int exec_binprm(struct linux_binprm *bprm)

audit_bprm(bprm);
trace_sched_process_exec(current, old_pid, bprm);
- ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
+ ptrace_post_event(PTRACE_EVENT_EXEC, old_vpid);
proc_exec_connector(current);
return 0;
}
diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..aeb22a8e4d24 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit);

SYSCALL_DEFINE1(exit, int, error_code)
{
- do_exit((error_code&0xff)<<8);
+ long code = (error_code&0xff)<<8;
+ if (!ptrace_post_event(PTRACE_EVENT_EXIT, code))
+ do_exit((error_code&0xff)<<8);
}

/*
diff --git a/kernel/fork.c b/kernel/fork.c
index dc06afd725cb..8533e056a3d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1266,8 +1266,7 @@ static void complete_vfork_done(struct task_struct *tsk)
task_unlock(tsk);
}

-static int wait_for_vfork_done(struct task_struct *child,
- struct completion *vfork)
+int wait_for_vfork_done(struct task_struct *child, struct completion *vfork)
{
int killed;

@@ -2278,7 +2277,8 @@ static __latent_entropy struct task_struct *copy_process(

init_task_pid_links(p);
if (likely(p->pid)) {
- ptrace_init_task(p, (clone_flags & CLONE_PTRACE) || trace);
+ ptrace_init_task(p, (clone_flags & CLONE_PTRACE) ||
+ (trace && ptrace_event_enabled(current, trace)));

init_task_pid(p, PIDTYPE_PID, pid);
if (thread_group_leader(p)) {
@@ -2462,7 +2462,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node)
pid_t kernel_clone(struct kernel_clone_args *args)
{
u64 clone_flags = args->flags;
- struct completion vfork;
+ unsigned long message;
struct pid *pid;
struct task_struct *p;
int trace = 0;
@@ -2495,9 +2495,6 @@ pid_t kernel_clone(struct kernel_clone_args *args)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
-
- if (likely(!ptrace_event_enabled(current, trace)))
- trace = 0;
}

p = copy_process(NULL, trace, NUMA_NO_NODE, args);
@@ -2512,30 +2509,27 @@ pid_t kernel_clone(struct kernel_clone_args *args)
*/
trace_sched_process_fork(current, p);

- pid = get_task_pid(p, PIDTYPE_PID);
+ pid = task_pid(p);
nr = pid_vnr(pid);
+ message = pid_parent_nr(pid);

if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, args->parent_tid);

- if (clone_flags & CLONE_VFORK) {
- p->vfork_done = &vfork;
+ if (!(clone_flags & CLONE_VFORK)) {
+ wake_up_new_task(p);
+ ptrace_post_event(trace, message);
+ }
+ else if (!ptrace_post_event(PTRACE_EVENT_VFORK, (unsigned long)p)) {
+ struct completion vfork;
init_completion(&vfork);
+ p->vfork_done = &vfork;
get_task_struct(p);
+ wake_up_new_task(p);
+ if (wait_for_vfork_done(p, &vfork))
+ ptrace_post_event(PTRACE_EVENT_VFORK_DONE, message);
}

- wake_up_new_task(p);
-
- /* forking complete and child started to run, tell ptracer */
- if (unlikely(trace))
- ptrace_event_pid(trace, pid);
-
- if (clone_flags & CLONE_VFORK) {
- if (!wait_for_vfork_done(p, &vfork))
- ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
- }
-
- put_pid(pid);
return nr;
}

diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..8ac8c4a31d88 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -155,7 +155,8 @@ static bool recalc_sigpending_tsk(struct task_struct *t)
if ((t->jobctl & (JOBCTL_PENDING_MASK | JOBCTL_TRAP_FREEZE)) ||
PENDING(&t->pending, &t->blocked) ||
PENDING(&t->signal->shared_pending, &t->blocked) ||
- cgroup_task_frozen(t)) {
+ cgroup_task_frozen(t) ||
+ t->stop_code) {
set_tsk_thread_flag(t, TIF_SIGPENDING);
return true;
}
@@ -2607,6 +2608,39 @@ bool get_signal(struct ksignal *ksig)
if (unlikely(current->task_works))
task_work_run();

+ptrace_event:
+ /* Handle a posted ptrace event */
+ if (unlikely(current->stop_code)) {
+ int stop_code = current->stop_code;
+ unsigned long message = current->ptrace_message;
+ struct completion vfork;
+ struct task_struct *p;
+
+ current->stop_code = 0;
+
+ if (stop_code == PTRACE_EVENT_VFORK) {
+ p = (struct task_struct *)message;
+ get_task_struct(p);
+ current->ptrace_message = pid_parent_nr(task_pid(p));
+ init_completion(&vfork);
+ p->vfork_done = &vfork;
+ wake_up_new_task(p);
+ }
+
+ spin_lock_irq(&sighand->siglock);
+ ptrace_do_notify(SIGTRAP, stop_code, CLD_TRAPPED);
+ spin_unlock_irq(&sighand->siglock);
+
+ if ((stop_code == PTRACE_EVENT_VFORK) &&
+ wait_for_vfork_done(p, &vfork) &&
+ ptrace_post_event(PTRACE_EVENT_VFORK_DONE, message))
+ goto ptrace_event;
+
+ if (stop_code == PTRACE_EVENT_EXIT) {
+ do_exit(message);
+ }
+ }
+
/*
* For non-generic architectures, check for TIF_NOTIFY_SIGNAL so
* that the arch handlers don't all have to do it. If we get here

Eric

2021-06-22 21:04:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Mon, Jun 21, 2021 at 4:23 PM Al Viro <[email protected]> wrote:
>>
>> How would it help e.g. oopsen on the way out of timer interrupts?
>> IMO we simply shouldn't allow ptrace access if the tracee is in that kind
>> of state, on any architecture...
>
> Yeah no, we can't do the "wait for ptrace" when the exit is due to an
> oops. Although honestly, we have other cases like that where do_exit()
> isn't 100% robust if you kill something in an interrupt. Like all the
> locks it leaves locked etc.
>
> So do_exit() from a timer interrupt is going to cause problems
> regardless. I agree it's probably a good idea to try to avoid causing
> even more with the odd ptrace thing, but I don't think ptrace_event is
> some really "fundamental" problem at that point - it's just one detail
> among many many.
>
> So I was more thinking of the debug patch for m68k to catch all the
> _regular_ cases, and all the other random cases of ptrace_event() or
> ptrace_notify().
>
> Although maybe we've really caught them all. The exit case was clearly
> missing, and the thread fork case was scrogged. There are patches for
> the known problems. The patches I really don't like are the
> verification ones to find any unknown ones..

We still have nios2 which copied the m68k logic at some point. I think
that is a processor that is still ``shipping'' and that people might
still be using in new designs.

I haven't looked closely enough to see what the other architectures with
caller saved registers are doing.

The challenging ones are /proc/pid/syscall and seccomp which want to see
all of the system call arguments. I think every architecture always
saves the system call arguments unconditionally, so those cases are
probably not as interesting. But they certain look like they could be
trouble.

Eric

2021-06-22 21:50:35

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Eric,

On 23/06/21 9:02 am, Eric W. Biederman wrote:
> Linus Torvalds <[email protected]> writes:
>
> So I was more thinking of the debug patch for m68k to catch all the
> _regular_ cases, and all the other random cases of ptrace_event() or
> ptrace_notify().
>
> Although maybe we've really caught them all. The exit case was clearly
> missing, and the thread fork case was scrogged. There are patches for
> the known problems. The patches I really don't like are the
> verification ones to find any unknown ones..
> We still have nios2 which copied the m68k logic at some point. I think
> that is a processor that is still ``shipping'' and that people might
> still be using in new designs.
>
> I haven't looked closely enough to see what the other architectures with
> caller saved registers are doing.
>
> The challenging ones are /proc/pid/syscall and seccomp which want to see
> all of the system call arguments. I think every architecture always
> saves the system call arguments unconditionally, so those cases are
> probably not as interesting. But they certain look like they could be
> trouble.

Seccomp hasn't yet been implemented on m68k, though I'm working on that
with Adrian. The sole secure_computing() call will happen in
syscall_trace_enter(), so all system call arguments have been saved on
the stack.

Haven't looked at /proc/pid/syscall yet ...

Cheers,

    Michael

> Eric
>

2021-06-22 21:58:52

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Al,

On 23/06/21 8:18 am, Al Viro wrote:
> On Wed, Jun 23, 2021 at 08:04:11AM +1200, Michael Schmitz wrote:
>
>> All syscalls that _do_ save the switch stack are currently called through
>> wrappers which pull the syscall arguments out of the saved pt_regs on the
>> stack (pushing the switch stack after the SAVE_ALL saved stuff buries the
>> syscall arguments on the stack, see comment about m68k_clone(). We'd have to
>> push the switch stack _first_ when entering system_call to leave the syscall
>> arguments in place, but that will require further changes to the syscall
>> exit path (currently shared with the interrupt exit path). Not to mention
>> the register offset calculations in arch/m68k/kernel/ptrace.c, and perhaps a
>> few other dependencies that don't come to mind immediately.
>>
>> We have both pt_regs and switch_stack in uapi/asm/ptrace.h, but the ordering
>> of the two is only mentioned in a comment. Can we reorder them on the stack,
>> as long as we don't change the struct definitions proper?
>>
>> This will take a little more time to work out and test - certainly not
>> before the weekend. I'll send a corrected version of my debug patch before
>> that.
> This is insane, *especially* on m68k where you have the mess with different
> frame layouts and associated ->stkadj crap (see mangle_kernel_stack() for
> the (very) full barfbag).

Indeed - that's one of the uses of pt_regs and switch_stack that I
hadn't yet seen.

So it's either leave the stack layout in system calls unchanged (aside
from the ones that need the extra registers) and protect against
accidental misuse of registers that weren't saved, with the overhead of
playing with thread_info->status bits, or tackle the mess of redoing the
stack layout to save all registers, always (did I already mention that
I'd need a _lot_ of help from someone more conversant with m68k assembly
coding for that option?).

Which one of these two barf bags is the fuller one?

Cheers,

    Michael

2021-06-23 00:42:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

On Tue, Jun 22, 2021 at 1:53 PM Eric W. Biederman <[email protected]> wrote:
>
> Playing with it some more I think I have everything working working
> except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and
> group_exit(2).
>
> Basically in exit sending yourself a signal and then calling do_exit
> from the signal handler is not unreasonable, as exit is an ordinary
> system call.

Ok, this is a bit odd, but I do like the concept of just making
ptrace_event just post a signal, and have all ptrace things always be
handled at signal time (or the special system call entry/exit, which
is fine too).

> For purposes of discussion this is my current draft implementation.

I didn't check what is so different about exit_group() that you left
that as an exercise for the reader, but if that ends up then removing
the whole "wait synchromously for ptrace" cases for good I don't
_hate_ this. It's a bit odd, but it would be really nice to limit
where ptrace picks up data.

We do end up doing that stuff in "get_signal()", and that means that
we have the interaction with io_uring calling it directly, but it's at
least not a new thing.

Linus

2021-06-23 05:28:10

by Michael Schmitz

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Hi Eric,

Am 23.06.2021 um 09:48 schrieb Michael Schmitz:
>>
>> The challenging ones are /proc/pid/syscall and seccomp which want to see
>> all of the system call arguments. I think every architecture always
>> saves the system call arguments unconditionally, so those cases are
>> probably not as interesting. But they certain look like they could be
>> trouble.
>
> Seccomp hasn't yet been implemented on m68k, though I'm working on that
> with Adrian. The sole secure_computing() call will happen in
> syscall_trace_enter(), so all system call arguments have been saved on
> the stack.
>
> Haven't looked at /proc/pid/syscall yet ...

Not supported at present (no HAVE_ARCH_TRACEHOOK for m68k). And the
syscall_get_arguments I wrote for seccomp support only copies the first
five data registers, which are always saved.

Cheers,

Michael

2021-06-23 14:36:22

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Linus Torvalds <[email protected]> writes:

> On Tue, Jun 22, 2021 at 1:53 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Playing with it some more I think I have everything working working
>> except for PTRACE_EVENT_SECCOMP (which can stay ptrace_event) and
>> group_exit(2).
>>
>> Basically in exit sending yourself a signal and then calling do_exit
>> from the signal handler is not unreasonable, as exit is an ordinary
>> system call.
>
> Ok, this is a bit odd, but I do like the concept of just making
> ptrace_event just post a signal, and have all ptrace things always be
> handled at signal time (or the special system call entry/exit, which
> is fine too).
>
>> For purposes of discussion this is my current draft implementation.
>
> I didn't check what is so different about exit_group() that you left
> that as an exercise for the reader, but if that ends up then removing
> the whole "wait synchromously for ptrace" cases for good I don't
> _hate_ this. It's a bit odd, but it would be really nice to limit
> where ptrace picks up data.

I am still figuring out exit_group. I am hoping for sometime today.
My intuition tells me I can do it, and I have a sense of what threads I
need to pull to get there. I just don't know what the code is going to
look like yet.

Basically solving exit_group means moving ptrace_event out of do_exit.

> We do end up doing that stuff in "get_signal()", and that means that
> we have the interaction with io_uring calling it directly, but it's at
> least not a new thing.

The ugliest bit is having to repeat the wait_for_vfork_done both in fork
and in get_signal.

Eric

2021-06-23 14:37:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: Kernel stack read with PTRACE_EVENT_EXIT and io_uring threads

Michael Schmitz <[email protected]> writes:

> Hi Eric,
>
> Am 23.06.2021 um 09:48 schrieb Michael Schmitz:
>>>
>>> The challenging ones are /proc/pid/syscall and seccomp which want to see
>>> all of the system call arguments. I think every architecture always
>>> saves the system call arguments unconditionally, so those cases are
>>> probably not as interesting. But they certain look like they could be
>>> trouble.
>>
>> Seccomp hasn't yet been implemented on m68k, though I'm working on that
>> with Adrian. The sole secure_computing() call will happen in
>> syscall_trace_enter(), so all system call arguments have been saved on
>> the stack.
>>
>> Haven't looked at /proc/pid/syscall yet ...
>
> Not supported at present (no HAVE_ARCH_TRACEHOOK for m68k). And the
> syscall_get_arguments I wrote for seccomp support only copies the first five
> data registers, which are always saved.

Yes. It is looking like I can fix everything generically except for
faking user space registers for io_uring threads.

Eric

2021-06-24 19:02:17

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 0/9] Refactoring exit


I dug into exit because PTRACE_EVENT_EXIT not being guaranteed to be
called with a stack where ptrace read and write all of the userspace
registers can lead to unfiltered reads and writes of kernel stack
contents.

While looking into it I realized that there are a lot of little races
between all of the ways an exit can be initiated. I don't know of a way
those races are harmful, but they make the code difficult to reason about.

The solution this set of changes adopts is to implement good primitives
for asynchronous exit and exit_group requests and modifies exit(2) and
exit_group(2) to use those primitives.

The result should be more consistent determination of the reason for an
exit, as well as PTRACE_EVENT_EXIT always being called from a context
(get_signal) where ptrace is guaranteed to be able to read and write
all of the registers.

I believe the set of changes could be justified for the cleanups alone
even if PTRACE_EVENT_EXIT did not need to be moved. Which makes me
feel good about this approach.

If a way can be found that coredumps can be started from complete_signal
(needed for timely handling of fatal signals) instead of needing to
start in do_coredump for proper synchronization force_siginfo_to_task
and get_signal can be significantly simplified. As it is a lot of
checks are duplicated to ensure that everything works properly in the
presence of do_coredump.

So far the code has been lightly tested, and the descriptions of some
of the patches are a bit light, but I think this shows the direction
I am aiming to travel for sorting out exit(2) and exit_group(2).

Eric W. Biederman (9):
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/seccomp: Refactor seccomp signal and coredump generation
signal/seccomp: Dump core when there is only one live thread
signal: Factor start_group_exit out of complete_signal
signal/group_exit: Use start_group_exit in place of do_group_exit
signal: Fold do_group_exit into get_signal fixing io_uring threads
signal: Make individual tasks exiting a first class concept.
signal/task_exit: Use start_task_exit in place of do_exit
signal: Move PTRACE_EVENT_EXIT into get_signal

arch/sh/kernel/cpu/fpu.c | 10 +--
fs/exec.c | 10 ++-
include/linux/sched/jobctl.h | 2 +
include/linux/sched/signal.h | 5 ++
include/linux/sched/task.h | 1 -
kernel/exit.c | 41 ++---------
kernel/seccomp.c | 45 +++---------
kernel/signal.c | 166 ++++++++++++++++++++++++++++++-------------
8 files changed, 154 insertions(+), 126 deletions(-)

2021-06-24 19:03:03

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal


Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 2 ++
kernel/signal.c | 52 +++++++++++++++++++++++++-----------
2 files changed, 39 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 774be5d3ac3e..c007e55cb119 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -428,6 +428,8 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
signal_wake_up_state(t, resume ? __TASK_TRACED : 0);
}

+void start_group_exit(int exit_code);
+
void task_join_group_stop(struct task_struct *task);

#ifdef TIF_RESTORE_SIGMASK
diff --git a/kernel/signal.c b/kernel/signal.c
index da37cc4515f2..c79c010ca5f3 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1027,6 +1027,42 @@ static inline bool wants_signal(int sig, struct task_struct *p)
return task_curr(p) || !task_sigpending(p);
}

+static void start_group_exit_locked(struct signal_struct *signal, int exit_code)
+{
+ /*
+ * Start a group exit and wake everybody up.
+ * This way we don't have other threads
+ * running and doing things after a slower
+ * thread has the fatal signal pending.
+ */
+ struct task_struct *t;
+
+ signal->flags = SIGNAL_GROUP_EXIT;
+ signal->group_exit_code = exit_code;
+ signal->group_stop_count = 0;
+ __for_each_thread(signal, t) {
+ task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
+
+ /* Don't bother with already dead threads */
+ if (t->exit_state)
+ continue;
+ sigaddset(&t->pending.signal, SIGKILL);
+ signal_wake_up(t, 1);
+ }
+}
+
+void start_group_exit(int exit_code)
+{
+ if (!fatal_signal_pending(current)) {
+ struct sighand_struct *const sighand = current->sighand;
+
+ spin_lock_irq(&sighand->siglock);
+ if (!fatal_signal_pending(current))
+ start_group_exit_locked(current->signal, exit_code);
+ spin_unlock_irq(&sighand->siglock);
+ }
+}
+
static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
{
struct signal_struct *signal = p->signal;
@@ -1076,21 +1112,7 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
* This signal will be fatal to the whole group.
*/
if (!sig_kernel_coredump(sig)) {
- /*
- * Start a group exit and wake everybody up.
- * This way we don't have other threads
- * running and doing things after a slower
- * thread has the fatal signal pending.
- */
- signal->flags = SIGNAL_GROUP_EXIT;
- signal->group_exit_code = sig;
- signal->group_stop_count = 0;
- t = p;
- do {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
- } while_each_thread(p, t);
+ start_group_exit_locked(signal, sig);
return;
}
}
--
2.20.1

2021-06-24 19:03:32

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit


Make thread exiting uniform by causing all threads to pass through
get_signal when they are exiting. This simplifies the analysis
of sychronization during exit and guarantees that all full set
of registers will be available for ptrace to examine for
threads that stop at PTRACE_EVENT_EXIT.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 4 ++--
kernel/seccomp.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index fd1c04193e18..921519d80b56 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -931,8 +931,8 @@ do_group_exit(int exit_code)
*/
SYSCALL_DEFINE1(exit_group, int, error_code)
{
- do_group_exit((error_code & 0xff) << 8);
- /* NOTREACHED */
+ start_group_exit((error_code & 0xff) << 8);
+ /* get_signal will call do_exit */
return 0;
}

diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 5301eca670a0..b1c06fd1b205 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1250,7 +1250,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
if (action == SECCOMP_RET_KILL_THREAD)
do_exit(SIGSYS);
else
- do_group_exit(SIGSYS);
+ start_group_exit(SIGSYS);
}
return -1;
}
--
2.20.1

2021-06-24 19:03:59

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation


Factor out force_sig_seccomp from the seccomp signal generation and
place it in kernel/signal.c. The function force_sig_seccomp takes a
paramter force_coredump to indicate that the sigaction field should be
reset to SIGDFL so that a coredump will be generated when the signal
is delivered.

force_sig_seccomp is then used to replace both seccomp_send_sigsys
and seccomp_init_siginfo.

force_sig_info_to_task gains an extra parameter to force using
the default signal action.

With this change seccomp is no longer a special case and there
becomes exactly one place do_coredump is called from.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 1 +
kernel/seccomp.c | 43 ++++++++----------------------------
kernel/signal.c | 30 +++++++++++++++++++++----
3 files changed, 36 insertions(+), 38 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 7f4278fa21fe..774be5d3ac3e 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -329,6 +329,7 @@ int force_sig_pkuerr(void __user *addr, u32 pkey);
int force_sig_perf(void __user *addr, u32 type, u64 sig_data);

int force_sig_ptrace_errno_trap(int errno, void __user *addr);
+int force_sig_seccomp(int syscall, int reason, bool force_coredump);

extern int send_sig_info(int, struct kernel_siginfo *, struct task_struct *);
extern void force_sigsegv(int sig);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 6ecd3f3a52b5..3e06d4628d98 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -920,30 +920,6 @@ void get_seccomp_filter(struct task_struct *tsk)
refcount_inc(&orig->users);
}

-static void seccomp_init_siginfo(kernel_siginfo_t *info, int syscall, int reason)
-{
- clear_siginfo(info);
- info->si_signo = SIGSYS;
- info->si_code = SYS_SECCOMP;
- info->si_call_addr = (void __user *)KSTK_EIP(current);
- info->si_errno = reason;
- info->si_arch = syscall_get_arch(current);
- info->si_syscall = syscall;
-}
-
-/**
- * seccomp_send_sigsys - signals the task to allow in-process syscall emulation
- * @syscall: syscall number to send to userland
- * @reason: filter-supplied reason code to send to userland (via si_errno)
- *
- * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info.
- */
-static void seccomp_send_sigsys(int syscall, int reason)
-{
- struct kernel_siginfo info;
- seccomp_init_siginfo(&info, syscall, reason);
- force_sig_info(&info);
-}
#endif /* CONFIG_SECCOMP_FILTER */

/* For use with seccomp_actions_logged */
@@ -1195,7 +1171,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
/* Show the handler the original registers. */
syscall_rollback(current, current_pt_regs());
/* Let the filter pass back 16 bits of data. */
- seccomp_send_sigsys(this_syscall, data);
+ force_sig_seccomp(this_syscall, data, false);
goto skip;

case SECCOMP_RET_TRACE:
@@ -1266,18 +1242,17 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
/* Dump core only if this is the last remaining thread. */
if (action != SECCOMP_RET_KILL_THREAD ||
get_nr_threads(current) == 1) {
- kernel_siginfo_t info;
-
/* Show the original registers in the dump. */
syscall_rollback(current, current_pt_regs());
- /* Trigger a manual coredump since do_exit skips it. */
- seccomp_init_siginfo(&info, this_syscall, data);
- do_coredump(&info);
+ /* Trigger a coredump with SIGSYS */
+ force_sig_seccomp(this_syscall, data, true);
+ } else {
+ if (action == SECCOMP_RET_KILL_THREAD)
+ do_exit(SIGSYS);
+ else
+ do_group_exit(SIGSYS);
}
- if (action == SECCOMP_RET_KILL_THREAD)
- do_exit(SIGSYS);
- else
- do_group_exit(SIGSYS);
+ return -1;
}

unreachable();
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..da37cc4515f2 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -54,6 +54,7 @@
#include <asm/unistd.h>
#include <asm/siginfo.h>
#include <asm/cacheflush.h>
+#include <asm/syscall.h> /* for syscall_get_* */

/*
* SLAB caches for signal bits.
@@ -1349,7 +1350,7 @@ int do_send_sig_info(int sig, struct kernel_siginfo *info, struct task_struct *p
* that is why we also clear SIGNAL_UNKILLABLE.
*/
static int
-force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)
+force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t, bool sigdfl)
{
unsigned long int flags;
int ret, blocked, ignored;
@@ -1360,7 +1361,7 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)
action = &t->sighand->action[sig-1];
ignored = action->sa.sa_handler == SIG_IGN;
blocked = sigismember(&t->blocked, sig);
- if (blocked || ignored) {
+ if (blocked || ignored || sigdfl) {
action->sa.sa_handler = SIG_DFL;
if (blocked) {
sigdelset(&t->blocked, sig);
@@ -1381,7 +1382,7 @@ force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)

int force_sig_info(struct kernel_siginfo *info)
{
- return force_sig_info_to_task(info, current);
+ return force_sig_info_to_task(info, current, false);
}

/*
@@ -1712,7 +1713,7 @@ int force_sig_fault_to_task(int sig, int code, void __user *addr
info.si_flags = flags;
info.si_isr = isr;
#endif
- return force_sig_info_to_task(&info, t);
+ return force_sig_info_to_task(&info, t, false);
}

int force_sig_fault(int sig, int code, void __user *addr
@@ -1820,6 +1821,27 @@ int force_sig_perf(void __user *addr, u32 type, u64 sig_data)
return force_sig_info(&info);
}

+/**
+ * force_sig_seccomp - signals the task to allow in-process syscall emulation
+ * @syscall: syscall number to send to userland
+ * @reason: filter-supplied reason code to send to userland (via si_errno)
+ *
+ * Forces a SIGSYS with a code of SYS_SECCOMP and related sigsys info.
+ */
+int force_sig_seccomp(int syscall, int reason, bool force_coredump)
+{
+ struct kernel_siginfo info;
+
+ clear_siginfo(&info);
+ info.si_signo = SIGSYS;
+ info.si_code = SYS_SECCOMP;
+ info.si_call_addr = (void __user *)KSTK_EIP(current);
+ info.si_errno = reason;
+ info.si_arch = syscall_get_arch(current);
+ info.si_syscall = syscall;
+ return force_sig_info_to_task(&info, current, force_coredump);
+}
+
/* For the crazy architectures that include trap information in
* the errno field, instead of an actual errno value.
*/
--
2.20.1

2021-06-24 19:04:22

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads


Forld do_group_exit into get_signal as it is the last caller.

Move the group_exit logic above the PF_IO_WORKER exit, ensuring
that if an PF_IO_WORKER catches SIGKILL every thread in
the thread group will exit not just the the PF_IO_WORKER.

Now that the information is easily available only set PF_SIGNALED
when it was a signal that caused the exit.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/task.h | 1 -
kernel/exit.c | 31 -------------------------------
kernel/signal.c | 35 +++++++++++++++++++++++++----------
3 files changed, 25 insertions(+), 42 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index ef02be869cf2..45525512e3d0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk)
{
}
#endif
-extern void do_group_exit(int);

extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 921519d80b56..635f434122b7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code)
do_exit((error_code&0xff)<<8);
}

-/*
- * Take down every thread in the group. This is called by fatal signals
- * as well as by sys_exit_group (below).
- */
-void
-do_group_exit(int exit_code)
-{
- struct signal_struct *sig = current->signal;
-
- BUG_ON(exit_code & 0x80); /* core dumps don't get here */
-
- if (signal_group_exit(sig))
- exit_code = sig->group_exit_code;
- else if (!thread_group_empty(current)) {
- struct sighand_struct *const sighand = current->sighand;
-
- spin_lock_irq(&sighand->siglock);
- if (signal_group_exit(sig))
- /* Another thread got here before we took the lock. */
- exit_code = sig->group_exit_code;
- else {
- sig->group_exit_code = exit_code;
- sig->flags = SIGNAL_GROUP_EXIT;
- zap_other_threads(current);
- }
- spin_unlock_irq(&sighand->siglock);
- }
-
- do_exit(exit_code);
- /* NOTREACHED */
-}

/*
* this kills every thread in the thread group. Note that any externally
diff --git a/kernel/signal.c b/kernel/signal.c
index c79c010ca5f3..95a076af600a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig)
{
struct sighand_struct *sighand = current->sighand;
struct signal_struct *signal = current->signal;
+ int exit_code;
int signr;

if (unlikely(current->task_works))
@@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig)
/*
* Anything else is fatal, maybe with a core dump.
*/
- current->flags |= PF_SIGNALED;
-
if (sig_kernel_coredump(signr)) {
if (print_fatal_signals)
print_fatal_signal(ksig->info.si_signo);
@@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig)
/*
* If it was able to dump core, this kills all
* other threads in the group and synchronizes with
- * their demise. If we lost the race with another
- * thread getting here, it set group_exit_code
- * first and our do_group_exit call below will use
- * that value and ignore the one we pass it.
+ * their demise. If another thread makes it
+ * to do_coredump first, it will set group_exit_code
+ * which will be passed to do_exit.
*/
do_coredump(&ksig->info);
}

+ /*
+ * Death signals, no core dump.
+ */
+ exit_code = signr;
+ if (signal_group_exit(signal)) {
+ exit_code = signal->group_exit_code;
+ } else {
+ spin_lock_irq(&sighand->siglock);
+ if (signal_group_exit(signal)) {
+ /* Another thread got here before we took the lock. */
+ exit_code = signal->group_exit_code;
+ } else {
+ start_group_exit_locked(signal, exit_code);
+ }
+ spin_unlock_irq(&sighand->siglock);
+ }
+
+ if (exit_code & 0x7f)
+ current->flags |= PF_SIGNALED;
+
/*
* PF_IO_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
@@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig)
if (current->flags & PF_IO_WORKER)
goto out;

- /*
- * Death signals, no core dump.
- */
- do_group_exit(ksig->info.si_signo);
+ do_exit(exit_code);
/* NOTREACHED */
}
spin_unlock_irq(&sighand->siglock);
--
2.20.1

2021-06-24 19:05:54

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 7/9] signal: Make individual tasks exiting a first class concept.


Implement start_task_exit_locked and rewrite the de_thread logic
in exec using it.

Calling start_task_exit_locked is equivalent to asyncrhonously
calling exit(2) aka pthread_exit on a task.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
fs/exec.c | 10 +++++++++-
include/linux/sched/jobctl.h | 2 ++
include/linux/sched/signal.h | 1 +
kernel/signal.c | 37 ++++++++++++++++--------------------
4 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..b6f50213f0a0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1040,6 +1040,7 @@ static int de_thread(struct task_struct *tsk)
struct signal_struct *sig = tsk->signal;
struct sighand_struct *oldsighand = tsk->sighand;
spinlock_t *lock = &oldsighand->siglock;
+ struct task_struct *t;

if (thread_group_empty(tsk))
goto no_thread_group;
@@ -1058,7 +1059,14 @@ static int de_thread(struct task_struct *tsk)
}

sig->group_exit_task = tsk;
- sig->notify_count = zap_other_threads(tsk);
+ sig->group_stop_count = 0;
+ sig->notify_count = 0;
+ __for_each_thread(sig, t) {
+ if (t == tsk)
+ continue;
+ sig->notify_count++;
+ start_task_exit_locked(t, SIGKILL);
+ }
if (!thread_group_leader(tsk))
sig->notify_count--;

diff --git a/include/linux/sched/jobctl.h b/include/linux/sched/jobctl.h
index fa067de9f1a9..e94833b0c819 100644
--- a/include/linux/sched/jobctl.h
+++ b/include/linux/sched/jobctl.h
@@ -19,6 +19,7 @@ struct task_struct;
#define JOBCTL_TRAPPING_BIT 21 /* switching to TRACED */
#define JOBCTL_LISTENING_BIT 22 /* ptracer is listening for events */
#define JOBCTL_TRAP_FREEZE_BIT 23 /* trap for cgroup freezer */
+#define JOBCTL_TASK_EXITING_BIT 31 /* the task is exiting */

#define JOBCTL_STOP_DEQUEUED (1UL << JOBCTL_STOP_DEQUEUED_BIT)
#define JOBCTL_STOP_PENDING (1UL << JOBCTL_STOP_PENDING_BIT)
@@ -28,6 +29,7 @@ struct task_struct;
#define JOBCTL_TRAPPING (1UL << JOBCTL_TRAPPING_BIT)
#define JOBCTL_LISTENING (1UL << JOBCTL_LISTENING_BIT)
#define JOBCTL_TRAP_FREEZE (1UL << JOBCTL_TRAP_FREEZE_BIT)
+#define JOBCTL_TASK_EXITING (1UL << JOBCTL_TASK_EXITING_BIT)

#define JOBCTL_TRAP_MASK (JOBCTL_TRAP_STOP | JOBCTL_TRAP_NOTIFY)
#define JOBCTL_PENDING_MASK (JOBCTL_STOP_PENDING | JOBCTL_TRAP_MASK)
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index c007e55cb119..a958381ba4a9 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -429,6 +429,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
}

void start_group_exit(int exit_code);
+void start_task_exit_locked(struct task_struct *task, int exit_code);

void task_join_group_stop(struct task_struct *task);

diff --git a/kernel/signal.c b/kernel/signal.c
index 95a076af600a..afbc001220dd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -264,6 +264,12 @@ static inline void print_dropped_signal(int sig)
current->comm, current->pid, sig);
}

+static void task_set_jobctl_exiting(struct task_struct *task, int exit_code)
+{
+ WARN_ON_ONCE(task->jobctl & ~JOBCTL_STOP_SIGMASK);
+ task->jobctl = JOBCTL_TASK_EXITING | (exit_code & JOBCTL_STOP_SIGMASK);
+}
+
/**
* task_set_jobctl_pending - set jobctl pending bits
* @task: target task
@@ -1407,28 +1413,15 @@ int force_sig_info(struct kernel_siginfo *info)
return force_sig_info_to_task(info, current, false);
}

-/*
- * Nuke all other threads in the group.
- */
-int zap_other_threads(struct task_struct *p)
+void start_task_exit_locked(struct task_struct *task, int exit_code)
{
- struct task_struct *t = p;
- int count = 0;
-
- p->signal->group_stop_count = 0;
-
- while_each_thread(p, t) {
- task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
- count++;
-
- /* Don't bother with already dead threads */
- if (t->exit_state)
- continue;
- sigaddset(&t->pending.signal, SIGKILL);
- signal_wake_up(t, 1);
+ task_clear_jobctl_pending(task, JOBCTL_PENDING_MASK);
+ /* Only bother with threads that might be alive */
+ if (!task->exit_state) {
+ task_set_jobctl_exiting(task, exit_code);
+ sigaddset(&task->pending.signal, SIGKILL);
+ signal_wake_up(task, 1);
}
-
- return count;
}

struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
@@ -2714,7 +2707,7 @@ bool get_signal(struct ksignal *ksig)
}

/* Has this task already been marked for death? */
- if (signal_group_exit(signal)) {
+ if (signal_group_exit(signal) || (current->jobctl & JOBCTL_TASK_EXITING)) {
ksig->info.si_signo = signr = SIGKILL;
sigdelset(&current->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
@@ -2874,6 +2867,8 @@ bool get_signal(struct ksignal *ksig)
if (signal_group_exit(signal)) {
/* Another thread got here before we took the lock. */
exit_code = signal->group_exit_code;
+ } else if (current->jobctl & JOBCTL_TASK_EXITING) {
+ exit_code = current->jobctl & JOBCTL_STOP_SIGMASK;
} else {
start_group_exit_locked(signal, exit_code);
}
--
2.20.1

2021-06-24 19:06:07

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit


Reuse start_task_exit_locked to implement start_task_exit.

Simplify the exit logic by having all exits go through get_signal.
This simplifies the analysis of syncrhonization during exit and
gurantees a full set of registers will be available for ptrace to
examine at PTRACE_EVENT_EXIT.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
include/linux/sched/signal.h | 1 +
kernel/exit.c | 4 +++-
kernel/seccomp.c | 2 +-
kernel/signal.c | 12 ++++++++++++
4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index a958381ba4a9..3f4e69c019b7 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -430,6 +430,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)

void start_group_exit(int exit_code);
void start_task_exit_locked(struct task_struct *task, int exit_code);
+void start_task_exit(int exit_code);

void task_join_group_stop(struct task_struct *task);

diff --git a/kernel/exit.c b/kernel/exit.c
index 635f434122b7..51e0c82b3f7d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit);

SYSCALL_DEFINE1(exit, int, error_code)
{
- do_exit((error_code&0xff)<<8);
+ start_task_exit((error_code&0xff)<<8);
+ /* get_signal will call do_exit */
+ return 0;
}


diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b1c06fd1b205..e0c4c123a8bf 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -1248,7 +1248,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
force_sig_seccomp(this_syscall, data, true);
} else {
if (action == SECCOMP_RET_KILL_THREAD)
- do_exit(SIGSYS);
+ start_task_exit(SIGSYS);
else
start_group_exit(SIGSYS);
}
diff --git a/kernel/signal.c b/kernel/signal.c
index afbc001220dd..63fda9b6bbf9 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1424,6 +1424,18 @@ void start_task_exit_locked(struct task_struct *task, int exit_code)
}
}

+void start_task_exit(int exit_code)
+{
+ struct task_struct *task = current;
+ if (!fatal_signal_pending(task)) {
+ struct sighand_struct *const sighand = task->sighand;
+ spin_lock_irq(&sighand->siglock);
+ if (!fatal_signal_pending(current))
+ start_task_exit_locked(task, exit_code);
+ spin_unlock_irq(&sighand->siglock);
+ }
+}
+
struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
unsigned long *flags)
{
--
2.20.1

2021-06-24 19:07:46

by Eric W. Biederman

[permalink] [raw]
Subject: [PATCH 9/9] signal: Move PTRACE_EVENT_EXIT into get_signal


This ensures that we always have all full set of registers available when
PTRACE_EVENT_EXIT is called. Something that is not guaranteed for callers
of do_exit.

Additionally this guarantees PTRACE_EVENT_EXIT will not cause havoc
with abnormal exits.

Signed-off-by: "Eric W. Biederman" <[email protected]>
---
kernel/exit.c | 2 --
kernel/signal.c | 2 ++
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 51e0c82b3f7d..309f1d71e340 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -763,8 +763,6 @@ void __noreturn do_exit(long code)
profile_task_exit(tsk);
kcov_task_exit(tsk);

- ptrace_event(PTRACE_EVENT_EXIT, code);
-
validate_creds_for_do_exit(tsk);

/*
diff --git a/kernel/signal.c b/kernel/signal.c
index 63fda9b6bbf9..7214331836bc 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2890,6 +2890,8 @@ bool get_signal(struct ksignal *ksig)
if (exit_code & 0x7f)
current->flags |= PF_SIGNALED;

+ ptrace_event(PTRACE_EVENT_EXIT, exit_code);
+
/*
* PF_IO_WORKER threads will catch and exit on fatal signals
* themselves. They have cleanup that must be performed, so
--
2.20.1

2021-06-24 20:13:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal

I don't really mind the patch, but this patch doesn't actually do what
it says it does.

It factors out start_group_exit_locked() - which all looks good.

But then it also creates that new start_group_exit() function and
makes the declaration for it, and nothing actually uses it. Yet.

I'd do that second part later when you actually introduce the use in
the next patch (5/9).

Hmm?

Linus

2021-06-24 20:20:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 7/9] signal: Make individual tasks exiting a first class concept.

On Thu, Jun 24, 2021 at 12:03 PM Eric W. Biederman
<[email protected]> wrote:
>
> Implement start_task_exit_locked and rewrite the de_thread logic
> in exec using it.
>
> Calling start_task_exit_locked is equivalent to asyncrhonously
> calling exit(2) aka pthread_exit on a task.

Ok, so this is the patch that makes me go "Yeah, this seems to all go together".

The whole "start_exit()" thing seemed fairly sane as an interesting
concept to the whole ptrace notification thing, but this one actually
made me think it makes conceptual sense and how we had exactly that
"start exit asynchronously" case already in zap_other_threads().

So doing that zap_other_threads() as that async exit makes me just
thin kthat yes, this series is the right thing, because it not only
cleans up the ptrace condition, it makes sense in this entirely
unrelated area too.

So I think I'm convinced. I'd like Oleg in particular to Ack this
series, and Al to look it over, but I do think this is the right
direction.

Linus

2021-06-24 21:38:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 7/9] signal: Make individual tasks exiting a first class concept.

Linus Torvalds <[email protected]> writes:

> On Thu, Jun 24, 2021 at 12:03 PM Eric W. Biederman
> <[email protected]> wrote:
>>
>> Implement start_task_exit_locked and rewrite the de_thread logic
>> in exec using it.
>>
>> Calling start_task_exit_locked is equivalent to asyncrhonously
>> calling exit(2) aka pthread_exit on a task.
>
> Ok, so this is the patch that makes me go "Yeah, this seems to all go together".
>
> The whole "start_exit()" thing seemed fairly sane as an interesting
> concept to the whole ptrace notification thing, but this one actually
> made me think it makes conceptual sense and how we had exactly that
> "start exit asynchronously" case already in zap_other_threads().
>
> So doing that zap_other_threads() as that async exit makes me just
> thin kthat yes, this series is the right thing, because it not only
> cleans up the ptrace condition, it makes sense in this entirely
> unrelated area too.
>
> So I think I'm convinced. I'd like Oleg in particular to Ack this
> series, and Al to look it over, but I do think this is the right
> direction.

Thanks.

It took a bit of exploration and playing with things to get here,
but I had the same sense.

Next round I will see if I can clean up the patches a bit more.

Eric

2021-06-24 22:47:11

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

On Thu, Jun 24, 2021 at 01:57:35PM -0500, Eric W. Biederman wrote:

> So far the code has been lightly tested, and the descriptions of some
> of the patches are a bit light, but I think this shows the direction
> I am aiming to travel for sorting out exit(2) and exit_group(2).

FWIW, here's the current picture for do_exit(), aside of exit(2) and do_exit_group():

1) stuff that is clearly oops-like -
alpha:die_if_kernel() alpha:do_entUna() alpha:do_page_fault() arm:oops_end()
arm:__do_kernel_fault() arm64:die() arm64:die_kernel_fault() csky:alignment()
csky:die() csky:no_context() h8300:die() h8300:do_page_fault() hexagon:die()
ia64:die() i64:ia64_do_page_fault() m68k:die_if_kernel() m68k:send_fault_sig()
microblaze:die() mips:die() nds32:handle_fpu_exception() nds32:die()
nds32:unhandled_interruption() nds32:unhandled_exceptions() nds32:do_revinsn()
nds32:do_page_fault() nios:die() openrisc:die() openrisc:do_page_fault()
parisc:die_if_kernel() ppc:oops_end() riscv:die() riscv:die_kernel_fault()
s390:die() s390:do_no_context() s390:do_low_address() sh:die()
sparc32:die_if_kernel() sparc32:do_sparc_fault() sparc64:die_if_kernel()
x86:rewind_stack_do_exit() xtensa:die() xtensa:bad_page_fault()
We really do not want ptrace anywhere near any of those and we do not want
any of that to return; this shit would better be handled right there and
there - no "post a fatal signal" would do.

2) sparc32 playing silly buggers with SIGILL in case when signal delivery
can't get a valid sigframe. The regular variant for that kind of stuff
is forced SIGSEGV from failure case of signal_setup_done(). We could force
that SIGILL instead of do_exit() there (and report failure from sigframe
setup), but I suspect that we'll get SIGSEGV override that SIGILL, with
user-visible behaviour change. Triggered by altstack overflow on sparc32;
sparc64 gets SIGSEGV in the same situation, just like everybody else.

3) ppc swapcontext(2). Normal syscall, on failure results in exit(SIGSEGV).
Not sure if we want to post signal here - exposing the caller to results
of failure might be... interesting. And I really don't know if we want
to allow ptrace() to poke around in the results of such failure. That's
a question for ppc maintainers.

4) sparc32:try_to_clear_window_buffer(). Probably could force SIGSEGV
instead of do_exit() there, but that might need a bit of massage in
asm glue - it's called on the way out of kernel, right before handling
signals. I'd like comments from davem on that one, though.

5) in xtensa fast_syscall_spill_registers() stuff. Might or might not
be similar to the above.

6) sparc64 in tsb_grow() - looks like "impossible case, kill the sucker
dead if that ever happens". Not sure if it's reachable at all.

7) s390 copy_thread() is doing something interesting in kernel thread
case - frame->childregs.gprs[11] = (unsigned long)do_exit;
AFAICS, had been unused since 30dcb0996e40, when s390 switched to new
kernel_execve() semantics and kernel_thread_starter stopped using r11
(or proceeding to do_exit() in the first place). Ought to be removed,
if s390 folks ACK that.

8) x86:emulate_vsyscall(), x86:save_v86_state(), m68k:fpsp040_die(),
mips:bad_stack(), s390:__s390_handle_mcck(), ia64:mca_handler_bh(),
s390:default_trap_handler() - fuck knows.

9) seccomp stuff - this one should *NOT* be switched to posting signals;
it's on syscall_trace_enter() paths and we'd better have signal-equivalent
environment there. We sure as hell do have regular "stop and let tracer
poke around" in the same area - that's where strace is poking around.

10) there's a (moderate) bunch of places all over the tree where we
have kthread() payload hit do_exit(), with or without complete() or
module_put(). No ptrace stuff is going to be hit there and I see no
point in switching those to posting anything. In particular,
module_put_and_exit() sure as hell does *NOT* want to return to caller -
it might've been unmapped by the point we are done. This do_exit()
should really be noreturn.

11) abuses in kernel/kthread.c; AFAICS, it's misused as a mechanism
to return an error value to parent. No ptrace possible (parent
definitely not traced) and I don't see any point in delaying the
handling of that do_exit() either (same as with the execve failure
in call_usermodehelper_exec_async()).

12) io-uring threads hitting do_exit(). These, apparently, can be
ptraced...

13) there's bdflush(1, whatever), which is equivalent to exit(0).
IMO it's long past the time to simply remove the sucker.

14) reboot(2) stuff. No idea.

15) syscall_user_dispatch(). Didn't have time to look through that
stuff in details yet, so no idea at the moment.

2021-06-25 08:50:12

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal

Hi "Eric,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.13-rc7 next-20210624]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Eric-W-Biederman/signal-sh-Use-force_sig-SIGKILL-instead-of-do_group_exit-SIGKILL/20210625-040018
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 4a09d388f2ab382f217a764e6a152b3f614246f6
config: riscv-randconfig-s032-20210622 (attached as .config)
compiler: riscv64-linux-gcc (GCC) 9.3.0
reproduce:
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-341-g8af24329-dirty
# https://github.com/0day-ci/linux/commit/096b21cc14d8d22f557833af71ad16318cfe51f0
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Eric-W-Biederman/signal-sh-Use-force_sig-SIGKILL-instead-of-do_group_exit-SIGKILL/20210625-040018
git checkout 096b21cc14d8d22f557833af71ad16318cfe51f0
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=riscv

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>


sparse warnings: (new ones prefixed by >>)
kernel/signal.c: note: in included file (through include/uapi/asm-generic/signal.h, include/asm-generic/signal.h, arch/riscv/include/generated/uapi/asm/signal.h, ...):
include/uapi/asm-generic/signal-defs.h:82:29: sparse: sparse: multiple address spaces given
kernel/signal.c:195:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:195:31: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:195:31: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:198:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:198:33: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:198:33: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:535:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:535:9: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:535:9: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:539:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:539:34: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:539:34: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:572:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:572:9: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:572:9: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:575:36: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:575:36: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:575:36: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:597:53: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct k_sigaction *ka @@ got struct k_sigaction [noderef] __rcu * @@
kernel/signal.c:597:53: sparse: expected struct k_sigaction *ka
kernel/signal.c:597:53: sparse: got struct k_sigaction [noderef] __rcu *
include/uapi/asm-generic/signal-defs.h:82:29: sparse: sparse: multiple address spaces given
kernel/signal.c:750:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:750:33: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:750:33: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:752:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:752:31: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:752:31: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:939:9: sparse: sparse: cast removes address space '__rcu' of expression
>> kernel/signal.c:1072:63: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sighand_struct *const sighand @@ got struct sighand_struct [noderef] __rcu *sighand @@
kernel/signal.c:1072:63: sparse: expected struct sighand_struct *const sighand
kernel/signal.c:1072:63: sparse: got struct sighand_struct [noderef] __rcu *sighand
kernel/signal.c:1156:9: sparse: sparse: cast removes address space '__rcu' of expression
kernel/signal.c:1397:9: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:1397:9: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:1397:9: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:1398:16: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct k_sigaction *action @@ got struct k_sigaction [noderef] __rcu * @@
kernel/signal.c:1398:16: sparse: expected struct k_sigaction *action
kernel/signal.c:1398:16: sparse: got struct k_sigaction [noderef] __rcu *
kernel/signal.c:1415:34: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:1415:34: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:1415:34: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:1726:17: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:1726:17: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:1726:17: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:1728:42: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:1728:42: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:1728:42: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:1932:36: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:1932:36: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:1932:36: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2042:44: sparse: sparse: cast removes address space '__rcu' of expression
kernel/signal.c:2061:65: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *parent @@
kernel/signal.c:2061:65: sparse: expected struct task_struct *tsk
kernel/signal.c:2061:65: sparse: got struct task_struct [noderef] __rcu *parent
kernel/signal.c:2062:40: sparse: sparse: cast removes address space '__rcu' of expression
kernel/signal.c:2080:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sighand_struct *psig @@ got struct sighand_struct [noderef] __rcu *[noderef] __rcu sighand @@
kernel/signal.c:2080:14: sparse: expected struct sighand_struct *psig
kernel/signal.c:2080:14: sparse: got struct sighand_struct [noderef] __rcu *[noderef] __rcu sighand
kernel/signal.c:2109:46: sparse: sparse: incorrect type in argument 3 (different address spaces) @@ expected struct task_struct *t @@ got struct task_struct [noderef] __rcu *parent @@
kernel/signal.c:2109:46: sparse: expected struct task_struct *t
kernel/signal.c:2109:46: sparse: got struct task_struct [noderef] __rcu *parent
kernel/signal.c:2110:34: sparse: sparse: incorrect type in argument 2 (different address spaces) @@ expected struct task_struct *parent @@ got struct task_struct [noderef] __rcu *parent @@
kernel/signal.c:2110:34: sparse: expected struct task_struct *parent
kernel/signal.c:2110:34: sparse: got struct task_struct [noderef] __rcu *parent
kernel/signal.c:2139:24: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *parent @@ got struct task_struct [noderef] __rcu *parent @@
kernel/signal.c:2139:24: sparse: expected struct task_struct *parent
kernel/signal.c:2139:24: sparse: got struct task_struct [noderef] __rcu *parent
kernel/signal.c:2142:24: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct task_struct *parent @@ got struct task_struct [noderef] __rcu *real_parent @@
kernel/signal.c:2142:24: sparse: expected struct task_struct *parent
kernel/signal.c:2142:24: sparse: got struct task_struct [noderef] __rcu *real_parent
kernel/signal.c:2175:17: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct sighand_struct *sighand @@ got struct sighand_struct [noderef] __rcu *sighand @@
kernel/signal.c:2175:17: sparse: expected struct sighand_struct *sighand
kernel/signal.c:2175:17: sparse: got struct sighand_struct [noderef] __rcu *sighand
kernel/signal.c:2250:41: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2250:41: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2250:41: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2252:39: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2252:39: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2252:39: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2300:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2300:33: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2300:33: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2355:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2355:31: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2355:31: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2389:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2389:31: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2389:31: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2391:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2391:33: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2391:33: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2488:41: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2488:41: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2488:41: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2573:41: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2573:41: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2573:41: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2585:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2585:33: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2585:33: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:2623:52: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct task_struct *tsk @@ got struct task_struct [noderef] __rcu *parent @@
kernel/signal.c:2623:52: sparse: expected struct task_struct *tsk
kernel/signal.c:2623:52: sparse: got struct task_struct [noderef] __rcu *parent
kernel/signal.c:2625:49: sparse: sparse: cast removes address space '__rcu' of expression
kernel/signal.c:2662:49: sparse: sparse: incorrect type in initializer (different address spaces) @@ expected struct sighand_struct *sighand @@ got struct sighand_struct [noderef] __rcu *sighand @@
kernel/signal.c:2662:49: sparse: expected struct sighand_struct *sighand
kernel/signal.c:2662:49: sparse: got struct sighand_struct [noderef] __rcu *sighand
kernel/signal.c:2991:27: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:2991:27: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:2991:27: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3011:29: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:3011:29: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:3011:29: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3078:27: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:3078:27: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:3078:27: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3080:29: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:3080:29: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:3080:29: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3231:31: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:3231:31: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:3231:31: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3234:33: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:3234:33: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:3234:33: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3617:27: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@
kernel/signal.c:3617:27: sparse: expected struct spinlock [usertype] *lock
kernel/signal.c:3617:27: sparse: got struct spinlock [noderef] __rcu *
kernel/signal.c:3629:37: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct spinlock [usertype] *lock @@ got struct spinlock [noderef] __rcu * @@

vim +1072 kernel/signal.c

1068
1069 void start_group_exit(int exit_code)
1070 {
1071 if (!fatal_signal_pending(current)) {
> 1072 struct sighand_struct *const sighand = current->sighand;
1073
1074 spin_lock_irq(&sighand->siglock);
1075 if (!fatal_signal_pending(current))
1076 start_group_exit_locked(current->signal, exit_code);
1077 spin_unlock_irq(&sighand->siglock);
1078 }
1079 }
1080

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (17.73 kB)
.config.gz (29.21 kB)
Download all attachments

2021-06-26 03:21:42

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation

On Thu, Jun 24, 2021 at 01:59:55PM -0500, Eric W. Biederman wrote:
>
> Factor out force_sig_seccomp from the seccomp signal generation and
> place it in kernel/signal.c. The function force_sig_seccomp takes a
> paramter force_coredump to indicate that the sigaction field should be
> reset to SIGDFL so that a coredump will be generated when the signal
> is delivered.

Ah! This is the part I missed when I was originally trying to figure
out the coredump stuff. It's the need for setting a default handler
(i.e. doing a coredump)?

> force_sig_seccomp is then used to replace both seccomp_send_sigsys
> and seccomp_init_siginfo.
>
> force_sig_info_to_task gains an extra parameter to force using
> the default signal action.
>
> With this change seccomp is no longer a special case and there
> becomes exactly one place do_coredump is called from.

Looks good to me. This may benefit from force_sig_seccomp() to be wrapped
in an #ifdef CONFIG_SECCOMP.

(This patch reminds me that the seccomp self tests don't check for core
dumps...)

--
Kees Cook

2021-06-26 03:24:53

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 4/9] signal: Factor start_group_exit out of complete_signal

On Thu, Jun 24, 2021 at 02:01:20PM -0500, Eric W. Biederman wrote:
> +static void start_group_exit_locked(struct signal_struct *signal, int exit_code)
> +{
> + /*
> + * Start a group exit and wake everybody up.
> + * This way we don't have other threads
> + * running and doing things after a slower
> + * thread has the fatal signal pending.
> + */
> + struct task_struct *t;
> +
> + signal->flags = SIGNAL_GROUP_EXIT;
> + signal->group_exit_code = exit_code;
> + signal->group_stop_count = 0;
> + __for_each_thread(signal, t) {
> + task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK);
> +
> + /* Don't bother with already dead threads */
> + if (t->exit_state)
> + continue;
> + sigaddset(&t->pending.signal, SIGKILL);
> + signal_wake_up(t, 1);
> + }

This both extracts it and changes it. For ease-of-review, maybe split
this patch into the move and then the logic changes?

--
Kees Cook

2021-06-26 03:36:12

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 5/9] signal/group_exit: Use start_group_exit in place of do_group_exit

On Thu, Jun 24, 2021 at 02:01:40PM -0500, Eric W. Biederman wrote:
>
> Make thread exiting uniform by causing all threads to pass through
> get_signal when they are exiting. This simplifies the analysis
> of sychronization during exit and guarantees that all full set
> of registers will be available for ptrace to examine for
> threads that stop at PTRACE_EVENT_EXIT.

Yeah, cool. I do like making the process lifetime more sensible here. It
always threw me that do_exit*() just stopped execution. :)

For future me, can you add a comment on start_group_exit() that mentions
where final process death happens?

>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> kernel/exit.c | 4 ++--
> kernel/seccomp.c | 2 +-
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd1c04193e18..921519d80b56 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -931,8 +931,8 @@ do_group_exit(int exit_code)
> */
> SYSCALL_DEFINE1(exit_group, int, error_code)
> {
> - do_group_exit((error_code & 0xff) << 8);
> - /* NOTREACHED */
> + start_group_exit((error_code & 0xff) << 8);
> + /* get_signal will call do_exit */
> return 0;

"0" feels weird here, but I can't think of any better "fail closed"
error code here.

> }
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 5301eca670a0..b1c06fd1b205 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1250,7 +1250,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> if (action == SECCOMP_RET_KILL_THREAD)
> do_exit(SIGSYS);
> else
> - do_group_exit(SIGSYS);
> + start_group_exit(SIGSYS);

This could use a similar comment to the syscall's comment, just so I
don't panic when I read this code in like 3 years. ;)

Otherwise, yeah, looks good.

-Kees

> }
> return -1;
> }
> --
> 2.20.1
>

--
Kees Cook

2021-06-26 03:44:12

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads

On Thu, Jun 24, 2021 at 02:02:16PM -0500, Eric W. Biederman wrote:
>
> Forld do_group_exit into get_signal as it is the last caller.
>
> Move the group_exit logic above the PF_IO_WORKER exit, ensuring
> that if an PF_IO_WORKER catches SIGKILL every thread in
> the thread group will exit not just the the PF_IO_WORKER.
>
> Now that the information is easily available only set PF_SIGNALED
> when it was a signal that caused the exit.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> include/linux/sched/task.h | 1 -
> kernel/exit.c | 31 -------------------------------
> kernel/signal.c | 35 +++++++++++++++++++++++++----------
> 3 files changed, 25 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index ef02be869cf2..45525512e3d0 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk)
> {
> }
> #endif
> -extern void do_group_exit(int);
>
> extern void exit_files(struct task_struct *);
> extern void exit_itimers(struct signal_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 921519d80b56..635f434122b7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code)
> do_exit((error_code&0xff)<<8);
> }
>
> -/*
> - * Take down every thread in the group. This is called by fatal signals
> - * as well as by sys_exit_group (below).
> - */
> -void
> -do_group_exit(int exit_code)
> -{
> - struct signal_struct *sig = current->signal;
> -
> - BUG_ON(exit_code & 0x80); /* core dumps don't get here */
> -
> - if (signal_group_exit(sig))
> - exit_code = sig->group_exit_code;
> - else if (!thread_group_empty(current)) {
> - struct sighand_struct *const sighand = current->sighand;
> -
> - spin_lock_irq(&sighand->siglock);
> - if (signal_group_exit(sig))
> - /* Another thread got here before we took the lock. */
> - exit_code = sig->group_exit_code;
> - else {
> - sig->group_exit_code = exit_code;
> - sig->flags = SIGNAL_GROUP_EXIT;
> - zap_other_threads(current);

Oh, now I see it: the "new code" in start_group_exit() is an open-coded
zap_other_threads()? That wasn't clear to me, but makes sense now.

> - }
> - spin_unlock_irq(&sighand->siglock);
> - }
> -
> - do_exit(exit_code);
> - /* NOTREACHED */
> -}
>
> /*
> * this kills every thread in the thread group. Note that any externally
> diff --git a/kernel/signal.c b/kernel/signal.c
> index c79c010ca5f3..95a076af600a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig)
> {
> struct sighand_struct *sighand = current->sighand;
> struct signal_struct *signal = current->signal;
> + int exit_code;
> int signr;
>
> if (unlikely(current->task_works))
> @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig)
> /*
> * Anything else is fatal, maybe with a core dump.
> */
> - current->flags |= PF_SIGNALED;
> -
> if (sig_kernel_coredump(signr)) {
> if (print_fatal_signals)
> print_fatal_signal(ksig->info.si_signo);
> @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig)
> /*
> * If it was able to dump core, this kills all
> * other threads in the group and synchronizes with
> - * their demise. If we lost the race with another
> - * thread getting here, it set group_exit_code
> - * first and our do_group_exit call below will use
> - * that value and ignore the one we pass it.
> + * their demise. If another thread makes it
> + * to do_coredump first, it will set group_exit_code
> + * which will be passed to do_exit.
> */
> do_coredump(&ksig->info);
> }
>
> + /*
> + * Death signals, no core dump.
> + */
> + exit_code = signr;
> + if (signal_group_exit(signal)) {
> + exit_code = signal->group_exit_code;
> + } else {
> + spin_lock_irq(&sighand->siglock);
> + if (signal_group_exit(signal)) {
> + /* Another thread got here before we took the lock. */
> + exit_code = signal->group_exit_code;
> + } else {
> + start_group_exit_locked(signal, exit_code);

And here's the "if we didn't already do start_group_exit(), do it here".
And that state is entirely captured via the SIGNAL_GROUP_EXIT flag.
Cool.

> + }
> + spin_unlock_irq(&sighand->siglock);
> + }
> +
> + if (exit_code & 0x7f)
> + current->flags |= PF_SIGNALED;
> +
> /*
> * PF_IO_WORKER threads will catch and exit on fatal signals
> * themselves. They have cleanup that must be performed, so
> @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig)
> if (current->flags & PF_IO_WORKER)
> goto out;
>
> - /*
> - * Death signals, no core dump.
> - */
> - do_group_exit(ksig->info.si_signo);
> + do_exit(exit_code);
> /* NOTREACHED */
> }
> spin_unlock_irq(&sighand->siglock);
> --
> 2.20.1
>

--
Kees Cook

2021-06-26 05:58:54

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 8/9] signal/task_exit: Use start_task_exit in place of do_exit

On Thu, Jun 24, 2021 at 02:03:25PM -0500, Eric W. Biederman wrote:
>
> Reuse start_task_exit_locked to implement start_task_exit.
>
> Simplify the exit logic by having all exits go through get_signal.
> This simplifies the analysis of syncrhonization during exit and
> gurantees a full set of registers will be available for ptrace to
> examine at PTRACE_EVENT_EXIT.
>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
> include/linux/sched/signal.h | 1 +
> kernel/exit.c | 4 +++-
> kernel/seccomp.c | 2 +-
> kernel/signal.c | 12 ++++++++++++
> 4 files changed, 17 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
> index a958381ba4a9..3f4e69c019b7 100644
> --- a/include/linux/sched/signal.h
> +++ b/include/linux/sched/signal.h
> @@ -430,6 +430,7 @@ static inline void ptrace_signal_wake_up(struct task_struct *t, bool resume)
>
> void start_group_exit(int exit_code);
> void start_task_exit_locked(struct task_struct *task, int exit_code);
> +void start_task_exit(int exit_code);
>
> void task_join_group_stop(struct task_struct *task);
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 635f434122b7..51e0c82b3f7d 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -889,7 +889,9 @@ EXPORT_SYMBOL(complete_and_exit);
>
> SYSCALL_DEFINE1(exit, int, error_code)
> {
> - do_exit((error_code&0xff)<<8);
> + start_task_exit((error_code&0xff)<<8);
> + /* get_signal will call do_exit */
> + return 0;
> }
>
>
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b1c06fd1b205..e0c4c123a8bf 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -1248,7 +1248,7 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
> force_sig_seccomp(this_syscall, data, true);
> } else {
> if (action == SECCOMP_RET_KILL_THREAD)
> - do_exit(SIGSYS);
> + start_task_exit(SIGSYS);
> else
> start_group_exit(SIGSYS);
> }

Looks good, yeah.

> diff --git a/kernel/signal.c b/kernel/signal.c
> index afbc001220dd..63fda9b6bbf9 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -1424,6 +1424,18 @@ void start_task_exit_locked(struct task_struct *task, int exit_code)
> }
> }
>
> +void start_task_exit(int exit_code)
> +{
> + struct task_struct *task = current;
> + if (!fatal_signal_pending(task)) {
> + struct sighand_struct *const sighand = task->sighand;
> + spin_lock_irq(&sighand->siglock);
> + if (!fatal_signal_pending(current))

efficiency nit: "task" instead of "current" here, yes?

> + start_task_exit_locked(task, exit_code);
> + spin_unlock_irq(&sighand->siglock);
> + }
> +}
> +
> struct sighand_struct *__lock_task_sighand(struct task_struct *tsk,
> unsigned long *flags)
> {
> --
> 2.20.1
>

--
Kees Cook

2021-06-27 22:17:40

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:

> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
> IMO it's long past the time to simply remove the sucker.

Incidentally, calling that from ptraced process on alpha leads to
the same headache for tracer. _If_ we leave it around, this is
another candidate for "hit yourself with that special signal" -
both alpha and m68k have that syscall, and IMO adding an asm
wrapper for that one is over the top.

Said that, we really ought to bury that thing:

commit 2f268ee88abb33968501a44368db55c63adaad40
Author: Andrew Morton <[email protected]>
Date: Sat Dec 14 03:16:29 2002 -0800

[PATCH] deprecate use of bdflush()

Patch from Robert Love <[email protected]>

We can never get rid of it if we do not deprecate it - so do so and
print a stern warning to those who still run bdflush daemons.

Deprecated for 18.5 years by now - I seriously suspect that we have
some contributors younger than that...

2021-06-27 23:10:13

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

On 28/06/21 10:13 am, Al Viro wrote:

> On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:
>
>> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
>> IMO it's long past the time to simply remove the sucker.
> Incidentally, calling that from ptraced process on alpha leads to
> the same headache for tracer. _If_ we leave it around, this is
> another candidate for "hit yourself with that special signal" -
> both alpha and m68k have that syscall, and IMO adding an asm
> wrapper for that one is over the top.
>
> Said that, we really ought to bury that thing:
>
> commit 2f268ee88abb33968501a44368db55c63adaad40
> Author: Andrew Morton <[email protected]>
> Date: Sat Dec 14 03:16:29 2002 -0800
>
> [PATCH] deprecate use of bdflush()
>
> Patch from Robert Love <[email protected]>
>
> We can never get rid of it if we do not deprecate it - so do so and
> print a stern warning to those who still run bdflush daemons.
>
> Deprecated for 18.5 years by now - I seriously suspect that we have
> some contributors younger than that...

Haven't found that warning in over 7 years' worth of console logs, and
I'm a good candidate for running the oldest userland in existence for m68k.

Time to let it go.

Cheers,

    Michael


2021-06-28 07:48:32

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Hi Michael,

On Mon, Jun 28, 2021 at 1:00 AM Michael Schmitz <[email protected]> wrote:
> On 28/06/21 10:13 am, Al Viro wrote:
> > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:
> >
> >> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
> >> IMO it's long past the time to simply remove the sucker.
> > Incidentally, calling that from ptraced process on alpha leads to
> > the same headache for tracer. _If_ we leave it around, this is
> > another candidate for "hit yourself with that special signal" -
> > both alpha and m68k have that syscall, and IMO adding an asm
> > wrapper for that one is over the top.
> >
> > Said that, we really ought to bury that thing:
> >
> > commit 2f268ee88abb33968501a44368db55c63adaad40
> > Author: Andrew Morton <[email protected]>
> > Date: Sat Dec 14 03:16:29 2002 -0800
> >
> > [PATCH] deprecate use of bdflush()
> >
> > Patch from Robert Love <[email protected]>
> >
> > We can never get rid of it if we do not deprecate it - so do so and
> > print a stern warning to those who still run bdflush daemons.
> >
> > Deprecated for 18.5 years by now - I seriously suspect that we have
> > some contributors younger than that...
>
> Haven't found that warning in over 7 years' worth of console logs, and
> I'm a good candidate for running the oldest userland in existence for m68k.
>
> Time to let it go.

The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
which is a very old ramdisk from right after the m68k a.out to ELF
transition:

warning: process `update' used the obsolete bdflush system call
Fix your initscripts?

I still boot it, once in a while.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-06-28 14:19:44

by kernel test robot

[permalink] [raw]
Subject: [signal/seccomp] 3fdd8c68c2: kernel-selftests.seccomp.seccomp_bpf.fail



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 3fdd8c68c2f4a6753189c4f44e9a60ff415b01a1 ("[PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation")
url: https://github.com/0day-ci/linux/commits/Eric-W-Biederman/signal-sh-Use-force_sig-SIGKILL-instead-of-do_group_exit-SIGKILL/20210625-040018
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 4a09d388f2ab382f217a764e6a152b3f614246f6

in testcase: kernel-selftests
version: kernel-selftests-x86_64-f8879e85-1_20210621
with following parameters:

group: group-s
ucode: 0xe2

test-description: The kernel contains a set of "self tests" under the tools/testing/selftests/ directory. These are intended to be small unit tests to exercise individual code paths in the kernel.
test-url: https://www.kernel.org/doc/Documentation/kselftest.txt


on test machine: 8 threads Intel(R) Core(TM) i7-6770HQ CPU @ 2.60GHz with 32G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

KERNEL SELFTESTS: linux_headers_dir is /usr/src/linux-headers-x86_64-rhel-8.3-kselftests-3fdd8c68c2f4a6753189c4f44e9a60ff415b01a1
2021-06-27 11:02:28 ln -sf /usr/bin/clang
2021-06-27 11:02:28 ln -sf /usr/bin/llc
2021-06-27 11:02:28 sed -i s/default_timeout=45/default_timeout=300/ kselftest/runner.sh
2021-06-27 11:02:29 sed -i s/default_timeout=45/default_timeout=300/ /kselftests/kselftest/runner.sh
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:02:29 /kselftests/run_kselftest.sh -c safesetid
TAP version 13
1..1
# selftests: safesetid: safesetid-test.sh
# mounting securityfs failed
# safesetid-test.sh: done
ok 1 selftests: safesetid: safesetid-test.sh
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:02:29 /kselftests/run_kselftest.sh -c seccomp
TAP version 13
1..2
# selftests: seccomp: seccomp_bpf
# TAP version 13
# 1..87
# # Starting 87 tests from 7 test cases.
# # RUN global.kcmp ...
# # OK global.kcmp
# ok 1 global.kcmp
# # RUN global.mode_strict_support ...
# # OK global.mode_strict_support
# ok 2 global.mode_strict_support
# # RUN global.mode_strict_cannot_call_prctl ...
# # OK global.mode_strict_cannot_call_prctl
# ok 3 global.mode_strict_cannot_call_prctl
# # RUN global.no_new_privs_support ...
# # OK global.no_new_privs_support
# ok 4 global.no_new_privs_support
# # RUN global.mode_filter_support ...
# # OK global.mode_filter_support
# ok 5 global.mode_filter_support
# # RUN global.mode_filter_without_nnp ...
# # OK global.mode_filter_without_nnp
# ok 6 global.mode_filter_without_nnp
# # RUN global.filter_size_limits ...
# # OK global.filter_size_limits
# ok 7 global.filter_size_limits
# # RUN global.filter_chain_limits ...
# # OK global.filter_chain_limits
# ok 8 global.filter_chain_limits
# # RUN global.mode_filter_cannot_move_to_strict ...
# # OK global.mode_filter_cannot_move_to_strict
# ok 9 global.mode_filter_cannot_move_to_strict
# # RUN global.mode_filter_get_seccomp ...
# # OK global.mode_filter_get_seccomp
# ok 10 global.mode_filter_get_seccomp
# # RUN global.ALLOW_all ...
# # OK global.ALLOW_all
# ok 11 global.ALLOW_all
# # RUN global.empty_prog ...
# # OK global.empty_prog
# ok 12 global.empty_prog
# # RUN global.log_all ...
# # OK global.log_all
# ok 13 global.log_all
# # RUN global.unknown_ret_is_kill_inside ...
# # OK global.unknown_ret_is_kill_inside
# ok 14 global.unknown_ret_is_kill_inside
# # RUN global.unknown_ret_is_kill_above_allow ...
# # OK global.unknown_ret_is_kill_above_allow
# ok 15 global.unknown_ret_is_kill_above_allow
# # RUN global.KILL_all ...
# # OK global.KILL_all
# ok 16 global.KILL_all
# # RUN global.KILL_one ...
# # OK global.KILL_one
# ok 17 global.KILL_one
# # RUN global.KILL_one_arg_one ...
# # OK global.KILL_one_arg_one
# ok 18 global.KILL_one_arg_one
# # RUN global.KILL_one_arg_six ...
# # OK global.KILL_one_arg_six
# ok 19 global.KILL_one_arg_six
# # RUN global.KILL_thread ...
# # OK global.KILL_thread
# ok 20 global.KILL_thread
# # RUN global.KILL_process ...
# # OK global.KILL_process
# ok 21 global.KILL_process
# # RUN global.KILL_unknown ...
# # OK global.KILL_unknown
# ok 22 global.KILL_unknown
# # RUN global.arg_out_of_range ...
# # OK global.arg_out_of_range
# ok 23 global.arg_out_of_range
# # RUN global.ERRNO_valid ...
# # OK global.ERRNO_valid
# ok 24 global.ERRNO_valid
# # RUN global.ERRNO_zero ...
# # OK global.ERRNO_zero
# ok 25 global.ERRNO_zero
# # RUN global.ERRNO_capped ...
# # OK global.ERRNO_capped
# ok 26 global.ERRNO_capped
# # RUN global.ERRNO_order ...
# # OK global.ERRNO_order
# ok 27 global.ERRNO_order
# # RUN global.negative_ENOSYS ...
# # OK global.negative_ENOSYS
# ok 28 global.negative_ENOSYS
# # RUN global.seccomp_syscall ...
# # OK global.seccomp_syscall
# ok 29 global.seccomp_syscall
# # RUN global.seccomp_syscall_mode_lock ...
# # OK global.seccomp_syscall_mode_lock
# ok 30 global.seccomp_syscall_mode_lock
# # RUN global.detect_seccomp_filter_flags ...
# # OK global.detect_seccomp_filter_flags
# ok 31 global.detect_seccomp_filter_flags
# # RUN global.TSYNC_first ...
# # OK global.TSYNC_first
# ok 32 global.TSYNC_first
# # RUN global.syscall_restart ...
# # OK global.syscall_restart
# ok 33 global.syscall_restart
# # RUN global.filter_flag_log ...
# # OK global.filter_flag_log
# ok 34 global.filter_flag_log
# # RUN global.get_action_avail ...
# # OK global.get_action_avail
# ok 35 global.get_action_avail
# # RUN global.get_metadata ...
# # OK global.get_metadata
# ok 36 global.get_metadata
# # RUN global.user_notification_basic ...
# # OK global.user_notification_basic
# ok 37 global.user_notification_basic
# # RUN global.user_notification_with_tsync ...
# # OK global.user_notification_with_tsync
# ok 38 global.user_notification_with_tsync
# # RUN global.user_notification_kill_in_middle ...
# # OK global.user_notification_kill_in_middle
# ok 39 global.user_notification_kill_in_middle
# # RUN global.user_notification_signal ...
# # OK global.user_notification_signal
# ok 40 global.user_notification_signal
# # RUN global.user_notification_closed_listener ...
# # OK global.user_notification_closed_listener
# ok 41 global.user_notification_closed_listener
# # RUN global.user_notification_child_pid_ns ...
# # OK global.user_notification_child_pid_ns
# ok 42 global.user_notification_child_pid_ns
# # RUN global.user_notification_sibling_pid_ns ...
# # OK global.user_notification_sibling_pid_ns
# ok 43 global.user_notification_sibling_pid_ns
# # RUN global.user_notification_fault_recv ...
# # OK global.user_notification_fault_recv
# ok 44 global.user_notification_fault_recv
# # RUN global.seccomp_get_notif_sizes ...
# # OK global.seccomp_get_notif_sizes
# ok 45 global.seccomp_get_notif_sizes
# # RUN global.user_notification_continue ...
# # OK global.user_notification_continue
# ok 46 global.user_notification_continue
# # RUN global.user_notification_filter_empty ...
# # OK global.user_notification_filter_empty
# ok 47 global.user_notification_filter_empty
# # RUN global.user_notification_filter_empty_threaded ...
# # OK global.user_notification_filter_empty_threaded
# ok 48 global.user_notification_filter_empty_threaded
# # RUN global.user_notification_addfd ...
# # OK global.user_notification_addfd
# ok 49 global.user_notification_addfd
# # RUN global.user_notification_addfd_rlimit ...
# # OK global.user_notification_addfd_rlimit
# ok 50 global.user_notification_addfd_rlimit
# # RUN TRAP.dfl ...
# # OK TRAP.dfl
# ok 51 TRAP.dfl
# # RUN TRAP.ign ...
# # OK TRAP.ign
# ok 52 TRAP.ign
# # RUN TRAP.handler ...
# # OK TRAP.handler
# ok 53 TRAP.handler
# # RUN precedence.allow_ok ...
# # OK precedence.allow_ok
# ok 54 precedence.allow_ok
# # RUN precedence.kill_is_highest ...
# # OK precedence.kill_is_highest
# ok 55 precedence.kill_is_highest
# # RUN precedence.kill_is_highest_in_any_order ...
# # OK precedence.kill_is_highest_in_any_order
# ok 56 precedence.kill_is_highest_in_any_order
# # RUN precedence.trap_is_second ...
# # OK precedence.trap_is_second
# ok 57 precedence.trap_is_second
# # RUN precedence.trap_is_second_in_any_order ...
# # OK precedence.trap_is_second_in_any_order
# ok 58 precedence.trap_is_second_in_any_order
# # RUN precedence.errno_is_third ...
# # OK precedence.errno_is_third
# ok 59 precedence.errno_is_third
# # RUN precedence.errno_is_third_in_any_order ...
# # OK precedence.errno_is_third_in_any_order
# ok 60 precedence.errno_is_third_in_any_order
# # RUN precedence.trace_is_fourth ...
# # OK precedence.trace_is_fourth
# ok 61 precedence.trace_is_fourth
# # RUN precedence.trace_is_fourth_in_any_order ...
# # OK precedence.trace_is_fourth_in_any_order
# ok 62 precedence.trace_is_fourth_in_any_order
# # RUN precedence.log_is_fifth ...
# # OK precedence.log_is_fifth
# ok 63 precedence.log_is_fifth
# # RUN precedence.log_is_fifth_in_any_order ...
# # OK precedence.log_is_fifth_in_any_order
# ok 64 precedence.log_is_fifth_in_any_order
# # RUN TRACE_poke.read_has_side_effects ...
# # OK TRACE_poke.read_has_side_effects
# ok 65 TRACE_poke.read_has_side_effects
# # RUN TRACE_poke.getpid_runs_normally ...
# # OK TRACE_poke.getpid_runs_normally
# ok 66 TRACE_poke.getpid_runs_normally
# # RUN TRACE_syscall.ptrace.negative_ENOSYS ...
# # OK TRACE_syscall.ptrace.negative_ENOSYS
# ok 67 TRACE_syscall.ptrace.negative_ENOSYS
# # RUN TRACE_syscall.ptrace.syscall_allowed ...
# # OK TRACE_syscall.ptrace.syscall_allowed
# ok 68 TRACE_syscall.ptrace.syscall_allowed
# # RUN TRACE_syscall.ptrace.syscall_redirected ...
# # OK TRACE_syscall.ptrace.syscall_redirected
# ok 69 TRACE_syscall.ptrace.syscall_redirected
# # RUN TRACE_syscall.ptrace.syscall_errno ...
# # OK TRACE_syscall.ptrace.syscall_errno
# ok 70 TRACE_syscall.ptrace.syscall_errno
# # RUN TRACE_syscall.ptrace.syscall_faked ...
# # OK TRACE_syscall.ptrace.syscall_faked
# ok 71 TRACE_syscall.ptrace.syscall_faked
# # RUN TRACE_syscall.ptrace.skip_after ...
# # OK TRACE_syscall.ptrace.skip_after
# ok 72 TRACE_syscall.ptrace.skip_after
# # RUN TRACE_syscall.ptrace.kill_after ...
# # seccomp_bpf.c:2019:kill_after:Expected entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY : PTRACE_EVENTMSG_SYSCALL_EXIT (1) == msg (0)
# # seccomp_bpf.c:2019:kill_after:Expected entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY : PTRACE_EVENTMSG_SYSCALL_EXIT (2) == msg (1)
# # seccomp_bpf.c:2019:kill_after:Expected entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY : PTRACE_EVENTMSG_SYSCALL_EXIT (1) == msg (2)
# # kill_after: Test exited normally instead of by signal (code: 12)
# # FAIL TRACE_syscall.ptrace.kill_after
# not ok 73 TRACE_syscall.ptrace.kill_after
# # RUN TRACE_syscall.seccomp.negative_ENOSYS ...
# # OK TRACE_syscall.seccomp.negative_ENOSYS
# ok 74 TRACE_syscall.seccomp.negative_ENOSYS
# # RUN TRACE_syscall.seccomp.syscall_allowed ...
# # OK TRACE_syscall.seccomp.syscall_allowed
# ok 75 TRACE_syscall.seccomp.syscall_allowed
# # RUN TRACE_syscall.seccomp.syscall_redirected ...
# # OK TRACE_syscall.seccomp.syscall_redirected
# ok 76 TRACE_syscall.seccomp.syscall_redirected
# # RUN TRACE_syscall.seccomp.syscall_errno ...
# # OK TRACE_syscall.seccomp.syscall_errno
# ok 77 TRACE_syscall.seccomp.syscall_errno
# # RUN TRACE_syscall.seccomp.syscall_faked ...
# # OK TRACE_syscall.seccomp.syscall_faked
# ok 78 TRACE_syscall.seccomp.syscall_faked
# # RUN TRACE_syscall.seccomp.skip_after ...
# # OK TRACE_syscall.seccomp.skip_after
# ok 79 TRACE_syscall.seccomp.skip_after
# # RUN TRACE_syscall.seccomp.kill_after ...
# # seccomp_bpf.c:1543:kill_after:Expected !ptrace_syscall (1) == IS_SECCOMP_EVENT(status) (0)
# # kill_after: Test exited normally instead of by signal (code: 0)
# # FAIL TRACE_syscall.seccomp.kill_after
# not ok 80 TRACE_syscall.seccomp.kill_after
# # RUN TSYNC.siblings_fail_prctl ...
# # OK TSYNC.siblings_fail_prctl
# ok 81 TSYNC.siblings_fail_prctl
# # RUN TSYNC.two_siblings_with_ancestor ...
# # OK TSYNC.two_siblings_with_ancestor
# ok 82 TSYNC.two_siblings_with_ancestor
# # RUN TSYNC.two_sibling_want_nnp ...
# # OK TSYNC.two_sibling_want_nnp
# ok 83 TSYNC.two_sibling_want_nnp
# # RUN TSYNC.two_siblings_with_no_filter ...
# # OK TSYNC.two_siblings_with_no_filter
# ok 84 TSYNC.two_siblings_with_no_filter
# # RUN TSYNC.two_siblings_with_one_divergence ...
# # OK TSYNC.two_siblings_with_one_divergence
# ok 85 TSYNC.two_siblings_with_one_divergence
# # RUN TSYNC.two_siblings_with_one_divergence_no_tid_in_err ...
# # OK TSYNC.two_siblings_with_one_divergence_no_tid_in_err
# ok 86 TSYNC.two_siblings_with_one_divergence_no_tid_in_err
# # RUN TSYNC.two_siblings_not_under_filter ...
# # OK TSYNC.two_siblings_not_under_filter
# ok 87 TSYNC.two_siblings_not_under_filter
# # FAILED: 85 / 87 tests passed.
# # Totals: pass:85 fail:2 xfail:0 xpass:0 skip:0 error:0
not ok 1 selftests: seccomp: seccomp_bpf # exit=1
# selftests: seccomp: seccomp_benchmark
# net.core.bpf_jit_enable = 1
# net.core.bpf_jit_harden = 0
# Current BPF sysctl settings:
# Calibrating sample size for 15 seconds worth of syscalls ...
# Benchmarking 6133395 syscalls...
# 14.923201557 - 0.999613884 = 13923587673 (13.9s)
# getpid native: 2270 ns
# 27.524888508 - 14.930184379 = 12594704129 (12.6s)
# getpid RET_ALLOW 1 filter (bitmap): 2053 ns
# 39.418429896 - 27.525194300 = 11893235596 (11.9s)
# getpid RET_ALLOW 2 filters (bitmap): 1939 ns
# 54.097596955 - 39.418720961 = 14678875994 (14.7s)
# getpid RET_ALLOW 3 filters (full): 2393 ns
# 68.769520433 - 54.097906014 = 14671614419 (14.7s)
# getpid RET_ALLOW 4 filters (full): 2392 ns
# Estimated total seccomp overhead for 1 bitmapped filter: 18446744073709551399 ns
# Saw unexpected benchmark result. Try running again with more samples?
ok 2 selftests: seccomp: seccomp_benchmark
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:03:41 /kselftests/run_kselftest.sh -c sigaltstack
TAP version 13
1..1
# selftests: sigaltstack: sas
# TAP version 13
# 1..3
# ok 1 Initial sigaltstack state was SS_DISABLE
# # [RUN] signal USR1
# ok 2 sigaltstack is disabled in sighandler
# # [RUN] switched to user ctx
# # [RUN] signal USR2
# # [OK] Stack preserved
# ok 3 sigaltstack is still SS_AUTODISARM after signal
# # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: sigaltstack: sas
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:03:42 /kselftests/run_kselftest.sh -c size
TAP version 13
1..1
# selftests: size: get_size
# TAP version 13
# # Testing system size.
# ok 1 get runtime memory use
# # System runtime memory report (units in Kilobytes):
# ---
# Total: 32741296
# Free: 29442404
# Buffer: 4
# In use: 3298888
# ...
# 1..1
ok 1 selftests: size: get_size
LKP SKIP sparc64
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:03:42 /kselftests/run_kselftest.sh -c splice
TAP version 13
1..2
# selftests: splice: default_file_splice_read.sh
ok 1 selftests: splice: default_file_splice_read.sh
# selftests: splice: short_splice_read.sh
# splice: Invalid argument
# FAIL: /proc/1919/limits 4096
# splice: Invalid argument
# FAIL: /proc/1919/limits 2
# splice: Invalid argument
# FAIL: /proc/1919/comm 4096
# splice: Invalid argument
# FAIL: /proc/1919/comm 2
# ok: /proc/sys/fs/nr_open 4096
# ok: /proc/sys/fs/nr_open 2
# ok: /proc/sys/kernel/modprobe 4096
# ok: /proc/sys/kernel/modprobe 2
# ok: /proc/sys/kernel/version 4096
# ok: /proc/sys/kernel/version 2
# ok: /sys/module/test_module/coresize 4096
# ok: /sys/module/test_module/coresize 2
# ok: /sys/module/test_module/sections/.init.text 4096
# ok: /sys/module/test_module/sections/.init.text 2
not ok 2 selftests: splice: short_splice_read.sh # exit=1
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:03:43 /kselftests/run_kselftest.sh -c static_keys
TAP version 13
1..1
# selftests: static_keys: test_static_keys.sh
# static_key: ok
ok 1 selftests: static_keys: test_static_keys.sh
LKP WARN miss config CONFIG_SYNC= of sync/config
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:03:43 /kselftests/run_kselftest.sh -c sync
TAP version 13
1..1
# selftests: sync: sync_test
# TAP version 13
# 1..10
# # [RUN] Testing sync framework
# ok 1 [RUN] test_alloc_timeline
# ok 2 [RUN] test_alloc_fence
# ok 3 [RUN] test_alloc_fence_negative
# ok 4 [RUN] test_fence_one_timeline_wait
# ok 5 [RUN] test_fence_one_timeline_merge
# ok 6 [RUN] test_fence_merge_same_fence
# ok 7 [RUN] test_fence_multi_timeline_wait
# ok 8 [RUN] test_stress_two_threads_shared_timeline
# ok 9 [RUN] test_consumer_stress_multi_producer_single_consumer
# ok 10 [RUN] test_merge_stress_random_merge
# # Totals: pass:10 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: sync: sync_test
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:03:50 /kselftests/run_kselftest.sh -c syscall_user_dispatch
TAP version 13
1..2
# selftests: syscall_user_dispatch: sud_test
# TAP version 13
# 1..6
# # Starting 6 tests from 1 test cases.
# # RUN global.dispatch_trigger_sigsys ...
# # OK global.dispatch_trigger_sigsys
# ok 1 global.dispatch_trigger_sigsys
# # RUN global.bad_prctl_param ...
# # OK global.bad_prctl_param
# ok 2 global.bad_prctl_param
# # RUN global.dispatch_and_return ...
# # OK global.dispatch_and_return
# ok 3 global.dispatch_and_return
# # RUN global.bad_selector ...
# # OK global.bad_selector
# ok 4 global.bad_selector
# # RUN global.disable_dispatch ...
# # OK global.disable_dispatch
# ok 5 global.disable_dispatch
# # RUN global.direct_dispatch_range ...
# # OK global.direct_dispatch_range
# ok 6 global.direct_dispatch_range
# # PASSED: 6 / 6 tests passed.
# # Totals: pass:6 fail:0 xfail:0 xpass:0 skip:0 error:0
ok 1 selftests: syscall_user_dispatch: sud_test
# selftests: syscall_user_dispatch: sud_benchmark
# Enabling syscall trapping.
# Caught sys_ff00
# Calibrating test set to last ~5 seconds...
# test iterations = 2000000
# Avg syscall time 2350ns.
# trapped_call_count 1, native_call_count 0.
# Avg syscall time 2418ns.
# Interception overhead: 2.9% (+68ns).
ok 2 selftests: syscall_user_dispatch: sud_benchmark
source /lkp/lkp/src/lib/tests/kernel-selftests-ext.sh
2021-06-27 11:04:01 /kselftests/run_kselftest.sh -c sysctl
TAP version 13
1..1
# selftests: sysctl: sysctl.sh
# Checking production write strict setting ... ok
# Sun Jun 27 11:04:01 UTC 2021
# Running test: sysctl_test_0001 - run #0
# == Testing sysctl behavior against /proc/sys/debug/test_sysctl/int_0001 ==
# Writing test file ... ok
# Checking sysctl is not set to test value ... ok
# Writing sysctl from shell ... ok
# Resetting sysctl to original value ... ok
# Writing entire sysctl in single write ... ok
# Writing middle of sysctl after synchronized seek ... ok
# Writing beyond end of sysctl ... ok
# Writing sysctl with multiple long writes ... ok
# Testing that 0x0000000100000000 fails as expected...ok
# Testing that 0x0000000100000001 fails as expected...ok
# Testing that 0x00000001ffffffff fails as expected...ok
# Testing that 0x0000000180000000 fails as expected...ok
# Testing that 0x000000017fffffff fails as expected...ok
# Testing that 0xffffffff00000000 fails as expected...ok
# Testing that 0xffffffff00000001 fails as expected...ok
# Testing that 0xffffffffffffffff fails as expected...ok
# Testing that 0xffffffff80000000 fails as expected...ok
# Testing that 0xffffffff7fffffff fails as expected...ok
# Testing that -0x0000000100000000 fails as expected...ok
# Testing that -0x0000000100000001 fails as expected...ok
# Testing that -0x00000001ffffffff fails as expected...ok
# Testing that -0x0000000180000000 fails as expected...ok
# Testing that -0x000000017fffffff fails as expected...ok
# Testing that -0xffffffff00000000 fails as expected...ok
# Testing that -0xffffffff00000001 fails as expected...ok
# Testing that -0xffffffffffffffff fails as expected...ok
# Testing that -0xffffffff80000000 fails as expected...ok
# Testing that -0xffffffff7fffffff fails as expected...ok
# Checking ignoring spaces up to PAGE_SIZE works on write ...ok
# Checking passing PAGE_SIZE of spaces fails on write ...ok
# Sun Jun 27 11:04:02 UTC 2021
# Running test: sysctl_test_0002 - run #0
# == Testing sysctl behavior against /proc/sys/debug/test_sysctl/string_0001 ==
# Writing test file ... ok
# Checking sysctl is not set to test value ... ok
# Writing sysctl from shell ... ok
# Resetting sysctl to original value ... ok
# Writing entire sysctl in single write ... ok
# Writing middle of sysctl after synchronized seek ... ok
# Writing beyond end of sysctl ... ok
# Writing sysctl with multiple long writes ... ok
# Writing entire sysctl in short writes ... ok
# Writing middle of sysctl after unsynchronized seek ... ok
# Checking sysctl maxlen is at least 65 ... ok
# Checking sysctl keeps original string on overflow append ... ok
# Checking sysctl stays NULL terminated on write ... ok
# Checking sysctl stays NULL terminated on overwrite ... ok
# Sun Jun 27 11:04:02 UTC 2021
# Running test: sysctl_test_0003 - run #0
# == Testing sysctl behavior against /proc/sys/debug/test_sysctl/int_0002 ==
# Writing test file ... ok
# Checking sysctl is not set to test value ... ok
# Writing sysctl from shell ... ok
# Resetting sysctl to original value ... ok
# Writing entire sysctl in single write ... ok
# Writing middle of sysctl after synchronized seek ... ok
# Writing beyond end of sysctl ... ok
# Writing sysctl with multiple long writes ... ok
# Testing that 0x0000000100000000 fails as expected...ok
# Testing that 0x0000000100000001 fails as expected...ok
# Testing that 0x00000001ffffffff fails as expected...ok
# Testing that 0x0000000180000000 fails as expected...ok
# Testing that 0x000000017fffffff fails as expected...ok
# Testing that 0xffffffff00000000 fails as expected...ok
# Testing that 0xffffffff00000001 fails as expected...ok
# Testing that 0xffffffffffffffff fails as expected...ok
# Testing that 0xffffffff80000000 fails as expected...ok
# Testing that 0xffffffff7fffffff fails as expected...ok
# Testing that -0x0000000100000000 fails as expected...ok
# Testing that -0x0000000100000001 fails as expected...ok
# Testing that -0x00000001ffffffff fails as expected...ok
# Testing that -0x0000000180000000 fails as expected...ok
# Testing that -0x000000017fffffff fails as expected...ok
# Testing that -0xffffffff00000000 fails as expected...ok
# Testing that -0xffffffff00000001 fails as expected...ok
# Testing that -0xffffffffffffffff fails as expected...ok
# Testing that -0xffffffff80000000 fails as expected...ok
# Testing that -0xffffffff7fffffff fails as expected...ok
# Checking ignoring spaces up to PAGE_SIZE works on write ...ok
# Checking passing PAGE_SIZE of spaces fails on write ...ok
# Testing INT_MAX works ...ok
# Testing INT_MAX + 1 will fail as expected...ok
# Testing negative values will work as expected...ok
# Sun Jun 27 11:04:03 UTC 2021
# Running test: sysctl_test_0004 - run #0
# == Testing sysctl behavior against /proc/sys/debug/test_sysctl/uint_0001 ==
# Writing test file ... ok
# Checking sysctl is not set to test value ... ok
# Writing sysctl from shell ... ok
# Resetting sysctl to original value ... ok
# Writing entire sysctl in single write ... ok
# Writing middle of sysctl after synchronized seek ... ok
# Writing beyond end of sysctl ... ok
# Writing sysctl with multiple long writes ... ok
# Testing that 0x0000000100000000 fails as expected...ok
# Testing that 0x0000000100000001 fails as expected...ok
# Testing that 0x00000001ffffffff fails as expected...ok
# Testing that 0x0000000180000000 fails as expected...ok
# Testing that 0x000000017fffffff fails as expected...ok
# Testing that 0xffffffff00000000 fails as expected...ok
# Testing that 0xffffffff00000001 fails as expected...ok
# Testing that 0xffffffffffffffff fails as expected...ok
# Testing that 0xffffffff80000000 fails as expected...ok
# Testing that 0xffffffff7fffffff fails as expected...ok
# Testing that -0x0000000100000000 fails as expected...ok
# Testing that -0x0000000100000001 fails as expected...ok
# Testing that -0x00000001ffffffff fails as expected...ok
# Testing that -0x0000000180000000 fails as expected...ok
# Testing that -0x000000017fffffff fails as expected...ok
# Testing that -0xffffffff00000000 fails as expected...ok
# Testing that -0xffffffff00000001 fails as expected...ok
# Testing that -0xffffffffffffffff fails as expected...ok
# Testing that -0xffffffff80000000 fails as expected...ok
# Testing that -0xffffffff7fffffff fails as expected...ok
# Checking ignoring spaces up to PAGE_SIZE works on write ...ok
# Checking passing PAGE_SIZE of spaces fails on write ...ok
# Testing UINT_MAX works ...ok
# Testing UINT_MAX + 1 will fail as expected...ok
# Testing negative values will not work as expected ...ok
# Sun Jun 27 11:04:05 UTC 2021
# Running test: sysctl_test_0005 - run #0
# Testing array works as expected ... ok
# Testing skipping trailing array elements works ... ok
# Testing PAGE_SIZE limit on array works ... ok
# Testing exceeding PAGE_SIZE limit fails as expected ... ok
# Sun Jun 27 11:04:05 UTC 2021
# Running test: sysctl_test_0005 - run #1
# Testing array works as expected ... ok
# Testing skipping trailing array elements works ... ok
# Testing PAGE_SIZE limit on array works ... ok
# Testing exceeding PAGE_SIZE limit fails as expected ... ok
# Sun Jun 27 11:04:05 UTC 2021
# Running test: sysctl_test_0005 - run #2
# Testing array works as expected ... ok
# Testing skipping trailing array elements works ... ok
# Testing PAGE_SIZE limit on array works ... ok
# Testing exceeding PAGE_SIZE limit fails as expected ... ok
# Sun Jun 27 11:04:05 UTC 2021
# Running test: sysctl_test_0006 - run #0
# Checking bitmap handler... ok
# Sun Jun 27 11:04:10 UTC 2021
# Running test: sysctl_test_0006 - run #1
# Checking bitmap handler... ok
# Sun Jun 27 11:04:13 UTC 2021
# Running test: sysctl_test_0006 - run #2
# Checking bitmap handler... ok
# Sun Jun 27 11:04:16 UTC 2021
# Running test: sysctl_test_0006 - run #3
# Checking bitmap handler... ok
# Sun Jun 27 11:04:19 UTC 2021
# Running test: sysctl_test_0006 - run #4
# Checking bitmap handler... ok
# Sun Jun 27 11:04:22 UTC 2021
# Running test: sysctl_test_0006 - run #5
# Checking bitmap handler... ok
# Sun Jun 27 11:04:25 UTC 2021
# Running test: sysctl_test_0006 - run #6
# Checking bitmap handler... ok
# Sun Jun 27 11:04:26 UTC 2021
# Running test: sysctl_test_0006 - run #7
# Checking bitmap handler... ok
# Sun Jun 27 11:04:27 UTC 2021
# Running test: sysctl_test_0006 - run #8
# Checking bitmap handler... ok
# Sun Jun 27 11:04:27 UTC 2021
# Running test: sysctl_test_0006 - run #9
# Checking bitmap handler... ok
# Sun Jun 27 11:04:29 UTC 2021
# Running test: sysctl_test_0006 - run #10
# Checking bitmap handler... ok
# Sun Jun 27 11:04:29 UTC 2021
# Running test: sysctl_test_0006 - run #11
# Checking bitmap handler... ok
# Sun Jun 27 11:04:29 UTC 2021
# Running test: sysctl_test_0006 - run #12
# Checking bitmap handler... ok
# Sun Jun 27 11:04:30 UTC 2021
# Running test: sysctl_test_0006 - run #13
# Checking bitmap handler... ok
# Sun Jun 27 11:04:31 UTC 2021
# Running test: sysctl_test_0006 - run #14
# Checking bitmap handler... ok
# Sun Jun 27 11:04:31 UTC 2021
# Running test: sysctl_test_0006 - run #15
# Checking bitmap handler... ok
# Sun Jun 27 11:04:34 UTC 2021
# Running test: sysctl_test_0006 - run #16
# Checking bitmap handler... ok
# Sun Jun 27 11:04:38 UTC 2021
# Running test: sysctl_test_0006 - run #17
# Checking bitmap handler... ok
# Sun Jun 27 11:04:42 UTC 2021
# Running test: sysctl_test_0006 - run #18
# Checking bitmap handler... ok
# Sun Jun 27 11:04:44 UTC 2021
# Running test: sysctl_test_0006 - run #19
# Checking bitmap handler... ok
# Sun Jun 27 11:04:45 UTC 2021
# Running test: sysctl_test_0006 - run #20
# Checking bitmap handler... ok
# Sun Jun 27 11:04:45 UTC 2021
# Running test: sysctl_test_0006 - run #21
# Checking bitmap handler... ok
# Sun Jun 27 11:04:47 UTC 2021
# Running test: sysctl_test_0006 - run #22
# Checking bitmap handler... ok
# Sun Jun 27 11:04:48 UTC 2021
# Running test: sysctl_test_0006 - run #23
# Checking bitmap handler... ok
# Sun Jun 27 11:04:49 UTC 2021
# Running test: sysctl_test_0006 - run #24
# Checking bitmap handler... ok
# Sun Jun 27 11:04:49 UTC 2021
# Running test: sysctl_test_0006 - run #25
# Checking bitmap handler... ok
# Sun Jun 27 11:04:51 UTC 2021
# Running test: sysctl_test_0006 - run #26
# Checking bitmap handler... ok
# Sun Jun 27 11:04:51 UTC 2021
# Running test: sysctl_test_0006 - run #27
# Checking bitmap handler... ok
# Sun Jun 27 11:04:52 UTC 2021
# Running test: sysctl_test_0006 - run #28
# Checking bitmap handler... ok
# Sun Jun 27 11:04:55 UTC 2021
# Running test: sysctl_test_0006 - run #29
# Checking bitmap handler... ok
# Sun Jun 27 11:04:56 UTC 2021
# Running test: sysctl_test_0006 - run #30
# Checking bitmap handler... ok
# Sun Jun 27 11:04:58 UTC 2021
# Running test: sysctl_test_0006 - run #31
# Checking bitmap handler... ok
# Sun Jun 27 11:04:59 UTC 2021
# Running test: sysctl_test_0006 - run #32
# Checking bitmap handler... ok
# Sun Jun 27 11:04:59 UTC 2021
# Running test: sysctl_test_0006 - run #33
# Checking bitmap handler... ok
# Sun Jun 27 11:04:59 UTC 2021
# Running test: sysctl_test_0006 - run #34
# Checking bitmap handler... ok
# Sun Jun 27 11:04:59 UTC 2021
# Running test: sysctl_test_0006 - run #35
# Checking bitmap handler... ok
# Sun Jun 27 11:05:03 UTC 2021
# Running test: sysctl_test_0006 - run #36
# Checking bitmap handler... ok
# Sun Jun 27 11:05:03 UTC 2021
# Running test: sysctl_test_0006 - run #37
# Checking bitmap handler... ok
# Sun Jun 27 11:05:06 UTC 2021
# Running test: sysctl_test_0006 - run #38
# Checking bitmap handler... ok
# Sun Jun 27 11:05:07 UTC 2021
# Running test: sysctl_test_0006 - run #39
# Checking bitmap handler... ok
# Sun Jun 27 11:05:09 UTC 2021
# Running test: sysctl_test_0006 - run #40
# Checking bitmap handler... ok
# Sun Jun 27 11:05:09 UTC 2021
# Running test: sysctl_test_0006 - run #41
# Checking bitmap handler... ok
# Sun Jun 27 11:05:09 UTC 2021
# Running test: sysctl_test_0006 - run #42
# Checking bitmap handler... ok
# Sun Jun 27 11:05:09 UTC 2021
# Running test: sysctl_test_0006 - run #43
# Checking bitmap handler... ok
# Sun Jun 27 11:05:10 UTC 2021
# Running test: sysctl_test_0006 - run #44
# Checking bitmap handler... ok
# Sun Jun 27 11:05:12 UTC 2021
# Running test: sysctl_test_0006 - run #45
# Checking bitmap handler... ok
# Sun Jun 27 11:05:16 UTC 2021
# Running test: sysctl_test_0006 - run #46
# Checking bitmap handler... ok
# Sun Jun 27 11:05:19 UTC 2021
# Running test: sysctl_test_0006 - run #47
# Checking bitmap handler... ok
# Sun Jun 27 11:05:20 UTC 2021
# Running test: sysctl_test_0006 - run #48
# Checking bitmap handler... ok
# Sun Jun 27 11:05:20 UTC 2021
# Running test: sysctl_test_0006 - run #49
# Checking bitmap handler... ok
# Sun Jun 27 11:05:23 UTC 2021
# Running test: sysctl_test_0007 - run #0
# Testing if /proc/sys/debug/test_sysctl/boot_int is set to 1 ...ok
ok 1 selftests: sysctl: sysctl.sh



To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
bin/lkp run generated-yaml-file



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (33.68 kB)
config-5.13.0-rc7-00027-g3fdd8c68c2f4 (178.15 kB)
job-script (6.47 kB)
dmesg.xz (11.30 kB)
kernel-selftests (32.02 kB)
job.yaml (5.38 kB)
reproduce (581.00 B)
Download all attachments

2021-06-28 16:23:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Geert Uytterhoeven <[email protected]> writes:

> Hi Michael,
>
> On Mon, Jun 28, 2021 at 1:00 AM Michael Schmitz <[email protected]> wrote:
>> On 28/06/21 10:13 am, Al Viro wrote:
>> > On Thu, Jun 24, 2021 at 10:45:23PM +0000, Al Viro wrote:
>> >
>> >> 13) there's bdflush(1, whatever), which is equivalent to exit(0).
>> >> IMO it's long past the time to simply remove the sucker.
>> > Incidentally, calling that from ptraced process on alpha leads to
>> > the same headache for tracer. _If_ we leave it around, this is
>> > another candidate for "hit yourself with that special signal" -
>> > both alpha and m68k have that syscall, and IMO adding an asm
>> > wrapper for that one is over the top.
>> >
>> > Said that, we really ought to bury that thing:
>> >
>> > commit 2f268ee88abb33968501a44368db55c63adaad40
>> > Author: Andrew Morton <[email protected]>
>> > Date: Sat Dec 14 03:16:29 2002 -0800
>> >
>> > [PATCH] deprecate use of bdflush()
>> >
>> > Patch from Robert Love <[email protected]>
>> >
>> > We can never get rid of it if we do not deprecate it - so do so and
>> > print a stern warning to those who still run bdflush daemons.
>> >
>> > Deprecated for 18.5 years by now - I seriously suspect that we have
>> > some contributors younger than that...
>>
>> Haven't found that warning in over 7 years' worth of console logs, and
>> I'm a good candidate for running the oldest userland in existence for m68k.
>>
>> Time to let it go.
>
> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> which is a very old ramdisk from right after the m68k a.out to ELF
> transition:
>
> warning: process `update' used the obsolete bdflush system call
> Fix your initscripts?
>
> I still boot it, once in a while.

The only thing left in bdflush is func == 1 calls do_exit(0);

Which is a hack introduced in 2.3.23 aka October of 1999 to force the
userspace process calling bdflush to exit, instead of repeatedly calling
sys_bdflush.

Can you try deleting that func == 1 call and seeing if your old ramdisk
works?

I suspect userspace used to get into a tight spin calling bdflush
func == 1, when that function stopped doing anything. That was back in
1999 so we are probably safe with out it.

Eric

2021-06-28 23:40:03

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Hi Geert,

Am 28.06.2021 um 19:31 schrieb Geert Uytterhoeven:
>> Haven't found that warning in over 7 years' worth of console logs, and
>> I'm a good candidate for running the oldest userland in existence for m68k.
>>
>> Time to let it go.
>
> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> which is a very old ramdisk from right after the m68k a.out to ELF
> transition:
>
> warning: process `update' used the obsolete bdflush system call
> Fix your initscripts?
>
> I still boot it, once in a while.

OK; you take the cake. That ramdisk came to mind when I thought about
where I'd last seen bdflush, but I've not used it in ages (not sure 14
MB are enough for that).

The question then is - will bdflush fail gracefully, or spin retrying
the syscall?

Cheers,

Michael

>
> Gr{oetje,eeting}s,
>
> Geert
>

2021-06-28 23:40:39

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Hi Michael,

On Mon, Jun 28, 2021 at 7:14 PM Michael Schmitz <[email protected]> wrote:
> Am 28.06.2021 um 19:31 schrieb Geert Uytterhoeven:
> >> Haven't found that warning in over 7 years' worth of console logs, and
> >> I'm a good candidate for running the oldest userland in existence for m68k.
> >>
> >> Time to let it go.
> >
> > The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> > which is a very old ramdisk from right after the m68k a.out to ELF
> > transition:
> >
> > warning: process `update' used the obsolete bdflush system call
> > Fix your initscripts?
> >
> > I still boot it, once in a while.
>
> OK; you take the cake. That ramdisk came to mind when I thought about
> where I'd last seen bdflush, but I've not used it in ages (not sure 14
> MB are enough for that).

Of course it will work on your 14 MiB machine! It fits on a floppy, _after_
decompression. It was used by people to install Linux on the hard disks
of their beefy m68k machines, after they had set up the family Christmas
tree, in December 1996.

I also have a slightly larger one, built from OpenWRT when I did my first
experiments on that. Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open
a shell on the serial console, so it is more useful to me.

> The question then is - will bdflush fail gracefully, or spin retrying
> the syscall?

Will add to my todo list...
BTW, you can boot this ramdisk on ARAnyM, too.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-06-28 23:41:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 2/9] signal/seccomp: Refactor seccomp signal and coredump generation

Kees Cook <[email protected]> writes:

> On Thu, Jun 24, 2021 at 01:59:55PM -0500, Eric W. Biederman wrote:
>>
>> Factor out force_sig_seccomp from the seccomp signal generation and
>> place it in kernel/signal.c. The function force_sig_seccomp takes a
>> paramter force_coredump to indicate that the sigaction field should be
>> reset to SIGDFL so that a coredump will be generated when the signal
>> is delivered.
>
> Ah! This is the part I missed when I was originally trying to figure
> out the coredump stuff. It's the need for setting a default handler
> (i.e. doing a coredump)?

Yes. If we don't force the handler to SIG_DFL someone might catch
SIGSYS.

>> force_sig_seccomp is then used to replace both seccomp_send_sigsys
>> and seccomp_init_siginfo.
>>
>> force_sig_info_to_task gains an extra parameter to force using
>> the default signal action.
>>
>> With this change seccomp is no longer a special case and there
>> becomes exactly one place do_coredump is called from.
>
> Looks good to me. This may benefit from force_sig_seccomp() to be wrapped
> in an #ifdef CONFIG_SECCOMP.

At which point Linus will probably be grumpy with me for introducing
#ifdefs.

I suspect seccomp at this point is sufficiently common that is probably
more productive to figure out how to remove #ifdef CONFIG_SECCOMP.

> (This patch reminds me that the seccomp self tests don't check for core
> dumps...)

This patch is slightly wrong in that it kept the call to do_group_exit
when it can never be reached.

Eric


2021-06-28 23:41:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Al Viro <[email protected]> writes:

> On Thu, Jun 24, 2021 at 01:57:35PM -0500, Eric W. Biederman wrote:
>
>> So far the code has been lightly tested, and the descriptions of some
>> of the patches are a bit light, but I think this shows the direction
>> I am aiming to travel for sorting out exit(2) and exit_group(2).
>
> FWIW, here's the current picture for do_exit(), aside of exit(2) and do_exit_group():
>
> 1) stuff that is clearly oops-like -
> alpha:die_if_kernel() alpha:do_entUna() alpha:do_page_fault() arm:oops_end()
> arm:__do_kernel_fault() arm64:die() arm64:die_kernel_fault() csky:alignment()
> csky:die() csky:no_context() h8300:die() h8300:do_page_fault() hexagon:die()
> ia64:die() i64:ia64_do_page_fault() m68k:die_if_kernel() m68k:send_fault_sig()
> microblaze:die() mips:die() nds32:handle_fpu_exception() nds32:die()
> nds32:unhandled_interruption() nds32:unhandled_exceptions() nds32:do_revinsn()
> nds32:do_page_fault() nios:die() openrisc:die() openrisc:do_page_fault()
> parisc:die_if_kernel() ppc:oops_end() riscv:die() riscv:die_kernel_fault()
> s390:die() s390:do_no_context() s390:do_low_address() sh:die()
> sparc32:die_if_kernel() sparc32:do_sparc_fault() sparc64:die_if_kernel()
> x86:rewind_stack_do_exit() xtensa:die() xtensa:bad_page_fault()
> We really do not want ptrace anywhere near any of those and we do not want
> any of that to return; this shit would better be handled right there and
> there - no "post a fatal signal" would do.

Thanks that makes a good start for digging into these.

I think the distinction I would make is:
- If the kernel is broken use do_task_dead.
- Otherwise cleanup the semantics by using start_group_exit,
start_task_exit or by just cleaning up the code.


Looking at the reboot case it looks like we the code
should have become do_group_exit in 2.5. I have a suspicion
we have a bunch of similar cases that want to terminate the
entire process, but we simply never updated to deal with
multi-thread processes.

I suspect in the reboot case panic if machine_halt or
or machine_power_off fails is more likely the correct
handling. But we do have funny semantics sometimes.

I will see what I can do to expand my patchset to handle all of these
various callers of do_exit.

Eric

2021-06-28 23:41:48

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 6/9] signal: Fold do_group_exit into get_signal fixing io_uring threads

Kees Cook <[email protected]> writes:

> On Thu, Jun 24, 2021 at 02:02:16PM -0500, Eric W. Biederman wrote:
>>
>> Forld do_group_exit into get_signal as it is the last caller.
>>
>> Move the group_exit logic above the PF_IO_WORKER exit, ensuring
>> that if an PF_IO_WORKER catches SIGKILL every thread in
>> the thread group will exit not just the the PF_IO_WORKER.
>>
>> Now that the information is easily available only set PF_SIGNALED
>> when it was a signal that caused the exit.
>>
>> Signed-off-by: "Eric W. Biederman" <[email protected]>
>> ---
>> include/linux/sched/task.h | 1 -
>> kernel/exit.c | 31 -------------------------------
>> kernel/signal.c | 35 +++++++++++++++++++++++++----------
>> 3 files changed, 25 insertions(+), 42 deletions(-)
>>
>> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
>> index ef02be869cf2..45525512e3d0 100644
>> --- a/include/linux/sched/task.h
>> +++ b/include/linux/sched/task.h
>> @@ -77,7 +77,6 @@ static inline void exit_thread(struct task_struct *tsk)
>> {
>> }
>> #endif
>> -extern void do_group_exit(int);
>>
>> extern void exit_files(struct task_struct *);
>> extern void exit_itimers(struct signal_struct *);
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 921519d80b56..635f434122b7 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -892,37 +892,6 @@ SYSCALL_DEFINE1(exit, int, error_code)
>> do_exit((error_code&0xff)<<8);
>> }
>>
>> -/*
>> - * Take down every thread in the group. This is called by fatal signals
>> - * as well as by sys_exit_group (below).
>> - */
>> -void
>> -do_group_exit(int exit_code)
>> -{
>> - struct signal_struct *sig = current->signal;
>> -
>> - BUG_ON(exit_code & 0x80); /* core dumps don't get here */
>> -
>> - if (signal_group_exit(sig))
>> - exit_code = sig->group_exit_code;
>> - else if (!thread_group_empty(current)) {
>> - struct sighand_struct *const sighand = current->sighand;
>> -
>> - spin_lock_irq(&sighand->siglock);
>> - if (signal_group_exit(sig))
>> - /* Another thread got here before we took the lock. */
>> - exit_code = sig->group_exit_code;
>> - else {
>> - sig->group_exit_code = exit_code;
>> - sig->flags = SIGNAL_GROUP_EXIT;
>> - zap_other_threads(current);
>
> Oh, now I see it: the "new code" in start_group_exit() is an open-coded
> zap_other_threads()? That wasn't clear to me, but makes sense now.

Pretty much. I think zap_other_threads has actually muddied the waters
quite a bit by putting reuse in the wrong place.

>> - }
>> - spin_unlock_irq(&sighand->siglock);
>> - }
>> -
>> - do_exit(exit_code);
>> - /* NOTREACHED */
>> -}
>>
>> /*
>> * this kills every thread in the thread group. Note that any externally
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index c79c010ca5f3..95a076af600a 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2646,6 +2646,7 @@ bool get_signal(struct ksignal *ksig)
>> {
>> struct sighand_struct *sighand = current->sighand;
>> struct signal_struct *signal = current->signal;
>> + int exit_code;
>> int signr;
>>
>> if (unlikely(current->task_works))
>> @@ -2848,8 +2849,6 @@ bool get_signal(struct ksignal *ksig)
>> /*
>> * Anything else is fatal, maybe with a core dump.
>> */
>> - current->flags |= PF_SIGNALED;
>> -
>> if (sig_kernel_coredump(signr)) {
>> if (print_fatal_signals)
>> print_fatal_signal(ksig->info.si_signo);
>> @@ -2857,14 +2856,33 @@ bool get_signal(struct ksignal *ksig)
>> /*
>> * If it was able to dump core, this kills all
>> * other threads in the group and synchronizes with
>> - * their demise. If we lost the race with another
>> - * thread getting here, it set group_exit_code
>> - * first and our do_group_exit call below will use
>> - * that value and ignore the one we pass it.
>> + * their demise. If another thread makes it
>> + * to do_coredump first, it will set group_exit_code
>> + * which will be passed to do_exit.
>> */
>> do_coredump(&ksig->info);
>> }
>>
>> + /*
>> + * Death signals, no core dump.
>> + */
>> + exit_code = signr;
>> + if (signal_group_exit(signal)) {
>> + exit_code = signal->group_exit_code;
>> + } else {
>> + spin_lock_irq(&sighand->siglock);
>> + if (signal_group_exit(signal)) {
>> + /* Another thread got here before we took the lock. */
>> + exit_code = signal->group_exit_code;
>> + } else {
>> + start_group_exit_locked(signal, exit_code);
>
> And here's the "if we didn't already do start_group_exit(), do it here".
> And that state is entirely captured via the SIGNAL_GROUP_EXIT flag.
> Cool.

Yes. At least when the dust clears.

>> + }
>> + spin_unlock_irq(&sighand->siglock);
>> + }
>> +
>> + if (exit_code & 0x7f)
>> + current->flags |= PF_SIGNALED;
>> +
>> /*
>> * PF_IO_WORKER threads will catch and exit on fatal signals
>> * themselves. They have cleanup that must be performed, so
>> @@ -2873,10 +2891,7 @@ bool get_signal(struct ksignal *ksig)
>> if (current->flags & PF_IO_WORKER)
>> goto out;
>>
>> - /*
>> - * Death signals, no core dump.
>> - */
>> - do_group_exit(ksig->info.si_signo);
>> + do_exit(exit_code);
>> /* NOTREACHED */
>> }
>> spin_unlock_irq(&sighand->siglock);
>> --
>> 2.20.1
>>

2021-06-28 23:43:23

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Hi Michael,

On Mon, Jun 28, 2021 at 10:13 PM Michael Schmitz <[email protected]> wrote:
> On 29/06/21 7:17 am, Geert Uytterhoeven wrote:
> >>> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
> >>> which is a very old ramdisk from right after the m68k a.out to ELF
> >>> transition:
> >>>
> >>> warning: process `update' used the obsolete bdflush system call
> >>> Fix your initscripts?
> >>>
> >>> I still boot it, once in a while.
> >> OK; you take the cake. That ramdisk came to mind when I thought about
> >> where I'd last seen bdflush, but I've not used it in ages (not sure 14
> >> MB are enough for that).
> > Of course it will work on your 14 MiB machine! It fits on a floppy, _after_
> > decompression. It was used by people to install Linux on the hard disks
> > of their beefy m68k machines, after they had set up the family Christmas
> > tree, in December 1996.
>
> Been there, done that. Wrote the HOWTO for ext2 filesystem byte-swapping.

I knew I could revive your memory ;-)

> > I also have a slightly larger one, built from OpenWRT when I did my first
> > experiments on that. Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open
> > a shell on the serial console, so it is more useful to me.
> >
> >> The question then is - will bdflush fail gracefully, or spin retrying
> >> the syscall?
> > Will add to my todo list...
> > BTW, you can boot this ramdisk on ARAnyM, too.
>
> True. I can't find that ramdisk image anywhere - if you can point me to
> some archive, I'll give that a try.

http://ftp.mac.linux-m68k.org/pub/linux-mac68k/initrd/

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-06-29 00:26:36

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Hi Geert,

On 29/06/21 7:17 am, Geert Uytterhoeven wrote:
>>> The warning is printed when using filesys-ELF-2.0.x-1400K-2.gz,
>>> which is a very old ramdisk from right after the m68k a.out to ELF
>>> transition:
>>>
>>> warning: process `update' used the obsolete bdflush system call
>>> Fix your initscripts?
>>>
>>> I still boot it, once in a while.
>> OK; you take the cake. That ramdisk came to mind when I thought about
>> where I'd last seen bdflush, but I've not used it in ages (not sure 14
>> MB are enough for that).
> Of course it will work on your 14 MiB machine! It fits on a floppy, _after_
> decompression. It was used by people to install Linux on the hard disks
> of their beefy m68k machines, after they had set up the family Christmas
> tree, in December 1996.
Been there, done that. Wrote the HOWTO for ext2 filesystem byte-swapping.
> I also have a slightly larger one, built from OpenWRT when I did my first
> experiments on that. Unlike filesys-ELF-2.0.x-1400K-2.gz, it does open
> a shell on the serial console, so it is more useful to me.
>
>> The question then is - will bdflush fail gracefully, or spin retrying
>> the syscall?
> Will add to my todo list...
> BTW, you can boot this ramdisk on ARAnyM, too.

True. I can't find that ramdisk image anywhere - if you can point me to
some archive, I'll give that a try.

Cheers,

    Michael


>
> Gr{oetje,eeting}s,
>
> Geert
>

2021-06-29 00:41:07

by Michael Schmitz

[permalink] [raw]
Subject: Re: [PATCH 0/9] Refactoring exit

Hi Geert,

On 29/06/21 9:18 am, Geert Uytterhoeven wrote:
>
>>>> The question then is - will bdflush fail gracefully, or spin retrying
>>>> the syscall?
>>> Will add to my todo list...
>>> BTW, you can boot this ramdisk on ARAnyM, too.
>> True. I can't find that ramdisk image anywhere - if you can point me to
>> some archive, I'll give that a try.
> http://ftp.mac.linux-m68k.org/pub/linux-mac68k/initrd/

Thanks - removing the if (func==1) do_exit(0); part does give similar
behaviour as before - kernel warns five times, then shuts up (without
change, warns twice only, and /sbin/update no longer runs).

Removing the syscall from the m68k syscall table altogether still gives
a working ramdisk. /sbin/update is still running, so evidently doesn't
care about the invalid syscall result ...

Cheers,

    Michael


>
> Gr{oetje,eeting}s,
>
> Geert
>

2021-06-29 21:25:06

by Eric W. Biederman

[permalink] [raw]
Subject: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call


The bdflush system call has been deprecated for a very long time.
Recently Michael Schmitz tested[1] and found that the last known
caller of of the bdflush system call is unaffected by it's removal.

Since the code is not needed delete it.

[1] https://lkml.kernel.org/r/[email protected]
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

I think we have consensus that bdflush can be removed. Can folks please
verify I have removed it correctly?

Michael could you give me a Tested-by on this patch?

arch/alpha/kernel/syscalls/syscall.tbl | 2 +-
arch/arm/tools/syscall.tbl | 2 +-
arch/arm64/include/asm/unistd32.h | 2 +-
arch/ia64/kernel/syscalls/syscall.tbl | 2 +-
arch/m68k/kernel/syscalls/syscall.tbl | 2 +-
arch/microblaze/kernel/syscalls/syscall.tbl | 2 +-
arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +-
arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
arch/powerpc/kernel/syscalls/syscall.tbl | 2 +-
arch/s390/kernel/syscalls/syscall.tbl | 2 +-
arch/sh/kernel/syscalls/syscall.tbl | 2 +-
arch/sparc/kernel/syscalls/syscall.tbl | 2 +-
arch/x86/entry/syscalls/syscall_32.tbl | 2 +-
arch/xtensa/kernel/syscalls/syscall.tbl | 2 +-
fs/buffer.c | 27 -------------------
include/linux/syscalls.h | 1 -
include/uapi/linux/capability.h | 1 -
kernel/sys_ni.c | 1 -
.../arch/powerpc/entry/syscalls/syscall.tbl | 2 +-
.../perf/arch/s390/entry/syscalls/syscall.tbl | 2 +-
20 files changed, 16 insertions(+), 46 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 3000a2e8ee21..85d2bcd9cf36 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -230,7 +230,7 @@
259 common osf_swapctl sys_ni_syscall
260 common osf_memcntl sys_ni_syscall
261 common osf_fdatasync sys_ni_syscall
-300 common bdflush sys_bdflush
+300 common bdflush sys_ni_syscall
301 common sethae sys_sethae
302 common mount sys_mount
303 common old_adjtimex sys_old_adjtimex
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 28e03b5fec00..241988512648 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -147,7 +147,7 @@
131 common quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 common personality sys_personality
# 137 was sys_afs_syscall
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 5dab69d2c22b..a35cd6c4909c 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -279,7 +279,7 @@ __SYSCALL(__NR_getpgid, sys_getpgid)
#define __NR_fchdir 133
__SYSCALL(__NR_fchdir, sys_fchdir)
#define __NR_bdflush 134
-__SYSCALL(__NR_bdflush, sys_bdflush)
+__SYSCALL(__NR_bdflush, sys_ni_syscall)
#define __NR_sysfs 135
__SYSCALL(__NR_sysfs, sys_sysfs)
#define __NR_personality 136
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index bb11fe4c875a..7de53a9a2972 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -123,7 +123,7 @@
# 1135 was get_kernel_syms
# 1136 was query_module
113 common quotactl sys_quotactl
-114 common bdflush sys_bdflush
+114 common bdflush sys_ni_syscall
115 common sysfs sys_sysfs
116 common personality sys_personality
117 common afs_syscall sys_ni_syscall
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 79c2d24c89dd..be5abd9c8c07 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
131 common quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 common personality sys_personality
# 137 was afs_syscall
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index b11395a20c20..555fd987f4ab 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
131 common quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 common personality sys_personality
137 common afs_syscall sys_ni_syscall
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index d560c467a8c6..2c6b10db3bd5 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -145,7 +145,7 @@
131 o32 quotactl sys_quotactl
132 o32 getpgid sys_getpgid
133 o32 fchdir sys_fchdir
-134 o32 bdflush sys_bdflush
+134 o32 bdflush sys_ni_syscall
135 o32 sysfs sys_sysfs
136 o32 personality sys_personality sys_32_personality
137 o32 afs_syscall sys_ni_syscall
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index aabc37f8cae3..51c156cb00f1 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -147,7 +147,7 @@
131 common quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 32 personality parisc_personality
136 64 personality sys_personality
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 8f052ff4058c..2518e4e6dccf 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -176,7 +176,7 @@
131 nospu quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 32 personality sys_personality ppc64_personality
136 64 personality ppc64_personality
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 0690263df1dd..ffcf03714f12 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -122,7 +122,7 @@
131 common quotactl sys_quotactl sys_quotactl
132 common getpgid sys_getpgid sys_getpgid
133 common fchdir sys_fchdir sys_fchdir
-134 common bdflush sys_bdflush sys_bdflush
+134 common bdflush sys_ni_syscall sys_ni_syscall
135 common sysfs sys_sysfs sys_sysfs
136 common personality sys_s390_personality sys_s390_personality
137 common afs_syscall - -
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 0b91499ebdcf..6e7305066a70 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -141,7 +141,7 @@
131 common quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 common personality sys_personality
# 137 was afs_syscall
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e34cc30ef22c..bf330dda7c61 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -270,7 +270,7 @@
222 common delete_module sys_delete_module
223 common get_kernel_syms sys_ni_syscall
224 common getpgid sys_getpgid
-225 common bdflush sys_bdflush
+225 common bdflush sys_ni_syscall
226 common sysfs sys_sysfs
227 common afs_syscall sys_nis_syscall
228 common setfsuid sys_setfsuid16
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 4bbc267fb36b..a21a72763d58 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -145,7 +145,7 @@
131 i386 quotactl sys_quotactl
132 i386 getpgid sys_getpgid
133 i386 fchdir sys_fchdir
-134 i386 bdflush sys_bdflush
+134 i386 bdflush sys_ni_syscall
135 i386 sysfs sys_sysfs
136 i386 personality sys_personality
137 i386 afs_syscall
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index fd2f30227d96..db4e3d09b249 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -223,7 +223,7 @@
# 205 was old nfsservctl
205 common nfsservctl sys_ni_syscall
206 common _sysctl sys_ni_syscall
-207 common bdflush sys_bdflush
+207 common bdflush sys_ni_syscall
208 common uname sys_newuname
209 common sysinfo sys_sysinfo
210 common init_module sys_init_module
diff --git a/fs/buffer.c b/fs/buffer.c
index ea48c01fb76b..04ddff76c860 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3292,33 +3292,6 @@ int try_to_free_buffers(struct page *page)
}
EXPORT_SYMBOL(try_to_free_buffers);

-/*
- * There are no bdflush tunables left. But distributions are
- * still running obsolete flush daemons, so we terminate them here.
- *
- * Use of bdflush() is deprecated and will be removed in a future kernel.
- * The `flush-X' kernel threads fully replace bdflush daemons and this call.
- */
-SYSCALL_DEFINE2(bdflush, int, func, long, data)
-{
- static int msg_count;
-
- if (!capable(CAP_SYS_ADMIN))
- return -EPERM;
-
- if (msg_count < 5) {
- msg_count++;
- printk(KERN_INFO
- "warning: process `%s' used the obsolete bdflush"
- " system call\n", current->comm);
- printk(KERN_INFO "Fix your initscripts?\n");
- }
-
- if (func == 1)
- do_exit(0);
- return 0;
-}
-
/*
* Buffer-head allocation
*/
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 050511e8f1f8..1bd6e05ea116 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1157,7 +1157,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
asmlinkage long sys_vfork(void);
asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
asmlinkage long sys_send(int, void __user *, size_t, unsigned);
-asmlinkage long sys_bdflush(int func, long data);
asmlinkage long sys_oldumount(char __user *name);
asmlinkage long sys_uselib(const char __user *library);
asmlinkage long sys_sysfs(int option,
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 2ddb4226cd23..463d1ba2232a 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -243,7 +243,6 @@ struct vfs_ns_cap_data {
/* Allow examination and configuration of disk quotas */
/* Allow setting the domainname */
/* Allow setting the hostname */
-/* Allow calling bdflush() */
/* Allow mount() and umount(), setting up new smb connection */
/* Allow some autofs root ioctls */
/* Allow nfsservctl */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 0ea8128468c3..adf4d66ffae2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -414,7 +414,6 @@ COND_SYSCALL(epoll_wait);
COND_SYSCALL(recv);
COND_SYSCALL_COMPAT(recv);
COND_SYSCALL(send);
-COND_SYSCALL(bdflush);
COND_SYSCALL(uselib);

/* optional: time32 */
diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
index 2e68fbb57cc6..ab72dec9dadb 100644
--- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
@@ -176,7 +176,7 @@
131 nospu quotactl sys_quotactl
132 common getpgid sys_getpgid
133 common fchdir sys_fchdir
-134 common bdflush sys_bdflush
+134 common bdflush sys_ni_syscall
135 common sysfs sys_sysfs
136 32 personality sys_personality ppc64_personality
136 64 personality ppc64_personality
diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
index 7e4a2aba366d..f2eba775e676 100644
--- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
+++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
@@ -122,7 +122,7 @@
131 common quotactl sys_quotactl sys_quotactl
132 common getpgid sys_getpgid sys_getpgid
133 common fchdir sys_fchdir sys_fchdir
-134 common bdflush sys_bdflush sys_bdflush
+134 common bdflush - -
135 common sysfs sys_sysfs sys_sysfs
136 common personality sys_s390_personality sys_s390_personality
137 common afs_syscall - -
--
2.20.1

2021-06-29 21:46:33

by Michael Schmitz

[permalink] [raw]
Subject: Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call


On 30/06/21 8:28 am, Eric W. Biederman wrote:
> The bdflush system call has been deprecated for a very long time.
> Recently Michael Schmitz tested[1] and found that the last known
> caller of of the bdflush system call is unaffected by it's removal.
>
> Since the code is not needed delete it.
>
> [1] https://lkml.kernel.org/r/[email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>
Tested-by: Michael Schmitz <[email protected]>
> ---
>
> I think we have consensus that bdflush can be removed. Can folks please
> verify I have removed it correctly?
>
> Michael could you give me a Tested-by on this patch?
>
> arch/alpha/kernel/syscalls/syscall.tbl | 2 +-
> arch/arm/tools/syscall.tbl | 2 +-
> arch/arm64/include/asm/unistd32.h | 2 +-
> arch/ia64/kernel/syscalls/syscall.tbl | 2 +-
> arch/m68k/kernel/syscalls/syscall.tbl | 2 +-
> arch/microblaze/kernel/syscalls/syscall.tbl | 2 +-
> arch/mips/kernel/syscalls/syscall_o32.tbl | 2 +-
> arch/parisc/kernel/syscalls/syscall.tbl | 2 +-
> arch/powerpc/kernel/syscalls/syscall.tbl | 2 +-
> arch/s390/kernel/syscalls/syscall.tbl | 2 +-
> arch/sh/kernel/syscalls/syscall.tbl | 2 +-
> arch/sparc/kernel/syscalls/syscall.tbl | 2 +-
> arch/x86/entry/syscalls/syscall_32.tbl | 2 +-
> arch/xtensa/kernel/syscalls/syscall.tbl | 2 +-
> fs/buffer.c | 27 -------------------
> include/linux/syscalls.h | 1 -
> include/uapi/linux/capability.h | 1 -
> kernel/sys_ni.c | 1 -
> .../arch/powerpc/entry/syscalls/syscall.tbl | 2 +-
> .../perf/arch/s390/entry/syscalls/syscall.tbl | 2 +-
> 20 files changed, 16 insertions(+), 46 deletions(-)
>
> diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
> index 3000a2e8ee21..85d2bcd9cf36 100644
> --- a/arch/alpha/kernel/syscalls/syscall.tbl
> +++ b/arch/alpha/kernel/syscalls/syscall.tbl
> @@ -230,7 +230,7 @@
> 259 common osf_swapctl sys_ni_syscall
> 260 common osf_memcntl sys_ni_syscall
> 261 common osf_fdatasync sys_ni_syscall
> -300 common bdflush sys_bdflush
> +300 common bdflush sys_ni_syscall
> 301 common sethae sys_sethae
> 302 common mount sys_mount
> 303 common old_adjtimex sys_old_adjtimex
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 28e03b5fec00..241988512648 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -147,7 +147,7 @@
> 131 common quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 common personality sys_personality
> # 137 was sys_afs_syscall
> diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
> index 5dab69d2c22b..a35cd6c4909c 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -279,7 +279,7 @@ __SYSCALL(__NR_getpgid, sys_getpgid)
> #define __NR_fchdir 133
> __SYSCALL(__NR_fchdir, sys_fchdir)
> #define __NR_bdflush 134
> -__SYSCALL(__NR_bdflush, sys_bdflush)
> +__SYSCALL(__NR_bdflush, sys_ni_syscall)
> #define __NR_sysfs 135
> __SYSCALL(__NR_sysfs, sys_sysfs)
> #define __NR_personality 136
> diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
> index bb11fe4c875a..7de53a9a2972 100644
> --- a/arch/ia64/kernel/syscalls/syscall.tbl
> +++ b/arch/ia64/kernel/syscalls/syscall.tbl
> @@ -123,7 +123,7 @@
> # 1135 was get_kernel_syms
> # 1136 was query_module
> 113 common quotactl sys_quotactl
> -114 common bdflush sys_bdflush
> +114 common bdflush sys_ni_syscall
> 115 common sysfs sys_sysfs
> 116 common personality sys_personality
> 117 common afs_syscall sys_ni_syscall
> diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
> index 79c2d24c89dd..be5abd9c8c07 100644
> --- a/arch/m68k/kernel/syscalls/syscall.tbl
> +++ b/arch/m68k/kernel/syscalls/syscall.tbl
> @@ -141,7 +141,7 @@
> 131 common quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 common personality sys_personality
> # 137 was afs_syscall
> diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
> index b11395a20c20..555fd987f4ab 100644
> --- a/arch/microblaze/kernel/syscalls/syscall.tbl
> +++ b/arch/microblaze/kernel/syscalls/syscall.tbl
> @@ -141,7 +141,7 @@
> 131 common quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 common personality sys_personality
> 137 common afs_syscall sys_ni_syscall
> diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
> index d560c467a8c6..2c6b10db3bd5 100644
> --- a/arch/mips/kernel/syscalls/syscall_o32.tbl
> +++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
> @@ -145,7 +145,7 @@
> 131 o32 quotactl sys_quotactl
> 132 o32 getpgid sys_getpgid
> 133 o32 fchdir sys_fchdir
> -134 o32 bdflush sys_bdflush
> +134 o32 bdflush sys_ni_syscall
> 135 o32 sysfs sys_sysfs
> 136 o32 personality sys_personality sys_32_personality
> 137 o32 afs_syscall sys_ni_syscall
> diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
> index aabc37f8cae3..51c156cb00f1 100644
> --- a/arch/parisc/kernel/syscalls/syscall.tbl
> +++ b/arch/parisc/kernel/syscalls/syscall.tbl
> @@ -147,7 +147,7 @@
> 131 common quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 32 personality parisc_personality
> 136 64 personality sys_personality
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
> index 8f052ff4058c..2518e4e6dccf 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -176,7 +176,7 @@
> 131 nospu quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 32 personality sys_personality ppc64_personality
> 136 64 personality ppc64_personality
> diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
> index 0690263df1dd..ffcf03714f12 100644
> --- a/arch/s390/kernel/syscalls/syscall.tbl
> +++ b/arch/s390/kernel/syscalls/syscall.tbl
> @@ -122,7 +122,7 @@
> 131 common quotactl sys_quotactl sys_quotactl
> 132 common getpgid sys_getpgid sys_getpgid
> 133 common fchdir sys_fchdir sys_fchdir
> -134 common bdflush sys_bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall sys_ni_syscall
> 135 common sysfs sys_sysfs sys_sysfs
> 136 common personality sys_s390_personality sys_s390_personality
> 137 common afs_syscall - -
> diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
> index 0b91499ebdcf..6e7305066a70 100644
> --- a/arch/sh/kernel/syscalls/syscall.tbl
> +++ b/arch/sh/kernel/syscalls/syscall.tbl
> @@ -141,7 +141,7 @@
> 131 common quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 common personality sys_personality
> # 137 was afs_syscall
> diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
> index e34cc30ef22c..bf330dda7c61 100644
> --- a/arch/sparc/kernel/syscalls/syscall.tbl
> +++ b/arch/sparc/kernel/syscalls/syscall.tbl
> @@ -270,7 +270,7 @@
> 222 common delete_module sys_delete_module
> 223 common get_kernel_syms sys_ni_syscall
> 224 common getpgid sys_getpgid
> -225 common bdflush sys_bdflush
> +225 common bdflush sys_ni_syscall
> 226 common sysfs sys_sysfs
> 227 common afs_syscall sys_nis_syscall
> 228 common setfsuid sys_setfsuid16
> diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
> index 4bbc267fb36b..a21a72763d58 100644
> --- a/arch/x86/entry/syscalls/syscall_32.tbl
> +++ b/arch/x86/entry/syscalls/syscall_32.tbl
> @@ -145,7 +145,7 @@
> 131 i386 quotactl sys_quotactl
> 132 i386 getpgid sys_getpgid
> 133 i386 fchdir sys_fchdir
> -134 i386 bdflush sys_bdflush
> +134 i386 bdflush sys_ni_syscall
> 135 i386 sysfs sys_sysfs
> 136 i386 personality sys_personality
> 137 i386 afs_syscall
> diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
> index fd2f30227d96..db4e3d09b249 100644
> --- a/arch/xtensa/kernel/syscalls/syscall.tbl
> +++ b/arch/xtensa/kernel/syscalls/syscall.tbl
> @@ -223,7 +223,7 @@
> # 205 was old nfsservctl
> 205 common nfsservctl sys_ni_syscall
> 206 common _sysctl sys_ni_syscall
> -207 common bdflush sys_bdflush
> +207 common bdflush sys_ni_syscall
> 208 common uname sys_newuname
> 209 common sysinfo sys_sysinfo
> 210 common init_module sys_init_module
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ea48c01fb76b..04ddff76c860 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3292,33 +3292,6 @@ int try_to_free_buffers(struct page *page)
> }
> EXPORT_SYMBOL(try_to_free_buffers);
>
> -/*
> - * There are no bdflush tunables left. But distributions are
> - * still running obsolete flush daemons, so we terminate them here.
> - *
> - * Use of bdflush() is deprecated and will be removed in a future kernel.
> - * The `flush-X' kernel threads fully replace bdflush daemons and this call.
> - */
> -SYSCALL_DEFINE2(bdflush, int, func, long, data)
> -{
> - static int msg_count;
> -
> - if (!capable(CAP_SYS_ADMIN))
> - return -EPERM;
> -
> - if (msg_count < 5) {
> - msg_count++;
> - printk(KERN_INFO
> - "warning: process `%s' used the obsolete bdflush"
> - " system call\n", current->comm);
> - printk(KERN_INFO "Fix your initscripts?\n");
> - }
> -
> - if (func == 1)
> - do_exit(0);
> - return 0;
> -}
> -
> /*
> * Buffer-head allocation
> */
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 050511e8f1f8..1bd6e05ea116 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -1157,7 +1157,6 @@ asmlinkage long sys_ustat(unsigned dev, struct ustat __user *ubuf);
> asmlinkage long sys_vfork(void);
> asmlinkage long sys_recv(int, void __user *, size_t, unsigned);
> asmlinkage long sys_send(int, void __user *, size_t, unsigned);
> -asmlinkage long sys_bdflush(int func, long data);
> asmlinkage long sys_oldumount(char __user *name);
> asmlinkage long sys_uselib(const char __user *library);
> asmlinkage long sys_sysfs(int option,
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 2ddb4226cd23..463d1ba2232a 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -243,7 +243,6 @@ struct vfs_ns_cap_data {
> /* Allow examination and configuration of disk quotas */
> /* Allow setting the domainname */
> /* Allow setting the hostname */
> -/* Allow calling bdflush() */
> /* Allow mount() and umount(), setting up new smb connection */
> /* Allow some autofs root ioctls */
> /* Allow nfsservctl */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 0ea8128468c3..adf4d66ffae2 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -414,7 +414,6 @@ COND_SYSCALL(epoll_wait);
> COND_SYSCALL(recv);
> COND_SYSCALL_COMPAT(recv);
> COND_SYSCALL(send);
> -COND_SYSCALL(bdflush);
> COND_SYSCALL(uselib);
>
> /* optional: time32 */
> diff --git a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> index 2e68fbb57cc6..ab72dec9dadb 100644
> --- a/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> +++ b/tools/perf/arch/powerpc/entry/syscalls/syscall.tbl
> @@ -176,7 +176,7 @@
> 131 nospu quotactl sys_quotactl
> 132 common getpgid sys_getpgid
> 133 common fchdir sys_fchdir
> -134 common bdflush sys_bdflush
> +134 common bdflush sys_ni_syscall
> 135 common sysfs sys_sysfs
> 136 32 personality sys_personality ppc64_personality
> 136 64 personality ppc64_personality
> diff --git a/tools/perf/arch/s390/entry/syscalls/syscall.tbl b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> index 7e4a2aba366d..f2eba775e676 100644
> --- a/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> +++ b/tools/perf/arch/s390/entry/syscalls/syscall.tbl
> @@ -122,7 +122,7 @@
> 131 common quotactl sys_quotactl sys_quotactl
> 132 common getpgid sys_getpgid sys_getpgid
> 133 common fchdir sys_fchdir sys_fchdir
> -134 common bdflush sys_bdflush sys_bdflush
> +134 common bdflush - -
> 135 common sysfs sys_sysfs sys_sysfs
> 136 common personality sys_s390_personality sys_s390_personality
> 137 common afs_syscall - -

2021-06-30 08:27:45

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call

On Tue, Jun 29, 2021 at 10:28 PM Eric W. Biederman
<[email protected]> wrote:
> The bdflush system call has been deprecated for a very long time.
> Recently Michael Schmitz tested[1] and found that the last known
> caller of of the bdflush system call is unaffected by it's removal.
>
> Since the code is not needed delete it.
>
> [1] https://lkml.kernel.org/r/[email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>

> arch/m68k/kernel/syscalls/syscall.tbl | 2 +-

Acked-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-06-30 08:42:23

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call

On Tue, Jun 29, 2021 at 10:28 PM Eric W. Biederman
<[email protected]> wrote:
>
>
> The bdflush system call has been deprecated for a very long time.
> Recently Michael Schmitz tested[1] and found that the last known
> caller of of the bdflush system call is unaffected by it's removal.
>
> Since the code is not needed delete it.
>
> [1] https://lkml.kernel.org/r/[email protected]
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>
> I think we have consensus that bdflush can be removed. Can folks please
> verify I have removed it correctly?

Reviewed-by: Arnd Bergmann <[email protected]>

We are traditionally somewhat inconsistent about whether to leave the
__NR_bdflush macro present in asm/unistd.h or to remove it when the
syscall is gone. Leaving it in place as you do is probably better here.

Arnd

2021-06-30 12:58:43

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [CFT][PATCH] exit/bdflush: Remove the deprecated bdflush system call

Hi!
I've send a similar patch [1] a while ago when I removed bdflush tests from
LTP.

[1] https://lore.kernel.org/lkml/[email protected]/

Acked-by: Cyril Hrubis <[email protected]>

--
Cyril Hrubis
[email protected]