2021-08-18 03:07:02

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

task_work being added with notify == TWA_SIGNAL will utilize
TIF_NOTIFY_SIGNAL for signaling the targeted task that work is available.
If this happens while a task is going through a core dump, it'll
potentially disturb and truncate the dump as a signal interruption.

Have task_work_add() with notify == TWA_SIGNAL check if a task has been
signaled for a core dump, and refuse to add the work if that is the case.
When a core dump is invoked, explicitly check for TIF_NOTIFY_SIGNAL and
run any pending task_work if that is set. This is similar to how an
exiting task will not get new task_work added, and we return the same
error for the core dump case. As we return success or failure from
task_work_add(), the caller has to be prepared to handle this case
already.

Currently this manifests itself in that io_uring tasks that end up using
task_work will experience truncated core dumps.

Reported-by: Tony Battersby <[email protected]>
Reported-by: Olivier Langlois <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: [email protected] # 5.10+
Signed-off-by: Jens Axboe <[email protected]>

---

diff --git a/fs/coredump.c b/fs/coredump.c
index 07afb5ddb1c4..ca7c1ee44ada 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t *siginfo)
.mm_flags = mm->flags,
};

+ /*
+ * task_work_add() will refuse to add work after PF_SIGNALED has
+ * been set, ensure that we flush any pending TIF_NOTIFY_SIGNAL work
+ * if any was queued before that.
+ */
+ if (test_thread_flag(TIF_NOTIFY_SIGNAL))
+ tracehook_notify_signal();
+
audit_core_dumps(siginfo->si_signo);

binfmt = mm->binfmt;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 1698fbe6f0e1..1ab28904adc4 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
head = READ_ONCE(task->task_works);
if (unlikely(head == &work_exited))
return -ESRCH;
+ /*
+ * TIF_NOTIFY_SIGNAL notifications will interfere with
+ * a core dump in progress, reject them.
+ */
+ if (notify == TWA_SIGNAL && (task->flags & PF_SIGNALED))
+ return -ESRCH;
work->next = head;
} while (cmpxchg(&task->task_works, head, work) != head);

--
Jens Axboe


2021-08-19 02:58:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

On Tue, Aug 17, 2021 at 8:06 PM Jens Axboe <[email protected]> wrote:
>
> task_work being added with notify == TWA_SIGNAL will utilize
> TIF_NOTIFY_SIGNAL for signaling the targeted task that work is available.
> If this happens while a task is going through a core dump, it'll
> potentially disturb and truncate the dump as a signal interruption.

This patch seems (a) buggy and (b) hacky.

> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
> head = READ_ONCE(task->task_works);
> if (unlikely(head == &work_exited))
> return -ESRCH;
> + /*
> + * TIF_NOTIFY_SIGNAL notifications will interfere with
> + * a core dump in progress, reject them.
> + */
> + if (notify == TWA_SIGNAL && (task->flags & PF_SIGNALED))
> + return -ESRCH;

This basically seems to check task->flags with no serialization.

I'm sure it works 99.9% of the time in practice, since you'd be really
unlucky to hit any races, but I really don't see what the
serialization logic is.

Also, the main user that actually triggered the problem already has

if (unlikely(tsk->flags & PF_EXITING))
goto fail;

just above the call to task_work_add(), so this all seems very hacky indeed.

Of course, I don't see what the serialization for _that_ one is either.

Pls explain. You can't just randomly add tests for random flags that
get modified by other random code.

Linus

2021-08-19 15:03:34

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

On 8/18/21 8:57 PM, Linus Torvalds wrote:
> On Tue, Aug 17, 2021 at 8:06 PM Jens Axboe <[email protected]> wrote:
>>
>> task_work being added with notify == TWA_SIGNAL will utilize
>> TIF_NOTIFY_SIGNAL for signaling the targeted task that work is available.
>> If this happens while a task is going through a core dump, it'll
>> potentially disturb and truncate the dump as a signal interruption.
>
> This patch seems (a) buggy and (b) hacky.
>
>> --- a/kernel/task_work.c
>> +++ b/kernel/task_work.c
>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>> head = READ_ONCE(task->task_works);
>> if (unlikely(head == &work_exited))
>> return -ESRCH;
>> + /*
>> + * TIF_NOTIFY_SIGNAL notifications will interfere with
>> + * a core dump in progress, reject them.
>> + */
>> + if (notify == TWA_SIGNAL && (task->flags & PF_SIGNALED))
>> + return -ESRCH;
>
> This basically seems to check task->flags with no serialization.
>
> I'm sure it works 99.9% of the time in practice, since you'd be really
> unlucky to hit any races, but I really don't see what the
> serialization logic is.
>
> Also, the main user that actually triggered the problem already has
>
> if (unlikely(tsk->flags & PF_EXITING))
> goto fail;
>
> just above the call to task_work_add(), so this all seems very hacky indeed.
>
> Of course, I don't see what the serialization for _that_ one is either.
>
> Pls explain. You can't just randomly add tests for random flags that
> get modified by other random code.

You're absolutely right. On the io_uring side, in the current tree,
there's only one check where current != task being checked - and that's
in the poll rewait arming. That one should likely just go away. It may
be fine as it is, as it just pertains to ring exit cancelations. We want
to ensure that we don't rearm poll requests if the process is canceling
and going away. I'll take a closer look at that one.

For this particular patch, I agree it's racy. I'll see if I can come up
with something better...

--
Jens Axboe

2021-08-22 20:56:46

by Olivier Langlois

[permalink] [raw]
Subject: Re: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

On Thu, 2021-08-19 at 08:59 -0600, Jens Axboe wrote:
>
> You're absolutely right. On the io_uring side, in the current tree,
> there's only one check where current != task being checked - and
> that's
> in the poll rewait arming. That one should likely just go away. It
> may
> be fine as it is, as it just pertains to ring exit cancelations. We
> want
> to ensure that we don't rearm poll requests if the process is
> canceling
> and going away. I'll take a closer look at that one.
>
> For this particular patch, I agree it's racy. I'll see if I can come
> up
> with something better...
>
I have finally found the patch that you wanted me to test. I'm going to
do it ASAP despite still having issue.

I do have a different approach to solve the same core dump issue.

Feel free to consider it if this can avoid the race condition described
here.


2021-08-23 04:57:45

by Olivier Langlois

[permalink] [raw]
Subject: Re: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

On Tue, 2021-08-17 at 21:06 -0600, Jens Axboe wrote:
> task_work being added with notify == TWA_SIGNAL will utilize
> TIF_NOTIFY_SIGNAL for signaling the targeted task that work is
> available.
> If this happens while a task is going through a core dump, it'll
> potentially disturb and truncate the dump as a signal interruption.
>
> Have task_work_add() with notify == TWA_SIGNAL check if a task has
> been
> signaled for a core dump, and refuse to add the work if that is the
> case.
> When a core dump is invoked, explicitly check for TIF_NOTIFY_SIGNAL
> and
> run any pending task_work if that is set. This is similar to how an
> exiting task will not get new task_work added, and we return the same
> error for the core dump case. As we return success or failure from
> task_work_add(), the caller has to be prepared to handle this case
> already.
>
> Currently this manifests itself in that io_uring tasks that end up
> using
> task_work will experience truncated core dumps.
>
> Reported-by: Tony Battersby <[email protected]>
> Reported-by: Olivier Langlois <[email protected]>
> Cc: Eric W. Biederman <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: [email protected]?# 5.10+
> Signed-off-by: Jens Axboe <[email protected]>
>
> ---
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 07afb5ddb1c4..ca7c1ee44ada 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -602,6 +602,14 @@ void do_coredump(const kernel_siginfo_t
> *siginfo)
> ????????????????.mm_flags = mm->flags,
> ????????};
> ?
> +???????/*
> +??????? * task_work_add() will refuse to add work after PF_SIGNALED
> has
> +??????? * been set, ensure that we flush any pending
> TIF_NOTIFY_SIGNAL work
> +??????? * if any was queued before that.
> +??????? */
> +???????if (test_thread_flag(TIF_NOTIFY_SIGNAL))
> +???????????????tracehook_notify_signal();
> +
> ????????audit_core_dumps(siginfo->si_signo);
> ?
> ????????binfmt = mm->binfmt;
> diff --git a/kernel/task_work.c b/kernel/task_work.c
> index 1698fbe6f0e1..1ab28904adc4 100644
> --- a/kernel/task_work.c
> +++ b/kernel/task_work.c
> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct
> callback_head *work,
> ????????????????head = READ_ONCE(task->task_works);
> ????????????????if (unlikely(head == &work_exited))
> ????????????????????????return -ESRCH;
> +???????????????/*
> +??????????????? * TIF_NOTIFY_SIGNAL notifications will interfere
> with
> +??????????????? * a core dump in progress, reject them.
> +??????????????? */
> +???????????????if (notify == TWA_SIGNAL && (task->flags &
> PF_SIGNALED))
> +???????????????????????return -ESRCH;
> ????????????????work->next = head;
> ????????} while (cmpxchg(&task->task_works, head, work) != head);
>

tested successfully on 5.12.19

Tested-by: Olivier Langlois <[email protected]>


2022-03-21 23:28:17

by Tony Battersby

[permalink] [raw]
Subject: Re: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

On 8/19/21 10:59, Jens Axboe wrote:
> On 8/18/21 8:57 PM, Linus Torvalds wrote:
>> On Tue, Aug 17, 2021 at 8:06 PM Jens Axboe <[email protected]> wrote:
>>> task_work being added with notify == TWA_SIGNAL will utilize
>>> TIF_NOTIFY_SIGNAL for signaling the targeted task that work is available.
>>> If this happens while a task is going through a core dump, it'll
>>> potentially disturb and truncate the dump as a signal interruption.
>> This patch seems (a) buggy and (b) hacky.
>>
>>> --- a/kernel/task_work.c
>>> +++ b/kernel/task_work.c
>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>> head = READ_ONCE(task->task_works);
>>> if (unlikely(head == &work_exited))
>>> return -ESRCH;
>>> + /*
>>> + * TIF_NOTIFY_SIGNAL notifications will interfere with
>>> + * a core dump in progress, reject them.
>>> + */
>>> + if (notify == TWA_SIGNAL && (task->flags & PF_SIGNALED))
>>> + return -ESRCH;
>> This basically seems to check task->flags with no serialization.
>>
>> I'm sure it works 99.9% of the time in practice, since you'd be really
>> unlucky to hit any races, but I really don't see what the
>> serialization logic is.
>>
>> Also, the main user that actually triggered the problem already has
>>
>> if (unlikely(tsk->flags & PF_EXITING))
>> goto fail;
>>
>> just above the call to task_work_add(), so this all seems very hacky indeed.
>>
>> Of course, I don't see what the serialization for _that_ one is either.
>>
>> Pls explain. You can't just randomly add tests for random flags that
>> get modified by other random code.
> You're absolutely right. On the io_uring side, in the current tree,
> there's only one check where current != task being checked - and that's
> in the poll rewait arming. That one should likely just go away. It may
> be fine as it is, as it just pertains to ring exit cancelations. We want
> to ensure that we don't rearm poll requests if the process is canceling
> and going away. I'll take a closer look at that one.
>
> For this particular patch, I agree it's racy. I'll see if I can come up
> with something better...
>

Continuing this thread from August 2021:

I previously tested a version of Jens' patch backported to 5.10 and it
fixed my problem.  Now I am trying to upgrade kernels, and 5.17 still
has the same problem - coredumps from an io_uring program to a pipe are
truncated.  Jens' patch applied to 5.17 again fixes the problem.  Has
there been any progress with fixing the problem upstream?

Reference:

https://lore.kernel.org/all/[email protected]/
https://lore.kernel.org/all/[email protected]/

Tony Battersby
Cybernetics

2022-03-23 12:31:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] kernel: make TIF_NOTIFY_SIGNAL and core dumps co-exist

Tony Battersby <[email protected]> writes:

> On 8/19/21 10:59, Jens Axboe wrote:
>> On 8/18/21 8:57 PM, Linus Torvalds wrote:
>>> On Tue, Aug 17, 2021 at 8:06 PM Jens Axboe <[email protected]> wrote:
>>>> task_work being added with notify == TWA_SIGNAL will utilize
>>>> TIF_NOTIFY_SIGNAL for signaling the targeted task that work is available.
>>>> If this happens while a task is going through a core dump, it'll
>>>> potentially disturb and truncate the dump as a signal interruption.
>>> This patch seems (a) buggy and (b) hacky.
>>>
>>>> --- a/kernel/task_work.c
>>>> +++ b/kernel/task_work.c
>>>> @@ -41,6 +41,12 @@ int task_work_add(struct task_struct *task, struct callback_head *work,
>>>> head = READ_ONCE(task->task_works);
>>>> if (unlikely(head == &work_exited))
>>>> return -ESRCH;
>>>> + /*
>>>> + * TIF_NOTIFY_SIGNAL notifications will interfere with
>>>> + * a core dump in progress, reject them.
>>>> + */
>>>> + if (notify == TWA_SIGNAL && (task->flags & PF_SIGNALED))
>>>> + return -ESRCH;
>>> This basically seems to check task->flags with no serialization.
>>>
>>> I'm sure it works 99.9% of the time in practice, since you'd be really
>>> unlucky to hit any races, but I really don't see what the
>>> serialization logic is.
>>>
>>> Also, the main user that actually triggered the problem already has
>>>
>>> if (unlikely(tsk->flags & PF_EXITING))
>>> goto fail;
>>>
>>> just above the call to task_work_add(), so this all seems very hacky indeed.
>>>
>>> Of course, I don't see what the serialization for _that_ one is either.
>>>
>>> Pls explain. You can't just randomly add tests for random flags that
>>> get modified by other random code.
>> You're absolutely right. On the io_uring side, in the current tree,
>> there's only one check where current != task being checked - and that's
>> in the poll rewait arming. That one should likely just go away. It may
>> be fine as it is, as it just pertains to ring exit cancelations. We want
>> to ensure that we don't rearm poll requests if the process is canceling
>> and going away. I'll take a closer look at that one.
>>
>> For this particular patch, I agree it's racy. I'll see if I can come up
>> with something better...
>>
>
> Continuing this thread from August 2021:
>
> I previously tested a version of Jens' patch backported to 5.10 and it
> fixed my problem.  Now I am trying to upgrade kernels, and 5.17 still
> has the same problem - coredumps from an io_uring program to a pipe are
> truncated.  Jens' patch applied to 5.17 again fixes the problem.  Has
> there been any progress with fixing the problem upstream?
>
> Reference:
>
> https://lore.kernel.org/all/[email protected]/
> https://lore.kernel.org/all/[email protected]/

I am still slowly working on this. (I was unfortunately preempted by
some painful to track down and fix regressions elsewhere).

When I was doubly checking to be certain I understood the problem the
case you describe is one of the easy cases that needs to be handled.

There is at least one more difficult interaction that is not solved by
squelching task_work_add after PF_SIGNALED is set, and I am not 100%
convinced that it is even correct to squelch task_work_add at that point
in the code.

The progress I have made to date that I am sending to Linus for v5.18 is
the removal of tracehook.h which makes the code much more
understandable.

I think I have a general solution that I am planning to post after
v5.18-rc1 that I have not tested yet on the cases that I know about,
but I expect it will work.

So I think that puts a good general fix 2-3 weeks out.

This is quite possibly a case where perfection is getting in the way of
the good, but I honestly can't judge anything except a fix that cleans
up everything and is complete. There are too many weird and subtle
interactions that I don't understand.

So I am going to continue concentrating on a good general solution so
that the code is readable and makes sense.

Eric