2021-07-06 19:07:12

by Luck, Tony

[permalink] [raw]
Subject: [PATCH 0/3] More machine check recovery fixes

Fix a couple of issues in machine check handling

1) A repeated machine check inside the kernel without calling the task
work function between machine checks it will go into an infinite
loop
2) Machine checks in kernel functions copying data from user addresses
send SIGBUS to the user as if the application had consumed the
poison. But this is wrong. The user should see either an -EFAULT
error return or a reduced byte count (in the case of write(2)).

Tony Luck (3):
x86/mce: Change to not send SIGBUS error during copy from user
x86/mce: Avoid infinite loop for copy from user recovery
x86/mce: Drop copyin special case for #MC

arch/x86/kernel/cpu/mce/core.c | 62 ++++++++++++++++++++++++----------
arch/x86/lib/copy_user_64.S | 13 -------
include/linux/sched.h | 1 +
3 files changed, 45 insertions(+), 31 deletions(-)

--
2.29.2


2021-07-06 19:07:46

by Luck, Tony

[permalink] [raw]
Subject: [PATCH 1/3] x86/mce: Change to not send SIGBUS error during copy from user

Sending a SIGBUS for a copy from user is not the correct semantic.
System calls should return -EFAULT (or a short count for write(2)).

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 33 +++++++++++++++++++--------------
1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 22791aadc085..dd03971e5ad5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1265,7 +1265,7 @@ static void kill_me_maybe(struct callback_head *cb)
flags |= MF_MUST_KILL;

ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
- if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+ if (!ret) {
set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
sync_core();
return;
@@ -1279,25 +1279,27 @@ static void kill_me_maybe(struct callback_head *cb)
if (ret == -EHWPOISON)
return;

- if (p->mce_vaddr != (void __user *)-1l) {
- force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
- } else {
- pr_err("Memory error not recovered");
- kill_me_now(cb);
- }
+ pr_err("Memory error not recovered");
+ kill_me_now(cb);
+}
+
+static void kill_me_never(struct callback_head *cb)
+{
+ struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
+
+ pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
+ if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
+ set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
}

-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, void (*func)(struct callback_head *))
{
current->mce_addr = m->addr;
current->mce_kflags = m->kflags;
current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
current->mce_whole_page = whole_page(m);

- if (kill_current_task)
- current->mce_kill_me.func = kill_me_now;
- else
- current->mce_kill_me.func = kill_me_maybe;
+ current->mce_kill_me.func = func;

task_work_add(current, &current->mce_kill_me, TWA_RESUME);
}
@@ -1435,7 +1437,10 @@ noinstr void do_machine_check(struct pt_regs *regs)
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));

- queue_task_work(&m, kill_current_task);
+ if (kill_current_task)
+ queue_task_work(&m, kill_me_now);
+ else
+ queue_task_work(&m, kill_me_maybe);

} else {
/*
@@ -1453,7 +1458,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
}

if (m.kflags & MCE_IN_KERNEL_COPYIN)
- queue_task_work(&m, kill_current_task);
+ queue_task_work(&m, kill_me_never);
}
out:
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
--
2.29.2

2021-07-06 19:07:51

by Luck, Tony

[permalink] [raw]
Subject: [PATCH 3/3] x86/mce: Drop copyin special case for #MC

Fixes to the iterator code to handle faults that are not on
page boundaries mean that the special case for machine check
during copy from user is no longer needed.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/lib/copy_user_64.S | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 57b79c577496..2797e630b9b1 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -234,24 +234,11 @@ EXPORT_SYMBOL(copy_user_enhanced_fast_string)
*/
SYM_CODE_START_LOCAL(.Lcopy_user_handle_tail)
movl %edx,%ecx
- cmp $X86_TRAP_MC,%eax /* check if X86_TRAP_MC */
- je 3f
1: rep movsb
2: mov %ecx,%eax
ASM_CLAC
ret

- /*
- * Return zero to pretend that this copy succeeded. This
- * is counter-intuitive, but needed to prevent the code
- * in lib/iov_iter.c from retrying and running back into
- * the poison cache line again. The machine check handler
- * will ensure that a SIGBUS is sent to the task.
- */
-3: xorl %eax,%eax
- ASM_CLAC
- ret
-
_ASM_EXTABLE_CPY(1b, 2b)
SYM_CODE_END(.Lcopy_user_handle_tail)

--
2.29.2

2021-07-06 19:07:55

by Luck, Tony

[permalink] [raw]
Subject: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT. Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send a
SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that
was used in the first call, the net result is an entry on the
current->task_works list that points to itself. When task_work_run()
is called it loops forever in this code:

do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);

Add a counter (current->mce_count) to keep track of repeated machine checks
before task_work() is called. First machine check saves the address information
and calls task_work_add(). Subsequent machine checks before that task_work
call back is executed check that the address is in the same page as the first
machine check (since the callback will offline exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 39 ++++++++++++++++++++++++++--------
include/linux/sched.h | 1 +
2 files changed, 31 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index dd03971e5ad5..957ec60cd2a8 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin

static void kill_me_now(struct callback_head *ch)
{
+ struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+ p->mce_count = 0;
force_sig(SIGBUS);
}

@@ -1259,6 +1262,7 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
int ret;

+ p->mce_count = 0;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);

if (!p->mce_ripv)
@@ -1287,19 +1291,36 @@ static void kill_me_never(struct callback_head *cb)
{
struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);

+ p->mce_count = 0;
pr_err("Kernel accessed poison in user space at %llx\n", p->mce_addr);
if (!memory_failure(p->mce_addr >> PAGE_SHIFT, 0))
set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
}

-static void queue_task_work(struct mce *m, void (*func)(struct callback_head *))
+static void queue_task_work(struct mce *m, char *msg, void (*func)(struct callback_head *))
{
- current->mce_addr = m->addr;
- current->mce_kflags = m->kflags;
- current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
- current->mce_whole_page = whole_page(m);
+ int count = ++current->mce_count;
+
+ /* First call, save all the details */
+ if (count == 1) {
+ current->mce_addr = m->addr;
+ current->mce_kflags = m->kflags;
+ current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(m);
+ current->mce_kill_me.func = func;
+ }

- current->mce_kill_me.func = func;
+ /* Ten is likley overkill. Don't expect more than two faults before task_work() */
+ if (count > 10)
+ mce_panic("Too many machine checks while accessing user data", m, msg);
+
+ /* Second or later call, make sure page address matches the one from first call */
+ if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+ mce_panic("Machine checks to different user pages", m, msg);
+
+ /* Do not call task_work_add() more than once */
+ if (count > 1)
+ return;

task_work_add(current, &current->mce_kill_me, TWA_RESUME);
}
@@ -1438,9 +1459,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
BUG_ON(!on_thread_stack() || !user_mode(regs));

if (kill_current_task)
- queue_task_work(&m, kill_me_now);
+ queue_task_work(&m, msg, kill_me_now);
else
- queue_task_work(&m, kill_me_maybe);
+ queue_task_work(&m, msg, kill_me_maybe);

} else {
/*
@@ -1458,7 +1479,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
}

if (m.kflags & MCE_IN_KERNEL_COPYIN)
- queue_task_work(&m, kill_me_never);
+ queue_task_work(&m, msg, kill_me_never);
}
out:
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d88641..f6935787e7e8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1394,6 +1394,7 @@ struct task_struct {
mce_whole_page : 1,
__mce_reserved : 62;
struct callback_head mce_kill_me;
+ int mce_count;
#endif

#ifdef CONFIG_KRETPROBES
--
2.29.2

2021-07-22 13:56:31

by Jue Wang

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

This patch assumes the UC error consumed in kernel is always the same UC.

Yet it's possible two UCs on different pages are consumed in a row.
The patch below will panic on the 2nd MCE. How can we make the code works
on multiple UC errors?


> + int count = ++current->mce_count;
> +
> + /* First call, save all the details */
> + if (count == 1) {
> + current->mce_addr = m->addr;
> + current->mce_kflags = m->kflags;
> + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> + current->mce_whole_page = whole_page(m);
> + current->mce_kill_me.func = func;
> + }
> ......
> + /* Second or later call, make sure page address matches the one from first call */
> + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> + mce_panic("Machine checks to different user pages", m, msg);

2021-07-22 15:22:39

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

On Thu, Jul 22, 2021 at 06:54:37AM -0700, Jue Wang wrote:
> This patch assumes the UC error consumed in kernel is always the same UC.
>
> Yet it's possible two UCs on different pages are consumed in a row.
> The patch below will panic on the 2nd MCE. How can we make the code works
> on multiple UC errors?
>
>
> > + int count = ++current->mce_count;
> > +
> > + /* First call, save all the details */
> > + if (count == 1) {
> > + current->mce_addr = m->addr;
> > + current->mce_kflags = m->kflags;
> > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > + current->mce_whole_page = whole_page(m);
> > + current->mce_kill_me.func = func;
> > + }
> > ......
> > + /* Second or later call, make sure page address matches the one from first call */
> > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > + mce_panic("Machine checks to different user pages", m, msg);

The issue is getting the information about the location
of the error from the machine check handler to the "task_work"
function that processes it. Currently there is a single place
to store the address of the error in the task structure:

current->mce_addr = m->addr;

Plausibly that could be made into an array, indexed by
current->mce_count to save mutiple addresses (perhaps
also need mce_kflags, mce_ripv, etc. to also be arrays).

But I don't want to pre-emptively make such a change without
some data to show that situations arise with multiple errors
to different addresses:
1) Actually occur
2) Would be recovered if we made the change.

The first would be indicated by seeing the:

"Machine checks to different user pages"

panic. You'd have to code up the change to have arrays
to confirm that would fix the problem.

-Tony

2021-07-22 23:32:04

by Jue Wang

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

I think the challenge being the uncorrectable errors are essentially
random. It's
just a matter of time for >1 UC errors to show up in sequential kernel accesses.

It's easy to create such cases with artificial error injections.

I suspect we want to design this part of the kernel to be able to handle generic
cases?

Thanks,
-Jue

On Thu, Jul 22, 2021 at 8:19 AM Luck, Tony <[email protected]> wrote:
>
> On Thu, Jul 22, 2021 at 06:54:37AM -0700, Jue Wang wrote:
> > This patch assumes the UC error consumed in kernel is always the same UC.
> >
> > Yet it's possible two UCs on different pages are consumed in a row.
> > The patch below will panic on the 2nd MCE. How can we make the code works
> > on multiple UC errors?
> >
> >
> > > + int count = ++current->mce_count;
> > > +
> > > + /* First call, save all the details */
> > > + if (count == 1) {
> > > + current->mce_addr = m->addr;
> > > + current->mce_kflags = m->kflags;
> > > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> > > + current->mce_whole_page = whole_page(m);
> > > + current->mce_kill_me.func = func;
> > > + }
> > > ......
> > > + /* Second or later call, make sure page address matches the one from first call */
> > > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > > + mce_panic("Machine checks to different user pages", m, msg);
>
> The issue is getting the information about the location
> of the error from the machine check handler to the "task_work"
> function that processes it. Currently there is a single place
> to store the address of the error in the task structure:
>
> current->mce_addr = m->addr;
>
> Plausibly that could be made into an array, indexed by
> current->mce_count to save mutiple addresses (perhaps
> also need mce_kflags, mce_ripv, etc. to also be arrays).
>
> But I don't want to pre-emptively make such a change without
> some data to show that situations arise with multiple errors
> to different addresses:
> 1) Actually occur
> 2) Would be recovered if we made the change.
>
> The first would be indicated by seeing the:
>
> "Machine checks to different user pages"
>
> panic. You'd have to code up the change to have arrays
> to confirm that would fix the problem.
>
> -Tony

2021-07-23 00:17:07

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

On Thu, Jul 22, 2021 at 04:30:44PM -0700, Jue Wang wrote:
> I think the challenge being the uncorrectable errors are essentially
> random. It's
> just a matter of time for >1 UC errors to show up in sequential kernel accesses.
>
> It's easy to create such cases with artificial error injections.
>
> I suspect we want to design this part of the kernel to be able to handle generic
> cases?

Remember that:
1) These errors are all in application memory
2) We reset the count every time we get into the task_work function that
will return to user

So the multiple error scenario here is one where we hit errors
on different user pages on a single trip into the kernel.

Hitting the same page is easy. The kernel has places where it
can hit poison with page faults disabled, and it then enables
page faults and retries the same access, and hits poison again.

I'm not aware of, nor expecting to find, places where the kernel
tries to access user address A and hits poison, and then tries to
access user address B (without returrning to user between access
A and access B).

-Tony

2021-07-23 03:49:57

by Jue Wang

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

On Thu, Jul 22, 2021 at 5:14 PM Luck, Tony <[email protected]> wrote:
>
> I'm not aware of, nor expecting to find, places where the kernel
> tries to access user address A and hits poison, and then tries to
> access user address B (without returrning to user between access
> A and access B).
This seems a reasonablely easy scenario.

A user space app allocates a buffer of xyz KB/MB/GB.

Unfortunately the dimms are bad and multiple cache lines have
uncorrectable errors in them on different pages.

Then the user space app tries to write the content of the buffer into some
file via write(2) from the entire buffer in one go.

We have some test cases like this repros reliably with infinite MCE loop.

I believe the key here is that in the real world this will happen,
in particular the bit flips tend to be clustered physically -
same dimm row, dimm column, or same rank, same device etc.
>
> -Tony

2021-07-23 04:07:37

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

>> I'm not aware of, nor expecting to find, places where the kernel
>> tries to access user address A and hits poison, and then tries to
>> access user address B (without returrning to user between access
>> A and access B).
>This seems a reasonablely easy scenario.
>
> A user space app allocates a buffer of xyz KB/MB/GB.
>
> Unfortunately the dimms are bad and multiple cache lines have
> uncorrectable errors in them on different pages.
>
> Then the user space app tries to write the content of the buffer into some
> file via write(2) from the entire buffer in one go.

Before this patch Linux gets into an infinite loop taking machine
checks on the first of the poison addresses in the buffer.

With this patch (and also patch 3/3 in this series). There are
a few machine checks on the first poison address (I think the number
depends on the alignment of the poison within a page ... but I'm
not sure). My test code shows 4 machine checks at the same
address. Then Linux returns a short byte count to the user
showing how many bytes were actually written to the file.

The fast that there are many more poison lines in the buffer
beyond the place where the write stopped on the first one is
irrelevant.

[Well, if the second poisoned line is immediately after the first
you may hit h/w prefetch issues and h/w may signal a fatal
machine check ... but that's a different problem that s/w could
only solve with painful LFENCE operations between each 64-bytes
of the copy]

-Tony

2021-07-23 04:18:13

by Jue Wang

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

On Thu, Jul 22, 2021 at 9:01 PM Luck, Tony <[email protected]> wrote:
>
> >> I'm not aware of, nor expecting to find, places where the kernel
> >> tries to access user address A and hits poison, and then tries to
> >> access user address B (without returrning to user between access
> >> A and access B).
> >This seems a reasonablely easy scenario.
> >
> > A user space app allocates a buffer of xyz KB/MB/GB.
> >
> > Unfortunately the dimms are bad and multiple cache lines have
> > uncorrectable errors in them on different pages.
> >
> > Then the user space app tries to write the content of the buffer into some
> > file via write(2) from the entire buffer in one go.
>
> Before this patch Linux gets into an infinite loop taking machine
> checks on the first of the poison addresses in the buffer.
>
> With this patch (and also patch 3/3 in this series). There are
> a few machine checks on the first poison address (I think the number
> depends on the alignment of the poison within a page ... but I'm
> not sure). My test code shows 4 machine checks at the same
> address. Then Linux returns a short byte count to the user
> showing how many bytes were actually written to the file.
>
> The fast that there are many more poison lines in the buffer
> beyond the place where the write stopped on the first one is
> irrelevant.
In our test, the application memory was anon.
With 1 UC error injected, the test always passes with the error
recovered and a SIGBUS delivered to user space.

When there are >1 UC errors in buffer, then indefinite mce loop.
>
> [Well, if the second poisoned line is immediately after the first
> you may hit h/w prefetch issues and h/w may signal a fatal
> machine check ... but that's a different problem that s/w could
> only solve with painful LFENCE operations between each 64-bytes
> of the copy]
>
> -Tony

2021-07-23 14:49:30

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

> In our test, the application memory was anon.
> With 1 UC error injected, the test always passes with the error
> recovered and a SIGBUS delivered to user space.
>
> When there are >1 UC errors in buffer, then indefinite mce loop.

Do you still see the infinite loop with these three patches on top of
v5.14-rc, rather than a short byte return value from write, or

mce_panic("Machine checks to different user pages", m, msg);

-Tony


2021-07-31 06:33:59

by Jue Wang

[permalink] [raw]
Subject: RE: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

Been busy with some other work.

After cherry picking patch 1 & 2, I saw the following with 2 UC errors injected
into the user space buffer passed into write(2), as expected:

[ 287.994754] Kernel panic - not syncing: Machine checks to different
user pages

The kernel tested with has its x86/mce and mm/memory-failure aligned with
upstream till around 2020/11.

Is there any other patch that I have missed to the write syscall etc?

Thanks,
-Jue

2021-07-31 20:45:17

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

> After cherry picking patch 1 & 2, I saw the following with 2 UC errors injected
> into the user space buffer passed into write(2), as expected:
>
> [ 287.994754] Kernel panic - not syncing: Machine checks to different
> user pages

Interesting. What are the offsets of the two injected errors in your test (both
w.r.t. the start of the buffer, and within a page).

> The kernel tested with has its x86/mce and mm/memory-failure aligned with
> upstream till around 2020/11.
>
> Is there any other patch that I have missed to the write syscall etc?

There is a long series of patches from Al Viro to lib/iov_iter.c that are maybe
also relevent in making the kernel copy from user stop at the first poison
address in the buffer.

-Tony

2021-08-02 15:32:06

by Jue Wang

[permalink] [raw]
Subject: Re: [PATCH 2/3] x86/mce: Avoid infinite loop for copy from user recovery

On Sat, Jul 31, 2021 at 1:43 PM Luck, Tony <[email protected]> wrote:
>
> > After cherry picking patch 1 & 2, I saw the following with 2 UC errors injected
> > into the user space buffer passed into write(2), as expected:
> >
> > [ 287.994754] Kernel panic - not syncing: Machine checks to different
> > user pages
>
> Interesting. What are the offsets of the two injected errors in your test (both
> w.r.t. the start of the buffer, and within a page).
They are just random offsets into the first 2 pages of the buffer (4k aligned),
1 error per page. To be precise: 0x440 and 0x1c0 within each page.

>
> > The kernel tested with has its x86/mce and mm/memory-failure aligned with
> > upstream till around 2020/11.
> >
> > Is there any other patch that I have missed to the write syscall etc?
>
> There is a long series of patches from Al Viro to lib/iov_iter.c that are maybe
> also relevent in making the kernel copy from user stop at the first poison
> address in the buffer.
Thanks for the pointer.

Looks like [1],[2] are not yet merged.

Is lib/iov_iter.c the only place the kernel performs a copy from user
and gets multiple
poisons? I suspect not.

For example, lots of kernel accesses to user space memory are from kernel agents
like khugepaged, NUMA auto balancing etc. These paths are not handled by the fix
to lib/iov_iter.c.

I think the fix might have to be made to #MC handler's behavior wrt
the task work.
Send #MC signals and perform memory-failures actions from a task work is fine
for #MCs originated from user space, but not suitable for kernel
accessing poisoned
memory (user space memory). For the latter #MC handler must handle
#MC recovery in the exception context without resorting to task work;
this may be
OK since the recovery action for the later case is minimal: mark PG_hwpoison and
remove the kernel mapping.

1. https://lore.kernel.org/linux-mm/[email protected]/
2. https://lore.kernel.org/linux-mm/[email protected]/

>
> -Tony

2021-08-18 00:32:36

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2 0/3] More machine check recovery fixes

Fix a couple of issues in machine check handling

1) A repeated machine check inside the kernel without calling the task
work function between machine checks it will go into an infinite
loop
2) Machine checks in kernel functions copying data from user addresses
send SIGBUS to the user as if the application had consumed the
poison. But this is wrong. The user should see either an -EFAULT
error return or a reduced byte count (in the case of write(2)).

My latest tests have been on v4.14-rc6 with this patch (that's already
in -mm) applied:
https://lore.kernel.org/r/[email protected]

Changes since v1:
1) Fix bug in kill_me_never() that forgot to clear p->mce_count so
repeated recovery in the same task would trigger the panic for
"Machine checks to different user pages"
[Note to Jue Wang ... this *might* be why your test that injects
two errors into the same buffer passed to a write(2) syscall
failed with this message]
2) Re-order patches so that "Avoid infinite loop" can be backported
to stable.

Note that the other two parts of this series depend upon Al Viro's
extensive re-work to lib/iov_iter.c ... so don't try to backport those
without also picking up Al's work.

Tony Luck (3):
x86/mce: Avoid infinite loop for copy from user recovery
x86/mce: Change to not send SIGBUS error during copy from user
x86/mce: Drop copyin special case for #MC

arch/x86/kernel/cpu/mce/core.c | 62 ++++++++++++++++++++++++----------
arch/x86/lib/copy_user_64.S | 13 -------
include/linux/sched.h | 1 +
3 files changed, 45 insertions(+), 31 deletions(-)


base-commit: 7c60610d476766e128cc4284bb6349732cbd6606
--
2.29.2

2021-08-18 00:33:28

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

Recovery action when get_user() triggers a machine check uses the fixup
path to make get_user() return -EFAULT. Also queue_task_work() sets up
so that kill_me_maybe() will be called on return to user mode to send
a SIGBUS to the current process.

But there are places in the kernel where the code assumes that this
EFAULT return was simply because of a page fault. The code takes some
action to fix that, and then retries the access. This results in a second
machine check.

While processing this second machine check queue_task_work() is called
again. But since this uses the same callback_head structure that was used
in the first call, the net result is an entry on the current->task_works
list that points to itself. When task_work_run() is called it loops
forever in this code:

do {
next = work->next;
work->func(work);
work = next;
cond_resched();
} while (work);

Add a counter (current->mce_count) to keep track of repeated machine
checks before task_work() is called. First machine check saves the address
information and calls task_work_add(). Subsequent machine checks before
that task_work call back is executed check that the address is in the
same page as the first machine check (since the callback will offline
exactly one page).

Expected worst case is two machine checks before moving on (e.g. one user
access with page faults disabled, then a repeat to the same addrsss with
page faults enabled). Just in case there is some code that loops forever
enforce a limit of 10.

Cc: <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mce/core.c | 43 +++++++++++++++++++++++++---------
include/linux/sched.h | 1 +
2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 22791aadc085..94830ee9581c 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1250,6 +1250,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin

static void kill_me_now(struct callback_head *ch)
{
+ struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+ p->mce_count = 0;
force_sig(SIGBUS);
}

@@ -1259,6 +1262,7 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
int ret;

+ p->mce_count = 0;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);

if (!p->mce_ripv)
@@ -1287,17 +1291,34 @@ static void kill_me_maybe(struct callback_head *cb)
}
}

-static void queue_task_work(struct mce *m, int kill_current_task)
+static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
{
- current->mce_addr = m->addr;
- current->mce_kflags = m->kflags;
- current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
- current->mce_whole_page = whole_page(m);
+ int count = ++current->mce_count;

- if (kill_current_task)
- current->mce_kill_me.func = kill_me_now;
- else
- current->mce_kill_me.func = kill_me_maybe;
+ /* First call, save all the details */
+ if (count == 1) {
+ current->mce_addr = m->addr;
+ current->mce_kflags = m->kflags;
+ current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(m);
+
+ if (kill_current_task)
+ current->mce_kill_me.func = kill_me_now;
+ else
+ current->mce_kill_me.func = kill_me_maybe;
+ }
+
+ /* Ten is likley overkill. Don't expect more than two faults before task_work() */
+ if (count > 10)
+ mce_panic("Too many machine checks while accessing user data", m, msg);
+
+ /* Second or later call, make sure page address matches the one from first call */
+ if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+ mce_panic("Machine checks to different user pages", m, msg);
+
+ /* Do not call task_work_add() more than once */
+ if (count > 1)
+ return;

task_work_add(current, &current->mce_kill_me, TWA_RESUME);
}
@@ -1435,7 +1456,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));

- queue_task_work(&m, kill_current_task);
+ queue_task_work(&m, msg, kill_current_task);

} else {
/*
@@ -1453,7 +1474,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
}

if (m.kflags & MCE_IN_KERNEL_COPYIN)
- queue_task_work(&m, kill_current_task);
+ queue_task_work(&m, msg, kill_current_task);
}
out:
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ec8d07d88641..f6935787e7e8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1394,6 +1394,7 @@ struct task_struct {
mce_whole_page : 1,
__mce_reserved : 62;
struct callback_head mce_kill_me;
+ int mce_count;
#endif

#ifdef CONFIG_KRETPROBES
--
2.29.2

2021-08-18 16:19:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2 0/3] More machine check recovery fixes

> Changes since v1:
> 1) Fix bug in kill_me_never() that forgot to clear p->mce_count so
> repeated recovery in the same task would trigger the panic for
> "Machine checks to different user pages"
> [Note to Jue Wang ... this *might* be why your test that injects
> two errors into the same buffer passed to a write(2) syscall
> failed with this message]

I recreated Jue's specific test today with uncorrected errors in two
pages passed to a write(2) syscall.

buf = alloc(2 pages);
inject(buf + 0x440);
inject*buf + 0x11c0);
n = write(fd, buf, 8K);

Result was that the write returned 0x440 (i.e. bytes written up to the
first poison location).

-Tony

2021-08-20 17:33:21

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote:
> @@ -1287,17 +1291,34 @@ static void kill_me_maybe(struct callback_head *cb)
> }
> }
>
> -static void queue_task_work(struct mce *m, int kill_current_task)
> +static void queue_task_work(struct mce *m, char *msg, int kill_current_task)
> {
> - current->mce_addr = m->addr;
> - current->mce_kflags = m->kflags;
> - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> - current->mce_whole_page = whole_page(m);
> + int count = ++current->mce_count;
>
> - if (kill_current_task)
> - current->mce_kill_me.func = kill_me_now;
> - else
> - current->mce_kill_me.func = kill_me_maybe;
> + /* First call, save all the details */
> + if (count == 1) {
> + current->mce_addr = m->addr;
> + current->mce_kflags = m->kflags;
> + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> + current->mce_whole_page = whole_page(m);
> +
> + if (kill_current_task)
> + current->mce_kill_me.func = kill_me_now;
> + else
> + current->mce_kill_me.func = kill_me_maybe;
> + }
> +
> + /* Ten is likley overkill. Don't expect more than two faults before task_work() */

"likely"

> + if (count > 10)
> + mce_panic("Too many machine checks while accessing user data", m, msg);

Ok, aren't we too nasty here? Why should we panic the whole box even
with 10 MCEs? It is still user memory...

IOW, why not:

if (count > 10)
current->mce_kill_me.func = kill_me_now;

and when we return, that user process dies immediately.

> + /* Second or later call, make sure page address matches the one from first call */
> + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> + mce_panic("Machine checks to different user pages", m, msg);

Same question here.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-20 19:04:33

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 07:31:43PM +0200, Borislav Petkov wrote:
> On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote:
> > + /* Ten is likley overkill. Don't expect more than two faults before task_work() */
>
> "likely"

Oops.

>
> > + if (count > 10)
> > + mce_panic("Too many machine checks while accessing user data", m, msg);
>
> Ok, aren't we too nasty here? Why should we panic the whole box even
> with 10 MCEs? It is still user memory...
>
> IOW, why not:
>
> if (count > 10)
> current->mce_kill_me.func = kill_me_now;
>
> and when we return, that user process dies immediately.

It's the "when we return" part that is the problem here. Logical
trace looks like:

user-syscall:

kernel does get_user() or copyin(), hits user poison address

machine check
sees that this was kernel get_user()/copyin() and
uses extable to "return" to exception path

still in kernel, see that get_user() or copyin() failed

Kernel does another get_user() or copyin() (maybe the first
was inside a pagefault_disable() region, and kernel is trying
again to see if the error was a fixable page fault. But that
wasn't the problem so ...

machine check
sees that this was kernel get_user()/copyin() and
uses extable to "return" to exception path

still in kernel ... but persistently thinks that just trying again
might fix it.

machine check
sees that this was kernel get_user()/copyin() and
uses extable to "return" to exception path

still in kernel ... this time for sure! get_user()

machine check
sees that this was kernel get_user()/copyin() and
uses extable to "return" to exception path

still in kernel ... but you may see the pattern get_user()

machine check
sees that this was kernel get_user()/copyin() and
uses extable to "return" to exception path

I'm bored typing this, but the kernel may not ever give up

machine check
sees that this was kernel get_user()/copyin() and
uses extable to "return" to exception path

I.e. the kernel doesn't ever get to call current->mce_kill_me.func()

I do have tests that show as many as 4 consecutive machine checks
before the kernel gives up trying and returns to the user to complete
recovery.

Maybe the message could be clearer?

mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg);

>
> > + /* Second or later call, make sure page address matches the one from first call */
> > + if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
> > + mce_panic("Machine checks to different user pages", m, msg);
>
> Same question here.

Not quite the same answer ... but similar. We could in theory handle
multiple different machine check addresses by turning the "mce_addr"
field in the task structure into an array and saving each address so
that when the kernel eventually gives up poking at poison and tries
to return to user kill_me_maybe() could loop through them and deal
with each poison page.

I don't think this can happen. Jue Wang suggested that multiple poisoned
pages passed to a single write(2) syscall might trigger this panic (and
because of a bug in my earlier version, he managed to trigger this
"different user pages" panic). But this fixed up version survives the
"Jue test".

-Tony

2021-08-20 19:29:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote:
> It's the "when we return" part that is the problem here. Logical
> trace looks like:
>
> user-syscall:
>
> kernel does get_user() or copyin(), hits user poison address
>
> machine check
> sees that this was kernel get_user()/copyin() and
> uses extable to "return" to exception path
>
> still in kernel, see that get_user() or copyin() failed
>
> Kernel does another get_user() or copyin() (maybe the first

I forgot all the details we were talking at the time but there's no way
to tell the kernel to back off here, is it?

As in: there was an MCE while trying to access this user memory, you
should not do get_user anymore. You did add that

* Return zero to pretend that this copy succeeded. This
* is counter-intuitive, but needed to prevent the code
* in lib/iov_iter.c from retrying and running back into

which you're removing with the last patch so I'm confused.

IOW, the problem is that with repeated MCEs while the kernel is
accessing that memory, it should be the kernel which should back off.
And then we should kill that process too but apparently we don't even
come to that.

> Maybe the message could be clearer?
>
> mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg);

That's not my point - it is rather: this is a recoverable error because
it is in user memory even if it is the kernel which tries to access it.
And maybe we should not panic the whole box but try to cordon off the
faulty memory only and poison it after having killed the process using
it...

> Not quite the same answer ... but similar. We could in theory handle
> multiple different machine check addresses by turning the "mce_addr"
> field in the task structure into an array and saving each address so
> that when the kernel eventually gives up poking at poison and tries
> to return to user kill_me_maybe() could loop through them and deal
> with each poison page.

Yes, I like the aspect of making the kernel give up poking at poison and
when we return we should kill the process and poison all pages collected
so that the error source is hopefully contained.

But again, I think the important thing is how to make the kernel to back
off quicker so that we can poison the pages at all...

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-20 20:26:34

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 09:27:44PM +0200, Borislav Petkov wrote:
> On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote:
> > It's the "when we return" part that is the problem here. Logical
> > trace looks like:
> >
> > user-syscall:
> >
> > kernel does get_user() or copyin(), hits user poison address
> >
> > machine check
> > sees that this was kernel get_user()/copyin() and
> > uses extable to "return" to exception path
> >
> > still in kernel, see that get_user() or copyin() failed
> >
> > Kernel does another get_user() or copyin() (maybe the first
>
> I forgot all the details we were talking at the time but there's no way
> to tell the kernel to back off here, is it?
>
> As in: there was an MCE while trying to access this user memory, you
> should not do get_user anymore. You did add that
>
> * Return zero to pretend that this copy succeeded. This
> * is counter-intuitive, but needed to prevent the code
> * in lib/iov_iter.c from retrying and running back into
>
> which you're removing with the last patch so I'm confused.
>
> IOW, the problem is that with repeated MCEs while the kernel is
> accessing that memory, it should be the kernel which should back off.
> And then we should kill that process too but apparently we don't even
> come to that.

My first foray into this just tried to fix the futex() case ... which is one
of the first copy is with pagefault_disable() set, then try again with it
clear. My attempt there was to make get_user() return -EHWPOISON, and
change the futex code to give up immediatley when it saw that code.

AndyL (and maybe others) barfed all over that (and rightly so ... there
are thousands of get_user() and copyin() calls ... making sure all of them
did the right thing with a new error code would be a huge effort. Very
error prone (because testing all these paths is hard). New direction was
just deal with the fact that we might take more than one machine check
before the kernel is finished poking at the poison.

>
> > Maybe the message could be clearer?
> >
> > mce_panic("Too many consecutive machine checks in kernel while accessing user data", m, msg);
>
> That's not my point - it is rather: this is a recoverable error because
> it is in user memory even if it is the kernel which tries to access it.
> And maybe we should not panic the whole box but try to cordon off the
> faulty memory only and poison it after having killed the process using
> it...

To recover we need to have some other place to jump to (besides the
normal extable error return ... which isn't working if we find ourselves
in this situation) when we hit a fault covered by an extable entry. And
also know how many machine checks is "normal" before taking the other path.

For futex(2) things resolve in two machine checks (one with
pagefault_disable() and then one without). For write(2) I see up to
four machine cehcks (I didn't do a detailed track ... but I think it is
because copyin() does a retry to see if a failure in a many-bytes-at-atime
copy might be able to get a few more bytes by doing byte-at-a-time).

The logical spot for the alternate return would be to unwind the stack
back to the syscall entry point, and then force an error return from
there. But that seems a perilous path ... what if some code between
syscall entry and the copyin() grabbed a mutex? We would also need to
know about that and release it as part of the recovery.

Another failed approach was to mark the page not present in the page
tables of the process accessing the poison. That would get us out of the
machine check loop. But walking page tables and changing them while still
in machine check context can't be done in a safe way (the process may be
multi-threaded and other threads could still be running on other cores).

Bottom line is that I don't think this panic can actually happen unless
there is some buggy kernel code that retries get_user() or copyin()
indefinitely.

Probably the same for the two different addresses case ... though I'm
not 100% confident about that. There could be some ioctl() that peeks
at two parts of a passed in structure, and the user might pass in a
structure that spans across a page boundary with both pages poisoned.
But that would only hit if the driver code ignored the failure of the
first get_user() and blindly tried the second. So I'd count that as a
critically bad driver bug.

-Tony

2021-08-20 20:35:47

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 09:27:44PM +0200, Borislav Petkov wrote:
> On Fri, Aug 20, 2021 at 11:59:45AM -0700, Luck, Tony wrote:
> As in: there was an MCE while trying to access this user memory, you
> should not do get_user anymore. You did add that
>
> * Return zero to pretend that this copy succeeded. This
> * is counter-intuitive, but needed to prevent the code
> * in lib/iov_iter.c from retrying and running back into
>
> which you're removing with the last patch so I'm confused.

Forget to address this part in the earlier reply.

My original code that forced a zero return has a hack. It
allowed recovery to complete, but only because there was
going to be a SIGBUS. There were some unplesant side effects.
E.g. on a write syscall the file size was updated as if the
write had succeeded. That would be very confusing for anyone
trying to clean up afterwards as the file would have good
data that was copied from the user up to the point where
the machine check interrupted things. Then NUL bytes after
(because the kernel clears pages that are allocated into
the page cache).

The new version (thanks to All fixing iov_iter.c) now does
exactly what POSIX says should happen. If I have a buffer
with poison at offset 213, and I do this:

ret = write(fd, buf, 512);

Then the return from write is 213, and the first 213 bytes
from the buffer appear in the file, and the file size is
incremented by 213 (assuming the write started with the lseek
offset at the original size of the file).

-Tony

2021-08-21 04:54:39

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 1:25 PM Luck, Tony <[email protected]> wrote:
> Probably the same for the two different addresses case ... though I'm
> not 100% confident about that. There could be some ioctl() that peeks
> at two parts of a passed in structure, and the user might pass in a
> structure that spans across a page boundary with both pages poisoned.
> But that would only hit if the driver code ignored the failure of the
> first get_user() and blindly tried the second. So I'd count that as a
> critically bad driver bug.

Or maybe driver writers are just evil :-(

for (i = 0; i < len; i++) {
tx_wait(10);
get_user(dsp56k_host_interface.data.b[1], bin++);
get_user(dsp56k_host_interface.data.b[2], bin++);
get_user(dsp56k_host_interface.data.b[3], bin++);
}

-Tony

2021-08-21 21:58:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 09:51:41PM -0700, Tony Luck wrote:
> On Fri, Aug 20, 2021 at 1:25 PM Luck, Tony <[email protected]> wrote:
> > Probably the same for the two different addresses case ... though I'm
> > not 100% confident about that. There could be some ioctl() that peeks
> > at two parts of a passed in structure, and the user might pass in a
> > structure that spans across a page boundary with both pages poisoned.
> > But that would only hit if the driver code ignored the failure of the
> > first get_user() and blindly tried the second. So I'd count that as a
> > critically bad driver bug.
>
> Or maybe driver writers are just evil :-(
>
> for (i = 0; i < len; i++) {
> tx_wait(10);
> get_user(dsp56k_host_interface.data.b[1], bin++);
> get_user(dsp56k_host_interface.data.b[2], bin++);
> get_user(dsp56k_host_interface.data.b[3], bin++);
> }

Almost any unchecked get_user()/put_user() is a bug. Fortunately, there's
not a lot of them
<greps>
93 for put_user() and 73 for get_user(). _Some_ of the former variety might
be legitimate, but most should be taken out and shot.

And dsp56k should be taken out and shot, period ;-/ This is far from the
worst in there...

2021-08-22 14:38:40

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 01:23:46PM -0700, Luck, Tony wrote:
> To recover we need to have some other place to jump to (besides the
> normal extable error return ... which isn't working if we find ourselves
> in this situation) when we hit a fault covered by an extable entry. And
> also know how many machine checks is "normal" before taking the other path.

Hohumm, we're on the same page here.

...

> Bottom line is that I don't think this panic can actually happen unless
> there is some buggy kernel code that retries get_user() or copyin()
> indefinitely.

You know how such statements of "well, this should not really happen in
practice" get disproved by, well, practice. :-)

I guess we'll see anyway what actually happens in practice.

> Probably the same for the two different addresses case ... though I'm
> not 100% confident about that. There could be some ioctl() that peeks
> at two parts of a passed in structure, and the user might pass in a
> structure that spans across a page boundary with both pages poisoned.
> But that would only hit if the driver code ignored the failure of the
> first get_user() and blindly tried the second. So I'd count that as a
> critically bad driver bug.

Right.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-22 14:47:46

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Fri, Aug 20, 2021 at 01:33:56PM -0700, Luck, Tony wrote:
> The new version (thanks to All fixing iov_iter.c) now does
> exactly what POSIX says should happen. If I have a buffer
> with poison at offset 213, and I do this:
>
> ret = write(fd, buf, 512);
>
> Then the return from write is 213, and the first 213 bytes
> from the buffer appear in the file, and the file size is
> incremented by 213 (assuming the write started with the lseek
> offset at the original size of the file).

... and the user still gets a SIGBUS so that it gets a chance to handle
the encountered poison? I.e., not retry the write for the remaining 512
- 213 bytes?

If so, do we document that somewhere so that application writers can
know what they should do in such cases?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-08-23 15:28:30

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Sun, Aug 22, 2021 at 04:46:14PM +0200, Borislav Petkov wrote:
> On Fri, Aug 20, 2021 at 01:33:56PM -0700, Luck, Tony wrote:
> > The new version (thanks to All fixing iov_iter.c) now does
> > exactly what POSIX says should happen. If I have a buffer
> > with poison at offset 213, and I do this:
> >
> > ret = write(fd, buf, 512);
> >
> > Then the return from write is 213, and the first 213 bytes
> > from the buffer appear in the file, and the file size is
> > incremented by 213 (assuming the write started with the lseek
> > offset at the original size of the file).
>
> ... and the user still gets a SIGBUS so that it gets a chance to handle
> the encountered poison? I.e., not retry the write for the remaining 512
> - 213 bytes?

Whether the user gets a SIGBUS depends on what they do next. In a typical
user loop trying to do a write:

while (nbytes) {
ret = write(fd, buf, nbytes);
if (ret == -1)
return ret;
buf += ret;
nbytes -= ret;
}

The next iteration after the short write caused by the machine check
will return ret == -1, errno = EFAULT.

Andy Lutomirski convinced me that the kernel should not send a SIGBUS
to an application when the kernel accesses the poison in user memory.

If the user tries to access the page with the poison directly they'll
get a SIGBUS (page was unmapped so user gets a #PF, but the x86 fault
handler sees that the page was unmapped because of poison, so sends a
SIGBUS).

> If so, do we document that somewhere so that application writers can
> know what they should do in such cases?

Applications see a failed write ... they should do whatever they would
normally do for a failed write.

-Tony

2021-09-13 09:27:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 1/3] x86/mce: Avoid infinite loop for copy from user recovery

On Tue, Aug 17, 2021 at 05:29:40PM -0700, Tony Luck wrote:
> Recovery action when get_user() triggers a machine check uses the fixup
> path to make get_user() return -EFAULT. Also queue_task_work() sets up
> so that kill_me_maybe() will be called on return to user mode to send
> a SIGBUS to the current process.
>
> But there are places in the kernel where the code assumes that this
> EFAULT return was simply because of a page fault. The code takes some
> action to fix that, and then retries the access. This results in a second
> machine check.
>
> While processing this second machine check queue_task_work() is called
> again. But since this uses the same callback_head structure that was used
> in the first call, the net result is an entry on the current->task_works
> list that points to itself. When task_work_run() is called it loops
> forever in this code:
>
> do {
> next = work->next;
> work->func(work);
> work = next;
> cond_resched();
> } while (work);
>
> Add a counter (current->mce_count) to keep track of repeated machine
> checks before task_work() is called. First machine check saves the address
> information and calls task_work_add(). Subsequent machine checks before
> that task_work call back is executed check that the address is in the
> same page as the first machine check (since the callback will offline
> exactly one page).
>
> Expected worst case is two machine checks before moving on (e.g. one user
> access with page faults disabled, then a repeat to the same addrsss with
> page faults enabled). Just in case there is some code that loops forever
> enforce a limit of 10.
>
> Cc: <[email protected]>

What about a Fixes: tag?

I guess backporting this to the respective kernels is predicated upon
the existence of those other "places" in the kernel where code assumes
the EFAULT was because of a #PF.

Hmmm?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-09-14 01:15:10

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery

There are two cases for machine check recovery:
1) The machine check was triggered by ring3 (application) code.
This is the simpler case. The machine check handler simply queues
work to be executed on return to user. That code unmaps the page
from all users and arranges to send a SIGBUS to the task that
triggered the poison.
2) The machine check was triggered in kernel code that is covered by
an extable entry.
In this case the machine check handler still queues a work entry to
unmap the page, etc. but this will not be called right away because
the #MC handler returns to the fix up code address in the extable
entry.

Problems occur if the kernel triggers another machine check before the
return to user processes the first queued work item.

Specifically the work is queued using the "mce_kill_me" callback
structure in the task struct for the current thread. Attempting to queue
a second work item using this same callback results in a loop in the
linked list of work functions to call. So when the kernel does return to
user it enters an infinite loop processing the same entry for ever.

There are some legitimate scenarios where the kernel may take a second
machine check before returning to the user.

1) Some code (e.g. futex) first tries a get_user() with page faults
disabled. If this fails, the code retries with page faults enabled
expecting that this will resolve the page fault.
2) Copy from user code retries a copy in byte-at-time mode to check
whether any additional bytes can be copied.

On the other side of the fence are some bad drivers that do not check
the return value from individual get_user() calls and may access
multiple user addresses without noticing that some/all calls have
failed.

Fix by adding a counter (current->mce_count) to keep track of repeated
machine checks before task_work() is called. First machine check saves
the address information and calls task_work_add(). Subsequent machine
checks before that task_work call back is executed check that the address
is in the same page as the first machine check (since the callback will
offline exactly one page).

Expected worst case is four machine checks before moving on (e.g. one
user access with page faults disabled, then a repeat to the same address
with page faults enabled ... repeat in copy tail bytes). Just in case
there is some code that loops forever enforce a limit of 10.

Also mark queue_task_work() as "noinstr" (as reported kernel test robot
<[email protected]>)

Cc: <[email protected]>
Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work")
Signed-off-by: Tony Luck <[email protected]>
---

> What about a Fixes: tag?

Added a Fixes tag.

Also added "noinstr" to queue_task_work() per a kernel robot report.

Also re-wrote the commit comment (based on questions raised against v2)

> I guess backporting this to the respective kernels is predicated upon
> the existence of those other "places" in the kernel where code assumes
> the EFAULT was because of a #PF.

Not really. I don't expect to change any kernel code that just bounces
off the same machine check a few times. This patch does work best in
conjunction with patches 2 & 3 (unchanged, not reposted here). But it
will fix some old issues even without those two.

arch/x86/kernel/cpu/mce/core.c | 43 +++++++++++++++++++++++++---------
include/linux/sched.h | 1 +
2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 8cb7816d03b4..9891b4070a61 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1253,6 +1253,9 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin

static void kill_me_now(struct callback_head *ch)
{
+ struct task_struct *p = container_of(ch, struct task_struct, mce_kill_me);
+
+ p->mce_count = 0;
force_sig(SIGBUS);
}

@@ -1262,6 +1265,7 @@ static void kill_me_maybe(struct callback_head *cb)
int flags = MF_ACTION_REQUIRED;
int ret;

+ p->mce_count = 0;
pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);

if (!p->mce_ripv)
@@ -1290,17 +1294,34 @@ static void kill_me_maybe(struct callback_head *cb)
}
}

-static void queue_task_work(struct mce *m, int kill_current_task)
+static noinstr void queue_task_work(struct mce *m, char *msg, int kill_current_task)
{
- current->mce_addr = m->addr;
- current->mce_kflags = m->kflags;
- current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
- current->mce_whole_page = whole_page(m);
+ int count = ++current->mce_count;

- if (kill_current_task)
- current->mce_kill_me.func = kill_me_now;
- else
- current->mce_kill_me.func = kill_me_maybe;
+ /* First call, save all the details */
+ if (count == 1) {
+ current->mce_addr = m->addr;
+ current->mce_kflags = m->kflags;
+ current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+ current->mce_whole_page = whole_page(m);
+
+ if (kill_current_task)
+ current->mce_kill_me.func = kill_me_now;
+ else
+ current->mce_kill_me.func = kill_me_maybe;
+ }
+
+ /* Ten is likley overkill. Don't expect more than two faults before task_work() */
+ if (count > 10)
+ mce_panic("Too many machine checks while accessing user data", m, msg);
+
+ /* Second or later call, make sure page address matches the one from first call */
+ if (count > 1 && (current->mce_addr >> PAGE_SHIFT) != (m->addr >> PAGE_SHIFT))
+ mce_panic("Machine checks to different user pages", m, msg);
+
+ /* Do not call task_work_add() more than once */
+ if (count > 1)
+ return;

task_work_add(current, &current->mce_kill_me, TWA_RESUME);
}
@@ -1438,7 +1459,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
/* If this triggers there is no way to recover. Die hard. */
BUG_ON(!on_thread_stack() || !user_mode(regs));

- queue_task_work(&m, kill_current_task);
+ queue_task_work(&m, msg, kill_current_task);

} else {
/*
@@ -1456,7 +1477,7 @@ noinstr void do_machine_check(struct pt_regs *regs)
}

if (m.kflags & MCE_IN_KERNEL_COPYIN)
- queue_task_work(&m, kill_current_task);
+ queue_task_work(&m, msg, kill_current_task);
}
out:
mce_wrmsrl(MSR_IA32_MCG_STATUS, 0);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e12b524426b0..39039ce8ac4c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1471,6 +1471,7 @@ struct task_struct {
mce_whole_page : 1,
__mce_reserved : 62;
struct callback_head mce_kill_me;
+ int mce_count;
#endif

#ifdef CONFIG_KRETPROBES
--
2.31.1

2021-09-14 08:30:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v3] x86/mce: Avoid infinite loop for copy from user recovery

On Mon, Sep 13, 2021 at 02:52:39PM -0700, Luck, Tony wrote:
> Also mark queue_task_work() as "noinstr" (as reported kernel test robot
> <[email protected]>)

Yeah, that's not enough - I have a patchset in the works for all this so
I'm going to drop your annotation.

> Cc: <[email protected]>
> Fixes: 5567d11c21a1 ("x86/mce: Send #MC singal from task work")

Ah ok, that one makes sense.

> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> > What about a Fixes: tag?
>
> Added a Fixes tag.
>
> Also added "noinstr" to queue_task_work() per a kernel robot report.
>
> Also re-wrote the commit comment (based on questions raised against v2)

Thanks - very much appreciated and it reads really good!

> > I guess backporting this to the respective kernels is predicated upon
> > the existence of those other "places" in the kernel where code assumes
> > the EFAULT was because of a #PF.
>
> Not really. I don't expect to change any kernel code that just bounces
> off the same machine check a few times. This patch does work best in
> conjunction with patches 2 & 3 (unchanged, not reposted here). But it
> will fix some old issues even without those two.

Ok, got it.

/me queues.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette