When one page is already hwpoisoned by AO action, process may not be
killed, the process mapping this page may make a syscall include this
page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
mode it may be fixed by fixup_exception. Current code will just return
error code to user process.
This is not sufficient, we should send a SIGBUS to the process and log
the info to console, as we can't trust the process will handle the error
correctly.
Suggested-by: Feng Yang <[email protected]>
Signed-off-by: Aili Yao <[email protected]>
---
arch/x86/mm/fault.c | 34 +++++++++++++++++++++++++++-------
1 file changed, 27 insertions(+), 7 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..23095b94cf42 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -631,7 +631,7 @@ static void set_signal_archinfo(unsigned long address,
static noinline void
no_context(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, int signal, int si_code)
+ unsigned long address, int signal, int si_code, vm_fault_t fault)
{
struct task_struct *tsk = current;
unsigned long flags;
@@ -662,12 +662,32 @@ no_context(struct pt_regs *regs, unsigned long error_code,
* In this case we need to make sure we're not recursively
* faulting through the emulate_vsyscall() logic.
*/
+
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
+ fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
+ unsigned int lsb = 0;
+
+ pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+ current->comm, current->pid, address);
+
+ sanitize_error_code(address, &error_code);
+ set_signal_archinfo(address, error_code);
+
+ if (fault & VM_FAULT_HWPOISON_LARGE)
+ lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
+ if (fault & VM_FAULT_HWPOISON)
+ lsb = PAGE_SHIFT;
+
+ force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
+
+ return;
+ }
+
if (current->thread.sig_on_uaccess_err && signal) {
sanitize_error_code(address, &error_code);
set_signal_archinfo(address, error_code);
- /* XXX: hwpoison faults will set the wrong code. */
force_sig_fault(signal, si_code, (void __user *)address);
}
@@ -836,7 +856,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
if (is_f00f_bug(regs, address))
return;
- no_context(regs, error_code, address, SIGSEGV, si_code);
+ no_context(regs, error_code, address, SIGSEGV, si_code, 0);
}
static noinline void
@@ -927,7 +947,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
{
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
- no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+ no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, fault);
return;
}
@@ -966,7 +986,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, vm_fault_t fault)
{
if (fatal_signal_pending(current) && !(error_code & X86_PF_USER)) {
- no_context(regs, error_code, address, 0, 0);
+ no_context(regs, error_code, address, 0, 0, 0);
return;
}
@@ -974,7 +994,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
no_context(regs, error_code, address,
- SIGSEGV, SEGV_MAPERR);
+ SIGSEGV, SEGV_MAPERR, 0);
return;
}
@@ -1396,7 +1416,7 @@ void do_user_addr_fault(struct pt_regs *regs,
if (fault_signal_pending(fault, regs)) {
if (!user_mode(regs))
no_context(regs, hw_error_code, address, SIGBUS,
- BUS_ADRERR);
+ BUS_ADRERR, 0);
return;
}
--
2.25.1
On Mon, Feb 1, 2021 at 12:17 AM Aili Yao <[email protected]> wrote:
>
> When one page is already hwpoisoned by AO action, process may not be
> killed, the process mapping this page may make a syscall include this
> page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
> mode it may be fixed by fixup_exception. Current code will just return
> error code to user process.
>
> This is not sufficient, we should send a SIGBUS to the process and log
> the info to console, as we can't trust the process will handle the error
> correctly.
Does this happen when one process gets SIGBUSed due to memory failure
and another process independently hits the poisoned memory? I'm not
entirely convinced that this is a problem.
In any case, this patch needs rebasing on top of my big fault series
-- as it stands, it's way too difficult to keep track of which paths
even call your new code.. And the various signal paths need to be
consolidated -- we already have three of them, and the last thing we
need is a fourth.
> In any case, this patch needs rebasing on top of my big fault series
Is that series in some GIT tree? Or can you give a lore.kernel.org link?
Thanks
-Tony
On Mon, Feb 1, 2021 at 1:10 PM Luck, Tony <[email protected]> wrote:
>
> > In any case, this patch needs rebasing on top of my big fault series
>
> Is that series in some GIT tree? Or can you give a lore.kernel.org link?
https://lore.kernel.org/lkml/[email protected]/T/#t
>
> Thanks
>
> -Tony
On Mon, 1 Feb 2021 08:58:27 -0800
Andy Lutomirski <[email protected]> wrote:
> On Mon, Feb 1, 2021 at 12:17 AM Aili Yao <[email protected]> wrote:
> >
> > When one page is already hwpoisoned by AO action, process may not be
> > killed, the process mapping this page may make a syscall include this
> > page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
> > mode it may be fixed by fixup_exception. Current code will just return
> > error code to user process.
> >
> > This is not sufficient, we should send a SIGBUS to the process and log
> > the info to console, as we can't trust the process will handle the error
> > correctly.
>
> Does this happen when one process gets SIGBUSed due to memory failure
> and another process independently hits the poisoned memory? I'm not
> entirely convinced that this is a problem.
>
OK, I will explain more, hope this will be helpful:
One page get poisoned which can be caused by at least two scenarios:
1. One user process access a address which corrupted, the memory failure() will
be called, the function will unmap the page which contain the corrupt memory cell, the process
triggering the error will get signaled with SIGBUS. Other process sharing this page
will get its related pte marked with SWP_HWPOISON, and in early-kill case, these other processes
will also be signaled with SIGBUS, In later-kill case, It should be signaled when it touch the page
which has been poisoned.
2.A patrol scrub UE error will also trigger the same process, page unmapped, pte being marked with
SWP_HWPOISON. In later-kill case, the process which touch the poisoned page will trigger a page fault
and should be signaled with SIGBUS.
In this later-kill case, normally it will hit the following code:
965 #ifdef CONFIG_MEMORY_FAILURE
966 if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
967 struct task_struct *tsk = current;
968 unsigned lsb = 0;
969
970 pr_err(
971 "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
972 tsk->comm, tsk->pid, address);
973 if (fault & VM_FAULT_HWPOISON_LARGE)
974 lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
975 if (fault & VM_FAULT_HWPOISON)
976 lsb = PAGE_SHIFT;
977 force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
978 return;
979 }
Or there is a case that user process make a syscall including the posioned user address(assume to be ADD_A)
and make a request to copy the ADD_A content to kernel space. In such a case, it will trigger a page fault when
copying starts. As it's in kernel mode and the address is in user space, the process will hit:
944 static void
945 do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
946 vm_fault_t fault)
947 {
948 /* Kernel mode? Handle exceptions or die: */
949 if (!(error_code & X86_PF_USER)) {
950 no_context(regs, error_code, address, SIGBUS, BUS_ADRERR, fault);
951 return;
952 }
In no_context(), fixup_exception() will be called, usually the copy function in such a case will provide
one fixup function, which will return true, then no_context() return, finally the syscall will return one ERROR
code(errno=14 Bad address) to user process, which the user process won't know the where the real problem is.
From syslog, we can't guarantee memory error recovey log and the user process error will have a close correlation in
timestamp.
Previous behavior is not only for latest upstream code, but also apply to older kernel versions.
This patch is to correct this behavior by Sending SIGBUS to user process in such a scenario. This behavior
in this patch is consistent with other memory poison case.
Following is the test result:
Current 5.11 rc6:
./einj_mem_uc -S -c 1 -f copyin
0: copyin vaddr = 0x7fed9808e400 paddr = 1b00c00400
./einj_mem_uc: couldn't write temp file (errno=14)
Expected SIGBUS, didn't get one
page not present
Big surprise ... still running. Thought that would be fatal
Test passed
Current 5.11 rc6 with this patch:
./einj_mem_uc -S -c 1 -f copyin
0: copyin vaddr = 0x7fef60e3d400 paddr = 14b7f43400
SIGBUS: addr = 0x7fef60e3d400
page not present
Big surprise ... still running. Thought that would be fatal
Test passed
And there is a small modification in the einj_mem_uc.c I missed in previous mail, which will lead the
test result unexpected.
306 int trigger_copyin(char *addr)
307 {
316 if ((ret = write(copyin_fd, addr - memcpy_runup, memcpy_size)) != memcpy_size) {
324 }
> In any case, this patch needs rebasing on top of my big fault series
> -- as it stands, it's way too difficult to keep track of which paths
> even call your new code.. And the various signal paths need to be
> consolidated -- we already have three of them, and the last thing we
> need is a fourth.
If you recognize the point listed, I will rebase the patch to your latest code.
Thanks
--
Best Regards!
Aili Yao
Hi Aili,
On Mon, Feb 01, 2021 at 04:17:49PM +0800, Aili Yao wrote:
> When one page is already hwpoisoned by AO action, process may not be
> killed, the process mapping this page may make a syscall include this
> page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
> mode it may be fixed by fixup_exception. Current code will just return
> error code to user process.
>
> This is not sufficient, we should send a SIGBUS to the process and log
> the info to console, as we can't trust the process will handle the error
> correctly.
>
> Suggested-by: Feng Yang <[email protected]>
> Signed-off-by: Aili Yao <[email protected]>
> ---
...
> @@ -662,12 +662,32 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> * In this case we need to make sure we're not recursively
> * faulting through the emulate_vsyscall() logic.
> */
> +
> + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
> + fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
> + unsigned int lsb = 0;
> +
> + pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> + current->comm, current->pid, address);
> +
> + sanitize_error_code(address, &error_code);
> + set_signal_archinfo(address, error_code);
> +
> + if (fault & VM_FAULT_HWPOISON_LARGE)
> + lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
> + if (fault & VM_FAULT_HWPOISON)
> + lsb = PAGE_SHIFT;
> +
> + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
This part contains some duplicated code with do_sigbus(), so some refactoring (like
adding a common function) would be more helpful.
Thanks,
Naoya Horiguchi
On Thu, 4 Feb 2021 07:25:55 +0000
HORIGUCHI NAOYA(堀口 直也) <[email protected]> wrote:
> Hi Aili,
>
> On Mon, Feb 01, 2021 at 04:17:49PM +0800, Aili Yao wrote:
> > When one page is already hwpoisoned by AO action, process may not be
> > killed, the process mapping this page may make a syscall include this
> > page and result to trigger a VM_FAULT_HWPOISON fault, if it's in kernel
> > mode it may be fixed by fixup_exception. Current code will just return
> > error code to user process.
> >
> > This is not sufficient, we should send a SIGBUS to the process and log
> > the info to console, as we can't trust the process will handle the error
> > correctly.
> >
> > Suggested-by: Feng Yang <[email protected]>
> > Signed-off-by: Aili Yao <[email protected]>
> > ---
> ...
>
> > @@ -662,12 +662,32 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> > * In this case we need to make sure we're not recursively
> > * faulting through the emulate_vsyscall() logic.
> > */
> > +
> > + if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
> > + fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
> > + unsigned int lsb = 0;
> > +
> > + pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> > + current->comm, current->pid, address);
> > +
> > + sanitize_error_code(address, &error_code);
> > + set_signal_archinfo(address, error_code);
> > +
> > + if (fault & VM_FAULT_HWPOISON_LARGE)
> > + lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
> > + if (fault & VM_FAULT_HWPOISON)
> > + lsb = PAGE_SHIFT;
> > +
> > + force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
>
> This part contains some duplicated code with do_sigbus(), so some refactoring (like
> adding a common function) would be more helpful.
Yes, agree, I will modify this and rebase to the big fault series from tip.
Thanks
Aili Yao
When one page is already hwpoisoned by MCE AO action, processes may not
be killed, processes mapping this page may make a syscall include this
page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
mode it may be fixed by fixup_exception, current code will just return
error code to user code.
This is not sufficient, we should send a SIGBUS to the process and log
the info to console, as we can't trust the process will handle the error
correctly.
Suggested-by: Feng Yang <[email protected]>
Signed-off-by: Aili Yao <[email protected]>
---
arch/x86/mm/fault.c | 62 +++++++++++++++++++++++++++++----------------
1 file changed, 40 insertions(+), 22 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 08f5f74cf989..62df798abb56 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -617,6 +617,30 @@ static void set_signal_archinfo(unsigned long address,
tsk->thread.cr2 = address;
}
+static int
+do_sigbus_mceerr(unsigned long error_code, unsigned long address, vm_fault_t fault, int prepared)
+{
+ struct task_struct *tsk = current;
+ unsigned int lsb = 0;
+
+ if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
+ pr_err(
+ "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+ tsk->comm, tsk->pid, address);
+ if (fault & VM_FAULT_HWPOISON_LARGE)
+ lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
+ if (fault & VM_FAULT_HWPOISON)
+ lsb = PAGE_SHIFT;
+ if (!prepared) {
+ sanitize_error_code(address, &error_code);
+ set_signal_archinfo(address, error_code);
+ }
+ force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
+ return 1;
+ }
+ return 0;
+}
+
static noinline void
page_fault_oops(struct pt_regs *regs, unsigned long error_code,
unsigned long address)
@@ -694,7 +718,7 @@ page_fault_oops(struct pt_regs *regs, unsigned long error_code,
static noinline void
kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned long error_code,
- unsigned long address, int signal, int si_code)
+ unsigned long address, int signal, int si_code, vm_fault_t fault)
{
WARN_ON_ONCE(user_mode(regs));
@@ -714,12 +738,17 @@ kernelmode_fixup_or_oops(struct pt_regs *regs, unsigned long error_code,
* In this case we need to make sure we're not recursively
* faulting through the emulate_vsyscall() logic.
*/
+
+ /* Sending MCERR Sigbus for page fault error from hwpoison */
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE)
+ && do_sigbus_mceerr(error_code, address, fault, 0))
+ return;
+
if (current->thread.sig_on_uaccess_err && signal) {
sanitize_error_code(address, &error_code);
set_signal_archinfo(address, error_code);
- /* XXX: hwpoison faults will set the wrong code. */
force_sig_fault(signal, si_code, (void __user *)address);
}
@@ -782,7 +811,7 @@ __bad_area_nosemaphore(struct pt_regs *regs, unsigned long error_code,
struct task_struct *tsk = current;
if (!user_mode(regs)) {
- kernelmode_fixup_or_oops(regs, error_code, address, pkey, si_code);
+ kernelmode_fixup_or_oops(regs, error_code, address, pkey, si_code, 0);
return;
}
@@ -914,7 +943,7 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
{
/* Kernel mode? Handle exceptions or die: */
if (!user_mode(regs)) {
- kernelmode_fixup_or_oops(regs, error_code, address, SIGBUS, BUS_ADRERR);
+ kernelmode_fixup_or_oops(regs, error_code, address, SIGBUS, BUS_ADRERR, fault);
return;
}
@@ -929,22 +958,11 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
set_signal_archinfo(address, error_code);
-#ifdef CONFIG_MEMORY_FAILURE
- if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
- struct task_struct *tsk = current;
- unsigned lsb = 0;
-
- pr_err(
- "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
- tsk->comm, tsk->pid, address);
- if (fault & VM_FAULT_HWPOISON_LARGE)
- lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
- if (fault & VM_FAULT_HWPOISON)
- lsb = PAGE_SHIFT;
- force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
+ /* Sending MCERR Sigbus for page fault error from hwpoison */
+ if (IS_ENABLED(CONFIG_MEMORY_FAILURE)
+ && do_sigbus_mceerr(error_code, address, fault, 1))
return;
- }
-#endif
+
force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
}
@@ -1380,7 +1398,7 @@ void do_user_addr_fault(struct pt_regs *regs,
*/
if (!user_mode(regs))
kernelmode_fixup_or_oops(regs, error_code, address,
- SIGBUS, BUS_ADRERR);
+ SIGBUS, BUS_ADRERR, 0);
return;
}
@@ -1400,7 +1418,7 @@ void do_user_addr_fault(struct pt_regs *regs,
return;
if (fatal_signal_pending(current) && !user_mode(regs)) {
- kernelmode_fixup_or_oops(regs, error_code, address, 0, 0);
+ kernelmode_fixup_or_oops(regs, error_code, address, 0, 0, 0);
return;
}
@@ -1408,7 +1426,7 @@ void do_user_addr_fault(struct pt_regs *regs,
/* Kernel mode? Handle exceptions or die: */
if (!user_mode(regs)) {
kernelmode_fixup_or_oops(regs, error_code, address,
- SIGSEGV, SEGV_MAPERR);
+ SIGSEGV, SEGV_MAPERR, 0);
return;
}
base-commit: 167104452673f6c1b0e7bdbdf6a4b25f862e2319
prerequisite-patch-id: 6d1e0f23558f83f851ae883633b5c498f0c826c6
prerequisite-patch-id: 2dd1d48b132e6fa0c7b564dbbd9af2e667ebc997
prerequisite-patch-id: 4fa3543c8706f54bfc0b12b787222c65ff701c10
prerequisite-patch-id: 7465d0fe35a243e23f21a60b3595c09f4a95bdc2
prerequisite-patch-id: 0e7f44bbfa3911af10aa208dee5a8f7fb33f54b7
prerequisite-patch-id: 19bdc10dd055ae8761a8a87f6912f84b4bda68be
prerequisite-patch-id: df466945a2b3150abca123da0467b52220c3cc19
prerequisite-patch-id: 6678a8efeced8907cbdf167221b7d8c0deaa6c44
prerequisite-patch-id: d2270eca0467bbd5e8575fda9f18ddf7943f5562
prerequisite-patch-id: 97ab827e60b94faa6e4fc9f955aa88c4f51ceb5a
--
2.25.1
On Fri, 5 Feb 2021 17:01:35 +0800
Aili Yao <[email protected]> wrote:
> When one page is already hwpoisoned by MCE AO action, processes may not
> be killed, processes mapping this page may make a syscall include this
> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> mode it may be fixed by fixup_exception, current code will just return
> error code to user code.
>
> This is not sufficient, we should send a SIGBUS to the process and log
> the info to console, as we can't trust the process will handle the error
> correctly.
>
> Suggested-by: Feng Yang <[email protected]>
> Signed-off-by: Aili Yao <[email protected]>
> ---
> arch/x86/mm/fault.c | 62 +++++++++++++++++++++++++++++----------------
> 1 file changed, 40 insertions(+), 22 deletions(-)
>
Hi luto;
Is there any feedback?
Thanks
Aili Yao
> On Feb 23, 2021, at 4:44 AM, Aili Yao <[email protected]> wrote:
>
> On Fri, 5 Feb 2021 17:01:35 +0800
> Aili Yao <[email protected]> wrote:
>
>> When one page is already hwpoisoned by MCE AO action, processes may not
>> be killed, processes mapping this page may make a syscall include this
>> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
>> mode it may be fixed by fixup_exception, current code will just return
>> error code to user code.
>>
>> This is not sufficient, we should send a SIGBUS to the process and log
>> the info to console, as we can't trust the process will handle the error
>> correctly.
>>
>> Suggested-by: Feng Yang <[email protected]>
>> Signed-off-by: Aili Yao <[email protected]>
>> ---
>> arch/x86/mm/fault.c | 62 +++++++++++++++++++++++++++++----------------
>> 1 file changed, 40 insertions(+), 22 deletions(-)
>>
> Hi luto;
> Is there any feedback?
At the very least, this needs a clear explanation of why your proposed behavior is better than the existing behavior.
>
> Thanks
> Aili Yao
On Tue, Feb 23, 2021 at 07:33:46AM -0800, Andy Lutomirski wrote:
>
> > On Feb 23, 2021, at 4:44 AM, Aili Yao <[email protected]> wrote:
> >
> > On Fri, 5 Feb 2021 17:01:35 +0800
> > Aili Yao <[email protected]> wrote:
> >
> >> When one page is already hwpoisoned by MCE AO action, processes may not
> >> be killed, processes mapping this page may make a syscall include this
> >> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> >> mode it may be fixed by fixup_exception, current code will just return
> >> error code to user code.
> >>
> >> This is not sufficient, we should send a SIGBUS to the process and log
> >> the info to console, as we can't trust the process will handle the error
> >> correctly.
> >>
> >> Suggested-by: Feng Yang <[email protected]>
> >> Signed-off-by: Aili Yao <[email protected]>
> >> ---
> >> arch/x86/mm/fault.c | 62 +++++++++++++++++++++++++++++----------------
> >> 1 file changed, 40 insertions(+), 22 deletions(-)
> >>
> > Hi luto;
> > Is there any feedback?
>
> At the very least, this needs a clear explanation of why your proposed behavior is better than the existing behavior.
The explanation is buried in that "can't trust the process" line.
E.g. user space isn't good about checking for failed write(2) syscalls.
So if the poison was in a user buffer passed to write(fd, buffer, count)
sending a SIGBUS would be the action if they read the poison directly,
so it seems reasonable to send the same signal if the kernel read their
poison for them.
It would avoid users that didn't check the return value merrily proceeding
as if everything was ok.
-Tony
On Tue, 23 Feb 2021 08:42:59 -0800
"Luck, Tony" <[email protected]> wrote:
> On Tue, Feb 23, 2021 at 07:33:46AM -0800, Andy Lutomirski wrote:
> >
> > > On Feb 23, 2021, at 4:44 AM, Aili Yao <[email protected]> wrote:
> > >
> > > On Fri, 5 Feb 2021 17:01:35 +0800
> > > Aili Yao <[email protected]> wrote:
> > >
> > >> When one page is already hwpoisoned by MCE AO action, processes may not
> > >> be killed, processes mapping this page may make a syscall include this
> > >> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> > >> mode it may be fixed by fixup_exception, current code will just return
> > >> error code to user code.
> > >>
> > >> This is not sufficient, we should send a SIGBUS to the process and log
> > >> the info to console, as we can't trust the process will handle the error
> > >> correctly.
> > >>
> > >> Suggested-by: Feng Yang <[email protected]>
> > >> Signed-off-by: Aili Yao <[email protected]>
> > >> ---
> > >> arch/x86/mm/fault.c | 62 +++++++++++++++++++++++++++++----------------
> > >> 1 file changed, 40 insertions(+), 22 deletions(-)
> > >>
> > > Hi luto;
> > > Is there any feedback?
> >
> > At the very least, this needs a clear explanation of why your proposed behavior is better than the existing behavior.
>
> The explanation is buried in that "can't trust the process" line.
>
> E.g. user space isn't good about checking for failed write(2) syscalls.
> So if the poison was in a user buffer passed to write(fd, buffer, count)
> sending a SIGBUS would be the action if they read the poison directly,
> so it seems reasonable to send the same signal if the kernel read their
> poison for them.
>
> It would avoid users that didn't check the return value merrily proceeding
> as if everything was ok.
Hi luto:
I will add more infomation:
Even if the process will check return value of syscall like write, I don't think
process will take proper action for this.
In test example, the return value will be errno is 14 (Bad Address), the process may not realize
this is a hw issue, and may take wrong action not as expected.
And totally, A hw error will rarely happen, and the hw error hitting this branch will be
more unlikely, the impaction without this patch is quite minor, but this is still not good enough, we should
make it better, right?
Thanks
Aili Yao
On Wed, Feb 24, 2021 at 8:47 PM Aili Yao <[email protected]> wrote:
>
> On Tue, 23 Feb 2021 08:42:59 -0800
> "Luck, Tony" <[email protected]> wrote:
>
> > On Tue, Feb 23, 2021 at 07:33:46AM -0800, Andy Lutomirski wrote:
> > >
> > > > On Feb 23, 2021, at 4:44 AM, Aili Yao <[email protected]> wrote:
> > > >
> > > > On Fri, 5 Feb 2021 17:01:35 +0800
> > > > Aili Yao <[email protected]> wrote:
> > > >
> > > >> When one page is already hwpoisoned by MCE AO action, processes may not
> > > >> be killed, processes mapping this page may make a syscall include this
> > > >> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> > > >> mode it may be fixed by fixup_exception, current code will just return
> > > >> error code to user code.
> > > >>
> > > >> This is not sufficient, we should send a SIGBUS to the process and log
> > > >> the info to console, as we can't trust the process will handle the error
> > > >> correctly.
> > > >>
> > > >> Suggested-by: Feng Yang <[email protected]>
> > > >> Signed-off-by: Aili Yao <[email protected]>
> > > >> ---
> > > >> arch/x86/mm/fault.c | 62 +++++++++++++++++++++++++++++----------------
> > > >> 1 file changed, 40 insertions(+), 22 deletions(-)
> > > >>
> > > > Hi luto;
> > > > Is there any feedback?
> > >
> > > At the very least, this needs a clear explanation of why your proposed behavior is better than the existing behavior.
> >
> > The explanation is buried in that "can't trust the process" line.
> >
> > E.g. user space isn't good about checking for failed write(2) syscalls.
> > So if the poison was in a user buffer passed to write(fd, buffer, count)
> > sending a SIGBUS would be the action if they read the poison directly,
> > so it seems reasonable to send the same signal if the kernel read their
> > poison for them.
> >
> > It would avoid users that didn't check the return value merrily proceeding
> > as if everything was ok.
>
> Hi luto:
> I will add more infomation:
> Even if the process will check return value of syscall like write, I don't think
> process will take proper action for this.
> In test example, the return value will be errno is 14 (Bad Address), the process may not realize
> this is a hw issue, and may take wrong action not as expected.
> And totally, A hw error will rarely happen, and the hw error hitting this branch will be
> more unlikely, the impaction without this patch is quite minor, but this is still not good enough, we should
> make it better, right?
There are a few issues I can imagine:
Some programs may use read(2), write(2), etc as ways to check if
memory is valid without getting a signal. They might not want
signals, which means that this feature might need to be configurable.
It's worth making sure that this doesn't end up sending duplicate
signals. If nothing else, this would impact the vsyscall emulation
code.
Programs that get a signal might expect that the RIP that the signal
frame points to is the instruction that caused the signal and that the
instruction faulted without side effects. For SIGSEGV, I would be
especially nervous about this. Maybe SIGBUS is safer. For SIGSEGV,
it's entirely valid to look at CR2 / si_fault_addr, fix it up, and
return. This would be completely *invalid* with your patch. I'm not
sure what to do about this.
--Andy
Hi Luto:
> > > > At the very least, this needs a clear explanation of why your proposed behavior is better than the existing behavior.
> > >
> > > The explanation is buried in that "can't trust the process" line.
> > >
> > > E.g. user space isn't good about checking for failed write(2) syscalls.
> > > So if the poison was in a user buffer passed to write(fd, buffer, count)
> > > sending a SIGBUS would be the action if they read the poison directly,
> > > so it seems reasonable to send the same signal if the kernel read their
> > > poison for them.
> > >
> > > It would avoid users that didn't check the return value merrily proceeding
> > > as if everything was ok.
> >
> > Hi luto:
> > I will add more infomation:
> > Even if the process will check return value of syscall like write, I don't think
> > process will take proper action for this.
> > In test example, the return value will be errno is 14 (Bad Address), the process may not realize
> > this is a hw issue, and may take wrong action not as expected.
> > And totally, A hw error will rarely happen, and the hw error hitting this branch will be
> > more unlikely, the impaction without this patch is quite minor, but this is still not good enough, we should
> > make it better, right?
>
> There are a few issues I can imagine:
>
> Some programs may use read(2), write(2), etc as ways to check if
> memory is valid without getting a signal. They might not want
> signals, which means that this feature might need to be configurable.
I checked the code again and found that: For poison page access, the process may not ignore the SIGBUS signal even if it was set to
1298 /*
1299 * Force a signal that the process can't ignore: if necessary
1300 * we unblock the signal and change any SIG_IGN to SIG_DFL.
1301 *
1302 * Note: If we unblock the signal, we always reset it to SIG_DFL,
1303 * since we do not want to have a signal handler that was blocked
1304 * be invoked when user space had explicitly blocked it.
1305 *
1306 * We don't want to have recursive SIGSEGV's etc, for example,
1307 * that is why we also clear SIGNAL_UNKILLABLE.
1308 */
1309 static int
1310 force_sig_info_to_task(struct kernel_siginfo *info, struct task_struct *t)
> It's worth making sure that this doesn't end up sending duplicate
> signals. If nothing else, this would impact the vsyscall emulation
> code.
I am not totally get the "duplicate signals" meaning , SIGBUS is a fatal signal and if it was
processed, the process should exit and another same signal will not be processed i think. Or if
the process capture the signal and not to exit, duplicate SIGBUS signal seems not a problem if that happens
For vsyscall emulation:
I do check the related code, and this may be a read operation like instruction fetch for the issue, it will
not hit the modified branch but go to emulation code, it seems we can't differentiate between a vsyscall emulation page fault
and a hwposion page fault, for current code it may access the invalid page again and lead to a panic. This patch will not
cover this scenario.
> Programs that get a signal might expect that the RIP that the signal
> frame points to is the instruction that caused the signal and that the
> instruction faulted without side effects. For SIGSEGV, I would be
> especially nervous about this. Maybe SIGBUS is safer. For SIGSEGV,
> it's entirely valid to look at CR2 / si_fault_addr, fix it up, and
> return. This would be completely *invalid* with your patch. I'm not
> sure what to do about this.
Do you mean the patch will replace the SIGSEGV with SIGBUS for hwposion case? I think SIGBUS is more accurate for the error.
Normally for poison access, the process shouldn't be returned and an exit will be good or we need another code stream for this I think.
This is the legacy way to process user poison access error like other posion code branch in kernel.
Thanks!
Aili Yao
> Some programs may use read(2), write(2), etc as ways to check if
> memory is valid without getting a signal. They might not want
> signals, which means that this feature might need to be configurable.
That sounds like an appalling hack. If users need such a mechanism
we should create some better way to do that.
An aeon ago ACPI created the RASF table as a way for the OS to
ask the platform to scan a block of physical memory using the patrol
scrubber in the memory controller. I never did anything with it in Linux
because it was just too complex and didn't know of any use cases.
Users would want to check virtual addresses. Perhaps some new
option MADV_CHECKFORPOISON to madvise(2) ?
-Tony
> Programs that get a signal might expect that the RIP that the signal
> frame points to is the instruction that caused the signal and that the
> instruction faulted without side effects. For SIGSEGV, I would be
> especially nervous about this. Maybe SIGBUS is safer. For SIGSEGV,
> it's entirely valid to look at CR2 / si_fault_addr, fix it up, and
> return. This would be completely *invalid* with your patch. I'm not
> sure what to do about this.
The original plan was that s/w like databases would be able to write
their own application specific recovery code. E.g. they hit poison while
reading some "table". The process gets a SIGBUS with siginfo telling
the handler the virtual address range that has been lost. The code
uses mmap(MAP_FIXED) to map a new page into the lost address and
fills it with suitable data (either reconstructing lost data by replaying
transactions, or filling the table with some "data unknown" indicator).
Then the SIGBUS handler returns to re-execute the instruction that
failed.
As far as I know nobody has been that creative in production s/w.
But I think there are folks with a siglongjmp() to a "this whole transaction
just failed" safe point.
-Tony
> On Mar 1, 2021, at 11:02 AM, Luck, Tony <[email protected]> wrote:
>
>
>>
>> Some programs may use read(2), write(2), etc as ways to check if
>> memory is valid without getting a signal. They might not want
>> signals, which means that this feature might need to be configurable.
>
> That sounds like an appalling hack. If users need such a mechanism
> we should create some better way to do that.
>
Appalling hack or not, it works. So, if we’re going to send a signal to user code that looks like it originated from a bina fide architectural recoverable fault, it needs to be recoverable. A load from a failed NVDIMM page is such a fault. A *kernel* load is not. So we need to distinguish it somehow.
> An aeon ago ACPI created the RASF table as a way for the OS to
> ask the platform to scan a block of physical memory using the patrol
> scrubber in the memory controller. I never did anything with it in Linux
> because it was just too complex and didn't know of any use cases.
>
> Users would want to check virtual addresses. Perhaps some new
> option MADV_CHECKFORPOISON to madvise(2) ?
>
> -Tony
On Wed, 3 Mar 2021 20:24:02 +0800
Aili Yao <[email protected]> wrote:
> On Mon, 1 Mar 2021 11:09:36 -0800
> Andy Lutomirski <[email protected]> wrote:
>
> > > On Mar 1, 2021, at 11:02 AM, Luck, Tony <[email protected]> wrote:
> > >
> > >
> > >>
> > >> Some programs may use read(2), write(2), etc as ways to check if
> > >> memory is valid without getting a signal. They might not want
> > >> signals, which means that this feature might need to be configurable.
> > >
> > > That sounds like an appalling hack. If users need such a mechanism
> > > we should create some better way to do that.
> > >
> >
> > Appalling hack or not, it works. So, if we’re going to send a signal to user code that looks like it originated from a bina fide architectural recoverable fault, it needs to be recoverable. A load from a failed NVDIMM page is such a fault. A *kernel* load is not. So we need to distinguish it somehow.
>
> Sorry for my previous mis-understanding, and i have some questions:
> if programs use read,write to check if if memory is valid, does it really want to cover the poison case?
> When for such a case, an error is returned, can the program realize it's hwposion issue not other software error and process correctly?
>
> if this is the proper action, the original posion flow in current code from read and write need to change too.
>
Sorry, another question:
When programs use read(2), write(2) as ways to check if memory is valid, does it really want to check if the user page the program provided is valid, not the destination or disk space valid?
the patch will not affect this purpose as it's only valid for user page which program provide to write or some syscall similiar parameter.
--
Thanks!
Aili Yao
On Mon, 1 Mar 2021 11:09:36 -0800
Andy Lutomirski <[email protected]> wrote:
> > On Mar 1, 2021, at 11:02 AM, Luck, Tony <[email protected]> wrote:
> >
> >
> >>
> >> Some programs may use read(2), write(2), etc as ways to check if
> >> memory is valid without getting a signal. They might not want
> >> signals, which means that this feature might need to be configurable.
> >
> > That sounds like an appalling hack. If users need such a mechanism
> > we should create some better way to do that.
> >
>
> Appalling hack or not, it works. So, if we’re going to send a signal to user code that looks like it originated from a bina fide architectural recoverable fault, it needs to be recoverable. A load from a failed NVDIMM page is such a fault. A *kernel* load is not. So we need to distinguish it somehow.
Sorry for my previous mis-understanding, and i have some questions:
if programs use read,write to check if if memory is valid, does it really want to cover the poison case?
When for such a case, an error is returned, can the program realize it's hwposion issue not other software error and process correctly?
if this is the proper action, the original posion flow in current code from read and write need to change too.
--
Thanks!
Aili Yao
On Wed, Mar 3, 2021 at 4:51 AM Aili Yao <[email protected]> wrote:
>
> On Wed, 3 Mar 2021 20:24:02 +0800
> Aili Yao <[email protected]> wrote:
>
> > On Mon, 1 Mar 2021 11:09:36 -0800
> > Andy Lutomirski <[email protected]> wrote:
> >
> > > > On Mar 1, 2021, at 11:02 AM, Luck, Tony <[email protected]> wrote:
> > > >
> > > >
> > > >>
> > > >> Some programs may use read(2), write(2), etc as ways to check if
> > > >> memory is valid without getting a signal. They might not want
> > > >> signals, which means that this feature might need to be configurable.
> > > >
> > > > That sounds like an appalling hack. If users need such a mechanism
> > > > we should create some better way to do that.
> > > >
> > >
> > > Appalling hack or not, it works. So, if we’re going to send a signal to user code that looks like it originated from a bina fide architectural recoverable fault, it needs to be recoverable. A load from a failed NVDIMM page is such a fault. A *kernel* load is not. So we need to distinguish it somehow.
> >
> > Sorry for my previous mis-understanding, and i have some questions:
> > if programs use read,write to check if if memory is valid, does it really want to cover the poison case?
I don't know.
> > When for such a case, an error is returned, can the program realize it's hwposion issue not other software error and process correctly?
Again, I don't know. But changing the API like this seems potentialy
dangerous and needs to be done with care.
> >
> > if this is the proper action, the original posion flow in current code from read and write need to change too.
> >
>
> Sorry, another question:
> When programs use read(2), write(2) as ways to check if memory is valid, does it really want to check if the user page the program provided is valid, not the destination or disk space valid?
They may well be trying to see if their memory is valid.
> the patch will not affect this purpose as it's only valid for user page which program provide to write or some syscall similiar parameter
>
> --
> Thanks!
> Aili Yao
--
Andy Lutomirski
AMA Capital Management, LLC
On Sun, 7 Mar 2021 11:16:24 -0800
Andy Lutomirski <[email protected]> wrote:
> > > > >> Some programs may use read(2), write(2), etc as ways to check if
> > > > >> memory is valid without getting a signal. They might not want
> > > > >> signals, which means that this feature might need to be configurable.
> > > > >
> > > > > That sounds like an appalling hack. If users need such a mechanism
> > > > > we should create some better way to do that.
> > > > >
> > > >
> > > > Appalling hack or not, it works. So, if we’re going to send a signal to user code that looks like it originated from a bina fide architectural recoverable fault, it needs to be recoverable. A load from a failed NVDIMM page is such a fault. A *kernel* load is not. So we need to distinguish it somehow.
> > >
> > > Sorry for my previous mis-understanding, and i have some questions:
> > > if programs use read,write to check if if memory is valid, does it really want to cover the poison case?
>
> I don't know.
>
> > > When for such a case, an error is returned, can the program realize it's hwposion issue not other software error and process correctly?
>
> Again, I don't know. But changing the API like this seems potentialy
> dangerous and needs to be done with care.
>
> > >
> > > if this is the proper action, the original posion flow in current code from read and write need to change too.
> > >
> >
> > Sorry, another question:
> > When programs use read(2), write(2) as ways to check if memory is valid, does it really want to check if the user page the program provided is valid, not the destination or disk space valid?
>
> They may well be trying to see if their memory is valid.
Thanks for your reply, and I don't know what to do.
For current code, if user program write to a block device(maybe a test try) and if its user copy page corrupt when in kernel copy, the process is killed with a SIGBUS.
And for the page fault case in this thread, the process is error returned.
--
Thanks!
Aili Yao
> On Mar 8, 2021, at 1:49 AM, Aili Yao <[email protected]> wrote:
>
> On Sun, 7 Mar 2021 11:16:24 -0800
> Andy Lutomirski <[email protected]> wrote:
>
>>>>>>> Some programs may use read(2), write(2), etc as ways to check if
>>>>>>> memory is valid without getting a signal. They might not want
>>>>>>> signals, which means that this feature might need to be configurable.
>>>>>>
>>>>>> That sounds like an appalling hack. If users need such a mechanism
>>>>>> we should create some better way to do that.
>>>>>>
>>>>>
>>>>> Appalling hack or not, it works. So, if we’re going to send a signal to user code that looks like it originated from a bina fide architectural recoverable fault, it needs to be recoverable. A load from a failed NVDIMM page is such a fault. A *kernel* load is not. So we need to distinguish it somehow.
>>>>
>>>> Sorry for my previous mis-understanding, and i have some questions:
>>>> if programs use read,write to check if if memory is valid, does it really want to cover the poison case?
>>
>> I don't know.
>>
>>>> When for such a case, an error is returned, can the program realize it's hwposion issue not other software error and process correctly?
>>
>> Again, I don't know. But changing the API like this seems potentialy
>> dangerous and needs to be done with care.
>>
>>>>
>>>> if this is the proper action, the original posion flow in current code from read and write need to change too.
>>>>
>>>
>>> Sorry, another question:
>>> When programs use read(2), write(2) as ways to check if memory is valid, does it really want to check if the user page the program provided is valid, not the destination or disk space valid?
>>
>> They may well be trying to see if their memory is valid.
>
> Thanks for your reply, and I don't know what to do.
> For current code, if user program write to a block device(maybe a test try) and if its user copy page corrupt when in kernel copy, the process is killed with a SIGBUS.
> And for the page fault case in this thread, the process is error returned.
Can you point me at that SIGBUS code in a current kernel?
>
> --
> Thanks!
> Aili Yao
> Can you point me at that SIGBUS code in a current kernel?
It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
or copy from user variant was in use in the kernel when the poison memory was consumed.
if (p->mce_vaddr != (void __user *)-1l) {
force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
That would make it match up better with what happens when poison is found
asynchronously by the patrol scrubber. I.e. the semantics are:
AR: You just touched poison at this address and need to do something about that.
AO: Just letting you know that you have some poison at the address in siginfo.
-Tony
> On Mar 8, 2021, at 10:31 AM, Luck, Tony <[email protected]> wrote:
>
>
>>
>> Can you point me at that SIGBUS code in a current kernel?
>
> It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
> or copy from user variant was in use in the kernel when the poison memory was consumed.
>
> if (p->mce_vaddr != (void __user *)-1l) {
> force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
Hmm. On the one hand, no one has complained yet. On the other hand, hardware that supports this isn’t exactly common.
We may need some actual ABI design here. We also need to make sure that things like io_uring accesses or, more generally, anything using the use_mm / use_temporary_mm ends up either sending no signal or sending a signal to the right target.
>
> Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
Dunno.
>
> That would make it match up better with what happens when poison is found
> asynchronously by the patrol scrubber. I.e. the semantics are:
>
> AR: You just touched poison at this address and need to do something about that.
> AO: Just letting you know that you have some poison at the address in siginfo.
>
> -Tony
On Mon, 8 Mar 2021 18:31:07 +0000
"Luck, Tony" <[email protected]> wrote:
> > Can you point me at that SIGBUS code in a current kernel?
>
> It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
> or copy from user variant was in use in the kernel when the poison memory was consumed.
>
> if (p->mce_vaddr != (void __user *)-1l) {
> force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
>
> Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
>
> That would make it match up better with what happens when poison is found
> asynchronously by the patrol scrubber. I.e. the semantics are:
>
> AR: You just touched poison at this address and need to do something about that.
> AO: Just letting you know that you have some poison at the address in siginfo.
>
> -Tony
Is the kill action for this scenario in memory_failure()?
--
Thanks!
Aili Yao
On Tue, 9 Mar 2021 10:14:52 +0800
Aili Yao <[email protected]> wrote:
> On Mon, 8 Mar 2021 18:31:07 +0000
> "Luck, Tony" <[email protected]> wrote:
>
> > > Can you point me at that SIGBUS code in a current kernel?
> >
> > It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
> > or copy from user variant was in use in the kernel when the poison memory was consumed.
> >
> > if (p->mce_vaddr != (void __user *)-1l) {
> > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> >
> > Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
> >
> > That would make it match up better with what happens when poison is found
> > asynchronously by the patrol scrubber. I.e. the semantics are:
> >
> > AR: You just touched poison at this address and need to do something about that.
> > AO: Just letting you know that you have some poison at the address in siginfo.
> >
> > -Tony
>
> Is the kill action for this scenario in memory_failure()?
Does the current logic kill the process twice for this scenario ?
I am confused.
--
Thanks!
Aili Yao
On Mon, 8 Mar 2021 11:00:28 -0800
Andy Lutomirski <[email protected]> wrote:
> > On Mar 8, 2021, at 10:31 AM, Luck, Tony <[email protected]> wrote:
> >
> >
> >>
> >> Can you point me at that SIGBUS code in a current kernel?
> >
> > It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
> > or copy from user variant was in use in the kernel when the poison memory was consumed.
> >
> > if (p->mce_vaddr != (void __user *)-1l) {
> > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
>
> Hmm. On the one hand, no one has complained yet. On the other hand, hardware that supports this isn’t exactly common.
>
> We may need some actual ABI design here. We also need to make sure that things like io_uring accesses or, more generally, anything using the use_mm / use_temporary_mm ends up either sending no signal or sending a signal to the right target.
>
> >
> > Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
>
> Dunno.
I have one thought here but don't know if it's proper:
Previous patch use force_sig_mceerr to the user process for such a scenario; with this method
The SIGBUS can't be ignored as force_sig_mceerr() was designed to.
If the user process don't want this signal, will it set signal config to ignore?
Maybe we can use a send_sig_mceerr() instead of force_sig_mceerr(), if process want to
ignore the SIGBUS, then it will ignore that, or it can also process the SIGBUS?
--
Thanks!
Aili Yao
On Wed, Mar 10, 2021 at 5:19 PM Aili Yao <[email protected]> wrote:
>
> On Mon, 8 Mar 2021 11:00:28 -0800
> Andy Lutomirski <[email protected]> wrote:
>
> > > On Mar 8, 2021, at 10:31 AM, Luck, Tony <[email protected]> wrote:
> > >
> > >
> > >>
> > >> Can you point me at that SIGBUS code in a current kernel?
> > >
> > > It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
> > > or copy from user variant was in use in the kernel when the poison memory was consumed.
> > >
> > > if (p->mce_vaddr != (void __user *)-1l) {
> > > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> >
> > Hmm. On the one hand, no one has complained yet. On the other hand, hardware that supports this isn’t exactly common.
> >
> > We may need some actual ABI design here. We also need to make sure that things like io_uring accesses or, more generally, anything using the use_mm / use_temporary_mm ends up either sending no signal or sending a signal to the right target.
> >
> > >
> > > Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
> >
> > Dunno.
>
> I have one thought here but don't know if it's proper:
>
> Previous patch use force_sig_mceerr to the user process for such a scenario; with this method
> The SIGBUS can't be ignored as force_sig_mceerr() was designed to.
>
> If the user process don't want this signal, will it set signal config to ignore?
> Maybe we can use a send_sig_mceerr() instead of force_sig_mceerr(), if process want to
> ignore the SIGBUS, then it will ignore that, or it can also process the SIGBUS?
I don't think the signal blocking mechanism makes sense for this.
Blocking a signal is for saying that, if another process sends the
signal (or an async event like ctrl-C), then the process doesn't want
it. Blocking doesn't block synchronous things like faults.
I think we need to at least fix the existing bug before we add more
signals. AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
threads.
--Andy
On Wed, 10 Mar 2021 17:28:12 -0800
Andy Lutomirski <[email protected]> wrote:
> On Wed, Mar 10, 2021 at 5:19 PM Aili Yao <[email protected]> wrote:
> >
> > On Mon, 8 Mar 2021 11:00:28 -0800
> > Andy Lutomirski <[email protected]> wrote:
> >
> > > > On Mar 8, 2021, at 10:31 AM, Luck, Tony <[email protected]> wrote:
> > > >
> > > >
> > > >>
> > > >> Can you point me at that SIGBUS code in a current kernel?
> > > >
> > > > It is in kill_me_maybe(). mce_vaddr is setup when we disassemble whatever get_user()
> > > > or copy from user variant was in use in the kernel when the poison memory was consumed.
> > > >
> > > > if (p->mce_vaddr != (void __user *)-1l) {
> > > > force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
> > >
> > > Hmm. On the one hand, no one has complained yet. On the other hand, hardware that supports this isn’t exactly common.
> > >
> > > We may need some actual ABI design here. We also need to make sure that things like io_uring accesses or, more generally, anything using the use_mm / use_temporary_mm ends up either sending no signal or sending a signal to the right target.
> > >
> > > >
> > > > Would it be any better if we used the BUS_MCEERR_AO code that goes into siginfo?
> > >
> > > Dunno.
> >
> > I have one thought here but don't know if it's proper:
> >
> > Previous patch use force_sig_mceerr to the user process for such a scenario; with this method
> > The SIGBUS can't be ignored as force_sig_mceerr() was designed to.
> >
> > If the user process don't want this signal, will it set signal config to ignore?
> > Maybe we can use a send_sig_mceerr() instead of force_sig_mceerr(), if process want to
> > ignore the SIGBUS, then it will ignore that, or it can also process the SIGBUS?
>
> I don't think the signal blocking mechanism makes sense for this.
> Blocking a signal is for saying that, if another process sends the
> signal (or an async event like ctrl-C), then the process doesn't want
> it. Blocking doesn't block synchronous things like faults.
>
> I think we need to at least fix the existing bug before we add more
> signals. AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
> threads.
Got this, Thanks!
I read https://man7.org/linux/man-pages/man2/write.2.html, and it seems the write syscall is not expecting
an signal, maybe a specific error code for this scenario is enough.
--
Thanks!
Aili Yao
> I think we need to at least fix the existing bug before we add more
> signals. AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
> threads.
Can a kernel thread do get_user() or copy_from_user()? Do we have kernel threads
that have an associated user address space to copy from?
-Tony
On Thu, Mar 11, 2021 at 04:52:10PM +0000, Luck, Tony wrote:
> > I think we need to at least fix the existing bug before we add more
> > signals. AFAICS the MCE_IN_KERNEL_COPYIN code is busted for kernel
> > threads.
>
> Can a kernel thread do get_user() or copy_from_user()? Do we have kernel threads
> that have an associated user address space to copy from?
kthread_use_mm()