when one page is already hwpoisoned by AO action, process may not be
killed, the process mapping this page may make a syscall include this
page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
mode it may be fixed by fixup_exception, current code will just return
error code to user process.
This is not suffient, we should send a SIGBUS to the process and log the
info to console, as we can't trust the process will handle the error
correctly.
Suggested-by: Feng Yang <[email protected]>
Signed-off-by: Aili Yao <[email protected]>
---
arch/x86/mm/fault.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f1f1b5a0956a..36d1e385512b 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -662,7 +662,16 @@ no_context(struct pt_regs *regs, unsigned long error_code,
* In this case we need to make sure we're not recursively
* faulting through the emulate_vsyscall() logic.
*/
+#ifdef CONFIG_MEMORY_FAILURE
+ if (si_code == BUS_MCEERR_AR && signal == SIGBUS)
+ pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
+ current->comm, current->pid, address);
+
+ if ((current->thread.sig_on_uaccess_err && signal) ||
+ (si_code == BUS_MCEERR_AR && signal == SIGBUS)) {
+#else
if (current->thread.sig_on_uaccess_err && signal) {
+#endif
sanitize_error_code(address, &error_code);
set_signal_archinfo(address, error_code);
@@ -927,7 +936,14 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
{
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & X86_PF_USER)) {
+#ifdef CONFIG_MEMORY_FAILURE
+ if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
+ no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR);
+ else
+ no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+#else
no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
+#endif
return;
}
--
2.25.1
On Thu, Jan 28, 2021 at 07:43:26PM +0800, Aili Yao wrote:
> when one page is already hwpoisoned by AO action, process may not be
> killed, the process mapping this page may make a syscall include this
> page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> mode it may be fixed by fixup_exception, current code will just return
> error code to user process.
Shouldn't the AO action that poisoned the page have also unmapped it?
>
> This is not suffient, we should send a SIGBUS to the process and log the
> info to console, as we can't trust the process will handle the error
> correctly.
I agree with this part ... few apps check for -EFAULT and do the right
thing. But I'm not sure how this happens. Can you provide a bit more
detail on the steps
-Tony
P.S. Typo: s/suffient/sufficient/
>
> Suggested-by: Feng Yang <[email protected]>
> Signed-off-by: Aili Yao <[email protected]>
> ---
> arch/x86/mm/fault.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f1f1b5a0956a..36d1e385512b 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -662,7 +662,16 @@ no_context(struct pt_regs *regs, unsigned long error_code,
> * In this case we need to make sure we're not recursively
> * faulting through the emulate_vsyscall() logic.
> */
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (si_code == BUS_MCEERR_AR && signal == SIGBUS)
> + pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> + current->comm, current->pid, address);
> +
> + if ((current->thread.sig_on_uaccess_err && signal) ||
> + (si_code == BUS_MCEERR_AR && signal == SIGBUS)) {
> +#else
> if (current->thread.sig_on_uaccess_err && signal) {
> +#endif
> sanitize_error_code(address, &error_code);
>
> set_signal_archinfo(address, error_code);
> @@ -927,7 +936,14 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> {
> /* Kernel mode? Handle exceptions or die: */
> if (!(error_code & X86_PF_USER)) {
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE))
> + no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR);
> + else
> + no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
> +#else
> no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
> +#endif
> return;
> }
>
> --
> 2.25.1
>
On Thu, 28 Jan 2021 09:43:52 -0800
"Luck, Tony" <[email protected]> wrote:
> On Thu, Jan 28, 2021 at 07:43:26PM +0800, Aili Yao wrote:
> > when one page is already hwpoisoned by AO action, process may not be
> > killed, the process mapping this page may make a syscall include this
> > page and result to trigger a VM_FAULT_HWPOISON fault, as it's in kernel
> > mode it may be fixed by fixup_exception, current code will just return
> > error code to user process.
>
> Shouldn't the AO action that poisoned the page have also unmapped it?
Yes, The page has been unmapped in the code mm/rmap.c:
1567 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
1568 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
1569 if (PageHuge(page)) {
1570 hugetlb_count_sub(compound_nr(page), mm);
1571 set_huge_swap_pte_at(mm, address,
1572 pvmw.pte, pteval,
1573 vma_mmu_pagesize(vma));
1574 } else {
1575 dec_mm_counter(mm, mm_counter(page));
1576 set_pte_at(mm, address, pvmw.pte, pteval);
1577 }
1578
1579 }
The pte for this page of processes mapping it should be marked with SWP_HWPOISON.
But the process may not be informed and may continue with the address which has been
ummaped, Then accessing the content in the page will trigger a page fault.
Normally, it will hit the code in arch/x86/mm/fault.c:
945 #ifdef CONFIG_MEMORY_FAILURE
946 if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
947 struct task_struct *tsk = current;
948 unsigned lsb = 0;
949
950 pr_err(
951 "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
952 tsk->comm, tsk->pid, address);
953 if (fault & VM_FAULT_HWPOISON_LARGE)
954 lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault));
955 if (fault & VM_FAULT_HWPOISON)
956 lsb = PAGE_SHIFT;
957 force_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb);
958 return;
959 }
960 #endif
961 force_sig_fault(SIGBUS, BUS_ADRERR, (void __user *)address);
962 }
But when the user processes do a syscall and make a copyin action in kernel space,
the page fault triggered by this action will not got the the above code, it actually
go to the code in arch/x86/mm/fault.c:
650 if (fixup_exception(regs, X86_TRAP_PF, error_code, address)) {
674 /*
675 * Barring that, we can do the fixup and be happy.
676 */
677 return;
678 }
> >
> > This is not suffient, we should send a SIGBUS to the process and log the
> > info to console, as we can't trust the process will handle the error
> > correctly.
>
> I agree with this part ... few apps check for -EFAULT and do the right
> thing. But I'm not sure how this happens. Can you provide a bit more
> detail on the steps
>
Attachment is a simple code to test this, you can try this test with:
./einj_mem_uc -f copyin2
In my enviroment, the stack will be:
[ 1583.063050] Memory failure: 0x1030254: recovery action for dirty LRU page: Recovered
[ 1583.071724] MCE: Killing einj_mem_uc:11139 due to hardware memory corruption fault at 7f4d59032000
[ 1583.081732] CPU: 38 PID: 11139 Comm: einj_mem_uc Kdump: loaded Not tainted 5.11.0-rc2+ #43
[ 1583.102607] Call Trace:
[ 1583.105338] dump_stack+0x57/0x6a
[ 1583.109041] no_context.cold+0xf6/0x284
[ 1583.113315] mm_fault_error+0xc3/0x1b0
[ 1583.117503] exc_page_fault+0x57/0x110
[ 1583.121690] asm_exc_page_fault+0x1e/0x30
[ 1583.126159] RIP: 0010:__get_user_nocheck_8+0x10/0x13
[ 1583.131704] Code: 0f b7 10 31 c0 0f 01 ca c3 90 0f 01 cb 0f ae e8 8b 10 31 c0 0f 01 ca c3 66 90 0f 01 cb 0f ae e8 48 8b 10 31 c0 0f 01 ca c3 90 <0f> 01 ca 31 d2 48 c7 c0 f2 ff ff ff c3 cc cc cc 0f 1f 44 00 00 40
[ 1583.152659] RSP: 0018:ffffb9e462917d90 EFLAGS: 00050293
[ 1583.158490] RAX: 00007f4d59032000 RBX: 0000000000000000 RCX: 00007f4d59032000
[ 1583.166453] RDX: 0000000000000000 RSI: 0000000000000200 RDI: 00007f4d590321ff
[ 1583.174418] RBP: 0000000000000200 R08: 0000000000000200 R09: ffffb9e462917e50
[ 1583.182382] R10: 0000000000000200 R11: 0000000000000000 R12: ffffb9e462917e60
[ 1583.190345] R13: ffff941470e93058 R14: 0000000000001000 R15: ffffffffc0626540
[ 1583.198310] iov_iter_fault_in_readable+0x4f/0x120
[ 1583.203657] generic_perform_write+0x83/0x1c0
[ 1583.208520] ext4_buffered_write_iter+0x93/0x150 [ext4]
[ 1583.214378] new_sync_write+0x11f/0x1b0
[ 1583.218661] vfs_write+0x1c0/0x280
[ 1583.222455] ksys_write+0x5f/0xe0
[ 1583.226153] do_syscall_64+0x33/0x40
[ 1583.230142] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1583.235778] RIP: 0033:0x7f4d58b35cd0
> P.S. Typo: s/suffient/sufficient/
Thanks for correction!
--
Best Regards!
Aili Yao
Thanks for the explanation and test code. I think I see better what
is going on here.
[I took your idea for using madvise(...MADV_HWPOISON) and added a new "-S"
option to my einj_mem_uc test program to use madvise instead of ACPI/EINJ
for injections. Update pushed here:
git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git ]
There have been some small changes to arch/x86/mm/fault.c since you wrote
the patch. Can you rebase to v5.11-rc5?
Also maybe this might be a case to use IS_ENABLED() instead of #ifdef to
make the code a little less ugly. At least for the 2nd hunk in your patch
this would work well:
if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
(fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)))
no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR);
else
no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
The first hunk might need a bit more thought.
-Tony
On Fri, 29 Jan 2021 14:55:29 -0800
"Luck, Tony" <[email protected]> wrote:
> Thanks for the explanation and test code. I think I see better what
> is going on here.
>
> [I took your idea for using madvise(...MADV_HWPOISON) and added a new "-S"
> option to my einj_mem_uc test program to use madvise instead of ACPI/EINJ
> for injections. Update pushed here:
> git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git ]
>
> There have been some small changes to arch/x86/mm/fault.c since you wrote
> the patch. Can you rebase to v5.11-rc5?
Yes, I will rebase it to newest version.
> Also maybe this might be a case to use IS_ENABLED() instead of #ifdef to
> make the code a little less ugly. At least for the 2nd hunk in your patch
> this would work well:
>
> if (IS_ENABLED(CONFIG_MEMORY_FAILURE) &&
> (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)))
> no_context(regs, error_code, address, SIGBUS, BUS_MCEERR_AR);
> else
> no_context(regs, error_code, address, SIGBUS, BUS_ADRERR);
>
> The first hunk might need a bit more thought.
>
Do you mean the force_sig_mceerr and force_sig_fault difference? I see a hwpoison related comment
there, but it's better to follow the usual way force_sig_mceerr, I will modify this in a v2 patch.
Or something other, you may post a better one.
Thanks
--
Best Regards!
Aili Yao