LinuxLists.cc - mm: NULL ptr deref in remove_migration

2014-04-16 14:59:34

Subject: mm: NULL ptr deref in remove_migration_pte

Hi all,

While fuzzing with trinity inside a KVM tools guest running latest -next
kernel I've stumbled on the following:

[ 2552.313602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 2552.315878] IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
[ 2552.315878] PGD 465836067 PUD 465837067 PMD 0
[ 2552.315878] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 2552.315878] Dumping ftrace buffer:
[ 2552.315878] (ftrace buffer empty)
[ 2552.315878] Modules linked in:
[ 2552.315878] CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
[ 2552.315878] task: ffff88046548b000 ti: ffff88044e532000 task.ti: ffff88044e532000
[ 2552.320286] RIP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
[ 2552.320286] RSP: 0018:ffff88044e5339c8 EFLAGS: 00010002
[ 2552.320286] RAX: 0000000000000082 RBX: ffff88046548b000 RCX: 0000000000000000
[ 2552.320286] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[ 2552.320286] RBP: ffff88044e533ab8 R08: 0000000000000001 R09: 0000000000000000
[ 2552.320286] R10: ffff88046548b000 R11: 0000000000000001 R12: 0000000000000000
[ 2552.320286] R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000000
[ 2552.320286] FS: 00007fd286a9a700(0000) GS:ffff88018b000000(0000) knlGS:0000000000000000
[ 2552.320286] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2552.320286] CR2: 0000000000000018 CR3: 0000000442c17000 CR4: 00000000000006a0
[ 2552.320286] DR0: 0000000000695000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2552.320286] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 2552.320286] Stack:
[ 2552.320286] ffff88044e5339e8 ffffffff9f56e761 0000000000000000 ffff880315c13000
[ 2552.320286] ffff88044e533a38 ffffffff9c193f0d ffffffff9c193e34 ffff8804654e8000
[ 2552.320286] ffff8804654e8000 0000000000000001 ffff88046548b000 0000000000000007
[ 2552.320286] Call Trace:
[ 2552.320286] ? _raw_spin_unlock_irq (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:169 kernel/locking/spinlock.c:199)
[ 2552.320286] ? finish_task_switch (include/linux/tick.h:206 kernel/sched/core.c:2163)
[ 2552.320286] ? finish_task_switch (arch/x86/include/asm/current.h:14 kernel/sched/sched.h:993 kernel/sched/core.c:2145)
[ 2552.320286] ? retint_restore_args (arch/x86/kernel/entry_64.S:1040)
[ 2552.320286] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 2552.320286] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
[ 2552.320286] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 2552.320286] ? remove_migration_pte (mm/migrate.c:137)
[ 2552.320286] ? retint_restore_args (arch/x86/kernel/entry_64.S:1040)
[ 2552.320286] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 2552.320286] ? remove_migration_pte (mm/migrate.c:137)
[ 2552.320286] remove_migration_pte (mm/migrate.c:137)
[ 2552.320286] rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
[ 2552.320286] remove_migration_ptes (mm/migrate.c:224)
[ 2552.320286] ? new_page_node (mm/migrate.c:107)
[ 2552.320286] ? remove_migration_pte (mm/migrate.c:195)
[ 2552.320286] migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
[ 2552.320286] ? perf_trace_mm_numa_migrate_ratelimit (mm/migrate.c:1574)
[ 2552.320286] migrate_misplaced_page (mm/migrate.c:1733)
[ 2552.320286] __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
[ 2552.320286] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 2552.320286] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 2552.320286] handle_mm_fault (mm/memory.c:3948)
[ 2552.320286] __get_user_pages (mm/memory.c:1851)
[ 2552.320286] ? preempt_count_sub (kernel/sched/core.c:2527)
[ 2552.320286] __mlock_vma_pages_range (mm/mlock.c:255)
[ 2552.320286] __mm_populate (mm/mlock.c:711)
[ 2552.320286] SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
[ 2552.320286] tracesys (arch/x86/kernel/entry_64.S:749)
[ 2552.320286] Code: 85 2d 1e 00 00 48 c7 c1 d7 68 6c a0 48 c7 c2 47 11 6c a0 31 c0 be fa 0b 00 00 48 c7 c7 91 68 6c a0 e8 1c 6d f9 ff e9 07 1e 00 00 <49> 81 7d 00 80 31 76 a2 b8 00 00 00 00 44 0f 44 c0 eb 07 0f 1f
[ 2552.320286] RIP __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
[ 2552.320286] RSP <ffff88044e5339c8>
[ 2552.320286] CR2: 0000000000000018

Thanks,
Sasha

2014-06-10 04:21:56

by Hugh Dickins

[permalink] [raw]

Subject: Re: mm: NULL ptr deref in remove_migration_pte

On Tue, 27 May 2014, Sasha Levin wrote:
> On 05/26/2014 04:05 PM, Hugh Dickins wrote:
> > On Fri, 23 May 2014, Sasha Levin wrote:
> >
> >> Ping?
> >>
> >> On 05/05/2014 11:51 AM, Sasha Levin wrote:
> >>> Did anyone have a chance to look at it? I still see it in -next.
> >>>
> >>>
> >>> Thanks,
> >>> Sasha
> >>>
> >>> On 04/16/2014 10:59 AM, Sasha Levin wrote:
> >>>> Hi all,
> >>>>
> >>>> While fuzzing with trinity inside a KVM tools guest running latest -next
> >>>> kernel I've stumbled on the following:
> >>>>
> >>>> [ 2552.313602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> >>>> [ 2552.315878] IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
> >>>> [ 2552.315878] PGD 465836067 PUD 465837067 PMD 0
> >>>> [ 2552.315878] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> >>>> [ 2552.315878] Dumping ftrace buffer:
> >>>> [ 2552.315878] (ftrace buffer empty)
> >>>> [ 2552.315878] Modules linked in:
> >>>> [ 2552.315878] CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
> >>>> [ 2552.315878] task: ffff88046548b000 ti: ffff88044e532000 task.ti: ffff88044e532000
> >>>> [ 2552.320286] RIP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
> >>>> [ 2552.320286] RSP: 0018:ffff88044e5339c8 EFLAGS: 00010002
> >>>> [ 2552.320286] RAX: 0000000000000082 RBX: ffff88046548b000 RCX: 0000000000000000
> >>>> [ 2552.320286] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
> >>>> [ 2552.320286] RBP: ffff88044e533ab8 R08: 0000000000000001 R09: 0000000000000000
> >>>> [ 2552.320286] R10: ffff88046548b000 R11: 0000000000000001 R12: 0000000000000000
> >>>> [ 2552.320286] R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000000
> >>>> [ 2552.320286] FS: 00007fd286a9a700(0000) GS:ffff88018b000000(0000) knlGS:0000000000000000
> >>>> [ 2552.320286] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >>>> [ 2552.320286] CR2: 0000000000000018 CR3: 0000000442c17000 CR4: 00000000000006a0
> >>>> [ 2552.320286] DR0: 0000000000695000 DR1: 0000000000000000 DR2: 0000000000000000
> >>>> [ 2552.320286] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> >>>> [ 2552.320286] Stack:
> >>>> [ 2552.320286] ffff88044e5339e8 ffffffff9f56e761 0000000000000000 ffff880315c13000
> >>>> [ 2552.320286] ffff88044e533a38 ffffffff9c193f0d ffffffff9c193e34 ffff8804654e8000
> >>>> [ 2552.320286] ffff8804654e8000 0000000000000001 ffff88046548b000 0000000000000007
> >>>> [ 2552.320286] Call Trace:
> >>>> [ 2552.320286] ? _raw_spin_unlock_irq (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:169 kernel/locking/spinlock.c:199)
> >>>> [ 2552.320286] ? finish_task_switch (include/linux/tick.h:206 kernel/sched/core.c:2163)
> >>>> [ 2552.320286] ? finish_task_switch (arch/x86/include/asm/current.h:14 kernel/sched/sched.h:993 kernel/sched/core.c:2145)
> >>>> [ 2552.320286] ? retint_restore_args (arch/x86/kernel/entry_64.S:1040)
> >>>> [ 2552.320286] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> >>>> [ 2552.320286] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
> >>>> [ 2552.320286] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
> >>>> [ 2552.320286] ? remove_migration_pte (mm/migrate.c:137)
> >>>> [ 2552.320286] ? retint_restore_args (arch/x86/kernel/entry_64.S:1040)
> >>>> [ 2552.320286] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
> >>>> [ 2552.320286] ? remove_migration_pte (mm/migrate.c:137)
> >>>> [ 2552.320286] remove_migration_pte (mm/migrate.c:137)
> >>>> [ 2552.320286] rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
> >>>> [ 2552.320286] remove_migration_ptes (mm/migrate.c:224)
> >>>> [ 2552.320286] ? new_page_node (mm/migrate.c:107)
> >>>> [ 2552.320286] ? remove_migration_pte (mm/migrate.c:195)
> >>>> [ 2552.320286] migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
> >>>> [ 2552.320286] ? perf_trace_mm_numa_migrate_ratelimit (mm/migrate.c:1574)
> >>>> [ 2552.320286] migrate_misplaced_page (mm/migrate.c:1733)
> >>>> [ 2552.320286] __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
> >>>> [ 2552.320286] ? __const_udelay (arch/x86/lib/delay.c:126)
> >>>> [ 2552.320286] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> >>>> [ 2552.320286] handle_mm_fault (mm/memory.c:3948)
> >>>> [ 2552.320286] __get_user_pages (mm/memory.c:1851)
> >>>> [ 2552.320286] ? preempt_count_sub (kernel/sched/core.c:2527)
> >>>> [ 2552.320286] __mlock_vma_pages_range (mm/mlock.c:255)
> >>>> [ 2552.320286] __mm_populate (mm/mlock.c:711)
> >>>> [ 2552.320286] SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
> >>>> [ 2552.320286] tracesys (arch/x86/kernel/entry_64.S:749)
> >>>> [ 2552.320286] Code: 85 2d 1e 00 00 48 c7 c1 d7 68 6c a0 48 c7 c2 47 11 6c a0 31 c0 be fa 0b 00 00 48 c7 c7 91 68 6c a0 e8 1c 6d f9 ff e9 07 1e 00 00 <49> 81 7d 00 80 31 76 a2 b8 00 00 00 00 44 0f 44 c0 eb 07 0f 1f
> >>>> [ 2552.320286] RIP __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
> >>>> [ 2552.320286] RSP <ffff88044e5339c8>
> >>>> [ 2552.320286] CR2: 0000000000000018
> >
> > Sasha, please clarify your Ping: I've seen you say in other mail
> > "I had to disable transhuge/hugetlb in my testing .config".
> >
> > Do you see this remove_migration_pte oops even with THP disabled?
> >
> > Do you see the filemap.c:202 BUG_ON(page_mapped(page))
> > even with THP disabled?
>
> The mail that you mentioned prompted me to go back and re-enable THP and
> see what still breaks, which would explain why I pinged this thread again (I
> only do that once I see that problem still occurs).
>
> However, I can't confirm if these problems happen without THP as I didn't
> think they were related. I'll disable THP again and give it a go.

Although there's nothing in the backtrace to implicate it,
I think this crash is caused by THP: please try this patch - thanks.

[PATCH] mm: let mm_find_pmd fix buggy race with THP fault

Trinity has reported:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
lock_acquire (arch/x86/include/asm/current.h:14
kernel/locking/lockdep.c:3602)
_raw_spin_lock (include/linux/spinlock_api_smp.h:143
kernel/locking/spinlock.c:151)
remove_migration_pte (mm/migrate.c:137)
rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
remove_migration_ptes (mm/migrate.c:224)
migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
migrate_misplaced_page (mm/migrate.c:1733)
__handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
handle_mm_fault (mm/memory.c:3948)
__get_user_pages (mm/memory.c:1851)
__mlock_vma_pages_range (mm/mlock.c:255)
__mm_populate (mm/mlock.c:711)
SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

I believe this comes about because, whereas collapsing and splitting
THP functions take anon_vma lock in write mode (which excludes
concurrent rmap walks), faulting THP functions (write protection and
misplaced NUMA) do not - and mostly they do not need to.

But they do use a pmdp_clear_flush(), set_pmd_at() sequence which,
for an instant (indeed, for a long instant, given the inter-CPU
TLB flush in there), leaves *pmd neither present not trans_huge.

Which can confuse a concurrent rmap walk, as when removing migration
ptes, seen in the dumped trace. Although that rmap walk has a 4k
page to insert, anon_vmas containing THPs are in no way segregated
from 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to
cope with that instant when a trans_huge pmd is temporarily absent.

I don't think we need strengthen the locking at the THP end: it's
easily handled with an ACCESS_ONCE() before testing both conditions.

And since mm_find_pmd() had only one caller who wanted a THP rather
than a pmd, let's slightly repurpose it to fail when it hits a THP
or non-present pmd, and open code split_huge_page_address() again.

Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---

mm/huge_memory.c | 18 ++++++++++++------
mm/ksm.c | 1 -
mm/migrate.c | 2 --
mm/rmap.c | 12 ++++++++----
4 files changed, 20 insertions(+), 13 deletions(-)

--- v3.15/mm/huge_memory.c 2014-06-08 11:19:54.000000000 -0700
+++ linux/mm/huge_memory.c 2014-06-09 19:18:20.004702465 -0700
@@ -2390,8 +2390,6 @@ static void collapse_huge_page(struct mm
pmd = mm_find_pmd(mm, address);
if (!pmd)
goto out;
- if (pmd_trans_huge(*pmd))
- goto out;

anon_vma_lock_write(vma->anon_vma);

@@ -2490,8 +2488,6 @@ static int khugepaged_scan_pmd(struct mm
pmd = mm_find_pmd(mm, address);
if (!pmd)
goto out;
- if (pmd_trans_huge(*pmd))
- goto out;

memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
@@ -2844,12 +2840,22 @@ void split_huge_page_pmd_mm(struct mm_st
static void split_huge_page_address(struct mm_struct *mm,
unsigned long address)
{
+ pgd_t *pgd;
+ pud_t *pud;
pmd_t *pmd;

VM_BUG_ON(!(address & ~HPAGE_PMD_MASK));

- pmd = mm_find_pmd(mm, address);
- if (!pmd)
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ return;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return;
+
+ pmd = pmd_offset(pud, address);
+ if (!pmd_present(*pmd))
return;
/*
* Caller holds the mmap_sem write mode, so a huge pmd cannot
--- v3.15/mm/ksm.c 2014-03-30 20:40:15.000000000 -0700
+++ linux/mm/ksm.c 2014-06-09 19:18:20.004702465 -0700
@@ -945,7 +945,6 @@ static int replace_page(struct vm_area_s
pmd = mm_find_pmd(mm, addr);
if (!pmd)
goto out;
- BUG_ON(pmd_trans_huge(*pmd));

mmun_start = addr;
mmun_end = addr + PAGE_SIZE;
--- v3.15/mm/migrate.c 2014-03-30 20:40:15.000000000 -0700
+++ linux/mm/migrate.c 2014-06-09 19:18:20.008702465 -0700
@@ -120,8 +120,6 @@ static int remove_migration_pte(struct p
pmd = mm_find_pmd(mm, addr);
if (!pmd)
goto out;
- if (pmd_trans_huge(*pmd))
- goto out;

ptep = pte_offset_map(pmd, addr);

--- v3.15/mm/rmap.c 2014-06-08 11:19:54.000000000 -0700
+++ linux/mm/rmap.c 2014-06-09 19:18:20.008702465 -0700
@@ -567,6 +567,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd = NULL;
+ pmd_t pmde;

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -577,7 +578,13 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
goto out;

pmd = pmd_offset(pud, address);
- if (!pmd_present(*pmd))
+ /*
+ * Some THP functions use the sequence pmdp_clear_flush(), set_pmd_at()
+ * without holding anon_vma lock for write. So when looking for a
+ * genuine pmde (in which to find pte), test present and !THP together.
+ */
+ pmde = ACCESS_ONCE(*pmd);
+ if (!pmd_present(pmde) || pmd_trans_huge(pmde))
pmd = NULL;
out:
return pmd;
@@ -613,9 +620,6 @@ pte_t *__page_check_address(struct page
if (!pmd)
return NULL;

- if (pmd_trans_huge(*pmd))
- return NULL;
-
pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!sync && !pte_present(*pte)) {

2014-06-10 10:09:51

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: mm: NULL ptr deref in remove_migration_pte

On Mon, Jun 09, 2014 at 09:20:33PM -0700, Hugh Dickins wrote:
> [PATCH] mm: let mm_find_pmd fix buggy race with THP fault
>
> Trinity has reported:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
> CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
> 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
> lock_acquire (arch/x86/include/asm/current.h:14
> kernel/locking/lockdep.c:3602)
> _raw_spin_lock (include/linux/spinlock_api_smp.h:143
> kernel/locking/spinlock.c:151)
> remove_migration_pte (mm/migrate.c:137)
> rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
> remove_migration_ptes (mm/migrate.c:224)
> migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
> migrate_misplaced_page (mm/migrate.c:1733)
> __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
> handle_mm_fault (mm/memory.c:3948)
> __get_user_pages (mm/memory.c:1851)
> __mlock_vma_pages_range (mm/mlock.c:255)
> __mm_populate (mm/mlock.c:711)
> SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
>
> I believe this comes about because, whereas collapsing and splitting
> THP functions take anon_vma lock in write mode (which excludes
> concurrent rmap walks), faulting THP functions (write protection and
> misplaced NUMA) do not - and mostly they do not need to.
>
> But they do use a pmdp_clear_flush(), set_pmd_at() sequence which,
> for an instant (indeed, for a long instant, given the inter-CPU
> TLB flush in there), leaves *pmd neither present not trans_huge.
>
> Which can confuse a concurrent rmap walk, as when removing migration
> ptes, seen in the dumped trace. Although that rmap walk has a 4k
> page to insert, anon_vmas containing THPs are in no way segregated
> from 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to
> cope with that instant when a trans_huge pmd is temporarily absent.
>
> I don't think we need strengthen the locking at the THP end: it's
> easily handled with an ACCESS_ONCE() before testing both conditions.
>
> And since mm_find_pmd() had only one caller who wanted a THP rather
> than a pmd, let's slightly repurpose it to fail when it hits a THP
> or non-present pmd, and open code split_huge_page_address() again.
>
> Reported-by: Sasha Levin <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-06-16 21:04:25

by Sasha Levin

[permalink] [raw]

Subject: Re: mm: NULL ptr deref in remove_migration_pte

On 06/10/2014 12:20 AM, Hugh Dickins wrote:
> On Tue, 27 May 2014, Sasha Levin wrote:
>> > On 05/26/2014 04:05 PM, Hugh Dickins wrote:
>>> > > On Fri, 23 May 2014, Sasha Levin wrote:
>>> > >
>>>> > >> Ping?
>>>> > >>
>>>> > >> On 05/05/2014 11:51 AM, Sasha Levin wrote:
>>>>> > >>> Did anyone have a chance to look at it? I still see it in -next.
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> Thanks,
>>>>> > >>> Sasha
>>>>> > >>>
>>>>> > >>> On 04/16/2014 10:59 AM, Sasha Levin wrote:
>>>>>> > >>>> Hi all,
>>>>>> > >>>>
>>>>>> > >>>> While fuzzing with trinity inside a KVM tools guest running latest -next
>>>>>> > >>>> kernel I've stumbled on the following:
>>>>>> > >>>>
>>>>>> > >>>> [ 2552.313602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
>>>>>> > >>>> [ 2552.315878] IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
>>>>>> > >>>> [ 2552.315878] PGD 465836067 PUD 465837067 PMD 0
>>>>>> > >>>> [ 2552.315878] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>>>>> > >>>> [ 2552.315878] Dumping ftrace buffer:
>>>>>> > >>>> [ 2552.315878] (ftrace buffer empty)
>>>>>> > >>>> [ 2552.315878] Modules linked in:
>>>>>> > >>>> [ 2552.315878] CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
>>>>>> > >>>> [ 2552.315878] task: ffff88046548b000 ti: ffff88044e532000 task.ti: ffff88044e532000
>>>>>> > >>>> [ 2552.320286] RIP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
>>>>>> > >>>> [ 2552.320286] RSP: 0018:ffff88044e5339c8 EFLAGS: 00010002
>>>>>> > >>>> [ 2552.320286] RAX: 0000000000000082 RBX: ffff88046548b000 RCX: 0000000000000000
>>>>>> > >>>> [ 2552.320286] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
>>>>>> > >>>> [ 2552.320286] RBP: ffff88044e533ab8 R08: 0000000000000001 R09: 0000000000000000
>>>>>> > >>>> [ 2552.320286] R10: ffff88046548b000 R11: 0000000000000001 R12: 0000000000000000
>>>>>> > >>>> [ 2552.320286] R13: 0000000000000018 R14: 0000000000000000 R15: 0000000000000000
>>>>>> > >>>> [ 2552.320286] FS: 00007fd286a9a700(0000) GS:ffff88018b000000(0000) knlGS:0000000000000000
>>>>>> > >>>> [ 2552.320286] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>> > >>>> [ 2552.320286] CR2: 0000000000000018 CR3: 0000000442c17000 CR4: 00000000000006a0
>>>>>> > >>>> [ 2552.320286] DR0: 0000000000695000 DR1: 0000000000000000 DR2: 0000000000000000
>>>>>> > >>>> [ 2552.320286] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>>>>> > >>>> [ 2552.320286] Stack:
>>>>>> > >>>> [ 2552.320286] ffff88044e5339e8 ffffffff9f56e761 0000000000000000 ffff880315c13000
>>>>>> > >>>> [ 2552.320286] ffff88044e533a38 ffffffff9c193f0d ffffffff9c193e34 ffff8804654e8000
>>>>>> > >>>> [ 2552.320286] ffff8804654e8000 0000000000000001 ffff88046548b000 0000000000000007
>>>>>> > >>>> [ 2552.320286] Call Trace:
>>>>>> > >>>> [ 2552.320286] ? _raw_spin_unlock_irq (arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:169 kernel/locking/spinlock.c:199)
>>>>>> > >>>> [ 2552.320286] ? finish_task_switch (include/linux/tick.h:206 kernel/sched/core.c:2163)
>>>>>> > >>>> [ 2552.320286] ? finish_task_switch (arch/x86/include/asm/current.h:14 kernel/sched/sched.h:993 kernel/sched/core.c:2145)
>>>>>> > >>>> [ 2552.320286] ? retint_restore_args (arch/x86/kernel/entry_64.S:1040)
>>>>>> > >>>> [ 2552.320286] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>>>>>> > >>>> [ 2552.320286] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2557 kernel/locking/lockdep.c:2599)
>>>>>> > >>>> [ 2552.320286] lock_acquire (arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
>>>>>> > >>>> [ 2552.320286] ? remove_migration_pte (mm/migrate.c:137)
>>>>>> > >>>> [ 2552.320286] ? retint_restore_args (arch/x86/kernel/entry_64.S:1040)
>>>>>> > >>>> [ 2552.320286] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
>>>>>> > >>>> [ 2552.320286] ? remove_migration_pte (mm/migrate.c:137)
>>>>>> > >>>> [ 2552.320286] remove_migration_pte (mm/migrate.c:137)
>>>>>> > >>>> [ 2552.320286] rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
>>>>>> > >>>> [ 2552.320286] remove_migration_ptes (mm/migrate.c:224)
>>>>>> > >>>> [ 2552.320286] ? new_page_node (mm/migrate.c:107)
>>>>>> > >>>> [ 2552.320286] ? remove_migration_pte (mm/migrate.c:195)
>>>>>> > >>>> [ 2552.320286] migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
>>>>>> > >>>> [ 2552.320286] ? perf_trace_mm_numa_migrate_ratelimit (mm/migrate.c:1574)
>>>>>> > >>>> [ 2552.320286] migrate_misplaced_page (mm/migrate.c:1733)
>>>>>> > >>>> [ 2552.320286] __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
>>>>>> > >>>> [ 2552.320286] ? __const_udelay (arch/x86/lib/delay.c:126)
>>>>>> > >>>> [ 2552.320286] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>>>>>> > >>>> [ 2552.320286] handle_mm_fault (mm/memory.c:3948)
>>>>>> > >>>> [ 2552.320286] __get_user_pages (mm/memory.c:1851)
>>>>>> > >>>> [ 2552.320286] ? preempt_count_sub (kernel/sched/core.c:2527)
>>>>>> > >>>> [ 2552.320286] __mlock_vma_pages_range (mm/mlock.c:255)
>>>>>> > >>>> [ 2552.320286] __mm_populate (mm/mlock.c:711)
>>>>>> > >>>> [ 2552.320286] SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
>>>>>> > >>>> [ 2552.320286] tracesys (arch/x86/kernel/entry_64.S:749)
>>>>>> > >>>> [ 2552.320286] Code: 85 2d 1e 00 00 48 c7 c1 d7 68 6c a0 48 c7 c2 47 11 6c a0 31 c0 be fa 0b 00 00 48 c7 c7 91 68 6c a0 e8 1c 6d f9 ff e9 07 1e 00 00 <49> 81 7d 00 80 31 76 a2 b8 00 00 00 00 44 0f 44 c0 eb 07 0f 1f
>>>>>> > >>>> [ 2552.320286] RIP __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
>>>>>> > >>>> [ 2552.320286] RSP <ffff88044e5339c8>
>>>>>> > >>>> [ 2552.320286] CR2: 0000000000000018
>>> > >
>>> > > Sasha, please clarify your Ping: I've seen you say in other mail
>>> > > "I had to disable transhuge/hugetlb in my testing .config".
>>> > >
>>> > > Do you see this remove_migration_pte oops even with THP disabled?
>>> > >
>>> > > Do you see the filemap.c:202 BUG_ON(page_mapped(page))
>>> > > even with THP disabled?
>> >
>> > The mail that you mentioned prompted me to go back and re-enable THP and
>> > see what still breaks, which would explain why I pinged this thread again (I
>> > only do that once I see that problem still occurs).
>> >
>> > However, I can't confirm if these problems happen without THP as I didn't
>> > think they were related. I'll disable THP again and give it a go.
> Although there's nothing in the backtrace to implicate it,
> I think this crash is caused by THP: please try this patch - thanks.
>
> [PATCH] mm: let mm_find_pmd fix buggy race with THP fault
>
> Trinity has reported:
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
> CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
> 3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
> lock_acquire (arch/x86/include/asm/current.h:14
> kernel/locking/lockdep.c:3602)
> _raw_spin_lock (include/linux/spinlock_api_smp.h:143
> kernel/locking/spinlock.c:151)
> remove_migration_pte (mm/migrate.c:137)
> rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
> remove_migration_ptes (mm/migrate.c:224)
> migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
> migrate_misplaced_page (mm/migrate.c:1733)
> __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
> handle_mm_fault (mm/memory.c:3948)
> __get_user_pages (mm/memory.c:1851)
> __mlock_vma_pages_range (mm/mlock.c:255)
> __mm_populate (mm/mlock.c:711)
> SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)
>
> I believe this comes about because, whereas collapsing and splitting
> THP functions take anon_vma lock in write mode (which excludes
> concurrent rmap walks), faulting THP functions (write protection and
> misplaced NUMA) do not - and mostly they do not need to.
>
> But they do use a pmdp_clear_flush(), set_pmd_at() sequence which,
> for an instant (indeed, for a long instant, given the inter-CPU
> TLB flush in there), leaves *pmd neither present not trans_huge.
>
> Which can confuse a concurrent rmap walk, as when removing migration
> ptes, seen in the dumped trace. Although that rmap walk has a 4k
> page to insert, anon_vmas containing THPs are in no way segregated
> from 4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to
> cope with that instant when a trans_huge pmd is temporarily absent.
>
> I don't think we need strengthen the locking at the THP end: it's
> easily handled with an ACCESS_ONCE() before testing both conditions.
>
> And since mm_find_pmd() had only one caller who wanted a THP rather
> than a pmd, let's slightly repurpose it to fail when it hits a THP
> or non-present pmd, and open code split_huge_page_address() again.
>
> Reported-by: Sasha Levin <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>

Hi Hugh,

It took some time to hit something here, but I think that the following
is related:

[ 489.152166] INFO: trying to register non-static key.
[ 489.152166] the code is fine but needs lockdep annotation.
[ 489.152166] turning off the locking correctness validator.
[ 489.152166] CPU: 23 PID: 12148 Comm: trinity-c79 Not tainted 3.15.0-next-20140616-sasha-00025-g0fd1f7d-dirty #657
[ 489.152166] ffff8804dd013000 ffff8804e15a38e8 ffffffff965140d1 0000000000000002
[ 489.152166] ffffffff9a5ce7c0 ffff8804e15a39e8 ffffffff931ca363 ffff8804e15a3928
[ 489.152166] 0000000000000000 0000000000000000 ffff8804e4730978 0000000000000001
[ 489.152166] Call Trace:
[ 489.152166] dump_stack (lib/dump_stack.c:52)
[ 489.152166] __lock_acquire (kernel/locking/lockdep.c:743 kernel/locking/lockdep.c:3078)
[ 489.152166] ? __lock_acquire (kernel/locking/lockdep.c:3189)
[ 489.152166] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
[ 489.152166] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
[ 489.152166] ? __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
[ 489.152166] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
[ 489.152166] ? __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
[ 489.152166] ? get_parent_ip (kernel/sched/core.c:2546)
[ 489.152166] __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
[ 489.152166] try_to_unmap_one (mm/rmap.c:1153)
[ 489.152166] ? __const_udelay (arch/x86/lib/delay.c:126)
[ 489.152166] ? __rcu_read_unlock (kernel/rcu/update.c:97)
[ 489.152166] ? page_lock_anon_vma_read (mm/rmap.c:448)
[ 489.152166] rmap_walk (mm/rmap.c:1654 mm/rmap.c:1725)
[ 489.152166] ? preempt_count_sub (kernel/sched/core.c:2602)
[ 489.152166] try_to_unmap (mm/rmap.c:1547)
[ 489.152166] ? page_remove_rmap (mm/rmap.c:1144)
[ 489.152166] ? invalid_migration_vma (mm/rmap.c:1503)
[ 489.152166] ? try_to_unmap_one (mm/rmap.c:1411)
[ 489.152166] ? anon_vma_prepare (mm/rmap.c:448)
[ 489.152166] ? invalid_mkclean_vma (mm/rmap.c:1498)
[ 489.152166] ? page_get_anon_vma (mm/rmap.c:405)
[ 489.152166] migrate_pages (mm/migrate.c:913 mm/migrate.c:959 mm/migrate.c:1146)
[ 489.152166] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:169 kernel/locking/spinlock.c:199)
[ 489.152166] ? perf_trace_mm_numa_migrate_ratelimit (mm/migrate.c:1594)
[ 489.152166] migrate_misplaced_page (mm/migrate.c:1754)
[ 489.152166] __handle_mm_fault (mm/memory.c:3157 mm/memory.c:3207 mm/memory.c:3317)
[ 489.152166] handle_mm_fault (include/linux/memcontrol.h:151 mm/memory.c:3343)
[ 489.152166] ? __do_page_fault (arch/x86/mm/fault.c:1163)
[ 489.152166] __do_page_fault (arch/x86/mm/fault.c:1230)
[ 489.152166] ? vtime_account_user (kernel/sched/cputime.c:687)
[ 489.152166] ? get_parent_ip (kernel/sched/core.c:2546)
[ 489.152166] ? preempt_count_sub (kernel/sched/core.c:2602)
[ 489.152166] ? context_tracking_user_exit (kernel/context_tracking.c:184)
[ 489.152166] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 489.152166] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
[ 489.152166] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
[ 489.152166] do_async_page_fault (arch/x86/kernel/kvm.c:264)
[ 489.152166] async_page_fault (arch/x86/kernel/entry_64.S:1322)
[ 494.710068] =============================================================================
[ 494.710068] BUG page->ptl (Not tainted): Redzone overwritten
[ 494.710068] -----------------------------------------------------------------------------
[ 494.710068]
[ 494.710068] INFO: 0xffff8804e4730e58-0xffff8804e4730e5f. First byte 0x0 instead of 0xbb
[ 494.710068] INFO: Slab 0xffffea001391cc00 objects=40 used=40 fp=0x (null) flags=0x56fffff80004080
[ 494.710068] INFO: Object 0xffff8804e4730e10 @offset=3600 fp=0x (null)
[ 494.710068]
[ 494.710068] Bytes b4 ffff8804e4730e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 494.710068] Object ffff8804e4730e10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 494.710068] Object ffff8804e4730e20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 494.710068] Object ffff8804e4730e30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 494.710068] Object ffff8804e4730e40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 494.710068] Object ffff8804e4730e50: 00 00 00 00 00 00 00 00 ........
[ 494.710068] Redzone ffff8804e4730e58: 00 00 00 00 00 00 00 00 ........
[ 494.710068] Padding ffff8804e4730f98: 00 00 00 00 00 00 00 00 ........
[ 494.710068] CPU: 21 PID: 12452 Comm: trinity-c128 Tainted: G B 3.15.0-next-20140616-sasha-00025-g0fd1f7d-dirty #657
[ 494.710068] ffff8804e4730e10 ffff88040b7d3980 ffffffff965140d1 0000000000000001
[ 494.710068] ffff88003680bb80 ffff88040b7d39b0 ffffffff932eac11 ffff8804e4730e60
[ 494.710068] ffff88003680bb80 00000000000000bb ffff8804e4730e10 ffff88040b7d3a00
[ 494.710068] Call Trace:
[ 494.710068] dump_stack (lib/dump_stack.c:52)
[ 494.710068] print_trailer (mm/slub.c:641)
[ 494.710068] check_bytes_and_report (mm/slub.c:680 mm/slub.c:704)
[ 494.710068] check_object (mm/slub.c:804)
[ 494.710068] ? ptlock_alloc (mm/memory.c:3826)
[ 494.742119] alloc_debug_processing (mm/slub.c:1082)
[ 494.742119] __slab_alloc (mm/slub.c:2382 (discriminator 1))
[ 494.742119] ? ptlock_alloc (mm/memory.c:3826)
[ 494.742119] ? get_parent_ip (kernel/sched/core.c:2546)
[ 494.742119] kmem_cache_alloc (mm/slub.c:2442 mm/slub.c:2484 mm/slub.c:2489)
[ 494.742119] ? ptlock_alloc (mm/memory.c:3826)
[ 494.742119] ? pte_alloc_one (arch/x86/mm/pgtable.c:28)
[ 494.742119] ? copy_huge_pmd (./arch/x86/include/asm/paravirt.h:571 ./arch/x86/include/asm/pgtable.h:168 mm/huge_memory.c:867)
[ 494.742119] ptlock_alloc (mm/memory.c:3826)
[ 494.742119] pte_alloc_one (include/linux/mm.h:1464 include/linux/mm.h:1499 arch/x86/mm/pgtable.c:30)
[ 494.742119] copy_huge_pmd (mm/huge_memory.c:858)
[ 494.742119] copy_page_range (mm/memory.c:968 mm/memory.c:998 mm/memory.c:1062)
[ 494.742119] copy_process (kernel/fork.c:460 kernel/fork.c:835 kernel/fork.c:898 kernel/fork.c:1346)
[ 494.742119] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2619)
[ 494.742119] do_fork (kernel/fork.c:1607)
[ 494.742119] ? get_parent_ip (kernel/sched/core.c:2546)
[ 494.742119] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
[ 494.742119] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2564)
[ 494.742119] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
[ 494.742119] SyS_clone (kernel/fork.c:1693)
[ 494.742119] stub_clone (arch/x86/kernel/entry_64.S:637)
[ 494.742119] ? tracesys (arch/x86/kernel/entry_64.S:542)
[ 494.742119] FIX page->ptl: Restoring 0xffff8804e4730e58-0xffff8804e4730e5f=0xbb
[ 494.742119]
[ 494.742119] FIX page->ptl: Marking all objects used

Thanks,
Sasha

2014-06-17 14:09:14

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: mm: NULL ptr deref in remove_migration_pte

On Mon, 16 Jun 2014, Sasha Levin wrote:

> It took some time to hit something here, but I think that the following
> is related:

This related thing looks like someone did a random memset. The SLUB
diagnostic show the object, redzone and padding were overwritten with
zeros.

> [ 494.710068] =============================================================================
> [ 494.710068] BUG page->ptl (Not tainted): Redzone overwritten
> [ 494.710068] -----------------------------------------------------------------------------
> [ 494.710068]
> [ 494.710068] INFO: 0xffff8804e4730e58-0xffff8804e4730e5f. First byte 0x0 instead of 0xbb
> [ 494.710068] INFO: Slab 0xffffea001391cc00 objects=40 used=40 fp=0x (null) flags=0x56fffff80004080
> [ 494.710068] INFO: Object 0xffff8804e4730e10 @offset=3600 fp=0x (null)
> [ 494.710068]
> [ 494.710068] Bytes b4 ffff8804e4730e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e50: 00 00 00 00 00 00 00 00 ........
> [ 494.710068] Redzone ffff8804e4730e58: 00 00 00 00 00 00 00 00 ........
> [ 494.710068] Padding ffff8804e4730f98: 00 00 00 00 00 00 00 00 ........

2014-06-18 03:37:53

by Hugh Dickins

[permalink] [raw]

Subject: Re: mm: NULL ptr deref in remove_migration_pte

On Mon, 16 Jun 2014, Sasha Levin wrote:
> On 06/10/2014 12:20 AM, Hugh Dickins wrote:
> > Although there's nothing in the backtrace to implicate it,
> > I think this crash is caused by THP: please try this patch - thanks.
> >
> > [PATCH] mm: let mm_find_pmd fix buggy race with THP fault
...
>
> It took some time to hit something here,

I take that to mean that you were running with the mm_find_pmd patch in,
and it seemed to take a little longer to hit the problem than before?

> but I think that the following is related:

I agree it does look like a variant of what you got before the patch;
but I still think the patch is good, and don't believe it caused this.

It looks as if these symptoms have an additional cause, which that patch
did not even attempt to address. I've looked around, but not found what.

>
> [ 489.152166] INFO: trying to register non-static key.
> [ 489.152166] the code is fine but needs lockdep annotation.
> [ 489.152166] turning off the locking correctness validator.
> [ 489.152166] CPU: 23 PID: 12148 Comm: trinity-c79 Not tainted 3.15.0-next-20140616-sasha-00025-g0fd1f7d-dirty #657
> [ 489.152166] ffff8804dd013000 ffff8804e15a38e8 ffffffff965140d1 0000000000000002
> [ 489.152166] ffffffff9a5ce7c0 ffff8804e15a39e8 ffffffff931ca363 ffff8804e15a3928
> [ 489.152166] 0000000000000000 0000000000000000 ffff8804e4730978 0000000000000001
> [ 489.152166] Call Trace:
> [ 489.152166] dump_stack (lib/dump_stack.c:52)
> [ 489.152166] __lock_acquire (kernel/locking/lockdep.c:743 kernel/locking/lockdep.c:3078)
> [ 489.152166] ? __lock_acquire (kernel/locking/lockdep.c:3189)
> [ 489.152166] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
> [ 489.152166] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
> [ 489.152166] ? __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
> [ 489.152166] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
> [ 489.152166] ? __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
> [ 489.152166] ? get_parent_ip (kernel/sched/core.c:2546)
> [ 489.152166] __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
> [ 489.152166] try_to_unmap_one (mm/rmap.c:1153)
> [ 489.152166] ? __const_udelay (arch/x86/lib/delay.c:126)
> [ 489.152166] ? __rcu_read_unlock (kernel/rcu/update.c:97)
> [ 489.152166] ? page_lock_anon_vma_read (mm/rmap.c:448)
> [ 489.152166] rmap_walk (mm/rmap.c:1654 mm/rmap.c:1725)
> [ 489.152166] ? preempt_count_sub (kernel/sched/core.c:2602)
> [ 489.152166] try_to_unmap (mm/rmap.c:1547)
> [ 489.152166] ? page_remove_rmap (mm/rmap.c:1144)
> [ 489.152166] ? invalid_migration_vma (mm/rmap.c:1503)
> [ 489.152166] ? try_to_unmap_one (mm/rmap.c:1411)
> [ 489.152166] ? anon_vma_prepare (mm/rmap.c:448)
> [ 489.152166] ? invalid_mkclean_vma (mm/rmap.c:1498)
> [ 489.152166] ? page_get_anon_vma (mm/rmap.c:405)
> [ 489.152166] migrate_pages (mm/migrate.c:913 mm/migrate.c:959 mm/migrate.c:1146)
> [ 489.152166] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:169 kernel/locking/spinlock.c:199)
> [ 489.152166] ? perf_trace_mm_numa_migrate_ratelimit (mm/migrate.c:1594)
> [ 489.152166] migrate_misplaced_page (mm/migrate.c:1754)
> [ 489.152166] __handle_mm_fault (mm/memory.c:3157 mm/memory.c:3207 mm/memory.c:3317)
> [ 489.152166] handle_mm_fault (include/linux/memcontrol.h:151 mm/memory.c:3343)
> [ 489.152166] ? __do_page_fault (arch/x86/mm/fault.c:1163)
> [ 489.152166] __do_page_fault (arch/x86/mm/fault.c:1230)
> [ 489.152166] ? vtime_account_user (kernel/sched/cputime.c:687)
> [ 489.152166] ? get_parent_ip (kernel/sched/core.c:2546)
> [ 489.152166] ? preempt_count_sub (kernel/sched/core.c:2602)
> [ 489.152166] ? context_tracking_user_exit (kernel/context_tracking.c:184)
> [ 489.152166] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 489.152166] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
> [ 489.152166] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
> [ 489.152166] do_async_page_fault (arch/x86/kernel/kvm.c:264)
> [ 489.152166] async_page_fault (arch/x86/kernel/entry_64.S:1322)

Originally I thought that the trace above and the trace below were
probably unrelated, there being more than five seconds between them.

But then noticed ffff8804e4730978 in the stack contents above, with
ffff8804e4730e10 the object address in the slub diagnostic below.

As Christoph points out, the slub diagnostic shows that something
has been overwriting with zeroes there.

Maybe the whole page (containing the slub-allocated page table lock
being checked by lockdep above) has been overwritten with zeroes.

>From experience a few months ago in another context, I believe that
would issue precisely the cryptic "INFO: trying to register non-static
key. the code is fine but needs lockdep annotation." seen above.

I think I'm going to ignore this one for now, assuming it to be
some randomish slab corruption from a bad patch in linux-next.

(I do take reports on 3.XY-rcZ more seriously than reports on
linux-next; but recognize that you're trying to give advance
warning, and to cover different territory than Dave does.)

If it's reproducible on linux-next in a week or so's time,
please let me know, and I'll worry about it some more then.

Hugh

> [ 494.710068] =============================================================================
> [ 494.710068] BUG page->ptl (Not tainted): Redzone overwritten
> [ 494.710068] -----------------------------------------------------------------------------
> [ 494.710068]
> [ 494.710068] INFO: 0xffff8804e4730e58-0xffff8804e4730e5f. First byte 0x0 instead of 0xbb
> [ 494.710068] INFO: Slab 0xffffea001391cc00 objects=40 used=40 fp=0x (null) flags=0x56fffff80004080
> [ 494.710068] INFO: Object 0xffff8804e4730e10 @offset=3600 fp=0x (null)
> [ 494.710068]
> [ 494.710068] Bytes b4 ffff8804e4730e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> [ 494.710068] Object ffff8804e4730e50: 00 00 00 00 00 00 00 00 ........
> [ 494.710068] Redzone ffff8804e4730e58: 00 00 00 00 00 00 00 00 ........
> [ 494.710068] Padding ffff8804e4730f98: 00 00 00 00 00 00 00 00 ........
> [ 494.710068] CPU: 21 PID: 12452 Comm: trinity-c128 Tainted: G B 3.15.0-next-20140616-sasha-00025-g0fd1f7d-dirty #657
> [ 494.710068] ffff8804e4730e10 ffff88040b7d3980 ffffffff965140d1 0000000000000001
> [ 494.710068] ffff88003680bb80 ffff88040b7d39b0 ffffffff932eac11 ffff8804e4730e60
> [ 494.710068] ffff88003680bb80 00000000000000bb ffff8804e4730e10 ffff88040b7d3a00
> [ 494.710068] Call Trace:
> [ 494.710068] dump_stack (lib/dump_stack.c:52)
> [ 494.710068] print_trailer (mm/slub.c:641)
> [ 494.710068] check_bytes_and_report (mm/slub.c:680 mm/slub.c:704)
> [ 494.710068] check_object (mm/slub.c:804)
> [ 494.710068] ? ptlock_alloc (mm/memory.c:3826)
> [ 494.742119] alloc_debug_processing (mm/slub.c:1082)
> [ 494.742119] __slab_alloc (mm/slub.c:2382 (discriminator 1))
> [ 494.742119] ? ptlock_alloc (mm/memory.c:3826)
> [ 494.742119] ? get_parent_ip (kernel/sched/core.c:2546)
> [ 494.742119] kmem_cache_alloc (mm/slub.c:2442 mm/slub.c:2484 mm/slub.c:2489)
> [ 494.742119] ? ptlock_alloc (mm/memory.c:3826)
> [ 494.742119] ? pte_alloc_one (arch/x86/mm/pgtable.c:28)
> [ 494.742119] ? copy_huge_pmd (./arch/x86/include/asm/paravirt.h:571 ./arch/x86/include/asm/pgtable.h:168 mm/huge_memory.c:867)
> [ 494.742119] ptlock_alloc (mm/memory.c:3826)
> [ 494.742119] pte_alloc_one (include/linux/mm.h:1464 include/linux/mm.h:1499 arch/x86/mm/pgtable.c:30)
> [ 494.742119] copy_huge_pmd (mm/huge_memory.c:858)
> [ 494.742119] copy_page_range (mm/memory.c:968 mm/memory.c:998 mm/memory.c:1062)
> [ 494.742119] copy_process (kernel/fork.c:460 kernel/fork.c:835 kernel/fork.c:898 kernel/fork.c:1346)
> [ 494.742119] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2619)
> [ 494.742119] do_fork (kernel/fork.c:1607)
> [ 494.742119] ? get_parent_ip (kernel/sched/core.c:2546)
> [ 494.742119] ? context_tracking_user_exit (./arch/x86/include/asm/paravirt.h:809 (discriminator 2) kernel/context_tracking.c:184 (discriminator 2))
> [ 494.742119] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2564)
> [ 494.742119] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
> [ 494.742119] SyS_clone (kernel/fork.c:1693)
> [ 494.742119] stub_clone (arch/x86/kernel/entry_64.S:637)
> [ 494.742119] ? tracesys (arch/x86/kernel/entry_64.S:542)
> [ 494.742119] FIX page->ptl: Restoring 0xffff8804e4730e58-0xffff8804e4730e5f=0xbb
> [ 494.742119]
> [ 494.742119] FIX page->ptl: Marking all objects used

2014-06-18 11:24:55

by Sasha Levin

[permalink] [raw]

Subject: Re: mm: NULL ptr deref in remove_migration_pte

On 06/17/2014 11:36 PM, Hugh Dickins wrote:
> On Mon, 16 Jun 2014, Sasha Levin wrote:
>> On 06/10/2014 12:20 AM, Hugh Dickins wrote:
>>> Although there's nothing in the backtrace to implicate it,
>>> I think this crash is caused by THP: please try this patch - thanks.
>>>
>>> [PATCH] mm: let mm_find_pmd fix buggy race with THP fault
> ...
>>
>> It took some time to hit something here,
>
> I take that to mean that you were running with the mm_find_pmd patch in,
> and it seemed to take a little longer to hit the problem than before?

Yes. I've also only seen it once so far.

>> but I think that the following is related:
>
> I agree it does look like a variant of what you got before the patch;
> but I still think the patch is good, and don't believe it caused this.
>
> It looks as if these symptoms have an additional cause, which that patch
> did not even attempt to address. I've looked around, but not found what.
>
>>
>> [ 489.152166] INFO: trying to register non-static key.
>> [ 489.152166] the code is fine but needs lockdep annotation.
>> [ 489.152166] turning off the locking correctness validator.
>> [ 489.152166] CPU: 23 PID: 12148 Comm: trinity-c79 Not tainted 3.15.0-next-20140616-sasha-00025-g0fd1f7d-dirty #657
>> [ 489.152166] ffff8804dd013000 ffff8804e15a38e8 ffffffff965140d1 0000000000000002
>> [ 489.152166] ffffffff9a5ce7c0 ffff8804e15a39e8 ffffffff931ca363 ffff8804e15a3928
>> [ 489.152166] 0000000000000000 0000000000000000 ffff8804e4730978 0000000000000001
>> [ 489.152166] Call Trace:
>> [ 489.152166] dump_stack (lib/dump_stack.c:52)
>> [ 489.152166] __lock_acquire (kernel/locking/lockdep.c:743 kernel/locking/lockdep.c:3078)
>> [ 489.152166] ? __lock_acquire (kernel/locking/lockdep.c:3189)
>> [ 489.152166] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:90 arch/x86/kernel/kvmclock.c:86)
>> [ 489.152166] lock_acquire (./arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602)
>> [ 489.152166] ? __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
>> [ 489.152166] _raw_spin_lock (include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151)
>> [ 489.152166] ? __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
>> [ 489.152166] ? get_parent_ip (kernel/sched/core.c:2546)
>> [ 489.152166] __page_check_address (include/linux/spinlock.h:303 mm/rmap.c:630)
>> [ 489.152166] try_to_unmap_one (mm/rmap.c:1153)
>> [ 489.152166] ? __const_udelay (arch/x86/lib/delay.c:126)
>> [ 489.152166] ? __rcu_read_unlock (kernel/rcu/update.c:97)
>> [ 489.152166] ? page_lock_anon_vma_read (mm/rmap.c:448)
>> [ 489.152166] rmap_walk (mm/rmap.c:1654 mm/rmap.c:1725)
>> [ 489.152166] ? preempt_count_sub (kernel/sched/core.c:2602)
>> [ 489.152166] try_to_unmap (mm/rmap.c:1547)
>> [ 489.152166] ? page_remove_rmap (mm/rmap.c:1144)
>> [ 489.152166] ? invalid_migration_vma (mm/rmap.c:1503)
>> [ 489.152166] ? try_to_unmap_one (mm/rmap.c:1411)
>> [ 489.152166] ? anon_vma_prepare (mm/rmap.c:448)
>> [ 489.152166] ? invalid_mkclean_vma (mm/rmap.c:1498)
>> [ 489.152166] ? page_get_anon_vma (mm/rmap.c:405)
>> [ 489.152166] migrate_pages (mm/migrate.c:913 mm/migrate.c:959 mm/migrate.c:1146)
>> [ 489.152166] ? _raw_spin_unlock_irq (./arch/x86/include/asm/preempt.h:98 include/linux/spinlock_api_smp.h:169 kernel/locking/spinlock.c:199)
>> [ 489.152166] ? perf_trace_mm_numa_migrate_ratelimit (mm/migrate.c:1594)
>> [ 489.152166] migrate_misplaced_page (mm/migrate.c:1754)
>> [ 489.152166] __handle_mm_fault (mm/memory.c:3157 mm/memory.c:3207 mm/memory.c:3317)
>> [ 489.152166] handle_mm_fault (include/linux/memcontrol.h:151 mm/memory.c:3343)
>> [ 489.152166] ? __do_page_fault (arch/x86/mm/fault.c:1163)
>> [ 489.152166] __do_page_fault (arch/x86/mm/fault.c:1230)
>> [ 489.152166] ? vtime_account_user (kernel/sched/cputime.c:687)
>> [ 489.152166] ? get_parent_ip (kernel/sched/core.c:2546)
>> [ 489.152166] ? preempt_count_sub (kernel/sched/core.c:2602)
>> [ 489.152166] ? context_tracking_user_exit (kernel/context_tracking.c:184)
>> [ 489.152166] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
>> [ 489.152166] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2638 (discriminator 2))
>> [ 489.152166] trace_do_page_fault (arch/x86/mm/fault.c:1313 include/linux/jump_label.h:115 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1314)
>> [ 489.152166] do_async_page_fault (arch/x86/kernel/kvm.c:264)
>> [ 489.152166] async_page_fault (arch/x86/kernel/entry_64.S:1322)
>
> Originally I thought that the trace above and the trace below were
> probably unrelated, there being more than five seconds between them.
>
> But then noticed ffff8804e4730978 in the stack contents above, with
> ffff8804e4730e10 the object address in the slub diagnostic below.
>
> As Christoph points out, the slub diagnostic shows that something
> has been overwriting with zeroes there.
>
> Maybe the whole page (containing the slub-allocated page table lock
> being checked by lockdep above) has been overwritten with zeroes.
>
> From experience a few months ago in another context, I believe that
> would issue precisely the cryptic "INFO: trying to register non-static
> key. the code is fine but needs lockdep annotation." seen above.
>
> I think I'm going to ignore this one for now, assuming it to be
> some randomish slab corruption from a bad patch in linux-next.
>
> (I do take reports on 3.XY-rcZ more seriously than reports on
> linux-next; but recognize that you're trying to give advance
> warning, and to cover different territory than Dave does.)
>
> If it's reproducible on linux-next in a week or so's time,
> please let me know, and I'll worry about it some more then.

Sounds good to me. I'll ping again whenever the issue reproduces. The
way it looks now it won't be any time soon.

Thanks,
Sasha