pmdp_collapse_flush() should be given the start address at which the huge
page is mapped, haddr: it was given addr, which at that point has been
used as a local variable, incremented to the end address of the extent.
Found by source inspection while chasing a hugepage locking bug, which
I then could not explain by this. At first I thought this was very bad;
then saw that all of the page translations that were not flushed would
actually still point to the right pages afterwards, so harmless; then
realized that I know nothing of how different architectures and models
cache intermediate paging structures, so maybe it matters after all -
particularly since the page table concerned is immediately freed.
Much easier to fix than to think about.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v5.4+
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- 5.8-rc7/mm/khugepaged.c 2020-07-26 16:58:02.189038680 -0700
+++ linux/mm/khugepaged.c 2020-08-02 10:48:59.890925896 -0700
@@ -1502,7 +1502,7 @@ void collapse_pte_mapped_thp(struct mm_s
/* step 4: collapse pmd */
ptl = pmd_lock(vma->vm_mm, pmd);
- _pmd = pmdp_collapse_flush(vma, addr, pmd);
+ _pmd = pmdp_collapse_flush(vma, haddr, pmd);
spin_unlock(ptl);
mm_dec_nr_ptes(mm);
pte_free(mm, pmd_pgtable(_pmd));
When retract_page_tables() removes a page table to make way for a huge
pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
pmd lock; but when collapse_pte_mapped_thp() does the same (to handle
the case when the original mmap_write_trylock had failed), only
mmap_write_trylock and pmd lock are held.
That's not enough. One machine has twice crashed under load, with
"BUG: spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the
second crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl
(serving page_referenced() on a file THP, that had found a page table
at *pmd) discovers that the page table page and its lock have already
been freed by the time it comes to unlock.
Follow the example of retract_page_tables(), but we only need one of
huge page lock or i_mmap_lock_write to secure against this: because it's
the narrower lock, and because it simplifies collapse_pte_mapped_thp()
to know the hpage earlier, choose to rely on huge page lock here.
Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v5.4+
---
mm/khugepaged.c | 44 +++++++++++++++++++-------------------------
1 file changed, 19 insertions(+), 25 deletions(-)
--- 5.8-rc7/mm/khugepaged.c 2020-07-26 16:58:02.189038680 -0700
+++ linux/mm/khugepaged.c 2020-08-02 10:51:02.127688808 -0700
@@ -1412,7 +1412,7 @@ void collapse_pte_mapped_thp(struct mm_s
{
unsigned long haddr = addr & HPAGE_PMD_MASK;
struct vm_area_struct *vma = find_vma(mm, haddr);
- struct page *hpage = NULL;
+ struct page *hpage;
pte_t *start_pte, *pte;
pmd_t *pmd, _pmd;
spinlock_t *ptl;
@@ -1432,9 +1432,17 @@ void collapse_pte_mapped_thp(struct mm_s
if (!hugepage_vma_check(vma, vma->vm_flags | VM_HUGEPAGE))
return;
+ hpage = find_lock_page(vma->vm_file->f_mapping,
+ linear_page_index(vma, haddr));
+ if (!hpage)
+ return;
+
+ if (!PageHead(hpage))
+ goto drop_hpage;
+
pmd = mm_find_pmd(mm, haddr);
if (!pmd)
- return;
+ goto drop_hpage;
start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
@@ -1453,30 +1461,11 @@ void collapse_pte_mapped_thp(struct mm_s
page = vm_normal_page(vma, addr, *pte);
- if (!page || !PageCompound(page))
- goto abort;
-
- if (!hpage) {
- hpage = compound_head(page);
- /*
- * The mapping of the THP should not change.
- *
- * Note that uprobe, debugger, or MAP_PRIVATE may
- * change the page table, but the new page will
- * not pass PageCompound() check.
- */
- if (WARN_ON(hpage->mapping != vma->vm_file->f_mapping))
- goto abort;
- }
-
/*
- * Confirm the page maps to the correct subpage.
- *
- * Note that uprobe, debugger, or MAP_PRIVATE may change
- * the page table, but the new page will not pass
- * PageCompound() check.
+ * Note that uprobe, debugger, or MAP_PRIVATE may change the
+ * page table, but the new page will not be a subpage of hpage.
*/
- if (WARN_ON(hpage + i != page))
+ if (hpage + i != page)
goto abort;
count++;
}
@@ -1495,7 +1484,7 @@ void collapse_pte_mapped_thp(struct mm_s
pte_unmap_unlock(start_pte, ptl);
/* step 3: set proper refcount and mm_counters. */
- if (hpage) {
+ if (count) {
page_ref_sub(hpage, count);
add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
}
@@ -1506,10 +1495,15 @@ void collapse_pte_mapped_thp(struct mm_s
spin_unlock(ptl);
mm_dec_nr_ptes(mm);
pte_free(mm, pmd_pgtable(_pmd));
+
+drop_hpage:
+ unlock_page(hpage);
+ put_page(hpage);
return;
abort:
pte_unmap_unlock(start_pte, ptl);
+ goto drop_hpage;
}
static int khugepaged_collapse_pte_mapped_thps(struct mm_slot *mm_slot)
Only once have I seen this scenario (and forgot even to notice what
forced the eventual crash): a sequence of "BUG: Bad page map" alerts
from vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.
The pte mappings being zapped in this case were supposed to be from a
huge page of ext4 text (but could as well have been shmem): my belief
is that it was racing with collapse_file()'s retract_page_tables(),
found *pmd pointing to a page table, locked it, but *pmd had become
0 by the time start_pte was decided.
In most cases, that possibility is excluded by holding mmap lock;
but exit_mmap() proceeds without mmap lock. Most of what's run by
khugepaged checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate()
do so, for example. But retract_page_tables() did not: fix that
(using an mm variable instead of vma->vm_mm repeatedly).
Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v4.8+
---
mm/khugepaged.c | 24 ++++++++++++++----------
1 file changed, 14 insertions(+), 10 deletions(-)
--- 5.8-rc7/mm/khugepaged.c 2020-07-26 16:58:02.189038680 -0700
+++ linux/mm/khugepaged.c 2020-08-02 10:53:37.892660983 -0700
@@ -1538,6 +1538,7 @@ out:
static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
{
struct vm_area_struct *vma;
+ struct mm_struct *mm;
unsigned long addr;
pmd_t *pmd, _pmd;
@@ -1566,7 +1567,8 @@ static void retract_page_tables(struct a
continue;
if (vma->vm_end < addr + HPAGE_PMD_SIZE)
continue;
- pmd = mm_find_pmd(vma->vm_mm, addr);
+ mm = vma->vm_mm;
+ pmd = mm_find_pmd(mm, addr);
if (!pmd)
continue;
/*
@@ -1576,17 +1578,19 @@ static void retract_page_tables(struct a
* mmap_lock while holding page lock. Fault path does it in
* reverse order. Trylock is a way to avoid deadlock.
*/
- if (mmap_write_trylock(vma->vm_mm)) {
- spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd);
- /* assume page table is clear */
- _pmd = pmdp_collapse_flush(vma, addr, pmd);
- spin_unlock(ptl);
- mmap_write_unlock(vma->vm_mm);
- mm_dec_nr_ptes(vma->vm_mm);
- pte_free(vma->vm_mm, pmd_pgtable(_pmd));
+ if (mmap_write_trylock(mm)) {
+ if (!khugepaged_test_exit(mm)) {
+ spinlock_t *ptl = pmd_lock(mm, pmd);
+ /* assume page table is clear */
+ _pmd = pmdp_collapse_flush(vma, addr, pmd);
+ spin_unlock(ptl);
+ mm_dec_nr_ptes(mm);
+ pte_free(mm, pmd_pgtable(_pmd));
+ }
+ mmap_write_unlock(mm);
} else {
/* Try again later */
- khugepaged_add_pte_mapped_thp(vma->vm_mm, addr);
+ khugepaged_add_pte_mapped_thp(mm, addr);
}
}
i_mmap_unlock_write(mapping);
Move collapse_huge_page()'s mmget_still_valid() check into
khugepaged_test_exit() itself. collapse_huge_page() is used for anon
THP only, and earned its mmget_still_valid() check because it inserts
a huge pmd entry in place of the page table's pmd entry; whereas
collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
merely clears the page table's pmd entry. But core dumping without
mmap lock must have been as open to mistaking a racily cleared pmd
entry for a page table at physical page 0, as exit_mmap() was. And
we certainly have no interest in mapping as a THP once dumping core.
Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v4.8+
---
Note on stable backporting: the coredump fix was backported as far as
v3.16, but this extension only needed where shmem or file THP supported.
mm/khugepaged.c | 5 +----
1 file changed, 1 insertion(+), 4 deletions(-)
--- 5.8-rc7/mm/khugepaged.c 2020-07-26 16:58:02.189038680 -0700
+++ linux/mm/khugepaged.c 2020-08-02 10:56:11.105617241 -0700
@@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(stru
static inline int khugepaged_test_exit(struct mm_struct *mm)
{
- return atomic_read(&mm->mm_users) == 0;
+ return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
}
static bool hugepage_vma_check(struct vm_area_struct *vma,
@@ -1100,9 +1100,6 @@ static void collapse_huge_page(struct mm
* handled by the anon_vma lock + PG_lock.
*/
mmap_write_lock(mm);
- result = SCAN_ANY_PROCESS;
- if (!mmget_still_valid(mm))
- goto out;
result = hugepage_vma_revalidate(mm, address, &vma);
if (result)
goto out;
On Sun, Aug 02, 2020 at 12:12:42PM -0700, Hugh Dickins wrote:
> pmdp_collapse_flush() should be given the start address at which the huge
> page is mapped, haddr: it was given addr, which at that point has been
> used as a local variable, incremented to the end address of the extent.
>
> Found by source inspection while chasing a hugepage locking bug, which
> I then could not explain by this. At first I thought this was very bad;
> then saw that all of the page translations that were not flushed would
> actually still point to the right pages afterwards, so harmless; then
> realized that I know nothing of how different architectures and models
> cache intermediate paging structures, so maybe it matters after all -
> particularly since the page table concerned is immediately freed.
>
> Much easier to fix than to think about.
>
> Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected] # v5.4+
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Sun, Aug 02, 2020 at 12:15:24PM -0700, Hugh Dickins wrote:
> When retract_page_tables() removes a page table to make way for a huge
> pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
> pmd lock; but when collapse_pte_mapped_thp() does the same (to handle
> the case when the original mmap_write_trylock had failed), only
> mmap_write_trylock and pmd lock are held.
>
> That's not enough. One machine has twice crashed under load, with
> "BUG: spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the
> second crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl
> (serving page_referenced() on a file THP, that had found a page table
> at *pmd) discovers that the page table page and its lock have already
> been freed by the time it comes to unlock.
>
> Follow the example of retract_page_tables(), but we only need one of
> huge page lock or i_mmap_lock_write to secure against this: because it's
> the narrower lock, and because it simplifies collapse_pte_mapped_thp()
> to know the hpage earlier, choose to rely on huge page lock here.
>
> Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected] # v5.4+
We could avoid the page cache lookup by locking the page on the first
valid PTE and recheck page->mapping, but this way is cleaner.
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
On Sun, Aug 02, 2020 at 12:16:53PM -0700, Hugh Dickins wrote:
> Only once have I seen this scenario (and forgot even to notice what
> forced the eventual crash): a sequence of "BUG: Bad page map" alerts
> from vm_normal_page(), from zap_pte_range() servicing exit_mmap();
> pmd:00000000, pte values corresponding to data in physical page 0.
>
> The pte mappings being zapped in this case were supposed to be from a
> huge page of ext4 text (but could as well have been shmem): my belief
> is that it was racing with collapse_file()'s retract_page_tables(),
> found *pmd pointing to a page table, locked it, but *pmd had become
> 0 by the time start_pte was decided.
>
> In most cases, that possibility is excluded by holding mmap lock;
> but exit_mmap() proceeds without mmap lock. Most of what's run by
> khugepaged checks khugepaged_test_exit() after acquiring mmap lock:
> khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate()
> do so, for example. But retract_page_tables() did not: fix that
> (using an mm variable instead of vma->vm_mm repeatedly).
Hm. I'm not sure I follow. vma->vm_mm has to be valid as long as we hold
i_mmap lock, no? Unlinking a VMA requires it.
--
Kirill A. Shutemov
On Mon, 3 Aug 2020, Kirill A. Shutemov wrote:
> On Sun, Aug 02, 2020 at 12:16:53PM -0700, Hugh Dickins wrote:
> > Only once have I seen this scenario (and forgot even to notice what
> > forced the eventual crash): a sequence of "BUG: Bad page map" alerts
> > from vm_normal_page(), from zap_pte_range() servicing exit_mmap();
> > pmd:00000000, pte values corresponding to data in physical page 0.
> >
> > The pte mappings being zapped in this case were supposed to be from a
> > huge page of ext4 text (but could as well have been shmem): my belief
> > is that it was racing with collapse_file()'s retract_page_tables(),
> > found *pmd pointing to a page table, locked it, but *pmd had become
> > 0 by the time start_pte was decided.
> >
> > In most cases, that possibility is excluded by holding mmap lock;
> > but exit_mmap() proceeds without mmap lock. Most of what's run by
> > khugepaged checks khugepaged_test_exit() after acquiring mmap lock:
> > khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate()
> > do so, for example. But retract_page_tables() did not: fix that
> > (using an mm variable instead of vma->vm_mm repeatedly).
>
> Hm. I'm not sure I follow. vma->vm_mm has to be valid as long as we hold
> i_mmap lock, no? Unlinking a VMA requires it.
Ah, my wording is misleading, yes. That comment
"(using an mm variable instead of vma->vm_mm repeatedly)"
was nothing more than a note, that the patch is bigger than it could be,
because I decided to use an mm variable, instead of vma->vm_mm repeatedly.
But it looks as if I'm saying there used to be a need for READ_ONCE() or
something, and by using the mm variable I was fixing the problem.
No, sorry: delete that line now the point is made: the mm variable is
just a patch detail, it's not important.
The fix (as the subject suggested) is for retract_page_tables() to check
khugepaged_test_exit(), after acquiring mmap lock, before doing anything
to the page table. Getting the mmap lock serializes with __mmput(),
which briefly takes and drops it in __khugepaged_exit(); then the
khugepaged_test_exit() check on mm_users makes sure we don't touch the
page table once exit_mmap() might reach it, since exit_mmap() will be
proceeding without mmap lock, not expecting anyone to be racing with it.
(I devised that protocol for ksmd, then Andrea adopted it for khugepaged:
back then it was important for these daemons to have a hold on the mm,
without an actual reference to mm_users, because that would prevent the
OOM killer from reaching exit_mmap(). Nowadays with the OOM reaper, it's
probably less crucial to avoid mm_users, but I think still worthwhile.)
Thanks a lot for looking at these patches so quickly,
Hugh
On Sun, Aug 02, 2020 at 05:35:23PM -0700, Hugh Dickins wrote:
> On Mon, 3 Aug 2020, Kirill A. Shutemov wrote:
> > On Sun, Aug 02, 2020 at 12:16:53PM -0700, Hugh Dickins wrote:
> > > Only once have I seen this scenario (and forgot even to notice what
> > > forced the eventual crash): a sequence of "BUG: Bad page map" alerts
> > > from vm_normal_page(), from zap_pte_range() servicing exit_mmap();
> > > pmd:00000000, pte values corresponding to data in physical page 0.
> > >
> > > The pte mappings being zapped in this case were supposed to be from a
> > > huge page of ext4 text (but could as well have been shmem): my belief
> > > is that it was racing with collapse_file()'s retract_page_tables(),
> > > found *pmd pointing to a page table, locked it, but *pmd had become
> > > 0 by the time start_pte was decided.
> > >
> > > In most cases, that possibility is excluded by holding mmap lock;
> > > but exit_mmap() proceeds without mmap lock. Most of what's run by
> > > khugepaged checks khugepaged_test_exit() after acquiring mmap lock:
> > > khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate()
> > > do so, for example. But retract_page_tables() did not: fix that
> > > (using an mm variable instead of vma->vm_mm repeatedly).
> >
> > Hm. I'm not sure I follow. vma->vm_mm has to be valid as long as we hold
> > i_mmap lock, no? Unlinking a VMA requires it.
>
> Ah, my wording is misleading, yes. That comment
> "(using an mm variable instead of vma->vm_mm repeatedly)"
> was nothing more than a note, that the patch is bigger than it could be,
> because I decided to use an mm variable, instead of vma->vm_mm repeatedly.
> But it looks as if I'm saying there used to be a need for READ_ONCE() or
> something, and by using the mm variable I was fixing the problem.
>
> No, sorry: delete that line now the point is made: the mm variable is
> just a patch detail, it's not important.
>
> The fix (as the subject suggested) is for retract_page_tables() to check
> khugepaged_test_exit(), after acquiring mmap lock, before doing anything
> to the page table. Getting the mmap lock serializes with __mmput(),
> which briefly takes and drops it in __khugepaged_exit(); then the
> khugepaged_test_exit() check on mm_users makes sure we don't touch the
> page table once exit_mmap() might reach it, since exit_mmap() will be
> proceeding without mmap lock, not expecting anyone to be racing with it.
Okay, makes sense.
Acked-by: Kirill A. Shutemov <[email protected]>
--
Kirill A. Shutemov
syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
__khugepaged_enter(): yes, when one thread is about to dump core, has set
core_state, and is waiting for others, another might do something calling
__khugepaged_enter(), which now crashes because I lumped the core_state
test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
think it's best to lump them together, so just in this exceptional case,
check mm->mm_users directly instead of khugepaged_test_exit().
Reported-by: syzbot <[email protected]>
Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected] # v4.8+
---
mm/khugepaged.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- v5.9-rc/mm/khugepaged.c 2020-08-12 19:46:50.867196579 -0700
+++ linux/mm/khugepaged.c 2020-08-14 14:24:32.739457309 -0700
@@ -466,7 +466,7 @@ int __khugepaged_enter(struct mm_struct
return -ENOMEM;
/* __khugepaged_exit() must not run from under us */
- VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
+ VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
free_mm_slot(mm_slot);
return 0;
On Fri, Aug 14, 2020 at 3:13 PM Hugh Dickins <[email protected]> wrote:
>
> syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
> __khugepaged_enter(): yes, when one thread is about to dump core, has set
> core_state, and is waiting for others, another might do something calling
> __khugepaged_enter(), which now crashes because I lumped the core_state
> test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
> think it's best to lump them together, so just in this exceptional case,
> check mm->mm_users directly instead of khugepaged_test_exit().
>
> Reported-by: syzbot <[email protected]>
> Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected] # v4.8+
Acked-by: Yang Shi <[email protected]>
> ---
>
> mm/khugepaged.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- v5.9-rc/mm/khugepaged.c 2020-08-12 19:46:50.867196579 -0700
> +++ linux/mm/khugepaged.c 2020-08-14 14:24:32.739457309 -0700
> @@ -466,7 +466,7 @@ int __khugepaged_enter(struct mm_struct
> return -ENOMEM;
>
> /* __khugepaged_exit() must not run from under us */
> - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
> + VM_BUG_ON_MM(atomic_read(&mm->mm_users) == 0, mm);
> if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
> free_mm_slot(mm_slot);
> return 0;
>