On Thu, 27 Aug 2020, Jann Horn wrote:
> The preceding patches have ensured that core dumping properly takes the
> mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
> its users.
Hi Jann, while the only tears to be shed over losing mmget_still_valid()
will be tears of joy, I think you need to explain why you believe it's
safe to remove the instance in mm/khugepaged.c: which you'll have found
I moved just recently, to cover an extra case (sorry for not Cc'ing you).
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
>
> static inline int khugepaged_test_exit(struct mm_struct *mm)
> {
> - return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
> + return atomic_read(&mm->mm_users) == 0;
> }
>
> static bool hugepage_vma_check(struct vm_area_struct *vma,
The movement (which you have correctly followed) was in
bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
but the "pmd .. physical page 0" issue is explained better in its parent
18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
I think your core dumping is still reading the page tables without
holding mmap_lock, so still vulnerable to that extra issue. It won't
be as satisfying as removing all traces of mmget_still_valid(), but
right now I think you should add an mm->core_state check there instead.
(I do have a better solution in use, but it's a much larger patch, that
will take a lot more effort to get in: checks in pte_offset_map_lock(),
perhaps returning NULL when pmd is transitioning, requiring retry.)
Or maybe it's me who has missed what you're doing instead.
Hugh
On Mon, Aug 31, 2020 at 8:07 AM Hugh Dickins <[email protected]> wrote:
> On Thu, 27 Aug 2020, Jann Horn wrote:
>
> > The preceding patches have ensured that core dumping properly takes the
> > mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
> > its users.
>
> Hi Jann, while the only tears to be shed over losing mmget_still_valid()
> will be tears of joy, I think you need to explain why you believe it's
> safe to remove the instance in mm/khugepaged.c: which you'll have found
> I moved just recently, to cover an extra case (sorry for not Cc'ing you).
>
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
> >
> > static inline int khugepaged_test_exit(struct mm_struct *mm)
> > {
> > - return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
> > + return atomic_read(&mm->mm_users) == 0;
> > }
> >
> > static bool hugepage_vma_check(struct vm_area_struct *vma,
>
> The movement (which you have correctly followed) was in
> bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
> but the "pmd .. physical page 0" issue is explained better in its parent
> 18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
>
> I think your core dumping is still reading the page tables without
> holding mmap_lock
Where? get_dump_page() takes mmap_lock now:
<https://lore.kernel.org/lkml/[email protected]/>
I don't think there should be any paths into __get_user_pages() left
that don't hold the mmap_lock. Actually, we should probably try
sticking mmap_assert_locked() in there now as a follow-up?
> so still vulnerable to that extra issue. It won't
> be as satisfying as removing all traces of mmget_still_valid(), but
> right now I think you should add an mm->core_state check there instead.
>
> (I do have a better solution in use, but it's a much larger patch, that
> will take a lot more effort to get in: checks in pte_offset_map_lock(),
> perhaps returning NULL when pmd is transitioning, requiring retry.)
Just to clarify: This is an issue only between GUP's software page
table walks when running without mmap_lock and concurrent page table
modifications from hugepage code, correct? Hardware page table walks
and get_user_pages_fast() are fine because they properly load PTEs
atomically and are written to assume that the page tables can change
arbitrarily under them, and the only guarantee is that disabling
interrupts ensures that pages referenced by PTEs can't be freed,
right?
> Or maybe it's me who has missed what you're doing instead.
>
> Hugh
On Mon, 31 Aug 2020, Jann Horn wrote:
> On Mon, Aug 31, 2020 at 8:07 AM Hugh Dickins <[email protected]> wrote:
> > On Thu, 27 Aug 2020, Jann Horn wrote:
> >
> > > The preceding patches have ensured that core dumping properly takes the
> > > mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
> > > its users.
> >
> > Hi Jann, while the only tears to be shed over losing mmget_still_valid()
> > will be tears of joy, I think you need to explain why you believe it's
> > safe to remove the instance in mm/khugepaged.c: which you'll have found
> > I moved just recently, to cover an extra case (sorry for not Cc'ing you).
> >
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -431,7 +431,7 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
> > >
> > > static inline int khugepaged_test_exit(struct mm_struct *mm)
> > > {
> > > - return atomic_read(&mm->mm_users) == 0 || !mmget_still_valid(mm);
> > > + return atomic_read(&mm->mm_users) == 0;
> > > }
> > >
> > > static bool hugepage_vma_check(struct vm_area_struct *vma,
> >
> > The movement (which you have correctly followed) was in
> > bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
> > but the "pmd .. physical page 0" issue is explained better in its parent
> > 18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
> >
> > I think your core dumping is still reading the page tables without
> > holding mmap_lock
>
> Where? get_dump_page() takes mmap_lock now:
> <https://lore.kernel.org/lkml/[email protected]/>
Right, sorry for the noise, that's precisely what 6/7 is all about,
and properly declares itself there in its Subject - I plead that I
got distracted by the vma snapshot part of the series, and paid too
little attention before bleating.
Looks good to me - thanks.
>
> I don't think there should be any paths into __get_user_pages() left
> that don't hold the mmap_lock. Actually, we should probably try
> sticking mmap_assert_locked() in there now as a follow-up?
Maybe: I haven't given it thought, to be honest.
Hugh
I didn't answer your questions further down, sorry, resuming...
On Mon, 31 Aug 2020, Jann Horn wrote:
> On Mon, Aug 31, 2020 at 8:07 AM Hugh Dickins <[email protected]> wrote:
...
> > but the "pmd .. physical page 0" issue is explained better in its parent
> > 18e77600f7a1 ("khugepaged: retract_page_tables() remember to test exit")
...
> Just to clarify: This is an issue only between GUP's software page
Not just GUP's software page table walks: any of our software page
table walks that could occur concurrently (notably, unmapping when
exiting).
> table walks when running without mmap_lock and concurrent page table
> modifications from hugepage code, correct?
Correct.
> Hardware page table walks
Have no problem: the necessary TLB flush is already done.
> and get_user_pages_fast() are fine because they properly load PTEs
> atomically and are written to assume that the page tables can change
> arbitrarily under them, and the only guarantee is that disabling
> interrupts ensures that pages referenced by PTEs can't be freed,
> right?
mm/gup.c has changed a lot since I was familiar with it, and I'm
out of touch with the history of architectural variants. I think
internal_get_user_pages_fast() is now the place to look, and I see
local_irq_save(flags);
gup_pgd_range(addr, end, fast_flags, pages, &nr_pinned);
local_irq_restore(flags);
reassuringly there, which is how x86 always used to do it,
and the dependence of x86 TLB flush on IPIs made it all safe.
Looking at gup_pmd_range(), its operations on pmd (= READ_ONCE(*pmdp))
look correct to me, and where I said "any of our software page table
walks" above, there should be an exception for GUP_fast.
But the other software page table walks are more loosely coded, and
less able to fall back - if gup_pmd_range() catches sight of a fleeting
*pmdp 0, it rightly just gives up immediately on !pmd_present(pmd);
whereas tearing down a userspace mapping needs to wait or retry on
seeing a transient state (but mmap_lock happens to give protection
against that particular transient state).
I assume that all the architectures which support GUP_fast have now
been gathered into the same mechanism (perhaps by an otherwise
superfluous IPI on TLB flush?) and are equally safe.
Hugh