2021-02-25 21:08:55

by Matthew Wilcox

[permalink] [raw]
Subject: Freeing page tables through RCU

In order to walk the page tables without the mmap semaphore, it must
be possible to prevent them from being freed and reused (eg if munmap()
races with viewing /proc/$pid/smaps).

There is various commentary within the mm on how to prevent this. One way
is to disable interrupts, relying on that to block rcu_sched or IPIs.
I don't think the RT people are terribly happy about reading a proc file
disabling interrupts, and it doesn't work for architectures that free
page tables directly instead of batching them into an rcu_sched (because
the IPI may not be sent to this CPU if the task has never run on it).

See "Fast GUP" in mm/gup.c

Ideally, I'd like rcu_read_lock() to delay page table reuse. This is
close to trivial for architectures which use entire pages or multiple
pages for levels of their page tables as we can use the rcu_head embedded
in struct page to queue the page for RCU.

s390 and powerpc are the only two architectures I know of that have
levels of their page table that are smaller than their PAGE_SIZE.
I'd like to discuss options. There may be a complicated scheme that
allows partial pages to be freed via RCU, but I have something simpler
in mind. For powerpc in particular, it can have a PAGE_SIZE of 64kB
and then the MMU wants to see 4kB entries in the PMD. I suggest that
instead of allocating each 4kB entry individually, we allocate a 64kB
page and fill in 16 consecutive PMDs. This could cost a bit more memory
(although if you've asked for a CONFIG_PAGE_SIZE of 64kB, you presumably
don't care too much about it), but it'll make future page faults cheaper
(as the PMDs will already be present, assuming you have good locality
of reference).

I'd like to hear better ideas than this.


2021-02-26 14:47:03

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Freeing page tables through RCU

On Thu, Feb 25, 2021 at 08:58:20PM +0000, Matthew Wilcox wrote:

> I'd like to hear better ideas than this.

You didn't like my suggestion to put a sleepable lock around the
freeing of page tables during flushing?

I still don't see how you convert the sleepable page walkers to use
rcu??

Jason

2021-02-26 16:08:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Freeing page tables through RCU

On Fri, Feb 26, 2021 at 10:42:00AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 25, 2021 at 08:58:20PM +0000, Matthew Wilcox wrote:
>
> > I'd like to hear better ideas than this.
>
> You didn't like my suggestion to put a sleepable lock around the
> freeing of page tables during flushing?
>
> I still don't see how you convert the sleepable page walkers to use
> rcu??

I don't want to convert the sleepable ones to use RCU ... I want to
convert the non-sleeping ones to use RCU. A page_table_free_lock might
work, but it might have its own problems later (eg a sleeping lock can't
be acquired under RCU or spinlock, and it can't be a spinlock because
it'd have to be held while we wait for IPIs).

I think it would solve my immediate problem, and I wonder if it might
solve some other problems ...

2021-02-26 16:15:14

by Gerald Schaefer

[permalink] [raw]
Subject: Re: Freeing page tables through RCU

On Thu, 25 Feb 2021 20:58:20 +0000
Matthew Wilcox <[email protected]> wrote:

> In order to walk the page tables without the mmap semaphore, it must
> be possible to prevent them from being freed and reused (eg if munmap()
> races with viewing /proc/$pid/smaps).
>
> There is various commentary within the mm on how to prevent this. One way
> is to disable interrupts, relying on that to block rcu_sched or IPIs.
> I don't think the RT people are terribly happy about reading a proc file
> disabling interrupts, and it doesn't work for architectures that free
> page tables directly instead of batching them into an rcu_sched (because
> the IPI may not be sent to this CPU if the task has never run on it).
>
> See "Fast GUP" in mm/gup.c
>
> Ideally, I'd like rcu_read_lock() to delay page table reuse. This is
> close to trivial for architectures which use entire pages or multiple
> pages for levels of their page tables as we can use the rcu_head embedded
> in struct page to queue the page for RCU.
>
> s390 and powerpc are the only two architectures I know of that have
> levels of their page table that are smaller than their PAGE_SIZE.
> I'd like to discuss options. There may be a complicated scheme that
> allows partial pages to be freed via RCU, but I have something simpler
> in mind. For powerpc in particular, it can have a PAGE_SIZE of 64kB
> and then the MMU wants to see 4kB entries in the PMD. I suggest that
> instead of allocating each 4kB entry individually, we allocate a 64kB
> page and fill in 16 consecutive PMDs. This could cost a bit more memory
> (although if you've asked for a CONFIG_PAGE_SIZE of 64kB, you presumably
> don't care too much about it), but it'll make future page faults cheaper
> (as the PMDs will already be present, assuming you have good locality
> of reference).
>
> I'd like to hear better ideas than this.

Some background on the situation for s390: The architecture defines
an 8 bit pagetable index, so we have 256 entries in a 2 KB pagetable,
but PAGE_SIZE is 4 KB. pte_alloc(_one) will use alloc_page() to allocate
a full 4 KB page, and then do some housekeeping to maintain a per-mm list
of such 4 KB pages, which will contain either one or two 2 KB pagetable
fragments.

This is also the reason why pgtable_t on s390 is not pointing to the
struct page of the (4 KB) page containing a 2 KB pagetable fragment, but
rather to the 2 KB pagetable itself.

I see at least two issues here, with using rcu_head embedded in the
struct page (for a 4 KB page):

1) There might be two 2 KB pagetables present in that 4 KB page,
and the rcu_head would affect both. Not sure if this would really be
a problem, because we already have a similar situation with the split
ptlock embedded in struct page, which also might lock two 2 KB pagetables,
i.e. more than necessary. It still is far less "over-locking" than
using mm->page_table_lock, and the move_pte() code e.g. takes care
to avoid a deadlock if src and dst ptlocks happen to be on the
same page.

So, a similar "over-locking" might also be possible and acceptable
for the rcu_head approach, but I do not really understand if that could
have some deadlock or other unwanted side-effects.

2) The "housekeeping" of our 2 KB pagetable fragments uses page->lru
to maintain the per-mm list. It also (mis)uses page->_refcount to mark
which 2 KB half is used/free, but that should not be an issue I guess.
Using page->lru will be an issue though. IIUC, then page->rcu_head will
overlay with page->lru, so using page->rcu_head for pagetable pages
on s390 would conflict with our page->lru usage for such pagetable pages.

I do not really see how that could be fixed, maybe we could find and
re-use other struct page members for our 2 KB fragment list. Also, for
kvm, there seem to be even more users of page->lru for pagetable pages,
in arch/s390/mm/gmap.c. Not sure though if those would actually also
affect "regular" pagetable walks, or if they are somehow independent.
But if we'd find some new list home for the 2 KB fragments, then that
could probably also be used for the gmap stuff.

2021-02-26 16:25:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: Freeing page tables through RCU

On Fri, Feb 26, 2021 at 04:03:54PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 26, 2021 at 10:42:00AM -0400, Jason Gunthorpe wrote:
> > On Thu, Feb 25, 2021 at 08:58:20PM +0000, Matthew Wilcox wrote:
> >
> > > I'd like to hear better ideas than this.
> >
> > You didn't like my suggestion to put a sleepable lock around the
> > freeing of page tables during flushing?
> >
> > I still don't see how you convert the sleepable page walkers to use
> > rcu??
>
> I don't want to convert the sleepable ones to use RCU ... I want to
> convert the non-sleeping ones to use RCU.

Why? Convert them to use the normal page table spinlocks?

It makes everything much more understandable if it is locked properly.

The lockless walks should be reserved for special places because they
are actually hard to do properly.

> A page_table_free_lock might work, but it might have its own
> problems later (eg a sleeping lock can't be acquired under RCU or
> spinlock, and it can't be a spinlock because it'd have to be held
> while we wait for IPIs).

The mmap_sem today is serving the function of the page_table_free_lock
idea, so I think a sleepable lock is already proven to work?

Jason