2013-10-29 17:38:21

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

Thomas Gleixner and Peter Zijlstra discussed off-list that real-time users
currently have a problem with the page lock being contended for unbounded
periods of time during futex operations. The three of us discussed the
possibiltity that the page lock is unnecessary in this case because we are
not concerned with the usual races with reclaim and page cache updates. For
anonymous pages, the associated futex object is the mm_struct which does
not require the page lock. For inodes, we should be able to check under
RCU read lock if the page mapping is still valid to take a reference to
the inode. This just leaves one rare race that requires the page lock
in the slow path. This patch does not completely eliminate the page lock
but it should reduce contention in the majority of cases.

Patch boots and futextest did not explode but I did no comparison
performance tests. Thomas, do you have details of the workload that
drove you to examine this problem? Alternatively, can you test it and
see does it help you? I added Chris to the To list because he mentioned
that some filesystems might already be doing tricks similar to this
patch that are worth copying.

Not-yet-signed-off-by: Peter Zijlstra <[email protected]>
Not-yet-signed-off-by: Mel Gorman <[email protected]>
---
kernel/futex.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 81 insertions(+), 8 deletions(-)

diff --git a/kernel/futex.c b/kernel/futex.c
index c3a1a55..a918358 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -239,6 +239,7 @@ static int
get_futex_key(u32 __user *uaddr, int fshared, union futex_key *key, int rw)
{
unsigned long address = (unsigned long)uaddr;
+ struct address_space *mapping;
struct mm_struct *mm = current->mm;
struct page *page, *page_head;
int err, ro = 0;
@@ -318,10 +319,20 @@ again:
}
#endif

- lock_page(page_head);
+ /*
+ * The treatment of mapping from this point on is critical. The page
+ * lock protects many things but in this context the page lock
+ * stabilises mapping, prevents inode freeing in the shared
+ * file-backed region case and guards against movement to swap cache.
+ * Strictly speaking the page lock is not needed in all cases being
+ * considered here and page lock forces unnecessarily serialisation.
+ * From this point on, mapping will be reverified if necessary and
+ * page lock will be acquired only if it is unavoiable.
+ */
+ mapping = ACCESS_ONCE(page_head->mapping);

/*
- * If page_head->mapping is NULL, then it cannot be a PageAnon
+ * If mapping is NULL, then it cannot be a PageAnon
* page; but it might be the ZERO_PAGE or in the gate area or
* in a special mapping (all cases which we are happy to fail);
* or it may have been a good file page when get_user_pages_fast
@@ -335,10 +346,22 @@ again:
* shmem_writepage move it from filecache to swapcache beneath us:
* an unlikely race, but we do need to retry for page_head->mapping.
*/
- if (!page_head->mapping) {
- int shmem_swizzled = PageSwapCache(page_head);
+ if (!mapping) {
+ int shmem_swizzled;
+
+ /*
+ * Page lock is required to identify which special case above
+ * applies. If this is really a shmem page then the page lock
+ * will prevent unexpected transitions.
+ */
+ lock_page(page_head);
+ mapping = page_head->mapping;
+ shmem_swizzled = PageSwapCache(page_head);
unlock_page(page_head);
+
put_page(page_head);
+ WARN_ON_ONCE(mapping);
+
if (shmem_swizzled)
goto again;
return -EFAULT;
@@ -347,6 +370,11 @@ again:
/*
* Private mappings are handled in a simple way.
*
+ * If the futex key is stored on an anonymous page then the associated
+ * object is the mm which is implicitly pinned by the calling process.
+ * Page lock is unnecessary to stabilise page->mapping in this case and
+ * is not taken.
+ *
* NOTE: When userspace waits on a MAP_SHARED mapping, even if
* it's a read-only handle, it's expected that futexes attach to
* the object not the particular process.
@@ -364,16 +392,61 @@ again:
key->both.offset |= FUT_OFF_MMSHARED; /* ref taken on mm */
key->private.mm = mm;
key->private.address = address;
+
+ get_futex_key_refs(key);
} else {
+ struct inode *inode;
+
+ /*
+ * The associtated futex object in this case is the inode and
+ * the page->mapping must be traversed. Ordinarily this should
+ * be stabilised under page lock but it's not strictly
+ * necessary in this case as we just want to pin the inode, not
+ * update radix tree or anything like that.
+ *
+ * The RCU read lock is taken as the inode is finally freed
+ * under RCU. If the mapping still matches expectations then the
+ * mapping->host can be safely accessed as being a valid inode.
+ */
+ rcu_read_lock();
+ if (page->mapping != mapping || !mapping->host) {
+ rcu_read_unlock();
+ put_page(page_head);
+ goto again;
+ }
+ inode = mapping->host;
+
+ /*
+ * Take a reference unless it is about to be freed. Previously
+ * this reference was taken by ihold under the page lock
+ * pinning the inode in place so i_lock was unnecessary. The
+ * only way for this check to fail is if the inode was
+ * truncated in parallel so warn for now if this happens.
+ *
+ * TODO: VFS and/or filesystem people should review this check
+ * and see if there is a safer or more reliable way to do this.
+ */
+ if (WARN_ON(!atomic_inc_not_zero(&inode->i_count))) {
+ rcu_read_unlock();
+ put_page(page_head);
+ goto again;
+ }
+
+ /* Should be impossible but lets be paranoid for now */
+ if (WARN_ON(inode->i_mapping != mapping)) {
+ rcu_read_unlock();
+ iput(inode);
+ put_page(page_head);
+ goto again;
+ }
+
key->both.offset |= FUT_OFF_INODE; /* inode-based key */
- key->shared.inode = page_head->mapping->host;
+ key->shared.inode = inode;
key->shared.pgoff = basepage_index(page);
+ rcu_read_unlock();
}

- get_futex_key_refs(key);
-
out:
- unlock_page(page_head);
put_page(page_head);
return err;
}


2013-10-29 18:48:36

by Chris Mason

[permalink] [raw]
Subject: Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

Quoting Mel Gorman (2013-10-29 13:38:14)
> Thomas Gleixner and Peter Zijlstra discussed off-list that real-time users
> currently have a problem with the page lock being contended for unbounded
> periods of time during futex operations. The three of us discussed the
> possibiltity that the page lock is unnecessary in this case because we are
> not concerned with the usual races with reclaim and page cache updates. For
> anonymous pages, the associated futex object is the mm_struct which does
> not require the page lock. For inodes, we should be able to check under
> RCU read lock if the page mapping is still valid to take a reference to
> the inode. This just leaves one rare race that requires the page lock
> in the slow path. This patch does not completely eliminate the page lock
> but it should reduce contention in the majority of cases.
>
> Patch boots and futextest did not explode but I did no comparison
> performance tests. Thomas, do you have details of the workload that
> drove you to examine this problem? Alternatively, can you test it and
> see does it help you? I added Chris to the To list because he mentioned
> that some filesystems might already be doing tricks similar to this
> patch that are worth copying.

Unfortunately, all the special cases I see in the filesystems either
have an inode ref or are trylocking the page to safety.

XFS is a special case because they have their own inode cache, but by my
reading they are still using i_count and free by rcu.

The iput in here is a little tricky:

>
> + /* Should be impossible but lets be paranoid for now */
> + if (WARN_ON(inode->i_mapping != mapping)) {
> + rcu_read_unlock();
> + iput(inode);
> + put_page(page_head);
> + goto again;
> + }
> +

Once you call iput, you add the potential to call the filesystem unlink
operation if i_nlink had gone to zero. This shouldn't be a problem
since you've dropped the rcu lock, but just for fun I'd move the
put_page up a line.

Or, change it to a BUG_ON instead, it really should be impossible.

-chris

2013-10-30 08:45:43

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

On Tue, 29 Oct 2013, Mel Gorman wrote:

> Thomas Gleixner and Peter Zijlstra discussed off-list that real-time users
> currently have a problem with the page lock being contended for unbounded
> periods of time during futex operations. The three of us discussed the
> possibiltity that the page lock is unnecessary in this case because we are
> not concerned with the usual races with reclaim and page cache updates. For
> anonymous pages, the associated futex object is the mm_struct which does
> not require the page lock. For inodes, we should be able to check under
> RCU read lock if the page mapping is still valid to take a reference to
> the inode. This just leaves one rare race that requires the page lock
> in the slow path. This patch does not completely eliminate the page lock
> but it should reduce contention in the majority of cases.
>
> Patch boots and futextest did not explode but I did no comparison
> performance tests. Thomas, do you have details of the workload that
> drove you to examine this problem? Alternatively, can you test it and

The scenario is simple. All you need is a PSHARED futex.

Task A
get_futex_key()
lock_page()

---> preemption

Now any other task trying to lock that page will have to wait until
task A gets scheduled back in, which is an unbound time.

It takes quite some time to reproduce, but I'll ask the people who
have that workload to give it a try.

Thanks,

tglx

2013-10-30 08:57:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

On Tue, Oct 29, 2013 at 02:48:27PM -0400, Chris Mason wrote:
> > + /* Should be impossible but lets be paranoid for now */
> > + if (WARN_ON(inode->i_mapping != mapping)) {
> > + rcu_read_unlock();
> > + iput(inode);
> > + put_page(page_head);
> > + goto again;
> > + }
> > +
>
> Once you call iput, you add the potential to call the filesystem unlink
> operation if i_nlink had gone to zero. This shouldn't be a problem
> since you've dropped the rcu lock, but just for fun I'd move the
> put_page up a line.
>
> Or, change it to a BUG_ON instead, it really should be impossible.

So I still meant to have a look at the RCU freeing of inodes etc.. but
that comparison was to guard against inode reuse. I don't know if that
actually happens, the inode free path is a tad longer than is trivially
understood.

But if an inode would be put on a free list and reused the
atomic_inc_not_zero() could inc on a different inode than the one we
wanted and thus we need to validate we indeed got the object we set out
to acquire.

Now if its guaranteed that once an inode's refcount hits zero it will be
freed the above test is indeed superfluous and we can do the BUG_ON as
you suggest.

2013-10-30 09:26:16

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

On Wed, Oct 30, 2013 at 09:45:31AM +0100, Thomas Gleixner wrote:
> On Tue, 29 Oct 2013, Mel Gorman wrote:
>
> > Thomas Gleixner and Peter Zijlstra discussed off-list that real-time users
> > currently have a problem with the page lock being contended for unbounded
> > periods of time during futex operations. The three of us discussed the
> > possibiltity that the page lock is unnecessary in this case because we are
> > not concerned with the usual races with reclaim and page cache updates. For
> > anonymous pages, the associated futex object is the mm_struct which does
> > not require the page lock. For inodes, we should be able to check under
> > RCU read lock if the page mapping is still valid to take a reference to
> > the inode. This just leaves one rare race that requires the page lock
> > in the slow path. This patch does not completely eliminate the page lock
> > but it should reduce contention in the majority of cases.
> >
> > Patch boots and futextest did not explode but I did no comparison
> > performance tests. Thomas, do you have details of the workload that
> > drove you to examine this problem? Alternatively, can you test it and
>
> The scenario is simple. All you need is a PSHARED futex.
>
> Task A
> get_futex_key()
> lock_page()
>
> ---> preemption
>

Ok, so scaling numbers of threads doing something like multiple
consumers using FUTEX_WAIT and then all being woken should trigger it.
Should not be that hard to device a test if something in futextest does
not do it alreayd.

> Now any other task trying to lock that page will have to wait until
> task A gets scheduled back in, which is an unbound time.
>
> It takes quite some time to reproduce, but I'll ask the people who
> have that workload to give it a try.
>

Do please. I'd rather not sink time into trying to reproduce a hypothetical
problem when people who are already familiar with it can provide better
data. If it stays quiet for too long then I'll either use an existing
futextest, extend futextest or conclude that the problem was not major
in the first place if the users cannot be arsed testing a patch.

Thanks.

--
Mel Gorman
SUSE Labs

2014-02-11 15:51:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

On Wed, 30 Oct 2013, Mel Gorman wrote:
> On Wed, Oct 30, 2013 at 09:45:31AM +0100, Thomas Gleixner wrote:
> > On Tue, 29 Oct 2013, Mel Gorman wrote:
> >
> > > Thomas Gleixner and Peter Zijlstra discussed off-list that real-time users
> > > currently have a problem with the page lock being contended for unbounded
> > > periods of time during futex operations. The three of us discussed the
> > > possibiltity that the page lock is unnecessary in this case because we are
> > > not concerned with the usual races with reclaim and page cache updates. For
> > > anonymous pages, the associated futex object is the mm_struct which does
> > > not require the page lock. For inodes, we should be able to check under
> > > RCU read lock if the page mapping is still valid to take a reference to
> > > the inode. This just leaves one rare race that requires the page lock
> > > in the slow path. This patch does not completely eliminate the page lock
> > > but it should reduce contention in the majority of cases.
> > >
> > > Patch boots and futextest did not explode but I did no comparison
> > > performance tests. Thomas, do you have details of the workload that
> > > drove you to examine this problem? Alternatively, can you test it and
> >
> > The scenario is simple. All you need is a PSHARED futex.
> >
> > Task A
> > get_futex_key()
> > lock_page()
> >
> > ---> preemption
> >
>
> Ok, so scaling numbers of threads doing something like multiple
> consumers using FUTEX_WAIT and then all being woken should trigger it.
> Should not be that hard to device a test if something in futextest does
> not do it alreayd.
>
> > Now any other task trying to lock that page will have to wait until
> > task A gets scheduled back in, which is an unbound time.
> >
> > It takes quite some time to reproduce, but I'll ask the people who
> > have that workload to give it a try.
> >
>
> Do please. I'd rather not sink time into trying to reproduce a hypothetical
> problem when people who are already familiar with it can provide better
> data. If it stays quiet for too long then I'll either use an existing
> futextest, extend futextest or conclude that the problem was not major
> in the first place if the users cannot be arsed testing a patch.

Took some time, but the folks finally came around to give it a try and
it fixes their problem. I did not explode either, but I doubt, that
their workload can trigger any of the corner cases.

Thanks,

tglx

Subject: Re: [RFC PATCH] futex: Remove requirement for lock_page in get_futex_key

On 2014-02-11 16:51:55 [+0100], Thomas Gleixner wrote:
> On Wed, 30 Oct 2013, Mel Gorman wrote:
> > On Wed, Oct 30, 2013 at 09:45:31AM +0100, Thomas Gleixner wrote:
> > > On Tue, 29 Oct 2013, Mel Gorman wrote:
> > > > Patch boots and futextest did not explode but I did no comparison
> > > > performance tests. Thomas, do you have details of the workload that
> > > > drove you to examine this problem? Alternatively, can you test it and
> > > The scenario is simple. All you need is a PSHARED futex.
> > >
> > > Task A
> > > get_futex_key()
> > > lock_page()
> > >
> > > ---> preemption
> > >
> > Do please. I'd rather not sink time into trying to reproduce a hypothetical
> > problem when people who are already familiar with it can provide better
> Took some time, but the folks finally came around to give it a try and
> it fixes their problem. I did not explode either, but I doubt, that
> their workload can trigger any of the corner cases.

I just stumbled uppon this patch and now I am curious about its status.
tglx said that it solves the problem in question (with doubt of
triggering all the corner cases). Chris pointed out that moving
put_page() might be worth doing.

Sebastian