LinuxLists.cc - [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb

2015-07-23 21:54:46

Subject: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

From: Joern Engel <[email protected]>

~150ms scheduler latency for both observed in the wild.

Signed-off-by: Joern Engel <[email protected]>
Signed-off-by: Spencer Baugh <[email protected]>
---
mm/hugetlb.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a8c3087..2eb6919 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
ret = alloc_fresh_gigantic_page(h, nodes_allowed);
else
ret = alloc_fresh_huge_page(h, nodes_allowed);
+ cond_resched();
spin_lock(&hugetlb_lock);
if (!ret)
goto out;
@@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(ptl);
ret = hugetlb_fault(mm, vma, vaddr,
(flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
+ cond_resched();
if (!(ret & VM_FAULT_ERROR))
continue;

--
2.5.0.rc3

2015-07-23 22:09:09

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Thu, 23 Jul 2015, Spencer Baugh wrote:

> From: Joern Engel <[email protected]>
>
> ~150ms scheduler latency for both observed in the wild.
>
> Signed-off-by: Joern Engel <[email protected]>
> Signed-off-by: Spencer Baugh <[email protected]>
> ---
> mm/hugetlb.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a8c3087..2eb6919 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> else
> ret = alloc_fresh_huge_page(h, nodes_allowed);
> + cond_resched();
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;

This is wrong, you'd want to do any cond_resched() before the page
allocation to avoid racing with an update to h->nr_huge_pages or
h->surplus_huge_pages while hugetlb_lock was dropped that would result in
the page having been uselessly allocated.

> @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> spin_unlock(ptl);
> ret = hugetlb_fault(mm, vma, vaddr,
> (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> + cond_resched();
> if (!(ret & VM_FAULT_ERROR))
> continue;
>

This is almost certainly the wrong placement as well since it's inserted
inside a conditional inside a while loop and there's no reason to
hugetlb_fault(), schedule, and then check the return value. You need to
insert your cond_resched()'s in legitimate places.

2015-07-23 22:36:59

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Thu, Jul 23, 2015 at 03:08:58PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, Spencer Baugh wrote:
> > From: Joern Engel <[email protected]>
> >
> > ~150ms scheduler latency for both observed in the wild.
> >
> > Signed-off-by: Joern Engel <[email protected]>
> > Signed-off-by: Spencer Baugh <[email protected]>
> > ---
> > mm/hugetlb.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index a8c3087..2eb6919 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> > ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> > else
> > ret = alloc_fresh_huge_page(h, nodes_allowed);
> > + cond_resched();
> > spin_lock(&hugetlb_lock);
> > if (!ret)
> > goto out;
>
> This is wrong, you'd want to do any cond_resched() before the page
> allocation to avoid racing with an update to h->nr_huge_pages or
> h->surplus_huge_pages while hugetlb_lock was dropped that would result in
> the page having been uselessly allocated.

There are three options. Either
/* some allocation */
cond_resched();
or
cond_resched();
/* some allocation */
or
if (cond_resched()) {
spin_lock(&hugetlb_lock);
continue;
}
/* some allocation */

I think you want the second option instead of the first. That way we
have a little less memory allocation for the time we are scheduled out.
Sure, we can do that. It probably doesn't make a big difference either
way, but why not.

If you are asking for the third option, I would rather avoid that. It
makes the code more complex and doesn't change the fact that we have a
race and better be able to handle the race. The code size growth will
likely cost us more performance that we would ever gain. nr_huge_pages
tends to get updated once per system boot.

> > @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > spin_unlock(ptl);
> > ret = hugetlb_fault(mm, vma, vaddr,
> > (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> > + cond_resched();
> > if (!(ret & VM_FAULT_ERROR))
> > continue;
> >
>
> This is almost certainly the wrong placement as well since it's inserted
> inside a conditional inside a while loop and there's no reason to
> hugetlb_fault(), schedule, and then check the return value. You need to
> insert your cond_resched()'s in legitimate places.

I assume you want the second option here as well. Am I right?

J?rn

--
Sometimes it pays to stay in bed on Monday, rather than spending the rest
of the week debugging Monday's code.
-- Christopher Thompson

2015-07-23 22:54:51

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Thu, 23 Jul 2015, J?rn Engel wrote:

> > This is wrong, you'd want to do any cond_resched() before the page
> > allocation to avoid racing with an update to h->nr_huge_pages or
> > h->surplus_huge_pages while hugetlb_lock was dropped that would result in
> > the page having been uselessly allocated.
>
> There are three options. Either
> /* some allocation */
> cond_resched();
> or
> cond_resched();
> /* some allocation */
> or
> if (cond_resched()) {
> spin_lock(&hugetlb_lock);
> continue;
> }
> /* some allocation */
>
> I think you want the second option instead of the first. That way we
> have a little less memory allocation for the time we are scheduled out.
> Sure, we can do that. It probably doesn't make a big difference either
> way, but why not.
>

The loop is dropping the lock simply to do the allocation and it needs to
compare with the user-written number of hugepages to allocate.

What we don't want is to allocate, reschedule, and check if we really
needed to allocate. That's what your patch does because it races with
persistent_huge_page(). It's probably the worst place to do it.

Rather, what you want to do is check if you need to allocate, reschedule
if needed (and if so, recheck), and then allocate.

> If you are asking for the third option, I would rather avoid that. It
> makes the code more complex and doesn't change the fact that we have a
> race and better be able to handle the race. The code size growth will
> likely cost us more performance that we would ever gain. nr_huge_pages
> tends to get updated once per system boot.
>

Your third option is nonsensical, you didn't save the state of whether you
locked the lock so you can't reliably unlock it, and you cannot hold a
spinlock while allocating in this context.

2015-07-23 23:09:34

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Thu, Jul 23, 2015 at 03:54:43PM -0700, David Rientjes wrote:
> On Thu, 23 Jul 2015, J?rn Engel wrote:
>
> > > This is wrong, you'd want to do any cond_resched() before the page
> > > allocation to avoid racing with an update to h->nr_huge_pages or
> > > h->surplus_huge_pages while hugetlb_lock was dropped that would result in
> > > the page having been uselessly allocated.
> >
> > There are three options. Either
> > /* some allocation */
> > cond_resched();
> > or
> > cond_resched();
> > /* some allocation */
> > or
> > if (cond_resched()) {
> > spin_lock(&hugetlb_lock);
> > continue;
> > }
> > /* some allocation */
> >
> > I think you want the second option instead of the first. That way we
> > have a little less memory allocation for the time we are scheduled out.
> > Sure, we can do that. It probably doesn't make a big difference either
> > way, but why not.
> >
>
> The loop is dropping the lock simply to do the allocation and it needs to
> compare with the user-written number of hugepages to allocate.

And at this point the existing code is racy. Page allocation might
block for minutes trying to free some memory. A cond_resched doesn't
change that - it only increases the odds of hitting the race window.

> What we don't want is to allocate, reschedule, and check if we really
> needed to allocate. That's what your patch does because it races with
> persistent_huge_page(). It's probably the worst place to do it.
>
> Rather, what you want to do is check if you need to allocate, reschedule
> if needed (and if so, recheck), and then allocate.
>
> > If you are asking for the third option, I would rather avoid that. It
> > makes the code more complex and doesn't change the fact that we have a
> > race and better be able to handle the race. The code size growth will
> > likely cost us more performance that we would ever gain. nr_huge_pages
> > tends to get updated once per system boot.
>
> Your third option is nonsensical, you didn't save the state of whether you
> locked the lock so you can't reliably unlock it, and you cannot hold a
> spinlock while allocating in this context.

Are we looking at the same code? Mine looks like this:
while (count > persistent_huge_pages(h)) {
/*
* If this allocation races such that we no longer need the
* page, free_huge_page will handle it by freeing the page
* and reducing the surplus.
*/
spin_unlock(&hugetlb_lock);
if (hstate_is_gigantic(h))
ret = alloc_fresh_gigantic_page(h, nodes_allowed);
else
ret = alloc_fresh_huge_page(h, nodes_allowed);
spin_lock(&hugetlb_lock);
if (!ret)
goto out;

/* Bail for signals. Probably ctrl-c from user */
if (signal_pending(current))
goto out;
}

What state is there to save? We just called spin_unlock, we did a
schedule and if we want to continue without doing page allocation we
better take the lock again. Or do you want to go even more complex and
check for signals as well?

The case you are concerned about is rare. It is so rare that it doesn't
matter from a performance point of view, only for correctness. And if
we hit the rare case, the worst harm would be an unnecessary allocation
that we return back to the system. How much complexity do you think it
is worth to avoid this allocation? How much runtime will the bigger
text size cost you in the common cases?

What matters to me is the scheduler latency. That is real and happens
reliably once per boot.

J?rn

--
Chance favors only the prepared mind.
-- Louis Pasteur

2015-07-24 07:00:08

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Thu 23-07-15 14:54:31, Spencer Baugh wrote:
> From: Joern Engel <[email protected]>
>
> ~150ms scheduler latency for both observed in the wild.

This is way to vague. Could you describe your problem somehow more,
please?
There are schduling points in the page allocator (when it triggers the
reclaim), why are those not sufficient? Or do you manage to allocate
many hugetlb pages without performing the reclaim and that leads to
soft lockups?

>
> Signed-off-by: Joern Engel <[email protected]>
> Signed-off-by: Spencer Baugh <[email protected]>
> ---
> mm/hugetlb.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a8c3087..2eb6919 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1836,6 +1836,7 @@ static unsigned long set_max_huge_pages(struct hstate *h, unsigned long count,
> ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> else
> ret = alloc_fresh_huge_page(h, nodes_allowed);
> + cond_resched();
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;
> @@ -3521,6 +3522,7 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
> spin_unlock(ptl);
> ret = hugetlb_fault(mm, vma, vaddr,
> (flags & FOLL_WRITE) ? FAULT_FLAG_WRITE : 0);
> + cond_resched();
> if (!(ret & VM_FAULT_ERROR))
> continue;
>
> --
> 2.5.0.rc3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2015-07-24 17:12:46

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Fri, Jul 24, 2015 at 08:59:59AM +0200, Michal Hocko wrote:
> On Thu 23-07-15 14:54:31, Spencer Baugh wrote:
> > From: Joern Engel <[email protected]>
> >
> > ~150ms scheduler latency for both observed in the wild.
>
> This is way to vague. Could you describe your problem somehow more,
> please?
> There are schduling points in the page allocator (when it triggers the
> reclaim), why are those not sufficient? Or do you manage to allocate
> many hugetlb pages without performing the reclaim and that leads to
> soft lockups?

We don't use transparent hugepages - they cause too much latency.
Instead we reserve somewhere around 3/4 or so of physical memory for
hugepages. "sysctl -w vm.nr_hugepages=100000" or something similar in a
startup script.

Since it is early in boot we don't go through page reclaim.

J?rn

--
Everything should be made as simple as possible, but not simpler.
-- Albert Einstein

2015-07-24 19:49:18

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Thu, 23 Jul 2015, Jörn Engel wrote:

> > The loop is dropping the lock simply to do the allocation and it needs to
> > compare with the user-written number of hugepages to allocate.
>
> And at this point the existing code is racy. Page allocation might
> block for minutes trying to free some memory. A cond_resched doesn't
> change that - it only increases the odds of hitting the race window.
>

The existing code has always been racy, it explicitly admits this, the
problem is that your patch is making the race window larger.

> Are we looking at the same code? Mine looks like this:
> while (count > persistent_huge_pages(h)) {
> /*
> * If this allocation races such that we no longer need the
> * page, free_huge_page will handle it by freeing the page
> * and reducing the surplus.
> */
> spin_unlock(&hugetlb_lock);
> if (hstate_is_gigantic(h))
> ret = alloc_fresh_gigantic_page(h, nodes_allowed);
> else
> ret = alloc_fresh_huge_page(h, nodes_allowed);
> spin_lock(&hugetlb_lock);
> if (!ret)
> goto out;
>
> /* Bail for signals. Probably ctrl-c from user */
> if (signal_pending(current))
> goto out;
> }
>

I don't see the cond_resched() you propose to add, but the need for it is
obvious with a large user-written nr_hugepages in the above loop.

The suggestion is to check the conditional, reschedule if needed (and if
so, recheck the conditional), and then allocate.

Your third option looks fine and the best place to do the cond_resched().
I was looking at your second option when I responded and compared it to
the first. We don't want to do cond_resched() immediately before or after
the allocation, the net result is the same: we may be pointlessly
allocating the hugepage and each hugepage allocation can be very
heavyweight.

So I agree with your third option from the previous email.

You may also want to include the actual text of the warning from the
kernel log in your commit message. When people encounter this, then will
probably grep in the kernel logs for some keywords to see if it was
already fixed and I fear your current commit message may allow it to be
missed.

2015-07-24 20:28:46

by Davidlohr Bueso

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Fri, 2015-07-24 at 10:12 -0700, Jörn Engel wrote:
> On Fri, Jul 24, 2015 at 08:59:59AM +0200, Michal Hocko wrote:
> > On Thu 23-07-15 14:54:31, Spencer Baugh wrote:
> > > From: Joern Engel <[email protected]>
> > >
> > > ~150ms scheduler latency for both observed in the wild.
> >
> > This is way to vague. Could you describe your problem somehow more,
> > please?
> > There are schduling points in the page allocator (when it triggers the
> > reclaim), why are those not sufficient? Or do you manage to allocate
> > many hugetlb pages without performing the reclaim and that leads to
> > soft lockups?
>
> We don't use transparent hugepages - they cause too much latency.
> Instead we reserve somewhere around 3/4 or so of physical memory for
> hugepages. "sysctl -w vm.nr_hugepages=100000" or something similar in a
> startup script.
>
> Since it is early in boot we don't go through page reclaim.

Still, please be more verbose about what you _are_ encountering. Iow,
please have decent changelog in v2.

2015-07-24 20:49:20

by Jörn Engel

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: cond_resched for set_max_huge_pages and follow_hugetlb_page

On Fri, Jul 24, 2015 at 12:49:14PM -0700, David Rientjes wrote:
>
> I don't see the cond_resched() you propose to add, but the need for it is
> obvious with a large user-written nr_hugepages in the above loop.
>
> The suggestion is to check the conditional, reschedule if needed (and if
> so, recheck the conditional), and then allocate.
>
> Your third option looks fine and the best place to do the cond_resched().
> I was looking at your second option when I responded and compared it to
> the first. We don't want to do cond_resched() immediately before or after
> the allocation, the net result is the same: we may be pointlessly
> allocating the hugepage and each hugepage allocation can be very
> heavyweight.
>
> So I agree with your third option from the previous email.

All right. We are talking about the same thing now. But I previously
argued that the pointless allocation will a) not impact correctness and
b) be so rare as to not impact performance. The problem with the third
option is that it adds a bit of constant overhead all the time to
compensate for not doing the pointless allocation.

On my systems at least, the pointless allocation will happen, on
average, less than once per boot. Unless my systems are vastly
unrepresentative, the third option doesn't look appealing to me.

> You may also want to include the actual text of the warning from the
> kernel log in your commit message. When people encounter this, then will
> probably grep in the kernel logs for some keywords to see if it was
> already fixed and I fear your current commit message may allow it to be
> missed.

Ack.

I should still have those warning in logfiles somewhere and can hunt
them down.

J?rn

--
Act only according to that maxim whereby you can, at the same time,
will that it should become a universal law.
-- Immanuel Kant