2014-01-03 19:55:56

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

Hi Joonsoo,

Sorry about the delay...

On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <[email protected]> wrote:
> > > > >
> > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > >
> > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > performance degradation, because it serialize all fault handling.
> > > > > >
> > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > performance degradation.
> > > > >
> > > > > So the whole point of the patch is to improve performance, but the
> > > > > changelog doesn't include any performance measurements!
> > > > >
> > > >
> > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is
> > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > workload and after that it does not matter. At worst, it hurts application
> > > > startup time but that is still poor motivation for putting a lot of work
> > > > into removing the mutex.
> > >
> > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > then it naturally decreases.
> > >
> > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > be important to check if any workload that matters is actually hitting
> > > > that problem.
> > >
> > > I was thinking of writing one to actually get some numbers for this
> > > patchset -- I don't know of any benchmark that might stress this lock.
> > >
> > > However I first measured the amount of cycles it costs to start an
> > > Oracle DB and things went south with these changes. A simple 'startup
> > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > billion. While there is naturally a fair amount of variation, these
> > > changes do seem to do more harm than good, at least in real world
> > > scenarios.
> >
> > Hello,
> >
> > I think that number of cycles is not proper to measure this patchset,
> > because cycles would be wasted by fault handling failure. Instead, it
> > targeted improved elapsed time.

Fair enough, however the fact of the matter is this approach does en up
hurting performance. Regarding total startup time, I didn't see hardly
any differences, with both vanilla and this patchset it takes close to
33.5 seconds.

> Could you tell me how long it
> > takes to fault all of it's hugepages?
> >
> > Anyway, this order of magnitude still seems a problem. :/
> >
> > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > Andrew pointed out.
> >
> > I will send another patches to fix this problem.
>
> Hello, Davidlohr.
>
> Here goes the fix on top of this series.

... and with this patch we go from 27 down to 11 billion cycles, so this
approach still costs more than what we currently have. A perf stat shows
that an entire 1Gb huge page aware DB startup costs around ~30 billion
cycles on a vanilla kernel, so the impact of hugetlb_fault() is
definitely non trivial and IMO worth considering.

Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
ride and things do look quite better, which is basically what Andrew was
suggesting previously anyway. With the hash table approach the startup
time did go down to ~25.1 seconds, which is a nice -24.7% time
reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
This hash table was on a 80 core system, so since we do the power of two
round up we end up with 256 entries -- I think we can do better if we
enlarger further, maybe something like statically 1024, or probably
better, 8-ish * nr cpus.

Thoughts? Is there any reason why we cannot go with this instead? Yes,
we still keep the mutex, but the approach is (1) proven better for
performance on real world workloads and (2) far less invasive.

Thanks,
Davidlohr


2014-01-06 00:19:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> Hi Joonsoo,
>
> Sorry about the delay...
>
> On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <[email protected]> wrote:
> > > > > >
> > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > >
> > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > >
> > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > performance degradation.
> > > > > >
> > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > changelog doesn't include any performance measurements!
> > > > > >
> > > > >
> > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is
> > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > into removing the mutex.
> > > >
> > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > then it naturally decreases.
> > > >
> > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > be important to check if any workload that matters is actually hitting
> > > > > that problem.
> > > >
> > > > I was thinking of writing one to actually get some numbers for this
> > > > patchset -- I don't know of any benchmark that might stress this lock.
> > > >
> > > > However I first measured the amount of cycles it costs to start an
> > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > billion. While there is naturally a fair amount of variation, these
> > > > changes do seem to do more harm than good, at least in real world
> > > > scenarios.
> > >
> > > Hello,
> > >
> > > I think that number of cycles is not proper to measure this patchset,
> > > because cycles would be wasted by fault handling failure. Instead, it
> > > targeted improved elapsed time.
>
> Fair enough, however the fact of the matter is this approach does en up
> hurting performance. Regarding total startup time, I didn't see hardly
> any differences, with both vanilla and this patchset it takes close to
> 33.5 seconds.
>
> > Could you tell me how long it
> > > takes to fault all of it's hugepages?
> > >
> > > Anyway, this order of magnitude still seems a problem. :/
> > >
> > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > Andrew pointed out.
> > >
> > > I will send another patches to fix this problem.
> >
> > Hello, Davidlohr.
> >
> > Here goes the fix on top of this series.
>
> ... and with this patch we go from 27 down to 11 billion cycles, so this
> approach still costs more than what we currently have. A perf stat shows
> that an entire 1Gb huge page aware DB startup costs around ~30 billion
> cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> definitely non trivial and IMO worth considering.

Really thanks for your help. :)

>
> Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> ride and things do look quite better, which is basically what Andrew was
> suggesting previously anyway. With the hash table approach the startup
> time did go down to ~25.1 seconds, which is a nice -24.7% time
> reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> This hash table was on a 80 core system, so since we do the power of two
> round up we end up with 256 entries -- I think we can do better if we
> enlarger further, maybe something like statically 1024, or probably
> better, 8-ish * nr cpus.
>
> Thoughts? Is there any reason why we cannot go with this instead? Yes,
> we still keep the mutex, but the approach is (1) proven better for
> performance on real world workloads and (2) far less invasive.

I have no more idea to improve my patches now, so I agree with your approach.
When I reviewed your approach last time, I found one race condition. In that
time, I didn't think of a solution about it. If you resend it, I will review
and re-think about it.

Thanks.

2014-01-06 12:19:17

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > Hi Joonsoo,
> >
> > Sorry about the delay...
> >
> > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <[email protected]> wrote:
> > > > > > >
> > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > >
> > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > >
> > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > performance degradation.
> > > > > > >
> > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > changelog doesn't include any performance measurements!
> > > > > > >
> > > > > >
> > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is
> > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > into removing the mutex.
> > > > >
> > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > then it naturally decreases.
> > > > >
> > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > be important to check if any workload that matters is actually hitting
> > > > > > that problem.
> > > > >
> > > > > I was thinking of writing one to actually get some numbers for this
> > > > > patchset -- I don't know of any benchmark that might stress this lock.
> > > > >
> > > > > However I first measured the amount of cycles it costs to start an
> > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > billion. While there is naturally a fair amount of variation, these
> > > > > changes do seem to do more harm than good, at least in real world
> > > > > scenarios.
> > > >
> > > > Hello,
> > > >
> > > > I think that number of cycles is not proper to measure this patchset,
> > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > targeted improved elapsed time.
> >
> > Fair enough, however the fact of the matter is this approach does en up
> > hurting performance. Regarding total startup time, I didn't see hardly
> > any differences, with both vanilla and this patchset it takes close to
> > 33.5 seconds.
> >
> > > Could you tell me how long it
> > > > takes to fault all of it's hugepages?
> > > >
> > > > Anyway, this order of magnitude still seems a problem. :/
> > > >
> > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > Andrew pointed out.
> > > >
> > > > I will send another patches to fix this problem.
> > >
> > > Hello, Davidlohr.
> > >
> > > Here goes the fix on top of this series.
> >
> > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > approach still costs more than what we currently have. A perf stat shows
> > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > definitely non trivial and IMO worth considering.
>
> Really thanks for your help. :)
>
> >
> > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > ride and things do look quite better, which is basically what Andrew was
> > suggesting previously anyway. With the hash table approach the startup
> > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > This hash table was on a 80 core system, so since we do the power of two
> > round up we end up with 256 entries -- I think we can do better if we
> > enlarger further, maybe something like statically 1024, or probably
> > better, 8-ish * nr cpus.
> >
> > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > we still keep the mutex, but the approach is (1) proven better for
> > performance on real world workloads and (2) far less invasive.
>
> I have no more idea to improve my patches now, so I agree with your approach.
> When I reviewed your approach last time, I found one race condition. In that
> time, I didn't think of a solution about it. If you resend it, I will review
> and re-think about it.

hmm so how do you want to play this? Your first 3 patches basically
deals (more elegantly) with my patch 1/2 which I was on my way of just
changing the lock -- we had agreed that serializing regions was with a
spinlock was better than with a sleeping one as the critical region was
small enough and we just had to deal with that trivial kmalloc case in
region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
instantiation mutex hash table change, sounds good?

Thanks,
Davidlohr

2014-01-07 01:56:53

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > Hi Joonsoo,
> > >
> > > Sorry about the delay...
> > >
> > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > >
> > > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > >
> > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > > performance degradation.
> > > > > > > >
> > > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > >
> > > > > > >
> > > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is
> > > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > > into removing the mutex.
> > > > > >
> > > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > > then it naturally decreases.
> > > > > >
> > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > > be important to check if any workload that matters is actually hitting
> > > > > > > that problem.
> > > > > >
> > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > patchset -- I don't know of any benchmark that might stress this lock.
> > > > > >
> > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > scenarios.
> > > > >
> > > > > Hello,
> > > > >
> > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > targeted improved elapsed time.
> > >
> > > Fair enough, however the fact of the matter is this approach does en up
> > > hurting performance. Regarding total startup time, I didn't see hardly
> > > any differences, with both vanilla and this patchset it takes close to
> > > 33.5 seconds.
> > >
> > > > Could you tell me how long it
> > > > > takes to fault all of it's hugepages?
> > > > >
> > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > >
> > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > > Andrew pointed out.
> > > > >
> > > > > I will send another patches to fix this problem.
> > > >
> > > > Hello, Davidlohr.
> > > >
> > > > Here goes the fix on top of this series.
> > >
> > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > approach still costs more than what we currently have. A perf stat shows
> > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > > definitely non trivial and IMO worth considering.
> >
> > Really thanks for your help. :)
> >
> > >
> > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > > ride and things do look quite better, which is basically what Andrew was
> > > suggesting previously anyway. With the hash table approach the startup
> > > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > > This hash table was on a 80 core system, so since we do the power of two
> > > round up we end up with 256 entries -- I think we can do better if we
> > > enlarger further, maybe something like statically 1024, or probably
> > > better, 8-ish * nr cpus.
> > >
> > > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > > we still keep the mutex, but the approach is (1) proven better for
> > > performance on real world workloads and (2) far less invasive.
> >
> > I have no more idea to improve my patches now, so I agree with your approach.
> > When I reviewed your approach last time, I found one race condition. In that
> > time, I didn't think of a solution about it. If you resend it, I will review
> > and re-think about it.
>
> hmm so how do you want to play this? Your first 3 patches basically
> deals (more elegantly) with my patch 1/2 which I was on my way of just
> changing the lock -- we had agreed that serializing regions was with a
> spinlock was better than with a sleeping one as the critical region was
> small enough and we just had to deal with that trivial kmalloc case in
> region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
> instantiation mutex hash table change, sounds good?

Hello,

If Andrew agree, It would be great to merge 1-7 patches into mainline
before your mutex approach. There are some of clean-up patches and, IMO,
it makes the code more readable and maintainable, so it is worth to merge
separately.

If disagree, 1-3 patches and then add your approach is find to me.

Andrew, what's your thought?

Thanks.

2014-01-07 02:37:10

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Tue, 2014-01-07 at 10:57 +0900, Joonsoo Kim wrote:
> On Mon, Jan 06, 2014 at 04:19:05AM -0800, Davidlohr Bueso wrote:
> > On Mon, 2014-01-06 at 09:19 +0900, Joonsoo Kim wrote:
> > > On Fri, Jan 03, 2014 at 11:55:45AM -0800, Davidlohr Bueso wrote:
> > > > Hi Joonsoo,
> > > >
> > > > Sorry about the delay...
> > > >
> > > > On Mon, 2013-12-23 at 11:11 +0900, Joonsoo Kim wrote:
> > > > > On Mon, Dec 23, 2013 at 09:44:38AM +0900, Joonsoo Kim wrote:
> > > > > > On Fri, Dec 20, 2013 at 10:48:17PM -0800, Davidlohr Bueso wrote:
> > > > > > > On Fri, 2013-12-20 at 14:01 +0000, Mel Gorman wrote:
> > > > > > > > On Thu, Dec 19, 2013 at 05:02:02PM -0800, Andrew Morton wrote:
> > > > > > > > > On Wed, 18 Dec 2013 15:53:59 +0900 Joonsoo Kim <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > If parallel fault occur, we can fail to allocate a hugepage,
> > > > > > > > > > because many threads dequeue a hugepage to handle a fault of same address.
> > > > > > > > > > This makes reserved pool shortage just for a little while and this cause
> > > > > > > > > > faulting thread who can get hugepages to get a SIGBUS signal.
> > > > > > > > > >
> > > > > > > > > > To solve this problem, we already have a nice solution, that is,
> > > > > > > > > > a hugetlb_instantiation_mutex. This blocks other threads to dive into
> > > > > > > > > > a fault handler. This solve the problem clearly, but it introduce
> > > > > > > > > > performance degradation, because it serialize all fault handling.
> > > > > > > > > >
> > > > > > > > > > Now, I try to remove a hugetlb_instantiation_mutex to get rid of
> > > > > > > > > > performance degradation.
> > > > > > > > >
> > > > > > > > > So the whole point of the patch is to improve performance, but the
> > > > > > > > > changelog doesn't include any performance measurements!
> > > > > > > > >
> > > > > > > >
> > > > > > > > I don't really deal with hugetlbfs any more and I have not examined this
> > > > > > > > series but I remember why I never really cared about this mutex. It wrecks
> > > > > > > > fault scalability but AFAIK fault scalability almost never mattered for
> > > > > > > > workloads using hugetlbfs. The most common user of hugetlbfs by far is
> > > > > > > > sysv shared memory. The memory is faulted early in the lifetime of the
> > > > > > > > workload and after that it does not matter. At worst, it hurts application
> > > > > > > > startup time but that is still poor motivation for putting a lot of work
> > > > > > > > into removing the mutex.
> > > > > > >
> > > > > > > Yep, important hugepage workloads initially pound heavily on this lock,
> > > > > > > then it naturally decreases.
> > > > > > >
> > > > > > > > Microbenchmarks will be able to trigger problems in this area but it'd
> > > > > > > > be important to check if any workload that matters is actually hitting
> > > > > > > > that problem.
> > > > > > >
> > > > > > > I was thinking of writing one to actually get some numbers for this
> > > > > > > patchset -- I don't know of any benchmark that might stress this lock.
> > > > > > >
> > > > > > > However I first measured the amount of cycles it costs to start an
> > > > > > > Oracle DB and things went south with these changes. A simple 'startup
> > > > > > > immediate' calls hugetlb_fault() ~5000 times. For a vanilla kernel, this
> > > > > > > costs ~7.5 billion cycles and with this patchset it goes up to ~27.1
> > > > > > > billion. While there is naturally a fair amount of variation, these
> > > > > > > changes do seem to do more harm than good, at least in real world
> > > > > > > scenarios.
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I think that number of cycles is not proper to measure this patchset,
> > > > > > because cycles would be wasted by fault handling failure. Instead, it
> > > > > > targeted improved elapsed time.
> > > >
> > > > Fair enough, however the fact of the matter is this approach does en up
> > > > hurting performance. Regarding total startup time, I didn't see hardly
> > > > any differences, with both vanilla and this patchset it takes close to
> > > > 33.5 seconds.
> > > >
> > > > > Could you tell me how long it
> > > > > > takes to fault all of it's hugepages?
> > > > > >
> > > > > > Anyway, this order of magnitude still seems a problem. :/
> > > > > >
> > > > > > I guess that cycles are wasted by zeroing hugepage in fault-path like as
> > > > > > Andrew pointed out.
> > > > > >
> > > > > > I will send another patches to fix this problem.
> > > > >
> > > > > Hello, Davidlohr.
> > > > >
> > > > > Here goes the fix on top of this series.
> > > >
> > > > ... and with this patch we go from 27 down to 11 billion cycles, so this
> > > > approach still costs more than what we currently have. A perf stat shows
> > > > that an entire 1Gb huge page aware DB startup costs around ~30 billion
> > > > cycles on a vanilla kernel, so the impact of hugetlb_fault() is
> > > > definitely non trivial and IMO worth considering.
> > >
> > > Really thanks for your help. :)
> > >
> > > >
> > > > Now, I took my old patchset (https://lkml.org/lkml/2013/7/26/299) for a
> > > > ride and things do look quite better, which is basically what Andrew was
> > > > suggesting previously anyway. With the hash table approach the startup
> > > > time did go down to ~25.1 seconds, which is a nice -24.7% time
> > > > reduction, with hugetlb_fault() consuming roughly 5.3 billion cycles.
> > > > This hash table was on a 80 core system, so since we do the power of two
> > > > round up we end up with 256 entries -- I think we can do better if we
> > > > enlarger further, maybe something like statically 1024, or probably
> > > > better, 8-ish * nr cpus.
> > > >
> > > > Thoughts? Is there any reason why we cannot go with this instead? Yes,
> > > > we still keep the mutex, but the approach is (1) proven better for
> > > > performance on real world workloads and (2) far less invasive.
> > >
> > > I have no more idea to improve my patches now, so I agree with your approach.
> > > When I reviewed your approach last time, I found one race condition. In that
> > > time, I didn't think of a solution about it. If you resend it, I will review
> > > and re-think about it.
> >
> > hmm so how do you want to play this? Your first 3 patches basically
> > deals (more elegantly) with my patch 1/2 which I was on my way of just
> > changing the lock -- we had agreed that serializing regions was with a
> > spinlock was better than with a sleeping one as the critical region was
> > small enough and we just had to deal with that trivial kmalloc case in
> > region_chg(). So I can pick up your patches 1, 2 & 3 and then add the
> > instantiation mutex hash table change, sounds good?
>
> Hello,
>
> If Andrew agree, It would be great to merge 1-7 patches into mainline
> before your mutex approach. There are some of clean-up patches and, IMO,
> it makes the code more readable and maintainable, so it is worth to merge
> separately.

Fine by me.

2014-01-15 03:13:32

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Mon, 6 Jan 2014, Davidlohr Bueso wrote:

> > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > before your mutex approach. There are some of clean-up patches and, IMO,
> > it makes the code more readable and maintainable, so it is worth to merge
> > separately.
>
> Fine by me.
>

It appears like patches 1-7 are still missing from linux-next, would you
mind posting them in a series with your approach?

2014-01-15 04:37:58

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
>
> > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > it makes the code more readable and maintainable, so it is worth to merge
> > > separately.
> >
> > Fine by me.
> >
>
> It appears like patches 1-7 are still missing from linux-next, would you
> mind posting them in a series with your approach?

I haven't looked much into patches 4-7, but at least the first three are
ok. I was waiting for Andrew to take all seven for linux-next and then
I'd rebase my approach on top. Anyway, unless Andrew has any
preferences, if by later this week they're not picked up, I'll resend
everything.

2014-01-15 04:54:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso <[email protected]> wrote:

> On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> >
> > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > it makes the code more readable and maintainable, so it is worth to merge
> > > > separately.
> > >
> > > Fine by me.
> > >
> >
> > It appears like patches 1-7 are still missing from linux-next, would you
> > mind posting them in a series with your approach?
>
> I haven't looked much into patches 4-7, but at least the first three are
> ok. I was waiting for Andrew to take all seven for linux-next and then
> I'd rebase my approach on top. Anyway, unless Andrew has any
> preferences, if by later this week they're not picked up, I'll resend
> everything.

Well, we're mainly looking for bugfixes this last in the cycle.
"[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
introduced resv_map lock" fixes a bug, but I'd assumed that it depended
on earlier patches. If we think that one is serious then it would be
better to cook up a minimal fix which is backportable into 3.12 and
eariler?

2014-01-15 20:47:10

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Tue, 2014-01-14 at 20:56 -0800, Andrew Morton wrote:
> On Tue, 14 Jan 2014 20:37:49 -0800 Davidlohr Bueso <[email protected]> wrote:
>
> > On Tue, 2014-01-14 at 19:08 -0800, David Rientjes wrote:
> > > On Mon, 6 Jan 2014, Davidlohr Bueso wrote:
> > >
> > > > > If Andrew agree, It would be great to merge 1-7 patches into mainline
> > > > > before your mutex approach. There are some of clean-up patches and, IMO,
> > > > > it makes the code more readable and maintainable, so it is worth to merge
> > > > > separately.
> > > >
> > > > Fine by me.
> > > >
> > >
> > > It appears like patches 1-7 are still missing from linux-next, would you
> > > mind posting them in a series with your approach?
> >
> > I haven't looked much into patches 4-7, but at least the first three are
> > ok. I was waiting for Andrew to take all seven for linux-next and then
> > I'd rebase my approach on top. Anyway, unless Andrew has any
> > preferences, if by later this week they're not picked up, I'll resend
> > everything.
>
> Well, we're mainly looking for bugfixes this last in the cycle.
> "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> on earlier patches.

It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).

> If we think that one is serious then it would be
> better to cook up a minimal fix which is backportable into 3.12 and
> eariler?

I don't think it's too serious, afaik it's a theoretical race and I
haven't seen any bug reports for it. So we can probably just wait for
3.14, as you say, it's already late in the cycle anyways. Just let me
know what you want to do so we can continue working on the actual
performance issue.

Thanks,
Davidlohr

2014-01-15 20:50:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 13/14] mm, hugetlb: retry if failed to allocate and there is concurrent user

On Wed, 15 Jan 2014 12:47:00 -0800 Davidlohr Bueso <[email protected]> wrote:

> > Well, we're mainly looking for bugfixes this last in the cycle.
> > "[PATCH v3 03/14] mm, hugetlb: protect region tracking via newly
> > introduced resv_map lock" fixes a bug, but I'd assumed that it depended
> > on earlier patches.
>
> It doesn't seem to depend on anything. All 1-7 patches apply cleanly on
> linux-next, the last change to mm/hugetlb.c was commit 3ebac7fa (mm:
> dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE).
>
> > If we think that one is serious then it would be
> > better to cook up a minimal fix which is backportable into 3.12 and
> > eariler?
>
> I don't think it's too serious, afaik it's a theoretical race and I
> haven't seen any bug reports for it. So we can probably just wait for
> 3.14, as you say, it's already late in the cycle anyways.

OK, thanks.

> Just let me
> know what you want to do so we can continue working on the actual
> performance issue.

A resend after -rc1 would suit.