2015-02-02 04:00:13

by Oleg Drokin

[permalink] [raw]
Subject: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

From: Oleg Drokin <[email protected]>

leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
it clearly does not want any shrinker activity within the fs itself.
convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
this goal.

Signed-off-by: Oleg Drokin <[email protected]>
---
fs/gfs2/dir.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
index c5a34f0..6371192 100644
--- a/fs/gfs2/dir.c
+++ b/fs/gfs2/dir.c
@@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len,

ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
if (ht == NULL)
- ht = vzalloc(size);
+ ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
+ PAGE_KERNEL);
if (!ht)
return -ENOMEM;

--
2.1.0


2015-02-02 05:37:13

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

On Sun, Feb 01, 2015 at 10:59:54PM -0500, [email protected] wrote:
> From: Oleg Drokin <[email protected]>
>
> leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
> it clearly does not want any shrinker activity within the fs itself.
> convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
> this goal.
>
> Signed-off-by: Oleg Drokin <[email protected]>
> ---
> fs/gfs2/dir.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> index c5a34f0..6371192 100644
> --- a/fs/gfs2/dir.c
> +++ b/fs/gfs2/dir.c
> @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len,
>
> ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
> if (ht == NULL)
> - ht = vzalloc(size);
> + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
> + PAGE_KERNEL);

That, in the end, won't help as vmalloc still uses GFP_KERNEL
allocations deep down in the PTE allocation code. See the hacks in
the DM and XFS code to work around this. i.e. go look for callers of
memalloc_noio_save(). It's ugly and grotesque, but we've got no
other way to limit reclaim context because the MM devs won't pass
the vmalloc gfp context down the stack to the PTE allocations....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-02-02 06:57:41

by Oleg Drokin

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

Hello!

On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:

> On Sun, Feb 01, 2015 at 10:59:54PM -0500, [email protected] wrote:
>> From: Oleg Drokin <[email protected]>
>>
>> leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
>> it clearly does not want any shrinker activity within the fs itself.
>> convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
>> this goal.
>>
>> Signed-off-by: Oleg Drokin <[email protected]>
>> ---
>> fs/gfs2/dir.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
>> index c5a34f0..6371192 100644
>> --- a/fs/gfs2/dir.c
>> +++ b/fs/gfs2/dir.c
>> @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len,
>>
>> ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
>> if (ht == NULL)
>> - ht = vzalloc(size);
>> + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
>> + PAGE_KERNEL);
> That, in the end, won't help as vmalloc still uses GFP_KERNEL
> allocations deep down in the PTE allocation code. See the hacks in
> the DM and XFS code to work around this. i.e. go look for callers of
> memalloc_noio_save(). It's ugly and grotesque, but we've got no
> other way to limit reclaim context because the MM devs won't pass
> the vmalloc gfp context down the stack to the PTE allocations....

Hm, interesting.
So all the other code in the kernel that does this sort of thing (and there's quite a bit
outside of xfs and ocfs2) would not get the desired effect?

So, I did some digging in archives and found this thread from 2010 onward with various
patches and rants.
Not sure how I missed that before.

Should we have another run at this I wonder?

Bye,
Oleg-

2015-02-02 08:11:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote:
> Hello!
>
> On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:
>
> > On Sun, Feb 01, 2015 at 10:59:54PM -0500, [email protected] wrote:
> >> From: Oleg Drokin <[email protected]>
> >>
> >> leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
> >> it clearly does not want any shrinker activity within the fs itself.
> >> convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
> >> this goal.
> >>
> >> Signed-off-by: Oleg Drokin <[email protected]>
> >> ---
> >> fs/gfs2/dir.c | 3 ++-
> >> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
> >> index c5a34f0..6371192 100644
> >> --- a/fs/gfs2/dir.c
> >> +++ b/fs/gfs2/dir.c
> >> @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len,
> >>
> >> ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
> >> if (ht == NULL)
> >> - ht = vzalloc(size);
> >> + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
> >> + PAGE_KERNEL);
> > That, in the end, won't help as vmalloc still uses GFP_KERNEL
> > allocations deep down in the PTE allocation code. See the hacks in
> > the DM and XFS code to work around this. i.e. go look for callers of
> > memalloc_noio_save(). It's ugly and grotesque, but we've got no
> > other way to limit reclaim context because the MM devs won't pass
> > the vmalloc gfp context down the stack to the PTE allocations....
>
> Hm, interesting.
> So all the other code in the kernel that does this sort of thing (and there's quite a bit
> outside of xfs and ocfs2) would not get the desired effect?

No. I expect, however, that very few people would ever see a
deadlock as a result - it's a pretty rare sort of kernel case to hit
in most cases. XFS does make extensive use of vm_map_ram() in
GFP_NOFS context, however, when large directory block sizes are in
use, and we also have a history of lockdep throwing warnings under
memory pressure. In the end, the memalloc_noio_save() changes were
made to stop the frequent lockdep reports rather than actual
deadlocks.

> So, I did some digging in archives and found this thread from 2010 onward with various
> patches and rants.
> Not sure how I missed that before.
>
> Should we have another run at this I wonder?

By all means, but I don't think you'll have any more luck than
anyone else in the past. We've still got the problem of attitude
("vmalloc is not for general use") and making it actually work is
seen as "encouraging undesirable behaviour". If you can change
attitudes towards vmalloc first, then you'll be much more likely to
make progress in getting these problems solved....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-02-02 10:30:42

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

Hi,

On 02/02/15 08:11, Dave Chinner wrote:
> On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote:
>> Hello!
>>
>> On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:
>>
>>> On Sun, Feb 01, 2015 at 10:59:54PM -0500, [email protected] wrote:
>>>> From: Oleg Drokin <[email protected]>
>>>>
>>>> leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
>>>> it clearly does not want any shrinker activity within the fs itself.
>>>> convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
>>>> this goal.
>>>>
>>>> Signed-off-by: Oleg Drokin <[email protected]>
>>>> ---
>>>> fs/gfs2/dir.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/gfs2/dir.c b/fs/gfs2/dir.c
>>>> index c5a34f0..6371192 100644
>>>> --- a/fs/gfs2/dir.c
>>>> +++ b/fs/gfs2/dir.c
>>>> @@ -1896,7 +1896,8 @@ static int leaf_dealloc(struct gfs2_inode *dip, u32 index, u32 len,
>>>>
>>>> ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
>>>> if (ht == NULL)
>>>> - ht = vzalloc(size);
>>>> + ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
>>>> + PAGE_KERNEL);
>>> That, in the end, won't help as vmalloc still uses GFP_KERNEL
>>> allocations deep down in the PTE allocation code. See the hacks in
>>> the DM and XFS code to work around this. i.e. go look for callers of
>>> memalloc_noio_save(). It's ugly and grotesque, but we've got no
>>> other way to limit reclaim context because the MM devs won't pass
>>> the vmalloc gfp context down the stack to the PTE allocations....
>> Hm, interesting.
>> So all the other code in the kernel that does this sort of thing (and there's quite a bit
>> outside of xfs and ocfs2) would not get the desired effect?
> No. I expect, however, that very few people would ever see a
> deadlock as a result - it's a pretty rare sort of kernel case to hit
> in most cases. XFS does make extensive use of vm_map_ram() in
> GFP_NOFS context, however, when large directory block sizes are in
> use, and we also have a history of lockdep throwing warnings under
> memory pressure. In the end, the memalloc_noio_save() changes were
> made to stop the frequent lockdep reports rather than actual
> deadlocks.
Indeed, I think the patch is still an improvement however, so I'm happy
to apply it while a better solution is found.

>> So, I did some digging in archives and found this thread from 2010 onward with various
>> patches and rants.
>> Not sure how I missed that before.
>>
>> Should we have another run at this I wonder?
> By all means, but I don't think you'll have any more luck than
> anyone else in the past. We've still got the problem of attitude
> ("vmalloc is not for general use") and making it actually work is
> seen as "encouraging undesirable behaviour". If you can change
> attitudes towards vmalloc first, then you'll be much more likely to
> make progress in getting these problems solved....
>

Well I don't know whether it has to be vmalloc that provides the
solution here... if memory fragmentation could be controlled then
kmalloc of larger contiguous chunks of memory could be done using that,
which might be a better solution overall. But I do agree that we need to
try and come to some kind of solution to this problem as it is one of
those things that has been rumbling on for a long time without a proper
solution.

I also wonder if vmalloc is still very slow? That was the case some time
ago when I noticed a problem in directory access times in gfs2, which
made us change to use kmalloc with a vmalloc fallback in the first place,

Steve.

2015-02-03 22:33:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

On Mon, Feb 02, 2015 at 10:30:29AM +0000, Steven Whitehouse wrote:
> Hi,
>
> On 02/02/15 08:11, Dave Chinner wrote:
> >On Mon, Feb 02, 2015 at 01:57:23AM -0500, Oleg Drokin wrote:
> >>Hello!
> >>
> >>On Feb 2, 2015, at 12:37 AM, Dave Chinner wrote:
> >>
> >>>On Sun, Feb 01, 2015 at 10:59:54PM -0500, [email protected] wrote:
> >>>>From: Oleg Drokin <[email protected]>
> >>>>
> >>>>leaf_dealloc uses vzalloc as a fallback to kzalloc(GFP_NOFS), so
> >>>>it clearly does not want any shrinker activity within the fs itself.
> >>>>convert vzalloc into __vmalloc(GFP_NOFS|__GFP_ZERO) to better achieve
> >>>>this goal.
....
> >>>> ht = kzalloc(size, GFP_NOFS | __GFP_NOWARN);
> >>>> if (ht == NULL)
> >>>>- ht = vzalloc(size);
> >>>>+ ht = __vmalloc(size, GFP_NOFS | __GFP_NOWARN | __GFP_ZERO,
> >>>>+ PAGE_KERNEL);
> >>>That, in the end, won't help as vmalloc still uses GFP_KERNEL
> >>>allocations deep down in the PTE allocation code. See the hacks in
> >>>the DM and XFS code to work around this. i.e. go look for callers of
> >>>memalloc_noio_save(). It's ugly and grotesque, but we've got no
> >>>other way to limit reclaim context because the MM devs won't pass
> >>>the vmalloc gfp context down the stack to the PTE allocations....
....
> >>So, I did some digging in archives and found this thread from
> >>2010 onward with various patches and rants. Not sure how I
> >>missed that before.
> >>
> >>Should we have another run at this I wonder?
> >
> >By all means, but I don't think you'll have any more luck than
> >anyone else in the past. We've still got the problem of attitude
> >("vmalloc is not for general use") and making it actually work is
> >seen as "encouraging undesirable behaviour". If you can change
> >attitudes towards vmalloc first, then you'll be much more likely to
> >make progress in getting these problems solved....
>
> Well I don't know whether it has to be vmalloc that provides the
> solution here... if memory fragmentation could be controlled then
> kmalloc of larger contiguous chunks of memory could be done using
> that, which might be a better solution overall.

Which has been said repeatedly for the past 15 years. And after all
this time kmalloc is still horribly unreliable for large contiguous
allocations. Hence we still have need for vmalloc for large
contiguous buffers because we have places where memory allocation
failure is simply not an option.

> But I do agree that
> we need to try and come to some kind of solution to this problem as
> it is one of those things that has been rumbling on for a long time
> without a proper solution.
>
> I also wonder if vmalloc is still very slow? That was the case some
> time ago when I noticed a problem in directory access times in gfs2,
> which made us change to use kmalloc with a vmalloc fallback in the
> first place,

Another of the "myths" about vmalloc. The speed and scalability of
vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
of those problems 5-6 years ago - see the rewrite from 2008 that
started with commit db64fe0 ("mm: rewrite vmap layer")....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-02-04 07:13:51

by Oleg Drokin

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

Hello!

On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
>> I also wonder if vmalloc is still very slow? That was the case some
>> time ago when I noticed a problem in directory access times in gfs2,
>> which made us change to use kmalloc with a vmalloc fallback in the
>> first place,
> Another of the "myths" about vmalloc. The speed and scalability of
> vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
> of those problems 5-6 years ago - see the rewrite from 2008 that
> started with commit db64fe0 ("mm: rewrite vmap layer")....

This actually might be less true than one would hope. At least somewhat
recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
show that there's huge contention on vmlist_lock, so if you have vmalloc
intense workloads, you get penalized heavily. Granted, this is rhel6 kernel,
but that is still (albeit heavily modified) 2.6.32, which was released at
the end of 2009, way after 2008.
I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily
used is still in place.

So of course with that in place there's every incentive to not use vmalloc
if at all possible. But if used, one would still hopes it would be at least
safe to do even if somewhat slow.

Bye,
Oleg

2015-02-04 09:50:11

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

Hi,

On 04/02/15 07:13, Oleg Drokin wrote:
> Hello!
>
> On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
>>> I also wonder if vmalloc is still very slow? That was the case some
>>> time ago when I noticed a problem in directory access times in gfs2,
>>> which made us change to use kmalloc with a vmalloc fallback in the
>>> first place,
>> Another of the "myths" about vmalloc. The speed and scalability of
>> vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
>> of those problems 5-6 years ago - see the rewrite from 2008 that
>> started with commit db64fe0 ("mm: rewrite vmap layer")....
> This actually might be less true than one would hope. At least somewhat
> recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
> show that there's huge contention on vmlist_lock, so if you have vmalloc
> intense workloads, you get penalized heavily. Granted, this is rhel6 kernel,
> but that is still (albeit heavily modified) 2.6.32, which was released at
> the end of 2009, way after 2008.
> I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily
> used is still in place.
>
> So of course with that in place there's every incentive to not use vmalloc
> if at all possible. But if used, one would still hopes it would be at least
> safe to do even if somewhat slow.
>
> Bye,
> Oleg

I was thinking back to this thread:
https://lkml.org/lkml/2010/4/12/207

More recent than 2008, and although it resulted in a patch that
apparently fixed the problem, I don't think it was ever applied on the
basis that it was too risky and kmalloc was the proper solution
anyway.... I've not tested recently, so it may have been fixed in the
mean time,

Steve.

2015-02-05 11:50:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

On Wed, Feb 04, 2015 at 02:13:29AM -0500, Oleg Drokin wrote:
> Hello!
>
> On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
> >> I also wonder if vmalloc is still very slow? That was the case some
> >> time ago when I noticed a problem in directory access times in gfs2,
> >> which made us change to use kmalloc with a vmalloc fallback in the
> >> first place,
> > Another of the "myths" about vmalloc. The speed and scalability of
> > vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
> > of those problems 5-6 years ago - see the rewrite from 2008 that
> > started with commit db64fe0 ("mm: rewrite vmap layer")....
>
> This actually might be less true than one would hope. At least somewhat
> recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
> show that there's huge contention on vmlist_lock, so if you have vmalloc

vmlist_lock and the list it protected went away in 3.10.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-02-05 20:11:54

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] gfs2: use __vmalloc GFP_NOFS for fs-related allocations.

On Wed, Feb 04, 2015 at 09:49:50AM +0000, Steven Whitehouse wrote:
> Hi,
>
> On 04/02/15 07:13, Oleg Drokin wrote:
> >Hello!
> >
> >On Feb 3, 2015, at 5:33 PM, Dave Chinner wrote:
> >>>I also wonder if vmalloc is still very slow? That was the case some
> >>>time ago when I noticed a problem in directory access times in gfs2,
> >>>which made us change to use kmalloc with a vmalloc fallback in the
> >>>first place,
> >>Another of the "myths" about vmalloc. The speed and scalability of
> >>vmap/vmalloc is a long solved problem - Nick Piggin fixed the worst
> >>of those problems 5-6 years ago - see the rewrite from 2008 that
> >>started with commit db64fe0 ("mm: rewrite vmap layer")....
> >This actually might be less true than one would hope. At least somewhat
> >recent studies by LLNL (https://jira.hpdd.intel.com/browse/LU-4008)
> >show that there's huge contention on vmlist_lock, so if you have vmalloc
> >intense workloads, you get penalized heavily. Granted, this is rhel6 kernel,
> >but that is still (albeit heavily modified) 2.6.32, which was released at
> >the end of 2009, way after 2008.
> >I see that vmlist_lock is gone now, but e.g. vmap_area_lock that is heavily
> >used is still in place.
> >
> >So of course with that in place there's every incentive to not use vmalloc
> >if at all possible. But if used, one would still hopes it would be at least
> >safe to do even if somewhat slow.
> >
> >Bye,
> > Oleg
>
> I was thinking back to this thread:
> https://lkml.org/lkml/2010/4/12/207
>
> More recent than 2008, and although it resulted in a patch that
> apparently fixed the problem, I don't think it was ever applied on
> the basis that it was too risky and kmalloc was the proper solution
> anyway.... I've not tested recently, so it may have been fixed in
> the mean time,

IIUC, the problem was resolved with a different fix back in 2011 - a
lookaside cache that avoids the overhead of searching the entire
list on every vmalloc.

commit 89699605fe7cfd8611900346f61cb6cbf179b10a
Author: Nick Piggin <[email protected]>
Date: Tue Mar 22 16:30:36 2011 -0700

mm: vmap area cache

Provide a free area cache for the vmalloc virtual address allocator, based
on the algorithm used by the user virtual memory allocator.

This reduces the number of rbtree operations and linear traversals over
the vmap extents in order to find a free area, by starting off at the last
point that a free area was found.
....
After this patch, the search will start from where it left off, giving
closer to an amortized O(1).

This is verified to solve regressions reported Steven in GFS2, and Avi in
KVM.
....

Cheers,

Dave.
--
Dave Chinner
[email protected]