2018-04-24 16:29:07

by Michal Hocko

[permalink] [raw]
Subject: vmalloc with GFP_NOFS

Hi,
it seems that we still have few vmalloc users who perform GFP_NOFS
allocation:
drivers/mtd/ubi/io.c
fs/ext4/xattr.c
fs/gfs2/dir.c
fs/gfs2/quota.c
fs/nfs/blocklayout/extent_tree.c
fs/ubifs/debug.c
fs/ubifs/lprops.c
fs/ubifs/lpt_commit.c
fs/ubifs/orphan.c

Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
because we do have hardocded GFP_KERNEL allocations deep inside the
vmalloc layers. That means that if GFP_NOFS really protects from
recursion into the fs deadlocks then the vmalloc call is broken.

What to do about this? Well, there are two things. Firstly, it would be
really great to double check whether the GFP_NOFS is really needed. I
cannot judge that because I am not familiar with the code. It would be
great if the respective maintainers (hopefully get_maintainer.sh pointed
me to all relevant ones). If there is not reclaim recursion issue then
simply use the standard vmalloc (aka GFP_KERNEL request).

If the use is really valid then we have a way to do the vmalloc
allocation properly. We have memalloc_nofs_{save,restore} scope api. How
does that work? You simply call memalloc_nofs_save when the reclaim
recursion critical section starts (e.g. when you take a lock which is
then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
when the critical section ends. _All_ allocations within that scope
will get GFP_NOFS semantic automagically. If you are not sure about the
scope itself then the easiest workaround is to wrap the vmalloc itself
with a big fat comment that this should be revisited.

Does that sound like something that can be done in a reasonable time?
I have tried to bring this up in the past but our speed is glacial and
there are attempts to do hacks like checking for abusers inside the
vmalloc which is just too ugly to live.

Please do not hesitate to get back to me if something is not clear.

Thanks!
--
Michal Hocko
SUSE Labs


2018-04-24 16:48:33

by Mikulas Patocka

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS



On Tue, 24 Apr 2018, Michal Hocko wrote:

> Hi,
> it seems that we still have few vmalloc users who perform GFP_NOFS
> allocation:
> drivers/mtd/ubi/io.c
> fs/ext4/xattr.c
> fs/gfs2/dir.c
> fs/gfs2/quota.c
> fs/nfs/blocklayout/extent_tree.c
> fs/ubifs/debug.c
> fs/ubifs/lprops.c
> fs/ubifs/lpt_commit.c
> fs/ubifs/orphan.c
>
> Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> because we do have hardocded GFP_KERNEL allocations deep inside the
> vmalloc layers. That means that if GFP_NOFS really protects from
> recursion into the fs deadlocks then the vmalloc call is broken.
>
> What to do about this? Well, there are two things. Firstly, it would be
> really great to double check whether the GFP_NOFS is really needed. I
> cannot judge that because I am not familiar with the code. It would be
> great if the respective maintainers (hopefully get_maintainer.sh pointed
> me to all relevant ones). If there is not reclaim recursion issue then
> simply use the standard vmalloc (aka GFP_KERNEL request).
>
> If the use is really valid then we have a way to do the vmalloc
> allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> does that work? You simply call memalloc_nofs_save when the reclaim
> recursion critical section starts (e.g. when you take a lock which is
> then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> when the critical section ends. _All_ allocations within that scope
> will get GFP_NOFS semantic automagically. If you are not sure about the
> scope itself then the easiest workaround is to wrap the vmalloc itself
> with a big fat comment that this should be revisited.
>
> Does that sound like something that can be done in a reasonable time?
> I have tried to bring this up in the past but our speed is glacial and
> there are attempts to do hacks like checking for abusers inside the
> vmalloc which is just too ugly to live.
>
> Please do not hesitate to get back to me if something is not clear.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs

I made a patch that adds memalloc_noio/fs_save around these calls a year
ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html

Mikulas

2018-04-24 16:57:07

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 12:46:55, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > Hi,
> > it seems that we still have few vmalloc users who perform GFP_NOFS
> > allocation:
> > drivers/mtd/ubi/io.c
> > fs/ext4/xattr.c
> > fs/gfs2/dir.c
> > fs/gfs2/quota.c
> > fs/nfs/blocklayout/extent_tree.c
> > fs/ubifs/debug.c
> > fs/ubifs/lprops.c
> > fs/ubifs/lpt_commit.c
> > fs/ubifs/orphan.c
> >
> > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> > because we do have hardocded GFP_KERNEL allocations deep inside the
> > vmalloc layers. That means that if GFP_NOFS really protects from
> > recursion into the fs deadlocks then the vmalloc call is broken.
> >
> > What to do about this? Well, there are two things. Firstly, it would be
> > really great to double check whether the GFP_NOFS is really needed. I
> > cannot judge that because I am not familiar with the code. It would be
> > great if the respective maintainers (hopefully get_maintainer.sh pointed
> > me to all relevant ones). If there is not reclaim recursion issue then
> > simply use the standard vmalloc (aka GFP_KERNEL request).
> >
> > If the use is really valid then we have a way to do the vmalloc
> > allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> > does that work? You simply call memalloc_nofs_save when the reclaim
> > recursion critical section starts (e.g. when you take a lock which is
> > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> > when the critical section ends. _All_ allocations within that scope
> > will get GFP_NOFS semantic automagically. If you are not sure about the
> > scope itself then the easiest workaround is to wrap the vmalloc itself
> > with a big fat comment that this should be revisited.
> >
> > Does that sound like something that can be done in a reasonable time?
> > I have tried to bring this up in the past but our speed is glacial and
> > there are attempts to do hacks like checking for abusers inside the
> > vmalloc which is just too ugly to live.
> >
> > Please do not hesitate to get back to me if something is not clear.
> >
> > Thanks!
> > --
> > Michal Hocko
> > SUSE Labs
>
> I made a patch that adds memalloc_noio/fs_save around these calls a year
> ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html

Yeah, and that is the wrong approach. Let's try to fix this properly
this time. As the above outlines, the worst case we can end up mid-term
would be to wrap vmalloc calls with the scope api with a TODO. But I am
pretty sure the respective maintainers can come up with a better
solution. I am definitely willing to help here.
--
Michal Hocko
SUSE Labs

2018-04-24 17:08:01

by Mikulas Patocka

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 12:46:55, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > Hi,
> > > it seems that we still have few vmalloc users who perform GFP_NOFS
> > > allocation:
> > > drivers/mtd/ubi/io.c
> > > fs/ext4/xattr.c
> > > fs/gfs2/dir.c
> > > fs/gfs2/quota.c
> > > fs/nfs/blocklayout/extent_tree.c
> > > fs/ubifs/debug.c
> > > fs/ubifs/lprops.c
> > > fs/ubifs/lpt_commit.c
> > > fs/ubifs/orphan.c
> > >
> > > Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> > > because we do have hardocded GFP_KERNEL allocations deep inside the
> > > vmalloc layers. That means that if GFP_NOFS really protects from
> > > recursion into the fs deadlocks then the vmalloc call is broken.
> > >
> > > What to do about this? Well, there are two things. Firstly, it would be
> > > really great to double check whether the GFP_NOFS is really needed. I
> > > cannot judge that because I am not familiar with the code. It would be
> > > great if the respective maintainers (hopefully get_maintainer.sh pointed
> > > me to all relevant ones). If there is not reclaim recursion issue then
> > > simply use the standard vmalloc (aka GFP_KERNEL request).
> > >
> > > If the use is really valid then we have a way to do the vmalloc
> > > allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> > > does that work? You simply call memalloc_nofs_save when the reclaim
> > > recursion critical section starts (e.g. when you take a lock which is
> > > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> > > when the critical section ends. _All_ allocations within that scope
> > > will get GFP_NOFS semantic automagically. If you are not sure about the
> > > scope itself then the easiest workaround is to wrap the vmalloc itself
> > > with a big fat comment that this should be revisited.
> > >
> > > Does that sound like something that can be done in a reasonable time?
> > > I have tried to bring this up in the past but our speed is glacial and
> > > there are attempts to do hacks like checking for abusers inside the
> > > vmalloc which is just too ugly to live.
> > >
> > > Please do not hesitate to get back to me if something is not clear.
> > >
> > > Thanks!
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > I made a patch that adds memalloc_noio/fs_save around these calls a year
> > ago: http://lkml.iu.edu/hypermail/linux/kernel/1707.0/01376.html
>
> Yeah, and that is the wrong approach.

It is crude, but it fixes the deadlock possibility. Then, the maintainers
will have a lot of time to refactor the code and move these
memalloc_noio_save calls to the proper scope.

> Let's try to fix this properly
> this time. As the above outlines, the worst case we can end up mid-term
> would be to wrap vmalloc calls with the scope api with a TODO. But I am
> pretty sure the respective maintainers can come up with a better
> solution. I am definitely willing to help here.
> --
> Michal Hocko
> SUSE Labs

Mikulas

2018-04-24 18:37:39

by Theodore Ts'o

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote:
> fs/ext4/xattr.c
>
> What to do about this? Well, there are two things. Firstly, it would be
> really great to double check whether the GFP_NOFS is really needed. I
> cannot judge that because I am not familiar with the code.

*Most* of the time it's not needed, but there are times when it is.
We could be more smart about sending down GFP_NOFS only when it is
needed. If we are sending too many GFP_NOFS's allocations such that
it's causing heartburn, we could fix this. (xattr commands are rare
enough that I dind't think it was worth it to modulate the GFP flags
for this particular case, but we could make it be smarter if it would
help.)

> If the use is really valid then we have a way to do the vmalloc
> allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> does that work? You simply call memalloc_nofs_save when the reclaim
> recursion critical section starts (e.g. when you take a lock which is
> then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> when the critical section ends. _All_ allocations within that scope
> will get GFP_NOFS semantic automagically. If you are not sure about the
> scope itself then the easiest workaround is to wrap the vmalloc itself
> with a big fat comment that this should be revisited.

This is something we could do in ext4. It hadn't been high priority,
because we've been rather overloaded. As a suggestion, could you take
documentation about how to convert to the memalloc_nofs_{save,restore}
scope api (which I think you've written about e-mails at length
before), and put that into a file in Documentation/core-api?

The question I was trying to figure out which triggered the above
request is how/whether to gradually convert to that scope API. Is it
safe to add the memalloc_nofs_{save,restore} to code and keep the
GFP_NOFS flags until we're sure we got it all right, for all of the
code paths, and then drop the GFP_NOFS?

Thanks,

- Ted

2018-04-24 19:11:46

by Richard Weinberger

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

[resending without html ...]

Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> Hi,
> it seems that we still have few vmalloc users who perform GFP_NOFS
> allocation:
> drivers/mtd/ubi/io.c

UBI is not a big deal. We use it here like in UBIFS for debugging
when self-checks are enabled.

> fs/ext4/xattr.c
> fs/gfs2/dir.c
> fs/gfs2/quota.c
> fs/nfs/blocklayout/extent_tree.c
> fs/ubifs/debug.c
> fs/ubifs/lprops.c
> fs/ubifs/lpt_commit.c
> fs/ubifs/orphan.c

All users in UBIFS are debugging code and some error reporting.
No fast paths.
I think we can switch to prealloation + locking without much hassle.
I can prepare a patch.

Thanks,
//richard

2018-04-24 19:27:12

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 14:35:36, Theodore Ts'o wrote:
> On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote:
> > fs/ext4/xattr.c
> >
> > What to do about this? Well, there are two things. Firstly, it would be
> > really great to double check whether the GFP_NOFS is really needed. I
> > cannot judge that because I am not familiar with the code.
>
> *Most* of the time it's not needed, but there are times when it is.
> We could be more smart about sending down GFP_NOFS only when it is
> needed.

Well, the primary idea is that you do not have to. All you care about is
to use the scope api where it matters + a comment describing the
reclaim recursion context (e.g. this lock will be held in the reclaim
path here and there).

> If we are sending too many GFP_NOFS's allocations such that
> it's causing heartburn, we could fix this. (xattr commands are rare
> enough that I dind't think it was worth it to modulate the GFP flags
> for this particular case, but we could make it be smarter if it would
> help.)

Well, the vmalloc is actually a correctness issue rather than a
heartburn...

> > If the use is really valid then we have a way to do the vmalloc
> > allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> > does that work? You simply call memalloc_nofs_save when the reclaim
> > recursion critical section starts (e.g. when you take a lock which is
> > then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> > when the critical section ends. _All_ allocations within that scope
> > will get GFP_NOFS semantic automagically. If you are not sure about the
> > scope itself then the easiest workaround is to wrap the vmalloc itself
> > with a big fat comment that this should be revisited.
>
> This is something we could do in ext4. It hadn't been high priority,
> because we've been rather overloaded.

Well, ext/jbd already has scopes defined for the transaction context so
anything down that road can be converted to GFP_KERNEL (well, unless the
same code path is shared outside of the transaction context and still
requires a protection). It would be really great to identify other
contexts and slowly move away from the explicit GFP_NOFS. Are you aware
of other contexts?

> As a suggestion, could you take
> documentation about how to convert to the memalloc_nofs_{save,restore}
> scope api (which I think you've written about e-mails at length
> before), and put that into a file in Documentation/core-api?

I can.

> The question I was trying to figure out which triggered the above
> request is how/whether to gradually convert to that scope API. Is it
> safe to add the memalloc_nofs_{save,restore} to code and keep the
> GFP_NOFS flags until we're sure we got it all right, for all of the
> code paths, and then drop the GFP_NOFS?

The first stage is to define and document those scopes. I have provided
a debugging patch [1] in the past that would dump_stack when seeing an
explicit GFP_NOFS from a scope which could help to eliminate existing
users.

[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs

2018-04-24 19:28:22

by Steven Whitehouse

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

Hi,


On 24/04/18 17:27, Michal Hocko wrote:
> Hi,
> it seems that we still have few vmalloc users who perform GFP_NOFS
> allocation:
> drivers/mtd/ubi/io.c
> fs/ext4/xattr.c
> fs/gfs2/dir.c
> fs/gfs2/quota.c
> fs/nfs/blocklayout/extent_tree.c
> fs/ubifs/debug.c
> fs/ubifs/lprops.c
> fs/ubifs/lpt_commit.c
> fs/ubifs/orphan.c
>
> Unfortunatelly vmalloc doesn't suppoer GFP_NOFS semantinc properly
> because we do have hardocded GFP_KERNEL allocations deep inside the
> vmalloc layers. That means that if GFP_NOFS really protects from
> recursion into the fs deadlocks then the vmalloc call is broken.
>
> What to do about this? Well, there are two things. Firstly, it would be
> really great to double check whether the GFP_NOFS is really needed. I
> cannot judge that because I am not familiar with the code. It would be
> great if the respective maintainers (hopefully get_maintainer.sh pointed
> me to all relevant ones). If there is not reclaim recursion issue then
> simply use the standard vmalloc (aka GFP_KERNEL request).
For GFS2, and I suspect for other fs too, it is really needed. We don't
want to enter reclaim while holding filesystem locks.

> If the use is really valid then we have a way to do the vmalloc
> allocation properly. We have memalloc_nofs_{save,restore} scope api. How
> does that work? You simply call memalloc_nofs_save when the reclaim
> recursion critical section starts (e.g. when you take a lock which is
> then used in the reclaim path - e.g. shrinker) and memalloc_nofs_restore
> when the critical section ends. _All_ allocations within that scope
> will get GFP_NOFS semantic automagically. If you are not sure about the
> scope itself then the easiest workaround is to wrap the vmalloc itself
> with a big fat comment that this should be revisited.
>
> Does that sound like something that can be done in a reasonable time?
> I have tried to bring this up in the past but our speed is glacial and
> there are attempts to do hacks like checking for abusers inside the
> vmalloc which is just too ugly to live.
>
> Please do not hesitate to get back to me if something is not clear.
>
> Thanks!

It would be good to fix this, and it has been known as an issue for a
long time. We might well be able to make use of the new API though. It
might be as simple as adding the calls when we get & release glocks, but
I'd have to check the code to be sure,

Steve.


2018-04-24 19:31:06

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 21:03:43, Richard Weinberger wrote:
> Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> > fs/ubifs/debug.c
>
> This one is just for debugging.
> So, preallocating + locking would not hurt much.
>
> > fs/ubifs/lprops.c
>
> Ditto.
>
> > fs/ubifs/lpt_commit.c
>
> Here we use it also only in debugging mode and in one case for
> fatal error reporting.
> No hot paths.
>
> > fs/ubifs/orphan.c
>
> Also only for debugging.
> Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> I can prepare a patch.

Cool!

Anyway, if UBIFS has some reclaim recursion critical sections in general
it would be really great to have them documented and that is where the
scope api is really handy. Just add the scope and document what is the
recursion issue. This will help people reading the code as well. Ideally
there shouldn't be any explicit GFP_NOFS in the code.

Thanks for a quick turnaround.

--
Michal Hocko
SUSE Labs

2018-04-24 20:11:02

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 20:26:23, Steven Whitehouse wrote:
[...]
> It would be good to fix this, and it has been known as an issue for a long
> time. We might well be able to make use of the new API though. It might be
> as simple as adding the calls when we get & release glocks, but I'd have to
> check the code to be sure,

Yeah, starting with annotating those locking contexts and how document
how their are used in the reclaim is the great first step. This has to
be done per-fs obviously.
--
Michal Hocko
SUSE Labs

2018-04-24 22:20:23

by Richard Weinberger

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > Also only for debugging.
> > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > I can prepare a patch.
>
> Cool!
>
> Anyway, if UBIFS has some reclaim recursion critical sections in general
> it would be really great to have them documented and that is where the
> scope api is really handy. Just add the scope and document what is the
> recursion issue. This will help people reading the code as well. Ideally
> there shouldn't be any explicit GFP_NOFS in the code.

So in a perfect world a filesystem calls memalloc_nofs_save/restore and
always uses GFP_KERNEL for kmalloc/vmalloc?

Thanks,
//richard



2018-04-24 23:11:35

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > Also only for debugging.
> > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > I can prepare a patch.
> >
> > Cool!
> >
> > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > it would be really great to have them documented and that is where the
> > scope api is really handy. Just add the scope and document what is the
> > recursion issue. This will help people reading the code as well. Ideally
> > there shouldn't be any explicit GFP_NOFS in the code.
>
> So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> always uses GFP_KERNEL for kmalloc/vmalloc?

Exactly! And in a dream world those memalloc_nofs_save act as a
documentation of the reclaim recursion documentation ;)
--
Michal Hocko
SUSE Labs

2018-04-24 23:18:41

by Mikulas Patocka

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > > Also only for debugging.
> > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > > I can prepare a patch.
> > >
> > > Cool!
> > >
> > > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > > it would be really great to have them documented and that is where the
> > > scope api is really handy. Just add the scope and document what is the
> > > recursion issue. This will help people reading the code as well. Ideally
> > > there shouldn't be any explicit GFP_NOFS in the code.
> >
> > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > always uses GFP_KERNEL for kmalloc/vmalloc?
>
> Exactly! And in a dream world those memalloc_nofs_save act as a
> documentation of the reclaim recursion documentation ;)
> --
> Michal Hocko
> SUSE Labs

BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just
one that prevents both I/O and FS recursion?

memalloc_nofs_save allows submitting bios to I/O stack and the bios
created under memalloc_nofs_save could be sent to the loop device and the
loop device calls the filesystem...

Mikulas

2018-04-24 23:26:46

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 19:17:12, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> > > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > > > Also only for debugging.
> > > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > > > I can prepare a patch.
> > > >
> > > > Cool!
> > > >
> > > > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > > > it would be really great to have them documented and that is where the
> > > > scope api is really handy. Just add the scope and document what is the
> > > > recursion issue. This will help people reading the code as well. Ideally
> > > > there shouldn't be any explicit GFP_NOFS in the code.
> > >
> > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > > always uses GFP_KERNEL for kmalloc/vmalloc?
> >
> > Exactly! And in a dream world those memalloc_nofs_save act as a
> > documentation of the reclaim recursion documentation ;)
> > --
> > Michal Hocko
> > SUSE Labs
>
> BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just
> one that prevents both I/O and FS recursion?

Why should FS usage stop IO altogether?

> memalloc_nofs_save allows submitting bios to I/O stack and the bios
> created under memalloc_nofs_save could be sent to the loop device and the
> loop device calls the filesystem...

Don't those use NOIO context?
--
Michal Hocko
SUSE Labs

2018-04-25 12:45:05

by Mikulas Patocka

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS



On Tue, 24 Apr 2018, Michal Hocko wrote:

> On Tue 24-04-18 19:17:12, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> > > > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > > > > Also only for debugging.
> > > > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > > > > I can prepare a patch.
> > > > >
> > > > > Cool!
> > > > >
> > > > > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > > > > it would be really great to have them documented and that is where the
> > > > > scope api is really handy. Just add the scope and document what is the
> > > > > recursion issue. This will help people reading the code as well. Ideally
> > > > > there shouldn't be any explicit GFP_NOFS in the code.
> > > >
> > > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > > > always uses GFP_KERNEL for kmalloc/vmalloc?
> > >
> > > Exactly! And in a dream world those memalloc_nofs_save act as a
> > > documentation of the reclaim recursion documentation ;)
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just
> > one that prevents both I/O and FS recursion?
>
> Why should FS usage stop IO altogether?

Because the IO may reach loop and loop may redirect it to the same
filesystem that is running under memalloc_nofs_save and deadlock.

> > memalloc_nofs_save allows submitting bios to I/O stack and the bios
> > created under memalloc_nofs_save could be sent to the loop device and the
> > loop device calls the filesystem...
>
> Don't those use NOIO context?

What do you mean?

Mikulas

2018-04-25 14:48:48

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed 25-04-18 08:43:32, Mikulas Patocka wrote:
>
>
> On Tue, 24 Apr 2018, Michal Hocko wrote:
>
> > On Tue 24-04-18 19:17:12, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Wed 25-04-18 00:18:40, Richard Weinberger wrote:
> > > > > Am Dienstag, 24. April 2018, 21:28:03 CEST schrieb Michal Hocko:
> > > > > > > Also only for debugging.
> > > > > > > Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> > > > > > > I can prepare a patch.
> > > > > >
> > > > > > Cool!
> > > > > >
> > > > > > Anyway, if UBIFS has some reclaim recursion critical sections in general
> > > > > > it would be really great to have them documented and that is where the
> > > > > > scope api is really handy. Just add the scope and document what is the
> > > > > > recursion issue. This will help people reading the code as well. Ideally
> > > > > > there shouldn't be any explicit GFP_NOFS in the code.
> > > > >
> > > > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > > > > always uses GFP_KERNEL for kmalloc/vmalloc?
> > > >
> > > > Exactly! And in a dream world those memalloc_nofs_save act as a
> > > > documentation of the reclaim recursion documentation ;)
> > > > --
> > > > Michal Hocko
> > > > SUSE Labs
> > >
> > > BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just
> > > one that prevents both I/O and FS recursion?
> >
> > Why should FS usage stop IO altogether?
>
> Because the IO may reach loop and loop may redirect it to the same
> filesystem that is running under memalloc_nofs_save and deadlock.

So what is the difference with the current GFP_NOFS?

> > > memalloc_nofs_save allows submitting bios to I/O stack and the bios
> > > created under memalloc_nofs_save could be sent to the loop device and the
> > > loop device calls the filesystem...
> >
> > Don't those use NOIO context?
>
> What do you mean?

That the loop driver should make sure it will not recurse. The scope API
doesn't add anything new here.
--
Michal Hocko
SUSE Labs

2018-04-25 15:26:56

by Mikulas Patocka

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS



On Wed, 25 Apr 2018, Michal Hocko wrote:

> On Wed 25-04-18 08:43:32, Mikulas Patocka wrote:
> >
> >
> > On Tue, 24 Apr 2018, Michal Hocko wrote:
> >
> > > On Tue 24-04-18 19:17:12, Mikulas Patocka wrote:
> > > >
> > > >
> > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > >
> > > > > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > > > > > always uses GFP_KERNEL for kmalloc/vmalloc?
> > > > >
> > > > > Exactly! And in a dream world those memalloc_nofs_save act as a
> > > > > documentation of the reclaim recursion documentation ;)
> > > > > --
> > > > > Michal Hocko
> > > > > SUSE Labs
> > > >
> > > > BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just
> > > > one that prevents both I/O and FS recursion?
> > >
> > > Why should FS usage stop IO altogether?
> >
> > Because the IO may reach loop and loop may redirect it to the same
> > filesystem that is running under memalloc_nofs_save and deadlock.
>
> So what is the difference with the current GFP_NOFS?

My point is that filesystems should use GFP_NOIO too. If
alloc_pages(GFP_NOFS) issues some random I/O to some block device, the I/O
may be end up being redirected (via block loop device) to the filesystem
that is calling alloc_pages(GFP_NOFS).

> > > > memalloc_nofs_save allows submitting bios to I/O stack and the bios
> > > > created under memalloc_nofs_save could be sent to the loop device and the
> > > > loop device calls the filesystem...
> > >
> > > Don't those use NOIO context?
> >
> > What do you mean?
>
> That the loop driver should make sure it will not recurse. The scope API
> doesn't add anything new here.

The loop driver doesn't recurse. The loop driver will add the request to a
queue and wake up a thread that processes it. But if the request queue is
full, __get_request will wait until the loop thread finishes processing
some other request.

It doesn't recurse, but it waits until the filesystem makes some progress.

Mikulas

2018-04-25 16:59:41

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed 25-04-18 11:25:09, Mikulas Patocka wrote:
>
>
> On Wed, 25 Apr 2018, Michal Hocko wrote:
>
> > On Wed 25-04-18 08:43:32, Mikulas Patocka wrote:
> > >
> > >
> > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > >
> > > > On Tue 24-04-18 19:17:12, Mikulas Patocka wrote:
> > > > >
> > > > >
> > > > > On Tue, 24 Apr 2018, Michal Hocko wrote:
> > > > >
> > > > > > > So in a perfect world a filesystem calls memalloc_nofs_save/restore and
> > > > > > > always uses GFP_KERNEL for kmalloc/vmalloc?
> > > > > >
> > > > > > Exactly! And in a dream world those memalloc_nofs_save act as a
> > > > > > documentation of the reclaim recursion documentation ;)
> > > > > > --
> > > > > > Michal Hocko
> > > > > > SUSE Labs
> > > > >
> > > > > BTW. should memalloc_nofs_save and memalloc_noio_save be merged into just
> > > > > one that prevents both I/O and FS recursion?
> > > >
> > > > Why should FS usage stop IO altogether?
> > >
> > > Because the IO may reach loop and loop may redirect it to the same
> > > filesystem that is running under memalloc_nofs_save and deadlock.
> >
> > So what is the difference with the current GFP_NOFS?
>
> My point is that filesystems should use GFP_NOIO too. If
> alloc_pages(GFP_NOFS) issues some random I/O to some block device, the I/O
> may be end up being redirected (via block loop device) to the filesystem
> that is calling alloc_pages(GFP_NOFS).

Talk to FS people, but I believe there is a good reason to distinguish
the two.

--
Michal Hocko
SUSE Labs

2018-05-09 13:43:05

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 13:25:42, Michal Hocko wrote:
[...]
> > As a suggestion, could you take
> > documentation about how to convert to the memalloc_nofs_{save,restore}
> > scope api (which I think you've written about e-mails at length
> > before), and put that into a file in Documentation/core-api?
>
> I can.

Does something like the below sound reasonable/helpful?
---
=================================
GFP masks used from FS/IO context
=================================

:Date: Mapy, 2018
:Author: Michal Hocko <[email protected]>

Introduction
============

FS resp. IO submitting code paths have to be careful when allocating
memory to prevent from potential recursion deadlocks caused by direct
memory reclaim calling back into the FS/IO path and block on already
held resources (e.g. locks). Traditional way to avoid this problem
is to clear __GFP_FS resp. __GFP_IO (note the later implies clearing
the first as well) in the gfp mask when calling an allocator. GFP_NOFS
resp. GFP_NOIO can be used as shortcut.

This has been the traditional way to avoid deadlocks since ages. It
turned out though that above approach has led to abuses when the restricted
gfp mask is used "just in case" without a deeper consideration which leads
to problems because an excessive use of GFP_NOFS/GFP_NOIO can lead to
memory over-reclaim or other memory reclaim issues.

New API
=======

Since 4.12 we do have a generic scope API for both NOFS and NOIO context
``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
``memalloc_noio_restore`` which allow to mark a scope to be a critical
section from the memory reclaim recursion into FS/IO POV. Any allocation
from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
mask so no memory allocation can recurse back in the FS/IO.

FS/IO code then simply calls the appropriate save function right at
the layer where a lock taken from the reclaim context (e.g. shrinker)
is taken and the corresponding restore function when the lock is
released. All that ideally along with an explanation what is the reclaim
context for easier maintenance.

What about __vmalloc(GFP_NOFS)
==============================

vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
GFP_KERNEL allocations deep inside the allocator which are quit non-trivial
to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
almost always a bug. The good news is that the NOFS/NOIO semantic can be
achieved by the scope api.

In the ideal world, upper layers should already mark dangerous contexts
and so no special care is required and vmalloc should be called without
any problems. Sometimes if the context is not really clear or there are
layering violations then the recommended way around that is to wrap ``vmalloc``
by the scope API with a comment explaining the problem.
--
Michal Hocko
SUSE Labs

2018-05-09 14:17:15

by David Sterba

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed, May 09, 2018 at 03:42:22PM +0200, Michal Hocko wrote:
> On Tue 24-04-18 13:25:42, Michal Hocko wrote:
> [...]
> > > As a suggestion, could you take
> > > documentation about how to convert to the memalloc_nofs_{save,restore}
> > > scope api (which I think you've written about e-mails at length
> > > before), and put that into a file in Documentation/core-api?
> >
> > I can.
>
> Does something like the below sound reasonable/helpful?

Sounds good to me and matches how we've been using the vmalloc/nofs so
far.

2018-05-09 15:16:35

by Darrick J. Wong

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed, May 09, 2018 at 03:42:22PM +0200, Michal Hocko wrote:
> On Tue 24-04-18 13:25:42, Michal Hocko wrote:
> [...]
> > > As a suggestion, could you take
> > > documentation about how to convert to the memalloc_nofs_{save,restore}
> > > scope api (which I think you've written about e-mails at length
> > > before), and put that into a file in Documentation/core-api?
> >
> > I can.
>
> Does something like the below sound reasonable/helpful?
> ---
> =================================
> GFP masks used from FS/IO context
> =================================
>
> :Date: Mapy, 2018
> :Author: Michal Hocko <[email protected]>
>
> Introduction
> ============
>
> FS resp. IO submitting code paths have to be careful when allocating

Not sure what 'FS resp. IO' means here -- 'FS and IO' ?

(Or is this one of those things where this looks like plain English text
but in reality it's some sort of markup that I'm not so familiar with?)

Confused because I've seen 'resp.' used as shorthand for
'responsible'...

> memory to prevent from potential recursion deadlocks caused by direct
> memory reclaim calling back into the FS/IO path and block on already
> held resources (e.g. locks). Traditional way to avoid this problem

'The traditional way to avoid this deadlock problem...'

> is to clear __GFP_FS resp. __GFP_IO (note the later implies clearing
> the first as well) in the gfp mask when calling an allocator. GFP_NOFS
> resp. GFP_NOIO can be used as shortcut.
>
> This has been the traditional way to avoid deadlocks since ages. It

I think this sentence is a little redundant with the previous sentence,
you could chop it out and join this paragraph to the one before it.

> turned out though that above approach has led to abuses when the restricted
> gfp mask is used "just in case" without a deeper consideration which leads
> to problems because an excessive use of GFP_NOFS/GFP_NOIO can lead to
> memory over-reclaim or other memory reclaim issues.
>
> New API
> =======
>
> Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> ``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> ``memalloc_noio_restore`` which allow to mark a scope to be a critical
> section from the memory reclaim recursion into FS/IO POV. Any allocation
> from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> mask so no memory allocation can recurse back in the FS/IO.
>
> FS/IO code then simply calls the appropriate save function right at
> the layer where a lock taken from the reclaim context (e.g. shrinker)
> is taken and the corresponding restore function when the lock is
> released. All that ideally along with an explanation what is the reclaim
> context for easier maintenance.
>
> What about __vmalloc(GFP_NOFS)
> ==============================
>
> vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> GFP_KERNEL allocations deep inside the allocator which are quit non-trivial

...which are quite non-trivial...

> to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> almost always a bug. The good news is that the NOFS/NOIO semantic can be
> achieved by the scope api.
>
> In the ideal world, upper layers should already mark dangerous contexts
> and so no special care is required and vmalloc should be called without
> any problems. Sometimes if the context is not really clear or there are
> layering violations then the recommended way around that is to wrap ``vmalloc``
> by the scope API with a comment explaining the problem.

Otherwise looks ok to me based on my understanding of how all this is
supposed to work...

Reviewed-by: Darrick J. Wong <[email protected]>

--D

> --
> Michal Hocko
> SUSE Labs

2018-05-09 16:25:40

by Mike Rapoport

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed, May 09, 2018 at 08:13:51AM -0700, Darrick J. Wong wrote:
> On Wed, May 09, 2018 at 03:42:22PM +0200, Michal Hocko wrote:
> > On Tue 24-04-18 13:25:42, Michal Hocko wrote:
> > [...]
> > > > As a suggestion, could you take
> > > > documentation about how to convert to the memalloc_nofs_{save,restore}
> > > > scope api (which I think you've written about e-mails at length
> > > > before), and put that into a file in Documentation/core-api?
> > >
> > > I can.
> >
> > Does something like the below sound reasonable/helpful?
> > ---
> > =================================
> > GFP masks used from FS/IO context
> > =================================
> >
> > :Date: Mapy, 2018
> > :Author: Michal Hocko <[email protected]>
> >
> > Introduction
> > ============
> >
> > FS resp. IO submitting code paths have to be careful when allocating
>
> Not sure what 'FS resp. IO' means here -- 'FS and IO' ?
>
> (Or is this one of those things where this looks like plain English text
> but in reality it's some sort of markup that I'm not so familiar with?)
>
> Confused because I've seen 'resp.' used as shorthand for
> 'responsible'...
>
> > memory to prevent from potential recursion deadlocks caused by direct
> > memory reclaim calling back into the FS/IO path and block on already
> > held resources (e.g. locks). Traditional way to avoid this problem
>
> 'The traditional way to avoid this deadlock problem...'
>
> > is to clear __GFP_FS resp. __GFP_IO (note the later implies clearing
> > the first as well) in the gfp mask when calling an allocator. GFP_NOFS
> > resp. GFP_NOIO can be used as shortcut.
> >
> > This has been the traditional way to avoid deadlocks since ages. It
>
> I think this sentence is a little redundant with the previous sentence,
> you could chop it out and join this paragraph to the one before it.
>
> > turned out though that above approach has led to abuses when the restricted
> > gfp mask is used "just in case" without a deeper consideration which leads
> > to problems because an excessive use of GFP_NOFS/GFP_NOIO can lead to
> > memory over-reclaim or other memory reclaim issues.
> >
> > New API
> > =======
> >
> > Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> > ``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> > ``memalloc_noio_restore`` which allow to mark a scope to be a critical
> > section from the memory reclaim recursion into FS/IO POV. Any allocation
> > from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> > mask so no memory allocation can recurse back in the FS/IO.
> >
> > FS/IO code then simply calls the appropriate save function right at
> > the layer where a lock taken from the reclaim context (e.g. shrinker)
> > is taken and the corresponding restore function when the lock is

Seems like the second "is taken" got there by mistake

> > released. All that ideally along with an explanation what is the reclaim
> > context for easier maintenance.
> >
> > What about __vmalloc(GFP_NOFS)
> > ==============================
> >
> > vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> > GFP_KERNEL allocations deep inside the allocator which are quit non-trivial
>
> ...which are quite non-trivial...
>
> > to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> > almost always a bug. The good news is that the NOFS/NOIO semantic can be
> > achieved by the scope api.
> >
> > In the ideal world, upper layers should already mark dangerous contexts
> > and so no special care is required and vmalloc should be called without
> > any problems. Sometimes if the context is not really clear or there are
> > layering violations then the recommended way around that is to wrap ``vmalloc``
> > by the scope API with a comment explaining the problem.
>
> Otherwise looks ok to me based on my understanding of how all this is
> supposed to work...
>
> Reviewed-by: Darrick J. Wong <[email protected]>
>
> --D
>
> > --
> > Michal Hocko
> > SUSE Labs
>

--
Sincerely yours,
Mike.


2018-05-09 21:05:20

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed 09-05-18 08:13:51, Darrick J. Wong wrote:
> On Wed, May 09, 2018 at 03:42:22PM +0200, Michal Hocko wrote:
> > On Tue 24-04-18 13:25:42, Michal Hocko wrote:
> > [...]
> > > > As a suggestion, could you take
> > > > documentation about how to convert to the memalloc_nofs_{save,restore}
> > > > scope api (which I think you've written about e-mails at length
> > > > before), and put that into a file in Documentation/core-api?
> > >
> > > I can.
> >
> > Does something like the below sound reasonable/helpful?
> > ---
> > =================================
> > GFP masks used from FS/IO context
> > =================================
> >
> > :Date: Mapy, 2018
> > :Author: Michal Hocko <[email protected]>
> >
> > Introduction
> > ============
> >
> > FS resp. IO submitting code paths have to be careful when allocating
>
> Not sure what 'FS resp. IO' means here -- 'FS and IO' ?
>
> (Or is this one of those things where this looks like plain English text
> but in reality it's some sort of markup that I'm not so familiar with?)
>
> Confused because I've seen 'resp.' used as shorthand for
> 'responsible'...

Well, I've tried to cover both. Filesystem and IO code paths which
allocate while in sensitive context. IO submission is kinda clear but I
am not sure what a general term for filsystem code paths would be. I
would be greatful for any hints here.

>
> > memory to prevent from potential recursion deadlocks caused by direct
> > memory reclaim calling back into the FS/IO path and block on already
> > held resources (e.g. locks). Traditional way to avoid this problem
>
> 'The traditional way to avoid this deadlock problem...'

Done

> > is to clear __GFP_FS resp. __GFP_IO (note the later implies clearing
> > the first as well) in the gfp mask when calling an allocator. GFP_NOFS
> > resp. GFP_NOIO can be used as shortcut.
> >
> > This has been the traditional way to avoid deadlocks since ages. It
>
> I think this sentence is a little redundant with the previous sentence,
> you could chop it out and join this paragraph to the one before it.

OK

>
> > turned out though that above approach has led to abuses when the restricted
> > gfp mask is used "just in case" without a deeper consideration which leads
> > to problems because an excessive use of GFP_NOFS/GFP_NOIO can lead to
> > memory over-reclaim or other memory reclaim issues.
> >
> > New API
> > =======
> >
> > Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> > ``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> > ``memalloc_noio_restore`` which allow to mark a scope to be a critical
> > section from the memory reclaim recursion into FS/IO POV. Any allocation
> > from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> > mask so no memory allocation can recurse back in the FS/IO.
> >
> > FS/IO code then simply calls the appropriate save function right at
> > the layer where a lock taken from the reclaim context (e.g. shrinker)
> > is taken and the corresponding restore function when the lock is
> > released. All that ideally along with an explanation what is the reclaim
> > context for easier maintenance.
> >
> > What about __vmalloc(GFP_NOFS)
> > ==============================
> >
> > vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> > GFP_KERNEL allocations deep inside the allocator which are quit non-trivial
>
> ...which are quite non-trivial...

fixed

> > to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> > almost always a bug. The good news is that the NOFS/NOIO semantic can be
> > achieved by the scope api.
> >
> > In the ideal world, upper layers should already mark dangerous contexts
> > and so no special care is required and vmalloc should be called without
> > any problems. Sometimes if the context is not really clear or there are
> > layering violations then the recommended way around that is to wrap ``vmalloc``
> > by the scope API with a comment explaining the problem.
>
> Otherwise looks ok to me based on my understanding of how all this is
> supposed to work...
>
> Reviewed-by: Darrick J. Wong <[email protected]>

Thanks for your review!

--
Michal Hocko
SUSE Labs

2018-05-09 21:07:09

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed 09-05-18 19:24:51, Mike Rapoport wrote:
> On Wed, May 09, 2018 at 08:13:51AM -0700, Darrick J. Wong wrote:
> > On Wed, May 09, 2018 at 03:42:22PM +0200, Michal Hocko wrote:
[...]
> > > FS/IO code then simply calls the appropriate save function right at
> > > the layer where a lock taken from the reclaim context (e.g. shrinker)
> > > is taken and the corresponding restore function when the lock is
>
> Seems like the second "is taken" got there by mistake

yeah, fixed. Thanks!
--
Michal Hocko
SUSE Labs

2018-05-09 22:04:45

by Darrick J. Wong

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed, May 09, 2018 at 11:04:47PM +0200, Michal Hocko wrote:
> On Wed 09-05-18 08:13:51, Darrick J. Wong wrote:
> > On Wed, May 09, 2018 at 03:42:22PM +0200, Michal Hocko wrote:
> > > On Tue 24-04-18 13:25:42, Michal Hocko wrote:
> > > [...]
> > > > > As a suggestion, could you take
> > > > > documentation about how to convert to the memalloc_nofs_{save,restore}
> > > > > scope api (which I think you've written about e-mails at length
> > > > > before), and put that into a file in Documentation/core-api?
> > > >
> > > > I can.
> > >
> > > Does something like the below sound reasonable/helpful?
> > > ---
> > > =================================
> > > GFP masks used from FS/IO context
> > > =================================
> > >
> > > :Date: Mapy, 2018
> > > :Author: Michal Hocko <[email protected]>
> > >
> > > Introduction
> > > ============
> > >
> > > FS resp. IO submitting code paths have to be careful when allocating
> >
> > Not sure what 'FS resp. IO' means here -- 'FS and IO' ?
> >
> > (Or is this one of those things where this looks like plain English text
> > but in reality it's some sort of markup that I'm not so familiar with?)
> >
> > Confused because I've seen 'resp.' used as shorthand for
> > 'responsible'...
>
> Well, I've tried to cover both. Filesystem and IO code paths which
> allocate while in sensitive context. IO submission is kinda clear but I
> am not sure what a general term for filsystem code paths would be. I
> would be greatful for any hints here.

"Code paths in the filesystem and IO stacks must be careful when
allocating memory to prevent recursion deadlocks caused by direct memory
reclaim calling back into the FS or IO paths and blocking on already
held resources (e.g. locks)." ?

--D

>
> >
> > > memory to prevent from potential recursion deadlocks caused by direct
> > > memory reclaim calling back into the FS/IO path and block on already
> > > held resources (e.g. locks). Traditional way to avoid this problem
> >
> > 'The traditional way to avoid this deadlock problem...'
>
> Done
>
> > > is to clear __GFP_FS resp. __GFP_IO (note the later implies clearing
> > > the first as well) in the gfp mask when calling an allocator. GFP_NOFS
> > > resp. GFP_NOIO can be used as shortcut.
> > >
> > > This has been the traditional way to avoid deadlocks since ages. It
> >
> > I think this sentence is a little redundant with the previous sentence,
> > you could chop it out and join this paragraph to the one before it.
>
> OK
>
> >
> > > turned out though that above approach has led to abuses when the restricted
> > > gfp mask is used "just in case" without a deeper consideration which leads
> > > to problems because an excessive use of GFP_NOFS/GFP_NOIO can lead to
> > > memory over-reclaim or other memory reclaim issues.
> > >
> > > New API
> > > =======
> > >
> > > Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> > > ``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> > > ``memalloc_noio_restore`` which allow to mark a scope to be a critical
> > > section from the memory reclaim recursion into FS/IO POV. Any allocation
> > > from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> > > mask so no memory allocation can recurse back in the FS/IO.
> > >
> > > FS/IO code then simply calls the appropriate save function right at
> > > the layer where a lock taken from the reclaim context (e.g. shrinker)
> > > is taken and the corresponding restore function when the lock is
> > > released. All that ideally along with an explanation what is the reclaim
> > > context for easier maintenance.
> > >
> > > What about __vmalloc(GFP_NOFS)
> > > ==============================
> > >
> > > vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> > > GFP_KERNEL allocations deep inside the allocator which are quit non-trivial
> >
> > ...which are quite non-trivial...
>
> fixed
>
> > > to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> > > almost always a bug. The good news is that the NOFS/NOIO semantic can be
> > > achieved by the scope api.
> > >
> > > In the ideal world, upper layers should already mark dangerous contexts
> > > and so no special care is required and vmalloc should be called without
> > > any problems. Sometimes if the context is not really clear or there are
> > > layering violations then the recommended way around that is to wrap ``vmalloc``
> > > by the scope API with a comment explaining the problem.
> >
> > Otherwise looks ok to me based on my understanding of how all this is
> > supposed to work...
> >
> > Reviewed-by: Darrick J. Wong <[email protected]>
>
> Thanks for your review!
>
> --
> Michal Hocko
> SUSE Labs

2018-05-10 06:00:02

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Wed 09-05-18 15:02:31, Darrick J. Wong wrote:
> On Wed, May 09, 2018 at 11:04:47PM +0200, Michal Hocko wrote:
> > On Wed 09-05-18 08:13:51, Darrick J. Wong wrote:
[...]
> > > > FS resp. IO submitting code paths have to be careful when allocating
> > >
> > > Not sure what 'FS resp. IO' means here -- 'FS and IO' ?
> > >
> > > (Or is this one of those things where this looks like plain English text
> > > but in reality it's some sort of markup that I'm not so familiar with?)
> > >
> > > Confused because I've seen 'resp.' used as shorthand for
> > > 'responsible'...
> >
> > Well, I've tried to cover both. Filesystem and IO code paths which
> > allocate while in sensitive context. IO submission is kinda clear but I
> > am not sure what a general term for filsystem code paths would be. I
> > would be greatful for any hints here.
>
> "Code paths in the filesystem and IO stacks must be careful when
> allocating memory to prevent recursion deadlocks caused by direct memory
> reclaim calling back into the FS or IO paths and blocking on already
> held resources (e.g. locks)." ?

Great, thanks!
--
Michal Hocko
SUSE Labs

2018-05-10 07:20:50

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Thu 10-05-18 07:58:25, Michal Hocko wrote:
> On Wed 09-05-18 15:02:31, Darrick J. Wong wrote:
> > On Wed, May 09, 2018 at 11:04:47PM +0200, Michal Hocko wrote:
> > > On Wed 09-05-18 08:13:51, Darrick J. Wong wrote:
> [...]
> > > > > FS resp. IO submitting code paths have to be careful when allocating
> > > >
> > > > Not sure what 'FS resp. IO' means here -- 'FS and IO' ?
> > > >
> > > > (Or is this one of those things where this looks like plain English text
> > > > but in reality it's some sort of markup that I'm not so familiar with?)
> > > >
> > > > Confused because I've seen 'resp.' used as shorthand for
> > > > 'responsible'...
> > >
> > > Well, I've tried to cover both. Filesystem and IO code paths which
> > > allocate while in sensitive context. IO submission is kinda clear but I
> > > am not sure what a general term for filsystem code paths would be. I
> > > would be greatful for any hints here.
> >
> > "Code paths in the filesystem and IO stacks must be careful when
> > allocating memory to prevent recursion deadlocks caused by direct memory
> > reclaim calling back into the FS or IO paths and blocking on already
> > held resources (e.g. locks)." ?
>
> Great, thanks!

I dared to extend the last part to "(e.g. locks - most commonly those
used for the transaction context)"
--
Michal Hocko
SUSE Labs

2018-05-24 11:45:35

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] doc: document scope NOFS, NOIO APIs

From: Michal Hocko <[email protected]>

Although the api is documented in the source code Ted has pointed out
that there is no mention in the core-api Documentation and there are
people looking there to find answers how to use a specific API.

Cc: "Darrick J. Wong" <[email protected]>
Cc: David Sterba <[email protected]>
Requested-by: "Theodore Y. Ts'o" <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---

Hi Johnatan,
Ted has proposed this at LSFMM and then we discussed that briefly on the
mailing list [1]. I received some useful feedback from Darrick and Dave
which has been (hopefully) integrated. Then the thing fall off my radar
rediscovering it now when doing some cleanup. Could you take the patch
please?

[1] http://lkml.kernel.org/r/[email protected]
.../core-api/gfp_mask-from-fs-io.rst | 55 +++++++++++++++++++
1 file changed, 55 insertions(+)
create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
new file mode 100644
index 000000000000..e8b2678e959b
--- /dev/null
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -0,0 +1,55 @@
+=================================
+GFP masks used from FS/IO context
+=================================
+
+:Date: Mapy, 2018
+:Author: Michal Hocko <[email protected]>
+
+Introduction
+============
+
+Code paths in the filesystem and IO stacks must be careful when
+allocating memory to prevent recursion deadlocks caused by direct
+memory reclaim calling back into the FS or IO paths and blocking on
+already held resources (e.g. locks - most commonly those used for the
+transaction context).
+
+The traditional way to avoid this deadlock problem is to clear __GFP_FS
+resp. __GFP_IO (note the later implies clearing the first as well) in
+the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be
+used as shortcut. It turned out though that above approach has led to
+abuses when the restricted gfp mask is used "just in case" without a
+deeper consideration which leads to problems because an excessive use
+of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
+reclaim issues.
+
+New API
+========
+
+Since 4.12 we do have a generic scope API for both NOFS and NOIO context
+``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
+``memalloc_noio_restore`` which allow to mark a scope to be a critical
+section from the memory reclaim recursion into FS/IO POV. Any allocation
+from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
+mask so no memory allocation can recurse back in the FS/IO.
+
+FS/IO code then simply calls the appropriate save function right at the
+layer where a lock taken from the reclaim context (e.g. shrinker) and
+the corresponding restore function when the lock is released. All that
+ideally along with an explanation what is the reclaim context for easier
+maintenance.
+
+What about __vmalloc(GFP_NOFS)
+==============================
+
+vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
+GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
+to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
+almost always a bug. The good news is that the NOFS/NOIO semantic can be
+achieved by the scope api.
+
+In the ideal world, upper layers should already mark dangerous contexts
+and so no special care is required and vmalloc should be called without
+any problems. Sometimes if the context is not really clear or there are
+layering violations then the recommended way around that is to wrap ``vmalloc``
+by the scope API with a comment explaining the problem.
--
2.17.0


2018-05-25 02:11:38

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Thu, May 24, 2018 at 4:43 AM, Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: David Sterba <[email protected]>
> Requested-by: "Theodore Y. Ts'o" <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi Johnatan,
> Ted has proposed this at LSFMM and then we discussed that briefly on the
> mailing list [1]. I received some useful feedback from Darrick and Dave
> which has been (hopefully) integrated. Then the thing fall off my radar
> rediscovering it now when doing some cleanup. Could you take the patch
> please?
>
> [1] http://lkml.kernel.org/r/[email protected]
> .../core-api/gfp_mask-from-fs-io.rst | 55 +++++++++++++++++++
> 1 file changed, 55 insertions(+)
> create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> new file mode 100644
> index 000000000000..e8b2678e959b
> --- /dev/null
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -0,0 +1,55 @@
> +=================================
> +GFP masks used from FS/IO context
> +=================================
> +
> +:Date: Mapy, 2018
> +:Author: Michal Hocko <[email protected]>
> +
> +Introduction
> +============
> +
> +Code paths in the filesystem and IO stacks must be careful when
> +allocating memory to prevent recursion deadlocks caused by direct
> +memory reclaim calling back into the FS or IO paths and blocking on
> +already held resources (e.g. locks - most commonly those used for the
> +transaction context).
> +
> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +resp. __GFP_IO (note the later implies clearing the first as well) in

Is resp. == respectively? Why not use the full word (here and below)?

> +the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be
> +used as shortcut. It turned out though that above approach has led to
> +abuses when the restricted gfp mask is used "just in case" without a
> +deeper consideration which leads to problems because an excessive use
> +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> +reclaim issues.
> +
> +New API
> +========
> +
> +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> +``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> +section from the memory reclaim recursion into FS/IO POV. Any allocation
> +from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> +mask so no memory allocation can recurse back in the FS/IO.
> +
> +FS/IO code then simply calls the appropriate save function right at the
> +layer where a lock taken from the reclaim context (e.g. shrinker) and
> +the corresponding restore function when the lock is released. All that
> +ideally along with an explanation what is the reclaim context for easier
> +maintenance.
> +
> +What about __vmalloc(GFP_NOFS)
> +==============================
> +
> +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
> +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> +almost always a bug. The good news is that the NOFS/NOIO semantic can be
> +achieved by the scope api.
> +
> +In the ideal world, upper layers should already mark dangerous contexts
> +and so no special care is required and vmalloc should be called without
> +any problems. Sometimes if the context is not really clear or there are
> +layering violations then the recommended way around that is to wrap ``vmalloc``
> +by the scope API with a comment explaining the problem.
> --
> 2.17.0
>

2018-05-25 02:16:36

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Thu 24-05-18 07:33:39, Shakeel Butt wrote:
> On Thu, May 24, 2018 at 4:43 AM, Michal Hocko <[email protected]> wrote:
[...]
> > +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> > +resp. __GFP_IO (note the later implies clearing the first as well) in
>
> Is resp. == respectively? Why not use the full word (here and below)?

yes. Because I was lazy ;)

--
Michal Hocko
SUSE Labs

2018-05-25 02:30:30

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On 05/24/2018 04:43 AM, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: David Sterba <[email protected]>
> Requested-by: "Theodore Y. Ts'o" <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi Johnatan,
> Ted has proposed this at LSFMM and then we discussed that briefly on the
> mailing list [1]. I received some useful feedback from Darrick and Dave
> which has been (hopefully) integrated. Then the thing fall off my radar
> rediscovering it now when doing some cleanup. Could you take the patch
> please?
>
> [1] http://lkml.kernel.org/r/[email protected]
> .../core-api/gfp_mask-from-fs-io.rst | 55 +++++++++++++++++++
> 1 file changed, 55 insertions(+)
> create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> new file mode 100644
> index 000000000000..e8b2678e959b
> --- /dev/null
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -0,0 +1,55 @@
> +=================================
> +GFP masks used from FS/IO context
> +=================================
> +
> +:Date: Mapy, 2018
> +:Author: Michal Hocko <[email protected]>
> +
> +Introduction
> +============
> +
> +Code paths in the filesystem and IO stacks must be careful when
> +allocating memory to prevent recursion deadlocks caused by direct
> +memory reclaim calling back into the FS or IO paths and blocking on
> +already held resources (e.g. locks - most commonly those used for the
> +transaction context).
> +
> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +resp. __GFP_IO (note the later implies clearing the first as well) in

latter

> +the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be
> +used as shortcut. It turned out though that above approach has led to
> +abuses when the restricted gfp mask is used "just in case" without a
> +deeper consideration which leads to problems because an excessive use
> +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> +reclaim issues.
> +
> +New API
> +========
> +
> +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> +``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> +section from the memory reclaim recursion into FS/IO POV. Any allocation

s/POV/point of view/ or whatever it is.

> +from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> +mask so no memory allocation can recurse back in the FS/IO.
> +
> +FS/IO code then simply calls the appropriate save function right at the
> +layer where a lock taken from the reclaim context (e.g. shrinker) and
> +the corresponding restore function when the lock is released. All that
> +ideally along with an explanation what is the reclaim context for easier
> +maintenance.
> +
> +What about __vmalloc(GFP_NOFS)
> +==============================
> +
> +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
> +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> +almost always a bug. The good news is that the NOFS/NOIO semantic can be
> +achieved by the scope api.

I would prefer s/api/API/ throughout.

> +
> +In the ideal world, upper layers should already mark dangerous contexts
> +and so no special care is required and vmalloc should be called without
> +any problems. Sometimes if the context is not really clear or there are
> +layering violations then the recommended way around that is to wrap ``vmalloc``
> +by the scope API with a comment explaining the problem.
>


--
~Randy

2018-05-25 02:44:18

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Thu, 24 May 2018 13:43:41 +0200
Michal Hocko <[email protected]> wrote:

> From: Michal Hocko <[email protected]>
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: David Sterba <[email protected]>
> Requested-by: "Theodore Y. Ts'o" <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
>
> Hi Johnatan,
> Ted has proposed this at LSFMM and then we discussed that briefly on the
> mailing list [1]. I received some useful feedback from Darrick and Dave
> which has been (hopefully) integrated. Then the thing fall off my radar
> rediscovering it now when doing some cleanup. Could you take the patch
> please?
>
> [1] http://lkml.kernel.org/r/[email protected]
> .../core-api/gfp_mask-from-fs-io.rst | 55 +++++++++++++++++++
> 1 file changed, 55 insertions(+)
> create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst

So you create the rst file, but don't add it in index.rst; that means it
won't be a part of the docs build and Sphinx will complain.

> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> new file mode 100644
> index 000000000000..e8b2678e959b
> --- /dev/null
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -0,0 +1,55 @@
> +=================================
> +GFP masks used from FS/IO context
> +=================================
> +
> +:Date: Mapy, 2018

Ah...the wonderful month of Mapy....:)

> +:Author: Michal Hocko <[email protected]>
> +
> +Introduction
> +============
> +
> +Code paths in the filesystem and IO stacks must be careful when
> +allocating memory to prevent recursion deadlocks caused by direct
> +memory reclaim calling back into the FS or IO paths and blocking on
> +already held resources (e.g. locks - most commonly those used for the
> +transaction context).
> +
> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +resp. __GFP_IO (note the later implies clearing the first as well) in

"resp." is indeed a bit terse. Even spelled out as "respectively", though,
I'm not sure what the word is intended to mean here. Did you mean "or"?

> +the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be

Here too.

> +used as shortcut. It turned out though that above approach has led to
> +abuses when the restricted gfp mask is used "just in case" without a
> +deeper consideration which leads to problems because an excessive use
> +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> +reclaim issues.
> +
> +New API
> +========
> +
> +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> +``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> +section from the memory reclaim recursion into FS/IO POV. Any allocation

"from a filesystem or I/O point of view" ?

> +from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> +mask so no memory allocation can recurse back in the FS/IO.

Wouldn't it be nice if those functions had kerneldoc comments that could be
pulled in here! :)

> +FS/IO code then simply calls the appropriate save function right at the
> +layer where a lock taken from the reclaim context (e.g. shrinker) and

where a lock *is* taken ?

> +the corresponding restore function when the lock is released. All that
> +ideally along with an explanation what is the reclaim context for easier
> +maintenance.
> +
> +What about __vmalloc(GFP_NOFS)
> +==============================
> +
> +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
> +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> +almost always a bug. The good news is that the NOFS/NOIO semantic can be
> +achieved by the scope api.

Agree with others on "API"

> +In the ideal world, upper layers should already mark dangerous contexts
> +and so no special care is required and vmalloc should be called without
> +any problems. Sometimes if the context is not really clear or there are
> +layering violations then the recommended way around that is to wrap ``vmalloc``
> +by the scope API with a comment explaining the problem.

Thanks,

jon

2018-05-25 02:45:51

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: David Sterba <[email protected]>
> Requested-by: "Theodore Y. Ts'o" <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Yay, Documentation! :)

> ---
>
> Hi Johnatan,
> Ted has proposed this at LSFMM and then we discussed that briefly on the
> mailing list [1]. I received some useful feedback from Darrick and Dave
> which has been (hopefully) integrated. Then the thing fall off my radar
> rediscovering it now when doing some cleanup. Could you take the patch
> please?
>
> [1] http://lkml.kernel.org/r/[email protected]
> .../core-api/gfp_mask-from-fs-io.rst | 55 +++++++++++++++++++
> 1 file changed, 55 insertions(+)
> create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> new file mode 100644
> index 000000000000..e8b2678e959b
> --- /dev/null
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -0,0 +1,55 @@
> +=================================
> +GFP masks used from FS/IO context
> +=================================
> +
> +:Date: Mapy, 2018
> +:Author: Michal Hocko <[email protected]>
> +
> +Introduction
> +============
> +
> +Code paths in the filesystem and IO stacks must be careful when
> +allocating memory to prevent recursion deadlocks caused by direct
> +memory reclaim calling back into the FS or IO paths and blocking on
> +already held resources (e.g. locks - most commonly those used for the
> +transaction context).
> +
> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +resp. __GFP_IO (note the later implies clearing the first as well) in
> +the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be
> +used as shortcut. It turned out though that above approach has led to
> +abuses when the restricted gfp mask is used "just in case" without a
> +deeper consideration which leads to problems because an excessive use
> +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> +reclaim issues.
> +
> +New API
> +========
> +
> +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> +``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> +section from the memory reclaim recursion into FS/IO POV. Any allocation
> +from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> +mask so no memory allocation can recurse back in the FS/IO.
> +
> +FS/IO code then simply calls the appropriate save function right at the
> +layer where a lock taken from the reclaim context (e.g. shrinker) and
> +the corresponding restore function when the lock is released. All that
> +ideally along with an explanation what is the reclaim context for easier
> +maintenance.

This paragraph doesn't make much sense to me. I think you're trying
to say that we should call the appropriate save function "before
locks are taken that a reclaim context (e.g a shrinker) might
require access to."

I think it's also worth making a note about recursive/nested
save/restore stacking, because it's not clear from this description
that this is allowed and will work as long as inner save/restore
calls are fully nested inside outer save/restore contexts.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-05-25 02:47:54

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Fri, May 25, 2018 at 08:17:15AM +1000, Dave Chinner wrote:
> On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Although the api is documented in the source code Ted has pointed out
> > that there is no mention in the core-api Documentation and there are
> > people looking there to find answers how to use a specific API.
> >
> > Cc: "Darrick J. Wong" <[email protected]>
> > Cc: David Sterba <[email protected]>
> > Requested-by: "Theodore Y. Ts'o" <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Yay, Documentation! :)

Indeed, many thanks!!!

- Ted

2018-05-25 07:52:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Thu 24-05-18 09:37:18, Randy Dunlap wrote:
> On 05/24/2018 04:43 AM, Michal Hocko wrote:
[...]
> > +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> > +resp. __GFP_IO (note the later implies clearing the first as well) in
>
> latter

?
No I really meant that clearing __GFP_IO implies __GFP_FS clearing
--
Michal Hocko
SUSE Labs

2018-05-25 08:12:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Thu 24-05-18 14:52:02, Jonathan Corbet wrote:
> On Thu, 24 May 2018 13:43:41 +0200
> Michal Hocko <[email protected]> wrote:
>
> > From: Michal Hocko <[email protected]>
> >
> > Although the api is documented in the source code Ted has pointed out
> > that there is no mention in the core-api Documentation and there are
> > people looking there to find answers how to use a specific API.
> >
> > Cc: "Darrick J. Wong" <[email protected]>
> > Cc: David Sterba <[email protected]>
> > Requested-by: "Theodore Y. Ts'o" <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> >
> > Hi Johnatan,
> > Ted has proposed this at LSFMM and then we discussed that briefly on the
> > mailing list [1]. I received some useful feedback from Darrick and Dave
> > which has been (hopefully) integrated. Then the thing fall off my radar
> > rediscovering it now when doing some cleanup. Could you take the patch
> > please?
> >
> > [1] http://lkml.kernel.org/r/[email protected]
> > .../core-api/gfp_mask-from-fs-io.rst | 55 +++++++++++++++++++
> > 1 file changed, 55 insertions(+)
> > create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst
>
> So you create the rst file, but don't add it in index.rst; that means it
> won't be a part of the docs build and Sphinx will complain.

I am not really familiar with how the whole rst thing works.

diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8a5f48ef16f2 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
genalloc
errseq
printk-formats
+ gfp_mask-from-fs-io

Interfaces for kernel debugging
===============================

This?

>
> > diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> > new file mode 100644
> > index 000000000000..e8b2678e959b
> > --- /dev/null
> > +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> > @@ -0,0 +1,55 @@
> > +=================================
> > +GFP masks used from FS/IO context
> > +=================================
> > +
> > +:Date: Mapy, 2018
>
> Ah...the wonderful month of Mapy....:)

fixed

> > +:Author: Michal Hocko <[email protected]>
> > +
> > +Introduction
> > +============
> > +
> > +Code paths in the filesystem and IO stacks must be careful when
> > +allocating memory to prevent recursion deadlocks caused by direct
> > +memory reclaim calling back into the FS or IO paths and blocking on
> > +already held resources (e.g. locks - most commonly those used for the
> > +transaction context).
> > +
> > +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> > +resp. __GFP_IO (note the later implies clearing the first as well) in
>
> "resp." is indeed a bit terse. Even spelled out as "respectively", though,

OK s@resp\.@respectively@g

> I'm not sure what the word is intended to mean here. Did you mean "or"?

Basically yes. There are two cases here. NOFS and NOIO. The later being
a subset of the first. I didn't really want to repeat the whole thing
for NOIO.

>
> > +the gfp mask when calling an allocator. GFP_NOFS resp. GFP_NOIO can be
>
> Here too.
>
> > +used as shortcut. It turned out though that above approach has led to
> > +abuses when the restricted gfp mask is used "just in case" without a
> > +deeper consideration which leads to problems because an excessive use
> > +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> > +reclaim issues.
> > +
> > +New API
> > +========
> > +
> > +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> > +``memalloc_nofs_save``, ``memalloc_nofs_restore`` resp. ``memalloc_noio_save``,
> > +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> > +section from the memory reclaim recursion into FS/IO POV. Any allocation
>
> "from a filesystem or I/O point of view" ?

OK

> > +from that scope will inherently drop __GFP_FS resp. __GFP_IO from the given
> > +mask so no memory allocation can recurse back in the FS/IO.
>
> Wouldn't it be nice if those functions had kerneldoc comments that could be
> pulled in here! :)

Most probably yes ;) I thought I've done that but that was probably in a
different universe. This probably?

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e1f8411e6b80..f49ece8ee37a 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -166,6 +166,17 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { }
static inline void fs_reclaim_release(gfp_t gfp_mask) { }
#endif

+/**
+ * memalloc_noio_save - Marks implicit GFP_NOIO allocation scope.
+ *
+ * This functions marks the beginning of the GFP_NOIO allocation scope.
+ * All further allocations will implicitly drop __GFP_IO flag and so
+ * they are safe for the IO critical section from the allocation recursion
+ * point of view. Use memalloc_noio_restore to end the scope with flags
+ * returned by this function.
+ *
+ * This function is safe to be used from any context.
+ */
static inline unsigned int memalloc_noio_save(void)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
@@ -173,11 +184,30 @@ static inline unsigned int memalloc_noio_save(void)
return flags;
}

+/**
+ * memalloc_noio_restore - Ends the implicit GFP_NOIO scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit GFP_NOIO scope started by memalloc_noio_save function.
+ * Always make sure that that the given flags is the return value from the
+ * pairing memalloc_noio_save call.
+ */
static inline void memalloc_noio_restore(unsigned int flags)
{
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+/**
+ * memalloc_nofs_save - Marks implicit GFP_NOFS allocation scope.
+ *
+ * This functions marks the beginning of the GFP_NOFS allocation scope.
+ * All further allocations will implicitly drop __GFP_FS flag and so
+ * they are safe for the FS critical section from the allocation recursion
+ * point of view. Use memalloc_nofs_restore to end the scope with flags
+ * returned by this function.
+ *
+ * This function is safe to be used from any context.
+ */
static inline unsigned int memalloc_nofs_save(void)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
@@ -185,6 +215,14 @@ static inline unsigned int memalloc_nofs_save(void)
return flags;
}

+/**
+ * memalloc_nofs_restore - Ends the implicit GFP_NOFS scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function.
+ * Always make sure that that the given flags is the return value from the
+ * pairing memalloc_nofs_save call.
+ */
static inline void memalloc_nofs_restore(unsigned int flags)
{
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;

> > +FS/IO code then simply calls the appropriate save function right at the
> > +layer where a lock taken from the reclaim context (e.g. shrinker) and
>
> where a lock *is* taken ?

fixed

--
Michal Hocko
SUSE Labs

2018-05-25 08:16:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
[...]
> > +FS/IO code then simply calls the appropriate save function right at the
> > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > +the corresponding restore function when the lock is released. All that
> > +ideally along with an explanation what is the reclaim context for easier
> > +maintenance.
>
> This paragraph doesn't make much sense to me. I think you're trying
> to say that we should call the appropriate save function "before
> locks are taken that a reclaim context (e.g a shrinker) might
> require access to."
>
> I think it's also worth making a note about recursive/nested
> save/restore stacking, because it's not clear from this description
> that this is allowed and will work as long as inner save/restore
> calls are fully nested inside outer save/restore contexts.

Any better?

-FS/IO code then simply calls the appropriate save function right at the
-layer where a lock taken from the reclaim context (e.g. shrinker) and
-the corresponding restore function when the lock is released. All that
-ideally along with an explanation what is the reclaim context for easier
-maintenance.
+FS/IO code then simply calls the appropriate save function before any
+lock shared with the reclaim context is taken. The corresponding
+restore function when the lock is released. All that ideally along with
+an explanation what is the reclaim context for easier maintenance.
+
+Please note that the proper pairing of save/restore function allows nesting
+so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.

What about __vmalloc(GFP_NOFS)
==============================
--
Michal Hocko
SUSE Labs

2018-05-27 12:48:10

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
> On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> > On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> [...]
> > > +FS/IO code then simply calls the appropriate save function right at the
> > > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > +the corresponding restore function when the lock is released. All that
> > > +ideally along with an explanation what is the reclaim context for easier
> > > +maintenance.
> >
> > This paragraph doesn't make much sense to me. I think you're trying
> > to say that we should call the appropriate save function "before
> > locks are taken that a reclaim context (e.g a shrinker) might
> > require access to."
> >
> > I think it's also worth making a note about recursive/nested
> > save/restore stacking, because it's not clear from this description
> > that this is allowed and will work as long as inner save/restore
> > calls are fully nested inside outer save/restore contexts.
>
> Any better?
>
> -FS/IO code then simply calls the appropriate save function right at the
> -layer where a lock taken from the reclaim context (e.g. shrinker) and
> -the corresponding restore function when the lock is released. All that
> -ideally along with an explanation what is the reclaim context for easier
> -maintenance.
> +FS/IO code then simply calls the appropriate save function before any
> +lock shared with the reclaim context is taken. The corresponding
> +restore function when the lock is released. All that ideally along with

Maybe: "The corresponding restore function is called when the lock is
released"

> +an explanation what is the reclaim context for easier maintenance.
> +
> +Please note that the proper pairing of save/restore function allows nesting
> +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.

so it is safe to call memalloc_noio_save from an existing NOIO or NOFS
scope

> What about __vmalloc(GFP_NOFS)
> ==============================
> --
> Michal Hocko
> SUSE Labs
>

--
Sincerely yours,
Mike.


2018-05-27 23:49:32

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
> On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> > On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> [...]
> > > +FS/IO code then simply calls the appropriate save function right at the
> > > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > +the corresponding restore function when the lock is released. All that
> > > +ideally along with an explanation what is the reclaim context for easier
> > > +maintenance.
> >
> > This paragraph doesn't make much sense to me. I think you're trying
> > to say that we should call the appropriate save function "before
> > locks are taken that a reclaim context (e.g a shrinker) might
> > require access to."
> >
> > I think it's also worth making a note about recursive/nested
> > save/restore stacking, because it's not clear from this description
> > that this is allowed and will work as long as inner save/restore
> > calls are fully nested inside outer save/restore contexts.
>
> Any better?
>
> -FS/IO code then simply calls the appropriate save function right at the
> -layer where a lock taken from the reclaim context (e.g. shrinker) and
> -the corresponding restore function when the lock is released. All that
> -ideally along with an explanation what is the reclaim context for easier
> -maintenance.
> +FS/IO code then simply calls the appropriate save function before any
> +lock shared with the reclaim context is taken. The corresponding
> +restore function when the lock is released. All that ideally along with
> +an explanation what is the reclaim context for easier maintenance.
> +
> +Please note that the proper pairing of save/restore function allows nesting
> +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.

It's better, but the talk of this being necessary for locking makes
me cringe. XFS doesn't do it for locking reasons - it does it
largely for preventing transaction context nesting, which has all
sorts of problems that cause hangs (e.g. log space reservations
can't be filled) that aren't directly locking related.

i.e we should be talking about using these functions around contexts
where recursion back into the filesystem through reclaim is
problematic, not that "holding locks" is problematic. Locks can be
used as an example of a problematic context, but locks are not the
only recursion issue that require GFP_NOFS allocation contexts to
avoid.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-05-28 15:55:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Sun 27-05-18 15:47:22, Mike Rapoport wrote:
> On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
> > On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> > > On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> > [...]
> > > > +FS/IO code then simply calls the appropriate save function right at the
> > > > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > > +the corresponding restore function when the lock is released. All that
> > > > +ideally along with an explanation what is the reclaim context for easier
> > > > +maintenance.
> > >
> > > This paragraph doesn't make much sense to me. I think you're trying
> > > to say that we should call the appropriate save function "before
> > > locks are taken that a reclaim context (e.g a shrinker) might
> > > require access to."
> > >
> > > I think it's also worth making a note about recursive/nested
> > > save/restore stacking, because it's not clear from this description
> > > that this is allowed and will work as long as inner save/restore
> > > calls are fully nested inside outer save/restore contexts.
> >
> > Any better?
> >
> > -FS/IO code then simply calls the appropriate save function right at the
> > -layer where a lock taken from the reclaim context (e.g. shrinker) and
> > -the corresponding restore function when the lock is released. All that
> > -ideally along with an explanation what is the reclaim context for easier
> > -maintenance.
> > +FS/IO code then simply calls the appropriate save function before any
> > +lock shared with the reclaim context is taken. The corresponding
> > +restore function when the lock is released. All that ideally along with
>
> Maybe: "The corresponding restore function is called when the lock is
> released"

This will get rewritten some more based on comments from Dave

> > +an explanation what is the reclaim context for easier maintenance.
> > +
> > +Please note that the proper pairing of save/restore function allows nesting
> > +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
>
> so it is safe to call memalloc_noio_save from an existing NOIO or NOFS
> scope

Here is what I have right now on top

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
index c0ec212d6773..0cff411693ab 100644
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -34,12 +34,15 @@ scope will inherently drop __GFP_FS respectively __GFP_IO from the given
mask so no memory allocation can recurse back in the FS/IO.

FS/IO code then simply calls the appropriate save function before any
-lock shared with the reclaim context is taken. The corresponding
-restore function when the lock is released. All that ideally along with
-an explanation what is the reclaim context for easier maintenance.
-
-Please note that the proper pairing of save/restore function allows nesting
-so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
+critical section wrt. the reclaim is started - e.g. lock shared with the
+reclaim context or when a transaction context nesting would be possible
+via reclaim. The corresponding restore function when the critical
+section ends. All that ideally along with an explanation what is
+the reclaim context for easier maintenance.
+
+Please note that the proper pairing of save/restore function allows
+nesting so it is safe to call ``memalloc_noio_save`` respectively
+``memalloc_noio_restore`` from an existing NOIO or NOFS scope.

What about __vmalloc(GFP_NOFS)
==============================

--
Michal Hocko
SUSE Labs

2018-05-28 15:55:48

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs



On 25.05.2018 10:52, Michal Hocko wrote:
> On Thu 24-05-18 09:37:18, Randy Dunlap wrote:
>> On 05/24/2018 04:43 AM, Michal Hocko wrote:
> [...]
>>> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
>>> +resp. __GFP_IO (note the later implies clearing the first as well) in
>>
>> latter
>
> ?
> No I really meant that clearing __GFP_IO implies __GFP_FS clearing
Sorry to barge in like that, but Randy is right.

<NIT WARNING>


https://www.merriam-webster.com/dictionary/latter

" of, relating to, or being the second of two groups or things or the
last of several groups or things referred to

</NIT WARNING>


>

2018-05-28 16:05:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Mon 28-05-18 09:48:54, Dave Chinner wrote:
> On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
> > On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> > > On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> > [...]
> > > > +FS/IO code then simply calls the appropriate save function right at the
> > > > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > > +the corresponding restore function when the lock is released. All that
> > > > +ideally along with an explanation what is the reclaim context for easier
> > > > +maintenance.
> > >
> > > This paragraph doesn't make much sense to me. I think you're trying
> > > to say that we should call the appropriate save function "before
> > > locks are taken that a reclaim context (e.g a shrinker) might
> > > require access to."
> > >
> > > I think it's also worth making a note about recursive/nested
> > > save/restore stacking, because it's not clear from this description
> > > that this is allowed and will work as long as inner save/restore
> > > calls are fully nested inside outer save/restore contexts.
> >
> > Any better?
> >
> > -FS/IO code then simply calls the appropriate save function right at the
> > -layer where a lock taken from the reclaim context (e.g. shrinker) and
> > -the corresponding restore function when the lock is released. All that
> > -ideally along with an explanation what is the reclaim context for easier
> > -maintenance.
> > +FS/IO code then simply calls the appropriate save function before any
> > +lock shared with the reclaim context is taken. The corresponding
> > +restore function when the lock is released. All that ideally along with
> > +an explanation what is the reclaim context for easier maintenance.
> > +
> > +Please note that the proper pairing of save/restore function allows nesting
> > +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
>
> It's better, but the talk of this being necessary for locking makes
> me cringe. XFS doesn't do it for locking reasons - it does it
> largely for preventing transaction context nesting, which has all
> sorts of problems that cause hangs (e.g. log space reservations
> can't be filled) that aren't directly locking related.

Yeah, I wanted to not mention locks as much as possible.

> i.e we should be talking about using these functions around contexts
> where recursion back into the filesystem through reclaim is
> problematic, not that "holding locks" is problematic. Locks can be
> used as an example of a problematic context, but locks are not the
> only recursion issue that require GFP_NOFS allocation contexts to
> avoid.

agreed. Do you have any suggestion how to add a more abstract wording
that would not make head spinning?

I've tried the following. Any better?

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
index c0ec212d6773..adac362b2875 100644
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -34,9 +34,11 @@ scope will inherently drop __GFP_FS respectively __GFP_IO from the given
mask so no memory allocation can recurse back in the FS/IO.

FS/IO code then simply calls the appropriate save function before any
-lock shared with the reclaim context is taken. The corresponding
-restore function when the lock is released. All that ideally along with
-an explanation what is the reclaim context for easier maintenance.
+critical section wrt. the reclaim is started - e.g. lock shared with the
+reclaim context or when a transaction context nesting would be possible
+via reclaim. The corresponding restore function when the critical
+section ends. All that ideally along with an explanation what is
+the reclaim context for easier maintenance.

Please note that the proper pairing of save/restore function allows nesting
so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
--
Michal Hocko
SUSE Labs

2018-05-28 16:05:58

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On 05/25/2018 09:52 AM, Michal Hocko wrote:
> On Thu 24-05-18 09:37:18, Randy Dunlap wrote:
>> On 05/24/2018 04:43 AM, Michal Hocko wrote:
> [...]
>>> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
>>> +resp. __GFP_IO (note the later implies clearing the first as well) in
>>
>> latter
>
> ?
> No I really meant that clearing __GFP_IO implies __GFP_FS clearing

In that case "latter" is the proper word AFAIK. You could also use
"former" instead of "first". Or maybe just repeat the flag names to
avoid confusion...

2018-05-28 16:13:38

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On 05/28/2018 02:21 AM, Michal Hocko wrote:
> On Sun 27-05-18 15:47:22, Mike Rapoport wrote:
>> On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
>>> On Fri 25-05-18 08:17:15, Dave Chinner wrote:
>>>> On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
>>> [...]
>>>>> +FS/IO code then simply calls the appropriate save function right at the
>>>>> +layer where a lock taken from the reclaim context (e.g. shrinker) and
>>>>> +the corresponding restore function when the lock is released. All that
>>>>> +ideally along with an explanation what is the reclaim context for easier
>>>>> +maintenance.
>>>>
>>>> This paragraph doesn't make much sense to me. I think you're trying
>>>> to say that we should call the appropriate save function "before
>>>> locks are taken that a reclaim context (e.g a shrinker) might
>>>> require access to."
>>>>
>>>> I think it's also worth making a note about recursive/nested
>>>> save/restore stacking, because it's not clear from this description
>>>> that this is allowed and will work as long as inner save/restore
>>>> calls are fully nested inside outer save/restore contexts.
>>>
>>> Any better?
>>>
>>> -FS/IO code then simply calls the appropriate save function right at the
>>> -layer where a lock taken from the reclaim context (e.g. shrinker) and
>>> -the corresponding restore function when the lock is released. All that
>>> -ideally along with an explanation what is the reclaim context for easier
>>> -maintenance.
>>> +FS/IO code then simply calls the appropriate save function before any
>>> +lock shared with the reclaim context is taken. The corresponding
>>> +restore function when the lock is released. All that ideally along with
>>
>> Maybe: "The corresponding restore function is called when the lock is
>> released"
>
> This will get rewritten some more based on comments from Dave
>
>>> +an explanation what is the reclaim context for easier maintenance.
>>> +
>>> +Please note that the proper pairing of save/restore function allows nesting
>>> +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
>>
>> so it is safe to call memalloc_noio_save from an existing NOIO or NOFS
>> scope
>
> Here is what I have right now on top
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> index c0ec212d6773..0cff411693ab 100644
> --- a/Documentation/core-api/gfp_mask-from-fs-io.rst
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -34,12 +34,15 @@ scope will inherently drop __GFP_FS respectively __GFP_IO from the given
> mask so no memory allocation can recurse back in the FS/IO.
>
> FS/IO code then simply calls the appropriate save function before any
> -lock shared with the reclaim context is taken. The corresponding
> -restore function when the lock is released. All that ideally along with
> -an explanation what is the reclaim context for easier maintenance.
> -
> -Please note that the proper pairing of save/restore function allows nesting
> -so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
> +critical section wrt. the reclaim is started - e.g. lock shared with the

Please spell out "with respect to".

> +reclaim context or when a transaction context nesting would be possible
> +via reclaim. The corresponding restore function when the critical

"The corresponding restore ... ends." << That is not a complete sentence.
It's missing something.

> +section ends. All that ideally along with an explanation what is
> +the reclaim context for easier maintenance.
> +
> +Please note that the proper pairing of save/restore function allows
> +nesting so it is safe to call ``memalloc_noio_save`` respectively
> +``memalloc_noio_restore`` from an existing NOIO or NOFS scope.

Please note that the proper pairing of save/restore functions allows
nesting so it is safe to call ``memalloc_noio_save`` or
``memalloc_noio_restore`` respectively from an existing NOIO or NOFS scope.


>
> What about __vmalloc(GFP_NOFS)
> ==============================
>


--
~Randy

2018-05-29 03:56:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Mon, May 28, 2018 at 11:19:23AM +0200, Michal Hocko wrote:
> On Mon 28-05-18 09:48:54, Dave Chinner wrote:
> > On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
> > > On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> > > > On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> > > [...]
> > > > > +FS/IO code then simply calls the appropriate save function right at the
> > > > > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > > > +the corresponding restore function when the lock is released. All that
> > > > > +ideally along with an explanation what is the reclaim context for easier
> > > > > +maintenance.
> > > >
> > > > This paragraph doesn't make much sense to me. I think you're trying
> > > > to say that we should call the appropriate save function "before
> > > > locks are taken that a reclaim context (e.g a shrinker) might
> > > > require access to."
> > > >
> > > > I think it's also worth making a note about recursive/nested
> > > > save/restore stacking, because it's not clear from this description
> > > > that this is allowed and will work as long as inner save/restore
> > > > calls are fully nested inside outer save/restore contexts.
> > >
> > > Any better?
> > >
> > > -FS/IO code then simply calls the appropriate save function right at the
> > > -layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > -the corresponding restore function when the lock is released. All that
> > > -ideally along with an explanation what is the reclaim context for easier
> > > -maintenance.
> > > +FS/IO code then simply calls the appropriate save function before any
> > > +lock shared with the reclaim context is taken. The corresponding
> > > +restore function when the lock is released. All that ideally along with
> > > +an explanation what is the reclaim context for easier maintenance.
> > > +
> > > +Please note that the proper pairing of save/restore function allows nesting
> > > +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
> >
> > It's better, but the talk of this being necessary for locking makes
> > me cringe. XFS doesn't do it for locking reasons - it does it
> > largely for preventing transaction context nesting, which has all
> > sorts of problems that cause hangs (e.g. log space reservations
> > can't be filled) that aren't directly locking related.
>
> Yeah, I wanted to not mention locks as much as possible.
>
> > i.e we should be talking about using these functions around contexts
> > where recursion back into the filesystem through reclaim is
> > problematic, not that "holding locks" is problematic. Locks can be
> > used as an example of a problematic context, but locks are not the
> > only recursion issue that require GFP_NOFS allocation contexts to
> > avoid.
>
> agreed. Do you have any suggestion how to add a more abstract wording
> that would not make head spinning?
>
> I've tried the following. Any better?
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> index c0ec212d6773..adac362b2875 100644
> --- a/Documentation/core-api/gfp_mask-from-fs-io.rst
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -34,9 +34,11 @@ scope will inherently drop __GFP_FS respectively __GFP_IO from the given
> mask so no memory allocation can recurse back in the FS/IO.
>
> FS/IO code then simply calls the appropriate save function before any
> -lock shared with the reclaim context is taken. The corresponding
> -restore function when the lock is released. All that ideally along with
> -an explanation what is the reclaim context for easier maintenance.
> +critical section wrt. the reclaim is started - e.g. lock shared with the
> +reclaim context or when a transaction context nesting would be possible
> +via reclaim. The corresponding restore function when the critical

.... restore function should be called when ...

But otherwise I think this is much better.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-05-29 08:20:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Tue 29-05-18 08:32:05, Dave Chinner wrote:
> On Mon, May 28, 2018 at 11:19:23AM +0200, Michal Hocko wrote:
> > On Mon 28-05-18 09:48:54, Dave Chinner wrote:
> > > On Fri, May 25, 2018 at 10:16:24AM +0200, Michal Hocko wrote:
> > > > On Fri 25-05-18 08:17:15, Dave Chinner wrote:
> > > > > On Thu, May 24, 2018 at 01:43:41PM +0200, Michal Hocko wrote:
> > > > [...]
> > > > > > +FS/IO code then simply calls the appropriate save function right at the
> > > > > > +layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > > > > +the corresponding restore function when the lock is released. All that
> > > > > > +ideally along with an explanation what is the reclaim context for easier
> > > > > > +maintenance.
> > > > >
> > > > > This paragraph doesn't make much sense to me. I think you're trying
> > > > > to say that we should call the appropriate save function "before
> > > > > locks are taken that a reclaim context (e.g a shrinker) might
> > > > > require access to."
> > > > >
> > > > > I think it's also worth making a note about recursive/nested
> > > > > save/restore stacking, because it's not clear from this description
> > > > > that this is allowed and will work as long as inner save/restore
> > > > > calls are fully nested inside outer save/restore contexts.
> > > >
> > > > Any better?
> > > >
> > > > -FS/IO code then simply calls the appropriate save function right at the
> > > > -layer where a lock taken from the reclaim context (e.g. shrinker) and
> > > > -the corresponding restore function when the lock is released. All that
> > > > -ideally along with an explanation what is the reclaim context for easier
> > > > -maintenance.
> > > > +FS/IO code then simply calls the appropriate save function before any
> > > > +lock shared with the reclaim context is taken. The corresponding
> > > > +restore function when the lock is released. All that ideally along with
> > > > +an explanation what is the reclaim context for easier maintenance.
> > > > +
> > > > +Please note that the proper pairing of save/restore function allows nesting
> > > > +so memalloc_noio_save is safe to be called from an existing NOIO or NOFS scope.
> > >
> > > It's better, but the talk of this being necessary for locking makes
> > > me cringe. XFS doesn't do it for locking reasons - it does it
> > > largely for preventing transaction context nesting, which has all
> > > sorts of problems that cause hangs (e.g. log space reservations
> > > can't be filled) that aren't directly locking related.
> >
> > Yeah, I wanted to not mention locks as much as possible.
> >
> > > i.e we should be talking about using these functions around contexts
> > > where recursion back into the filesystem through reclaim is
> > > problematic, not that "holding locks" is problematic. Locks can be
> > > used as an example of a problematic context, but locks are not the
> > > only recursion issue that require GFP_NOFS allocation contexts to
> > > avoid.
> >
> > agreed. Do you have any suggestion how to add a more abstract wording
> > that would not make head spinning?
> >
> > I've tried the following. Any better?
> >
> > diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> > index c0ec212d6773..adac362b2875 100644
> > --- a/Documentation/core-api/gfp_mask-from-fs-io.rst
> > +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> > @@ -34,9 +34,11 @@ scope will inherently drop __GFP_FS respectively __GFP_IO from the given
> > mask so no memory allocation can recurse back in the FS/IO.
> >
> > FS/IO code then simply calls the appropriate save function before any
> > -lock shared with the reclaim context is taken. The corresponding
> > -restore function when the lock is released. All that ideally along with
> > -an explanation what is the reclaim context for easier maintenance.
> > +critical section wrt. the reclaim is started - e.g. lock shared with the
> > +reclaim context or when a transaction context nesting would be possible
> > +via reclaim. The corresponding restore function when the critical
>
> .... restore function should be called when ...

fixed

> But otherwise I think this is much better.

Thanks!

--
Michal Hocko
SUSE Labs

2018-05-29 08:23:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Mon 28-05-18 09:10:43, Randy Dunlap wrote:
> On 05/28/2018 02:21 AM, Michal Hocko wrote:
[...]
> > +reclaim context or when a transaction context nesting would be possible
> > +via reclaim. The corresponding restore function when the critical
>
> "The corresponding restore ... ends." << That is not a complete sentence.
> It's missing something.

Dave has pointed that out.
"The restore function should be called when the critical section ends."

> > +section ends. All that ideally along with an explanation what is
> > +the reclaim context for easier maintenance.
> > +
> > +Please note that the proper pairing of save/restore function allows
> > +nesting so it is safe to call ``memalloc_noio_save`` respectively
> > +``memalloc_noio_restore`` from an existing NOIO or NOFS scope.
>
> Please note that the proper pairing of save/restore functions allows
> nesting so it is safe to call ``memalloc_noio_save`` or
> ``memalloc_noio_restore`` respectively from an existing NOIO or NOFS scope.

Fixed. Thanks
--
Michal Hocko
SUSE Labs

2018-05-29 08:24:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] doc: document scope NOFS, NOIO APIs

On Mon 28-05-18 10:21:00, Nikolay Borisov wrote:
>
>
> On 25.05.2018 10:52, Michal Hocko wrote:
> > On Thu 24-05-18 09:37:18, Randy Dunlap wrote:
> >> On 05/24/2018 04:43 AM, Michal Hocko wrote:
> > [...]
> >>> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> >>> +resp. __GFP_IO (note the later implies clearing the first as well) in
> >>
> >> latter
> >
> > ?
> > No I really meant that clearing __GFP_IO implies __GFP_FS clearing
> Sorry to barge in like that, but Randy is right.
>
> <NIT WARNING>
>
>
> https://www.merriam-webster.com/dictionary/latter
>
> " of, relating to, or being the second of two groups or things or the
> last of several groups or things referred to
>
> </NIT WARNING>

Fixed
--
Michal Hocko
SUSE Labs

2018-05-29 08:28:18

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v2] doc: document scope NOFS, NOIO APIs

From: Michal Hocko <[email protected]>

Although the api is documented in the source code Ted has pointed out
that there is no mention in the core-api Documentation and there are
people looking there to find answers how to use a specific API.

Changes since v1
- add kerneldoc for the api - suggested by Johnatan
- review feedback from Dave and Johnatan
- feedback from Dave about more general critical context rather than
locking
- feedback from Mike
- typo fixed - Randy, Dave

Requested-by: "Theodore Y. Ts'o" <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
.../core-api/gfp_mask-from-fs-io.rst | 61 +++++++++++++++++++
Documentation/core-api/index.rst | 1 +
include/linux/sched/mm.h | 38 ++++++++++++
3 files changed, 100 insertions(+)
create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
new file mode 100644
index 000000000000..2dc442b04a77
--- /dev/null
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -0,0 +1,61 @@
+=================================
+GFP masks used from FS/IO context
+=================================
+
+:Date: May, 2018
+:Author: Michal Hocko <[email protected]>
+
+Introduction
+============
+
+Code paths in the filesystem and IO stacks must be careful when
+allocating memory to prevent recursion deadlocks caused by direct
+memory reclaim calling back into the FS or IO paths and blocking on
+already held resources (e.g. locks - most commonly those used for the
+transaction context).
+
+The traditional way to avoid this deadlock problem is to clear __GFP_FS
+respectively __GFP_IO (note the latter implies clearing the first as well) in
+the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
+used as shortcut. It turned out though that above approach has led to
+abuses when the restricted gfp mask is used "just in case" without a
+deeper consideration which leads to problems because an excessive use
+of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
+reclaim issues.
+
+New API
+========
+
+Since 4.12 we do have a generic scope API for both NOFS and NOIO context
+``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
+``memalloc_noio_restore`` which allow to mark a scope to be a critical
+section from a filesystem or I/O point of view. Any allocation from that
+scope will inherently drop __GFP_FS respectively __GFP_IO from the given
+mask so no memory allocation can recurse back in the FS/IO.
+
+FS/IO code then simply calls the appropriate save function before
+any critical section with respect to the reclaim is started - e.g.
+lock shared with the reclaim context or when a transaction context
+nesting would be possible via reclaim. The restore function should be
+called when the critical section ends. All that ideally along with an
+explanation what is the reclaim context for easier maintenance.
+
+Please note that the proper pairing of save/restore functions
+allows nesting so it is safe to call ``memalloc_noio_save`` or
+``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
+scope.
+
+What about __vmalloc(GFP_NOFS)
+==============================
+
+vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
+GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
+to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
+almost always a bug. The good news is that the NOFS/NOIO semantic can be
+achieved by the scope API.
+
+In the ideal world, upper layers should already mark dangerous contexts
+and so no special care is required and vmalloc should be called without
+any problems. Sometimes if the context is not really clear or there are
+layering violations then the recommended way around that is to wrap ``vmalloc``
+by the scope API with a comment explaining the problem.
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index c670a8031786..8a5f48ef16f2 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -25,6 +25,7 @@ Core utilities
genalloc
errseq
printk-formats
+ gfp_mask-from-fs-io

Interfaces for kernel debugging
===============================
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index e1f8411e6b80..af5ba077bbc4 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -166,6 +166,17 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { }
static inline void fs_reclaim_release(gfp_t gfp_mask) { }
#endif

+/**
+ * memalloc_noio_save - Marks implicit GFP_NOIO allocation scope.
+ *
+ * This functions marks the beginning of the GFP_NOIO allocation scope.
+ * All further allocations will implicitly drop __GFP_IO flag and so
+ * they are safe for the IO critical section from the allocation recursion
+ * point of view. Use memalloc_noio_restore to end the scope with flags
+ * returned by this function.
+ *
+ * This function is safe to be used from any context.
+ */
static inline unsigned int memalloc_noio_save(void)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
@@ -173,11 +184,30 @@ static inline unsigned int memalloc_noio_save(void)
return flags;
}

+/**
+ * memalloc_noio_restore - Ends the implicit GFP_NOIO scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit GFP_NOIO scope started by memalloc_noio_save function.
+ * Always make sure that that the given flags is the return value from the
+ * pairing memalloc_noio_save call.
+ */
static inline void memalloc_noio_restore(unsigned int flags)
{
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+/**
+ * memalloc_nofs_save - Marks implicit GFP_NOFS allocation scope.
+ *
+ * This functions marks the beginning of the GFP_NOFS allocation scope.
+ * All further allocations will implicitly drop __GFP_FS flag and so
+ * they are safe for the FS critical section from the allocation recursion
+ * point of view. Use memalloc_nofs_restore to end the scope with flags
+ * returned by this function.
+ *
+ * This function is safe to be used from any context.
+ */
static inline unsigned int memalloc_nofs_save(void)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
@@ -185,6 +215,14 @@ static inline unsigned int memalloc_nofs_save(void)
return flags;
}

+/**
+ * memalloc_nofs_restore - Ends the implicit GFP_NOFS scope.
+ * @flags: Flags to restore.
+ *
+ * Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function.
+ * Always make sure that that the given flags is the return value from the
+ * pairing memalloc_nofs_save call.
+ */
static inline void memalloc_nofs_restore(unsigned int flags)
{
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
--
2.17.0


2018-05-29 10:24:58

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2] doc: document scope NOFS, NOIO APIs

On Tue, May 29, 2018 at 10:26:44AM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Changes since v1
> - add kerneldoc for the api - suggested by Johnatan
> - review feedback from Dave and Johnatan
> - feedback from Dave about more general critical context rather than
> locking
> - feedback from Mike
> - typo fixed - Randy, Dave
>
> Requested-by: "Theodore Y. Ts'o" <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

We could bikeshed forever about the exact wording, but it covers
everything I think needs to be documented.

Reviewed-by: Dave Chinner <[email protected]>

--
Dave Chinner
[email protected]

2018-05-29 11:52:09

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH v2] doc: document scope NOFS, NOIO APIs

On Tue, May 29, 2018 at 10:26:44AM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.
>
> Changes since v1
> - add kerneldoc for the api - suggested by Johnatan
> - review feedback from Dave and Johnatan
> - feedback from Dave about more general critical context rather than
> locking
> - feedback from Mike
> - typo fixed - Randy, Dave
>
> Requested-by: "Theodore Y. Ts'o" <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

I believe it's worth linking the kernel-doc part with the text. e.g.:

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
index 2dc442b..b001f5f 100644
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -59,3 +59,10 @@ and so no special care is required and vmalloc should be called without
any problems. Sometimes if the context is not really clear or there are
layering violations then the recommended way around that is to wrap ``vmalloc``
by the scope API with a comment explaining the problem.
+
+Reference
+=========
+
+.. kernel-doc:: include/linux/sched/mm.h
+ :functions: memalloc_nofs_save memalloc_nofs_restore \
+ memalloc_noio_save memalloc_noio_restore

Except that, all looks good to me

Reviewed-by: Mike Rapoport <[email protected]>

> ---
> .../core-api/gfp_mask-from-fs-io.rst | 61 +++++++++++++++++++
> Documentation/core-api/index.rst | 1 +
> include/linux/sched/mm.h | 38 ++++++++++++
> 3 files changed, 100 insertions(+)
> create mode 100644 Documentation/core-api/gfp_mask-from-fs-io.rst
>
> diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
> new file mode 100644
> index 000000000000..2dc442b04a77
> --- /dev/null
> +++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
> @@ -0,0 +1,61 @@
> +=================================
> +GFP masks used from FS/IO context
> +=================================
> +
> +:Date: May, 2018
> +:Author: Michal Hocko <[email protected]>
> +
> +Introduction
> +============
> +
> +Code paths in the filesystem and IO stacks must be careful when
> +allocating memory to prevent recursion deadlocks caused by direct
> +memory reclaim calling back into the FS or IO paths and blocking on
> +already held resources (e.g. locks - most commonly those used for the
> +transaction context).
> +
> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +respectively __GFP_IO (note the latter implies clearing the first as well) in
> +the gfp mask when calling an allocator. GFP_NOFS respectively GFP_NOIO can be
> +used as shortcut. It turned out though that above approach has led to
> +abuses when the restricted gfp mask is used "just in case" without a
> +deeper consideration which leads to problems because an excessive use
> +of GFP_NOFS/GFP_NOIO can lead to memory over-reclaim or other memory
> +reclaim issues.
> +
> +New API
> +========
> +
> +Since 4.12 we do have a generic scope API for both NOFS and NOIO context
> +``memalloc_nofs_save``, ``memalloc_nofs_restore`` respectively ``memalloc_noio_save``,
> +``memalloc_noio_restore`` which allow to mark a scope to be a critical
> +section from a filesystem or I/O point of view. Any allocation from that
> +scope will inherently drop __GFP_FS respectively __GFP_IO from the given
> +mask so no memory allocation can recurse back in the FS/IO.
> +
> +FS/IO code then simply calls the appropriate save function before
> +any critical section with respect to the reclaim is started - e.g.
> +lock shared with the reclaim context or when a transaction context
> +nesting would be possible via reclaim. The restore function should be
> +called when the critical section ends. All that ideally along with an
> +explanation what is the reclaim context for easier maintenance.
> +
> +Please note that the proper pairing of save/restore functions
> +allows nesting so it is safe to call ``memalloc_noio_save`` or
> +``memalloc_noio_restore`` respectively from an existing NOIO or NOFS
> +scope.
> +
> +What about __vmalloc(GFP_NOFS)
> +==============================
> +
> +vmalloc doesn't support GFP_NOFS semantic because there are hardcoded
> +GFP_KERNEL allocations deep inside the allocator which are quite non-trivial
> +to fix up. That means that calling ``vmalloc`` with GFP_NOFS/GFP_NOIO is
> +almost always a bug. The good news is that the NOFS/NOIO semantic can be
> +achieved by the scope API.
> +
> +In the ideal world, upper layers should already mark dangerous contexts
> +and so no special care is required and vmalloc should be called without
> +any problems. Sometimes if the context is not really clear or there are
> +layering violations then the recommended way around that is to wrap ``vmalloc``
> +by the scope API with a comment explaining the problem.
> diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
> index c670a8031786..8a5f48ef16f2 100644
> --- a/Documentation/core-api/index.rst
> +++ b/Documentation/core-api/index.rst
> @@ -25,6 +25,7 @@ Core utilities
> genalloc
> errseq
> printk-formats
> + gfp_mask-from-fs-io
>
> Interfaces for kernel debugging
> ===============================
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index e1f8411e6b80..af5ba077bbc4 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -166,6 +166,17 @@ static inline void fs_reclaim_acquire(gfp_t gfp_mask) { }
> static inline void fs_reclaim_release(gfp_t gfp_mask) { }
> #endif
>
> +/**
> + * memalloc_noio_save - Marks implicit GFP_NOIO allocation scope.
> + *
> + * This functions marks the beginning of the GFP_NOIO allocation scope.
> + * All further allocations will implicitly drop __GFP_IO flag and so
> + * they are safe for the IO critical section from the allocation recursion
> + * point of view. Use memalloc_noio_restore to end the scope with flags
> + * returned by this function.
> + *
> + * This function is safe to be used from any context.
> + */
> static inline unsigned int memalloc_noio_save(void)
> {
> unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
> @@ -173,11 +184,30 @@ static inline unsigned int memalloc_noio_save(void)
> return flags;
> }
>
> +/**
> + * memalloc_noio_restore - Ends the implicit GFP_NOIO scope.
> + * @flags: Flags to restore.
> + *
> + * Ends the implicit GFP_NOIO scope started by memalloc_noio_save function.
> + * Always make sure that that the given flags is the return value from the
> + * pairing memalloc_noio_save call.
> + */
> static inline void memalloc_noio_restore(unsigned int flags)
> {
> current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
> }
>
> +/**
> + * memalloc_nofs_save - Marks implicit GFP_NOFS allocation scope.
> + *
> + * This functions marks the beginning of the GFP_NOFS allocation scope.
> + * All further allocations will implicitly drop __GFP_FS flag and so
> + * they are safe for the FS critical section from the allocation recursion
> + * point of view. Use memalloc_nofs_restore to end the scope with flags
> + * returned by this function.
> + *
> + * This function is safe to be used from any context.
> + */
> static inline unsigned int memalloc_nofs_save(void)
> {
> unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> @@ -185,6 +215,14 @@ static inline unsigned int memalloc_nofs_save(void)
> return flags;
> }
>
> +/**
> + * memalloc_nofs_restore - Ends the implicit GFP_NOFS scope.
> + * @flags: Flags to restore.
> + *
> + * Ends the implicit GFP_NOFS scope started by memalloc_nofs_save function.
> + * Always make sure that that the given flags is the return value from the
> + * pairing memalloc_nofs_save call.
> + */
> static inline void memalloc_nofs_restore(unsigned int flags)
> {
> current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> --
> 2.17.0
>

--
Sincerely yours,
Mike.


2018-05-29 11:54:37

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH v2] doc: document scope NOFS, NOIO APIs

On Tue, 29 May 2018 10:26:44 +0200
Michal Hocko <[email protected]> wrote:

> Although the api is documented in the source code Ted has pointed out
> that there is no mention in the core-api Documentation and there are
> people looking there to find answers how to use a specific API.

So, I still think that this:

> +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> +respectively __GFP_IO (note the latter implies clearing the first as well) in

doesn't read the way you intend it to. But we've sent you in more
than enough circles on this already, so I went ahead and applied it;
wording can always be tweaked later.

You added the kerneldoc comments, but didn't bring them into your new
document. I'm going to tack this on afterward, hopefully nobody will
object.

Thanks,

jon

---
docs: Use the kerneldoc comments for memalloc_no*()

Now that we have kerneldoc comments for
memalloc_no{fs,io}_{save_restore}(), go ahead and pull them into the docs.

Signed-off-by: Jonathan Corbet <[email protected]>
---
Documentation/core-api/gfp_mask-from-fs-io.rst | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/Documentation/core-api/gfp_mask-from-fs-io.rst b/Documentation/core-api/gfp_mask-from-fs-io.rst
index 2dc442b04a77..e0df8f416582 100644
--- a/Documentation/core-api/gfp_mask-from-fs-io.rst
+++ b/Documentation/core-api/gfp_mask-from-fs-io.rst
@@ -33,6 +33,11 @@ section from a filesystem or I/O point of view. Any allocation from that
scope will inherently drop __GFP_FS respectively __GFP_IO from the given
mask so no memory allocation can recurse back in the FS/IO.

+.. kernel-doc:: include/linux/sched/mm.h
+ :functions: memalloc_nofs_save memalloc_nofs_restore
+.. kernel-doc:: include/linux/sched/mm.h
+ :functions: memalloc_noio_save memalloc_noio_restore
+
FS/IO code then simply calls the appropriate save function before
any critical section with respect to the reclaim is started - e.g.
lock shared with the reclaim context or when a transaction context
--
2.14.3


2018-05-29 12:38:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2] doc: document scope NOFS, NOIO APIs

On Tue 29-05-18 05:51:58, Jonathan Corbet wrote:
> On Tue, 29 May 2018 10:26:44 +0200
> Michal Hocko <[email protected]> wrote:
>
> > Although the api is documented in the source code Ted has pointed out
> > that there is no mention in the core-api Documentation and there are
> > people looking there to find answers how to use a specific API.
>
> So, I still think that this:
>
> > +The traditional way to avoid this deadlock problem is to clear __GFP_FS
> > +respectively __GFP_IO (note the latter implies clearing the first as well) in
>
> doesn't read the way you intend it to. But we've sent you in more
> than enough circles on this already, so I went ahead and applied it;
> wording can always be tweaked later.

Thanks a lot Jonathan! I am open to any suggestions of course and can
follow up with some refinements. Just for the background. The above
paragraph is meant to say that:
- clearing __GFP_FS is a way to avoid reclaim recursion into filesystems
deadlocks
- clearing __GFP_IO is a way to avoid reclaim recursion into the IO
layer deadlocks
- GFP_NOIO implies __GFP_NOFS

> You added the kerneldoc comments, but didn't bring them into your new
> document. I'm going to tack this on afterward, hopefully nobody will
> object.

I have to confess I've never studied how the rst and kerneldoc should be
interlinked so thanks for the fix up!
--
Michal Hocko
SUSE Labs

2018-07-17 12:49:00

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 21:03:43, Richard Weinberger wrote:
> Am Dienstag, 24. April 2018, 18:27:12 CEST schrieb Michal Hocko:
> > fs/ubifs/debug.c
>
> This one is just for debugging.
> So, preallocating + locking would not hurt much.
>
> > fs/ubifs/lprops.c
>
> Ditto.
>
> > fs/ubifs/lpt_commit.c
>
> Here we use it also only in debugging mode and in one case for
> fatal error reporting.
> No hot paths.
>
> > fs/ubifs/orphan.c
>
> Also only for debugging.
> Getting rid of vmalloc with GFP_NOFS in UBIFS is no big problem.
> I can prepare a patch.

Hi Richard, I have just got back to this and noticed that the vmalloc
NOFS usage is still there. Do you have any plans to push changes to
remove it?
--
Michal Hocko
SUSE Labs

2018-07-17 12:50:28

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 14:35:36, Theodore Ts'o wrote:
> On Tue, Apr 24, 2018 at 10:27:12AM -0600, Michal Hocko wrote:
> > fs/ext4/xattr.c
> >
> > What to do about this? Well, there are two things. Firstly, it would be
> > really great to double check whether the GFP_NOFS is really needed. I
> > cannot judge that because I am not familiar with the code.
>
> *Most* of the time it's not needed, but there are times when it is.
> We could be more smart about sending down GFP_NOFS only when it is
> needed. If we are sending too many GFP_NOFS's allocations such that
> it's causing heartburn, we could fix this. (xattr commands are rare
> enough that I dind't think it was worth it to modulate the GFP flags
> for this particular case, but we could make it be smarter if it would
> help.)

There still seem to be ext4_kvmalloc(NOFS) callers in the ext4 code. Do
you have any plans to get rid of those?
--
Michal Hocko
SUSE Labs

2018-07-17 12:52:32

by Michal Hocko

[permalink] [raw]
Subject: Re: vmalloc with GFP_NOFS

On Tue 24-04-18 14:09:27, Michal Hocko wrote:
> On Tue 24-04-18 20:26:23, Steven Whitehouse wrote:
> [...]
> > It would be good to fix this, and it has been known as an issue for a long
> > time. We might well be able to make use of the new API though. It might be
> > as simple as adding the calls when we get & release glocks, but I'd have to
> > check the code to be sure,
>
> Yeah, starting with annotating those locking contexts and how document
> how their are used in the reclaim is the great first step. This has to
> be done per-fs obviously.

Any chance of progress here?
--
Michal Hocko
SUSE Labs