Many places that need to wait before retrying a memory allocation use
congestion_wait(). xfs_buf_alloc_pages() is a good example which
follows a similar pattern to that in svc_alloc_args().
It make sense to do the same thing in svc_alloc_args(); This will allow
the allocation to be retried sooner if some backing device becomes
non-congested before the timeout.
Every call to congestion_wait() in the entire kernel passes BLK_RW_ASYNC
as the first argument, so we should so.
The second argument - an upper limit for waiting - seem fairly
arbitrary. Many places use "HZ/50" or "HZ/10". As there is no obvious
choice, it seems reasonable to leave the maximum time unchanged.
If a service using svc_alloc_args() is terminated, it may now have to
wait up to the full 500ms before termination completes as
congestion_wait() cannot be interrupted. I don't believe this will be a
problem in practice, though it might be justification for using a
smaller timeout.
Signed-off-by: NeilBrown <[email protected]>
---
I happened to notice this inconsistency between svc_alloc_args() and
xfs_buf_alloc_pages() despite them doing very similar things, so thought
I'd send a patch.
NeilBrown
net/sunrpc/svc_xprt.c | 9 ++++-----
1 file changed, 4 insertions(+), 5 deletions(-)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 796eebf1787d..161433ae0fab 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -10,6 +10,7 @@
#include <linux/freezer.h>
#include <linux/kthread.h>
#include <linux/slab.h>
+#include <linux/backing-dev.h>
#include <net/sock.h>
#include <linux/sunrpc/addr.h>
#include <linux/sunrpc/stats.h>
@@ -682,12 +683,10 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
/* Made progress, don't sleep yet */
continue;
- set_current_state(TASK_INTERRUPTIBLE);
- if (signalled() || kthread_should_stop()) {
- set_current_state(TASK_RUNNING);
+ if (signalled() || kthread_should_stop())
return -EINTR;
- }
- schedule_timeout(msecs_to_jiffies(500));
+
+ congestion_wait(BLK_RW_ASYNC, msecs_to_jiffies(500));
}
rqstp->rq_page_end = &rqstp->rq_pages[pages];
rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
--
2.33.0
Hi Neil-
> On Sep 6, 2021, at 12:44 AM, NeilBrown <[email protected]> wrote:
>
>
> Many places that need to wait before retrying a memory allocation use
> congestion_wait(). xfs_buf_alloc_pages() is a good example which
> follows a similar pattern to that in svc_alloc_args().
>
> It make sense to do the same thing in svc_alloc_args(); This will allow
> the allocation to be retried sooner if some backing device becomes
> non-congested before the timeout.
>
> Every call to congestion_wait() in the entire kernel passes BLK_RW_ASYNC
> as the first argument, so we should so.
>
> The second argument - an upper limit for waiting - seem fairly
> arbitrary. Many places use "HZ/50" or "HZ/10". As there is no obvious
> choice, it seems reasonable to leave the maximum time unchanged.
>
> If a service using svc_alloc_args() is terminated, it may now have to
> wait up to the full 500ms before termination completes as
> congestion_wait() cannot be interrupted. I don't believe this will be a
> problem in practice, though it might be justification for using a
> smaller timeout.
>
> Signed-off-by: NeilBrown <[email protected]>
> ---
>
> I happened to notice this inconsistency between svc_alloc_args() and
> xfs_buf_alloc_pages() despite them doing very similar things, so thought
> I'd send a patch.
>
> NeilBrown
When we first considered the alloc_pages_bulk() API, the SUNRPC
patch in that series replaced this schedule_timeout(). Mel
suggested we postpone that to a separate patch. Now is an ideal
time to consider this change again. I've added the MM folks for
expert commentary.
I would rather see a shorter timeout, since that will be less
disruptive in practice and today's systems shouldn't have to wait
that long for free memory to become available. DEFAULT_IO_TIMEOUT
might be a defensible choice -- it will slow down this loop
effectively without adding a significant delay.
> net/sunrpc/svc_xprt.c | 9 ++++-----
> 1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 796eebf1787d..161433ae0fab 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -10,6 +10,7 @@
> #include <linux/freezer.h>
> #include <linux/kthread.h>
> #include <linux/slab.h>
> +#include <linux/backing-dev.h>
> #include <net/sock.h>
> #include <linux/sunrpc/addr.h>
> #include <linux/sunrpc/stats.h>
> @@ -682,12 +683,10 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> /* Made progress, don't sleep yet */
> continue;
>
> - set_current_state(TASK_INTERRUPTIBLE);
> - if (signalled() || kthread_should_stop()) {
> - set_current_state(TASK_RUNNING);
> + if (signalled() || kthread_should_stop())
> return -EINTR;
> - }
> - schedule_timeout(msecs_to_jiffies(500));
> +
> + congestion_wait(BLK_RW_ASYNC, msecs_to_jiffies(500));
> }
> rqstp->rq_page_end = &rqstp->rq_pages[pages];
> rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
> --
> 2.33.0
>
--
Chuck Lever
On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote:
> Hi Neil-
>
> > On Sep 6, 2021, at 12:44 AM, NeilBrown <[email protected]> wrote:
> >
> >
> > Many places that need to wait before retrying a memory allocation use
> > congestion_wait(). xfs_buf_alloc_pages() is a good example which
> > follows a similar pattern to that in svc_alloc_args().
> >
> > It make sense to do the same thing in svc_alloc_args(); This will allow
> > the allocation to be retried sooner if some backing device becomes
> > non-congested before the timeout.
It's adorable that you believe this is still true.
https://lore.kernel.org/linux-mm/[email protected]/
On Tue, 07 Sep 2021, Chuck Lever III wrote:
> Hi Neil-
>
> > On Sep 6, 2021, at 12:44 AM, NeilBrown <[email protected]> wrote:
> >
> >
> > Many places that need to wait before retrying a memory allocation use
> > congestion_wait(). xfs_buf_alloc_pages() is a good example which
> > follows a similar pattern to that in svc_alloc_args().
> >
> > It make sense to do the same thing in svc_alloc_args(); This will allow
> > the allocation to be retried sooner if some backing device becomes
> > non-congested before the timeout.
> >
> > Every call to congestion_wait() in the entire kernel passes BLK_RW_ASYNC
> > as the first argument, so we should so.
> >
> > The second argument - an upper limit for waiting - seem fairly
> > arbitrary. Many places use "HZ/50" or "HZ/10". As there is no obvious
> > choice, it seems reasonable to leave the maximum time unchanged.
> >
> > If a service using svc_alloc_args() is terminated, it may now have to
> > wait up to the full 500ms before termination completes as
> > congestion_wait() cannot be interrupted. I don't believe this will be a
> > problem in practice, though it might be justification for using a
> > smaller timeout.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > ---
> >
> > I happened to notice this inconsistency between svc_alloc_args() and
> > xfs_buf_alloc_pages() despite them doing very similar things, so thought
> > I'd send a patch.
> >
> > NeilBrown
>
> When we first considered the alloc_pages_bulk() API, the SUNRPC
> patch in that series replaced this schedule_timeout(). Mel
> suggested we postpone that to a separate patch. Now is an ideal
> time to consider this change again. I've added the MM folks for
> expert commentary.
>
> I would rather see a shorter timeout, since that will be less
> disruptive in practice and today's systems shouldn't have to wait
> that long for free memory to become available. DEFAULT_IO_TIMEOUT
> might be a defensible choice -- it will slow down this loop
> effectively without adding a significant delay.
DEFAULT_IO_TIMEOUT is local to f2fs, so might not be the best choice.
I could be comfortable with any number from 1 to HZ, and have no basis
on how to make a choice - which is why I deliberately avoided making
one.
Ideally, the full timeout would (almost) never expire in practice.
Ideally, the interface would not even ask that we supply a timeout.
But are not currently at the ideal ;-(
Thanks,
NeilBrown
On Mon, Sep 06, 2021 at 09:20:35PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote:
> > Hi Neil-
> >
> > > On Sep 6, 2021, at 12:44 AM, NeilBrown <[email protected]> wrote:
> > >
> > >
> > > Many places that need to wait before retrying a memory allocation use
> > > congestion_wait(). xfs_buf_alloc_pages() is a good example which
> > > follows a similar pattern to that in svc_alloc_args().
> > >
> > > It make sense to do the same thing in svc_alloc_args(); This will allow
> > > the allocation to be retried sooner if some backing device becomes
> > > non-congested before the timeout.
>
> It's adorable that you believe this is still true.
>
> https://lore.kernel.org/linux-mm/[email protected]/
So, what's the advice for now? Do we add the congestion_wait() call
anyway and assume it'll be fixed to do something less broken in the
future, or just skip it completely?
--b.
On Tue, 07 Sep 2021, Matthew Wilcox wrote:
> On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote:
> > Hi Neil-
> >
> > > On Sep 6, 2021, at 12:44 AM, NeilBrown <[email protected]> wrote:
> > >
> > >
> > > Many places that need to wait before retrying a memory allocation use
> > > congestion_wait(). xfs_buf_alloc_pages() is a good example which
> > > follows a similar pattern to that in svc_alloc_args().
> > >
> > > It make sense to do the same thing in svc_alloc_args(); This will allow
> > > the allocation to be retried sooner if some backing device becomes
> > > non-congested before the timeout.
>
> It's adorable that you believe this is still true.
always happy to be called "adorable" !!
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
>
Interesting ... a few filesystems call clear_bdi_congested(), but not
enough to make a difference.
At least my patch won't make things worse. And when (not if !!)
congestion_wait() gets fixed, sunrpc will immediately benefit.
I suspect that "congestion_wait()" needs to be replaced by several
different interfaces.
Some callers want to wait until memory might be available, which should
be tracked entirely by MM, not by filesystems.
Other caller are really only interested in their own bdi making progress
and should be allowed to specify that bdi.
And in general, it seems that that waits aren't really interested in
congestion being eased, but in progress being made.
reclaim_progress_wait()
bdi_progress_wait()
??
Even if we just provided
void reclaim_progress_wait(void)
{
schedule_timeout_uninterruptible(HZ/20);
}
that would be an improvement. It could be called from svc_alloc_args()
and various other places that want to wait before retrying an
allocation, and it would be no worse than current behaviour.
Then if anyone had a clever idea of how to get a notification when
progress had been made, there would be an obvious place to implement
that idea.
NeilBrown
On Tue, 07 Sep 2021, NeilBrown wrote:
> Even if we just provided
>
> void reclaim_progress_wait(void)
> {
> schedule_timeout_uninterruptible(HZ/20);
> }
>
> that would be an improvement.
I contemplated providing a patch, but the more I thought about it, the
less sure I was.
When does a single-page GFP_KERNEL allocation fail? Ever?
I know that if I add __GFP_NOFAIL then it won't fail and that is
preferred to looping.
I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might
fail.
But that is the semantics for a plain GFP_KERNEL ??
I recall a suggestion one that it would only fail if the process was
being killed by the oom killer. That seems reasonable and would suggest
that retrying is really bad. Is that true?
For svc_alloc_args(), it might be better to fail and have the calling
server thread exit. This would need to be combined with dynamic
thread-count management so that when a request arrived, a new thread
might be started.
So maybe we really don't want reclaim_progress_wait(), and all current
callers of congestion_wait() which are just waiting for allocation to
succeed should be either change to use __GFP_NOFAIL, or to handle
failure.
For the ext4_truncate case() that would be easier if there were a
PF_MEMALLOC_NOFAIL flag though maybe passing down __GFP_MNOFAIL isn't
too hard.
(this is why we all work-around problems in the platform rather than
fixing them. Fixing them is just too hard...)
NeilBrown
On Tue, Sep 07, 2021 at 08:22:39AM +1000, NeilBrown wrote:
> On Tue, 07 Sep 2021, Matthew Wilcox wrote:
> > On Mon, Sep 06, 2021 at 03:46:34PM +0000, Chuck Lever III wrote:
> > > Hi Neil-
> > >
> > > > On Sep 6, 2021, at 12:44 AM, NeilBrown <[email protected]> wrote:
> > > >
> > > >
> > > > Many places that need to wait before retrying a memory allocation use
> > > > congestion_wait(). xfs_buf_alloc_pages() is a good example which
> > > > follows a similar pattern to that in svc_alloc_args().
> > > >
> > > > It make sense to do the same thing in svc_alloc_args(); This will allow
> > > > the allocation to be retried sooner if some backing device becomes
> > > > non-congested before the timeout.
> >
> > It's adorable that you believe this is still true.
>
> always happy to be called "adorable" !!
>
> >
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> >
> Interesting ... a few filesystems call clear_bdi_congested(), but not
> enough to make a difference.
>
> At least my patch won't make things worse. And when (not if !!)
> congestion_wait() gets fixed, sunrpc will immediately benefit.
>
> I suspect that "congestion_wait()" needs to be replaced by several
> different interfaces.
>
> Some callers want to wait until memory might be available, which should
> be tracked entirely by MM, not by filesystems.
> Other caller are really only interested in their own bdi making progress
> and should be allowed to specify that bdi.
>
For the available memory side, I believe the interface would involve a
waitqueue combined with something like struct capture_control except it
has a waitqueue, a zone, an order, a struct page pointer and a list_head
that is declared on stack. Reclaimers for that zone would check if there
are any such waiters and if so, add a page that has just being reclaimed
and wake the waiter.
That then would be more event driven than time driven which is usually
what mm is meant to do. Despite congestion_wait being known to be broken
for a long time, I don't recall anyone trying to actually fix it.
> And in general, it seems that that waits aren't really interested in
> congestion being eased, but in progress being made.
>
> reclaim_progress_wait()
> bdi_progress_wait()
>
> ??
>
> Even if we just provided
>
> void reclaim_progress_wait(void)
> {
> schedule_timeout_uninterruptible(HZ/20);
> }
>
reclaim_progress_wait at least would clarify that it's waiting on a page
but ultimately, it shouldn't be time based.
--
Mel Gorman
SUSE Labs
> On Sep 6, 2021, at 8:41 PM, NeilBrown <[email protected]> wrote:
>
> When does a single-page GFP_KERNEL allocation fail? Ever?
>
> I know that if I add __GFP_NOFAIL then it won't fail and that is
> preferred to looping.
> I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might
> fail.
> But that is the semantics for a plain GFP_KERNEL ??
>
> I recall a suggestion one that it would only fail if the process was
> being killed by the oom killer. That seems reasonable and would suggest
> that retrying is really bad. Is that true?
>
> For svc_alloc_args(), it might be better to fail and have the calling
> server thread exit. This would need to be combined with dynamic
> thread-count management so that when a request arrived, a new thread
> might be started.
I don't immediately see a benefit to killing server threads
during periods of memory exhaustion, but sometimes I lack
imagination.
> So maybe we really don't want reclaim_progress_wait(), and all current
> callers of congestion_wait() which are just waiting for allocation to
> succeed should be either change to use __GFP_NOFAIL, or to handle
> failure.
I had completely forgotten about GFP_NOFAIL. That seems like the
preferred answer, as it avoids an arbitrary time-based wait for
a memory resource. (And maybe svc_rqst_alloc() should also get
the NOFAIL treatment?).
The question in my mind is how the new alloc_pages_bulk() will
behave when passed NOFAIL. I would expect that NOFAIL would simply
guarantee that at least one page is allocated on every call.
--
Chuck Lever
On Tue, Sep 07, 2021 at 02:53:48PM +0000, Chuck Lever III wrote:
>
>
> > On Sep 6, 2021, at 8:41 PM, NeilBrown <[email protected]> wrote:
> >
> > When does a single-page GFP_KERNEL allocation fail? Ever?
> >
> > I know that if I add __GFP_NOFAIL then it won't fail and that is
> > preferred to looping.
> > I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might
> > fail.
> > But that is the semantics for a plain GFP_KERNEL ??
> >
> > I recall a suggestion one that it would only fail if the process was
> > being killed by the oom killer. That seems reasonable and would suggest
> > that retrying is really bad. Is that true?
> >
> > For svc_alloc_args(), it might be better to fail and have the calling
> > server thread exit. This would need to be combined with dynamic
> > thread-count management so that when a request arrived, a new thread
> > might be started.
>
> I don't immediately see a benefit to killing server threads
> during periods of memory exhaustion, but sometimes I lack
> imagination.
Give up parallelism in return for at least hope of forward progress?
(Which should be possible as long as there's at least one server
thread.)
--b.
On Tue, Sep 07, 2021 at 02:53:48PM +0000, Chuck Lever III wrote:
> > So maybe we really don't want reclaim_progress_wait(), and all current
> > callers of congestion_wait() which are just waiting for allocation to
> > succeed should be either change to use __GFP_NOFAIL, or to handle
> > failure.
>
> I had completely forgotten about GFP_NOFAIL. That seems like the
> preferred answer, as it avoids an arbitrary time-based wait for
> a memory resource. (And maybe svc_rqst_alloc() should also get
> the NOFAIL treatment?).
>
Don't use NOFAIL. It's intended for allocation requests that if they fail,
there is no sane way for the kernel to recover. Amoung other things,
it can access emergency memory reserves and if not returned quickly may
cause premature OOM or even a livelock.
> The question in my mind is how the new alloc_pages_bulk() will
> behave when passed NOFAIL. I would expect that NOFAIL would simply
> guarantee that at least one page is allocated on every call.
>
alloc_pages_bulk ignores GFP_NOFAIL. If called repeatedly, it might pass
the GFP_NOFAIL flag to allocate at least one page but you'll be depleting
reserves to do it from a call site that has other options for recovery.
The docs say it better
* %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
* cannot handle allocation failures. The allocation could block
* indefinitely but will never return with failure. Testing for
* failure is pointless.
* New users should be evaluated carefully (and the flag should be
* used only when there is no reasonable failure policy) but it is
* definitely preferable to use the flag rather than opencode endless
* loop around allocator.
* Using this flag for costly allocations is _highly_ discouraged
--
Mel Gorman
SUSE Labs
> On Sep 7, 2021, at 11:41 AM, Mel Gorman <[email protected]> wrote:
>
> On Tue, Sep 07, 2021 at 02:53:48PM +0000, Chuck Lever III wrote:
>>> So maybe we really don't want reclaim_progress_wait(), and all current
>>> callers of congestion_wait() which are just waiting for allocation to
>>> succeed should be either change to use __GFP_NOFAIL, or to handle
>>> failure.
>>
>> I had completely forgotten about GFP_NOFAIL. That seems like the
>> preferred answer, as it avoids an arbitrary time-based wait for
>> a memory resource. (And maybe svc_rqst_alloc() should also get
>> the NOFAIL treatment?).
>>
>
> Don't use NOFAIL. It's intended for allocation requests that if they fail,
> there is no sane way for the kernel to recover. Amoung other things,
> it can access emergency memory reserves and if not returned quickly may
> cause premature OOM or even a livelock.
>
>> The question in my mind is how the new alloc_pages_bulk() will
>> behave when passed NOFAIL. I would expect that NOFAIL would simply
>> guarantee that at least one page is allocated on every call.
>>
>
> alloc_pages_bulk ignores GFP_NOFAIL. If called repeatedly, it might pass
> the GFP_NOFAIL flag to allocate at least one page but you'll be depleting
> reserves to do it from a call site that has other options for recovery.
True, an allocation failure in svc_alloc_arg() won't cause kernel
instability.
But AFAICS svc_alloc_arg() can't do anything but call again if an
allocation request fails to make forward progress. There really
aren't other options for recovery.
On it's face, svc_alloc_arg() seems to match the "use the flag rather
than opencode [an] endless loop around [the] allocator" comment below.
> The docs say it better
>
> * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> * cannot handle allocation failures. The allocation could block
> * indefinitely but will never return with failure. Testing for
> * failure is pointless.
> * New users should be evaluated carefully (and the flag should be
> * used only when there is no reasonable failure policy) but it is
> * definitely preferable to use the flag rather than opencode endless
> * loop around allocator.
As an aside, there's nothing here that suggests that using NOFAIL would
be risky. And the "no reasonable failure policy" phrasing is a
parenthetical, but it seems to be most pertinent.
IMHO the comment should be updated to match the current implementation
of NOFAIL.
> * Using this flag for costly allocations is _highly_ discouraged
I don't have a preference about using NOFAIL. I think it's a good
idea to base the wait for resources on the availability of those
resources rather than an arbitrary time interval, so I'd like to
see the schedule_timeout() in svc_alloc_arg() replaced with
something smarter.
--
Chuck Lever
On Wed, 08 Sep 2021, Chuck Lever III wrote:
>
> > On Sep 6, 2021, at 8:41 PM, NeilBrown <[email protected]> wrote:
> >
> > When does a single-page GFP_KERNEL allocation fail? Ever?
> >
> > I know that if I add __GFP_NOFAIL then it won't fail and that is
> > preferred to looping.
> > I know that if I add __GFP_RETRY_MAILFAIL (or others) then it might
> > fail.
> > But that is the semantics for a plain GFP_KERNEL ??
> >
> > I recall a suggestion one that it would only fail if the process was
> > being killed by the oom killer. That seems reasonable and would suggest
> > that retrying is really bad. Is that true?
> >
> > For svc_alloc_args(), it might be better to fail and have the calling
> > server thread exit. This would need to be combined with dynamic
> > thread-count management so that when a request arrived, a new thread
> > might be started.
>
> I don't immediately see a benefit to killing server threads
> during periods of memory exhaustion, but sometimes I lack
> imagination.
Dining philosophers problem. If you cannot get all the resources you
need, a polite response is to release *all* the resources you have
before trying again.
>
>
> > So maybe we really don't want reclaim_progress_wait(), and all current
> > callers of congestion_wait() which are just waiting for allocation to
> > succeed should be either change to use __GFP_NOFAIL, or to handle
> > failure.
>
> I had completely forgotten about GFP_NOFAIL. That seems like the
> preferred answer, as it avoids an arbitrary time-based wait for
> a memory resource. (And maybe svc_rqst_alloc() should also get
> the NOFAIL treatment?).
Apart from what others have said, GFP_NOFAIL is not appropriate for
svc_rqst_alloc() because it cannot honour kthread_stop().
NeilBrown