2020-04-16 01:16:24

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE


PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi for
an NFS filesystem or loop block device etc) do not affect the throttling
of the deamon writing to the "final" bdi.

This patch changes the heuristic so that the task is only throttled if
*both* the global threshhold *and* the per-wb threshold are exceeded.
This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the
global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE
task will be allowed to proceed unthrottled if the global threshold is
not exceeded or if the local threshold is not exceeded. They need to
both be exceeded before PF_LOCAL_THROTTLE tasks are throttled.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless both
thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for
real-time threads, so it is now different from the behaviour of nfsd and
loop tasks. I don't know what is wanted for realtime.

Acked-by: Chuck Lever <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
---
drivers/block/loop.c | 2 +-
fs/nfsd/vfs.c | 9 +++++----
include/linux/sched.h | 3 ++-
kernel/sys.c | 2 +-
mm/page-writeback.c | 18 ++++++++++++++----
mm/vmscan.c | 4 ++--
6 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..d89c25ba3b89 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo)

static int loop_kthread_worker_fn(void *worker_ptr)
{
- current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+ current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
return kthread_worker_fn(worker_ptr);
}

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..c3fbab1753ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,

if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
/*
- * We want less throttling in balance_dirty_pages()
- * and shrink_inactive_list() so that nfs to
+ * We want throttling in balance_dirty_pages()
+ * and shrink_inactive_list() to only consider
+ * the backingdev we are writing to, so that nfs to
* localhost doesn't cause nfsd to lock up due to all
* the client's dirty pages or its congested queue.
*/
- current->flags |= PF_LESS_THROTTLE;
+ current->flags |= PF_LOCAL_THROTTLE;

exp = fhp->fh_export;
use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
nfserr = nfserrno(host_err);
}
if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
- current_restore_flags(pflags, PF_LESS_THROTTLE);
+ current_restore_flags(pflags, PF_LOCAL_THROTTLE);
return nfserr;
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5955a089df32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,7 +1481,8 @@ extern struct pid *cad_pid;
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
-#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
+#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to,
+ * I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..180a2fa33f7f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}

-#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
+#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)

SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7326b54ab728..9692c553526b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void)
* Calculate @dtc->thresh and ->bg_thresh considering
* vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller
* must ensure that @dtc->avail is set before calling this function. The
- * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
+ * dirty limits will be lifted by 1/4 for real-time tasks.
*/
static void domain_dirty_limits(struct dirty_throttle_control *dtc)
{
@@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
if (bg_thresh >= thresh)
bg_thresh = thresh / 2;
tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ if (rt_task(tsk)) {
bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
}
@@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
else
dirty = vm_dirty_ratio * node_memory / 100;

- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+ if (rt_task(tsk))
dirty += dirty / 4;

return dirty;
@@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
sdtc = mdtc;
}

+ if (current->flags & PF_LOCAL_THROTTLE)
+ /* This task must only be throttled based on the bdi
+ * it is writing to - dirty pages for other bdis might
+ * be pages this task is trying to write out. So it
+ * gets a free pass unless both global and local
+ * thresholds are exceeded. i.e unless
+ * "dirty_exceeded".
+ */
+ if (!dirty_exceeded)
+ break;
+
if (dirty_exceeded && !wb->dirty_exceeded)
wb->dirty_exceeded = 1;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b06868fc4926..80ab523926aa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1879,13 +1879,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,

/*
* If a kernel thread (such as nfsd for loop-back mounts) services
- * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
+ * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE.
* In that case we should only throttle if the backing device it is
* writing to is congested. In other cases it is safe to throttle.
*/
static int current_may_throttle(void)
{
- return !(current->flags & PF_LESS_THROTTLE) ||
+ return !(current->flags & PF_LOCAL_THROTTLE) ||
current->backing_dev_info == NULL ||
bdi_write_congested(current->backing_dev_info);
}
--
2.26.0


Attachments:
signature.asc (847.00 B)

2020-04-16 06:55:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

> + if (current->flags & PF_LOCAL_THROTTLE)
> + /* This task must only be throttled based on the bdi
> + * it is writing to - dirty pages for other bdis might
> + * be pages this task is trying to write out. So it
> + * gets a free pass unless both global and local
> + * thresholds are exceeded. i.e unless
> + * "dirty_exceeded".
> + */

This is not our normal multi-line comment style. The first line should
be just a

/*

Otherwise this looks good.

2020-04-21 02:23:59

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

On Thu, Apr 16 2020, Jan Kara wrote:

> On Thu 16-04-20 10:30:42, NeilBrown wrote:
>>
>> PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
>> loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
>> daemon needs to write to one bdi (the final bdi) in order to free up
>> writes queued to another bdi (the client bdi).
>>
>> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
>> pages, so that it can still dirty pages after other processses have been
>> throttled.
>>
>> This approach was designed when all threads were blocked equally,
>> independently on which device they were writing to, or how fast it was.
>> Since that time the writeback algorithm has changed substantially with
>> different threads getting different allowances based on non-trivial
>> heuristics. This means the simple "add 25%" heuristic is no longer
>> reliable.
>>
>> The important issue is not that the daemon needs a *larger* dirty page
>> allowance, but that it needs a *private* dirty page allowance, so that
>> dirty pages for the "client" bdi that it is helping to clear (the bdi for
>> an NFS filesystem or loop block device etc) do not affect the throttling
>> of the deamon writing to the "final" bdi.
>>
>> This patch changes the heuristic so that the task is only throttled if
>> *both* the global threshhold *and* the per-wb threshold are exceeded.
>> This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the
>> global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE
>> task will be allowed to proceed unthrottled if the global threshold is
>> not exceeded or if the local threshold is not exceeded. They need to
>> both be exceeded before PF_LOCAL_THROTTLE tasks are throttled.
>>
>> This approach of "only throttle when target bdi is busy" is consistent
>> with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
>> it causes attention to be focussed only on the target bdi.
>>
>> So this patch
>> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
>> - removes the 25% bonus that that flag gives, and
>> - If PF_LOCAL_THROTTLE is set, don't delay at all unless both
>> thresholds are exceeded.
>>
>> Note that previously realtime threads were treated the same as
>> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for
>> real-time threads, so it is now different from the behaviour of nfsd and
>> loop tasks. I don't know what is wanted for realtime.
>>
>> Acked-by: Chuck Lever <[email protected]>
>> Signed-off-by: NeilBrown <[email protected]>
>
> ...
>
>> @@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
>> sdtc = mdtc;
>> }
>>
>> + if (current->flags & PF_LOCAL_THROTTLE)
>> + /* This task must only be throttled based on the bdi
>> + * it is writing to - dirty pages for other bdis might
>> + * be pages this task is trying to write out. So it
>> + * gets a free pass unless both global and local
>> + * thresholds are exceeded. i.e unless
>> + * "dirty_exceeded".
>> + */
>> + if (!dirty_exceeded)
>> + break;
>> +
>> if (dirty_exceeded && !wb->dirty_exceeded)
>> wb->dirty_exceeded = 1;
>
> Ok, but note that this will have one sideeffect you maybe didn't realize:
> Currently we try to throttle tasks softly - the heuristic rougly works like
> this: If dirty < (thresh + bg_thresh)/2, leave the task alone.
> (thresh+bg_thresh)/2 is called "freerun ceiling". If dirty is greater than
> this, we delay the task somewhat (the aim is to delay the task as long as
> it would take to write back the pages task has dirtied) in
> balance_dirty_pages() so ideally 'thresh' is never hit. Only if the
> heuristic consistently underestimates the time to writeback pages, we hit
> 'thresh' and then block the task as long as it takes flush worker to clean
> enough pages to get below 'thresh'. This all leads to task being usually
> gradually slowed down in balance_dirty_pages() which generally leads to
> smoother overall system behavior.
>
> What you did makes PF_LOCAL_THROTTLE tasks ignore any limits and then when
> local bdi limit is exceeded, they'll suddently hit the wall and be blocked
> for a long time in balance_dirty_pages().
>
> So I like what you suggest in principle, just I think the implementation
> has undesirable sideeffects. I think it would be better to modify
> wb_position_ratio() to take PF_LOCAL_THROTTLE into account. It will be
> probably similar to how BDI_CAP_STRICTLIMIT is handled but different in
> some ways because BDI_CAP_STRICTLIMIT takes minimum from wb_pos_ratio and
> global pos_ratio, you rather want to take wb_pos_ratio only. Also there are
> some early bail out conditions when we are over global dirty limit which
> you need to handle differently for PF_LOCAL_THROTTLE. And then, when you
> have appropriate pos_ratio computed based on your policy, you can let the
> task wait for appropriate amount of time and things should just work (TM) ;).
> Thinking about it, you probably also want to add 'freerun' condition for
> PF_LOCAL_THROTTLE tasks like:
>
> if ((current->flags & PF_LOCAL_THROTTLE) &&
> wb_dirty <= dirty_freerun_ceiling(wb_thresh, wb_bg_thresh))
> go the freerun path...
>

Thanks.....
I have 2 thoughts on this.
One is that I'm not sure how much it really matters.
The PF_LOCAL_THROTTLE task it always doing writeout on behalf of some
other process. Some process writes to NFS or to a loop block device or
somewhere, then the PF_LOCAL_THROTTLE task writes those dirty pages out
to a different BDI. So the top level task will be throttled, an the
PF_LOCAL_THROTTLE task won't get more than it can handle.
There will be starting transients of course, but I doubt it would
generally be a problem. However it would still be nice to find the
"right" solution.

My second thought is that I really don't understand the writeback code.
I think I understand the general principle, and there are lots of big
comments that try to explain things, but it just doesn't seem to help.
I look at the code and see more questions than answers.

What are the units for "dirty_ratelimit"?? I think it is pages per
second, because it is initialized to INIT_BW which is documented as 100
MB/s.
What is the difference between dirty_ratelimit and
balanced_dirty_ratelimit?
The later is "balanced" I guess. What does that mean?
Apparently (from backing-dev-defs.h) dirty_ratelimit moves in smaller
steps and is more smooth than balanced_dirty_ratelimit. How is being
less smooth, more balanced??

What is pos_ratio? And what is RATELIMIT_CALC_SHIFT ???
Maybe pos_ratio is the ratio of the actual number of dirty pages to the
desired number? And pos_ratio is calculated with fixed-point arithmetic
and RATELIMIT_CALC_SHIFT tells where the point is?

I think I understand freerun - half way between the dirty limit and the
dirty_bg limit. Se below dirty_bg, no writeback happens. Between there
and freerun, writeback happens, but nothing in throttled. From free up
to the limit, tasks are progressively throttled.
"setpoint" is the midpoint of this range. Is the goal that pos_ratio is
computed for.
(except that in the BDI_CAP_STRICTLIMIT part of wb_position_ratio)
wb_setpoint is set to the bottom of this range, same as the freerun ceiling.)

Then we have the control lines, which are cubic(?) for global counts and
linear for per-wb - but truncated at 1/4. The comment says "so that
wb_dirty can be smoothly throttled". It'll take me a while to work out
what a hard edge results in smooth throttling. I suspect it makes sense
but it doesn't jump out at me.

So, you see, I don't feel at all confident changing any of this code
because I just don't get it.

So I'm inclined to stick with the patch that I have. :-(

Thanks,
NeilBrown


Attachments:
signature.asc (847.00 B)

2020-04-22 12:55:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

On Tue 21-04-20 12:22:59, NeilBrown wrote:
> On Thu, Apr 16 2020, Jan Kara wrote:
>
> > On Thu 16-04-20 10:30:42, NeilBrown wrote:
> >>
> >> PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
> >> loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
> >> daemon needs to write to one bdi (the final bdi) in order to free up
> >> writes queued to another bdi (the client bdi).
> >>
> >> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
> >> pages, so that it can still dirty pages after other processses have been
> >> throttled.
> >>
> >> This approach was designed when all threads were blocked equally,
> >> independently on which device they were writing to, or how fast it was.
> >> Since that time the writeback algorithm has changed substantially with
> >> different threads getting different allowances based on non-trivial
> >> heuristics. This means the simple "add 25%" heuristic is no longer
> >> reliable.
> >>
> >> The important issue is not that the daemon needs a *larger* dirty page
> >> allowance, but that it needs a *private* dirty page allowance, so that
> >> dirty pages for the "client" bdi that it is helping to clear (the bdi for
> >> an NFS filesystem or loop block device etc) do not affect the throttling
> >> of the deamon writing to the "final" bdi.
> >>
> >> This patch changes the heuristic so that the task is only throttled if
> >> *both* the global threshhold *and* the per-wb threshold are exceeded.
> >> This is similar to the effect of BDI_CAP_STRICTLIMIT which causes the
> >> global limits to be ignored, but it isn't as strict. A PF_LOCAL_THROTTLE
> >> task will be allowed to proceed unthrottled if the global threshold is
> >> not exceeded or if the local threshold is not exceeded. They need to
> >> both be exceeded before PF_LOCAL_THROTTLE tasks are throttled.
> >>
> >> This approach of "only throttle when target bdi is busy" is consistent
> >> with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
> >> it causes attention to be focussed only on the target bdi.
> >>
> >> So this patch
> >> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
> >> - removes the 25% bonus that that flag gives, and
> >> - If PF_LOCAL_THROTTLE is set, don't delay at all unless both
> >> thresholds are exceeded.
> >>
> >> Note that previously realtime threads were treated the same as
> >> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for
> >> real-time threads, so it is now different from the behaviour of nfsd and
> >> loop tasks. I don't know what is wanted for realtime.
> >>
> >> Acked-by: Chuck Lever <[email protected]>
> >> Signed-off-by: NeilBrown <[email protected]>
> >
> > ...
> >
> >> @@ -1700,6 +1699,17 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
> >> sdtc = mdtc;
> >> }
> >>
> >> + if (current->flags & PF_LOCAL_THROTTLE)
> >> + /* This task must only be throttled based on the bdi
> >> + * it is writing to - dirty pages for other bdis might
> >> + * be pages this task is trying to write out. So it
> >> + * gets a free pass unless both global and local
> >> + * thresholds are exceeded. i.e unless
> >> + * "dirty_exceeded".
> >> + */
> >> + if (!dirty_exceeded)
> >> + break;
> >> +
> >> if (dirty_exceeded && !wb->dirty_exceeded)
> >> wb->dirty_exceeded = 1;
> >
> > Ok, but note that this will have one sideeffect you maybe didn't realize:
> > Currently we try to throttle tasks softly - the heuristic rougly works like
> > this: If dirty < (thresh + bg_thresh)/2, leave the task alone.
> > (thresh+bg_thresh)/2 is called "freerun ceiling". If dirty is greater than
> > this, we delay the task somewhat (the aim is to delay the task as long as
> > it would take to write back the pages task has dirtied) in
> > balance_dirty_pages() so ideally 'thresh' is never hit. Only if the
> > heuristic consistently underestimates the time to writeback pages, we hit
> > 'thresh' and then block the task as long as it takes flush worker to clean
> > enough pages to get below 'thresh'. This all leads to task being usually
> > gradually slowed down in balance_dirty_pages() which generally leads to
> > smoother overall system behavior.
> >
> > What you did makes PF_LOCAL_THROTTLE tasks ignore any limits and then when
> > local bdi limit is exceeded, they'll suddently hit the wall and be blocked
> > for a long time in balance_dirty_pages().
> >
> > So I like what you suggest in principle, just I think the implementation
> > has undesirable sideeffects. I think it would be better to modify
> > wb_position_ratio() to take PF_LOCAL_THROTTLE into account. It will be
> > probably similar to how BDI_CAP_STRICTLIMIT is handled but different in
> > some ways because BDI_CAP_STRICTLIMIT takes minimum from wb_pos_ratio and
> > global pos_ratio, you rather want to take wb_pos_ratio only. Also there are
> > some early bail out conditions when we are over global dirty limit which
> > you need to handle differently for PF_LOCAL_THROTTLE. And then, when you
> > have appropriate pos_ratio computed based on your policy, you can let the
> > task wait for appropriate amount of time and things should just work (TM) ;).
> > Thinking about it, you probably also want to add 'freerun' condition for
> > PF_LOCAL_THROTTLE tasks like:
> >
> > if ((current->flags & PF_LOCAL_THROTTLE) &&
> > wb_dirty <= dirty_freerun_ceiling(wb_thresh, wb_bg_thresh))
> > go the freerun path...
> >
>
> Thanks.....
> I have 2 thoughts on this.
> One is that I'm not sure how much it really matters.
> The PF_LOCAL_THROTTLE task it always doing writeout on behalf of some
> other process. Some process writes to NFS or to a loop block device or
> somewhere, then the PF_LOCAL_THROTTLE task writes those dirty pages out
> to a different BDI. So the top level task will be throttled, an the
> PF_LOCAL_THROTTLE task won't get more than it can handle.
> There will be starting transients of course, but I doubt it would
> generally be a problem. However it would still be nice to find the
> "right" solution.

I'm not sure PF_LOCAL_THROTTLE "won't get more than it can handle". Once
dirty pages on NFS BDI accumulate, flush worker will start to push them out
as fast as it can. So the only thing that's limitting this is the dirty
throttling on the receiving (NFS server - thus underlying BDI) side. When
underlying BDI throttling triggers depends on that BDI dirty limits and
those are proportional part of global dirty limits scaled by writeback
throughput on underlying BDI compared to other BDIs. So depending on which
BDIs are in the system and how active they are in dirtying pages
'underlying BDI' will get different dirty limits set. It's quite imaginable
that in some configurations it will be easy to push NFS server to hit its
dirty limit even with PF_LOCAL_THROTTLE. And then having NFS server
undersponsive for couple seconds because it is blocked in
balance_dirty_pages() just is not nice...

> My second thought is that I really don't understand the writeback code.
> I think I understand the general principle, and there are lots of big
> comments that try to explain things, but it just doesn't seem to help.
> I look at the code and see more questions than answers.

I fully understand what you mean :). The logic is complex and while
Fengguang wrote a lot of comments it is still rather hard to follow.

> What are the units for "dirty_ratelimit"?? I think it is pages per
> second, because it is initialized to INIT_BW which is documented as 100
> MB/s.

Yes, that's what I think as well.

> What is the difference between dirty_ratelimit and
> balanced_dirty_ratelimit?
> The later is "balanced" I guess. What does that mean?
> Apparently (from backing-dev-defs.h) dirty_ratelimit moves in smaller
> steps and is more smooth than balanced_dirty_ratelimit. How is being
> less smooth, more balanced??

Yeah. So I cannot really explain the naming to you (not sure why Fengguang
chose these names). But 'balanced_dirty_ratelimit' is pages/second value we
want to limit task to based on the events in the most current time slice.
'dirty_ratelimit' is smoothed version of 'balanced_dirty_ratelimit' taking
more of history into account.

> What is pos_ratio? And what is RATELIMIT_CALC_SHIFT ???
> Maybe pos_ratio is the ratio of the actual number of dirty pages to the
> desired number? And pos_ratio is calculated with fixed-point arithmetic
> and RATELIMIT_CALC_SHIFT tells where the point is?

So RATELIMIT_CALC_SHIFT is indeed the shift of fixed point arithmetic used
in the computations. Pos_ratio is the multiplicative "correction" factor we
apply to computed dirty_ratelimit (i.e., task_ratelimit = dirty_ratelimit *
pos_ratio) - so if we see we are able to writeout say 100 MB/s but we are
still relatively far from dirty limits, we let tasks dirty 200 MB/s
(pos_ratio is 2). As we are nearing dirty limits, pos_ratio is dropping
(it's appropriately scaled and shifted third order polynomial) so very
close to dirty limits, we let tasks dirty only say 10 MB/s even though we
are still able to write out 100 MB/s.

> I think I understand freerun - half way between the dirty limit and the
> dirty_bg limit. Se below dirty_bg, no writeback happens. Between there
> and freerun, writeback happens, but nothing in throttled. From free up
> to the limit, tasks are progressively throttled.

Correct.

> "setpoint" is the midpoint of this range. Is the goal that pos_ratio is
> computed for.
> (except that in the BDI_CAP_STRICTLIMIT part of wb_position_ratio)
> wb_setpoint is set to the bottom of this range, same as the freerun ceiling.)

Correct.

> Then we have the control lines, which are cubic(?) for global counts and
> linear for per-wb - but truncated at 1/4. The comment says "so that
> wb_dirty can be smoothly throttled". It'll take me a while to work out
> what a hard edge results in smooth throttling. I suspect it makes sense
> but it doesn't jump out at me.
>
> So, you see, I don't feel at all confident changing any of this code
> because I just don't get it.
>
> So I'm inclined to stick with the patch that I have. :-(

OK, I'll try to write something and we'll see if it will work :)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-05-13 07:17:33

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 1/2 V3] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE


I thought about this some more and come up with another "simple"
approach that didn't require me understanding too much code, but does -
I think - address your concerns.

I've changed the heuristic to avoid any throttling on PF_LOCAL_THROTTLE
task if:
- the global dirty count is below the global free-run threshold. The
code did this already.
- (or) the per-wb dirty count is below the per-wb free-run threshold.
This is the change.

This means that:
- in a steady stated, all bdis will be throttled based on their (steady
state) throughput, which is equally appropriate for PF_LOCAL_THROTTLE
tasks.
- a PF_LOCAL_THROTTLE task will never be *completely* blocked by dirty
pages queued for other devices. This means no deadlock, and that is
the primary purpose of PF_LOCAL_THROTTLE.
- when writes through the PF_LOCAL_THROTTLE task start up from idle -
when there is no current throughput estimate - the PF_LOCAL_THROTTLE
can be expected to get a fair share of the available memory, just as
much as any other writer. This was the possible problem with
treating PF_LOCAL_THROTTLE just like BDI_CAP_STRICTLIMIT.

So I think this is a good solution. Thoughts?
Patches follow - I've address the comment formatting issue.

Thanks,
NeilBrown


Attachments:
signature.asc (847.00 B)

2020-05-13 07:18:06

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/2 V4] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE


PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled. The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi for
an NFS filesystem or loop block device etc) do not affect the throttling
of the deamon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi. This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless the
global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for
real-time threads, so it is now different from the behaviour of nfsd and
loop tasks. I don't know what is wanted for realtime.

Acked-by: Chuck Lever <[email protected]> (for nfsd change)
Signed-off-by: NeilBrown <[email protected]>
---
drivers/block/loop.c | 2 +-
fs/nfsd/vfs.c | 9 +++++----
include/linux/sched.h | 3 ++-
kernel/sys.c | 2 +-
mm/page-writeback.c | 41 +++++++++++++++++++++++++++++++++--------
mm/vmscan.c | 4 ++--
6 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..d89c25ba3b89 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo)

static int loop_kthread_worker_fn(void *worker_ptr)
{
- current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+ current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
return kthread_worker_fn(worker_ptr);
}

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..c3fbab1753ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,

if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
/*
- * We want less throttling in balance_dirty_pages()
- * and shrink_inactive_list() so that nfs to
+ * We want throttling in balance_dirty_pages()
+ * and shrink_inactive_list() to only consider
+ * the backingdev we are writing to, so that nfs to
* localhost doesn't cause nfsd to lock up due to all
* the client's dirty pages or its congested queue.
*/
- current->flags |= PF_LESS_THROTTLE;
+ current->flags |= PF_LOCAL_THROTTLE;

exp = fhp->fh_export;
use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
nfserr = nfserrno(host_err);
}
if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
- current_restore_flags(pflags, PF_LESS_THROTTLE);
+ current_restore_flags(pflags, PF_LOCAL_THROTTLE);
return nfserr;
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5955a089df32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,7 +1481,8 @@ extern struct pid *cad_pid;
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
-#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
+#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to,
+ * I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..180a2fa33f7f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}

-#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
+#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)

SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7326b54ab728..f02a71797781 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void)
* Calculate @dtc->thresh and ->bg_thresh considering
* vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller
* must ensure that @dtc->avail is set before calling this function. The
- * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
+ * dirty limits will be lifted by 1/4 for real-time tasks.
*/
static void domain_dirty_limits(struct dirty_throttle_control *dtc)
{
@@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
if (bg_thresh >= thresh)
bg_thresh = thresh / 2;
tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ if (rt_task(tsk)) {
bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
}
@@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
else
dirty = vm_dirty_ratio * node_memory / 100;

- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+ if (rt_task(tsk))
dirty += dirty / 4;

return dirty;
@@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
(!mdtc ||
m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
- unsigned long intv = dirty_poll_interval(dirty, thresh);
- unsigned long m_intv = ULONG_MAX;
+ unsigned long intv;
+ unsigned long m_intv;
+
+ free_running:
+ intv = dirty_poll_interval(dirty, thresh);
+ m_intv = ULONG_MAX;

current->dirty_paused_when = now;
current->nr_dirtied = 0;
@@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
* Calculate global domain's pos_ratio and select the
* global dtc by default.
*/
- if (!strictlimit)
+ if (!strictlimit) {
wb_dirty_limits(gdtc);

+ if ((current->flags & PF_LOCAL_THROTTLE) &&
+ gdtc->wb_dirty <
+ dirty_freerun_ceiling(gdtc->wb_thresh,
+ gdtc->wb_bg_thresh))
+ /*
+ * LOCAL_THROTTLE tasks must not be throttled
+ * when below the per-wb freerun ceiling.
+ */
+ goto free_running;
+ }
+
dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
((gdtc->dirty > gdtc->thresh) || strictlimit);

@@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
* both global and memcg domains. Choose the one
* w/ lower pos_ratio.
*/
- if (!strictlimit)
+ if (!strictlimit) {
wb_dirty_limits(mdtc);

+ if ((current->flags & PF_LOCAL_THROTTLE) &&
+ mdtc->wb_dirty <
+ dirty_freerun_ceiling(mdtc->wb_thresh,
+ mdtc->wb_bg_thresh))
+ /*
+ * LOCAL_THROTTLE tasks must not be
+ * throttled when below the per-wb
+ * freerun ceiling.
+ */
+ goto free_running;
+ }
dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) &&
((mdtc->dirty > mdtc->thresh) || strictlimit);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a37c87b5aee2..b2f5deb3603c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,

/*
* If a kernel thread (such as nfsd for loop-back mounts) services
- * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
+ * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE.
* In that case we should only throttle if the backing device it is
* writing to is congested. In other cases it is safe to throttle.
*/
static int current_may_throttle(void)
{
- return !(current->flags & PF_LESS_THROTTLE) ||
+ return !(current->flags & PF_LOCAL_THROTTLE) ||
current->backing_dev_info == NULL ||
bdi_write_congested(current->backing_dev_info);
}
--
2.26.2


Attachments:
signature.asc (847.00 B)

2020-05-13 07:21:55

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/2 V4] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead.


After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.
With this patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.

Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Trond Myklebust <[email protected]>
Acked-by: Michal Hocko <[email protected]> # for MM parts
Signed-off-by: NeilBrown <[email protected]>
---
Documentation/filesystems/proc.rst | 4 ++--
drivers/base/node.c | 2 +-
fs/fs-writeback.c | 1 -
fs/nfs/internal.h | 10 +++++++---
fs/nfs/write.c | 4 ++--
fs/proc/meminfo.c | 3 +--
include/linux/mmzone.h | 1 -
include/trace/events/writeback.h | 5 +----
mm/memcontrol.c | 1 -
mm/page-writeback.c | 17 ++++-------------
mm/page_alloc.c | 5 +----
mm/vmstat.c | 11 +++++++++--
12 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 38b606991065..092b7b44d158 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1042,8 +1042,8 @@ PageTables
amount of memory dedicated to the lowest level of page
tables.
NFS_Unstable
- NFS pages sent to the server, but not yet committed to stable
- storage
+ Always zero. Previous counted pages which had been written to
+ the server, but has not been committed to stable storage.
Bounce
Memory used for block device "bounce buffers"
WritebackTmp
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 10d7e818e118..15f5ed6a8830 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(i.sharedram),
nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
- nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
+ nid, 0,
nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
nid, K(sreclaimable +
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..c5bdf46e3b4b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
static unsigned long get_nr_dirty_pages(void)
{
return global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS) +
get_nr_dirty_inodes();
}

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 1f32a9fbfdaf..6673a77884d9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize)
}

/*
- * Record the page as unstable and mark its inode as dirty.
+ * Record the page as unstable (an extra writeback period) and mark its
+ * inode as dirty.
*/
static inline
void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
@@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
if (!cinfo->dreq) {
struct inode *inode = page_file_mapping(page)->host;

- inc_node_page_state(page, NR_UNSTABLE_NFS);
- inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
+ /* This page is really still in write-back - just that the
+ * writeback is happening on the server now.
+ */
+ inc_node_page_state(page, NR_WRITEBACK);
+ inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
}
}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index df4b87c30ac9..d9ea824accb7 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
static void
nfs_clear_page_commit(struct page *page)
{
- dec_node_page_state(page, NR_UNSTABLE_NFS);
+ dec_node_page_state(page, NR_WRITEBACK);
dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
- WB_RECLAIMABLE);
+ WB_WRITEBACK);
}

/* Called holding the request lock on @req */
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..9bd94b5a9658 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "PageTables: ",
global_zone_page_state(NR_PAGETABLE));

- show_val_kb(m, "NFS_Unstable: ",
- global_node_page_state(NR_UNSTABLE_NFS));
+ show_val_kb(m, "NFS_Unstable: ", 0);
show_val_kb(m, "Bounce: ",
global_zone_page_state(NR_BOUNCE));
show_val_kb(m, "WritebackTmp: ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1b9de7d220fb..a89f47515eb1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -193,7 +193,6 @@ enum node_stat_item {
NR_FILE_THPS,
NR_FILE_PMDMAPPED,
NR_ANON_THPS,
- NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 85a33bea76f1..10f5d1fa7347 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -541,7 +541,6 @@ TRACE_EVENT(global_dirty_state,
TP_STRUCT__entry(
__field(unsigned long, nr_dirty)
__field(unsigned long, nr_writeback)
- __field(unsigned long, nr_unstable)
__field(unsigned long, background_thresh)
__field(unsigned long, dirty_thresh)
__field(unsigned long, dirty_limit)
@@ -552,7 +551,6 @@ TRACE_EVENT(global_dirty_state,
TP_fast_assign(
__entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY);
__entry->nr_writeback = global_node_page_state(NR_WRITEBACK);
- __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS);
__entry->nr_dirtied = global_node_page_state(NR_DIRTIED);
__entry->nr_written = global_node_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
@@ -560,12 +558,11 @@ TRACE_EVENT(global_dirty_state,
__entry->dirty_limit = global_wb_domain.dirty_limit;
),

- TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+ TP_printk("dirty=%lu writeback=%lu "
"bg_thresh=%lu thresh=%lu limit=%lu "
"dirtied=%lu written=%lu",
__entry->nr_dirty,
__entry->nr_writeback,
- __entry->nr_unstable,
__entry->background_thresh,
__entry->dirty_thresh,
__entry->dirty_limit,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3b97f103966..1db4b285c407 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,

*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);

- /* this should eventually include NR_UNSTABLE_NFS */
*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f02a71797781..d11b097c8002 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat)
unsigned long nr_pages = 0;

nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
- nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
nr_pages += node_page_state(pgdat, NR_WRITEBACK);

return nr_pages <= limit;
@@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*
* Return: @wb's dirty limit in pages. The term "dirty" in the context of
- * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * dirty balancing includes all PG_dirty and PG_writeback pages.
*/
static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
{
@@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
&mdtc_stor : NULL;
struct dirty_throttle_control *sdtc;
- unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
+ unsigned long nr_reclaimable; /* = file_dirty */
long period;
long pause;
long max_pause;
@@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
unsigned long m_thresh = 0;
unsigned long m_bg_thresh = 0;

- /*
- * Unstable writes are a feature of certain networked
- * filesystems (i.e. NFS) in which data may have been
- * written to the server's write cache, but has not yet
- * been flushed to permanent storage.
- */
- nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS);
+ nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
gdtc->avail = global_dirtyable_memory();
gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);

@@ -1963,8 +1955,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
* as we're trying to decide whether to put more under writeback.
*/
gdtc->avail = global_dirtyable_memory();
- gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS);
+ gdtc->dirty = global_node_page_state(NR_FILE_DIRTY);
domain_dirty_limits(gdtc);

if (gdtc->dirty > gdtc->bg_thresh)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13cc653122b7..cc406ee17ad9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5319,7 +5319,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
- " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+ " unevictable:%lu dirty:%lu writeback:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
" free:%lu free_pcp:%lu free_cma:%lu\n",
@@ -5332,7 +5332,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
global_node_page_state(NR_UNEVICTABLE),
global_node_page_state(NR_FILE_DIRTY),
global_node_page_state(NR_WRITEBACK),
- global_node_page_state(NR_UNSTABLE_NFS),
global_node_page_state(NR_SLAB_RECLAIMABLE),
global_node_page_state(NR_SLAB_UNRECLAIMABLE),
global_node_page_state(NR_FILE_MAPPED),
@@ -5365,7 +5364,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
" anon_thp: %lukB"
#endif
" writeback_tmp:%lukB"
- " unstable:%lukB"
" all_unreclaimable? %s"
"\n",
pgdat->node_id,
@@ -5387,7 +5385,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
#endif
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
- K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
"yes" : "no");
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 96d21a792b57..6c719f184843 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
TEXT_FOR_HIGHMEM(xx) xx "_movable",

const char * const vmstat_text[] = {
- /* enum zone_stat_item countes */
+ /* enum zone_stat_item counters */
"nr_free_pages",
"nr_zone_inactive_anon",
"nr_zone_active_anon",
@@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = {
"nr_file_hugepages",
"nr_file_pmdmapped",
"nr_anon_transparent_hugepages",
- "nr_unstable",
"nr_vmscan_write",
"nr_vmscan_immediate_reclaim",
"nr_dirtied",
@@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg)
seq_puts(m, vmstat_text[off]);
seq_put_decimal_ull(m, " ", *l);
seq_putc(m, '\n');
+
+ if (off == NR_VMSTAT_ITEMS - 1) {
+ /* We've come to the end - add any deprecated counters
+ * to avoid breaking userspace which might depend on
+ * them being present.
+ */
+ seq_puts(m, "nr_unstable 0\n");
+ }
return 0;
}

--
2.26.2


Attachments:
signature.asc (847.00 B)

2020-05-15 11:11:42

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/2 V4] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE

On Wed 13-05-20 17:17:20, NeilBrown wrote:
>
> PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
> loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
> daemon needs to write to one bdi (the final bdi) in order to free up
> writes queued to another bdi (the client bdi).
>
> The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
> pages, so that it can still dirty pages after other processses have been
> throttled. The purpose of this is to avoid deadlock that happen when
> the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
> but it is being thottled and cannot write.
>
> This approach was designed when all threads were blocked equally,
> independently on which device they were writing to, or how fast it was.
> Since that time the writeback algorithm has changed substantially with
> different threads getting different allowances based on non-trivial
> heuristics. This means the simple "add 25%" heuristic is no longer
> reliable.
>
> The important issue is not that the daemon needs a *larger* dirty page
> allowance, but that it needs a *private* dirty page allowance, so that
> dirty pages for the "client" bdi that it is helping to clear (the bdi for
> an NFS filesystem or loop block device etc) do not affect the throttling
> of the deamon writing to the "final" bdi.
>
> This patch changes the heuristic so that the task is not throttled when
> the bdi it is writing to has a dirty page count below below (or equal
> to) the free-run threshold for that bdi. This ensures it will always be
> able to have some pages in flight, and so will not deadlock.
>
> In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
> still be throttled by global threshold, but that is acceptable as it is
> only the deadlock state that is interesting for this flag.
>
> This approach of "only throttle when target bdi is busy" is consistent
> with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
> it causes attention to be focussed only on the target bdi.
>
> So this patch
> - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
> - removes the 25% bonus that that flag gives, and
> - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
> global and the local free-run thresholds are exceeded.
>
> Note that previously realtime threads were treated the same as
> PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for
> real-time threads, so it is now different from the behaviour of nfsd and
> loop tasks. I don't know what is wanted for realtime.
>
> Acked-by: Chuck Lever <[email protected]> (for nfsd change)
> Signed-off-by: NeilBrown <[email protected]>

The idea looks good to me. It will still have the "threads may hit hard
wall" behavior when BDI freerun threshold is crossed at the moment we are
very close to the global limit but that should be rare. So I think for now
this good enough.

The patch looks good to me so feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Throttling patches usually go through mm tree so ask Andrew to pick them
up.


Honza

> @@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
> if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
> (!mdtc ||
> m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
> - unsigned long intv = dirty_poll_interval(dirty, thresh);
> - unsigned long m_intv = ULONG_MAX;
> + unsigned long intv;
> + unsigned long m_intv;
> +
> + free_running:
> + intv = dirty_poll_interval(dirty, thresh);
> + m_intv = ULONG_MAX;
>
> current->dirty_paused_when = now;
> current->nr_dirtied = 0;
> @@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
> * Calculate global domain's pos_ratio and select the
> * global dtc by default.
> */
> - if (!strictlimit)
> + if (!strictlimit) {
> wb_dirty_limits(gdtc);
>
> + if ((current->flags & PF_LOCAL_THROTTLE) &&
> + gdtc->wb_dirty <
> + dirty_freerun_ceiling(gdtc->wb_thresh,
> + gdtc->wb_bg_thresh))
> + /*
> + * LOCAL_THROTTLE tasks must not be throttled
> + * when below the per-wb freerun ceiling.
> + */
> + goto free_running;
> + }
> +
> dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
> ((gdtc->dirty > gdtc->thresh) || strictlimit);
>
> @@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
> * both global and memcg domains. Choose the one
> * w/ lower pos_ratio.
> */
> - if (!strictlimit)
> + if (!strictlimit) {
> wb_dirty_limits(mdtc);
>
> + if ((current->flags & PF_LOCAL_THROTTLE) &&
> + mdtc->wb_dirty <
> + dirty_freerun_ceiling(mdtc->wb_thresh,
> + mdtc->wb_bg_thresh))
> + /*
> + * LOCAL_THROTTLE tasks must not be
> + * throttled when below the per-wb
> + * freerun ceiling.
> + */
> + goto free_running;
> + }
> dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) &&
> ((mdtc->dirty > mdtc->thresh) || strictlimit);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a37c87b5aee2..b2f5deb3603c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
>
> /*
> * If a kernel thread (such as nfsd for loop-back mounts) services
> - * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
> + * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE.
> * In that case we should only throttle if the backing device it is
> * writing to is congested. In other cases it is safe to throttle.
> */
> static int current_may_throttle(void)
> {
> - return !(current->flags & PF_LESS_THROTTLE) ||
> + return !(current->flags & PF_LOCAL_THROTTLE) ||
> current->backing_dev_info == NULL ||
> bdi_write_congested(current->backing_dev_info);
> }
> --
> 2.26.2
>


--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-06-01 00:47:36

by NeilBrown

[permalink] [raw]
Subject: Writeback fixes for NFS


Hi Andrew,
could you please queue these two patches (following).
I think they have sufficient review and no remaining complaints.

Thanks,
NeilBrown


Attachments:
signature.asc (847.00 B)

2020-06-01 00:49:13

by NeilBrown

[permalink] [raw]
Subject: [PATCH 1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE


PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
daemon needs to write to one bdi (the final bdi) in order to free up
writes queued to another bdi (the client bdi).

The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
pages, so that it can still dirty pages after other processses have been
throttled. The purpose of this is to avoid deadlock that happen when
the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
but it is being thottled and cannot write.

This approach was designed when all threads were blocked equally,
independently on which device they were writing to, or how fast it was.
Since that time the writeback algorithm has changed substantially with
different threads getting different allowances based on non-trivial
heuristics. This means the simple "add 25%" heuristic is no longer
reliable.

The important issue is not that the daemon needs a *larger* dirty page
allowance, but that it needs a *private* dirty page allowance, so that
dirty pages for the "client" bdi that it is helping to clear (the bdi for
an NFS filesystem or loop block device etc) do not affect the throttling
of the deamon writing to the "final" bdi.

This patch changes the heuristic so that the task is not throttled when
the bdi it is writing to has a dirty page count below below (or equal
to) the free-run threshold for that bdi. This ensures it will always be
able to have some pages in flight, and so will not deadlock.

In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
still be throttled by global threshold, but that is acceptable as it is
only the deadlock state that is interesting for this flag.

This approach of "only throttle when target bdi is busy" is consistent
with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
it causes attention to be focussed only on the target bdi.

So this patch
- renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
- removes the 25% bonus that that flag gives, and
- If PF_LOCAL_THROTTLE is set, don't delay at all unless the
global and the local free-run thresholds are exceeded.

Note that previously realtime threads were treated the same as
PF_LESS_THROTTLE threads. This patch does *not* change the behvaiour for
real-time threads, so it is now different from the behaviour of nfsd and
loop tasks. I don't know what is wanted for realtime.

Reviewed-by: Jan Kara <[email protected]>
Acked-by: Chuck Lever <[email protected]> (for nfsd change)
Signed-off-by: NeilBrown <[email protected]>
---
drivers/block/loop.c | 2 +-
fs/nfsd/vfs.c | 9 +++++----
include/linux/sched.h | 3 ++-
kernel/sys.c | 2 +-
mm/page-writeback.c | 41 +++++++++++++++++++++++++++++++++--------
mm/vmscan.c | 4 ++--
6 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index da693e6a834e..d89c25ba3b89 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -919,7 +919,7 @@ static void loop_unprepare_queue(struct loop_device *lo)

static int loop_kthread_worker_fn(void *worker_ptr)
{
- current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+ current->flags |= PF_LOCAL_THROTTLE | PF_MEMALLOC_NOIO;
return kthread_worker_fn(worker_ptr);
}

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 0aa02eb18bd3..c3fbab1753ec 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -979,12 +979,13 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,

if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
/*
- * We want less throttling in balance_dirty_pages()
- * and shrink_inactive_list() so that nfs to
+ * We want throttling in balance_dirty_pages()
+ * and shrink_inactive_list() to only consider
+ * the backingdev we are writing to, so that nfs to
* localhost doesn't cause nfsd to lock up due to all
* the client's dirty pages or its congested queue.
*/
- current->flags |= PF_LESS_THROTTLE;
+ current->flags |= PF_LOCAL_THROTTLE;

exp = fhp->fh_export;
use_wgather = (rqstp->rq_vers == 2) && EX_WGATHER(exp);
@@ -1037,7 +1038,7 @@ nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfsd_file *nf,
nfserr = nfserrno(host_err);
}
if (test_bit(RQ_LOCAL, &rqstp->rq_flags))
- current_restore_flags(pflags, PF_LESS_THROTTLE);
+ current_restore_flags(pflags, PF_LOCAL_THROTTLE);
return nfserr;
}

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5955a089df32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1481,7 +1481,8 @@ extern struct pid *cad_pid;
#define PF_KSWAPD 0x00020000 /* I am kswapd */
#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
-#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
+#define PF_LOCAL_THROTTLE 0x00100000 /* Throttle writes only agasint the bdi I write to,
+ * I am cleaning dirty pages from some other bdi. */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
diff --git a/kernel/sys.c b/kernel/sys.c
index d325f3ab624a..180a2fa33f7f 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2262,7 +2262,7 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which,
return -EINVAL;
}

-#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LESS_THROTTLE)
+#define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE)

SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7326b54ab728..f02a71797781 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -387,8 +387,7 @@ static unsigned long global_dirtyable_memory(void)
* Calculate @dtc->thresh and ->bg_thresh considering
* vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller
* must ensure that @dtc->avail is set before calling this function. The
- * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
+ * dirty limits will be lifted by 1/4 for real-time tasks.
*/
static void domain_dirty_limits(struct dirty_throttle_control *dtc)
{
@@ -436,7 +435,7 @@ static void domain_dirty_limits(struct dirty_throttle_control *dtc)
if (bg_thresh >= thresh)
bg_thresh = thresh / 2;
tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ if (rt_task(tsk)) {
bg_thresh += bg_thresh / 4 + global_wb_domain.dirty_limit / 32;
thresh += thresh / 4 + global_wb_domain.dirty_limit / 32;
}
@@ -486,7 +485,7 @@ static unsigned long node_dirty_limit(struct pglist_data *pgdat)
else
dirty = vm_dirty_ratio * node_memory / 100;

- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+ if (rt_task(tsk))
dirty += dirty / 4;

return dirty;
@@ -1653,8 +1652,12 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
(!mdtc ||
m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
- unsigned long intv = dirty_poll_interval(dirty, thresh);
- unsigned long m_intv = ULONG_MAX;
+ unsigned long intv;
+ unsigned long m_intv;
+
+ free_running:
+ intv = dirty_poll_interval(dirty, thresh);
+ m_intv = ULONG_MAX;

current->dirty_paused_when = now;
current->nr_dirtied = 0;
@@ -1673,9 +1676,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
* Calculate global domain's pos_ratio and select the
* global dtc by default.
*/
- if (!strictlimit)
+ if (!strictlimit) {
wb_dirty_limits(gdtc);

+ if ((current->flags & PF_LOCAL_THROTTLE) &&
+ gdtc->wb_dirty <
+ dirty_freerun_ceiling(gdtc->wb_thresh,
+ gdtc->wb_bg_thresh))
+ /*
+ * LOCAL_THROTTLE tasks must not be throttled
+ * when below the per-wb freerun ceiling.
+ */
+ goto free_running;
+ }
+
dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
((gdtc->dirty > gdtc->thresh) || strictlimit);

@@ -1689,9 +1703,20 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
* both global and memcg domains. Choose the one
* w/ lower pos_ratio.
*/
- if (!strictlimit)
+ if (!strictlimit) {
wb_dirty_limits(mdtc);

+ if ((current->flags & PF_LOCAL_THROTTLE) &&
+ mdtc->wb_dirty <
+ dirty_freerun_ceiling(mdtc->wb_thresh,
+ mdtc->wb_bg_thresh))
+ /*
+ * LOCAL_THROTTLE tasks must not be
+ * throttled when below the per-wb
+ * freerun ceiling.
+ */
+ goto free_running;
+ }
dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) &&
((mdtc->dirty > mdtc->thresh) || strictlimit);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a37c87b5aee2..b2f5deb3603c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1878,13 +1878,13 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,

/*
* If a kernel thread (such as nfsd for loop-back mounts) services
- * a backing device by writing to the page cache it sets PF_LESS_THROTTLE.
+ * a backing device by writing to the page cache it sets PF_LOCAL_THROTTLE.
* In that case we should only throttle if the backing device it is
* writing to is congested. In other cases it is safe to throttle.
*/
static int current_may_throttle(void)
{
- return !(current->flags & PF_LESS_THROTTLE) ||
+ return !(current->flags & PF_LOCAL_THROTTLE) ||
current->backing_dev_info == NULL ||
bdi_write_congested(current->backing_dev_info);
}
--
2.26.2


Attachments:
signature.asc (847.00 B)

2020-06-01 00:49:57

by NeilBrown

[permalink] [raw]
Subject: [PATCH 2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead.


After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds. If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count. This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g. releasepage() used to
send a COMMIT). However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.
With this patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero. static trace points and warning printks which mentioned this
counter no longer report it.

Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Acked-by: Trond Myklebust <[email protected]>
Acked-by: Michal Hocko <[email protected]> # for MM parts
Signed-off-by: NeilBrown <[email protected]>
---
Documentation/filesystems/proc.rst | 4 ++--
drivers/base/node.c | 2 +-
fs/fs-writeback.c | 1 -
fs/nfs/internal.h | 10 +++++++---
fs/nfs/write.c | 4 ++--
fs/proc/meminfo.c | 3 +--
include/linux/mmzone.h | 1 -
include/trace/events/writeback.h | 5 +----
mm/memcontrol.c | 1 -
mm/page-writeback.c | 17 ++++-------------
mm/page_alloc.c | 5 +----
mm/vmstat.c | 11 +++++++++--
12 files changed, 28 insertions(+), 36 deletions(-)

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 38b606991065..092b7b44d158 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1042,8 +1042,8 @@ PageTables
amount of memory dedicated to the lowest level of page
tables.
NFS_Unstable
- NFS pages sent to the server, but not yet committed to stable
- storage
+ Always zero. Previous counted pages which had been written to
+ the server, but has not been committed to stable storage.
Bounce
Memory used for block device "bounce buffers"
WritebackTmp
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 10d7e818e118..15f5ed6a8830 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(i.sharedram),
nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
- nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
+ nid, 0,
nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
nid, K(sreclaimable +
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..c5bdf46e3b4b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
static unsigned long get_nr_dirty_pages(void)
{
return global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS) +
get_nr_dirty_inodes();
}

diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 1f32a9fbfdaf..6673a77884d9 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -668,7 +668,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize)
}

/*
- * Record the page as unstable and mark its inode as dirty.
+ * Record the page as unstable (an extra writeback period) and mark its
+ * inode as dirty.
*/
static inline
void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
@@ -676,8 +677,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
if (!cinfo->dreq) {
struct inode *inode = page_file_mapping(page)->host;

- inc_node_page_state(page, NR_UNSTABLE_NFS);
- inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
+ /* This page is really still in write-back - just that the
+ * writeback is happening on the server now.
+ */
+ inc_node_page_state(page, NR_WRITEBACK);
+ inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
}
}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1e767f779c49..639c34fec04a 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -946,9 +946,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
static void
nfs_clear_page_commit(struct page *page)
{
- dec_node_page_state(page, NR_UNSTABLE_NFS);
+ dec_node_page_state(page, NR_WRITEBACK);
dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
- WB_RECLAIMABLE);
+ WB_WRITEBACK);
}

/* Called holding the request lock on @req */
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..9bd94b5a9658 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
show_val_kb(m, "PageTables: ",
global_zone_page_state(NR_PAGETABLE));

- show_val_kb(m, "NFS_Unstable: ",
- global_node_page_state(NR_UNSTABLE_NFS));
+ show_val_kb(m, "NFS_Unstable: ", 0);
show_val_kb(m, "Bounce: ",
global_zone_page_state(NR_BOUNCE));
show_val_kb(m, "WritebackTmp: ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1b9de7d220fb..a89f47515eb1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -193,7 +193,6 @@ enum node_stat_item {
NR_FILE_THPS,
NR_FILE_PMDMAPPED,
NR_ANON_THPS,
- NR_UNSTABLE_NFS, /* NFS unstable pages */
NR_VMSCAN_WRITE,
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 85a33bea76f1..10f5d1fa7347 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -541,7 +541,6 @@ TRACE_EVENT(global_dirty_state,
TP_STRUCT__entry(
__field(unsigned long, nr_dirty)
__field(unsigned long, nr_writeback)
- __field(unsigned long, nr_unstable)
__field(unsigned long, background_thresh)
__field(unsigned long, dirty_thresh)
__field(unsigned long, dirty_limit)
@@ -552,7 +551,6 @@ TRACE_EVENT(global_dirty_state,
TP_fast_assign(
__entry->nr_dirty = global_node_page_state(NR_FILE_DIRTY);
__entry->nr_writeback = global_node_page_state(NR_WRITEBACK);
- __entry->nr_unstable = global_node_page_state(NR_UNSTABLE_NFS);
__entry->nr_dirtied = global_node_page_state(NR_DIRTIED);
__entry->nr_written = global_node_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
@@ -560,12 +558,11 @@ TRACE_EVENT(global_dirty_state,
__entry->dirty_limit = global_wb_domain.dirty_limit;
),

- TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+ TP_printk("dirty=%lu writeback=%lu "
"bg_thresh=%lu thresh=%lu limit=%lu "
"dirtied=%lu written=%lu",
__entry->nr_dirty,
__entry->nr_writeback,
- __entry->nr_unstable,
__entry->background_thresh,
__entry->dirty_thresh,
__entry->dirty_limit,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a3b97f103966..1db4b285c407 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4330,7 +4330,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,

*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);

- /* this should eventually include NR_UNSTABLE_NFS */
*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f02a71797781..d11b097c8002 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat)
unsigned long nr_pages = 0;

nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
- nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
nr_pages += node_page_state(pgdat, NR_WRITEBACK);

return nr_pages <= limit;
@@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*
* Return: @wb's dirty limit in pages. The term "dirty" in the context of
- * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * dirty balancing includes all PG_dirty and PG_writeback pages.
*/
static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
{
@@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
&mdtc_stor : NULL;
struct dirty_throttle_control *sdtc;
- unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
+ unsigned long nr_reclaimable; /* = file_dirty */
long period;
long pause;
long max_pause;
@@ -1586,14 +1585,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
unsigned long m_thresh = 0;
unsigned long m_bg_thresh = 0;

- /*
- * Unstable writes are a feature of certain networked
- * filesystems (i.e. NFS) in which data may have been
- * written to the server's write cache, but has not yet
- * been flushed to permanent storage.
- */
- nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS);
+ nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
gdtc->avail = global_dirtyable_memory();
gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);

@@ -1963,8 +1955,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
* as we're trying to decide whether to put more under writeback.
*/
gdtc->avail = global_dirtyable_memory();
- gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
- global_node_page_state(NR_UNSTABLE_NFS);
+ gdtc->dirty = global_node_page_state(NR_FILE_DIRTY);
domain_dirty_limits(gdtc);

if (gdtc->dirty > gdtc->bg_thresh)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 13cc653122b7..cc406ee17ad9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5319,7 +5319,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
- " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+ " unevictable:%lu dirty:%lu writeback:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
" free:%lu free_pcp:%lu free_cma:%lu\n",
@@ -5332,7 +5332,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
global_node_page_state(NR_UNEVICTABLE),
global_node_page_state(NR_FILE_DIRTY),
global_node_page_state(NR_WRITEBACK),
- global_node_page_state(NR_UNSTABLE_NFS),
global_node_page_state(NR_SLAB_RECLAIMABLE),
global_node_page_state(NR_SLAB_UNRECLAIMABLE),
global_node_page_state(NR_FILE_MAPPED),
@@ -5365,7 +5364,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
" anon_thp: %lukB"
#endif
" writeback_tmp:%lukB"
- " unstable:%lukB"
" all_unreclaimable? %s"
"\n",
pgdat->node_id,
@@ -5387,7 +5385,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
#endif
K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
- K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
"yes" : "no");
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 96d21a792b57..6c719f184843 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
TEXT_FOR_HIGHMEM(xx) xx "_movable",

const char * const vmstat_text[] = {
- /* enum zone_stat_item countes */
+ /* enum zone_stat_item counters */
"nr_free_pages",
"nr_zone_inactive_anon",
"nr_zone_active_anon",
@@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = {
"nr_file_hugepages",
"nr_file_pmdmapped",
"nr_anon_transparent_hugepages",
- "nr_unstable",
"nr_vmscan_write",
"nr_vmscan_immediate_reclaim",
"nr_dirtied",
@@ -1723,6 +1722,14 @@ static int vmstat_show(struct seq_file *m, void *arg)
seq_puts(m, vmstat_text[off]);
seq_put_decimal_ull(m, " ", *l);
seq_putc(m, '\n');
+
+ if (off == NR_VMSTAT_ITEMS - 1) {
+ /* We've come to the end - add any deprecated counters
+ * to avoid breaking userspace which might depend on
+ * them being present.
+ */
+ seq_puts(m, "nr_unstable 0\n");
+ }
return 0;
}

--
2.26.2


Attachments:
signature.asc (847.00 B)