2019-10-31 23:51:59

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 00/28] mm, xfs: non-blocking inode reclaim

Hi folks,

This is an updated version of the non-blocking inode reclaim
patchset for XFS. Original description of the problem and patchset
is below the changelog.

Thoughts, comments and improvemnts welcome.

-Dave.

Changelog:

V3:

- rebased on 5.4-rc5 + linux-xfs/for-next
- fixed various typos and whitespace damage
- new patch to factor out inode lookup from xfs_ifree_cluster() to
simplify the loop walking all the in-core inodes in the chunk
being freed.
- fixed up issues with temporary xfs_ail_push_sync() scaffolding
- fixed up per-ag locking description in commit message of patch
"xfs: use AIL pushing for inode reclaim IO"
- fixed missing per-ag reclaim cursor unlock
- made sync LSN calculation consistent between inode grab and
reclaim for pinned inodes
- check against NULLCOMMITLSN rather than implicitly relying on
XFS_LSN_CMP() to do the right thing.
- renamed xfs_reclaim_inodes -> xfs_reclaim_all_inodes().:w
- moved new LRU list init to xfs_fs_fill_super()
- fixed incorrect locking around xfs_ifunlock() in
xfs_inode_reclaim_isolate().
- added lockdep_assert_held() to __xfs_iflock_nowait()
- use list_first_entry_or_null() in xfs_dispose_inodes()
- moved removal of ag reclaim walk into patch that changes reclaim
to use the lru list walk.
- added patch to introduce down_write_non_owner() and friends
- added mrupdate_*_non_owner() for inode reclaim to use.
- fixed incorrect flag manipulation in xfs_iget_cache_hit()
- removed spurious extra xfs_ail_push_all() calls


V2:

https://lore.kernel.org/linux-xfs/[email protected]/

- added current_reclaim_account_pages() wrapper for reclaim_state
updates
- moved xfs_buf page free accounting to the page freeing code rather
than the reclaim loop that drops the LRU buffer reference.
- increased log CIL flush limit by 2x
- moved xfs_buf GFP_NOFS reclaim changes to correct patch
- renamed sc->will_defer to sc->defer_work
- used min() in do_shrink_slab() appropriately
- converted xfs_trans_ail_update_bulk() to use new
xfs_ail_update_finish() function (fixes missing wakeups w/
xfs_ail_push_sync())
- fixed CIL size limit calculation units - was calculating min size
in sectors rather than bytes. Fixes performance degradation w/
patch series.
- dropped shadow buffer freeing - Brain pointed out it wasn't doing
exactly what I thought it was, and the memory saving were from the
massively undersized CIL limits, not early freeing of the shadow
buffers. Needs rethinking.
- fixed stray NULLCOMMITLSN in comments
- pinned items don't track the commit LSN, just the CIL sequence
number so we can't use that to push the AIL.
- removed stale tracing debug from AIL push code.
- fixes to memory reclaim shrinker accounting in 5.3-rc3 result in
direct reclaim backoff working a whole lot better, such that it's
no long necessary for the XFS inode shrinker to wait for IO to
complete. Changed the LRU reclaim logic to simply push on the AIL
if dirty inodes are hit, but never wait on them. Relevant commits
are:

0308f7cf19c9 ("mm/vmscan.c: calculate reclaimed slab caches in all reclaim paths")
e5ca8071fe65 ("mm/vmscan.c: add a new member reclaim_state in struct shrink_control")

- Added a patch to convert xfs_reclaim_inodes() to use
xfs_ail_push_all() which gets rid of the last user of
xfs_ail_push_sync(). This allows it to be removed as "temporary
infrastructure for the series" rather than having to be fixed up
and made robust. The optimisations and factoring it drove are
retained, as they are still a net improvement overall.
- fixed atomic_long vs atomic64 issues with shrinker deferral
rework.
- don't drop ag reclaim cursor locking any more, it gets removed
when all the old reclaim code is removed.
- added patch to change inode reclaim vs unreferenced XFS indoe
lookup done by inode write clustering and inode cluster freeing.
This gets rid of the need to cycle the ILOCK before running
call_rcu() to queue the inode to be freed when the current RCU
grace period expires. In doing so, the last major blocking point
in XFS inode reclaim is removed.

V1 (original RFC):

https://lore.kernel.org/linux-xfs/[email protected]/

----
Original text:

We've had a problem with inode reclaim for a long time - XFS is
capable of caching tens of millions of inodes with ease and dirtying
hundreds of thousands of those cached inodes every second. It is
also capable of reclaiming more than half a million clean inodes per
second per reclaim thread.

The result of this is that when there is a significant change in
sustained memory pressure on a system ith a large inode cache,
memory reclaim rapdily frees all the clean inodes, but cannot make
progress on reclaiming dirty inodes because they are rate limited
by IO.

However, the shrinker infrastructure in the kernel has no way to
feed back rate limiting to the core memory reclaim algorithms. In
fact there are no feedback mechanisms at all, and so when reclaim
has freed all the clean inodes and starts hitting dirty inodes, the
filesystem has no way of telling reclaim that the inode reclaim rate
has dropped from 500k/s to 500/s.

The result is that reclaim continues to try to free memory, and
because it makes no progress freeing inodes, it puts much more
pressure on the page LRUs and frees pages. When it runs out of
pages, it starts swapping, and when it runs out of swap or can't get
a page for swap-in it starts going on an OOM kill rampage.

That does nothing to "fix" the shortage of memory caused by the
slowness of dirty inode reclaim - if memory demand continues we just
keep hitting the OOM killer until either something critical is
killed or memory demand eases.

For a long time, XFS has avoided the insane spiral of shouty
OOM-killer rage death by cleaning inodes directly in the shrinker.
This has the effect of throttling memory reclaim to the rate at
which dirty inodes can be cleaned, and so when we get into the state
when memory reclaim is dependent on inode reclaim making progress
we don't ever allow LRU reclaim to run so far ahead of inode reclaim
that it winds up reclaim priority and runs out of LRU pages to
reclaim and/or swap.

This has a downside, though. When there is a large amount of clean
page cache and a small amount of inode cache that is dirty (e.g.
lots of file data pressure and/or application memory demand) the
inode reclaim shrinkers can run out of clean inodes to reclaim and
start blocking on inode writeback. This can result in long reclaim
latencies even though there is lots of memory that can be
immediately reclaimed from the page cache.

There are other issues, too. We have to block kswapd, too, because
it will continue running until watermarks are satisfied, and that
is largely the vector for shouty swappy death if it doesn't back
off before priority windup from lack of progress occurs. Blocking
kswapd then affects direct reclaim function, which often backs off
expecting kswapd to make progress in the mean time. But if kswapd
is not making progress, direct reclaim ends up in priority windup
from lack of progress, too. This is especially prevalent in
workloads that have a high percentage of GFP_NOFS allocations (e.g.
filesystem modification workloads).

The shrinkers have another problem w/ GFP_NOFS reclaim: the work
that is deferred because the shrinker cannot make progress gets
lumped on the first reclaim context that can do that work. That
means a direct reclaimer might get lumped with scanning millions of
objects during low priority scanning when it should only be scanning
a couple of thousand objects. This can result in highly
unpredictable and extremely long direct reclaim delays.

This is most definitely sub-optimal, but it's better than random
and/or premature OOM killer invocation under trivial workloads and
lots of reclaimable memory still being available.

This patch set aims to fix all these problems. The memory reclaim
and shrinker changes involve:

- a substantial rework of how the shrinker defers work, moving all
the deferred work to kswapd to remove all the unpredictability
from direct reclaim. Direct reclaim will only do the work the
direct reclaim context determines is necesary.

- deferred work is capped, and the amount of deferred work kswapd
will do in each scan is increased linearly w.r.t. increasing
reclaim priority. Hence when we are desparate for memory, kswapd
will be running all the deferred work as quickly as possible.

- The amount of deferred work and the amount of scanning that is
done by the shrinkers is now tracked in the struct reclaim_state.
This allows shrink_node() to see how much work is being done in
comparison to both the LRU scanning and how much is being deferred
to kswapd. This allows direct reclaim to back off when too much
work is being deferred and hence allow kswapd to make progress on
the deferred work while it waits.

- A "need backoff" flag has been added to the struct reclaim_state.
This allows individual shrinkers to indicate to kswapd that they
need some time to finish work before being scanned again. This is
basically for the same case as kswapd backs off from LRU scanning.

i.e. the LRU scanning has run into the tail of the LRU and is
finding dirty objects that require IO to complete before reclaim
can make further progress. This is exactly the same problem we
have with inode reclaim in XFS, and it is this mechanism that
enables us to move to IO-less inode reclaim.

The XFS changes are all over the place, and address both the reclaim
blocking problems and all the other related issues I found while
working on this patchest. These involve:

- fixing IO priority inversion problems between metadata
writeback (inodes!) and log IO caused by the block layer write
throttling (more on this later).

- some slab caches weren't marked as reclaimable, so were
incorrectly accounted. Also account for the pages xfs_buf reclaim
releases.

- reduced the delayed logging dirty item aggregation size (the CIL).
This defines the minimum amount of memory XFS can operate in when
there is heavy modifications in progress.

- reduced the memory footprint of the CIL when repeated
modifications to objects occur.

- Added a mechanism to push the AIL to a specific LSN (metadata
modification epoch) and wait for it. This forms the basis for
direct inode reclaim deferring IO and waiting for some progress
without issuing IO iteslf.

- reworked inode reclaim to use a list_lru to track inodes in
reclaim rather than a radix tree tag in the inode cache. We
iterated the radix tree for reclaim because it resulted in optimal
IO patterns from multiple concurrent reclaimers, but we dont' have
to care about that any more because all IO comes from the AIL now.

This gives us try LRU reclaim, and it allows us to effectively
determine when we've run out of clean inodes to easily reclaim and
provide that feedback to the higher levels via the "need backoff"
flag.

- direct reclaim is non-blocking while scanning, but at the end of a
scan it will still block waiting for IO, but only for /some/
progress to be made and not specific individual IOs.

- kswapd based reclaim is fully non-blocking.

The result is that there is now enough feedback from the shrinkers
into the main memory reclaim loop for it to back off in the
situations where back-off is required to avoid OOM killer
invocation, despite XFS now largely doing non-blocking reclaim.

Testing involves at 16p/16GB machine running a fsmark workload that
creates sustained heavy dirty inode cache pressure, then
progressively locking 2GB of memory at time to squeeze the workload
into less and less memory. A vanilla kernel runs well up to 12GB
squeezed, but at 14GB squeezed performance goes to hell. With just
the hacky "don't block kswapd by removing SYNC_WAIT" patch that
people seem to like, OOM kills start when squeezed to 12GB. With
that extended to direct reclaim, OOM kills start with squeezed to
just 8GB. With the full patchset, it runs similar to a vanilla
kernel up to 12GB squeezed, and vastly out-performs the vanilla
kernel with 14GB squeezed. Performance only drops ~20% with a 14GB
squeeze, whereas the vanilla kernel sees up to a 90% drop in
performance.

I also run testing with simoop, a simulated workload that Chris
Mason put together to demonstrate the long tail latency and
allocation stall problems the blocking in inode reclaim was causing.
The vanilla kernel averaged ~5 stalls/s over a test period of 10
hours, this patch series resulted in:

alloc stall rate = 0.00/sec (avg: 0.04) (p50: 0.04) (p95: 0.16) (p99: 0.32)

stalls almost going away entirely.

So the signs are there that this is a workable solution to the
problems caused by blocking inode reclaim without re-introducing the
the Death-by-OOM-killer issues the blocking avoids.

Please note that I haven't full gone non-blocking on direct reclaim
for a couple of reasons:

1. congestion_wait() and wait_iff_congested() are completely broken.
The blkmq change-over ripped out all the block layer congestion
reporting in 5.0 and didn't replace it with anything, so unless you
are operating on an NFS client, Ceph, FUSE or a DVD, congestion
checks and backoff aren't actually doing what they are supposed to.
i.e. wait_iff_congested() never blocks, and congestion_wait() always
sleeps for it's full timeout.

IOWs, the whole bdi-based IO congestion feedback mechanism no longer
functions as intended, and so I'm betting a lot of the memory
reclaim heuristics no longer function as they were intended to...

2. The block layer write throttle is full of priority inversions.
Apart from the log IO one I fixed in this series, I noticed that
swap in/out has a major problem. I lost count of the number of OOM
kills that occurred from the swap in path when there were several
processes blocked in wbt_wait() in the block layer in the swap out
path. i.e. if swap out had been making progress, swap in would not
have oom killed. Hence I found it still necessary to throttle direct
reclaim back in the shrinker as there wasn't a realiable way to get
the core reclaim code to throttle effectively.

FWIW, from the swap in/out perspective, this whole inversion problem
is made worse by #1: the congestion_wait/wait_iff_congested
interfaces being broken. Direct reclaim uses wait_iff_congested() to
back off if kswapd has indicated that the node is congested
(PGDAT_CONGESTED) and reclaim is struggling to make progress.
However, this backoff never actually happens now and hence direct
reclaim barrels into the swap code as hard as it can and blocks in
wbt_wait() waiting behind other swap IO instead of backing off and
waiting for some IO to complete and then retrying it's allocation....

So maybe if we fix the bdi congestion interfaces so they work again
we can get rid of the waiting in direct reclaim, but right now I
don't see any other choice....



Dave Chinner (28):
xfs: Lower CIL flush limit for large logs
xfs: Throttle commits on delayed background CIL push
xfs: don't allow log IO to be throttled
xfs: Improve metadata buffer reclaim accountability
xfs: correctly acount for reclaimable slabs
xfs: factor common AIL item deletion code
xfs: tail updates only need to occur when LSN changes
xfs: factor inode lookup from xfs_ifree_cluster
mm: directed shrinker work deferral
shrinkers: use defer_work for GFP_NOFS sensitive shrinkers
mm: factor shrinker work calculations
shrinker: defer work only to kswapd
shrinker: clean up variable types and tracepoints
mm: reclaim_state records pages reclaimed, not slabs
mm: back off direct reclaim on excessive shrinker deferral
mm: kswapd backoff for shrinkers
xfs: synchronous AIL pushing
xfs: don't block kswapd in inode reclaim
xfs: reduce kswapd blocking on inode locking.
xfs: kill background reclaim work
xfs: use AIL pushing for inode reclaim IO
xfs: remove mode from xfs_reclaim_inodes()
xfs: track reclaimable inodes using a LRU list
xfs: reclaim inodes from the LRU
xfs: remove unusued old inode reclaim code
xfs: use xfs_ail_push_all in xfs_reclaim_inodes
rwsem: introduce down/up_write_non_owner
xfs: rework unreferenced inode lookups

drivers/staging/android/ashmem.c | 8 +-
fs/gfs2/glock.c | 5 +-
fs/gfs2/quota.c | 6 +-
fs/inode.c | 3 +-
fs/nfs/dir.c | 6 +-
fs/super.c | 6 +-
fs/xfs/mrlock.h | 27 ++
fs/xfs/xfs_buf.c | 4 +-
fs/xfs/xfs_icache.c | 632 +++++++++----------------------
fs/xfs/xfs_icache.h | 26 +-
fs/xfs/xfs_inode.c | 219 +++++------
fs/xfs/xfs_inode.h | 18 +
fs/xfs/xfs_inode_item.c | 28 +-
fs/xfs/xfs_log.c | 10 +-
fs/xfs/xfs_log_cil.c | 37 +-
fs/xfs/xfs_log_priv.h | 53 ++-
fs/xfs/xfs_mount.c | 10 +-
fs/xfs/xfs_mount.h | 6 +-
fs/xfs/xfs_qm.c | 11 +-
fs/xfs/xfs_super.c | 93 +++--
fs/xfs/xfs_trace.h | 1 +
fs/xfs/xfs_trans_ail.c | 88 +++--
fs/xfs/xfs_trans_priv.h | 6 +-
include/linux/rwsem.h | 6 +
include/linux/shrinker.h | 9 +-
include/linux/swap.h | 23 +-
include/trace/events/vmscan.h | 69 ++--
kernel/locking/rwsem.c | 23 ++
mm/slab.c | 3 +-
mm/slob.c | 4 +-
mm/slub.c | 3 +-
mm/vmscan.c | 231 +++++++----
net/sunrpc/auth.c | 5 +-
33 files changed, 863 insertions(+), 816 deletions(-)

--
2.24.0.rc0


2019-10-31 23:52:16

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability

From: Dave Chinner <[email protected]>

The buffer cache shrinker frees more than just the xfs_buf slab
objects - it also frees the pages attached to the buffers. Make sure
the memory reclaim code accounts for this memory being freed
correctly, similar to how the inode shrinker accounts for pages
freed from the page cache due to mapping invalidation.

We also need to make sure that the mm subsystem knows these are
reclaimable objects. We provide the memory reclaim subsystem with a
a shrinker to reclaim xfs_bufs, so we should really mark the slab
that way.

We also have a lot of xfs_bufs in a busy system, spread them around
like we do inodes.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_buf.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 1e63dd3d1257..d34e5d2edacd 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -324,6 +324,9 @@ xfs_buf_free(

__free_page(page);
}
+ if (current->reclaim_state)
+ current->reclaim_state->reclaimed_slab +=
+ bp->b_page_count;
} else if (bp->b_flags & _XBF_KMEM)
kmem_free(bp->b_addr);
_xfs_buf_free_pages(bp);
@@ -2061,7 +2064,8 @@ int __init
xfs_buf_init(void)
{
xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
- KM_ZONE_HWALIGN, NULL);
+ KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
+ NULL);
if (!xfs_buf_zone)
goto out;

--
2.24.0.rc0

2019-11-01 00:49:27

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 09/28] mm: directed shrinker work deferral

From: Dave Chinner <[email protected]>

Introduce a mechanism for ->count_objects() to indicate to the
shrinker infrastructure that the reclaim context will not allow
scanning work to be done and so the work it decides is necessary
needs to be deferred.

This simplifies the code by separating out the accounting of
deferred work from the actual doing of the work, and allows better
decisions to be made by the shrinekr control logic on what action it
can take.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/shrinker.h | 7 +++++++
mm/vmscan.c | 8 ++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 0f80123650e2..3405c39ab92c 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -31,6 +31,13 @@ struct shrink_control {

/* current memcg being shrunk (for memcg aware shrinkers) */
struct mem_cgroup *memcg;
+
+ /*
+ * set by ->count_objects if reclaim context prevents reclaim from
+ * occurring. This allows the shrinker to immediately defer all the
+ * work and not even attempt to scan the cache.
+ */
+ bool defer_work;
};

#define SHRINK_STOP (~0UL)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee4eecc7e1c2..a215d71d9d4b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
freeable, delta, total_scan, priority);

+ /*
+ * If the shrinker can't run (e.g. due to gfp_mask constraints), then
+ * defer the work to a context that can scan the cache.
+ */
+ if (shrinkctl->defer_work)
+ goto done;
+
/*
* Normally, we should not scan less than batch_size objects in one
* pass to avoid too frequent shrinker calls, but if the slab has less
@@ -570,6 +577,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
cond_resched();
}

+done:
if (next_deferred >= scanned)
next_deferred -= scanned;
else
--
2.24.0.rc0

2019-11-01 00:49:27

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 01/28] xfs: Lower CIL flush limit for large logs

From: Dave Chinner <[email protected]>

The current CIL size aggregation limit is 1/8th the log size. This
means for large logs we might be aggregating at least 250MB of dirty objects
in memory before the CIL is flushed to the journal. With CIL shadow
buffers sitting around, this means the CIL is often consuming >500MB
of temporary memory that is all allocated under GFP_NOFS conditions.

Flushing the CIL can take some time to do if there is other IO
ongoing, and can introduce substantial log force latency by itself.
It also pins the memory until the objects are in the AIL and can be
written back and reclaimed by shrinkers. Hence this threshold also
tends to determine the minimum amount of memory XFS can operate in
under heavy modification without triggering the OOM killer.

Modify the CIL space limit to prevent such huge amounts of pinned
metadata from aggregating. We can have 2MB of log IO in flight at
once, so limit aggregation to 16x this size. This threshold was
chosen as it little impact on performance (on 16-way fsmark) or log
traffic but pins a lot less memory on large logs especially under
heavy memory pressure. An aggregation limit of 8x had 5-10%
performance degradation and a 50% increase in log throughput for
the same workload, so clearly that was too small for highly
concurrent workloads on large logs.

This was found via trace analysis of AIL behaviour. e.g. insertion
from a single CIL flush:

xfs_ail_insert: old lsn 0/0 new lsn 1/3033090 type XFS_LI_INODE flags IN_AIL

$ grep xfs_ail_insert /mnt/scratch/s.t |grep "new lsn 1/3033090" |wc -l
1721823
$

So there were 1.7 million objects inserted into the AIL from this
CIL checkpoint, the first at 2323.392108, the last at 2325.667566 which
was the end of the trace (i.e. it hadn't finished). Clearly a major
problem.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/xfs/xfs_log_priv.h | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 4f19375f6592..abd382cfffe3 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -318,13 +318,30 @@ struct xfs_cil {
* tries to keep 25% of the log free, so we need to keep below that limit or we
* risk running out of free log space to start any new transactions.
*
- * In order to keep background CIL push efficient, we will set a lower
- * threshold at which background pushing is attempted without blocking current
- * transaction commits. A separate, higher bound defines when CIL pushes are
- * enforced to ensure we stay within our maximum checkpoint size bounds.
- * threshold, yet give us plenty of space for aggregation on large logs.
+ * In order to keep background CIL push efficient, we only need to ensure the
+ * CIL is large enough to maintain sufficient in-memory relogging to avoid
+ * repeated physical writes of frequently modified metadata. If we allow the CIL
+ * to grow to a substantial fraction of the log, then we may be pinning hundreds
+ * of megabytes of metadata in memory until the CIL flushes. This can cause
+ * issues when we are running low on memory - pinned memory cannot be reclaimed,
+ * and the CIL consumes a lot of memory. Hence we need to set an upper physical
+ * size limit for the CIL that limits the maximum amount of memory pinned by the
+ * CIL but does not limit performance by reducing relogging efficiency
+ * significantly.
+ *
+ * As such, the CIL push threshold ends up being the smaller of two thresholds:
+ * - a threshold large enough that it allows CIL to be pushed and progress to be
+ * made without excessive blocking of incoming transaction commits. This is
+ * defined to be 12.5% of the log space - half the 25% push threshold of the
+ * AIL.
+ * - small enough that it doesn't pin excessive amounts of memory but maintains
+ * close to peak relogging efficiency. This is defined to be 16x the iclog
+ * buffer window (32MB) as measurements have shown this to be roughly the
+ * point of diminishing performance increases under highly concurrent
+ * modification workloads.
*/
-#define XLOG_CIL_SPACE_LIMIT(log) (log->l_logsize >> 3)
+#define XLOG_CIL_SPACE_LIMIT(log) \
+ min_t(int, (log)->l_logsize >> 3, BBTOB(XLOG_TOTAL_REC_SHIFT(log)) << 4)

/*
* ticket grant locks, queues and accounting have their own cachlines
--
2.24.0.rc0

2019-11-01 00:52:54

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 18/28] xfs: don't block kswapd in inode reclaim

From: Dave Chinner <[email protected]>

We have a number of reasons for blocking kswapd in XFS inode
reclaim, mainly all to do with the fact that memory reclaim has no
feedback mechanisms to throttle on dirty slab objects that need IO
to reclaim.

As a result, we currently throttle inode reclaim by issuing IO in
the reclaim context. The unfortunate side effect of this is that it
can cause long tail latencies in reclaim and for some workloads this
can be a problem.

Now that the shrinkers finally have a method of telling kswapd to
back off, we can start the process of making inode reclaim in XFS
non-blocking. The first thing we need to do is not block kswapd, but
so that doesn't cause immediate serious problems, make sure inode
writeback is always underway when kswapd is running.

As we don't block kswapd now, we don't have to worry about reclaim
scans taking long delays due to IO being issued and waited for.
Hence while direct reclaim gets delayed by IO, kswapd will not and
so it will keep pushing the AIL to clean inodes. Hence direct
reclaim doesn't need to push the AIL anymore as kswapd will do it
reliably now.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
---
fs/xfs/xfs_icache.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 944add5ff8e0..edcc3f6bb3bf 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1378,11 +1378,22 @@ xfs_reclaim_inodes_nr(
struct xfs_mount *mp,
int nr_to_scan)
{
- /* kick background reclaimer and push the AIL */
+ int sync_mode = SYNC_TRYLOCK;
+
+ /* kick background reclaimer */
xfs_reclaim_work_queue(mp);
- xfs_ail_push_all(mp->m_ail);

- return xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+ /*
+ * For kswapd, we kick background inode writeback. For direct
+ * reclaim, we issue and wait on inode writeback to throttle
+ * reclaim rates and avoid shouty OOM-death.
+ */
+ if (current_is_kswapd())
+ xfs_ail_push_all(mp->m_ail);
+ else
+ sync_mode |= SYNC_WAIT;
+
+ return xfs_reclaim_inodes_ag(mp, sync_mode, &nr_to_scan);
}

/*
--
2.24.0.rc0

2019-11-01 00:55:08

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 17/28] xfs: synchronous AIL pushing

From: Dave Chinner <[email protected]>

Provide an interface to push the AIL to a target LSN and wait for
the tail of the log to move past that LSN. This is used to wait for
all items older than a specific LSN to either be cleaned (written
back) or relogged to a higher LSN in the AIL. The primary use for
this is to allow IO free inode reclaim throttling.

Factor the common AIL deletion code that does all the wakeups into a
helper so we only have one copy of this somewhat tricky code to
interface with all the wakeups necessary when the LSN of the log
tail changes.

xfs_ail_push_sync() is temporary infrastructure to facilitate
non-blocking, IO-less inode reclaim throttling that allows further
structural changes to be made. Once those structural changes are
made, the need for this function goes away and it is removed. In
essence, it is only provided to ensure git bisects don't break while
the changes to the reclaim algorithms are in progress.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_trans_ail.c | 32 ++++++++++++++++++++++++++++++++
fs/xfs/xfs_trans_priv.h | 2 ++
2 files changed, 34 insertions(+)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 685a21cd24c0..3e1d0e1439e2 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -662,6 +662,36 @@ xfs_ail_push_all(
xfs_ail_push(ailp, threshold_lsn);
}

+/*
+ * Push the AIL to a specific lsn and wait for it to complete.
+ */
+void
+xfs_ail_push_sync(
+ struct xfs_ail *ailp,
+ xfs_lsn_t threshold_lsn)
+{
+ struct xfs_log_item *lip;
+ DEFINE_WAIT(wait);
+
+ spin_lock(&ailp->ail_lock);
+ while ((lip = xfs_ail_min(ailp)) != NULL) {
+ prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
+ if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
+ XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
+ break;
+ if (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
+ ailp->ail_target = threshold_lsn;
+ wake_up_process(ailp->ail_task);
+ spin_unlock(&ailp->ail_lock);
+ schedule();
+ spin_lock(&ailp->ail_lock);
+ }
+ spin_unlock(&ailp->ail_lock);
+
+ finish_wait(&ailp->ail_push, &wait);
+}
+
+
/*
* Push out all items in the AIL immediately and wait until the AIL is empty.
*/
@@ -702,6 +732,7 @@ xfs_ail_update_finish(
if (!XFS_FORCED_SHUTDOWN(mp))
xlog_assign_tail_lsn_locked(mp);

+ wake_up_all(&ailp->ail_push);
if (list_empty(&ailp->ail_head))
wake_up_all(&ailp->ail_empty);
spin_unlock(&ailp->ail_lock);
@@ -858,6 +889,7 @@ xfs_trans_ail_init(
spin_lock_init(&ailp->ail_lock);
INIT_LIST_HEAD(&ailp->ail_buf_list);
init_waitqueue_head(&ailp->ail_empty);
+ init_waitqueue_head(&ailp->ail_push);

ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
ailp->ail_mount->m_fsname);
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 35655eac01a6..1b6f4bbd47c0 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -61,6 +61,7 @@ struct xfs_ail {
int ail_log_flush;
struct list_head ail_buf_list;
wait_queue_head_t ail_empty;
+ wait_queue_head_t ail_push;
};

/*
@@ -113,6 +114,7 @@ xfs_trans_ail_remove(
}

void xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
+void xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
void xfs_ail_push_all(struct xfs_ail *);
void xfs_ail_push_all_sync(struct xfs_ail *);
struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp);
--
2.24.0.rc0

2019-11-01 13:43:01

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability

On Fri, Nov 01, 2019 at 10:45:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
>
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
>
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---

I still don't see why we wouldn't set the spread flag on the bli cache
as well, but afaict it doesn't matter in most cases unless the spread
knob is enabled. Unless I'm misunderstanding how that works, I think the
commit log could be improved to describe that since to me it implies the
flag by itself has an effect, but otherwise the change seems fine:

Reviewed-by: Brian Foster <[email protected]>

> fs/xfs/xfs_buf.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1e63dd3d1257..d34e5d2edacd 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,6 +324,9 @@ xfs_buf_free(
>
> __free_page(page);
> }
> + if (current->reclaim_state)
> + current->reclaim_state->reclaimed_slab +=
> + bp->b_page_count;
> } else if (bp->b_flags & _XBF_KMEM)
> kmem_free(bp->b_addr);
> _xfs_buf_free_pages(bp);
> @@ -2061,7 +2064,8 @@ int __init
> xfs_buf_init(void)
> {
> xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> - KM_ZONE_HWALIGN, NULL);
> + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,
> + NULL);
> if (!xfs_buf_zone)
> goto out;
>
> --
> 2.24.0.rc0
>

2019-11-04 15:27:41

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 09/28] mm: directed shrinker work deferral

On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Introduce a mechanism for ->count_objects() to indicate to the
> shrinker infrastructure that the reclaim context will not allow
> scanning work to be done and so the work it decides is necessary
> needs to be deferred.
>
> This simplifies the code by separating out the accounting of
> deferred work from the actual doing of the work, and allows better
> decisions to be made by the shrinekr control logic on what action it
> can take.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---

My understanding from the previous discussion(s) is that this is not
tied directly to the gfp mask because that is not the only intended use.
While it is currently a boolean tied to the the entire shrinker call,
the longer term objective is per-object granularity.

I find the argument reasonable enough, but if the above is true, why do
we move these checks from ->scan_objects() to ->count_objects() (in the
next patch) when per-object decisions will ultimately need to be made by
the former? That seems like unnecessary churn and inconsistent with the
argument against just temporarily doing something like what Christoph
suggested in the previous version, particularly since IIRC the only use
in this series was for gfp mask purposes.

> include/linux/shrinker.h | 7 +++++++
> mm/vmscan.c | 8 ++++++++
> 2 files changed, 15 insertions(+)
>
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 0f80123650e2..3405c39ab92c 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -31,6 +31,13 @@ struct shrink_control {
>
> /* current memcg being shrunk (for memcg aware shrinkers) */
> struct mem_cgroup *memcg;
> +
> + /*
> + * set by ->count_objects if reclaim context prevents reclaim from
> + * occurring. This allows the shrinker to immediately defer all the
> + * work and not even attempt to scan the cache.
> + */
> + bool defer_work;
> };
>
> #define SHRINK_STOP (~0UL)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ee4eecc7e1c2..a215d71d9d4b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> freeable, delta, total_scan, priority);
>
> + /*
> + * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> + * defer the work to a context that can scan the cache.
> + */
> + if (shrinkctl->defer_work)
> + goto done;
> +

I still find the fact that this per-shrinker invocation field is never
reset unnecessarily fragile, and I don't see any good reason not to
reset it prior to the shrinker callback that potentially sets it.

Brian

> /*
> * Normally, we should not scan less than batch_size objects in one
> * pass to avoid too frequent shrinker calls, but if the slab has less
> @@ -570,6 +577,7 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> cond_resched();
> }
>
> +done:
> if (next_deferred >= scanned)
> next_deferred -= scanned;
> else
> --
> 2.24.0.rc0
>

2019-11-04 23:27:33

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 04/28] xfs: Improve metadata buffer reclaim accountability

On Fri, Nov 01, 2019 at 10:45:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> The buffer cache shrinker frees more than just the xfs_buf slab
> objects - it also frees the pages attached to the buffers. Make sure
> the memory reclaim code accounts for this memory being freed
> correctly, similar to how the inode shrinker accounts for pages
> freed from the page cache due to mapping invalidation.
>
> We also need to make sure that the mm subsystem knows these are
> reclaimable objects. We provide the memory reclaim subsystem with a
> a shrinker to reclaim xfs_bufs, so we should really mark the slab
> that way.
>
> We also have a lot of xfs_bufs in a busy system, spread them around
> like we do inodes.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/xfs/xfs_buf.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 1e63dd3d1257..d34e5d2edacd 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -324,6 +324,9 @@ xfs_buf_free(
>
> __free_page(page);
> }
> + if (current->reclaim_state)
> + current->reclaim_state->reclaimed_slab +=
> + bp->b_page_count;
> } else if (bp->b_flags & _XBF_KMEM)
> kmem_free(bp->b_addr);
> _xfs_buf_free_pages(bp);
> @@ -2061,7 +2064,8 @@ int __init
> xfs_buf_init(void)
> {
> xfs_buf_zone = kmem_zone_init_flags(sizeof(xfs_buf_t), "xfs_buf",
> - KM_ZONE_HWALIGN, NULL);
> + KM_ZONE_HWALIGN | KM_ZONE_SPREAD | KM_ZONE_RECLAIM,

As discussed on the previous iteration of this series, I'd like to
capture the reasons for adding KM_ZONE_SPREAD as a separate patch.

--D

> + NULL);
> if (!xfs_buf_zone)
> goto out;
>
> --
> 2.24.0.rc0
>

2019-11-05 17:06:16

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 17/28] xfs: synchronous AIL pushing

On Fri, Nov 01, 2019 at 10:46:07AM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Provide an interface to push the AIL to a target LSN and wait for
> the tail of the log to move past that LSN. This is used to wait for
> all items older than a specific LSN to either be cleaned (written
> back) or relogged to a higher LSN in the AIL. The primary use for
> this is to allow IO free inode reclaim throttling.
>
> Factor the common AIL deletion code that does all the wakeups into a
> helper so we only have one copy of this somewhat tricky code to
> interface with all the wakeups necessary when the LSN of the log
> tail changes.
>

The above paragraph doesn't seem applicable to this patch. With that
fixed:

Reviewed-by: Brian Foster <[email protected]>

> xfs_ail_push_sync() is temporary infrastructure to facilitate
> non-blocking, IO-less inode reclaim throttling that allows further
> structural changes to be made. Once those structural changes are
> made, the need for this function goes away and it is removed. In
> essence, it is only provided to ensure git bisects don't break while
> the changes to the reclaim algorithms are in progress.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> fs/xfs/xfs_trans_ail.c | 32 ++++++++++++++++++++++++++++++++
> fs/xfs/xfs_trans_priv.h | 2 ++
> 2 files changed, 34 insertions(+)
>
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 685a21cd24c0..3e1d0e1439e2 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -662,6 +662,36 @@ xfs_ail_push_all(
> xfs_ail_push(ailp, threshold_lsn);
> }
>
> +/*
> + * Push the AIL to a specific lsn and wait for it to complete.
> + */
> +void
> +xfs_ail_push_sync(
> + struct xfs_ail *ailp,
> + xfs_lsn_t threshold_lsn)
> +{
> + struct xfs_log_item *lip;
> + DEFINE_WAIT(wait);
> +
> + spin_lock(&ailp->ail_lock);
> + while ((lip = xfs_ail_min(ailp)) != NULL) {
> + prepare_to_wait(&ailp->ail_push, &wait, TASK_UNINTERRUPTIBLE);
> + if (XFS_FORCED_SHUTDOWN(ailp->ail_mount) ||
> + XFS_LSN_CMP(threshold_lsn, lip->li_lsn) < 0)
> + break;
> + if (XFS_LSN_CMP(threshold_lsn, ailp->ail_target) > 0)
> + ailp->ail_target = threshold_lsn;
> + wake_up_process(ailp->ail_task);
> + spin_unlock(&ailp->ail_lock);
> + schedule();
> + spin_lock(&ailp->ail_lock);
> + }
> + spin_unlock(&ailp->ail_lock);
> +
> + finish_wait(&ailp->ail_push, &wait);
> +}
> +
> +
> /*
> * Push out all items in the AIL immediately and wait until the AIL is empty.
> */
> @@ -702,6 +732,7 @@ xfs_ail_update_finish(
> if (!XFS_FORCED_SHUTDOWN(mp))
> xlog_assign_tail_lsn_locked(mp);
>
> + wake_up_all(&ailp->ail_push);
> if (list_empty(&ailp->ail_head))
> wake_up_all(&ailp->ail_empty);
> spin_unlock(&ailp->ail_lock);
> @@ -858,6 +889,7 @@ xfs_trans_ail_init(
> spin_lock_init(&ailp->ail_lock);
> INIT_LIST_HEAD(&ailp->ail_buf_list);
> init_waitqueue_head(&ailp->ail_empty);
> + init_waitqueue_head(&ailp->ail_push);
>
> ailp->ail_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
> ailp->ail_mount->m_fsname);
> diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
> index 35655eac01a6..1b6f4bbd47c0 100644
> --- a/fs/xfs/xfs_trans_priv.h
> +++ b/fs/xfs/xfs_trans_priv.h
> @@ -61,6 +61,7 @@ struct xfs_ail {
> int ail_log_flush;
> struct list_head ail_buf_list;
> wait_queue_head_t ail_empty;
> + wait_queue_head_t ail_push;
> };
>
> /*
> @@ -113,6 +114,7 @@ xfs_trans_ail_remove(
> }
>
> void xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
> +void xfs_ail_push_sync(struct xfs_ail *, xfs_lsn_t);
> void xfs_ail_push_all(struct xfs_ail *);
> void xfs_ail_push_all_sync(struct xfs_ail *);
> struct xfs_log_item *xfs_ail_min(struct xfs_ail *ailp);
> --
> 2.24.0.rc0
>

2019-11-14 20:50:50

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 09/28] mm: directed shrinker work deferral

On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Introduce a mechanism for ->count_objects() to indicate to the
> > shrinker infrastructure that the reclaim context will not allow
> > scanning work to be done and so the work it decides is necessary
> > needs to be deferred.
> >
> > This simplifies the code by separating out the accounting of
> > deferred work from the actual doing of the work, and allows better
> > decisions to be made by the shrinekr control logic on what action it
> > can take.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
>
> My understanding from the previous discussion(s) is that this is not
> tied directly to the gfp mask because that is not the only intended use.
> While it is currently a boolean tied to the the entire shrinker call,
> the longer term objective is per-object granularity.

Longer term, yes, but right now such things are not possible as the
shrinker needs more context to be able to make sane per-object
decisions. shrinker policy decisions that affect the entire run
scope should be handled by the ->count operation - it's the one that
says whether the scan loop should run or not, and right now GFP_NOFS
for all filesystem shrinkers is a pure boolean policy
implementation.

The next future step is to provide a superblock context with
GFP_NOFS to indicate which filesystem we cannot recurse into. That
is also a shrinker instance wide check, so again it's something that
->count should be deciding.

i.e. ->count determines what is to be done, ->scan iterates the work
that has to be done until we are done.

> I find the argument reasonable enough, but if the above is true, why do
> we move these checks from ->scan_objects() to ->count_objects() (in the
> next patch) when per-object decisions will ultimately need to be made by
> the former?

Because run/no-run policy belongs in one place, and things like
GFP_NOFS do no change across calls to the ->scan loop. i.e. after
the first ->scan call in a loop that calls it hundreds to thousands
of times, the GFP_NOFS run/no-run check is completely redundant.

Once we introduce a new policy that allows the fs shrinker to do
careful reclaim in GFP_NOFS conditions, we need to do substantial
rework the shrinker scan loop and how it accounts the work that is
done - we now have at least 3 or 4 different return counters
(skipped because locked, skipped because referenced,
reclaimed, deferred reclaim because couldn't lock/recursion) and
the accounting and decisions to be made are a lot more complex.

In that case, the ->count function will drop the GFP_NOFS check, but
still do all the other things is needs to do. The GFP_NOFS check
will go deep in the guts of the shrinker scan implementation where
the per-object recursion problem exists. But for most shrinkers,
it's still going to be a global boolean check...

> That seems like unnecessary churn and inconsistent with the
> argument against just temporarily doing something like what Christoph
> suggested in the previous version, particularly since IIRC the only use
> in this series was for gfp mask purposes.

If people want to call avoiding repeated, unnecessary evaluation of
the same condition hundreds of times instead of once "unnecessary
churn", then I'll drop it.

> > include/linux/shrinker.h | 7 +++++++
> > mm/vmscan.c | 8 ++++++++
> > 2 files changed, 15 insertions(+)
> >
> > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > index 0f80123650e2..3405c39ab92c 100644
> > --- a/include/linux/shrinker.h
> > +++ b/include/linux/shrinker.h
> > @@ -31,6 +31,13 @@ struct shrink_control {
> >
> > /* current memcg being shrunk (for memcg aware shrinkers) */
> > struct mem_cgroup *memcg;
> > +
> > + /*
> > + * set by ->count_objects if reclaim context prevents reclaim from
> > + * occurring. This allows the shrinker to immediately defer all the
> > + * work and not even attempt to scan the cache.
> > + */
> > + bool defer_work;
> > };
> >
> > #define SHRINK_STOP (~0UL)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ee4eecc7e1c2..a215d71d9d4b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > freeable, delta, total_scan, priority);
> >
> > + /*
> > + * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > + * defer the work to a context that can scan the cache.
> > + */
> > + if (shrinkctl->defer_work)
> > + goto done;
> > +
>
> I still find the fact that this per-shrinker invocation field is never
> reset unnecessarily fragile, and I don't see any good reason not to
> reset it prior to the shrinker callback that potentially sets it.

I missed that when updating. I'll reset it in the next version.

-Dave.
--
Dave Chinner
[email protected]

2019-11-15 17:23:25

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 09/28] mm: directed shrinker work deferral

On Fri, Nov 15, 2019 at 07:49:26AM +1100, Dave Chinner wrote:
> On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> > On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <[email protected]>
> > >
> > > Introduce a mechanism for ->count_objects() to indicate to the
> > > shrinker infrastructure that the reclaim context will not allow
> > > scanning work to be done and so the work it decides is necessary
> > > needs to be deferred.
> > >
> > > This simplifies the code by separating out the accounting of
> > > deferred work from the actual doing of the work, and allows better
> > > decisions to be made by the shrinekr control logic on what action it
> > > can take.
> > >
> > > Signed-off-by: Dave Chinner <[email protected]>
> > > ---
> >
> > My understanding from the previous discussion(s) is that this is not
> > tied directly to the gfp mask because that is not the only intended use.
> > While it is currently a boolean tied to the the entire shrinker call,
> > the longer term objective is per-object granularity.
>
> Longer term, yes, but right now such things are not possible as the
> shrinker needs more context to be able to make sane per-object
> decisions. shrinker policy decisions that affect the entire run
> scope should be handled by the ->count operation - it's the one that
> says whether the scan loop should run or not, and right now GFP_NOFS
> for all filesystem shrinkers is a pure boolean policy
> implementation.
>
> The next future step is to provide a superblock context with
> GFP_NOFS to indicate which filesystem we cannot recurse into. That
> is also a shrinker instance wide check, so again it's something that
> ->count should be deciding.
>
> i.e. ->count determines what is to be done, ->scan iterates the work
> that has to be done until we are done.
>

Sure, makes sense in general.

> > I find the argument reasonable enough, but if the above is true, why do
> > we move these checks from ->scan_objects() to ->count_objects() (in the
> > next patch) when per-object decisions will ultimately need to be made by
> > the former?
>
> Because run/no-run policy belongs in one place, and things like
> GFP_NOFS do no change across calls to the ->scan loop. i.e. after
> the first ->scan call in a loop that calls it hundreds to thousands
> of times, the GFP_NOFS run/no-run check is completely redundant.
>

What loop is currently called hundreds to thousands of times that this
change prevents? AFAICT the current nofs checks in the ->scan calls
explicitly terminate the scan loop. So we're effectively saving a
function call by doing this earlier in the count ->call. (Nothing wrong
with that, I'm just not following the numbers used in this reasoning..).

> Once we introduce a new policy that allows the fs shrinker to do
> careful reclaim in GFP_NOFS conditions, we need to do substantial
> rework the shrinker scan loop and how it accounts the work that is
> done - we now have at least 3 or 4 different return counters
> (skipped because locked, skipped because referenced,
> reclaimed, deferred reclaim because couldn't lock/recursion) and
> the accounting and decisions to be made are a lot more complex.
>

Yeah, that's generally what I expected from your previous description.

> In that case, the ->count function will drop the GFP_NOFS check, but
> still do all the other things is needs to do. The GFP_NOFS check
> will go deep in the guts of the shrinker scan implementation where
> the per-object recursion problem exists. But for most shrinkers,
> it's still going to be a global boolean check...
>

So once the nofs checks are lifted out of the ->count callback and into
the core shrinker, is there still a use case to defer an entire ->count
instance from the callback?

> > That seems like unnecessary churn and inconsistent with the
> > argument against just temporarily doing something like what Christoph
> > suggested in the previous version, particularly since IIRC the only use
> > in this series was for gfp mask purposes.
>
> If people want to call avoiding repeated, unnecessary evaluation of
> the same condition hundreds of times instead of once "unnecessary
> churn", then I'll drop it.
>

I'm not referring to the functional change as churn. What I was
referring to is that we're shuffling around the boilerplate gfp checking
code between the different shrinker callbacks, knowing that it's
eventually going to be lifted out, when we could potentially just lift
that code up a level now.

Brian

> > > include/linux/shrinker.h | 7 +++++++
> > > mm/vmscan.c | 8 ++++++++
> > > 2 files changed, 15 insertions(+)
> > >
> > > diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> > > index 0f80123650e2..3405c39ab92c 100644
> > > --- a/include/linux/shrinker.h
> > > +++ b/include/linux/shrinker.h
> > > @@ -31,6 +31,13 @@ struct shrink_control {
> > >
> > > /* current memcg being shrunk (for memcg aware shrinkers) */
> > > struct mem_cgroup *memcg;
> > > +
> > > + /*
> > > + * set by ->count_objects if reclaim context prevents reclaim from
> > > + * occurring. This allows the shrinker to immediately defer all the
> > > + * work and not even attempt to scan the cache.
> > > + */
> > > + bool defer_work;
> > > };
> > >
> > > #define SHRINK_STOP (~0UL)
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index ee4eecc7e1c2..a215d71d9d4b 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -536,6 +536,13 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> > > trace_mm_shrink_slab_start(shrinker, shrinkctl, nr,
> > > freeable, delta, total_scan, priority);
> > >
> > > + /*
> > > + * If the shrinker can't run (e.g. due to gfp_mask constraints), then
> > > + * defer the work to a context that can scan the cache.
> > > + */
> > > + if (shrinkctl->defer_work)
> > > + goto done;
> > > +
> >
> > I still find the fact that this per-shrinker invocation field is never
> > reset unnecessarily fragile, and I don't see any good reason not to
> > reset it prior to the shrinker callback that potentially sets it.
>
> I missed that when updating. I'll reset it in the next version.
>
> -Dave.
> --
> Dave Chinner
> [email protected]
>

2019-11-18 00:51:30

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 09/28] mm: directed shrinker work deferral

On Fri, Nov 15, 2019 at 12:21:40PM -0500, Brian Foster wrote:
> On Fri, Nov 15, 2019 at 07:49:26AM +1100, Dave Chinner wrote:
> > On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> > > On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > > > From: Dave Chinner <[email protected]>
> > > >
> > > > Introduce a mechanism for ->count_objects() to indicate to the
> > > > shrinker infrastructure that the reclaim context will not allow
> > > > scanning work to be done and so the work it decides is necessary
> > > > needs to be deferred.
> > > >
> > > > This simplifies the code by separating out the accounting of
> > > > deferred work from the actual doing of the work, and allows better
> > > > decisions to be made by the shrinekr control logic on what action it
> > > > can take.
> > > >
> > > > Signed-off-by: Dave Chinner <[email protected]>
> > > > ---
> > >
> > > My understanding from the previous discussion(s) is that this is not
> > > tied directly to the gfp mask because that is not the only intended use.
> > > While it is currently a boolean tied to the the entire shrinker call,
> > > the longer term objective is per-object granularity.
> >
> > Longer term, yes, but right now such things are not possible as the
> > shrinker needs more context to be able to make sane per-object
> > decisions. shrinker policy decisions that affect the entire run
> > scope should be handled by the ->count operation - it's the one that
> > says whether the scan loop should run or not, and right now GFP_NOFS
> > for all filesystem shrinkers is a pure boolean policy
> > implementation.
> >
> > The next future step is to provide a superblock context with
> > GFP_NOFS to indicate which filesystem we cannot recurse into. That
> > is also a shrinker instance wide check, so again it's something that
> > ->count should be deciding.
> >
> > i.e. ->count determines what is to be done, ->scan iterates the work
> > that has to be done until we are done.
> >
>
> Sure, makes sense in general.
>
> > > I find the argument reasonable enough, but if the above is true, why do
> > > we move these checks from ->scan_objects() to ->count_objects() (in the
> > > next patch) when per-object decisions will ultimately need to be made by
> > > the former?
> >
> > Because run/no-run policy belongs in one place, and things like
> > GFP_NOFS do no change across calls to the ->scan loop. i.e. after
> > the first ->scan call in a loop that calls it hundreds to thousands
> > of times, the GFP_NOFS run/no-run check is completely redundant.
> >
>
> What loop is currently called hundreds to thousands of times that this
> change prevents? AFAICT the current nofs checks in the ->scan calls
> explicitly terminate the scan loop.

Right, but when we are in GFP_KERNEL context, every call to ->scan()
checks it and says "ok". If we are scanning tens of thousands of
objects in a scan, and we are using a befault batch size of 128
objects per scan, then we have hundreds of calls in a single scan
loop that check the GFP context and say "ok"....

> So we're effectively saving a
> function call by doing this earlier in the count ->call. (Nothing wrong
> with that, I'm just not following the numbers used in this reasoning..).

It's the don't terminate case. :)

> > Once we introduce a new policy that allows the fs shrinker to do
> > careful reclaim in GFP_NOFS conditions, we need to do substantial
> > rework the shrinker scan loop and how it accounts the work that is
> > done - we now have at least 3 or 4 different return counters
> > (skipped because locked, skipped because referenced,
> > reclaimed, deferred reclaim because couldn't lock/recursion) and
> > the accounting and decisions to be made are a lot more complex.
> >
>
> Yeah, that's generally what I expected from your previous description.
>
> > In that case, the ->count function will drop the GFP_NOFS check, but
> > still do all the other things is needs to do. The GFP_NOFS check
> > will go deep in the guts of the shrinker scan implementation where
> > the per-object recursion problem exists. But for most shrinkers,
> > it's still going to be a global boolean check...
> >
>
> So once the nofs checks are lifted out of the ->count callback and into
> the core shrinker, is there still a use case to defer an entire ->count
> instance from the callback?

Not right now. There may be in future, but I don't want to make
things more complex than they need to be by trying to support
functionality that isn't used.

> > If people want to call avoiding repeated, unnecessary evaluation of
> > the same condition hundreds of times instead of once "unnecessary
> > churn", then I'll drop it.
> >
>
> I'm not referring to the functional change as churn. What I was
> referring to is that we're shuffling around the boilerplate gfp checking
> code between the different shrinker callbacks, knowing that it's
> eventually going to be lifted out, when we could potentially just lift
> that code up a level now.

I don't think that lifting it up will save much code at all, once we
add all the gfp mask intialisation to all the shrinkers, etc. It's
just means we can't look at the shrinker implementation and know
that it can't run in GFP_NOFS context - we have to go look up
where it is instantiated instead to see if there are gfp context
constraints.

I think it's better where it is, documenting the constraints the
shrinker implementation runs under in the implementation itself...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-11-19 15:14:47

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 09/28] mm: directed shrinker work deferral

On Mon, Nov 18, 2019 at 11:49:56AM +1100, Dave Chinner wrote:
> On Fri, Nov 15, 2019 at 12:21:40PM -0500, Brian Foster wrote:
> > On Fri, Nov 15, 2019 at 07:49:26AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 04, 2019 at 10:25:25AM -0500, Brian Foster wrote:
> > > > On Fri, Nov 01, 2019 at 10:45:59AM +1100, Dave Chinner wrote:
> > > > > From: Dave Chinner <[email protected]>
> > > > >
> > > > > Introduce a mechanism for ->count_objects() to indicate to the
> > > > > shrinker infrastructure that the reclaim context will not allow
> > > > > scanning work to be done and so the work it decides is necessary
> > > > > needs to be deferred.
> > > > >
> > > > > This simplifies the code by separating out the accounting of
> > > > > deferred work from the actual doing of the work, and allows better
> > > > > decisions to be made by the shrinekr control logic on what action it
> > > > > can take.
> > > > >
> > > > > Signed-off-by: Dave Chinner <[email protected]>
> > > > > ---
> > > >
> > > > My understanding from the previous discussion(s) is that this is not
> > > > tied directly to the gfp mask because that is not the only intended use.
> > > > While it is currently a boolean tied to the the entire shrinker call,
> > > > the longer term objective is per-object granularity.
> > >
> > > Longer term, yes, but right now such things are not possible as the
> > > shrinker needs more context to be able to make sane per-object
> > > decisions. shrinker policy decisions that affect the entire run
> > > scope should be handled by the ->count operation - it's the one that
> > > says whether the scan loop should run or not, and right now GFP_NOFS
> > > for all filesystem shrinkers is a pure boolean policy
> > > implementation.
> > >
> > > The next future step is to provide a superblock context with
> > > GFP_NOFS to indicate which filesystem we cannot recurse into. That
> > > is also a shrinker instance wide check, so again it's something that
> > > ->count should be deciding.
> > >
> > > i.e. ->count determines what is to be done, ->scan iterates the work
> > > that has to be done until we are done.
> > >
> >
> > Sure, makes sense in general.
> >
> > > > I find the argument reasonable enough, but if the above is true, why do
> > > > we move these checks from ->scan_objects() to ->count_objects() (in the
> > > > next patch) when per-object decisions will ultimately need to be made by
> > > > the former?
> > >
> > > Because run/no-run policy belongs in one place, and things like
> > > GFP_NOFS do no change across calls to the ->scan loop. i.e. after
> > > the first ->scan call in a loop that calls it hundreds to thousands
> > > of times, the GFP_NOFS run/no-run check is completely redundant.
> > >
> >
> > What loop is currently called hundreds to thousands of times that this
> > change prevents? AFAICT the current nofs checks in the ->scan calls
> > explicitly terminate the scan loop.
>
> Right, but when we are in GFP_KERNEL context, every call to ->scan()
> checks it and says "ok". If we are scanning tens of thousands of
> objects in a scan, and we are using a befault batch size of 128
> objects per scan, then we have hundreds of calls in a single scan
> loop that check the GFP context and say "ok"....
>
> > So we're effectively saving a
> > function call by doing this earlier in the count ->call. (Nothing wrong
> > with that, I'm just not following the numbers used in this reasoning..).
>
> It's the don't terminate case. :)
>

Oh, I see. You're talking about the number of executions of the gfp
check itself. That makes sense, though my understanding is that we'll
ultimately have a similar check anyways if we want per-object
granularity based on the allocation constraints of the current context.
OTOH, the check would still occur only once with an alloc flags field in
the shrinker structure too, FWIW.

> > > Once we introduce a new policy that allows the fs shrinker to do
> > > careful reclaim in GFP_NOFS conditions, we need to do substantial
> > > rework the shrinker scan loop and how it accounts the work that is
> > > done - we now have at least 3 or 4 different return counters
> > > (skipped because locked, skipped because referenced,
> > > reclaimed, deferred reclaim because couldn't lock/recursion) and
> > > the accounting and decisions to be made are a lot more complex.
> > >
> >
> > Yeah, that's generally what I expected from your previous description.
> >
> > > In that case, the ->count function will drop the GFP_NOFS check, but
> > > still do all the other things is needs to do. The GFP_NOFS check
> > > will go deep in the guts of the shrinker scan implementation where
> > > the per-object recursion problem exists. But for most shrinkers,
> > > it's still going to be a global boolean check...
> > >
> >
> > So once the nofs checks are lifted out of the ->count callback and into
> > the core shrinker, is there still a use case to defer an entire ->count
> > instance from the callback?
>
> Not right now. There may be in future, but I don't want to make
> things more complex than they need to be by trying to support
> functionality that isn't used.
>

Ok, but do note that the reason I ask is to touch on simply whether it's
worth putting this in the ->scan callback at all. It's not like _not_
doing that is some big complexity adjustment. ;)

> > > If people want to call avoiding repeated, unnecessary evaluation of
> > > the same condition hundreds of times instead of once "unnecessary
> > > churn", then I'll drop it.
> > >
> >
> > I'm not referring to the functional change as churn. What I was
> > referring to is that we're shuffling around the boilerplate gfp checking
> > code between the different shrinker callbacks, knowing that it's
> > eventually going to be lifted out, when we could potentially just lift
> > that code up a level now.
>
> I don't think that lifting it up will save much code at all, once we
> add all the gfp mask intialisation to all the shrinkers, etc. It's
> just means we can't look at the shrinker implementation and know
> that it can't run in GFP_NOFS context - we have to go look up
> where it is instantiated instead to see if there are gfp context
> constraints.
>
> I think it's better where it is, documenting the constraints the
> shrinker implementation runs under in the implementation itself...
>

Fair enough.. I don't necessarily agree that this is the best approach,
but the implementation is reasonable enough that I certainly don't
object to it (provided the fragility nits are addressed) and I don't
feel particularly tied to the suggested alternative. At the end of the
day this isn't a lot of code and it's not difficult to change (which it
probably will). I just wanted to make sure the alternative was fairly
considered and to test the reasoning for the approach a bit. I'll
move along from this topic on review of the next version...

Brian

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>