2011-06-02 07:01:28

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 0/12] Per superblock cache reclaim

This series converts the VFS cache shrinkers to a per-superblock
shrinker, and provides a callout from the superblock shrinker to
allow the filesystem to shrink internal caches proportionally to the
amount of reclaim done to the VFS caches.

The motivation for this work is that the VFS caches are dependent
caches - dentries pin inodes, and inodes often pin other filesystem
specific structures. The caches can grow quite large and it is easy
for them to get unbalanced when they are shrunk independently.

Reclaim is also focussed on sharing reclaim batches across all
superblocks rather than within a superblock, so often reclaim calls
only remove a few objects from each superblock at a time. This means
that we touch lots of superblocks and LRUs one every shrinker call,
and we have to traverse the superblock list all the time.

This leads to life-cycle issues - we have to ensure that the
superblock we are trying to work on is active and won't go away, and
also ensure that the unmount process synchronises correctly with
active shrinkers. This is complex and the locks involved cause
issues with lockdep refularly reporting false positive lock
inversions.

Firstly, however, there are several longstanding bugs in the
VM shrinker infrastructure that need to be fixed. Firstly, we need
to add tracepoints so we can observe the behaviour of the shrinker
calculations. Secondly, the shrinker scan calculations are not SMP
safe and that is causing shrinkers to either miss work they should
be doing, or doing a lot more work than they should.

With these fixes in place, I found the reason that I was not able to
balance system behaviour on my first attempt at per-sb shrinkers.
When a shrinker repeatedly returns "-1" to avoid deadlocks, like
will happen when a filesystem is doing GFP_NOFS memory allocations
during transactions (and that happens *a lot* during filesystem
intensive workloads), then the work is delayed by adding it to
shrinker->nr for the next shrinker call to do.

This causes the shrinker->nr to increase until it is 2x the number
of objects in the cache, and so when the shrinker is finally able to
do work, it is effectively told to shrink the entire cache to zero.
Twice over. You'll never guess how I found it - the tracepoints I
added, perhaps? This problem is fixed by only allowing the
shrinker->nr to wind up to half the size of the cache when there are
lots of little additions caused by deadlock avoidance. This is
sufficient to maintain current levels of performance whilst avoiding
the cache trashing problem.

So, back to the VFS cache shrinkers. To avoid all the above
problems, we can use the per-shrinker context infrastructure that
was introduced recently for XFS. By adding a shrinker context to
each superblock and registering the shrinker after the superblock is
created and unregistering it early in the unmount process we avoid
the need for specific unmount synchronisation between the shrinker
and the unmount process. Goodbye iprune_sem.

Further, by having per-superblock shrinker callouts, we no longer
need to walk the superblock list on every shrinker call for both the
dentry and inode caches, nor do we need to proportion reclaim
between superblocks. That simplifies the cache shrinking
implementation significantly.

However, to take advantage of this, the first thing we need to do is
convert the inode cache LRU to a per-superblock LRU. This is trivial
to do - it's just a copy of the dentry cache infrastructure. The
inode cache LRU can also be trivially converted to a lock per
superblock as well, so that is done at the same time.

[ Note that it looks like the same change can be made to the dentry
cache LRU, but the simple conversion from the global dcache_lru_lock
to per-sb locks results in occasional, strange ENOENT errors during
path lookups. So that patch is on hold. ]

With a single shrinker - prune_super() - that can address both the
per-sb dentry and inode LRUs, it is a simple matter of proportioning
the reclaim batch between them. This is done simply by the ratio of
objects in the two caches, and the dentry cache is pruned first so
that it unpins inodes before the inode cache is pruned.

Now that we have prune_super(), reclaiming hundreds of
thousands or millions of dentries and inodes in batches of 128
objects does not make much sense. The VM shrinker infrastructure
uses a batch size of 128 so that it can regularly reschedule if
necessary. The dentry cache pruner already has reschedule checks,
and it is trivial to add them to the VFS and XFS inode cache
pruners. With that done, there is no reason why we can't use a much
larger reclaim batch size and remove more objects from each cache on
each visit to them.

To do this, add a per-shrinker batch size configuration field, and
configure prune_super() to use a larger batch size of 1024
objects. This reduces the number of times we need to make
calculations, traffic locks and structures, and means we spend more
time in cache specific loops than we would with a smaller batch
size. This reduces the overhead of cache shrinking.

Overall, the changes result in steady state cache ratios on XFS,
ext4 and btrfs of 1 dentry : 3 inodes. The state ratio is 1 inused
inode : 2 free inodes (the in-use inode is pinned by the dentry).
The following chart demonstrateѕ ext4 (left) and btrfs (right) cache
ratios under steady state 8-way file creation conditions.

http://userweb.kernel.org/~dgc/shrinker/ext4-btrfs-cache-ratio.png

For XFS, however, the situation is slightly more complex. XFS
maintains it's own inode cache (the VFS inode cache is a subset of
the XFS cache), and so needs to be able to keep that synchronised
with the VFS caches. Hence a filesystem specific callout is added
to the superblock pruning method that is proportioned with the
VFS dentry and inode caches. Implementing these methods is optional,
and this is done for XFS in the last patch in the series.

XFS behaviour at different stages of the patch series can be seen in
the following chart:

http://userweb.kernel.org/~dgc/shrinker/per-sb-shrinker-comparison.png

The left-most traces are from a kernel with just the VM
shrink_slab() fixes. The middle trace is the same 8-way create
workload, but with the inode cache LRU changes and the per-sb
superblock shrinker addressing just the VFS dentry and inode
caches. The right-most (partial) workload trace is the full series
with the XFS inode cache shrinker being called from prune_super().

You can see from the top chart that the cache behaviour has much
less variance in the middle trace with the per-sb shrinkers compared
to the left-most trace. Also, you can see that the XFS inode cache
size follows the VFS inode cache residency much more closely in the
right-most trace as a result of using the prune_super() filesystem
callout.

Yes, these XFS traces are much more variable that the ext4 and btrfs
charts, but XFS is putting significantly more pressure on the caches
and most allocations are GFP_NOFS, hence triggering the wind-up
problems described above. It is, however, much better behaved than
the existing shrinker behaviour (worse than the left-most trace with
the VM fixes) and much better than the previous (aborted) per-sb
shrinker attempts:

http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-per-sb-basic-16x500-xfs.png
http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-per-sb-balance-16x500-xfs.png
http://userweb.kernel.org/~dgc/shrinker-2.6.36/fs_mark-2.6.35-rc4-per-sb-proportional-16x500-xfs.png

---

The following changes since commit c7427d23f7ed695ac226dbe3a84d7f19091d34ce:

autofs4: bogus dentry_unhash() added in ->unlink() (2011-05-30 01:50:53 -0400)

are available in the git repository at:
git://git.kernel.org/pub/scm/linux/people/dgc/xfsdev.git per-sb-shrinker

Dave Chinner (12):
vmscan: add shrink_slab tracepoints
vmscan: shrinker->nr updates race and go wrong
vmscan: reduce wind up shrinker->nr when shrinker can't do work
vmscan: add customisable shrinker batch size
inode: convert inode_stat.nr_unused to per-cpu counters
inode: Make unused inode LRU per superblock
inode: move to per-sb LRU locks
superblock: introduce per-sb cache shrinker infrastructure
inode: remove iprune_sem
superblock: add filesystem shrinker operations
vfs: increase shrinker batch size
xfs: make use of new shrinker callout for the inode cache

Documentation/filesystems/vfs.txt | 21 ++++++
fs/dcache.c | 121 ++++--------------------------------
fs/inode.c | 124 ++++++++++++-------------------------
fs/super.c | 79 +++++++++++++++++++++++-
fs/xfs/linux-2.6/xfs_super.c | 26 +++++---
fs/xfs/linux-2.6/xfs_sync.c | 71 ++++++++-------------
fs/xfs/linux-2.6/xfs_sync.h | 5 +-
include/linux/fs.h | 14 ++++
include/linux/mm.h | 1 +
include/trace/events/vmscan.h | 71 +++++++++++++++++++++
mm/vmscan.c | 70 ++++++++++++++++-----
11 files changed, 337 insertions(+), 266 deletions(-)


2011-06-02 07:01:45

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 01/12] vmscan: add shrink_slab tracepoints

From: Dave Chinner <[email protected]>

Іt is impossible to understand what the shrinkers are actually doing
without instrumenting the code, so add a some tracepoints to allow
insight to be gained.

Signed-off-by: Dave Chinner <[email protected]>
---
include/trace/events/vmscan.h | 67 +++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 6 +++-
2 files changed, 72 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index ea422aa..c798cd7 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -310,6 +310,73 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
show_reclaim_flags(__entry->reclaim_flags))
);

+TRACE_EVENT(mm_shrink_slab_start,
+ TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+ unsigned long pgs_scanned, unsigned long lru_pgs,
+ unsigned long cache_items, unsigned long long delta,
+ unsigned long total_scan),
+
+ TP_ARGS(shr, sc, pgs_scanned, lru_pgs, cache_items, delta, total_scan),
+
+ TP_STRUCT__entry(
+ __field(struct shrinker *, shr)
+ __field(long, shr_nr)
+ __field(gfp_t, gfp_flags)
+ __field(unsigned long, pgs_scanned)
+ __field(unsigned long, lru_pgs)
+ __field(unsigned long, cache_items)
+ __field(unsigned long long, delta)
+ __field(unsigned long, total_scan)
+ ),
+
+ TP_fast_assign(
+ __entry->shr = shr;
+ __entry->shr_nr = shr->nr;
+ __entry->gfp_flags = sc->gfp_mask;
+ __entry->pgs_scanned = pgs_scanned;
+ __entry->lru_pgs = lru_pgs;
+ __entry->cache_items = cache_items;
+ __entry->delta = delta;
+ __entry->total_scan = total_scan;
+ ),
+
+ TP_printk("shrinker %p: nr %ld gfp_flags %s pgs_scanned %ld lru_pgs %ld cache items %ld delta %lld total_scan %ld",
+ __entry->shr,
+ __entry->shr_nr,
+ show_gfp_flags(__entry->gfp_flags),
+ __entry->pgs_scanned,
+ __entry->lru_pgs,
+ __entry->cache_items,
+ __entry->delta,
+ __entry->total_scan)
+);
+
+TRACE_EVENT(mm_shrink_slab_end,
+ TP_PROTO(struct shrinker *shr, int shrinker_ret,
+ unsigned long total_scan),
+
+ TP_ARGS(shr, shrinker_ret, total_scan),
+
+ TP_STRUCT__entry(
+ __field(struct shrinker *, shr)
+ __field(long, shr_nr)
+ __field(int, shrinker_ret)
+ __field(unsigned long, total_scan)
+ ),
+
+ TP_fast_assign(
+ __entry->shr = shr;
+ __entry->shr_nr = shr->nr;
+ __entry->shrinker_ret = shrinker_ret;
+ __entry->total_scan = total_scan;
+ ),
+
+ TP_printk("shrinker %p: nr %ld total_scan %ld return val %d",
+ __entry->shr,
+ __entry->shr_nr,
+ __entry->total_scan,
+ __entry->shrinker_ret)
+);

#endif /* _TRACE_VMSCAN_H */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index faa0a08..48e3fbd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -250,6 +250,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
unsigned long long delta;
unsigned long total_scan;
unsigned long max_pass;
+ int shrink_ret = 0;

max_pass = do_shrinker_shrink(shrinker, shrink, 0);
delta = (4 * nr_pages_scanned) / shrinker->seeks;
@@ -274,9 +275,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
total_scan = shrinker->nr;
shrinker->nr = 0;

+ trace_mm_shrink_slab_start(shrinker, shrink, nr_pages_scanned,
+ lru_pages, max_pass, delta, total_scan);
+
while (total_scan >= SHRINK_BATCH) {
long this_scan = SHRINK_BATCH;
- int shrink_ret;
int nr_before;

nr_before = do_shrinker_shrink(shrinker, shrink, 0);
@@ -293,6 +296,7 @@ unsigned long shrink_slab(struct shrink_control *shrink,
}

shrinker->nr += total_scan;
+ trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan);
}
up_read(&shrinker_rwsem);
out:
--
1.7.5.1

2011-06-02 07:01:22

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 02/12] vmscan: shrinker->nr updates race and go wrong

From: Dave Chinner <[email protected]>

shrink_slab() allows shrinkers to be called in parallel so the
struct shrinker can be updated concurrently. It does not provide any
exclusio for such updates, so we can get the shrinker->nr value
increasing or decreasing incorrectly.

As a result, when a shrinker repeatedly returns a value of -1 (e.g.
a VFS shrinker called w/ GFP_NOFS), the shrinker->nr goes haywire,
sometimes updating with the scan count that wasn't used, sometimes
losing it altogether. Worse is when a shrinker does work and that
update is lost due to racy updates, which means the shrinker will do
the work again!

Fix this by making the total_scan calculations independent of
shrinker->nr, and making the shrinker->nr updates atomic w.r.t. to
other updates via cmpxchg loops.

Signed-off-by: Dave Chinner <[email protected]>
---
include/trace/events/vmscan.h | 26 ++++++++++++++----------
mm/vmscan.c | 43 ++++++++++++++++++++++++++++++----------
2 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c798cd7..6147b4e 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -311,12 +311,13 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
);

TRACE_EVENT(mm_shrink_slab_start,
- TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
+ TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long shr_nr,
unsigned long pgs_scanned, unsigned long lru_pgs,
unsigned long cache_items, unsigned long long delta,
unsigned long total_scan),

- TP_ARGS(shr, sc, pgs_scanned, lru_pgs, cache_items, delta, total_scan),
+ TP_ARGS(shr, sc, shr_nr, pgs_scanned, lru_pgs,
+ cache_items, delta, total_scan),

TP_STRUCT__entry(
__field(struct shrinker *, shr)
@@ -331,7 +332,7 @@ TRACE_EVENT(mm_shrink_slab_start,

TP_fast_assign(
__entry->shr = shr;
- __entry->shr_nr = shr->nr;
+ __entry->shr_nr = shr_nr;
__entry->gfp_flags = sc->gfp_mask;
__entry->pgs_scanned = pgs_scanned;
__entry->lru_pgs = lru_pgs;
@@ -353,27 +354,30 @@ TRACE_EVENT(mm_shrink_slab_start,

TRACE_EVENT(mm_shrink_slab_end,
TP_PROTO(struct shrinker *shr, int shrinker_ret,
- unsigned long total_scan),
+ long old_nr, long new_nr),

- TP_ARGS(shr, shrinker_ret, total_scan),
+ TP_ARGS(shr, shrinker_ret, old_nr, new_nr),

TP_STRUCT__entry(
__field(struct shrinker *, shr)
- __field(long, shr_nr)
+ __field(long, old_nr)
+ __field(long, new_nr)
__field(int, shrinker_ret)
- __field(unsigned long, total_scan)
+ __field(long, total_scan)
),

TP_fast_assign(
__entry->shr = shr;
- __entry->shr_nr = shr->nr;
+ __entry->old_nr = old_nr;
+ __entry->new_nr = new_nr;
__entry->shrinker_ret = shrinker_ret;
- __entry->total_scan = total_scan;
+ __entry->total_scan = new_nr - old_nr;
),

- TP_printk("shrinker %p: nr %ld total_scan %ld return val %d",
+ TP_printk("shrinker %p: old_nr %ld new_nr %ld total_scan %ld return val %d",
__entry->shr,
- __entry->shr_nr,
+ __entry->old_nr,
+ __entry->new_nr,
__entry->total_scan,
__entry->shrinker_ret)
);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48e3fbd..dce2767 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_control *shrink,
unsigned long total_scan;
unsigned long max_pass;
int shrink_ret = 0;
+ long nr;
+ long new_nr;

+ /*
+ * copy the current shrinker scan count into a local variable
+ * and zero it so that other concurrent shrinker invocations
+ * don't also do this scanning work.
+ */
+ do {
+ nr = shrinker->nr;
+ } while (cmpxchg(&shrinker->nr, nr, 0) != nr);
+
+ total_scan = nr;
max_pass = do_shrinker_shrink(shrinker, shrink, 0);
delta = (4 * nr_pages_scanned) / shrinker->seeks;
delta *= max_pass;
do_div(delta, lru_pages + 1);
- shrinker->nr += delta;
- if (shrinker->nr < 0) {
+ total_scan += delta;
+ if (total_scan < 0) {
printk(KERN_ERR "shrink_slab: %pF negative objects to "
"delete nr=%ld\n",
- shrinker->shrink, shrinker->nr);
- shrinker->nr = max_pass;
+ shrinker->shrink, total_scan);
+ total_scan = max_pass;
}

/*
@@ -269,13 +281,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
* never try to free more than twice the estimate number of
* freeable entries.
*/
- if (shrinker->nr > max_pass * 2)
- shrinker->nr = max_pass * 2;
+ if (total_scan > max_pass * 2)
+ total_scan = max_pass * 2;

- total_scan = shrinker->nr;
- shrinker->nr = 0;

- trace_mm_shrink_slab_start(shrinker, shrink, nr_pages_scanned,
+ trace_mm_shrink_slab_start(shrinker, shrink, nr, nr_pages_scanned,
lru_pages, max_pass, delta, total_scan);

while (total_scan >= SHRINK_BATCH) {
@@ -295,8 +305,19 @@ unsigned long shrink_slab(struct shrink_control *shrink,
cond_resched();
}

- shrinker->nr += total_scan;
- trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan);
+ /*
+ * move the unused scan count back into the shrinker in a
+ * manner that handles concurrent updates. If we exhausted the
+ * scan, there is no need to do an update.
+ */
+ do {
+ nr = shrinker->nr;
+ new_nr = total_scan + nr;
+ if (total_scan <= 0)
+ break;
+ } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
+
+ trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
}
up_read(&shrinker_rwsem);
out:
--
1.7.5.1

2011-06-02 07:01:53

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 03/12] vmscan: reduce wind up shrinker->nr when shrinker can't do work

From: Dave Chinner <[email protected]>

When a shrinker returns -1 to shrink_slab() to indicate it cannot do
any work given the current memory reclaim requirements, it adds the
entire total_scan count to shrinker->nr. The idea ehind this is that
whenteh shrinker is next called and can do work, it will do the work
of the previously aborted shrinker call as well.

However, if a filesystem is doing lots of allocation with GFP_NOFS
set, then we get many, many more aborts from the shrinkers than we
do successful calls. The result is that shrinker->nr winds up to
it's maximum permissible value (twice the current cache size) and
then when the next shrinker call that can do work is issued, it
has enough scan count built up to free the entire cache twice over.

This manifests itself in the cache going from full to empty in a
matter of seconds, even when only a small part of the cache is
needed to be emptied to free sufficient memory.

Under metadata intensive workloads on ext4 and XFS, I'm seeing the
VFS caches increase memory consumption up to 75% of memory (no page
cache pressure) over a period of 30-60s, and then the shrinker
empties them down to zero in the space of 2-3s. This cycle repeats
over and over again, with the shrinker completely trashing the іnode
and dentry caches every minute or so the workload continues.

This behaviour was made obvious by the shrink_slab tracepoints added
earlier in the series, and made worse by the patch that corrected
the concurrent accounting of shrinker->nr.

To avoid this problem, stop repeated small increments of the total
scan value from winding shrinker->nr up to a value that can cause
the entire cache to be freed. We still need to allow it to wind up,
so use the delta as the "large scan" threshold check - if the delta
is more than a quarter of the entire cache size, then it is a large
scan and allowed to cause lots of windup because we are clearly
needing to free lots of memory.

If it isn't a large scan then limit the total scan to half the size
of the cache so that windup never increases to consume the whole
cache. Reducing the total scan limit further does not allow enough
wind-up to maintain the current levels of performance, whilst a
higher threshold does not prevent the windup from freeing the entire
cache under sustained workloads.

Signed-off-by: Dave Chinner <[email protected]>
---
mm/vmscan.c | 14 ++++++++++++++
1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dce2767..3688f47 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -277,6 +277,20 @@ unsigned long shrink_slab(struct shrink_control *shrink,
}

/*
+ * Avoid excessive windup on fielsystem shrinkers due to large
+ * numbers of GFP_NOFS allocations causing the shrinkers to
+ * return -1 all the time. This results in a large nr being
+ * built up so when a shrink that can do some work comes along
+ * it empties the entire cache due to nr >>> max_pass. This is
+ * bad for sustaining a working set in memory.
+ *
+ * Hence only allow nr to go large when a large delta is
+ * calculated.
+ */
+ if (delta < max_pass / 4)
+ total_scan = min(total_scan, max_pass / 2);
+
+ /*
* Avoid risking looping forever due to too large nr value:
* never try to free more than twice the estimate number of
* freeable entries.
--
1.7.5.1

2011-06-02 07:01:51

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 04/12] vmscan: add customisable shrinker batch size

From: Dave Chinner <[email protected]>

For shrinkers that have their own cond_resched* calls, having
shrink_slab break the work down into small batches is not
paticularly efficient. Add a custom batchsize field to the struct
shrinker so that shrinkers can use a larger batch size if they
desire.

A value of zero (uninitialised) means "use the default", so
behaviour is unchanged by this patch.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/mm.h | 1 +
mm/vmscan.c | 11 ++++++-----
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9670f71..9b9777a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1150,6 +1150,7 @@ struct shrink_control {
struct shrinker {
int (*shrink)(struct shrinker *, struct shrink_control *sc);
int seeks; /* seeks to recreate an obj */
+ long batch; /* reclaim batch size, 0 = default */

/* These are for internal use */
struct list_head list;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3688f47..a17909f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -253,6 +253,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
int shrink_ret = 0;
long nr;
long new_nr;
+ long batch_size = shrinker->batch ? shrinker->batch
+ : SHRINK_BATCH;

/*
* copy the current shrinker scan count into a local variable
@@ -302,19 +304,18 @@ unsigned long shrink_slab(struct shrink_control *shrink,
trace_mm_shrink_slab_start(shrinker, shrink, nr, nr_pages_scanned,
lru_pages, max_pass, delta, total_scan);

- while (total_scan >= SHRINK_BATCH) {
- long this_scan = SHRINK_BATCH;
+ while (total_scan >= batch_size) {
int nr_before;

nr_before = do_shrinker_shrink(shrinker, shrink, 0);
shrink_ret = do_shrinker_shrink(shrinker, shrink,
- this_scan);
+ batch_size);
if (shrink_ret == -1)
break;
if (shrink_ret < nr_before)
ret += nr_before - shrink_ret;
- count_vm_events(SLABS_SCANNED, this_scan);
- total_scan -= this_scan;
+ count_vm_events(SLABS_SCANNED, batch_size);
+ total_scan -= batch_size;

cond_resched();
}
--
1.7.5.1

2011-06-02 07:03:54

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 05/12] inode: convert inode_stat.nr_unused to per-cpu counters

From: Dave Chinner <[email protected]>

Before we split up the inode_lru_lock, the unused inode counter
needs to be made independent of the global inode_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/inode.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 0f7e88a..17fea5b 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -95,6 +95,7 @@ EXPORT_SYMBOL(empty_aops);
struct inodes_stat_t inodes_stat;

static DEFINE_PER_CPU(unsigned int, nr_inodes);
+static DEFINE_PER_CPU(unsigned int, nr_unused);

static struct kmem_cache *inode_cachep __read_mostly;

@@ -109,7 +110,11 @@ static int get_nr_inodes(void)

static inline int get_nr_inodes_unused(void)
{
- return inodes_stat.nr_unused;
+ int i;
+ int sum = 0;
+ for_each_possible_cpu(i)
+ sum += per_cpu(nr_unused, i);
+ return sum < 0 ? 0 : sum;
}

int get_nr_dirty_inodes(void)
@@ -127,6 +132,7 @@ int proc_nr_inodes(ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
inodes_stat.nr_inodes = get_nr_inodes();
+ inodes_stat.nr_unused = get_nr_inodes_unused();
return proc_dointvec(table, write, buffer, lenp, ppos);
}
#endif
@@ -340,7 +346,7 @@ static void inode_lru_list_add(struct inode *inode)
spin_lock(&inode_lru_lock);
if (list_empty(&inode->i_lru)) {
list_add(&inode->i_lru, &inode_lru);
- inodes_stat.nr_unused++;
+ this_cpu_inc(nr_unused);
}
spin_unlock(&inode_lru_lock);
}
@@ -350,7 +356,7 @@ static void inode_lru_list_del(struct inode *inode)
spin_lock(&inode_lru_lock);
if (!list_empty(&inode->i_lru)) {
list_del_init(&inode->i_lru);
- inodes_stat.nr_unused--;
+ this_cpu_dec(nr_unused);
}
spin_unlock(&inode_lru_lock);
}
@@ -649,7 +655,7 @@ static void prune_icache(int nr_to_scan)
(inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
- inodes_stat.nr_unused--;
+ this_cpu_dec(nr_unused);
continue;
}

@@ -686,7 +692,7 @@ static void prune_icache(int nr_to_scan)
spin_unlock(&inode->i_lock);

list_move(&inode->i_lru, &freeable);
- inodes_stat.nr_unused--;
+ this_cpu_dec(nr_unused);
}
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
--
1.7.5.1

2011-06-02 07:01:26

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 06/12] inode: Make unused inode LRU per superblock

From: Dave Chinner <[email protected]>

The inode unused list is currently a global LRU. This does not match
the other global filesystem cache - the dentry cache - which uses
per-superblock LRU lists. Hence we have related filesystem object
types using different LRU reclaimation schemes.

To enable a per-superblock filesystem cache shrinker, both of these
caches need to have per-sb unused object LRU lists. Hence this patch
converts the global inode LRU to per-sb LRUs.

The patch only does rudimentary per-sb propotioning in the shrinker
infrastructure, as this gets removed when the per-sb shrinker
callouts are introduced later on.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/inode.c | 91 +++++++++++++++++++++++++++++++++++++++++++++------
fs/super.c | 1 +
include/linux/fs.h | 4 ++
3 files changed, 85 insertions(+), 11 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 17fea5b..e039115 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -34,7 +34,7 @@
* inode->i_lock protects:
* inode->i_state, inode->i_hash, __iget()
* inode_lru_lock protects:
- * inode_lru, inode->i_lru
+ * inode->i_sb->s_inode_lru, inode->i_lru
* inode_sb_list_lock protects:
* sb->s_inodes, inode->i_sb_list
* inode_wb_list_lock protects:
@@ -64,7 +64,6 @@ static unsigned int i_hash_shift __read_mostly;
static struct hlist_head *inode_hashtable __read_mostly;
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);

-static LIST_HEAD(inode_lru);
static DEFINE_SPINLOCK(inode_lru_lock);

__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
@@ -345,7 +344,8 @@ static void inode_lru_list_add(struct inode *inode)
{
spin_lock(&inode_lru_lock);
if (list_empty(&inode->i_lru)) {
- list_add(&inode->i_lru, &inode_lru);
+ list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
+ inode->i_sb->s_nr_inodes_unused++;
this_cpu_inc(nr_unused);
}
spin_unlock(&inode_lru_lock);
@@ -356,6 +356,7 @@ static void inode_lru_list_del(struct inode *inode)
spin_lock(&inode_lru_lock);
if (!list_empty(&inode->i_lru)) {
list_del_init(&inode->i_lru);
+ inode->i_sb->s_nr_inodes_unused--;
this_cpu_dec(nr_unused);
}
spin_unlock(&inode_lru_lock);
@@ -621,21 +622,20 @@ static int can_unuse(struct inode *inode)
* LRU does not have strict ordering. Hence we don't want to reclaim inodes
* with this flag set because they are the inodes that are out of order.
*/
-static void prune_icache(int nr_to_scan)
+static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
{
LIST_HEAD(freeable);
int nr_scanned;
unsigned long reap = 0;

- down_read(&iprune_sem);
spin_lock(&inode_lru_lock);
- for (nr_scanned = 0; nr_scanned < nr_to_scan; nr_scanned++) {
+ for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
struct inode *inode;

- if (list_empty(&inode_lru))
+ if (list_empty(&sb->s_inode_lru))
break;

- inode = list_entry(inode_lru.prev, struct inode, i_lru);
+ inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);

/*
* we are inverting the inode_lru_lock/inode->i_lock here,
@@ -643,7 +643,7 @@ static void prune_icache(int nr_to_scan)
* inode to the back of the list so we don't spin on it.
*/
if (!spin_trylock(&inode->i_lock)) {
- list_move(&inode->i_lru, &inode_lru);
+ list_move(&inode->i_lru, &sb->s_inode_lru);
continue;
}

@@ -655,6 +655,7 @@ static void prune_icache(int nr_to_scan)
(inode->i_state & ~I_REFERENCED)) {
list_del_init(&inode->i_lru);
spin_unlock(&inode->i_lock);
+ sb->s_nr_inodes_unused--;
this_cpu_dec(nr_unused);
continue;
}
@@ -662,7 +663,7 @@ static void prune_icache(int nr_to_scan)
/* recently referenced inodes get one more pass */
if (inode->i_state & I_REFERENCED) {
inode->i_state &= ~I_REFERENCED;
- list_move(&inode->i_lru, &inode_lru);
+ list_move(&inode->i_lru, &sb->s_inode_lru);
spin_unlock(&inode->i_lock);
continue;
}
@@ -676,7 +677,7 @@ static void prune_icache(int nr_to_scan)
iput(inode);
spin_lock(&inode_lru_lock);

- if (inode != list_entry(inode_lru.next,
+ if (inode != list_entry(sb->s_inode_lru.next,
struct inode, i_lru))
continue; /* wrong inode or list_empty */
/* avoid lock inversions with trylock */
@@ -692,6 +693,7 @@ static void prune_icache(int nr_to_scan)
spin_unlock(&inode->i_lock);

list_move(&inode->i_lru, &freeable);
+ sb->s_nr_inodes_unused--;
this_cpu_dec(nr_unused);
}
if (current_is_kswapd())
@@ -699,8 +701,75 @@ static void prune_icache(int nr_to_scan)
else
__count_vm_events(PGINODESTEAL, reap);
spin_unlock(&inode_lru_lock);
+ *nr_to_scan = nr_scanned;

dispose_list(&freeable);
+}
+
+static void prune_icache(int count)
+{
+ struct super_block *sb, *p = NULL;
+ int w_count;
+ int unused = inodes_stat.nr_unused;
+ int prune_ratio;
+ int pruned;
+
+ if (unused == 0 || count == 0)
+ return;
+ down_read(&iprune_sem);
+ if (count >= unused)
+ prune_ratio = 1;
+ else
+ prune_ratio = unused / count;
+ spin_lock(&sb_lock);
+ list_for_each_entry(sb, &super_blocks, s_list) {
+ if (list_empty(&sb->s_instances))
+ continue;
+ if (sb->s_nr_inodes_unused == 0)
+ continue;
+ sb->s_count++;
+ /* Now, we reclaim unused dentrins with fairness.
+ * We reclaim them same percentage from each superblock.
+ * We calculate number of dentries to scan on this sb
+ * as follows, but the implementation is arranged to avoid
+ * overflows:
+ * number of dentries to scan on this sb =
+ * count * (number of dentries on this sb /
+ * number of dentries in the machine)
+ */
+ spin_unlock(&sb_lock);
+ if (prune_ratio != 1)
+ w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
+ else
+ w_count = sb->s_nr_inodes_unused;
+ pruned = w_count;
+ /*
+ * We need to be sure this filesystem isn't being unmounted,
+ * otherwise we could race with generic_shutdown_super(), and
+ * end up holding a reference to an inode while the filesystem
+ * is unmounted. So we try to get s_umount, and make sure
+ * s_root isn't NULL.
+ */
+ if (down_read_trylock(&sb->s_umount)) {
+ if ((sb->s_root != NULL) &&
+ (!list_empty(&sb->s_dentry_lru))) {
+ shrink_icache_sb(sb, &w_count);
+ pruned -= w_count;
+ }
+ up_read(&sb->s_umount);
+ }
+ spin_lock(&sb_lock);
+ if (p)
+ __put_super(p);
+ count -= pruned;
+ p = sb;
+ /* more work left to do? */
+ if (count <= 0)
+ break;
+ }
+ if (p)
+ __put_super(p);
+ spin_unlock(&sb_lock);
up_read(&iprune_sem);
}

diff --git a/fs/super.c b/fs/super.c
index c755939..ef7caf7 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,6 +77,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
INIT_LIST_HEAD(&s->s_dentry_lru);
+ INIT_LIST_HEAD(&s->s_inode_lru);
init_rwsem(&s->s_umount);
mutex_init(&s->s_lock);
lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c55d6b7..a96071d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1393,6 +1393,10 @@ struct super_block {
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */

+ /* inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
+ struct list_head s_inode_lru; /* unused inode lru */
+ int s_nr_inodes_unused; /* # of inodes on lru */
+
struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
--
1.7.5.1

2011-06-02 07:01:58

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 07/12] inode: move to per-sb LRU locks

From: Dave Chinner <[email protected]>

With the inode LRUs moving to per-sb structures, there is no longer
a need for a global inode_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/inode.c | 27 +++++++++++++--------------
fs/super.c | 1 +
include/linux/fs.h | 3 ++-
3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index e039115..667a29c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -33,7 +33,7 @@
*
* inode->i_lock protects:
* inode->i_state, inode->i_hash, __iget()
- * inode_lru_lock protects:
+ * inode->i_sb->s_inode_lru_lock protects:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode_sb_list_lock protects:
* sb->s_inodes, inode->i_sb_list
@@ -46,7 +46,7 @@
*
* inode_sb_list_lock
* inode->i_lock
- * inode_lru_lock
+ * inode->i_sb->s_inode_lru_lock
*
* inode_wb_list_lock
* inode->i_lock
@@ -64,8 +64,6 @@ static unsigned int i_hash_shift __read_mostly;
static struct hlist_head *inode_hashtable __read_mostly;
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);

-static DEFINE_SPINLOCK(inode_lru_lock);
-
__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);

@@ -342,24 +340,24 @@ EXPORT_SYMBOL(ihold);

static void inode_lru_list_add(struct inode *inode)
{
- spin_lock(&inode_lru_lock);
+ spin_lock(&inode->i_sb->s_inode_lru_lock);
if (list_empty(&inode->i_lru)) {
list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
inode->i_sb->s_nr_inodes_unused++;
this_cpu_inc(nr_unused);
}
- spin_unlock(&inode_lru_lock);
+ spin_unlock(&inode->i_sb->s_inode_lru_lock);
}

static void inode_lru_list_del(struct inode *inode)
{
- spin_lock(&inode_lru_lock);
+ spin_lock(&inode->i_sb->s_inode_lru_lock);
if (!list_empty(&inode->i_lru)) {
list_del_init(&inode->i_lru);
inode->i_sb->s_nr_inodes_unused--;
this_cpu_dec(nr_unused);
}
- spin_unlock(&inode_lru_lock);
+ spin_unlock(&inode->i_sb->s_inode_lru_lock);
}

/**
@@ -608,7 +606,8 @@ static int can_unuse(struct inode *inode)

/*
* Scan `goal' inodes on the unused list for freeable ones. They are moved to a
- * temporary list and then are freed outside inode_lru_lock by dispose_list().
+ * temporary list and then are freed outside sb->s_inode_lru_lock by
+ * dispose_list().
*
* Any inodes which are pinned purely because of attached pagecache have their
* pagecache removed. If the inode has metadata buffers attached to
@@ -628,7 +627,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
int nr_scanned;
unsigned long reap = 0;

- spin_lock(&inode_lru_lock);
+ spin_lock(&sb->s_inode_lru_lock);
for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
struct inode *inode;

@@ -638,7 +637,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);

/*
- * we are inverting the inode_lru_lock/inode->i_lock here,
+ * we are inverting the sb->s_inode_lru_lock/inode->i_lock here,
* so use a trylock. If we fail to get the lock, just move the
* inode to the back of the list so we don't spin on it.
*/
@@ -670,12 +669,12 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
if (inode_has_buffers(inode) || inode->i_data.nrpages) {
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_lru_lock);
+ spin_unlock(&sb->s_inode_lru_lock);
if (remove_inode_buffers(inode))
reap += invalidate_mapping_pages(&inode->i_data,
0, -1);
iput(inode);
- spin_lock(&inode_lru_lock);
+ spin_lock(&sb->s_inode_lru_lock);

if (inode != list_entry(sb->s_inode_lru.next,
struct inode, i_lru))
@@ -700,7 +699,7 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
__count_vm_events(KSWAPD_INODESTEAL, reap);
else
__count_vm_events(PGINODESTEAL, reap);
- spin_unlock(&inode_lru_lock);
+ spin_unlock(&sb->s_inode_lru_lock);
*nr_to_scan = nr_scanned;

dispose_list(&freeable);
diff --git a/fs/super.c b/fs/super.c
index ef7caf7..9c3fa1f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,6 +78,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_LIST_HEAD(&s->s_inodes);
INIT_LIST_HEAD(&s->s_dentry_lru);
INIT_LIST_HEAD(&s->s_inode_lru);
+ spin_lock_init(&s->s_inode_lru_lock);
init_rwsem(&s->s_umount);
mutex_init(&s->s_lock);
lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a96071d..bbd478e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1393,7 +1393,8 @@ struct super_block {
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */

- /* inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
+ /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
+ spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp;
struct list_head s_inode_lru; /* unused inode lru */
int s_nr_inodes_unused; /* # of inodes on lru */

--
1.7.5.1

2011-06-02 07:02:57

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 08/12] superblock: introduce per-sb cache shrinker infrastructure

From: Dave Chinner <[email protected]>

With context based shrinkers, we can implement a per-superblock
shrinker that shrinks the caches attached to the superblock. We
currently have global shrinkers for the inode and dentry caches that
split up into per-superblock operations via a coarse proportioning
method that does not batch very well. The global shrinkers also
have a dependency - dentries pin inodes - so we have to be very
careful about how we register the global shrinkers so that the
implicit call order is always correct.

With a per-sb shrinker callout, we can encode this dependency
directly into the per-sb shrinker, hence avoiding the need for
strictly ordering shrinker registrations. We also have no need for
any proportioning code for the shrinker subsystem already provides
this functionality across all shrinkers. Allowing the shrinker to
operate on a single superblock at a time means that we do less
superblock list traversals and locking and reclaim should batch more
effectively. This should result in less CPU overhead for reclaim and
potentially faster reclaim of items from each filesystem.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/dcache.c | 121 +++++----------------------------------------------
fs/inode.c | 117 ++++----------------------------------------------
fs/super.c | 55 +++++++++++++++++++++++-
include/linux/fs.h | 7 +++
4 files changed, 82 insertions(+), 218 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 37f72ee..f73ef23 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -720,13 +720,11 @@ static void shrink_dentry_list(struct list_head *list)
*
* If flags contains DCACHE_REFERENCED reference dentries will not be pruned.
*/
-static void __shrink_dcache_sb(struct super_block *sb, int *count, int flags)
+static void __shrink_dcache_sb(struct super_block *sb, int count, int flags)
{
- /* called from prune_dcache() and shrink_dcache_parent() */
struct dentry *dentry;
LIST_HEAD(referenced);
LIST_HEAD(tmp);
- int cnt = *count;

relock:
spin_lock(&dcache_lru_lock);
@@ -754,7 +752,7 @@ relock:
} else {
list_move_tail(&dentry->d_lru, &tmp);
spin_unlock(&dentry->d_lock);
- if (!--cnt)
+ if (!--count)
break;
}
cond_resched_lock(&dcache_lru_lock);
@@ -764,83 +762,22 @@ relock:
spin_unlock(&dcache_lru_lock);

shrink_dentry_list(&tmp);
-
- *count = cnt;
}

/**
- * prune_dcache - shrink the dcache
- * @count: number of entries to try to free
+ * prune_dcache_sb - shrink the dcache
+ * @nr_to_scan: number of entries to try to free
*
- * Shrink the dcache. This is done when we need more memory, or simply when we
- * need to unmount something (at which point we need to unuse all dentries).
+ * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
+ * done when we need more memory an called from the superblock shrinker
+ * function.
*
- * This function may fail to free any resources if all the dentries are in use.
+ * This function may fail to free any resources if all the dentries are in
+ * use.
*/
-static void prune_dcache(int count)
+void prune_dcache_sb(struct super_block *sb, int nr_to_scan)
{
- struct super_block *sb, *p = NULL;
- int w_count;
- int unused = dentry_stat.nr_unused;
- int prune_ratio;
- int pruned;
-
- if (unused == 0 || count == 0)
- return;
- if (count >= unused)
- prune_ratio = 1;
- else
- prune_ratio = unused / count;
- spin_lock(&sb_lock);
- list_for_each_entry(sb, &super_blocks, s_list) {
- if (list_empty(&sb->s_instances))
- continue;
- if (sb->s_nr_dentry_unused == 0)
- continue;
- sb->s_count++;
- /* Now, we reclaim unused dentrins with fairness.
- * We reclaim them same percentage from each superblock.
- * We calculate number of dentries to scan on this sb
- * as follows, but the implementation is arranged to avoid
- * overflows:
- * number of dentries to scan on this sb =
- * count * (number of dentries on this sb /
- * number of dentries in the machine)
- */
- spin_unlock(&sb_lock);
- if (prune_ratio != 1)
- w_count = (sb->s_nr_dentry_unused / prune_ratio) + 1;
- else
- w_count = sb->s_nr_dentry_unused;
- pruned = w_count;
- /*
- * We need to be sure this filesystem isn't being unmounted,
- * otherwise we could race with generic_shutdown_super(), and
- * end up holding a reference to an inode while the filesystem
- * is unmounted. So we try to get s_umount, and make sure
- * s_root isn't NULL.
- */
- if (down_read_trylock(&sb->s_umount)) {
- if ((sb->s_root != NULL) &&
- (!list_empty(&sb->s_dentry_lru))) {
- __shrink_dcache_sb(sb, &w_count,
- DCACHE_REFERENCED);
- pruned -= w_count;
- }
- up_read(&sb->s_umount);
- }
- spin_lock(&sb_lock);
- if (p)
- __put_super(p);
- count -= pruned;
- p = sb;
- /* more work left to do? */
- if (count <= 0)
- break;
- }
- if (p)
- __put_super(p);
- spin_unlock(&sb_lock);
+ __shrink_dcache_sb(sb, nr_to_scan, DCACHE_REFERENCED);
}

/**
@@ -1215,42 +1152,10 @@ void shrink_dcache_parent(struct dentry * parent)
int found;

while ((found = select_parent(parent)) != 0)
- __shrink_dcache_sb(sb, &found, 0);
+ __shrink_dcache_sb(sb, found, 0);
}
EXPORT_SYMBOL(shrink_dcache_parent);

-/*
- * Scan `sc->nr_slab_to_reclaim' dentries and return the number which remain.
- *
- * We need to avoid reentering the filesystem if the caller is performing a
- * GFP_NOFS allocation attempt. One example deadlock is:
- *
- * ext2_new_block->getblk->GFP->shrink_dcache_memory->prune_dcache->
- * prune_one_dentry->dput->dentry_iput->iput->inode->i_sb->s_op->put_inode->
- * ext2_discard_prealloc->ext2_free_blocks->lock_super->DEADLOCK.
- *
- * In this case we return -1 to tell the caller that we baled.
- */
-static int shrink_dcache_memory(struct shrinker *shrink,
- struct shrink_control *sc)
-{
- int nr = sc->nr_to_scan;
- gfp_t gfp_mask = sc->gfp_mask;
-
- if (nr) {
- if (!(gfp_mask & __GFP_FS))
- return -1;
- prune_dcache(nr);
- }
-
- return (dentry_stat.nr_unused / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker dcache_shrinker = {
- .shrink = shrink_dcache_memory,
- .seeks = DEFAULT_SEEKS,
-};
-
/**
* d_alloc - allocate a dcache entry
* @parent: parent of entry to allocate
@@ -3030,8 +2935,6 @@ static void __init dcache_init(void)
*/
dentry_cache = KMEM_CACHE(dentry,
SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|SLAB_MEM_SPREAD);
-
- register_shrinker(&dcache_shrinker);

/* Hash may have been set up in dcache_init_early */
if (!hashdist)
diff --git a/fs/inode.c b/fs/inode.c
index 667a29c..890d95e 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -73,7 +73,7 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);
*
* We don't actually need it to protect anything in the umount path,
* but only need to cycle through it to make sure any inode that
- * prune_icache took off the LRU list has been fully torn down by the
+ * prune_icache_sb took off the LRU list has been fully torn down by the
* time we are past evict_inodes.
*/
static DECLARE_RWSEM(iprune_sem);
@@ -537,7 +537,7 @@ void evict_inodes(struct super_block *sb)
dispose_list(&dispose);

/*
- * Cycle through iprune_sem to make sure any inode that prune_icache
+ * Cycle through iprune_sem to make sure any inode that prune_icache_sb
* moved off the list before we took the lock has been fully torn
* down.
*/
@@ -605,9 +605,10 @@ static int can_unuse(struct inode *inode)
}

/*
- * Scan `goal' inodes on the unused list for freeable ones. They are moved to a
- * temporary list and then are freed outside sb->s_inode_lru_lock by
- * dispose_list().
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
*
* Any inodes which are pinned purely because of attached pagecache have their
* pagecache removed. If the inode has metadata buffers attached to
@@ -621,14 +622,15 @@ static int can_unuse(struct inode *inode)
* LRU does not have strict ordering. Hence we don't want to reclaim inodes
* with this flag set because they are the inodes that are out of order.
*/
-static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
+void prune_icache_sb(struct super_block *sb, int nr_to_scan)
{
LIST_HEAD(freeable);
int nr_scanned;
unsigned long reap = 0;

+ down_read(&iprune_sem);
spin_lock(&sb->s_inode_lru_lock);
- for (nr_scanned = *nr_to_scan; nr_scanned >= 0; nr_scanned--) {
+ for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
struct inode *inode;

if (list_empty(&sb->s_inode_lru))
@@ -700,111 +702,11 @@ static void shrink_icache_sb(struct super_block *sb, int *nr_to_scan)
else
__count_vm_events(PGINODESTEAL, reap);
spin_unlock(&sb->s_inode_lru_lock);
- *nr_to_scan = nr_scanned;

dispose_list(&freeable);
-}
-
-static void prune_icache(int count)
-{
- struct super_block *sb, *p = NULL;
- int w_count;
- int unused = inodes_stat.nr_unused;
- int prune_ratio;
- int pruned;
-
- if (unused == 0 || count == 0)
- return;
- down_read(&iprune_sem);
- if (count >= unused)
- prune_ratio = 1;
- else
- prune_ratio = unused / count;
- spin_lock(&sb_lock);
- list_for_each_entry(sb, &super_blocks, s_list) {
- if (list_empty(&sb->s_instances))
- continue;
- if (sb->s_nr_inodes_unused == 0)
- continue;
- sb->s_count++;
- /* Now, we reclaim unused dentrins with fairness.
- * We reclaim them same percentage from each superblock.
- * We calculate number of dentries to scan on this sb
- * as follows, but the implementation is arranged to avoid
- * overflows:
- * number of dentries to scan on this sb =
- * count * (number of dentries on this sb /
- * number of dentries in the machine)
- */
- spin_unlock(&sb_lock);
- if (prune_ratio != 1)
- w_count = (sb->s_nr_inodes_unused / prune_ratio) + 1;
- else
- w_count = sb->s_nr_inodes_unused;
- pruned = w_count;
- /*
- * We need to be sure this filesystem isn't being unmounted,
- * otherwise we could race with generic_shutdown_super(), and
- * end up holding a reference to an inode while the filesystem
- * is unmounted. So we try to get s_umount, and make sure
- * s_root isn't NULL.
- */
- if (down_read_trylock(&sb->s_umount)) {
- if ((sb->s_root != NULL) &&
- (!list_empty(&sb->s_dentry_lru))) {
- shrink_icache_sb(sb, &w_count);
- pruned -= w_count;
- }
- up_read(&sb->s_umount);
- }
- spin_lock(&sb_lock);
- if (p)
- __put_super(p);
- count -= pruned;
- p = sb;
- /* more work left to do? */
- if (count <= 0)
- break;
- }
- if (p)
- __put_super(p);
- spin_unlock(&sb_lock);
up_read(&iprune_sem);
}

-/*
- * shrink_icache_memory() will attempt to reclaim some unused inodes. Here,
- * "unused" means that no dentries are referring to the inodes: the files are
- * not open and the dcache references to those inodes have already been
- * reclaimed.
- *
- * This function is passed the number of inodes to scan, and it returns the
- * total number of remaining possibly-reclaimable inodes.
- */
-static int shrink_icache_memory(struct shrinker *shrink,
- struct shrink_control *sc)
-{
- int nr = sc->nr_to_scan;
- gfp_t gfp_mask = sc->gfp_mask;
-
- if (nr) {
- /*
- * Nasty deadlock avoidance. We may hold various FS locks,
- * and we don't want to recurse into the FS that called us
- * in clear_inode() and friends..
- */
- if (!(gfp_mask & __GFP_FS))
- return -1;
- prune_icache(nr);
- }
- return (get_nr_inodes_unused() / 100) * sysctl_vfs_cache_pressure;
-}
-
-static struct shrinker icache_shrinker = {
- .shrink = shrink_icache_memory,
- .seeks = DEFAULT_SEEKS,
-};
-
static void __wait_on_freeing_inode(struct inode *inode);
/*
* Called with the inode lock held.
@@ -1684,7 +1586,6 @@ void __init inode_init(void)
(SLAB_RECLAIM_ACCOUNT|SLAB_PANIC|
SLAB_MEM_SPREAD),
init_once);
- register_shrinker(&icache_shrinker);

/* Hash may have been set up in inode_init_early */
if (!hashdist)
diff --git a/fs/super.c b/fs/super.c
index 9c3fa1f..f4630d9 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -38,6 +38,50 @@
LIST_HEAD(super_blocks);
DEFINE_SPINLOCK(sb_lock);

+static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
+{
+ struct super_block *sb;
+ int count;
+
+ sb = container_of(shrink, struct super_block, s_shrink);
+
+ /*
+ * Deadlock avoidance. We may hold various FS locks, and we don't want
+ * to recurse into the FS that called us in clear_inode() and friends..
+ */
+ if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+ return -1;
+
+ /*
+ * if we can't get the umount lock, then there's no point having the
+ * shrinker try again because the sb is being torn down.
+ */
+ if (!down_read_trylock(&sb->s_umount))
+ return -1;
+
+ if (!sb->s_root) {
+ up_read(&sb->s_umount);
+ return -1;
+ }
+
+ if (sc->nr_to_scan) {
+ /* proportion the scan between the two cacheѕ */
+ int total;
+
+ total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
+ count = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total;
+
+ /* prune dcache first as icache is pinned by it */
+ prune_dcache_sb(sb, count);
+ prune_icache_sb(sb, sc->nr_to_scan - count);
+ }
+
+ count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
+ * sysctl_vfs_cache_pressure;
+ up_read(&sb->s_umount);
+ return count;
+}
+
/**
* alloc_super - create new superblock
* @type: filesystem type superblock should belong to
@@ -116,6 +160,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
s->s_op = &default_op;
s->s_time_gran = 1000000000;
s->cleancache_poolid = -1;
+
+ s->s_shrink.seeks = DEFAULT_SEEKS;
+ s->s_shrink.shrink = prune_super;
}
out:
return s;
@@ -278,7 +325,12 @@ void generic_shutdown_super(struct super_block *sb)
{
const struct super_operations *sop = sb->s_op;

-
+ /*
+ * shut down the shrinker first so we know that there are no possible
+ * races when shrinking the dcache or icache. Removes the need for
+ * external locking to prevent such races.
+ */
+ unregister_shrinker(&sb->s_shrink);
if (sb->s_root) {
shrink_dcache_for_umount(sb);
sync_filesystem(sb);
@@ -366,6 +418,7 @@ retry:
list_add(&s->s_instances, &type->fs_supers);
spin_unlock(&sb_lock);
get_filesystem(type);
+ register_shrinker(&s->s_shrink);
return s;
}

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bbd478e..c3b3462 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -391,6 +391,7 @@ struct inodes_stat_t {
#include <linux/semaphore.h>
#include <linux/fiemap.h>
#include <linux/rculist_bl.h>
+#include <linux/mm.h>

#include <asm/atomic.h>
#include <asm/byteorder.h>
@@ -1440,8 +1441,14 @@ struct super_block {
* Saved pool identifier for cleancache (-1 means none)
*/
int cleancache_poolid;
+
+ struct shrinker s_shrink; /* per-sb shrinker handle */
};

+/* superblock cache pruning functions */
+extern void prune_icache_sb(struct super_block *sb, int nr_to_scan);
+extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
+
extern struct timespec current_fs_time(struct super_block *sb);

/*
--
1.7.5.1

2011-06-02 07:02:55

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 09/12] inode: remove iprune_sem

From: Dave Chinner <[email protected]>

Now that we have per-sb shrinkers and they are unregistered before
evict_inode() is called, there is not longer any race condition for
the iprune_sem to protect against. Hence we can remove it.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/inode.c | 21 ---------------------
1 files changed, 0 insertions(+), 21 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 890d95e..167adfd 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -68,17 +68,6 @@ __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_sb_list_lock);
__cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_wb_list_lock);

/*
- * iprune_sem provides exclusion between the icache shrinking and the
- * umount path.
- *
- * We don't actually need it to protect anything in the umount path,
- * but only need to cycle through it to make sure any inode that
- * prune_icache_sb took off the LRU list has been fully torn down by the
- * time we are past evict_inodes.
- */
-static DECLARE_RWSEM(iprune_sem);
-
-/*
* Empty aops. Can be used for the cases where the user does not
* define any of the address_space operations.
*/
@@ -535,14 +524,6 @@ void evict_inodes(struct super_block *sb)
spin_unlock(&inode_sb_list_lock);

dispose_list(&dispose);
-
- /*
- * Cycle through iprune_sem to make sure any inode that prune_icache_sb
- * moved off the list before we took the lock has been fully torn
- * down.
- */
- down_write(&iprune_sem);
- up_write(&iprune_sem);
}

/**
@@ -628,7 +609,6 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
int nr_scanned;
unsigned long reap = 0;

- down_read(&iprune_sem);
spin_lock(&sb->s_inode_lru_lock);
for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
struct inode *inode;
@@ -704,7 +684,6 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
spin_unlock(&sb->s_inode_lru_lock);

dispose_list(&freeable);
- up_read(&iprune_sem);
}

static void __wait_on_freeing_inode(struct inode *inode);
--
1.7.5.1

2011-06-02 07:03:33

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 10/12] superblock: add filesystem shrinker operations

From: Dave Chinner <[email protected]>

Now we have a per-superblock shrinker implementation, we can add a
filesystem specific callout to it to allow filesystem internal
caches to be shrunk by the superblock shrinker.

Rather than perpetuate the multipurpose shrinker callback API (i.e.
nr_to_scan == 0 meaning "tell me how many objects freeable in the
cache), two operations will be added. The first will return the
number of objects that are freeable, the second is the actual
shrinker call.

Signed-off-by: Dave Chinner <[email protected]>
---
Documentation/filesystems/vfs.txt | 16 +++++++++++++
fs/super.c | 43 +++++++++++++++++++++++++++---------
include/linux/fs.h | 2 +
3 files changed, 50 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 88b9f55..dc732d2 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -229,6 +229,8 @@ struct super_operations {

ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
+ int (*nr_cached_objects)(struct super_block *);
+ void (*free_cached_objects)(struct super_block *, int);
};

All methods are called without any locks being held, unless otherwise
@@ -301,6 +303,20 @@ or bottom half).

quota_write: called by the VFS to write to filesystem quota file.

+ nr_cached_objects: called by the sb cache shrinking function for the
+ filesystem to return the number of freeable cached objects it contains.
+ Optional.
+
+ free_cache_objects: called by the sb cache shrinking function for the
+ filesystem to scan the number of objects indicated to try to free them.
+ Optional, but any filesystem implementing this method needs to also
+ implement ->nr_cached_objects for it to be called correctly.
+
+ We can't do anything with any errors that the filesystem might
+ encountered, hence the void return type. This will never be called if
+ the VM is trying to reclaim under GFP_NOFS conditions, hence this
+ method does not need to handle that situation itself.
+
Whoever sets up the inode is responsible for filling in the "i_op" field. This
is a pointer to a "struct inode_operations" which describes the methods that
can be performed on individual inodes.
diff --git a/fs/super.c b/fs/super.c
index f4630d9..b55f968 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -41,7 +41,8 @@ DEFINE_SPINLOCK(sb_lock);
static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
{
struct super_block *sb;
- int count;
+ int fs_objects = 0;
+ int total_objects;

sb = container_of(shrink, struct super_block, s_shrink);

@@ -64,22 +65,42 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
return -1;
}

- if (sc->nr_to_scan) {
- /* proportion the scan between the two cacheѕ */
- int total;
+ if (sb->s_op && sb->s_op->nr_cached_objects)
+ fs_objects = sb->s_op->nr_cached_objects(sb);
+
+ total_objects = sb->s_nr_dentry_unused +
+ sb->s_nr_inodes_unused + fs_objects + 1;

- total = sb->s_nr_dentry_unused + sb->s_nr_inodes_unused + 1;
- count = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total;
+ if (sc->nr_to_scan) {
+ int dentries;
+ int inodes;
+
+ /* proportion the scan between the cacheѕ */
+ dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
+ total_objects;
+ inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
+ total_objects;
+ if (fs_objects)
+ fs_objects = (sc->nr_to_scan * fs_objects) /
+ total_objects;
+ /*
+ * prune the dcache first as the icache is pinned by it, then
+ * prune the icache, followed by the filesystem specific caches
+ */
+ prune_dcache_sb(sb, dentries);
+ prune_icache_sb(sb, inodes);

- /* prune dcache first as icache is pinned by it */
- prune_dcache_sb(sb, count);
- prune_icache_sb(sb, sc->nr_to_scan - count);
+ if (fs_objects && sb->s_op->free_cached_objects) {
+ sb->s_op->free_cached_objects(sb, fs_objects);
+ fs_objects = sb->s_op->nr_cached_objects(sb);
+ }
+ total_objects = sb->s_nr_dentry_unused +
+ sb->s_nr_inodes_unused + fs_objects;
}

- count = ((sb->s_nr_dentry_unused + sb->s_nr_inodes_unused) / 100)
- * sysctl_vfs_cache_pressure;
+ total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
up_read(&sb->s_umount);
- return count;
+ return total_objects;
}

/**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3b3462..4f0ed0a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1654,6 +1654,8 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
+ int (*nr_cached_objects)(struct super_block *);
+ void (*free_cached_objects)(struct super_block *, int);
};

/*
--
1.7.5.1

2011-06-02 07:01:56

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 11/12] vfs: increase shrinker batch size

From: Dave Chinner <[email protected]>

Now that the per-sb shrinker is responsible for shrinking 2 or more
caches, increase the batch size to keep econmies of scale for
shrinking each cache. Increase the shrinker batch size to 1024
objects.

To allow for a large increase in batch size, add a conditional
reschedule to prune_icache_sb() so that we don't hold the LRU spin
lock for too long. This mirrors the behaviour of the
__shrink_dcache_sb(), and allows us to increase the batch size
without needing to worry about problems caused by long lock hold
times.

To ensure that filesystems using the per-sb shrinker callouts don't
cause problems, document that the object freeing method must
reschedule appropriately inside loops.

Signed-off-by: Dave Chinner <[email protected]>
---
Documentation/filesystems/vfs.txt | 5 +++++
fs/super.c | 1 +
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index dc732d2..2e26973 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -317,6 +317,11 @@ or bottom half).
the VM is trying to reclaim under GFP_NOFS conditions, hence this
method does not need to handle that situation itself.

+ Implementations must include conditional reschedule calls inside any
+ scanning loop that is done. This allows the VFS to determine
+ appropriate scan batch sizes without having to worry about whether
+ implementations will cause holdoff problems due ot large batch sizes.
+
Whoever sets up the inode is responsible for filling in the "i_op" field. This
is a pointer to a "struct inode_operations" which describes the methods that
can be performed on individual inodes.
diff --git a/fs/super.c b/fs/super.c
index b55f968..323a63e 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -184,6 +184,7 @@ static struct super_block *alloc_super(struct file_system_type *type)

s->s_shrink.seeks = DEFAULT_SEEKS;
s->s_shrink.shrink = prune_super;
+ s->s_shrink.batch = 1024;
}
out:
return s;
--
1.7.5.1

2011-06-02 07:02:39

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 12/12] xfs: make use of new shrinker callout for the inode cache

From: Dave Chinner <[email protected]>

Convert the inode reclaim shrinker to use the new per-sb shrinker
operations. This allows much bigger reclaim batches to be used, and
allows the XFS inode cache to be shrunk in proportion with the VFS
dentry and inode caches. This avoids the problem of the VFS caches
being shrunk significantly before the XFS inode cache is shrunk
resulting in imbalances in the caches during reclaim.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/linux-2.6/xfs_super.c | 26 ++++++++++-----
fs/xfs/linux-2.6/xfs_sync.c | 71 ++++++++++++++++--------------------------
fs/xfs/linux-2.6/xfs_sync.h | 5 +--
3 files changed, 46 insertions(+), 56 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 1e3a7ce..b133b69 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1087,11 +1087,6 @@ xfs_fs_put_super(
{
struct xfs_mount *mp = XFS_M(sb);

- /*
- * Unregister the memory shrinker before we tear down the mount
- * structure so we don't have memory reclaim racing with us here.
- */
- xfs_inode_shrinker_unregister(mp);
xfs_syncd_stop(mp);

/*
@@ -1491,8 +1486,6 @@ xfs_fs_fill_super(
if (error)
goto out_filestream_unmount;

- xfs_inode_shrinker_register(mp);
-
error = xfs_mountfs(mp);
if (error)
goto out_syncd_stop;
@@ -1515,7 +1508,6 @@ xfs_fs_fill_super(
return 0;

out_syncd_stop:
- xfs_inode_shrinker_unregister(mp);
xfs_syncd_stop(mp);
out_filestream_unmount:
xfs_filestream_unmount(mp);
@@ -1540,7 +1532,6 @@ xfs_fs_fill_super(
}

fail_unmount:
- xfs_inode_shrinker_unregister(mp);
xfs_syncd_stop(mp);

/*
@@ -1566,6 +1557,21 @@ xfs_fs_mount(
return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
}

+static int
+xfs_fs_nr_cached_objects(
+ struct super_block *sb)
+{
+ return xfs_reclaim_inodes_count(XFS_M(sb));
+}
+
+static void
+xfs_fs_free_cached_objects(
+ struct super_block *sb,
+ int nr_to_scan)
+{
+ xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+}
+
static const struct super_operations xfs_super_operations = {
.alloc_inode = xfs_fs_alloc_inode,
.destroy_inode = xfs_fs_destroy_inode,
@@ -1579,6 +1585,8 @@ static const struct super_operations xfs_super_operations = {
.statfs = xfs_fs_statfs,
.remount_fs = xfs_fs_remount,
.show_options = xfs_fs_show_options,
+ .nr_cached_objects = xfs_fs_nr_cached_objects,
+ .free_cached_objects = xfs_fs_free_cached_objects,
};

static struct file_system_type xfs_fs_type = {
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 8ecad5f..9bd7e89 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -179,6 +179,8 @@ restart:
if (error == EFSCORRUPTED)
break;

+ cond_resched();
+
} while (nr_found && !done);

if (skipped) {
@@ -986,6 +988,8 @@ restart:

*nr_to_scan -= XFS_LOOKUP_BATCH;

+ cond_resched();
+
} while (nr_found && !done && *nr_to_scan > 0);

if (trylock && !done)
@@ -1003,7 +1007,7 @@ restart:
* ensure that when we get more reclaimers than AGs we block rather
* than spin trying to execute reclaim.
*/
- if (trylock && skipped && *nr_to_scan > 0) {
+ if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
trylock = 0;
goto restart;
}
@@ -1021,44 +1025,38 @@ xfs_reclaim_inodes(
}

/*
- * Inode cache shrinker.
+ * Scan a certain number of inodes for reclaim.
*
* When called we make sure that there is a background (fast) inode reclaim in
- * progress, while we will throttle the speed of reclaim via doiing synchronous
+ * progress, while we will throttle the speed of reclaim via doing synchronous
* reclaim of inodes. That means if we come across dirty inodes, we wait for
* them to be cleaned, which we hope will not be very long due to the
* background walker having already kicked the IO off on those dirty inodes.
*/
-static int
-xfs_reclaim_inode_shrink(
- struct shrinker *shrink,
- struct shrink_control *sc)
+void
+xfs_reclaim_inodes_nr(
+ struct xfs_mount *mp,
+ int nr_to_scan)
{
- struct xfs_mount *mp;
- struct xfs_perag *pag;
- xfs_agnumber_t ag;
- int reclaimable;
- int nr_to_scan = sc->nr_to_scan;
- gfp_t gfp_mask = sc->gfp_mask;
-
- mp = container_of(shrink, struct xfs_mount, m_inode_shrink);
- if (nr_to_scan) {
- /* kick background reclaimer and push the AIL */
- xfs_syncd_queue_reclaim(mp);
- xfs_ail_push_all(mp->m_ail);
+ /* kick background reclaimer and push the AIL */
+ xfs_syncd_queue_reclaim(mp);
+ xfs_ail_push_all(mp->m_ail);

- if (!(gfp_mask & __GFP_FS))
- return -1;
+ xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+}

- xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT,
- &nr_to_scan);
- /* terminate if we don't exhaust the scan */
- if (nr_to_scan > 0)
- return -1;
- }
+/*
+ * Return the number of reclaimable inodes in the filesystem for
+ * the shrinker to determine how much to reclaim.
+ */
+int
+xfs_reclaim_inodes_count(
+ struct xfs_mount *mp)
+{
+ struct xfs_perag *pag;
+ xfs_agnumber_t ag = 0;
+ int reclaimable = 0;

- reclaimable = 0;
- ag = 0;
while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
ag = pag->pag_agno + 1;
reclaimable += pag->pag_ici_reclaimable;
@@ -1067,18 +1065,3 @@ xfs_reclaim_inode_shrink(
return reclaimable;
}

-void
-xfs_inode_shrinker_register(
- struct xfs_mount *mp)
-{
- mp->m_inode_shrink.shrink = xfs_reclaim_inode_shrink;
- mp->m_inode_shrink.seeks = DEFAULT_SEEKS;
- register_shrinker(&mp->m_inode_shrink);
-}
-
-void
-xfs_inode_shrinker_unregister(
- struct xfs_mount *mp)
-{
- unregister_shrinker(&mp->m_inode_shrink);
-}
diff --git a/fs/xfs/linux-2.6/xfs_sync.h b/fs/xfs/linux-2.6/xfs_sync.h
index e3a6ad2..2e15685 100644
--- a/fs/xfs/linux-2.6/xfs_sync.h
+++ b/fs/xfs/linux-2.6/xfs_sync.h
@@ -43,6 +43,8 @@ void xfs_quiesce_attr(struct xfs_mount *mp);
void xfs_flush_inodes(struct xfs_inode *ip);

int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+int xfs_reclaim_inodes_count(struct xfs_mount *mp);
+void xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);

void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);
void __xfs_inode_set_reclaim_tag(struct xfs_perag *pag, struct xfs_inode *ip);
@@ -54,7 +56,4 @@ int xfs_inode_ag_iterator(struct xfs_mount *mp,
int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag, int flags),
int flags);

-void xfs_inode_shrinker_register(struct xfs_mount *mp);
-void xfs_inode_shrinker_unregister(struct xfs_mount *mp);
-
#endif
--
1.7.5.1

2011-06-02 09:31:14

by Nicolas Kaiser

[permalink] [raw]
Subject: Re: [PATCH 11/12] vfs: increase shrinker batch size

Just noticed below two typos.

* Dave Chinner <[email protected]>:
> From: Dave Chinner <[email protected]>
>
> Now that the per-sb shrinker is responsible for shrinking 2 or more
> caches, increase the batch size to keep econmies of scale for

economies

(..)

> Documentation/filesystems/vfs.txt | 5 +++++
> fs/super.c | 1 +
> 2 files changed, 6 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index dc732d2..2e26973 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -317,6 +317,11 @@ or bottom half).
> the VM is trying to reclaim under GFP_NOFS conditions, hence this
> method does not need to handle that situation itself.
>
> + Implementations must include conditional reschedule calls inside any
> + scanning loop that is done. This allows the VFS to determine
> + appropriate scan batch sizes without having to worry about whether
> + implementations will cause holdoff problems due ot large batch sizes.

due to

Best regards,
Nicolas Kaiser

2011-06-04 00:25:57

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 06/12] inode: Make unused inode LRU per superblock

On Thu, Jun 02, 2011 at 05:01:01PM +1000, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> The inode unused list is currently a global LRU. This does not match
> the other global filesystem cache - the dentry cache - which uses
> per-superblock LRU lists. Hence we have related filesystem object
> types using different LRU reclaimation schemes.
>
> To enable a per-superblock filesystem cache shrinker, both of these
> caches need to have per-sb unused object LRU lists. Hence this patch
> converts the global inode LRU to per-sb LRUs.
>
> The patch only does rudimentary per-sb propotioning in the shrinker
> infrastructure, as this gets removed when the per-sb shrinker
> callouts are introduced later on.

What protects s_nr_inodes_unused?

2011-06-04 00:42:34

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/12] superblock: introduce per-sb cache shrinker infrastructure

> @@ -278,7 +325,12 @@ void generic_shutdown_super(struct super_block *sb)
> {
> const struct super_operations *sop = sb->s_op;
>
> -
> + /*
> + * shut down the shrinker first so we know that there are no possible
> + * races when shrinking the dcache or icache. Removes the need for
> + * external locking to prevent such races.
> + */
> + unregister_shrinker(&sb->s_shrink);
> if (sb->s_root) {
> shrink_dcache_for_umount(sb);
> sync_filesystem(sb);

What it means is that shrinker_rwsem now nests inside ->s_umount... IOW,
if any ->shrink() gets stuck, so does every generic_shutdown_super().
I'm still not convinced it's a good idea - especially since _this_
superblock will be skipped anyway. Is there any good reason to evict
shrinker that early? Note that doing that after ->s_umount is dropped
should be reasonably safe - your shrinker will see that superblock is
doomed if it's called anywhere in that window...

2011-06-04 01:40:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 06/12] inode: Make unused inode LRU per superblock

On Sat, Jun 04, 2011 at 01:25:52AM +0100, Al Viro wrote:
> On Thu, Jun 02, 2011 at 05:01:01PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > The inode unused list is currently a global LRU. This does not match
> > the other global filesystem cache - the dentry cache - which uses
> > per-superblock LRU lists. Hence we have related filesystem object
> > types using different LRU reclaimation schemes.
> >
> > To enable a per-superblock filesystem cache shrinker, both of these
> > caches need to have per-sb unused object LRU lists. Hence this patch
> > converts the global inode LRU to per-sb LRUs.
> >
> > The patch only does rudimentary per-sb propotioning in the shrinker
> > infrastructure, as this gets removed when the per-sb shrinker
> > callouts are introduced later on.
>
> What protects s_nr_inodes_unused?

For this patch, the modifications are protected by the
inode_lru_lock, but the reads are unprotected. That's the same
protection as the inode_stat.nr_unused field, and the same as the
existing dentry cache per-sb LRU accounting. In the next patch
modifcations are moved under the sb->s_inode_lru_lock, but reads
still remain unprotected.

I can see how the multiple reads in shrink_icache_sb() could each
return a different value during the proportioning, but I don't think
that is a big problem. That proportioning code goes away in the next
patch and is replaced by different code in prune_super(), so if you
want the reads protected by locks or a single snapshot used for the
proportioning calculations I'll do it in the new code in
prune_super().

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-04 01:52:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 08/12] superblock: introduce per-sb cache shrinker infrastructure

On Sat, Jun 04, 2011 at 01:42:31AM +0100, Al Viro wrote:
> > @@ -278,7 +325,12 @@ void generic_shutdown_super(struct super_block *sb)
> > {
> > const struct super_operations *sop = sb->s_op;
> >
> > -
> > + /*
> > + * shut down the shrinker first so we know that there are no possible
> > + * races when shrinking the dcache or icache. Removes the need for
> > + * external locking to prevent such races.
> > + */
> > + unregister_shrinker(&sb->s_shrink);
> > if (sb->s_root) {
> > shrink_dcache_for_umount(sb);
> > sync_filesystem(sb);
>
> What it means is that shrinker_rwsem now nests inside ->s_umount... IOW,
> if any ->shrink() gets stuck, so does every generic_shutdown_super().
> I'm still not convinced it's a good idea - especially since _this_
> superblock will be skipped anyway.

True, that's not nice.

> Is there any good reason to evict
> shrinker that early?

I wanted to put it early on in the unmount path so that the shrinker
was guaranteed to be gone before evict_inodes() was called. That
would mean that it is obviously safe to remove the iprune_sem
serialisation in that function.

The code in the umount path is quite different between 2.6.35 (the
original version of the patchset) and 3.0-rc1, so I'm not surprised
that I haven't put the unregister call in the right place.

> Note that doing that after ->s_umount is dropped
> should be reasonably safe - your shrinker will see that superblock is
> doomed if it's called anywhere in that window...

Agreed. In trying to find the best "early" place to unregister the
shrinker, I've completely missed the obvious "late is safe"
solution. I'll respin it with these changes.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-04 14:08:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 08/12] superblock: introduce per-sb cache shrinker infrastructure

On Sat, Jun 04, 2011 at 11:52:12AM +1000, Dave Chinner wrote:
> I wanted to put it early on in the unmount path so that the shrinker
> was guaranteed to be gone before evict_inodes() was called. That
> would mean that it is obviously safe to remove the iprune_sem
> serialisation in that function.

The iprune_sem removal is fine as soon as you have a per-sb shrinker
for the inodes which keeps an active reference on the superblock until
all the inodes are evicted.

2011-06-04 14:19:48

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/12] superblock: introduce per-sb cache shrinker infrastructure

On Sat, Jun 04, 2011 at 10:08:48AM -0400, Christoph Hellwig wrote:
> On Sat, Jun 04, 2011 at 11:52:12AM +1000, Dave Chinner wrote:
> > I wanted to put it early on in the unmount path so that the shrinker
> > was guaranteed to be gone before evict_inodes() was called. That
> > would mean that it is obviously safe to remove the iprune_sem
> > serialisation in that function.
>
> The iprune_sem removal is fine as soon as you have a per-sb shrinker
> for the inodes which keeps an active reference on the superblock until
> all the inodes are evicted.

I really don't like that. Stuff keeping active refs, worse yet doing that
asynchronously... Shrinkers should *not* do that. Just grab a passive
ref (i.e. bump s_count), try grab s_umount (shared) and if that thing still
has ->s_root while we hold s_umount, go ahead. Unregister either at the
end of generic_shutdown_super() or from deactivate_locked_super(), between
the calls of ->kill_sb() and put_filesystem().

2011-06-04 14:24:52

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/12] superblock: introduce per-sb cache shrinker infrastructure

On Sat, Jun 04, 2011 at 03:19:40PM +0100, Al Viro wrote:
> > The iprune_sem removal is fine as soon as you have a per-sb shrinker
> > for the inodes which keeps an active reference on the superblock until
> > all the inodes are evicted.
>
> I really don't like that. Stuff keeping active refs, worse yet doing that
> asynchronously... Shrinkers should *not* do that. Just grab a passive
> ref (i.e. bump s_count), try grab s_umount (shared) and if that thing still
> has ->s_root while we hold s_umount, go ahead. Unregister either at the
> end of generic_shutdown_super() or from deactivate_locked_super(), between
> the calls of ->kill_sb() and put_filesystem().

PS: shrinkers should not acquire active refs; more specifically, they should
not _drop_ active refs, lest they end up dropping the last active one and
trigger unregistering a shrinker for superblock in question. From inside of
->shrink(), with shrinker_rwsem held by caller. Deadlock...

2011-06-16 11:33:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/12] Per superblock cache reclaim

Can we get some comments from the MM folks for patches 1-3? Those look
like some pretty urgent fixes for really dumb shrinker behaviour.

2011-06-17 03:36:07

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/12] Per superblock cache reclaim

(2011/06/16 20:33), Christoph Hellwig wrote:
> Can we get some comments from the MM folks for patches 1-3? Those look
> like some pretty urgent fixes for really dumb shrinker behaviour.

Yeah, I'm reviewing today. I'm sorry delayed it. So, generically they are
pretty good to me. thanks Dave. So, I have a few minor comments and I'll post
it when my review is finished. (Maybe some hour later).

2011-06-20 00:44:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 01/12] vmscan: add shrink_slab tracepoints

(2011/06/02 16:00), Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Іt is impossible to understand what the shrinkers are actually doing
> without instrumenting the code, so add a some tracepoints to allow
> insight to be gained.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> include/trace/events/vmscan.h | 67 +++++++++++++++++++++++++++++++++++++++++
> mm/vmscan.c | 6 +++-
> 2 files changed, 72 insertions(+), 1 deletions(-)

This look good to me. I have two minor request. 1) please change patch order,
move this patch after shrinker changes. iow, now both this and [2/12] have
tracepoint change. I don't like it. 2) please avoid cryptic abbreviated variable
names. Instead, please just use the same variable name with vmscan.c source code.

2011-06-20 00:46:59

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 02/12] vmscan: shrinker->nr updates race and go wrong

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48e3fbd..dce2767 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> unsigned long total_scan;
> unsigned long max_pass;
> int shrink_ret = 0;
> + long nr;
> + long new_nr;
>
> + /*
> + * copy the current shrinker scan count into a local variable
> + * and zero it so that other concurrent shrinker invocations
> + * don't also do this scanning work.
> + */
> + do {
> + nr = shrinker->nr;
> + } while (cmpxchg(&shrinker->nr, nr, 0) != nr);
> +
> + total_scan = nr;
> max_pass = do_shrinker_shrink(shrinker, shrink, 0);
> delta = (4 * nr_pages_scanned) / shrinker->seeks;
> delta *= max_pass;
> do_div(delta, lru_pages + 1);
> - shrinker->nr += delta;
> - if (shrinker->nr < 0) {
> + total_scan += delta;
> + if (total_scan < 0) {
> printk(KERN_ERR "shrink_slab: %pF negative objects to "
> "delete nr=%ld\n",
> - shrinker->shrink, shrinker->nr);
> - shrinker->nr = max_pass;
> + shrinker->shrink, total_scan);
> + total_scan = max_pass;
> }
>
> /*
> @@ -269,13 +281,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> * never try to free more than twice the estimate number of
> * freeable entries.
> */
> - if (shrinker->nr > max_pass * 2)
> - shrinker->nr = max_pass * 2;
> + if (total_scan > max_pass * 2)
> + total_scan = max_pass * 2;
>
> - total_scan = shrinker->nr;
> - shrinker->nr = 0;
>
> - trace_mm_shrink_slab_start(shrinker, shrink, nr_pages_scanned,
> + trace_mm_shrink_slab_start(shrinker, shrink, nr, nr_pages_scanned,
> lru_pages, max_pass, delta, total_scan);
>
> while (total_scan >= SHRINK_BATCH) {
> @@ -295,8 +305,19 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> cond_resched();
> }
>
> - shrinker->nr += total_scan;
> - trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan);
> + /*
> + * move the unused scan count back into the shrinker in a
> + * manner that handles concurrent updates. If we exhausted the
> + * scan, there is no need to do an update.
> + */
> + do {
> + nr = shrinker->nr;
> + new_nr = total_scan + nr;
> + if (total_scan <= 0)
> + break;
> + } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
> +
> + trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
> }
> up_read(&shrinker_rwsem);
> out:

Looks great fix. Please remove tracepoint change from this patch and send it
to -stable. iow, I expect I'll ack your next spin.

thanks.

2011-06-20 00:51:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 03/12] vmscan: reduce wind up shrinker->nr when shrinker can't do work

(2011/06/02 16:00), Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> When a shrinker returns -1 to shrink_slab() to indicate it cannot do
> any work given the current memory reclaim requirements, it adds the
> entire total_scan count to shrinker->nr. The idea ehind this is that
> whenteh shrinker is next called and can do work, it will do the work
> of the previously aborted shrinker call as well.
>
> However, if a filesystem is doing lots of allocation with GFP_NOFS
> set, then we get many, many more aborts from the shrinkers than we
> do successful calls. The result is that shrinker->nr winds up to
> it's maximum permissible value (twice the current cache size) and
> then when the next shrinker call that can do work is issued, it
> has enough scan count built up to free the entire cache twice over.
>
> This manifests itself in the cache going from full to empty in a
> matter of seconds, even when only a small part of the cache is
> needed to be emptied to free sufficient memory.
>
> Under metadata intensive workloads on ext4 and XFS, I'm seeing the
> VFS caches increase memory consumption up to 75% of memory (no page
> cache pressure) over a period of 30-60s, and then the shrinker
> empties them down to zero in the space of 2-3s. This cycle repeats
> over and over again, with the shrinker completely trashing the іnode
> and dentry caches every minute or so the workload continues.
>
> This behaviour was made obvious by the shrink_slab tracepoints added
> earlier in the series, and made worse by the patch that corrected
> the concurrent accounting of shrinker->nr.
>
> To avoid this problem, stop repeated small increments of the total
> scan value from winding shrinker->nr up to a value that can cause
> the entire cache to be freed. We still need to allow it to wind up,
> so use the delta as the "large scan" threshold check - if the delta
> is more than a quarter of the entire cache size, then it is a large
> scan and allowed to cause lots of windup because we are clearly
> needing to free lots of memory.
>
> If it isn't a large scan then limit the total scan to half the size
> of the cache so that windup never increases to consume the whole
> cache. Reducing the total scan limit further does not allow enough
> wind-up to maintain the current levels of performance, whilst a
> higher threshold does not prevent the windup from freeing the entire
> cache under sustained workloads.
>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> mm/vmscan.c | 14 ++++++++++++++
> 1 files changed, 14 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index dce2767..3688f47 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -277,6 +277,20 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> }
>
> /*
> + * Avoid excessive windup on fielsystem shrinkers due to large
> + * numbers of GFP_NOFS allocations causing the shrinkers to
> + * return -1 all the time. This results in a large nr being
> + * built up so when a shrink that can do some work comes along
> + * it empties the entire cache due to nr >>> max_pass. This is
> + * bad for sustaining a working set in memory.
> + *
> + * Hence only allow nr to go large when a large delta is
> + * calculated.
> + */
> + if (delta < max_pass / 4)
> + total_scan = min(total_scan, max_pass / 2);
> +
> + /*
> * Avoid risking looping forever due to too large nr value:
> * never try to free more than twice the estimate number of
> * freeable entries.

I guess "max_pass/4" and "min(total_scan, max_pass / 2)" are your heuristic value. right?
If so, please write your benchmark name and its result into the description. I mean,
currently some mm folks plan to enhance shrinker. So, sharing benchmark may help to avoid
an accidental regression.

I mean, your code itself looks pretty good to me.

thanks.

2011-06-20 00:53:25

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 01/12] vmscan: add shrink_slab tracepoints

On Mon, Jun 20, 2011 at 09:44:33AM +0900, KOSAKI Motohiro wrote:
> (2011/06/02 16:00), Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Іt is impossible to understand what the shrinkers are actually doing
> > without instrumenting the code, so add a some tracepoints to allow
> > insight to be gained.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > include/trace/events/vmscan.h | 67 +++++++++++++++++++++++++++++++++++++++++
> > mm/vmscan.c | 6 +++-
> > 2 files changed, 72 insertions(+), 1 deletions(-)
>
> This look good to me. I have two minor request. 1) please change patch order,
> move this patch after shrinker changes. iow, now both this and [2/12] have
> tracepoint change. I don't like it.

No big deal - I'll just fold the second change (how shrinker->nr is
passed into the tracepoint) into the first. Tracepoints should be
first in the series, anyway, otherwise there is no way to validate
the before/after effect of the bug fixes....

> 2) please avoid cryptic abbreviated variable
> names. Instead, please just use the same variable name with
> vmscan.c source code.

So replace cryptic abbreviated names with slightly different
cryptic abbreviated names? ;)

Sure, I can do that...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-20 01:25:38

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 02/12] vmscan: shrinker->nr updates race and go wrong

On Mon, Jun 20, 2011 at 09:46:54AM +0900, KOSAKI Motohiro wrote:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 48e3fbd..dce2767 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -251,17 +251,29 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > unsigned long total_scan;
> > unsigned long max_pass;
> > int shrink_ret = 0;
> > + long nr;
> > + long new_nr;
> >
> > + /*
> > + * copy the current shrinker scan count into a local variable
> > + * and zero it so that other concurrent shrinker invocations
> > + * don't also do this scanning work.
> > + */
> > + do {
> > + nr = shrinker->nr;
> > + } while (cmpxchg(&shrinker->nr, nr, 0) != nr);
> > +
> > + total_scan = nr;
> > max_pass = do_shrinker_shrink(shrinker, shrink, 0);
> > delta = (4 * nr_pages_scanned) / shrinker->seeks;
> > delta *= max_pass;
> > do_div(delta, lru_pages + 1);
> > - shrinker->nr += delta;
> > - if (shrinker->nr < 0) {
> > + total_scan += delta;
> > + if (total_scan < 0) {
> > printk(KERN_ERR "shrink_slab: %pF negative objects to "
> > "delete nr=%ld\n",
> > - shrinker->shrink, shrinker->nr);
> > - shrinker->nr = max_pass;
> > + shrinker->shrink, total_scan);
> > + total_scan = max_pass;
> > }
> >
> > /*
> > @@ -269,13 +281,11 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > * never try to free more than twice the estimate number of
> > * freeable entries.
> > */
> > - if (shrinker->nr > max_pass * 2)
> > - shrinker->nr = max_pass * 2;
> > + if (total_scan > max_pass * 2)
> > + total_scan = max_pass * 2;
> >
> > - total_scan = shrinker->nr;
> > - shrinker->nr = 0;
> >
> > - trace_mm_shrink_slab_start(shrinker, shrink, nr_pages_scanned,
> > + trace_mm_shrink_slab_start(shrinker, shrink, nr, nr_pages_scanned,
> > lru_pages, max_pass, delta, total_scan);
> >
> > while (total_scan >= SHRINK_BATCH) {
> > @@ -295,8 +305,19 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > cond_resched();
> > }
> >
> > - shrinker->nr += total_scan;
> > - trace_mm_shrink_slab_end(shrinker, shrink_ret, total_scan);
> > + /*
> > + * move the unused scan count back into the shrinker in a
> > + * manner that handles concurrent updates. If we exhausted the
> > + * scan, there is no need to do an update.
> > + */
> > + do {
> > + nr = shrinker->nr;
> > + new_nr = total_scan + nr;
> > + if (total_scan <= 0)
> > + break;
> > + } while (cmpxchg(&shrinker->nr, nr, new_nr) != nr);
> > +
> > + trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
> > }
> > up_read(&shrinker_rwsem);
> > out:
>
> Looks great fix. Please remove tracepoint change from this patch and send it
> to -stable. iow, I expect I'll ack your next spin.

I don't believe such a change belongs in -stable. This code has been
buggy for many years and as I mentioned it actually makes existing
bad shrinker behaviour worse. I don't test stable kernels, so I've
got no idea what side effects it will have outside of this series.
I'm extremely hesitant to change VM behaviour in stable kernels
without having tested first, so I'm not going to push it for stable
kernels.

If you want it in stable kernels, then you can always let
[email protected] know once the commits are in the mainline tree and
you've tested them...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-20 04:30:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 02/12] vmscan: shrinker->nr updates race and go wrong

>> Looks great fix. Please remove tracepoint change from this patch and send it
>> to -stable. iow, I expect I'll ack your next spin.
>
> I don't believe such a change belongs in -stable. This code has been
> buggy for many years and as I mentioned it actually makes existing
> bad shrinker behaviour worse. I don't test stable kernels, so I've
> got no idea what side effects it will have outside of this series.
> I'm extremely hesitant to change VM behaviour in stable kernels
> without having tested first, so I'm not going to push it for stable
> kernels.

Ok, I have no strong opinion.



>
> If you want it in stable kernels, then you can always let
> [email protected] know once the commits are in the mainline tree and
> you've tested them...

2011-06-21 05:09:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 03/12] vmscan: reduce wind up shrinker->nr when shrinker can't do work

On Mon, Jun 20, 2011 at 09:51:08AM +0900, KOSAKI Motohiro wrote:
> (2011/06/02 16:00), Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > When a shrinker returns -1 to shrink_slab() to indicate it cannot do
> > any work given the current memory reclaim requirements, it adds the
> > entire total_scan count to shrinker->nr. The idea ehind this is that
> > whenteh shrinker is next called and can do work, it will do the work
> > of the previously aborted shrinker call as well.
> >
> > However, if a filesystem is doing lots of allocation with GFP_NOFS
> > set, then we get many, many more aborts from the shrinkers than we
> > do successful calls. The result is that shrinker->nr winds up to
> > it's maximum permissible value (twice the current cache size) and
> > then when the next shrinker call that can do work is issued, it
> > has enough scan count built up to free the entire cache twice over.
> >
> > This manifests itself in the cache going from full to empty in a
> > matter of seconds, even when only a small part of the cache is
> > needed to be emptied to free sufficient memory.
> >
> > Under metadata intensive workloads on ext4 and XFS, I'm seeing the
> > VFS caches increase memory consumption up to 75% of memory (no page
> > cache pressure) over a period of 30-60s, and then the shrinker
> > empties them down to zero in the space of 2-3s. This cycle repeats
> > over and over again, with the shrinker completely trashing the іnode
> > and dentry caches every minute or so the workload continues.
> >
> > This behaviour was made obvious by the shrink_slab tracepoints added
> > earlier in the series, and made worse by the patch that corrected
> > the concurrent accounting of shrinker->nr.
> >
> > To avoid this problem, stop repeated small increments of the total
> > scan value from winding shrinker->nr up to a value that can cause
> > the entire cache to be freed. We still need to allow it to wind up,
> > so use the delta as the "large scan" threshold check - if the delta
> > is more than a quarter of the entire cache size, then it is a large
> > scan and allowed to cause lots of windup because we are clearly
> > needing to free lots of memory.
> >
> > If it isn't a large scan then limit the total scan to half the size
> > of the cache so that windup never increases to consume the whole
> > cache. Reducing the total scan limit further does not allow enough
> > wind-up to maintain the current levels of performance, whilst a
> > higher threshold does not prevent the windup from freeing the entire
> > cache under sustained workloads.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > ---
> > mm/vmscan.c | 14 ++++++++++++++
> > 1 files changed, 14 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index dce2767..3688f47 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -277,6 +277,20 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > }
> >
> > /*
> > + * Avoid excessive windup on fielsystem shrinkers due to large
> > + * numbers of GFP_NOFS allocations causing the shrinkers to
> > + * return -1 all the time. This results in a large nr being
> > + * built up so when a shrink that can do some work comes along
> > + * it empties the entire cache due to nr >>> max_pass. This is
> > + * bad for sustaining a working set in memory.
> > + *
> > + * Hence only allow nr to go large when a large delta is
> > + * calculated.
> > + */
> > + if (delta < max_pass / 4)
> > + total_scan = min(total_scan, max_pass / 2);
> > +
> > + /*
> > * Avoid risking looping forever due to too large nr value:
> > * never try to free more than twice the estimate number of
> > * freeable entries.
>
> I guess "max_pass/4" and "min(total_scan, max_pass / 2)" are your heuristic value. right?

Yes.

> If so, please write your benchmark name and its result into the description.

Take your pick.

Anything that generates a large amount of vfs cache pressure
combined with GFP_NOFS memory allocation will show changes in
-behaviour- as you modify these variables. e.g. creating 50 million
inodes in parallel with fs_mark, parallel chmod -R traversals of
said inodes, parallel rm -rf, etc. The same cache behaviour can be
observed with any of these sorts of cold cache workloads.

I say changes in behaviour rather than performance because just
measuring wall time does not necessarily show any difference in
performance. The bug this fixes is the cache being complete trashed
periodically, but none of the above workloads show significant wall
time differences in behaviour because of this modification as they
don't rely on cache hits for performance. If you have a workload
that actually relies on resident cache hits for good performance,
then you'll see a difference in performance that you can measure
with wall time....

What this change does have an effect on is the variance of the
resident cache size, which I monitor via live graphing of the
various cache metrics. It is not as simple to monitor as having a
single number fall out the bottom that you can point to for
improvement. Details of this is already documented in patch zero of
the series:

https://lkml.org/lkml/2011/6/2/42

Bsically, the above numbers gave the lowest variance in the resident
cache size without preventing the shrinker from being able to free
enough memory for the system to work effectively.

> I mean, currently some mm folks plan to enhance shrinker. So,
> sharing benchmark may help to avoid an accidental regression.

I predict that I will have some bug reporting to do in future. ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2011-06-21 05:27:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 03/12] vmscan: reduce wind up shrinker->nr when shrinker can't do work

>> I mean, currently some mm folks plan to enhance shrinker. So,
>> sharing benchmark may help to avoid an accidental regression.
>
> I predict that I will have some bug reporting to do in future. ;)

Ok, I have no objection. thanks.