2012-11-27 23:15:14

by Dave Chinner

[permalink] [raw]
Subject: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

Hi Glauber,

Here's a working version of my patchset for generic LRU lists and
NUMA-aware shrinkers.

There are several parts to this patch set. The NUMA aware shrinkers
are based on having a generic node-based LRU list implementation,
and there are subsystems that need to be converted to use these
lists as part of the process. There is also a long overdue change to
the shrinker API to give it separate object count and object scan
callbacks, getting rid of the magic and confusing "nr_to_scan = 0"
semantics.

First of all, the patch set is based on a current 3.7-rc7 tree with
the current xfs dev tree merged into it [can be found at
git://oss.sgi.com/xfs/xfs]. That's because there are lots of XFS
changes in the patch set, and theres no way I'm going to write them
a second time in a couple of weeks when the current dev tree is
merged into 3.8-rc1....

So, where's what the patches do:

[PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu
[PATCH 02/19] dentry: move to per-sb LRU locks
[PATCH 03/19] dcache: remove dentries from LRU before putting on

These three patches are preparation of the dcache for moving to the
generic LRU list API. it basically gets rid of the global dentry LRU
lock, and in doing so has to avoid several creative abuses of the
lru list detection to allow dentries on shrink lists to be still
magically be on the LRU list. The main change here is that now
dentries on the shrink lists *must* have the DCACHE_SHRINK_LIST flag
set and be entirely removed from the LRU before being disposed of.

This is probably a good cleanup to do regardless of the rest of the
patch set because it removes a couple of landmines in
shrink_dentry_list() that took me a while to work out...

[PATCH 04/19] mm: new shrinker API
[PATCH 05/19] shrinker: convert superblock shrinkers to new API

These introduce the count/scan shrinker API, and for testing
purposes convert the superblock shrinker to use it before any other
changes are made. This gives a clean separation of counting the
number of objects in a cache for pressure calculations, and the act
of scanning objects in an attempt to free memory. Indeed, the
scan_objects() callback now returns the number of objects freed by
the scan instead of having to try to work out whether any progress
was made by comparing absolute counts.

This is also more efficient as we don't have to count all the
objects in a cache on every scan pass. It is now done once per
shrink_slab() invocation to calculate how much to scan, and we get
direct feedback on how much gets reclaimed in that pass. i.e. we get
reliable, accurate feedback on shrinker progress.

[PATCH 06/19] list: add a new LRU list type
[PATCH 07/19] inode: convert inode lru list to generic lru list
[PATCH 08/19] dcache: convert to use new lru list infrastructure

These add the generic LRU list API and infrastructure and convert
the inode and dentry caches to use it. This is still just a single
global list per LRU at this point, so it's really only changing the
where the LRU implemenation is rather than the fundamental
algorithm. It does, however, introduce a new method of walking the
LRU lists and building the dispose list of items for shrinkers, but
because we are still dealing with a global list the algorithmic
changes are minimal.

[PATCH 09/19] list_lru: per-node list infrastructure

This makes the generic LRU list much more scalable by changing it to
a {list,lock,count} tuple per node. There are no external API
changes to this changeover, so is transparent to current users.

[PATCH 10/19] shrinker: add node awareness
[PATCH 11/19] fs: convert inode and dentry shrinking to be node

Adds a nodemask to the struct shrink_control for callers of
shrink_slab to set appropriately for their reclaim context. This
nodemask is then passed by the inode and dentry cache reclaim code
to the generic LRU list code to implement node aware shrinking.

What this doesn't do is convert the internal shrink_slab() algorithm
to be node aware. I'm not sure what the best approach is here, but
it strikes me that it should really be calculating and keeping track
of scan counts and pressure on a per-node basis. The current code
seems to work OK at the moment, though.

[PATCH 12/19] xfs: convert buftarg LRU to generic code
[PATCH 13/19] xfs: Node aware direct inode reclaim
[PATCH 14/19] xfs: use generic AG walk for background inode reclaim
[PATCH 15/19] xfs: convert dquot cache lru to list_lru

These patches convert all the XFS LRUs and shrinkers to be node
aware. This gets rid of a lot of hairy, special case code in the
inode cache shrinker for avoiding concurrent shrinker contention and
to throttle direct reclaim to prevent premature OOM conditions.
Removing this code greatly simplifies inode cache reclaim whilst
reducing overhead and improving performance. In all, it converts
three separate caches and shrinkers to use the generic LRU lists and
pass nodemasks around appropriately.

This is how I've really tested the code - lots of interesting
filesystem workloads that generate simultaneous slab and page cache
pressure on VM's with and without fake_numa configs....

[PATCH 16/19] fs: convert fs shrinkers to new scan/count API
[PATCH 17/19] drivers: convert shrinkers to new count/scan API
[PATCH 18/19] shrinker: convert remaining shrinkers to count/scan
[PATCH 19/19] shrinker: Kill old ->shrink API.

These last three patches convert all the other shrinker
implementations to the new count/scan API. The fs, android and dm
shrinkers are pretty well behaved and are implemented as is expected
for there intended purposes. The driver and staging code, however,
is basically a bunch of hacks to try to do something resembling
reclaim when a shrinker tells it there is memory pressure. Looking
at all the driver and staging code is not an exercise I recommend if
you value your eyes and/or your sanity.

I haven't even tried to compile this on a CONFIG_SMP=n
configuration, nor have I done extensive allmod style build tests
and it's only been built and tested on x86-64. That said, apart from
CONFIG_SMP=n, I don't see there being any major problems here.

There's still a bunch of cleanup work needed. e.g. the LRU list
walk/isolation code needs to use enums for the isolate callback
return code, there needs to be a generic list_lru_for_each() style
function for walking all the objects in the cache (which will allow
the list_lru structures to be used for things like the per-sb inode
list). Indeed, even the name "list_lru" is probably something that
should be changed - I think the list has become more of a general
per-node list than it's initial functionality as a scalable LRU list
implementation and I can see uses for it outside of LRUs...

Comments, thoughts and flames all welcome.

Cheers,

Dave.


2012-11-27 23:15:01

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 06/19] list: add a new LRU list type

From: Dave Chinner <[email protected]>

Several subsystems use the same construct for LRU lists - a list
head, a spin lock and and item count. They also use exactly the same
code for adding and removing items from the LRU. Create a generic
type for these LRU lists.

This is the beginning of generic, node aware LRUs for shrinkers to
work with.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/list_lru.h | 36 ++++++++++++++
lib/Makefile | 2 +-
lib/list_lru.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 154 insertions(+), 1 deletion(-)
create mode 100644 include/linux/list_lru.h
create mode 100644 lib/list_lru.c

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
new file mode 100644
index 0000000..3423949
--- /dev/null
+++ b/include/linux/list_lru.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#ifndef _LRU_LIST_H
+#define _LRU_LIST_H 0
+
+#include <linux/list.h>
+
+struct list_lru {
+ spinlock_t lock;
+ struct list_head list;
+ long nr_items;
+};
+
+int list_lru_init(struct list_lru *lru);
+int list_lru_add(struct list_lru *lru, struct list_head *item);
+int list_lru_del(struct list_lru *lru, struct list_head *item);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+ return lru->nr_items;
+}
+
+typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
+ void *cb_arg);
+typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);
+
+long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+ void *cb_arg, long nr_to_walk);
+
+long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);
+
+#endif /* _LRU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 821a162..a0849d7 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -12,7 +12,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
idr.o int_sqrt.o extable.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
- is_single_threaded.o plist.o decompress.o
+ is_single_threaded.o plist.o decompress.o list_lru.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/list_lru.c b/lib/list_lru.c
new file mode 100644
index 0000000..475d0e9
--- /dev/null
+++ b/lib/list_lru.c
@@ -0,0 +1,117 @@
+/*
+ * Copyright (c) 2010-2012 Red Hat, Inc. All rights reserved.
+ * Author: David Chinner
+ *
+ * Generic LRU infrastructure
+ */
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/list_lru.h>
+
+int
+list_lru_add(
+ struct list_lru *lru,
+ struct list_head *item)
+{
+ spin_lock(&lru->lock);
+ if (list_empty(item)) {
+ list_add_tail(item, &lru->list);
+ lru->nr_items++;
+ spin_unlock(&lru->lock);
+ return 1;
+ }
+ spin_unlock(&lru->lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_add);
+
+int
+list_lru_del(
+ struct list_lru *lru,
+ struct list_head *item)
+{
+ spin_lock(&lru->lock);
+ if (!list_empty(item)) {
+ list_del_init(item);
+ lru->nr_items--;
+ spin_unlock(&lru->lock);
+ return 1;
+ }
+ spin_unlock(&lru->lock);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_del);
+
+long
+list_lru_walk(
+ struct list_lru *lru,
+ list_lru_walk_cb isolate,
+ void *cb_arg,
+ long nr_to_walk)
+{
+ struct list_head *item, *n;
+ long removed = 0;
+restart:
+ spin_lock(&lru->lock);
+ list_for_each_safe(item, n, &lru->list) {
+ int ret;
+
+ if (nr_to_walk-- < 0)
+ break;
+
+ ret = isolate(item, &lru->lock, cb_arg);
+ switch (ret) {
+ case 0: /* item removed from list */
+ lru->nr_items--;
+ removed++;
+ break;
+ case 1: /* item referenced, give another pass */
+ list_move_tail(item, &lru->list);
+ break;
+ case 2: /* item cannot be locked, skip */
+ break;
+ case 3: /* item not freeable, lock dropped */
+ goto restart;
+ default:
+ BUG();
+ }
+ }
+ spin_unlock(&lru->lock);
+ return removed;
+}
+EXPORT_SYMBOL_GPL(list_lru_walk);
+
+long
+list_lru_dispose_all(
+ struct list_lru *lru,
+ list_lru_dispose_cb dispose)
+{
+ long disposed = 0;
+ LIST_HEAD(dispose_list);
+
+ spin_lock(&lru->lock);
+ while (!list_empty(&lru->list)) {
+ list_splice_init(&lru->list, &dispose_list);
+ disposed += lru->nr_items;
+ lru->nr_items = 0;
+ spin_unlock(&lru->lock);
+
+ dispose(&dispose_list);
+
+ spin_lock(&lru->lock);
+ }
+ spin_unlock(&lru->lock);
+ return disposed;
+}
+
+int
+list_lru_init(
+ struct list_lru *lru)
+{
+ spin_lock_init(&lru->lock);
+ INIT_LIST_HEAD(&lru->list);
+ lru->nr_items = 0;
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(list_lru_init);
--
1.7.10

2012-11-27 23:15:18

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 16/19] fs: convert fs shrinkers to new scan/count API

From: Dave Chinner <[email protected]>

Convert the filesystem shrinkers to use the new API, and standardise
some of the behaviours of the shrinkers at the same time. For
example, nr_to_scan means the number of objects to scan, not the
number of objects to free.

I refactored the CIFS idmap shrinker a little - it really needs to
be broken up into a shrinker per tree and keep an item count with
the tree root so that we don't need to walk the tree every time the
shrinker needs to count the number of objects in the tree (i.e.
all the time under memory pressure).

Signed-off-by: Dave Chinner <[email protected]>
---
fs/cifs/cifsacl.c | 112 ++++++++++++++++++++++++++++++++---------------------
fs/gfs2/glock.c | 23 ++++++-----
fs/gfs2/main.c | 3 +-
fs/gfs2/quota.c | 12 +++---
fs/gfs2/quota.h | 4 +-
fs/mbcache.c | 53 ++++++++++++++-----------
fs/nfs/dir.c | 21 ++++++++--
fs/nfs/internal.h | 4 +-
fs/nfs/super.c | 3 +-
fs/quota/dquot.c | 37 ++++++++----------
10 files changed, 163 insertions(+), 109 deletions(-)

diff --git a/fs/cifs/cifsacl.c b/fs/cifs/cifsacl.c
index 0fb15bb..a0e5c22 100644
--- a/fs/cifs/cifsacl.c
+++ b/fs/cifs/cifsacl.c
@@ -44,66 +44,95 @@ static const struct cifs_sid sid_user = {1, 2 , {0, 0, 0, 0, 0, 5}, {} };

const struct cred *root_cred;

-static void
-shrink_idmap_tree(struct rb_root *root, int nr_to_scan, int *nr_rem,
- int *nr_del)
+static long
+cifs_idmap_tree_scan(
+ struct rb_root *root,
+ spinlock_t *tree_lock,
+ long nr_to_scan)
{
struct rb_node *node;
- struct rb_node *tmp;
- struct cifs_sid_id *psidid;
+ long freed = 0;

+ spin_lock(tree_lock);
node = rb_first(root);
- while (node) {
+ while (nr_to_scan-- >= 0 && node) {
+ struct cifs_sid_id *psidid;
+ struct rb_node *tmp;
+
tmp = node;
node = rb_next(tmp);
psidid = rb_entry(tmp, struct cifs_sid_id, rbnode);
- if (nr_to_scan == 0 || *nr_del == nr_to_scan)
- ++(*nr_rem);
- else {
- if (time_after(jiffies, psidid->time + SID_MAP_EXPIRE)
- && psidid->refcount == 0) {
- rb_erase(tmp, root);
- ++(*nr_del);
- } else
- ++(*nr_rem);
+ if (time_after(jiffies, psidid->time + SID_MAP_EXPIRE)
+ && psidid->refcount == 0) {
+ rb_erase(tmp, root);
+ freed++;
}
}
+ spin_unlock(tree_lock);
+ return freed;
+}
+
+static long
+cifs_idmap_tree_count(
+ struct rb_root *root,
+ spinlock_t *tree_lock)
+{
+ struct rb_node *node;
+ long count = 0;
+
+ spin_lock(tree_lock);
+ node = rb_first(root);
+ while (node) {
+ node = rb_next(node);
+ count++;
+ }
+ spin_unlock(tree_lock);
+ return count;
}

/*
- * Run idmap cache shrinker.
+ * idmap tree shrinker.
+ *
+ * XXX (dchinner): this should really be 4 separate shrinker instances (one per
+ * tree structure) so that each are shrunk proportionally to their individual
+ * sizes.
*/
-static int
-cifs_idmap_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+cifs_idmap_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
- int nr_to_scan = sc->nr_to_scan;
- int nr_del = 0;
- int nr_rem = 0;
- struct rb_root *root;
+ long freed = 0;

- root = &uidtree;
- spin_lock(&siduidlock);
- shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
- spin_unlock(&siduidlock);
+ freed += cifs_idmap_tree_scan(&uidtree, &siduidlock, sc->nr_to_scan);
+ freed += cifs_idmap_tree_scan(&gidtree, &sidgidlock, sc->nr_to_scan);
+ freed += cifs_idmap_tree_scan(&siduidtree, &siduidlock, sc->nr_to_scan);
+ freed += cifs_idmap_tree_scan(&sidgidtree, &sidgidlock, sc->nr_to_scan);

- root = &gidtree;
- spin_lock(&sidgidlock);
- shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
- spin_unlock(&sidgidlock);
+ return freed;
+}

- root = &siduidtree;
- spin_lock(&uidsidlock);
- shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
- spin_unlock(&uidsidlock);
+static long
+cifs_idmap_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ long count = 0;

- root = &sidgidtree;
- spin_lock(&gidsidlock);
- shrink_idmap_tree(root, nr_to_scan, &nr_rem, &nr_del);
- spin_unlock(&gidsidlock);
+ count += cifs_idmap_tree_count(&uidtree, &siduidlock);
+ count += cifs_idmap_tree_count(&gidtree, &sidgidlock);
+ count += cifs_idmap_tree_count(&siduidtree, &siduidlock);
+ count += cifs_idmap_tree_count(&sidgidtree, &sidgidlock);

- return nr_rem;
+ return count;
}

+static struct shrinker cifs_shrinker = {
+ .count_objects = cifs_idmap_shrink_count,
+ .scan_objects = cifs_idmap_shrink_scan,
+ .seeks = DEFAULT_SEEKS,
+};
+
static void
sid_rb_insert(struct rb_root *root, unsigned long cid,
struct cifs_sid_id **psidid, char *typestr)
@@ -161,11 +190,6 @@ sid_rb_search(struct rb_root *root, unsigned long cid)
return NULL;
}

-static struct shrinker cifs_shrinker = {
- .shrink = cifs_idmap_shrinker,
- .seeks = DEFAULT_SEEKS,
-};
-
static int
cifs_idmap_key_instantiate(struct key *key, struct key_preparsed_payload *prep)
{
diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index e6c2fd5..4be61a22e 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1365,9 +1365,8 @@ void gfs2_glock_complete(struct gfs2_glock *gl, int ret)
gfs2_glock_put(gl);
}

-
-static int gfs2_shrink_glock_memory(struct shrinker *shrink,
- struct shrink_control *sc)
+static long gfs2_glock_shrink_scan(struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct gfs2_glock *gl;
int may_demote;
@@ -1375,15 +1374,13 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
int nr = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;
LIST_HEAD(skipped);
-
- if (nr == 0)
- goto out;
+ long freed = 0;

if (!(gfp_mask & __GFP_FS))
return -1;

spin_lock(&lru_lock);
- while(nr && !list_empty(&lru_list)) {
+ while(nr-- >= 0 && !list_empty(&lru_list)) {
gl = list_entry(lru_list.next, struct gfs2_glock, gl_lru);
list_del_init(&gl->gl_lru);
clear_bit(GLF_LRU, &gl->gl_flags);
@@ -1397,7 +1394,7 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
may_demote = demote_ok(gl);
if (may_demote) {
handle_callback(gl, LM_ST_UNLOCKED, 0);
- nr--;
+ freed++;
}
clear_bit(GLF_LOCK, &gl->gl_flags);
smp_mb__after_clear_bit();
@@ -1414,12 +1411,18 @@ static int gfs2_shrink_glock_memory(struct shrinker *shrink,
list_splice(&skipped, &lru_list);
atomic_add(nr_skipped, &lru_count);
spin_unlock(&lru_lock);
-out:
+ return freed;
+}
+
+static long gfs2_glock_shrink_count(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
return (atomic_read(&lru_count) / 100) * sysctl_vfs_cache_pressure;
}

static struct shrinker glock_shrinker = {
- .shrink = gfs2_shrink_glock_memory,
+ .count_objects = gfs2_glock_shrink_count,
+ .scan_objects = gfs2_glock_shrink_scan,
.seeks = DEFAULT_SEEKS,
};

diff --git a/fs/gfs2/main.c b/fs/gfs2/main.c
index e04d0e0..a105d84 100644
--- a/fs/gfs2/main.c
+++ b/fs/gfs2/main.c
@@ -32,7 +32,8 @@
struct workqueue_struct *gfs2_control_wq;

static struct shrinker qd_shrinker = {
- .shrink = gfs2_shrink_qd_memory,
+ .count_objects = gfs2_qd_shrink_count,
+ .scan_objects = gfs2_qd_shrink_scan,
.seeks = DEFAULT_SEEKS,
};

diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index c5af8e1..f4bf289 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -78,14 +78,12 @@ static LIST_HEAD(qd_lru_list);
static atomic_t qd_lru_count = ATOMIC_INIT(0);
static DEFINE_SPINLOCK(qd_lru_lock);

-int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
+long gfs2_qd_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
{
struct gfs2_quota_data *qd;
struct gfs2_sbd *sdp;
int nr_to_scan = sc->nr_to_scan;
-
- if (nr_to_scan == 0)
- goto out;
+ long freed = 0;

if (!(sc->gfp_mask & __GFP_FS))
return -1;
@@ -113,10 +111,14 @@ int gfs2_shrink_qd_memory(struct shrinker *shrink, struct shrink_control *sc)
kmem_cache_free(gfs2_quotad_cachep, qd);
spin_lock(&qd_lru_lock);
nr_to_scan--;
+ freed++;
}
spin_unlock(&qd_lru_lock);
+ return freed;
+}

-out:
+long gfs2_qd_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
return (atomic_read(&qd_lru_count) * sysctl_vfs_cache_pressure) / 100;
}

diff --git a/fs/gfs2/quota.h b/fs/gfs2/quota.h
index f25d98b..e0bd306 100644
--- a/fs/gfs2/quota.h
+++ b/fs/gfs2/quota.h
@@ -52,7 +52,9 @@ static inline int gfs2_quota_lock_check(struct gfs2_inode *ip)
return ret;
}

-extern int gfs2_shrink_qd_memory(struct shrinker *shrink,
+extern long gfs2_qd_shrink_count(struct shrinker *shrink,
+ struct shrink_control *sc);
+extern long gfs2_qd_shrink_scan(struct shrinker *shrink,
struct shrink_control *sc);
extern const struct quotactl_ops gfs2_quotactl_ops;

diff --git a/fs/mbcache.c b/fs/mbcache.c
index 8c32ef3..04bf2fb 100644
--- a/fs/mbcache.c
+++ b/fs/mbcache.c
@@ -86,18 +86,6 @@ static LIST_HEAD(mb_cache_list);
static LIST_HEAD(mb_cache_lru_list);
static DEFINE_SPINLOCK(mb_cache_spinlock);

-/*
- * What the mbcache registers as to get shrunk dynamically.
- */
-
-static int mb_cache_shrink_fn(struct shrinker *shrink,
- struct shrink_control *sc);
-
-static struct shrinker mb_cache_shrinker = {
- .shrink = mb_cache_shrink_fn,
- .seeks = DEFAULT_SEEKS,
-};
-
static inline int
__mb_cache_entry_is_hashed(struct mb_cache_entry *ce)
{
@@ -151,7 +139,7 @@ forget:


/*
- * mb_cache_shrink_fn() memory pressure callback
+ * mb_cache_shrink_scan() memory pressure callback
*
* This function is called by the kernel memory management when memory
* gets low.
@@ -159,17 +147,18 @@ forget:
* @shrink: (ignored)
* @sc: shrink_control passed from reclaim
*
- * Returns the number of objects which are present in the cache.
+ * Returns the number of objects freed.
*/
-static int
-mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
+static long
+mb_cache_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
LIST_HEAD(free_list);
- struct mb_cache *cache;
struct mb_cache_entry *entry, *tmp;
- int count = 0;
int nr_to_scan = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;
+ long freed = 0;

mb_debug("trying to free %d entries", nr_to_scan);
spin_lock(&mb_cache_spinlock);
@@ -179,19 +168,39 @@ mb_cache_shrink_fn(struct shrinker *shrink, struct shrink_control *sc)
struct mb_cache_entry, e_lru_list);
list_move_tail(&ce->e_lru_list, &free_list);
__mb_cache_entry_unhash(ce);
+ freed++;
+ }
+ spin_unlock(&mb_cache_spinlock);
+ list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
+ __mb_cache_entry_forget(entry, gfp_mask);
}
+ return freed;
+}
+
+static long
+mb_cache_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct mb_cache *cache;
+ long count = 0;
+
+ spin_lock(&mb_cache_spinlock);
list_for_each_entry(cache, &mb_cache_list, c_cache_list) {
mb_debug("cache %s (%d)", cache->c_name,
atomic_read(&cache->c_entry_count));
count += atomic_read(&cache->c_entry_count);
}
spin_unlock(&mb_cache_spinlock);
- list_for_each_entry_safe(entry, tmp, &free_list, e_lru_list) {
- __mb_cache_entry_forget(entry, gfp_mask);
- }
- return (count / 100) * sysctl_vfs_cache_pressure;
+ return count;
}

+static struct shrinker mb_cache_shrinker = {
+ .count_objects = mb_cache_shrink_count,
+ .scan_objects = mb_cache_shrink_scan,
+ .seeks = DEFAULT_SEEKS,
+};
+

/*
* mb_cache_create() create a new cache
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index ce8cb92..15f6fbb 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1909,17 +1909,20 @@ static void nfs_access_free_list(struct list_head *head)
}
}

-int nfs_access_cache_shrinker(struct shrinker *shrink,
- struct shrink_control *sc)
+long
+nfs_access_cache_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
LIST_HEAD(head);
struct nfs_inode *nfsi, *next;
struct nfs_access_entry *cache;
int nr_to_scan = sc->nr_to_scan;
gfp_t gfp_mask = sc->gfp_mask;
+ long freed = 0;

if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
- return (nr_to_scan == 0) ? 0 : -1;
+ return -1;

spin_lock(&nfs_access_lru_lock);
list_for_each_entry_safe(nfsi, next, &nfs_access_lru_list, access_cache_inode_lru) {
@@ -1935,6 +1938,7 @@ int nfs_access_cache_shrinker(struct shrinker *shrink,
struct nfs_access_entry, lru);
list_move(&cache->lru, &head);
rb_erase(&cache->rb_node, &nfsi->access_cache);
+ freed++;
if (!list_empty(&nfsi->access_cache_entry_lru))
list_move_tail(&nfsi->access_cache_inode_lru,
&nfs_access_lru_list);
@@ -1949,7 +1953,16 @@ remove_lru_entry:
}
spin_unlock(&nfs_access_lru_lock);
nfs_access_free_list(&head);
- return (atomic_long_read(&nfs_access_nr_entries) / 100) * sysctl_vfs_cache_pressure;
+ return freed;
+}
+
+long
+nfs_access_cache_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ return (atomic_long_read(&nfs_access_nr_entries) / 100) *
+ sysctl_vfs_cache_pressure;
}

static void __nfs_access_zap_cache(struct nfs_inode *nfsi, struct list_head *head)
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 05521ca..6b9c45b 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -285,7 +285,9 @@ extern struct nfs_client *nfs_init_client(struct nfs_client *clp,
const char *ip_addr, rpc_authflavor_t authflavour);

/* dir.c */
-extern int nfs_access_cache_shrinker(struct shrinker *shrink,
+extern long nfs_access_cache_count(struct shrinker *shrink,
+ struct shrink_control *sc);
+extern long nfs_access_cache_scan(struct shrinker *shrink,
struct shrink_control *sc);
struct dentry *nfs_lookup(struct inode *, struct dentry *, unsigned int);
int nfs_create(struct inode *, struct dentry *, umode_t, bool);
diff --git a/fs/nfs/super.c b/fs/nfs/super.c
index 652d3f7..e7dd232 100644
--- a/fs/nfs/super.c
+++ b/fs/nfs/super.c
@@ -354,7 +354,8 @@ static void unregister_nfs4_fs(void)
#endif

static struct shrinker acl_shrinker = {
- .shrink = nfs_access_cache_shrinker,
+ .count_objects = nfs_access_cache_count,
+ .scan_objects = nfs_access_cache_scan,
.seeks = DEFAULT_SEEKS,
};

diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 05ae3c9..544bd65 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -687,45 +687,42 @@ int dquot_quota_sync(struct super_block *sb, int type)
}
EXPORT_SYMBOL(dquot_quota_sync);

-/* Free unused dquots from cache */
-static void prune_dqcache(int count)
+static long
+dqcache_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct list_head *head;
struct dquot *dquot;
+ long freed = 0;

head = free_dquots.prev;
- while (head != &free_dquots && count) {
+ while (head != &free_dquots && sc->nr_to_scan) {
dquot = list_entry(head, struct dquot, dq_free);
remove_dquot_hash(dquot);
remove_free_dquot(dquot);
remove_inuse(dquot);
do_destroy_dquot(dquot);
- count--;
+ sc->nr_to_scan--;
+ freed++;
head = free_dquots.prev;
}
+ return freed;
}

-/*
- * This is called from kswapd when we think we need some
- * more memory
- */
-static int shrink_dqcache_memory(struct shrinker *shrink,
- struct shrink_control *sc)
+static long
+dqcache_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
- int nr = sc->nr_to_scan;
-
- if (nr) {
- spin_lock(&dq_list_lock);
- prune_dqcache(nr);
- spin_unlock(&dq_list_lock);
- }
- return ((unsigned)
+ return ((long)
percpu_counter_read_positive(&dqstats.counter[DQST_FREE_DQUOTS])
- /100) * sysctl_vfs_cache_pressure;
+ / 100) * sysctl_vfs_cache_pressure;
}

static struct shrinker dqcache_shrinker = {
- .shrink = shrink_dqcache_memory,
+ .count_objects = dqcache_shrink_count,
+ .scan_objects = dqcache_shrink_scan,
.seeks = DEFAULT_SEEKS,
};

--
1.7.10

2012-11-27 23:15:29

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 14/19] xfs: use generic AG walk for background inode reclaim

From: Dave Chinner <[email protected]>

The per-ag inode cache radix trees are not walked by the shrinkers
any more, so there is no need for a special walker that contained
heurisitcs to prevent multiple shrinker instances from colliding
with each other. Hence we can just remote that and convert the code
to use the generic walker.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_ag.h | 2 -
fs/xfs/xfs_icache.c | 217 +++++++++++-----------------------------------
fs/xfs/xfs_icache.h | 4 +-
fs/xfs/xfs_mount.c | 1 -
fs/xfs/xfs_qm_syscalls.c | 2 +-
5 files changed, 55 insertions(+), 171 deletions(-)

diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h
index f2aeedb..40a7df9 100644
--- a/fs/xfs/xfs_ag.h
+++ b/fs/xfs/xfs_ag.h
@@ -218,8 +218,6 @@ typedef struct xfs_perag {
spinlock_t pag_ici_lock; /* incore inode cache lock */
struct radix_tree_root pag_ici_root; /* incore inode cache root */
int pag_ici_reclaimable; /* reclaimable inodes */
- struct mutex pag_ici_reclaim_lock; /* serialisation point */
- unsigned long pag_ici_reclaim_cursor; /* reclaim restart point */

/* buffer cache index */
spinlock_t pag_buf_lock; /* lock for pag_buf_tree */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 82b053f..5cfc2eb 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -468,7 +468,8 @@ out_error_or_again:

STATIC int
xfs_inode_ag_walk_grab(
- struct xfs_inode *ip)
+ struct xfs_inode *ip,
+ int flags)
{
struct inode *inode = VFS_I(ip);

@@ -517,6 +518,7 @@ STATIC int
xfs_inode_ag_walk(
struct xfs_mount *mp,
struct xfs_perag *pag,
+ int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip,
struct xfs_perag *pag, int flags,
void *args),
@@ -530,6 +532,9 @@ xfs_inode_ag_walk(
int done;
int nr_found;

+ if (!grab)
+ grab = xfs_inode_ag_walk_grab;
+
restart:
done = 0;
skipped = 0;
@@ -564,7 +569,7 @@ restart:
for (i = 0; i < nr_found; i++) {
struct xfs_inode *ip = batch[i];

- if (done || xfs_inode_ag_walk_grab(ip))
+ if (done || grab(ip, flags))
batch[i] = NULL;

/*
@@ -593,7 +598,8 @@ restart:
if (!batch[i])
continue;
error = execute(batch[i], pag, flags, args);
- IRELE(batch[i]);
+ if (grab == xfs_inode_ag_walk_grab)
+ IRELE(batch[i]);
if (error == EAGAIN) {
skipped++;
continue;
@@ -617,35 +623,10 @@ restart:
return last_error;
}

-/*
- * Background scanning to trim post-EOF preallocated space. This is queued
- * based on the 'background_prealloc_discard_period' tunable (5m by default).
- */
-STATIC void
-xfs_queue_eofblocks(
- struct xfs_mount *mp)
-{
- rcu_read_lock();
- if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
- queue_delayed_work(mp->m_eofblocks_workqueue,
- &mp->m_eofblocks_work,
- msecs_to_jiffies(xfs_eofb_secs * 1000));
- rcu_read_unlock();
-}
-
-void
-xfs_eofblocks_worker(
- struct work_struct *work)
-{
- struct xfs_mount *mp = container_of(to_delayed_work(work),
- struct xfs_mount, m_eofblocks_work);
- xfs_icache_free_eofblocks(mp, NULL);
- xfs_queue_eofblocks(mp);
-}
-
int
xfs_inode_ag_iterator(
struct xfs_mount *mp,
+ int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip,
struct xfs_perag *pag, int flags,
void *args),
@@ -660,7 +641,8 @@ xfs_inode_ag_iterator(
ag = 0;
while ((pag = xfs_perag_get(mp, ag))) {
ag = pag->pag_agno + 1;
- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, -1);
+ error = xfs_inode_ag_walk(mp, pag, grab, execute,
+ flags, args, -1);
xfs_perag_put(pag);
if (error) {
last_error = error;
@@ -674,6 +656,7 @@ xfs_inode_ag_iterator(
int
xfs_inode_ag_iterator_tag(
struct xfs_mount *mp,
+ int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip,
struct xfs_perag *pag, int flags,
void *args),
@@ -689,7 +672,8 @@ xfs_inode_ag_iterator_tag(
ag = 0;
while ((pag = xfs_perag_get_tag(mp, ag, tag))) {
ag = pag->pag_agno + 1;
- error = xfs_inode_ag_walk(mp, pag, execute, flags, args, tag);
+ error = xfs_inode_ag_walk(mp, pag, grab, execute,
+ flags, args, tag);
xfs_perag_put(pag);
if (error) {
last_error = error;
@@ -904,7 +888,8 @@ STATIC int
xfs_reclaim_inode(
struct xfs_inode *ip,
struct xfs_perag *pag,
- int sync_mode)
+ int sync_mode,
+ void *args)
{
struct xfs_buf *bp = NULL;
int error;
@@ -1032,140 +1017,14 @@ out:
return 0;
}

-/*
- * Walk the AGs and reclaim the inodes in them. Even if the filesystem is
- * corrupted, we still want to try to reclaim all the inodes. If we don't,
- * then a shut down during filesystem unmount reclaim walk leak all the
- * unreclaimed inodes.
- */
-STATIC int
-xfs_reclaim_inodes_ag(
- struct xfs_mount *mp,
- int flags,
- long *nr_to_scan)
-{
- struct xfs_perag *pag;
- int error = 0;
- int last_error = 0;
- xfs_agnumber_t ag;
- int trylock = flags & SYNC_TRYLOCK;
- int skipped;
-
-restart:
- ag = 0;
- skipped = 0;
- while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
- unsigned long first_index = 0;
- int done = 0;
- int nr_found = 0;
-
- ag = pag->pag_agno + 1;
-
- if (trylock) {
- if (!mutex_trylock(&pag->pag_ici_reclaim_lock)) {
- skipped++;
- xfs_perag_put(pag);
- continue;
- }
- first_index = pag->pag_ici_reclaim_cursor;
- } else
- mutex_lock(&pag->pag_ici_reclaim_lock);
-
- do {
- struct xfs_inode *batch[XFS_LOOKUP_BATCH];
- int i;
-
- rcu_read_lock();
- nr_found = radix_tree_gang_lookup_tag(
- &pag->pag_ici_root,
- (void **)batch, first_index,
- XFS_LOOKUP_BATCH,
- XFS_ICI_RECLAIM_TAG);
- if (!nr_found) {
- done = 1;
- rcu_read_unlock();
- break;
- }
-
- /*
- * Grab the inodes before we drop the lock. if we found
- * nothing, nr == 0 and the loop will be skipped.
- */
- for (i = 0; i < nr_found; i++) {
- struct xfs_inode *ip = batch[i];
-
- if (done || xfs_reclaim_inode_grab(ip, flags))
- batch[i] = NULL;
-
- /*
- * Update the index for the next lookup. Catch
- * overflows into the next AG range which can
- * occur if we have inodes in the last block of
- * the AG and we are currently pointing to the
- * last inode.
- *
- * Because we may see inodes that are from the
- * wrong AG due to RCU freeing and
- * reallocation, only update the index if it
- * lies in this AG. It was a race that lead us
- * to see this inode, so another lookup from
- * the same index will not find it again.
- */
- if (XFS_INO_TO_AGNO(mp, ip->i_ino) !=
- pag->pag_agno)
- continue;
- first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1);
- if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino))
- done = 1;
- }
-
- /* unlock now we've grabbed the inodes. */
- rcu_read_unlock();
-
- for (i = 0; i < nr_found; i++) {
- if (!batch[i])
- continue;
- error = xfs_reclaim_inode(batch[i], pag, flags);
- if (error && last_error != EFSCORRUPTED)
- last_error = error;
- }
-
- *nr_to_scan -= XFS_LOOKUP_BATCH;
-
- cond_resched();
-
- } while (nr_found && !done && *nr_to_scan > 0);
-
- if (trylock && !done)
- pag->pag_ici_reclaim_cursor = first_index;
- else
- pag->pag_ici_reclaim_cursor = 0;
- mutex_unlock(&pag->pag_ici_reclaim_lock);
- xfs_perag_put(pag);
- }
-
- /*
- * if we skipped any AG, and we still have scan count remaining, do
- * another pass this time using blocking reclaim semantics (i.e
- * waiting on the reclaim locks and ignoring the reclaim cursors). This
- * ensure that when we get more reclaimers than AGs we block rather
- * than spin trying to execute reclaim.
- */
- if (skipped && (flags & SYNC_WAIT) && *nr_to_scan > 0) {
- trylock = 0;
- goto restart;
- }
- return XFS_ERROR(last_error);
-}
-
int
xfs_reclaim_inodes(
- xfs_mount_t *mp,
- int mode)
+ struct xfs_mount *mp,
+ int flags)
{
- long nr_to_scan = LONG_MAX;
-
- return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
+ return xfs_inode_ag_iterator_tag(mp, xfs_reclaim_inode_grab,
+ xfs_reclaim_inode, flags,
+ NULL, XFS_ICI_RECLAIM_TAG);
}

static int
@@ -1229,13 +1088,39 @@ xfs_reclaim_inodes_nr(

pag = xfs_perag_get(mp,
XFS_INO_TO_AGNO(mp, XFS_I(inode)->i_ino));
- xfs_reclaim_inode(XFS_I(inode), pag, SYNC_WAIT);
+ xfs_reclaim_inode(XFS_I(inode), pag, SYNC_WAIT, NULL);
xfs_perag_put(pag);
}

return freed;
}

+/*
+ * Background scanning to trim post-EOF preallocated space. This is queued
+ * based on the 'background_prealloc_discard_period' tunable (5m by default).
+ */
+STATIC void
+xfs_queue_eofblocks(
+ struct xfs_mount *mp)
+{
+ rcu_read_lock();
+ if (radix_tree_tagged(&mp->m_perag_tree, XFS_ICI_EOFBLOCKS_TAG))
+ queue_delayed_work(mp->m_eofblocks_workqueue,
+ &mp->m_eofblocks_work,
+ msecs_to_jiffies(xfs_eofb_secs * 1000));
+ rcu_read_unlock();
+}
+
+void
+xfs_eofblocks_worker(
+ struct work_struct *work)
+{
+ struct xfs_mount *mp = container_of(to_delayed_work(work),
+ struct xfs_mount, m_eofblocks_work);
+ xfs_icache_free_eofblocks(mp, NULL);
+ xfs_queue_eofblocks(mp);
+}
+
STATIC int
xfs_inode_match_id(
struct xfs_inode *ip,
@@ -1310,8 +1195,8 @@ xfs_icache_free_eofblocks(
if (eofb && (eofb->eof_flags & XFS_EOF_FLAGS_SYNC))
flags = SYNC_WAIT;

- return xfs_inode_ag_iterator_tag(mp, xfs_inode_free_eofblocks, flags,
- eofb, XFS_ICI_EOFBLOCKS_TAG);
+ return xfs_inode_ag_iterator_tag(mp, NULL, xfs_inode_free_eofblocks,
+ flags, eofb, XFS_ICI_EOFBLOCKS_TAG);
}

void
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index 4214518..a3380bf 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -29,7 +29,7 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,

void xfs_reclaim_worker(struct work_struct *work);

-int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
+int xfs_reclaim_inodes(struct xfs_mount *mp, int flags);
long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan,
nodemask_t *nodes_to_scan);

@@ -42,10 +42,12 @@ void xfs_eofblocks_worker(struct work_struct *);

int xfs_sync_inode_grab(struct xfs_inode *ip);
int xfs_inode_ag_iterator(struct xfs_mount *mp,
+ int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag,
int flags, void *args),
int flags, void *args);
int xfs_inode_ag_iterator_tag(struct xfs_mount *mp,
+ int (*grab)(struct xfs_inode *ip, int flags),
int (*execute)(struct xfs_inode *ip, struct xfs_perag *pag,
int flags, void *args),
int flags, void *args, int tag);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index da50846..6985a32 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -456,7 +456,6 @@ xfs_initialize_perag(
pag->pag_agno = index;
pag->pag_mount = mp;
spin_lock_init(&pag->pag_ici_lock);
- mutex_init(&pag->pag_ici_reclaim_lock);
INIT_RADIX_TREE(&pag->pag_ici_root, GFP_ATOMIC);
spin_lock_init(&pag->pag_buf_lock);
pag->pag_buf_tree = RB_ROOT;
diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 5f53e75..85294a6 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -883,5 +883,5 @@ xfs_qm_dqrele_all_inodes(
uint flags)
{
ASSERT(mp->m_quotainfo);
- xfs_inode_ag_iterator(mp, xfs_dqrele_inode, flags, NULL);
+ xfs_inode_ag_iterator(mp, NULL, xfs_dqrele_inode, flags, NULL);
}
--
1.7.10

2012-11-27 23:15:35

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 19/19] shrinker: Kill old ->shrink API.

From: Dave Chinner <[email protected]>

There are no more users of this API, so kill it dead, dead, dead and
quietly bury the corpse in a shallow, unmarked grave in a dark
forest deep in the hills...

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/shrinker.h | 15 +++++----------
include/trace/events/vmscan.h | 4 ++--
mm/vmscan.c | 39 ++++++++-------------------------------
3 files changed, 15 insertions(+), 43 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index e71286f..d4636a0 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -7,14 +7,15 @@
*
* The 'gfpmask' refers to the allocation we are currently trying to
* fulfil.
- *
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
*/
struct shrink_control {
gfp_t gfp_mask;

- /* How many slab objects shrinker() should scan and try to reclaim */
+ /*
+ * How many objects scan_objects should scan and try to reclaim.
+ * This is reset before every call, so it is safe for callees
+ * to modify.
+ */
long nr_to_scan;

/* shrink from these nodes */
@@ -24,11 +25,6 @@ struct shrink_control {
/*
* A callback you can register to apply pressure to ageable caches.
*
- * @shrink() should look through the least-recently-used 'nr_to_scan' entries
- * and attempt to free them up. It should return the number of objects which
- * remain in the cache. If it returns -1, it means it cannot do any scanning at
- * this time (eg. there is a risk of deadlock).
- *
* @count_objects should return the number of freeable items in the cache. If
* there are no objects to free or the number of freeable items cannot be
* determined, it should return 0. No deadlock checks should be done during the
@@ -44,7 +40,6 @@ struct shrink_control {
* @scan_objects will be made from the current reclaim context.
*/
struct shrinker {
- int (*shrink)(struct shrinker *, struct shrink_control *sc);
long (*count_objects)(struct shrinker *, struct shrink_control *sc);
long (*scan_objects)(struct shrinker *, struct shrink_control *sc);

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 63cfccc..132a985 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -202,7 +202,7 @@ TRACE_EVENT(mm_shrink_slab_start,

TP_fast_assign(
__entry->shr = shr;
- __entry->shrink = shr->shrink;
+ __entry->shrink = shr->scan_objects;
__entry->nr_objects_to_shrink = nr_objects_to_shrink;
__entry->gfp_flags = sc->gfp_mask;
__entry->pgs_scanned = pgs_scanned;
@@ -241,7 +241,7 @@ TRACE_EVENT(mm_shrink_slab_end,

TP_fast_assign(
__entry->shr = shr;
- __entry->shrink = shr->shrink;
+ __entry->shrink = shr->scan_objects;
__entry->unused_scan = unused_scan_cnt;
__entry->new_scan = new_scan_cnt;
__entry->retval = shrinker_retval;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4a602ec..81731f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -176,14 +176,6 @@ void unregister_shrinker(struct shrinker *shrinker)
}
EXPORT_SYMBOL(unregister_shrinker);

-static inline int do_shrinker_shrink(struct shrinker *shrinker,
- struct shrink_control *sc,
- unsigned long nr_to_scan)
-{
- sc->nr_to_scan = nr_to_scan;
- return (*shrinker->shrink)(shrinker, sc);
-}
-
#define SHRINK_BATCH 128
/*
* Call the shrink functions to age shrinkable caches
@@ -229,11 +221,8 @@ unsigned long shrink_slab(struct shrink_control *sc,
long batch_size = shrinker->batch ? shrinker->batch
: SHRINK_BATCH;

- if (shrinker->scan_objects) {
- max_pass = shrinker->count_objects(shrinker, sc);
- WARN_ON(max_pass < 0);
- } else
- max_pass = do_shrinker_shrink(shrinker, sc, 0);
+ max_pass = shrinker->count_objects(shrinker, sc);
+ WARN_ON(max_pass < 0);
if (max_pass <= 0)
continue;

@@ -252,7 +241,7 @@ unsigned long shrink_slab(struct shrink_control *sc,
if (total_scan < 0) {
printk(KERN_ERR
"shrink_slab: %pF negative objects to delete nr=%ld\n",
- shrinker->shrink, total_scan);
+ shrinker->scan_objects, total_scan);
total_scan = max_pass;
}

@@ -286,24 +275,12 @@ unsigned long shrink_slab(struct shrink_control *sc,
while (total_scan >= batch_size) {
long ret;

- if (shrinker->scan_objects) {
- sc->nr_to_scan = batch_size;
- ret = shrinker->scan_objects(shrinker, sc);
+ sc->nr_to_scan = batch_size;
+ ret = shrinker->scan_objects(shrinker, sc);

- if (ret == -1)
- break;
- freed += ret;
- } else {
- int nr_before;
-
- nr_before = do_shrinker_shrink(shrinker, sc, 0);
- ret = do_shrinker_shrink(shrinker, sc,
- batch_size);
- if (ret == -1)
- break;
- if (ret < nr_before)
- freed += nr_before - ret;
- }
+ if (ret == -1)
+ break;
+ freed += ret;
count_vm_events(SLABS_SCANNED, batch_size);
total_scan -= batch_size;

--
1.7.10

2012-11-27 23:15:33

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 18/19] shrinker: convert remaining shrinkers to count/scan API

From: Dave Chinner <[email protected]>

Convert the remaining couple of random shrinkers in the tree to the
new API.

Signed-off-by: Dave Chinner <[email protected]>
---
arch/x86/kvm/mmu.c | 35 +++++++++++++++++++++++++----------
net/sunrpc/auth.c | 45 +++++++++++++++++++++++++++++++--------------
2 files changed, 56 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6f85fe0..3dbc3c0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -4212,26 +4212,28 @@ restart:
spin_unlock(&kvm->mmu_lock);
}

-static void kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
+static long kvm_mmu_remove_some_alloc_mmu_pages(struct kvm *kvm,
struct list_head *invalid_list)
{
struct kvm_mmu_page *page;

if (list_empty(&kvm->arch.active_mmu_pages))
- return;
+ return 0;

page = container_of(kvm->arch.active_mmu_pages.prev,
struct kvm_mmu_page, link);
- kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
+ return kvm_mmu_prepare_zap_page(kvm, page, invalid_list);
}

-static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
+
+static long
+mmu_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct kvm *kvm;
int nr_to_scan = sc->nr_to_scan;
-
- if (nr_to_scan == 0)
- goto out;
+ long freed = 0;

raw_spin_lock(&kvm_lock);

@@ -4259,24 +4261,37 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);

- kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
+ freed += kvm_mmu_remove_some_alloc_mmu_pages(kvm, &invalid_list);
kvm_mmu_commit_zap_page(kvm, &invalid_list);

spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);

+ /*
+ * unfair on small ones
+ * per-vm shrinkers cry out
+ * sadness comes quickly
+ */
list_move_tail(&kvm->vm_list, &vm_list);
break;
}

raw_spin_unlock(&kvm_lock);
+ return freed;

-out:
+}
+
+static long
+mmu_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
return percpu_counter_read_positive(&kvm_total_used_mmu_pages);
}

static struct shrinker mmu_shrinker = {
- .shrink = mmu_shrink,
+ .count_objects = mmu_shrink_count,
+ .scan_objects = mmu_shrink_scan,
.seeks = DEFAULT_SEEKS * 10,
};

diff --git a/net/sunrpc/auth.c b/net/sunrpc/auth.c
index b5c067b..969c629 100644
--- a/net/sunrpc/auth.c
+++ b/net/sunrpc/auth.c
@@ -340,12 +340,13 @@ EXPORT_SYMBOL_GPL(rpcauth_destroy_credcache);
/*
* Remove stale credentials. Avoid sleeping inside the loop.
*/
-static int
+static long
rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
{
spinlock_t *cache_lock;
struct rpc_cred *cred, *next;
unsigned long expired = jiffies - RPC_AUTH_EXPIRY_MORATORIUM;
+ long freed = 0;

list_for_each_entry_safe(cred, next, &cred_unused, cr_lru) {

@@ -357,10 +358,11 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
*/
if (time_in_range(cred->cr_expire, expired, jiffies) &&
test_bit(RPCAUTH_CRED_HASHED, &cred->cr_flags) != 0)
- return 0;
+ break;

list_del_init(&cred->cr_lru);
number_cred_unused--;
+ freed++;
if (atomic_read(&cred->cr_count) != 0)
continue;

@@ -373,29 +375,43 @@ rpcauth_prune_expired(struct list_head *free, int nr_to_scan)
}
spin_unlock(cache_lock);
}
- return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
+ return freed;
}

/*
* Run memory cache shrinker.
*/
-static int
-rpcauth_cache_shrinker(struct shrinker *shrink, struct shrink_control *sc)
+static long
+rpcauth_cache_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+
{
LIST_HEAD(free);
- int res;
- int nr_to_scan = sc->nr_to_scan;
- gfp_t gfp_mask = sc->gfp_mask;
+ long freed;
+
+ if ((sc->gfp_mask & GFP_KERNEL) != GFP_KERNEL)
+ return -1;

- if ((gfp_mask & GFP_KERNEL) != GFP_KERNEL)
- return (nr_to_scan == 0) ? 0 : -1;
+ /* nothing left, don't come back */
if (list_empty(&cred_unused))
- return 0;
+ return -1;
+
spin_lock(&rpc_credcache_lock);
- res = rpcauth_prune_expired(&free, nr_to_scan);
+ freed = rpcauth_prune_expired(&free, sc->nr_to_scan);
spin_unlock(&rpc_credcache_lock);
rpcauth_destroy_credlist(&free);
- return res;
+
+ return freed;
+}
+
+static long
+rpcauth_cache_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+
+{
+ return (number_cred_unused / 100) * sysctl_vfs_cache_pressure;
}

/*
@@ -712,7 +728,8 @@ rpcauth_uptodatecred(struct rpc_task *task)
}

static struct shrinker rpc_cred_shrinker = {
- .shrink = rpcauth_cache_shrinker,
+ .count_objects = rpcauth_cache_shrink_count,
+ .scan_objects = rpcauth_cache_shrink_scan,
.seeks = DEFAULT_SEEKS,
};

--
1.7.10

2012-11-27 23:15:27

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 10/19] shrinker: add node awareness

From: Dave Chinner <[email protected]>

Pass the node of the current zone being reclaimed to shrink_slab(),
allowing the shrinker control nodemask to be set appropriately for
node aware shrinkers.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/drop_caches.c | 1 +
include/linux/shrinker.h | 3 +++
mm/memory-failure.c | 2 ++
mm/vmscan.c | 12 +++++++++---
4 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index c00e055..9fd702f 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -44,6 +44,7 @@ static void drop_slab(void)
.gfp_mask = GFP_KERNEL,
};

+ nodes_setall(shrink.nodes_to_scan);
do {
nr_objects = shrink_slab(&shrink, 1000, 1000);
} while (nr_objects > 10);
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 4f59615..e71286f 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -16,6 +16,9 @@ struct shrink_control {

/* How many slab objects shrinker() should scan and try to reclaim */
long nr_to_scan;
+
+ /* shrink from these nodes */
+ nodemask_t nodes_to_scan;
};

/*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..7bcbde4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -248,10 +248,12 @@ void shake_page(struct page *p, int access)
*/
if (access) {
int nr;
+ int nid = page_to_nid(p);
do {
struct shrink_control shrink = {
.gfp_mask = GFP_KERNEL,
};
+ node_set(nid, shrink.nodes_to_scan);

nr = shrink_slab(&shrink, 1000, 1000);
if (page_count(p) == 1)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 55c4fc9..4a602ec 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2122,15 +2122,20 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
*/
if (global_reclaim(sc)) {
unsigned long lru_pages = 0;
+
+ nodes_clear(shrink->nodes_to_scan);
for_each_zone_zonelist(zone, z, zonelist,
gfp_zone(sc->gfp_mask)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;

lru_pages += zone_reclaimable_pages(zone);
+ node_set(zone_to_nid(zone),
+ shrink->nodes_to_scan);
}

shrink_slab(shrink, sc->nr_scanned, lru_pages);
+
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -2682,6 +2687,8 @@ loop_again:
shrink_zone(zone, &sc);

reclaim_state->reclaimed_slab = 0;
+ nodes_clear(shrink.nodes_to_scan);
+ node_set(zone_to_nid(zone), shrink.nodes_to_scan);
nr_slab = shrink_slab(&shrink, sc.nr_scanned, lru_pages);
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
@@ -3318,10 +3325,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
* number of slab pages and shake the slab until it is reduced
* by the same nr_pages that we used for reclaiming unmapped
* pages.
- *
- * Note that shrink_slab will free memory on all zones and may
- * take a long time.
*/
+ nodes_clear(shrink.nodes_to_scan);
+ node_set(zone_to_nid(zone), shrink.nodes_to_scan);
for (;;) {
unsigned long lru_pages = zone_reclaimable_pages(zone);

--
1.7.10

2012-11-27 23:16:30

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 13/19] xfs: Node aware direct inode reclaim

From: Dave Chinner <[email protected]>

XFS currently only tracks inodes for reclaim via tag bits in the
inode cache radix tree. While this is awesome for background reclaim
because it allows inodes to be reclaimed in ascending disk offset
order, it sucks for direct memory reclaim which really is trying to
free the oldest inodes from memory.

As such, the direct reclaim code is a bit of a mess. It has all
sorts of heuristics code to try to avoid dueling shrinker problems
and to limit each radix tree to a single direct reclaim walker at a
time. We can do better.

Given that at the point in time that we mark an inode as under
reclaim, it has been evicted from the VFS inode cache, we can reuse
the struct inode LRU fields to hold our own reclaim ordered LRU
list. With the generic LRU code, it doesn't impact on scalability,
and the shrinker can walk the LRU lists directly giving us
node-aware inode cache reclaim.

This means that we get the best of both worlds - background reclaim
runs very efficiently in terms of IO for cleaning dirty reclaimable
inodes, while direct reclaim can walk the LRU lists and pick inodes
to reclaim that suit the MM subsystem the best.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_icache.c | 77 ++++++++++++++++++++++++++++++++++++++-------------
fs/xfs/xfs_icache.h | 4 +--
fs/xfs/xfs_linux.h | 1 +
fs/xfs/xfs_mount.h | 2 +-
fs/xfs/xfs_super.c | 6 ++--
5 files changed, 65 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 2f91e2b..82b053f 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -244,6 +244,8 @@ xfs_iget_cache_hit(

spin_unlock(&ip->i_flags_lock);
spin_unlock(&pag->pag_ici_lock);
+
+ list_lru_del(&mp->m_inode_lru, &VFS_I(ip)->i_lru);
} else {
/* If the VFS inode is being torn down, pause and try again. */
if (!igrab(inode)) {
@@ -990,6 +992,17 @@ reclaim:
spin_unlock(&pag->pag_ici_lock);

/*
+ * iT is safe to do this unlocked check as we've guaranteed that we have
+ * exclusive access to this inode via the XFS_IRECLAIM flag. Hence
+ * concurrent LRU list walks will avoid removing this inode from the
+ * list. For direct reclaim, we know the inode has already been removed
+ * from any list it might be on, hence there's no need to traffic the
+ * LRU code to find that out.
+ */
+ if (!list_empty(&VFS_I(ip)->i_lru))
+ list_lru_del(&ip->i_mount->m_inode_lru, &VFS_I(ip)->i_lru);
+
+ /*
* Here we do an (almost) spurious inode lock in order to coordinate
* with inode cache radix tree lookups. This is because the lookup
* can reference the inodes in the cache without taking references.
@@ -1155,6 +1168,32 @@ xfs_reclaim_inodes(
return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
}

+static int
+xfs_reclaim_inode_isolate(
+ struct list_head *item,
+ spinlock_t *lru_lock,
+ void *cb_arg)
+{
+ struct inode *inode = container_of(item, struct inode,
+ i_lru);
+ struct list_head *dispose = cb_arg;
+
+ rcu_read_lock();
+ if (xfs_reclaim_inode_grab(XFS_I(inode), SYNC_TRYLOCK)) {
+ /* not a reclaim candiate, skip it */
+ rcu_read_unlock();
+ return 2;
+ }
+ rcu_read_unlock();
+
+ /*
+ * We have the XFS_IRECLAIM flag set now, so nobody is going to touch
+ * this inode now except us.
+ */
+ list_move(item, dispose);
+ return 0;
+}
+
/*
* Scan a certain number of inodes for reclaim.
*
@@ -1167,36 +1206,34 @@ xfs_reclaim_inodes(
long
xfs_reclaim_inodes_nr(
struct xfs_mount *mp,
- long nr_to_scan)
+ long nr_to_scan,
+ nodemask_t *nodes_to_scan)
{
- long nr = nr_to_scan;
+ LIST_HEAD(dispose);
+ long freed;

/* kick background reclaimer and push the AIL */
xfs_reclaim_work_queue(mp);
xfs_ail_push_all(mp->m_ail);

- xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr);
- return nr_to_scan - nr;
-}
+ freed = list_lru_walk_nodemask(&mp->m_inode_lru,
+ xfs_reclaim_inode_isolate, &dispose,
+ nr_to_scan, nodes_to_scan);

-/*
- * Return the number of reclaimable inodes in the filesystem for
- * the shrinker to determine how much to reclaim.
- */
-long
-xfs_reclaim_inodes_count(
- struct xfs_mount *mp)
-{
- struct xfs_perag *pag;
- xfs_agnumber_t ag = 0;
- long reclaimable = 0;
+ while (!list_empty(&dispose)) {
+ struct xfs_perag *pag;
+ struct inode *inode;

- while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
- ag = pag->pag_agno + 1;
- reclaimable += pag->pag_ici_reclaimable;
+ inode = list_first_entry(&dispose, struct inode, i_lru);
+ list_del_init(&inode->i_lru);
+
+ pag = xfs_perag_get(mp,
+ XFS_INO_TO_AGNO(mp, XFS_I(inode)->i_ino));
+ xfs_reclaim_inode(XFS_I(inode), pag, SYNC_WAIT);
xfs_perag_put(pag);
}
- return reclaimable;
+
+ return freed;
}

STATIC int
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index c860d07..4214518 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -30,8 +30,8 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
void xfs_reclaim_worker(struct work_struct *work);

int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
-long xfs_reclaim_inodes_count(struct xfs_mount *mp);
-long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan);
+long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan,
+ nodemask_t *nodes_to_scan);

void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);

diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index fe7e4df..40cc5d1 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -72,6 +72,7 @@
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/list_sort.h>
+#include <linux/list_lru.h>

#include <asm/page.h>
#include <asm/div64.h>
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index bab8314..859fd5d 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -200,9 +200,9 @@ typedef struct xfs_mount {
trimming */
__int64_t m_update_flags; /* sb flags we need to update
on the next remount,rw */
- struct shrinker m_inode_shrink; /* inode reclaim shrinker */
int64_t m_low_space[XFS_LOWSP_MAX];
/* low free space thresholds */
+ struct list_lru m_inode_lru; /* direct inode reclaim list */

struct workqueue_struct *m_data_workqueue;
struct workqueue_struct *m_unwritten_workqueue;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 33d67d5..814e07a 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -968,6 +968,7 @@ xfs_fs_destroy_inode(
* reclaim tear down all inodes.
*/
out_reclaim:
+ list_lru_add(&ip->i_mount->m_inode_lru, &inode->i_lru);
xfs_inode_set_reclaim_tag(ip);
}

@@ -1402,6 +1403,7 @@ xfs_fs_fill_super(
atomic_set(&mp->m_active_trans, 0);
INIT_DELAYED_WORK(&mp->m_reclaim_work, xfs_reclaim_worker);
INIT_DELAYED_WORK(&mp->m_eofblocks_work, xfs_eofblocks_worker);
+ list_lru_init(&mp->m_inode_lru);

mp->m_super = sb;
sb->s_fs_info = mp;
@@ -1519,7 +1521,7 @@ xfs_fs_nr_cached_objects(
struct super_block *sb,
nodemask_t *nodes_to_count)
{
- return xfs_reclaim_inodes_count(XFS_M(sb));
+ return list_lru_count_nodemask(&XFS_M(sb)->m_inode_lru, nodes_to_count);
}

static long
@@ -1528,7 +1530,7 @@ xfs_fs_free_cached_objects(
long nr_to_scan,
nodemask_t *nodes_to_scan)
{
- return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+ return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan, nodes_to_scan);
}

static const struct super_operations xfs_super_operations = {
--
1.7.10

2012-11-27 23:17:07

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

From: Dave Chinner <[email protected]>

Convert the driver shrinkers to the new API. Most changes are
compile tested only because I either don't have the hardware or it's
staging stuff.

FWIW, the md and android code is pretty good, but the rest of it
makes me want to claw my eyes out. The amount of broken code I just
encountered is mind boggling. I've added comments explaining what
is broken, but I fear that some of the code would be best dealt with
by being dragged behind the bike shed, burying in mud up to it's
neck and then run over repeatedly with a blunt lawn mower.

Special mention goes to the zcache/zcache2 drivers. They can't
co-exist in the build at the same time, they are under different
menu options in menuconfig, they only show up when you've got the
right set of mm subsystem options configured and so even compile
testing is an exercise in pulling teeth. And that doesn't even take
into account the horrible, broken code...

Signed-off-by: Dave Chinner <[email protected]>
---
drivers/gpu/drm/i915/i915_dma.c | 4 +-
drivers/gpu/drm/i915/i915_gem.c | 64 +++++++++++++++++++++-------
drivers/gpu/drm/ttm/ttm_page_alloc.c | 48 ++++++++++++++-------
drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 55 +++++++++++++++---------
drivers/md/dm-bufio.c | 65 +++++++++++++++++++----------
drivers/staging/android/ashmem.c | 44 ++++++++++++-------
drivers/staging/android/lowmemorykiller.c | 60 +++++++++++++++++---------
drivers/staging/ramster/zcache-main.c | 58 ++++++++++++++++++-------
drivers/staging/zcache/zcache-main.c | 40 ++++++++++--------
9 files changed, 297 insertions(+), 141 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
index 61ae104..0ddec32 100644
--- a/drivers/gpu/drm/i915/i915_dma.c
+++ b/drivers/gpu/drm/i915/i915_dma.c
@@ -1658,7 +1658,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
return 0;

out_gem_unload:
- if (dev_priv->mm.inactive_shrinker.shrink)
+ if (dev_priv->mm.inactive_shrinker.scan_objects)
unregister_shrinker(&dev_priv->mm.inactive_shrinker);

if (dev->pdev->msi_enabled)
@@ -1695,7 +1695,7 @@ int i915_driver_unload(struct drm_device *dev)

i915_teardown_sysfs(dev);

- if (dev_priv->mm.inactive_shrinker.shrink)
+ if (dev_priv->mm.inactive_shrinker.scan_objects)
unregister_shrinker(&dev_priv->mm.inactive_shrinker);

mutex_lock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 107f09b..ceab752 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -53,8 +53,10 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
struct drm_i915_fence_reg *fence,
bool enable);

-static int i915_gem_inactive_shrink(struct shrinker *shrinker,
+static long i915_gem_inactive_count(struct shrinker *shrinker,
struct shrink_control *sc);
+static long i915_gem_inactive_scan(struct shrinker *shrinker,
+ struct shrink_control *sc);
static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
@@ -4197,7 +4199,8 @@ i915_gem_load(struct drm_device *dev)

dev_priv->mm.interruptible = true;

- dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
+ dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
+ dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
register_shrinker(&dev_priv->mm.inactive_shrinker);
}
@@ -4407,35 +4410,64 @@ void i915_gem_release(struct drm_device *dev, struct drm_file *file)
spin_unlock(&file_priv->mm.lock);
}

-static int
-i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
+/*
+ * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
+ *
+ * i915_gem_purge() expects a byte count to be passed, and the minimum object
+ * size is PAGE_SIZE. The shrinker doesn't work on bytes - it works on
+ * *objects*. So it passes a nr_to_scan of 128 objects, which is interpreted
+ * here to mean "free 128 bytes". That means a single object will be freed, as
+ * the minimum object size is a page.
+ *
+ * But the craziest part comes when i915_gem_purge() has walked all the objects
+ * and can't free any memory. That results in i915_gem_shrink_all() being
+ * called, which idles the GPU and frees everything the driver has in it's
+ * active and inactive lists. It's basically hitting the driver with a great big
+ * hammer because it was busy doing stuff when something else generated memory
+ * pressure. This doesn't seem particularly wise...
+ */
+static long
+i915_gem_inactive_scan(struct shrinker *shrinker, struct shrink_control *sc)
{
struct drm_i915_private *dev_priv =
container_of(shrinker,
struct drm_i915_private,
mm.inactive_shrinker);
struct drm_device *dev = dev_priv->dev;
- struct drm_i915_gem_object *obj;
- int nr_to_scan = sc->nr_to_scan;
- int cnt;
+ long freed = 0;

if (!mutex_trylock(&dev->struct_mutex))
return 0;

- if (nr_to_scan) {
- nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
- if (nr_to_scan > 0)
- i915_gem_shrink_all(dev_priv);
- }
+ freed = i915_gem_purge(dev_priv, sc->nr_to_scan);
+ if (freed < sc->nr_to_scan)
+ i915_gem_shrink_all(dev_priv);
+
+ mutex_unlock(&dev->struct_mutex);
+ return freed;
+}
+
+static long
+i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
+{
+ struct drm_i915_private *dev_priv =
+ container_of(shrinker,
+ struct drm_i915_private,
+ mm.inactive_shrinker);
+ struct drm_device *dev = dev_priv->dev;
+ struct drm_i915_gem_object *obj;
+ long count = 0;
+
+ if (!mutex_trylock(&dev->struct_mutex))
+ return 0;

- cnt = 0;
list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
if (obj->pages_pin_count == 0)
- cnt += obj->base.size >> PAGE_SHIFT;
+ count += obj->base.size >> PAGE_SHIFT;
list_for_each_entry(obj, &dev_priv->mm.bound_list, gtt_list)
if (obj->pin_count == 0 && obj->pages_pin_count == 0)
- cnt += obj->base.size >> PAGE_SHIFT;
+ count += obj->base.size >> PAGE_SHIFT;

mutex_unlock(&dev->struct_mutex);
- return cnt;
+ return count;
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index bd2a3b4..83058a2 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -377,28 +377,28 @@ out:
return nr_free;
}

-/* Get good estimation how many pages are free in pools */
-static int ttm_pool_get_num_unused_pages(void)
-{
- unsigned i;
- int total = 0;
- for (i = 0; i < NUM_POOLS; ++i)
- total += _manager->pools[i].npages;
-
- return total;
-}
-
/**
* Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_page_pool_free() does memory allocation using GFP_KERNEL. that means
+ * this can deadlock when called a sc->gfp_mask that is not equal to
+ * GFP_KERNEL.
+ *
+ * This code is crying out for a shrinker per pool....
*/
-static int ttm_pool_mm_shrink(struct shrinker *shrink,
- struct shrink_control *sc)
+static long
+ttm_pool_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
static atomic_t start_pool = ATOMIC_INIT(0);
unsigned i;
unsigned pool_offset = atomic_add_return(1, &start_pool);
struct ttm_page_pool *pool;
int shrink_pages = sc->nr_to_scan;
+ long freed = 0;

pool_offset = pool_offset % NUM_POOLS;
/* select start pool in round robin fashion */
@@ -408,14 +408,30 @@ static int ttm_pool_mm_shrink(struct shrinker *shrink,
break;
pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
shrink_pages = ttm_page_pool_free(pool, nr_free);
+ freed += nr_free - shrink_pages;
}
- /* return estimated number of unused pages in pool */
- return ttm_pool_get_num_unused_pages();
+ return freed;
+}
+
+
+static long
+ttm_pool_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ unsigned i;
+ long count = 0;
+
+ for (i = 0; i < NUM_POOLS; ++i)
+ count += _manager->pools[i].npages;
+
+ return count;
}

static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
{
- manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
+ manager->mm_shrink.count_objects = &ttm_pool_shrink_count;
+ manager->mm_shrink.scan_objects = &ttm_pool_shrink_scan;
manager->mm_shrink.seeks = 1;
register_shrinker(&manager->mm_shrink);
}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index b8b3943..b3b4f99 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -918,19 +918,6 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct device *dev)
}
EXPORT_SYMBOL_GPL(ttm_dma_populate);

-/* Get good estimation how many pages are free in pools */
-static int ttm_dma_pool_get_num_unused_pages(void)
-{
- struct device_pools *p;
- unsigned total = 0;
-
- mutex_lock(&_manager->lock);
- list_for_each_entry(p, &_manager->pools, pools)
- total += p->pool->npages_free;
- mutex_unlock(&_manager->lock);
- return total;
-}
-
/* Put all pages in pages list to correct pool to wait for reuse */
void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
{
@@ -1002,18 +989,31 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);

/**
* Callback for mm to request pool to reduce number of page held.
+ *
+ * XXX: (dchinner) Deadlock warning!
+ *
+ * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
+ * needs to be paid to sc->gfp_mask to determine if this can be done or not.
+ * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really
+ * bad.
+ *
+ * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
+ * shrinkers
*/
-static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
- struct shrink_control *sc)
+static long
+ttm_dma_pool_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
static atomic_t start_pool = ATOMIC_INIT(0);
unsigned idx = 0;
unsigned pool_offset = atomic_add_return(1, &start_pool);
unsigned shrink_pages = sc->nr_to_scan;
struct device_pools *p;
+ long freed = 0;

if (list_empty(&_manager->pools))
- return 0;
+ return -1;

mutex_lock(&_manager->lock);
pool_offset = pool_offset % _manager->npools;
@@ -1029,18 +1029,35 @@ static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
continue;
nr_free = shrink_pages;
shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
+ freed += nr_free - shrink_pages;
+
pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
p->pool->dev_name, p->pool->name, current->pid,
nr_free, shrink_pages);
}
mutex_unlock(&_manager->lock);
- /* return estimated number of unused pages in pool */
- return ttm_dma_pool_get_num_unused_pages();
+ return freed;
+}
+
+static long
+ttm_dma_pool_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct device_pools *p;
+ long count = 0;
+
+ mutex_lock(&_manager->lock);
+ list_for_each_entry(p, &_manager->pools, pools)
+ count += p->pool->npages_free;
+ mutex_unlock(&_manager->lock);
+ return count;
}

static void ttm_dma_pool_mm_shrink_init(struct ttm_pool_manager *manager)
{
- manager->mm_shrink.shrink = &ttm_dma_pool_mm_shrink;
+ manager->mm_shrink.count_objects = &ttm_dma_pool_shrink_count;
+ manager->mm_shrink.scan_objects = &ttm_dma_pool_shrink_scan;
manager->mm_shrink.seeks = 1;
register_shrinker(&manager->mm_shrink);
}
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index 651ca79..0898bf5 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -1359,62 +1359,80 @@ static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
unsigned long max_jiffies)
{
if (jiffies - b->last_accessed < max_jiffies)
- return 1;
+ return 0;

if (!(gfp & __GFP_IO)) {
if (test_bit(B_READING, &b->state) ||
test_bit(B_WRITING, &b->state) ||
test_bit(B_DIRTY, &b->state))
- return 1;
+ return 0;
}

if (b->hold_count)
- return 1;
+ return 0;

__make_buffer_clean(b);
__unlink_buffer(b);
__free_buffer_wake(b);

- return 0;
+ return 1;
}

-static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
- struct shrink_control *sc)
+static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
+ gfp_t gfp_mask)
{
int l;
struct dm_buffer *b, *tmp;
+ long freed = 0;

for (l = 0; l < LIST_SIZE; l++) {
- list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
- if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
- !--nr_to_scan)
- return;
+ list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
+ freed += __cleanup_old_buffer(b, gfp_mask, 0);
+ if (!--nr_to_scan)
+ break;
+ }
dm_bufio_cond_resched();
}
+ return freed;
}

-static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
+static long
+dm_bufio_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct dm_bufio_client *c =
- container_of(shrinker, struct dm_bufio_client, shrinker);
- unsigned long r;
- unsigned long nr_to_scan = sc->nr_to_scan;
+ container_of(shrink, struct dm_bufio_client, shrinker);
+ long freed;

if (sc->gfp_mask & __GFP_IO)
dm_bufio_lock(c);
else if (!dm_bufio_trylock(c))
- return !nr_to_scan ? 0 : -1;
+ return -1;

- if (nr_to_scan)
- __scan(c, nr_to_scan, sc);
+ freed = __scan(c, sc->nr_to_scan, sc->gfp_mask);
+ dm_bufio_unlock(c);
+ return freed;
+}

- r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
- if (r > INT_MAX)
- r = INT_MAX;
+static long
+dm_bufio_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct dm_bufio_client *c =
+ container_of(shrink, struct dm_bufio_client, shrinker);
+ long count;
+
+ if (sc->gfp_mask & __GFP_IO)
+ dm_bufio_lock(c);
+ else if (!dm_bufio_trylock(c))
+ return 0;

+ count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
dm_bufio_unlock(c);
+ return count;

- return r;
}

/*
@@ -1516,7 +1534,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
__cache_size_refresh();
mutex_unlock(&dm_bufio_clients_lock);

- c->shrinker.shrink = shrink;
+ c->shrinker.count_objects = dm_bufio_shrink_count;
+ c->shrinker.scan_objects = dm_bufio_shrink_scan;
c->shrinker.seeks = 1;
c->shrinker.batch = 0;
register_shrinker(&c->shrinker);
@@ -1603,7 +1622,7 @@ static void cleanup_old_buffers(void)
struct dm_buffer *b;
b = list_entry(c->lru[LIST_CLEAN].prev,
struct dm_buffer, lru_list);
- if (__cleanup_old_buffer(b, 0, max_age * HZ))
+ if (!__cleanup_old_buffer(b, 0, max_age * HZ))
break;
dm_bufio_cond_resched();
}
diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
index 634b9ae..30f9f8e 100644
--- a/drivers/staging/android/ashmem.c
+++ b/drivers/staging/android/ashmem.c
@@ -341,27 +341,28 @@ out:
/*
* ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
*
- * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
- * many objects (pages) we have in total.
+ * 'nr_to_scan' is the number of objects to scan for freeing.
*
* 'gfp_mask' is the mask of the allocation that got us into this mess.
*
- * Return value is the number of objects (pages) remaining, or -1 if we cannot
+ * Return value is the number of objects freed or -1 if we cannot
* proceed without risk of deadlock (due to gfp_mask).
*
* We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
* chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
* pages freed.
*/
-static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
+static long
+ashmem_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct ashmem_range *range, *next;
+ long freed = 0;

/* We might recurse into filesystem code, so bail out if necessary */
- if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+ if (!(sc->gfp_mask & __GFP_FS))
return -1;
- if (!sc->nr_to_scan)
- return lru_count;

mutex_lock(&ashmem_mutex);
list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
@@ -374,17 +375,34 @@ static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
range->purged = ASHMEM_WAS_PURGED;
lru_del(range);

- sc->nr_to_scan -= range_size(range);
- if (sc->nr_to_scan <= 0)
+ freed += range_size(range);
+ if (--sc->nr_to_scan <= 0)
break;
}
mutex_unlock(&ashmem_mutex);
+ return freed;
+}

+static long
+ashmem_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ /*
+ * note that lru_count is count of pages on the lru, not a count of
+ * objects on the list. This means the scan function needs to return the
+ * number of pages freed, not the number of objects scanned.
+ */
return lru_count;
}

static struct shrinker ashmem_shrinker = {
- .shrink = ashmem_shrink,
+ .count_objects = ashmem_shrink_count,
+ .scan_objects = ashmem_shrink_scan,
+ /*
+ * XXX (dchinner): I wish people would comment on why they need on
+ * significant changes to the default value here
+ */
.seeks = DEFAULT_SEEKS * 4,
};

@@ -671,11 +689,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
if (capable(CAP_SYS_ADMIN)) {
struct shrink_control sc = {
.gfp_mask = GFP_KERNEL,
- .nr_to_scan = 0,
+ .nr_to_scan = LONG_MAX,
};
- ret = ashmem_shrink(&ashmem_shrinker, &sc);
- sc.nr_to_scan = ret;
- ashmem_shrink(&ashmem_shrinker, &sc);
+ ashmem_shrink_scan(&ashmem_shrinker, &sc);
}
break;
}
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index b91e4bc..2bf2c2f 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -63,11 +63,19 @@ static unsigned long lowmem_deathpending_timeout;
printk(x); \
} while (0)

-static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
+/*
+ * XXX (dchinner): this should all be using longs, not ints, as
+ * functions like global_page_state, get_mm_rss, etc all return longs or
+ * unsigned longs. Even the shrinker now uses longs....
+ */
+static long
+lowmem_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
struct task_struct *tsk;
struct task_struct *selected = NULL;
- int rem = 0;
+ long freed = 0;
int tasksize;
int i;
int min_score_adj = OOM_SCORE_ADJ_MAX + 1;
@@ -89,19 +97,17 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
break;
}
}
- if (sc->nr_to_scan > 0)
- lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
- sc->nr_to_scan, sc->gfp_mask, other_free,
- other_file, min_score_adj);
- rem = global_page_state(NR_ACTIVE_ANON) +
- global_page_state(NR_ACTIVE_FILE) +
- global_page_state(NR_INACTIVE_ANON) +
- global_page_state(NR_INACTIVE_FILE);
- if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
- lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
- sc->nr_to_scan, sc->gfp_mask, rem);
- return rem;
+ lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
+ sc->nr_to_scan, sc->gfp_mask, other_free,
+ other_file, min_score_adj);
+
+ if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
+ /* nothing to do, no point in calling again */
+ lowmem_print(5, "lowmem_shrink %lu, %x, return -1\n",
+ sc->nr_to_scan, sc->gfp_mask);
+ return -1;
}
+
selected_oom_score_adj = min_score_adj;

rcu_read_lock();
@@ -151,16 +157,32 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
lowmem_deathpending_timeout = jiffies + HZ;
send_sig(SIGKILL, selected, 0);
set_tsk_thread_flag(selected, TIF_MEMDIE);
- rem -= selected_tasksize;
+ freed += selected_tasksize;
}
- lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
- sc->nr_to_scan, sc->gfp_mask, rem);
+ lowmem_print(4, "lowmem_shrink %lu, %x, return %ld\n",
+ sc->nr_to_scan, sc->gfp_mask, freed);
rcu_read_unlock();
- return rem;
+ return freed;
+}
+
+static long
+lowmem_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ long count;
+
+ count = global_page_state(NR_ACTIVE_ANON) +
+ global_page_state(NR_ACTIVE_FILE) +
+ global_page_state(NR_INACTIVE_ANON) +
+ global_page_state(NR_INACTIVE_FILE);
+ return count;
}

static struct shrinker lowmem_shrinker = {
- .shrink = lowmem_shrink,
+ .count_objects = lowmem_shrink_count,
+ .scan_objects = lowmem_shrink_scan,
+ /* why can't we document magic numbers? */
.seeks = DEFAULT_SEEKS * 16
};

diff --git a/drivers/staging/ramster/zcache-main.c b/drivers/staging/ramster/zcache-main.c
index a09dd5c..7d50688 100644
--- a/drivers/staging/ramster/zcache-main.c
+++ b/drivers/staging/ramster/zcache-main.c
@@ -1054,12 +1054,13 @@ static bool zcache_freeze;
* used by zcache to approximately the same as the total number of LRU_FILE
* pageframes in use.
*/
-static int shrink_zcache_memory(struct shrinker *shrink,
- struct shrink_control *sc)
+static long
+zcache_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
static bool in_progress;
- int ret = -1;
- int nr = sc->nr_to_scan;
+ long freed = 0;
int nr_evict = 0;
int nr_unuse = 0;
struct page *page;
@@ -1067,15 +1068,23 @@ static int shrink_zcache_memory(struct shrinker *shrink,
int unuse_ret;
#endif

- if (nr <= 0)
- goto skip_evict;
+ /*
+ * XXX (dchinner): My kingdom for a mutex! There's no way this should
+ * ever be allowed to move out of staging until it supports concurrent
+ * shrinkers correctly.
+ *
+ * This whole shrinker is making me want to claw my eyes out. It has no
+ * redeeming values whatsoever and I can't undo the damage it has
+ * already done to my brain.
+ */

/* don't allow more than one eviction thread at a time */
if (in_progress)
- goto skip_evict;
+ return -1;

in_progress = true;

+
/* we are going to ignore nr, and target a different value */
zcache_last_active_file_pageframes =
global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
@@ -1083,11 +1092,13 @@ static int shrink_zcache_memory(struct shrinker *shrink,
global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
nr_evict = zcache_eph_pageframes - zcache_last_active_file_pageframes +
zcache_last_inactive_file_pageframes;
+
while (nr_evict-- > 0) {
page = zcache_evict_eph_pageframe();
if (page == NULL)
break;
zcache_free_page(page);
+ freed++;
}

zcache_last_active_anon_pageframes =
@@ -1104,25 +1115,42 @@ static int shrink_zcache_memory(struct shrinker *shrink,
unuse_ret = zcache_frontswap_unuse();
if (unuse_ret == -ENOMEM)
break;
+ freed++;
}
#endif
in_progress = false;
+ return freed;
+}
+
+
+/*
+ * XXX (dchinner): the shrinker updates global variables? You've got to be
+ * kidding me! And the object count can (apparently) go negative - that's
+ * *always* a bug so be bloody noisy about it.
+ */
+static long
+zcache_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ long count;

-skip_evict:
- /* resample: has changed, but maybe not all the way yet */
zcache_last_active_file_pageframes =
global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
zcache_last_inactive_file_pageframes =
global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
- ret = zcache_eph_pageframes - zcache_last_active_file_pageframes +
- zcache_last_inactive_file_pageframes;
- if (ret < 0)
- ret = 0;
- return ret;
+
+ count = zcache_last_active_file_pageframes +
+ zcache_last_inactive_file_pageframes +
+ zcache_eph_pageframes;
+
+ WARN_ON_ONCE(count < 0);
+ return count < 0 ? 0 : count;
}

static struct shrinker zcache_shrinker = {
- .shrink = shrink_zcache_memory,
+ .count_objects = zcache_shrink_count,
+ .scan_objects = zcache_shrink_scan,
.seeks = DEFAULT_SEEKS,
};

diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index 52b43b7..d17ab5d 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -536,10 +536,11 @@ static void zbud_evict_zbpg(struct zbud_page *zbpg)
* page in use by another cpu, but also to avoid potential deadlock due to
* lock inversion.
*/
-static void zbud_evict_pages(int nr)
+static long zbud_evict_pages(int nr)
{
struct zbud_page *zbpg;
int i;
+ long freed = 0;

/* first try freeing any pages on unused list */
retry_unused_list:
@@ -554,6 +555,7 @@ retry_unused_list:
spin_unlock_bh(&zbpg_unused_list_spinlock);
zcache_free_page(zbpg);
zcache_evicted_raw_pages++;
+ freed++;
if (--nr <= 0)
goto out;
goto retry_unused_list;
@@ -578,6 +580,7 @@ retry_unbud_list_i:
/* want budlists unlocked when doing zbpg eviction */
zbud_evict_zbpg(zbpg);
local_bh_enable();
+ freed++;
if (--nr <= 0)
goto out;
goto retry_unbud_list_i;
@@ -602,13 +605,14 @@ retry_bud_list:
/* want budlists unlocked when doing zbpg eviction */
zbud_evict_zbpg(zbpg);
local_bh_enable();
+ freed++;
if (--nr <= 0)
goto out;
goto retry_bud_list;
}
spin_unlock_bh(&zbud_budlists_spinlock);
out:
- return;
+ return freed;
}

static void __init zbud_init(void)
@@ -1527,26 +1531,28 @@ static bool zcache_freeze;
/*
* zcache shrinker interface (only useful for ephemeral pages, so zbud only)
*/
-static int shrink_zcache_memory(struct shrinker *shrink,
- struct shrink_control *sc)
+static long
+zcache_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
{
- int ret = -1;
- int nr = sc->nr_to_scan;
- gfp_t gfp_mask = sc->gfp_mask;
+ if (!(sc->gfp_mask & __GFP_FS))
+ return -1;

- if (nr >= 0) {
- if (!(gfp_mask & __GFP_FS))
- /* does this case really need to be skipped? */
- goto out;
- zbud_evict_pages(nr);
- }
- ret = (int)atomic_read(&zcache_zbud_curr_raw_pages);
-out:
- return ret;
+ return zbud_evict_pages(sc->nr_to_scan);
+}
+
+static long
+zcache_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ return (long)atomic_read(&zcache_zbud_curr_raw_pages);
}

static struct shrinker zcache_shrinker = {
- .shrink = shrink_zcache_memory,
+ .count_objects = zcache_shrink_count,
+ .scan_objects = zcache_shrink_scan,
.seeks = DEFAULT_SEEKS,
};

--
1.7.10

2012-11-27 23:15:22

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 02/19] dentry: move to per-sb LRU locks

From: Dave Chinner <[email protected]>

With the dentry LRUs being per-sb structures, there is no real need
for a global dentry_lru_lock. The locking can be made more
fine-grained by moving to a per-sb LRU lock, isolating the LRU
operations of different filesytsems completely from each other.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/dcache.c | 37 ++++++++++++++++++-------------------
fs/super.c | 1 +
include/linux/fs.h | 4 +++-
3 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 2fc0daa..e0c97fe 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -48,7 +48,7 @@
* - the dcache hash table
* s_anon bl list spinlock protects:
* - the s_anon list (see __d_drop)
- * dcache_lru_lock protects:
+ * dentry->d_sb->s_dentry_lru_lock protects:
* - the dcache lru lists and counters
* d_lock protects:
* - d_flags
@@ -63,7 +63,7 @@
* Ordering:
* dentry->d_inode->i_lock
* dentry->d_lock
- * dcache_lru_lock
+ * dentry->d_sb->s_dentry_lru_lock
* dcache_hash_bucket lock
* s_anon lock
*
@@ -81,7 +81,6 @@
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);

-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_lru_lock);
__cacheline_aligned_in_smp DEFINE_SEQLOCK(rename_lock);

EXPORT_SYMBOL(rename_lock);
@@ -320,11 +319,11 @@ static void dentry_unlink_inode(struct dentry * dentry)
static void dentry_lru_add(struct dentry *dentry)
{
if (list_empty(&dentry->d_lru)) {
- spin_lock(&dcache_lru_lock);
+ spin_lock(&dentry->d_sb->s_dentry_lru_lock);
list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
dentry->d_sb->s_nr_dentry_unused++;
this_cpu_inc(nr_dentry_unused);
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
}

@@ -342,9 +341,9 @@ static void __dentry_lru_del(struct dentry *dentry)
static void dentry_lru_del(struct dentry *dentry)
{
if (!list_empty(&dentry->d_lru)) {
- spin_lock(&dcache_lru_lock);
+ spin_lock(&dentry->d_sb->s_dentry_lru_lock);
__dentry_lru_del(dentry);
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
}

@@ -359,15 +358,15 @@ static void dentry_lru_prune(struct dentry *dentry)
if (dentry->d_flags & DCACHE_OP_PRUNE)
dentry->d_op->d_prune(dentry);

- spin_lock(&dcache_lru_lock);
+ spin_lock(&dentry->d_sb->s_dentry_lru_lock);
__dentry_lru_del(dentry);
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
}

static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
{
- spin_lock(&dcache_lru_lock);
+ spin_lock(&dentry->d_sb->s_dentry_lru_lock);
if (list_empty(&dentry->d_lru)) {
list_add_tail(&dentry->d_lru, list);
dentry->d_sb->s_nr_dentry_unused++;
@@ -375,7 +374,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
} else {
list_move_tail(&dentry->d_lru, list);
}
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}

/**
@@ -879,14 +878,14 @@ void prune_dcache_sb(struct super_block *sb, int count)
LIST_HEAD(tmp);

relock:
- spin_lock(&dcache_lru_lock);
+ spin_lock(&sb->s_dentry_lru_lock);
while (!list_empty(&sb->s_dentry_lru)) {
dentry = list_entry(sb->s_dentry_lru.prev,
struct dentry, d_lru);
BUG_ON(dentry->d_sb != sb);

if (!spin_trylock(&dentry->d_lock)) {
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&sb->s_dentry_lru_lock);
cpu_relax();
goto relock;
}
@@ -902,11 +901,11 @@ relock:
if (!--count)
break;
}
- cond_resched_lock(&dcache_lru_lock);
+ cond_resched_lock(&sb->s_dentry_lru_lock);
}
if (!list_empty(&referenced))
list_splice(&referenced, &sb->s_dentry_lru);
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&sb->s_dentry_lru_lock);

shrink_dentry_list(&tmp);
}
@@ -922,14 +921,14 @@ void shrink_dcache_sb(struct super_block *sb)
{
LIST_HEAD(tmp);

- spin_lock(&dcache_lru_lock);
+ spin_lock(&sb->s_dentry_lru_lock);
while (!list_empty(&sb->s_dentry_lru)) {
list_splice_init(&sb->s_dentry_lru, &tmp);
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&sb->s_dentry_lru_lock);
shrink_dentry_list(&tmp);
- spin_lock(&dcache_lru_lock);
+ spin_lock(&sb->s_dentry_lru_lock);
}
- spin_unlock(&dcache_lru_lock);
+ spin_unlock(&sb->s_dentry_lru_lock);
}
EXPORT_SYMBOL(shrink_dcache_sb);

diff --git a/fs/super.c b/fs/super.c
index 12f1237..21abf02 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -182,6 +182,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
INIT_LIST_HEAD(&s->s_dentry_lru);
+ spin_lock_init(&s->s_dentry_lru_lock);
INIT_LIST_HEAD(&s->s_inode_lru);
spin_lock_init(&s->s_inode_lru_lock);
INIT_LIST_HEAD(&s->s_mounts);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..0845283 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1262,7 +1262,9 @@ struct super_block {
struct list_head s_files;
#endif
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
- /* s_dentry_lru, s_nr_dentry_unused protected by dcache.c lru locks */
+
+ /* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
+ spinlock_t s_dentry_lru_lock ____cacheline_aligned_in_smp;
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */

--
1.7.10

2012-11-27 23:17:32

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 01/19] dcache: convert dentry_stat.nr_unused to per-cpu counters

From: Dave Chinner <[email protected]>

Before we split up the dcache_lru_lock, the unused dentry counter
needs to be made independent of the global dcache_lru_lock. Convert
it to per-cpu counters to do this.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
---
fs/dcache.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 3a463d0..2fc0daa 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -118,6 +118,7 @@ struct dentry_stat_t dentry_stat = {
};

static DEFINE_PER_CPU(unsigned int, nr_dentry);
+static DEFINE_PER_CPU(unsigned int, nr_dentry_unused);

#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
static int get_nr_dentry(void)
@@ -129,10 +130,20 @@ static int get_nr_dentry(void)
return sum < 0 ? 0 : sum;
}

+static int get_nr_dentry_unused(void)
+{
+ int i;
+ int sum = 0;
+ for_each_possible_cpu(i)
+ sum += per_cpu(nr_dentry_unused, i);
+ return sum < 0 ? 0 : sum;
+}
+
int proc_nr_dentry(ctl_table *table, int write, void __user *buffer,
size_t *lenp, loff_t *ppos)
{
dentry_stat.nr_dentry = get_nr_dentry();
+ dentry_stat.nr_unused = get_nr_dentry_unused();
return proc_dointvec(table, write, buffer, lenp, ppos);
}
#endif
@@ -312,7 +323,7 @@ static void dentry_lru_add(struct dentry *dentry)
spin_lock(&dcache_lru_lock);
list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
dentry->d_sb->s_nr_dentry_unused++;
- dentry_stat.nr_unused++;
+ this_cpu_inc(nr_dentry_unused);
spin_unlock(&dcache_lru_lock);
}
}
@@ -322,7 +333,7 @@ static void __dentry_lru_del(struct dentry *dentry)
list_del_init(&dentry->d_lru);
dentry->d_flags &= ~DCACHE_SHRINK_LIST;
dentry->d_sb->s_nr_dentry_unused--;
- dentry_stat.nr_unused--;
+ this_cpu_dec(nr_dentry_unused);
}

/*
@@ -360,7 +371,7 @@ static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
if (list_empty(&dentry->d_lru)) {
list_add_tail(&dentry->d_lru, list);
dentry->d_sb->s_nr_dentry_unused++;
- dentry_stat.nr_unused++;
+ this_cpu_inc(nr_dentry_unused);
} else {
list_move_tail(&dentry->d_lru, list);
}
--
1.7.10

2012-11-27 23:17:59

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 12/19] xfs: convert buftarg LRU to generic code

From: Dave Chinner <[email protected]>

Convert the buftarg LRU to use the new generic LRU list and take
advantage of the functionality it supplies to make the buffer cache
shrinker node aware.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_buf.c | 162 +++++++++++++++++++++++++-----------------------------
fs/xfs/xfs_buf.h | 5 +-
2 files changed, 76 insertions(+), 91 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index a80195b..1011c59 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -85,20 +85,14 @@ xfs_buf_vmap_len(
* The LRU takes a new reference to the buffer so that it will only be freed
* once the shrinker takes the buffer off the LRU.
*/
-STATIC void
+static void
xfs_buf_lru_add(
struct xfs_buf *bp)
{
- struct xfs_buftarg *btp = bp->b_target;
-
- spin_lock(&btp->bt_lru_lock);
- if (list_empty(&bp->b_lru)) {
- atomic_inc(&bp->b_hold);
- list_add_tail(&bp->b_lru, &btp->bt_lru);
- btp->bt_lru_nr++;
+ if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
+ atomic_inc(&bp->b_hold);
}
- spin_unlock(&btp->bt_lru_lock);
}

/*
@@ -107,24 +101,13 @@ xfs_buf_lru_add(
* The unlocked check is safe here because it only occurs when there are not
* b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
* to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free(). i.e. it removes an unnecessary round trip on the
- * bt_lru_lock.
+ * xfs_buf_free().
*/
-STATIC void
+static void
xfs_buf_lru_del(
struct xfs_buf *bp)
{
- struct xfs_buftarg *btp = bp->b_target;
-
- if (list_empty(&bp->b_lru))
- return;
-
- spin_lock(&btp->bt_lru_lock);
- if (!list_empty(&bp->b_lru)) {
- list_del_init(&bp->b_lru);
- btp->bt_lru_nr--;
- }
- spin_unlock(&btp->bt_lru_lock);
+ list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
}

/*
@@ -151,18 +134,10 @@ xfs_buf_stale(
bp->b_flags &= ~_XBF_DELWRI_Q;

atomic_set(&(bp)->b_lru_ref, 0);
- if (!list_empty(&bp->b_lru)) {
- struct xfs_buftarg *btp = bp->b_target;
-
- spin_lock(&btp->bt_lru_lock);
- if (!list_empty(&bp->b_lru) &&
- !(bp->b_lru_flags & _XBF_LRU_DISPOSE)) {
- list_del_init(&bp->b_lru);
- btp->bt_lru_nr--;
- atomic_dec(&bp->b_hold);
- }
- spin_unlock(&btp->bt_lru_lock);
- }
+ if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+ (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
+ atomic_dec(&bp->b_hold);
+
ASSERT(atomic_read(&bp->b_hold) >= 1);
}

@@ -1476,81 +1451,92 @@ xfs_buf_iomove(
* returned. These buffers will have an elevated hold count, so wait on those
* while freeing all the buffers only held by the LRU.
*/
-void
-xfs_wait_buftarg(
- struct xfs_buftarg *btp)
+static int
+xfs_buftarg_wait_rele(
+ struct list_head *item,
+ spinlock_t *lru_lock,
+ void *arg)
{
- struct xfs_buf *bp;
+ struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru);

-restart:
- spin_lock(&btp->bt_lru_lock);
- while (!list_empty(&btp->bt_lru)) {
- bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
- if (atomic_read(&bp->b_hold) > 1) {
- spin_unlock(&btp->bt_lru_lock);
- delay(100);
- goto restart;
- }
+ if (atomic_read(&bp->b_hold) > 1) {
+ /* need to wait */
+ spin_unlock(lru_lock);
+ delay(100);
+ } else {
/*
* clear the LRU reference count so the buffer doesn't get
* ignored in xfs_buf_rele().
*/
atomic_set(&bp->b_lru_ref, 0);
- spin_unlock(&btp->bt_lru_lock);
+ spin_unlock(lru_lock);
xfs_buf_rele(bp);
- spin_lock(&btp->bt_lru_lock);
}
- spin_unlock(&btp->bt_lru_lock);
+ return 3;
}

-int
-xfs_buftarg_shrink(
+void
+xfs_wait_buftarg(
+ struct xfs_buftarg *btp)
+{
+ while (list_lru_count(&btp->bt_lru))
+ list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
+ NULL, LONG_MAX);
+}
+
+static int
+xfs_buftarg_isolate(
+ struct list_head *item,
+ spinlock_t *lru_lock,
+ void *arg)
+{
+ struct xfs_buf *bp = container_of(item, struct xfs_buf, b_lru);
+ struct list_head *dispose = arg;
+
+ /*
+ * Decrement the b_lru_ref count unless the value is already
+ * zero. If the value is already zero, we need to reclaim the
+ * buffer, otherwise it gets another trip through the LRU.
+ */
+ if (!atomic_add_unless(&bp->b_lru_ref, -1, 0))
+ return 1;
+
+ bp->b_lru_flags |= _XBF_LRU_DISPOSE;
+ list_move(item, dispose);
+ return 0;
+}
+
+static long
+xfs_buftarg_shrink_scan(
struct shrinker *shrink,
struct shrink_control *sc)
{
struct xfs_buftarg *btp = container_of(shrink,
struct xfs_buftarg, bt_shrinker);
- struct xfs_buf *bp;
- int nr_to_scan = sc->nr_to_scan;
LIST_HEAD(dispose);
+ long freed;

- if (!nr_to_scan)
- return btp->bt_lru_nr;
-
- spin_lock(&btp->bt_lru_lock);
- while (!list_empty(&btp->bt_lru)) {
- if (nr_to_scan-- <= 0)
- break;
-
- bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-
- /*
- * Decrement the b_lru_ref count unless the value is already
- * zero. If the value is already zero, we need to reclaim the
- * buffer, otherwise it gets another trip through the LRU.
- */
- if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
- list_move_tail(&bp->b_lru, &btp->bt_lru);
- continue;
- }
-
- /*
- * remove the buffer from the LRU now to avoid needing another
- * lock round trip inside xfs_buf_rele().
- */
- list_move(&bp->b_lru, &dispose);
- btp->bt_lru_nr--;
- bp->b_lru_flags |= _XBF_LRU_DISPOSE;
- }
- spin_unlock(&btp->bt_lru_lock);
+ freed = list_lru_walk_nodemask(&btp->bt_lru, xfs_buftarg_isolate,
+ &dispose, sc->nr_to_scan,
+ &sc->nodes_to_scan);

while (!list_empty(&dispose)) {
+ struct xfs_buf *bp;
bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
list_del_init(&bp->b_lru);
xfs_buf_rele(bp);
}
+ return freed;
+}
+static long
+xfs_buftarg_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct xfs_buftarg *btp = container_of(shrink,
+ struct xfs_buftarg, bt_shrinker);

- return btp->bt_lru_nr;
+ return list_lru_count_nodemask(&btp->bt_lru, &sc->nodes_to_scan);
}

void
@@ -1632,11 +1618,11 @@ xfs_alloc_buftarg(
if (!btp->bt_bdi)
goto error;

- INIT_LIST_HEAD(&btp->bt_lru);
- spin_lock_init(&btp->bt_lru_lock);
+ list_lru_init(&btp->bt_lru);
if (xfs_setsize_buftarg_early(btp, bdev))
goto error;
- btp->bt_shrinker.shrink = xfs_buftarg_shrink;
+ btp->bt_shrinker.count_objects = xfs_buftarg_shrink_count;
+ btp->bt_shrinker.scan_objects = xfs_buftarg_shrink_scan;
btp->bt_shrinker.seeks = DEFAULT_SEEKS;
register_shrinker(&btp->bt_shrinker);
return btp;
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index e541c0a..54113a1 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -25,6 +25,7 @@
#include <linux/fs.h>
#include <linux/buffer_head.h>
#include <linux/uio.h>
+#include <linux/list_lru.h>

/*
* Base types
@@ -92,9 +93,7 @@ typedef struct xfs_buftarg {

/* LRU control structures */
struct shrinker bt_shrinker;
- struct list_head bt_lru;
- spinlock_t bt_lru_lock;
- unsigned int bt_lru_nr;
+ struct list_lru bt_lru;
} xfs_buftarg_t;

struct xfs_buf;
--
1.7.10

2012-11-27 23:18:31

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 08/19] dcache: convert to use new lru list infrastructure

From: Dave Chinner <[email protected]>

Signed-off-by: Dave Chinner <[email protected]>
---
fs/dcache.c | 171 +++++++++++++++++++++-------------------------------
fs/super.c | 10 +--
include/linux/fs.h | 15 +++--
3 files changed, 82 insertions(+), 114 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index ca647b8..d72e388 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -37,6 +37,7 @@
#include <linux/rculist_bl.h>
#include <linux/prefetch.h>
#include <linux/ratelimit.h>
+#include <linux/list_lru.h>
#include "internal.h"
#include "mount.h"

@@ -318,20 +319,8 @@ static void dentry_unlink_inode(struct dentry * dentry)
*/
static void dentry_lru_add(struct dentry *dentry)
{
- if (list_empty(&dentry->d_lru)) {
- spin_lock(&dentry->d_sb->s_dentry_lru_lock);
- list_add(&dentry->d_lru, &dentry->d_sb->s_dentry_lru);
- dentry->d_sb->s_nr_dentry_unused++;
+ if (list_lru_add(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
this_cpu_inc(nr_dentry_unused);
- spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
- }
-}
-
-static void __dentry_lru_del(struct dentry *dentry)
-{
- list_del_init(&dentry->d_lru);
- dentry->d_sb->s_nr_dentry_unused--;
- this_cpu_dec(nr_dentry_unused);
}

/*
@@ -341,11 +330,8 @@ static void dentry_lru_del(struct dentry *dentry)
{
BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);

- if (!list_empty(&dentry->d_lru)) {
- spin_lock(&dentry->d_sb->s_dentry_lru_lock);
- __dentry_lru_del(dentry);
- spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
- }
+ if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
+ this_cpu_dec(nr_dentry_unused);
}

/*
@@ -361,35 +347,19 @@ static void dentry_lru_del(struct dentry *dentry)
*/
static void dentry_lru_prune(struct dentry *dentry)
{
- if (!list_empty(&dentry->d_lru)) {
+ int prune = dentry->d_flags & DCACHE_OP_PRUNE;

- if (dentry->d_flags & DCACHE_OP_PRUNE)
- dentry->d_op->d_prune(dentry);
-
- if ((dentry->d_flags & DCACHE_SHRINK_LIST))
- list_del_init(&dentry->d_lru);
- else {
- spin_lock(&dentry->d_sb->s_dentry_lru_lock);
- __dentry_lru_del(dentry);
- spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
- }
- dentry->d_flags &= ~DCACHE_SHRINK_LIST;
- }
-}
-
-static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
-{
- BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
-
- spin_lock(&dentry->d_sb->s_dentry_lru_lock);
- if (list_empty(&dentry->d_lru)) {
- list_add_tail(&dentry->d_lru, list);
- } else {
- list_move_tail(&dentry->d_lru, list);
- dentry->d_sb->s_nr_dentry_unused--;
+ if (!list_empty(&dentry->d_lru) &&
+ (dentry->d_flags & DCACHE_SHRINK_LIST))
+ list_del_init(&dentry->d_lru);
+ else if (list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru))
this_cpu_dec(nr_dentry_unused);
- }
- spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+ else
+ prune = 0;
+
+ dentry->d_flags &= ~DCACHE_SHRINK_LIST;
+ if (prune)
+ dentry->d_op->d_prune(dentry);
}

/**
@@ -880,6 +850,51 @@ static void shrink_dentry_list(struct list_head *list)
rcu_read_unlock();
}

+static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+ void *arg)
+{
+ struct list_head *freeable = arg;
+ struct dentry *dentry = container_of(item, struct dentry, d_lru);
+
+
+ /*
+ * we are inverting the lru lock/dentry->d_lock here,
+ * so use a trylock. If we fail to get the lock, just skip
+ * it
+ */
+ if (!spin_trylock(&dentry->d_lock))
+ return 2;
+
+ /*
+ * Referenced dentries are still in use. If they have active
+ * counts, just remove them from the LRU. Otherwise give them
+ * another pass through the LRU.
+ */
+ if (dentry->d_count) {
+ list_del_init(&dentry->d_lru);
+ spin_unlock(&dentry->d_lock);
+ return 0;
+ }
+
+ if (dentry->d_flags & DCACHE_REFERENCED) {
+ dentry->d_flags &= ~DCACHE_REFERENCED;
+ spin_unlock(&dentry->d_lock);
+
+ /*
+ * XXX: this list move should be be done under d_lock. Need to
+ * determine if it is safe just to do it under the lru lock.
+ */
+ return 1;
+ }
+
+ dentry->d_flags |= DCACHE_SHRINK_LIST;
+ list_move_tail(&dentry->d_lru, freeable);
+ this_cpu_dec(nr_dentry_unused);
+ spin_unlock(&dentry->d_lock);
+
+ return 0;
+}
+
/**
* prune_dcache_sb - shrink the dcache
* @sb: superblock
@@ -894,45 +909,12 @@ static void shrink_dentry_list(struct list_head *list)
*/
long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
{
- struct dentry *dentry;
- LIST_HEAD(referenced);
- LIST_HEAD(tmp);
- long freed = 0;
-
-relock:
- spin_lock(&sb->s_dentry_lru_lock);
- while (!list_empty(&sb->s_dentry_lru)) {
- dentry = list_entry(sb->s_dentry_lru.prev,
- struct dentry, d_lru);
- BUG_ON(dentry->d_sb != sb);
-
- if (!spin_trylock(&dentry->d_lock)) {
- spin_unlock(&sb->s_dentry_lru_lock);
- cpu_relax();
- goto relock;
- }
-
- if (dentry->d_flags & DCACHE_REFERENCED) {
- dentry->d_flags &= ~DCACHE_REFERENCED;
- list_move(&dentry->d_lru, &referenced);
- spin_unlock(&dentry->d_lock);
- } else {
- list_move_tail(&dentry->d_lru, &tmp);
- dentry->d_flags |= DCACHE_SHRINK_LIST;
- this_cpu_dec(nr_dentry_unused);
- sb->s_nr_dentry_unused--;
- spin_unlock(&dentry->d_lock);
- freed++;
- if (!--nr_to_scan)
- break;
- }
- cond_resched_lock(&sb->s_dentry_lru_lock);
- }
- if (!list_empty(&referenced))
- list_splice(&referenced, &sb->s_dentry_lru);
- spin_unlock(&sb->s_dentry_lru_lock);
+ LIST_HEAD(dispose);
+ long freed;

- shrink_dentry_list(&tmp);
+ freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
+ &dispose, nr_to_scan);
+ shrink_dentry_list(&dispose);
return freed;
}

@@ -967,24 +949,10 @@ shrink_dcache_list(
*/
void shrink_dcache_sb(struct super_block *sb)
{
- LIST_HEAD(tmp);
-
- spin_lock(&sb->s_dentry_lru_lock);
- while (!list_empty(&sb->s_dentry_lru)) {
- list_splice_init(&sb->s_dentry_lru, &tmp);
-
- /*
- * account for removal here so we don't need to handle it later
- * even though the dentry is no longer on the lru list.
- */
- this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
- sb->s_nr_dentry_unused = 0;
+ long freed;

- spin_unlock(&sb->s_dentry_lru_lock);
- shrink_dcache_list(&tmp);
- spin_lock(&sb->s_dentry_lru_lock);
- }
- spin_unlock(&sb->s_dentry_lru_lock);
+ freed = list_lru_dispose_all(&sb->s_dentry_lru, shrink_dcache_list);
+ this_cpu_sub(nr_dentry_unused, freed);
}
EXPORT_SYMBOL(shrink_dcache_sb);

@@ -1255,7 +1223,8 @@ resume:
if (dentry->d_count) {
dentry_lru_del(dentry);
} else if (!(dentry->d_flags & DCACHE_SHRINK_LIST)) {
- dentry_lru_move_list(dentry, dispose);
+ dentry_lru_del(dentry);
+ list_add_tail(&dentry->d_lru, dispose);
dentry->d_flags |= DCACHE_SHRINK_LIST;
found++;
}
diff --git a/fs/super.c b/fs/super.c
index a2f09c8..b1d24ef 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -78,10 +78,11 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
fs_objects = sb->s_op->nr_cached_objects(sb);

inodes = list_lru_count(&sb->s_inode_lru);
- total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;
+ dentries = list_lru_count(&sb->s_dentry_lru);
+ total_objects = dentries + inodes + fs_objects + 1;

/* proportion the scan between the caches */
- dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total_objects;
+ dentries = (sc->nr_to_scan * dentries) / total_objects;
inodes = (sc->nr_to_scan * inodes) / total_objects;

/*
@@ -113,7 +114,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb);

- total_objects += sb->s_nr_dentry_unused;
+ total_objects += list_lru_count(&sb->s_dentry_lru);
total_objects += list_lru_count(&sb->s_inode_lru);

total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
@@ -193,8 +194,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
- INIT_LIST_HEAD(&s->s_dentry_lru);
- spin_lock_init(&s->s_dentry_lru_lock);
+ list_lru_init(&s->s_dentry_lru);
list_lru_init(&s->s_inode_lru);
INIT_LIST_HEAD(&s->s_mounts);
init_rwsem(&s->s_umount);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 36b7db5..befa46f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1263,14 +1263,6 @@ struct super_block {
struct list_head s_files;
#endif
struct list_head s_mounts; /* list of mounts; _not_ for fs use */
-
- /* s_dentry_lru_lock protects s_dentry_lru and s_nr_dentry_unused */
- spinlock_t s_dentry_lru_lock ____cacheline_aligned_in_smp;
- struct list_head s_dentry_lru; /* unused dentry lru */
- int s_nr_dentry_unused; /* # of dentry on lru */
-
- struct list_lru s_inode_lru ____cacheline_aligned_in_smp;
-
struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info *s_mtd;
@@ -1321,6 +1313,13 @@ struct super_block {

/* Being remounted read-only */
int s_readonly_remount;
+
+ /*
+ * Keep the lru lists last in the structure so they always sit on their
+ * own individual cachelines.
+ */
+ struct list_lru s_dentry_lru ____cacheline_aligned_in_smp;
+ struct list_lru s_inode_lru ____cacheline_aligned_in_smp;
};

extern struct timespec current_fs_time(struct super_block *sb);
--
1.7.10

2012-11-27 23:18:51

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 09/19] list_lru: per-node list infrastructure

From: Dave Chinner <[email protected]>

Now that we have an LRU list API, we can start to enhance the
implementation. This splits the single LRU list into per-node lists
and locks to enhance scalability. Items are placed on lists
according to the node the memory belongs to. To make scanning the
lists efficient, also track whether the per-node lists have entries
in them in a active nodemask.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/list_lru.h | 14 ++--
lib/list_lru.c | 160 +++++++++++++++++++++++++++++++++++-----------
2 files changed, 129 insertions(+), 45 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 3423949..b0e3ba2 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -8,21 +8,23 @@
#define _LRU_LIST_H 0

#include <linux/list.h>
+#include <linux/nodemask.h>

-struct list_lru {
+struct list_lru_node {
spinlock_t lock;
struct list_head list;
long nr_items;
+} ____cacheline_aligned_in_smp;
+
+struct list_lru {
+ struct list_lru_node node[MAX_NUMNODES];
+ nodemask_t active_nodes;
};

int list_lru_init(struct list_lru *lru);
int list_lru_add(struct list_lru *lru, struct list_head *item);
int list_lru_del(struct list_lru *lru, struct list_head *item);
-
-static inline long list_lru_count(struct list_lru *lru)
-{
- return lru->nr_items;
-}
+long list_lru_count(struct list_lru *lru);

typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
void *cb_arg);
diff --git a/lib/list_lru.c b/lib/list_lru.c
index 475d0e9..881e342 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -6,6 +6,7 @@
*/
#include <linux/kernel.h>
#include <linux/module.h>
+#include <linux/mm.h>
#include <linux/list_lru.h>

int
@@ -13,14 +14,19 @@ list_lru_add(
struct list_lru *lru,
struct list_head *item)
{
- spin_lock(&lru->lock);
+ int nid = page_to_nid(virt_to_page(item));
+ struct list_lru_node *nlru = &lru->node[nid];
+
+ spin_lock(&nlru->lock);
+ BUG_ON(nlru->nr_items < 0);
if (list_empty(item)) {
- list_add_tail(item, &lru->list);
- lru->nr_items++;
- spin_unlock(&lru->lock);
+ list_add_tail(item, &nlru->list);
+ if (nlru->nr_items++ == 0)
+ node_set(nid, lru->active_nodes);
+ spin_unlock(&nlru->lock);
return 1;
}
- spin_unlock(&lru->lock);
+ spin_unlock(&nlru->lock);
return 0;
}
EXPORT_SYMBOL_GPL(list_lru_add);
@@ -30,43 +36,72 @@ list_lru_del(
struct list_lru *lru,
struct list_head *item)
{
- spin_lock(&lru->lock);
+ int nid = page_to_nid(virt_to_page(item));
+ struct list_lru_node *nlru = &lru->node[nid];
+
+ spin_lock(&nlru->lock);
if (!list_empty(item)) {
list_del_init(item);
- lru->nr_items--;
- spin_unlock(&lru->lock);
+ if (--nlru->nr_items == 0)
+ node_clear(nid, lru->active_nodes);
+ BUG_ON(nlru->nr_items < 0);
+ spin_unlock(&nlru->lock);
return 1;
}
- spin_unlock(&lru->lock);
+ spin_unlock(&nlru->lock);
return 0;
}
EXPORT_SYMBOL_GPL(list_lru_del);

long
-list_lru_walk(
- struct list_lru *lru,
- list_lru_walk_cb isolate,
- void *cb_arg,
- long nr_to_walk)
+list_lru_count(
+ struct list_lru *lru)
{
+ long count = 0;
+ int nid;
+
+ for_each_node_mask(nid, lru->active_nodes) {
+ struct list_lru_node *nlru = &lru->node[nid];
+
+ spin_lock(&nlru->lock);
+ BUG_ON(nlru->nr_items < 0);
+ count += nlru->nr_items;
+ spin_unlock(&nlru->lock);
+ }
+
+ return count;
+}
+EXPORT_SYMBOL_GPL(list_lru_count);
+
+static long
+list_lru_walk_node(
+ struct list_lru *lru,
+ int nid,
+ list_lru_walk_cb isolate,
+ void *cb_arg,
+ long *nr_to_walk)
+{
+ struct list_lru_node *nlru = &lru->node[nid];
struct list_head *item, *n;
- long removed = 0;
+ long isolated = 0;
restart:
- spin_lock(&lru->lock);
- list_for_each_safe(item, n, &lru->list) {
+ spin_lock(&nlru->lock);
+ list_for_each_safe(item, n, &nlru->list) {
int ret;

- if (nr_to_walk-- < 0)
+ if ((*nr_to_walk)-- < 0)
break;

- ret = isolate(item, &lru->lock, cb_arg);
+ ret = isolate(item, &nlru->lock, cb_arg);
switch (ret) {
case 0: /* item removed from list */
- lru->nr_items--;
- removed++;
+ if (--nlru->nr_items == 0)
+ node_clear(nid, lru->active_nodes);
+ BUG_ON(nlru->nr_items < 0);
+ isolated++;
break;
case 1: /* item referenced, give another pass */
- list_move_tail(item, &lru->list);
+ list_move_tail(item, &nlru->list);
break;
case 2: /* item cannot be locked, skip */
break;
@@ -76,42 +111,89 @@ restart:
BUG();
}
}
- spin_unlock(&lru->lock);
- return removed;
+ spin_unlock(&nlru->lock);
+ return isolated;
+}
+
+long
+list_lru_walk(
+ struct list_lru *lru,
+ list_lru_walk_cb isolate,
+ void *cb_arg,
+ long nr_to_walk)
+{
+ long isolated = 0;
+ int nid;
+
+ for_each_node_mask(nid, lru->active_nodes) {
+ isolated += list_lru_walk_node(lru, nid, isolate,
+ cb_arg, &nr_to_walk);
+ if (nr_to_walk <= 0)
+ break;
+ }
+ return isolated;
}
EXPORT_SYMBOL_GPL(list_lru_walk);

long
-list_lru_dispose_all(
- struct list_lru *lru,
- list_lru_dispose_cb dispose)
+list_lru_dispose_all_node(
+ struct list_lru *lru,
+ int nid,
+ list_lru_dispose_cb dispose)
{
- long disposed = 0;
+ struct list_lru_node *nlru = &lru->node[nid];
LIST_HEAD(dispose_list);
+ long disposed = 0;

- spin_lock(&lru->lock);
- while (!list_empty(&lru->list)) {
- list_splice_init(&lru->list, &dispose_list);
- disposed += lru->nr_items;
- lru->nr_items = 0;
- spin_unlock(&lru->lock);
+ spin_lock(&nlru->lock);
+ while (!list_empty(&nlru->list)) {
+ list_splice_init(&nlru->list, &dispose_list);
+ disposed += nlru->nr_items;
+ nlru->nr_items = 0;
+ node_clear(nid, lru->active_nodes);
+ spin_unlock(&nlru->lock);

dispose(&dispose_list);

- spin_lock(&lru->lock);
+ spin_lock(&nlru->lock);
}
- spin_unlock(&lru->lock);
+ spin_unlock(&nlru->lock);
return disposed;
}

+long
+list_lru_dispose_all(
+ struct list_lru *lru,
+ list_lru_dispose_cb dispose)
+{
+ long disposed;
+ long total = 0;
+ int nid;
+
+ do {
+ disposed = 0;
+ for_each_node_mask(nid, lru->active_nodes) {
+ disposed += list_lru_dispose_all_node(lru, nid,
+ dispose);
+ }
+ total += disposed;
+ } while (disposed != 0);
+
+ return total;
+}
+
int
list_lru_init(
struct list_lru *lru)
{
- spin_lock_init(&lru->lock);
- INIT_LIST_HEAD(&lru->list);
- lru->nr_items = 0;
+ int i;

+ nodes_clear(lru->active_nodes);
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ spin_lock_init(&lru->node[i].lock);
+ INIT_LIST_HEAD(&lru->node[i].list);
+ lru->node[i].nr_items = 0;
+ }
return 0;
}
EXPORT_SYMBOL_GPL(list_lru_init);
--
1.7.10

2012-11-27 23:19:06

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 07/19] inode: convert inode lru list to generic lru list code.

From: Dave Chinner <[email protected]>

Signed-off-by: Dave Chinner <[email protected]>
---
fs/inode.c | 173 +++++++++++++++++++++-------------------------------
fs/super.c | 11 ++--
include/linux/fs.h | 6 +-
3 files changed, 75 insertions(+), 115 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 3624ae0..2662305 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -17,6 +17,7 @@
#include <linux/prefetch.h>
#include <linux/buffer_head.h> /* for inode_has_buffers */
#include <linux/ratelimit.h>
+#include <linux/list_lru.h>
#include "internal.h"

/*
@@ -24,7 +25,7 @@
*
* inode->i_lock protects:
* inode->i_state, inode->i_hash, __iget()
- * inode->i_sb->s_inode_lru_lock protects:
+ * Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
* inode_sb_list_lock protects:
* sb->s_inodes, inode->i_sb_list
@@ -37,7 +38,7 @@
*
* inode_sb_list_lock
* inode->i_lock
- * inode->i_sb->s_inode_lru_lock
+ * Inode LRU list locks
*
* bdi->wb.list_lock
* inode->i_lock
@@ -399,24 +400,14 @@ EXPORT_SYMBOL(ihold);

static void inode_lru_list_add(struct inode *inode)
{
- spin_lock(&inode->i_sb->s_inode_lru_lock);
- if (list_empty(&inode->i_lru)) {
- list_add(&inode->i_lru, &inode->i_sb->s_inode_lru);
- inode->i_sb->s_nr_inodes_unused++;
+ if (list_lru_add(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_inc(nr_unused);
- }
- spin_unlock(&inode->i_sb->s_inode_lru_lock);
}

static void inode_lru_list_del(struct inode *inode)
{
- spin_lock(&inode->i_sb->s_inode_lru_lock);
- if (!list_empty(&inode->i_lru)) {
- list_del_init(&inode->i_lru);
- inode->i_sb->s_nr_inodes_unused--;
+ if (list_lru_del(&inode->i_sb->s_inode_lru, &inode->i_lru))
this_cpu_dec(nr_unused);
- }
- spin_unlock(&inode->i_sb->s_inode_lru_lock);
}

/**
@@ -660,24 +651,8 @@ int invalidate_inodes(struct super_block *sb, bool kill_dirty)
return busy;
}

-static int can_unuse(struct inode *inode)
-{
- if (inode->i_state & ~I_REFERENCED)
- return 0;
- if (inode_has_buffers(inode))
- return 0;
- if (atomic_read(&inode->i_count))
- return 0;
- if (inode->i_data.nrpages)
- return 0;
- return 1;
-}
-
/*
- * Walk the superblock inode LRU for freeable inodes and attempt to free them.
- * This is called from the superblock shrinker function with a number of inodes
- * to trim from the LRU. Inodes to be freed are moved to a temporary list and
- * then are freed outside inode_lock by dispose_list().
+ * Isolate the inode from the LRU in preparation for freeing it.
*
* Any inodes which are pinned purely because of attached pagecache have their
* pagecache removed. If the inode has metadata buffers attached to
@@ -691,90 +666,78 @@ static int can_unuse(struct inode *inode)
* LRU does not have strict ordering. Hence we don't want to reclaim inodes
* with this flag set because they are the inodes that are out of order.
*/
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
+ void *arg)
{
- LIST_HEAD(freeable);
- long nr_scanned;
- long freed = 0;
- unsigned long reap = 0;
+ struct list_head *freeable = arg;
+ struct inode *inode = container_of(item, struct inode, i_lru);

- spin_lock(&sb->s_inode_lru_lock);
- for (nr_scanned = nr_to_scan; nr_scanned >= 0; nr_scanned--) {
- struct inode *inode;
+ /*
+ * we are inverting the lru lock/inode->i_lock here, so use a trylock.
+ * If we fail to get the lock, just skip it.
+ */
+ if (!spin_trylock(&inode->i_lock))
+ return 2;

- if (list_empty(&sb->s_inode_lru))
- break;
+ /*
+ * Referenced or dirty inodes are still in use. Give them another pass
+ * through the LRU as we canot reclaim them now.
+ */
+ if (atomic_read(&inode->i_count) ||
+ (inode->i_state & ~I_REFERENCED)) {
+ list_del_init(&inode->i_lru);
+ spin_unlock(&inode->i_lock);
+ this_cpu_dec(nr_unused);
+ return 0;
+ }

- inode = list_entry(sb->s_inode_lru.prev, struct inode, i_lru);
+ /* recently referenced inodes get one more pass */
+ if (inode->i_state & I_REFERENCED) {
+ inode->i_state &= ~I_REFERENCED;
+ spin_unlock(&inode->i_lock);
+ return 1;
+ }

- /*
- * we are inverting the sb->s_inode_lru_lock/inode->i_lock here,
- * so use a trylock. If we fail to get the lock, just move the
- * inode to the back of the list so we don't spin on it.
- */
- if (!spin_trylock(&inode->i_lock)) {
- list_move_tail(&inode->i_lru, &sb->s_inode_lru);
- continue;
+ if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+ __iget(inode);
+ spin_unlock(&inode->i_lock);
+ spin_unlock(lru_lock);
+ if (remove_inode_buffers(inode)) {
+ unsigned long reap;
+ reap = invalidate_mapping_pages(&inode->i_data, 0, -1);
+ if (current_is_kswapd())
+ __count_vm_events(KSWAPD_INODESTEAL, reap);
+ else
+ __count_vm_events(PGINODESTEAL, reap);
+ if (current->reclaim_state)
+ current->reclaim_state->reclaimed_slab += reap;
}
+ iput(inode);
+ return 3;
+ }

- /*
- * Referenced or dirty inodes are still in use. Give them
- * another pass through the LRU as we canot reclaim them now.
- */
- if (atomic_read(&inode->i_count) ||
- (inode->i_state & ~I_REFERENCED)) {
- list_del_init(&inode->i_lru);
- spin_unlock(&inode->i_lock);
- sb->s_nr_inodes_unused--;
- this_cpu_dec(nr_unused);
- continue;
- }
+ WARN_ON(inode->i_state & I_NEW);
+ inode->i_state |= I_FREEING;
+ spin_unlock(&inode->i_lock);

- /* recently referenced inodes get one more pass */
- if (inode->i_state & I_REFERENCED) {
- inode->i_state &= ~I_REFERENCED;
- list_move(&inode->i_lru, &sb->s_inode_lru);
- spin_unlock(&inode->i_lock);
- continue;
- }
- if (inode_has_buffers(inode) || inode->i_data.nrpages) {
- __iget(inode);
- spin_unlock(&inode->i_lock);
- spin_unlock(&sb->s_inode_lru_lock);
- if (remove_inode_buffers(inode))
- reap += invalidate_mapping_pages(&inode->i_data,
- 0, -1);
- iput(inode);
- spin_lock(&sb->s_inode_lru_lock);
-
- if (inode != list_entry(sb->s_inode_lru.next,
- struct inode, i_lru))
- continue; /* wrong inode or list_empty */
- /* avoid lock inversions with trylock */
- if (!spin_trylock(&inode->i_lock))
- continue;
- if (!can_unuse(inode)) {
- spin_unlock(&inode->i_lock);
- continue;
- }
- }
- WARN_ON(inode->i_state & I_NEW);
- inode->i_state |= I_FREEING;
- spin_unlock(&inode->i_lock);
+ list_move(&inode->i_lru, freeable);
+ this_cpu_dec(nr_unused);
+ return 0;
+}

- list_move(&inode->i_lru, &freeable);
- sb->s_nr_inodes_unused--;
- this_cpu_dec(nr_unused);
- freed++;
- }
- if (current_is_kswapd())
- __count_vm_events(KSWAPD_INODESTEAL, reap);
- else
- __count_vm_events(PGINODESTEAL, reap);
- spin_unlock(&sb->s_inode_lru_lock);
- if (current->reclaim_state)
- current->reclaim_state->reclaimed_slab += reap;
+/*
+ * Walk the superblock inode LRU for freeable inodes and attempt to free them.
+ * This is called from the superblock shrinker function with a number of inodes
+ * to trim from the LRU. Inodes to be freed are moved to a temporary list and
+ * then are freed outside inode_lock by dispose_list().
+ */
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+{
+ LIST_HEAD(freeable);
+ long freed;

+ freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
+ &freeable, nr_to_scan);
dispose_list(&freeable);
return freed;
}
diff --git a/fs/super.c b/fs/super.c
index fda6f12..a2f09c8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,12 +77,12 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
if (sb->s_op && sb->s_op->nr_cached_objects)
fs_objects = sb->s_op->nr_cached_objects(sb);

- total_objects = sb->s_nr_dentry_unused +
- sb->s_nr_inodes_unused + fs_objects + 1;
+ inodes = list_lru_count(&sb->s_inode_lru);
+ total_objects = sb->s_nr_dentry_unused + inodes + fs_objects + 1;

/* proportion the scan between the caches */
dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total_objects;
- inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) / total_objects;
+ inodes = (sc->nr_to_scan * inodes) / total_objects;

/*
* prune the dcache first as the icache is pinned by it, then
@@ -114,7 +114,7 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
total_objects = sb->s_op->nr_cached_objects(sb);

total_objects += sb->s_nr_dentry_unused;
- total_objects += sb->s_nr_inodes_unused;
+ total_objects += list_lru_count(&sb->s_inode_lru);

total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
drop_super(sb);
@@ -195,8 +195,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
INIT_LIST_HEAD(&s->s_inodes);
INIT_LIST_HEAD(&s->s_dentry_lru);
spin_lock_init(&s->s_dentry_lru_lock);
- INIT_LIST_HEAD(&s->s_inode_lru);
- spin_lock_init(&s->s_inode_lru_lock);
+ list_lru_init(&s->s_inode_lru);
INIT_LIST_HEAD(&s->s_mounts);
init_rwsem(&s->s_umount);
lockdep_set_class(&s->s_umount, &type->s_umount_key);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13833e3..36b7db5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -10,6 +10,7 @@
#include <linux/stat.h>
#include <linux/cache.h>
#include <linux/list.h>
+#include <linux/list_lru.h>
#include <linux/radix-tree.h>
#include <linux/rbtree.h>
#include <linux/init.h>
@@ -1268,10 +1269,7 @@ struct super_block {
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */

- /* s_inode_lru_lock protects s_inode_lru and s_nr_inodes_unused */
- spinlock_t s_inode_lru_lock ____cacheline_aligned_in_smp;
- struct list_head s_inode_lru; /* unused inode lru */
- int s_nr_inodes_unused; /* # of inodes on lru */
+ struct list_lru s_inode_lru ____cacheline_aligned_in_smp;

struct block_device *s_bdev;
struct backing_dev_info *s_bdi;
--
1.7.10

2012-11-27 23:15:16

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 03/19] dcache: remove dentries from LRU before putting on dispose list

From: Dave Chinner <[email protected]>

One of the big problems with modifying the way the dcache shrinker
and LRU implementation works is that the LRU is abused in several
ways. One of these is shrink_dentry_list().

Basically, we can move a dentry off the LRU onto a different list
without doing any accounting changes, and then use dentry_lru_prune()
to remove it from what-ever list it is now on to do the LRU
accounting at that point.

This makes it -really hard- to change the LRU implementation. The
use of the per-sb LRU lock serialises movement of the dentries
between the different lists and the removal of them, and this is the
only reason that it works. If we want to break up the dentry LRU
lock and lists into, say, per-node lists, we remove the only
serialisation that allows this lru list/dispose list abuse to work.

To make this work effectively, the dispose list has to be isolated
from the LRU list - dentries have to be removed from the LRU
*before* being placed on the dispose list. This means that the LRU
accounting and isolation is completed before disposal is started,
and that means we can change the LRU implementation freely in
future.

This means that dentries *must* be marked with DCACHE_SHRINK_LIST
when they are placed on the dispose list so that we don't think that
parent dentries found in try_prune_one_dentry() are on the LRU when
the are actually on the dispose list. This would result in
accounting the dentry to the LRU a second time. Hence
dentry_lru_prune() has to handle the DCACHE_SHRINK_LIST case
differently because the dentry isn't on the LRU list.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/dcache.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 63 insertions(+), 10 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e0c97fe..0124a84 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -330,7 +330,6 @@ static void dentry_lru_add(struct dentry *dentry)
static void __dentry_lru_del(struct dentry *dentry)
{
list_del_init(&dentry->d_lru);
- dentry->d_flags &= ~DCACHE_SHRINK_LIST;
dentry->d_sb->s_nr_dentry_unused--;
this_cpu_dec(nr_dentry_unused);
}
@@ -340,6 +339,8 @@ static void __dentry_lru_del(struct dentry *dentry)
*/
static void dentry_lru_del(struct dentry *dentry)
{
+ BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
if (!list_empty(&dentry->d_lru)) {
spin_lock(&dentry->d_sb->s_dentry_lru_lock);
__dentry_lru_del(dentry);
@@ -351,28 +352,42 @@ static void dentry_lru_del(struct dentry *dentry)
* Remove a dentry that is unreferenced and about to be pruned
* (unhashed and destroyed) from the LRU, and inform the file system.
* This wrapper should be called _prior_ to unhashing a victim dentry.
+ *
+ * Check that the dentry really is on the LRU as it may be on a private dispose
+ * list and in that case we do not want to call the generic LRU removal
+ * functions. This typically happens when shrink_dcache_sb() clears the LRU in
+ * one go and then try_prune_one_dentry() walks back up the parent chain finding
+ * dentries that are also on the dispose list.
*/
static void dentry_lru_prune(struct dentry *dentry)
{
if (!list_empty(&dentry->d_lru)) {
+
if (dentry->d_flags & DCACHE_OP_PRUNE)
dentry->d_op->d_prune(dentry);

- spin_lock(&dentry->d_sb->s_dentry_lru_lock);
- __dentry_lru_del(dentry);
- spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+ if ((dentry->d_flags & DCACHE_SHRINK_LIST))
+ list_del_init(&dentry->d_lru);
+ else {
+ spin_lock(&dentry->d_sb->s_dentry_lru_lock);
+ __dentry_lru_del(dentry);
+ spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
+ }
+ dentry->d_flags &= ~DCACHE_SHRINK_LIST;
}
}

static void dentry_lru_move_list(struct dentry *dentry, struct list_head *list)
{
+ BUG_ON(dentry->d_flags & DCACHE_SHRINK_LIST);
+
spin_lock(&dentry->d_sb->s_dentry_lru_lock);
if (list_empty(&dentry->d_lru)) {
list_add_tail(&dentry->d_lru, list);
- dentry->d_sb->s_nr_dentry_unused++;
- this_cpu_inc(nr_dentry_unused);
} else {
list_move_tail(&dentry->d_lru, list);
+ dentry->d_sb->s_nr_dentry_unused--;
+ this_cpu_dec(nr_dentry_unused);
}
spin_unlock(&dentry->d_sb->s_dentry_lru_lock);
}
@@ -840,12 +855,18 @@ static void shrink_dentry_list(struct list_head *list)
}

/*
+ * The dispose list is isolated and dentries are not accounted
+ * to the LRU here, so we can simply remove it from the list
+ * here regardless of whether it is referenced or not.
+ */
+ list_del_init(&dentry->d_lru);
+
+ /*
* We found an inuse dentry which was not removed from
- * the LRU because of laziness during lookup. Do not free
- * it - just keep it off the LRU list.
+ * the LRU because of laziness during lookup. Do not free it.
*/
if (dentry->d_count) {
- dentry_lru_del(dentry);
+ dentry->d_flags &= ~DCACHE_SHRINK_LIST;
spin_unlock(&dentry->d_lock);
continue;
}
@@ -897,6 +918,8 @@ relock:
} else {
list_move_tail(&dentry->d_lru, &tmp);
dentry->d_flags |= DCACHE_SHRINK_LIST;
+ this_cpu_dec(nr_dentry_unused);
+ sb->s_nr_dentry_unused--;
spin_unlock(&dentry->d_lock);
if (!--count)
break;
@@ -910,6 +933,28 @@ relock:
shrink_dentry_list(&tmp);
}

+/*
+ * Mark all the dentries as on being the dispose list so we don't think they are
+ * still on the LRU if we try to kill them from ascending the parent chain in
+ * try_prune_one_dentry() rather than directly from the dispose list.
+ */
+static void
+shrink_dcache_list(
+ struct list_head *dispose)
+{
+ struct dentry *dentry;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(dentry, dispose, d_lru) {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_SHRINK_LIST;
+ this_cpu_dec(nr_dentry_unused);
+ spin_unlock(&dentry->d_lock);
+ }
+ rcu_read_unlock();
+ shrink_dentry_list(dispose);
+}
+
/**
* shrink_dcache_sb - shrink dcache for a superblock
* @sb: superblock
@@ -924,8 +969,16 @@ void shrink_dcache_sb(struct super_block *sb)
spin_lock(&sb->s_dentry_lru_lock);
while (!list_empty(&sb->s_dentry_lru)) {
list_splice_init(&sb->s_dentry_lru, &tmp);
+
+ /*
+ * account for removal here so we don't need to handle it later
+ * even though the dentry is no longer on the lru list.
+ */
+ this_cpu_sub(nr_dentry_unused, sb->s_nr_dentry_unused);
+ sb->s_nr_dentry_unused = 0;
+
spin_unlock(&sb->s_dentry_lru_lock);
- shrink_dentry_list(&tmp);
+ shrink_dcache_list(&tmp);
spin_lock(&sb->s_dentry_lru_lock);
}
spin_unlock(&sb->s_dentry_lru_lock);
--
1.7.10

2012-11-27 23:19:45

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 15/19] xfs: convert dquot cache lru to list_lru

From: Dave Chinner <[email protected]>

Convert the XFS dquot lru to use the list_lru construct and convert
the shrinker to being node aware.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/xfs/xfs_dquot.c | 7 +-
fs/xfs/xfs_qm.c | 307 ++++++++++++++++++++++++++--------------------------
fs/xfs/xfs_qm.h | 4 +-
3 files changed, 156 insertions(+), 162 deletions(-)

diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 9e1bf52..4fcd42e 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -843,13 +843,8 @@ xfs_qm_dqput_final(

trace_xfs_dqput_free(dqp);

- mutex_lock(&qi->qi_lru_lock);
- if (list_empty(&dqp->q_lru)) {
- list_add_tail(&dqp->q_lru, &qi->qi_lru_list);
- qi->qi_lru_count++;
+ if (list_lru_add(&qi->qi_lru, &dqp->q_lru))
XFS_STATS_INC(xs_qm_dquot_unused);
- }
- mutex_unlock(&qi->qi_lru_lock);

/*
* If we just added a udquot to the freelist, then we want to release
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index e6a0af0..534b924 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -50,7 +50,6 @@
*/
STATIC int xfs_qm_init_quotainos(xfs_mount_t *);
STATIC int xfs_qm_init_quotainfo(xfs_mount_t *);
-STATIC int xfs_qm_shake(struct shrinker *, struct shrink_control *);

/*
* We use the batch lookup interface to iterate over the dquots as it
@@ -196,12 +195,9 @@ xfs_qm_dqpurge(
* We move dquots to the freelist as soon as their reference count
* hits zero, so it really should be on the freelist here.
*/
- mutex_lock(&qi->qi_lru_lock);
ASSERT(!list_empty(&dqp->q_lru));
- list_del_init(&dqp->q_lru);
- qi->qi_lru_count--;
+ list_lru_del(&qi->qi_lru, &dqp->q_lru);
XFS_STATS_DEC(xs_qm_dquot_unused);
- mutex_unlock(&qi->qi_lru_lock);

xfs_qm_dqdestroy(dqp);

@@ -617,6 +613,156 @@ xfs_qm_dqdetach(
}
}

+STATIC void
+xfs_qm_dqfree_one(
+ struct xfs_dquot *dqp)
+{
+ struct xfs_mount *mp = dqp->q_mount;
+ struct xfs_quotainfo *qi = mp->m_quotainfo;
+
+ mutex_lock(&qi->qi_tree_lock);
+ radix_tree_delete(XFS_DQUOT_TREE(qi, dqp->q_core.d_flags),
+ be32_to_cpu(dqp->q_core.d_id));
+
+ qi->qi_dquots--;
+ mutex_unlock(&qi->qi_tree_lock);
+
+ xfs_qm_dqdestroy(dqp);
+}
+
+struct xfs_qm_isolate {
+ struct list_head buffers;
+ struct list_head dispose;
+};
+
+static int
+xfs_qm_dquot_isolate(
+ struct list_head *item,
+ spinlock_t *lru_lock,
+ void *arg)
+{
+ struct xfs_dquot *dqp = container_of(item,
+ struct xfs_dquot, q_lru);
+ struct xfs_qm_isolate *isol = arg;
+
+ if (!xfs_dqlock_nowait(dqp))
+ goto out_miss_busy;
+
+ /*
+ * This dquot has acquired a reference in the meantime remove it from
+ * the freelist and try again.
+ */
+ if (dqp->q_nrefs) {
+ xfs_dqunlock(dqp);
+ XFS_STATS_INC(xs_qm_dqwants);
+
+ trace_xfs_dqreclaim_want(dqp);
+ list_del_init(&dqp->q_lru);
+ XFS_STATS_DEC(xs_qm_dquot_unused);
+ return 0;
+ }
+
+ /*
+ * If the dquot is dirty, flush it. If it's already being flushed, just
+ * skip it so there is time for the IO to complete before we try to
+ * reclaim it again on the next LRU pass.
+ */
+ if (!xfs_dqflock_nowait(dqp))
+ xfs_dqunlock(dqp);
+ goto out_miss_busy;
+
+ if (XFS_DQ_IS_DIRTY(dqp)) {
+ struct xfs_buf *bp = NULL;
+ int error;
+
+ trace_xfs_dqreclaim_dirty(dqp);
+
+ /* we have to drop the LRU lock to flush the dquot */
+ spin_unlock(lru_lock);
+
+ error = xfs_qm_dqflush(dqp, &bp);
+ if (error) {
+ xfs_warn(dqp->q_mount, "%s: dquot %p flush failed",
+ __func__, dqp);
+ goto out_unlock_dirty;
+ }
+
+ xfs_buf_delwri_queue(bp, &isol->buffers);
+ xfs_buf_relse(bp);
+ goto out_unlock_dirty;
+ }
+ xfs_dqfunlock(dqp);
+
+ /*
+ * Prevent lookups now that we are past the point of no return.
+ */
+ dqp->dq_flags |= XFS_DQ_FREEING;
+ xfs_dqunlock(dqp);
+
+ ASSERT(dqp->q_nrefs == 0);
+ list_move_tail(&dqp->q_lru, &isol->dispose);
+ XFS_STATS_DEC(xs_qm_dquot_unused);
+ trace_xfs_dqreclaim_done(dqp);
+ XFS_STATS_INC(xs_qm_dqreclaims);
+ return 0;
+
+out_miss_busy:
+ trace_xfs_dqreclaim_busy(dqp);
+ XFS_STATS_INC(xs_qm_dqreclaim_misses);
+ return 2;
+
+out_unlock_dirty:
+ trace_xfs_dqreclaim_busy(dqp);
+ XFS_STATS_INC(xs_qm_dqreclaim_misses);
+ return 3;
+}
+
+static long
+xfs_qm_shrink_scan(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct xfs_quotainfo *qi = container_of(shrink,
+ struct xfs_quotainfo, qi_shrinker);
+ struct xfs_qm_isolate isol;
+ long freed;
+ int error;
+
+ if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+ return 0;
+
+ INIT_LIST_HEAD(&isol.buffers);
+ INIT_LIST_HEAD(&isol.dispose);
+
+ freed = list_lru_walk_nodemask(&qi->qi_lru, xfs_qm_dquot_isolate, &isol,
+ sc->nr_to_scan, &sc->nodes_to_scan);
+
+ error = xfs_buf_delwri_submit(&isol.buffers);
+ if (error)
+ xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
+
+ while (!list_empty(&isol.dispose)) {
+ struct xfs_dquot *dqp;
+
+ dqp = list_first_entry(&isol.dispose, struct xfs_dquot, q_lru);
+ list_del_init(&dqp->q_lru);
+ xfs_qm_dqfree_one(dqp);
+ }
+
+ return freed;
+}
+
+static long
+xfs_qm_shrink_count(
+ struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ struct xfs_quotainfo *qi = container_of(shrink,
+ struct xfs_quotainfo, qi_shrinker);
+
+ return list_lru_count_nodemask(&qi->qi_lru, &sc->nodes_to_scan);
+}
+
/*
* This initializes all the quota information that's kept in the
* mount structure
@@ -647,9 +793,7 @@ xfs_qm_init_quotainfo(
INIT_RADIX_TREE(&qinf->qi_gquota_tree, GFP_NOFS);
mutex_init(&qinf->qi_tree_lock);

- INIT_LIST_HEAD(&qinf->qi_lru_list);
- qinf->qi_lru_count = 0;
- mutex_init(&qinf->qi_lru_lock);
+ list_lru_init(&qinf->qi_lru);

/* mutex used to serialize quotaoffs */
mutex_init(&qinf->qi_quotaofflock);
@@ -716,7 +860,8 @@ xfs_qm_init_quotainfo(
qinf->qi_rtbwarnlimit = XFS_QM_RTBWARNLIMIT;
}

- qinf->qi_shrinker.shrink = xfs_qm_shake;
+ qinf->qi_shrinker.count_objects = xfs_qm_shrink_count;
+ qinf->qi_shrinker.scan_objects = xfs_qm_shrink_scan;
qinf->qi_shrinker.seeks = DEFAULT_SEEKS;
register_shrinker(&qinf->qi_shrinker);
return 0;
@@ -1428,150 +1573,6 @@ xfs_qm_init_quotainos(
return 0;
}

-STATIC void
-xfs_qm_dqfree_one(
- struct xfs_dquot *dqp)
-{
- struct xfs_mount *mp = dqp->q_mount;
- struct xfs_quotainfo *qi = mp->m_quotainfo;
-
- mutex_lock(&qi->qi_tree_lock);
- radix_tree_delete(XFS_DQUOT_TREE(qi, dqp->q_core.d_flags),
- be32_to_cpu(dqp->q_core.d_id));
-
- qi->qi_dquots--;
- mutex_unlock(&qi->qi_tree_lock);
-
- xfs_qm_dqdestroy(dqp);
-}
-
-STATIC void
-xfs_qm_dqreclaim_one(
- struct xfs_dquot *dqp,
- struct list_head *buffer_list,
- struct list_head *dispose_list)
-{
- struct xfs_mount *mp = dqp->q_mount;
- struct xfs_quotainfo *qi = mp->m_quotainfo;
- int error;
-
- if (!xfs_dqlock_nowait(dqp))
- goto out_busy;
-
- /*
- * This dquot has acquired a reference in the meantime remove it from
- * the freelist and try again.
- */
- if (dqp->q_nrefs) {
- xfs_dqunlock(dqp);
-
- trace_xfs_dqreclaim_want(dqp);
- XFS_STATS_INC(xs_qm_dqwants);
-
- list_del_init(&dqp->q_lru);
- qi->qi_lru_count--;
- XFS_STATS_DEC(xs_qm_dquot_unused);
- return;
- }
-
- /*
- * Try to grab the flush lock. If this dquot is in the process of
- * getting flushed to disk, we don't want to reclaim it.
- */
- if (!xfs_dqflock_nowait(dqp))
- goto out_busy;
-
- if (XFS_DQ_IS_DIRTY(dqp)) {
- struct xfs_buf *bp = NULL;
-
- trace_xfs_dqreclaim_dirty(dqp);
-
- error = xfs_qm_dqflush(dqp, &bp);
- if (error) {
- xfs_warn(mp, "%s: dquot %p flush failed",
- __func__, dqp);
- goto out_busy;
- }
-
- xfs_buf_delwri_queue(bp, buffer_list);
- xfs_buf_relse(bp);
- /*
- * Give the dquot another try on the freelist, as the
- * flushing will take some time.
- */
- goto out_busy;
- }
- xfs_dqfunlock(dqp);
-
- /*
- * Prevent lookups now that we are past the point of no return.
- */
- dqp->dq_flags |= XFS_DQ_FREEING;
- xfs_dqunlock(dqp);
-
- ASSERT(dqp->q_nrefs == 0);
- list_move_tail(&dqp->q_lru, dispose_list);
- qi->qi_lru_count--;
- XFS_STATS_DEC(xs_qm_dquot_unused);
-
- trace_xfs_dqreclaim_done(dqp);
- XFS_STATS_INC(xs_qm_dqreclaims);
- return;
-
-out_busy:
- xfs_dqunlock(dqp);
-
- /*
- * Move the dquot to the tail of the list so that we don't spin on it.
- */
- list_move_tail(&dqp->q_lru, &qi->qi_lru_list);
-
- trace_xfs_dqreclaim_busy(dqp);
- XFS_STATS_INC(xs_qm_dqreclaim_misses);
-}
-
-STATIC int
-xfs_qm_shake(
- struct shrinker *shrink,
- struct shrink_control *sc)
-{
- struct xfs_quotainfo *qi =
- container_of(shrink, struct xfs_quotainfo, qi_shrinker);
- int nr_to_scan = sc->nr_to_scan;
- LIST_HEAD (buffer_list);
- LIST_HEAD (dispose_list);
- struct xfs_dquot *dqp;
- int error;
-
- if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
- return 0;
- if (!nr_to_scan)
- goto out;
-
- mutex_lock(&qi->qi_lru_lock);
- while (!list_empty(&qi->qi_lru_list)) {
- if (nr_to_scan-- <= 0)
- break;
- dqp = list_first_entry(&qi->qi_lru_list, struct xfs_dquot,
- q_lru);
- xfs_qm_dqreclaim_one(dqp, &buffer_list, &dispose_list);
- }
- mutex_unlock(&qi->qi_lru_lock);
-
- error = xfs_buf_delwri_submit(&buffer_list);
- if (error)
- xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
-
- while (!list_empty(&dispose_list)) {
- dqp = list_first_entry(&dispose_list, struct xfs_dquot, q_lru);
- list_del_init(&dqp->q_lru);
- xfs_qm_dqfree_one(dqp);
- }
-
-out:
- return (qi->qi_lru_count / 100) * sysctl_vfs_cache_pressure;
-}
-
/*
* Start a transaction and write the incore superblock changes to
* disk. flags parameter indicates which fields have changed.
diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h
index 44b858b..d08b72d 100644
--- a/fs/xfs/xfs_qm.h
+++ b/fs/xfs/xfs_qm.h
@@ -47,9 +47,7 @@ typedef struct xfs_quotainfo {
struct mutex qi_tree_lock;
xfs_inode_t *qi_uquotaip; /* user quota inode */
xfs_inode_t *qi_gquotaip; /* group quota inode */
- struct list_head qi_lru_list;
- struct mutex qi_lru_lock;
- int qi_lru_count;
+ struct list_lru qi_lru;
int qi_dquots;
time_t qi_btimelimit; /* limit for blks timer */
time_t qi_itimelimit; /* limit for inodes timer */
--
1.7.10

2012-11-27 23:20:52

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 05/19] shrinker: convert superblock shrinkers to new API

From: Dave Chinner <[email protected]>

Convert superblock shrinker to use the new count/scan API, and
propagate the API changes through to the filesystem callouts. The
filesystem callouts already use a count/scan API, so it's just
changing counters to longs to match the VM API.

This requires the dentry and inode shrinker callouts to be converted
to the count/scan API. This is mainly a mechanical change.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/dcache.c | 10 +++++---
fs/inode.c | 7 +++--
fs/internal.h | 3 +++
fs/super.c | 71 ++++++++++++++++++++++++++++++---------------------
fs/xfs/xfs_icache.c | 17 +++++++-----
fs/xfs/xfs_icache.h | 4 +--
fs/xfs/xfs_super.c | 8 +++---
include/linux/fs.h | 8 ++----
8 files changed, 74 insertions(+), 54 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0124a84..ca647b8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -892,11 +892,12 @@ static void shrink_dentry_list(struct list_head *list)
* This function may fail to free any resources if all the dentries are in
* use.
*/
-void prune_dcache_sb(struct super_block *sb, int count)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
{
struct dentry *dentry;
LIST_HEAD(referenced);
LIST_HEAD(tmp);
+ long freed = 0;

relock:
spin_lock(&sb->s_dentry_lru_lock);
@@ -921,7 +922,8 @@ relock:
this_cpu_dec(nr_dentry_unused);
sb->s_nr_dentry_unused--;
spin_unlock(&dentry->d_lock);
- if (!--count)
+ freed++;
+ if (!--nr_to_scan)
break;
}
cond_resched_lock(&sb->s_dentry_lru_lock);
@@ -931,6 +933,7 @@ relock:
spin_unlock(&sb->s_dentry_lru_lock);

shrink_dentry_list(&tmp);
+ return freed;
}

/*
@@ -1317,9 +1320,8 @@ rename_retry:
void shrink_dcache_parent(struct dentry * parent)
{
LIST_HEAD(dispose);
- int found;

- while ((found = select_parent(parent, &dispose)) != 0)
+ while (select_parent(parent, &dispose))
shrink_dentry_list(&dispose);
}
EXPORT_SYMBOL(shrink_dcache_parent);
diff --git a/fs/inode.c b/fs/inode.c
index c9fb382..3624ae0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -691,10 +691,11 @@ static int can_unuse(struct inode *inode)
* LRU does not have strict ordering. Hence we don't want to reclaim inodes
* with this flag set because they are the inodes that are out of order.
*/
-void prune_icache_sb(struct super_block *sb, int nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan)
{
LIST_HEAD(freeable);
- int nr_scanned;
+ long nr_scanned;
+ long freed = 0;
unsigned long reap = 0;

spin_lock(&sb->s_inode_lru_lock);
@@ -764,6 +765,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
list_move(&inode->i_lru, &freeable);
sb->s_nr_inodes_unused--;
this_cpu_dec(nr_unused);
+ freed++;
}
if (current_is_kswapd())
__count_vm_events(KSWAPD_INODESTEAL, reap);
@@ -774,6 +776,7 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan)
current->reclaim_state->reclaimed_slab += reap;

dispose_list(&freeable);
+ return freed;
}

static void __wait_on_freeing_inode(struct inode *inode);
diff --git a/fs/internal.h b/fs/internal.h
index 916b7cb..7d7908b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,6 +110,8 @@ extern int open_check_o_direct(struct file *f);
* inode.c
*/
extern spinlock_t inode_sb_list_lock;
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+

/*
* fs-writeback.c
@@ -124,3 +126,4 @@ extern int invalidate_inodes(struct super_block *, bool);
* dcache.c
*/
extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
diff --git a/fs/super.c b/fs/super.c
index 21abf02..fda6f12 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -53,11 +53,14 @@ static char *sb_writers_name[SB_FREEZE_LEVELS] = {
* shrinker path and that leads to deadlock on the shrinker_rwsem. Hence we
* take a passive reference to the superblock to avoid this from occurring.
*/
-static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
+static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
{
struct super_block *sb;
- int fs_objects = 0;
- int total_objects;
+ long fs_objects = 0;
+ long total_objects;
+ long freed = 0;
+ long dentries;
+ long inodes;

sb = container_of(shrink, struct super_block, s_shrink);

@@ -65,7 +68,7 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
* Deadlock avoidance. We may hold various FS locks, and we don't want
* to recurse into the FS that called us in clear_inode() and friends..
*/
- if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
+ if (!(sc->gfp_mask & __GFP_FS))
return -1;

if (!grab_super_passive(sb))
@@ -77,33 +80,42 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
total_objects = sb->s_nr_dentry_unused +
sb->s_nr_inodes_unused + fs_objects + 1;

- if (sc->nr_to_scan) {
- int dentries;
- int inodes;
-
- /* proportion the scan between the caches */
- dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) /
- total_objects;
- inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) /
- total_objects;
- if (fs_objects)
- fs_objects = (sc->nr_to_scan * fs_objects) /
- total_objects;
- /*
- * prune the dcache first as the icache is pinned by it, then
- * prune the icache, followed by the filesystem specific caches
- */
- prune_dcache_sb(sb, dentries);
- prune_icache_sb(sb, inodes);
+ /* proportion the scan between the caches */
+ dentries = (sc->nr_to_scan * sb->s_nr_dentry_unused) / total_objects;
+ inodes = (sc->nr_to_scan * sb->s_nr_inodes_unused) / total_objects;

- if (fs_objects && sb->s_op->free_cached_objects) {
- sb->s_op->free_cached_objects(sb, fs_objects);
- fs_objects = sb->s_op->nr_cached_objects(sb);
- }
- total_objects = sb->s_nr_dentry_unused +
- sb->s_nr_inodes_unused + fs_objects;
+ /*
+ * prune the dcache first as the icache is pinned by it, then
+ * prune the icache, followed by the filesystem specific caches
+ */
+ freed = prune_dcache_sb(sb, dentries);
+ freed += prune_icache_sb(sb, inodes);
+
+ if (fs_objects) {
+ fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
+ freed += sb->s_op->free_cached_objects(sb, fs_objects);
}

+ drop_super(sb);
+ return freed;
+}
+
+static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+ struct super_block *sb;
+ long total_objects = 0;
+
+ sb = container_of(shrink, struct super_block, s_shrink);
+
+ if (!grab_super_passive(sb))
+ return -1;
+
+ if (sb->s_op && sb->s_op->nr_cached_objects)
+ total_objects = sb->s_op->nr_cached_objects(sb);
+
+ total_objects += sb->s_nr_dentry_unused;
+ total_objects += sb->s_nr_inodes_unused;
+
total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
drop_super(sb);
return total_objects;
@@ -217,7 +229,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->cleancache_poolid = -1;

s->s_shrink.seeks = DEFAULT_SEEKS;
- s->s_shrink.shrink = prune_super;
+ s->s_shrink.scan_objects = super_cache_scan;
+ s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
}
out:
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 96e344e..2f91e2b 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -1029,7 +1029,7 @@ STATIC int
xfs_reclaim_inodes_ag(
struct xfs_mount *mp,
int flags,
- int *nr_to_scan)
+ long *nr_to_scan)
{
struct xfs_perag *pag;
int error = 0;
@@ -1150,7 +1150,7 @@ xfs_reclaim_inodes(
xfs_mount_t *mp,
int mode)
{
- int nr_to_scan = INT_MAX;
+ long nr_to_scan = LONG_MAX;

return xfs_reclaim_inodes_ag(mp, mode, &nr_to_scan);
}
@@ -1164,29 +1164,32 @@ xfs_reclaim_inodes(
* them to be cleaned, which we hope will not be very long due to the
* background walker having already kicked the IO off on those dirty inodes.
*/
-void
+long
xfs_reclaim_inodes_nr(
struct xfs_mount *mp,
- int nr_to_scan)
+ long nr_to_scan)
{
+ long nr = nr_to_scan;
+
/* kick background reclaimer and push the AIL */
xfs_reclaim_work_queue(mp);
xfs_ail_push_all(mp->m_ail);

- xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr_to_scan);
+ xfs_reclaim_inodes_ag(mp, SYNC_TRYLOCK | SYNC_WAIT, &nr);
+ return nr_to_scan - nr;
}

/*
* Return the number of reclaimable inodes in the filesystem for
* the shrinker to determine how much to reclaim.
*/
-int
+long
xfs_reclaim_inodes_count(
struct xfs_mount *mp)
{
struct xfs_perag *pag;
xfs_agnumber_t ag = 0;
- int reclaimable = 0;
+ long reclaimable = 0;

while ((pag = xfs_perag_get_tag(mp, ag, XFS_ICI_RECLAIM_TAG))) {
ag = pag->pag_agno + 1;
diff --git a/fs/xfs/xfs_icache.h b/fs/xfs/xfs_icache.h
index e0f138c..c860d07 100644
--- a/fs/xfs/xfs_icache.h
+++ b/fs/xfs/xfs_icache.h
@@ -30,8 +30,8 @@ int xfs_iget(struct xfs_mount *mp, struct xfs_trans *tp, xfs_ino_t ino,
void xfs_reclaim_worker(struct work_struct *work);

int xfs_reclaim_inodes(struct xfs_mount *mp, int mode);
-int xfs_reclaim_inodes_count(struct xfs_mount *mp);
-void xfs_reclaim_inodes_nr(struct xfs_mount *mp, int nr_to_scan);
+long xfs_reclaim_inodes_count(struct xfs_mount *mp);
+long xfs_reclaim_inodes_nr(struct xfs_mount *mp, long nr_to_scan);

void xfs_inode_set_reclaim_tag(struct xfs_inode *ip);

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index ab8839b..00aa61d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1514,19 +1514,19 @@ xfs_fs_mount(
return mount_bdev(fs_type, flags, dev_name, data, xfs_fs_fill_super);
}

-static int
+static long
xfs_fs_nr_cached_objects(
struct super_block *sb)
{
return xfs_reclaim_inodes_count(XFS_M(sb));
}

-static void
+static long
xfs_fs_free_cached_objects(
struct super_block *sb,
- int nr_to_scan)
+ long nr_to_scan)
{
- xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+ return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
}

static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0845283..13833e3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1325,10 +1325,6 @@ struct super_block {
int s_readonly_remount;
};

-/* superblock cache pruning functions */
-extern void prune_icache_sb(struct super_block *sb, int nr_to_scan);
-extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
-
extern struct timespec current_fs_time(struct super_block *sb);

/*
@@ -1621,8 +1617,8 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
- int (*nr_cached_objects)(struct super_block *);
- void (*free_cached_objects)(struct super_block *, int);
+ long (*nr_cached_objects)(struct super_block *);
+ long (*free_cached_objects)(struct super_block *, long);
};

/*
--
1.7.10

2012-11-27 23:21:18

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 11/19] fs: convert inode and dentry shrinking to be node aware

From: Dave Chinner <[email protected]>

Now that the shrinker is passing a nodemask in the scan control
structure, we can pass this to the the generic LRU list code to
isolate reclaim to the lists on matching nodes.

This requires a small amount of refactoring of the LRU list API,
which might be best split out into a separate patch.

Signed-off-by: Dave Chinner <[email protected]>
---
fs/dcache.c | 7 ++++---
fs/inode.c | 7 ++++---
fs/internal.h | 6 ++++--
fs/super.c | 22 +++++++++++++---------
fs/xfs/xfs_super.c | 6 ++++--
include/linux/fs.h | 4 ++--
include/linux/list_lru.h | 19 ++++++++++++++++---
lib/list_lru.c | 18 ++++++++++--------
8 files changed, 57 insertions(+), 32 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d72e388..7f107fb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -907,13 +907,14 @@ static int dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
* This function may fail to free any resources if all the dentries are in
* use.
*/
-long prune_dcache_sb(struct super_block *sb, long nr_to_scan)
+long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+ nodemask_t *nodes_to_walk)
{
LIST_HEAD(dispose);
long freed;

- freed = list_lru_walk(&sb->s_dentry_lru, dentry_lru_isolate,
- &dispose, nr_to_scan);
+ freed = list_lru_walk_nodemask(&sb->s_dentry_lru, dentry_lru_isolate,
+ &dispose, nr_to_scan, nodes_to_walk);
shrink_dentry_list(&dispose);
return freed;
}
diff --git a/fs/inode.c b/fs/inode.c
index 2662305..3857f9f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -731,13 +731,14 @@ static int inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock,
* to trim from the LRU. Inodes to be freed are moved to a temporary list and
* then are freed outside inode_lock by dispose_list().
*/
-long prune_icache_sb(struct super_block *sb, long nr_to_scan)
+long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+ nodemask_t *nodes_to_walk)
{
LIST_HEAD(freeable);
long freed;

- freed = list_lru_walk(&sb->s_inode_lru, inode_lru_isolate,
- &freeable, nr_to_scan);
+ freed = list_lru_walk_nodemask(&sb->s_inode_lru, inode_lru_isolate,
+ &freeable, nr_to_scan, nodes_to_walk);
dispose_list(&freeable);
return freed;
}
diff --git a/fs/internal.h b/fs/internal.h
index 7d7908b..95c4e9b 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -110,7 +110,8 @@ extern int open_check_o_direct(struct file *f);
* inode.c
*/
extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_icache_sb(struct super_block *sb, long nr_to_scan,
+ nodemask_t *nodes_to_scan);


/*
@@ -126,4 +127,5 @@ extern int invalidate_inodes(struct super_block *, bool);
* dcache.c
*/
extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
-extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan);
+extern long prune_dcache_sb(struct super_block *sb, long nr_to_scan,
+ nodemask_t *nodes_to_scan);
diff --git a/fs/super.c b/fs/super.c
index b1d24ef..3c975b1 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,10 +75,10 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
return -1;

if (sb->s_op && sb->s_op->nr_cached_objects)
- fs_objects = sb->s_op->nr_cached_objects(sb);
+ fs_objects = sb->s_op->nr_cached_objects(sb, &sc->nodes_to_scan);

- inodes = list_lru_count(&sb->s_inode_lru);
- dentries = list_lru_count(&sb->s_dentry_lru);
+ inodes = list_lru_count_nodemask(&sb->s_inode_lru, &sc->nodes_to_scan);
+ dentries = list_lru_count_nodemask(&sb->s_dentry_lru, &sc->nodes_to_scan);
total_objects = dentries + inodes + fs_objects + 1;

/* proportion the scan between the caches */
@@ -89,12 +89,13 @@ static long super_cache_scan(struct shrinker *shrink, struct shrink_control *sc)
* prune the dcache first as the icache is pinned by it, then
* prune the icache, followed by the filesystem specific caches
*/
- freed = prune_dcache_sb(sb, dentries);
- freed += prune_icache_sb(sb, inodes);
+ freed = prune_dcache_sb(sb, dentries, &sc->nodes_to_scan);
+ freed += prune_icache_sb(sb, inodes, &sc->nodes_to_scan);

if (fs_objects) {
fs_objects = (sc->nr_to_scan * fs_objects) / total_objects;
- freed += sb->s_op->free_cached_objects(sb, fs_objects);
+ freed += sb->s_op->free_cached_objects(sb, fs_objects,
+ &sc->nodes_to_scan);
}

drop_super(sb);
@@ -112,10 +113,13 @@ static long super_cache_count(struct shrinker *shrink, struct shrink_control *sc
return -1;

if (sb->s_op && sb->s_op->nr_cached_objects)
- total_objects = sb->s_op->nr_cached_objects(sb);
+ total_objects = sb->s_op->nr_cached_objects(sb,
+ &sc->nodes_to_scan);

- total_objects += list_lru_count(&sb->s_dentry_lru);
- total_objects += list_lru_count(&sb->s_inode_lru);
+ total_objects += list_lru_count_nodemask(&sb->s_dentry_lru,
+ &sc->nodes_to_scan);
+ total_objects += list_lru_count_nodemask(&sb->s_inode_lru,
+ &sc->nodes_to_scan);

total_objects = (total_objects / 100) * sysctl_vfs_cache_pressure;
drop_super(sb);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 00aa61d..33d67d5 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1516,7 +1516,8 @@ xfs_fs_mount(

static long
xfs_fs_nr_cached_objects(
- struct super_block *sb)
+ struct super_block *sb,
+ nodemask_t *nodes_to_count)
{
return xfs_reclaim_inodes_count(XFS_M(sb));
}
@@ -1524,7 +1525,8 @@ xfs_fs_nr_cached_objects(
static long
xfs_fs_free_cached_objects(
struct super_block *sb,
- long nr_to_scan)
+ long nr_to_scan,
+ nodemask_t *nodes_to_scan)
{
return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index befa46f..fda4ee2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1614,8 +1614,8 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
- long (*nr_cached_objects)(struct super_block *);
- long (*free_cached_objects)(struct super_block *, long);
+ long (*nr_cached_objects)(struct super_block *, nodemask_t *);
+ long (*free_cached_objects)(struct super_block *, long, nodemask_t *);
};

/*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index b0e3ba2..02796da 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -24,14 +24,27 @@ struct list_lru {
int list_lru_init(struct list_lru *lru);
int list_lru_add(struct list_lru *lru, struct list_head *item);
int list_lru_del(struct list_lru *lru, struct list_head *item);
-long list_lru_count(struct list_lru *lru);
+long list_lru_count_nodemask(struct list_lru *lru, nodemask_t *nodes_to_count);
+
+static inline long list_lru_count(struct list_lru *lru)
+{
+ return list_lru_count_nodemask(lru, &lru->active_nodes);
+}
+

typedef int (*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock,
void *cb_arg);
typedef void (*list_lru_dispose_cb)(struct list_head *dispose_list);

-long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
- void *cb_arg, long nr_to_walk);
+long list_lru_walk_nodemask(struct list_lru *lru, list_lru_walk_cb isolate,
+ void *cb_arg, long nr_to_walk, nodemask_t *nodes_to_walk);
+
+static inline long list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
+ void *cb_arg, long nr_to_walk)
+{
+ return list_lru_walk_nodemask(lru, isolate, cb_arg, nr_to_walk,
+ &lru->active_nodes);
+}

long list_lru_dispose_all(struct list_lru *lru, list_lru_dispose_cb dispose);

diff --git a/lib/list_lru.c b/lib/list_lru.c
index 881e342..0f08ed6 100644
--- a/lib/list_lru.c
+++ b/lib/list_lru.c
@@ -54,13 +54,14 @@ list_lru_del(
EXPORT_SYMBOL_GPL(list_lru_del);

long
-list_lru_count(
- struct list_lru *lru)
+list_lru_count_nodemask(
+ struct list_lru *lru,
+ nodemask_t *nodes_to_count)
{
long count = 0;
int nid;

- for_each_node_mask(nid, lru->active_nodes) {
+ for_each_node_mask(nid, *nodes_to_count) {
struct list_lru_node *nlru = &lru->node[nid];

spin_lock(&nlru->lock);
@@ -71,7 +72,7 @@ list_lru_count(

return count;
}
-EXPORT_SYMBOL_GPL(list_lru_count);
+EXPORT_SYMBOL_GPL(list_lru_count_nodemask);

static long
list_lru_walk_node(
@@ -116,16 +117,17 @@ restart:
}

long
-list_lru_walk(
+list_lru_walk_nodemask(
struct list_lru *lru,
list_lru_walk_cb isolate,
void *cb_arg,
- long nr_to_walk)
+ long nr_to_walk,
+ nodemask_t *nodes_to_walk)
{
long isolated = 0;
int nid;

- for_each_node_mask(nid, lru->active_nodes) {
+ for_each_node_mask(nid, *nodes_to_walk) {
isolated += list_lru_walk_node(lru, nid, isolate,
cb_arg, &nr_to_walk);
if (nr_to_walk <= 0)
@@ -133,7 +135,7 @@ list_lru_walk(
}
return isolated;
}
-EXPORT_SYMBOL_GPL(list_lru_walk);
+EXPORT_SYMBOL_GPL(list_lru_walk_nodemask);

long
list_lru_dispose_all_node(
--
1.7.10

2012-11-27 23:21:38

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 04/19] mm: new shrinker API

From: Dave Chinner <[email protected]>

The current shrinker callout API uses an a single shrinker call for
multiple functions. To determine the function, a special magical
value is passed in a parameter to change the behaviour. This
complicates the implementation and return value specification for
the different behaviours.

Separate the two different behaviours into separate operations, one
to return a count of freeable objects in the cache, and another to
scan a certain number of objects in the cache for freeing. In
defining these new operations, ensure the return values and
resultant behaviours are clearly defined and documented.

Modify shrink_slab() to use the new API and implement the callouts
for all the existing shrinkers.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/shrinker.h | 37 ++++++++++++++++++++++++----------
mm/vmscan.c | 50 +++++++++++++++++++++++++++++-----------------
2 files changed, 59 insertions(+), 28 deletions(-)

diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index ac6b8ee..4f59615 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,31 +4,47 @@
/*
* This struct is used to pass information from page reclaim to the shrinkers.
* We consolidate the values for easier extention later.
+ *
+ * The 'gfpmask' refers to the allocation we are currently trying to
+ * fulfil.
+ *
+ * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
+ * querying the cache size, so a fastpath for that case is appropriate.
*/
struct shrink_control {
gfp_t gfp_mask;

/* How many slab objects shrinker() should scan and try to reclaim */
- unsigned long nr_to_scan;
+ long nr_to_scan;
};

/*
* A callback you can register to apply pressure to ageable caches.
*
- * 'sc' is passed shrink_control which includes a count 'nr_to_scan'
- * and a 'gfpmask'. It should look through the least-recently-used
- * 'nr_to_scan' entries and attempt to free them up. It should return
- * the number of objects which remain in the cache. If it returns -1, it means
- * it cannot do any scanning at this time (eg. there is a risk of deadlock).
+ * @shrink() should look through the least-recently-used 'nr_to_scan' entries
+ * and attempt to free them up. It should return the number of objects which
+ * remain in the cache. If it returns -1, it means it cannot do any scanning at
+ * this time (eg. there is a risk of deadlock).
*
- * The 'gfpmask' refers to the allocation we are currently trying to
- * fulfil.
+ * @count_objects should return the number of freeable items in the cache. If
+ * there are no objects to free or the number of freeable items cannot be
+ * determined, it should return 0. No deadlock checks should be done during the
+ * count callback - the shrinker relies on aggregating scan counts that couldn't
+ * be executed due to potential deadlocks to be run at a later call when the
+ * deadlock condition is no longer pending.
*
- * Note that 'shrink' will be passed nr_to_scan == 0 when the VM is
- * querying the cache size, so a fastpath for that case is appropriate.
+ * @scan_objects will only be called if @count_objects returned a positive
+ * value for the number of freeable objects. The callout should scan the cache
+ * and attemp to free items from the cache. It should then return the number of
+ * objects freed during the scan, or -1 if progress cannot be made due to
+ * potential deadlocks. If -1 is returned, then no further attempts to call the
+ * @scan_objects will be made from the current reclaim context.
*/
struct shrinker {
int (*shrink)(struct shrinker *, struct shrink_control *sc);
+ long (*count_objects)(struct shrinker *, struct shrink_control *sc);
+ long (*scan_objects)(struct shrinker *, struct shrink_control *sc);
+
int seeks; /* seeks to recreate an obj */
long batch; /* reclaim batch size, 0 = default */

@@ -36,6 +52,7 @@ struct shrinker {
struct list_head list;
atomic_long_t nr_in_batch; /* objs pending delete */
};
+
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
extern void register_shrinker(struct shrinker *);
extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 48550c6..55c4fc9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,19 +204,19 @@ static inline int do_shrinker_shrink(struct shrinker *shrinker,
*
* Returns the number of slab objects which we shrunk.
*/
-unsigned long shrink_slab(struct shrink_control *shrink,
+unsigned long shrink_slab(struct shrink_control *sc,
unsigned long nr_pages_scanned,
unsigned long lru_pages)
{
struct shrinker *shrinker;
- unsigned long ret = 0;
+ unsigned long freed = 0;

if (nr_pages_scanned == 0)
nr_pages_scanned = SWAP_CLUSTER_MAX;

if (!down_read_trylock(&shrinker_rwsem)) {
/* Assume we'll be able to shrink next time */
- ret = 1;
+ freed = 1;
goto out;
}

@@ -224,13 +224,16 @@ unsigned long shrink_slab(struct shrink_control *shrink,
unsigned long long delta;
long total_scan;
long max_pass;
- int shrink_ret = 0;
long nr;
long new_nr;
long batch_size = shrinker->batch ? shrinker->batch
: SHRINK_BATCH;

- max_pass = do_shrinker_shrink(shrinker, shrink, 0);
+ if (shrinker->scan_objects) {
+ max_pass = shrinker->count_objects(shrinker, sc);
+ WARN_ON(max_pass < 0);
+ } else
+ max_pass = do_shrinker_shrink(shrinker, sc, 0);
if (max_pass <= 0)
continue;

@@ -247,8 +250,8 @@ unsigned long shrink_slab(struct shrink_control *shrink,
do_div(delta, lru_pages + 1);
total_scan += delta;
if (total_scan < 0) {
- printk(KERN_ERR "shrink_slab: %pF negative objects to "
- "delete nr=%ld\n",
+ printk(KERN_ERR
+ "shrink_slab: %pF negative objects to delete nr=%ld\n",
shrinker->shrink, total_scan);
total_scan = max_pass;
}
@@ -276,20 +279,31 @@ unsigned long shrink_slab(struct shrink_control *shrink,
if (total_scan > max_pass * 2)
total_scan = max_pass * 2;

- trace_mm_shrink_slab_start(shrinker, shrink, nr,
+ trace_mm_shrink_slab_start(shrinker, sc, nr,
nr_pages_scanned, lru_pages,
max_pass, delta, total_scan);

while (total_scan >= batch_size) {
- int nr_before;
+ long ret;

- nr_before = do_shrinker_shrink(shrinker, shrink, 0);
- shrink_ret = do_shrinker_shrink(shrinker, shrink,
- batch_size);
- if (shrink_ret == -1)
- break;
- if (shrink_ret < nr_before)
- ret += nr_before - shrink_ret;
+ if (shrinker->scan_objects) {
+ sc->nr_to_scan = batch_size;
+ ret = shrinker->scan_objects(shrinker, sc);
+
+ if (ret == -1)
+ break;
+ freed += ret;
+ } else {
+ int nr_before;
+
+ nr_before = do_shrinker_shrink(shrinker, sc, 0);
+ ret = do_shrinker_shrink(shrinker, sc,
+ batch_size);
+ if (ret == -1)
+ break;
+ if (ret < nr_before)
+ freed += nr_before - ret;
+ }
count_vm_events(SLABS_SCANNED, batch_size);
total_scan -= batch_size;

@@ -307,12 +321,12 @@ unsigned long shrink_slab(struct shrink_control *shrink,
else
new_nr = atomic_long_read(&shrinker->nr_in_batch);

- trace_mm_shrink_slab_end(shrinker, shrink_ret, nr, new_nr);
+ trace_mm_shrink_slab_end(shrinker, freed, nr, new_nr);
}
up_read(&shrinker_rwsem);
out:
cond_resched();
- return ret;
+ return freed;
}

static inline int is_page_cache_freeable(struct page *page)
--
1.7.10

2012-11-28 01:13:20

by Chris Wilson

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <[email protected]> wrote:
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> + * size is PAGE_SIZE.

No, purge() expects a count of pages to be freed. Each pass of the
shrinker therefore tries to free a minimum of 128 pages.

> The shrinker doesn't work on bytes - it works on
> + * *objects*.

And I thought you were reviewing the shrinker API to be useful where a
single object may range between 4K and 4G.

> So it passes a nr_to_scan of 128 objects, which is interpreted
> + * here to mean "free 128 bytes". That means a single object will be freed, as
> + * the minimum object size is a page.
> + *
> + * But the craziest part comes when i915_gem_purge() has walked all the objects
> + * and can't free any memory. That results in i915_gem_shrink_all() being
> + * called, which idles the GPU and frees everything the driver has in it's
> + * active and inactive lists. It's basically hitting the driver with a great big
> + * hammer because it was busy doing stuff when something else generated memory
> + * pressure. This doesn't seem particularly wise...
> + */

As opposed to triggering an OOM? The choice was between custom code for
a hopefully rare code path in a situation of last resort, or first
implementing the simplest code that stopped i915 from starving the
system of memory.
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2012-11-28 03:17:25

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <[email protected]> wrote:
> > +/*
> > + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> > + *
> > + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> > + * size is PAGE_SIZE.
>
> No, purge() expects a count of pages to be freed. Each pass of the
> shrinker therefore tries to free a minimum of 128 pages.

Ah, I got the shifts mixed up. I'd been looking at way too much crap
already when I saw this. But the fact this can be misunderstood says
something about the level of documentation that the code has (i.e.
none).

> > The shrinker doesn't work on bytes - it works on
> > + * *objects*.
>
> And I thought you were reviewing the shrinker API to be useful where a
> single object may range between 4K and 4G.

Which requires rewriting all the algorithms to not be dependent on
the subsystems using a fixed size object. The shrinker control
function is called shrink_slab() for a reason - it was expected to
be used to shrink caches of fixed sized objects allocated from slab
memory.

It has no concept of the amount of memory that each object consumes,
just an idea of how much *IO* it takes to replace the object in
memory once it's been reclaimed. The DEFAULT_SEEKS is design to
encode the fact it generally takes 2 IOs to replace either a LRU
page or a filesystem slab object, and so balances the scanning based
on that value. i.e. the shrinker algorithms are solidly based around
fixed sized objects that have some relationship to the cost of
physical IO operations to replace them in the cache.

The API change is the first step in the path to removing these built
in assumptions. The current API is just insane and any attempt to
build on it is going to be futile. The way I see this developing is
this:

- make the shrink_slab count -> scan algorithm per node

- add information about size of objects in the cache for
fixed size object caches.
- the shrinker now has some idea of how many objects
need to be freed to be able to free a page of
memory, as well as the relative penalty for
replacing them.
- tells the shrinker the size of the cache
in bytes so overall memory footprint of the caches
can be taken into account

- add new count and scan operations for caches that are
based on memory used, not object counts
- allows us to use the same count/scan algorithm for
calculating how much pressure to put on caches
with variable size objects.

My care factor mostly ends here, as it will allow XFS to corectly
balance the metadata buffer cache (variable size objects) against the
inode, dentry and dquot caches which are object based. The next
steps that I'm about to give you are based on some discussions with
some MM people over bottles of red wine, so take it with a grain of
salt...

- calculate a "pressure" value for each cache controlled by a
shrinker so that the relative memory pressure between
caches can be compared. This allows the shrinkers to bias
reclaim based on where the memory pressure is being
generated

- start grouping shrinkers into a heirarchy, allowing
related shrinkers (e.g. all the caches in a memcg) to be
shrunk according resource limits that can be placed on the
group. i.e. memory pressure is proportioned across
groups rather than many individual shrinkers.

- comments have been made to the extent that with generic
per-node lists and a node aware shrinker, all of the page
scanning could be driven by the shrinker infrastructure,
rather than the shrinkers being driven by how many pages
in the page cache just got scanned for reclaim.

IOWs, the main memory reclaim algorithm walks all the
shrinkers groups to calculate overall memory pressure,
calculate how much reclaim is necessary, and then
proportion reclaim across all the shrinker groups. i.e.
everything is a shrinker.

This patch set is really just the start of a long process. balance
between the page cache and VFS/filesystem shrinkers is critical to
the efficient operation of the OS under many, many workloads, so I'm
not about to change more than oe little thing at a time. This API
change is just one little step. You'll get what you want eventually,
but you're not going to get it as a first step.

> > + * But the craziest part comes when i915_gem_purge() has walked all the objects
> > + * and can't free any memory. That results in i915_gem_shrink_all() being
> > + * called, which idles the GPU and frees everything the driver has in it's
> > + * active and inactive lists. It's basically hitting the driver with a great big
> > + * hammer because it was busy doing stuff when something else generated memory
> > + * pressure. This doesn't seem particularly wise...
> > + */
>
> As opposed to triggering an OOM? The choice was between custom code for
> a hopefully rare code path in a situation of last resort, or first
> implementing the simplest code that stopped i915 from starving the
> system of memory.

And when it's something else that is causing the memory pressue?
The shrinker gets called whenever somethign runs low on memory - it
might be called thousands of times a second so there's a very good
chance you have very little to free after the first purge has
occurred. After than you're going to idle the GPU and purge all the
memory on every single shrinker call, even though the GPU is not
generating memory pressure and not causing the shrinkers to run.
That's why it's crazy - it's got close to worst case behaviour when
the GPU is already using as little memory as possible.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-11-28 08:22:04

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On 11/28/2012 07:17 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <[email protected]> wrote:
>>> +/*
>>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
>>> + *
>>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
>>> + * size is PAGE_SIZE.
>>
>> No, purge() expects a count of pages to be freed. Each pass of the
>> shrinker therefore tries to free a minimum of 128 pages.
>
> Ah, I got the shifts mixed up. I'd been looking at way too much crap
> already when I saw this. But the fact this can be misunderstood says
> something about the level of documentation that the code has (i.e.
> none).
>
>>> The shrinker doesn't work on bytes - it works on
>>> + * *objects*.
>>
>> And I thought you were reviewing the shrinker API to be useful where a
>> single object may range between 4K and 4G.
>
> Which requires rewriting all the algorithms to not be dependent on
> the subsystems using a fixed size object. The shrinker control
> function is called shrink_slab() for a reason - it was expected to
> be used to shrink caches of fixed sized objects allocated from slab
> memory.
>
> It has no concept of the amount of memory that each object consumes,
> just an idea of how much *IO* it takes to replace the object in
> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> encode the fact it generally takes 2 IOs to replace either a LRU
> page or a filesystem slab object, and so balances the scanning based
> on that value. i.e. the shrinker algorithms are solidly based around
> fixed sized objects that have some relationship to the cost of
> physical IO operations to replace them in the cache.

One nit: It shouldn't take 2IOs to replace a slab object, right? This
should be the cost of allocating a new page, that can contain, multiple
objects.

Once the page is in, a new object should be quite cheap to come up with.

This is a very wild thought, but now that I am diving deep in the
shrinker API, and seeing things like this:

if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
}

I am becoming more convinced that we should have a page-based mechanism,
like the rest of vmscan.

Also, if we are seeing pressure from someone requesting user pages, what
good does it make to free, say, 35 Mb of memory, if this means we are
freeing objects across 5k different pages, without actually releasing
any of them? (still is TBD if this is a theoretical problem or a
practical one). It would maybe be better to free objects that are
moderately hot, but are on pages dominated by cold objects...


>
> The API change is the first step in the path to removing these built
> in assumptions. The current API is just insane and any attempt to
> build on it is going to be futile.

Amen, brother!

> The way I see this developing is
> this:
>
> - make the shrink_slab count -> scan algorithm per node
>
pages are per-node.

> - add information about size of objects in the cache for
> fixed size object caches.
> - the shrinker now has some idea of how many objects
> need to be freed to be able to free a page of
> memory, as well as the relative penalty for
> replacing them.
this is still guesswork, telling how many pages it should free, could
be a better idea.

> - tells the shrinker the size of the cache
> in bytes so overall memory footprint of the caches
> can be taken into account

> - add new count and scan operations for caches that are
> based on memory used, not object counts
> - allows us to use the same count/scan algorithm for
> calculating how much pressure to put on caches
> with variable size objects.

IOW, pages.

> My care factor mostly ends here, as it will allow XFS to corectly
> balance the metadata buffer cache (variable size objects) against the
> inode, dentry and dquot caches which are object based. The next
> steps that I'm about to give you are based on some discussions with
> some MM people over bottles of red wine, so take it with a grain of
> salt...
>
> - calculate a "pressure" value for each cache controlled by a
> shrinker so that the relative memory pressure between
> caches can be compared. This allows the shrinkers to bias
> reclaim based on where the memory pressure is being
> generated
>

Ok, if a cache is using a lot of memory, this would indicate it has the
dominant workload, right? Should we free from it, or should we free from
the others, so this ones gets the pages it needs?

> - start grouping shrinkers into a heirarchy, allowing
> related shrinkers (e.g. all the caches in a memcg) to be
> shrunk according resource limits that can be placed on the
> group. i.e. memory pressure is proportioned across
> groups rather than many individual shrinkers.
>
pages are already grouped like that!

> - comments have been made to the extent that with generic
> per-node lists and a node aware shrinker, all of the page
> scanning could be driven by the shrinker infrastructure,
> rather than the shrinkers being driven by how many pages
> in the page cache just got scanned for reclaim.
>
> IOWs, the main memory reclaim algorithm walks all the
> shrinkers groups to calculate overall memory pressure,
> calculate how much reclaim is necessary, and then
> proportion reclaim across all the shrinker groups. i.e.
> everything is a shrinker.
>
> This patch set is really just the start of a long process. balance
> between the page cache and VFS/filesystem shrinkers is critical to
> the efficient operation of the OS under many, many workloads, so I'm
> not about to change more than oe little thing at a time. This API
> change is just one little step. You'll get what you want eventually,
> but you're not going to get it as a first step.
>

I have to note again that this is my first *serious* look at the
problem... but this is a summary of what I got, that fits in the context
of this particular discussion =)

I still have to go through all your other patches...

But one thing we seem to agree is that we have quite a long road ahead

2012-11-28 16:10:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 06/19] list: add a new LRU list type

On Wed, Nov 28, 2012 at 10:14:33AM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Several subsystems use the same construct for LRU lists - a list
> head, a spin lock and and item count. They also use exactly the same
> code for adding and removing items from the LRU. Create a generic
> type for these LRU lists.
>
> This is the beginning of generic, node aware LRUs for shrinkers to
> work with.

Can you please add kerneldoc comments for the functions, and add
symbolic constants for the possible return values from the isolate
callback?

2012-11-28 16:17:11

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 15/19] xfs: convert dquot cache lru to list_lru

> + if (!xfs_dqflock_nowait(dqp))
> + xfs_dqunlock(dqp);
> + goto out_miss_busy;

This seems to miss braces.

2012-11-28 21:29:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
> >> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <[email protected]> wrote:
> >>> +/*
> >>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> >>> + *
> >>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> >>> + * size is PAGE_SIZE.
> >>
> >> No, purge() expects a count of pages to be freed. Each pass of the
> >> shrinker therefore tries to free a minimum of 128 pages.
> >
> > Ah, I got the shifts mixed up. I'd been looking at way too much crap
> > already when I saw this. But the fact this can be misunderstood says
> > something about the level of documentation that the code has (i.e.
> > none).
> >
> >>> The shrinker doesn't work on bytes - it works on
> >>> + * *objects*.
> >>
> >> And I thought you were reviewing the shrinker API to be useful where a
> >> single object may range between 4K and 4G.
> >
> > Which requires rewriting all the algorithms to not be dependent on
> > the subsystems using a fixed size object. The shrinker control
> > function is called shrink_slab() for a reason - it was expected to
> > be used to shrink caches of fixed sized objects allocated from slab
> > memory.
> >
> > It has no concept of the amount of memory that each object consumes,
> > just an idea of how much *IO* it takes to replace the object in
> > memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> > encode the fact it generally takes 2 IOs to replace either a LRU
> > page or a filesystem slab object, and so balances the scanning based
> > on that value. i.e. the shrinker algorithms are solidly based around
> > fixed sized objects that have some relationship to the cost of
> > physical IO operations to replace them in the cache.
>
> One nit: It shouldn't take 2IOs to replace a slab object, right?
> objects.

A random dentry in a small directory will take on IO to read the
inode, then another to read the block the dirent sits in. TO read an
inode froma cached dentry will generally take one IO to read the
inode, and another to read related, out of inode information (e.g.
attributes or extent/block maps). Sometimes it will only take on IO,
sometimes it might take 3 or, in the case of dirents, coult take
hundreds of IOs if the directory structure is large enough.

So a default of 2 seeks to replace any single dentry/inode in the
cache is a pretty good default to use.

> This
> should be the cost of allocating a new page, that can contain, multiple
> Once the page is in, a new object should be quite cheap to come up with.

It's not the cost of allocating the page (a couple of microseconds)
that is being considered - it the 3-4 orders of magnitude worse cost
of reading the object from disk (could be 20ms). The slab/page
allocation is lost in the noise compared to the time it takes to
fill the page cache page with data or a single slab object.
Essentially, slab pages with multiple objects in them are much more
expensive to replace in the cache than a page cache page....

> This is a very wild thought, but now that I am diving deep in the
> shrinker API, and seeing things like this:
>
> if (reclaim_state) {
> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> reclaim_state->reclaimed_slab = 0;
> }

That's not part of the shrinker - that's part of the vmscan
code, external to the shrinker infrastructure. It's getting
information back from the slab caches behind the shrinkers, and it's
not the full picture because many shrinkers are not backed by slab
caches. It's a work around for not not having accurate feedback from
the shrink_slab() code about how many pages were freed.

Essentially, the problem is an impedance mismatch between the way
the LRUs are scanned/balanced (in pages) and slab caches are managed
(by objects). That's what needs unifying...

> I am becoming more convinced that we should have a page-based mechanism,
> like the rest of vmscan.

Been thought about and consiered before. Would you like to rewrite
the slab code?

> Also, if we are seeing pressure from someone requesting user pages, what
> good does it make to free, say, 35 Mb of memory, if this means we are
> freeing objects across 5k different pages, without actually releasing
> any of them? (still is TBD if this is a theoretical problem or a
> practical one). It would maybe be better to free objects that are
> moderately hot, but are on pages dominated by cold objects...

Yup, that's a problem, but now you're asking shrinker
implementations to know in great detail the physical locality of
object and not just the temporal locality. the node-aware LRU list
does this at a coarse level, but to do page based reclaim you need
ot track pages in SL*B that contain unreferenced objects as those
are the only ones that can be reclaimed.

If you have no pages with unreferenced objects, then you don't make
progress and you have to fall back to freeing unreferenced objects
from random pages. ANd under most workloads that aren't benchmarks,
slab object population ends up with little correlation between
physical and temporal locality. Hence this is the norm rather than
the exception..

Also, handling physical locality in this way means we'd need to tie
the shrinker deep into the SLAB/SLUB/SLOB implementation that is
being used to allocate the objects..

There have been various attempts at this sort of thing in the past.
e.g:

http://marc.info/?l=linux-mm&m=112810938004047

or for slab defragmentation:

https://lkml.org/lkml/2010/1/29/332

and more on LRUs in slab caches and general shrinker design in that
thread (definitely worth reading this, at least):

https://lkml.org/lkml/2010/2/2/499

And it's made far more complex by the fact that some shrinkers don't
necessarily free the objects they are working on. e.g. the VFS inode
cache shrinker basically hands objects to XFS, and the XFS inode
cache takes over from there (via the XFS inode cache shrinker) to
free the objects. i.e. two shrinkers act on the same structure...

> > The API change is the first step in the path to removing these built
> > in assumptions. The current API is just insane and any attempt to
> > build on it is going to be futile.
>
> Amen, brother!
>
> > The way I see this developing is
> > this:
> >
> > - make the shrink_slab count -> scan algorithm per node
> >
> pages are per-node.
>
> > - add information about size of objects in the cache for
> > fixed size object caches.
> > - the shrinker now has some idea of how many objects
> > need to be freed to be able to free a page of
> > memory, as well as the relative penalty for
> > replacing them.
> this is still guesswork, telling how many pages it should free, could
> be a better idea.
>
> > - tells the shrinker the size of the cache
> > in bytes so overall memory footprint of the caches
> > can be taken into account
>
> > - add new count and scan operations for caches that are
> > based on memory used, not object counts
> > - allows us to use the same count/scan algorithm for
> > calculating how much pressure to put on caches
> > with variable size objects.
>
> IOW, pages.

Not necessarily - if we are going to deal with multiple objects in a
page as well as multi-page objects, then pages are unable to express
the full range of possibilities. We may as well use byte counts at
this point, expecially when you consider page sizes differ on
different platforms...

> > My care factor mostly ends here, as it will allow XFS to corectly
> > balance the metadata buffer cache (variable size objects) against the
> > inode, dentry and dquot caches which are object based. The next
> > steps that I'm about to give you are based on some discussions with
> > some MM people over bottles of red wine, so take it with a grain of
> > salt...
> >
> > - calculate a "pressure" value for each cache controlled by a
> > shrinker so that the relative memory pressure between
> > caches can be compared. This allows the shrinkers to bias
> > reclaim based on where the memory pressure is being
> > generated
> >
>
> Ok, if a cache is using a lot of memory, this would indicate it has the
> dominant workload, right?

Not necessarily. Someone might jus thave run a find across their
filesystem, and that is where all the pressure is coming from. In
this case, you don't want that memory prssure to toss out all the
other caches. I suspect that the "pressure" measure is going to need
to take into account cache hit rates to work properly...

> Should we free from it, or should we free from
> the others, so this ones gets the pages it needs?

That's the million dollar question...

> > - start grouping shrinkers into a heirarchy, allowing
> > related shrinkers (e.g. all the caches in a memcg) to be
> > shrunk according resource limits that can be placed on the
> > group. i.e. memory pressure is proportioned across
> > groups rather than many individual shrinkers.
> >
> pages are already grouped like that!

But shrinkers and slab caches are not.

Besides, once you start grouping shrinkers, why should we treat the
page LRU list scanning any differently from any other cache that has
a shrinker?

> But one thing we seem to agree is that we have quite a long road ahead

Definitely.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-11-29 10:29:41

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On 11/29/2012 01:28 AM, Dave Chinner wrote:
> On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
>> On 11/28/2012 07:17 AM, Dave Chinner wrote:
>>> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
>>>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <[email protected]> wrote:
>>>>> +/*
>>>>> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
>>>>> + *
>>>>> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
>>>>> + * size is PAGE_SIZE.
>>>>
>>>> No, purge() expects a count of pages to be freed. Each pass of the
>>>> shrinker therefore tries to free a minimum of 128 pages.
>>>
>>> Ah, I got the shifts mixed up. I'd been looking at way too much crap
>>> already when I saw this. But the fact this can be misunderstood says
>>> something about the level of documentation that the code has (i.e.
>>> none).
>>>
>>>>> The shrinker doesn't work on bytes - it works on
>>>>> + * *objects*.
>>>>
>>>> And I thought you were reviewing the shrinker API to be useful where a
>>>> single object may range between 4K and 4G.
>>>
>>> Which requires rewriting all the algorithms to not be dependent on
>>> the subsystems using a fixed size object. The shrinker control
>>> function is called shrink_slab() for a reason - it was expected to
>>> be used to shrink caches of fixed sized objects allocated from slab
>>> memory.
>>>
>>> It has no concept of the amount of memory that each object consumes,
>>> just an idea of how much *IO* it takes to replace the object in
>>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
>>> encode the fact it generally takes 2 IOs to replace either a LRU
>>> page or a filesystem slab object, and so balances the scanning based
>>> on that value. i.e. the shrinker algorithms are solidly based around
>>> fixed sized objects that have some relationship to the cost of
>>> physical IO operations to replace them in the cache.
>>
>> One nit: It shouldn't take 2IOs to replace a slab object, right?
>> objects.
>
> A random dentry in a small directory will take on IO to read the
> inode, then another to read the block the dirent sits in. TO read an
> inode froma cached dentry will generally take one IO to read the
> inode, and another to read related, out of inode information (e.g.
> attributes or extent/block maps). Sometimes it will only take on IO,
> sometimes it might take 3 or, in the case of dirents, coult take
> hundreds of IOs if the directory structure is large enough.
>
> So a default of 2 seeks to replace any single dentry/inode in the
> cache is a pretty good default to use.
>
>> This
>> should be the cost of allocating a new page, that can contain, multiple
>> Once the page is in, a new object should be quite cheap to come up with.
>

Indeed. More on this in the next paragraph...

> It's not the cost of allocating the page (a couple of microseconds)
> that is being considered - it the 3-4 orders of magnitude worse cost
> of reading the object from disk (could be 20ms). The slab/page
> allocation is lost in the noise compared to the time it takes to
> fill the page cache page with data or a single slab object.
> Essentially, slab pages with multiple objects in them are much more
> expensive to replace in the cache than a page cache page....
>
>> This is a very wild thought, but now that I am diving deep in the
>> shrinker API, and seeing things like this:
>>
>> if (reclaim_state) {
>> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
>> reclaim_state->reclaimed_slab = 0;
>> }
>
> That's not part of the shrinker - that's part of the vmscan
> code, external to the shrinker infrastructure. It's getting
> information back from the slab caches behind the shrinkers, and it's
> not the full picture because many shrinkers are not backed by slab
> caches. It's a work around for not not having accurate feedback from
> the shrink_slab() code about how many pages were freed.
>
I know it is not part of the shrinkers, and that is precisely my point.
vmscan needs to go through this kinds of hacks because our API is not
strong enough to just give it back the answer that matters to the caller.

> Essentially, the problem is an impedance mismatch between the way
> the LRUs are scanned/balanced (in pages) and slab caches are managed
> (by objects). That's what needs unifying...
>
So read my statement again, Dave: this is precisely what I am advocating!

The fact that you are so more concerned with bringing the dentries back
from disk is just an obvious consequence of your FS background. The
problem I was more concerned, is when a user needs to allocate a page
for whatever reason. We're short on pages, and then we shrink. But the
shrink gives us nothing. If this is a user-page driven workload, it
should be better to do this, than to get rid of user pages - which we
may end up doing if the shrinkers does not release enough pages. This is
in contrast with a dcache-driven workload, where what you are saying
makes total sense.

In fact, those goals are so orthogonal, that I wonder if it wouldn't be
worth it to introduce some kind of pressure measurement to determine if
the workload is kernel-driven or userpages-driven.

I am right now experimenting with something (dcache only for starters)
like this (simplified version):

static void dentry_lru_add(struct dentry *dentry)
{
if (!dentry->score) {
struct page *page;
page = virt_to_head_page(dentry);
page->slab_score_page += score;
dentry->score = score();
}
}

static void __dentry_lru_del(struct dentry *dentry)
{
struct page *page;
page = virt_to_head_page(dentry);

dentry->d_flags &= ~DCACHE_SHRINK_LIST;
page->slab_score_page += dentry->score();
}


score decreases as the time passes. So if a page has a very large score,
it means that it is likely to have a lot of objects that are somewhat
old. When we scan, we start from them. If this is a purely kernel driven
workload, we only delete the objects that are, itself, old. If this is a
user-driven workload, we try to take down the rest as well. With a
grey-area in the middle, maybe...

In fact, this may (I am not bold enough to state anything as certain at
this point!!) free a lot *less* dentries than the normal shrinkers
would, while still being able to give userspace users what it really
cares about: pages!

The big advantage of scanning through pages, is that it becomes trivial
to either walk it per-zone, or per-memcg, since you can trivially derive
the set of pages that belong to each of them.

The determination of user vs kernel driven workloads can also be done
per-{memcg,zone}. So we can conceivably have a {memcg,zone} that is
receiving a lot of kernel-objects pressure, and another that is
receiving a lot of user-driven pressure, due to the different nature of
their workloads. We would shrink them differently, in this case.

>> I am becoming more convinced that we should have a page-based mechanism,
>> like the rest of vmscan.
>
> Been thought about and consiered before. Would you like to rewrite
> the slab code?
>
It would be fun!

>> Also, if we are seeing pressure from someone requesting user pages, what
>> good does it make to free, say, 35 Mb of memory, if this means we are
>> freeing objects across 5k different pages, without actually releasing
>> any of them? (still is TBD if this is a theoretical problem or a
>> practical one). It would maybe be better to free objects that are
>> moderately hot, but are on pages dominated by cold objects...
>
> Yup, that's a problem, but now you're asking shrinker
> implementations to know in great detail the physical locality of
> object and not just the temporal locality. the node-aware LRU list
> does this at a coarse level, but to do page based reclaim you need
> ot track pages in SL*B that contain unreferenced objects as those
> are the only ones that can be reclaimed.
>
Yes, this is the part that I am still struggling with. What I am
currently trying is to come up with a stable SL*B api that will allow us
to know which objects are alive in a given page, and how to walk them.

Combined with the score mechanism above, we can define minimum scores
for an object to be freed - which relates to how long ago it was marked
- and then free the objects in the page that has a high enough score. If
pressure is kept, we lower the threshold. Do this n times.

> If you have no pages with unreferenced objects, then you don't make
> progress and you have to fall back to freeing unreferenced objects
> from random pages. ANd under most workloads that aren't benchmarks,
> slab object population ends up with little correlation between
> physical and temporal locality. Hence this is the norm rather than
> the exception..
>

More or less.

Last time I counted, a dentry had 248 bytes. Let's say 256 for
simplicity. So each page will have around 16 dentries (it is actually
usually a bit less than that, due to slab metadata).

In a normal shrink, how many objects do you expect to free ? At some
point, as your scan rate grows, the likelihood of having a bunch of them
in the same page increases, even if not all objects in that page are
cold. But still, we can easily scan the objects in a page with a high
score, and find a bunch of objects that are cold. If it is a
kernel-driven workload, we move on to the next page when they are out.
If it is user driven, it may be worth it to free the others, and give
the user what it wants - a page. Leaving all other thousands of dentries
untouched.


> Also, handling physical locality in this way means we'd need to tie
> the shrinker deep into the SLAB/SLUB/SLOB implementation that is
> being used to allocate the objects..
>
> There have been various attempts at this sort of thing in the past.
> e.g:
>
> http://marc.info/?l=linux-mm&m=112810938004047
>
> or for slab defragmentation:
>
> https://lkml.org/lkml/2010/1/29/332
>
> and more on LRUs in slab caches and general shrinker design in that
> thread (definitely worth reading this, at least):
>
> https://lkml.org/lkml/2010/2/2/499
>

Thanks for the pointers. I will read them right now. A lot of good ideas
end up being killed by the unseen details, so if I radically change my
mind, I will let you know =)

> And it's made far more complex by the fact that some shrinkers don't
> necessarily free the objects they are working on. e.g. the VFS inode
> cache shrinker basically hands objects to XFS, and the XFS inode
> cache takes over from there (via the XFS inode cache shrinker) to
> free the objects. i.e. two shrinkers act on the same structure...
>
I will dig into this as well. Right now, I don't really understand it.

>>> The API change is the first step in the path to removing these built
>>> in assumptions. The current API is just insane and any attempt to
>>> build on it is going to be futile.
>>
>> Amen, brother!
>>
>>> The way I see this developing is
>>> this:
>>>
>>> - make the shrink_slab count -> scan algorithm per node
>>>
>> pages are per-node.
>>
>>> - add information about size of objects in the cache for
>>> fixed size object caches.
>>> - the shrinker now has some idea of how many objects
>>> need to be freed to be able to free a page of
>>> memory, as well as the relative penalty for
>>> replacing them.
>> this is still guesswork, telling how many pages it should free, could
>> be a better idea.
>>
>>> - tells the shrinker the size of the cache
>>> in bytes so overall memory footprint of the caches
>>> can be taken into account
>>
>>> - add new count and scan operations for caches that are
>>> based on memory used, not object counts
>>> - allows us to use the same count/scan algorithm for
>>> calculating how much pressure to put on caches
>>> with variable size objects.
>>
>> IOW, pages.
>
> Not necessarily - if we are going to deal with multiple objects in a
> page as well as multi-page objects, then pages are unable to express
> the full range of possibilities. We may as well use byte counts at
> this point, expecially when you consider page sizes differ on
> different platforms...
>

bytecounts by themselves doesn't seem that bad to me. It also makes
sense. But I would still note that for that memory to be useful, it must
be in consumable units. What is a consumable unit depends on where the
pressure comes from.

An object is a consumable unit, so if the pressure is coming from the
dentry cache, it is great to free 4000 consumable units - objects. If it
is coming from the page cache, it would be better to free as many pages
as we can, because that is what can be used.

>>> My care factor mostly ends here, as it will allow XFS to corectly
>>> balance the metadata buffer cache (variable size objects) against the
>>> inode, dentry and dquot caches which are object based. The next
>>> steps that I'm about to give you are based on some discussions with
>>> some MM people over bottles of red wine, so take it with a grain of
>>> salt...
>>>
>>> - calculate a "pressure" value for each cache controlled by a
>>> shrinker so that the relative memory pressure between
>>> caches can be compared. This allows the shrinkers to bias
>>> reclaim based on where the memory pressure is being
>>> generated
>>>
>>
>> Ok, if a cache is using a lot of memory, this would indicate it has the
>> dominant workload, right?
>
> Not necessarily. Someone might jus thave run a find across their
> filesystem, and that is where all the pressure is coming from. In
> this case, you don't want that memory prssure to toss out all the
> other caches. I suspect that the "pressure" measure is going to need
> to take into account cache hit rates to work properly...
>
Indeed you are right, memory usage is a very bad estimate. Let me stand
corrected: memory usage over a sliding time window, or something like
the such.

>> Should we free from it, or should we free from
>> the others, so this ones gets the pages it needs?
>
> That's the million dollar question...
>

For god's sake, it's time we stop this bullshit of telling people
"that's the million dollar question". Those fucktards are printing money
like crazy, so we need to correct it for inflation, at the very least.

This is the USD 1,289,648.82 question, using Y2K as a baseline.


>>> - start grouping shrinkers into a heirarchy, allowing
>>> related shrinkers (e.g. all the caches in a memcg) to be
>>> shrunk according resource limits that can be placed on the
>>> group. i.e. memory pressure is proportioned across
>>> groups rather than many individual shrinkers.
>>>
>> pages are already grouped like that!
>
> But shrinkers and slab caches are not.
>
> Besides, once you start grouping shrinkers, why should we treat the
> page LRU list scanning any differently from any other cache that has
> a shrinker?
>

Dunno. Should we?

2012-11-29 19:02:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

Dave Chinner <[email protected]> writes:
>
> Comments, thoughts and flames all welcome.

Doing the reclaim per CPU sounds like a big change in the VM balance.
Doesn't this invalidate some zone reclaim mode settings?
How did you validate all this?

-Andi

--
[email protected] -- Speaking for myself only

2012-11-29 22:02:23

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On Thu, Nov 29, 2012 at 02:29:33PM +0400, Glauber Costa wrote:
> On 11/29/2012 01:28 AM, Dave Chinner wrote:
> > On Wed, Nov 28, 2012 at 12:21:54PM +0400, Glauber Costa wrote:
> >> On 11/28/2012 07:17 AM, Dave Chinner wrote:
> >>> On Wed, Nov 28, 2012 at 01:13:11AM +0000, Chris Wilson wrote:
> >>>> On Wed, 28 Nov 2012 10:14:44 +1100, Dave Chinner <[email protected]> wrote:
> >>>>> The shrinker doesn't work on bytes - it works on
> >>>>> + * *objects*.
> >>>>
> >>>> And I thought you were reviewing the shrinker API to be useful where a
> >>>> single object may range between 4K and 4G.
> >>>
> >>> Which requires rewriting all the algorithms to not be dependent on
> >>> the subsystems using a fixed size object. The shrinker control
> >>> function is called shrink_slab() for a reason - it was expected to
> >>> be used to shrink caches of fixed sized objects allocated from slab
> >>> memory.
> >>>
> >>> It has no concept of the amount of memory that each object consumes,
> >>> just an idea of how much *IO* it takes to replace the object in
> >>> memory once it's been reclaimed. The DEFAULT_SEEKS is design to
> >>> encode the fact it generally takes 2 IOs to replace either a LRU
> >>> page or a filesystem slab object, and so balances the scanning based
> >>> on that value. i.e. the shrinker algorithms are solidly based around
> >>> fixed sized objects that have some relationship to the cost of
> >>> physical IO operations to replace them in the cache.
> >>
> >> One nit: It shouldn't take 2IOs to replace a slab object, right?
> >> objects.
> >
> > A random dentry in a small directory will take on IO to read the
> > inode, then another to read the block the dirent sits in. TO read an
> > inode froma cached dentry will generally take one IO to read the
> > inode, and another to read related, out of inode information (e.g.
> > attributes or extent/block maps). Sometimes it will only take on IO,
> > sometimes it might take 3 or, in the case of dirents, coult take
> > hundreds of IOs if the directory structure is large enough.
> >
> > So a default of 2 seeks to replace any single dentry/inode in the
> > cache is a pretty good default to use.
> >
> >> This
> >> should be the cost of allocating a new page, that can contain, multiple
> >> Once the page is in, a new object should be quite cheap to come up with.
> >
>
> Indeed. More on this in the next paragraph...

I'm not sure what you are trying to say here. Are you saying that
you think that the IO cost for replacing a slab cache object doesn't
matter?

> > It's not the cost of allocating the page (a couple of microseconds)
> > that is being considered - it the 3-4 orders of magnitude worse cost
> > of reading the object from disk (could be 20ms). The slab/page
> > allocation is lost in the noise compared to the time it takes to
> > fill the page cache page with data or a single slab object.
> > Essentially, slab pages with multiple objects in them are much more
> > expensive to replace in the cache than a page cache page....
> >
> >> This is a very wild thought, but now that I am diving deep in the
> >> shrinker API, and seeing things like this:
> >>
> >> if (reclaim_state) {
> >> sc->nr_reclaimed += reclaim_state->reclaimed_slab;
> >> reclaim_state->reclaimed_slab = 0;
> >> }
> >
> > That's not part of the shrinker - that's part of the vmscan
> > code, external to the shrinker infrastructure. It's getting
> > information back from the slab caches behind the shrinkers, and it's
> > not the full picture because many shrinkers are not backed by slab
> > caches. It's a work around for not not having accurate feedback from
> > the shrink_slab() code about how many pages were freed.
> >
> I know it is not part of the shrinkers, and that is precisely my point.
> vmscan needs to go through this kinds of hacks because our API is not
> strong enough to just give it back the answer that matters to the caller.

What matters is that the slab caches are shrunk in proportion to the
page cache. i.e. balanced reclaim. For dentry and inode caches, what
matters is the number of objects reclaimed because the shrinker
algorithm balances based on the relative cost of object replacement
in the cache.

e.g. if you have 1000 pages in the page LRUs, and 1000 objects in
the dentry cache, each takes 1 IO to replace, then if you reclaim 2
pages from the page LRUs and 2 pages from the dentry cache, it will
take 2 IOs to replace the pages in the LRU, but 36 IOs to replace
the objects in the dentry cache that were reclaimed.

This is why the shrinker balances "objects scanned" vs "LRU pages
scanned" - it treats each page as an object and the shrinker relates
that to the relative cost of objects in the slab cache being
reclaimed. i.e. the focus is on keeping a balance between caches,
not reclaiming an absolute number of pages.

Note: I'm not saying this is perfect, what I'm trying to do is let
you know the "why" behind the current algorithm. i.e. why it mostly
works and why ignoring the principles behind why it works is going
fraught with danger...

> > Essentially, the problem is an impedance mismatch between the way
> > the LRUs are scanned/balanced (in pages) and slab caches are managed
> > (by objects). That's what needs unifying...
> >
> So read my statement again, Dave: this is precisely what I am advocating!

Not clearly enough, obviously :/

> The fact that you are so more concerned with bringing the dentries back
> from disk is just an obvious consequence of your FS background.

No, it's a solution ito a systemic problem that Andrew Morton
identified way back in the 2.5 days that resulting in him implenting
the shrinker infrastructure to directly control the sizes of the
inode and dentry caches. IOWs, the historic reason for having the
shrinkers is to balance dentry/inode caches against the page cache
size to solve performance problems related to cache imbalance
problems.

> The
> problem I was more concerned, is when a user needs to allocate a page
> for whatever reason. We're short on pages, and then we shrink. But the
> shrink gives us nothing. If this is a user-page driven workload, it
> should be better to do this, than to get rid of user pages - which we
> may end up doing if the shrinkers does not release enough pages. This is
> in contrast with a dcache-driven workload, where what you are saying
> makes total sense.

What you are ignoring is that the dcache is a global resource.
trashing the dcache because a user want anonymous memory will affect
system/memcg wide performance degradation, instead of just the
single user application being slowed down. It is preferable that the
user application demanding memory is penalised, not the whole
system.

> In fact, those goals are so orthogonal, that I wonder if it wouldn't be
> worth it to introduce some kind of pressure measurement to determine if
> the workload is kernel-driven or userpages-driven.

Dentry cache growth is user driven - the kernel doesn't walk
directories or open files itself - and is a direct representation of
the recent set of the userspace directory/file accesses. So the
whole concept of "treat the dcache as a kernel driven cache" is
fundamentally flawed.

> I am right now experimenting with something (dcache only for starters)
> like this (simplified version):
>
> static void dentry_lru_add(struct dentry *dentry)
> {
> if (!dentry->score) {
> struct page *page;
> page = virt_to_head_page(dentry);
> page->slab_score_page += score;
> dentry->score = score();
> }
> }
>
> static void __dentry_lru_del(struct dentry *dentry)
> {
> struct page *page;
> page = virt_to_head_page(dentry);
>
> dentry->d_flags &= ~DCACHE_SHRINK_LIST;
> page->slab_score_page += dentry->score();
> }

slab_score_page only ever increases according to this code.... :/

> score decreases as the time passes. So if a page has a very large
> score, it means that it is likely to have a lot of objects that
> are somewhat old.

And if score decreases over time, then a small score means the
entries on the page are older than a page with a large score, right?

> When we scan, we start from them. If this is a purely kernel driven
> workload, we only delete the objects that are, itself, old. If this is a
> user-driven workload, we try to take down the rest as well. With a
> grey-area in the middle, maybe...
>
> In fact, this may (I am not bold enough to state anything as certain at
> this point!!) free a lot *less* dentries than the normal shrinkers

The question is this: does it free the right ones? If you have a
single hot dentry on a page full of cold ones, freeing that hot
dentry is the wrong thing to do as it will immediately trigger IO to
bring it back into cache. This is the really problem with page based
slab reclaim....

> would, while still being able to give userspace users what it really
> cares about: pages!

Userspace doesn't care about how the kernel reclaims, tracks or
accounts for memory. All it cares about is having a memory
allocation succeed or fail quickly.

> The big advantage of scanning through pages, is that it becomes trivial
> to either walk it per-zone, or per-memcg, since you can trivially derive
> the set of pages that belong to each of them.

But you really don't have a viable LRU ordering because pages with
mixed hot/cold objects don't look that much different from pages
with all mildly cold objects...

> The determination of user vs kernel driven workloads can also be done
> per-{memcg,zone}. So we can conceivably have a {memcg,zone} that is
> receiving a lot of kernel-objects pressure, and another that is
> receiving a lot of user-driven pressure, due to the different nature of
> their workloads. We would shrink them differently, in this case.

I see that whole concept as fundamentally flawed. Maybe I
misunderstand what you mean by "kernel vs user driven worklaods" as
you haven't really defined what you mean, but I still see the idea
as treating kernel vs userspace driven memory allocation differently
as a vector for insanity. And, in most cases, ensuring kernel memory
allocation succeeds is far, far more important than userspace memory
allocation, so I think that even on this basis this is a
non-starter....

> >> Also, if we are seeing pressure from someone requesting user pages, what
> >> good does it make to free, say, 35 Mb of memory, if this means we are
> >> freeing objects across 5k different pages, without actually releasing
> >> any of them? (still is TBD if this is a theoretical problem or a
> >> practical one). It would maybe be better to free objects that are
> >> moderately hot, but are on pages dominated by cold objects...
> >
> > Yup, that's a problem, but now you're asking shrinker
> > implementations to know in great detail the physical locality of
> > object and not just the temporal locality. the node-aware LRU list
> > does this at a coarse level, but to do page based reclaim you need
> > ot track pages in SL*B that contain unreferenced objects as those
> > are the only ones that can be reclaimed.
> >
> Yes, this is the part that I am still struggling with. What I am
> currently trying is to come up with a stable SL*B api that will allow us
> to know which objects are alive in a given page, and how to walk them.
>
> Combined with the score mechanism above, we can define minimum scores
> for an object to be freed - which relates to how long ago it was marked
> - and then free the objects in the page that has a high enough score. If
> pressure is kept, we lower the threshold. Do this n times.
>
> > If you have no pages with unreferenced objects, then you don't make
> > progress and you have to fall back to freeing unreferenced objects
> > from random pages. ANd under most workloads that aren't benchmarks,
> > slab object population ends up with little correlation between
> > physical and temporal locality. Hence this is the norm rather than
> > the exception..
> >
>
> More or less.
>
> Last time I counted, a dentry had 248 bytes. Let's say 256 for
> simplicity. So each page will have around 16 dentries (it is actually
> usually a bit less than that, due to slab metadata).

IIRC, it's 192 bytes. It was carefully constructed to be exactly
3 cachelines in size for x86_64 when you don't have any lock
debugging turned on.


> In a normal shrink, how many objects do you expect to free ? At some
> point, as your scan rate grows, the likelihood of having a bunch of them
> in the same page increases, even if not all objects in that page are
> cold. But still, we can easily scan the objects in a page with a high
> score, and find a bunch of objects that are cold. If it is a

How are you scanning those pages?

> >>> - add new count and scan operations for caches that are
> >>> based on memory used, not object counts
> >>> - allows us to use the same count/scan algorithm for
> >>> calculating how much pressure to put on caches
> >>> with variable size objects.
> >>
> >> IOW, pages.
> >
> > Not necessarily - if we are going to deal with multiple objects in a
> > page as well as multi-page objects, then pages are unable to express
> > the full range of possibilities. We may as well use byte counts at
> > this point, expecially when you consider page sizes differ on
> > different platforms...
> >
>
> bytecounts by themselves doesn't seem that bad to me. It also makes
> sense. But I would still note that for that memory to be useful, it must
> be in consumable units. What is a consumable unit depends on where the
> pressure comes from.
>
> An object is a consumable unit, so if the pressure is coming from the
> dentry cache, it is great to free 4000 consumable units - objects. If it
> is coming from the page cache, it would be better to free as many pages
> as we can, because that is what can be used.

For the page cache, the consumable unit is a page. This is what I've
been trying to say - you're looking at the page cache as pages, not
as consumable units. There's nothing special about the page cache,
it's just full of objects that are a single page in size, and so a
generic shrinker algorithms can be applied to them just as easily as
the dentry cache...



> >>> My care factor mostly ends here, as it will allow XFS to corectly
> >>> balance the metadata buffer cache (variable size objects) against the
> >>> inode, dentry and dquot caches which are object based. The next
> >>> steps that I'm about to give you are based on some discussions with
> >>> some MM people over bottles of red wine, so take it with a grain of
> >>> salt...
> >>>
> >>> - calculate a "pressure" value for each cache controlled by a
> >>> shrinker so that the relative memory pressure between
> >>> caches can be compared. This allows the shrinkers to bias
> >>> reclaim based on where the memory pressure is being
> >>> generated
> >>>
> >>
> >> Ok, if a cache is using a lot of memory, this would indicate it has the
> >> dominant workload, right?
> >
> > Not necessarily. Someone might jus thave run a find across their
> > filesystem, and that is where all the pressure is coming from. In
> > this case, you don't want that memory prssure to toss out all the
> > other caches. I suspect that the "pressure" measure is going to need
> > to take into account cache hit rates to work properly...
> >
> Indeed you are right, memory usage is a very bad estimate. Let me stand
> corrected: memory usage over a sliding time window, or something like
> the such.

/me handwaves about how it might work.

This will need a lot of active research, so coming up with something
that works without heuristics is something that is way beyond what
I'm considering right now.

> >> Should we free from it, or should we free from
> >> the others, so this ones gets the pages it needs?
> >
> > That's the million dollar question...
>
> For god's sake, it's time we stop this bullshit of telling people
> "that's the million dollar question". Those fucktards are printing money
> like crazy, so we need to correct it for inflation, at the very least.
>
> This is the USD 1,289,648.82 question, using Y2K as a baseline.

:)

>
>
> >>> - start grouping shrinkers into a heirarchy, allowing
> >>> related shrinkers (e.g. all the caches in a memcg) to be
> >>> shrunk according resource limits that can be placed on the
> >>> group. i.e. memory pressure is proportioned across
> >>> groups rather than many individual shrinkers.
> >>>
> >> pages are already grouped like that!
> >
> > But shrinkers and slab caches are not.
> >
> > Besides, once you start grouping shrinkers, why should we treat the
> > page LRU list scanning any differently from any other cache that has
> > a shrinker?
> >
>
> Dunno. Should we?

Another USD 1,289,648.82 question ;)

We currently base all our memory reclaim decisions on contents of
the LRU lists without taking into account the various other caches
in the system. I think we really should be taking into account the
usages of all the caches in the system before deciding exactly what
we should be trying to reclaim and where we reclaim from.

That kind of implies a level playing field for all caches, rather
than the primary (page LRUs) vs secondary (shrinkers) reclaim
architecture we have now. Given the observation we could treat the
page LRUs as an object based shrinker implementation....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-11-29 22:09:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [RFC, PATCH 00/19] Numa aware LRU lists and shrinkers

On Thu, Nov 29, 2012 at 11:02:24AM -0800, Andi Kleen wrote:
> Dave Chinner <[email protected]> writes:
> >
> > Comments, thoughts and flames all welcome.
>
> Doing the reclaim per CPU sounds like a big change in the VM balance.

It's per node, not per CPU. And AFAICT, it hasn't changed the
balance of page cache vs inode/dentry caches under general, global
workloads at all.

> Doesn't this invalidate some zone reclaim mode settings?

No, because zone reclaim is per-node and the shrinkers now can
reclaim just from a single node. i.e. the behaviour is now better
suited to the aims of zone reclaim which is to free memory from a
single, targetted node. Indeed, I removed a hack in the zone reclaim
code that sprayed slab reclaim across the entire machine until
sufficient objects had been freed from the target node....

> How did you validate all this?

fakenuma setups, various workloads that generate even dentry/slab
cache loadings across all nodes, adding page cache pressure on a
single node, watching slab reclaim from a single node. that sort of
thing.

I haven't really done any performance testing other than "not
obviously slower". There's no point optimising anything before
there's any sort of agreement as to whether this is the right
approach to take or not....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-07 13:37:40

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 17/19] drivers: convert shrinkers to new count/scan API

On Wed, Nov 28, 2012 at 10:14:44AM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Convert the driver shrinkers to the new API. Most changes are
> compile tested only because I either don't have the hardware or it's
> staging stuff.

I presume that the the i915, ttm_page_alloc and ttm_page_alloc_dma
were tested by you? They cover the most common graphic drivers.

>
> FWIW, the md and android code is pretty good, but the rest of it
> makes me want to claw my eyes out. The amount of broken code I just
> encountered is mind boggling. I've added comments explaining what
> is broken, but I fear that some of the code would be best dealt with
> by being dragged behind the bike shed, burying in mud up to it's
> neck and then run over repeatedly with a blunt lawn mower.
>
> Special mention goes to the zcache/zcache2 drivers. They can't
> co-exist in the build at the same time, they are under different
> menu options in menuconfig, they only show up when you've got the
> right set of mm subsystem options configured and so even compile
> testing is an exercise in pulling teeth. And that doesn't even take
> into account the horrible, broken code...

Hm, I was under the impression that there is only one zcache code?
Are you referring to ramster perhaps?

>
> Signed-off-by: Dave Chinner <[email protected]>
> ---
> drivers/gpu/drm/i915/i915_dma.c | 4 +-
> drivers/gpu/drm/i915/i915_gem.c | 64 +++++++++++++++++++++-------
> drivers/gpu/drm/ttm/ttm_page_alloc.c | 48 ++++++++++++++-------
> drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 55 +++++++++++++++---------
> drivers/md/dm-bufio.c | 65 +++++++++++++++++++----------
> drivers/staging/android/ashmem.c | 44 ++++++++++++-------
> drivers/staging/android/lowmemorykiller.c | 60 +++++++++++++++++---------
> drivers/staging/ramster/zcache-main.c | 58 ++++++++++++++++++-------
> drivers/staging/zcache/zcache-main.c | 40 ++++++++++--------
> 9 files changed, 297 insertions(+), 141 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_dma.c b/drivers/gpu/drm/i915/i915_dma.c
> index 61ae104..0ddec32 100644
> --- a/drivers/gpu/drm/i915/i915_dma.c
> +++ b/drivers/gpu/drm/i915/i915_dma.c
> @@ -1658,7 +1658,7 @@ int i915_driver_load(struct drm_device *dev, unsigned long flags)
> return 0;
>
> out_gem_unload:
> - if (dev_priv->mm.inactive_shrinker.shrink)
> + if (dev_priv->mm.inactive_shrinker.scan_objects)
> unregister_shrinker(&dev_priv->mm.inactive_shrinker);
>
> if (dev->pdev->msi_enabled)
> @@ -1695,7 +1695,7 @@ int i915_driver_unload(struct drm_device *dev)
>
> i915_teardown_sysfs(dev);
>
> - if (dev_priv->mm.inactive_shrinker.shrink)
> + if (dev_priv->mm.inactive_shrinker.scan_objects)
> unregister_shrinker(&dev_priv->mm.inactive_shrinker);
>
> mutex_lock(&dev->struct_mutex);
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 107f09b..ceab752 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -53,8 +53,10 @@ static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
> struct drm_i915_fence_reg *fence,
> bool enable);
>
> -static int i915_gem_inactive_shrink(struct shrinker *shrinker,
> +static long i915_gem_inactive_count(struct shrinker *shrinker,
> struct shrink_control *sc);
> +static long i915_gem_inactive_scan(struct shrinker *shrinker,
> + struct shrink_control *sc);
> static long i915_gem_purge(struct drm_i915_private *dev_priv, long target);
> static void i915_gem_shrink_all(struct drm_i915_private *dev_priv);
> static void i915_gem_object_truncate(struct drm_i915_gem_object *obj);
> @@ -4197,7 +4199,8 @@ i915_gem_load(struct drm_device *dev)
>
> dev_priv->mm.interruptible = true;
>
> - dev_priv->mm.inactive_shrinker.shrink = i915_gem_inactive_shrink;
> + dev_priv->mm.inactive_shrinker.count_objects = i915_gem_inactive_count;
> + dev_priv->mm.inactive_shrinker.scan_objects = i915_gem_inactive_scan;
> dev_priv->mm.inactive_shrinker.seeks = DEFAULT_SEEKS;
> register_shrinker(&dev_priv->mm.inactive_shrinker);
> }
> @@ -4407,35 +4410,64 @@ void i915_gem_release(struct drm_device *dev, struct drm_file *file)
> spin_unlock(&file_priv->mm.lock);
> }
>
> -static int
> -i915_gem_inactive_shrink(struct shrinker *shrinker, struct shrink_control *sc)
> +/*
> + * XXX: (dchinner) This is one of the worst cases of shrinker abuse I've seen.
> + *
> + * i915_gem_purge() expects a byte count to be passed, and the minimum object
> + * size is PAGE_SIZE. The shrinker doesn't work on bytes - it works on
> + * *objects*. So it passes a nr_to_scan of 128 objects, which is interpreted
> + * here to mean "free 128 bytes". That means a single object will be freed, as
> + * the minimum object size is a page.
> + *
> + * But the craziest part comes when i915_gem_purge() has walked all the objects
> + * and can't free any memory. That results in i915_gem_shrink_all() being
> + * called, which idles the GPU and frees everything the driver has in it's
> + * active and inactive lists. It's basically hitting the driver with a great big
> + * hammer because it was busy doing stuff when something else generated memory
> + * pressure. This doesn't seem particularly wise...
> + */
> +static long
> +i915_gem_inactive_scan(struct shrinker *shrinker, struct shrink_control *sc)
> {
> struct drm_i915_private *dev_priv =
> container_of(shrinker,
> struct drm_i915_private,
> mm.inactive_shrinker);
> struct drm_device *dev = dev_priv->dev;
> - struct drm_i915_gem_object *obj;
> - int nr_to_scan = sc->nr_to_scan;
> - int cnt;
> + long freed = 0;
>
> if (!mutex_trylock(&dev->struct_mutex))
> return 0;
>
> - if (nr_to_scan) {
> - nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
> - if (nr_to_scan > 0)
> - i915_gem_shrink_all(dev_priv);
> - }
> + freed = i915_gem_purge(dev_priv, sc->nr_to_scan);
> + if (freed < sc->nr_to_scan)
> + i915_gem_shrink_all(dev_priv);
> +
> + mutex_unlock(&dev->struct_mutex);
> + return freed;
> +}
> +
> +static long
> +i915_gem_inactive_count(struct shrinker *shrinker, struct shrink_control *sc)
> +{
> + struct drm_i915_private *dev_priv =
> + container_of(shrinker,
> + struct drm_i915_private,
> + mm.inactive_shrinker);
> + struct drm_device *dev = dev_priv->dev;
> + struct drm_i915_gem_object *obj;
> + long count = 0;
> +
> + if (!mutex_trylock(&dev->struct_mutex))
> + return 0;
>
> - cnt = 0;
> list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
> if (obj->pages_pin_count == 0)
> - cnt += obj->base.size >> PAGE_SHIFT;
> + count += obj->base.size >> PAGE_SHIFT;
> list_for_each_entry(obj, &dev_priv->mm.bound_list, gtt_list)
> if (obj->pin_count == 0 && obj->pages_pin_count == 0)
> - cnt += obj->base.size >> PAGE_SHIFT;
> + count += obj->base.size >> PAGE_SHIFT;
>
> mutex_unlock(&dev->struct_mutex);
> - return cnt;
> + return count;
> }
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
> index bd2a3b4..83058a2 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
> @@ -377,28 +377,28 @@ out:
> return nr_free;
> }
>
> -/* Get good estimation how many pages are free in pools */
> -static int ttm_pool_get_num_unused_pages(void)
> -{
> - unsigned i;
> - int total = 0;
> - for (i = 0; i < NUM_POOLS; ++i)
> - total += _manager->pools[i].npages;
> -
> - return total;
> -}
> -
> /**
> * Callback for mm to request pool to reduce number of page held.
> + *
> + * XXX: (dchinner) Deadlock warning!
> + *
> + * ttm_page_pool_free() does memory allocation using GFP_KERNEL. that means
> + * this can deadlock when called a sc->gfp_mask that is not equal to
> + * GFP_KERNEL.
> + *
> + * This code is crying out for a shrinker per pool....
> */
> -static int ttm_pool_mm_shrink(struct shrinker *shrink,
> - struct shrink_control *sc)
> +static long
> +ttm_pool_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)

Please don't change the style.
> {
> static atomic_t start_pool = ATOMIC_INIT(0);
> unsigned i;
> unsigned pool_offset = atomic_add_return(1, &start_pool);
> struct ttm_page_pool *pool;
> int shrink_pages = sc->nr_to_scan;
> + long freed = 0;
>
> pool_offset = pool_offset % NUM_POOLS;
> /* select start pool in round robin fashion */
> @@ -408,14 +408,30 @@ static int ttm_pool_mm_shrink(struct shrinker *shrink,
> break;
> pool = &_manager->pools[(i + pool_offset)%NUM_POOLS];
> shrink_pages = ttm_page_pool_free(pool, nr_free);
> + freed += nr_free - shrink_pages;
> }
> - /* return estimated number of unused pages in pool */
> - return ttm_pool_get_num_unused_pages();
> + return freed;
> +}
> +
> +
> +static long
> +ttm_pool_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)

Please don't change the style. Why did you move it down?

> +{
> + unsigned i;
> + long count = 0;
> +
> + for (i = 0; i < NUM_POOLS; ++i)
> + count += _manager->pools[i].npages;
> +
> + return count;
> }
>
> static void ttm_pool_mm_shrink_init(struct ttm_pool_manager *manager)
> {
> - manager->mm_shrink.shrink = &ttm_pool_mm_shrink;
> + manager->mm_shrink.count_objects = &ttm_pool_shrink_count;
> + manager->mm_shrink.scan_objects = &ttm_pool_shrink_scan;
> manager->mm_shrink.seeks = 1;
> register_shrinker(&manager->mm_shrink);
> }
> diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> index b8b3943..b3b4f99 100644
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> @@ -918,19 +918,6 @@ int ttm_dma_populate(struct ttm_dma_tt *ttm_dma, struct device *dev)
> }
> EXPORT_SYMBOL_GPL(ttm_dma_populate);
>
> -/* Get good estimation how many pages are free in pools */
> -static int ttm_dma_pool_get_num_unused_pages(void)
> -{
> - struct device_pools *p;
> - unsigned total = 0;
> -
> - mutex_lock(&_manager->lock);
> - list_for_each_entry(p, &_manager->pools, pools)
> - total += p->pool->npages_free;
> - mutex_unlock(&_manager->lock);
> - return total;
> -}
> -
> /* Put all pages in pages list to correct pool to wait for reuse */
> void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
> {
> @@ -1002,18 +989,31 @@ EXPORT_SYMBOL_GPL(ttm_dma_unpopulate);
>
> /**
> * Callback for mm to request pool to reduce number of page held.
> + *
> + * XXX: (dchinner) Deadlock warning!
> + *
> + * ttm_dma_page_pool_free() does GFP_KERNEL memory allocation, and so attention
> + * needs to be paid to sc->gfp_mask to determine if this can be done or not.
> + * GFP_KERNEL memory allocation in a GFP_ATOMIC reclaim context woul dbe really

would be
> + * bad.

It could use (ttm_dma_page_pool_free) use GFP_ATOMIC as the allocation is
just for an array of pages (so that we can iterate over all of the them
in the list and find the candidates). At the end of the
ttm_dma_page_pool_free it ends up freeing it.

The same treatment can be applied to the ttm_page_pool_free.
> + *
> + * I'm getting sadder as I hear more pathetical whimpers about needing per-pool
> + * shrinkers

I am not entirely clear how this comment is useful in the code?
> */
> -static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
> - struct shrink_control *sc)
> +static long
> +ttm_dma_pool_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> {
> static atomic_t start_pool = ATOMIC_INIT(0);
> unsigned idx = 0;
> unsigned pool_offset = atomic_add_return(1, &start_pool);
> unsigned shrink_pages = sc->nr_to_scan;
> struct device_pools *p;
> + long freed = 0;
>
> if (list_empty(&_manager->pools))
> - return 0;
> + return -1;

Could there be a set of #defines for these values?
>
> mutex_lock(&_manager->lock);
> pool_offset = pool_offset % _manager->npools;
> @@ -1029,18 +1029,35 @@ static int ttm_dma_pool_mm_shrink(struct shrinker *shrink,
> continue;
> nr_free = shrink_pages;
> shrink_pages = ttm_dma_page_pool_free(p->pool, nr_free);
> + freed += nr_free - shrink_pages;
> +
> pr_debug("%s: (%s:%d) Asked to shrink %d, have %d more to go\n",
> p->pool->dev_name, p->pool->name, current->pid,
> nr_free, shrink_pages);
> }
> mutex_unlock(&_manager->lock);
> - /* return estimated number of unused pages in pool */
> - return ttm_dma_pool_get_num_unused_pages();
> + return freed;
> +}
> +
> +static long
> +ttm_dma_pool_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)

Again, please don't change the styleguide and also why move the code?

> +{
> + struct device_pools *p;
> + long count = 0;
> +
> + mutex_lock(&_manager->lock);
> + list_for_each_entry(p, &_manager->pools, pools)
> + count += p->pool->npages_free;
> + mutex_unlock(&_manager->lock);
> + return count;
> }
>
> static void ttm_dma_pool_mm_shrink_init(struct ttm_pool_manager *manager)
> {
> - manager->mm_shrink.shrink = &ttm_dma_pool_mm_shrink;
> + manager->mm_shrink.count_objects = &ttm_dma_pool_shrink_count;
> + manager->mm_shrink.scan_objects = &ttm_dma_pool_shrink_scan;
> manager->mm_shrink.seeks = 1;
> register_shrinker(&manager->mm_shrink);
> }
> diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
> index 651ca79..0898bf5 100644
> --- a/drivers/md/dm-bufio.c
> +++ b/drivers/md/dm-bufio.c
> @@ -1359,62 +1359,80 @@ static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
> unsigned long max_jiffies)
> {
> if (jiffies - b->last_accessed < max_jiffies)
> - return 1;
> + return 0;
>
> if (!(gfp & __GFP_IO)) {
> if (test_bit(B_READING, &b->state) ||
> test_bit(B_WRITING, &b->state) ||
> test_bit(B_DIRTY, &b->state))
> - return 1;
> + return 0;
> }
>
> if (b->hold_count)
> - return 1;
> + return 0;
>
> __make_buffer_clean(b);
> __unlink_buffer(b);
> __free_buffer_wake(b);
>
> - return 0;
> + return 1;
> }
>
> -static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> - struct shrink_control *sc)
> +static long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
> + gfp_t gfp_mask)
> {
> int l;
> struct dm_buffer *b, *tmp;
> + long freed = 0;
>
> for (l = 0; l < LIST_SIZE; l++) {
> - list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
> - if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
> - !--nr_to_scan)
> - return;
> + list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list) {
> + freed += __cleanup_old_buffer(b, gfp_mask, 0);
> + if (!--nr_to_scan)
> + break;
> + }
> dm_bufio_cond_resched();
> }
> + return freed;
> }
>
> -static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
> +static long
> +dm_bufio_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> {
> struct dm_bufio_client *c =
> - container_of(shrinker, struct dm_bufio_client, shrinker);
> - unsigned long r;
> - unsigned long nr_to_scan = sc->nr_to_scan;
> + container_of(shrink, struct dm_bufio_client, shrinker);
> + long freed;
>
> if (sc->gfp_mask & __GFP_IO)
> dm_bufio_lock(c);
> else if (!dm_bufio_trylock(c))
> - return !nr_to_scan ? 0 : -1;
> + return -1;
>
> - if (nr_to_scan)
> - __scan(c, nr_to_scan, sc);
> + freed = __scan(c, sc->nr_to_scan, sc->gfp_mask);
> + dm_bufio_unlock(c);
> + return freed;
> +}
>
> - r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
> - if (r > INT_MAX)
> - r = INT_MAX;
> +static long
> +dm_bufio_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> +{
> + struct dm_bufio_client *c =
> + container_of(shrink, struct dm_bufio_client, shrinker);
> + long count;
> +
> + if (sc->gfp_mask & __GFP_IO)
> + dm_bufio_lock(c);
> + else if (!dm_bufio_trylock(c))
> + return 0;
>
> + count = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
> dm_bufio_unlock(c);
> + return count;
>
> - return r;
> }
>
> /*
> @@ -1516,7 +1534,8 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
> __cache_size_refresh();
> mutex_unlock(&dm_bufio_clients_lock);
>
> - c->shrinker.shrink = shrink;
> + c->shrinker.count_objects = dm_bufio_shrink_count;
> + c->shrinker.scan_objects = dm_bufio_shrink_scan;
> c->shrinker.seeks = 1;
> c->shrinker.batch = 0;
> register_shrinker(&c->shrinker);
> @@ -1603,7 +1622,7 @@ static void cleanup_old_buffers(void)
> struct dm_buffer *b;
> b = list_entry(c->lru[LIST_CLEAN].prev,
> struct dm_buffer, lru_list);
> - if (__cleanup_old_buffer(b, 0, max_age * HZ))
> + if (!__cleanup_old_buffer(b, 0, max_age * HZ))
> break;
> dm_bufio_cond_resched();
> }
> diff --git a/drivers/staging/android/ashmem.c b/drivers/staging/android/ashmem.c
> index 634b9ae..30f9f8e 100644
> --- a/drivers/staging/android/ashmem.c
> +++ b/drivers/staging/android/ashmem.c
> @@ -341,27 +341,28 @@ out:
> /*
> * ashmem_shrink - our cache shrinker, called from mm/vmscan.c :: shrink_slab
> *
> - * 'nr_to_scan' is the number of objects (pages) to prune, or 0 to query how
> - * many objects (pages) we have in total.
> + * 'nr_to_scan' is the number of objects to scan for freeing.
> *
> * 'gfp_mask' is the mask of the allocation that got us into this mess.
> *
> - * Return value is the number of objects (pages) remaining, or -1 if we cannot
> + * Return value is the number of objects freed or -1 if we cannot
> * proceed without risk of deadlock (due to gfp_mask).
> *
> * We approximate LRU via least-recently-unpinned, jettisoning unpinned partial
> * chunks of ashmem regions LRU-wise one-at-a-time until we hit 'nr_to_scan'
> * pages freed.
> */
> -static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
> +static long
> +ashmem_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> {
> struct ashmem_range *range, *next;
> + long freed = 0;
>
> /* We might recurse into filesystem code, so bail out if necessary */
> - if (sc->nr_to_scan && !(sc->gfp_mask & __GFP_FS))
> + if (!(sc->gfp_mask & __GFP_FS))
> return -1;
> - if (!sc->nr_to_scan)
> - return lru_count;
>
> mutex_lock(&ashmem_mutex);
> list_for_each_entry_safe(range, next, &ashmem_lru_list, lru) {
> @@ -374,17 +375,34 @@ static int ashmem_shrink(struct shrinker *s, struct shrink_control *sc)
> range->purged = ASHMEM_WAS_PURGED;
> lru_del(range);
>
> - sc->nr_to_scan -= range_size(range);
> - if (sc->nr_to_scan <= 0)
> + freed += range_size(range);
> + if (--sc->nr_to_scan <= 0)
> break;
> }
> mutex_unlock(&ashmem_mutex);
> + return freed;
> +}
>
> +static long
> +ashmem_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> +{
> + /*
> + * note that lru_count is count of pages on the lru, not a count of
> + * objects on the list. This means the scan function needs to return the
> + * number of pages freed, not the number of objects scanned.
> + */
> return lru_count;
> }
>
> static struct shrinker ashmem_shrinker = {
> - .shrink = ashmem_shrink,
> + .count_objects = ashmem_shrink_count,
> + .scan_objects = ashmem_shrink_scan,
> + /*
> + * XXX (dchinner): I wish people would comment on why they need on
> + * significant changes to the default value here
> + */
> .seeks = DEFAULT_SEEKS * 4,
> };
>
> @@ -671,11 +689,9 @@ static long ashmem_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> if (capable(CAP_SYS_ADMIN)) {
> struct shrink_control sc = {
> .gfp_mask = GFP_KERNEL,
> - .nr_to_scan = 0,
> + .nr_to_scan = LONG_MAX,
> };
> - ret = ashmem_shrink(&ashmem_shrinker, &sc);
> - sc.nr_to_scan = ret;
> - ashmem_shrink(&ashmem_shrinker, &sc);
> + ashmem_shrink_scan(&ashmem_shrinker, &sc);
> }
> break;
> }
> diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
> index b91e4bc..2bf2c2f 100644
> --- a/drivers/staging/android/lowmemorykiller.c
> +++ b/drivers/staging/android/lowmemorykiller.c
> @@ -63,11 +63,19 @@ static unsigned long lowmem_deathpending_timeout;
> printk(x); \
> } while (0)
>
> -static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
> +/*
> + * XXX (dchinner): this should all be using longs, not ints, as
> + * functions like global_page_state, get_mm_rss, etc all return longs or
> + * unsigned longs. Even the shrinker now uses longs....
> + */
> +static long
> +lowmem_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> {
> struct task_struct *tsk;
> struct task_struct *selected = NULL;
> - int rem = 0;
> + long freed = 0;
> int tasksize;
> int i;
> int min_score_adj = OOM_SCORE_ADJ_MAX + 1;
> @@ -89,19 +97,17 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
> break;
> }
> }
> - if (sc->nr_to_scan > 0)
> - lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
> - sc->nr_to_scan, sc->gfp_mask, other_free,
> - other_file, min_score_adj);
> - rem = global_page_state(NR_ACTIVE_ANON) +
> - global_page_state(NR_ACTIVE_FILE) +
> - global_page_state(NR_INACTIVE_ANON) +
> - global_page_state(NR_INACTIVE_FILE);
> - if (sc->nr_to_scan <= 0 || min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
> - lowmem_print(5, "lowmem_shrink %lu, %x, return %d\n",
> - sc->nr_to_scan, sc->gfp_mask, rem);
> - return rem;
> + lowmem_print(3, "lowmem_shrink %lu, %x, ofree %d %d, ma %d\n",
> + sc->nr_to_scan, sc->gfp_mask, other_free,
> + other_file, min_score_adj);
> +
> + if (min_score_adj == OOM_SCORE_ADJ_MAX + 1) {
> + /* nothing to do, no point in calling again */
> + lowmem_print(5, "lowmem_shrink %lu, %x, return -1\n",
> + sc->nr_to_scan, sc->gfp_mask);
> + return -1;
> }
> +
> selected_oom_score_adj = min_score_adj;
>
> rcu_read_lock();
> @@ -151,16 +157,32 @@ static int lowmem_shrink(struct shrinker *s, struct shrink_control *sc)
> lowmem_deathpending_timeout = jiffies + HZ;
> send_sig(SIGKILL, selected, 0);
> set_tsk_thread_flag(selected, TIF_MEMDIE);
> - rem -= selected_tasksize;
> + freed += selected_tasksize;
> }
> - lowmem_print(4, "lowmem_shrink %lu, %x, return %d\n",
> - sc->nr_to_scan, sc->gfp_mask, rem);
> + lowmem_print(4, "lowmem_shrink %lu, %x, return %ld\n",
> + sc->nr_to_scan, sc->gfp_mask, freed);
> rcu_read_unlock();
> - return rem;
> + return freed;
> +}
> +
> +static long
> +lowmem_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> +{
> + long count;
> +
> + count = global_page_state(NR_ACTIVE_ANON) +
> + global_page_state(NR_ACTIVE_FILE) +
> + global_page_state(NR_INACTIVE_ANON) +
> + global_page_state(NR_INACTIVE_FILE);
> + return count;
> }
>
> static struct shrinker lowmem_shrinker = {
> - .shrink = lowmem_shrink,
> + .count_objects = lowmem_shrink_count,
> + .scan_objects = lowmem_shrink_scan,
> + /* why can't we document magic numbers? */
> .seeks = DEFAULT_SEEKS * 16
> };
>
> diff --git a/drivers/staging/ramster/zcache-main.c b/drivers/staging/ramster/zcache-main.c
> index a09dd5c..7d50688 100644
> --- a/drivers/staging/ramster/zcache-main.c
> +++ b/drivers/staging/ramster/zcache-main.c
> @@ -1054,12 +1054,13 @@ static bool zcache_freeze;
> * used by zcache to approximately the same as the total number of LRU_FILE
> * pageframes in use.
> */
> -static int shrink_zcache_memory(struct shrinker *shrink,
> - struct shrink_control *sc)
> +static long
> +zcache_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)

Why the Style changes? If the code has a certain style there is no need
to alter just one function in it to be at odd with the rest.
> {
> static bool in_progress;
> - int ret = -1;
> - int nr = sc->nr_to_scan;
> + long freed = 0;
> int nr_evict = 0;
> int nr_unuse = 0;
> struct page *page;
> @@ -1067,15 +1068,23 @@ static int shrink_zcache_memory(struct shrinker *shrink,
> int unuse_ret;
> #endif
>
> - if (nr <= 0)
> - goto skip_evict;
> + /*
> + * XXX (dchinner): My kingdom for a mutex! There's no way this should
> + * ever be allowed to move out of staging until it supports concurrent
> + * shrinkers correctly.
> + *
> + * This whole shrinker is making me want to claw my eyes out. It has no
> + * redeeming values whatsoever and I can't undo the damage it has
> + * already done to my brain.

I am sad to hear that your brain has been damaged, but I fear that part
of working for open source is that there is no warranty and it can be
hazardous to your health.

How would you add redeeming values here? Perhaps that should
be added in the TODO file in the root directory of the driver. That is
the usual policy that Greg wants.

Is it that you would like to insert a mutex here? The underlaying code
(zcache_evict_eph_pageframe) ends up taking a spinlock.

> + */
>
> /* don't allow more than one eviction thread at a time */
> if (in_progress)
> - goto skip_evict;
> + return -1;
>
> in_progress = true;
>
> +

Hm..
> /* we are going to ignore nr, and target a different value */
> zcache_last_active_file_pageframes =
> global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
> @@ -1083,11 +1092,13 @@ static int shrink_zcache_memory(struct shrinker *shrink,
> global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
> nr_evict = zcache_eph_pageframes - zcache_last_active_file_pageframes +
> zcache_last_inactive_file_pageframes;
> +

Hmmm?
> while (nr_evict-- > 0) {
> page = zcache_evict_eph_pageframe();
> if (page == NULL)
> break;
> zcache_free_page(page);
> + freed++;
> }
>
> zcache_last_active_anon_pageframes =
> @@ -1104,25 +1115,42 @@ static int shrink_zcache_memory(struct shrinker *shrink,
> unuse_ret = zcache_frontswap_unuse();
> if (unuse_ret == -ENOMEM)
> break;
> + freed++;
> }
> #endif
> in_progress = false;
> + return freed;
> +}
> +
> +
> +/*
> + * XXX (dchinner): the shrinker updates global variables? You've got to be
> + * kidding me! And the object count can (apparently) go negative - that's
> + * *always* a bug so be bloody noisy about it.

The reason for this is that the ramster code shares code with zcache.
The mechanism to find out whether a page has been freed or not is
currently via these counters - which are actually ssize_t so are silly.

Perhaps you could expand the comment to mention a better mechanism for
keeping that value non-global and atomic?

> + */
> +static long
> +zcache_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)

Please, not style guide changes.

> +{
> + long count;
>
> -skip_evict:
> - /* resample: has changed, but maybe not all the way yet */
> zcache_last_active_file_pageframes =
> global_page_state(NR_LRU_BASE + LRU_ACTIVE_FILE);
> zcache_last_inactive_file_pageframes =
> global_page_state(NR_LRU_BASE + LRU_INACTIVE_FILE);
> - ret = zcache_eph_pageframes - zcache_last_active_file_pageframes +
> - zcache_last_inactive_file_pageframes;
> - if (ret < 0)
> - ret = 0;
> - return ret;
> +
> + count = zcache_last_active_file_pageframes +
> + zcache_last_inactive_file_pageframes +
> + zcache_eph_pageframes;

Why not just use the 'free' value?
> +
> + WARN_ON_ONCE(count < 0);
> + return count < 0 ? 0 : count;
> }
>
> static struct shrinker zcache_shrinker = {
> - .shrink = shrink_zcache_memory,
> + .count_objects = zcache_shrink_count,
> + .scan_objects = zcache_shrink_scan,
> .seeks = DEFAULT_SEEKS,
> };
>
> diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
> index 52b43b7..d17ab5d 100644
> --- a/drivers/staging/zcache/zcache-main.c
> +++ b/drivers/staging/zcache/zcache-main.c

What Linux version is this based on? I am not seeing this in v3.10-rc4 ?

> @@ -536,10 +536,11 @@ static void zbud_evict_zbpg(struct zbud_page *zbpg)
> * page in use by another cpu, but also to avoid potential deadlock due to
> * lock inversion.
> */
> -static void zbud_evict_pages(int nr)
> +static long zbud_evict_pages(int nr)
> {
> struct zbud_page *zbpg;
> int i;
> + long freed = 0;
>
> /* first try freeing any pages on unused list */
> retry_unused_list:
> @@ -554,6 +555,7 @@ retry_unused_list:
> spin_unlock_bh(&zbpg_unused_list_spinlock);
> zcache_free_page(zbpg);
> zcache_evicted_raw_pages++;
> + freed++;
> if (--nr <= 0)
> goto out;
> goto retry_unused_list;
> @@ -578,6 +580,7 @@ retry_unbud_list_i:
> /* want budlists unlocked when doing zbpg eviction */
> zbud_evict_zbpg(zbpg);
> local_bh_enable();
> + freed++;
> if (--nr <= 0)
> goto out;
> goto retry_unbud_list_i;
> @@ -602,13 +605,14 @@ retry_bud_list:
> /* want budlists unlocked when doing zbpg eviction */
> zbud_evict_zbpg(zbpg);
> local_bh_enable();
> + freed++;
> if (--nr <= 0)
> goto out;
> goto retry_bud_list;
> }
> spin_unlock_bh(&zbud_budlists_spinlock);
> out:
> - return;
> + return freed;
> }
>
> static void __init zbud_init(void)
> @@ -1527,26 +1531,28 @@ static bool zcache_freeze;
> /*
> * zcache shrinker interface (only useful for ephemeral pages, so zbud only)
> */
> -static int shrink_zcache_memory(struct shrinker *shrink,
> - struct shrink_control *sc)
> +static long
> +zcache_shrink_scan(
> + struct shrinker *shrink,
> + struct shrink_control *sc)

Please, no style guide changes.
> {
> - int ret = -1;
> - int nr = sc->nr_to_scan;
> - gfp_t gfp_mask = sc->gfp_mask;
> + if (!(sc->gfp_mask & __GFP_FS))
> + return -1;
>
> - if (nr >= 0) {
> - if (!(gfp_mask & __GFP_FS))
> - /* does this case really need to be skipped? */
> - goto out;
> - zbud_evict_pages(nr);
> - }
> - ret = (int)atomic_read(&zcache_zbud_curr_raw_pages);
> -out:
> - return ret;
> + return zbud_evict_pages(sc->nr_to_scan);
> +}
> +
> +static long
> +zcache_shrink_count(
> + struct shrinker *shrink,
> + struct shrink_control *sc)
> +{
> + return (long)atomic_read(&zcache_zbud_curr_raw_pages);
> }
>
> static struct shrinker zcache_shrinker = {
> - .shrink = shrink_zcache_memory,
> + .count_objects = zcache_shrink_count,
> + .scan_objects = zcache_shrink_scan,
> .seeks = DEFAULT_SEEKS,
> };
>
> --
> 1.7.10
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>