2014-10-24 10:38:05

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 0/9] Per memcg slab shrinkers

Hi,

Kmem accounting of memcg is unusable now, because it lacks slab shrinker
support. That means when we hit the limit we will get ENOMEM w/o any
chance to recover. What we should do then is to call shrink_slab, which
would reclaim old inode/dentry caches from this cgroup. This is what
this patch set is intended to do.

Basically, it does two things. First, it introduces the notion of
per-memcg slab shrinker. A shrinker that wants to reclaim objects per
cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be
passed the memory cgroup to scan from in shrink_control->memcg. For such
shrinkers shrink_slab iterates over the whole cgroup subtree under the
target cgroup and calls the shrinker for each kmem-active memory cgroup.

Secondly, this patch set makes the list_lru structure per-memcg. It's
done transparently to list_lru users - everything they have to do is to
tell list_lru_init that they want memcg-aware list_lru. Then the
list_lru will automatically distribute objects among per-memcg lists
basing on which cgroup the object is accounted to. This way to make FS
shrinkers (icache, dcache) memcg-aware we only need to make them use
memcg-aware list_lru, and this is what this patch set does.

As before, this patch set only enables per-memcg kmem reclaim when the
pressure goes from memory.limit, not from memory.kmem.limit. Handling
memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, it
will probably require a sort of soft limit to work properly. I'm leaving
this for future work.

The main difference in v2 is that I got rid of reparenting of list_lrus.
Thanks to Johannes' and Tejun's efforts mem_cgroup_iter now iterates
over all memory cgroups including dead ones, so dangling offline cgroups
whose css is pinned by kmem allocations is no longer a problem - they
will be eventually reaped by memory pressure. OTOH, this allowed to
simplify the patch set significantly. Nevertheless, if we want
reparenting one day, it won't cause any problems to implement it on top.
Another differences in v2 include:

- Rebased on top of v3.18-rc1-mmotm-2014-10-23-16-26
- Simplified handling of the list of all list_lrus
- Improved comments and function names
- Fix leak on the list_lru_init error path
- Fix list_lru_destroy crash on uninitialized zeroed object

v1 can be found at https://lkml.org/lkml/2014/9/21/64

The patch set is organized as follows:

- Patches 1-3 implement per-memcg shrinker core with patches 1 and 2
preparing list_lru users for upcoming changes and patch 3 tuning
shrink_slab.

- Patches 4 and 5 cleanup handling of max memcg_cache_id in the memcg
core.

- Patch 6 gets rid of the useless list_lru->active_nodes, and patch 7
links all list_lrus to a list, which is required by memcg.

- Patch 8 adds per-memcg lrus to the list_lru structure, and finally
patch 9 marks fs shrinkers as memcg aware.

Reviews are more than welcome.

Thanks,

Vladimir Davydov (9):
list_lru: introduce list_lru_shrink_{count,walk}
fs: consolidate {nr,free}_cached_objects args in shrink_control
vmscan: shrink slab on memcg pressure
memcg: rename some cache id related variables
memcg: add rwsem to sync against memcg_caches arrays relocation
list_lru: get rid of ->active_nodes
list_lru: organize all list_lrus to list
list_lru: introduce per-memcg lists
fs: make shrinker memcg aware

fs/dcache.c | 14 +-
fs/gfs2/quota.c | 6 +-
fs/inode.c | 7 +-
fs/internal.h | 7 +-
fs/super.c | 44 +++---
fs/xfs/xfs_buf.c | 7 +-
fs/xfs/xfs_qm.c | 7 +-
fs/xfs/xfs_super.c | 7 +-
include/linux/fs.h | 6 +-
include/linux/list_lru.h | 72 +++++++--
include/linux/memcontrol.h | 53 ++++++-
include/linux/shrinker.h | 10 +-
mm/list_lru.c | 361 ++++++++++++++++++++++++++++++++++++++++----
mm/memcontrol.c | 117 +++++++++++---
mm/slab_common.c | 14 +-
mm/vmscan.c | 87 ++++++++---
mm/workingset.c | 6 +-
17 files changed, 678 insertions(+), 147 deletions(-)

--
1.7.10.4


2014-10-24 10:37:56

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 1/9] list_lru: introduce list_lru_shrink_{count,walk}

NUMA aware slab shrinkers use the list_lru structure to distribute
objects coming from different NUMA nodes to different lists. Whenever
such a shrinker needs to count or scan objects from a particular node,
it issues commands like this:

count = list_lru_count_node(lru, sc->nid);
freed = list_lru_walk_node(lru, sc->nid, isolate_func,
isolate_arg, &sc->nr_to_scan);

where sc is an instance of the shrink_control structure passed to it
from vmscan.

To simplify this, let's add special list_lru functions to be used by
shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
consolidate the nid and nr_to_scan arguments in the shrink_control
structure.

This will also allow us to avoid patching shrinkers that use list_lru
when we make shrink_slab() per-memcg - all we will have to do is extend
the shrink_control structure to include the target memcg and make
list_lru_shrink_{count,walk} handle this appropriately.

Suggested-by: Dave Chinner <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/dcache.c | 14 ++++++--------
fs/gfs2/quota.c | 6 +++---
fs/inode.c | 7 +++----
fs/internal.h | 7 +++----
fs/super.c | 24 +++++++++++-------------
fs/xfs/xfs_buf.c | 7 +++----
fs/xfs/xfs_qm.c | 7 +++----
include/linux/list_lru.h | 16 ++++++++++++++++
mm/workingset.c | 6 +++---
9 files changed, 51 insertions(+), 43 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index d5a23fd0da90..670c925d7c87 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -930,24 +930,22 @@ dentry_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
/**
* prune_dcache_sb - shrink the dcache
* @sb: superblock
- * @nr_to_scan : number of entries to try to free
- * @nid: which node to scan for freeable entities
+ * @sc: shrink control, passed to list_lru_shrink_walk()
*
- * Attempt to shrink the superblock dcache LRU by @nr_to_scan entries. This is
- * done when we need more memory an called from the superblock shrinker
+ * Attempt to shrink the superblock dcache LRU by @sc->nr_to_scan entries. This
+ * is done when we need more memory and called from the superblock shrinker
* function.
*
* This function may fail to free any resources if all the dentries are in
* use.
*/
-long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
- int nid)
+long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc)
{
LIST_HEAD(dispose);
long freed;

- freed = list_lru_walk_node(&sb->s_dentry_lru, nid, dentry_lru_isolate,
- &dispose, &nr_to_scan);
+ freed = list_lru_shrink_walk(&sb->s_dentry_lru, sc,
+ dentry_lru_isolate, &dispose);
shrink_dentry_list(&dispose);
return freed;
}
diff --git a/fs/gfs2/quota.c b/fs/gfs2/quota.c
index 64b29f7f6b4c..6292d79fc340 100644
--- a/fs/gfs2/quota.c
+++ b/fs/gfs2/quota.c
@@ -171,8 +171,8 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
if (!(sc->gfp_mask & __GFP_FS))
return SHRINK_STOP;

- freed = list_lru_walk_node(&gfs2_qd_lru, sc->nid, gfs2_qd_isolate,
- &dispose, &sc->nr_to_scan);
+ freed = list_lru_shrink_walk(&gfs2_qd_lru, sc,
+ gfs2_qd_isolate, &dispose);

gfs2_qd_dispose(&dispose);

@@ -182,7 +182,7 @@ static unsigned long gfs2_qd_shrink_scan(struct shrinker *shrink,
static unsigned long gfs2_qd_shrink_count(struct shrinker *shrink,
struct shrink_control *sc)
{
- return vfs_pressure_ratio(list_lru_count_node(&gfs2_qd_lru, sc->nid));
+ return vfs_pressure_ratio(list_lru_shrink_count(&gfs2_qd_lru, sc));
}

struct shrinker gfs2_qd_shrinker = {
diff --git a/fs/inode.c b/fs/inode.c
index 26753ba7b6d6..f08420a3bf50 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -749,14 +749,13 @@ inode_lru_isolate(struct list_head *item, spinlock_t *lru_lock, void *arg)
* to trim from the LRU. Inodes to be freed are moved to a temporary list and
* then are freed outside inode_lock by dispose_list().
*/
-long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
- int nid)
+long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
{
LIST_HEAD(freeable);
long freed;

- freed = list_lru_walk_node(&sb->s_inode_lru, nid, inode_lru_isolate,
- &freeable, &nr_to_scan);
+ freed = list_lru_shrink_walk(&sb->s_inode_lru, sc,
+ inode_lru_isolate, &freeable);
dispose_list(&freeable);
return freed;
}
diff --git a/fs/internal.h b/fs/internal.h
index 9477f8f6aefc..7a6aa641c060 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -14,6 +14,7 @@ struct file_system_type;
struct linux_binprm;
struct path;
struct mount;
+struct shrink_control;

/*
* block_dev.c
@@ -112,8 +113,7 @@ extern int open_check_o_direct(struct file *f);
* inode.c
*/
extern spinlock_t inode_sb_list_lock;
-extern long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
- int nid);
+extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
extern void inode_add_lru(struct inode *inode);

/*
@@ -130,8 +130,7 @@ extern int invalidate_inodes(struct super_block *, bool);
*/
extern struct dentry *__d_alloc(struct super_block *, const struct qstr *);
extern int d_set_mounted(struct dentry *dentry);
-extern long prune_dcache_sb(struct super_block *sb, unsigned long nr_to_scan,
- int nid);
+extern long prune_dcache_sb(struct super_block *sb, struct shrink_control *sc);

/*
* read_write.c
diff --git a/fs/super.c b/fs/super.c
index eae088f6aaae..4554ac257647 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -77,8 +77,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
if (sb->s_op->nr_cached_objects)
fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);

- inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid);
- dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid);
+ inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
+ dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects = dentries + inodes + fs_objects + 1;
if (!total_objects)
total_objects = 1;
@@ -86,20 +86,20 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
/* proportion the scan between the caches */
dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
+ fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);

/*
* prune the dcache first as the icache is pinned by it, then
* prune the icache, followed by the filesystem specific caches
*/
- freed = prune_dcache_sb(sb, dentries, sc->nid);
- freed += prune_icache_sb(sb, inodes, sc->nid);
+ sc->nr_to_scan = dentries;
+ freed = prune_dcache_sb(sb, sc);
+ sc->nr_to_scan = inodes;
+ freed += prune_icache_sb(sb, sc);

- if (fs_objects) {
- fs_objects = mult_frac(sc->nr_to_scan, fs_objects,
- total_objects);
+ if (fs_objects)
freed += sb->s_op->free_cached_objects(sb, fs_objects,
sc->nid);
- }

drop_super(sb);
return freed;
@@ -118,17 +118,15 @@ static unsigned long super_cache_count(struct shrinker *shrink,
* scalability bottleneck. The counts could get updated
* between super_cache_count and super_cache_scan anyway.
* Call to super_cache_count with shrinker_rwsem held
- * ensures the safety of call to list_lru_count_node() and
+ * ensures the safety of call to list_lru_shrink_count() and
* s_op->nr_cached_objects().
*/
if (sb->s_op && sb->s_op->nr_cached_objects)
total_objects = sb->s_op->nr_cached_objects(sb,
sc->nid);

- total_objects += list_lru_count_node(&sb->s_dentry_lru,
- sc->nid);
- total_objects += list_lru_count_node(&sb->s_inode_lru,
- sc->nid);
+ total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
+ total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);

total_objects = vfs_pressure_ratio(total_objects);
return total_objects;
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 24b4ebea0d4d..38f6671f75a3 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1578,10 +1578,9 @@ xfs_buftarg_shrink_scan(
struct xfs_buftarg, bt_shrinker);
LIST_HEAD(dispose);
unsigned long freed;
- unsigned long nr_to_scan = sc->nr_to_scan;

- freed = list_lru_walk_node(&btp->bt_lru, sc->nid, xfs_buftarg_isolate,
- &dispose, &nr_to_scan);
+ freed = list_lru_shrink_walk(&btp->bt_lru, sc,
+ xfs_buftarg_isolate, &dispose);

while (!list_empty(&dispose)) {
struct xfs_buf *bp;
@@ -1600,7 +1599,7 @@ xfs_buftarg_shrink_count(
{
struct xfs_buftarg *btp = container_of(shrink,
struct xfs_buftarg, bt_shrinker);
- return list_lru_count_node(&btp->bt_lru, sc->nid);
+ return list_lru_shrink_count(&btp->bt_lru, sc);
}

void
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index d68f23021af3..ec92f6b2e0f2 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -525,7 +525,6 @@ xfs_qm_shrink_scan(
struct xfs_qm_isolate isol;
unsigned long freed;
int error;
- unsigned long nr_to_scan = sc->nr_to_scan;

if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
return 0;
@@ -533,8 +532,8 @@ xfs_qm_shrink_scan(
INIT_LIST_HEAD(&isol.buffers);
INIT_LIST_HEAD(&isol.dispose);

- freed = list_lru_walk_node(&qi->qi_lru, sc->nid, xfs_qm_dquot_isolate, &isol,
- &nr_to_scan);
+ freed = list_lru_shrink_walk(&qi->qi_lru, sc,
+ xfs_qm_dquot_isolate, &isol);

error = xfs_buf_delwri_submit(&isol.buffers);
if (error)
@@ -559,7 +558,7 @@ xfs_qm_shrink_count(
struct xfs_quotainfo *qi = container_of(shrink,
struct xfs_quotainfo, qi_shrinker);

- return list_lru_count_node(&qi->qi_lru, sc->nid);
+ return list_lru_shrink_count(&qi->qi_lru, sc);
}

/*
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f3434533fbf8..f500a2e39b13 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -9,6 +9,7 @@

#include <linux/list.h>
#include <linux/nodemask.h>
+#include <linux/shrinker.h>

/* list_lru_walk_cb has to always return one of those */
enum lru_status {
@@ -81,6 +82,13 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item);
* Callers that want such a guarantee need to provide an outer lock.
*/
unsigned long list_lru_count_node(struct list_lru *lru, int nid);
+
+static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
+ struct shrink_control *sc)
+{
+ return list_lru_count_node(lru, sc->nid);
+}
+
static inline unsigned long list_lru_count(struct list_lru *lru)
{
long count = 0;
@@ -120,6 +128,14 @@ unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
unsigned long *nr_to_walk);

static inline unsigned long
+list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
+ list_lru_walk_cb isolate, void *cb_arg)
+{
+ return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
+ &sc->nr_to_scan);
+}
+
+static inline unsigned long
list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
void *cb_arg, unsigned long nr_to_walk)
{
diff --git a/mm/workingset.c b/mm/workingset.c
index f7216fa7da27..d4fa7fb10a52 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -275,7 +275,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,

/* list_lru lock nests inside IRQ-safe mapping->tree_lock */
local_irq_disable();
- shadow_nodes = list_lru_count_node(&workingset_shadow_nodes, sc->nid);
+ shadow_nodes = list_lru_shrink_count(&workingset_shadow_nodes, sc);
local_irq_enable();

pages = node_present_pages(sc->nid);
@@ -376,8 +376,8 @@ static unsigned long scan_shadow_nodes(struct shrinker *shrinker,

/* list_lru lock nests inside IRQ-safe mapping->tree_lock */
local_irq_disable();
- ret = list_lru_walk_node(&workingset_shadow_nodes, sc->nid,
- shadow_lru_isolate, NULL, &sc->nr_to_scan);
+ ret = list_lru_shrink_walk(&workingset_shadow_nodes, sc,
+ shadow_lru_isolate, NULL);
local_irq_enable();
return ret;
}
--
1.7.10.4

2014-10-24 10:37:59

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 2/9] fs: consolidate {nr,free}_cached_objects args in shrink_control

We are going to make FS shrinkers memcg-aware. To achieve that, we will
have to pass the memcg to scan to the nr_cached_objects and
free_cached_objects VFS methods, which currently take only the NUMA node
to scan. Since the shrink_control structure already holds the node, and
the memcg to scan will be added to it when we introduce memcg-aware
vmscan, let us consolidate the methods' arguments in this structure to
keep things clean.

Suggested-by: Dave Chinner <[email protected]>
Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/super.c | 12 ++++++------
fs/xfs/xfs_super.c | 7 +++----
include/linux/fs.h | 6 ++++--
3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 4554ac257647..a2b735a42e74 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -75,7 +75,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
return SHRINK_STOP;

if (sb->s_op->nr_cached_objects)
- fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid);
+ fs_objects = sb->s_op->nr_cached_objects(sb, sc);

inodes = list_lru_shrink_count(&sb->s_inode_lru, sc);
dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc);
@@ -97,9 +97,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
sc->nr_to_scan = inodes;
freed += prune_icache_sb(sb, sc);

- if (fs_objects)
- freed += sb->s_op->free_cached_objects(sb, fs_objects,
- sc->nid);
+ if (fs_objects) {
+ sc->nr_to_scan = fs_objects;
+ freed += sb->s_op->free_cached_objects(sb, sc);
+ }

drop_super(sb);
return freed;
@@ -122,8 +123,7 @@ static unsigned long super_cache_count(struct shrinker *shrink,
* s_op->nr_cached_objects().
*/
if (sb->s_op && sb->s_op->nr_cached_objects)
- total_objects = sb->s_op->nr_cached_objects(sb,
- sc->nid);
+ total_objects = sb->s_op->nr_cached_objects(sb, sc);

total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc);
total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9f622feda6a4..222d8c53bb72 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1527,7 +1527,7 @@ xfs_fs_mount(
static long
xfs_fs_nr_cached_objects(
struct super_block *sb,
- int nid)
+ struct shrink_control *sc)
{
return xfs_reclaim_inodes_count(XFS_M(sb));
}
@@ -1535,10 +1535,9 @@ xfs_fs_nr_cached_objects(
static long
xfs_fs_free_cached_objects(
struct super_block *sb,
- long nr_to_scan,
- int nid)
+ struct shrink_control *sc)
{
- return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan);
+ return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
}

static const struct super_operations xfs_super_operations = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4111ee0afeea..d5624f5a463d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1568,8 +1568,10 @@ struct super_operations {
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
- long (*nr_cached_objects)(struct super_block *, int);
- long (*free_cached_objects)(struct super_block *, long, int);
+ long (*nr_cached_objects)(struct super_block *,
+ struct shrink_control *);
+ long (*free_cached_objects)(struct super_block *,
+ struct shrink_control *);
};

/*
--
1.7.10.4

2014-10-24 10:38:15

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 6/9] list_lru: get rid of ->active_nodes

The active_nodes mask allows us to skip empty nodes when walking over
list_lru items from all nodes in list_lru_count/walk. However, these
functions are never called from hot paths, so it doesn't seem we need
such kind of optimization there. OTOH, removing the mask will make it
easier to make list_lru per-memcg.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/list_lru.h | 5 ++---
mm/list_lru.c | 10 +++-------
2 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index f500a2e39b13..53c1d6b78270 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,7 +31,6 @@ struct list_lru_node {

struct list_lru {
struct list_lru_node *node;
- nodemask_t active_nodes;
};

void list_lru_destroy(struct list_lru *lru);
@@ -94,7 +93,7 @@ static inline unsigned long list_lru_count(struct list_lru *lru)
long count = 0;
int nid;

- for_each_node_mask(nid, lru->active_nodes)
+ for_each_node_state(nid, N_NORMAL_MEMORY)
count += list_lru_count_node(lru, nid);

return count;
@@ -142,7 +141,7 @@ list_lru_walk(struct list_lru *lru, list_lru_walk_cb isolate,
long isolated = 0;
int nid;

- for_each_node_mask(nid, lru->active_nodes) {
+ for_each_node_state(nid, N_NORMAL_MEMORY) {
isolated += list_lru_walk_node(lru, nid, isolate,
cb_arg, &nr_to_walk);
if (nr_to_walk <= 0)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index f1a0db194173..07e198c77888 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -19,8 +19,7 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item)
WARN_ON_ONCE(nlru->nr_items < 0);
if (list_empty(item)) {
list_add_tail(item, &nlru->list);
- if (nlru->nr_items++ == 0)
- node_set(nid, lru->active_nodes);
+ nlru->nr_items++;
spin_unlock(&nlru->lock);
return true;
}
@@ -37,8 +36,7 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
spin_lock(&nlru->lock);
if (!list_empty(item)) {
list_del_init(item);
- if (--nlru->nr_items == 0)
- node_clear(nid, lru->active_nodes);
+ nlru->nr_items--;
WARN_ON_ONCE(nlru->nr_items < 0);
spin_unlock(&nlru->lock);
return true;
@@ -90,8 +88,7 @@ restart:
case LRU_REMOVED_RETRY:
assert_spin_locked(&nlru->lock);
case LRU_REMOVED:
- if (--nlru->nr_items == 0)
- node_clear(nid, lru->active_nodes);
+ nlru->nr_items--;
WARN_ON_ONCE(nlru->nr_items < 0);
isolated++;
/*
@@ -133,7 +130,6 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
if (!lru->node)
return -ENOMEM;

- nodes_clear(lru->active_nodes);
for (i = 0; i < nr_node_ids; i++) {
spin_lock_init(&lru->node[i].lock);
if (key)
--
1.7.10.4

2014-10-24 10:38:25

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 9/9] fs: make shrinker memcg aware

Now, to make any list_lru-based shrinker memcg aware we should only
initialize its list_lru as memcg aware. Let's do it for the general FS
shrinker (super_block::s_shrink).

There are other FS-specific shrinkers that use list_lru for storing
objects, such as XFS and GFS2 dquot cache shrinkers, but since they
reclaim objects that are shared among different cgroups, there is no
point making them memcg aware. It's a big question whether we should
account them to memcg at all.

Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/super.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index b027849d92d2..482b4071f4de 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -189,9 +189,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);

- if (list_lru_init(&s->s_dentry_lru))
+ if (list_lru_init_memcg(&s->s_dentry_lru))
goto fail;
- if (list_lru_init(&s->s_inode_lru))
+ if (list_lru_init_memcg(&s->s_inode_lru))
goto fail;

init_rwsem(&s->s_umount);
@@ -227,7 +227,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
s->s_shrink.scan_objects = super_cache_scan;
s->s_shrink.count_objects = super_cache_count;
s->s_shrink.batch = 1024;
- s->s_shrink.flags = SHRINKER_NUMA_AWARE;
+ s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE;
return s;

fail:
--
1.7.10.4

2014-10-24 10:38:51

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 8/9] list_lru: introduce per-memcg lists

There are several FS shrinkers, including super_block::s_shrink, that
keep reclaimable objects in the list_lru structure. Hence to turn them
to memcg-aware shrinkers, it is enough to make list_lru per-memcg.

This patch does the trick. It adds an array of lru lists to the
list_lru_node structure (per-node part of the list_lru), one for each
kmem-active memcg, and dispatches every item addition or removal to the
list corresponding to the memcg which the item is accounted to. So now
the list_lru structure is not just per node, but per node and per memcg.

Not all list_lrus need this feature, so this patch also adds a new
method, list_lru_init_memcg, which initializes a list_lru as memcg
aware. Otherwise (i.e. if initialized with old list_lru_init), the
list_lru won't have per memcg lists.

Just like per memcg caches arrays, the arrays of per-memcg lists are
indexed by memcg_cache_id, so we must grow them whenever
memcg_max_cache_ids is increased. So we introduce a callback,
memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
space is full.

The locking is implemented in a manner similar to lruvecs, i.e. we have
one lock per node that protects all lists (both global and per cgroup)
on the node.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/list_lru.h | 54 ++++++--
include/linux/memcontrol.h | 7 +
mm/list_lru.c | 325 ++++++++++++++++++++++++++++++++++++++++----
mm/memcontrol.c | 27 ++++
4 files changed, 377 insertions(+), 36 deletions(-)

diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index ee9486ac0621..731acd3bd6e6 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -11,6 +11,8 @@
#include <linux/nodemask.h>
#include <linux/shrinker.h>

+struct mem_cgroup;
+
/* list_lru_walk_cb has to always return one of those */
enum lru_status {
LRU_REMOVED, /* item removed from list */
@@ -22,11 +24,26 @@ enum lru_status {
internally, but has to return locked. */
};

-struct list_lru_node {
- spinlock_t lock;
+struct list_lru_one {
struct list_head list;
/* kept as signed so we can catch imbalance bugs */
long nr_items;
+};
+
+struct list_lru_memcg {
+ /* array of per cgroup lists, indexed by memcg_cache_id */
+ struct list_lru_one *lru[0];
+};
+
+struct list_lru_node {
+ /* protects all lists on the node, including per cgroup */
+ spinlock_t lock;
+ /* global list, used for the root cgroup in cgroup aware lrus */
+ struct list_lru_one lru;
+#ifdef CONFIG_MEMCG_KMEM
+ /* for cgroup aware lrus points to per cgroup lists, otherwise NULL */
+ struct list_lru_memcg *memcg_lrus;
+#endif
} ____cacheline_aligned_in_smp;

struct list_lru {
@@ -36,12 +53,17 @@ struct list_lru {
#endif
};

+#ifdef CONFIG_MEMCG_KMEM
+int memcg_update_all_list_lrus(int num_memcgs);
+#endif
+
void list_lru_destroy(struct list_lru *lru);
-int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key);
-static inline int list_lru_init(struct list_lru *lru)
-{
- return list_lru_init_key(lru, NULL);
-}
+int __list_lru_init(struct list_lru *lru, bool memcg_aware,
+ struct lock_class_key *key);
+
+#define list_lru_init(lru) __list_lru_init((lru), false, NULL)
+#define list_lru_init_key(lru, key) __list_lru_init((lru), false, (key))
+#define list_lru_init_memcg(lru) __list_lru_init((lru), true, NULL)

/**
* list_lru_add: add an element to the lru list's tail
@@ -75,20 +97,23 @@ bool list_lru_add(struct list_lru *lru, struct list_head *item);
bool list_lru_del(struct list_lru *lru, struct list_head *item);

/**
- * list_lru_count_node: return the number of objects currently held by @lru
+ * list_lru_count_one: return the number of objects currently held by @lru
* @lru: the lru pointer.
* @nid: the node id to count from.
+ * @memcg: the cgroup to count from.
*
* Always return a non-negative number, 0 for empty lists. There is no
* guarantee that the list is not updated while the count is being computed.
* Callers that want such a guarantee need to provide an outer lock.
*/
+unsigned long list_lru_count_one(struct list_lru *lru,
+ int nid, struct mem_cgroup *memcg);
unsigned long list_lru_count_node(struct list_lru *lru, int nid);

static inline unsigned long list_lru_shrink_count(struct list_lru *lru,
struct shrink_control *sc)
{
- return list_lru_count_node(lru, sc->nid);
+ return list_lru_count_one(lru, sc->nid, sc->memcg);
}

static inline unsigned long list_lru_count(struct list_lru *lru)
@@ -105,9 +130,10 @@ static inline unsigned long list_lru_count(struct list_lru *lru)
typedef enum lru_status
(*list_lru_walk_cb)(struct list_head *item, spinlock_t *lock, void *cb_arg);
/**
- * list_lru_walk_node: walk a list_lru, isolating and disposing freeable items.
+ * list_lru_walk_one: walk a list_lru, isolating and disposing freeable items.
* @lru: the lru pointer.
* @nid: the node id to scan from.
+ * @memcg: the cgroup to scan from.
* @isolate: callback function that is resposible for deciding what to do with
* the item currently being scanned
* @cb_arg: opaque type that will be passed to @isolate
@@ -125,6 +151,10 @@ typedef enum lru_status
*
* Return value: the number of objects effectively removed from the LRU.
*/
+unsigned long list_lru_walk_one(struct list_lru *lru,
+ int nid, struct mem_cgroup *memcg,
+ list_lru_walk_cb isolate, void *cb_arg,
+ unsigned long *nr_to_walk);
unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
list_lru_walk_cb isolate, void *cb_arg,
unsigned long *nr_to_walk);
@@ -133,8 +163,8 @@ static inline unsigned long
list_lru_shrink_walk(struct list_lru *lru, struct shrink_control *sc,
list_lru_walk_cb isolate, void *cb_arg)
{
- return list_lru_walk_node(lru, sc->nid, isolate, cb_arg,
- &sc->nr_to_scan);
+ return list_lru_walk_one(lru, sc->nid, sc->memcg, isolate, cb_arg,
+ &sc->nr_to_scan);
}

static inline unsigned long
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index eebb56b94d23..fd1219cb8ee5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -419,6 +419,8 @@ static inline bool memcg_kmem_enabled(void)
bool memcg_kmem_is_active(struct mem_cgroup *memcg);
bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg);

+struct mem_cgroup *mem_cgroup_from_kmem(void *ptr);
+
/*
* In general, we'll do everything in our power to not incur in any overhead
* for non-memcg users for the kmem functions. Not even a function call, if we
@@ -554,6 +556,11 @@ static inline bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg)
return false;
}

+static inline struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
+{
+ return NULL;
+}
+
static inline bool
memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
{
diff --git a/mm/list_lru.c b/mm/list_lru.c
index a9021cb3ccde..4041e5a1569b 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -10,6 +10,7 @@
#include <linux/list_lru.h>
#include <linux/slab.h>
#include <linux/mutex.h>
+#include <linux/memcontrol.h>

#ifdef CONFIG_MEMCG_KMEM
static LIST_HEAD(list_lrus);
@@ -38,16 +39,56 @@ static void list_lru_unregister(struct list_lru *lru)
}
#endif /* CONFIG_MEMCG_KMEM */

+#ifdef CONFIG_MEMCG_KMEM
+static inline bool list_lru_memcg_aware(struct list_lru *lru)
+{
+ return !!lru->node[0].memcg_lrus;
+}
+
+/*
+ * For the given list_lru_node return the list corresponding to the memory
+ * cgroup whose memcg_cache_id equals @idx. If @idx < 0 or the list_lru is not
+ * cgroup aware, the global list is returned.
+ */
+static inline struct list_lru_one *
+lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
+{
+ /*
+ * The lock protects the array of per cgroup lists from relocation
+ * (see update_memcg_lrus).
+ */
+ lockdep_assert_held(&nlru->lock);
+ if (nlru->memcg_lrus && idx >= 0)
+ return nlru->memcg_lrus->lru[idx];
+
+ return &nlru->lru;
+}
+#else
+static inline bool list_lru_memcg_aware(struct list_lru *lru)
+{
+ return false;
+}
+
+static inline struct list_lru_one *
+lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
+{
+ return &nlru->lru;
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
bool list_lru_add(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
+ struct mem_cgroup *memcg = mem_cgroup_from_kmem(item);
+ struct list_lru_one *l;

spin_lock(&nlru->lock);
- WARN_ON_ONCE(nlru->nr_items < 0);
+ l = lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
+ WARN_ON_ONCE(l->nr_items < 0);
if (list_empty(item)) {
- list_add_tail(item, &nlru->list);
- nlru->nr_items++;
+ list_add_tail(item, &l->list);
+ l->nr_items++;
spin_unlock(&nlru->lock);
return true;
}
@@ -60,12 +101,15 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
{
int nid = page_to_nid(virt_to_page(item));
struct list_lru_node *nlru = &lru->node[nid];
+ struct mem_cgroup *memcg = mem_cgroup_from_kmem(item);
+ struct list_lru_one *l;

spin_lock(&nlru->lock);
+ l = lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
if (!list_empty(item)) {
list_del_init(item);
- nlru->nr_items--;
- WARN_ON_ONCE(nlru->nr_items < 0);
+ l->nr_items--;
+ WARN_ON_ONCE(l->nr_items < 0);
spin_unlock(&nlru->lock);
return true;
}
@@ -74,33 +118,58 @@ bool list_lru_del(struct list_lru *lru, struct list_head *item)
}
EXPORT_SYMBOL_GPL(list_lru_del);

-unsigned long
-list_lru_count_node(struct list_lru *lru, int nid)
+static unsigned long __list_lru_count_one(struct list_lru *lru,
+ int nid, int memcg_idx)
{
- unsigned long count = 0;
struct list_lru_node *nlru = &lru->node[nid];
+ struct list_lru_one *l;
+ unsigned long count;

spin_lock(&nlru->lock);
- WARN_ON_ONCE(nlru->nr_items < 0);
- count += nlru->nr_items;
+ l = lru_from_memcg_idx(nlru, memcg_idx);
+ WARN_ON_ONCE(l->nr_items < 0);
+ count = l->nr_items;
spin_unlock(&nlru->lock);

return count;
}
+
+unsigned long list_lru_count_one(struct list_lru *lru,
+ int nid, struct mem_cgroup *memcg)
+{
+ return __list_lru_count_one(lru, nid, memcg_cache_id(memcg));
+}
+EXPORT_SYMBOL_GPL(list_lru_count_one);
+
+unsigned long list_lru_count_node(struct list_lru *lru, int nid)
+{
+ long count = 0;
+ int memcg_idx;
+
+ count += __list_lru_count_one(lru, nid, -1);
+ if (list_lru_memcg_aware(lru)) {
+ for_each_memcg_cache_index(memcg_idx)
+ count += __list_lru_count_one(lru, nid, memcg_idx);
+ }
+ return count;
+}
EXPORT_SYMBOL_GPL(list_lru_count_node);

-unsigned long
-list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
- void *cb_arg, unsigned long *nr_to_walk)
+static unsigned long
+__list_lru_walk_one(struct list_lru *lru, int nid, int memcg_idx,
+ list_lru_walk_cb isolate, void *cb_arg,
+ unsigned long *nr_to_walk)
{

- struct list_lru_node *nlru = &lru->node[nid];
+ struct list_lru_node *nlru = &lru->node[nid];
+ struct list_lru_one *l;
struct list_head *item, *n;
unsigned long isolated = 0;

spin_lock(&nlru->lock);
+ l = lru_from_memcg_idx(nlru, memcg_idx);
restart:
- list_for_each_safe(item, n, &nlru->list) {
+ list_for_each_safe(item, n, &l->list) {
enum lru_status ret;

/*
@@ -116,8 +185,8 @@ restart:
case LRU_REMOVED_RETRY:
assert_spin_locked(&nlru->lock);
case LRU_REMOVED:
- nlru->nr_items--;
- WARN_ON_ONCE(nlru->nr_items < 0);
+ l->nr_items--;
+ WARN_ON_ONCE(l->nr_items < 0);
isolated++;
/*
* If the lru lock has been dropped, our list
@@ -128,7 +197,7 @@ restart:
goto restart;
break;
case LRU_ROTATE:
- list_move_tail(item, &nlru->list);
+ list_move_tail(item, &l->list);
break;
case LRU_SKIP:
break;
@@ -147,12 +216,205 @@ restart:
spin_unlock(&nlru->lock);
return isolated;
}
+
+unsigned long
+list_lru_walk_one(struct list_lru *lru, int nid, struct mem_cgroup *memcg,
+ list_lru_walk_cb isolate, void *cb_arg,
+ unsigned long *nr_to_walk)
+{
+ return __list_lru_walk_one(lru, nid, memcg_cache_id(memcg),
+ isolate, cb_arg, nr_to_walk);
+}
+EXPORT_SYMBOL_GPL(list_lru_walk_one);
+
+unsigned long list_lru_walk_node(struct list_lru *lru, int nid,
+ list_lru_walk_cb isolate, void *cb_arg,
+ unsigned long *nr_to_walk)
+{
+ long isolated = 0;
+ int memcg_idx;
+
+ isolated += __list_lru_walk_one(lru, nid, -1, isolate, cb_arg,
+ nr_to_walk);
+ if (*nr_to_walk > 0 && list_lru_memcg_aware(lru)) {
+ for_each_memcg_cache_index(memcg_idx) {
+ isolated += __list_lru_walk_one(lru, nid, memcg_idx,
+ isolate, cb_arg, nr_to_walk);
+ if (*nr_to_walk <= 0)
+ break;
+ }
+ }
+ return isolated;
+}
EXPORT_SYMBOL_GPL(list_lru_walk_node);

-int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
+static void init_one_lru(struct list_lru_one *l)
+{
+ INIT_LIST_HEAD(&l->list);
+ l->nr_items = 0;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
+/*
+ * Free the given array of per cgroup lists.
+ */
+static void list_lru_memcg_free(struct list_lru_memcg *p)
+{
+ int i, nr;
+
+ if (p) {
+ nr = ksize(p) / sizeof(void *);
+ for (i = 0; i < nr; i++)
+ kfree(p->lru[i]);
+ kfree(p);
+ }
+}
+
+/*
+ * Allocate an array of per cgroup lists that may store up to @nr lists and
+ * initialize it starting from @init_from.
+ */
+static struct list_lru_memcg *list_lru_memcg_alloc(int nr, int init_from)
+{
+ int i;
+ struct list_lru_memcg *p;
+ struct list_lru_one *l;
+
+ p = kmalloc(nr * sizeof(void *), GFP_KERNEL);
+ if (!p)
+ return NULL;
+
+ /*
+ * Instead of storing the array size along with the array, we employ
+ * ksize(). Therefore we must zero the whole structure to make sure
+ * list_lru_memcg_free() won't dereference crap.
+ */
+ memset(p, 0, ksize(p));
+
+ for (i = init_from; i < nr; i++) {
+ l = kmalloc(sizeof(struct list_lru_one), GFP_KERNEL);
+ if (!l) {
+ list_lru_memcg_free(p);
+ return NULL;
+ }
+ init_one_lru(l);
+ p->lru[i] = l;
+ }
+ return p;
+}
+
+/*
+ * Destroy per cgroup lists on each node for the given list_lru.
+ */
+static void list_lru_destroy_memcg_lrus(struct list_lru *lru)
+{
+ int i;
+
+ for (i = 0; i < nr_node_ids; i++)
+ list_lru_memcg_free(lru->node[i].memcg_lrus);
+}
+
+/*
+ * Initialize per cgroup lists on each node for the given list_lru.
+ */
+static int list_lru_init_memcg_lrus(struct list_lru *lru)
+{
+ int i;
+ struct list_lru_memcg *p;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ /*
+ * If memcg_max_cache_ids equals 0 (i.e. kmem accounting is
+ * inactive), kmalloc will return ZERO_SIZE_PTR (not NULL), so
+ * that the lru will still be cgroup aware.
+ */
+ p = list_lru_memcg_alloc(memcg_max_cache_ids, 0);
+ if (!p) {
+ list_lru_destroy_memcg_lrus(lru);
+ return -ENOMEM;
+ }
+ lru->node[i].memcg_lrus = p;
+ }
+ return 0;
+}
+
+/*
+ * Update per cgroup list arrays on each node for the given list_lru to be able
+ * to store up to @num_memcgs elements.
+ */
+static int list_lru_update_memcg_lrus(struct list_lru *lru, int num_memcgs)
+{
+ int i;
+ struct list_lru_node *nlru;
+ struct list_lru_memcg *old, *new;
+
+ for (i = 0; i < nr_node_ids; i++) {
+ nlru = &lru->node[i];
+ old = nlru->memcg_lrus;
+
+ new = list_lru_memcg_alloc(num_memcgs, memcg_max_cache_ids);
+ if (!new)
+ return -ENOMEM;
+
+ memcpy(new, old, memcg_max_cache_ids * sizeof(void *));
+
+ /*
+ * The lock guarantees that we won't race with a reader
+ * (see also lru_from_memcg_idx).
+ *
+ * Since list_lru functions may be called under an IRQ-safe
+ * lock, we have to use IRQ-safe primitives here to avoid
+ * deadlock.
+ */
+ spin_lock_irq(&nlru->lock);
+ nlru->memcg_lrus = new;
+ spin_unlock_irq(&nlru->lock);
+
+ kfree(old);
+ }
+ return 0;
+}
+
+/*
+ * This function is called from the memory cgroup core before increasing
+ * memcg_max_cache_ids. We must update all lrus' arrays of per cgroup lists to
+ * conform to the new size. The memcg_cache_id_space_sem is held for writing.
+ */
+int memcg_update_all_list_lrus(int num_memcgs)
+{
+ int ret = 0;
+ struct list_lru *lru;
+
+ mutex_lock(&list_lrus_mutex);
+ list_for_each_entry(lru, &list_lrus, list) {
+ ret = list_lru_update_memcg_lrus(lru, num_memcgs);
+ /*
+ * It isn't worth the trouble to revert to the old size if we
+ * fail, so we just leave the lrus updated to this point.
+ */
+ if (ret)
+ break;
+ }
+ mutex_unlock(&list_lrus_mutex);
+ return ret;
+}
+#else
+static int list_lru_init_memcg_lrus(struct list_lru *lru)
+{
+ return 0;
+}
+
+static void list_lru_destroy_memcg_lrus(struct list_lru *lru)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */
+
+int __list_lru_init(struct list_lru *lru, bool memcg_aware,
+ struct lock_class_key *key)
{
int i;
size_t size = sizeof(*lru->node) * nr_node_ids;
+ int err = 0;

lru->node = kzalloc(size, GFP_KERNEL);
if (!lru->node)
@@ -162,13 +424,27 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
spin_lock_init(&lru->node[i].lock);
if (key)
lockdep_set_class(&lru->node[i].lock, key);
- INIT_LIST_HEAD(&lru->node[i].list);
- lru->node[i].nr_items = 0;
+ init_one_lru(&lru->node[i].lru);
}
- list_lru_register(lru);
- return 0;
+
+ /*
+ * Note, memcg_max_cache_ids must remain stable while we are
+ * allocating per cgroup lrus *and* registering the list_lru,
+ * otherwise memcg_update_all_list_lrus can skip our list_lru.
+ */
+ memcg_lock_cache_id_space();
+ if (memcg_aware)
+ err = list_lru_init_memcg_lrus(lru);
+
+ if (!err)
+ list_lru_register(lru);
+ else
+ kfree(lru->node);
+
+ memcg_unlock_cache_id_space();
+ return err;
}
-EXPORT_SYMBOL_GPL(list_lru_init_key);
+EXPORT_SYMBOL_GPL(__list_lru_init);

void list_lru_destroy(struct list_lru *lru)
{
@@ -176,6 +452,7 @@ void list_lru_destroy(struct list_lru *lru)
if (!lru->node)
return;
list_lru_unregister(lru);
+ list_lru_destroy_memcg_lrus(lru);
kfree(lru->node);
lru->node = NULL;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 444bf8fe5f1d..81f4d2485fbc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2625,6 +2625,9 @@ static int memcg_alloc_cache_id(void)
mutex_unlock(&memcg_slab_mutex);

if (!err)
+ err = memcg_update_all_list_lrus(size);
+
+ if (!err)
memcg_max_cache_ids = size;
up_write(&memcg_cache_id_space_sem);

@@ -3013,6 +3016,30 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
memcg_uncharge_kmem(memcg, 1 << order);
pc->mem_cgroup = NULL;
}
+
+struct mem_cgroup *mem_cgroup_from_kmem(void *ptr)
+{
+ struct mem_cgroup *memcg = NULL;
+ struct page_cgroup *pc;
+ struct kmem_cache *cachep;
+ struct page *page;
+
+ if (!memcg_kmem_enabled())
+ return NULL;
+
+ page = virt_to_head_page(ptr);
+ if (PageSlab(page)) {
+ cachep = page->slab_cache;
+ if (!is_root_cache(cachep))
+ memcg = cachep->memcg_params->memcg;
+ } else {
+ /* page allocated with alloc_kmem_pages */
+ pc = lookup_page_cgroup(page);
+ if (pc->mem_cgroup)
+ memcg = pc->mem_cgroup;
+ }
+ return memcg;
+}
#else
static inline void memcg_unregister_all_caches(struct mem_cgroup *memcg)
{
--
1.7.10.4

2014-10-24 10:39:22

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 7/9] list_lru: organize all list_lrus to list

To make list_lru memcg aware, I need all list_lrus to be kept on a list
protected by a mutex, so that I could sleep while walking over the list.

Therefore after this change list_lru_destroy may sleep. Fortunately,
there is the only user that calls it from an atomic context - it's
put_super - and we can easily fix it by calling list_lru_destroy before
put_super in destroy_locked_super - anyway we don't longer need lrus by
that time.

Another point that should be noted is that list_lru_destroy is allowed
to be called on an uninitialized zeroed-out object, in which case it is
a no-op. Before this patch this was guaranteed by kfree, but now we need
an explicit check there.

Signed-off-by: Vladimir Davydov <[email protected]>
---
fs/super.c | 8 ++++++++
include/linux/list_lru.h | 3 +++
mm/list_lru.c | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 45 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index a2b735a42e74..b027849d92d2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -282,6 +282,14 @@ void deactivate_locked_super(struct super_block *s)
unregister_shrinker(&s->s_shrink);
fs->kill_sb(s);

+ /*
+ * Since list_lru_destroy() may sleep, we cannot call it from
+ * put_super(), where we hold the sb_lock. Therefore we destroy
+ * the lru lists right now.
+ */
+ list_lru_destroy(&s->s_dentry_lru);
+ list_lru_destroy(&s->s_inode_lru);
+
put_filesystem(fs);
put_super(s);
} else {
diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h
index 53c1d6b78270..ee9486ac0621 100644
--- a/include/linux/list_lru.h
+++ b/include/linux/list_lru.h
@@ -31,6 +31,9 @@ struct list_lru_node {

struct list_lru {
struct list_lru_node *node;
+#ifdef CONFIG_MEMCG_KMEM
+ struct list_head list;
+#endif
};

void list_lru_destroy(struct list_lru *lru);
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 07e198c77888..a9021cb3ccde 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -9,6 +9,34 @@
#include <linux/mm.h>
#include <linux/list_lru.h>
#include <linux/slab.h>
+#include <linux/mutex.h>
+
+#ifdef CONFIG_MEMCG_KMEM
+static LIST_HEAD(list_lrus);
+static DEFINE_MUTEX(list_lrus_mutex);
+
+static void list_lru_register(struct list_lru *lru)
+{
+ mutex_lock(&list_lrus_mutex);
+ list_add(&lru->list, &list_lrus);
+ mutex_unlock(&list_lrus_mutex);
+}
+
+static void list_lru_unregister(struct list_lru *lru)
+{
+ mutex_lock(&list_lrus_mutex);
+ list_del(&lru->list);
+ mutex_unlock(&list_lrus_mutex);
+}
+#else
+static void list_lru_register(struct list_lru *lru)
+{
+}
+
+static void list_lru_unregister(struct list_lru *lru)
+{
+}
+#endif /* CONFIG_MEMCG_KMEM */

bool list_lru_add(struct list_lru *lru, struct list_head *item)
{
@@ -137,12 +165,18 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key)
INIT_LIST_HEAD(&lru->node[i].list);
lru->node[i].nr_items = 0;
}
+ list_lru_register(lru);
return 0;
}
EXPORT_SYMBOL_GPL(list_lru_init_key);

void list_lru_destroy(struct list_lru *lru)
{
+ /* Already destroyed or not yet initialized? */
+ if (!lru->node)
+ return;
+ list_lru_unregister(lru);
kfree(lru->node);
+ lru->node = NULL;
}
EXPORT_SYMBOL_GPL(list_lru_destroy);
--
1.7.10.4

2014-10-24 10:38:13

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 5/9] memcg: add rwsem to sync against memcg_caches arrays relocation

We need a stable value of memcg_max_cache_ids in kmem_cache_create()
(memcg_alloc_cache_params() wants it for root caches), where we only
hold the slab_mutex and no memcg-related locks. As a result, we have to
update memcg_cache_ids under the slab_mutex, which we can only take from
the slab's side. This looks awkward and will become even worse when
per-memcg list_lru is introduced, which also wants stable access to
memcg_max_cache_ids.

To get rid of this dependency between the memcg_max_cache_ids and the
slab_mutex, this patch introduces a special rwsem. The rwsem is held for
writing during memcg_caches arrays relocation and memcg_max_cache_ids
updates. Therefore one can take it for reading to get a stable access to
memcg_caches arrays and/or memcg_max_cache_ids.

Currently the semaphore is taken for reading only from
kmem_cache_create, right before taking the slab_mutex, so right now
there's no point in using rwsem instead of mutex. However, once list_lru
is made per-memcg it will allow list_lru initializations to proceed
concurrently.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 15 +++++++++++++--
mm/memcontrol.c | 28 ++++++++++++++++++----------
mm/slab_common.c | 10 +++++-----
3 files changed, 36 insertions(+), 17 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1a894c1018f..eebb56b94d23 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -395,8 +395,13 @@ extern struct static_key memcg_kmem_enabled_key;
* The maximal number of kmem-active memory cgroups that can exist on the
* system. May grow, but never shrinks. The value returned by memcg_cache_id()
* is always less.
+ *
+ * To prevent memcg_max_cache_ids from growing, memcg_lock_cache_id_space() can
+ * be used. It's backed by rw semaphore.
*/
extern int memcg_max_cache_ids;
+extern void memcg_lock_cache_id_space(void);
+extern void memcg_unlock_cache_id_space(void);

/*
* Helper macro to loop through all memcg-specific caches. Callers must still
@@ -433,8 +438,6 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order);

int memcg_cache_id(struct mem_cgroup *memcg);

-void memcg_update_array_size(int num_groups);
-
struct kmem_cache *
__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);

@@ -571,6 +574,14 @@ static inline int memcg_cache_id(struct mem_cgroup *memcg)
return -1;
}

+static inline void memcg_lock_cache_id_space(void)
+{
+}
+
+static inline void memcg_unlock_cache_id_space(void)
+{
+}
+
static inline struct kmem_cache *
memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fc1e2067a4c4..444bf8fe5f1d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -595,6 +595,19 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
static DEFINE_IDA(memcg_cache_ida);
int memcg_max_cache_ids;

+/* Protects memcg_max_cache_ids */
+static DECLARE_RWSEM(memcg_cache_id_space_sem);
+
+void memcg_lock_cache_id_space(void)
+{
+ down_read(&memcg_cache_id_space_sem);
+}
+
+void memcg_unlock_cache_id_space(void)
+{
+ up_read(&memcg_cache_id_space_sem);
+}
+
/*
* MIN_SIZE is different than 1, because we would like to avoid going through
* the alloc/free process all the time. In a small machine, 4 kmem-limited
@@ -2599,6 +2612,7 @@ static int memcg_alloc_cache_id(void)
* There's no space for the new id in memcg_caches arrays,
* so we have to grow them.
*/
+ down_write(&memcg_cache_id_space_sem);

size = 2 * (id + 1);
if (size < MEMCG_CACHES_MIN_SIZE)
@@ -2610,6 +2624,10 @@ static int memcg_alloc_cache_id(void)
err = memcg_update_all_caches(size);
mutex_unlock(&memcg_slab_mutex);

+ if (!err)
+ memcg_max_cache_ids = size;
+ up_write(&memcg_cache_id_space_sem);
+
if (err) {
ida_simple_remove(&memcg_cache_ida, id);
return err;
@@ -2622,16 +2640,6 @@ static void memcg_free_cache_id(int id)
ida_simple_remove(&memcg_cache_ida, id);
}

-/*
- * We should update the current array size iff all caches updates succeed. This
- * can only be done from the slab side. The slab mutex needs to be held when
- * calling this.
- */
-void memcg_update_array_size(int num)
-{
- memcg_max_cache_ids = num;
-}
-
static void memcg_register_cache(struct mem_cgroup *memcg,
struct kmem_cache *root_cache)
{
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 41fe0ad199f2..879c1a8c54ba 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -169,8 +169,8 @@ int memcg_update_all_caches(int num_memcgs)
{
struct kmem_cache *s;
int ret = 0;
- mutex_lock(&slab_mutex);

+ mutex_lock(&slab_mutex);
list_for_each_entry(s, &slab_caches, list) {
if (!is_root_cache(s))
continue;
@@ -181,11 +181,8 @@ int memcg_update_all_caches(int num_memcgs)
* up to this point in an updated state.
*/
if (ret)
- goto out;
+ break;
}
-
- memcg_update_array_size(num_memcgs);
-out:
mutex_unlock(&slab_mutex);
return ret;
}
@@ -365,6 +362,8 @@ kmem_cache_create(const char *name, size_t size, size_t align,

get_online_cpus();
get_online_mems();
+ memcg_lock_cache_id_space(); /* memcg_alloc_cache_params() needs a
+ stable value of memcg_max_cache_ids */

mutex_lock(&slab_mutex);

@@ -403,6 +402,7 @@ kmem_cache_create(const char *name, size_t size, size_t align,
out_unlock:
mutex_unlock(&slab_mutex);

+ memcg_unlock_cache_id_space();
put_online_mems();
put_online_cpus();

--
1.7.10.4

2014-10-24 10:40:15

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 4/9] memcg: rename some cache id related variables

memcg_limited_groups_array_size, which defines the size of memcg_caches
arrays, sounds rather cumbersome. Also it doesn't point anyhow that it's
related to kmem/caches stuff. So let's rename it to memcg_max_cache_ids.
It's concise and points us directly to memcg_cache_id.

Also, rename kmem_limited_groups to memcg_cache_ida, because it's not a
container for groups, but the memcg_cache_id allocator.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 9 +++++++--
mm/memcontrol.c | 19 +++++++++----------
mm/slab_common.c | 4 ++--
3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a592fe75192b..e1a894c1018f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -391,7 +391,12 @@ static inline void sock_release_memcg(struct sock *sk)
#ifdef CONFIG_MEMCG_KMEM
extern struct static_key memcg_kmem_enabled_key;

-extern int memcg_limited_groups_array_size;
+/*
+ * The maximal number of kmem-active memory cgroups that can exist on the
+ * system. May grow, but never shrinks. The value returned by memcg_cache_id()
+ * is always less.
+ */
+extern int memcg_max_cache_ids;

/*
* Helper macro to loop through all memcg-specific caches. Callers must still
@@ -399,7 +404,7 @@ extern int memcg_limited_groups_array_size;
* the slab_mutex must be held when looping through those caches
*/
#define for_each_memcg_cache_index(_idx) \
- for ((_idx) = 0; (_idx) < memcg_limited_groups_array_size; (_idx)++)
+ for ((_idx) = 0; (_idx) < memcg_max_cache_ids; (_idx)++)

static inline bool memcg_kmem_enabled(void)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa7c1d7b0376..fc1e2067a4c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -589,12 +589,11 @@ static void disarm_sock_keys(struct mem_cgroup *memcg)
* memcgs, and none but the 200th is kmem-limited, we'd have to have a
* 200 entry array for that.
*
- * The current size of the caches array is stored in
- * memcg_limited_groups_array_size. It will double each time we have to
- * increase it.
+ * The current size of the caches array is stored in memcg_max_cache_ids. It
+ * will double each time we have to increase it.
*/
-static DEFINE_IDA(kmem_limited_groups);
-int memcg_limited_groups_array_size;
+static DEFINE_IDA(memcg_cache_ida);
+int memcg_max_cache_ids;

/*
* MIN_SIZE is different than 1, because we would like to avoid going through
@@ -2588,12 +2587,12 @@ static int memcg_alloc_cache_id(void)
int id, size;
int err;

- id = ida_simple_get(&kmem_limited_groups,
+ id = ida_simple_get(&memcg_cache_ida,
0, MEMCG_CACHES_MAX_SIZE, GFP_KERNEL);
if (id < 0)
return id;

- if (id < memcg_limited_groups_array_size)
+ if (id < memcg_max_cache_ids)
return id;

/*
@@ -2612,7 +2611,7 @@ static int memcg_alloc_cache_id(void)
mutex_unlock(&memcg_slab_mutex);

if (err) {
- ida_simple_remove(&kmem_limited_groups, id);
+ ida_simple_remove(&memcg_cache_ida, id);
return err;
}
return id;
@@ -2620,7 +2619,7 @@ static int memcg_alloc_cache_id(void)

static void memcg_free_cache_id(int id)
{
- ida_simple_remove(&kmem_limited_groups, id);
+ ida_simple_remove(&memcg_cache_ida, id);
}

/*
@@ -2630,7 +2629,7 @@ static void memcg_free_cache_id(int id)
*/
void memcg_update_array_size(int num)
{
- memcg_limited_groups_array_size = num;
+ memcg_max_cache_ids = num;
}

static void memcg_register_cache(struct mem_cgroup *memcg,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 406944207b61..41fe0ad199f2 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -116,7 +116,7 @@ static int memcg_alloc_cache_params(struct mem_cgroup *memcg,

if (!memcg) {
size = offsetof(struct memcg_cache_params, memcg_caches);
- size += memcg_limited_groups_array_size * sizeof(void *);
+ size += memcg_max_cache_ids * sizeof(void *);
} else
size = sizeof(struct memcg_cache_params);

@@ -154,7 +154,7 @@ static int memcg_update_cache_params(struct kmem_cache *s, int num_memcgs)

cur_params = s->memcg_params;
memcpy(new_params->memcg_caches, cur_params->memcg_caches,
- memcg_limited_groups_array_size * sizeof(void *));
+ memcg_max_cache_ids * sizeof(void *));

new_params->is_root_cache = true;

--
1.7.10.4

2014-10-24 10:40:46

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH -mm v2 3/9] vmscan: shrink slab on memcg pressure

This patch makes direct reclaim path shrink slab not only on global
memory pressure, but also when we reach the user memory limit of a
memcg. To achieve that, it makes shrink_slab() walk over the memcg
hierarchy and run shrinkers marked as memcg-aware on the target memcg
and all its descendants. The memcg to scan is passed in a shrink_control
structure; memcg-unaware shrinkers are still called only on global
memory pressure with memcg=NULL. It is up to the shrinker how to
organize the objects it is responsible for to achieve per-memcg reclaim.

Signed-off-by: Vladimir Davydov <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++++++
include/linux/shrinker.h | 10 ++++-
mm/memcontrol.c | 45 ++++++++++++++++++++++-
mm/vmscan.c | 87 +++++++++++++++++++++++++++++++++-----------
4 files changed, 141 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ea007615e8f9..a592fe75192b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -68,6 +68,9 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+ struct mem_cgroup *memcg);
+
bool __mem_cgroup_same_or_subtree(const struct mem_cgroup *root_memcg,
struct mem_cgroup *memcg);
bool task_in_mem_cgroup(struct task_struct *task,
@@ -226,6 +229,12 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
return &zone->lruvec;
}

+static inline unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+ struct mem_cgroup *)
+{
+ return 0;
+}
+
static inline struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
{
return NULL;
@@ -397,6 +406,9 @@ static inline bool memcg_kmem_enabled(void)
return static_key_false(&memcg_kmem_enabled_key);
}

+bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg);
+
/*
* In general, we'll do everything in our power to not incur in any overhead
* for non-memcg users for the kmem functions. Not even a function call, if we
@@ -524,6 +536,16 @@ static inline bool memcg_kmem_enabled(void)
return false;
}

+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+ return false;
+}
+
+static inline bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg)
+{
+ return false;
+}
+
static inline bool
memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
{
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 68c097077ef0..ab79b174bfbe 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -20,8 +20,15 @@ struct shrink_control {

/* shrink from these nodes */
nodemask_t nodes_to_scan;
+
+ /* shrink from this memory cgroup hierarchy (if not NULL) */
+ struct mem_cgroup *target_mem_cgroup;
+
/* current node being shrunk (for NUMA aware shrinkers) */
int nid;
+
+ /* current memcg being shrunk (for memcg aware shrinkers) */
+ struct mem_cgroup *memcg;
};

#define SHRINK_STOP (~0UL)
@@ -63,7 +70,8 @@ struct shrinker {
#define DEFAULT_SEEKS 2 /* A good number if you don't know better. */

/* Flags */
-#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_NUMA_AWARE (1 << 0)
+#define SHRINKER_MEMCG_AWARE (1 << 1)

extern int register_shrinker(struct shrinker *);
extern void unregister_shrinker(struct shrinker *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c50176429fa3..aa7c1d7b0376 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -371,11 +371,29 @@ static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
}

-static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+bool memcg_kmem_is_active(struct mem_cgroup *memcg)
{
return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
}

+/*
+ * Returns true if the given cgroup or any of its descendants has kmem
+ * accounting enabled.
+ */
+bool memcg_kmem_is_active_subtree(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *iter;
+
+ iter = memcg;
+ do {
+ if (memcg_kmem_is_active(iter)) {
+ mem_cgroup_iter_break(memcg, iter);
+ return true;
+ }
+ } while ((iter = mem_cgroup_iter(memcg, iter, NULL)) != NULL);
+
+ return false;
+}
#endif

/* Stuffs for move charges at task migration. */
@@ -1307,6 +1325,31 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
VM_BUG_ON((long)(*lru_size) < 0);
}

+unsigned long mem_cgroup_zone_reclaimable_pages(struct zone *zone,
+ struct mem_cgroup *memcg)
+{
+ unsigned long nr = 0;
+ unsigned int lru_mask;
+ struct mem_cgroup *iter;
+
+ lru_mask = LRU_ALL_FILE;
+ if (get_nr_swap_pages() > 0)
+ lru_mask |= LRU_ALL_ANON;
+
+ iter = memcg;
+ do {
+ struct mem_cgroup_per_zone *mz;
+ enum lru_list lru;
+
+ mz = mem_cgroup_zone_zoneinfo(memcg, zone);
+ for_each_lru(lru)
+ if (BIT(lru) & lru_mask)
+ nr += mz->lru_size[lru];
+ } while ((iter = mem_cgroup_iter(memcg, iter, NULL)) != NULL);
+
+ return nr;
+}
+
/*
* Checks whether given mem is same or in the root_mem_cgroup's
* hierarchy subtree
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a384339bf718..2cf6b04a4e0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -339,6 +339,26 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
return freed;
}

+static unsigned long
+run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
+ unsigned long nr_pages_scanned, unsigned long lru_pages)
+{
+ unsigned long freed = 0;
+
+ if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
+ shrinkctl->nid = 0;
+ return shrink_slab_node(shrinkctl, shrinker,
+ nr_pages_scanned, lru_pages);
+ }
+
+ for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
+ if (node_online(shrinkctl->nid))
+ freed += shrink_slab_node(shrinkctl, shrinker,
+ nr_pages_scanned, lru_pages);
+ }
+ return freed;
+}
+
/*
* Call the shrink functions to age shrinkable caches
*
@@ -380,19 +400,32 @@ unsigned long shrink_slab(struct shrink_control *shrinkctl,
}

list_for_each_entry(shrinker, &shrinker_list, list) {
- if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
- shrinkctl->nid = 0;
- freed += shrink_slab_node(shrinkctl, shrinker,
- nr_pages_scanned, lru_pages);
+ /*
+ * Call memcg-unaware shrinkers only on global pressure.
+ */
+ if (!(shrinker->flags & SHRINKER_MEMCG_AWARE)) {
+ if (!shrinkctl->target_mem_cgroup) {
+ shrinkctl->memcg = NULL;
+ freed += run_shrinker(shrinkctl, shrinker,
+ nr_pages_scanned, lru_pages);
+ }
continue;
}

- for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
- if (node_online(shrinkctl->nid))
- freed += shrink_slab_node(shrinkctl, shrinker,
+ /*
+ * For memcg-aware shrinkers iterate over the target memcg
+ * hierarchy and run the shrinker on each kmem-active memcg
+ * found in the hierarchy.
+ */
+ shrinkctl->memcg = shrinkctl->target_mem_cgroup;
+ do {
+ if (!shrinkctl->memcg ||
+ memcg_kmem_is_active(shrinkctl->memcg))
+ freed += run_shrinker(shrinkctl, shrinker,
nr_pages_scanned, lru_pages);
-
- }
+ } while ((shrinkctl->memcg =
+ mem_cgroup_iter(shrinkctl->target_mem_cgroup,
+ shrinkctl->memcg, NULL)) != NULL);
}
up_read(&shrinker_rwsem);
out:
@@ -2381,6 +2414,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
gfp_t orig_mask;
struct shrink_control shrink = {
.gfp_mask = sc->gfp_mask,
+ .target_mem_cgroup = sc->target_mem_cgroup,
};
enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
bool reclaimable = false;
@@ -2400,18 +2434,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!populated_zone(zone))
continue;
+
+ if (global_reclaim(sc) &&
+ !cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
+ continue;
+
+ lru_pages += global_reclaim(sc) ?
+ zone_reclaimable_pages(zone) :
+ mem_cgroup_zone_reclaimable_pages(zone,
+ sc->target_mem_cgroup);
+ node_set(zone_to_nid(zone), shrink.nodes_to_scan);
+
/*
* Take care memory controller reclaiming has small influence
* to global LRU.
*/
if (global_reclaim(sc)) {
- if (!cpuset_zone_allowed(zone,
- GFP_KERNEL | __GFP_HARDWALL))
- continue;
-
- lru_pages += zone_reclaimable_pages(zone);
- node_set(zone_to_nid(zone), shrink.nodes_to_scan);
-
if (sc->priority != DEF_PRIORITY &&
!zone_reclaimable(zone))
continue; /* Let kswapd poll it */
@@ -2459,12 +2497,11 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
}

/*
- * Don't shrink slabs when reclaiming memory from over limit cgroups
- * but do shrink slab at least once when aborting reclaim for
- * compaction to avoid unevenly scanning file/anon LRU pages over slab
- * pages.
+ * Shrink slabs at least once when aborting reclaim for compaction
+ * to avoid unevenly scanning file/anon LRU pages over slab pages.
*/
- if (global_reclaim(sc)) {
+ if (global_reclaim(sc) ||
+ memcg_kmem_is_active_subtree(sc->target_mem_cgroup)) {
shrink_slab(&shrink, sc->nr_scanned, lru_pages);
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
@@ -2767,6 +2804,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
struct zonelist *zonelist;
unsigned long nr_reclaimed;
int nid;
+ struct reclaim_state reclaim_state;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2787,6 +2825,10 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,

zonelist = NODE_DATA(nid)->node_zonelists;

+ lockdep_set_current_reclaim_state(sc.gfp_mask);
+ reclaim_state.reclaimed_slab = 0;
+ current->reclaim_state = &reclaim_state;
+
trace_mm_vmscan_memcg_reclaim_begin(0,
sc.may_writepage,
sc.gfp_mask);
@@ -2795,6 +2837,9 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,

trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);

+ current->reclaim_state = NULL;
+ lockdep_clear_current_reclaim_state();
+
return nr_reclaimed;
}
#endif
--
1.7.10.4

2014-11-06 15:21:55

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm v2 3/9] vmscan: shrink slab on memcg pressure

On Fri, Oct 24, 2014 at 02:37:34PM +0400, Vladimir Davydov wrote:
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a384339bf718..2cf6b04a4e0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -339,6 +339,26 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> return freed;
> }
>
> +static unsigned long
> +run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> + unsigned long nr_pages_scanned, unsigned long lru_pages)
> +{
> + unsigned long freed = 0;
> +
> + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> + shrinkctl->nid = 0;
> + return shrink_slab_node(shrinkctl, shrinker,
> + nr_pages_scanned, lru_pages);
> + }
> +
> + for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> + if (node_online(shrinkctl->nid))
> + freed += shrink_slab_node(shrinkctl, shrinker,
> + nr_pages_scanned, lru_pages);
> + }
> + return freed;
> +}

The slab shrinking logic accumulates the lru pages, as well as the
nodes_to_scan mask, when going over the zones, only to go over the
zones here again using the accumulated node information. Why not just
invoke the thing per-zone instead in the first place? Kswapd already
does that (although it could probably work with the per-zone lru_pages
and nr_scanned deltas) and direct reclaim should as well. It would
simplify the existing code as well as your series a lot.

> + /*
> + * For memcg-aware shrinkers iterate over the target memcg
> + * hierarchy and run the shrinker on each kmem-active memcg
> + * found in the hierarchy.
> + */
> + shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> + do {
> + if (!shrinkctl->memcg ||
> + memcg_kmem_is_active(shrinkctl->memcg))
> + freed += run_shrinker(shrinkctl, shrinker,
> nr_pages_scanned, lru_pages);
> -
> - }
> + } while ((shrinkctl->memcg =
> + mem_cgroup_iter(shrinkctl->target_mem_cgroup,
> + shrinkctl->memcg, NULL)) != NULL);

More symptoms of the above. This hierarchy walk is duplicative and
potentially quite expensive.

> @@ -2381,6 +2414,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> gfp_t orig_mask;
> struct shrink_control shrink = {
> .gfp_mask = sc->gfp_mask,
> + .target_mem_cgroup = sc->target_mem_cgroup,
> };
> enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
> bool reclaimable = false;
> @@ -2400,18 +2434,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> gfp_zone(sc->gfp_mask), sc->nodemask) {
> if (!populated_zone(zone))
> continue;
> +
> + if (global_reclaim(sc) &&
> + !cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> + continue;
> +
> + lru_pages += global_reclaim(sc) ?
> + zone_reclaimable_pages(zone) :
> + mem_cgroup_zone_reclaimable_pages(zone,
> + sc->target_mem_cgroup);
> + node_set(zone_to_nid(zone), shrink.nodes_to_scan);

And yet another costly hierarchy walk.

The reclaim code walks zonelists according to a nodemask, and within
each zone it walks lruvecs according to the memcg hierarchy. The
shrinkers are wrong in making up an ad-hoc concept of NUMA nodes that
otherwise does not exist anywhere in the VM. Please integrate them
properly instead of adding more duplication on top.

Thanks

2014-11-06 15:42:24

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH -mm v2 3/9] vmscan: shrink slab on memcg pressure

Hi Johannes,

On Thu, Nov 06, 2014 at 10:21:35AM -0500, Johannes Weiner wrote:
> On Fri, Oct 24, 2014 at 02:37:34PM +0400, Vladimir Davydov wrote:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a384339bf718..2cf6b04a4e0c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -339,6 +339,26 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> > return freed;
> > }
> >
> > +static unsigned long
> > +run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> > + unsigned long nr_pages_scanned, unsigned long lru_pages)
> > +{
> > + unsigned long freed = 0;
> > +
> > + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> > + shrinkctl->nid = 0;
> > + return shrink_slab_node(shrinkctl, shrinker,
> > + nr_pages_scanned, lru_pages);
> > + }
> > +
> > + for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> > + if (node_online(shrinkctl->nid))
> > + freed += shrink_slab_node(shrinkctl, shrinker,
> > + nr_pages_scanned, lru_pages);
> > + }
> > + return freed;
> > +}
>
> The slab shrinking logic accumulates the lru pages, as well as the
> nodes_to_scan mask, when going over the zones, only to go over the
> zones here again using the accumulated node information. Why not just
> invoke the thing per-zone instead in the first place? Kswapd already
> does that (although it could probably work with the per-zone lru_pages
> and nr_scanned deltas) and direct reclaim should as well. It would
> simplify the existing code as well as your series a lot.

100% agree. Yet another argument for invoking shrinkers per-zone is soft
(or low?) memory limit reclaim (when it's fixed/rewritten): the current
code would shrink slab of all memory cgroups even if only those that
exceeded the limit were scanned - unfair.

>
> > + /*
> > + * For memcg-aware shrinkers iterate over the target memcg
> > + * hierarchy and run the shrinker on each kmem-active memcg
> > + * found in the hierarchy.
> > + */
> > + shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> > + do {
> > + if (!shrinkctl->memcg ||
> > + memcg_kmem_is_active(shrinkctl->memcg))
> > + freed += run_shrinker(shrinkctl, shrinker,
> > nr_pages_scanned, lru_pages);
> > -
> > - }
> > + } while ((shrinkctl->memcg =
> > + mem_cgroup_iter(shrinkctl->target_mem_cgroup,
> > + shrinkctl->memcg, NULL)) != NULL);
>
> More symptoms of the above. This hierarchy walk is duplicative and
> potentially quite expensive.
>
> > @@ -2381,6 +2414,7 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > gfp_t orig_mask;
> > struct shrink_control shrink = {
> > .gfp_mask = sc->gfp_mask,
> > + .target_mem_cgroup = sc->target_mem_cgroup,
> > };
> > enum zone_type requested_highidx = gfp_zone(sc->gfp_mask);
> > bool reclaimable = false;
> > @@ -2400,18 +2434,22 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
> > gfp_zone(sc->gfp_mask), sc->nodemask) {
> > if (!populated_zone(zone))
> > continue;
> > +
> > + if (global_reclaim(sc) &&
> > + !cpuset_zone_allowed(zone, GFP_KERNEL | __GFP_HARDWALL))
> > + continue;
> > +
> > + lru_pages += global_reclaim(sc) ?
> > + zone_reclaimable_pages(zone) :
> > + mem_cgroup_zone_reclaimable_pages(zone,
> > + sc->target_mem_cgroup);
> > + node_set(zone_to_nid(zone), shrink.nodes_to_scan);
>
> And yet another costly hierarchy walk.
>
> The reclaim code walks zonelists according to a nodemask, and within
> each zone it walks lruvecs according to the memcg hierarchy. The
> shrinkers are wrong in making up an ad-hoc concept of NUMA nodes that
> otherwise does not exist anywhere in the VM. Please integrate them
> properly instead of adding more duplication on top.

Will do.

Thanks,
Vladimir

2014-11-10 04:06:56

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH -mm v2 3/9] vmscan: shrink slab on memcg pressure

On Thu, Nov 06, 2014 at 10:21:35AM -0500, Johannes Weiner wrote:
> On Fri, Oct 24, 2014 at 02:37:34PM +0400, Vladimir Davydov wrote:
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a384339bf718..2cf6b04a4e0c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -339,6 +339,26 @@ shrink_slab_node(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> > return freed;
> > }
> >
> > +static unsigned long
> > +run_shrinker(struct shrink_control *shrinkctl, struct shrinker *shrinker,
> > + unsigned long nr_pages_scanned, unsigned long lru_pages)
> > +{
> > + unsigned long freed = 0;
> > +
> > + if (!(shrinker->flags & SHRINKER_NUMA_AWARE)) {
> > + shrinkctl->nid = 0;
> > + return shrink_slab_node(shrinkctl, shrinker,
> > + nr_pages_scanned, lru_pages);
> > + }
> > +
> > + for_each_node_mask(shrinkctl->nid, shrinkctl->nodes_to_scan) {
> > + if (node_online(shrinkctl->nid))
> > + freed += shrink_slab_node(shrinkctl, shrinker,
> > + nr_pages_scanned, lru_pages);
> > + }
> > + return freed;
> > +}
>
> The slab shrinking logic accumulates the lru pages, as well as the
> nodes_to_scan mask, when going over the zones, only to go over the
> zones here again using the accumulated node information. Why not just
> invoke the thing per-zone instead in the first place?

It's not iterating zones here - it's iterating nodes. This is the
external interface that other subsystems call to cause reclaim to
occur, and they can specify multiple nodes to scan at once (e.g.
drop-slab()). Hence we have to iterate....

Indeed, shrink_zones() requires this because it can be passed an
arbtrary zonelist that may span multiple nodes. Hence shrink_slab()
can be passed a shrinkctl with multiple nodes set in it's nodemask
and so, again, iteration is required.

If you want other callers from the VM that guarantee only a single
node needs to be scanned (such as __zone_reclaim()) to avoid the
zone iteration, then factor the code such that shrink_slab_node()
can be called directly by those functions.

> Kswapd already
> does that (although it could probably work with the per-zone lru_pages
> and nr_scanned deltas) and direct reclaim should as well. It would
> simplify the existing code as well as your series a lot.
>
> > + /*
> > + * For memcg-aware shrinkers iterate over the target memcg
> > + * hierarchy and run the shrinker on each kmem-active memcg
> > + * found in the hierarchy.
> > + */
> > + shrinkctl->memcg = shrinkctl->target_mem_cgroup;
> > + do {
> > + if (!shrinkctl->memcg ||
> > + memcg_kmem_is_active(shrinkctl->memcg))
> > + freed += run_shrinker(shrinkctl, shrinker,
> > nr_pages_scanned, lru_pages);
> > -
> > - }
> > + } while ((shrinkctl->memcg =
> > + mem_cgroup_iter(shrinkctl->target_mem_cgroup,
> > + shrinkctl->memcg, NULL)) != NULL);
>
> More symptoms of the above. This hierarchy walk is duplicative and
> potentially quite expensive.

Same again - if the "zone" being reclaimed is controlled by a memcg
rather than a node ID, then ensure that shrink_slab_foo() can be
called directly with the correct shrinkctl configuration to avoid
unnecessary iteration.

> The reclaim code walks zonelists according to a nodemask, and within
> each zone it walks lruvecs according to the memcg hierarchy. The
> shrinkers are wrong in making up an ad-hoc concept of NUMA nodes that
> otherwise does not exist anywhere in the VM.

Hardly. the shrinker API is an *external VM interface*, just like
memory allocation is an external interface. Node IDs and node masks
are exactly the way memory locality is conveyed to the MM subsystem
during allocation, so reclaim interfaces should match that for
consistency. IOWs, the shrinker matches the "ad-hoc concept of NUMA
nodes" that exists everywhere *outside* the VM.

IOWs, the shrinkers have not "made up" anything - they conform to
the existing VM abstractions that everyone is used to. Yes, the
layers between the core VM LRU reclaim code and the shrinker
infrastructure could do with some improvement and refinement, but
the external interface is consistent with all the other external
locality interfaces the VM provides....

Cheers,

Dave.
--
Dave Chinner
[email protected]