2015-07-08 12:28:25

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/8 -v3] memcg cleanups + get rid of mm_struct::owner

Hi,
this is the third version of the patch series. The previous version was posted
here: http://marc.info/?l=linux-mm&m=143290066126282 and the first version
here: http://marc.info/?l=linux-mm&m=143264102317318&w=2.

Johannes has suggested to export struct mem_cgroup because some other
things will become easier. So even though this is not directly related
to the patchset I have posted it together because there are some
dependencies.

The first patch exports struct mem_cgroup along with its dependencies +
some simple accessor functions so they can be inlined in the code.

The second patch is a minor cleanup for cgroup writeback code which hasn't
been present in the previous version.

The third patch simply cleans up some extern declarations in memcontrol.h
because we are not consistent in that regard.

The fourth patch moves some more memcg code to vmscan as it doesn't have any
other users.

The fifth patch is from Tejun and it cleans up mem_cgroup_can_attach a
bit and the follow up patches depend on it.

The sixth patch is preparatory and it's touching sock_update_memcg which
doesn't check for cg_proto == NULL. It doesn't have to do it currently
because mem_cgroup_from_task always return non-NULL but the code is awkward
and this will no longer be true with the follow up patch.

The patch number 7 is the core one and it gets rid of mm_struct::owner
in favor of mm_struct::memcg. The rationale is described in the patch
so I will not repeat it here again.

The last patch gets rid of mem_cgroup_from_task which is not needed anymore.

I have tried to compile test this with my usual configs battery +
randconfigs and nothing blown up. I have also runtime tested migration
between two memcgs as fast as possible. Except for pre-existing races no
regressions seemed to be introduced.

I was worried about the user visible change this patch introduces
but Johannes and Tejun seem to be OK with that and nobody complained
since this was posted last time as an RFC. The changelog states
that a potential usecase is not really worth all the troubles the
implementation exposes to the memcg behavior.

changes since v2
- rebased on top of the current mmotm tree
- clenaup mem_cgroup_root_css
changes since v1
- added Suggested-by Oleg has it was him who kicked me into doing
it
- exec_mmap association was missing [Oleg]
- mem_cgroup_from_task cannot return NULL anymore [Oleg]
- mm_match_cgroup doesn't have to play games and it can use
rcu_dereference after mm_struct is exported [Johannes]
- drop mm_move_memcg as it has only one user in memcontrol.c
so it can be opencoded [Johannes]
- functions to associate and drop mm->memcg association have
to be static inline for !CONFIG_MEMCG [Johannes]
- drop synchronize_rcu nonsense during memcg move [Johannes]
- drop "memcg: Use mc.moving_task as the indication for charge
moving" patch because we can get the target task from css
easily [Johannes]
- rename css_set_memcg renamed to css_inherit_memcg because the
name better suits its usage
- dropped css_get(&from->css) during move because it is pinned
already by the mm. We just have to be careful and do not drop
it before we are really done during migration.

Shortlog says:
Michal Hocko (7):
memcg: export struct mem_cgroup
memcg: get rid of mem_cgroup_root_css for !CONFIG_MEMCG
memcg: get rid of extern for functions in memcontrol.h
memcg, mm: move mem_cgroup_select_victim_node into vmscan
memcg, tcp_kmem: check for cg_proto in sock_update_memcg
memcg: get rid of mm_struct::owner
memcg: get rid of mem_cgroup_from_task

Tejun Heo (1):
memcg: restructure mem_cgroup_can_attach()

And diffstat looks promissing as well.
fs/exec.c | 2 +-
include/linux/memcontrol.h | 439 +++++++++++++++++++++++++++++++++----
include/linux/mm_types.h | 12 +-
include/linux/swap.h | 10 +-
include/net/sock.h | 28 ---
kernel/exit.c | 89 --------
kernel/fork.c | 10 +-
mm/debug.c | 4 +-
mm/memcontrol.c | 527 ++++++---------------------------------------
mm/memory-failure.c | 2 +-
mm/slab_common.c | 2 +-
mm/vmscan.c | 98 ++++++++-
12 files changed, 579 insertions(+), 644 deletions(-)

Thanks


2015-07-08 12:28:41

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/8] memcg: export struct mem_cgroup

From: Michal Hocko <[email protected]>

mem_cgroup structure is defined in mm/memcontrol.c currently which
means that the code outside of this file has to use external API even
for trivial access stuff.

This patch exports mm_struct with its dependencies and makes some of the
exported functions inlines. This even helps to reduce the code size a bit
(make defconfig + CONFIG_MEMCG=y)

text data bss dec hex filename
12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
12354970 1823792 1089536 15268298 e8f9ca vmlinux.after

This is not much (370B) but better than nothing. We also save a function
call in some hot paths like callers of mem_cgroup_count_vm_event which is
used for accounting.

The patch doesn't introduce any functional changes.

Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 365 +++++++++++++++++++++++++++++++++++++++++----
include/linux/swap.h | 10 +-
include/net/sock.h | 28 ----
mm/memcontrol.c | 305 -------------------------------------
mm/memory-failure.c | 2 +-
mm/slab_common.c | 2 +-
mm/vmscan.c | 2 +-
7 files changed, 344 insertions(+), 370 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 73b02b0a8f60..f5a8d0bbef8d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -23,8 +23,11 @@
#include <linux/vm_event_item.h>
#include <linux/hardirq.h>
#include <linux/jump_label.h>
+#include <linux/page_counter.h>
+#include <linux/vmpressure.h>
+#include <linux/mmzone.h>
+#include <linux/writeback.h>

-struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
@@ -67,12 +70,221 @@ enum mem_cgroup_events_index {
MEMCG_NR_EVENTS,
};

+/*
+ * Per memcg event counter is incremented at every pagein/pageout. With THP,
+ * it will be incremated by the number of pages. This counter is used for
+ * for trigger some periodic events. This is straightforward and better
+ * than using jiffies etc. to handle periodic memcg event.
+ */
+enum mem_cgroup_events_target {
+ MEM_CGROUP_TARGET_THRESH,
+ MEM_CGROUP_TARGET_SOFTLIMIT,
+ MEM_CGROUP_TARGET_NUMAINFO,
+ MEM_CGROUP_NTARGETS,
+};
+
+struct mem_cgroup_stat_cpu {
+ long count[MEM_CGROUP_STAT_NSTATS];
+ unsigned long events[MEMCG_NR_EVENTS];
+ unsigned long nr_page_events;
+ unsigned long targets[MEM_CGROUP_NTARGETS];
+};
+
+struct reclaim_iter {
+ struct mem_cgroup *position;
+ /* scan generation, increased every round-trip */
+ unsigned int generation;
+};
+
+/*
+ * per-zone information in memory controller.
+ */
+struct mem_cgroup_per_zone {
+ struct lruvec lruvec;
+ unsigned long lru_size[NR_LRU_LISTS];
+
+ struct reclaim_iter iter[DEF_PRIORITY + 1];
+
+ struct rb_node tree_node; /* RB tree node */
+ unsigned long usage_in_excess;/* Set to the value by which */
+ /* the soft limit is exceeded*/
+ bool on_tree;
+ struct mem_cgroup *memcg; /* Back pointer, we cannot */
+ /* use container_of */
+};
+
+struct mem_cgroup_per_node {
+ struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
+};
+
+struct mem_cgroup_threshold {
+ struct eventfd_ctx *eventfd;
+ unsigned long threshold;
+};
+
+/* For threshold */
+struct mem_cgroup_threshold_ary {
+ /* An array index points to threshold just below or equal to usage. */
+ int current_threshold;
+ /* Size of entries[] */
+ unsigned int size;
+ /* Array of thresholds */
+ struct mem_cgroup_threshold entries[0];
+};
+
+struct mem_cgroup_thresholds {
+ /* Primary thresholds array */
+ struct mem_cgroup_threshold_ary *primary;
+ /*
+ * Spare threshold array.
+ * This is needed to make mem_cgroup_unregister_event() "never fail".
+ * It must be able to store at least primary->size - 1 entries.
+ */
+ struct mem_cgroup_threshold_ary *spare;
+};
+
+/*
+ * Bits in struct cg_proto.flags
+ */
+enum cg_proto_flags {
+ /* Currently active and new sockets should be assigned to cgroups */
+ MEMCG_SOCK_ACTIVE,
+ /* It was ever activated; we must disarm static keys on destruction */
+ MEMCG_SOCK_ACTIVATED,
+};
+
+struct cg_proto {
+ struct page_counter memory_allocated; /* Current allocated memory. */
+ struct percpu_counter sockets_allocated; /* Current number of sockets. */
+ int memory_pressure;
+ long sysctl_mem[3];
+ unsigned long flags;
+ /*
+ * memcg field is used to find which memcg we belong directly
+ * Each memcg struct can hold more than one cg_proto, so container_of
+ * won't really cut.
+ *
+ * The elegant solution would be having an inverse function to
+ * proto_cgroup in struct proto, but that means polluting the structure
+ * for everybody, instead of just for memcg users.
+ */
+ struct mem_cgroup *memcg;
+};
+
#ifdef CONFIG_MEMCG
+/*
+ * The memory controller data structure. The memory controller controls both
+ * page cache and RSS per cgroup. We would eventually like to provide
+ * statistics based on the statistics developed by Rik Van Riel for clock-pro,
+ * to help the administrator determine what knobs to tune.
+ */
+struct mem_cgroup {
+ struct cgroup_subsys_state css;
+
+ /* Accounted resources */
+ struct page_counter memory;
+ struct page_counter memsw;
+ struct page_counter kmem;
+
+ /* Normal memory consumption range */
+ unsigned long low;
+ unsigned long high;
+
+ unsigned long soft_limit;
+
+ /* vmpressure notifications */
+ struct vmpressure vmpressure;
+
+ /* css_online() has been completed */
+ int initialized;
+
+ /*
+ * Should the accounting and control be hierarchical, per subtree?
+ */
+ bool use_hierarchy;
+
+ /* protected by memcg_oom_lock */
+ bool oom_lock;
+ int under_oom;
+
+ int swappiness;
+ /* OOM-Killer disable */
+ int oom_kill_disable;
+
+ /* protect arrays of thresholds */
+ struct mutex thresholds_lock;
+
+ /* thresholds for memory usage. RCU-protected */
+ struct mem_cgroup_thresholds thresholds;
+
+ /* thresholds for mem+swap usage. RCU-protected */
+ struct mem_cgroup_thresholds memsw_thresholds;
+
+ /* For oom notifier event fd */
+ struct list_head oom_notify;
+
+ /*
+ * Should we move charges of a task when a task is moved into this
+ * mem_cgroup ? And what type of charges should we move ?
+ */
+ unsigned long move_charge_at_immigrate;
+ /*
+ * set > 0 if pages under this cgroup are moving to other cgroup.
+ */
+ atomic_t moving_account;
+ /* taken only while moving_account > 0 */
+ spinlock_t move_lock;
+ struct task_struct *move_lock_task;
+ unsigned long move_lock_flags;
+ /*
+ * percpu counter.
+ */
+ struct mem_cgroup_stat_cpu __percpu *stat;
+ spinlock_t pcp_counter_lock;
+
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
+ struct cg_proto tcp_mem;
+#endif
+#if defined(CONFIG_MEMCG_KMEM)
+ /* Index in the kmem_cache->memcg_params.memcg_caches array */
+ int kmemcg_id;
+ bool kmem_acct_activated;
+ bool kmem_acct_active;
+#endif
+
+ int last_scanned_node;
+#if MAX_NUMNODES > 1
+ nodemask_t scan_nodes;
+ atomic_t numainfo_events;
+ atomic_t numainfo_updating;
+#endif
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct list_head cgwb_list;
+ struct wb_domain cgwb_domain;
+#endif
+
+ /* List of events which userspace want to receive */
+ struct list_head event_list;
+ spinlock_t event_list_lock;
+
+ struct mem_cgroup_per_node *nodeinfo[0];
+ /* WARNING: nodeinfo must be the last member here */
+};
extern struct cgroup_subsys_state *mem_cgroup_root_css;

-void mem_cgroup_events(struct mem_cgroup *memcg,
+/**
+ * mem_cgroup_events - count memory events against a cgroup
+ * @memcg: the memory cgroup
+ * @idx: the event index
+ * @nr: the number of events to account for
+ */
+static inline void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
- unsigned int nr);
+ unsigned int nr)
+{
+ this_cpu_add(memcg->stat->events[idx], nr);
+}

bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);

@@ -90,15 +302,31 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

-bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
- struct mem_cgroup *root);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
-extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css);
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
+ return css ? container_of(css, struct mem_cgroup, css) : NULL;
+}
+
+struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
+ struct mem_cgroup *,
+ struct mem_cgroup_reclaim_cookie *);
+void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+
+static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
+ struct mem_cgroup *root)
+{
+ if (root == memcg)
+ return true;
+ if (!root->use_hierarchy)
+ return false;
+ return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
+}

static inline bool mm_match_cgroup(struct mm_struct *mm,
struct mem_cgroup *memcg)
@@ -114,22 +342,65 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
return match;
}

-extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);

-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
- struct mem_cgroup *,
- struct mem_cgroup_reclaim_cookie *);
-void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
+static inline bool mem_cgroup_disabled(void)
+{
+ if (memory_cgrp_subsys.disabled)
+ return true;
+ return false;
+}

/*
* For memory reclaim.
*/
-int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec);
-bool mem_cgroup_lruvec_online(struct lruvec *lruvec);
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
-void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
+
+void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
+ int nr_pages);
+
+static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
+{
+ struct mem_cgroup_per_zone *mz;
+ struct mem_cgroup *memcg;
+
+ if (mem_cgroup_disabled())
+ return true;
+
+ mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+ memcg = mz->memcg;
+
+ return !!(memcg->css.flags & CSS_ONLINE);
+}
+
+static inline
+unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
+{
+ struct mem_cgroup_per_zone *mz;
+
+ mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
+ return mz->lru_size[lru];
+}
+
+static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
+{
+ unsigned long inactive_ratio;
+ unsigned long inactive;
+ unsigned long active;
+ unsigned long gb;
+
+ inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
+ active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
+
+ gb = (inactive + active) >> (30 - PAGE_SHIFT);
+ if (gb)
+ inactive_ratio = int_sqrt(10 * gb);
+ else
+ inactive_ratio = 1;
+
+ return inactive * inactive_ratio < active;
+}
+
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

@@ -156,18 +427,26 @@ bool mem_cgroup_oom_synchronize(bool wait);
extern int do_swap_account;
#endif

-static inline bool mem_cgroup_disabled(void)
-{
- if (memory_cgrp_subsys.disabled)
- return true;
- return false;
-}
-
struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page);
-void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
- enum mem_cgroup_stat_index idx, int val);
void mem_cgroup_end_page_stat(struct mem_cgroup *memcg);

+/**
+ * mem_cgroup_update_page_stat - update page state statistics
+ * @memcg: memcg to account against
+ * @idx: page state item to account
+ * @val: number of pages (positive or negative)
+ *
+ * See mem_cgroup_begin_page_stat() for locking requirements.
+ */
+static inline void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
+ enum mem_cgroup_stat_index idx, int val)
+{
+ VM_BUG_ON(!rcu_read_lock_held());
+
+ if (memcg)
+ this_cpu_add(memcg->stat->count[idx], val);
+}
+
static inline void mem_cgroup_inc_page_stat(struct mem_cgroup *memcg,
enum mem_cgroup_stat_index idx)
{
@@ -184,13 +463,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);

-void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
enum vm_event_item idx)
{
+ struct mem_cgroup *memcg;
+
if (mem_cgroup_disabled())
return;
- __mem_cgroup_count_vm_event(mm, idx);
+
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!memcg))
+ goto out;
+
+ switch (idx) {
+ case PGFAULT:
+ this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
+ break;
+ case PGMAJFAULT:
+ this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
+ break;
+ default:
+ BUG();
+ }
+out:
+ rcu_read_unlock();
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
void mem_cgroup_split_huge_fixup(struct page *head);
@@ -275,12 +572,6 @@ static inline bool task_in_mem_cgroup(struct task_struct *task,
return true;
}

-static inline struct cgroup_subsys_state
- *mem_cgroup_css(struct mem_cgroup *memcg)
-{
- return NULL;
-}
-
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
@@ -463,7 +754,15 @@ void __memcg_kmem_commit_charge(struct page *page,
struct mem_cgroup *memcg, int order);
void __memcg_kmem_uncharge_pages(struct page *page, int order);

-int memcg_cache_id(struct mem_cgroup *memcg);
+/*
+ * helper for acessing a memcg's index. It will be used as an index in the
+ * child cache array in kmem_cache, and also to derive its name. This function
+ * will return -1 when this is not a kmem-limited memcg.
+ */
+static inline int memcg_cache_id(struct mem_cgroup *memcg)
+{
+ return memcg ? memcg->kmemcg_id : -1;
+}

struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep);
void __memcg_kmem_put_cache(struct kmem_cache *cachep);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9a7adfb85e22..e617b3eefc8a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -352,7 +352,15 @@ extern void check_move_unevictable_pages(struct page **, int nr_pages);
extern int kswapd_run(int nid);
extern void kswapd_stop(int nid);
#ifdef CONFIG_MEMCG
-extern int mem_cgroup_swappiness(struct mem_cgroup *mem);
+static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
+{
+ /* root ? */
+ if (mem_cgroup_disabled() || !memcg->css.parent)
+ return vm_swappiness;
+
+ return memcg->swappiness;
+}
+
#else
static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
{
diff --git a/include/net/sock.h b/include/net/sock.h
index 3a4898ec8c67..35628cf42044 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1041,34 +1041,6 @@ struct proto {
#endif
};

-/*
- * Bits in struct cg_proto.flags
- */
-enum cg_proto_flags {
- /* Currently active and new sockets should be assigned to cgroups */
- MEMCG_SOCK_ACTIVE,
- /* It was ever activated; we must disarm static keys on destruction */
- MEMCG_SOCK_ACTIVATED,
-};
-
-struct cg_proto {
- struct page_counter memory_allocated; /* Current allocated memory. */
- struct percpu_counter sockets_allocated; /* Current number of sockets. */
- int memory_pressure;
- long sysctl_mem[3];
- unsigned long flags;
- /*
- * memcg field is used to find which memcg we belong directly
- * Each memcg struct can hold more than one cg_proto, so container_of
- * won't really cut.
- *
- * The elegant solution would be having an inverse function to
- * proto_cgroup in struct proto, but that means polluting the structure
- * for everybody, instead of just for memcg users.
- */
- struct mem_cgroup *memcg;
-};
-
int proto_register(struct proto *prot, int alloc_slab);
void proto_unregister(struct proto *prot);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..759ec413e72c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -111,56 +111,10 @@ static const char * const mem_cgroup_lru_names[] = {
"unevictable",
};

-/*
- * Per memcg event counter is incremented at every pagein/pageout. With THP,
- * it will be incremated by the number of pages. This counter is used for
- * for trigger some periodic events. This is straightforward and better
- * than using jiffies etc. to handle periodic memcg event.
- */
-enum mem_cgroup_events_target {
- MEM_CGROUP_TARGET_THRESH,
- MEM_CGROUP_TARGET_SOFTLIMIT,
- MEM_CGROUP_TARGET_NUMAINFO,
- MEM_CGROUP_NTARGETS,
-};
#define THRESHOLDS_EVENTS_TARGET 128
#define SOFTLIMIT_EVENTS_TARGET 1024
#define NUMAINFO_EVENTS_TARGET 1024

-struct mem_cgroup_stat_cpu {
- long count[MEM_CGROUP_STAT_NSTATS];
- unsigned long events[MEMCG_NR_EVENTS];
- unsigned long nr_page_events;
- unsigned long targets[MEM_CGROUP_NTARGETS];
-};
-
-struct reclaim_iter {
- struct mem_cgroup *position;
- /* scan generation, increased every round-trip */
- unsigned int generation;
-};
-
-/*
- * per-zone information in memory controller.
- */
-struct mem_cgroup_per_zone {
- struct lruvec lruvec;
- unsigned long lru_size[NR_LRU_LISTS];
-
- struct reclaim_iter iter[DEF_PRIORITY + 1];
-
- struct rb_node tree_node; /* RB tree node */
- unsigned long usage_in_excess;/* Set to the value by which */
- /* the soft limit is exceeded*/
- bool on_tree;
- struct mem_cgroup *memcg; /* Back pointer, we cannot */
- /* use container_of */
-};
-
-struct mem_cgroup_per_node {
- struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
-};
-
/*
* Cgroups above their limits are maintained in a RB-Tree, independent of
* their hierarchy representation
@@ -181,32 +135,6 @@ struct mem_cgroup_tree {

static struct mem_cgroup_tree soft_limit_tree __read_mostly;

-struct mem_cgroup_threshold {
- struct eventfd_ctx *eventfd;
- unsigned long threshold;
-};
-
-/* For threshold */
-struct mem_cgroup_threshold_ary {
- /* An array index points to threshold just below or equal to usage. */
- int current_threshold;
- /* Size of entries[] */
- unsigned int size;
- /* Array of thresholds */
- struct mem_cgroup_threshold entries[0];
-};
-
-struct mem_cgroup_thresholds {
- /* Primary thresholds array */
- struct mem_cgroup_threshold_ary *primary;
- /*
- * Spare threshold array.
- * This is needed to make mem_cgroup_unregister_event() "never fail".
- * It must be able to store at least primary->size - 1 entries.
- */
- struct mem_cgroup_threshold_ary *spare;
-};
-
/* for OOM */
struct mem_cgroup_eventfd_list {
struct list_head list;
@@ -256,106 +184,6 @@ struct mem_cgroup_event {
static void mem_cgroup_threshold(struct mem_cgroup *memcg);
static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);

-/*
- * The memory controller data structure. The memory controller controls both
- * page cache and RSS per cgroup. We would eventually like to provide
- * statistics based on the statistics developed by Rik Van Riel for clock-pro,
- * to help the administrator determine what knobs to tune.
- */
-struct mem_cgroup {
- struct cgroup_subsys_state css;
-
- /* Accounted resources */
- struct page_counter memory;
- struct page_counter memsw;
- struct page_counter kmem;
-
- /* Normal memory consumption range */
- unsigned long low;
- unsigned long high;
-
- unsigned long soft_limit;
-
- /* vmpressure notifications */
- struct vmpressure vmpressure;
-
- /* css_online() has been completed */
- int initialized;
-
- /*
- * Should the accounting and control be hierarchical, per subtree?
- */
- bool use_hierarchy;
-
- /* protected by memcg_oom_lock */
- bool oom_lock;
- int under_oom;
-
- int swappiness;
- /* OOM-Killer disable */
- int oom_kill_disable;
-
- /* protect arrays of thresholds */
- struct mutex thresholds_lock;
-
- /* thresholds for memory usage. RCU-protected */
- struct mem_cgroup_thresholds thresholds;
-
- /* thresholds for mem+swap usage. RCU-protected */
- struct mem_cgroup_thresholds memsw_thresholds;
-
- /* For oom notifier event fd */
- struct list_head oom_notify;
-
- /*
- * Should we move charges of a task when a task is moved into this
- * mem_cgroup ? And what type of charges should we move ?
- */
- unsigned long move_charge_at_immigrate;
- /*
- * set > 0 if pages under this cgroup are moving to other cgroup.
- */
- atomic_t moving_account;
- /* taken only while moving_account > 0 */
- spinlock_t move_lock;
- struct task_struct *move_lock_task;
- unsigned long move_lock_flags;
- /*
- * percpu counter.
- */
- struct mem_cgroup_stat_cpu __percpu *stat;
- spinlock_t pcp_counter_lock;
-
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
- struct cg_proto tcp_mem;
-#endif
-#if defined(CONFIG_MEMCG_KMEM)
- /* Index in the kmem_cache->memcg_params.memcg_caches array */
- int kmemcg_id;
- bool kmem_acct_activated;
- bool kmem_acct_active;
-#endif
-
- int last_scanned_node;
-#if MAX_NUMNODES > 1
- nodemask_t scan_nodes;
- atomic_t numainfo_events;
- atomic_t numainfo_updating;
-#endif
-
-#ifdef CONFIG_CGROUP_WRITEBACK
- struct list_head cgwb_list;
- struct wb_domain cgwb_domain;
-#endif
-
- /* List of events which userspace want to receive */
- struct list_head event_list;
- spinlock_t event_list_lock;
-
- struct mem_cgroup_per_node *nodeinfo[0];
- /* WARNING: nodeinfo must be the last member here */
-};
-
#ifdef CONFIG_MEMCG_KMEM
bool memcg_kmem_is_active(struct mem_cgroup *memcg)
{
@@ -423,11 +251,6 @@ enum res_type {
*/
static DEFINE_MUTEX(memcg_create_mutex);

-struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *s)
-{
- return s ? container_of(s, struct mem_cgroup, css) : NULL;
-}
-
/* Some nice accessors for the vmpressure. */
struct vmpressure *memcg_to_vmpressure(struct mem_cgroup *memcg)
{
@@ -593,11 +416,6 @@ mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
return &memcg->nodeinfo[nid]->zoneinfo[zid];
}

-struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg)
-{
- return &memcg->css;
-}
-
/**
* mem_cgroup_css_from_page - css of the memcg associated with a page
* @page: page of interest
@@ -876,14 +694,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
}

-unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list lru)
-{
- struct mem_cgroup_per_zone *mz;
-
- mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
- return mz->lru_size[lru];
-}
-
static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid,
unsigned int lru_mask)
@@ -1173,30 +983,6 @@ void mem_cgroup_iter_break(struct mem_cgroup *root,
iter != NULL; \
iter = mem_cgroup_iter(NULL, iter, NULL))

-void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
-{
- struct mem_cgroup *memcg;
-
- rcu_read_lock();
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (unlikely(!memcg))
- goto out;
-
- switch (idx) {
- case PGFAULT:
- this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
- break;
- case PGMAJFAULT:
- this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
- break;
- default:
- BUG();
- }
-out:
- rcu_read_unlock();
-}
-EXPORT_SYMBOL(__mem_cgroup_count_vm_event);
-
/**
* mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
* @zone: zone of the wanted lruvec
@@ -1295,15 +1081,6 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
VM_BUG_ON((long)(*lru_size) < 0);
}

-bool mem_cgroup_is_descendant(struct mem_cgroup *memcg, struct mem_cgroup *root)
-{
- if (root == memcg)
- return true;
- if (!root->use_hierarchy)
- return false;
- return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
-}
-
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
{
struct mem_cgroup *task_memcg;
@@ -1330,39 +1107,6 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
return ret;
}

-int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
-{
- unsigned long inactive_ratio;
- unsigned long inactive;
- unsigned long active;
- unsigned long gb;
-
- inactive = mem_cgroup_get_lru_size(lruvec, LRU_INACTIVE_ANON);
- active = mem_cgroup_get_lru_size(lruvec, LRU_ACTIVE_ANON);
-
- gb = (inactive + active) >> (30 - PAGE_SHIFT);
- if (gb)
- inactive_ratio = int_sqrt(10 * gb);
- else
- inactive_ratio = 1;
-
- return inactive * inactive_ratio < active;
-}
-
-bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
-{
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup *memcg;
-
- if (mem_cgroup_disabled())
- return true;
-
- mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
- memcg = mz->memcg;
-
- return !!(memcg->css.flags & CSS_ONLINE);
-}
-
#define mem_cgroup_from_counter(counter, member) \
container_of(counter, struct mem_cgroup, member)

@@ -1394,15 +1138,6 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
return margin;
}

-int mem_cgroup_swappiness(struct mem_cgroup *memcg)
-{
- /* root ? */
- if (mem_cgroup_disabled() || !memcg->css.parent)
- return vm_swappiness;
-
- return memcg->swappiness;
-}
-
/*
* A routine for checking "mem" is under move_account() or not.
*
@@ -2062,23 +1797,6 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(mem_cgroup_end_page_stat);

-/**
- * mem_cgroup_update_page_stat - update page state statistics
- * @memcg: memcg to account against
- * @idx: page state item to account
- * @val: number of pages (positive or negative)
- *
- * See mem_cgroup_begin_page_stat() for locking requirements.
- */
-void mem_cgroup_update_page_stat(struct mem_cgroup *memcg,
- enum mem_cgroup_stat_index idx, int val)
-{
- VM_BUG_ON(!rcu_read_lock_held());
-
- if (memcg)
- this_cpu_add(memcg->stat->count[idx], val);
-}
-
/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
@@ -2504,16 +2222,6 @@ void memcg_uncharge_kmem(struct mem_cgroup *memcg, unsigned long nr_pages)
css_put_many(&memcg->css, nr_pages);
}

-/*
- * helper for acessing a memcg's index. It will be used as an index in the
- * child cache array in kmem_cache, and also to derive its name. This function
- * will return -1 when this is not a kmem-limited memcg.
- */
-int memcg_cache_id(struct mem_cgroup *memcg)
-{
- return memcg ? memcg->kmemcg_id : -1;
-}
-
static int memcg_alloc_cache_id(void)
{
int id, size;
@@ -5521,19 +5229,6 @@ struct cgroup_subsys memory_cgrp_subsys = {
};

/**
- * mem_cgroup_events - count memory events against a cgroup
- * @memcg: the memory cgroup
- * @idx: the event index
- * @nr: the number of events to account for
- */
-void mem_cgroup_events(struct mem_cgroup *memcg,
- enum mem_cgroup_events_index idx,
- unsigned int nr)
-{
- this_cpu_add(memcg->stat->events[idx], nr);
-}
-
-/**
* mem_cgroup_low - check if memory consumption is below the normal range
* @root: the highest ancestor to consider
* @memcg: the memory cgroup to check
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1cf7f2988422..ef33ccf37224 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -146,7 +146,7 @@ static int hwpoison_filter_task(struct page *p)
if (!mem)
return -EINVAL;

- css = mem_cgroup_css(mem);
+ css = &mem->css;
ino = cgroup_ino(css->cgroup);
css_put(css);

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 8873985f905e..ec3f68b14b87 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -501,7 +501,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
struct kmem_cache *root_cache)
{
static char memcg_name_buf[NAME_MAX + 1]; /* protected by slab_mutex */
- struct cgroup_subsys_state *css = mem_cgroup_css(memcg);
+ struct cgroup_subsys_state *css = &memcg->css;
struct memcg_cache_array *arr;
struct kmem_cache *s = NULL;
char *cache_name;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c8d82827279a..903ca5f53339 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -175,7 +175,7 @@ static bool sane_reclaim(struct scan_control *sc)
if (!memcg)
return true;
#ifdef CONFIG_CGROUP_WRITEBACK
- if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup))
+ if (memcg->css.cgroup)
return true;
#endif
return false;
--
2.1.4

2015-07-08 12:28:23

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/8] memcg: get rid of mem_cgroup_root_css for !CONFIG_MEMCG

The only user is cgwb_bdi_init and that one depends on
CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG
so it doesn't make much sense to definte an empty stub for
!CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead
to runtime crashes if used in unguarded code paths. Better fail
during compilation.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f5a8d0bbef8d..680cefec8c2a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -496,8 +496,6 @@ void mem_cgroup_split_huge_fixup(struct page *head);
#else /* CONFIG_MEMCG */
struct mem_cgroup;

-#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
-
static inline void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
unsigned int nr)
--
2.1.4

2015-07-08 12:28:06

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/8] memcg: get rid of extern for functions in memcontrol.h

From: Michal Hocko <[email protected]>

Most of the exported functions in this header are not marked extern so
change the rest to follow the same style.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 680cefec8c2a..8818eee95f93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -304,10 +304,10 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

-extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
-extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

-extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
+struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -342,7 +342,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
return match;
}

-extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);

static inline bool mem_cgroup_disabled(void)
{
@@ -401,8 +401,8 @@ static inline int mem_cgroup_inactive_anon_is_low(struct lruvec *lruvec)
return inactive * inactive_ratio < active;
}

-extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
- struct task_struct *p);
+void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
+ struct task_struct *p);

static inline void mem_cgroup_oom_enable(void)
{
@@ -717,8 +717,8 @@ static inline void sock_release_memcg(struct sock *sk)
extern struct static_key memcg_kmem_enabled_key;

extern int memcg_nr_cache_ids;
-extern void memcg_get_cache_ids(void);
-extern void memcg_put_cache_ids(void);
+void memcg_get_cache_ids(void);
+void memcg_put_cache_ids(void);

/*
* Helper macro to loop through all memcg-specific caches. Callers must still
--
2.1.4

2015-07-08 12:28:28

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 4/8] memcg, mm: move mem_cgroup_select_victim_node into vmscan

From: Michal Hocko <[email protected]>

We currently have only one caller of mem_cgroup_select_victim_node which
is sitting in mm/vmscan.c and which is already wrapped by CONFIG_MEMCG
ifdef. Now that we have struct mem_cgroup visible outside of
mm/memcontrol.c we can move the function and its dependencies there.
This even shrinks the code size by few bytes:

text data bss dec hex filename
478509 65806 26384 570699 8b54b mm/built-in.o.before
478445 65806 26384 570635 8b50b mm/built-in.o.after

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 6 ++-
mm/memcontrol.c | 99 +---------------------------------------------
mm/vmscan.c | 96 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 101 insertions(+), 100 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8818eee95f93..78e9d4ac57a1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -354,11 +354,13 @@ static inline bool mem_cgroup_disabled(void)
/*
* For memory reclaim.
*/
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg);
-
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
int nr_pages);

+unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
+ int nid,
+ unsigned int lru_mask);
+
static inline bool mem_cgroup_lruvec_online(struct lruvec *lruvec)
{
struct mem_cgroup_per_zone *mz;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 759ec413e72c..4f76ee67023b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -694,7 +694,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
__this_cpu_add(memcg->stat->nr_page_events, nr_pages);
}

-static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
+unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid,
unsigned int lru_mask)
{
@@ -1352,103 +1352,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
mutex_unlock(&oom_lock);
}

-#if MAX_NUMNODES > 1
-
-/**
- * test_mem_cgroup_node_reclaimable
- * @memcg: the target memcg
- * @nid: the node ID to be checked.
- * @noswap : specify true here if the user wants flle only information.
- *
- * This function returns whether the specified memcg contains any
- * reclaimable pages on a node. Returns true if there are any reclaimable
- * pages in the node.
- */
-static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
- int nid, bool noswap)
-{
- if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
- return true;
- if (noswap || !total_swap_pages)
- return false;
- if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
- return true;
- return false;
-
-}
-
-/*
- * Always updating the nodemask is not very good - even if we have an empty
- * list or the wrong list here, we can start from some node and traverse all
- * nodes based on the zonelist. So update the list loosely once per 10 secs.
- *
- */
-static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
-{
- int nid;
- /*
- * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
- * pagein/pageout changes since the last update.
- */
- if (!atomic_read(&memcg->numainfo_events))
- return;
- if (atomic_inc_return(&memcg->numainfo_updating) > 1)
- return;
-
- /* make a nodemask where this memcg uses memory from */
- memcg->scan_nodes = node_states[N_MEMORY];
-
- for_each_node_mask(nid, node_states[N_MEMORY]) {
-
- if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
- node_clear(nid, memcg->scan_nodes);
- }
-
- atomic_set(&memcg->numainfo_events, 0);
- atomic_set(&memcg->numainfo_updating, 0);
-}
-
-/*
- * Selecting a node where we start reclaim from. Because what we need is just
- * reducing usage counter, start from anywhere is O,K. Considering
- * memory reclaim from current node, there are pros. and cons.
- *
- * Freeing memory from current node means freeing memory from a node which
- * we'll use or we've used. So, it may make LRU bad. And if several threads
- * hit limits, it will see a contention on a node. But freeing from remote
- * node means more costs for memory reclaim because of memory latency.
- *
- * Now, we use round-robin. Better algorithm is welcomed.
- */
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
-{
- int node;
-
- mem_cgroup_may_update_nodemask(memcg);
- node = memcg->last_scanned_node;
-
- node = next_node(node, memcg->scan_nodes);
- if (node == MAX_NUMNODES)
- node = first_node(memcg->scan_nodes);
- /*
- * We call this when we hit limit, not when pages are added to LRU.
- * No LRU may hold pages because all pages are UNEVICTABLE or
- * memcg is too small and all pages are not on LRU. In that case,
- * we use curret node.
- */
- if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
-
- memcg->last_scanned_node = node;
- return node;
-}
-#else
-int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
-{
- return 0;
-}
-#endif
-
static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
struct zone *zone,
gfp_t gfp_mask,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 903ca5f53339..27eb9343888a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2923,6 +2923,102 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
return sc.nr_reclaimed;
}

+#if MAX_NUMNODES > 1
+
+/**
+ * test_mem_cgroup_node_reclaimable
+ * @memcg: the target memcg
+ * @nid: the node ID to be checked.
+ * @noswap : specify true here if the user wants flle only information.
+ *
+ * This function returns whether the specified memcg contains any
+ * reclaimable pages on a node. Returns true if there are any reclaimable
+ * pages in the node.
+ */
+static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
+ int nid, bool noswap)
+{
+ if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
+ return true;
+ if (noswap || !total_swap_pages)
+ return false;
+ if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
+ return true;
+ return false;
+
+}
+/*
+ * Always updating the nodemask is not very good - even if we have an empty
+ * list or the wrong list here, we can start from some node and traverse all
+ * nodes based on the zonelist. So update the list loosely once per 10 secs.
+ *
+ */
+static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
+{
+ int nid;
+ /*
+ * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
+ * pagein/pageout changes since the last update.
+ */
+ if (!atomic_read(&memcg->numainfo_events))
+ return;
+ if (atomic_inc_return(&memcg->numainfo_updating) > 1)
+ return;
+
+ /* make a nodemask where this memcg uses memory from */
+ memcg->scan_nodes = node_states[N_MEMORY];
+
+ for_each_node_mask(nid, node_states[N_MEMORY]) {
+
+ if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
+ node_clear(nid, memcg->scan_nodes);
+ }
+
+ atomic_set(&memcg->numainfo_events, 0);
+ atomic_set(&memcg->numainfo_updating, 0);
+}
+
+/*
+ * Selecting a node where we start reclaim from. Because what we need is just
+ * reducing usage counter, start from anywhere is O,K. Considering
+ * memory reclaim from current node, there are pros. and cons.
+ *
+ * Freeing memory from current node means freeing memory from a node which
+ * we'll use or we've used. So, it may make LRU bad. And if several threads
+ * hit limits, it will see a contention on a node. But freeing from remote
+ * node means more costs for memory reclaim because of memory latency.
+ *
+ * Now, we use round-robin. Better algorithm is welcomed.
+ */
+static int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
+{
+ int node;
+
+ mem_cgroup_may_update_nodemask(memcg);
+ node = memcg->last_scanned_node;
+
+ node = next_node(node, memcg->scan_nodes);
+ if (node == MAX_NUMNODES)
+ node = first_node(memcg->scan_nodes);
+ /*
+ * We call this when we hit limit, not when pages are added to LRU.
+ * No LRU may hold pages because all pages are UNEVICTABLE or
+ * memcg is too small and all pages are not on LRU. In that case,
+ * we use curret node.
+ */
+ if (unlikely(node == MAX_NUMNODES))
+ node = numa_node_id();
+
+ memcg->last_scanned_node = node;
+ return node;
+}
+#else
+static int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
+{
+ return 0;
+}
+#endif
+
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
--
2.1.4

2015-07-08 12:28:44

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 5/8] memcg: restructure mem_cgroup_can_attach()

From: Tejun Heo <[email protected]>

Restructure it to lower nesting level and help the planned threadgroup
leader iteration changes.

This is pure reorganization.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 61 ++++++++++++++++++++++++++++++---------------------------
1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4f76ee67023b..559e3b2926e8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4738,10 +4738,12 @@ static void mem_cgroup_clear_mc(void)
static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
struct cgroup_taskset *tset)
{
- struct task_struct *p = cgroup_taskset_first(tset);
- int ret = 0;
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct mem_cgroup *from;
+ struct task_struct *p;
+ struct mm_struct *mm;
unsigned long move_flags;
+ int ret = 0;

/*
* We are now commited to this value whatever it is. Changes in this
@@ -4749,36 +4751,37 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
* So we need to save it, and keep it going.
*/
move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
- if (move_flags) {
- struct mm_struct *mm;
- struct mem_cgroup *from = mem_cgroup_from_task(p);
+ if (!move_flags)
+ return 0;

- VM_BUG_ON(from == memcg);
+ p = cgroup_taskset_first(tset);
+ from = mem_cgroup_from_task(p);

- mm = get_task_mm(p);
- if (!mm)
- return 0;
- /* We move charges only when we move a owner of the mm */
- if (mm->owner == p) {
- VM_BUG_ON(mc.from);
- VM_BUG_ON(mc.to);
- VM_BUG_ON(mc.precharge);
- VM_BUG_ON(mc.moved_charge);
- VM_BUG_ON(mc.moved_swap);
-
- spin_lock(&mc.lock);
- mc.from = from;
- mc.to = memcg;
- mc.flags = move_flags;
- spin_unlock(&mc.lock);
- /* We set mc.moving_task later */
-
- ret = mem_cgroup_precharge_mc(mm);
- if (ret)
- mem_cgroup_clear_mc();
- }
- mmput(mm);
+ VM_BUG_ON(from == memcg);
+
+ mm = get_task_mm(p);
+ if (!mm)
+ return 0;
+ /* We move charges only when we move a owner of the mm */
+ if (mm->owner == p) {
+ VM_BUG_ON(mc.from);
+ VM_BUG_ON(mc.to);
+ VM_BUG_ON(mc.precharge);
+ VM_BUG_ON(mc.moved_charge);
+ VM_BUG_ON(mc.moved_swap);
+
+ spin_lock(&mc.lock);
+ mc.from = from;
+ mc.to = memcg;
+ mc.flags = move_flags;
+ spin_unlock(&mc.lock);
+ /* We set mc.moving_task later */
+
+ ret = mem_cgroup_precharge_mc(mm);
+ if (ret)
+ mem_cgroup_clear_mc();
}
+ mmput(mm);
return ret;
}

--
2.1.4

2015-07-08 12:28:36

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 6/8] memcg, tcp_kmem: check for cg_proto in sock_update_memcg

From: Michal Hocko <[email protected]>

sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg
doesn't check for NULL. The function relies on the mem_cgroup_is_root
check because we shouldn't get NULL otherwise because
mem_cgroup_from_task will always return !NULL.

All other callers are checking for NULL and we can safely replace
mem_cgroup_is_root() check by cg_proto != NULL which will be more
straightforward (proto_cgroup returns NULL for the root memcg already).

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 559e3b2926e8..19ffae804076 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -322,8 +322,7 @@ void sock_update_memcg(struct sock *sk)
rcu_read_lock();
memcg = mem_cgroup_from_task(current);
cg_proto = sk->sk_prot->proto_cgroup(memcg);
- if (!mem_cgroup_is_root(memcg) &&
- memcg_proto_active(cg_proto) &&
+ if (cg_proto && memcg_proto_active(cg_proto) &&
css_tryget_online(&memcg->css)) {
sk->sk_cgrp = cg_proto;
}
--
2.1.4

2015-07-08 12:28:34

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 7/8] memcg: get rid of mm_struct::owner

From: Michal Hocko <[email protected]>

mm_struct::owner keeps track of the task which is in charge for the
specific mm. This is usually the thread group leader of the process but
there are exotic cases where this doesn't hold.

The most prominent one is when separate tasks (not in the same thread
group) share the address space (by using clone with CLONE_VM without
CLONE_THREAD). The first task will be the owner until it exits.
mm_update_next_owner will then try to find a new owner - a task which
points to the same mm_struct. There is no guarantee a new owner will
be a thread group leader though because the leader for that thread
group might have exited. Even though such a thread will be still around
waiting for the remaining threads from its group, it's mm will be NULL
so it cannot be chosen.

cgroup migration code, however assumes only group leaders when migrating
via cgroup.procs (which will be the only mode in the unified hierarchy
API) while mem_cgroup_can_attach considers only those tasks which are
owner of the mm. So we might end up with tasks which cannot be migrated.
mm_update_next_owner could be tweaked to try harder and use a group
leader whenever possible but this will never be 100% because all the
leaders might be dead. It seems that getting rid of the mm->owner sounds
like a better and less hacky option.

The whole concept of the mm owner is a bit artificial and too tricky to
get right. All the memcg code needs is to find struct mem_cgroup from
a given mm_struct and there are only two events when the association
is either built or changed
- a new mm is created - dup_mmm resp exec_mmap - when the memcg
is inherited from the oldmm
- task associated with the mm is moved to another memcg
So it is much more easier to bind mm_struct with the mem_cgroup directly
rather than indirectly via a task. This is exactly what this patch does.

mm_inherit_memcg and mm_drop_memcg are exported for the core kernel
to bind an old memcg during dup_mm (fork) resp. exec_mmap (exec) and
releasing that memcg in mmput after the last reference is dropped and no
task sees the mm anymore. We have to be careful and take a reference to
the memcg->css so that it doesn't vanish from under our feet.

The only remaining part is to catch task migration and change the
association. This is done in mem_cgroup_move_task before charges get
moved because mem_cgroup_can_attach is too early and other controllers
might fail and we would have to handle the rollback.

mm->memcg conforms to standard mem_cgroup locking rules. It has to be
used inside rcu_read_{un}lock() and a reference has to be taken before the
unlock if the memcg is supposed to be used outside.

Finally mem_cgroup_can_attach will allow task migration only for the
thread group leaders to conform with cgroup core requirements.

Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
Without mm->owner _all_ tasks (group leaders to be precise) associated
with the mm_struct would initiate memcg migration while previously
only owner of the mm_struct could do that. The original behavior was
awkward though because the user task didn't have any means to find out
the current owner (esp. after mm_update_next_owner) so the migration
behavior was not well defined in general.
New cgroup API (unified hierarchy) will discontinue tasks cgroup file
which means that migrating threads will no longer be possible. In such
a case having CLONE_VM without CLONE_THREAD could emulate the thread
behavior but this patch prevents from isolating memcg controllers from
others. Nevertheless I am not convinced such a use case would really
deserve complications on the memcg code side.

Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/exec.c | 2 +-
include/linux/memcontrol.h | 58 ++++++++++++++++++++++++--
include/linux/mm_types.h | 12 +-----
kernel/exit.c | 89 ---------------------------------------
kernel/fork.c | 10 +----
mm/debug.c | 4 +-
mm/memcontrol.c | 101 ++++++++++++++++++++++++++++-----------------
7 files changed, 123 insertions(+), 153 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1977c2a553ac..3ed9c0abc9f5 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -870,7 +870,7 @@ static int exec_mmap(struct mm_struct *mm)
up_read(&old_mm->mmap_sem);
BUG_ON(active_mm != old_mm);
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
- mm_update_next_owner(old_mm);
+ mm_inherit_memcg(mm, old_mm);
mmput(old_mm);
return 0;
}
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 78e9d4ac57a1..8e6b2444ebfe 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -274,6 +274,52 @@ struct mem_cgroup {
extern struct cgroup_subsys_state *mem_cgroup_root_css;

/**
+ * __mm_set_memcg - Set mm_struct:memcg to a given memcg.
+ * @mm: mm struct
+ * @memcg: mem_cgroup to be used
+ *
+ * Note that this function doesn't clean up the previous mm->memcg.
+ * This should be done by caller when necessary (e.g. when moving
+ * mm from one memcg to another).
+ */
+static inline
+void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
+{
+ if (memcg)
+ css_get(&memcg->css);
+ rcu_assign_pointer(mm->memcg, memcg);
+}
+
+/**
+ * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct
+ * @newmm: new mm struct
+ * @oldmm: old mm struct to inherit from
+ *
+ * Should be called for each new mm_struct.
+ */
+static inline
+void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
+{
+ struct mem_cgroup *memcg = oldmm->memcg;
+
+ __mm_set_memcg(newmm, memcg);
+}
+
+/**
+ * mm_drop_iter - drop mm_struct::memcg association
+ * @mm: mm struct
+ *
+ * Should be called after the mm has been removed from all tasks
+ * and before it is freed (e.g. from mmput)
+ */
+static inline void mm_drop_memcg(struct mm_struct *mm)
+{
+ if (mm->memcg)
+ css_put(&mm->memcg->css);
+ mm->memcg = NULL;
+}
+
+/**
* mem_cgroup_events - count memory events against a cgroup
* @memcg: the memory cgroup
* @idx: the event index
@@ -305,7 +351,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);

struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);

struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
static inline
@@ -335,7 +380,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
bool match = false;

rcu_read_lock();
- task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ task_memcg = rcu_dereference(mm->memcg);
if (task_memcg)
match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
@@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
return;

rcu_read_lock();
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (unlikely(!memcg))
goto out;

@@ -498,6 +543,13 @@ void mem_cgroup_split_huge_fixup(struct page *head);
#else /* CONFIG_MEMCG */
struct mem_cgroup;

+static inline void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
+{
+}
+static inline void mm_drop_memcg(struct mm_struct *mm)
+{
+}
+
static inline void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
unsigned int nr)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f6266742ce1f..93dc8cb9c636 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -426,17 +426,7 @@ struct mm_struct {
struct kioctx_table __rcu *ioctx_table;
#endif
#ifdef CONFIG_MEMCG
- /*
- * "owner" points to a task that is regarded as the canonical
- * user/owner of this mm. All of the following must be true in
- * order for it to be changed:
- *
- * current == mm->owner
- * current->mm != mm
- * new_owner->mm == mm
- * new_owner->alloc_lock is held
- */
- struct task_struct __rcu *owner;
+ struct mem_cgroup __rcu *memcg;
#endif

/* store ref to file /proc/<pid>/exe symlink points to */
diff --git a/kernel/exit.c b/kernel/exit.c
index 185752a729f6..339554612677 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
}
}

-#ifdef CONFIG_MEMCG
-/*
- * A task is exiting. If it owned this mm, find a new owner for the mm.
- */
-void mm_update_next_owner(struct mm_struct *mm)
-{
- struct task_struct *c, *g, *p = current;
-
-retry:
- /*
- * If the exiting or execing task is not the owner, it's
- * someone else's problem.
- */
- if (mm->owner != p)
- return;
- /*
- * The current owner is exiting/execing and there are no other
- * candidates. Do not leave the mm pointing to a possibly
- * freed task structure.
- */
- if (atomic_read(&mm->mm_users) <= 1) {
- mm->owner = NULL;
- return;
- }
-
- read_lock(&tasklist_lock);
- /*
- * Search in the children
- */
- list_for_each_entry(c, &p->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search in the siblings
- */
- list_for_each_entry(c, &p->real_parent->children, sibling) {
- if (c->mm == mm)
- goto assign_new_owner;
- }
-
- /*
- * Search through everything else, we should not get here often.
- */
- for_each_process(g) {
- if (g->flags & PF_KTHREAD)
- continue;
- for_each_thread(g, c) {
- if (c->mm == mm)
- goto assign_new_owner;
- if (c->mm)
- break;
- }
- }
- read_unlock(&tasklist_lock);
- /*
- * We found no owner yet mm_users > 1: this implies that we are
- * most likely racing with swapoff (try_to_unuse()) or /proc or
- * ptrace or page migration (get_task_mm()). Mark owner as NULL.
- */
- mm->owner = NULL;
- return;
-
-assign_new_owner:
- BUG_ON(c == p);
- get_task_struct(c);
- /*
- * The task_lock protects c->mm from changing.
- * We always want mm->owner->mm == mm
- */
- task_lock(c);
- /*
- * Delay read_unlock() till we have the task_lock()
- * to ensure that c does not slip away underneath us
- */
- read_unlock(&tasklist_lock);
- if (c->mm != mm) {
- task_unlock(c);
- put_task_struct(c);
- goto retry;
- }
- mm->owner = c;
- task_unlock(c);
- put_task_struct(c);
-}
-#endif /* CONFIG_MEMCG */
-
/*
* Turn us into a lazy TLB process if we
* aren't already..
@@ -433,7 +345,6 @@ static void exit_mm(struct task_struct *tsk)
up_read(&mm->mmap_sem);
enter_lazy_tlb(mm, current);
task_unlock(tsk);
- mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
exit_oom_victim();
diff --git a/kernel/fork.c b/kernel/fork.c
index 16e0f872f084..d073b6249d98 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -570,13 +570,6 @@ static void mm_init_aio(struct mm_struct *mm)
#endif
}

-static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
-{
-#ifdef CONFIG_MEMCG
- mm->owner = p;
-#endif
-}
-
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
{
mm->mmap = NULL;
@@ -596,7 +589,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
spin_lock_init(&mm->page_table_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
- mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
clear_tlb_flush_pending(mm);
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
@@ -702,6 +694,7 @@ void mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ mm_drop_memcg(mm);
mmdrop(mm);
}
}
@@ -926,6 +919,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
if (mm->binfmt && !try_module_get(mm->binfmt->module))
goto free_pt;

+ mm_inherit_memcg(mm, oldmm);
return mm;

free_pt:
diff --git a/mm/debug.c b/mm/debug.c
index 3eb3ac2fcee7..d0347a168651 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -184,7 +184,7 @@ void dump_mm(const struct mm_struct *mm)
"ioctx_table %p\n"
#endif
#ifdef CONFIG_MEMCG
- "owner %p "
+ "memcg %p "
#endif
"exe_file %p\n"
#ifdef CONFIG_MMU_NOTIFIER
@@ -218,7 +218,7 @@ void dump_mm(const struct mm_struct *mm)
mm->ioctx_table,
#endif
#ifdef CONFIG_MEMCG
- mm->owner,
+ mm->memcg,
#endif
mm->exe_file,
#ifdef CONFIG_MMU_NOTIFIER
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 19ffae804076..4069ec8f52be 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,6 +294,18 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
return mem_cgroup_from_css(css);
}

+static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
+{
+ if (p->mm)
+ return rcu_dereference(p->mm->memcg);
+
+ /*
+ * If the process doesn't have mm struct anymore we have to fallback
+ * to the task_css.
+ */
+ return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
+}
+
/* Writing them here to avoid exposing memcg's inner layout */
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)

@@ -783,19 +795,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
}
}

-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
- /*
- * mm_update_next_owner() may clear mm->owner to NULL
- * if it races with swapoff, page migration, etc.
- * So this can be called with p == NULL.
- */
- if (unlikely(!p))
- return NULL;
-
- return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-
static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
struct mem_cgroup *memcg = NULL;
@@ -810,7 +809,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ memcg = rcu_dereference(mm->memcg);
if (unlikely(!memcg))
memcg = root_mem_cgroup;
}
@@ -2286,7 +2285,7 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep)
}

/*
- * We need to verify if the allocation against current->mm->owner's memcg is
+ * We need to verify if the allocation against current->mm->memcg is
* possible for the given order. But the page is not allocated yet, so we'll
* need a further commit step to do the final arrangements.
*
@@ -4737,7 +4736,7 @@ static void mem_cgroup_clear_mc(void)
static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
struct cgroup_taskset *tset)
{
- struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct mem_cgroup *to = mem_cgroup_from_css(css);
struct mem_cgroup *from;
struct task_struct *p;
struct mm_struct *mm;
@@ -4749,37 +4748,49 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
* tunable will only affect upcoming migrations, not the current one.
* So we need to save it, and keep it going.
*/
- move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
+ move_flags = READ_ONCE(to->move_charge_at_immigrate);
if (!move_flags)
return 0;

p = cgroup_taskset_first(tset);
- from = mem_cgroup_from_task(p);
-
- VM_BUG_ON(from == memcg);
+ if (!thread_group_leader(p))
+ return 0;

mm = get_task_mm(p);
if (!mm)
return 0;
- /* We move charges only when we move a owner of the mm */
- if (mm->owner == p) {
- VM_BUG_ON(mc.from);
- VM_BUG_ON(mc.to);
- VM_BUG_ON(mc.precharge);
- VM_BUG_ON(mc.moved_charge);
- VM_BUG_ON(mc.moved_swap);
-
- spin_lock(&mc.lock);
- mc.from = from;
- mc.to = memcg;
- mc.flags = move_flags;
- spin_unlock(&mc.lock);
- /* We set mc.moving_task later */
-
- ret = mem_cgroup_precharge_mc(mm);
- if (ret)
- mem_cgroup_clear_mc();
- }
+
+ /*
+ * tasks' cgroup might be different from the one p->mm is associated
+ * with because CLONE_VM is allowed without CLONE_THREAD. The task is
+ * moving so we have to migrate from the memcg associated with its
+ * address space.
+ * No need to take a reference here because the memcg is pinned by the
+ * mm_struct.
+ */
+ from = READ_ONCE(mm->memcg);
+ if (!from)
+ from = root_mem_cgroup;
+ if (from == to)
+ goto out;
+
+ VM_BUG_ON(mc.from);
+ VM_BUG_ON(mc.to);
+ VM_BUG_ON(mc.precharge);
+ VM_BUG_ON(mc.moved_charge);
+ VM_BUG_ON(mc.moved_swap);
+
+ spin_lock(&mc.lock);
+ mc.from = from;
+ mc.to = to;
+ mc.flags = move_flags;
+ spin_unlock(&mc.lock);
+ /* We set mc.moving_task later */
+
+ ret = mem_cgroup_precharge_mc(mm);
+ if (ret)
+ mem_cgroup_clear_mc();
+out:
mmput(mm);
return ret;
}
@@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
{
struct task_struct *p = cgroup_taskset_first(tset);
struct mm_struct *mm = get_task_mm(p);
+ struct mem_cgroup *old_memcg = NULL;

if (mm) {
+ old_memcg = READ_ONCE(mm->memcg);
+ __mm_set_memcg(mm, mem_cgroup_from_css(css));
+
if (mc.to)
mem_cgroup_move_charge(mm);
mmput(mm);
}
if (mc.to)
mem_cgroup_clear_mc();
+
+ /*
+ * Be careful and drop the reference only after we are done because
+ * p's task_css memcg might be different from p->memcg and nothing else
+ * might be pinning the old memcg.
+ */
+ if (old_memcg)
+ css_put(&old_memcg->css);
}
#else /* !CONFIG_MMU */
static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
--
2.1.4

2015-07-08 12:28:32

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 8/8] memcg: get rid of mem_cgroup_from_task

From: Michal Hocko <[email protected]>

mem_cgroup_from_task has always been a tricky API. It was added
by 78fb74669e80 ("Memory controller: accounting setup") for
mm_struct::mem_cgroup initialization. Later on it gained new callers
mostly due to mm_struct::mem_cgroup -> mem_cgroup::owner transition and
most users had to do mem_cgroup_from_task(mm->owner) to get the
resulting memcg. Now that mm_struct::owner is gone this is not
necessary, yet the API is still confusing.

One tricky part has always been that the API sounds generic but it is
not really. mem_cgroup_from_task(current) doesn't necessarily mean the
same thing as current->mm->memcg (resp.
mem_cgroup_from_task(current->mm->owner) previously) because mm might be
associated with a different cgroup than the process.

Another tricky part is that p->mm->memcg is unsafe if p!=current
as pointed by Oleg because nobody is holding a reference on that
mm. This is not a problem right now because we have only 2 callers in
the tree. sock_update_memcg operates on current and task_in_mem_cgroup
is providing non-NULL task so it is always using task_css.

Let's ditch this function and use current->mm->memcg for
sock_update_memcg and use task_css for task_in_mem_cgroup. This doesn't
have any functional effect.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 24 +++++++-----------------
1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4069ec8f52be..fb8e9bd04a29 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -294,18 +294,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
return mem_cgroup_from_css(css);
}

-static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
-{
- if (p->mm)
- return rcu_dereference(p->mm->memcg);
-
- /*
- * If the process doesn't have mm struct anymore we have to fallback
- * to the task_css.
- */
- return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
-}
-
/* Writing them here to avoid exposing memcg's inner layout */
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)

@@ -332,7 +320,7 @@ void sock_update_memcg(struct sock *sk)
}

rcu_read_lock();
- memcg = mem_cgroup_from_task(current);
+ memcg = rcu_dereference(current->mm->memcg);
cg_proto = sk->sk_prot->proto_cgroup(memcg);
if (cg_proto && memcg_proto_active(cg_proto) &&
css_tryget_online(&memcg->css)) {
@@ -1091,12 +1079,14 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
task_unlock(p);
} else {
/*
- * All threads may have already detached their mm's, but the oom
- * killer still needs to detect if they have already been oom
- * killed to prevent needlessly killing additional tasks.
+ * All threads have already detached their mm's but we should
+ * still be able to at least guess the original memcg from the
+ * task_css. These two will match most of the time but there are
+ * corner cases where task->mm and task_css refer to a different
+ * cgroups.
*/
rcu_read_lock();
- task_memcg = mem_cgroup_from_task(task);
+ task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
css_get(&task_memcg->css);
rcu_read_unlock();
}
--
2.1.4

2015-07-08 15:39:44

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 1/8] memcg: export struct mem_cgroup

Hi Michal,

On Wed, Jul 08, 2015 at 02:27:45PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mem_cgroup structure is defined in mm/memcontrol.c currently which
> means that the code outside of this file has to use external API even
> for trivial access stuff.
>
> This patch exports mm_struct with its dependencies and makes some of the

IMO it's a step in the right direction. A few nit picks below.

> exported functions inlines. This even helps to reduce the code size a bit
> (make defconfig + CONFIG_MEMCG=y)
>
> text data bss dec hex filename
> 12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
> 12354970 1823792 1089536 15268298 e8f9ca vmlinux.after
>
> This is not much (370B) but better than nothing. We also save a function
> call in some hot paths like callers of mem_cgroup_count_vm_event which is
> used for accounting.
>
> The patch doesn't introduce any functional changes.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> include/linux/memcontrol.h | 365 +++++++++++++++++++++++++++++++++++++++++----
> include/linux/swap.h | 10 +-
> include/net/sock.h | 28 ----
> mm/memcontrol.c | 305 -------------------------------------
> mm/memory-failure.c | 2 +-
> mm/slab_common.c | 2 +-
> mm/vmscan.c | 2 +-
> 7 files changed, 344 insertions(+), 370 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 73b02b0a8f60..f5a8d0bbef8d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -23,8 +23,11 @@
> #include <linux/vm_event_item.h>
> #include <linux/hardirq.h>
> #include <linux/jump_label.h>
> +#include <linux/page_counter.h>
> +#include <linux/vmpressure.h>
> +#include <linux/mmzone.h>
> +#include <linux/writeback.h>
>
> -struct mem_cgroup;

I think we still need this forward declaration e.g. for defining
reclaim_iter.

> struct page;
> struct mm_struct;
> struct kmem_cache;
> @@ -67,12 +70,221 @@ enum mem_cgroup_events_index {
> MEMCG_NR_EVENTS,
> };
>
> +/*
> + * Per memcg event counter is incremented at every pagein/pageout. With THP,
> + * it will be incremated by the number of pages. This counter is used for
> + * for trigger some periodic events. This is straightforward and better
> + * than using jiffies etc. to handle periodic memcg event.
> + */
> +enum mem_cgroup_events_target {
> + MEM_CGROUP_TARGET_THRESH,
> + MEM_CGROUP_TARGET_SOFTLIMIT,
> + MEM_CGROUP_TARGET_NUMAINFO,
> + MEM_CGROUP_NTARGETS,
> +};
> +
> +struct mem_cgroup_stat_cpu {
> + long count[MEM_CGROUP_STAT_NSTATS];
> + unsigned long events[MEMCG_NR_EVENTS];
> + unsigned long nr_page_events;
> + unsigned long targets[MEM_CGROUP_NTARGETS];
> +};
> +
> +struct reclaim_iter {

I think we'd better rename it to mem_cgroup_reclaim_iter.

> + struct mem_cgroup *position;
> + /* scan generation, increased every round-trip */
> + unsigned int generation;
> +};
> +
> +/*
> + * per-zone information in memory controller.
> + */
> +struct mem_cgroup_per_zone {
> + struct lruvec lruvec;
> + unsigned long lru_size[NR_LRU_LISTS];
> +
> + struct reclaim_iter iter[DEF_PRIORITY + 1];
> +
> + struct rb_node tree_node; /* RB tree node */
> + unsigned long usage_in_excess;/* Set to the value by which */
> + /* the soft limit is exceeded*/
> + bool on_tree;
> + struct mem_cgroup *memcg; /* Back pointer, we cannot */
> + /* use container_of */
> +};
> +
> +struct mem_cgroup_per_node {
> + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> +};
> +
> +struct mem_cgroup_threshold {
> + struct eventfd_ctx *eventfd;
> + unsigned long threshold;
> +};
> +
> +/* For threshold */
> +struct mem_cgroup_threshold_ary {
> + /* An array index points to threshold just below or equal to usage. */
> + int current_threshold;
> + /* Size of entries[] */
> + unsigned int size;
> + /* Array of thresholds */
> + struct mem_cgroup_threshold entries[0];
> +};
> +
> +struct mem_cgroup_thresholds {
> + /* Primary thresholds array */
> + struct mem_cgroup_threshold_ary *primary;
> + /*
> + * Spare threshold array.
> + * This is needed to make mem_cgroup_unregister_event() "never fail".
> + * It must be able to store at least primary->size - 1 entries.
> + */
> + struct mem_cgroup_threshold_ary *spare;
> +};

I think we'd better define these structures inside CONFIG_MEMCG section,
just like struct mem_cgroup.

> +
> +/*
> + * Bits in struct cg_proto.flags
> + */
> +enum cg_proto_flags {
> + /* Currently active and new sockets should be assigned to cgroups */
> + MEMCG_SOCK_ACTIVE,
> + /* It was ever activated; we must disarm static keys on destruction */
> + MEMCG_SOCK_ACTIVATED,
> +};
> +
> +struct cg_proto {
> + struct page_counter memory_allocated; /* Current allocated memory. */
> + struct percpu_counter sockets_allocated; /* Current number of sockets. */
> + int memory_pressure;
> + long sysctl_mem[3];
> + unsigned long flags;
> + /*
> + * memcg field is used to find which memcg we belong directly
> + * Each memcg struct can hold more than one cg_proto, so container_of
> + * won't really cut.
> + *
> + * The elegant solution would be having an inverse function to
> + * proto_cgroup in struct proto, but that means polluting the structure
> + * for everybody, instead of just for memcg users.
> + */
> + struct mem_cgroup *memcg;
> +};

I'd prefer to leave it where it is now. I don't see any reason why we
have to embed it into mem_cgroup, so may be we'd better keep a pointer
to it in struct mem_cgroup instead?

> +
> #ifdef CONFIG_MEMCG
> +/*
> + * The memory controller data structure. The memory controller controls both
> + * page cache and RSS per cgroup. We would eventually like to provide
> + * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> + * to help the administrator determine what knobs to tune.
> + */
> +struct mem_cgroup {
> + struct cgroup_subsys_state css;
> +
> + /* Accounted resources */
> + struct page_counter memory;
> + struct page_counter memsw;
> + struct page_counter kmem;
> +
> + /* Normal memory consumption range */
> + unsigned long low;
> + unsigned long high;
> +
> + unsigned long soft_limit;
> +
> + /* vmpressure notifications */
> + struct vmpressure vmpressure;
> +
> + /* css_online() has been completed */
> + int initialized;
> +
> + /*
> + * Should the accounting and control be hierarchical, per subtree?
> + */
> + bool use_hierarchy;
> +
> + /* protected by memcg_oom_lock */
> + bool oom_lock;
> + int under_oom;
> +
> + int swappiness;
> + /* OOM-Killer disable */
> + int oom_kill_disable;
> +
> + /* protect arrays of thresholds */
> + struct mutex thresholds_lock;
> +
> + /* thresholds for memory usage. RCU-protected */
> + struct mem_cgroup_thresholds thresholds;
> +
> + /* thresholds for mem+swap usage. RCU-protected */
> + struct mem_cgroup_thresholds memsw_thresholds;
> +
> + /* For oom notifier event fd */
> + struct list_head oom_notify;
> +
> + /*
> + * Should we move charges of a task when a task is moved into this
> + * mem_cgroup ? And what type of charges should we move ?
> + */
> + unsigned long move_charge_at_immigrate;
> + /*
> + * set > 0 if pages under this cgroup are moving to other cgroup.
> + */
> + atomic_t moving_account;
> + /* taken only while moving_account > 0 */
> + spinlock_t move_lock;
> + struct task_struct *move_lock_task;
> + unsigned long move_lock_flags;
> + /*
> + * percpu counter.
> + */
> + struct mem_cgroup_stat_cpu __percpu *stat;
> + spinlock_t pcp_counter_lock;
> +
> +#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> + struct cg_proto tcp_mem;
> +#endif
> +#if defined(CONFIG_MEMCG_KMEM)
> + /* Index in the kmem_cache->memcg_params.memcg_caches array */
> + int kmemcg_id;
> + bool kmem_acct_activated;
> + bool kmem_acct_active;
> +#endif
> +
> + int last_scanned_node;
> +#if MAX_NUMNODES > 1
> + nodemask_t scan_nodes;
> + atomic_t numainfo_events;
> + atomic_t numainfo_updating;
> +#endif
> +
> +#ifdef CONFIG_CGROUP_WRITEBACK
> + struct list_head cgwb_list;
> + struct wb_domain cgwb_domain;
> +#endif
> +
> + /* List of events which userspace want to receive */
> + struct list_head event_list;
> + spinlock_t event_list_lock;
> +
> + struct mem_cgroup_per_node *nodeinfo[0];
> + /* WARNING: nodeinfo must be the last member here */
> +};
> extern struct cgroup_subsys_state *mem_cgroup_root_css;
>
> -void mem_cgroup_events(struct mem_cgroup *memcg,
> +/**
> + * mem_cgroup_events - count memory events against a cgroup
> + * @memcg: the memory cgroup
> + * @idx: the event index
> + * @nr: the number of events to account for
> + */
> +static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> enum mem_cgroup_events_index idx,
> - unsigned int nr);
> + unsigned int nr)
> +{
> + this_cpu_add(memcg->stat->events[idx], nr);
> +}
>
> bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg);
>
> @@ -90,15 +302,31 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage,
> struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>
> -bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
> - struct mem_cgroup *root);
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>
> extern struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
> extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);

It's a trivial one line function, so why not inline it too?

> -extern struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css);
> +static inline
> +struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> + return css ? container_of(css, struct mem_cgroup, css) : NULL;
> +}
> +
> +struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
> + struct mem_cgroup *,
> + struct mem_cgroup_reclaim_cookie *);
> +void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
> +
> +static inline bool mem_cgroup_is_descendant(struct mem_cgroup *memcg,
> + struct mem_cgroup *root)
> +{
> + if (root == memcg)
> + return true;
> + if (!root->use_hierarchy)
> + return false;
> + return cgroup_is_descendant(memcg->css.cgroup, root->css.cgroup);
> +}
>
> static inline bool mm_match_cgroup(struct mm_struct *mm,
> struct mem_cgroup *memcg)
[...]
> @@ -184,13 +463,31 @@ unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask,
> unsigned long *total_scanned);
>
> -void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
> static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> enum vm_event_item idx)
> {
> + struct mem_cgroup *memcg;
> +
> if (mem_cgroup_disabled())
> return;
> - __mem_cgroup_count_vm_event(mm, idx);
> +
> + rcu_read_lock();
> + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + if (unlikely(!memcg))
> + goto out;
> +
> + switch (idx) {
> + case PGFAULT:
> + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
> + break;
> + case PGMAJFAULT:
> + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
> + break;
> + default:
> + BUG();

This switch-case looks bulky and weird. Let's make this function accept
MEM_CGROUP_EVENTS_PGFAULT/PGMAJFAULT directly instead.

> + }
> +out:
> + rcu_read_unlock();
> }
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> void mem_cgroup_split_huge_fixup(struct page *head);
[...]
> @@ -463,7 +754,15 @@ void __memcg_kmem_commit_charge(struct page *page,
> struct mem_cgroup *memcg, int order);
> void __memcg_kmem_uncharge_pages(struct page *page, int order);
>
> -int memcg_cache_id(struct mem_cgroup *memcg);
> +/*
> + * helper for acessing a memcg's index. It will be used as an index in the
> + * child cache array in kmem_cache, and also to derive its name. This function
> + * will return -1 when this is not a kmem-limited memcg.
> + */
> +static inline int memcg_cache_id(struct mem_cgroup *memcg)
> +{
> + return memcg ? memcg->kmemcg_id : -1;
> +}

We can inline memcg_kmem_is_active too.

>
> struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep);
> void __memcg_kmem_put_cache(struct kmem_cache *cachep);
[...]

Thanks,
Vladimir

2015-07-08 15:43:39

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 2/8] memcg: get rid of mem_cgroup_root_css for !CONFIG_MEMCG

On Wed, Jul 08, 2015 at 02:27:46PM +0200, Michal Hocko wrote:
> The only user is cgwb_bdi_init and that one depends on
> CONFIG_CGROUP_WRITEBACK which in turn depends on CONFIG_MEMCG
> so it doesn't make much sense to definte an empty stub for
> !CONFIG_MEMCG. Moreover ERR_PTR(-EINVAL) is ugly and would lead
> to runtime crashes if used in unguarded code paths. Better fail
> during compilation.
>
> Signed-off-by: Michal Hocko <[email protected]>

Reviewed-by: Vladimir Davydov <[email protected]>

2015-07-08 15:44:10

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 3/8] memcg: get rid of extern for functions in memcontrol.h

On Wed, Jul 08, 2015 at 02:27:47PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Most of the exported functions in this header are not marked extern so
> change the rest to follow the same style.
>
> Signed-off-by: Michal Hocko <[email protected]>

Reviewed-by: Vladimir Davydov <[email protected]>

2015-07-08 16:02:24

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 4/8] memcg, mm: move mem_cgroup_select_victim_node into vmscan

On Wed, Jul 08, 2015 at 02:27:48PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> We currently have only one caller of mem_cgroup_select_victim_node which
> is sitting in mm/vmscan.c and which is already wrapped by CONFIG_MEMCG
> ifdef. Now that we have struct mem_cgroup visible outside of
> mm/memcontrol.c we can move the function and its dependencies there.
> This even shrinks the code size by few bytes:
>
> text data bss dec hex filename
> 478509 65806 26384 570699 8b54b mm/built-in.o.before
> 478445 65806 26384 570635 8b50b mm/built-in.o.after
>
> Signed-off-by: Michal Hocko <[email protected]>

I dislike this patch, because I don't see any reason why logic specific
to per memcg reclaim should live in the file representing the global
reclaim path. With such an approach you may end up with moving
mem_cgroup_low, mem_cgroup_soft_limit_reclaim, etc to vmscan.c, because
they are used only there. I don't think it's right.

Thanks,
Vladimir

2015-07-08 16:05:57

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 5/8] memcg: restructure mem_cgroup_can_attach()

On Wed, Jul 08, 2015 at 02:27:49PM +0200, Michal Hocko wrote:
> From: Tejun Heo <[email protected]>
>
> Restructure it to lower nesting level and help the planned threadgroup
> leader iteration changes.
>
> This is pure reorganization.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Acked-by: Michal Hocko <[email protected]>

Reviewed-by: Vladimir Davydov <[email protected]>

2015-07-08 16:11:16

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 6/8] memcg, tcp_kmem: check for cg_proto in sock_update_memcg

On Wed, Jul 08, 2015 at 02:27:50PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> sk_prot->proto_cgroup is allowed to return NULL but sock_update_memcg
> doesn't check for NULL. The function relies on the mem_cgroup_is_root
> check because we shouldn't get NULL otherwise because
> mem_cgroup_from_task will always return !NULL.
>
> All other callers are checking for NULL and we can safely replace
> mem_cgroup_is_root() check by cg_proto != NULL which will be more
> straightforward (proto_cgroup returns NULL for the root memcg already).
>
> Signed-off-by: Michal Hocko <[email protected]>

Reviewed-by: Vladimir Davydov <[email protected]>

2015-07-08 17:33:08

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

I like the gist of this patch. A few comments below.

On Wed, Jul 08, 2015 at 02:27:51PM +0200, Michal Hocko wrote:
[...]
> diff --git a/fs/exec.c b/fs/exec.c
> index 1977c2a553ac..3ed9c0abc9f5 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -870,7 +870,7 @@ static int exec_mmap(struct mm_struct *mm)
> up_read(&old_mm->mmap_sem);
> BUG_ON(active_mm != old_mm);
> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> - mm_update_next_owner(old_mm);
> + mm_inherit_memcg(mm, old_mm);
> mmput(old_mm);
> return 0;
> }
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 78e9d4ac57a1..8e6b2444ebfe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -274,6 +274,52 @@ struct mem_cgroup {
> extern struct cgroup_subsys_state *mem_cgroup_root_css;
>
> /**
> + * __mm_set_memcg - Set mm_struct:memcg to a given memcg.
> + * @mm: mm struct
> + * @memcg: mem_cgroup to be used
> + *
> + * Note that this function doesn't clean up the previous mm->memcg.
> + * This should be done by caller when necessary (e.g. when moving
> + * mm from one memcg to another).
> + */
> +static inline
> +void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> +{
> + if (memcg)
> + css_get(&memcg->css);
> + rcu_assign_pointer(mm->memcg, memcg);
> +}
> +
> +/**
> + * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct
> + * @newmm: new mm struct
> + * @oldmm: old mm struct to inherit from
> + *
> + * Should be called for each new mm_struct.
> + */
> +static inline
> +void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
> +{
> + struct mem_cgroup *memcg = oldmm->memcg;

FWIW, if CONFIG_SPARSE_RCU_POINTER is on, this will trigger a compile
time warning, as well as any unannotated dereference of mm_struct->memcg
below.

> +
> + __mm_set_memcg(newmm, memcg);
> +}
> +
> +/**
> + * mm_drop_iter - drop mm_struct::memcg association

s/mm_drop_iter/mm_drop_memcg

> + * @mm: mm struct
> + *
> + * Should be called after the mm has been removed from all tasks
> + * and before it is freed (e.g. from mmput)
> + */
> +static inline void mm_drop_memcg(struct mm_struct *mm)
> +{
> + if (mm->memcg)
> + css_put(&mm->memcg->css);
> + mm->memcg = NULL;
> +}
> +
> +/**
> * mem_cgroup_events - count memory events against a cgroup
> * @memcg: the memory cgroup
> * @idx: the event index
> @@ -305,7 +351,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>
> struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
> static inline
> @@ -335,7 +380,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
> bool match = false;
>
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + task_memcg = rcu_dereference(mm->memcg);
> if (task_memcg)
> match = mem_cgroup_is_descendant(task_memcg, memcg);
> rcu_read_unlock();
> @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> return;
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (unlikely(!memcg))
> goto out;
>

If I'm not mistaken, mm->memcg equals NULL for any task in the root
memory cgroup (BTW, it it's true, it's worth mentioning in the comment
to mm->memcg definition IMO). As a result, we won't account the stats
for such tasks, will we?

[...]
> @@ -4749,37 +4748,49 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
> * tunable will only affect upcoming migrations, not the current one.
> * So we need to save it, and keep it going.
> */
> - move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> + move_flags = READ_ONCE(to->move_charge_at_immigrate);
> if (!move_flags)
> return 0;
>
> p = cgroup_taskset_first(tset);
> - from = mem_cgroup_from_task(p);
> -
> - VM_BUG_ON(from == memcg);
> + if (!thread_group_leader(p))
> + return 0;
>
> mm = get_task_mm(p);
> if (!mm)
> return 0;
> - /* We move charges only when we move a owner of the mm */
> - if (mm->owner == p) {
> - VM_BUG_ON(mc.from);
> - VM_BUG_ON(mc.to);
> - VM_BUG_ON(mc.precharge);
> - VM_BUG_ON(mc.moved_charge);
> - VM_BUG_ON(mc.moved_swap);
> -
> - spin_lock(&mc.lock);
> - mc.from = from;
> - mc.to = memcg;
> - mc.flags = move_flags;
> - spin_unlock(&mc.lock);
> - /* We set mc.moving_task later */
> -
> - ret = mem_cgroup_precharge_mc(mm);
> - if (ret)
> - mem_cgroup_clear_mc();
> - }
> +
> + /*
> + * tasks' cgroup might be different from the one p->mm is associated
> + * with because CLONE_VM is allowed without CLONE_THREAD. The task is
> + * moving so we have to migrate from the memcg associated with its
> + * address space.

> + * No need to take a reference here because the memcg is pinned by the
> + * mm_struct.
> + */

But after we drop the reference to the mm below, mc.from can pass away
and we can get use-after-free in mem_cgroup_move_task, can't we?

AFAIU the real reason why we can skip taking a reference to mc.from, as
well as to mc.to, is that task migration proceeds under cgroup_mutex,
which blocks cgroup destruction. Am I missing something? If not, please
remove this comment, because it's confusing.

> + from = READ_ONCE(mm->memcg);
> + if (!from)
> + from = root_mem_cgroup;
> + if (from == to)
> + goto out;
> +
> + VM_BUG_ON(mc.from);
> + VM_BUG_ON(mc.to);
> + VM_BUG_ON(mc.precharge);
> + VM_BUG_ON(mc.moved_charge);
> + VM_BUG_ON(mc.moved_swap);
> +
> + spin_lock(&mc.lock);
> + mc.from = from;
> + mc.to = to;
> + mc.flags = move_flags;
> + spin_unlock(&mc.lock);
> + /* We set mc.moving_task later */
> +
> + ret = mem_cgroup_precharge_mc(mm);
> + if (ret)
> + mem_cgroup_clear_mc();
> +out:
> mmput(mm);
> return ret;
> }
> @@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
> {
> struct task_struct *p = cgroup_taskset_first(tset);
> struct mm_struct *mm = get_task_mm(p);
> + struct mem_cgroup *old_memcg = NULL;
>
> if (mm) {
> + old_memcg = READ_ONCE(mm->memcg);
> + __mm_set_memcg(mm, mem_cgroup_from_css(css));
> +
> if (mc.to)
> mem_cgroup_move_charge(mm);
> mmput(mm);
> }
> if (mc.to)
> mem_cgroup_clear_mc();
> +
> + /*
> + * Be careful and drop the reference only after we are done because
> + * p's task_css memcg might be different from p->memcg and nothing else
> + * might be pinning the old memcg.
> + */
> + if (old_memcg)
> + css_put(&old_memcg->css);

Please explain why the following race is impossible:

CPU0 CPU1
---- ----
[current = T]
dup_mm or exec_mmap
mm_inherit_memcg
memcg = current->mm->memcg;
mem_cgroup_move_task
p = T;
mm = get_task_mm(p);
old_memcg = mm->memcg;
css_put(&old_memcg->css);
/* old_memcg can be freed now */
css_get(memcg); /* BUG */

> }
> #else /* !CONFIG_MMU */
> static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,

Thanks,
Vladimir

2015-07-08 17:43:52

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg: get rid of mem_cgroup_from_task

On Wed, Jul 08, 2015 at 02:27:52PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mem_cgroup_from_task has always been a tricky API. It was added
> by 78fb74669e80 ("Memory controller: accounting setup") for
> mm_struct::mem_cgroup initialization. Later on it gained new callers
> mostly due to mm_struct::mem_cgroup -> mem_cgroup::owner transition and
> most users had to do mem_cgroup_from_task(mm->owner) to get the
> resulting memcg. Now that mm_struct::owner is gone this is not
> necessary, yet the API is still confusing.
>
> One tricky part has always been that the API sounds generic but it is
> not really. mem_cgroup_from_task(current) doesn't necessarily mean the
> same thing as current->mm->memcg (resp.
> mem_cgroup_from_task(current->mm->owner) previously) because mm might be
> associated with a different cgroup than the process.
>
> Another tricky part is that p->mm->memcg is unsafe if p!=current
> as pointed by Oleg because nobody is holding a reference on that
> mm. This is not a problem right now because we have only 2 callers in
> the tree. sock_update_memcg operates on current and task_in_mem_cgroup
> is providing non-NULL task so it is always using task_css.
>
> Let's ditch this function and use current->mm->memcg for
> sock_update_memcg and use task_css for task_in_mem_cgroup. This doesn't
> have any functional effect.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 24 +++++++-----------------
> 1 file changed, 7 insertions(+), 17 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4069ec8f52be..fb8e9bd04a29 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -294,18 +294,6 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> return mem_cgroup_from_css(css);
> }
>
> -static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> -{
> - if (p->mm)
> - return rcu_dereference(p->mm->memcg);
> -
> - /*
> - * If the process doesn't have mm struct anymore we have to fallback
> - * to the task_css.
> - */
> - return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> -}
> -
> /* Writing them here to avoid exposing memcg's inner layout */
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>
> @@ -332,7 +320,7 @@ void sock_update_memcg(struct sock *sk)
> }
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_task(current);
> + memcg = rcu_dereference(current->mm->memcg);
> cg_proto = sk->sk_prot->proto_cgroup(memcg);
> if (cg_proto && memcg_proto_active(cg_proto) &&
> css_tryget_online(&memcg->css)) {
> @@ -1091,12 +1079,14 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
> task_unlock(p);
> } else {
> /*
> - * All threads may have already detached their mm's, but the oom
> - * killer still needs to detect if they have already been oom
> - * killed to prevent needlessly killing additional tasks.
> + * All threads have already detached their mm's but we should
> + * still be able to at least guess the original memcg from the
> + * task_css. These two will match most of the time but there are
> + * corner cases where task->mm and task_css refer to a different
> + * cgroups.
> */
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(task);
> + task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> css_get(&task_memcg->css);

I wonder why it's safe to call css_get here.

The patch itself looks good though,

Reviewed-by: Vladimir Davydov <[email protected]>

> rcu_read_unlock();
> }
> --
> 2.1.4
>

2015-07-09 11:22:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/8] memcg: export struct mem_cgroup

On Wed 08-07-15 18:39:26, Vladimir Davydov wrote:
> Hi Michal,
>
> On Wed, Jul 08, 2015 at 02:27:45PM +0200, Michal Hocko wrote:
[...]
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 73b02b0a8f60..f5a8d0bbef8d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -23,8 +23,11 @@
> > #include <linux/vm_event_item.h>
> > #include <linux/hardirq.h>
> > #include <linux/jump_label.h>
> > +#include <linux/page_counter.h>
> > +#include <linux/vmpressure.h>
> > +#include <linux/mmzone.h>
> > +#include <linux/writeback.h>
> >
> > -struct mem_cgroup;
>
> I think we still need this forward declaration e.g. for defining
> reclaim_iter.

Yes, this was just an omission and I haven't noticed it because the
compilation haven't complained. I do agree that we shouldn't rely on
fwd. declarations from other header files. Fixed.

[...]
> > +struct mem_cgroup_stat_cpu {
> > + long count[MEM_CGROUP_STAT_NSTATS];
> > + unsigned long events[MEMCG_NR_EVENTS];
> > + unsigned long nr_page_events;
> > + unsigned long targets[MEM_CGROUP_NTARGETS];
> > +};
> > +
> > +struct reclaim_iter {
>
> I think we'd better rename it to mem_cgroup_reclaim_iter.

The name is awfully long but it is used only at the single place so it
shouldn't matter much and we will be more consistent in naming and won't
pollute the namespace. Changed...

>
> > + struct mem_cgroup *position;
> > + /* scan generation, increased every round-trip */
> > + unsigned int generation;
> > +};
> > +
> > +/*
> > + * per-zone information in memory controller.
> > + */
> > +struct mem_cgroup_per_zone {
> > + struct lruvec lruvec;
> > + unsigned long lru_size[NR_LRU_LISTS];
> > +
> > + struct reclaim_iter iter[DEF_PRIORITY + 1];
> > +
> > + struct rb_node tree_node; /* RB tree node */
> > + unsigned long usage_in_excess;/* Set to the value by which */
> > + /* the soft limit is exceeded*/
> > + bool on_tree;
> > + struct mem_cgroup *memcg; /* Back pointer, we cannot */
> > + /* use container_of */
> > +};
> > +
> > +struct mem_cgroup_per_node {
> > + struct mem_cgroup_per_zone zoneinfo[MAX_NR_ZONES];
> > +};
> > +
> > +struct mem_cgroup_threshold {
> > + struct eventfd_ctx *eventfd;
> > + unsigned long threshold;
> > +};
> > +
> > +/* For threshold */
> > +struct mem_cgroup_threshold_ary {
> > + /* An array index points to threshold just below or equal to usage. */
> > + int current_threshold;
> > + /* Size of entries[] */
> > + unsigned int size;
> > + /* Array of thresholds */
> > + struct mem_cgroup_threshold entries[0];
> > +};
> > +
> > +struct mem_cgroup_thresholds {
> > + /* Primary thresholds array */
> > + struct mem_cgroup_threshold_ary *primary;
> > + /*
> > + * Spare threshold array.
> > + * This is needed to make mem_cgroup_unregister_event() "never fail".
> > + * It must be able to store at least primary->size - 1 entries.
> > + */
> > + struct mem_cgroup_threshold_ary *spare;
> > +};
>
> I think we'd better define these structures inside CONFIG_MEMCG section,
> just like struct mem_cgroup.

OK. I just wanted to make the code compilable but I can move the ifdef
up which would be cleaner. We only need enums and cg_proto stuff to be
compilable. Will do that.

> > +
> > +/*
> > + * Bits in struct cg_proto.flags
> > + */
> > +enum cg_proto_flags {
> > + /* Currently active and new sockets should be assigned to cgroups */
> > + MEMCG_SOCK_ACTIVE,
> > + /* It was ever activated; we must disarm static keys on destruction */
> > + MEMCG_SOCK_ACTIVATED,
> > +};
> > +
> > +struct cg_proto {
> > + struct page_counter memory_allocated; /* Current allocated memory. */
> > + struct percpu_counter sockets_allocated; /* Current number of sockets. */
> > + int memory_pressure;
> > + long sysctl_mem[3];
> > + unsigned long flags;
> > + /*
> > + * memcg field is used to find which memcg we belong directly
> > + * Each memcg struct can hold more than one cg_proto, so container_of
> > + * won't really cut.
> > + *
> > + * The elegant solution would be having an inverse function to
> > + * proto_cgroup in struct proto, but that means polluting the structure
> > + * for everybody, instead of just for memcg users.
> > + */
> > + struct mem_cgroup *memcg;
> > +};
>
> I'd prefer to leave it where it is now. I don't see any reason why we
> have to embed it into mem_cgroup, so may be we'd better keep a pointer
> to it in struct mem_cgroup instead?

This patch is supposed to be minimal without any functional changes.
Changing tcp_mem to pointer would require allocation and freeing and that
is out of scope of this patch. Besides that I do not see any stong
advantage doing that.

[...]
> > extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
>
> It's a trivial one line function, so why not inline it too?

Yes it is trivial but according to my notes it increased the code size
by ~100B.

[...]
> > -void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
> > static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > enum vm_event_item idx)
> > {
> > + struct mem_cgroup *memcg;
> > +
> > if (mem_cgroup_disabled())
> > return;
> > - __mem_cgroup_count_vm_event(mm, idx);
> > +
> > + rcu_read_lock();
> > + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > + if (unlikely(!memcg))
> > + goto out;
> > +
> > + switch (idx) {
> > + case PGFAULT:
> > + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
> > + break;
> > + case PGMAJFAULT:
> > + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
> > + break;
> > + default:
> > + BUG();
>
> This switch-case looks bulky and weird. Let's make this function accept
> MEM_CGROUP_EVENTS_PGFAULT/PGMAJFAULT directly instead.

Yes it looks ugly but I didn't intend to change it in this particular
patch. I wouldn't mind a follow up cleanup patch.

[...]
> > -int memcg_cache_id(struct mem_cgroup *memcg);
> > +/*
> > + * helper for acessing a memcg's index. It will be used as an index in the
> > + * child cache array in kmem_cache, and also to derive its name. This function
> > + * will return -1 when this is not a kmem-limited memcg.
> > + */
> > +static inline int memcg_cache_id(struct mem_cgroup *memcg)
> > +{
> > + return memcg ? memcg->kmemcg_id : -1;
> > +}
>
> We can inline memcg_kmem_is_active too.

I do not have this one in my notes so I haven't tried to inline it.
Let's see.

text data bss dec hex filename
12355346 1823792 1089536 15268674 e8fb42 vmlinux.before
12354970 1823792 1089536 15268298 e8f9ca vmlinux.after
12354970 1823792 1089536 15268298 e8f9ca vmlinux.memcg_kmem_is_active

Interesting. The code size hasn't changed. This is quite surprising but
I guess it has changed just the aligning of the code. Anyway, I will
inline it.

Thanks for the review Vladimir!

The current diff against the patch is:
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f5a8d0bbef8d..42f118ae04cf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,7 @@
#include <linux/mmzone.h>
#include <linux/writeback.h>

+struct mem_cgroup;
struct page;
struct mm_struct;
struct kmem_cache;
@@ -83,6 +84,35 @@ enum mem_cgroup_events_target {
MEM_CGROUP_NTARGETS,
};

+/*
+ * Bits in struct cg_proto.flags
+ */
+enum cg_proto_flags {
+ /* Currently active and new sockets should be assigned to cgroups */
+ MEMCG_SOCK_ACTIVE,
+ /* It was ever activated; we must disarm static keys on destruction */
+ MEMCG_SOCK_ACTIVATED,
+};
+
+struct cg_proto {
+ struct page_counter memory_allocated; /* Current allocated memory. */
+ struct percpu_counter sockets_allocated; /* Current number of sockets. */
+ int memory_pressure;
+ long sysctl_mem[3];
+ unsigned long flags;
+ /*
+ * memcg field is used to find which memcg we belong directly
+ * Each memcg struct can hold more than one cg_proto, so container_of
+ * won't really cut.
+ *
+ * The elegant solution would be having an inverse function to
+ * proto_cgroup in struct proto, but that means polluting the structure
+ * for everybody, instead of just for memcg users.
+ */
+ struct mem_cgroup *memcg;
+};
+
+#ifdef CONFIG_MEMCG
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEMCG_NR_EVENTS];
@@ -90,7 +120,7 @@ struct mem_cgroup_stat_cpu {
unsigned long targets[MEM_CGROUP_NTARGETS];
};

-struct reclaim_iter {
+struct mem_cgroup_reclaim_iter {
struct mem_cgroup *position;
/* scan generation, increased every round-trip */
unsigned int generation;
@@ -103,7 +133,7 @@ struct mem_cgroup_per_zone {
struct lruvec lruvec;
unsigned long lru_size[NR_LRU_LISTS];

- struct reclaim_iter iter[DEF_PRIORITY + 1];
+ struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];

struct rb_node tree_node; /* RB tree node */
unsigned long usage_in_excess;/* Set to the value by which */
@@ -144,35 +174,6 @@ struct mem_cgroup_thresholds {
};

/*
- * Bits in struct cg_proto.flags
- */
-enum cg_proto_flags {
- /* Currently active and new sockets should be assigned to cgroups */
- MEMCG_SOCK_ACTIVE,
- /* It was ever activated; we must disarm static keys on destruction */
- MEMCG_SOCK_ACTIVATED,
-};
-
-struct cg_proto {
- struct page_counter memory_allocated; /* Current allocated memory. */
- struct percpu_counter sockets_allocated; /* Current number of sockets. */
- int memory_pressure;
- long sysctl_mem[3];
- unsigned long flags;
- /*
- * memcg field is used to find which memcg we belong directly
- * Each memcg struct can hold more than one cg_proto, so container_of
- * won't really cut.
- *
- * The elegant solution would be having an inverse function to
- * proto_cgroup in struct proto, but that means polluting the structure
- * for everybody, instead of just for memcg users.
- */
- struct mem_cgroup *memcg;
-};
-
-#ifdef CONFIG_MEMCG
-/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
* statistics based on the statistics developed by Rik Van Riel for clock-pro,
@@ -735,7 +736,10 @@ static inline bool memcg_kmem_enabled(void)
return static_key_false(&memcg_kmem_enabled_key);
}

-bool memcg_kmem_is_active(struct mem_cgroup *memcg);
+static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
+{
+ return memcg->kmem_acct_active;
+}

/*
* In general, we'll do everything in our power to not incur in any overhead
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 759ec413e72c..a3543dedc153 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -184,13 +184,6 @@ struct mem_cgroup_event {
static void mem_cgroup_threshold(struct mem_cgroup *memcg);
static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);

-#ifdef CONFIG_MEMCG_KMEM
-bool memcg_kmem_is_active(struct mem_cgroup *memcg)
-{
- return memcg->kmem_acct_active;
-}
-#endif
-
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved.
@@ -841,7 +834,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
struct mem_cgroup *prev,
struct mem_cgroup_reclaim_cookie *reclaim)
{
- struct reclaim_iter *uninitialized_var(iter);
+ struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
struct cgroup_subsys_state *css = NULL;
struct mem_cgroup *memcg = NULL;
struct mem_cgroup *pos = NULL;
--
Michal Hocko
SUSE Labs

2015-07-09 11:51:31

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 1/8] memcg: export struct mem_cgroup

On Thu, Jul 09, 2015 at 01:22:39PM +0200, Michal Hocko wrote:
> On Wed 08-07-15 18:39:26, Vladimir Davydov wrote:
> > On Wed, Jul 08, 2015 at 02:27:45PM +0200, Michal Hocko wrote:
[...]
> > > +struct cg_proto {
> > > + struct page_counter memory_allocated; /* Current allocated memory. */
> > > + struct percpu_counter sockets_allocated; /* Current number of sockets. */
> > > + int memory_pressure;
> > > + long sysctl_mem[3];
> > > + unsigned long flags;
> > > + /*
> > > + * memcg field is used to find which memcg we belong directly
> > > + * Each memcg struct can hold more than one cg_proto, so container_of
> > > + * won't really cut.
> > > + *
> > > + * The elegant solution would be having an inverse function to
> > > + * proto_cgroup in struct proto, but that means polluting the structure
> > > + * for everybody, instead of just for memcg users.
> > > + */
> > > + struct mem_cgroup *memcg;
> > > +};
> >
> > I'd prefer to leave it where it is now. I don't see any reason why we
> > have to embed it into mem_cgroup, so may be we'd better keep a pointer
> > to it in struct mem_cgroup instead?
>
> This patch is supposed to be minimal without any functional changes.
> Changing tcp_mem to pointer would require allocation and freeing and that
> is out of scope of this patch. Besides that I do not see any stong
> advantage doing that.

OK, got it.

> > > extern struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
> >
> > It's a trivial one line function, so why not inline it too?
>
> Yes it is trivial but according to my notes it increased the code size
> by ~100B.

Ugh, it's surprising. I think this is because it's called from so many
places.

> > > -void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
> > > static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > > enum vm_event_item idx)
> > > {
> > > + struct mem_cgroup *memcg;
> > > +
> > > if (mem_cgroup_disabled())
> > > return;
> > > - __mem_cgroup_count_vm_event(mm, idx);
> > > +
> > > + rcu_read_lock();
> > > + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > > + if (unlikely(!memcg))
> > > + goto out;
> > > +
> > > + switch (idx) {
> > > + case PGFAULT:
> > > + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGFAULT]);
> > > + break;
> > > + case PGMAJFAULT:
> > > + this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PGMAJFAULT]);
> > > + break;
> > > + default:
> > > + BUG();
> >
> > This switch-case looks bulky and weird. Let's make this function accept
> > MEM_CGROUP_EVENTS_PGFAULT/PGMAJFAULT directly instead.
>
> Yes it looks ugly but I didn't intend to change it in this particular
> patch. I wouldn't mind a follow up cleanup patch.

OK, I'll probably do that.

[...]
> The current diff against the patch is:
> ---
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f5a8d0bbef8d..42f118ae04cf 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -28,6 +28,7 @@
> #include <linux/mmzone.h>
> #include <linux/writeback.h>
>
> +struct mem_cgroup;
> struct page;
> struct mm_struct;
> struct kmem_cache;
> @@ -83,6 +84,35 @@ enum mem_cgroup_events_target {
> MEM_CGROUP_NTARGETS,
> };
>
> +/*
> + * Bits in struct cg_proto.flags
> + */
> +enum cg_proto_flags {
> + /* Currently active and new sockets should be assigned to cgroups */
> + MEMCG_SOCK_ACTIVE,
> + /* It was ever activated; we must disarm static keys on destruction */
> + MEMCG_SOCK_ACTIVATED,
> +};
> +
> +struct cg_proto {
> + struct page_counter memory_allocated; /* Current allocated memory. */
> + struct percpu_counter sockets_allocated; /* Current number of sockets. */
> + int memory_pressure;
> + long sysctl_mem[3];
> + unsigned long flags;
> + /*
> + * memcg field is used to find which memcg we belong directly
> + * Each memcg struct can hold more than one cg_proto, so container_of
> + * won't really cut.
> + *
> + * The elegant solution would be having an inverse function to
> + * proto_cgroup in struct proto, but that means polluting the structure
> + * for everybody, instead of just for memcg users.
> + */
> + struct mem_cgroup *memcg;
> +};
> +
> +#ifdef CONFIG_MEMCG
> struct mem_cgroup_stat_cpu {
> long count[MEM_CGROUP_STAT_NSTATS];
> unsigned long events[MEMCG_NR_EVENTS];
> @@ -90,7 +120,7 @@ struct mem_cgroup_stat_cpu {
> unsigned long targets[MEM_CGROUP_NTARGETS];
> };
>
> -struct reclaim_iter {
> +struct mem_cgroup_reclaim_iter {
> struct mem_cgroup *position;
> /* scan generation, increased every round-trip */
> unsigned int generation;
> @@ -103,7 +133,7 @@ struct mem_cgroup_per_zone {
> struct lruvec lruvec;
> unsigned long lru_size[NR_LRU_LISTS];
>
> - struct reclaim_iter iter[DEF_PRIORITY + 1];
> + struct mem_cgroup_reclaim_iter iter[DEF_PRIORITY + 1];
>
> struct rb_node tree_node; /* RB tree node */
> unsigned long usage_in_excess;/* Set to the value by which */
> @@ -144,35 +174,6 @@ struct mem_cgroup_thresholds {
> };
>
> /*
> - * Bits in struct cg_proto.flags
> - */
> -enum cg_proto_flags {
> - /* Currently active and new sockets should be assigned to cgroups */
> - MEMCG_SOCK_ACTIVE,
> - /* It was ever activated; we must disarm static keys on destruction */
> - MEMCG_SOCK_ACTIVATED,
> -};
> -
> -struct cg_proto {
> - struct page_counter memory_allocated; /* Current allocated memory. */
> - struct percpu_counter sockets_allocated; /* Current number of sockets. */
> - int memory_pressure;
> - long sysctl_mem[3];
> - unsigned long flags;
> - /*
> - * memcg field is used to find which memcg we belong directly
> - * Each memcg struct can hold more than one cg_proto, so container_of
> - * won't really cut.
> - *
> - * The elegant solution would be having an inverse function to
> - * proto_cgroup in struct proto, but that means polluting the structure
> - * for everybody, instead of just for memcg users.
> - */
> - struct mem_cgroup *memcg;
> -};
> -
> -#ifdef CONFIG_MEMCG
> -/*
> * The memory controller data structure. The memory controller controls both
> * page cache and RSS per cgroup. We would eventually like to provide
> * statistics based on the statistics developed by Rik Van Riel for clock-pro,
> @@ -735,7 +736,10 @@ static inline bool memcg_kmem_enabled(void)
> return static_key_false(&memcg_kmem_enabled_key);
> }
>
> -bool memcg_kmem_is_active(struct mem_cgroup *memcg);
> +static inline bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> +{
> + return memcg->kmem_acct_active;
> +}
>
> /*
> * In general, we'll do everything in our power to not incur in any overhead
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 759ec413e72c..a3543dedc153 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -184,13 +184,6 @@ struct mem_cgroup_event {
> static void mem_cgroup_threshold(struct mem_cgroup *memcg);
> static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>
> -#ifdef CONFIG_MEMCG_KMEM
> -bool memcg_kmem_is_active(struct mem_cgroup *memcg)
> -{
> - return memcg->kmem_acct_active;
> -}
> -#endif
> -
> /* Stuffs for move charges at task migration. */
> /*
> * Types of charges to be moved.
> @@ -841,7 +834,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> struct mem_cgroup *prev,
> struct mem_cgroup_reclaim_cookie *reclaim)
> {
> - struct reclaim_iter *uninitialized_var(iter);
> + struct mem_cgroup_reclaim_iter *uninitialized_var(iter);
> struct cgroup_subsys_state *css = NULL;
> struct mem_cgroup *memcg = NULL;
> struct mem_cgroup *pos = NULL;

Reviewed-by: Vladimir Davydov <[email protected]>

Thanks,
Vladimir

2015-07-09 12:08:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 4/8] memcg, mm: move mem_cgroup_select_victim_node into vmscan

On Wed 08-07-15 19:01:59, Vladimir Davydov wrote:
> On Wed, Jul 08, 2015 at 02:27:48PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > We currently have only one caller of mem_cgroup_select_victim_node which
> > is sitting in mm/vmscan.c and which is already wrapped by CONFIG_MEMCG
> > ifdef. Now that we have struct mem_cgroup visible outside of
> > mm/memcontrol.c we can move the function and its dependencies there.
> > This even shrinks the code size by few bytes:
> >
> > text data bss dec hex filename
> > 478509 65806 26384 570699 8b54b mm/built-in.o.before
> > 478445 65806 26384 570635 8b50b mm/built-in.o.after
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> I dislike this patch, because I don't see any reason why logic specific
> to per memcg reclaim should live in the file representing the global
> reclaim path.

Well the idea was that mem_cgroup_select_victim_node is specific to
try_to_free_mem_cgroup_pages. It is basically a split up of otherwise
large function for readability. Same applies to
mem_cgroup_may_update_nodemask. Having that code together makes some
sense to me.

On the other hand I do agree that at least
test_mem_cgroup_node_reclaimable is generally reusable and so it
shouldn't be in vmscan. I can move it back to memcontrol but that leaves
the generated code much worse.

Fair enough then, I will drop this patch.
--
Michal Hocko
SUSE Labs

2015-07-09 14:09:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Wed 08-07-15 20:32:51, Vladimir Davydov wrote:
> I like the gist of this patch. A few comments below.
>
> On Wed, Jul 08, 2015 at 02:27:51PM +0200, Michal Hocko wrote:
[...]
> > +/**
> > + * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct
> > + * @newmm: new mm struct
> > + * @oldmm: old mm struct to inherit from
> > + *
> > + * Should be called for each new mm_struct.
> > + */
> > +static inline
> > +void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
> > +{
> > + struct mem_cgroup *memcg = oldmm->memcg;
>
> FWIW, if CONFIG_SPARSE_RCU_POINTER is on, this will trigger a compile
> time warning, as well as any unannotated dereference of mm_struct->memcg
> below.

The idea was that this would be a false positive because the
oldmm->memcg should be stable. But now that I am reading your race
scenario below I am not so sure anymore and we may need to use rcu
locking here. More below

>
> > +
> > + __mm_set_memcg(newmm, memcg);
> > +}
> > +
> > +/**
> > + * mm_drop_iter - drop mm_struct::memcg association
>
> s/mm_drop_iter/mm_drop_memcg

Thanks

>
> > + * @mm: mm struct
> > + *
> > + * Should be called after the mm has been removed from all tasks
> > + * and before it is freed (e.g. from mmput)
> > + */
> > +static inline void mm_drop_memcg(struct mm_struct *mm)
> > +{
> > + if (mm->memcg)
> > + css_put(&mm->memcg->css);
> > + mm->memcg = NULL;
> > +}
[...]
> > @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > return;
> >
> > rcu_read_lock();
> > - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > + memcg = rcu_dereference(mm->memcg);
> > if (unlikely(!memcg))
> > goto out;
> >
>
> If I'm not mistaken, mm->memcg equals NULL for any task in the root
> memory cgroup

right

> (BTW, it it's true, it's worth mentioning in the comment
> to mm->memcg definition IMO). As a result, we won't account the stats
> for such tasks, will we?

well spotted! This is certainly a bug. There are more places which are
checking for mm->memcg being NULL and falling back to root_mem_cgroup. I
think it would be better to simply use root_mem_cgroup right away. We
can setup init_mm.memcg = root_mem_cgroup during initialization and be
done with it. What do you think? The diff is in the very end of the
email (completely untested yet).

[...]
> > + * No need to take a reference here because the memcg is pinned by the
> > + * mm_struct.
> > + */
>
> But after we drop the reference to the mm below, mc.from can pass away
> and we can get use-after-free in mem_cgroup_move_task, can't we?

Right, the comment is a left over from the previous attempt when I was
holding the reference throughout the migration.
But then I managed to convince myself that...

> AFAIU the real reason why we can skip taking a reference to mc.from, as
> well as to mc.to, is that task migration proceeds under cgroup_mutex,
> which blocks cgroup destruction.

is true. But now that I am thinking about that again I think I just
misled myself. If a task p is moving from A -> B but p->mm->memcg = C
then we are not protected. I will think about this some more.

[...]

> > @@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
> > {
> > struct task_struct *p = cgroup_taskset_first(tset);
> > struct mm_struct *mm = get_task_mm(p);
> > + struct mem_cgroup *old_memcg = NULL;
> >
> > if (mm) {
> > + old_memcg = READ_ONCE(mm->memcg);
> > + __mm_set_memcg(mm, mem_cgroup_from_css(css));
> > +
> > if (mc.to)
> > mem_cgroup_move_charge(mm);
> > mmput(mm);
> > }
> > if (mc.to)
> > mem_cgroup_clear_mc();
> > +
> > + /*
> > + * Be careful and drop the reference only after we are done because
> > + * p's task_css memcg might be different from p->memcg and nothing else
> > + * might be pinning the old memcg.
> > + */
> > + if (old_memcg)
> > + css_put(&old_memcg->css);
>
> Please explain why the following race is impossible:
>
> CPU0 CPU1
> ---- ----
> [current = T]
> dup_mm or exec_mmap
> mm_inherit_memcg
> memcg = current->mm->memcg;
> mem_cgroup_move_task
> p = T;
> mm = get_task_mm(p);
> old_memcg = mm->memcg;
> css_put(&old_memcg->css);
> /* old_memcg can be freed now */
> css_get(memcg); /* BUG */

I guess you are right. The window seem to be very small but CPU0 simly
might get preempted by the moving task and so even cgroup pinning
wouldn't help here.

I guess we need
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b3e7e30b5a74..6fbd33273b6d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -300,9 +300,17 @@ void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
static inline
void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
{
- struct mem_cgroup *memcg = oldmm->memcg;
+ struct mem_cgroup *memcg;

+ /*
+ * oldmm might be under move and just replacing its memcg (see
+ * mem_cgroup_move_task) so we have to protect from its memcg
+ * going away between we dereference and take a reference.
+ */
+ rcu_read_lock();
+ memcg = rcu_dereference(oldmm->memcg);
__mm_set_memcg(newmm, memcg);
+ rcu_read_unlock();
}

/**


Make sure that all tasks have non NULL memcg.
---
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f23e29f3d4fa..b3e7e30b5a74 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -286,8 +286,7 @@ extern struct cgroup_subsys_state *mem_cgroup_root_css;
static inline
void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
{
- if (memcg)
- css_get(&memcg->css);
+ css_get(&memcg->css);
rcu_assign_pointer(mm->memcg, memcg);
}

@@ -307,7 +306,7 @@ void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
}

/**
- * mm_drop_iter - drop mm_struct::memcg association
+ * mm_drop_memcg - drop mm_struct::memcg association
* @mm: mm struct
*
* Should be called after the mm has been removed from all tasks
@@ -315,8 +314,7 @@ void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
*/
static inline void mm_drop_memcg(struct mm_struct *mm)
{
- if (mm->memcg)
- css_put(&mm->memcg->css);
+ css_put(&mm->memcg->css);
mm->memcg = NULL;
}

@@ -382,8 +380,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,

rcu_read_lock();
task_memcg = rcu_dereference(mm->memcg);
- if (task_memcg)
- match = mem_cgroup_is_descendant(task_memcg, memcg);
+ match = mem_cgroup_is_descendant(task_memcg, memcg);
rcu_read_unlock();
return match;
}
@@ -526,8 +523,6 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,

rcu_read_lock();
memcg = rcu_dereference(mm->memcg);
- if (unlikely(!memcg))
- goto out;

switch (idx) {
case PGFAULT:
@@ -539,7 +534,6 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
default:
BUG();
}
-out:
rcu_read_unlock();
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e7169a9f7a47..23ee92c396e2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -801,11 +801,8 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
*/
if (unlikely(!mm))
memcg = root_mem_cgroup;
- else {
+ else
memcg = rcu_dereference(mm->memcg);
- if (unlikely(!memcg))
- memcg = root_mem_cgroup;
- }
} while (!css_tryget_online(&memcg->css));
rcu_read_unlock();
return memcg;
@@ -4176,6 +4173,11 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
memcg->soft_limit = PAGE_COUNTER_MAX;
page_counter_init(&memcg->memsw, NULL);
page_counter_init(&memcg->kmem, NULL);
+ /*
+ * Make sure all tasks will inherit root_mem_cgroup
+ * implicitly.
+ */
+ __mm_set_memcg(&init_mm, root_mem_cgroup);
}

memcg->last_scanned_node = MAX_NUMNODES;
@@ -4787,8 +4789,6 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
* mm_struct.
*/
from = READ_ONCE(mm->memcg);
- if (!from)
- from = root_mem_cgroup;
if (from == to)
goto out;

@@ -4979,8 +4979,7 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
* p's task_css memcg might be different from p->memcg and nothing else
* might be pinning the old memcg.
*/
- if (old_memcg)
- css_put(&old_memcg->css);
+ css_put(&old_memcg->css);
}
#else /* !CONFIG_MMU */
static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,

--
Michal Hocko
SUSE Labs

2015-07-09 14:13:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg: get rid of mem_cgroup_from_task

On Wed 08-07-15 20:43:31, Vladimir Davydov wrote:
> On Wed, Jul 08, 2015 at 02:27:52PM +0200, Michal Hocko wrote:
[...]
> > @@ -1091,12 +1079,14 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
> > task_unlock(p);
> > } else {
> > /*
> > - * All threads may have already detached their mm's, but the oom
> > - * killer still needs to detect if they have already been oom
> > - * killed to prevent needlessly killing additional tasks.
> > + * All threads have already detached their mm's but we should
> > + * still be able to at least guess the original memcg from the
> > + * task_css. These two will match most of the time but there are
> > + * corner cases where task->mm and task_css refer to a different
> > + * cgroups.
> > */
> > rcu_read_lock();
> > - task_memcg = mem_cgroup_from_task(task);
> > + task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> > css_get(&task_memcg->css);
>
> I wonder why it's safe to call css_get here.

What do you mean by safe? Memcg cannot go away because we are under rcu
lock.

>
> The patch itself looks good though,
>
> Reviewed-by: Vladimir Davydov <[email protected]>

Thanks!
--
Michal Hocko
SUSE Labs

2015-07-09 14:33:13

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg: get rid of mem_cgroup_from_task

On Thu, Jul 09, 2015 at 04:13:21PM +0200, Michal Hocko wrote:
> On Wed 08-07-15 20:43:31, Vladimir Davydov wrote:
> > On Wed, Jul 08, 2015 at 02:27:52PM +0200, Michal Hocko wrote:
> [...]
> > > @@ -1091,12 +1079,14 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
> > > task_unlock(p);
> > > } else {
> > > /*
> > > - * All threads may have already detached their mm's, but the oom
> > > - * killer still needs to detect if they have already been oom
> > > - * killed to prevent needlessly killing additional tasks.
> > > + * All threads have already detached their mm's but we should
> > > + * still be able to at least guess the original memcg from the
> > > + * task_css. These two will match most of the time but there are
> > > + * corner cases where task->mm and task_css refer to a different
> > > + * cgroups.
> > > */
> > > rcu_read_lock();
> > > - task_memcg = mem_cgroup_from_task(task);
> > > + task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> > > css_get(&task_memcg->css);
> >
> > I wonder why it's safe to call css_get here.
>
> What do you mean by safe? Memcg cannot go away because we are under rcu
> lock.

No, it can't, but css->refcnt can reach zero while we are here, can't
it? If it happens, css->refcnt.release will be called twice, which will
have very bad consequences. I think it's OK to call css_tryget{_online}
from an RCU read-side section, but not css_get. Am I missing something?

Thanks,
Vladimir

2015-07-09 16:33:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 8/8] memcg: get rid of mem_cgroup_from_task

On Thu 09-07-15 17:32:47, Vladimir Davydov wrote:
> On Thu, Jul 09, 2015 at 04:13:21PM +0200, Michal Hocko wrote:
> > On Wed 08-07-15 20:43:31, Vladimir Davydov wrote:
> > > On Wed, Jul 08, 2015 at 02:27:52PM +0200, Michal Hocko wrote:
> > [...]
> > > > @@ -1091,12 +1079,14 @@ bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
> > > > task_unlock(p);
> > > > } else {
> > > > /*
> > > > - * All threads may have already detached their mm's, but the oom
> > > > - * killer still needs to detect if they have already been oom
> > > > - * killed to prevent needlessly killing additional tasks.
> > > > + * All threads have already detached their mm's but we should
> > > > + * still be able to at least guess the original memcg from the
> > > > + * task_css. These two will match most of the time but there are
> > > > + * corner cases where task->mm and task_css refer to a different
> > > > + * cgroups.
> > > > */
> > > > rcu_read_lock();
> > > > - task_memcg = mem_cgroup_from_task(task);
> > > > + task_memcg = mem_cgroup_from_css(task_css(task, memory_cgrp_id));
> > > > css_get(&task_memcg->css);
> > >
> > > I wonder why it's safe to call css_get here.
> >
> > What do you mean by safe? Memcg cannot go away because we are under rcu
> > lock.
>
> No, it can't, but css->refcnt can reach zero while we are here, can't
> it? If it happens, css->refcnt.release will be called twice, which will
> have very bad consequences. I think it's OK to call css_tryget{_online}
> from an RCU read-side section, but not css_get. Am I missing something?

OK, now I see what you mean. This is a good question indeed. This code has been
like that for quite a while and I took it for granted. I have to think
about it some more. Anyway the patch doesn't change the behavior here.
--
Michal Hocko
SUSE Labs

2015-07-10 07:54:21

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Thu, Jul 09, 2015 at 04:09:41PM +0200, Michal Hocko wrote:
> On Wed 08-07-15 20:32:51, Vladimir Davydov wrote:
> > On Wed, Jul 08, 2015 at 02:27:51PM +0200, Michal Hocko wrote:
[...]
> > > @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > > return;
> > >
> > > rcu_read_lock();
> > > - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > > + memcg = rcu_dereference(mm->memcg);
> > > if (unlikely(!memcg))
> > > goto out;
> > >
> >
> > If I'm not mistaken, mm->memcg equals NULL for any task in the root
> > memory cgroup
>
> right
>
> > (BTW, it it's true, it's worth mentioning in the comment
> > to mm->memcg definition IMO). As a result, we won't account the stats
> > for such tasks, will we?
>
> well spotted! This is certainly a bug. There are more places which are
> checking for mm->memcg being NULL and falling back to root_mem_cgroup. I
> think it would be better to simply use root_mem_cgroup right away. We
> can setup init_mm.memcg = root_mem_cgroup during initialization and be
> done with it. What do you think? The diff is in the very end of the
> email (completely untested yet).

I'd prefer initializing init_mm.memcg to root_mem_cgroup. This way we
wouldn't have to check whether mm->memcg is NULL or not here and there,
which would make the code cleaner IMO.

[...]
> > > @@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
> > > {
> > > struct task_struct *p = cgroup_taskset_first(tset);
> > > struct mm_struct *mm = get_task_mm(p);
> > > + struct mem_cgroup *old_memcg = NULL;
> > >
> > > if (mm) {
> > > + old_memcg = READ_ONCE(mm->memcg);
> > > + __mm_set_memcg(mm, mem_cgroup_from_css(css));
> > > +
> > > if (mc.to)
> > > mem_cgroup_move_charge(mm);
> > > mmput(mm);
> > > }
> > > if (mc.to)
> > > mem_cgroup_clear_mc();
> > > +
> > > + /*
> > > + * Be careful and drop the reference only after we are done because
> > > + * p's task_css memcg might be different from p->memcg and nothing else
> > > + * might be pinning the old memcg.
> > > + */
> > > + if (old_memcg)
> > > + css_put(&old_memcg->css);
> >
> > Please explain why the following race is impossible:
> >
> > CPU0 CPU1
> > ---- ----
> > [current = T]
> > dup_mm or exec_mmap
> > mm_inherit_memcg
> > memcg = current->mm->memcg;
> > mem_cgroup_move_task
> > p = T;
> > mm = get_task_mm(p);
> > old_memcg = mm->memcg;
> > css_put(&old_memcg->css);
> > /* old_memcg can be freed now */
> > css_get(memcg); /* BUG */
>
> I guess you are right. The window seem to be very small but CPU0 simly
> might get preempted by the moving task and so even cgroup pinning
> wouldn't help here.
>
> I guess we need
> ---
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b3e7e30b5a74..6fbd33273b6d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -300,9 +300,17 @@ void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> static inline
> void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
> {
> - struct mem_cgroup *memcg = oldmm->memcg;
> + struct mem_cgroup *memcg;
>
> + /*
> + * oldmm might be under move and just replacing its memcg (see
> + * mem_cgroup_move_task) so we have to protect from its memcg
> + * going away between we dereference and take a reference.
> + */
> + rcu_read_lock();
> + memcg = rcu_dereference(oldmm->memcg);
> __mm_set_memcg(newmm, memcg);

If it's safe to call css_get under rcu_read_lock, then it's OK,
otherwise we probably need to use a do {} while (!css_tryget(memcg))
loop in __mm_set_memcg.

> + rcu_read_unlock();
> }
>
> /**
>
>
> Make sure that all tasks have non NULL memcg.
[...]

That looks better to me.

Thanks,
Vladimir

2015-07-10 12:53:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Fri 10-07-15 10:54:00, Vladimir Davydov wrote:
> On Thu, Jul 09, 2015 at 04:09:41PM +0200, Michal Hocko wrote:
> > On Wed 08-07-15 20:32:51, Vladimir Davydov wrote:
> > > On Wed, Jul 08, 2015 at 02:27:51PM +0200, Michal Hocko wrote:
> [...]
> > > > @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > > > return;
> > > >
> > > > rcu_read_lock();
> > > > - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > > > + memcg = rcu_dereference(mm->memcg);
> > > > if (unlikely(!memcg))
> > > > goto out;
> > > >
> > >
> > > If I'm not mistaken, mm->memcg equals NULL for any task in the root
> > > memory cgroup
> >
> > right
> >
> > > (BTW, it it's true, it's worth mentioning in the comment
> > > to mm->memcg definition IMO). As a result, we won't account the stats
> > > for such tasks, will we?
> >
> > well spotted! This is certainly a bug. There are more places which are
> > checking for mm->memcg being NULL and falling back to root_mem_cgroup. I
> > think it would be better to simply use root_mem_cgroup right away. We
> > can setup init_mm.memcg = root_mem_cgroup during initialization and be
> > done with it. What do you think? The diff is in the very end of the
> > email (completely untested yet).
>
> I'd prefer initializing init_mm.memcg to root_mem_cgroup. This way we
> wouldn't have to check whether mm->memcg is NULL or not here and there,
> which would make the code cleaner IMO.

So the patch I've posted will not work as a simple boot test told me. We
are initializing root_mem_cgroup too late. This will be more complicated.
I will leave this idea outside of this patch series and will come up
with a separate patch which will clean this up later. I will update the
doc discouraging any use of mm->memcg outside of memcg and use accessor
functions instead. There is only one currently (mm/debug.c) and this is
used only to print the pointer which is safe.

--
Michal Hocko
SUSE Labs

2015-07-10 14:05:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

JFYI: I've found some more issues while hamerring this more. Please
ignore this and the follow up patch for now. If others are OK with the
cleanups preceding this patch I will repost with the changes based on
the feedback so far and let them merge into mm tree before I settle
with this much more tricky part.

On Wed 08-07-15 14:27:51, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> mm_struct::owner keeps track of the task which is in charge for the
> specific mm. This is usually the thread group leader of the process but
> there are exotic cases where this doesn't hold.
>
> The most prominent one is when separate tasks (not in the same thread
> group) share the address space (by using clone with CLONE_VM without
> CLONE_THREAD). The first task will be the owner until it exits.
> mm_update_next_owner will then try to find a new owner - a task which
> points to the same mm_struct. There is no guarantee a new owner will
> be a thread group leader though because the leader for that thread
> group might have exited. Even though such a thread will be still around
> waiting for the remaining threads from its group, it's mm will be NULL
> so it cannot be chosen.
>
> cgroup migration code, however assumes only group leaders when migrating
> via cgroup.procs (which will be the only mode in the unified hierarchy
> API) while mem_cgroup_can_attach considers only those tasks which are
> owner of the mm. So we might end up with tasks which cannot be migrated.
> mm_update_next_owner could be tweaked to try harder and use a group
> leader whenever possible but this will never be 100% because all the
> leaders might be dead. It seems that getting rid of the mm->owner sounds
> like a better and less hacky option.
>
> The whole concept of the mm owner is a bit artificial and too tricky to
> get right. All the memcg code needs is to find struct mem_cgroup from
> a given mm_struct and there are only two events when the association
> is either built or changed
> - a new mm is created - dup_mmm resp exec_mmap - when the memcg
> is inherited from the oldmm
> - task associated with the mm is moved to another memcg
> So it is much more easier to bind mm_struct with the mem_cgroup directly
> rather than indirectly via a task. This is exactly what this patch does.
>
> mm_inherit_memcg and mm_drop_memcg are exported for the core kernel
> to bind an old memcg during dup_mm (fork) resp. exec_mmap (exec) and
> releasing that memcg in mmput after the last reference is dropped and no
> task sees the mm anymore. We have to be careful and take a reference to
> the memcg->css so that it doesn't vanish from under our feet.
>
> The only remaining part is to catch task migration and change the
> association. This is done in mem_cgroup_move_task before charges get
> moved because mem_cgroup_can_attach is too early and other controllers
> might fail and we would have to handle the rollback.
>
> mm->memcg conforms to standard mem_cgroup locking rules. It has to be
> used inside rcu_read_{un}lock() and a reference has to be taken before the
> unlock if the memcg is supposed to be used outside.
>
> Finally mem_cgroup_can_attach will allow task migration only for the
> thread group leaders to conform with cgroup core requirements.
>
> Please note that this patch introduces a USER VISIBLE CHANGE OF BEHAVIOR.
> Without mm->owner _all_ tasks (group leaders to be precise) associated
> with the mm_struct would initiate memcg migration while previously
> only owner of the mm_struct could do that. The original behavior was
> awkward though because the user task didn't have any means to find out
> the current owner (esp. after mm_update_next_owner) so the migration
> behavior was not well defined in general.
> New cgroup API (unified hierarchy) will discontinue tasks cgroup file
> which means that migrating threads will no longer be possible. In such
> a case having CLONE_VM without CLONE_THREAD could emulate the thread
> behavior but this patch prevents from isolating memcg controllers from
> others. Nevertheless I am not convinced such a use case would really
> deserve complications on the memcg code side.
>
> Suggested-by: Oleg Nesterov <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/exec.c | 2 +-
> include/linux/memcontrol.h | 58 ++++++++++++++++++++++++--
> include/linux/mm_types.h | 12 +-----
> kernel/exit.c | 89 ---------------------------------------
> kernel/fork.c | 10 +----
> mm/debug.c | 4 +-
> mm/memcontrol.c | 101 ++++++++++++++++++++++++++++-----------------
> 7 files changed, 123 insertions(+), 153 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 1977c2a553ac..3ed9c0abc9f5 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -870,7 +870,7 @@ static int exec_mmap(struct mm_struct *mm)
> up_read(&old_mm->mmap_sem);
> BUG_ON(active_mm != old_mm);
> setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
> - mm_update_next_owner(old_mm);
> + mm_inherit_memcg(mm, old_mm);
> mmput(old_mm);
> return 0;
> }
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 78e9d4ac57a1..8e6b2444ebfe 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -274,6 +274,52 @@ struct mem_cgroup {
> extern struct cgroup_subsys_state *mem_cgroup_root_css;
>
> /**
> + * __mm_set_memcg - Set mm_struct:memcg to a given memcg.
> + * @mm: mm struct
> + * @memcg: mem_cgroup to be used
> + *
> + * Note that this function doesn't clean up the previous mm->memcg.
> + * This should be done by caller when necessary (e.g. when moving
> + * mm from one memcg to another).
> + */
> +static inline
> +void __mm_set_memcg(struct mm_struct *mm, struct mem_cgroup *memcg)
> +{
> + if (memcg)
> + css_get(&memcg->css);
> + rcu_assign_pointer(mm->memcg, memcg);
> +}
> +
> +/**
> + * mm_inherit_memcg - Initialize mm_struct::memcg from an existing mm_struct
> + * @newmm: new mm struct
> + * @oldmm: old mm struct to inherit from
> + *
> + * Should be called for each new mm_struct.
> + */
> +static inline
> +void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
> +{
> + struct mem_cgroup *memcg = oldmm->memcg;
> +
> + __mm_set_memcg(newmm, memcg);
> +}
> +
> +/**
> + * mm_drop_iter - drop mm_struct::memcg association
> + * @mm: mm struct
> + *
> + * Should be called after the mm has been removed from all tasks
> + * and before it is freed (e.g. from mmput)
> + */
> +static inline void mm_drop_memcg(struct mm_struct *mm)
> +{
> + if (mm->memcg)
> + css_put(&mm->memcg->css);
> + mm->memcg = NULL;
> +}
> +
> +/**
> * mem_cgroup_events - count memory events against a cgroup
> * @memcg: the memory cgroup
> * @idx: the event index
> @@ -305,7 +351,6 @@ struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>
> struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page);
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
>
> struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
> static inline
> @@ -335,7 +380,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
> bool match = false;
>
> rcu_read_lock();
> - task_memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + task_memcg = rcu_dereference(mm->memcg);
> if (task_memcg)
> match = mem_cgroup_is_descendant(task_memcg, memcg);
> rcu_read_unlock();
> @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> return;
>
> rcu_read_lock();
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (unlikely(!memcg))
> goto out;
>
> @@ -498,6 +543,13 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> #else /* CONFIG_MEMCG */
> struct mem_cgroup;
>
> +static inline void mm_inherit_memcg(struct mm_struct *newmm, struct mm_struct *oldmm)
> +{
> +}
> +static inline void mm_drop_memcg(struct mm_struct *mm)
> +{
> +}
> +
> static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> enum mem_cgroup_events_index idx,
> unsigned int nr)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index f6266742ce1f..93dc8cb9c636 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -426,17 +426,7 @@ struct mm_struct {
> struct kioctx_table __rcu *ioctx_table;
> #endif
> #ifdef CONFIG_MEMCG
> - /*
> - * "owner" points to a task that is regarded as the canonical
> - * user/owner of this mm. All of the following must be true in
> - * order for it to be changed:
> - *
> - * current == mm->owner
> - * current->mm != mm
> - * new_owner->mm == mm
> - * new_owner->alloc_lock is held
> - */
> - struct task_struct __rcu *owner;
> + struct mem_cgroup __rcu *memcg;
> #endif
>
> /* store ref to file /proc/<pid>/exe symlink points to */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 185752a729f6..339554612677 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -292,94 +292,6 @@ kill_orphaned_pgrp(struct task_struct *tsk, struct task_struct *parent)
> }
> }
>
> -#ifdef CONFIG_MEMCG
> -/*
> - * A task is exiting. If it owned this mm, find a new owner for the mm.
> - */
> -void mm_update_next_owner(struct mm_struct *mm)
> -{
> - struct task_struct *c, *g, *p = current;
> -
> -retry:
> - /*
> - * If the exiting or execing task is not the owner, it's
> - * someone else's problem.
> - */
> - if (mm->owner != p)
> - return;
> - /*
> - * The current owner is exiting/execing and there are no other
> - * candidates. Do not leave the mm pointing to a possibly
> - * freed task structure.
> - */
> - if (atomic_read(&mm->mm_users) <= 1) {
> - mm->owner = NULL;
> - return;
> - }
> -
> - read_lock(&tasklist_lock);
> - /*
> - * Search in the children
> - */
> - list_for_each_entry(c, &p->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> - * Search in the siblings
> - */
> - list_for_each_entry(c, &p->real_parent->children, sibling) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - }
> -
> - /*
> - * Search through everything else, we should not get here often.
> - */
> - for_each_process(g) {
> - if (g->flags & PF_KTHREAD)
> - continue;
> - for_each_thread(g, c) {
> - if (c->mm == mm)
> - goto assign_new_owner;
> - if (c->mm)
> - break;
> - }
> - }
> - read_unlock(&tasklist_lock);
> - /*
> - * We found no owner yet mm_users > 1: this implies that we are
> - * most likely racing with swapoff (try_to_unuse()) or /proc or
> - * ptrace or page migration (get_task_mm()). Mark owner as NULL.
> - */
> - mm->owner = NULL;
> - return;
> -
> -assign_new_owner:
> - BUG_ON(c == p);
> - get_task_struct(c);
> - /*
> - * The task_lock protects c->mm from changing.
> - * We always want mm->owner->mm == mm
> - */
> - task_lock(c);
> - /*
> - * Delay read_unlock() till we have the task_lock()
> - * to ensure that c does not slip away underneath us
> - */
> - read_unlock(&tasklist_lock);
> - if (c->mm != mm) {
> - task_unlock(c);
> - put_task_struct(c);
> - goto retry;
> - }
> - mm->owner = c;
> - task_unlock(c);
> - put_task_struct(c);
> -}
> -#endif /* CONFIG_MEMCG */
> -
> /*
> * Turn us into a lazy TLB process if we
> * aren't already..
> @@ -433,7 +345,6 @@ static void exit_mm(struct task_struct *tsk)
> up_read(&mm->mmap_sem);
> enter_lazy_tlb(mm, current);
> task_unlock(tsk);
> - mm_update_next_owner(mm);
> mmput(mm);
> if (test_thread_flag(TIF_MEMDIE))
> exit_oom_victim();
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 16e0f872f084..d073b6249d98 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -570,13 +570,6 @@ static void mm_init_aio(struct mm_struct *mm)
> #endif
> }
>
> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
> -{
> -#ifdef CONFIG_MEMCG
> - mm->owner = p;
> -#endif
> -}
> -
> static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> {
> mm->mmap = NULL;
> @@ -596,7 +589,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> spin_lock_init(&mm->page_table_lock);
> mm_init_cpumask(mm);
> mm_init_aio(mm);
> - mm_init_owner(mm, p);
> mmu_notifier_mm_init(mm);
> clear_tlb_flush_pending(mm);
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> @@ -702,6 +694,7 @@ void mmput(struct mm_struct *mm)
> }
> if (mm->binfmt)
> module_put(mm->binfmt->module);
> + mm_drop_memcg(mm);
> mmdrop(mm);
> }
> }
> @@ -926,6 +919,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk)
> if (mm->binfmt && !try_module_get(mm->binfmt->module))
> goto free_pt;
>
> + mm_inherit_memcg(mm, oldmm);
> return mm;
>
> free_pt:
> diff --git a/mm/debug.c b/mm/debug.c
> index 3eb3ac2fcee7..d0347a168651 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -184,7 +184,7 @@ void dump_mm(const struct mm_struct *mm)
> "ioctx_table %p\n"
> #endif
> #ifdef CONFIG_MEMCG
> - "owner %p "
> + "memcg %p "
> #endif
> "exe_file %p\n"
> #ifdef CONFIG_MMU_NOTIFIER
> @@ -218,7 +218,7 @@ void dump_mm(const struct mm_struct *mm)
> mm->ioctx_table,
> #endif
> #ifdef CONFIG_MEMCG
> - mm->owner,
> + mm->memcg,
> #endif
> mm->exe_file,
> #ifdef CONFIG_MMU_NOTIFIER
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 19ffae804076..4069ec8f52be 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -294,6 +294,18 @@ static inline struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
> return mem_cgroup_from_css(css);
> }
>
> +static struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> +{
> + if (p->mm)
> + return rcu_dereference(p->mm->memcg);
> +
> + /*
> + * If the process doesn't have mm struct anymore we have to fallback
> + * to the task_css.
> + */
> + return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> +}
> +
> /* Writing them here to avoid exposing memcg's inner layout */
> #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
>
> @@ -783,19 +795,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> }
> }
>
> -struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
> -{
> - /*
> - * mm_update_next_owner() may clear mm->owner to NULL
> - * if it races with swapoff, page migration, etc.
> - * So this can be called with p == NULL.
> - */
> - if (unlikely(!p))
> - return NULL;
> -
> - return mem_cgroup_from_css(task_css(p, memory_cgrp_id));
> -}
> -
> static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> {
> struct mem_cgroup *memcg = NULL;
> @@ -810,7 +809,7 @@ static struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> if (unlikely(!mm))
> memcg = root_mem_cgroup;
> else {
> - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> + memcg = rcu_dereference(mm->memcg);
> if (unlikely(!memcg))
> memcg = root_mem_cgroup;
> }
> @@ -2286,7 +2285,7 @@ void __memcg_kmem_put_cache(struct kmem_cache *cachep)
> }
>
> /*
> - * We need to verify if the allocation against current->mm->owner's memcg is
> + * We need to verify if the allocation against current->mm->memcg is
> * possible for the given order. But the page is not allocated yet, so we'll
> * need a further commit step to do the final arrangements.
> *
> @@ -4737,7 +4736,7 @@ static void mem_cgroup_clear_mc(void)
> static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
> struct cgroup_taskset *tset)
> {
> - struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> + struct mem_cgroup *to = mem_cgroup_from_css(css);
> struct mem_cgroup *from;
> struct task_struct *p;
> struct mm_struct *mm;
> @@ -4749,37 +4748,49 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
> * tunable will only affect upcoming migrations, not the current one.
> * So we need to save it, and keep it going.
> */
> - move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> + move_flags = READ_ONCE(to->move_charge_at_immigrate);
> if (!move_flags)
> return 0;
>
> p = cgroup_taskset_first(tset);
> - from = mem_cgroup_from_task(p);
> -
> - VM_BUG_ON(from == memcg);
> + if (!thread_group_leader(p))
> + return 0;
>
> mm = get_task_mm(p);
> if (!mm)
> return 0;
> - /* We move charges only when we move a owner of the mm */
> - if (mm->owner == p) {
> - VM_BUG_ON(mc.from);
> - VM_BUG_ON(mc.to);
> - VM_BUG_ON(mc.precharge);
> - VM_BUG_ON(mc.moved_charge);
> - VM_BUG_ON(mc.moved_swap);
> -
> - spin_lock(&mc.lock);
> - mc.from = from;
> - mc.to = memcg;
> - mc.flags = move_flags;
> - spin_unlock(&mc.lock);
> - /* We set mc.moving_task later */
> -
> - ret = mem_cgroup_precharge_mc(mm);
> - if (ret)
> - mem_cgroup_clear_mc();
> - }
> +
> + /*
> + * tasks' cgroup might be different from the one p->mm is associated
> + * with because CLONE_VM is allowed without CLONE_THREAD. The task is
> + * moving so we have to migrate from the memcg associated with its
> + * address space.
> + * No need to take a reference here because the memcg is pinned by the
> + * mm_struct.
> + */
> + from = READ_ONCE(mm->memcg);
> + if (!from)
> + from = root_mem_cgroup;
> + if (from == to)
> + goto out;
> +
> + VM_BUG_ON(mc.from);
> + VM_BUG_ON(mc.to);
> + VM_BUG_ON(mc.precharge);
> + VM_BUG_ON(mc.moved_charge);
> + VM_BUG_ON(mc.moved_swap);
> +
> + spin_lock(&mc.lock);
> + mc.from = from;
> + mc.to = to;
> + mc.flags = move_flags;
> + spin_unlock(&mc.lock);
> + /* We set mc.moving_task later */
> +
> + ret = mem_cgroup_precharge_mc(mm);
> + if (ret)
> + mem_cgroup_clear_mc();
> +out:
> mmput(mm);
> return ret;
> }
> @@ -4932,14 +4943,26 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
> {
> struct task_struct *p = cgroup_taskset_first(tset);
> struct mm_struct *mm = get_task_mm(p);
> + struct mem_cgroup *old_memcg = NULL;
>
> if (mm) {
> + old_memcg = READ_ONCE(mm->memcg);
> + __mm_set_memcg(mm, mem_cgroup_from_css(css));
> +
> if (mc.to)
> mem_cgroup_move_charge(mm);
> mmput(mm);
> }
> if (mc.to)
> mem_cgroup_clear_mc();
> +
> + /*
> + * Be careful and drop the reference only after we are done because
> + * p's task_css memcg might be different from p->memcg and nothing else
> + * might be pinning the old memcg.
> + */
> + if (old_memcg)
> + css_put(&old_memcg->css);
> }
> #else /* !CONFIG_MMU */
> static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
> --
> 2.1.4

--
Michal Hocko
SUSE Labs

2015-07-11 07:09:19

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Fri, Jul 10, 2015 at 02:45:20PM +0200, Michal Hocko wrote:
> On Fri 10-07-15 10:54:00, Vladimir Davydov wrote:
> > On Thu, Jul 09, 2015 at 04:09:41PM +0200, Michal Hocko wrote:
> > > On Wed 08-07-15 20:32:51, Vladimir Davydov wrote:
> > > > On Wed, Jul 08, 2015 at 02:27:51PM +0200, Michal Hocko wrote:
> > [...]
> > > > > @@ -474,7 +519,7 @@ static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
> > > > > return;
> > > > >
> > > > > rcu_read_lock();
> > > > > - memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
> > > > > + memcg = rcu_dereference(mm->memcg);
> > > > > if (unlikely(!memcg))
> > > > > goto out;
> > > > >
> > > >
> > > > If I'm not mistaken, mm->memcg equals NULL for any task in the root
> > > > memory cgroup
> > >
> > > right
> > >
> > > > (BTW, it it's true, it's worth mentioning in the comment
> > > > to mm->memcg definition IMO). As a result, we won't account the stats
> > > > for such tasks, will we?
> > >
> > > well spotted! This is certainly a bug. There are more places which are
> > > checking for mm->memcg being NULL and falling back to root_mem_cgroup. I
> > > think it would be better to simply use root_mem_cgroup right away. We
> > > can setup init_mm.memcg = root_mem_cgroup during initialization and be
> > > done with it. What do you think? The diff is in the very end of the
> > > email (completely untested yet).
> >
> > I'd prefer initializing init_mm.memcg to root_mem_cgroup. This way we
> > wouldn't have to check whether mm->memcg is NULL or not here and there,
> > which would make the code cleaner IMO.
>
> So the patch I've posted will not work as a simple boot test told me. We
> are initializing root_mem_cgroup too late. This will be more complicated.
> I will leave this idea outside of this patch series and will come up
> with a separate patch which will clean this up later. I will update the
> doc discouraging any use of mm->memcg outside of memcg and use accessor
> functions instead. There is only one currently (mm/debug.c) and this is
> used only to print the pointer which is safe.

Why can't we make root_mem_cgroup statically allocated? AFAICS it's a
common practice - e.g. see blkcg_root, root_task_group.

Thanks,
Vladimir

2015-07-14 15:18:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Fri 10-07-15 16:05:33, Michal Hocko wrote:
> JFYI: I've found some more issues while hamerring this more.

OK so the main issue is quite simple but I have completely missed it when
thinking about the patch before. clone(CLONE_VM) without CLONE_THREAD is
really nasty and it will easily lockup the machine with preempt. disabled
for ever. It goes like this:
taskA (in memcg A)
taskB = clone(CLONE_VM)
taskB
A -> B # Both tasks charge to B now
exit() # No tasks in B -> can be
# offlined now
css_offline()
mem_cgroup_try_charge
get_mem_cgroup_from_mm
rcu_read_lock()
do {
} while css_tryget_online(mm->memcg) # will never succeed
rcu_read_unlock()

taskA and taskB are basically independent entities wrt. the life
cycle (unlike threads which are bound to the group leader). The
previous code handles this by re-ownering during exit by the monster
mm_update_next_owner.

I can see the following options without reintroducing reintroducing
some form of mm_update_next_owner:

1) Do not allow offlining a cgroup if we have active users in it. This
would require a callback from the cgroup core to the subsystem called if
there are no active tasks tracked by the cgroup core. Tracking on the memcg
side doesn't sound terribly hard - just mark a mm_struct as an alien and
count the number of aliens during the move in mem_cgroup. mm_drop_memcg
then drops the counter. We could end up with EBUSY cgroup without any
visible tasks which is a bit awkward.

2) update get_mem_cgroup_from_mm and others to fallback to the parent
memcg if the current one is offline. This would be in line with charge
reparenting we used to do. I cannot say I would like this because it
allows for easy runaway to the root memcg if the hierarchy is not
configured cautiously. The code would be also quite tricky because each
direct consumer of mm->memcg would have to be aware of this. This is
awkward.

3) fail mem_cgroup_can_attach if we are trying to migrate a task sharing
mm_struct with a process outside of the tset. If I understand the
tset properly this would require all the sharing tasks to be migrated
together and we would never end up with task_css != &task->mm->css.
__cgroup_procs_write doesn't seem to support multi pid move currently
AFAICS, though. cgroup_migrate_add_src, however, seems to be intended
for this purpose so this should be doable. Without that support we would
basically disallow migrating these tasks - I wouldn't object if you ask
me.

Do you see other options? From the above three options the 3rd one
sounds the most sane to me and the 1st quite easy to implement. Both will
require some cgroup core work though. But maybe we would be good enough
with 3rd option without supporting moving schizophrenic tasks and that
would be reduced to memcg code.

Or we can, of course, stay with the current state but I think it would
be much saner to get rid of the schizophrenia.

What do you think?
--
Michal Hocko
SUSE Labs

2015-07-14 15:32:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Sat 11-07-15 10:09:06, Vladimir Davydov wrote:
[...]
> Why can't we make root_mem_cgroup statically allocated? AFAICS it's a
> common practice - e.g. see blkcg_root, root_task_group.

It's not that easy. mem_cgroup doesn't have a fixed size. It depends
on the number of online nodes. I haven't looked closer to this yet but
cgroup has an early init code path maybe we can hook in there.

I would like to settle with the current issues described in other email
first, though. This is just a cosmetic issue.
--
Michal Hocko
SUSE Labs

2015-07-29 11:58:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Tue 14-07-15 17:18:23, Michal Hocko wrote:
> On Fri 10-07-15 16:05:33, Michal Hocko wrote:
> > JFYI: I've found some more issues while hamerring this more.
>
> OK so the main issue is quite simple but I have completely missed it when
> thinking about the patch before. clone(CLONE_VM) without CLONE_THREAD is
> really nasty and it will easily lockup the machine with preempt. disabled
> for ever. It goes like this:
> taskA (in memcg A)
> taskB = clone(CLONE_VM)
> taskB
> A -> B # Both tasks charge to B now
> exit() # No tasks in B -> can be
> # offlined now
> css_offline()
> mem_cgroup_try_charge
> get_mem_cgroup_from_mm
> rcu_read_lock()
> do {
> } while css_tryget_online(mm->memcg) # will never succeed
> rcu_read_unlock()
>
> taskA and taskB are basically independent entities wrt. the life
> cycle (unlike threads which are bound to the group leader). The
> previous code handles this by re-ownering during exit by the monster
> mm_update_next_owner.
>
> I can see the following options without reintroducing reintroducing
> some form of mm_update_next_owner:
>
> 1) Do not allow offlining a cgroup if we have active users in it. This
> would require a callback from the cgroup core to the subsystem called if
> there are no active tasks tracked by the cgroup core. Tracking on the memcg
> side doesn't sound terribly hard - just mark a mm_struct as an alien and
> count the number of aliens during the move in mem_cgroup. mm_drop_memcg
> then drops the counter. We could end up with EBUSY cgroup without any
> visible tasks which is a bit awkward.
>
> 2) update get_mem_cgroup_from_mm and others to fallback to the parent
> memcg if the current one is offline. This would be in line with charge
> reparenting we used to do. I cannot say I would like this because it
> allows for easy runaway to the root memcg if the hierarchy is not
> configured cautiously. The code would be also quite tricky because each
> direct consumer of mm->memcg would have to be aware of this. This is
> awkward.
>
> 3) fail mem_cgroup_can_attach if we are trying to migrate a task sharing
> mm_struct with a process outside of the tset. If I understand the
> tset properly this would require all the sharing tasks to be migrated
> together and we would never end up with task_css != &task->mm->css.
> __cgroup_procs_write doesn't seem to support multi pid move currently
> AFAICS, though. cgroup_migrate_add_src, however, seems to be intended
> for this purpose so this should be doable. Without that support we would
> basically disallow migrating these tasks - I wouldn't object if you ask
> me.
>
> Do you see other options? From the above three options the 3rd one
> sounds the most sane to me and the 1st quite easy to implement. Both will
> require some cgroup core work though. But maybe we would be good enough
> with 3rd option without supporting moving schizophrenic tasks and that
> would be reduced to memcg code.
>
> Or we can, of course, stay with the current state but I think it would
> be much saner to get rid of the schizophrenia.
>
> What do you think?

Ideas, thoughs? Anybody?
--
Michal Hocko
SUSE Labs

2015-07-29 13:15:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Tue, Jul 14, 2015 at 05:18:23PM +0200, Michal Hocko wrote:
> On Fri 10-07-15 16:05:33, Michal Hocko wrote:
> > JFYI: I've found some more issues while hamerring this more.
>
> OK so the main issue is quite simple but I have completely missed it when
> thinking about the patch before. clone(CLONE_VM) without CLONE_THREAD is
> really nasty and it will easily lockup the machine with preempt. disabled
> for ever. It goes like this:
> taskA (in memcg A)
> taskB = clone(CLONE_VM)
> taskB
> A -> B # Both tasks charge to B now
> exit() # No tasks in B -> can be
> # offlined now
> css_offline()
> mem_cgroup_try_charge
> get_mem_cgroup_from_mm
> rcu_read_lock()
> do {
> } while css_tryget_online(mm->memcg) # will never succeed
> rcu_read_unlock()
>
> taskA and taskB are basically independent entities wrt. the life
> cycle (unlike threads which are bound to the group leader). The
> previous code handles this by re-ownering during exit by the monster
> mm_update_next_owner.
>
> I can see the following options without reintroducing reintroducing
> some form of mm_update_next_owner:
>
> 1) Do not allow offlining a cgroup if we have active users in it. This
> would require a callback from the cgroup core to the subsystem called if
> there are no active tasks tracked by the cgroup core. Tracking on the memcg
> side doesn't sound terribly hard - just mark a mm_struct as an alien and
> count the number of aliens during the move in mem_cgroup. mm_drop_memcg
> then drops the counter. We could end up with EBUSY cgroup without any
> visible tasks which is a bit awkward.

You couldn't remove the group, and you wouldn't know which task needs
to move to get the mm out of there. That's no good.

> 2) update get_mem_cgroup_from_mm and others to fallback to the parent
> memcg if the current one is offline. This would be in line with charge
> reparenting we used to do. I cannot say I would like this because it
> allows for easy runaway to the root memcg if the hierarchy is not
> configured cautiously. The code would be also quite tricky because each
> direct consumer of mm->memcg would have to be aware of this. This is
> awkward.

In the unified hierarchy, there won't be tasks inside intermediate
nodes, so reparenting would lead to surprising behavior.

> 3) fail mem_cgroup_can_attach if we are trying to migrate a task sharing
> mm_struct with a process outside of the tset. If I understand the
> tset properly this would require all the sharing tasks to be migrated
> together and we would never end up with task_css != &task->mm->css.
> __cgroup_procs_write doesn't seem to support multi pid move currently
> AFAICS, though. cgroup_migrate_add_src, however, seems to be intended
> for this purpose so this should be doable. Without that support we would
> basically disallow migrating these tasks - I wouldn't object if you ask
> me.

I'd prefer not adding controller-specific failure modes for attaching,
and this too would lead to very non-obvious behavior.

> Do you see other options? From the above three options the 3rd one
> sounds the most sane to me and the 1st quite easy to implement. Both will
> require some cgroup core work though. But maybe we would be good enough
> with 3rd option without supporting moving schizophrenic tasks and that
> would be reduced to memcg code.

A modified form of 1) would be to track the mms referring to a memcg
but during offline search the process tree for a matching task. This
is heavy-handed, but it's a rare case and this work would be done in
the cgroup removal path rather than during task exit. This is stolen
from the current mm_update_next_owner():

list_for_each_entry(mm, memcg->mms, memcg_list) {
for_each_process(g) {
if (g->flags & PF_KTHREAD)
continue;
for_each_thread(g, c) {
if (c->mm == mm)
goto assign;
if (c->mm)
break;
}
}
assign:
memcg = mem_cgroup_from_task(c);
mm->memcg = memcg;
list_move(&mm->memcg_list, &memcg->mms);
}

(plus appropriate synchronization, of course)

2015-07-29 15:05:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Wed 29-07-15 09:14:54, Johannes Weiner wrote:
> On Tue, Jul 14, 2015 at 05:18:23PM +0200, Michal Hocko wrote:
[...]
> > 3) fail mem_cgroup_can_attach if we are trying to migrate a task sharing
> > mm_struct with a process outside of the tset. If I understand the
> > tset properly this would require all the sharing tasks to be migrated
> > together and we would never end up with task_css != &task->mm->css.
> > __cgroup_procs_write doesn't seem to support multi pid move currently
> > AFAICS, though. cgroup_migrate_add_src, however, seems to be intended
> > for this purpose so this should be doable. Without that support we would
> > basically disallow migrating these tasks - I wouldn't object if you ask
> > me.
>
> I'd prefer not adding controller-specific failure modes for attaching,

Does this mean that there is a plan to drop the return value from
can_attach? I can see that both cpuset and cpu controllers currently
allow to fail to attach. Are those going to change? I remember some
discussions but no clear outcome of those.

> and this too would lead to very non-obvious behavior.

Yeah, the user will not get an error source with the current API but
this is an inherent restriction currently. Maybe we can add a knob with
the error source?

If there is a clear consensus that can_attach failures are clearly a no
go then what about "silent" moving of the associated tasks? This would
be similar to thread group except the group would be more generic term.

> > Do you see other options? From the above three options the 3rd one
> > sounds the most sane to me and the 1st quite easy to implement. Both will
> > require some cgroup core work though. But maybe we would be good enough
> > with 3rd option without supporting moving schizophrenic tasks and that
> > would be reduced to memcg code.
>
> A modified form of 1) would be to track the mms referring to a memcg
> but during offline search the process tree for a matching task.

But we might have many of those and all of them living in different
cgroups. So which one do we take? The first encountered, the one with
the majority? I am not sure this is much better.

I would really prefer if we could get rid of the schizophrenia if it is
possible.

> This is heavy-handed, but it's a rare case and this work would be done
> in the cgroup removal path rather than during task exit.

Yes it would be lighter a bit.

> This is stolen
> from the current mm_update_next_owner():
>
> list_for_each_entry(mm, memcg->mms, memcg_list) {
> for_each_process(g) {
> if (g->flags & PF_KTHREAD)
> continue;
> for_each_thread(g, c) {
> if (c->mm == mm)
> goto assign;
> if (c->mm)
> break;
> }
> }
> assign:
> memcg = mem_cgroup_from_task(c);
> mm->memcg = memcg;
> list_move(&mm->memcg_list, &memcg->mms);
> }
>
> (plus appropriate synchronization, of course)

--
Michal Hocko
SUSE Labs

2015-07-29 16:43:25

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 7/8] memcg: get rid of mm_struct::owner

On Wed, Jul 29, 2015 at 05:05:49PM +0200, Michal Hocko wrote:
> On Wed 29-07-15 09:14:54, Johannes Weiner wrote:
> > On Tue, Jul 14, 2015 at 05:18:23PM +0200, Michal Hocko wrote:
> [...]
> > > 3) fail mem_cgroup_can_attach if we are trying to migrate a task sharing
> > > mm_struct with a process outside of the tset. If I understand the
> > > tset properly this would require all the sharing tasks to be migrated
> > > together and we would never end up with task_css != &task->mm->css.
> > > __cgroup_procs_write doesn't seem to support multi pid move currently
> > > AFAICS, though. cgroup_migrate_add_src, however, seems to be intended
> > > for this purpose so this should be doable. Without that support we would
> > > basically disallow migrating these tasks - I wouldn't object if you ask
> > > me.
> >
> > I'd prefer not adding controller-specific failure modes for attaching,
>
> Does this mean that there is a plan to drop the return value from
> can_attach? I can see that both cpuset and cpu controllers currently
> allow to fail to attach. Are those going to change? I remember some
> discussions but no clear outcome of those.

Nothing but the realtime stuff needs to be able to fail migration due
to controller restraints. This should probably remain a fringe thing,
because it does make for a much more ambiguous interface.

So I think can_attach() will have to stay, but it should be avoided.

> > and this too would lead to very non-obvious behavior.
>
> Yeah, the user will not get an error source with the current API but
> this is an inherent restriction currently. Maybe we can add a knob with
> the error source?
>
> If there is a clear consensus that can_attach failures are clearly a no
> go then what about "silent" moving of the associated tasks? This would
> be similar to thread group except the group would be more generic term.
>
> > > Do you see other options? From the above three options the 3rd one
> > > sounds the most sane to me and the 1st quite easy to implement. Both will
> > > require some cgroup core work though. But maybe we would be good enough
> > > with 3rd option without supporting moving schizophrenic tasks and that
> > > would be reduced to memcg code.
> >
> > A modified form of 1) would be to track the mms referring to a memcg
> > but during offline search the process tree for a matching task.
>
> But we might have many of those and all of them living in different
> cgroups. So which one do we take? The first encountered, the one with
> the majority? I am not sure this is much better.
>
> I would really prefer if we could get rid of the schizophrenia if it is
> possible.

The first encountered.

This is just our model for sharing memory across groups. Page cache,
writeback, address space--we have always accounted based on who's
touching it first. We might as well stick with it for shared mms.