2014-10-21 13:16:07

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages
on *any* NUMA node. However, the only place where it's called is
mem_cgroup_soft_reclaim(), which tries to reclaim memory from a
*specific* zone. So the way how it's used is incorrect - it will return
true even if the cgroup doesn't have pages on the zone we're scanning.

I think we can get rid of this check completely, because
mem_cgroup_shrink_node_zone(), which is called by
mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
equivalent to shrink_lruvec(), which exits almost immediately if the
lruvec passed to it is empty. So there's no need to optimize anything
here. Besides, we don't have such a check in the general scan path
(shrink_zone) either.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/memcontrol.c | 43 -------------------------------------------
1 file changed, 43 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 53393e27ff03..833b6a696aab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1799,52 +1799,11 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
memcg->last_scanned_node = node;
return node;
}
-
-/*
- * Check all nodes whether it contains reclaimable pages or not.
- * For quick scan, we make use of scan_nodes. This will allow us to skip
- * unused nodes. But scan_nodes is lazily updated and may not cotain
- * enough new information. We need to do double check.
- */
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
- int nid;
-
- /*
- * quick check...making use of scan_node.
- * We can skip unused nodes.
- */
- if (!nodes_empty(memcg->scan_nodes)) {
- for (nid = first_node(memcg->scan_nodes);
- nid < MAX_NUMNODES;
- nid = next_node(nid, memcg->scan_nodes)) {
-
- if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
- return true;
- }
- }
- /*
- * Check rest of nodes.
- */
- for_each_node_state(nid, N_MEMORY) {
- if (node_isset(nid, memcg->scan_nodes))
- continue;
- if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
- return true;
- }
- return false;
-}
-
#else
int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
{
return 0;
}
-
-static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
-{
- return test_mem_cgroup_node_reclaimable(memcg, 0, noswap);
-}
#endif

static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
@@ -1888,8 +1847,6 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
}
continue;
}
- if (!mem_cgroup_reclaimable(victim, false))
- continue;
total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
zone, &nr_scanned);
*total_scanned += nr_scanned;
--
1.7.10.4


2014-10-21 15:37:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

On Tue 21-10-14 17:15:50, Vladimir Davydov wrote:
> mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages
> on *any* NUMA node. However, the only place where it's called is
> mem_cgroup_soft_reclaim(), which tries to reclaim memory from a
> *specific* zone. So the way how it's used is incorrect - it will return
> true even if the cgroup doesn't have pages on the zone we're scanning.
>
> I think we can get rid of this check completely, because
> mem_cgroup_shrink_node_zone(), which is called by
> mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
> equivalent to shrink_lruvec(), which exits almost immediately if the
> lruvec passed to it is empty. So there's no need to optimize anything
> here. Besides, we don't have such a check in the general scan path
> (shrink_zone) either.
>
> Signed-off-by: Vladimir Davydov <[email protected]>

yeah, let's ditch it. It doesn't make any sense without checking the
target zone and even then it is dubious bevause get_scan_count will give
us 0 target on an empty lru list.

Acked-by: Michal Hocko <[email protected]>

Thanks

> ---
> mm/memcontrol.c | 43 -------------------------------------------
> 1 file changed, 43 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 53393e27ff03..833b6a696aab 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1799,52 +1799,11 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
> memcg->last_scanned_node = node;
> return node;
> }
> -
> -/*
> - * Check all nodes whether it contains reclaimable pages or not.
> - * For quick scan, we make use of scan_nodes. This will allow us to skip
> - * unused nodes. But scan_nodes is lazily updated and may not cotain
> - * enough new information. We need to do double check.
> - */
> -static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
> -{
> - int nid;
> -
> - /*
> - * quick check...making use of scan_node.
> - * We can skip unused nodes.
> - */
> - if (!nodes_empty(memcg->scan_nodes)) {
> - for (nid = first_node(memcg->scan_nodes);
> - nid < MAX_NUMNODES;
> - nid = next_node(nid, memcg->scan_nodes)) {
> -
> - if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
> - return true;
> - }
> - }
> - /*
> - * Check rest of nodes.
> - */
> - for_each_node_state(nid, N_MEMORY) {
> - if (node_isset(nid, memcg->scan_nodes))
> - continue;
> - if (test_mem_cgroup_node_reclaimable(memcg, nid, noswap))
> - return true;
> - }
> - return false;
> -}
> -
> #else
> int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
> {
> return 0;
> }
> -
> -static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
> -{
> - return test_mem_cgroup_node_reclaimable(memcg, 0, noswap);
> -}
> #endif
>
> static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> @@ -1888,8 +1847,6 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> }
> continue;
> }
> - if (!mem_cgroup_reclaimable(victim, false))
> - continue;
> total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
> zone, &nr_scanned);
> *total_scanned += nr_scanned;
> --
> 1.7.10.4
>

--
Michal Hocko
SUSE Labs

2014-10-21 18:22:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

On Tue, Oct 21, 2014 at 05:15:50PM +0400, Vladimir Davydov wrote:
> mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages
> on *any* NUMA node. However, the only place where it's called is
> mem_cgroup_soft_reclaim(), which tries to reclaim memory from a
> *specific* zone. So the way how it's used is incorrect - it will return
> true even if the cgroup doesn't have pages on the zone we're scanning.
>
> I think we can get rid of this check completely, because
> mem_cgroup_shrink_node_zone(), which is called by
> mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
> equivalent to shrink_lruvec(), which exits almost immediately if the
> lruvec passed to it is empty. So there's no need to optimize anything
> here. Besides, we don't have such a check in the general scan path
> (shrink_zone) either.
>
> Signed-off-by: Vladimir Davydov <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

How about this on top?

---

>From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Tue, 21 Oct 2014 09:53:54 -0400
Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
round-robin

The round-robin node reclaim currently tries to include only nodes
that have memory of the memcg in question, which is quite elaborate.

Just use plain round-robin over the nodes that are allowed by the
task's cpuset, which are the most likely to contain that memcg's
memory. But even if zones without memcg memory are encountered,
direct reclaim will skip over them without too much hassle.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 97 +++------------------------------------------------------
1 file changed, 5 insertions(+), 92 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d353d9e1fdca..293db8234179 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/cpuset.h>
#include <linux/lockdep.h>
#include <linux/file.h>
#include "internal.h"
@@ -129,12 +130,10 @@ static const char * const mem_cgroup_lru_names[] = {
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
MEM_CGROUP_TARGET_SOFTLIMIT,
- MEM_CGROUP_TARGET_NUMAINFO,
MEM_CGROUP_NTARGETS,
};
#define THRESHOLDS_EVENTS_TARGET 128
#define SOFTLIMIT_EVENTS_TARGET 1024
-#define NUMAINFO_EVENTS_TARGET 1024

struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
@@ -352,11 +351,6 @@ struct mem_cgroup {
#endif

int last_scanned_node;
-#if MAX_NUMNODES > 1
- nodemask_t scan_nodes;
- atomic_t numainfo_events;
- atomic_t numainfo_updating;
-#endif

/* List of events which userspace want to receive */
struct list_head event_list;
@@ -965,9 +959,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
case MEM_CGROUP_TARGET_SOFTLIMIT:
next = val + SOFTLIMIT_EVENTS_TARGET;
break;
- case MEM_CGROUP_TARGET_NUMAINFO:
- next = val + NUMAINFO_EVENTS_TARGET;
- break;
default:
break;
}
@@ -986,22 +977,10 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
- bool do_softlimit;
- bool do_numainfo __maybe_unused;
-
- do_softlimit = mem_cgroup_event_ratelimit(memcg,
- MEM_CGROUP_TARGET_SOFTLIMIT);
-#if MAX_NUMNODES > 1
- do_numainfo = mem_cgroup_event_ratelimit(memcg,
- MEM_CGROUP_TARGET_NUMAINFO);
-#endif
mem_cgroup_threshold(memcg);
- if (unlikely(do_softlimit))
+ if (mem_cgroup_event_ratelimit(memcg,
+ MEM_CGROUP_TARGET_SOFTLIMIT))
mem_cgroup_update_tree(memcg, page);
-#if MAX_NUMNODES > 1
- if (unlikely(do_numainfo))
- atomic_inc(&memcg->numainfo_events);
-#endif
}
}

@@ -1708,61 +1687,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
NULL, "Memory cgroup out of memory");
}

-/**
- * test_mem_cgroup_node_reclaimable
- * @memcg: the target memcg
- * @nid: the node ID to be checked.
- * @noswap : specify true here if the user wants flle only information.
- *
- * This function returns whether the specified memcg contains any
- * reclaimable pages on a node. Returns true if there are any reclaimable
- * pages in the node.
- */
-static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
- int nid, bool noswap)
-{
- if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
- return true;
- if (noswap || !total_swap_pages)
- return false;
- if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
- return true;
- return false;
-
-}
#if MAX_NUMNODES > 1
-
-/*
- * Always updating the nodemask is not very good - even if we have an empty
- * list or the wrong list here, we can start from some node and traverse all
- * nodes based on the zonelist. So update the list loosely once per 10 secs.
- *
- */
-static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
-{
- int nid;
- /*
- * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
- * pagein/pageout changes since the last update.
- */
- if (!atomic_read(&memcg->numainfo_events))
- return;
- if (atomic_inc_return(&memcg->numainfo_updating) > 1)
- return;
-
- /* make a nodemask where this memcg uses memory from */
- memcg->scan_nodes = node_states[N_MEMORY];
-
- for_each_node_mask(nid, node_states[N_MEMORY]) {
-
- if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
- node_clear(nid, memcg->scan_nodes);
- }
-
- atomic_set(&memcg->numainfo_events, 0);
- atomic_set(&memcg->numainfo_updating, 0);
-}
-
/*
* Selecting a node where we start reclaim from. Because what we need is just
* reducing usage counter, start from anywhere is O,K. Considering
@@ -1779,21 +1704,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
{
int node;

- mem_cgroup_may_update_nodemask(memcg);
- node = memcg->last_scanned_node;
-
- node = next_node(node, memcg->scan_nodes);
+ node = next_node(memcg->last_scanned_node, cpuset_current_mems_allowed);
if (node == MAX_NUMNODES)
- node = first_node(memcg->scan_nodes);
- /*
- * We call this when we hit limit, not when pages are added to LRU.
- * No LRU may hold pages because all pages are UNEVICTABLE or
- * memcg is too small and all pages are not on LRU. In that case,
- * we use curret node.
- */
- if (unlikely(node == MAX_NUMNODES))
- node = numa_node_id();
-
+ node = first_node(cpuset_current_mems_allowed);
memcg->last_scanned_node = node;
return node;
}
--
2.1.2

2014-10-22 06:40:57

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

On Tue, Oct 21, 2014 at 02:22:39PM -0400, Johannes Weiner wrote:
> On Tue, Oct 21, 2014 at 05:15:50PM +0400, Vladimir Davydov wrote:
> > mem_cgroup_reclaimable() checks whether a cgroup has reclaimable pages
> > on *any* NUMA node. However, the only place where it's called is
> > mem_cgroup_soft_reclaim(), which tries to reclaim memory from a
> > *specific* zone. So the way how it's used is incorrect - it will return
> > true even if the cgroup doesn't have pages on the zone we're scanning.
> >
> > I think we can get rid of this check completely, because
> > mem_cgroup_shrink_node_zone(), which is called by
> > mem_cgroup_soft_reclaim() if mem_cgroup_reclaimable() returns true, is
> > equivalent to shrink_lruvec(), which exits almost immediately if the
> > lruvec passed to it is empty. So there's no need to optimize anything
> > here. Besides, we don't have such a check in the general scan path
> > (shrink_zone) either.
> >
> > Signed-off-by: Vladimir Davydov <[email protected]>
>
> Acked-by: Johannes Weiner <[email protected]>
>
> How about this on top?
>
> ---
>
> From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Tue, 21 Oct 2014 09:53:54 -0400
> Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
> round-robin
>
> The round-robin node reclaim currently tries to include only nodes
> that have memory of the memcg in question, which is quite elaborate.
>
> Just use plain round-robin over the nodes that are allowed by the
> task's cpuset, which are the most likely to contain that memcg's
> memory. But even if zones without memcg memory are encountered,
> direct reclaim will skip over them without too much hassle.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Totally agree.

Acked-by: Vladimir Davydov <[email protected]>

> ---
> mm/memcontrol.c | 97 +++------------------------------------------------------
> 1 file changed, 5 insertions(+), 92 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d353d9e1fdca..293db8234179 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -54,6 +54,7 @@
> #include <linux/page_cgroup.h>
> #include <linux/cpu.h>
> #include <linux/oom.h>
> +#include <linux/cpuset.h>
> #include <linux/lockdep.h>
> #include <linux/file.h>
> #include "internal.h"
> @@ -129,12 +130,10 @@ static const char * const mem_cgroup_lru_names[] = {
> enum mem_cgroup_events_target {
> MEM_CGROUP_TARGET_THRESH,
> MEM_CGROUP_TARGET_SOFTLIMIT,
> - MEM_CGROUP_TARGET_NUMAINFO,
> MEM_CGROUP_NTARGETS,
> };
> #define THRESHOLDS_EVENTS_TARGET 128
> #define SOFTLIMIT_EVENTS_TARGET 1024
> -#define NUMAINFO_EVENTS_TARGET 1024
>
> struct mem_cgroup_stat_cpu {
> long count[MEM_CGROUP_STAT_NSTATS];
> @@ -352,11 +351,6 @@ struct mem_cgroup {
> #endif
>
> int last_scanned_node;
> -#if MAX_NUMNODES > 1
> - nodemask_t scan_nodes;
> - atomic_t numainfo_events;
> - atomic_t numainfo_updating;
> -#endif
>
> /* List of events which userspace want to receive */
> struct list_head event_list;
> @@ -965,9 +959,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
> case MEM_CGROUP_TARGET_SOFTLIMIT:
> next = val + SOFTLIMIT_EVENTS_TARGET;
> break;
> - case MEM_CGROUP_TARGET_NUMAINFO:
> - next = val + NUMAINFO_EVENTS_TARGET;
> - break;
> default:
> break;
> }
> @@ -986,22 +977,10 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> /* threshold event is triggered in finer grain than soft limit */
> if (unlikely(mem_cgroup_event_ratelimit(memcg,
> MEM_CGROUP_TARGET_THRESH))) {
> - bool do_softlimit;
> - bool do_numainfo __maybe_unused;
> -
> - do_softlimit = mem_cgroup_event_ratelimit(memcg,
> - MEM_CGROUP_TARGET_SOFTLIMIT);
> -#if MAX_NUMNODES > 1
> - do_numainfo = mem_cgroup_event_ratelimit(memcg,
> - MEM_CGROUP_TARGET_NUMAINFO);
> -#endif
> mem_cgroup_threshold(memcg);
> - if (unlikely(do_softlimit))
> + if (mem_cgroup_event_ratelimit(memcg,
> + MEM_CGROUP_TARGET_SOFTLIMIT))
> mem_cgroup_update_tree(memcg, page);
> -#if MAX_NUMNODES > 1
> - if (unlikely(do_numainfo))
> - atomic_inc(&memcg->numainfo_events);
> -#endif
> }
> }
>
> @@ -1708,61 +1687,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> NULL, "Memory cgroup out of memory");
> }
>
> -/**
> - * test_mem_cgroup_node_reclaimable
> - * @memcg: the target memcg
> - * @nid: the node ID to be checked.
> - * @noswap : specify true here if the user wants flle only information.
> - *
> - * This function returns whether the specified memcg contains any
> - * reclaimable pages on a node. Returns true if there are any reclaimable
> - * pages in the node.
> - */
> -static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
> - int nid, bool noswap)
> -{
> - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
> - return true;
> - if (noswap || !total_swap_pages)
> - return false;
> - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
> - return true;
> - return false;
> -
> -}
> #if MAX_NUMNODES > 1
> -
> -/*
> - * Always updating the nodemask is not very good - even if we have an empty
> - * list or the wrong list here, we can start from some node and traverse all
> - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> - *
> - */
> -static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
> -{
> - int nid;
> - /*
> - * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
> - * pagein/pageout changes since the last update.
> - */
> - if (!atomic_read(&memcg->numainfo_events))
> - return;
> - if (atomic_inc_return(&memcg->numainfo_updating) > 1)
> - return;
> -
> - /* make a nodemask where this memcg uses memory from */
> - memcg->scan_nodes = node_states[N_MEMORY];
> -
> - for_each_node_mask(nid, node_states[N_MEMORY]) {
> -
> - if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> - node_clear(nid, memcg->scan_nodes);
> - }
> -
> - atomic_set(&memcg->numainfo_events, 0);
> - atomic_set(&memcg->numainfo_updating, 0);
> -}
> -
> /*
> * Selecting a node where we start reclaim from. Because what we need is just
> * reducing usage counter, start from anywhere is O,K. Considering
> @@ -1779,21 +1704,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
> {
> int node;
>
> - mem_cgroup_may_update_nodemask(memcg);
> - node = memcg->last_scanned_node;
> -
> - node = next_node(node, memcg->scan_nodes);
> + node = next_node(memcg->last_scanned_node, cpuset_current_mems_allowed);
> if (node == MAX_NUMNODES)
> - node = first_node(memcg->scan_nodes);
> - /*
> - * We call this when we hit limit, not when pages are added to LRU.
> - * No LRU may hold pages because all pages are UNEVICTABLE or
> - * memcg is too small and all pages are not on LRU. In that case,
> - * we use curret node.
> - */
> - if (unlikely(node == MAX_NUMNODES))
> - node = numa_node_id();
> -
> + node = first_node(cpuset_current_mems_allowed);
> memcg->last_scanned_node = node;
> return node;
> }
> --
> 2.1.2

2014-10-22 11:21:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

On Tue 21-10-14 14:22:39, Johannes Weiner wrote:
[...]
> From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Tue, 21 Oct 2014 09:53:54 -0400
> Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
> round-robin
>
> The round-robin node reclaim currently tries to include only nodes
> that have memory of the memcg in question, which is quite elaborate.
>
> Just use plain round-robin over the nodes that are allowed by the
> task's cpuset, which are the most likely to contain that memcg's
> memory. But even if zones without memcg memory are encountered,
> direct reclaim will skip over them without too much hassle.

I do not think that using the current's node mask is correct. Different
tasks in the same memcg might be bound to different nodes and then a set
of nodes might be reclaimed much more if a particular task hits limit
more often. It also doesn't make much sense from semantical POV, we are
reclaiming memcg so the mask should be union of all tasks allowed nodes.

What we do currently is overly complicated though and I agree that there
is no good reason for it.
Let's just s@cpuset_current_mems_allowed@node_online_map@ and round
robin over all nodes. As you said we do not have to optimize for empty
zones.

Anyway I very much welcome all these simplifications (my todo list is
shrinking ;). Thanks!

> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 97 +++------------------------------------------------------
> 1 file changed, 5 insertions(+), 92 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d353d9e1fdca..293db8234179 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -54,6 +54,7 @@
> #include <linux/page_cgroup.h>
> #include <linux/cpu.h>
> #include <linux/oom.h>
> +#include <linux/cpuset.h>
> #include <linux/lockdep.h>
> #include <linux/file.h>
> #include "internal.h"
> @@ -129,12 +130,10 @@ static const char * const mem_cgroup_lru_names[] = {
> enum mem_cgroup_events_target {
> MEM_CGROUP_TARGET_THRESH,
> MEM_CGROUP_TARGET_SOFTLIMIT,
> - MEM_CGROUP_TARGET_NUMAINFO,
> MEM_CGROUP_NTARGETS,
> };
> #define THRESHOLDS_EVENTS_TARGET 128
> #define SOFTLIMIT_EVENTS_TARGET 1024
> -#define NUMAINFO_EVENTS_TARGET 1024
>
> struct mem_cgroup_stat_cpu {
> long count[MEM_CGROUP_STAT_NSTATS];
> @@ -352,11 +351,6 @@ struct mem_cgroup {
> #endif
>
> int last_scanned_node;
> -#if MAX_NUMNODES > 1
> - nodemask_t scan_nodes;
> - atomic_t numainfo_events;
> - atomic_t numainfo_updating;
> -#endif
>
> /* List of events which userspace want to receive */
> struct list_head event_list;
> @@ -965,9 +959,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
> case MEM_CGROUP_TARGET_SOFTLIMIT:
> next = val + SOFTLIMIT_EVENTS_TARGET;
> break;
> - case MEM_CGROUP_TARGET_NUMAINFO:
> - next = val + NUMAINFO_EVENTS_TARGET;
> - break;
> default:
> break;
> }
> @@ -986,22 +977,10 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
> /* threshold event is triggered in finer grain than soft limit */
> if (unlikely(mem_cgroup_event_ratelimit(memcg,
> MEM_CGROUP_TARGET_THRESH))) {
> - bool do_softlimit;
> - bool do_numainfo __maybe_unused;
> -
> - do_softlimit = mem_cgroup_event_ratelimit(memcg,
> - MEM_CGROUP_TARGET_SOFTLIMIT);
> -#if MAX_NUMNODES > 1
> - do_numainfo = mem_cgroup_event_ratelimit(memcg,
> - MEM_CGROUP_TARGET_NUMAINFO);
> -#endif
> mem_cgroup_threshold(memcg);
> - if (unlikely(do_softlimit))
> + if (mem_cgroup_event_ratelimit(memcg,
> + MEM_CGROUP_TARGET_SOFTLIMIT))
> mem_cgroup_update_tree(memcg, page);
> -#if MAX_NUMNODES > 1
> - if (unlikely(do_numainfo))
> - atomic_inc(&memcg->numainfo_events);
> -#endif
> }
> }
>
> @@ -1708,61 +1687,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> NULL, "Memory cgroup out of memory");
> }
>
> -/**
> - * test_mem_cgroup_node_reclaimable
> - * @memcg: the target memcg
> - * @nid: the node ID to be checked.
> - * @noswap : specify true here if the user wants flle only information.
> - *
> - * This function returns whether the specified memcg contains any
> - * reclaimable pages on a node. Returns true if there are any reclaimable
> - * pages in the node.
> - */
> -static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
> - int nid, bool noswap)
> -{
> - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
> - return true;
> - if (noswap || !total_swap_pages)
> - return false;
> - if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_ANON))
> - return true;
> - return false;
> -
> -}
> #if MAX_NUMNODES > 1
> -
> -/*
> - * Always updating the nodemask is not very good - even if we have an empty
> - * list or the wrong list here, we can start from some node and traverse all
> - * nodes based on the zonelist. So update the list loosely once per 10 secs.
> - *
> - */
> -static void mem_cgroup_may_update_nodemask(struct mem_cgroup *memcg)
> -{
> - int nid;
> - /*
> - * numainfo_events > 0 means there was at least NUMAINFO_EVENTS_TARGET
> - * pagein/pageout changes since the last update.
> - */
> - if (!atomic_read(&memcg->numainfo_events))
> - return;
> - if (atomic_inc_return(&memcg->numainfo_updating) > 1)
> - return;
> -
> - /* make a nodemask where this memcg uses memory from */
> - memcg->scan_nodes = node_states[N_MEMORY];
> -
> - for_each_node_mask(nid, node_states[N_MEMORY]) {
> -
> - if (!test_mem_cgroup_node_reclaimable(memcg, nid, false))
> - node_clear(nid, memcg->scan_nodes);
> - }
> -
> - atomic_set(&memcg->numainfo_events, 0);
> - atomic_set(&memcg->numainfo_updating, 0);
> -}
> -
> /*
> * Selecting a node where we start reclaim from. Because what we need is just
> * reducing usage counter, start from anywhere is O,K. Considering
> @@ -1779,21 +1704,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
> {
> int node;
>
> - mem_cgroup_may_update_nodemask(memcg);
> - node = memcg->last_scanned_node;
> -
> - node = next_node(node, memcg->scan_nodes);
> + node = next_node(memcg->last_scanned_node, cpuset_current_mems_allowed);
> if (node == MAX_NUMNODES)
> - node = first_node(memcg->scan_nodes);
> - /*
> - * We call this when we hit limit, not when pages are added to LRU.
> - * No LRU may hold pages because all pages are UNEVICTABLE or
> - * memcg is too small and all pages are not on LRU. In that case,
> - * we use curret node.
> - */
> - if (unlikely(node == MAX_NUMNODES))
> - node = numa_node_id();
> -
> + node = first_node(cpuset_current_mems_allowed);
> memcg->last_scanned_node = node;
> return node;
> }
> --
> 2.1.2

--
Michal Hocko
SUSE Labs

2014-10-22 12:40:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

On Wed, Oct 22, 2014 at 01:21:16PM +0200, Michal Hocko wrote:
> On Tue 21-10-14 14:22:39, Johannes Weiner wrote:
> [...]
> > From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
> > From: Johannes Weiner <[email protected]>
> > Date: Tue, 21 Oct 2014 09:53:54 -0400
> > Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
> > round-robin
> >
> > The round-robin node reclaim currently tries to include only nodes
> > that have memory of the memcg in question, which is quite elaborate.
> >
> > Just use plain round-robin over the nodes that are allowed by the
> > task's cpuset, which are the most likely to contain that memcg's
> > memory. But even if zones without memcg memory are encountered,
> > direct reclaim will skip over them without too much hassle.
>
> I do not think that using the current's node mask is correct. Different
> tasks in the same memcg might be bound to different nodes and then a set
> of nodes might be reclaimed much more if a particular task hits limit
> more often. It also doesn't make much sense from semantical POV, we are
> reclaiming memcg so the mask should be union of all tasks allowed nodes.

Unless the cpuset hierarchy is separate from the memcg hierarchy, all
tasks in the memcg belong to the same cpuset. And the whole point of
cpusets is that a group of tasks has the same nodemask, no?

Sure, there are *possible* configurations for which this assumption
breaks, like multiple hierarchies, but are they sensible? Do we care?

> What we do currently is overly complicated though and I agree that there
> is no good reason for it.
> Let's just s@cpuset_current_mems_allowed@node_online_map@ and round
> robin over all nodes. As you said we do not have to optimize for empty
> zones.

That was what I first had. And cpuset_current_mems_allowed defaults
to node_online_map, but once the user sets up cpusets in conjunction
with memcgs, it seems to be the preferred value.

The other end of this is that if you have 16 nodes and use cpuset to
bind the task to node 14 and 15, round-robin iterations of node 1-13
will reclaim the group's memory on 14 and only the 15 iteration will
actually look at memory from node 15 first.

It seems using the cpuset bindings, while theoretically independent,
would do the right thing for all intents and purposes.

2014-10-22 13:51:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] memcg: remove mem_cgroup_reclaimable check from soft reclaim

On Wed 22-10-14 08:40:25, Johannes Weiner wrote:
> On Wed, Oct 22, 2014 at 01:21:16PM +0200, Michal Hocko wrote:
> > On Tue 21-10-14 14:22:39, Johannes Weiner wrote:
> > [...]
> > > From 27bd24b00433d9f6c8d60ba2b13dbff158b06c13 Mon Sep 17 00:00:00 2001
> > > From: Johannes Weiner <[email protected]>
> > > Date: Tue, 21 Oct 2014 09:53:54 -0400
> > > Subject: [patch] mm: memcontrol: do not filter reclaimable nodes in NUMA
> > > round-robin
> > >
> > > The round-robin node reclaim currently tries to include only nodes
> > > that have memory of the memcg in question, which is quite elaborate.
> > >
> > > Just use plain round-robin over the nodes that are allowed by the
> > > task's cpuset, which are the most likely to contain that memcg's
> > > memory. But even if zones without memcg memory are encountered,
> > > direct reclaim will skip over them without too much hassle.
> >
> > I do not think that using the current's node mask is correct. Different
> > tasks in the same memcg might be bound to different nodes and then a set
> > of nodes might be reclaimed much more if a particular task hits limit
> > more often. It also doesn't make much sense from semantical POV, we are
> > reclaiming memcg so the mask should be union of all tasks allowed nodes.
>
> Unless the cpuset hierarchy is separate from the memcg hierarchy, all
> tasks in the memcg belong to the same cpuset. And the whole point of
> cpusets is that a group of tasks has the same nodemask, no?

Memory limit and memory placement are orthogonal configurations and they
might be stacked one on top of other in both directions.

> Sure, there are *possible* configurations for which this assumption
> breaks, like multiple hierarchies, but are they sensible? Do we care?

Why wouldn't they be sensible? What is wrong about limiting memory of
a load which internally uses node placement for its components?

> > What we do currently is overly complicated though and I agree that there
> > is no good reason for it.
> > Let's just s@cpuset_current_mems_allowed@node_online_map@ and round
> > robin over all nodes. As you said we do not have to optimize for empty
> > zones.
>
> That was what I first had. And cpuset_current_mems_allowed defaults
> to node_online_map, but once the user sets up cpusets in conjunction
> with memcgs, it seems to be the preferred value.
>
> The other end of this is that if you have 16 nodes and use cpuset to
> bind the task to node 14 and 15, round-robin iterations of node 1-13
> will reclaim the group's memory on 14 and only the 15 iteration will
> actually look at memory from node 15 first.

mem_cgroup_select_victim_node can check reclaimability of the memcg
(hierarchy) and skip nodes without pages. Or would that be too
expensive? We are in the slow path already.

> It seems using the cpuset bindings, while theoretically independent,
> would do the right thing for all intents and purposes.

Only if cpuset is on top of memcg. Not the other way around as mentioned
above (possible node over-reclaim).
--
Michal Hocko
SUSE Labs