Hi,
This is the third version of the patchset.
The first version has been posted here: http://permalink.gmane.org/gmane.linux.kernel.mm/97973
(lkml wasn't CCed at the time so I cannot find it in lwn.net
archives). There were no major objections. The second version
has been posted here http://lwn.net/Articles/548191/ as a part
of a longer and spicier thread which started after LSF here:
https://lwn.net/Articles/548192/
So this is a back to trees repost which tries to get only the guts from
the original post.
Changes between RFC (aka V1) -> V2
As there were no major objections there were only some minor cleanups
since the last version and I have moved "memcg: Ignore soft limit until
it is explicitly specified" to the end of the series.
Changes between V2 -> V3
No changes in the code since the last version. I have just rebased the
series on top of the current mmotm tree. The most controversial part
has been dropped (the last patch "memcg: Ignore soft limit until it is
explicitly specified") so there are no semantical changes to the soft
limit behavior. This makes this work mostly a code clean up and code
reorganization. Nevertheless, this is enough to make the soft limit work
more efficiently according to my testing and groups above the soft limit
are reclaimed much less as a result.
The basic idea is quite simple. Pull soft reclaim into shrink_zone in
the first step and get rid of the previous soft reclaim infrastructure.
shrink_zone is done in two passes now. First it tries to do the soft
limit reclaim and it falls back to reclaim-all-mode if no group is over
the limit or no pages have been scanned. The second pass happens at the
same priority so the only time we waste is the memcg tree walk which
shouldn't be a big deal [1]. There is certainly room for improvements
in that direction (e.g. do not do soft reclaim pass if no limit has
been specified for any group yet or if we know that no group is bellow
the limit). But let's keep it simple for now. As a bonus we will get
rid of a _lot_ of code by this and soft reclaim will not stand out like
before. The clean up is in a separate patch because I felt it would be
easier to review that way.
The second step is soft limit reclaim integration into targeted
reclaim. It should be rather straight forward. Soft limit has been used
only for the global reclaim so far but it makes for any kind of pressure
coming from up-the-hierarchy, including targeted reclaim.
My primary test case was a parallel kernel (each make is run with -j4
with a distribution .config in a separate cgroup without any hard
limit) build on a 8 CPU machine booted with 1GB memory. I was mostly
interested in 2 setups. Default - no soft limit set and - and 0 soft
limit set to both groups.
The first one should tell us whether the rework regresses the default
behavior while the second one should show us improvements in an extreme
case where both workloads are always over the soft limit.
/usr/bin/time -v has been used to collect the statistics and each
configuration had 3 runs after fresh boot without any other load on the
system.
* No-limit
Base kernel (before)
System: min:238.26 max:240.87 avg:239.85 stdev:0.90
User: min:1166.40 max:1173.79 avg:1169.92 stdev:2.41
Elapsed: min:09:38.06 max:09:57.96 avg:09:48.80 stdev:00:07.14
Rework (after)
System: min:238.88[100.3%] max:242.43[100.6%] avg:240.70[100.4%] stdev:1.62
User: min:1168.61[100.2%] max:1171.15[99.8%] avg:1169.69[100%] stdev:0.82
Elapsed: min:09:31.85[98.9%] max:09:57.74[100.0%] avg:09:47.57[99.8%] stdev:00:09.54
The numbers are within stdev so I think we are doing good here.
* 0-limit
Base kernel (before)
System: min:244.74 max:248.66 avg:246.39 stdev:1.20
User: min:1190.51 max:1195.70 avg:1193.77 stdev:1.66
Elapsed: min:12:39.73 max:12:56.79 avg:12:48.33 stdev:00:06.59
Rework (after)
System: min:238.64[97.5%] max:242.34[97.5%] avg:240.67[97.7%] stdev:1.26
User: min:1166.20[98%] max:1170.98[98%] avg:1168.17[97.9%] stdev:1.44
Elapsed: min:09:47.53[77.3%] max:09:59.05[77.1%] avg:09:53.23[77.2%] stdev:00:04.72
We can see 2% time decrease for both System and User time which is not
rocket high but sounds like a good outcome from a cleanup ;). It is even
more interesting to check the Elapsed time numbers which show that the
parallel load is much more effective. I haven't looked into the specific
reasons for this boost up deeply but I would guess that priority-0
reclaim done in the original implementation should be a big contributor.
Page fault statistics tell us at least part of the story:
Base
Major min:29686 max:34224 avg:31028.67 stdev:1503.11
Minor min:35919443 max:35980518 avg:35945936.50 stdev:21340.41
Rework
Major min:332[1.1%] max:883[2.6%] avg:478.67[1.5%] stdev:187.46
Minor min:35583644[99%] max:35660073[99%] avg:35621868.67[99%] stdev:29240.64
While the minor faults are within the noise the major faults are reduced
considerably. This looks like an aggressive pageout during the reclaim
and that pageout affects the working set presumably. If we look at the
same numbers for no-limit (aka no soft limit in action) we get very
similar numbers to 0-limit rework:
Base
Major min:208.00 max:454.00, avg:375.17 stdev:86.73
Rework
Major min:270.00 max:730.00, avg:431.67 stdev:143.58
So this clearly points at the priority-0 reclaim.
Shortlog says:
Michal Hocko (3):
memcg: integrate soft reclaim tighter with zone shrinking code
memcg: Get rid of soft-limit tree infrastructure
vmscan, memcg: Do softlimit reclaim also for targeted reclaim
And the diffstat:
include/linux/memcontrol.h | 12 +-
mm/memcontrol.c | 387 +++-----------------------------------------
mm/vmscan.c | 62 ++++---
3 files changed, 65 insertions(+), 396 deletions(-)
Memcg soft reclaim has been traditionally triggered from the global
reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
then picked up a group which exceeds the soft limit the most and
reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
The infrastructure requires per-node-zone trees which hold over-limit
groups and keep them up-to-date (via memcg_check_events) which is not
cost free. Although this overhead hasn't turned out to be a bottle neck
the implementation is suboptimal because mem_cgroup_update_tree has no
idea which zones consumed memory over the limit so we could easily end
up having a group on a node-zone tree having only few pages from that
node-zone.
This patch doesn't try to fix node-zone trees management because it
seems that integrating soft reclaim into zone shrinking sounds much
easier and more appropriate for several reasons.
First of all 0 priority reclaim was a crude hack which might lead to
big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
of dirty/writeback pages).
Soft reclaim should be applicable also to the targeted reclaim which is
awkward right now without additional hacks.
Last but not least the whole infrastructure eats quite some code.
After this patch shrink_zone is done in 2 passes. First it tries to do the
soft reclaim if appropriate (only for global reclaim for now to keep
compatible with the original state) and fall back to ignoring soft limit
if no group is eligible to soft reclaim or nothing has been scanned
during the first pass. Only groups which are over their soft limit or
any of their parents up the hierarchy is over the limit are considered
eligible during the first pass.
Soft limit tree which is not necessary anymore will be removed in the
follow up patch to make this patch smaller and easier to review.
Changes since v1
- __shrink_zone doesn't return the number of shrunk groups as nr_scanned
test covers both no groups scanned and no pages from the required zone
as pointed by Johannes
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 10 +--
mm/memcontrol.c | 161 ++++++--------------------------------------
mm/vmscan.c | 62 ++++++++++-------
3 files changed, 59 insertions(+), 174 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..1833c95 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,9 +179,7 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -358,11 +356,9 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}
static inline
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
{
- return 0;
+ return false;
}
static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fe4f123..7e5d7a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2083,57 +2083,28 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
}
#endif
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
- struct zone *zone,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- struct mem_cgroup *victim = NULL;
- int total = 0;
- int loop = 0;
- unsigned long excess;
- unsigned long nr_scanned;
- struct mem_cgroup_reclaim_cookie reclaim = {
- .zone = zone,
- .priority = 0,
- };
+/*
+ * A group is eligible for the soft limit reclaim if it is
+ * a) is over its soft limit
+ * b) any parent up the hierarchy is over its soft limit
+ */
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+{
+ struct mem_cgroup *parent = memcg;
- excess = res_counter_soft_limit_excess(&root_memcg->res) >> PAGE_SHIFT;
-
- while (1) {
- victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
- if (!victim) {
- loop++;
- if (loop >= 2) {
- /*
- * If we have not been able to reclaim
- * anything, it might because there are
- * no reclaimable pages under this hierarchy
- */
- if (!total)
- break;
- /*
- * We want to do more targeted reclaim.
- * excess >> 2 is not to excessive so as to
- * reclaim too much, nor too less that we keep
- * coming back to reclaim from this cgroup
- */
- if (total >= (excess >> 2) ||
- (loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
- break;
- }
- continue;
- }
- if (!mem_cgroup_reclaimable(victim, false))
- continue;
- total += mem_cgroup_shrink_node_zone(victim, gfp_mask, false,
- zone, &nr_scanned);
- *total_scanned += nr_scanned;
- if (!res_counter_soft_limit_excess(&root_memcg->res))
- break;
+ if (res_counter_soft_limit_excess(&memcg->res))
+ return true;
+
+ /*
+ * If any parent up the hierarchy is over its soft limit then we
+ * have to obey and reclaim from this group as well.
+ */
+ while((parent = parent_mem_cgroup(parent))) {
+ if (res_counter_soft_limit_excess(&parent->res))
+ return true;
}
- mem_cgroup_iter_break(root_memcg, victim);
- return total;
+
+ return false;
}
/*
@@ -4751,98 +4722,6 @@ static int mem_cgroup_resize_memsw_limit(struct mem_cgroup *memcg,
return ret;
}
-unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
- gfp_t gfp_mask,
- unsigned long *total_scanned)
-{
- unsigned long nr_reclaimed = 0;
- struct mem_cgroup_per_zone *mz, *next_mz = NULL;
- unsigned long reclaimed;
- int loop = 0;
- struct mem_cgroup_tree_per_zone *mctz;
- unsigned long long excess;
- unsigned long nr_scanned;
-
- if (order > 0)
- return 0;
-
- mctz = soft_limit_tree_node_zone(zone_to_nid(zone), zone_idx(zone));
- /*
- * This loop can run a while, specially if mem_cgroup's continuously
- * keep exceeding their soft limit and putting the system under
- * pressure
- */
- do {
- if (next_mz)
- mz = next_mz;
- else
- mz = mem_cgroup_largest_soft_limit_node(mctz);
- if (!mz)
- break;
-
- nr_scanned = 0;
- reclaimed = mem_cgroup_soft_reclaim(mz->memcg, zone,
- gfp_mask, &nr_scanned);
- nr_reclaimed += reclaimed;
- *total_scanned += nr_scanned;
- spin_lock(&mctz->lock);
-
- /*
- * If we failed to reclaim anything from this memory cgroup
- * it is time to move on to the next cgroup
- */
- next_mz = NULL;
- if (!reclaimed) {
- do {
- /*
- * Loop until we find yet another one.
- *
- * By the time we get the soft_limit lock
- * again, someone might have aded the
- * group back on the RB tree. Iterate to
- * make sure we get a different mem.
- * mem_cgroup_largest_soft_limit_node returns
- * NULL if no other cgroup is present on
- * the tree
- */
- next_mz =
- __mem_cgroup_largest_soft_limit_node(mctz);
- if (next_mz == mz)
- css_put(&next_mz->memcg->css);
- else /* next_mz == NULL or other memcg */
- break;
- } while (1);
- }
- __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
- excess = res_counter_soft_limit_excess(&mz->memcg->res);
- /*
- * One school of thought says that we should not add
- * back the node to the tree if reclaim returns 0.
- * But our reclaim could return 0, simply because due
- * to priority we are exposing a smaller subset of
- * memory to reclaim from. Consider this as a longer
- * term TODO.
- */
- /* If excess == 0, no tree ops */
- __mem_cgroup_insert_exceeded(mz->memcg, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- css_put(&mz->memcg->css);
- loop++;
- /*
- * Could not reclaim anything and there are no more
- * mem cgroups to try or we seem to be looping without
- * reclaiming anything.
- */
- if (!nr_reclaimed &&
- (next_mz == NULL ||
- loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
- break;
- } while (!nr_reclaimed);
- if (next_mz)
- css_put(&next_mz->memcg->css);
- return nr_reclaimed;
-}
-
/**
* mem_cgroup_force_empty_list - clears LRU of a group
* @memcg: group to clear
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fa6a853..fe63a43 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -139,11 +139,21 @@ static bool global_reclaim(struct scan_control *sc)
{
return !sc->target_mem_cgroup;
}
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+ return global_reclaim(sc);
+}
#else
static bool global_reclaim(struct scan_control *sc)
{
return true;
}
+
+static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
+{
+ return false;
+}
#endif
static unsigned long get_lru_size(struct lruvec *lruvec, enum lru_list lru)
@@ -1943,7 +1953,8 @@ static inline bool should_continue_reclaim(struct zone *zone,
}
}
-static void shrink_zone(struct zone *zone, struct scan_control *sc)
+static void
+__shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
{
unsigned long nr_reclaimed, nr_scanned;
@@ -1962,6 +1973,12 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
do {
struct lruvec *lruvec;
+ if (soft_reclaim &&
+ !mem_cgroup_soft_reclaim_eligible(memcg)) {
+ memcg = mem_cgroup_iter(root, memcg, &reclaim);
+ continue;
+ }
+
lruvec = mem_cgroup_zone_lruvec(zone, memcg);
shrink_lruvec(lruvec, sc);
@@ -1992,6 +2009,24 @@ static void shrink_zone(struct zone *zone, struct scan_control *sc)
sc->nr_scanned - nr_scanned, sc));
}
+
+static void shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+ bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
+ unsigned long nr_scanned = sc->nr_scanned;
+
+ __shrink_zone(zone, sc, do_soft_reclaim);
+
+ /*
+ * No group is over the soft limit or those that are do not have
+ * pages in the zone we are reclaiming so we have to reclaim everybody
+ */
+ if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) {
+ __shrink_zone(zone, sc, false);
+ return;
+ }
+}
+
/* Returns true if compaction should go ahead for a high-order request */
static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
{
@@ -2053,8 +2088,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
struct zoneref *z;
struct zone *zone;
- unsigned long nr_soft_reclaimed;
- unsigned long nr_soft_scanned;
bool aborted_reclaim = false;
/*
@@ -2094,18 +2127,6 @@ static bool shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
continue;
}
}
- /*
- * This steals pages from memory cgroups over softlimit
- * and returns the number of reclaimed pages and
- * scanned pages. This works for global memory pressure
- * and balancing, not for a memcg's limit.
- */
- nr_soft_scanned = 0;
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- sc->order, sc->gfp_mask,
- &nr_soft_scanned);
- sc->nr_reclaimed += nr_soft_reclaimed;
- sc->nr_scanned += nr_soft_scanned;
/* need some check for avoid more shrink_zone() */
}
@@ -2628,8 +2649,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int i;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
struct reclaim_state *reclaim_state = current->reclaim_state;
- unsigned long nr_soft_reclaimed;
- unsigned long nr_soft_scanned;
struct scan_control sc = {
.gfp_mask = GFP_KERNEL,
.may_unmap = 1,
@@ -2728,15 +2747,6 @@ loop_again:
sc.nr_scanned = 0;
- nr_soft_scanned = 0;
- /*
- * Call soft limit reclaim before calling shrink_zone.
- */
- nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone,
- order, sc.gfp_mask,
- &nr_soft_scanned);
- sc.nr_reclaimed += nr_soft_reclaimed;
-
/*
* We put equal pressure on every zone, unless
* one zone has way too many pages free
--
1.7.10.4
Now that the soft limit is integrated to the reclaim directly the whole
soft-limit tree infrastructure is not needed anymore. Rip it out.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 224 +------------------------------------------------------
1 file changed, 1 insertion(+), 223 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e5d7a9..1223aaa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -39,7 +39,6 @@
#include <linux/limits.h>
#include <linux/export.h>
#include <linux/mutex.h>
-#include <linux/rbtree.h>
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -137,7 +136,6 @@ static const char * const mem_cgroup_lru_names[] = {
*/
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
- MEM_CGROUP_TARGET_SOFTLIMIT,
MEM_CGROUP_TARGET_NUMAINFO,
MEM_CGROUP_NTARGETS,
};
@@ -173,10 +171,6 @@ struct mem_cgroup_per_zone {
struct mem_cgroup_reclaim_iter reclaim_iter[DEF_PRIORITY + 1];
- struct rb_node tree_node; /* RB tree node */
- unsigned long long usage_in_excess;/* Set to the value by which */
- /* the soft limit is exceeded*/
- bool on_tree;
struct mem_cgroup *memcg; /* Back pointer, we cannot */
/* use container_of */
};
@@ -189,26 +183,6 @@ struct mem_cgroup_lru_info {
struct mem_cgroup_per_node *nodeinfo[0];
};
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_zone {
- struct rb_root rb_root;
- spinlock_t lock;
-};
-
-struct mem_cgroup_tree_per_node {
- struct mem_cgroup_tree_per_zone rb_tree_per_zone[MAX_NR_ZONES];
-};
-
-struct mem_cgroup_tree {
- struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
struct mem_cgroup_threshold {
struct eventfd_ctx *eventfd;
u64 threshold;
@@ -533,7 +507,6 @@ static bool move_file(void)
* limit reclaim to prevent infinite loops, if they ever occur.
*/
#define MEM_CGROUP_MAX_RECLAIM_LOOPS 100
-#define MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS 2
enum charge_type {
MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
@@ -764,164 +737,6 @@ page_cgroup_zoneinfo(struct mem_cgroup *memcg, struct page *page)
return mem_cgroup_zoneinfo(memcg, nid, zid);
}
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_node_zone(int nid, int zid)
-{
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static struct mem_cgroup_tree_per_zone *
-soft_limit_tree_from_page(struct page *page)
-{
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
-
- return &soft_limit_tree.rb_tree_per_node[nid]->rb_tree_per_zone[zid];
-}
-
-static void
-__mem_cgroup_insert_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz,
- unsigned long long new_usage_in_excess)
-{
- struct rb_node **p = &mctz->rb_root.rb_node;
- struct rb_node *parent = NULL;
- struct mem_cgroup_per_zone *mz_node;
-
- if (mz->on_tree)
- return;
-
- mz->usage_in_excess = new_usage_in_excess;
- if (!mz->usage_in_excess)
- return;
- while (*p) {
- parent = *p;
- mz_node = rb_entry(parent, struct mem_cgroup_per_zone,
- tree_node);
- if (mz->usage_in_excess < mz_node->usage_in_excess)
- p = &(*p)->rb_left;
- /*
- * We can't avoid mem cgroups that are over their soft
- * limit by the same amount
- */
- else if (mz->usage_in_excess >= mz_node->usage_in_excess)
- p = &(*p)->rb_right;
- }
- rb_link_node(&mz->tree_node, parent, p);
- rb_insert_color(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = true;
-}
-
-static void
-__mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- if (!mz->on_tree)
- return;
- rb_erase(&mz->tree_node, &mctz->rb_root);
- mz->on_tree = false;
-}
-
-static void
-mem_cgroup_remove_exceeded(struct mem_cgroup *memcg,
- struct mem_cgroup_per_zone *mz,
- struct mem_cgroup_tree_per_zone *mctz)
-{
- spin_lock(&mctz->lock);
- __mem_cgroup_remove_exceeded(memcg, mz, mctz);
- spin_unlock(&mctz->lock);
-}
-
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, struct page *page)
-{
- unsigned long long excess;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
- int nid = page_to_nid(page);
- int zid = page_zonenum(page);
- mctz = soft_limit_tree_from_page(page);
-
- /*
- * Necessary to update all ancestors when hierarchy is used.
- * because their event counter is not touched.
- */
- for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- mz = mem_cgroup_zoneinfo(memcg, nid, zid);
- excess = res_counter_soft_limit_excess(&memcg->res);
- /*
- * We have to update the tree if mz is on RB-tree or
- * mem is over its softlimit.
- */
- if (excess || mz->on_tree) {
- spin_lock(&mctz->lock);
- /* if on-tree, remove it */
- if (mz->on_tree)
- __mem_cgroup_remove_exceeded(memcg, mz, mctz);
- /*
- * Insert again. mz->usage_in_excess will be updated.
- * If excess is 0, no tree ops.
- */
- __mem_cgroup_insert_exceeded(memcg, mz, mctz, excess);
- spin_unlock(&mctz->lock);
- }
- }
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
- int node, zone;
- struct mem_cgroup_per_zone *mz;
- struct mem_cgroup_tree_per_zone *mctz;
-
- for_each_node(node) {
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- mz = mem_cgroup_zoneinfo(memcg, node, zone);
- mctz = soft_limit_tree_node_zone(node, zone);
- mem_cgroup_remove_exceeded(memcg, mz, mctz);
- }
- }
-}
-
-static struct mem_cgroup_per_zone *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct rb_node *rightmost = NULL;
- struct mem_cgroup_per_zone *mz;
-
-retry:
- mz = NULL;
- rightmost = rb_last(&mctz->rb_root);
- if (!rightmost)
- goto done; /* Nothing to reclaim from */
-
- mz = rb_entry(rightmost, struct mem_cgroup_per_zone, tree_node);
- /*
- * Remove the node now but someone else can add it back,
- * we will to add it back at the end of reclaim to its correct
- * position in the tree.
- */
- __mem_cgroup_remove_exceeded(mz->memcg, mz, mctz);
- if (!res_counter_soft_limit_excess(&mz->memcg->res) ||
- !css_tryget(&mz->memcg->css))
- goto retry;
-done:
- return mz;
-}
-
-static struct mem_cgroup_per_zone *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_zone *mctz)
-{
- struct mem_cgroup_per_zone *mz;
-
- spin_lock(&mctz->lock);
- mz = __mem_cgroup_largest_soft_limit_node(mctz);
- spin_unlock(&mctz->lock);
- return mz;
-}
-
/*
* Implementation Note: reading percpu statistics for memcg.
*
@@ -1075,9 +890,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
case MEM_CGROUP_TARGET_THRESH:
next = val + THRESHOLDS_EVENTS_TARGET;
break;
- case MEM_CGROUP_TARGET_SOFTLIMIT:
- next = val + SOFTLIMIT_EVENTS_TARGET;
- break;
case MEM_CGROUP_TARGET_NUMAINFO:
next = val + NUMAINFO_EVENTS_TARGET;
break;
@@ -1100,11 +912,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
- bool do_softlimit;
bool do_numainfo __maybe_unused;
- do_softlimit = mem_cgroup_event_ratelimit(memcg,
- MEM_CGROUP_TARGET_SOFTLIMIT);
#if MAX_NUMNODES > 1
do_numainfo = mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_NUMAINFO);
@@ -1112,8 +921,6 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
preempt_enable();
mem_cgroup_threshold(memcg);
- if (unlikely(do_softlimit))
- mem_cgroup_update_tree(memcg, page);
#if MAX_NUMNODES > 1
if (unlikely(do_numainfo))
atomic_inc(&memcg->numainfo_events);
@@ -2955,9 +2762,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,
unlock_page_cgroup(pc);
/*
- * "charge_statistics" updated event counter. Then, check it.
- * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree.
- * if they exceeds softlimit.
+ * "charge_statistics" updated event counter.
*/
memcg_check_events(memcg, page);
}
@@ -6086,8 +5891,6 @@ static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *memcg, int node)
for (zone = 0; zone < MAX_NR_ZONES; zone++) {
mz = &pn->zoneinfo[zone];
lruvec_init(&mz->lruvec);
- mz->usage_in_excess = 0;
- mz->on_tree = false;
mz->memcg = memcg;
}
memcg->info.nodeinfo[node] = pn;
@@ -6143,7 +5946,6 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
int node;
size_t size = memcg_size();
- mem_cgroup_remove_from_trees(memcg);
free_css_id(&mem_cgroup_subsys, &memcg->css);
for_each_node(node)
@@ -6225,29 +6027,6 @@ struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
}
EXPORT_SYMBOL(parent_mem_cgroup);
-static void __init mem_cgroup_soft_limit_tree_init(void)
-{
- struct mem_cgroup_tree_per_node *rtpn;
- struct mem_cgroup_tree_per_zone *rtpz;
- int tmp, node, zone;
-
- for_each_node(node) {
- tmp = node;
- if (!node_state(node, N_NORMAL_MEMORY))
- tmp = -1;
- rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, tmp);
- BUG_ON(!rtpn);
-
- soft_limit_tree.rb_tree_per_node[node] = rtpn;
-
- for (zone = 0; zone < MAX_NR_ZONES; zone++) {
- rtpz = &rtpn->rb_tree_per_zone[zone];
- rtpz->rb_root = RB_ROOT;
- spin_lock_init(&rtpz->lock);
- }
- }
-}
-
static struct cgroup_subsys_state * __ref
mem_cgroup_css_alloc(struct cgroup *cont)
{
@@ -7040,7 +6819,6 @@ static int __init mem_cgroup_init(void)
{
hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
enable_swap_cgroup();
- mem_cgroup_soft_limit_tree_init();
memcg_stock_init();
return 0;
}
--
1.7.10.4
Soft reclaim has been done only for the global reclaim (both background
and direct). Since "memcg: integrate soft reclaim tighter with zone
shrinking code" there is no reason for this limitation anymore as the
soft limit reclaim doesn't use any special code paths and it is a
part of the zone shrinking code which is used by both global and
targeted reclaims.
>From semantic point of view it is even natural to consider soft limit
before touching all groups in the hierarchy tree which is touching the
hard limit because soft limit tells us where to push back when there is
a memory pressure. It is not important whether the pressure comes from
the limit or imbalanced zones.
This patch simply enables soft reclaim unconditionally in
mem_cgroup_should_soft_reclaim so it is enabled for both global and
targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
about the root of the reclaim to know where to stop checking soft limit
state of parents up the hierarchy.
Say we have
A (over soft limit)
\
B (below s.l., hit the hard limit)
/ \
C D (below s.l.)
B is the source of the outside memory pressure now for D but we
shouldn't soft reclaim it because it is behaving well under B subtree
and we can still reclaim from C (pressumably it is over the limit).
mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
hierarchy at B (root of the memory pressure).
Changes since v1
- add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 6 ++++--
mm/memcontrol.c | 14 +++++++++-----
mm/vmscan.c | 4 ++--
3 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1833c95..80ed1b6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -179,7 +179,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg);
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+ struct mem_cgroup *root);
void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
static inline void mem_cgroup_count_vm_event(struct mm_struct *mm,
@@ -356,7 +357,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}
static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+ struct mem_cgroup *root)
{
return false;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1223aaa..163567b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1891,11 +1891,13 @@ static bool mem_cgroup_reclaimable(struct mem_cgroup *memcg, bool noswap)
#endif
/*
- * A group is eligible for the soft limit reclaim if it is
- * a) is over its soft limit
+ * A group is eligible for the soft limit reclaim under the given root
+ * hierarchy if
+ * a) it is over its soft limit
* b) any parent up the hierarchy is over its soft limit
*/
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
+bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+ struct mem_cgroup *root)
{
struct mem_cgroup *parent = memcg;
@@ -1903,12 +1905,14 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg)
return true;
/*
- * If any parent up the hierarchy is over its soft limit then we
- * have to obey and reclaim from this group as well.
+ * If any parent up to the root in the hierarchy is over its soft limit
+ * then we have to obey and reclaim from this group as well.
*/
while((parent = parent_mem_cgroup(parent))) {
if (res_counter_soft_limit_excess(&parent->res))
return true;
+ if (parent == root)
+ break;
}
return false;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fe63a43..d738802 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,7 +142,7 @@ static bool global_reclaim(struct scan_control *sc)
static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
{
- return global_reclaim(sc);
+ return true;
}
#else
static bool global_reclaim(struct scan_control *sc)
@@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
struct lruvec *lruvec;
if (soft_reclaim &&
- !mem_cgroup_soft_reclaim_eligible(memcg)) {
+ !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
memcg = mem_cgroup_iter(root, memcg, &reclaim);
continue;
}
--
1.7.10.4
On 05/13/2013 11:46 AM, Michal Hocko wrote:
> Memcg soft reclaim has been traditionally triggered from the global
> reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> then picked up a group which exceeds the soft limit the most and
> reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
>
> The infrastructure requires per-node-zone trees which hold over-limit
> groups and keep them up-to-date (via memcg_check_events) which is not
> cost free. Although this overhead hasn't turned out to be a bottle neck
> the implementation is suboptimal because mem_cgroup_update_tree has no
> idea which zones consumed memory over the limit so we could easily end
> up having a group on a node-zone tree having only few pages from that
> node-zone.
>
> This patch doesn't try to fix node-zone trees management because it
> seems that integrating soft reclaim into zone shrinking sounds much
> easier and more appropriate for several reasons.
> First of all 0 priority reclaim was a crude hack which might lead to
> big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> of dirty/writeback pages).
> Soft reclaim should be applicable also to the targeted reclaim which is
> awkward right now without additional hacks.
> Last but not least the whole infrastructure eats quite some code.
>
> After this patch shrink_zone is done in 2 passes. First it tries to do the
> soft reclaim if appropriate (only for global reclaim for now to keep
> compatible with the original state) and fall back to ignoring soft limit
> if no group is eligible to soft reclaim or nothing has been scanned
> during the first pass. Only groups which are over their soft limit or
> any of their parents up the hierarchy is over the limit are considered
> eligible during the first pass.
>
> Soft limit tree which is not necessary anymore will be removed in the
> follow up patch to make this patch smaller and easier to review.
>
> Changes since v1
> - __shrink_zone doesn't return the number of shrunk groups as nr_scanned
> test covers both no groups scanned and no pages from the required zone
> as pointed by Johannes
>
> Signed-off-by: Michal Hocko <[email protected]>
Patch looks fine to me
Reviewed-by: Glauber Costa <[email protected]>
On 05/13/2013 11:46 AM, Michal Hocko wrote:
> Now that the soft limit is integrated to the reclaim directly the whole
> soft-limit tree infrastructure is not needed anymore. Rip it out.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 224 +------------------------------------------------------
> 1 file changed, 1 insertion(+), 223 deletions(-)
Great =)
Reviewed-by: Glauber Costa <[email protected]>
On 05/13/2013 11:46 AM, Michal Hocko wrote:
> Soft reclaim has been done only for the global reclaim (both background
> and direct). Since "memcg: integrate soft reclaim tighter with zone
> shrinking code" there is no reason for this limitation anymore as the
> soft limit reclaim doesn't use any special code paths and it is a
> part of the zone shrinking code which is used by both global and
> targeted reclaims.
>
> From semantic point of view it is even natural to consider soft limit
> before touching all groups in the hierarchy tree which is touching the
> hard limit because soft limit tells us where to push back when there is
> a memory pressure. It is not important whether the pressure comes from
> the limit or imbalanced zones.
>
> This patch simply enables soft reclaim unconditionally in
> mem_cgroup_should_soft_reclaim so it is enabled for both global and
> targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
> about the root of the reclaim to know where to stop checking soft limit
> state of parents up the hierarchy.
> Say we have
> A (over soft limit)
> \
> B (below s.l., hit the hard limit)
> / \
> C D (below s.l.)
>
> B is the source of the outside memory pressure now for D but we
> shouldn't soft reclaim it because it is behaving well under B subtree
> and we can still reclaim from C (pressumably it is over the limit).
> mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
> hierarchy at B (root of the memory pressure).
>
> Changes since v1
> - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
Reviewed-by: Glauber Costa <[email protected]>
Sorry about the delay. Just getting back to memcg.
On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote:
...
> during the first pass. Only groups which are over their soft limit or
> any of their parents up the hierarchy is over the limit are considered
ancestors?
> +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> +{
> + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> + unsigned long nr_scanned = sc->nr_scanned;
> +
> + __shrink_zone(zone, sc, do_soft_reclaim);
> +
> + /*
> + * No group is over the soft limit or those that are do not have
> + * pages in the zone we are reclaiming so we have to reclaim everybody
> + */
> + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) {
> + __shrink_zone(zone, sc, false);
> + return;
> + }
> +}
Maybe the following is easier to follow?
if (mem_cgroup_should_soft_reclaim(sc)) {
__shrink_zone(zone, sc, true);
if (sc->nr_scanned == nr_scanned)
__shrink_zone(zone, sc, false);
} else {
__shrink_zone(zone, sc, false);
}
But it's a minor point, please feel free to ignore.
Reviewed-by: Tejun Heo <[email protected]>
Thanks.
--
tejun
One more thing,
Given that this is a rather significant behavior change, it probably
is a good idea to include the the benchmark results from the head
message?
Thanks.
--
tejun
On Mon, May 13, 2013 at 09:46:11AM +0200, Michal Hocko wrote:
> Now that the soft limit is integrated to the reclaim directly the whole
> soft-limit tree infrastructure is not needed anymore. Rip it out.
>
> Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Nice cleanup, thanks.
--
tejun
On Mon, May 13, 2013 at 09:46:12AM +0200, Michal Hocko wrote:
> Soft reclaim has been done only for the global reclaim (both background
> and direct). Since "memcg: integrate soft reclaim tighter with zone
> shrinking code" there is no reason for this limitation anymore as the
> soft limit reclaim doesn't use any special code paths and it is a
> part of the zone shrinking code which is used by both global and
> targeted reclaims.
...
> Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Tejun Heo <[email protected]>
Some nitpicks follow.
> /*
> - * A group is eligible for the soft limit reclaim if it is
> - * a) is over its soft limit
> + * A group is eligible for the soft limit reclaim under the given root
> + * hierarchy if
> + * a) it is over its soft limit
> * b) any parent up the hierarchy is over its soft limit
This was added before but in general I think the use of parent for
ancestor is a bit confusing. Not a big deal but no reason to continue
it.
> /*
> - * If any parent up the hierarchy is over its soft limit then we
> - * have to obey and reclaim from this group as well.
> + * If any parent up to the root in the hierarchy is over its soft limit
> + * then we have to obey and reclaim from this group as well.
Prolly using terms ancestors and subtree would make the explanation
clearer?
> static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
> {
> - return global_reclaim(sc);
> + return true;
Kinda silly after this change, maybe just modify shrink_zone() like
the following?
if (IS_ENABLED(CONFIG_MEMCG)) {
__shrink_zone(zone, sc, true);
if (sc->nr_scanned == nr_scanned)
__shrink_zone(zone, sc, false);
} else {
__shrink_zone(zone, sc, false);
}
> @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
> struct lruvec *lruvec;
>
> if (soft_reclaim &&
> - !mem_cgroup_soft_reclaim_eligible(memcg)) {
> + !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
Weird indentation which breaks line and goes over 80 col, why not do
the following?
if (soft_reclaim &&
!mem_cgroup_soft_reclaim_eligible(memcg, root)) {
memcg = mem_cgroup_iter(root, memcg, &reclaim);
continue;
}
Thanks.
--
tejun
On Thu 16-05-13 15:12:00, Tejun Heo wrote:
> Sorry about the delay. Just getting back to memcg.
>
> On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote:
> ...
> > during the first pass. Only groups which are over their soft limit or
> > any of their parents up the hierarchy is over the limit are considered
>
> ancestors?
Well, ancestors might or might not be part of the hierarchy, so while it
is much shorter I consider the original more precise.
> > +static void shrink_zone(struct zone *zone, struct scan_control *sc)
> > +{
> > + bool do_soft_reclaim = mem_cgroup_should_soft_reclaim(sc);
> > + unsigned long nr_scanned = sc->nr_scanned;
> > +
> > + __shrink_zone(zone, sc, do_soft_reclaim);
> > +
> > + /*
> > + * No group is over the soft limit or those that are do not have
> > + * pages in the zone we are reclaiming so we have to reclaim everybody
> > + */
> > + if (do_soft_reclaim && (sc->nr_scanned == nr_scanned)) {
> > + __shrink_zone(zone, sc, false);
> > + return;
> > + }
> > +}
>
> Maybe the following is easier to follow?
>
> if (mem_cgroup_should_soft_reclaim(sc)) {
> __shrink_zone(zone, sc, true);
> if (sc->nr_scanned == nr_scanned)
> __shrink_zone(zone, sc, false);
> } else {
> __shrink_zone(zone, sc, false);
> }
>
> But it's a minor point, please feel free to ignore.
I will stick with the original code as I have some plans to build on top
of that.
> Reviewed-by: Tejun Heo <[email protected]>
Thank you!
>
> Thanks.
>
> --
> tejun
--
Michal Hocko
SUSE Labs
On Thu 16-05-13 15:15:01, Tejun Heo wrote:
> One more thing,
>
> Given that this is a rather significant behavior change, it probably
> is a good idea to include the the benchmark results from the head
> message?
The testing I have done was on top of the complete series. The last
patch should be irrelevant as I have tested the global reclaim but the
second patch might still influence figures a tiny bit (we still do the
soft limit tree thing). That's why I haven't pushed the numbers here.
I can add that information if people prefer or just ask Andrew to squash
the leader email into the first patch as he tend to do quite often in
other cases as well.
--
Michal Hocko
SUSE Labs
On Thu 16-05-13 16:12:38, Tejun Heo wrote:
> On Mon, May 13, 2013 at 09:46:12AM +0200, Michal Hocko wrote:
> > Soft reclaim has been done only for the global reclaim (both background
> > and direct). Since "memcg: integrate soft reclaim tighter with zone
> > shrinking code" there is no reason for this limitation anymore as the
> > soft limit reclaim doesn't use any special code paths and it is a
> > part of the zone shrinking code which is used by both global and
> > targeted reclaims.
> ...
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Reviewed-by: Tejun Heo <[email protected]>
Thanks
>
> Some nitpicks follow.
>
> > /*
> > - * A group is eligible for the soft limit reclaim if it is
> > - * a) is over its soft limit
> > + * A group is eligible for the soft limit reclaim under the given root
> > + * hierarchy if
> > + * a) it is over its soft limit
> > * b) any parent up the hierarchy is over its soft limit
>
> This was added before but in general I think the use of parent for
> ancestor is a bit confusing. Not a big deal but no reason to continue
> it.
$ git grep ancestor mm/memcontrol.c | wc -l
4
$ git grep
parent mm/memcontrol.c | wc -l
80
Yeah, we are used to use parent much more. Maybe it is worth a clean up
on its own but I will stick with the majority in this patch
> > /*
> > - * If any parent up the hierarchy is over its soft limit then we
> > - * have to obey and reclaim from this group as well.
> > + * If any parent up to the root in the hierarchy is over its soft limit
> > + * then we have to obey and reclaim from this group as well.
>
> Prolly using terms ancestors and subtree would make the explanation
> clearer?
As I said earlier we should be explicit about hierarchy as
ancestor/parent (what ever we call it) might or might not be part of the
hierarchy. Yeah, we have that use_hierarchy thingy which we love so
much.
> > static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
> > {
> > - return global_reclaim(sc);
> > + return true;
>
> Kinda silly after this change, maybe just modify shrink_zone() like
> the following?
>
> if (IS_ENABLED(CONFIG_MEMCG)) {
> __shrink_zone(zone, sc, true);
> if (sc->nr_scanned == nr_scanned)
> __shrink_zone(zone, sc, false);
> } else {
> __shrink_zone(zone, sc, false);
> }
I plan to build on top of this where mem_cgroup_should_soft_reclaim
would do more than just return true. So I will keep it this way if you
do not mind.
> > @@ -1974,7 +1974,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
> > struct lruvec *lruvec;
> >
> > if (soft_reclaim &&
> > - !mem_cgroup_soft_reclaim_eligible(memcg)) {
> > + !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
>
> Weird indentation which breaks line and goes over 80 col, why not do
> the following?
>
> if (soft_reclaim &&
> !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
> memcg = mem_cgroup_iter(root, memcg, &reclaim);
> continue;
> }
Hmm, I rely on vim doing the_right_thing usually. I definitely do not
mind to change the formatting. I have fixed this in the first patch
where the code has been introduced and refreshed this patch on top of
that.
I will repost the whole series with reviewed-bys and other acks later
Thanks!
--
Michal Hocko
SUSE Labs
On Wed 15-05-13 12:42:10, Glauber Costa wrote:
> On 05/13/2013 11:46 AM, Michal Hocko wrote:
> > Soft reclaim has been done only for the global reclaim (both background
> > and direct). Since "memcg: integrate soft reclaim tighter with zone
> > shrinking code" there is no reason for this limitation anymore as the
> > soft limit reclaim doesn't use any special code paths and it is a
> > part of the zone shrinking code which is used by both global and
> > targeted reclaims.
> >
> > From semantic point of view it is even natural to consider soft limit
> > before touching all groups in the hierarchy tree which is touching the
> > hard limit because soft limit tells us where to push back when there is
> > a memory pressure. It is not important whether the pressure comes from
> > the limit or imbalanced zones.
> >
> > This patch simply enables soft reclaim unconditionally in
> > mem_cgroup_should_soft_reclaim so it is enabled for both global and
> > targeted reclaim paths. mem_cgroup_soft_reclaim_eligible needs to learn
> > about the root of the reclaim to know where to stop checking soft limit
> > state of parents up the hierarchy.
> > Say we have
> > A (over soft limit)
> > \
> > B (below s.l., hit the hard limit)
> > / \
> > C D (below s.l.)
> >
> > B is the source of the outside memory pressure now for D but we
> > shouldn't soft reclaim it because it is behaving well under B subtree
> > and we can still reclaim from C (pressumably it is over the limit).
> > mem_cgroup_soft_reclaim_eligible should therefore stop climbing up the
> > hierarchy at B (root of the memory pressure).
> >
> > Changes since v1
> > - add sc->target_mem_cgroup handling into mem_cgroup_soft_reclaim_eligible
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
>
> Reviewed-by: Glauber Costa <[email protected]>
Thanks for the review Glauber!
--
Michal Hocko
SUSE Labs
On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote:
> Memcg soft reclaim has been traditionally triggered from the global
> reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> then picked up a group which exceeds the soft limit the most and
> reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
>
> The infrastructure requires per-node-zone trees which hold over-limit
> groups and keep them up-to-date (via memcg_check_events) which is not
> cost free. Although this overhead hasn't turned out to be a bottle neck
> the implementation is suboptimal because mem_cgroup_update_tree has no
> idea which zones consumed memory over the limit so we could easily end
> up having a group on a node-zone tree having only few pages from that
> node-zone.
>
> This patch doesn't try to fix node-zone trees management because it
> seems that integrating soft reclaim into zone shrinking sounds much
> easier and more appropriate for several reasons.
> First of all 0 priority reclaim was a crude hack which might lead to
> big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> of dirty/writeback pages).
> Soft reclaim should be applicable also to the targeted reclaim which is
> awkward right now without additional hacks.
> Last but not least the whole infrastructure eats quite some code.
>
> After this patch shrink_zone is done in 2 passes. First it tries to do the
> soft reclaim if appropriate (only for global reclaim for now to keep
> compatible with the original state) and fall back to ignoring soft limit
> if no group is eligible to soft reclaim or nothing has been scanned
> during the first pass. Only groups which are over their soft limit or
> any of their parents up the hierarchy is over the limit are considered
> eligible during the first pass.
There are setups with thousands of groups that do not even use soft
limits. Having them pointlessly iterate over all of them for every
couple of pages reclaimed is just not acceptable.
This is not the first time this implementation was proposed, either,
I'm afraid we have now truly gone full circle on this stuff.
Hello, Johannes.
On Fri, May 17, 2013 at 12:02:47PM -0400, Johannes Weiner wrote:
> There are setups with thousands of groups that do not even use soft
> limits. Having them pointlessly iterate over all of them for every
> couple of pages reclaimed is just not acceptable.
Hmmm... if the iteration is the problem, it shouldn't be difficult to
build list of children which should be iterated. Would that make it
acceptable?
Thanks.
--
tejun
Tejun Heo <[email protected]> wrote:
>Hello, Johannes.
>
>On Fri, May 17, 2013 at 12:02:47PM -0400, Johannes Weiner wrote:
>> There are setups with thousands of groups that do not even use soft
>> limits. Having them pointlessly iterate over all of them for every
>> couple of pages reclaimed is just not acceptable.
>
>Hmmm... if the iteration is the problem, it shouldn't be difficult to
>build list of children which should be iterated. Would that make it
>acceptable?
You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :)
Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case.
To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup.
But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Hello,
On Fri, May 17, 2013 at 10:27 AM, Johannes Weiner <[email protected]> wrote:
>>Hmmm... if the iteration is the problem, it shouldn't be difficult to
>>build list of children which should be iterated. Would that make it
>>acceptable?
>
> You mean, a separate structure that tracks which groups are in excess of the limit? Like the current tree? :)
Heh, yeah, realized that after writing it but it can be something much
simpler. ie. just linked list of children with soft limit configured.
> Kidding aside, yes, that would be better, and an unsorted list would probably be enough for the global case.
Yeap.
> To support target reclaim soft limits later on, we could maybe propagate tags upwards the cgroup tree when a group is in excess so that reclaim can be smarter about which subtrees to test for soft limits and which to skip during the soft limit pass. The no-softlimit-set-anywhere case is then only a single tag test in the root cgroup.
>
> But starting with the list would be simple enough, delete a bunch of code, come with the same performance improvements etc.
Thanks.
--
tejun
On Fri 17-05-13 12:02:47, Johannes Weiner wrote:
> On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote:
> > Memcg soft reclaim has been traditionally triggered from the global
> > reclaim paths before calling shrink_zone. mem_cgroup_soft_limit_reclaim
> > then picked up a group which exceeds the soft limit the most and
> > reclaimed it with 0 priority to reclaim at least SWAP_CLUSTER_MAX pages.
> >
> > The infrastructure requires per-node-zone trees which hold over-limit
> > groups and keep them up-to-date (via memcg_check_events) which is not
> > cost free. Although this overhead hasn't turned out to be a bottle neck
> > the implementation is suboptimal because mem_cgroup_update_tree has no
> > idea which zones consumed memory over the limit so we could easily end
> > up having a group on a node-zone tree having only few pages from that
> > node-zone.
> >
> > This patch doesn't try to fix node-zone trees management because it
> > seems that integrating soft reclaim into zone shrinking sounds much
> > easier and more appropriate for several reasons.
> > First of all 0 priority reclaim was a crude hack which might lead to
> > big stalls if the group's LRUs are big and hard to reclaim (e.g. a lot
> > of dirty/writeback pages).
> > Soft reclaim should be applicable also to the targeted reclaim which is
> > awkward right now without additional hacks.
> > Last but not least the whole infrastructure eats quite some code.
> >
> > After this patch shrink_zone is done in 2 passes. First it tries to do the
> > soft reclaim if appropriate (only for global reclaim for now to keep
> > compatible with the original state) and fall back to ignoring soft limit
> > if no group is eligible to soft reclaim or nothing has been scanned
> > during the first pass. Only groups which are over their soft limit or
> > any of their parents up the hierarchy is over the limit are considered
> > eligible during the first pass.
>
> There are setups with thousands of groups that do not even use soft
> limits. Having them pointlessly iterate over all of them for every
> couple of pages reclaimed is just not acceptable.
OK, that is a fair point. The soft reclaim pass should not be done if
we know that every group is below the limit. This can be fixed easily.
mem_cgroup_should_soft_reclaim could check a counter of over limit
groups. This still doesn't solve the problem if there are relatively few
groups over the limit wrt. those that are under the limit.
You are proposing a simple list in the follow up email. I have
considered this approach as well but then I decided not to go that
way because the list iteration doesn't respect per-node-zone-priority
tree walk which makes it tricky to prevent from over-reclaim with many
parallel reclaimers. I rather wanted to integrate the soft reclaim into
the reclaim tree walk. There is also a problem when all groups are in
excess then the whole tree collapses into a linked list which is not
nice either (hmm, this could be mitigated if only a group in excess
which is highest in the hierarchy would be in the list)
I was playing with the huge number of groups a bit and my conclusion
was that although the System time for the workload was higher with more
groups, the Elapsed time was still better/comparable with the original
kernel. I do not know how many groups are used in practice but I have
never heard of configurations with more than thousands memcgs but even
for 8k groups results didn't look too crazy.
I had one group (call it A) with the streaming IO load (dd if=/dev/zero
of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups
with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit
set. I have compared the results with the same configuration with the
base kernel.
Two configurations have been tested. `A' group without any soft limit and
with limit set to 0. The first configuration measures overhead of an
additional pass as there is no soft reclaim done in both the base kernel
and the rework. The second configuration compares the effectiveness of
the reworked implementation wrt. the base kernel.
* No soft limit set
Elapsed
500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3
500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3
User
500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3
500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3
System
500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3
500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3
Elapsed
1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3
1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3
User
1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3
1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3
System
1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3
1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3
Elapsed
4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3
4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3
User
4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3
4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3
System
4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3
4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3
Elapsed
8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3
8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3
User
8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3
8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3
System
8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3
8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3
* Soft limit set to 0 for the group with the dd load
Elapsed
500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3
500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3
User
500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3
500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3
System
500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3
500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3
Elapsed
1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3
1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3
User
1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3
1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3
System
1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3
1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3
Elapsed
4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3
4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3
User
4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3
4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3
System
4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3
4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3
Elapsed
8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3
8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3
User
8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3
8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3
System
8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3
8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3
As we can see the System time climbs really high but the Elapsed time
is better than in the base kernel (except for 8k-0-limit). If we had
more reclaimers then the system time should be amortized more because
the reclaim tree walk is shared.
I think that the numbers can be improved even without introducing
the list of groups in excess. One way to go could be introducing a
conditional (callback) to the memcg iterator so the groups under the
limit would be excluded during the walk without playing with css
references and other things. My quick and dirty patch shows that
4k-0-limit System time was reduced by 40% wrt. this patchset. With a
proper tagging we can make the walk close to free.
Nevertheless, I guess I can live with the excess list as well if the
above sounds like a no-go for you.
[...]
--
Michal Hocko
SUSE Labs
On Mon 20-05-13 16:44:38, Michal Hocko wrote:
[...]
> I had one group (call it A) with the streaming IO load (dd if=/dev/zero
> of=file with 4*TotalRam size) and a parallel hierarchy with 2 groups
> with up to 12 levels each (512, 1024, 4096, 8192 groups) and no limit
> set. I have compared the results with the same configuration with the
> base kernel.
> Two configurations have been tested. `A' group without any soft limit and
> with limit set to 0. The first configuration measures overhead of an
> additional pass as there is no soft reclaim done in both the base kernel
> and the rework. The second configuration compares the effectiveness of
> the reworked implementation wrt. the base kernel.
And I have forgotten to mention that the machine was booted with mem=1G
so there was only a single node without Normal zone. This setup with
only a single mem hog makes the original implementation really effective
because there is a good chance that the aggressive soft limit reclaim
prevents from getting into shrink_zone most of the time thus prevent
from iterating all groups.
The reality with bigger machines is quite different though. The
per-node tree where a group is inserted depends on timing because
mem_cgroup_update_tree uses the current page for the decision. So the
group can be hanging on the node where it charged just few pages.
So the figures below show the close the best case with an average case.
As I have already said. The numbers can be improved but is it essential
to do it right now rather than slowly evolve to something smarter?
> * No soft limit set
> Elapsed
> 500-no-limit/base: min: 16.32 max: 18.03 avg: 17.37 std: 0.75 runs: 3
> 500-no-limit/rework: min: 15.76 [96.6%] max: 19.72 [109.4%] avg: 17.49 [100.7%] std: 1.66 runs: 3
> User
> 500-no-limit/base: min: 1.53 max: 1.60 avg: 1.57 std: 0.03 runs: 3
> 500-no-limit/rework: min: 1.18 [77.1%] max: 1.45 [90.6%] avg: 1.33 [84.7%] std: 0.11 runs: 3
> System
> 500-no-limit/base: min: 38.60 max: 41.54 avg: 39.95 std: 1.21 runs: 3
> 500-no-limit/rework: min: 39.78 [103.1%] max: 42.93 [103.3%] avg: 41.06 [102.8%] std: 1.35 runs: 3
>
> Elapsed
> 1k-no-limit/base: min: 37.04 max: 43.36 avg: 40.26 std: 2.58 runs: 3
> 1k-no-limit/rework: min: 16.38 [44.2%] max: 17.82 [41.1%] avg: 17.22 [42.8%] std: 0.61 runs: 3
> User
> 1k-no-limit/base: min: 1.12 max: 1.38 avg: 1.24 std: 0.11 runs: 3
> 1k-no-limit/rework: min: 1.11 [99.1%] max: 1.26 [91.3%] avg: 1.20 [96.8%] std: 0.07 runs: 3
> System
> 1k-no-limit/base: min: 33.51 max: 36.29 avg: 34.99 std: 1.14 runs: 3
> 1k-no-limit/rework: min: 45.09 [134.6%] max: 49.52 [136.5%] avg: 47.99 [137.2%] std: 2.05 runs: 3
>
> Elapsed
> 4k-no-limit/base: min: 40.04 max: 47.14 avg: 44.46 std: 3.15 runs: 3
> 4k-no-limit/rework: min: 30.38 [75.9%] max: 37.66 [79.9%] avg: 34.24 [77.0%] std: 2.99 runs: 3
> User
> 4k-no-limit/base: min: 1.16 max: 1.33 avg: 1.25 std: 0.07 runs: 3
> 4k-no-limit/rework: min: 0.70 [60.3%] max: 0.82 [61.7%] avg: 0.77 [61.6%] std: 0.05 runs: 3
> System
> 4k-no-limit/base: min: 37.91 max: 39.91 avg: 39.19 std: 0.91 runs: 3
> 4k-no-limit/rework: min: 130.35 [343.8%] max: 133.26 [333.9%] avg: 131.63 [335.9%] std: 1.21 runs: 3
>
> Elapsed
> 8k-no-limit/base: min: 41.27 max: 50.60 avg: 45.51 std: 3.86 runs: 3
> 8k-no-limit/rework: min: 39.56 [95.9%] max: 52.12 [103.0%] avg: 44.49 [97.8%] std: 5.47 runs: 3
> User
> 8k-no-limit/base: min: 1.26 max: 1.38 avg: 1.32 std: 0.05 runs: 3
> 8k-no-limit/rework: min: 0.68 [54.0%] max: 0.82 [59.4%] avg: 0.73 [55.3%] std: 0.06 runs: 3
> System
> 8k-no-limit/base: min: 39.93 max: 40.73 avg: 40.25 std: 0.34 runs: 3
> 8k-no-limit/rework: min: 228.74 [572.9%] max: 238.43 [585.4%] avg: 232.57 [577.8%] std: 4.21 runs: 3
>
> * Soft limit set to 0 for the group with the dd load
> Elapsed
> 500-0-limit/base: min: 30.29 max: 38.91 avg: 34.83 std: 3.53 runs: 3
> 500-0-limit/rework: min: 14.34 [47.3%] max: 17.18 [44.2%] avg: 16.01 [46.0%] std: 1.21 runs: 3
> User
> 500-0-limit/base: min: 1.14 max: 1.29 avg: 1.24 std: 0.07 runs: 3
> 500-0-limit/rework: min: 1.42 [124.6%] max: 1.47 [114.0%] avg: 1.44 [116.1%] std: 0.02 runs: 3
> System
> 500-0-limit/base: min: 31.94 max: 35.66 avg: 33.77 std: 1.52 runs: 3
> 500-0-limit/rework: min: 45.25 [141.7%] max: 47.43 [133.0%] avg: 46.27 [137.0%] std: 0.89 runs: 3
>
> Elapsed
> 1k-0-limit/base: min: 37.23 max: 45.11 avg: 40.48 std: 3.36 runs: 3
> 1k-0-limit/rework: min: 15.18 [40.8%] max: 18.69 [41.4%] avg: 16.99 [42.0%] std: 1.44 runs: 3
> User
> 1k-0-limit/base: min: 1.33 max: 1.56 avg: 1.44 std: 0.09 runs: 3
> 1k-0-limit/rework: min: 1.31 [98.5%] max: 1.55 [99.4%] avg: 1.44 [100.0%] std: 0.10 runs: 3
> System
> 1k-0-limit/base: min: 33.21 max: 34.44 avg: 33.77 std: 0.51 runs: 3
> 1k-0-limit/rework: min: 45.52 [137.1%] max: 50.82 [147.6%] avg: 48.76 [144.4%] std: 2.32 runs: 3
>
> Elapsed
> 4k-0-limit/base: min: 42.71 max: 47.83 avg: 45.45 std: 2.11 runs: 3
> 4k-0-limit/rework: min: 34.24 [80.2%] max: 34.99 [73.2%] avg: 34.56 [76.0%] std: 0.32 runs: 3
> User
> 4k-0-limit/base: min: 1.11 max: 1.34 avg: 1.21 std: 0.10 runs: 3
> 4k-0-limit/rework: min: 0.80 [72.1%] max: 0.87 [64.9%] avg: 0.83 [68.6%] std: 0.03 runs: 3
> System
> 4k-0-limit/base: min: 37.08 max: 40.28 avg: 38.91 std: 1.35 runs: 3
> 4k-0-limit/rework: min: 131.08 [353.5%] max: 132.33 [328.5%] avg: 131.66 [338.4%] std: 0.51 runs: 3
>
> Elapsed
> 8k-0-limit/base: min: 35.71 max: 47.18 avg: 43.19 std: 5.29 runs: 3
> 8k-0-limit/rework: min: 43.95 [123.1%] max: 59.77 [126.7%] avg: 50.48 [116.9%] std: 6.75 runs: 3
> User
> 8k-0-limit/base: min: 1.18 max: 1.21 avg: 1.19 std: 0.01 runs: 3
> 8k-0-limit/rework: min: 0.72 [61.0%] max: 0.85 [70.2%] avg: 0.77 [64.7%] std: 0.06 runs: 3
> System
> 8k-0-limit/base: min: 38.34 max: 39.91 avg: 39.24 std: 0.66 runs: 3
> 8k-0-limit/rework: min: 196.90 [513.6%] max: 235.32 [589.6%] avg: 222.34 [566.6%] std: 17.99 runs: 3
>
> As we can see the System time climbs really high but the Elapsed time
> is better than in the base kernel (except for 8k-0-limit). If we had
> more reclaimers then the system time should be amortized more because
> the reclaim tree walk is shared.
>
> I think that the numbers can be improved even without introducing
> the list of groups in excess. One way to go could be introducing a
> conditional (callback) to the memcg iterator so the groups under the
> limit would be excluded during the walk without playing with css
> references and other things. My quick and dirty patch shows that
> 4k-0-limit System time was reduced by 40% wrt. this patchset. With a
> proper tagging we can make the walk close to free.
>
> Nevertheless, I guess I can live with the excess list as well if the
> above sounds like a no-go for you.
--
Michal Hocko
SUSE Labs
Hi,
it took me a bit longer than I wanted but I was closed in a conference
room in the end of the last week so I didn't have much time.
On Mon 20-05-13 16:44:38, Michal Hocko wrote:
> On Fri 17-05-13 12:02:47, Johannes Weiner wrote:
> > On Mon, May 13, 2013 at 09:46:10AM +0200, Michal Hocko wrote:
[...]
> > > After this patch shrink_zone is done in 2 passes. First it tries to do the
> > > soft reclaim if appropriate (only for global reclaim for now to keep
> > > compatible with the original state) and fall back to ignoring soft limit
> > > if no group is eligible to soft reclaim or nothing has been scanned
> > > during the first pass. Only groups which are over their soft limit or
> > > any of their parents up the hierarchy is over the limit are considered
> > > eligible during the first pass.
> >
> > There are setups with thousands of groups that do not even use soft
> > limits. Having them pointlessly iterate over all of them for every
> > couple of pages reclaimed is just not acceptable.
>
> OK, that is a fair point. The soft reclaim pass should not be done if
> we know that every group is below the limit. This can be fixed easily.
> mem_cgroup_should_soft_reclaim could check a counter of over limit
> groups. This still doesn't solve the problem if there are relatively few
> groups over the limit wrt. those that are under the limit.
>
> You are proposing a simple list in the follow up email. I have
> considered this approach as well but then I decided not to go that
> way because the list iteration doesn't respect per-node-zone-priority
> tree walk which makes it tricky to prevent from over-reclaim with many
> parallel reclaimers. I rather wanted to integrate the soft reclaim into
> the reclaim tree walk. There is also a problem when all groups are in
> excess then the whole tree collapses into a linked list which is not
> nice either (hmm, this could be mitigated if only a group in excess
> which is highest in the hierarchy would be in the list)
>
[...]
> I think that the numbers can be improved even without introducing
> the list of groups in excess. One way to go could be introducing a
> conditional (callback) to the memcg iterator so the groups under the
> limit would be excluded during the walk without playing with css
> references and other things. My quick and dirty patch shows that
> 4k-0-limit System time was reduced by 40% wrt. this patchset. With a
> proper tagging we can make the walk close to free.
And the following patchset implements that. My first numbers shown an
improvement (I will post some numbers later after I collect them).
Nevertheless I have encountered an issue while testing the huge number
of groups scenario. And the issue is not limitted to only to this
scenario unfortunately. As memcg iterators use per node-zone-priority
cache to prevent from over reclaim it might quite easily happen that
the walk will not visit all groups and will terminate the loop either
prematurely or skip some groups. An example could be the direct reclaim
racing with kswapd. This might cause that the loop misses over limit
groups so no pages are scanned and so we will fall back to all groups
reclaim.
Not good! But also not easy to fix without risking an over reclaim or
potential stalls. I was thinking about introducing something like:
bool should_soft_limit_reclaim_continue(struct mem_cgroup *root, int groups_reclaimed)
{
if (!groups_reclaimed)
return false;
if (mem_cgroup_soft_reclaim_eligible(root, root) == VISIT)
return true;
}
and loop again few times in __shrink_zone. I am not entirely thrilled
about this as the effectiveness depends on the number of parallel
reclaimers at the same priority but it should work at least somehow. If
anybody has a better idea I am all for it.
I will think about it some more.
Anyway I will post 3 patches which should mitigate the "too many groups"
issue as a reply to this email. See patches for details.
Soft limit reclaim has to check the whole reclaim hierarchy while doing
the first pass of the reclaim. This leads to a higher system time which
can be visible especially when there are many groups in the hierarchy.
- TODO put testing results here
This patch adds a per-memcg counter of children in excess. It also
restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for
a proper batching.
If a group crosses soft limit for the first time it increases parent's
children_in_excess up the hierarchy. The similarly if a group gets below
the limit it will decrease the counter. The transition phase is recorded
in soft_contributed flag.
mem_cgroup_soft_reclaim_eligible then uses this information to better
decide whether to skip the node or the whole subtree. The rule is
simple. Skip the node with a children in excess or skip the whole subtree
otherwise.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 51 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 981ee12..60b48bc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -136,6 +136,7 @@ static const char * const mem_cgroup_lru_names[] = {
*/
enum mem_cgroup_events_target {
MEM_CGROUP_TARGET_THRESH,
+ MEM_CGROUP_TARGET_SOFTLIMIT,
MEM_CGROUP_TARGET_NUMAINFO,
MEM_CGROUP_NTARGETS,
};
@@ -355,6 +356,10 @@ struct mem_cgroup {
atomic_t numainfo_updating;
#endif
+ spinlock_t soft_lock;
+ bool soft_contributed;
+ atomic_t children_in_excess;
+
/*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
@@ -890,6 +895,9 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
case MEM_CGROUP_TARGET_THRESH:
next = val + THRESHOLDS_EVENTS_TARGET;
break;
+ case MEM_CGROUP_TARGET_SOFTLIMIT:
+ next = val + SOFTLIMIT_EVENTS_TARGET;
+ break;
case MEM_CGROUP_TARGET_NUMAINFO:
next = val + NUMAINFO_EVENTS_TARGET;
break;
@@ -902,6 +910,34 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
return false;
}
+static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg)
+{
+ unsigned long long excess = res_counter_soft_limit_excess(&memcg->res);
+ struct mem_cgroup *parent = memcg;
+ int delta = 0;
+
+ spin_lock(&memcg->soft_lock);
+ if (excess) {
+ if (!memcg->soft_contributed) {
+ delta = 1;
+ memcg->soft_contributed = true;
+ }
+ } else {
+ if (memcg->soft_contributed) {
+ delta = -1;
+ memcg->soft_contributed = false;
+ }
+ }
+
+ /*
+ * Necessary to update all ancestors when hierarchy is used
+ * because their event counter is not touched.
+ */
+ while (delta && (parent = parent_mem_cgroup(parent)))
+ atomic_add(delta, &parent->children_in_excess);
+ spin_unlock(&memcg->soft_lock);
+}
+
/*
* Check events in order.
*
@@ -912,8 +948,11 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
/* threshold event is triggered in finer grain than soft limit */
if (unlikely(mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_THRESH))) {
+ bool do_softlimit;
bool do_numainfo __maybe_unused;
+ do_softlimit = mem_cgroup_event_ratelimit(memcg,
+ MEM_CGROUP_TARGET_SOFTLIMIT);
#if MAX_NUMNODES > 1
do_numainfo = mem_cgroup_event_ratelimit(memcg,
MEM_CGROUP_TARGET_NUMAINFO);
@@ -921,6 +960,8 @@ static void memcg_check_events(struct mem_cgroup *memcg, struct page *page)
preempt_enable();
mem_cgroup_threshold(memcg);
+ if (unlikely(do_softlimit))
+ mem_cgroup_update_soft_limit(memcg);
#if MAX_NUMNODES > 1
if (unlikely(do_numainfo))
atomic_inc(&memcg->numainfo_events);
@@ -1894,6 +1935,9 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
* hierarchy if
* a) it is over its soft limit
* b) any parent up the hierarchy is over its soft limit
+ *
+ * If the given group doesn't have any children over the limit then it
+ * doesn't make any sense to iterate its subtree.
*/
enum mem_cgroup_filter_t
mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
@@ -1915,6 +1959,8 @@ mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
break;
}
+ if (!atomic_read(&memcg->children_in_excess))
+ return SKIP_TREE;
return SKIP;
}
@@ -6061,6 +6107,7 @@ mem_cgroup_css_alloc(struct cgroup *cont)
mutex_init(&memcg->thresholds_lock);
spin_lock_init(&memcg->move_lock);
vmpressure_init(&memcg->vmpressure);
+ spin_lock_init(&memcg->soft_lock);
return &memcg->css;
@@ -6150,6 +6197,10 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
mem_cgroup_invalidate_reclaim_iterators(memcg);
mem_cgroup_reparent_charges(memcg);
+ if (memcg->soft_contributed) {
+ while ((memcg = parent_mem_cgroup(memcg)))
+ atomic_dec(&memcg->children_in_excess);
+ }
mem_cgroup_destroy_all_caches(memcg);
}
--
1.7.10.4
Children in soft limit excess are currently tracked up the hierarchy
in memcg->children_in_excess. Nevertheless there still might exist
tons of groups that are not in hierarchy relation to the root cgroup
(e.g. all first level groups if root_mem_cgroup->use_hierarchy ==
false).
As the whole tree walk has to be done when the iteration starts at
root_mem_cgroup the iterator should be able to skip the walk if there
is no child above the limit without iterating them. This can be done
easily if the root tracks all children rather than only hierarchical
children. This is done by this patch which updates root_mem_cgroup
children_in_excess if root_mem_cgroup->use_hierarchy == false so the
root knows about all children in excess.
Please note that this is not an issue for inner memcgs which have
use_hierarchy == false because then only the single group is visited so
no special optimization is necessary.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 592df10..7fb063f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -932,9 +932,15 @@ static void mem_cgroup_update_soft_limit(struct mem_cgroup *memcg)
/*
* Necessary to update all ancestors when hierarchy is used
* because their event counter is not touched.
+ * We track children even outside the hierarchy for the root
+ * cgroup because tree walk starting at root should visit
+ * all cgroups and we want to prevent from pointless tree
+ * walk if no children is below the limit.
*/
while (delta && (parent = parent_mem_cgroup(parent)))
atomic_add(delta, &parent->children_in_excess);
+ if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy)
+ atomic_add(delta, &root_mem_cgroup->children_in_excess);
spin_unlock(&memcg->soft_lock);
}
@@ -6200,6 +6206,9 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
if (memcg->soft_contributed) {
while ((memcg = parent_mem_cgroup(memcg)))
atomic_dec(&memcg->children_in_excess);
+
+ if (memcg != root_mem_cgroup && !root_mem_cgroup->use_hierarchy)
+ atomic_dec(&root_mem_cgroup->children_in_excess);
}
mem_cgroup_destroy_all_caches(memcg);
}
--
1.7.10.4
mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is
done and it always says yes currently. Memcg iterators are clever to
skip nodes that are not soft reclaimable quite efficiently but
mem_cgroup_should_soft_reclaim can be more clever and do not start the
soft reclaim pass at all if it knows that nothing would be scanned
anyway.
In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for
the target group of the reclaim and allow the pass only if the whole
subtree wouldn't be skipped.
TODO - should mem_cgroup_soft_reclaim_eligible check for NULL root/memcg
so that we do not export root_mem_cgroup?
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 2 ++
mm/memcontrol.c | 2 +-
mm/vmscan.c | 5 ++++-
3 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 811967a..909bb6b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -51,6 +51,8 @@ typedef enum mem_cgroup_filter_t
(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root);
#ifdef CONFIG_MEMCG
+extern struct mem_cgroup *root_mem_cgroup;
+
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
* (gfp_mask & GFP_RECLAIM_MASK). In current implementatin, memcg doesn't
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 60b48bc..592df10 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -66,7 +66,7 @@ struct cgroup_subsys mem_cgroup_subsys __read_mostly;
EXPORT_SYMBOL(mem_cgroup_subsys);
#define MEM_CGROUP_RECLAIM_RETRIES 5
-static struct mem_cgroup *root_mem_cgroup __read_mostly;
+struct mem_cgroup *root_mem_cgroup __read_mostly;
#ifdef CONFIG_MEMCG_SWAP
/* Turned on only when memory cgroup is enabled && really_do_swap_account = 1 */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 49878dd..22c1278 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -142,7 +142,10 @@ static bool global_reclaim(struct scan_control *sc)
static bool mem_cgroup_should_soft_reclaim(struct scan_control *sc)
{
- return true;
+ struct mem_cgroup *root = sc->target_mem_cgroup;
+ if (!root)
+ root = root_mem_cgroup;
+ return mem_cgroup_soft_reclaim_eligible(root, root) != SKIP_TREE;
}
#else
static bool global_reclaim(struct scan_control *sc)
--
1.7.10.4
Oh, forgot to post the first patch...
---
>From 54b422ba96e3fa9264ee308c4fd9d5fe36efde63 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 20 May 2013 13:40:11 +0200
Subject: [PATCH] memcg: enhance memcg iterator to support predicates
The caller of the iterator might know that some nodes or even subtrees
should be skipped but there is no way to tell iterators about that so
the only choice left is to let iterators to visit each node and do the
selection outside of the iterating code. This, however, doesn't scale
well with hierarchies with many groups where only few groups are
interesting.
This patch adds mem_cgroup_iter_cond variant of the iterator with a
callback which gets called for every visited node. There are three
possible ways how the callback can influence the walk. Either the node
is visited, it is skipped but the tree walk continues down the tree or
the whole subtree of the current group is skipped.
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 40 +++++++++++++++++++++----
mm/memcontrol.c | 69 ++++++++++++++++++++++++++++++++++----------
mm/vmscan.c | 16 ++++------
3 files changed, 93 insertions(+), 32 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 80ed1b6..811967a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -41,6 +41,15 @@ struct mem_cgroup_reclaim_cookie {
unsigned int generation;
};
+enum mem_cgroup_filter_t {
+ VISIT,
+ SKIP,
+ SKIP_TREE,
+};
+
+typedef enum mem_cgroup_filter_t
+(*mem_cgroup_iter_filter)(struct mem_cgroup *memcg, struct mem_cgroup *root);
+
#ifdef CONFIG_MEMCG
/*
* All "charge" functions with gfp_mask should use GFP_KERNEL or
@@ -107,9 +116,18 @@ mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
extern void mem_cgroup_end_migration(struct mem_cgroup *memcg,
struct page *oldpage, struct page *newpage, bool migration_ok);
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
- struct mem_cgroup *,
- struct mem_cgroup_reclaim_cookie *);
+struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root,
+ struct mem_cgroup *prev,
+ struct mem_cgroup_reclaim_cookie *reclaim,
+ mem_cgroup_iter_filter cond);
+
+static inline struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+ struct mem_cgroup *prev,
+ struct mem_cgroup_reclaim_cookie *reclaim)
+{
+ return mem_cgroup_iter_cond(root, prev, reclaim, NULL);
+}
+
void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
/*
@@ -179,7 +197,8 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root);
void __mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx);
@@ -294,6 +313,14 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *memcg,
struct page *oldpage, struct page *newpage, bool migration_ok)
{
}
+static inline struct mem_cgroup *
+mem_cgroup_iter_cond(struct mem_cgroup *root,
+ struct mem_cgroup *prev,
+ struct mem_cgroup_reclaim_cookie *reclaim,
+ mem_cgroup_iter_filter cond)
+{
+ return NULL;
+}
static inline struct mem_cgroup *
mem_cgroup_iter(struct mem_cgroup *root,
@@ -357,10 +384,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
}
static inline
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root)
{
- return false;
+ return VISIT;
}
static inline void mem_cgroup_split_huge_fixup(struct page *head)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9f8dd37..981ee12 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -969,6 +969,15 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
return memcg;
}
+static enum mem_cgroup_filter_t
+mem_cgroup_filter(struct mem_cgroup *memcg, struct mem_cgroup *root,
+ mem_cgroup_iter_filter cond)
+{
+ if (!cond)
+ return VISIT;
+ return cond(memcg, root);
+}
+
/*
* Returns a next (in a pre-order walk) alive memcg (with elevated css
* ref. count) or NULL if the whole root's subtree has been visited.
@@ -976,7 +985,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_mm(struct mm_struct *mm)
* helper function to be used by mem_cgroup_iter
*/
static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
- struct mem_cgroup *last_visited)
+ struct mem_cgroup *last_visited, mem_cgroup_iter_filter cond)
{
struct cgroup *prev_cgroup, *next_cgroup;
@@ -984,10 +993,18 @@ static struct mem_cgroup *__mem_cgroup_iter_next(struct mem_cgroup *root,
* Root is not visited by cgroup iterators so it needs an
* explicit visit.
*/
- if (!last_visited)
- return root;
+ if (!last_visited) {
+ switch(mem_cgroup_filter(root, root, cond)) {
+ case VISIT:
+ return root;
+ case SKIP:
+ break;
+ case SKIP_TREE:
+ return NULL;
+ }
+ }
- prev_cgroup = (last_visited == root) ? NULL
+ prev_cgroup = (last_visited == root || !last_visited) ? NULL
: last_visited->css.cgroup;
skip_node:
next_cgroup = cgroup_next_descendant_pre(
@@ -1003,11 +1020,22 @@ skip_node:
if (next_cgroup) {
struct mem_cgroup *mem = mem_cgroup_from_cont(
next_cgroup);
- if (css_tryget(&mem->css))
- return mem;
- else {
+
+ switch (mem_cgroup_filter(mem, root, cond)) {
+ case SKIP:
prev_cgroup = next_cgroup;
goto skip_node;
+ case SKIP_TREE:
+ prev_cgroup = cgroup_rightmost_descendant(next_cgroup);
+ goto skip_node;
+ case VISIT:
+ if (css_tryget(&mem->css))
+ return mem;
+ else {
+ prev_cgroup = next_cgroup;
+ goto skip_node;
+ }
+ break;
}
}
@@ -1019,6 +1047,7 @@ skip_node:
* @root: hierarchy root
* @prev: previously returned memcg, NULL on first invocation
* @reclaim: cookie for shared reclaim walks, NULL for full walks
+ * @cond: filter for visited nodes, NULL for no filter
*
* Returns references to children of the hierarchy below @root, or
* @root itself, or %NULL after a full round-trip.
@@ -1031,9 +1060,10 @@ skip_node:
* divide up the memcgs in the hierarchy among all concurrent
* reclaimers operating on the same zone and priority.
*/
-struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
+struct mem_cgroup *mem_cgroup_iter_cond(struct mem_cgroup *root,
struct mem_cgroup *prev,
- struct mem_cgroup_reclaim_cookie *reclaim)
+ struct mem_cgroup_reclaim_cookie *reclaim,
+ mem_cgroup_iter_filter cond)
{
struct mem_cgroup *memcg = NULL;
struct mem_cgroup *last_visited = NULL;
@@ -1051,7 +1081,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
if (!root->use_hierarchy && root != root_mem_cgroup) {
if (prev)
goto out_css_put;
- return root;
+ if (mem_cgroup_filter(root, root, cond) == VISIT)
+ return root;
+ return NULL;
}
rcu_read_lock();
@@ -1094,7 +1126,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
}
}
- memcg = __mem_cgroup_iter_next(root, last_visited);
+ memcg = __mem_cgroup_iter_next(root, last_visited, cond);
+ /*
+ * No group has been visited because filter told us to skip
+ * the root node
+ */
+ if (cond && !last_visited && !memcg)
+ goto out_unlock;
if (reclaim) {
if (last_visited)
@@ -1857,13 +1895,14 @@ int mem_cgroup_select_victim_node(struct mem_cgroup *memcg)
* a) it is over its soft limit
* b) any parent up the hierarchy is over its soft limit
*/
-bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
+enum mem_cgroup_filter_t
+mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
struct mem_cgroup *root)
{
struct mem_cgroup *parent = memcg;
if (res_counter_soft_limit_excess(&memcg->res))
- return true;
+ return VISIT;
/*
* If any parent up to the root in the hierarchy is over its soft limit
@@ -1871,12 +1910,12 @@ bool mem_cgroup_soft_reclaim_eligible(struct mem_cgroup *memcg,
*/
while((parent = parent_mem_cgroup(parent))) {
if (res_counter_soft_limit_excess(&parent->res))
- return true;
+ return VISIT;
if (parent == root)
break;
}
- return false;
+ return SKIP;
}
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 151f6b9..49878dd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1964,21 +1964,16 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
.zone = zone,
.priority = sc->priority,
};
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
+ mem_cgroup_iter_filter filter = (soft_reclaim) ?
+ mem_cgroup_soft_reclaim_eligible : NULL;
nr_reclaimed = sc->nr_reclaimed;
nr_scanned = sc->nr_scanned;
- memcg = mem_cgroup_iter(root, NULL, &reclaim);
- do {
+ while ((memcg = mem_cgroup_iter_cond(root, memcg, &reclaim, filter))) {
struct lruvec *lruvec;
- if (soft_reclaim &&
- !mem_cgroup_soft_reclaim_eligible(memcg, root)) {
- memcg = mem_cgroup_iter(root, memcg, &reclaim);
- continue;
- }
-
lruvec = mem_cgroup_zone_lruvec(zone, memcg);
shrink_lruvec(lruvec, sc);
@@ -1998,8 +1993,7 @@ __shrink_zone(struct zone *zone, struct scan_control *sc, bool soft_reclaim)
mem_cgroup_iter_break(root, memcg);
break;
}
- memcg = mem_cgroup_iter(root, memcg, &reclaim);
- } while (memcg);
+ }
vmpressure(sc->gfp_mask, sc->target_mem_cgroup,
sc->nr_scanned - nr_scanned,
--
1.7.10.4
--
Michal Hocko
SUSE Labs
On Mon 27-05-13 19:13:08, Michal Hocko wrote:
[...]
> Nevertheless I have encountered an issue while testing the huge number
> of groups scenario. And the issue is not limitted to only to this
> scenario unfortunately. As memcg iterators use per node-zone-priority
> cache to prevent from over reclaim it might quite easily happen that
> the walk will not visit all groups and will terminate the loop either
> prematurely or skip some groups. An example could be the direct reclaim
> racing with kswapd. This might cause that the loop misses over limit
> groups so no pages are scanned and so we will fall back to all groups
> reclaim.
And after some more testing and head scratching it turned out that
fallbacks to pass#2 I was seeing are caused by something else. It is
not race between iterators but rather reclaiming from zone DMA which
has troubles to scan anything despite there are pages on LRU and so we
fall back. I have to look into that more but what-ever the issue is it
shouldn't be related to the patch series.
--
Michal Hocko
SUSE Labs
On Mon 27-05-13 19:13:08, Michal Hocko wrote:
[...]
> > I think that the numbers can be improved even without introducing
> > the list of groups in excess. One way to go could be introducing a
> > conditional (callback) to the memcg iterator so the groups under the
> > limit would be excluded during the walk without playing with css
> > references and other things. My quick and dirty patch shows that
> > 4k-0-limit System time was reduced by 40% wrt. this patchset. With a
> > proper tagging we can make the walk close to free.
>
> And the following patchset implements that. My first numbers shown an
> improvement (I will post some numbers later after I collect them).
With that series applied (+some minor cleanups - no functional changes)
I am getting much numbers this time. Again the test case is a streaming
IO (dd from /dev/zero to a file with 4*MemTotal) booted with mem=1G so
there are just DMA and DMA32 zones. This helps the previous soft limit
implementation as the over limit group will be in the proper mztree.
The load was running in a group (A) and other groups were created under
the root/bunch cgroups (with use_hierarchy and two groups at each
level). I have ran each test 3 times. rework denotes to the original
series, reworkoptim contains 3 patches on top. First number in the name
tells us the number of groups under root/bunch and the second the soft
limit for A.
System
0-0-limit/base: min: 32.45 max: 36.91 avg: 34.65 std: 1.82 runs: 3
0-0-limit/rework: min: 38.03 [117.2%] max: 45.30 [122.7%] avg: 42.30 [122.1%] std: 3.10 runs: 3
0-0-limit/reworkoptim: min: 43.99 [135.6%] max: 47.80 [129.5%] avg: 45.45 [131.2%] std: 1.68 runs: 3
Elapsed
0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3
0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3
0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3
System
0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3
0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3
0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3
Elapsed
0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3
0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3
0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3
System
2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3
2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3
2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3
Elapsed
2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3
2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3
2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3
System
8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3
8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3
8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3
Elapsed
8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3
8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3
8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3
The System time is increased by ~30-40% which can be explained by the
fact that the original soft reclaim scanned at priority 0 so it was much
more effective for this workload (which is basically touch once and
writeback). The Elapsed time looks better though (~20%) which sounds
like a good news to me.
The tree walk overhead seems to be reduced considerably if we compare
reworkoptim to rework.
Same test without soft limit set to A:
System
0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3
0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3
0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3
Elapsed
0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3
0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3
0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3
System
0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3
0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3
0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3
Elapsed
0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3
0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3
0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3
System
2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3
2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3
2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3
Elapsed
2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3
2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3
2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3
System
8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3
8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3
8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3
Elapsed
8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3
8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3
8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3
Both System and Elapsed are in stdev with the base kernel for all
configurations except for 8k where both System and Elapsed are up by
35%. I do not have a good explanation for this because there is no soft
reclaim pass going on as no group is above the limit which is checked in
mem_cgroup_should_soft_reclaim.
I am still running kbuild tests with the same configuration to see a
more general workload.
Does this sound like it reasonable mitigation of the issue you were
worried about Johannes?
--
Michal Hocko
SUSE Labs
On Wed 29-05-13 15:05:38, Michal Hocko wrote:
> On Mon 27-05-13 19:13:08, Michal Hocko wrote:
> [...]
> > Nevertheless I have encountered an issue while testing the huge number
> > of groups scenario. And the issue is not limitted to only to this
> > scenario unfortunately. As memcg iterators use per node-zone-priority
> > cache to prevent from over reclaim it might quite easily happen that
> > the walk will not visit all groups and will terminate the loop either
> > prematurely or skip some groups. An example could be the direct reclaim
> > racing with kswapd. This might cause that the loop misses over limit
> > groups so no pages are scanned and so we will fall back to all groups
> > reclaim.
>
> And after some more testing and head scratching it turned out that
> fallbacks to pass#2 I was seeing are caused by something else. It is
> not race between iterators but rather reclaiming from zone DMA which
> has troubles to scan anything despite there are pages on LRU and so we
> fall back. I have to look into that more but what-ever the issue is it
> shouldn't be related to the patch series.
Think I know what is going on. get_scan_count sees relatively small
amount of pages in the lists (around 2k). This means that get_scan_count
will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually
~16M) then the DEF_PRIORITY is basically no-op and we have to wait and
fall down to a priority which actually let us scan something.
Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce
one pointless loop over groups.
--
Michal Hocko
SUSE Labs
On Wed, May 29, 2013 at 05:57:56PM +0200, Michal Hocko wrote:
> On Wed 29-05-13 15:05:38, Michal Hocko wrote:
> > On Mon 27-05-13 19:13:08, Michal Hocko wrote:
> > [...]
> > > Nevertheless I have encountered an issue while testing the huge number
> > > of groups scenario. And the issue is not limitted to only to this
> > > scenario unfortunately. As memcg iterators use per node-zone-priority
> > > cache to prevent from over reclaim it might quite easily happen that
> > > the walk will not visit all groups and will terminate the loop either
> > > prematurely or skip some groups. An example could be the direct reclaim
> > > racing with kswapd. This might cause that the loop misses over limit
> > > groups so no pages are scanned and so we will fall back to all groups
> > > reclaim.
> >
> > And after some more testing and head scratching it turned out that
> > fallbacks to pass#2 I was seeing are caused by something else. It is
> > not race between iterators but rather reclaiming from zone DMA which
> > has troubles to scan anything despite there are pages on LRU and so we
> > fall back. I have to look into that more but what-ever the issue is it
> > shouldn't be related to the patch series.
>
> Think I know what is going on. get_scan_count sees relatively small
> amount of pages in the lists (around 2k). This means that get_scan_count
> will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually
> ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and
> fall down to a priority which actually let us scan something.
>
> Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce
> one pointless loop over groups.
If you have a small group in excess of its soft limit and bigger
groups that are not, you may reclaim something in the regular reclaim
cycle before reclaiming anything in the soft limit cycle with the way
the code is structured.
The soft limit cycle probably needs to sit outside of the priority
loop, not inside the loop, so that the soft limit reclaim cycle
descends priority levels until it makes progress BEFORE it exits to
the regular reclaim cycle.
On Wed 29-05-13 16:54:00, Michal Hocko wrote:
[...]
> I am still running kbuild tests with the same configuration to see a
> more general workload.
And here we go with the kbuild numbers. Same configuration (mem=1G, one
group for kernel build - it is actually expand the three + build a
distro config + bunch of groups under root/bunch).
System
0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3
0-0-limit/reclaim: min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3
0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3
Elapsed
0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3
0-0-limit/reclaim: min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3
0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3
System
0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3
0.5k-0-limit/reclaim: min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3
0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3
Elapsed
0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3
0.5k-0-limit/reclaim: min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3
0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3
System
2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3
2k-0-limit/reclaim: min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3
2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3
Elapsed
2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3
2k-0-limit/reclaim: min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3
2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3
System
8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3
8k-0-limit/reclaim: min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3
8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3
Elapsed
8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3
8k-0-limit/reclaim: min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3
8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3
Apart from 8k the system time is comparable with the base kernel while
Elapsed is up to 20% better with all configurations.
And with not soft limit set
System
0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3
0-no-limit/reclaim: min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3
0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3
Elapsed
0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3
0-no-limit/reclaim: min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3
0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3
System
0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3
0.5k-no-limit/reclaim: min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3
0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3
Elapsed
0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3
0.5k-no-limit/reclaim: min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3
0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3
System
2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3
2k-no-limit/reclaim: min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3
2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3
Elapsed
2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3
2k-no-limit/reclaim: min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3
2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3
System
8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3
8k-no-limit/reclaim: min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3
8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3
Elapsed
8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3
8k-no-limit/reclaim: min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3
8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3
and these numbers look good as well. System time is around 100%
(suprisingly better for the 8k case) and Elapsed is copies that trend.
--
Michal Hocko
SUSE Labs
On Wed 29-05-13 16:01:54, Johannes Weiner wrote:
> On Wed, May 29, 2013 at 05:57:56PM +0200, Michal Hocko wrote:
> > On Wed 29-05-13 15:05:38, Michal Hocko wrote:
> > > On Mon 27-05-13 19:13:08, Michal Hocko wrote:
> > > [...]
> > > > Nevertheless I have encountered an issue while testing the huge number
> > > > of groups scenario. And the issue is not limitted to only to this
> > > > scenario unfortunately. As memcg iterators use per node-zone-priority
> > > > cache to prevent from over reclaim it might quite easily happen that
> > > > the walk will not visit all groups and will terminate the loop either
> > > > prematurely or skip some groups. An example could be the direct reclaim
> > > > racing with kswapd. This might cause that the loop misses over limit
> > > > groups so no pages are scanned and so we will fall back to all groups
> > > > reclaim.
> > >
> > > And after some more testing and head scratching it turned out that
> > > fallbacks to pass#2 I was seeing are caused by something else. It is
> > > not race between iterators but rather reclaiming from zone DMA which
> > > has troubles to scan anything despite there are pages on LRU and so we
> > > fall back. I have to look into that more but what-ever the issue is it
> > > shouldn't be related to the patch series.
> >
> > Think I know what is going on. get_scan_count sees relatively small
> > amount of pages in the lists (around 2k). This means that get_scan_count
> > will tell us to scan nothing for DEF_PRIORITY (as the DMA32 is usually
> > ~16M) then the DEF_PRIORITY is basically no-op and we have to wait and
> > fall down to a priority which actually let us scan something.
> >
> > Hmm, maybe ignoring soft reclaim for DMA zone would help to reduce
> > one pointless loop over groups.
>
> If you have a small group in excess of its soft limit and bigger
> groups that are not, you may reclaim something in the regular reclaim
> cycle before reclaiming anything in the soft limit cycle with the way
> the code is structured.
Yes the way how get_scan_count works might really cause this. Although
tageted reclaim is protected from this the global reclaim can really
suffer from this. I am not sure this is necessarily a problem though. If
we are under the global reclaim then a small group which doesn't have at
least 1<<DEF_PRIORITY pages probably doesn't matter that much. The soft
limit is not a guarantee anyway so we can sacrifice some pages from all
groups in such a case.
I also think that the force_scan logic should be enhanced a bit.
Especially for cases like DMA zone. The zone is clearly under watermaks
but we have to wait few priority cycles to reclaim something. But this
is a different issue in depended on the soft reclaim rework.
> The soft limit cycle probably needs to sit outside of the priority
> loop, not inside the loop, so that the soft limit reclaim cycle
> descends priority levels until it makes progress BEFORE it exits to
> the regular reclaim cycle.
I do not like this to be honest. shrink_zone is an ideal place as it is
shared among all reclaimers and we really want to obey priority in the
soft reclaim as well. The corner case mentioned above is probably
fixable on the get_scan_count layer and even if not then I wouldn't call
it a disaster.
--
Michal Hocko
SUSE Labs