Hi,
I am sending the patch below as an RFC because I am not entirely happy
about myself and maybe somebody can come up with a different approach
which would be less hackish.
As a background, I have noticed that memcg OOM killer kills a wrong
tasks while playing with memory.swappiness==0 in a small group (e.g.
50M). I have multiple anon mem eaters which fault in more than the hard
limit. OOM killer kills the last executed task:
# mem_eater spawns one process per parameter, mmaps the given size and
# faults memory in in parallel (all of them are synced to start together)
./mem_eater anon:50M anon:20M anon:20M anon:20M
10571: anon_eater for 20971520B
10570: anon_eater for 52428800B
10573: anon_eater for 20971520B
10572: anon_eater for 20971520B
10573: done with status 9
10571: done with status 0
10572: done with status 9
10570: done with status 9
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 5706] 0 5706 4955 556 13 0 0 bash
[10569] 0 10569 1015 134 6 0 0 mem_eater
[10570] 0 10570 13815 4118 15 0 0 mem_eater
[10571] 0 10571 6135 5140 16 0 0 mem_eater
[10572] 0 10572 6135 22 7 0 0 mem_eater
[10573] 0 10573 6135 3541 14 0 0 mem_eater
Memory cgroup out of memory: Kill process 10573 (mem_eater) score 0 or sacrifice child
Killed process 10573 (mem_eater) total-vm:24540kB, anon-rss:14028kB, file-rss:136kB
[...]
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 5706] 0 5706 4955 556 13 0 0 bash
[10569] 0 10569 1015 134 6 0 0 mem_eater
[10570] 0 10570 13815 10267 27 0 0 mem_eater
[10572] 0 10572 6135 2519 12 0 0 mem_eater
Memory cgroup out of memory: Kill process 10572 (mem_eater) score 0 or sacrifice child
Killed process 10572 (mem_eater) total-vm:24540kB, anon-rss:9940kB, file-rss:136kB
[...]
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 5706] 0 5706 4955 556 13 0 0 bash
[10569] 0 10569 1015 134 6 0 0 mem_eater
[10570] 0 10570 13815 12773 31 0 0 mem_eater
Memory cgroup out of memory: Kill process 10570 (mem_eater) score 2 or sacrifice child
Killed process 10570 (mem_eater) total-vm:55260kB, anon-rss:50956kB, file-rss:136kB
As you can see 50M (pid:10570) is killed as the last one while 20M ones
are killed first. See the patch for more details about the problem.
As I state in the changelog the very same issue is present in the global
oom killer as well but it is much less probable as the amount of swap is
usualy much smaller than the available RAM and I think it is not worth
considering.
---
>From 445c2ced957cd77cbfca44d0e3f5056fed252a34 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 10 Oct 2012 15:46:54 +0200
Subject: [PATCH] memcg: oom: fix totalpages calculation for swappiness==0
oom_badness takes totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004f (mm: avoid swapping out
with swappiness==0) we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller
than the available swap because the used memory (capped by the limit)
is negligible comparing to totalpages so the resulting score is too
small. A wrong process might be selected as result.
The same issue exists for the global oom killer as well but it is not
that problematic as the amount of the RAM is usually much bigger than
the swap space.
The problem can be worked around by checking swappiness==0 and not
considering swap at all.
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..93a7e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
{
u64 limit;
- u64 memsw;
limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- limit += total_swap_pages << PAGE_SHIFT;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
/*
- * If memsw is finite and limits the amount of swap space available
- * to this memcg, return that limit.
+ * Do not consider swap space if we cannot swap due to swappiness
*/
- return min(limit, memsw);
+ if (mem_cgroup_swappiness(memcg)) {
+ u64 memsw;
+
+ limit += total_swap_pages << PAGE_SHIFT;
+ memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+
+ /*
+ * If memsw is finite and limits the amount of swap space
+ * available to this memcg, return that limit.
+ */
+ limit = min(limit, memsw);
+ }
+
+ return limit;
}
void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
1.7.10.4
--
Michal Hocko
SUSE Labs
On Wed, 10 Oct 2012, Michal Hocko wrote:
> Hi,
> I am sending the patch below as an RFC because I am not entirely happy
> about myself and maybe somebody can come up with a different approach
> which would be less hackish.
I don't see this as hackish, if memory.swappiness limits access to swap
then this shouldn't be factored into the calculation, and that's what your
patch fixes.
The reason why the process with the largest rss isn't killed in this case
is because all processes have CAP_SYS_ADMIN so they get a 3% bonus; when
factoring swap into the calculation and subtracting 3% from the score in
oom_badness(), they all end up having an internal score of 1 so they are
all considered equal. It appears like the cgroup_iter_next() iteration
for memcg ooms does this in reverse order, which is actually helpful so it
will select the task that is newer.
The only suggestion I have to make is specify this is for
memory.swappiness in the patch title, otherwise:
Acked-by: David Rientjes <[email protected]>
On Wed 10-10-12 13:50:21, David Rientjes wrote:
> On Wed, 10 Oct 2012, Michal Hocko wrote:
>
> > Hi,
> > I am sending the patch below as an RFC because I am not entirely happy
> > about myself and maybe somebody can come up with a different approach
> > which would be less hackish.
>
> I don't see this as hackish,
I didn't like how swappiness spreads outside of the LRU scanning code...
> if memory.swappiness limits access to swap then this shouldn't be
> factored into the calculation, and that's what your patch fixes.
>
> The reason why the process with the largest rss isn't killed in this case
> is because all processes have CAP_SYS_ADMIN so they get a 3% bonus;
OK I should have mentioned that I have tested it as root which makes a
big difference with the current upstream as totalpages are considered
only if adj!=0.
I have originally seen the problem in 3.0 kernel (with fe35004f applied)
where the calculation is different (missing a7f638f9) and we always
consider total_pages there so it doesn't depend on root or oom_score_adj.
> when factoring swap into the calculation and subtracting 3% from
> the score in oom_badness(), they all end up having an internal
> score of 1 so they are all considered equal. It appears like the
> cgroup_iter_next() iteration for memcg ooms does this in reverse
> order, which is actually helpful so it will select the task that is
> newer.
>
> The only suggestion I have to make is specify this is for
> memory.swappiness in the patch title, otherwise:
OK. I will also update the changelog to mention oom_score_adj and
CAP_SYS_ADMIN, mark the patch for stable and repost it.
> Acked-by: David Rientjes <[email protected]>
Thanks
--
Michal Hocko
SUSE Labs
oom_badness takes totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004f (mm: avoid swapping out
with swappiness==0) we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.
The same issue exists for the global oom killer as well but it is not
that problematic as the amount of the RAM is usually much bigger than
the swap space.
The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: stable [3.5+]
---
mm/memcontrol.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..93a7e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
{
u64 limit;
- u64 memsw;
limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- limit += total_swap_pages << PAGE_SHIFT;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
/*
- * If memsw is finite and limits the amount of swap space available
- * to this memcg, return that limit.
+ * Do not consider swap space if we cannot swap due to swappiness
*/
- return min(limit, memsw);
+ if (mem_cgroup_swappiness(memcg)) {
+ u64 memsw;
+
+ limit += total_swap_pages << PAGE_SHIFT;
+ memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+
+ /*
+ * If memsw is finite and limits the amount of swap space
+ * available to this memcg, return that limit.
+ */
+ limit = min(limit, memsw);
+ }
+
+ return limit;
}
void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
1.7.10.4
On Thu 11-10-12 10:57:39, Michal Hocko wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: stable [3.5+]
I have just realized that fe35004f (introduced in 3.5-rc1) has been
backported to 3.2 and 3.4 stable kernels so this should be [3.2+]
> ---
> mm/memcontrol.c | 21 +++++++++++++++------
> 1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7acf43b..93a7e36 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
> static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
> {
> u64 limit;
> - u64 memsw;
>
> limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
> - limit += total_swap_pages << PAGE_SHIFT;
>
> - memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> /*
> - * If memsw is finite and limits the amount of swap space available
> - * to this memcg, return that limit.
> + * Do not consider swap space if we cannot swap due to swappiness
> */
> - return min(limit, memsw);
> + if (mem_cgroup_swappiness(memcg)) {
> + u64 memsw;
> +
> + limit += total_swap_pages << PAGE_SHIFT;
> + memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
> +
> + /*
> + * If memsw is finite and limits the amount of swap space
> + * available to this memcg, return that limit.
> + */
> + limit = min(limit, memsw);
> + }
> +
> + return limit;
> }
>
> void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> --
> 1.7.10.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Thu, Oct 11, 2012 at 10:57:39AM +0200, Michal Hocko wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: stable [3.5+]
I also don't think it's hackish, the limit depends very much on
whether reclaim can swap, so it's natural that swappiness shows up
here.
Acked-by: Johannes Weiner <[email protected]>
On Thu, Oct 11, 2012 at 4:57 AM, Michal Hocko <[email protected]> wrote:
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: stable [3.5+]
Acked-by: KOSAKI Motohiro <[email protected]>
On Thu 11-10-12 08:20:37, Johannes Weiner wrote:
> On Thu, Oct 11, 2012 at 10:57:39AM +0200, Michal Hocko wrote:
> > oom_badness takes totalpages argument which says how many pages are
> > available and it uses it as a base for the score calculation. The value
> > is calculated by mem_cgroup_get_limit which considers both limit and
> > total_swap_pages (resp. memsw portion of it).
> >
> > This is usually correct but since fe35004f (mm: avoid swapping out
> > with swappiness==0) we do not swap when swappiness is 0 which means
> > that we cannot really use up all the totalpages pages. This in turn
> > confuses oom score calculation if the memcg limit is much smaller than
> > the available swap because the used memory (capped by the limit) is
> > negligible comparing to totalpages so the resulting score is too small
> > if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> > A wrong process might be selected as result.
> >
> > The same issue exists for the global oom killer as well but it is not
> > that problematic as the amount of the RAM is usually much bigger than
> > the swap space.
> >
> > The problem can be worked around by checking mem_cgroup_swappiness==0
> > and not considering swap at all in such a case.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > Acked-by: David Rientjes <[email protected]>
> > Cc: stable [3.5+]
>
> I also don't think it's hackish, the limit depends very much on
> whether reclaim can swap, so it's natural that swappiness shows up
> here.
OK, maybe I was just a bit over sensitive here. The other reason I
didn't like it is that swappiness might change over time we some of the
tasks could be already swapped out. oom_score already considers
MM_SWAPENTS but this just tells the number of swapped out ptes not the
pages. So we could still kill something that is resident with much
smaller memory footprint. But this is a different issue and probably a
corner case.
> Acked-by: Johannes Weiner <[email protected]>
Thanks
--
Michal Hocko
SUSE Labs
On Thu 11-10-12 18:36:26, KOSAKI Motohiro wrote:
> On Thu, Oct 11, 2012 at 4:57 AM, Michal Hocko <[email protected]> wrote:
> > oom_badness takes totalpages argument which says how many pages are
> > available and it uses it as a base for the score calculation. The value
> > is calculated by mem_cgroup_get_limit which considers both limit and
> > total_swap_pages (resp. memsw portion of it).
> >
> > This is usually correct but since fe35004f (mm: avoid swapping out
> > with swappiness==0) we do not swap when swappiness is 0 which means
> > that we cannot really use up all the totalpages pages. This in turn
> > confuses oom score calculation if the memcg limit is much smaller than
> > the available swap because the used memory (capped by the limit) is
> > negligible comparing to totalpages so the resulting score is too small
> > if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> > A wrong process might be selected as result.
> >
> > The same issue exists for the global oom killer as well but it is not
> > that problematic as the amount of the RAM is usually much bigger than
> > the swap space.
> >
> > The problem can be worked around by checking mem_cgroup_swappiness==0
> > and not considering swap at all in such a case.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > Acked-by: David Rientjes <[email protected]>
> > Cc: stable [3.5+]
>
> Acked-by: KOSAKI Motohiro <[email protected]>
Thanks!
--
Michal Hocko
SUSE Labs
(2012/10/10 23:11), Michal Hocko wrote:
> Hi,
> I am sending the patch below as an RFC because I am not entirely happy
> about myself and maybe somebody can come up with a different approach
> which would be less hackish.
> As a background, I have noticed that memcg OOM killer kills a wrong
> tasks while playing with memory.swappiness==0 in a small group (e.g.
> 50M). I have multiple anon mem eaters which fault in more than the hard
> limit. OOM killer kills the last executed task:
>
> # mem_eater spawns one process per parameter, mmaps the given size and
> # faults memory in in parallel (all of them are synced to start together)
> ./mem_eater anon:50M anon:20M anon:20M anon:20M
> 10571: anon_eater for 20971520B
> 10570: anon_eater for 52428800B
> 10573: anon_eater for 20971520B
> 10572: anon_eater for 20971520B
> 10573: done with status 9
> 10571: done with status 0
> 10572: done with status 9
> 10570: done with status 9
>
> [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
> [ 5706] 0 5706 4955 556 13 0 0 bash
> [10569] 0 10569 1015 134 6 0 0 mem_eater
> [10570] 0 10570 13815 4118 15 0 0 mem_eater
> [10571] 0 10571 6135 5140 16 0 0 mem_eater
> [10572] 0 10572 6135 22 7 0 0 mem_eater
> [10573] 0 10573 6135 3541 14 0 0 mem_eater
> Memory cgroup out of memory: Kill process 10573 (mem_eater) score 0 or sacrifice child
> Killed process 10573 (mem_eater) total-vm:24540kB, anon-rss:14028kB, file-rss:136kB
> [...]
> [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
> [ 5706] 0 5706 4955 556 13 0 0 bash
> [10569] 0 10569 1015 134 6 0 0 mem_eater
> [10570] 0 10570 13815 10267 27 0 0 mem_eater
> [10572] 0 10572 6135 2519 12 0 0 mem_eater
> Memory cgroup out of memory: Kill process 10572 (mem_eater) score 0 or sacrifice child
> Killed process 10572 (mem_eater) total-vm:24540kB, anon-rss:9940kB, file-rss:136kB
> [...]
> [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
> [ 5706] 0 5706 4955 556 13 0 0 bash
> [10569] 0 10569 1015 134 6 0 0 mem_eater
> [10570] 0 10570 13815 12773 31 0 0 mem_eater
> Memory cgroup out of memory: Kill process 10570 (mem_eater) score 2 or sacrifice child
> Killed process 10570 (mem_eater) total-vm:55260kB, anon-rss:50956kB, file-rss:136kB
>
> As you can see 50M (pid:10570) is killed as the last one while 20M ones
> are killed first. See the patch for more details about the problem.
> As I state in the changelog the very same issue is present in the global
> oom killer as well but it is much less probable as the amount of swap is
> usualy much smaller than the available RAM and I think it is not worth
> considering.
>
> ---
> From 445c2ced957cd77cbfca44d0e3f5056fed252a34 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 10 Oct 2012 15:46:54 +0200
> Subject: [PATCH] memcg: oom: fix totalpages calculation for swappiness==0
>
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller
> than the available swap because the used memory (capped by the limit)
> is negligible comparing to totalpages so the resulting score is too
> small. A wrong process might be selected as result.
>
> The same issue exists for the global oom killer as well but it is not
> that problematic as the amount of the RAM is usually much bigger than
> the swap space.
>
> The problem can be worked around by checking swappiness==0 and not
> considering swap at all.
>
> Signed-off-by: Michal Hocko <[email protected]>@jp.fujitsu.com>
Hm...where should we describe this behavior....
Documentation/cgroup/memory.txt "5.3 swappiness" ?
Anyway, the patch itself seems good.
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
On Mon 15-10-12 18:11:24, KAMEZAWA Hiroyuki wrote:
> (2012/10/10 23:11), Michal Hocko wrote:
[...]
> > From 445c2ced957cd77cbfca44d0e3f5056fed252a34 Mon Sep 17 00:00:00 2001
> >From: Michal Hocko <[email protected]>
> >Date: Wed, 10 Oct 2012 15:46:54 +0200
> >Subject: [PATCH] memcg: oom: fix totalpages calculation for swappiness==0
> >
> >oom_badness takes totalpages argument which says how many pages are
> >available and it uses it as a base for the score calculation. The value
> >is calculated by mem_cgroup_get_limit which considers both limit and
> >total_swap_pages (resp. memsw portion of it).
> >
> >This is usually correct but since fe35004f (mm: avoid swapping out
> >with swappiness==0) we do not swap when swappiness is 0 which means
> >that we cannot really use up all the totalpages pages. This in turn
> >confuses oom score calculation if the memcg limit is much smaller
> >than the available swap because the used memory (capped by the limit)
> >is negligible comparing to totalpages so the resulting score is too
> >small. A wrong process might be selected as result.
> >
> >The same issue exists for the global oom killer as well but it is not
> >that problematic as the amount of the RAM is usually much bigger than
> >the swap space.
> >
> >The problem can be worked around by checking swappiness==0 and not
> >considering swap at all.
> >
> >Signed-off-by: Michal Hocko <[email protected]>@jp.fujitsu.com>
>
> Hm...where should we describe this behavior....
> Documentation/cgroup/memory.txt "5.3 swappiness" ?
Hmm. The swappiness behavior is consistent with the global knob. On the
other hand the visible effects are still "stronger" as the environment
is much more constrained with memcgs so the corner cases are hit more
frequently. But this is somehow expected so I am not sure whether we
need to be explicit about this one.
Maybe we could be more explicit about the swappiness==0 behavior in
Documentation/sysctl/vm.txt because the current description is quite
vague as it doesn't say anything about the range. Maybe a patch bellow
will help to clarify this?
> Anyway, the patch itself seems good.
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Thanks!
---
>From 712995bc656cb7ad278aad45974b9e23fb524498 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe swappiness more precisely
since fe35004f (mm: avoid swapping out with swappiness==0) reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect.
While we are at it also mention the upper limit and its effect.
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/sysctl/vm.txt | 3 +++
1 file changed, 3 insertions(+)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 078701f..308fd77 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -640,6 +640,9 @@ swappiness
This control is used to define how aggressive the kernel will swap
memory pages. Higher values will increase agressiveness, lower values
decrease the amount of swap.
+The value can be used from the [0, 100] range, where 0 means no swapping
+at all (even if there is a swap storage enabled) while 100 means that
+anonymous pages are reclaimed in the same rate as file pages.
The default value is 60.
--
1.7.10.4
--
Michal Hocko
SUSE Labs
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 078701f..308fd77 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -640,6 +640,9 @@ swappiness
> This control is used to define how aggressive the kernel will swap
> memory pages. Higher values will increase agressiveness, lower values
> decrease the amount of swap.
> +The value can be used from the [0, 100] range, where 0 means no swapping
> +at all (even if there is a swap storage enabled) while 100 means that
> +anonymous pages are reclaimed in the same rate as file pages.
I think this only correct when memcg. Even if swappiness==0, global reclaim swap
out anon pages before oom.
see below.
get_scan_count()
(snip)
if (global_reclaim(sc)) {
free = zone_page_state(zone, NR_FREE_PAGES);
/* If we have very few page cache pages,
force-scan anon pages. */
if (unlikely(file + free <= high_wmark_pages(zone))) {
fraction[0] = 1;
fraction[1] = 0;
denominator = 1;
goto out;
}
}
On Mon 15-10-12 10:25:14, KOSAKI Motohiro wrote:
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index 078701f..308fd77 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -640,6 +640,9 @@ swappiness
> > This control is used to define how aggressive the kernel will swap
> > memory pages. Higher values will increase agressiveness, lower values
> > decrease the amount of swap.
> > +The value can be used from the [0, 100] range, where 0 means no swapping
> > +at all (even if there is a swap storage enabled) while 100 means that
> > +anonymous pages are reclaimed in the same rate as file pages.
>
> I think this only correct when memcg. Even if swappiness==0, global reclaim swap
> out anon pages before oom.
Right you are (we really do swap when the file pages are really
low)! Sorry about the confusion. I kind of became if(global_reclaim)
block blind...
Then this really needs a memcg specific documentation fix. What about
the following?
---
>From 59a60705abd2faf9e266a4270bbf302001845588 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe memcg swappiness more precisely
since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/cgroups/memory.txt | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..71c4da4 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -466,6 +466,10 @@ Note:
5.3 swappiness
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+Please note that unlike the global swappiness, memcg knob set to 0
+really prevents from any swapping even if there is a swap storage
+available. This might lead to memcg OOM killer if there are no file
+pages to reclaim.
Following cgroups' swappiness can't be changed.
- root cgroup (uses /proc/sys/vm/swappiness).
--
1.7.10.4
--
Michal Hocko
SUSE Labs
As Kosaki correctly pointed out, the glogal reclaim doesn't have this
issue because we _do_ swap on swappinnes==0 so the swap space has
to be considered. So the v2 is just acks + changelog fix.
Changes since v1
- drop a note about global swappiness affected as well from the
changelog
- stable needs 3.2+ rather than 3.5+ because the fe35004f has been
backported to stable
---
>From c2ae4849f09dbfda6b61472c6dd1fd8c2fe8ac81 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 10 Oct 2012 15:46:54 +0200
Subject: [PATCH] memcg: oom: fix totalpages calculation for
memory.swappiness==0
oom_badness takes totalpages argument which says how many pages are
available and it uses it as a base for the score calculation. The value
is calculated by mem_cgroup_get_limit which considers both limit and
total_swap_pages (resp. memsw portion of it).
This is usually correct but since fe35004f (mm: avoid swapping out
with swappiness==0) we do not swap when swappiness is 0 which means
that we cannot really use up all the totalpages pages. This in turn
confuses oom score calculation if the memcg limit is much smaller than
the available swap because the used memory (capped by the limit) is
negligible comparing to totalpages so the resulting score is too small
if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
A wrong process might be selected as result.
The problem can be worked around by checking mem_cgroup_swappiness==0
and not considering swap at all in such a case.
Signed-off-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: KOSAKI Motohiro <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Cc: stable [3.2+]
---
mm/memcontrol.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..93a7e36 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1452,17 +1452,26 @@ static int mem_cgroup_count_children(struct mem_cgroup *memcg)
static u64 mem_cgroup_get_limit(struct mem_cgroup *memcg)
{
u64 limit;
- u64 memsw;
limit = res_counter_read_u64(&memcg->res, RES_LIMIT);
- limit += total_swap_pages << PAGE_SHIFT;
- memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
/*
- * If memsw is finite and limits the amount of swap space available
- * to this memcg, return that limit.
+ * Do not consider swap space if we cannot swap due to swappiness
*/
- return min(limit, memsw);
+ if (mem_cgroup_swappiness(memcg)) {
+ u64 memsw;
+
+ limit += total_swap_pages << PAGE_SHIFT;
+ memsw = res_counter_read_u64(&memcg->memsw, RES_LIMIT);
+
+ /*
+ * If memsw is finite and limits the amount of swap space
+ * available to this memcg, return that limit.
+ */
+ limit = min(limit, memsw);
+ }
+
+ return limit;
}
void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
--
1.7.10.4
--
Michal Hocko
SUSE Labs
And a follow up for memcg.swappiness documentation which is more
specific about spwappiness==0 meaning.
---
>From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Mon, 15 Oct 2012 11:43:56 +0200
Subject: [PATCH] doc: describe memcg swappiness more precisely
since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
stopped swapping out anon pages completely when 0 value is used.
Although this is somehow expected it hasn't been done for a really long
time this way and so it is probably better to be explicit about the
effect. Moreover global reclaim swapps out even when swappiness is 0
to prevent from OOM killer.
Signed-off-by: Michal Hocko <[email protected]>
---
Documentation/cgroups/memory.txt | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index c07f7b4..71c4da4 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -466,6 +466,10 @@ Note:
5.3 swappiness
Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
+Please note that unlike the global swappiness, memcg knob set to 0
+really prevents from any swapping even if there is a swap storage
+available. This might lead to memcg OOM killer if there are no file
+pages to reclaim.
Following cgroups' swappiness can't be changed.
- root cgroup (uses /proc/sys/vm/swappiness).
--
1.7.10.4
--
Michal Hocko
SUSE Labs
>> I think this only correct when memcg. Even if swappiness==0, global reclaim swap
>> out anon pages before oom.
>
> Right you are (we really do swap when the file pages are really
> low)! Sorry about the confusion. I kind of became if(global_reclaim)
> block blind...
>
> Then this really needs a memcg specific documentation fix. What about
> the following?
> ---
> From 59a60705abd2faf9e266a4270bbf302001845588 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
>
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> Documentation/cgroups/memory.txt | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index c07f7b4..71c4da4 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -466,6 +466,10 @@ Note:
> 5.3 swappiness
>
> Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
> +Please note that unlike the global swappiness, memcg knob set to 0
> +really prevents from any swapping even if there is a swap storage
> +available. This might lead to memcg OOM killer if there are no file
> +pages to reclaim.
Pretty good to me. Thank you!
Acked-by: KOSAKI Motohiro <[email protected]>
(2012/10/16 7:07), Michal Hocko wrote:
> And a follow up for memcg.swappiness documentation which is more
> specific about spwappiness==0 meaning.
> ---
> From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
>
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
>
> Signed-off-by: Michal Hocko <[email protected]>
Nice :)
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Documentation/cgroups/memory.txt | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index c07f7b4..71c4da4 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -466,6 +466,10 @@ Note:
> 5.3 swappiness
>
> Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only.
> +Please note that unlike the global swappiness, memcg knob set to 0
> +really prevents from any swapping even if there is a swap storage
> +available. This might lead to memcg OOM killer if there are no file
> +pages to reclaim.
>
> Following cgroups' swappiness can't be changed.
> - root cgroup (uses /proc/sys/vm/swappiness).
>
On Tue, 16 Oct 2012, Michal Hocko wrote:
> And a follow up for memcg.swappiness documentation which is more
> specific about spwappiness==0 meaning.
> ---
> From 1bc3a94fea728107ed108edd42df464b908cd067 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Mon, 15 Oct 2012 11:43:56 +0200
> Subject: [PATCH] doc: describe memcg swappiness more precisely
>
> since fe35004f (mm: avoid swapping out with swappiness==0) memcg reclaim
> stopped swapping out anon pages completely when 0 value is used.
> Although this is somehow expected it hasn't been done for a really long
> time this way and so it is probably better to be explicit about the
> effect. Moreover global reclaim swapps out even when swappiness is 0
> to prevent from OOM killer.
>
> Signed-off-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Tue, 16 Oct 2012 00:04:08 +0200
Michal Hocko <[email protected]> wrote:
> As Kosaki correctly pointed out, the glogal reclaim doesn't have this
> issue because we _do_ swap on swappinnes==0 so the swap space has
> to be considered. So the v2 is just acks + changelog fix.
>
> Changes since v1
> - drop a note about global swappiness affected as well from the
> changelog
> - stable needs 3.2+ rather than 3.5+ because the fe35004f has been
> backported to stable
> ---
> >From c2ae4849f09dbfda6b61472c6dd1fd8c2fe8ac81 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 10 Oct 2012 15:46:54 +0200
> Subject: [PATCH] memcg: oom: fix totalpages calculation for
> memory.swappiness==0
>
> oom_badness takes totalpages argument which says how many pages are
> available and it uses it as a base for the score calculation. The value
> is calculated by mem_cgroup_get_limit which considers both limit and
> total_swap_pages (resp. memsw portion of it).
>
> This is usually correct but since fe35004f (mm: avoid swapping out
> with swappiness==0) we do not swap when swappiness is 0 which means
> that we cannot really use up all the totalpages pages. This in turn
> confuses oom score calculation if the memcg limit is much smaller than
> the available swap because the used memory (capped by the limit) is
> negligible comparing to totalpages so the resulting score is too small
> if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> A wrong process might be selected as result.
>
> The problem can be worked around by checking mem_cgroup_swappiness==0
> and not considering swap at all in such a case.
>
> Signed-off-by: Michal Hocko <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: KOSAKI Motohiro <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Cc: stable [3.2+]
That's "Cc: <[email protected]>", please.
It's unobvious from the changelog that a -stable backport is really
needed. The bug looks pretty obscure and has been there for a long
time. Realistically, is anyone likely to hurt from this?
On Wed 07-11-12 14:10:25, Andrew Morton wrote:
> On Tue, 16 Oct 2012 00:04:08 +0200
> Michal Hocko <[email protected]> wrote:
>
> > As Kosaki correctly pointed out, the glogal reclaim doesn't have this
> > issue because we _do_ swap on swappinnes==0 so the swap space has
> > to be considered. So the v2 is just acks + changelog fix.
> >
> > Changes since v1
> > - drop a note about global swappiness affected as well from the
> > changelog
> > - stable needs 3.2+ rather than 3.5+ because the fe35004f has been
> > backported to stable
> > ---
> > >From c2ae4849f09dbfda6b61472c6dd1fd8c2fe8ac81 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Wed, 10 Oct 2012 15:46:54 +0200
> > Subject: [PATCH] memcg: oom: fix totalpages calculation for
> > memory.swappiness==0
> >
> > oom_badness takes totalpages argument which says how many pages are
> > available and it uses it as a base for the score calculation. The value
> > is calculated by mem_cgroup_get_limit which considers both limit and
> > total_swap_pages (resp. memsw portion of it).
> >
> > This is usually correct but since fe35004f (mm: avoid swapping out
> > with swappiness==0) we do not swap when swappiness is 0 which means
> > that we cannot really use up all the totalpages pages. This in turn
> > confuses oom score calculation if the memcg limit is much smaller than
> > the available swap because the used memory (capped by the limit) is
> > negligible comparing to totalpages so the resulting score is too small
> > if adj!=0 (typically task with CAP_SYS_ADMIN or non zero oom_score_adj).
> > A wrong process might be selected as result.
> >
> > The problem can be worked around by checking mem_cgroup_swappiness==0
> > and not considering swap at all in such a case.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > Acked-by: David Rientjes <[email protected]>
> > Acked-by: Johannes Weiner <[email protected]>
> > Acked-by: KOSAKI Motohiro <[email protected]>
> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > Cc: stable [3.2+]
>
> That's "Cc: <[email protected]>", please.
Will do next time.
> It's unobvious from the changelog that a -stable backport is really
> needed. The bug looks pretty obscure and has been there for a long
> time.
Yes but it is not _that_ long since fe35004f made it into stable trees
(e.g. 3.2.29).
The reason why we probably do not see many reports is because people
didn't get used to swappiness==0 really works these days - especially
with memcg where it means _really_ no swapping.
> Realistically, is anyone likely to hurt from this?
The primary motivation for the fix was a real report by a customer.
--
Michal Hocko
SUSE Labs
On Wed, 7 Nov 2012 23:46:40 +0100
Michal Hocko <[email protected]> wrote:
> > Realistically, is anyone likely to hurt from this?
>
> The primary motivation for the fix was a real report by a customer.
Describe it please and I'll copy it to the changelog.
So that Greg can understand why we sent this at him and so that others
who hit the same problem (or any other problem!) will be able to
determine whether this patch might solve it.
On Wed 07-11-12 14:53:40, Andrew Morton wrote:
> On Wed, 7 Nov 2012 23:46:40 +0100
> Michal Hocko <[email protected]> wrote:
>
> > > Realistically, is anyone likely to hurt from this?
> >
> > The primary motivation for the fix was a real report by a customer.
>
> Describe it please and I'll copy it to the changelog.
The original issue (a wrong tasks get killed in a small group and memcg
swappiness=0) has been reported on top of our 3.0 based kernel (with
fe35004f backported). I have tried to replicate it by the test case
mentioned https://lkml.org/lkml/2012/10/10/223.
As David correctly pointed out (https://lkml.org/lkml/2012/10/10/418)
the significant role played the fact that all the processes in the group
have CAP_SYS_ADMIN but oom_score_adj has the similar effect.
Say there is 2G of swap space which is 524288 pages. If you add
CAP_SYS_ADMIN bonus then you have -15728 score for the bias. This means
that all tasks with less than 60M get the minimum score and it is tasks
ordering which determines who gets killed as a result.
To summarize it. Users of small groups (relatively to the swap size)
with CAP_SYS_ADMIN tasks resp. oom_score_adj are affected the most
others might see an unexpected oom_badness calculation.
Whether this is a workload which is representative, I don't know but
I think that it is worth fixing and pushing to stable as well.
--
Michal Hocko
SUSE Labs