Hi, Andrew!
Please, find an updated version of memory.min refinements/fixes
in this patchset. It's against linus tree.
Please, merge these patches into 4.18.
Thanks!
Roman Gushchin (3):
mm: fix null pointer dereference in mem_cgroup_protected
mm, memcg: propagate memory effective protection on setting
memory.min/low
mm, memcg: don't skip memory guarantee calculations
mm/memcontrol.c | 33 ++++++++++++++++++++++++---------
1 file changed, 24 insertions(+), 9 deletions(-)
--
2.14.4
There are two cases when effective memory guarantee calculation is
mistakenly skipped:
1) If memcg is a child of the root cgroup, and the root cgroup is not
root_mem_cgroup (in other words, if the reclaim is targeted).
Top-level memory cgroups are handled specially in
mem_cgroup_protected(), because the root memory cgroup doesn't have
memory guarantee and can't limit its children guarantees. So, all
effective guarantee calculation is skipped. But in case of targeted
reclaim things are different: cgroups, which parent exceeded its memory
limit aren't special.
2) If memcg has no charged memory (memory usage is 0). In this case
mem_cgroup_protected() always returns MEMCG_PROT_NONE, which is correct
and prevents to generate fake memory low events for empty cgroups. But
skipping memory emin/elow calculation is wrong: if there is no global
memory pressure there might be no good chance again, so we can end up
with effective guarantees set to 0 without any reason.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/memcontrol.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 485df6f63d26..3220c992ee26 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5477,15 +5477,10 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
if (mem_cgroup_disabled())
return MEMCG_PROT_NONE;
- if (!root)
- root = root_mem_cgroup;
- if (memcg == root)
+ if (memcg == root_mem_cgroup)
return MEMCG_PROT_NONE;
usage = page_counter_read(&memcg->memory);
- if (!usage)
- return MEMCG_PROT_NONE;
-
emin = memcg->memory.min;
elow = memcg->memory.low;
@@ -5494,7 +5489,7 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
if (!parent)
return MEMCG_PROT_NONE;
- if (parent == root)
+ if (parent == root_mem_cgroup)
goto exit;
parent_emin = READ_ONCE(parent->memory.emin);
@@ -5529,6 +5524,12 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
memcg->memory.emin = emin;
memcg->memory.elow = elow;
+ if (root && memcg == root)
+ return MEMCG_PROT_NONE;
+
+ if (!usage)
+ return MEMCG_PROT_NONE;
+
if (usage <= emin)
return MEMCG_PROT_MIN;
else if (usage <= elow)
--
2.14.4
Shakeel reported a crash in mem_cgroup_protected(), which
can be triggered by memcg reclaim if the legacy cgroup v1
use_hierarchy=0 mode is used:
[ 226.060572] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000120
[ 226.068310] PGD 8000001ff55da067 P4D 8000001ff55da067 PUD 1fdc7df067 PMD 0
[ 226.075191] Oops: 0000 [#4] SMP PTI
[ 226.078637] CPU: 0 PID: 15581 Comm: bash Tainted: G D
4.17.0-smp-clean #5
[ 226.086635] Hardware name: ...
[ 226.094546] RIP: 0010:mem_cgroup_protected+0x54/0x130
[ 226.099533] Code: 4c 8b 8e 00 01 00 00 4c 8b 86 08 01 00 00 48 8d
8a 08 ff ff ff 48 85 d2 ba 00 00 00 00 48 0f 44 ca 48 39 c8 0f 84 cf
00 00 00 <48> 8b 81 20 01 00 00 4d 89 ca 4c 39 c8 4c 0f 46 d0 4d 85 d2
74 05
[ 226.118194] RSP: 0000:ffffabe64dfafa58 EFLAGS: 00010286
[ 226.123358] RAX: ffff9fb6ff03d000 RBX: ffff9fb6f5b1b000 RCX: 0000000000000000
[ 226.130406] RDX: 0000000000000000 RSI: ffff9fb6f5b1b000 RDI: ffff9fb6f5b1b000
[ 226.137454] RBP: ffffabe64dfafb08 R08: 0000000000000000 R09: 0000000000000000
[ 226.144503] R10: 0000000000000000 R11: 000000000000c800 R12: ffffabe64dfafb88
[ 226.151551] R13: ffff9fb6f5b1b000 R14: ffffabe64dfafb88 R15: ffff9fb77fffe000
[ 226.158602] FS: 00007fed1f8ac700(0000) GS:ffff9fb6ff400000(0000)
knlGS:0000000000000000
[ 226.166594] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 226.172270] CR2: 0000000000000120 CR3: 0000001fdcf86003 CR4: 00000000001606f0
[ 226.179317] Call Trace:
[ 226.181732] ? shrink_node+0x194/0x510
[ 226.185435] do_try_to_free_pages+0xfd/0x390
[ 226.189653] try_to_free_mem_cgroup_pages+0x123/0x210
[ 226.194643] try_charge+0x19e/0x700
[ 226.198088] mem_cgroup_try_charge+0x10b/0x1a0
[ 226.202478] wp_page_copy+0x134/0x5b0
[ 226.206094] do_wp_page+0x90/0x460
[ 226.209453] __handle_mm_fault+0x8e3/0xf30
[ 226.213498] handle_mm_fault+0xfe/0x220
[ 226.217285] __do_page_fault+0x262/0x500
[ 226.221158] do_page_fault+0x28/0xd0
[ 226.224689] ? page_fault+0x8/0x30
[ 226.228048] page_fault+0x1e/0x30
[ 226.231323] RIP: 0033:0x485b72
The problem happens because parent_mem_cgroup() returns a NULL
pointer, which is dereferenced later without a check.
As cgroup v1 has no memory guarantee support, let's make
mem_cgroup_protected() immediately return MEMCG_PROT_NONE,
if the given cgroup has no parent (non-hierarchical mode is used).
Reported-by: Shakeel Butt <[email protected]>
Tested-by: Shakeel Butt <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Fixes: bf8d5d52ffe8 ("memcg: introduce memory.min")
---
mm/memcontrol.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c1e64d60ed02..5a3873e9d657 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5480,6 +5480,10 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
elow = memcg->memory.low;
parent = parent_mem_cgroup(memcg);
+ /* No parent means a non-hierarchical mode on v1 memcg */
+ if (!parent)
+ return MEMCG_PROT_NONE;
+
if (parent == root)
goto exit;
--
2.14.4
Explicitly propagate effective memory min/low values down by the tree.
If there is the global memory pressure, it's not really necessary.
Effective memory guarantees will be propagated automatically as we
traverse memory cgroup tree in the reclaim path.
But if there is no global memory pressure, effective memory protection
still matters for local (memcg-scoped) memory pressure. So, we have to
update effective limits in the subtree, if a user changes memory.min and
memory.low values.
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vladimir Davydov <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Shuah Khan <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/memcontrol.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5a3873e9d657..485df6f63d26 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5084,7 +5084,7 @@ static int memory_min_show(struct seq_file *m, void *v)
static ssize_t memory_min_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
- struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(of_css(of));
unsigned long min;
int err;
@@ -5095,6 +5095,11 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
page_counter_set_min(&memcg->memory, min);
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(iter, memcg)
+ mem_cgroup_protected(NULL, iter);
+ rcu_read_unlock();
+
return nbytes;
}
@@ -5114,7 +5119,7 @@ static int memory_low_show(struct seq_file *m, void *v)
static ssize_t memory_low_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
- struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+ struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(of_css(of));
unsigned long low;
int err;
@@ -5125,6 +5130,11 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
page_counter_set_low(&memcg->memory, low);
+ rcu_read_lock();
+ for_each_mem_cgroup_tree(iter, memcg)
+ mem_cgroup_protected(NULL, iter);
+ rcu_read_unlock();
+
return nbytes;
}
--
2.14.4
On Mon, 11 Jun 2018 10:54:15 -0700 Roman Gushchin <[email protected]> wrote:
> Hi, Andrew!
>
> Please, find an updated version of memory.min refinements/fixes
> in this patchset. It's against linus tree.
> Please, merge these patches into 4.18.
>
> ...
>
> mm: fix null pointer dereference in mem_cgroup_protected
> mm, memcg: propagate memory effective protection on setting
> memory.min/low
> mm, memcg: don't skip memory guarantee calculations
Has nobody reviewed or acked #2 and #3?
On Mon, Jun 11, 2018 at 03:36:21PM -0700, Andrew Morton wrote:
> On Mon, 11 Jun 2018 10:54:15 -0700 Roman Gushchin <[email protected]> wrote:
>
> > Hi, Andrew!
> >
> > Please, find an updated version of memory.min refinements/fixes
> > in this patchset. It's against linus tree.
> > Please, merge these patches into 4.18.
> >
> > ...
> >
> > mm: fix null pointer dereference in mem_cgroup_protected
> > mm, memcg: propagate memory effective protection on setting
> > memory.min/low
> > mm, memcg: don't skip memory guarantee calculations
>
> Has nobody reviewed or acked #2 and #3?
>
Looks so...
You took them very fast into the mm tree last time, so probably nobody
did give it too much attention.
Johannes, can you, please, take a look?
Thanks!
On Mon, Jun 11, 2018 at 10:54:17AM -0700, Roman Gushchin wrote:
> Explicitly propagate effective memory min/low values down by the tree.
>
> If there is the global memory pressure, it's not really necessary.
> Effective memory guarantees will be propagated automatically as we
> traverse memory cgroup tree in the reclaim path.
>
> But if there is no global memory pressure, effective memory protection
> still matters for local (memcg-scoped) memory pressure. So, we have to
> update effective limits in the subtree, if a user changes memory.min and
> memory.low values.
>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Vladimir Davydov <[email protected]>
> Cc: Greg Thelen <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Shuah Khan <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
> mm/memcontrol.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5a3873e9d657..485df6f63d26 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5084,7 +5084,7 @@ static int memory_min_show(struct seq_file *m, void *v)
> static ssize_t memory_min_write(struct kernfs_open_file *of,
> char *buf, size_t nbytes, loff_t off)
> {
> - struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(of_css(of));
> unsigned long min;
> int err;
>
> @@ -5095,6 +5095,11 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
>
> page_counter_set_min(&memcg->memory, min);
>
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, memcg)
> + mem_cgroup_protected(NULL, iter);
> + rcu_read_unlock();
I'm not quite following. mem_cgroup_protected() is a just-in-time
query that depends on the groups' usage. How does it make sense to run
this at the time the limit is set?
Also, why is target reclaim different from global reclaim here? We
have all the information we need, even if we don't start at the
root_mem_cgroup. If we enter target reclaim against a specific cgroup,
yes, we don't know the elow it receives from its parents. What we *do*
know, though, is that it hit its own hard limit. What is happening
higher up that group doesn't matter for the purpose of protection.
I.e. it seems to me that instead of this patch we should be treating
the reclaim root and its first-level children the same way we treat
root_mem_cgroup and top-level cgroups: no protection for the root,
first children use their low setting as the elow, all descendants get
the proportional low-usage distribution.
On Tue, Jun 12, 2018 at 11:52:42AM -0400, Johannes Weiner wrote:
> On Mon, Jun 11, 2018 at 10:54:17AM -0700, Roman Gushchin wrote:
> > Explicitly propagate effective memory min/low values down by the tree.
> >
> > If there is the global memory pressure, it's not really necessary.
> > Effective memory guarantees will be propagated automatically as we
> > traverse memory cgroup tree in the reclaim path.
> >
> > But if there is no global memory pressure, effective memory protection
> > still matters for local (memcg-scoped) memory pressure. So, we have to
> > update effective limits in the subtree, if a user changes memory.min and
> > memory.low values.
> >
> > Link: http://lkml.kernel.org/r/[email protected]
> > Signed-off-by: Roman Gushchin <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Vladimir Davydov <[email protected]>
> > Cc: Greg Thelen <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> > Cc: Shuah Khan <[email protected]>
> > Signed-off-by: Andrew Morton <[email protected]>
> > ---
> > mm/memcontrol.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5a3873e9d657..485df6f63d26 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5084,7 +5084,7 @@ static int memory_min_show(struct seq_file *m, void *v)
> > static ssize_t memory_min_write(struct kernfs_open_file *of,
> > char *buf, size_t nbytes, loff_t off)
> > {
> > - struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > + struct mem_cgroup *iter, *memcg = mem_cgroup_from_css(of_css(of));
> > unsigned long min;
> > int err;
> >
> > @@ -5095,6 +5095,11 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
> >
> > page_counter_set_min(&memcg->memory, min);
> >
> > + rcu_read_lock();
> > + for_each_mem_cgroup_tree(iter, memcg)
> > + mem_cgroup_protected(NULL, iter);
> > + rcu_read_unlock();
>
> I'm not quite following. mem_cgroup_protected() is a just-in-time
> query that depends on the groups' usage. How does it make sense to run
> this at the time the limit is set?
mem_cgroup_protected() emulates memory pressure to propagate
effective memory guarantee values.
>
> Also, why is target reclaim different from global reclaim here? We
> have all the information we need, even if we don't start at the
> root_mem_cgroup. If we enter target reclaim against a specific cgroup,
> yes, we don't know the elow it receives from its parents. What we *do*
> know, though, is that it hit its own hard limit. What is happening
> higher up that group doesn't matter for the purpose of protection.
>
> I.e. it seems to me that instead of this patch we should be treating
> the reclaim root and its first-level children the same way we treat
> root_mem_cgroup and top-level cgroups: no protection for the root,
> first children use their low setting as the elow, all descendants get
> the proportional low-usage distribution.
>
Ok, we can keep it this way. We can have some races between the global
and targeted reclaim, but it's fine.
Andrew,
can you, please, drop these patches from the mm tree:
selftests: cgroup: add test for memory.low corner cases
mm, memcg: don't skip memory guarantee calculations
mm, memcg: propagate memory effective protection on setting memory.min/low
The null pointer fix ("b2c21aa3690a mm: fix null pointer dereference in mem_cgroup_protected")
should be kept and merged asap.
Thank you!