Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
from cgroup itself"), cgroup_bpf release occurs asynchronously
(from a worker context), and before the release of the cgroup itself.
This introduced a previously non-existing race between the release
and update paths. E.g. if a leaf's cgroup_bpf is released and a new
bpf program is attached to the one of ancestor cgroups at the same
time. The race may result in double-free and other memory corruptions.
To fix the problem, let's protect the body of cgroup_bpf_release()
with cgroup_mutex, as it was effectively previously, when all this
code was called from the cgroup release path with cgroup mutex held.
Also make sure, that we don't leave already freed pointers to the
effective prog arrays. Otherwise, they can be released again by
the update path. It wasn't necessary before, because previously
the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
itself were released together.
Big thanks for Tejun Heo for discovering and debugging of this
problem!
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from
cgroup itself")
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
---
kernel/bpf/cgroup.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 1b65ab0df457..3128770c0f47 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -19,6 +19,8 @@
#include <linux/bpf-cgroup.h>
#include <net/sock.h>
+#include "../cgroup/cgroup-internal.h"
+
DEFINE_STATIC_KEY_FALSE(cgroup_bpf_enabled_key);
EXPORT_SYMBOL(cgroup_bpf_enabled_key);
@@ -41,6 +43,8 @@ static void cgroup_bpf_release(struct work_struct *work)
struct bpf_prog_array *old_array;
unsigned int type;
+ mutex_lock(&cgroup_mutex);
+
for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
struct list_head *progs = &cgrp->bpf.progs[type];
struct bpf_prog_list *pl, *tmp;
@@ -57,10 +61,13 @@ static void cgroup_bpf_release(struct work_struct *work)
}
old_array = rcu_dereference_protected(
cgrp->bpf.effective[type],
- percpu_ref_is_dying(&cgrp->bpf.refcnt));
+ lockdep_is_held(&cgroup_mutex));
+ RCU_INIT_POINTER(cgrp->bpf.effective[type], NULL);
bpf_prog_array_free(old_array);
}
+ mutex_unlock(&cgroup_mutex);
+
percpu_ref_exit(&cgrp->bpf.refcnt);
cgroup_put(cgrp);
}
--
2.21.0
On 6/23/19 7:30 PM, Roman Gushchin wrote:
> Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
> from cgroup itself"), cgroup_bpf release occurs asynchronously
> (from a worker context), and before the release of the cgroup itself.
>
> This introduced a previously non-existing race between the release
> and update paths. E.g. if a leaf's cgroup_bpf is released and a new
> bpf program is attached to the one of ancestor cgroups at the same
> time. The race may result in double-free and other memory corruptions.
>
> To fix the problem, let's protect the body of cgroup_bpf_release()
> with cgroup_mutex, as it was effectively previously, when all this
> code was called from the cgroup release path with cgroup mutex held.
>
> Also make sure, that we don't leave already freed pointers to the
> effective prog arrays. Otherwise, they can be released again by
> the update path. It wasn't necessary before, because previously
> the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
> itself were released together.
I thought dying cgroup won't have any children cgroups ?
It should have been empty with no tasks inside it?
Only some resources are still held?
mutex and zero init are highly suspicious.
It feels that cgroup_bpf_release is called too early.
Thinking from another angle... if child cgroups can still attach then
this bpf_release is broken. The code should be
calling __cgroup_bpf_detach() one by one to make sure
update_effective_progs() is called, since descendant are still
sort-of alive and can attach?
My money is on 'too early'.
May be cgroup is not dying ?
Just cgroup_sk_free() is called on the last socket and
this auto-detach logic got triggered incorrectly?
On Sun, Jun 23, 2019 at 08:29:21PM -0700, Alexei Starovoitov wrote:
> On 6/23/19 7:30 PM, Roman Gushchin wrote:
> > Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
> > from cgroup itself"), cgroup_bpf release occurs asynchronously
> > (from a worker context), and before the release of the cgroup itself.
> >
> > This introduced a previously non-existing race between the release
> > and update paths. E.g. if a leaf's cgroup_bpf is released and a new
> > bpf program is attached to the one of ancestor cgroups at the same
> > time. The race may result in double-free and other memory corruptions.
> >
> > To fix the problem, let's protect the body of cgroup_bpf_release()
> > with cgroup_mutex, as it was effectively previously, when all this
> > code was called from the cgroup release path with cgroup mutex held.
> >
> > Also make sure, that we don't leave already freed pointers to the
> > effective prog arrays. Otherwise, they can be released again by
> > the update path. It wasn't necessary before, because previously
> > the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
> > itself were released together.
>
> I thought dying cgroup won't have any children cgroups ?
It's not completely true, a dying cgroup can't have living children.
> It should have been empty with no tasks inside it?
Right.
> Only some resources are still held?
Right.
> mutex and zero init are highly suspicious.
> It feels that cgroup_bpf_release is called too early.
An alternative solution is to bump the refcounter on
every update path, and explicitly skip de-bpf'ed cgroups.
>
> Thinking from another angle... if child cgroups can still attach then
> this bpf_release is broken.
Hm, what do you mean under attach? It's not possible to attach
a new prog, but if a prog is attached to a parent cgroup,
a pointer can spill through "effective" array.
But I agree, it's broken. Update path should ignore such
cgroups (cgroups, which cgroup_bpf was released). I'll take a look.
> The code should be
> calling __cgroup_bpf_detach() one by one to make sure
> update_effective_progs() is called, since descendant are still
> sort-of alive and can attach?
Not sure I get you. Dying cgroup is a leaf cgroup.
>
> My money is on 'too early'.
> May be cgroup is not dying ?
> Just cgroup_sk_free() is called on the last socket and
> this auto-detach logic got triggered incorrectly?
So, once again, what's my picture:
A
A/B
A/B/C
cpu1: cpu2:
rmdir C attach new prog to A
C got dying update A, update B, update C...
C's cgroup_bpf is released C's effective progs is replaced with new one
old is double freed
It looks like it can be reproduced without any sockets.
Thanks!
On 6/23/19 9:02 PM, Roman Gushchin wrote:
> On Sun, Jun 23, 2019 at 08:29:21PM -0700, Alexei Starovoitov wrote:
>> On 6/23/19 7:30 PM, Roman Gushchin wrote:
>>> Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
>>> from cgroup itself"), cgroup_bpf release occurs asynchronously
>>> (from a worker context), and before the release of the cgroup itself.
>>>
>>> This introduced a previously non-existing race between the release
>>> and update paths. E.g. if a leaf's cgroup_bpf is released and a new
>>> bpf program is attached to the one of ancestor cgroups at the same
>>> time. The race may result in double-free and other memory corruptions.
>>>
>>> To fix the problem, let's protect the body of cgroup_bpf_release()
>>> with cgroup_mutex, as it was effectively previously, when all this
>>> code was called from the cgroup release path with cgroup mutex held.
>>>
>>> Also make sure, that we don't leave already freed pointers to the
>>> effective prog arrays. Otherwise, they can be released again by
>>> the update path. It wasn't necessary before, because previously
>>> the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
>>> itself were released together.
>>
>> I thought dying cgroup won't have any children cgroups ?
>
> It's not completely true, a dying cgroup can't have living children.
>
>> It should have been empty with no tasks inside it?
>
> Right.
>
>> Only some resources are still held?
>
> Right.
>
>> mutex and zero init are highly suspicious.
>> It feels that cgroup_bpf_release is called too early.
>
> An alternative solution is to bump the refcounter on
> every update path, and explicitly skip de-bpf'ed cgroups.
>
>>
>> Thinking from another angle... if child cgroups can still attach then
>> this bpf_release is broken.
>
> Hm, what do you mean under attach? It's not possible to attach
> a new prog, but if a prog is attached to a parent cgroup,
> a pointer can spill through "effective" array.
>
> But I agree, it's broken. Update path should ignore such
> cgroups (cgroups, which cgroup_bpf was released). I'll take a look.
>
>> The code should be
>> calling __cgroup_bpf_detach() one by one to make sure
>> update_effective_progs() is called, since descendant are still
>> sort-of alive and can attach?
>
> Not sure I get you. Dying cgroup is a leaf cgroup.
>
>>
>> My money is on 'too early'.
>> May be cgroup is not dying ?
>> Just cgroup_sk_free() is called on the last socket and
>> this auto-detach logic got triggered incorrectly?
>
> So, once again, what's my picture:
>
> A
> A/B
> A/B/C
>
> cpu1: cpu2:
> rmdir C attach new prog to A
> C got dying update A, update B, update C...
> C's cgroup_bpf is released C's effective progs is replaced with new one
> old is double freed
>
> It looks like it can be reproduced without any sockets.
I see.
Does it mean that css_for_each_descendant walks dying cgroups ?
I guess the fix then is to avoid walking them in update_effective_progs ?
On Tue, Jun 25, 2019 at 08:50:55AM -0700, Alexei Starovoitov wrote:
> On 6/23/19 9:02 PM, Roman Gushchin wrote:
> > On Sun, Jun 23, 2019 at 08:29:21PM -0700, Alexei Starovoitov wrote:
> >> On 6/23/19 7:30 PM, Roman Gushchin wrote:
> >>> Since commit 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf
> >>> from cgroup itself"), cgroup_bpf release occurs asynchronously
> >>> (from a worker context), and before the release of the cgroup itself.
> >>>
> >>> This introduced a previously non-existing race between the release
> >>> and update paths. E.g. if a leaf's cgroup_bpf is released and a new
> >>> bpf program is attached to the one of ancestor cgroups at the same
> >>> time. The race may result in double-free and other memory corruptions.
> >>>
> >>> To fix the problem, let's protect the body of cgroup_bpf_release()
> >>> with cgroup_mutex, as it was effectively previously, when all this
> >>> code was called from the cgroup release path with cgroup mutex held.
> >>>
> >>> Also make sure, that we don't leave already freed pointers to the
> >>> effective prog arrays. Otherwise, they can be released again by
> >>> the update path. It wasn't necessary before, because previously
> >>> the update path couldn't see such a cgroup, as cgroup_bpf and cgroup
> >>> itself were released together.
> >>
> >> I thought dying cgroup won't have any children cgroups ?
> >
> > It's not completely true, a dying cgroup can't have living children.
> >
> >> It should have been empty with no tasks inside it?
> >
> > Right.
> >
> >> Only some resources are still held?
> >
> > Right.
> >
> >> mutex and zero init are highly suspicious.
> >> It feels that cgroup_bpf_release is called too early.
> >
> > An alternative solution is to bump the refcounter on
> > every update path, and explicitly skip de-bpf'ed cgroups.
> >
> >>
> >> Thinking from another angle... if child cgroups can still attach then
> >> this bpf_release is broken.
> >
> > Hm, what do you mean under attach? It's not possible to attach
> > a new prog, but if a prog is attached to a parent cgroup,
> > a pointer can spill through "effective" array.
> >
> > But I agree, it's broken. Update path should ignore such
> > cgroups (cgroups, which cgroup_bpf was released). I'll take a look.
> >
> >> The code should be
> >> calling __cgroup_bpf_detach() one by one to make sure
> >> update_effective_progs() is called, since descendant are still
> >> sort-of alive and can attach?
> >
> > Not sure I get you. Dying cgroup is a leaf cgroup.
> >
> >>
> >> My money is on 'too early'.
> >> May be cgroup is not dying ?
> >> Just cgroup_sk_free() is called on the last socket and
> >> this auto-detach logic got triggered incorrectly?
> >
> > So, once again, what's my picture:
> >
> > A
> > A/B
> > A/B/C
> >
> > cpu1: cpu2:
> > rmdir C attach new prog to A
> > C got dying update A, update B, update C...
> > C's cgroup_bpf is released C's effective progs is replaced with new one
> > old is double freed
> >
> > It looks like it can be reproduced without any sockets.
>
> I see.
> Does it mean that css_for_each_descendant walks dying cgroups ?
Yes.
> I guess the fix then is to avoid walking them in update_effective_progs ?
>
Yes, this is close to what I'm testing now. We basically need to skip cgroups,
which bpf refcounter is 0 (and in atomic mode). These cgroups can't invoke bpf
programs, so there is no point in updates.