The reported issue is an attempted write in a cgroup file, by a zombie
who has/had acct(2) enabled. Such a write needs cgroup_ns for access
checking. Ordinary acct_process() would not be affected by this as it is
called well before exit_task_namespaces(). However, the reported NULL
dereference is a different acct data writer:
Call Trace:
<TASK>
kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
__kernel_write+0x5d1/0xaf0 fs/read_write.c:535
do_acct_process+0x112a/0x17b0 kernel/acct.c:518
acct_pin_kill+0x27/0x130 kernel/acct.c:173
pin_kill+0x2a6/0x940 fs/fs_pin.c:44
mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81
cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130
task_work_run+0x146/0x1c0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0x705/0x24f0 kernel/exit.c:832
do_group_exit+0x168/0x2d0 kernel/exit.c:929
get_signal+0x16b0/0x2090 kernel/signal.c:2820
arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
handle_signal_work kernel/entry/common.c:148 [inline]
exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
__syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
entry_SYSCALL_64_after_hwframe+0x44/0xae
i.e. called as one of task_work_run() entries.
The historical commit 8aac62706ada ("move exit_task_namespaces() outside
of exit_notify()") argues that exit_task_namespaces() must come before
exit_task_work() because ipc_ns cleanup calls fput/task_work_add.
There is accompanying commit e7b2c4069252 ("fput: task_work_add() can
fail if the caller has passed exit_task_work()") in the original series
that makes fput() robust in situations when task_work_add() cannot be
used anymore.
So in order to ensure that task_work_run() entries of the exiting task
have the nsproxy still available, swap the order of
exit_task_namespaces() and exit_task_work().
This change may appear like a partial revert of 8aac62706ada but this
particular ordering change shouldn't matter with the fix from
e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify
simpler) still holds.
Reported-by: [email protected]
Link: https://lore.kernel.org/r/[email protected]
Cc: Oleg Nesterov <[email protected]>
Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
Signed-off-by: Michal Koutný <[email protected]>
---
kernel/exit.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
I wasn't able to reproduce the syzbot's crash manually so the effectiveness of
the fix is only based on the reasoning.
diff --git a/kernel/exit.c b/kernel/exit.c
index f702a6a63686..0c2abdebb87c 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -828,8 +828,8 @@ void __noreturn do_exit(long code)
exit_fs(tsk);
if (group_dead)
disassociate_ctty(1);
- exit_task_namespaces(tsk);
exit_task_work(tsk);
+ exit_task_namespaces(tsk);
exit_thread(tsk);
/*
--
2.33.1
Adding the security list and a couple of other people because I feel
like I just opened pandora's box, and I don't know how bad this issue
is.
TL;DR the cgroup file system is checking permissions at write time.
Michal Koutný <[email protected]> writes:
> The reported issue is an attempted write in a cgroup file, by a zombie
> who has/had acct(2) enabled. Such a write needs cgroup_ns for access
> checking. Ordinary acct_process() would not be affected by this as it is
> called well before exit_task_namespaces(). However, the reported NULL
> dereference is a different acct data writer:
>
> Call Trace:
> <TASK>
> kernfs_fop_write_iter+0x3b6/0x510 fs/kernfs/file.c:296
> __kernel_write+0x5d1/0xaf0 fs/read_write.c:535
> do_acct_process+0x112a/0x17b0 kernel/acct.c:518
> acct_pin_kill+0x27/0x130 kernel/acct.c:173
> pin_kill+0x2a6/0x940 fs/fs_pin.c:44
> mnt_pin_kill+0xc1/0x170 fs/fs_pin.c:81
> cleanup_mnt+0x4bc/0x510 fs/namespace.c:1130
> task_work_run+0x146/0x1c0 kernel/task_work.c:164
> exit_task_work include/linux/task_work.h:32 [inline]
> do_exit+0x705/0x24f0 kernel/exit.c:832
> do_group_exit+0x168/0x2d0 kernel/exit.c:929
> get_signal+0x16b0/0x2090 kernel/signal.c:2820
> arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
> handle_signal_work kernel/entry/common.c:148 [inline]
> exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
> exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
> __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
> syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
> do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
> entry_SYSCALL_64_after_hwframe+0x44/0xae
>
> i.e. called as one of task_work_run() entries.
>
> The historical commit 8aac62706ada ("move exit_task_namespaces() outside
> of exit_notify()") argues that exit_task_namespaces() must come before
> exit_task_work() because ipc_ns cleanup calls fput/task_work_add.
>
> There is accompanying commit e7b2c4069252 ("fput: task_work_add() can
> fail if the caller has passed exit_task_work()") in the original series
> that makes fput() robust in situations when task_work_add() cannot be
> used anymore.
>
> So in order to ensure that task_work_run() entries of the exiting task
> have the nsproxy still available, swap the order of
> exit_task_namespaces() and exit_task_work().
>
> This change may appear like a partial revert of 8aac62706ada but this
> particular ordering change shouldn't matter with the fix from
> e7b2c4069252 and the other reason for 8aac62706ada (keeping exit_notify
> simpler) still holds.
I think I follow your reasoning and I think it will even fix the issue
but no. That is completely very much not the correct fix.
That permission check in cgroup_file_write is just plain wrong.
Permission checks on files need to happen at open time, not at
write time. It is all too easy to confuse something that writes
to stdout to write to a file of your choosing to make permission
checking at write time a good idea.
It looks like the this issue was introduced in commit 5136f6365ce3
("cgroup: implement "nsdelegate" mount option").
I wish I knew where I was when that patch was posted for review so
I could suggest a better implementation.
That said I can't quite tell if the test should be moved into
cgroup_file_open or if there is a permission entry that would work.
Oh bleep!
The likely named cgroup_procs_write_permission isn't a permission method
at all. It is called from cgroup_procs_write which I believe is
called from cgroup_file_write.
I may be wrong but at first glance this looks like the cgroup code is
going to need significant surgery to get the permission checks happening
at open time where they belong.
Please don't apply this patch.
exit_task_work running after exit_task_namespaces is the messenger
that just told us about something ugly.
Eric
> Reported-by: [email protected]
> Link: https://lore.kernel.org/r/[email protected]
> Cc: Oleg Nesterov <[email protected]>
> Fixes: 5136f6365ce3 ("cgroup: implement "nsdelegate" mount option")
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/exit.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> I wasn't able to reproduce the syzbot's crash manually so the effectiveness of
> the fix is only based on the reasoning.
>
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f702a6a63686..0c2abdebb87c 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -828,8 +828,8 @@ void __noreturn do_exit(long code)
> exit_fs(tsk);
> if (group_dead)
> disassociate_ctty(1);
> - exit_task_namespaces(tsk);
> exit_task_work(tsk);
> + exit_task_namespaces(tsk);
> exit_thread(tsk);
>
> /*
Hello, Eric.
On Wed, Dec 08, 2021 at 12:45:54PM -0600, Eric W. Biederman wrote:
> Permission checks on files need to happen at open time, not at
> write time. It is all too easy to confuse something that writes
> to stdout to write to a file of your choosing to make permission
> checking at write time a good idea.
Whether a given operation is allowed or not is dependent on the content of
the write. I don't see how it can be moved to open time. I'm not a fan of
the fs based interface but they're not real files and it has been like this
from the beginning.
> That said I can't quite tell if the test should be moved into
> cgroup_file_open or if there is a permission entry that would work.
It can't.
> I may be wrong but at first glance this looks like the cgroup code is
> going to need significant surgery to get the permission checks happening
> at open time where they belong.
There's no way to change that without changing the interface drastically.
> Please don't apply this patch.
>
> exit_task_work running after exit_task_namespaces is the messenger
> that just told us about something ugly.
I don't see good alternatives here. You got any?
Thanks.
--
tejun
On Wed, Dec 8, 2021 at 11:06 AM Tejun Heo <[email protected]> wrote:
>
> > That said I can't quite tell if the test should be moved into
> > cgroup_file_open or if there is a permission entry that would work.
>
> It can't.
Two options:
(a) anybody doing "current process" permission checks at IO time
should then also check that the credentials are still the same as when
the file was opened.
(b) alternatively, go ahead and do the permission check at IO time,
but do it using "file->f_cred" (ie the open-time permission), not the
current process ones.
In the above, (a) and (b) are basically the same: it uses f_cred for
permission checking. The only difference is that in (a) you may be
using some function that _technically_ uses the implicit "current
credentials" (there are many of them), and you just separately make
sure that those current credentials are identical to what they were at
open time.
Obviously (b) is hugely preferred, but sometimes the code organization
(ie "file or f_cred just isn't passed down deep enough") means that
(a) might be the only realistic option.
IOW, it's not *wrong* to do permission checking at IO time, but it
absolutely needs to be done using the open-time credentials.
Linus
Hello,
On Wed, Dec 08, 2021 at 11:39:43AM -0800, Linus Torvalds wrote:
> (b) alternatively, go ahead and do the permission check at IO time,
> but do it using "file->f_cred" (ie the open-time permission), not the
> current process ones.
>
> In the above, (a) and (b) are basically the same: it uses f_cred for
> permission checking. The only difference is that in (a) you may be
> using some function that _technically_ uses the implicit "current
> credentials" (there are many of them), and you just separately make
> sure that those current credentials are identical to what they were at
> open time.
>
> Obviously (b) is hugely preferred, but sometimes the code organization
> (ie "file or f_cred just isn't passed down deep enough") means that
> (a) might be the only realistic option.
>
> IOW, it's not *wrong* to do permission checking at IO time, but it
> absolutely needs to be done using the open-time credentials.
Yeah, (b) sounds good to me. Will look into it.
Thanks.
--
tejun
Hello,
On Wed, Dec 08, 2021 at 09:49:17AM -1000, Tejun Heo wrote:
> > (b) alternatively, go ahead and do the permission check at IO time,
> > but do it using "file->f_cred" (ie the open-time permission), not the
> > current process ones.
So, I have sth like the following. It builds and euid-open test case behaves
as expected (can't evade delegation restrictions if the fd was opened with
lesser euid). The namespace part is a bit more involved as f_cred doesn't
capture them on open. I made cgroup file open path capture it and use that
for all permission checks. Please let me know if anything looks weird.
Otherwise, I'm gonna add selftests and prep the patchset.
Thanks.
Index: work/kernel/cgroup/cgroup-v1.c
===================================================================
--- work.orig/kernel/cgroup/cgroup-v1.c
+++ work/kernel/cgroup/cgroup-v1.c
@@ -504,10 +504,11 @@ static ssize_t __cgroup1_procs_write(str
goto out_unlock;
/*
- * Even if we're attaching all tasks in the thread group, we only
- * need to check permissions on one of them.
+ * Even if we're attaching all tasks in the thread group, we only need
+ * to check permissions on one of them. Check permissions using the
+ * credentials from file open to protect against inherited fd attacks.
*/
- cred = current_cred();
+ cred = of->file->f_cred;
tcred = get_task_cred(task);
if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
!uid_eq(cred->euid, tcred->uid) &&
Index: work/kernel/cgroup/cgroup.c
===================================================================
--- work.orig/kernel/cgroup/cgroup.c
+++ work/kernel/cgroup/cgroup.c
@@ -3630,6 +3630,7 @@ static int cgroup_cpu_pressure_show(stru
static ssize_t cgroup_pressure_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, enum psi_res res)
{
+ struct cgroup_file_ctx *ctx = of->priv;
struct psi_trigger *new;
struct cgroup *cgrp;
struct psi_group *psi;
@@ -3648,7 +3649,7 @@ static ssize_t cgroup_pressure_write(str
return PTR_ERR(new);
}
- psi_trigger_replace(&of->priv, new);
+ psi_trigger_replace(&ctx->psi.trigger, new);
cgroup_put(cgrp);
@@ -3679,12 +3680,16 @@ static ssize_t cgroup_cpu_pressure_write
static __poll_t cgroup_pressure_poll(struct kernfs_open_file *of,
poll_table *pt)
{
- return psi_trigger_poll(&of->priv, of->file, pt);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ return psi_trigger_poll(&ctx->psi.trigger, of->file, pt);
}
static void cgroup_pressure_release(struct kernfs_open_file *of)
{
- psi_trigger_replace(&of->priv, NULL);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ psi_trigger_replace(&ctx->psi.trigger, NULL);
}
bool cgroup_psi_enabled(void)
@@ -3811,24 +3816,42 @@ static ssize_t cgroup_kill_write(struct
static int cgroup_file_open(struct kernfs_open_file *of)
{
struct cftype *cft = of_cft(of);
+ struct cgroup_file_ctx *ctx;
+ int ret;
- if (cft->open)
- return cft->open(of);
- return 0;
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->ns = current->nsproxy->cgroup_ns;
+ get_cgroup_ns(ctx->ns);
+
+ of->priv = ctx;
+
+ if (!cft->open)
+ return 0;
+
+ ret = cft->open(of);
+ if (ret)
+ kfree(ctx);
+ return ret;
}
static void cgroup_file_release(struct kernfs_open_file *of)
{
struct cftype *cft = of_cft(of);
+ struct cgroup_file_ctx *ctx = of->priv;
if (cft->release)
cft->release(of);
+ put_cgroup_ns(ctx->ns);
+ kfree(ctx);
}
static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
size_t nbytes, loff_t off)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
+ struct cgroup_file_ctx *ctx = of->priv;
struct cgroup *cgrp = of->kn->parent->priv;
struct cftype *cft = of_cft(of);
struct cgroup_subsys_state *css;
@@ -3845,7 +3868,7 @@ static ssize_t cgroup_file_write(struct
*/
if ((cgrp->root->flags & CGRP_ROOT_NS_DELEGATE) &&
!(cft->flags & CFTYPE_NS_DELEGATABLE) &&
- ns != &init_cgroup_ns && ns->root_cset->dfl_cgrp == cgrp)
+ ctx->ns != &init_cgroup_ns && ctx->ns->root_cset->dfl_cgrp == cgrp)
return -EPERM;
if (cft->write)
@@ -4751,21 +4774,23 @@ void css_task_iter_end(struct css_task_i
static void cgroup_procs_release(struct kernfs_open_file *of)
{
- if (of->priv) {
- css_task_iter_end(of->priv);
- kfree(of->priv);
+ struct cgroup_file_ctx *ctx = of->priv;
+
+ if (ctx->procs.it) {
+ css_task_iter_end(ctx->procs.it);
+ kfree(ctx->procs.it);
}
}
static void *cgroup_procs_next(struct seq_file *s, void *v, loff_t *pos)
{
struct kernfs_open_file *of = s->private;
- struct css_task_iter *it = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;
if (pos)
(*pos)++;
- return css_task_iter_next(it);
+ return css_task_iter_next(ctx->procs.it);
}
static void *__cgroup_procs_start(struct seq_file *s, loff_t *pos,
@@ -4773,7 +4798,8 @@ static void *__cgroup_procs_start(struct
{
struct kernfs_open_file *of = s->private;
struct cgroup *cgrp = seq_css(s)->cgroup;
- struct css_task_iter *it = of->priv;
+ struct cgroup_file_ctx *ctx = of->priv;
+ struct css_task_iter *it = ctx->procs.it;
/*
* When a seq_file is seeked, it's always traversed sequentially
@@ -4786,7 +4812,7 @@ static void *__cgroup_procs_start(struct
it = kzalloc(sizeof(*it), GFP_KERNEL);
if (!it)
return ERR_PTR(-ENOMEM);
- of->priv = it;
+ ctx->procs.it = it;
css_task_iter_start(&cgrp->self, iter_flags, it);
} else if (!(*pos)) {
css_task_iter_end(it);
@@ -4838,9 +4864,9 @@ static int cgroup_may_write(const struct
static int cgroup_procs_write_permission(struct cgroup *src_cgrp,
struct cgroup *dst_cgrp,
- struct super_block *sb)
+ struct super_block *sb,
+ struct cgroup_namespace *ns)
{
- struct cgroup_namespace *ns = current->nsproxy->cgroup_ns;
struct cgroup *com_cgrp = src_cgrp;
int ret;
@@ -4869,11 +4895,12 @@ static int cgroup_procs_write_permission
static int cgroup_attach_permissions(struct cgroup *src_cgrp,
struct cgroup *dst_cgrp,
- struct super_block *sb, bool threadgroup)
+ struct super_block *sb, bool threadgroup,
+ struct cgroup_namespace *ns)
{
int ret = 0;
- ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb);
+ ret = cgroup_procs_write_permission(src_cgrp, dst_cgrp, sb, ns);
if (ret)
return ret;
@@ -4890,8 +4917,10 @@ static int cgroup_attach_permissions(str
static ssize_t __cgroup_procs_write(struct kernfs_open_file *of, char *buf,
bool threadgroup)
{
+ struct cgroup_file_ctx *ctx = of->priv;
struct cgroup *src_cgrp, *dst_cgrp;
struct task_struct *task;
+ const struct cred *saved_cred;
ssize_t ret;
bool locked;
@@ -4909,9 +4938,16 @@ static ssize_t __cgroup_procs_write(stru
src_cgrp = task_cgroup_from_root(task, &cgrp_dfl_root);
spin_unlock_irq(&css_set_lock);
- /* process and thread migrations follow same delegation rule */
+ /*
+ * Process and thread migrations follow same delegation rule. Check
+ * permissions using the credentials from file open to protect against
+ * inherited fd attacks.
+ */
+ saved_cred = override_creds(of->file->f_cred);
ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
- of->file->f_path.dentry->d_sb, threadgroup);
+ of->file->f_path.dentry->d_sb,
+ threadgroup, ctx->ns);
+ revert_creds(saved_cred);
if (ret)
goto out_finish;
@@ -6130,7 +6166,8 @@ static int cgroup_css_set_fork(struct ke
goto err;
ret = cgroup_attach_permissions(cset->dfl_cgrp, dst_cgrp, sb,
- !(kargs->flags & CLONE_THREAD));
+ !(kargs->flags & CLONE_THREAD),
+ current->nsproxy->cgroup_ns);
if (ret)
goto err;
Index: work/kernel/cgroup/cgroup-internal.h
===================================================================
--- work.orig/kernel/cgroup/cgroup-internal.h
+++ work/kernel/cgroup/cgroup-internal.h
@@ -65,6 +65,20 @@ static inline struct cgroup_fs_context *
return container_of(kfc, struct cgroup_fs_context, kfc);
}
+struct cgroup_file_ctx {
+ struct cgroup_namespace *ns;
+
+ union {
+ struct {
+ struct css_task_iter *it;
+ } procs;
+
+ struct {
+ void *trigger;
+ } psi;
+ };
+};
+
/*
* A cgroup can be associated with multiple css_sets as different tasks may
* belong to different cgroups on different hierarchies. In the other
On Wed, Dec 08, 2021 at 01:07:54PM -1000, Tejun Heo <[email protected]> wrote:
> + saved_cred = override_creds(of->file->f_cred);
> ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
> - of->file->f_path.dentry->d_sb, threadgroup);
> + of->file->f_path.dentry->d_sb,
> + threadgroup, ctx->ns);
> + revert_creds(saved_cred);
I wonder now whether such a wrap shouldn't also be around cgroup_kill()
too (+ replacement of send_sig() with group_send_sig_info() [1])?
This shouldn't break the use case of passing cgroup kill fd to a less
privileged task for (auto)destruction purposes but on the other hand it
would prevent subverting the fd to a more privileged confused task to
kill otherwise disallowed processes.
Thanks,
Michal
[1] https://lore.kernel.org/r/[email protected]/
On Thu, Dec 09, 2021 at 02:44:19PM +0100, Michal Koutný wrote:
> On Wed, Dec 08, 2021 at 01:07:54PM -1000, Tejun Heo <[email protected]> wrote:
>
> > + saved_cred = override_creds(of->file->f_cred);
> > ret = cgroup_attach_permissions(src_cgrp, dst_cgrp,
> > - of->file->f_path.dentry->d_sb, threadgroup);
> > + of->file->f_path.dentry->d_sb,
> > + threadgroup, ctx->ns);
> > + revert_creds(saved_cred);
>
> I wonder now whether such a wrap shouldn't also be around cgroup_kill()
> too (+ replacement of send_sig() with group_send_sig_info() [1])?
send_sig() isn't used that was changed in response to a review. I'm
confused.
>
> This shouldn't break the use case of passing cgroup kill fd to a less
> privileged task for (auto)destruction purposes but on the other hand it
> would prevent subverting the fd to a more privileged confused task to
> kill otherwise disallowed processes.
Kill and freeze only do time permission checking at open. Why would you
introduce another write time check?
On Thu, Dec 09, 2021 at 03:08:26PM +0100, Christian Brauner <[email protected]> wrote:
> send_sig() isn't used that was changed in response to a review. I'm
> confused.
Sorry for ambiguity, I meant this instance [1].
> Kill and freeze only do time permission checking at open. Why would you
> introduce another write time check?
Let's have a cgroup G with tasks t1,...,tn (run by user u) and some
monitoring tasks m1,...,mk belonging to a different user v != u.
Currently u can kill also the tasks of v -- I'm not sure if that's
intentional. My argument would apply if it wasn't -- it'd be suscebtible
to similar abuse, i.e. passing the opened fd to a more privileged
process to kill also v's tasks. (But if the intention is to be able to
kill anyone in the cgroup, then it likely doesn't matter.)
Michal
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v5.16-rc4#n3762
On Thu, Dec 09, 2021 at 03:47:00PM +0100, Michal Koutný wrote:
> On Thu, Dec 09, 2021 at 03:08:26PM +0100, Christian Brauner <[email protected]> wrote:
> > send_sig() isn't used that was changed in response to a review. I'm
> > confused.
>
> Sorry for ambiguity, I meant this instance [1].
Sure, seems good.
>
> > Kill and freeze only do time permission checking at open. Why would you
> > introduce another write time check?
>
> Let's have a cgroup G with tasks t1,...,tn (run by user u) and some
> monitoring tasks m1,...,mk belonging to a different user v != u.
>
> Currently u can kill also the tasks of v -- I'm not sure if that's
> intentional. My argument would apply if it wasn't -- it'd be suscebtible
That was discussed and is intentional and is supposed to mirror the
behavior of cgroup.freeze. Delegated killing was supposed to work and
was one use-case.
On Thu, Dec 09, 2021 at 04:06:55PM +0100, Christian Brauner <[email protected]> wrote:
> That was discussed and is intentional and is supposed to mirror the
> behavior of cgroup.freeze. Delegated killing was supposed to work and
> was one use-case.
Thanks for the clarification.
The cgroup_kill() doesn't need the override_creds() treating then
(clearing my previous wondering).
Michal
On Wed, Dec 08, 2021 at 12:45:54PM -0600, "Eric W. Biederman" <[email protected]> wrote:
> TL;DR the cgroup file system is checking permissions at write time.
Thank you for bringing that up (handled in a separate thread now).
> I think I follow your reasoning and I think it will even fix the issue
> but no.
FTR, part of Tejun's series [1] ensures that cgroup_ns is accessed
directly without nsproxy and a reference to it is kept while the file
is opened. I.e. that'd properly fix this particular crash reported by
syzbot.
> Please don't apply this patch.
>
> exit_task_work running after exit_task_namespaces is the messenger
> that just told us about something ugly.
In (my) theory some other task_work callbacks could (transitively) rely
on the current->nsproxy which could still be cleared by
exit_task_namespaces().
Is there another reason why to have exit_task_namespaces() before
exit_task_work()?
Thanks,
Michal
[1] https://lore.kernel.org/r/[email protected]/
Michal Koutný <[email protected]> writes:
> On Wed, Dec 08, 2021 at 12:45:54PM -0600, "Eric W. Biederman" <[email protected]> wrote:
>> TL;DR the cgroup file system is checking permissions at write time.
>
> Thank you for bringing that up (handled in a separate thread now).
>
>> I think I follow your reasoning and I think it will even fix the issue
>> but no.
>
> FTR, part of Tejun's series [1] ensures that cgroup_ns is accessed
> directly without nsproxy and a reference to it is kept while the file
> is opened. I.e. that'd properly fix this particular crash reported by
> syzbot.
>
>> Please don't apply this patch.
>>
>> exit_task_work running after exit_task_namespaces is the messenger
>> that just told us about something ugly.
>
> In (my) theory some other task_work callbacks could (transitively) rely
> on the current->nsproxy which could still be cleared by
> exit_task_namespaces().
> Is there another reason why to have exit_task_namespaces() before
> exit_task_work()?
We already have the principle that things are going to be cleaned up
before exit_task_work is called and exit_files depends upon that.
So I think the burden is to find a good reason why exit_task_work should
move not to defend it.
If we don't want things cleaned up before exit_task_work it should come
at the start of do_exit and exit_files and others need to stop depending
upon it. Which seems like challenging change to make.
Eric