LinuxLists.cc - proc_flush

2017-12-18 21:44:45

Subject: proc_flush_task oops

I've hit this twice today. It's odd, because afaics, none of this code
has really changed in a long time.

Dave

Oops: 0000 [#1] SMP
CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
RIP: 0010:proc_flush_task+0x8e/0x1b0
RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
FS: 00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
? release_task+0xaf/0x680
release_task+0xd2/0x680
? wait_consider_task+0xb82/0xce0
wait_consider_task+0xbe9/0xce0
? do_wait+0xe1/0x330
do_wait+0x151/0x330
kernel_wait4+0x8d/0x150
? task_stopped_code+0x50/0x50
SYSC_wait4+0x95/0xa0
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
? do_syscall_64+0x60/0x210
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f7cb82603aa
RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 <49> 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
CR2: 0000000000000000
---[ end trace 53d67a6481059862 ]---

All code
========
0: c1 e2 04 shl $0x4,%edx
3: 44 8b 60 30 mov 0x30(%rax),%r12d
7: 48 8b 40 38 mov 0x38(%rax),%rax
b: 44 8b 34 11 mov (%rcx,%rdx,1),%r14d
f: 48 c7 c2 60 3a f5 81 mov $0xffffffff81f53a60,%rdx
16: 44 89 e1 mov %r12d,%ecx
19: 4c 8b 68 58 mov 0x58(%rax),%r13
1d: e8 4b b4 77 00 callq 0x77b46d
22: 89 44 24 14 mov %eax,0x14(%rsp)
26: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
2b:* 49 8b 7d 00 mov 0x0(%r13),%rdi <-- trapping instruction
2f: e8 b9 6a f9 ff callq 0xfffffffffff96aed
34: 48 85 c0 test %rax,%rax
37: 74 1a je 0x53
39: 48 89 c7 mov %rax,%rdi
3c: 48 rex.W
3d: 89 .byte 0x89
3e: 44 rex.R
3f: 24 .byte 0x24

Code starting with the faulting instruction
===========================================
0: 49 8b 7d 00 mov 0x0(%r13),%rdi
4: e8 b9 6a f9 ff callq 0xfffffffffff96ac2
9: 48 85 c0 test %rax,%rax
c: 74 1a je 0x28
e: 48 89 c7 mov %rax,%rdi
11: 48 rex.W
12: 89 .byte 0x89
13: 44 rex.R
14: 24 .byte 0x24

This looks like an inlined part of proc_flush_task_mnt

dentry = d_hash_and_lookup(mnt->mnt_root, &name);
4f99: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
4f9e: 49 8b 7d 00 mov 0x0(%r13),%rdi
4fa2: e8 00 00 00 00 callq 4fa7 <proc_flush_task+0x97>

So it looks like this..

3097 for (i = 0; i <= pid->level; i++) {
3098 upid = &pid->numbers[i];
3099 proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
3100 tgid->numbers[i].nr);
3101 }

somehow passed a null upid->ns->proc_mnt down there.

I'll try and narrow down a reproducer tomorrow.

Any obvious recent changes that might explain this, or did I just
finally appease the entropy gods enough to find the right combination of
args to hit this ?

Dave

2017-12-18 22:15:46

by Al Viro

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Mon, Dec 18, 2017 at 04:44:38PM -0500, Dave Jones wrote:
> I've hit this twice today. It's odd, because afaics, none of this code
> has really changed in a long time.

Which tree had that been?

2017-12-18 23:10:17

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Mon, Dec 18, 2017 at 10:15:41PM +0000, Al Viro wrote:
> On Mon, Dec 18, 2017 at 04:44:38PM -0500, Dave Jones wrote:
> > I've hit this twice today. It's odd, because afaics, none of this code
> > has really changed in a long time.
>
> Which tree had that been?

Linus, rc4.

Dave

2017-12-18 23:50:55

by Linus Torvalds

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Mon, Dec 18, 2017 at 3:10 PM, Dave Jones <[email protected]> wrote:
> On Mon, Dec 18, 2017 at 10:15:41PM +0000, Al Viro wrote:
> > On Mon, Dec 18, 2017 at 04:44:38PM -0500, Dave Jones wrote:
> > > I've hit this twice today. It's odd, because afaics, none of this code
> > > has really changed in a long time.
> >
> > Which tree had that been?
>
> Linus, rc4.

Ok, so the original report was marked as spam for me for whatever
reason. I ended up re-analyzing the oops, but came to the same
conclusion you did: it's a NULL mnt pointer in proc_flush_task_mnt().

The code disassembles to

0: c1 e2 04 shl $0x4,%edx
3: 44 8b 60 30 mov 0x30(%rax),%r12d
7: 48 8b 40 38 mov 0x38(%rax),%rax
b: 44 8b 34 11 mov (%rcx,%rdx,1),%r14d
f: 48 c7 c2 60 3a f5 81 mov $0xffffffff81f53a60,%rdx
16: 44 89 e1 mov %r12d,%ecx
19: 4c 8b 68 58 mov 0x58(%rax),%r13
1d: e8 4b b4 77 00 callq 0x77b46d
22: 89 44 24 14 mov %eax,0x14(%rsp)
26: 48 8d 74 24 10 lea 0x10(%rsp),%rsi
2b:* 49 8b 7d 00 mov 0x0(%r13),%rdi <-- trapping instruction
2f: e8 b9 6a f9 ff callq 0xfffffffffff96aed
34: 48 85 c0 test %rax,%rax
37: 74 1a je 0x53
39: 48 89 c7 mov %rax,%rdi

and just matching that up against the code I see generated, that first
call is the call to snprintf, and the second call is to
d_hash_and_lookup.

So it's one of these two patterns (pid vs tgid):

name.len = snprintf(buf, sizeof(buf), "%d", pid);
/* no ->d_hash() rejects on procfs */
dentry = d_hash_and_lookup(mnt->mnt_root, &name);

and that "mov 0x0(%r13),%rdi" that traps is "mnt->mnt_root".

But I don't see what would have changed in this area recently.

Do you end up saving the seeds that cause crashes? Is this
reproducible? (Other than seeing it twoce, of course)

Linus

2017-12-19 01:22:17

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
> On Mon, Dec 18, 2017 at 3:10 PM, Dave Jones <[email protected]> wrote:
> > On Mon, Dec 18, 2017 at 10:15:41PM +0000, Al Viro wrote:
> > > On Mon, Dec 18, 2017 at 04:44:38PM -0500, Dave Jones wrote:
> > > > I've hit this twice today. It's odd, because afaics, none of this code
> > > > has really changed in a long time.
> > >
> > > Which tree had that been?
> >
> > Linus, rc4.
>
> Ok, so the original report was marked as spam for me for whatever
> reason. I ended up re-analyzing the oops, but came to the same
> conclusion you did: it's a NULL mnt pointer in proc_flush_task_mnt().
> ..
> But I don't see what would have changed in this area recently.
>
> Do you end up saving the seeds that cause crashes? Is this
> reproducible? (Other than seeing it twoce, of course)

Hit it another two times in the last hour, so it's pretty reproducable.
Running it now with some more logging, will see if that yields any extra clues.

Dave

2017-12-19 03:39:34

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:

> But I don't see what would have changed in this area recently.
>
> Do you end up saving the seeds that cause crashes? Is this
> reproducible? (Other than seeing it twoce, of course)

Only clue so far, is every time I'm able to trigger it, the last thing
the child process that triggers it did, was an execveat.

Telling it to just fuzz execveat doesn't instantly trigger it, so it
must be a combination of some other syscall. I'll leave a script running
overnight to see if I can binary search the other syscalls in
combination with it.

One other thing: I said this was rc4, but it was actually rc4 + all the
x86 stuff from today. There's enough creepy stuff in that pile, that
I'll try with just plain rc4 tomorrow too.

Dave

2017-12-19 10:49:59

by Tetsuo Handa

[permalink] [raw]

Subject: Re: proc_flush_task oops

On 2017/12/19 12:39, Dave Jones wrote:
> On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
>
> > But I don't see what would have changed in this area recently.
> >
> > Do you end up saving the seeds that cause crashes? Is this
> > reproducible? (Other than seeing it twoce, of course)
>
> Only clue so far, is every time I'm able to trigger it, the last thing
> the child process that triggers it did, was an execveat.
>
> Telling it to just fuzz execveat doesn't instantly trigger it, so it
> must be a combination of some other syscall. I'll leave a script running
> overnight to see if I can binary search the other syscalls in
> combination with it.
>
> One other thing: I said this was rc4, but it was actually rc4 + all the
> x86 stuff from today. There's enough creepy stuff in that pile, that
> I'll try with just plain rc4 tomorrow too.
>
> Dave
>

When hitting an oops at finalization function using fuzzing tool, checking for
corresponding initialization function and previous error messages might help.

It seems to me that there are error paths which allow nsproxy to be replaced
without assigning ->pid_ns_for_children->proc_mnt.

----------
static __latent_entropy struct task_struct *copy_process(
unsigned long clone_flags,
unsigned long stack_start,
unsigned long stack_size,
int __user *child_tidptr,
struct pid *pid,
int trace,
unsigned long tls,
int node)
{
(...snipped...)
retval = copy_namespaces(clone_flags, p); // Allocates p->nsproxy->pid_ns_for_children
if (retval)
goto bad_fork_cleanup_mm;
retval = copy_io(clone_flags, p);
if (retval)
goto bad_fork_cleanup_namespaces; // p->nsproxy->pid_ns_for_children->proc_mnt == NULL.
retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
if (retval)
goto bad_fork_cleanup_io; // p->nsproxy->pid_ns_for_children->proc_mnt == NULL.

if (pid != &init_struct_pid) {
pid = alloc_pid(p->nsproxy->pid_ns_for_children); // Initializes p->nsproxy->pid_ns_for_children->proc_mnt upon success.
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread; // p->nsproxy->pid_ns_for_children->proc_mnt == NULL.
}
}
(...snipped...)
}

int copy_namespaces(unsigned long flags, struct task_struct *tsk)
{
struct nsproxy *old_ns = tsk->nsproxy;
struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
struct nsproxy *new_ns;

if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWCGROUP)))) {
get_nsproxy(old_ns);
return 0;
}

if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;

/*
* CLONE_NEWIPC must detach from the undolist: after switching
* to a new ipc namespace, the semaphore arrays from the old
* namespace are unreachable. In clone parlance, CLONE_SYSVSEM
* means share undolist with parent, so we must forbid using
* it along with CLONE_NEWIPC.
*/
if ((flags & (CLONE_NEWIPC | CLONE_SYSVSEM)) ==
(CLONE_NEWIPC | CLONE_SYSVSEM))
return -EINVAL;

new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs); // Allocates new nsproxy.
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);

tsk->nsproxy = new_ns; // p->nsproxy is updated with ->pid_ns_for_children->proc_mnt == NULL.
return 0;
}

static struct nsproxy *create_new_namespaces(unsigned long flags,
struct task_struct *tsk, struct user_namespace *user_ns,
struct fs_struct *new_fs)
{
struct nsproxy *new_nsp;
int err;

new_nsp = create_nsproxy();
if (!new_nsp)
return ERR_PTR(-ENOMEM);

new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
if (IS_ERR(new_nsp->mnt_ns)) {
err = PTR_ERR(new_nsp->mnt_ns);
goto out_ns;
}

new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
if (IS_ERR(new_nsp->uts_ns)) {
err = PTR_ERR(new_nsp->uts_ns);
goto out_uts;
}

new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
if (IS_ERR(new_nsp->ipc_ns)) {
err = PTR_ERR(new_nsp->ipc_ns);
goto out_ipc;
}

new_nsp->pid_ns_for_children =
copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children); // Returns with ->proc_mnt == NULL.
if (IS_ERR(new_nsp->pid_ns_for_children)) {
err = PTR_ERR(new_nsp->pid_ns_for_children);
goto out_pid;
}

new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
tsk->nsproxy->cgroup_ns);
if (IS_ERR(new_nsp->cgroup_ns)) {
err = PTR_ERR(new_nsp->cgroup_ns);
goto out_cgroup;
}

new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
if (IS_ERR(new_nsp->net_ns)) {
err = PTR_ERR(new_nsp->net_ns);
goto out_net;
}

return new_nsp; // Returns with ->pid_ns_for_children->proc_mnt == NULL.

out_net:
put_cgroup_ns(new_nsp->cgroup_ns);
out_cgroup:
if (new_nsp->pid_ns_for_children)
put_pid_ns(new_nsp->pid_ns_for_children);
out_pid:
if (new_nsp->ipc_ns)
put_ipc_ns(new_nsp->ipc_ns);
out_ipc:
if (new_nsp->uts_ns)
put_uts_ns(new_nsp->uts_ns);
out_uts:
if (new_nsp->mnt_ns)
put_mnt_ns(new_nsp->mnt_ns);
out_ns:
kmem_cache_free(nsproxy_cachep, new_nsp);
return ERR_PTR(err);
}

struct pid_namespace *copy_pid_ns(unsigned long flags,
struct user_namespace *user_ns, struct pid_namespace *old_ns)
{
if (!(flags & CLONE_NEWPID))
return get_pid_ns(old_ns);
if (task_active_pid_ns(current) != old_ns)
return ERR_PTR(-EINVAL);
return create_pid_namespace(user_ns, old_ns); // Returns with ->proc_mnt == NULL.
}

static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
struct pid_namespace *parent_pid_ns)
{
struct pid_namespace *ns;
unsigned int level = parent_pid_ns->level + 1;
struct ucounts *ucounts;
int err;

err = -EINVAL;
if (!in_userns(parent_pid_ns->user_ns, user_ns))
goto out;

err = -ENOSPC;
if (level > MAX_PID_NS_LEVEL)
goto out;
ucounts = inc_pid_namespaces(user_ns);
if (!ucounts)
goto out;

err = -ENOMEM;
ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL); // Initializes ->proc_mnt with NULL.
if (ns == NULL)
goto out_dec;

idr_init(&ns->idr);

ns->pid_cachep = create_pid_cachep(level + 1);
if (ns->pid_cachep == NULL)
goto out_free_idr;

err = ns_alloc_inum(&ns->ns);
if (err)
goto out_free_idr;
ns->ns.ops = &pidns_operations;

kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
ns->user_ns = get_user_ns(user_ns);
ns->ucounts = ucounts;
ns->pid_allocated = PIDNS_ADDING;
INIT_WORK(&ns->proc_work, proc_cleanup_work);

return ns; // Returns with ->proc_mnt == NULL.

out_free_idr:
idr_destroy(&ns->idr);
kmem_cache_free(pid_ns_cachep, ns);
out_dec:
dec_pid_namespaces(ucounts);
out:
return ERR_PTR(err);
}

struct pid *alloc_pid(struct pid_namespace *ns)
{
(...snipped...)
if (unlikely(is_child_reaper(pid))) {
if (pid_ns_prepare_proc(ns)) {
disable_pid_allocation(ns);
goto out_free;
}
}
(...snipped...)
}

int pid_ns_prepare_proc(struct pid_namespace *ns)
{
struct vfsmount *mnt;

mnt = kern_mount_data(&proc_fs_type, ns);
if (IS_ERR(mnt))
return PTR_ERR(mnt);

ns->proc_mnt = mnt;
return 0;
}
----------

If one of copy_io(), copy_thread_tls() or alloc_pid() returns an error, creation
of a child fails with p->nsproxy->pid_ns_for_children->proc_mnt == NULL.

Then, when the child exits, the parent waiting at wait() calls
proc_flush_task() which assumes that everything was set up properly.

----------
void proc_flush_task(struct task_struct *task)
{
int i;
struct pid *pid, *tgid;
struct upid *upid;

pid = task_pid(task);
tgid = task_tgid(task);

for (i = 0; i <= pid->level; i++) {
upid = &pid->numbers[i];
proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
tgid->numbers[i].nr);
}
}
----------

2017-12-19 18:25:45

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Tetsuo Handa <[email protected]> writes:

> On 2017/12/19 12:39, Dave Jones wrote:
>> On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
>>
>> > But I don't see what would have changed in this area recently.
>> >
>> > Do you end up saving the seeds that cause crashes? Is this
>> > reproducible? (Other than seeing it twoce, of course)
>>
>> Only clue so far, is every time I'm able to trigger it, the last thing
>> the child process that triggers it did, was an execveat.
>>
>> Telling it to just fuzz execveat doesn't instantly trigger it, so it
>> must be a combination of some other syscall. I'll leave a script running
>> overnight to see if I can binary search the other syscalls in
>> combination with it.
>>
>> One other thing: I said this was rc4, but it was actually rc4 + all the
>> x86 stuff from today. There's enough creepy stuff in that pile, that
>> I'll try with just plain rc4 tomorrow too.
>>
>> Dave
>>
>
> When hitting an oops at finalization function using fuzzing tool, checking for
> corresponding initialization function and previous error messages might help.
>
> It seems to me that there are error paths which allow nsproxy to be replaced
> without assigning ->pid_ns_for_children->proc_mnt.
>
> ----------
> static __latent_entropy struct task_struct *copy_process(
> unsigned long clone_flags,
> unsigned long stack_start,
> unsigned long stack_size,
> int __user *child_tidptr,
> struct pid *pid,
> int trace,
> unsigned long tls,
> int node)
> {
> (...snipped...)
> retval = copy_namespaces(clone_flags, p); // Allocates p->nsproxy->pid_ns_for_children
> if (retval)
> goto bad_fork_cleanup_mm;
> retval = copy_io(clone_flags, p);
> if (retval)
> goto bad_fork_cleanup_namespaces; // p->nsproxy->pid_ns_for_children->proc_mnt == NULL.
> retval = copy_thread_tls(clone_flags, stack_start, stack_size, p, tls);
> if (retval)
> goto bad_fork_cleanup_io; // p->nsproxy->pid_ns_for_children->proc_mnt == NULL.
>
> if (pid != &init_struct_pid) {
> pid = alloc_pid(p->nsproxy->pid_ns_for_children); // Initializes p->nsproxy->pid_ns_for_children->proc_mnt upon success.
> if (IS_ERR(pid)) {
> retval = PTR_ERR(pid);
> goto bad_fork_cleanup_thread; // p->nsproxy->pid_ns_for_children->proc_mnt == NULL.
> }
> }
> (...snipped...)
> }

> int copy_namespaces(unsigned long flags, struct task_struct *tsk)
> {
> struct nsproxy *old_ns = tsk->nsproxy;
> struct user_namespace *user_ns = task_cred_xxx(tsk, user_ns);
> struct nsproxy *new_ns;
>
> if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
> CLONE_NEWPID | CLONE_NEWNET |
> CLONE_NEWCGROUP)))) {
> get_nsproxy(old_ns);
> return 0;
> }
>
> if (!ns_capable(user_ns, CAP_SYS_ADMIN))
> return -EPERM;
>
> /*
> * CLONE_NEWIPC must detach from the undolist: after switching
> * to a new ipc namespace, the semaphore arrays from the old
> * namespace are unreachable. In clone parlance, CLONE_SYSVSEM
> * means share undolist with parent, so we must forbid using
> * it along with CLONE_NEWIPC.
> */
> if ((flags & (CLONE_NEWIPC | CLONE_SYSVSEM)) ==
> (CLONE_NEWIPC | CLONE_SYSVSEM))
> return -EINVAL;
>
> new_ns = create_new_namespaces(flags, tsk, user_ns, tsk->fs); // Allocates new nsproxy.
> if (IS_ERR(new_ns))
> return PTR_ERR(new_ns);
>
> tsk->nsproxy = new_ns; // p->nsproxy is updated with ->pid_ns_for_children->proc_mnt == NULL.
> return 0;
> }

> static struct nsproxy *create_new_namespaces(unsigned long flags,
> struct task_struct *tsk, struct user_namespace *user_ns,
> struct fs_struct *new_fs)
> {
> struct nsproxy *new_nsp;
> int err;
>
> new_nsp = create_nsproxy();
> if (!new_nsp)
> return ERR_PTR(-ENOMEM);
>
> new_nsp->mnt_ns = copy_mnt_ns(flags, tsk->nsproxy->mnt_ns, user_ns, new_fs);
> if (IS_ERR(new_nsp->mnt_ns)) {
> err = PTR_ERR(new_nsp->mnt_ns);
> goto out_ns;
> }
>
> new_nsp->uts_ns = copy_utsname(flags, user_ns, tsk->nsproxy->uts_ns);
> if (IS_ERR(new_nsp->uts_ns)) {
> err = PTR_ERR(new_nsp->uts_ns);
> goto out_uts;
> }
>
> new_nsp->ipc_ns = copy_ipcs(flags, user_ns, tsk->nsproxy->ipc_ns);
> if (IS_ERR(new_nsp->ipc_ns)) {
> err = PTR_ERR(new_nsp->ipc_ns);
> goto out_ipc;
> }
>
> new_nsp->pid_ns_for_children =
> copy_pid_ns(flags, user_ns, tsk->nsproxy->pid_ns_for_children); // Returns with ->proc_mnt == NULL.
> if (IS_ERR(new_nsp->pid_ns_for_children)) {
> err = PTR_ERR(new_nsp->pid_ns_for_children);
> goto out_pid;
> }
>
> new_nsp->cgroup_ns = copy_cgroup_ns(flags, user_ns,
> tsk->nsproxy->cgroup_ns);
> if (IS_ERR(new_nsp->cgroup_ns)) {
> err = PTR_ERR(new_nsp->cgroup_ns);
> goto out_cgroup;
> }
>
> new_nsp->net_ns = copy_net_ns(flags, user_ns, tsk->nsproxy->net_ns);
> if (IS_ERR(new_nsp->net_ns)) {
> err = PTR_ERR(new_nsp->net_ns);
> goto out_net;
> }
>
> return new_nsp; // Returns with ->pid_ns_for_children->proc_mnt == NULL.
>
> out_net:
> put_cgroup_ns(new_nsp->cgroup_ns);
> out_cgroup:
> if (new_nsp->pid_ns_for_children)
> put_pid_ns(new_nsp->pid_ns_for_children);
> out_pid:
> if (new_nsp->ipc_ns)
> put_ipc_ns(new_nsp->ipc_ns);
> out_ipc:
> if (new_nsp->uts_ns)
> put_uts_ns(new_nsp->uts_ns);
> out_uts:
> if (new_nsp->mnt_ns)
> put_mnt_ns(new_nsp->mnt_ns);
> out_ns:
> kmem_cache_free(nsproxy_cachep, new_nsp);
> return ERR_PTR(err);
> }
>
> struct pid_namespace *copy_pid_ns(unsigned long flags,
> struct user_namespace *user_ns, struct pid_namespace *old_ns)
> {
> if (!(flags & CLONE_NEWPID))
> return get_pid_ns(old_ns);
> if (task_active_pid_ns(current) != old_ns)
> return ERR_PTR(-EINVAL);
> return create_pid_namespace(user_ns, old_ns); // Returns with ->proc_mnt == NULL.
> }
>
> static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,
> struct pid_namespace *parent_pid_ns)
> {
> struct pid_namespace *ns;
> unsigned int level = parent_pid_ns->level + 1;
> struct ucounts *ucounts;
> int err;
>
> err = -EINVAL;
> if (!in_userns(parent_pid_ns->user_ns, user_ns))
> goto out;
>
> err = -ENOSPC;
> if (level > MAX_PID_NS_LEVEL)
> goto out;
> ucounts = inc_pid_namespaces(user_ns);
> if (!ucounts)
> goto out;
>
> err = -ENOMEM;
> ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL); // Initializes ->proc_mnt with NULL.
> if (ns == NULL)
> goto out_dec;
>
> idr_init(&ns->idr);
>
> ns->pid_cachep = create_pid_cachep(level + 1);
> if (ns->pid_cachep == NULL)
> goto out_free_idr;
>
> err = ns_alloc_inum(&ns->ns);
> if (err)
> goto out_free_idr;
> ns->ns.ops = &pidns_operations;
>
> kref_init(&ns->kref);
> ns->level = level;
> ns->parent = get_pid_ns(parent_pid_ns);
> ns->user_ns = get_user_ns(user_ns);
> ns->ucounts = ucounts;
> ns->pid_allocated = PIDNS_ADDING;
> INIT_WORK(&ns->proc_work, proc_cleanup_work);
>
> return ns; // Returns with ->proc_mnt == NULL.
>
> out_free_idr:
> idr_destroy(&ns->idr);
> kmem_cache_free(pid_ns_cachep, ns);
> out_dec:
> dec_pid_namespaces(ucounts);
> out:
> return ERR_PTR(err);
> }
>
> struct pid *alloc_pid(struct pid_namespace *ns)
> {
> (...snipped...)
> if (unlikely(is_child_reaper(pid))) {
> if (pid_ns_prepare_proc(ns)) {
> disable_pid_allocation(ns);
> goto out_free;
> }
> }
> (...snipped...)
> }
>
> int pid_ns_prepare_proc(struct pid_namespace *ns)
> {
> struct vfsmount *mnt;
>
> mnt = kern_mount_data(&proc_fs_type, ns);
> if (IS_ERR(mnt))
> return PTR_ERR(mnt);
>
> ns->proc_mnt = mnt;
> return 0;
> }
> ----------
>
> If one of copy_io(), copy_thread_tls() or alloc_pid() returns an error, creation
> of a child fails with p->nsproxy->pid_ns_for_children->proc_mnt == NULL.
>
> Then, when the child exits, the parent waiting at wait() calls
> proc_flush_task() which assumes that everything was set up properly.

I don't think so.

proc_mnt is allocated when the first pid is allocated in a pid
namespace. And proc_mnt is freed after the last pid in a pid
namespace is freed. That is schedule_work(&ns->proc_work) in free_pid.

Yes we can occassionally have pid namespaces without pids and
thus without a proc_mnt.

But I don't see that allows us to create a task_struct that is
not flushable.

If fork fails there is hashed pid and no task_struct to call
proc_flush_task on.

If fork succeeds there the process contains a pid in a pid namespace.

So while pid_ns_for_children->proc_mnt might be NULL. It does not
follow that pid->ns->proc_mnt can be NULL.

Eric

> ----------
> void proc_flush_task(struct task_struct *task)
> {
> int i;
> struct pid *pid, *tgid;
> struct upid *upid;
>
> pid = task_pid(task);
> tgid = task_tgid(task);
>
> for (i = 0; i <= pid->level; i++) {
> upid = &pid->numbers[i];
> proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
> tgid->numbers[i].nr);
> }
> }
> ----------

2017-12-19 18:27:58

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
>
> > But I don't see what would have changed in this area recently.
> >
> > Do you end up saving the seeds that cause crashes? Is this
> > reproducible? (Other than seeing it twoce, of course)
>
> Only clue so far, is every time I'm able to trigger it, the last thing
> the child process that triggers it did, was an execveat.

Is there any chance the excveat might be called from a child thread?

That switching pids between tasks of a process during exec can get a
little bit tricky.

> Telling it to just fuzz execveat doesn't instantly trigger it, so it
> must be a combination of some other syscall. I'll leave a script running
> overnight to see if I can binary search the other syscalls in
> combination with it.

Could we have a buggy syscall that is stomping something?

> One other thing: I said this was rc4, but it was actually rc4 + all the
> x86 stuff from today. There's enough creepy stuff in that pile, that
> I'll try with just plain rc4 tomorrow too.

Eric

2017-12-19 19:30:23

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Tue, Dec 19, 2017 at 12:27:30PM -0600, Eric W. Biederman wrote:
> Dave Jones <[email protected]> writes:
>
> > On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
> >
> > > But I don't see what would have changed in this area recently.
> > >
> > > Do you end up saving the seeds that cause crashes? Is this
> > > reproducible? (Other than seeing it twoce, of course)
> >
> > Only clue so far, is every time I'm able to trigger it, the last thing
> > the child process that triggers it did, was an execveat.
>
> Is there any chance the excveat might be called from a child thread?

If trinity choose one of the exec syscalls, it forks off an extra child
to do it in, on the off-chance that it succeeds, and we never return.
https://github.com/kernelslacker/trinity/blob/master/syscall.c#L139

> That switching pids between tasks of a process during exec can get a
> little bit tricky.
>
> > Telling it to just fuzz execveat doesn't instantly trigger it, so it
> > must be a combination of some other syscall. I'll leave a script running
> > overnight to see if I can binary search the other syscalls in
> > combination with it.
>
> Could we have a buggy syscall that is stomping something?

Not totally impossible I guess, though I would expect that would
manifest in additional random failures, whereas this seems remarkably
consistent.

Dave

2017-12-19 21:45:00

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Tue, Dec 19, 2017 at 12:27:30PM -0600, Eric W. Biederman wrote:
> > Dave Jones <[email protected]> writes:
> >
> > > On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
> > >
> > > > But I don't see what would have changed in this area recently.
> > > >
> > > > Do you end up saving the seeds that cause crashes? Is this
> > > > reproducible? (Other than seeing it twoce, of course)
> > >
> > > Only clue so far, is every time I'm able to trigger it, the last thing
> > > the child process that triggers it did, was an execveat.
> >
> > Is there any chance the excveat might be called from a child thread?
>
> If trinity choose one of the exec syscalls, it forks off an extra child
> to do it in, on the off-chance that it succeeds, and we never return.
> https://github.com/kernelslacker/trinity/blob/master/syscall.c#L139

extrapid = fork();
if (extrapid == 0) {
/* grand-child */
char childname[]="trinity-subchild";
prctl(PR_SET_NAME, (unsigned long) &childname);

__do_syscall(rec, GOING_AWAY);
/* if this was for eg. an successful execve, we should never get here.
* if it failed though... */
_exit(EXIT_SUCCESS);
}

That is interesting.

So the system call sequence is a fork which just succeeded and than an
exec. That reduces the possibilities quite a lot.

With pids there was a recent change that just replaced the pid hash
table and the pid bitmap with and idr. It changes the locking somewhat
and probably changes the timing so that might be the culprit.

I am trying to figure out if there is an interface that would let
ns_last_pid for a pid namespace be accessed before the first pid is
allocated and I am not seeing it. It does not appear to be possible
to mount a proc for a pid namespace you are not currently in.

*Scratches my head* I am not seeing anything obvious.

Eric

2017-12-20 01:54:53

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

[email protected] (Eric W. Biederman) writes:

> Dave Jones <[email protected]> writes:
>
>> On Tue, Dec 19, 2017 at 12:27:30PM -0600, Eric W. Biederman wrote:
>> > Dave Jones <[email protected]> writes:
>> >
>> > > On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
>> > >
>> > > > But I don't see what would have changed in this area recently.
>> > > >
>> > > > Do you end up saving the seeds that cause crashes? Is this
>> > > > reproducible? (Other than seeing it twoce, of course)
>> > >
>> > > Only clue so far, is every time I'm able to trigger it, the last thing
>> > > the child process that triggers it did, was an execveat.
>> >
>> > Is there any chance the excveat might be called from a child thread?
>>
>> If trinity choose one of the exec syscalls, it forks off an extra child
>> to do it in, on the off-chance that it succeeds, and we never return.
>> https://github.com/kernelslacker/trinity/blob/master/syscall.c#L139
>
> extrapid = fork();
> if (extrapid == 0) {
> /* grand-child */
> char childname[]="trinity-subchild";
> prctl(PR_SET_NAME, (unsigned long) &childname);
>
> __do_syscall(rec, GOING_AWAY);
> /* if this was for eg. an successful execve, we should never get here.
> * if it failed though... */
> _exit(EXIT_SUCCESS);
> }
>
> That is interesting.
>
>
> So the system call sequence is a fork which just succeeded and than an
> exec. That reduces the possibilities quite a lot.
>
> With pids there was a recent change that just replaced the pid hash
> table and the pid bitmap with and idr. It changes the locking somewhat
> and probably changes the timing so that might be the culprit.
>
> I am trying to figure out if there is an interface that would let
> ns_last_pid for a pid namespace be accessed before the first pid is
> allocated and I am not seeing it. It does not appear to be possible
> to mount a proc for a pid namespace you are not currently in.
>
> *Scratches my head* I am not seeing anything obvious.

Can you try this patch as you reproduce this issue?

diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..df9e5d4d8f83 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
goto out_unlock;
for ( ; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
+ WARN_ON(!upid->ns->proc_mnt);
idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
}

If the warning triggers it means the bug is in alloc_pid and somehow
something has gotten past the is_child_reaper check.

If the warning does not trigger it means something is stomping proc_mnt.

In the entire kernel there are exactly two assignments to proc_mnt.
- kmem_cache_zalloc in create_pid_namespace.
- In pid_ns_prepare_proc where proc_mnt is set to a non-zero value.

On the 29th of Nov syzkaller also hit this and gave me this reproducer
that I can't figure out heads or tails of.

#{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
socket$inet6_dccp(0xa, 0x6, 0x0)
unshare(0x20000400)
sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)

#{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
socket$inet6_dccp(0xa, 0x6, 0x0)
unshare(0x20000400)
sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)

Eric

2017-12-20 05:28:08

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Tue, Dec 19, 2017 at 07:54:24PM -0600, Eric W. Biederman wrote:

> > *Scratches my head* I am not seeing anything obvious.
>
> Can you try this patch as you reproduce this issue?
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index b13b624e2c49..df9e5d4d8f83 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> goto out_unlock;
> for ( ; upid >= pid->numbers; --upid) {
> /* Make the PID visible to find_pid_ns. */
> + WARN_ON(!upid->ns->proc_mnt);
> idr_replace(&upid->ns->idr, pid, upid->nr);
> upid->ns->pid_allocated++;
> }
>
>
> If the warning triggers it means the bug is in alloc_pid and somehow
> something has gotten past the is_child_reaper check.

You're onto something.

WARNING: CPU: 1 PID: 12020 at kernel/pid.c:213 alloc_pid+0x230/0x280
CPU: 1 PID: 12020 Comm: trinity-c29 Not tainted 4.15.0-rc4-think+ #3
RIP: 0010:alloc_pid+0x230/0x280
RSP: 0018:ffffc90009977d48 EFLAGS: 00010046
RAX: 0000000000000030 RBX: ffff8804fb431280 RCX: 8f5c28f5c28f5c29
RDX: ffff88050a00de40 RSI: ffffffff82005218 RDI: ffff8804fc6aa9a8
RBP: ffff8804fb431270 R08: 0000000000000000 R09: 0000000000000001
R10: ffffc90009977cc0 R11: eab94e31da7171b7 R12: ffff8804fb431260
R13: ffff8804fb431240 R14: ffffffff82005200 R15: ffff8804fb431268
FS: 00007f49b9065700(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f49b906a000 CR3: 00000004f7446001 CR4: 00000000001606e0
DR0: 00007f0b4c405000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
copy_process.part.41+0x14fa/0x1e30
_do_fork+0xe7/0x720
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25

followed immediately by...

Oops: 0000 [#1] SMP
CPU: 1 PID: 12020 Comm: trinity-c29 Tainted: G W 4.15.0-rc4-think+ #3
RIP: 0010:proc_flush_task+0x8e/0x1b0
RSP: 0018:ffffc90009977c40 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
RDX: 0000000000000000 RSI: ffffc90009977c50 RDI: 0000000000000000
RBP: ffffc90009977c63 R08: 0000000000000000 R09: 0000000000000002
R10: ffffc90009977b70 R11: ffffc90009977c64 R12: 0000000000000004
R13: 0000000000000000 R14: 0000000000000004 R15: ffff8804fb431240
FS: 00007f49b9065700(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000004f7446001 CR4: 00000000001606e0
DR0: 00007f0b4c405000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
? release_task+0xaf/0x680
release_task+0xd2/0x680
? wait_consider_task+0xb82/0xce0
wait_consider_task+0xbe9/0xce0
? do_wait+0xe1/0x330
do_wait+0x151/0x330
kernel_wait4+0x8d/0x150
? task_stopped_code+0x50/0x50
SYSC_wait4+0x95/0xa0
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
? do_syscall_64+0x60/0x210
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25

2017-12-20 08:00:34

by Dmitry Vyukov

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Wed, Dec 20, 2017 at 2:54 AM, Eric W. Biederman
<[email protected]> wrote:
> [email protected] (Eric W. Biederman) writes:
>
>> Dave Jones <[email protected]> writes:
>>
>>> On Tue, Dec 19, 2017 at 12:27:30PM -0600, Eric W. Biederman wrote:
>>> > Dave Jones <[email protected]> writes:
>>> >
>>> > > On Mon, Dec 18, 2017 at 03:50:52PM -0800, Linus Torvalds wrote:
>>> > >
>>> > > > But I don't see what would have changed in this area recently.
>>> > > >
>>> > > > Do you end up saving the seeds that cause crashes? Is this
>>> > > > reproducible? (Other than seeing it twoce, of course)
>>> > >
>>> > > Only clue so far, is every time I'm able to trigger it, the last thing
>>> > > the child process that triggers it did, was an execveat.
>>> >
>>> > Is there any chance the excveat might be called from a child thread?
>>>
>>> If trinity choose one of the exec syscalls, it forks off an extra child
>>> to do it in, on the off-chance that it succeeds, and we never return.
>>> https://github.com/kernelslacker/trinity/blob/master/syscall.c#L139
>>
>> extrapid = fork();
>> if (extrapid == 0) {
>> /* grand-child */
>> char childname[]="trinity-subchild";
>> prctl(PR_SET_NAME, (unsigned long) &childname);
>>
>> __do_syscall(rec, GOING_AWAY);
>> /* if this was for eg. an successful execve, we should never get here.
>> * if it failed though... */
>> _exit(EXIT_SUCCESS);
>> }
>>
>> That is interesting.
>>
>>
>> So the system call sequence is a fork which just succeeded and than an
>> exec. That reduces the possibilities quite a lot.
>>
>> With pids there was a recent change that just replaced the pid hash
>> table and the pid bitmap with and idr. It changes the locking somewhat
>> and probably changes the timing so that might be the culprit.
>>
>> I am trying to figure out if there is an interface that would let
>> ns_last_pid for a pid namespace be accessed before the first pid is
>> allocated and I am not seeing it. It does not appear to be possible
>> to mount a proc for a pid namespace you are not currently in.
>>
>> *Scratches my head* I am not seeing anything obvious.
>
> Can you try this patch as you reproduce this issue?
>
> diff --git a/kernel/pid.c b/kernel/pid.c
> index b13b624e2c49..df9e5d4d8f83 100644
> --- a/kernel/pid.c
> +++ b/kernel/pid.c
> @@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> goto out_unlock;
> for ( ; upid >= pid->numbers; --upid) {
> /* Make the PID visible to find_pid_ns. */
> + WARN_ON(!upid->ns->proc_mnt);
> idr_replace(&upid->ns->idr, pid, upid->nr);
> upid->ns->pid_allocated++;
> }
>
>
> If the warning triggers it means the bug is in alloc_pid and somehow
> something has gotten past the is_child_reaper check.
>
> If the warning does not trigger it means something is stomping proc_mnt.
>
> In the entire kernel there are exactly two assignments to proc_mnt.
> - kmem_cache_zalloc in create_pid_namespace.
> - In pid_ns_prepare_proc where proc_mnt is set to a non-zero value.
>
> On the 29th of Nov syzkaller also hit this and gave me this reproducer
> that I can't figure out heads or tails of.

You can ask syzbot to test a patch if there is a reproducer.
Instructions are at the bottom of the report email.

> #{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
> mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
> perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
> socket$inet6_dccp(0xa, 0x6, 0x0)
> unshare(0x20000400)
> sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
> process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
> pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
> clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
> ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
> epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)
>
> #{Threaded:true Collide:true Repeat:true Procs:8 Sandbox:namespace Fault:false FaultCall:-1 FaultNth:0 EnableTun:true UseTmpDir:true HandleSegv:true WaitRepeat:true Debug:false Repro:false}
> mmap(&(0x7f0000000000/0xfff000)=nil, 0xfff000, 0x3, 0x32, 0xffffffffffffffff, 0x0)
> perf_event_open(&(0x7f000025c000)={0x2, 0x78, 0x3e3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf72, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, 0x0, 0xffffffffffffffff, 0xffffffffffffffff, 0x0)
> socket$inet6_dccp(0xa, 0x6, 0x0)
> unshare(0x20000400)
> sendmsg$unix(0xffffffffffffffff, &(0x7f0000001000-0x38)={&(0x7f0000239000-0x8)=@abs={0x0, 0x0, 0x0}, 0x8, &(0x7f0000008000)=[], 0x0, &(0x7f0000001000-0x10)=[@rights={0x200, 0x1, 0x1, [0xffffffffffffffff]}], 0x1, 0x0}, 0x0)
> process_vm_writev(0x0, &(0x7f0000699000-0x70)=[{&(0x7f00006a5000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x4c}, {&(0x7f00007b9000-0x54)="", 0x0}, {&(0x7f00004f3000)="", 0x0}, {&(0x7f00002e3000-0xd6)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xd6}, {&(0x7f0000f2e000)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x52}, {&(0x7f00008e5000-0x10)="00000000000000000000000000000000", 0x10}, {&(0x7f0000a3a000)="", 0x0}], 0x7, &(0x7f0000d05000)=[{&(0x7f0000d64000)="", 0x0}, {&(0x7f0000062000-0x93)="", 0x0}, {&(0x7f0000a16000-0x7e)="000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0x7e}, {&(0x7f00003dc000-0x9a)="", 0x0}, {&(0x7f0000fe3000-0xc7)="00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000", 0xc7}], 0x5, 0x0)
> pselect6(0x40, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x20000, 0x0}, &(0x7f0000cc9000-0x40)={0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, &(0x7f00000de000-0x40)={0xffffffffffffffe1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, &(0x7f00008e6000-0x10)={0x0, 0x989680}, &(0x7f0000205000-0x10)={&(0x7f00006e4000-0x8)={0x0}, 0x8})
> clone(0x20900, &(0x7f0000a94000-0x1)="6f", &(0x7f00002b8000-0x4)=0x0, &(0x7f000029e000)=0x0, &(0x7f00006fe000)="")
> ioctl$KVM_ENABLE_CAP_CPU(0xffffffffffffffff, 0x4068aea3, &(0x7f0000e48000)={0x7b, 0x0, [0x1, 0x1, 0x800, 0x1], [0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0]})
> epoll_ctl$EPOLL_CTL_DEL(0xffffffffffffffff, 0x2, 0xffffffffffffffff)
>
>
> Eric
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/87mv2e17vz.fsf%40xmission.com.
> For more options, visit https://groups.google.com/d/optout.

2017-12-20 18:26:22

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Tue, Dec 19, 2017 at 07:54:24PM -0600, Eric W. Biederman wrote:
>
> > > *Scratches my head* I am not seeing anything obvious.
> >
> > Can you try this patch as you reproduce this issue?
> >
> > diff --git a/kernel/pid.c b/kernel/pid.c
> > index b13b624e2c49..df9e5d4d8f83 100644
> > --- a/kernel/pid.c
> > +++ b/kernel/pid.c
> > @@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
> > goto out_unlock;
> > for ( ; upid >= pid->numbers; --upid) {
> > /* Make the PID visible to find_pid_ns. */
> > + WARN_ON(!upid->ns->proc_mnt);
> > idr_replace(&upid->ns->idr, pid, upid->nr);
> > upid->ns->pid_allocated++;
> > }
> >
> >
> > If the warning triggers it means the bug is in alloc_pid and somehow
> > something has gotten past the is_child_reaper check.
>
> You're onto something.
>
> WARNING: CPU: 1 PID: 12020 at kernel/pid.c:213 alloc_pid+0x230/0x280
> CPU: 1 PID: 12020 Comm: trinity-c29 Not tainted 4.15.0-rc4-think+ #3
> RIP: 0010:alloc_pid+0x230/0x280
> RSP: 0018:ffffc90009977d48 EFLAGS: 00010046
> RAX: 0000000000000030 RBX: ffff8804fb431280 RCX: 8f5c28f5c28f5c29
> RDX: ffff88050a00de40 RSI: ffffffff82005218 RDI: ffff8804fc6aa9a8
> RBP: ffff8804fb431270 R08: 0000000000000000 R09: 0000000000000001
> R10: ffffc90009977cc0 R11: eab94e31da7171b7 R12: ffff8804fb431260
> R13: ffff8804fb431240 R14: ffffffff82005200 R15: ffff8804fb431268
> FS: 00007f49b9065700(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f49b906a000 CR3: 00000004f7446001 CR4: 00000000001606e0
> DR0: 00007f0b4c405000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Call Trace:
> copy_process.part.41+0x14fa/0x1e30
> _do_fork+0xe7/0x720
> ? rcu_read_lock_sched_held+0x6c/0x80
> ? syscall_trace_enter+0x2d7/0x340
> do_syscall_64+0x60/0x210
> entry_SYSCALL64_slow_path+0x25/0x25
>
> followed immediately by...
>
> Oops: 0000 [#1] SMP
> CPU: 1 PID: 12020 Comm: trinity-c29 Tainted: G W 4.15.0-rc4-think+ #3
> RIP: 0010:proc_flush_task+0x8e/0x1b0
> RSP: 0018:ffffc90009977c40 EFLAGS: 00010286
> RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
> RDX: 0000000000000000 RSI: ffffc90009977c50 RDI: 0000000000000000
> RBP: ffffc90009977c63 R08: 0000000000000000 R09: 0000000000000002
> R10: ffffc90009977b70 R11: ffffc90009977c64 R12: 0000000000000004
> R13: 0000000000000000 R14: 0000000000000004 R15: ffff8804fb431240
> FS: 00007f49b9065700(0000) GS:ffff88050a000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 00000004f7446001 CR4: 00000000001606e0
> DR0: 00007f0b4c405000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Call Trace:
> ? release_task+0xaf/0x680
> release_task+0xd2/0x680
> ? wait_consider_task+0xb82/0xce0
> wait_consider_task+0xbe9/0xce0
> ? do_wait+0xe1/0x330
> do_wait+0x151/0x330
> kernel_wait4+0x8d/0x150
> ? task_stopped_code+0x50/0x50
> SYSC_wait4+0x95/0xa0
> ? rcu_read_lock_sched_held+0x6c/0x80
> ? syscall_trace_enter+0x2d7/0x340
> ? do_syscall_64+0x60/0x210
> do_syscall_64+0x60/0x210
> entry_SYSCALL64_slow_path+0x25/0x25

I am not seeing where things go wrong, but that puts the recent pid bitmap, bit
hash to idr change in the suspect zone.

Can you try reverting that change:

e8cfbc245e24 ("pid: remove pidhash")
95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")

While keeping the warning in place so we can see if this fixes the
allocation problem?

Eric

2017-12-21 03:16:10

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Wed, Dec 20, 2017 at 12:25:52PM -0600, Eric W. Biederman wrote:
> > >
> > > If the warning triggers it means the bug is in alloc_pid and somehow
> > > something has gotten past the is_child_reaper check.
> >
> > You're onto something.
> >
> I am not seeing where things go wrong, but that puts the recent pid bitmap, bit
> hash to idr change in the suspect zone.
>
> Can you try reverting that change:
>
> e8cfbc245e24 ("pid: remove pidhash")
> 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
>
> While keeping the warning in place so we can see if this fixes the
> allocation problem?

So I can't trigger this any more with those reverted. I seem to hit a
bunch of other long-standing bugs first. I'll keep running it
overnight, but it looks like this is where the problem lies.

Dave

2017-12-21 08:27:24

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Wed, Dec 20, 2017 at 12:25:52PM -0600, Eric W. Biederman wrote:
>> > >
> > > > If the warning triggers it means the bug is in alloc_pid and somehow
> > > > something has gotten past the is_child_reaper check.
> > >
> > > You're onto something.
> > >
> > I am not seeing where things go wrong, but that puts the recent pid bitmap, bit
> > hash to idr change in the suspect zone.
> >
> > Can you try reverting that change:
> >
> > e8cfbc245e24 ("pid: remove pidhash")
> > 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
> >
> > While keeping the warning in place so we can see if this fixes the
> > allocation problem?
>
> So I can't trigger this any more with those reverted. I seem to hit a
> bunch of other long-standing bugs first. I'll keep running it
> overnight, but it looks like this is where the problem lies.

I would really like to hear from the people who made this change if they
are interested in tracking down this failure.

It might be as simple as the locking changed enough that the locking
instrumentation is now slowing things down, and opening up an old race.

I have stared at this code, and written some test programs and I can't
see what is going on. alloc_pid by design and in implementation (as far
as I can see) is always single threaded when allocating the first pid
in a pid namespace. idr_init always initialized idr_next to 0.

So how we can get past:

if (unlikely(is_child_reaper(pid))) {
if (pid_ns_prepare_proc(ns)) {
disable_pid_allocation(ns);
goto out_free;
}
}

with proc_mnt still set to NULL is a mystery to me.

Is there any chance the idr code doesn't always return the lowest valid
free number? So init gets assigned something other than 1?

Eric

2017-12-21 10:38:16

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: proc_flush_task oops

On 12/21/17, Eric W. Biederman <[email protected]> wrote:
> I have stared at this code, and written some test programs and I can't
> see what is going on. alloc_pid by design and in implementation (as far
> as I can see) is always single threaded when allocating the first pid
> in a pid namespace. idr_init always initialized idr_next to 0.
>
> So how we can get past:
>
> if (unlikely(is_child_reaper(pid))) {
> if (pid_ns_prepare_proc(ns)) {
> disable_pid_allocation(ns);
> goto out_free;
> }
> }
>
> with proc_mnt still set to NULL is a mystery to me.
>
> Is there any chance the idr code doesn't always return the lowest valid
> free number? So init gets assigned something other than 1?

Well, this theory is easy to test (attached).

There is a "valid" way to break the code via kernel.ns_last_pid:
unshare+write+fork but the reproducer doesn't seem to use it (or it does?)

Attachments:

pid1.diff (1.19 kB)

2017-12-21 14:25:42

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
> On 12/21/17, Eric W. Biederman <[email protected]> wrote:
> > I have stared at this code, and written some test programs and I can't
> > see what is going on. alloc_pid by design and in implementation (as far
> > as I can see) is always single threaded when allocating the first pid
> > in a pid namespace. idr_init always initialized idr_next to 0.
> >
> > So how we can get past:
> >
> > if (unlikely(is_child_reaper(pid))) {
> > if (pid_ns_prepare_proc(ns)) {
> > disable_pid_allocation(ns);
> > goto out_free;
> > }
> > }
> >
> > with proc_mnt still set to NULL is a mystery to me.
> >
> > Is there any chance the idr code doesn't always return the lowest valid
> > free number? So init gets assigned something other than 1?
>
> Well, this theory is easy to test (attached).

I'll give this a shot and report back when I get to the office.

> There is a "valid" way to break the code via kernel.ns_last_pid:
> unshare+write+fork but the reproducer doesn't seem to use it (or it does?)

that sysctl is root only, so that isn't at play here.

Dav

2017-12-21 16:41:55

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
> > On 12/21/17, Eric W. Biederman <[email protected]> wrote:
> > > I have stared at this code, and written some test programs and I can't
> > > see what is going on. alloc_pid by design and in implementation (as far
> > > as I can see) is always single threaded when allocating the first pid
> > > in a pid namespace. idr_init always initialized idr_next to 0.
> > >
> > > So how we can get past:
> > >
> > > if (unlikely(is_child_reaper(pid))) {
> > > if (pid_ns_prepare_proc(ns)) {
> > > disable_pid_allocation(ns);
> > > goto out_free;
> > > }
> > > }
> > >
> > > with proc_mnt still set to NULL is a mystery to me.
> > >
> > > Is there any chance the idr code doesn't always return the lowest valid
> > > free number? So init gets assigned something other than 1?
> >
> > Well, this theory is easy to test (attached).
>
> I'll give this a shot and report back when I get to the office.
>
> > There is a "valid" way to break the code via kernel.ns_last_pid:
> > unshare+write+fork but the reproducer doesn't seem to use it (or it does?)
>
> that sysctl is root only, so that isn't at play here.

ns_capable(CAP_SYS_ADMIN) will allow root in a user namespace. So the
sysctl should be fuzzable.

The ns_last_pid sysctl is still not in play because it changes
task_active_pid_ns (aka the pid namespace of the callers pid) not
pid_ns_for_children. So it still is not in play.

Every time I think of a "valid" way to break the code, I double check
myself and find there are already checks in place to prevent that.

Eric

2017-12-21 22:00:49

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:

> > with proc_mnt still set to NULL is a mystery to me.
> >
> > Is there any chance the idr code doesn't always return the lowest valid
> > free number? So init gets assigned something other than 1?
>
> Well, this theory is easy to test (attached).

I didn't hit this BUG, but I hit the same oops in proc_flush_task.

Dave

2017-12-22 01:31:57

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
>
> > > with proc_mnt still set to NULL is a mystery to me.
> > >
> > > Is there any chance the idr code doesn't always return the lowest valid
> > > free number? So init gets assigned something other than 1?
> >
> > Well, this theory is easy to test (attached).
>
> I didn't hit this BUG, but I hit the same oops in proc_flush_task.

Scratch one idea.

If it isn't too much trouble can you try this.

I am wondering if somehow the proc_mnt that is NULL is somewhere in the
middle of the stack of pid namespaces.

This adds two warnings. The first just reports which pid namespace in
the stack of pid namespaces is problematic, and the pid number in that
pid namespace. Which should give a whole lot more to go by.

The second warning complains if we manage to create a pid namespace
where the parent pid namespace is not properly set up. The test to
prevent that looks quite robust, but at this point I don't know where to
look.

Thank you very much,
Eric

diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..a1e8734afbba 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -210,6 +210,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)
goto out_unlock;
for ( ; upid >= pid->numbers; --upid) {
/* Make the PID visible to find_pid_ns. */
+ WARN(!upid->ns->proc_mnt, "%ld/%d: %d no proc_mnt", (upid - pid->numbers), pid->level, upid->nr);
idr_replace(&upid->ns->idr, pid, upid->nr);
upid->ns->pid_allocated++;
}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 0b53eef7d34b..8f4c02c7223a 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -136,6 +136,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
ns->pid_allocated = PIDNS_ADDING;
INIT_WORK(&ns->proc_work, proc_cleanup_work);

+ WARN_ON(!parent_pid_ns->proc_mnt);
+
return ns;

out_free_idr:

2017-12-22 03:35:04

by Dave Jones

[permalink] [raw]

Subject: Re: proc_flush_task oops

On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote:
> Dave Jones <[email protected]> writes:
>
> > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
> >
> > > > with proc_mnt still set to NULL is a mystery to me.
> > > >
> > > > Is there any chance the idr code doesn't always return the lowest valid
> > > > free number? So init gets assigned something other than 1?
> > >
> > > Well, this theory is easy to test (attached).
> >
> > I didn't hit this BUG, but I hit the same oops in proc_flush_task.
>
> Scratch one idea.
>
> If it isn't too much trouble can you try this.
>
> I am wondering if somehow the proc_mnt that is NULL is somewhere in the
> middle of the stack of pid namespaces.
>
> This adds two warnings. The first just reports which pid namespace in
> the stack of pid namespaces is problematic, and the pid number in that
> pid namespace. Which should give a whole lot more to go by.
>
> The second warning complains if we manage to create a pid namespace
> where the parent pid namespace is not properly set up. The test to
> prevent that looks quite robust, but at this point I don't know where to
> look.

Progress ?

[ 1653.030190] ------------[ cut here ]------------
[ 1653.030852] 1/1: 2 no proc_mnt
[ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213 alloc_pid+0x24f/0x2a0

2017-12-22 07:58:54

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Dave Jones <[email protected]> writes:

> On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote:
> > Dave Jones <[email protected]> writes:
> >
> > > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
> > >
> > > > > with proc_mnt still set to NULL is a mystery to me.
> > > > >
> > > > > Is there any chance the idr code doesn't always return the lowest valid
> > > > > free number? So init gets assigned something other than 1?
> > > >
> > > > Well, this theory is easy to test (attached).
> > >
> > > I didn't hit this BUG, but I hit the same oops in proc_flush_task.
> >
> > Scratch one idea.
> >
> > If it isn't too much trouble can you try this.
> >
> > I am wondering if somehow the proc_mnt that is NULL is somewhere in the
> > middle of the stack of pid namespaces.
> >
> > This adds two warnings. The first just reports which pid namespace in
> > the stack of pid namespaces is problematic, and the pid number in that
> > pid namespace. Which should give a whole lot more to go by.
> >
> > The second warning complains if we manage to create a pid namespace
> > where the parent pid namespace is not properly set up. The test to
> > prevent that looks quite robust, but at this point I don't know where to
> > look.
>
> Progress ?
>
> [ 1653.030190] ------------[ cut here ]------------
> [ 1653.030852] 1/1: 2 no proc_mnt
> [ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213 alloc_pid+0x24f/0x2a0

Yes. I don't know why Alexey's patch did not fire but this is
confirmation that the first pid allocated was #2 and not #1.

Which explains the pid_mnt not being set, and it is definitely the new
code, changing from the pid bitmap+hash table to an idr.

So it looks like idr_alloc_cyclic in some configuration for the first
allocation returns value #2 instead of value #1.

I don't know that code, and it is quite complicated so I will have to
stare at it some more to even guess why it is doing that.

This is confirmation that reverting those pid changes will fix the
problem. As they are definitely at fault.

Hmm. After a little more staring I have a hunch what is going wrong.
It is just possible that there is a failure in alloc_pid during the
first pid allocation and then idr_next gets left at 2. I need to sleep
before I can think of a patch to test that.

Hmm. A failure and then restart would also explain why Alexey's patch
did not fire. An incomplete reset of state.

Eric

2017-12-22 10:13:30

by Alexey Dobriyan

[permalink] [raw]

Subject: Re: proc_flush_task oops

On 12/22/17, Eric W. Biederman <[email protected]> wrote:
> Dave Jones <[email protected]> writes:
>
>> On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote:
>> > Dave Jones <[email protected]> writes:
>> >
>> > > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
>> > >
>> > > > > with proc_mnt still set to NULL is a mystery to me.
>> > > > >
>> > > > > Is there any chance the idr code doesn't always return the
>> lowest valid
>> > > > > free number? So init gets assigned something other than 1?
>> > > >
>> > > > Well, this theory is easy to test (attached).
>> > >
>> > > I didn't hit this BUG, but I hit the same oops in proc_flush_task.
>> >
>> > Scratch one idea.
>> >
>> > If it isn't too much trouble can you try this.
>> >
>> > I am wondering if somehow the proc_mnt that is NULL is somewhere in
>> the
>> > middle of the stack of pid namespaces.
>> >
>> > This adds two warnings. The first just reports which pid namespace in
>> > the stack of pid namespaces is problematic, and the pid number in that
>> > pid namespace. Which should give a whole lot more to go by.
>> >
>> > The second warning complains if we manage to create a pid namespace
>> > where the parent pid namespace is not properly set up. The test to
>> > prevent that looks quite robust, but at this point I don't know where
>> to
>> > look.
>>
>> Progress ?
>>
>> [ 1653.030190] ------------[ cut here ]------------
>> [ 1653.030852] 1/1: 2 no proc_mnt
>> [ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213
>> alloc_pid+0x24f/0x2a0
>
> Yes. I don't know why Alexey's patch did not fire but this is
> confirmation that the first pid allocated was #2 and not #1.
>
> Which explains the pid_mnt not being set, and it is definitely the new
> code, changing from the pid bitmap+hash table to an idr.
>
> So it looks like idr_alloc_cyclic in some configuration for the first
> allocation returns value #2 instead of value #1.
>
> I don't know that code, and it is quite complicated so I will have to
> stare at it some more to even guess why it is doing that.
>
> This is confirmation that reverting those pid changes will fix the
> problem. As they are definitely at fault.
>
>
> Hmm. After a little more staring I have a hunch what is going wrong.
> It is just possible that there is a failure in alloc_pid during the
> first pid allocation and then idr_next gets left at 2. I need to sleep
> before I can think of a patch to test that.
>
> Hmm. A failure and then restart would also explain why Alexey's patch
> did not fire. An incomplete reset of state.

You are right (you are also right about sysctl :-\)

unshare
fork
alloc_pid in level 1 succeeds
alloc_pid in level 0 fails, ->idr_next is 2
fork
alloc pid 2
exit

Reliable reproducer and fail injection patch attached

I'd say proper fix is allocating pids in the opposite order
so that failure in the last layer doesn't move IDR cursor
in baby child pidns.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: proc_flush_task+0x7b/0x180
PGD 3dbb5067 P4D 3dbb5067 PUD 3d11c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2775 Comm: trinity Not tainted
4.15.0-rc4-00202-gead68f216110-dirty #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.2-1.fc27 04/01/2014
RIP: 0010:proc_flush_task+0x7b/0x180
RSP: 0018:ffffc9000016bca8 EFLAGS: 00010296
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000032
RDX: ffffc9000016bcbc RSI: ffffc9000016bcc8 RDI: 0000000000000001
RBP: ffffc9000016bcbb R08: 0000000036373732 R09: 0000000000000000
R10: 0000000000000000 R11: ffffc9000016bcbc R12: 0000000000000002
R13: 0000000000000000 R14: 0000000000000002 R15: ffff88003e2aee00
FS: 00007f9b52c86700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003d24b000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
release_task+0x3c/0x430
? thread_group_cputime_adjusted+0x35/0x40
wait_consider_task+0x7a3/0x7d0
do_wait+0xe5/0x1f0
kernel_wait4+0x74/0x110
? task_stopped_code+0x50/0x50
SyS_wait4+0x77/0x80
? handle_mm_fault+0x4a/0x50
? __do_page_fault+0x1a9/0x3e0
? entry_SYSCALL_64_fastpath+0x13/0x6c
entry_SYSCALL_64_fastpath+0x13/0x6c

Attachments:

proc-flush-task.diff (2.19 kB)
proc-flush-task.c (250.00 B)
Download all attachments

2017-12-22 14:42:27

by Eric W. Biederman

[permalink] [raw]

Subject: Re: proc_flush_task oops

Alexey Dobriyan <[email protected]> writes:

> On 12/22/17, Eric W. Biederman <[email protected]> wrote:
>> Dave Jones <[email protected]> writes:
>>
>>> On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote:
>>> > Dave Jones <[email protected]> writes:
>>> >
>>> > > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
>>> > >
>>> > > > > with proc_mnt still set to NULL is a mystery to me.
>>> > > > >
>>> > > > > Is there any chance the idr code doesn't always return the
>>> lowest valid
>>> > > > > free number? So init gets assigned something other than 1?
>>> > > >
>>> > > > Well, this theory is easy to test (attached).
>>> > >
>>> > > I didn't hit this BUG, but I hit the same oops in proc_flush_task.
>>> >
>>> > Scratch one idea.
>>> >
>>> > If it isn't too much trouble can you try this.
>>> >
>>> > I am wondering if somehow the proc_mnt that is NULL is somewhere in
>>> the
>>> > middle of the stack of pid namespaces.
>>> >
>>> > This adds two warnings. The first just reports which pid namespace in
>>> > the stack of pid namespaces is problematic, and the pid number in that
>>> > pid namespace. Which should give a whole lot more to go by.
>>> >
>>> > The second warning complains if we manage to create a pid namespace
>>> > where the parent pid namespace is not properly set up. The test to
>>> > prevent that looks quite robust, but at this point I don't know where
>>> to
>>> > look.
>>>
>>> Progress ?
>>>
>>> [ 1653.030190] ------------[ cut here ]------------
>>> [ 1653.030852] 1/1: 2 no proc_mnt
>>> [ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213
>>> alloc_pid+0x24f/0x2a0
>>
>> Yes. I don't know why Alexey's patch did not fire but this is
>> confirmation that the first pid allocated was #2 and not #1.
>>
>> Which explains the pid_mnt not being set, and it is definitely the new
>> code, changing from the pid bitmap+hash table to an idr.
>>
>> So it looks like idr_alloc_cyclic in some configuration for the first
>> allocation returns value #2 instead of value #1.
>>
>> I don't know that code, and it is quite complicated so I will have to
>> stare at it some more to even guess why it is doing that.
>>
>> This is confirmation that reverting those pid changes will fix the
>> problem. As they are definitely at fault.
>>
>>
>> Hmm. After a little more staring I have a hunch what is going wrong.
>> It is just possible that there is a failure in alloc_pid during the
>> first pid allocation and then idr_next gets left at 2. I need to sleep
>> before I can think of a patch to test that.
>>
>> Hmm. A failure and then restart would also explain why Alexey's patch
>> did not fire. An incomplete reset of state.
>
> You are right (you are also right about sysctl :-\)
>
> unshare
> fork
> alloc_pid in level 1 succeeds
> alloc_pid in level 0 fails, ->idr_next is 2
> fork
> alloc pid 2
> exit
>
> Reliable reproducer and fail injection patch attached
>
> I'd say proper fix is allocating pids in the opposite order
> so that failure in the last layer doesn't move IDR cursor
> in baby child pidns.

I agree with you about changing the order. That will make
the code simpler and in the second loop actually conforming C code,
and fix the immediate problem.

I was worrying about the case where the mount of the proc filesystem
fails, but we call disable_pid_allocation in that case. So we won't try
to allocate a pid again. That seems better than any path where we might
have to reset the allocation state.

The nasty thing is that the pid bitmap+hashtable code also did not set
the pointer back and we did not have any problems. AKA the bug is not
new.

Which means with the new code allocating pid numbers is failing much
more often when allocating pids than the old code ever did. That is not
good.

Is it perhaps the GFP_ATOMIC in a context where we could otherwise
sleep?

Eric

2017-12-22 16:11:29

by Alexey Dobriyan

[permalink] [raw]

Subject: [TEST PATCH] pid: fix allocating pid 2 for init (was Re: proc_flush_task oops)

On Fri, Dec 22, 2017 at 08:41:54AM -0600, Eric W. Biederman wrote:
> Alexey Dobriyan <[email protected]> writes:

> > unshare
> > fork
> > alloc_pid in level 1 succeeds
> > alloc_pid in level 0 fails, ->idr_next is 2
> > fork
> > alloc pid 2
> > exit
> >
> > Reliable reproducer and fail injection patch attached
> >
> > I'd say proper fix is allocating pids in the opposite order
> > so that failure in the last layer doesn't move IDR cursor
> > in baby child pidns.
>
> I agree with you about changing the order. That will make
> the code simpler and in the second loop actually conforming C code,
> and fix the immediate problem.

Something like that (barely tested)

---

include/linux/pid_namespace.h | 2 ++
kernel/pid.c | 15 +++++++++++----
kernel/pid_namespace.c | 3 ---
3 files changed, 13 insertions(+), 7 deletions(-)

--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -28,6 +28,8 @@ struct pid_namespace {
unsigned int pid_allocated;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
unsigned int level;
struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -146,6 +146,7 @@ void free_pid(struct pid *pid)

struct pid *alloc_pid(struct pid_namespace *ns)
{
+ struct pid_namespace *pid_ns[MAX_PID_NS_LEVEL + 1];
struct pid *pid;
enum pid_type type;
int i, nr;
@@ -157,12 +158,19 @@ struct pid *alloc_pid(struct pid_namespace *ns)
if (!pid)
return ERR_PTR(retval);

- tmp = ns;
pid->level = ns->level;

- for (i = ns->level; i >= 0; i--) {
+ tmp = ns;
+ do {
+ pid_ns[tmp->level] = tmp;
+ tmp = tmp->parent;
+ } while (tmp->level != 0);
+
+ for (i = 0; i <= ns->level; i++) {
int pid_min = 1;

+ tmp = pid_ns[i];
+
idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);

@@ -189,7 +197,6 @@ struct pid *alloc_pid(struct pid_namespace *ns)

pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
- tmp = tmp->parent;
}

if (unlikely(is_child_reaper(pid))) {
@@ -223,7 +230,7 @@ struct pid *alloc_pid(struct pid_namespace *ns)

out_free:
spin_lock_irq(&pidmap_lock);
- while (++i <= ns->level)
+ while (--i >= 0)
idr_remove(&ns->idr, (pid->numbers + i)->nr);

spin_unlock_irq(&pidmap_lock);
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -80,9 +80,6 @@ static void proc_cleanup_work(struct work_struct *work)
pid_ns_release_proc(ns);
}

-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
-
static struct ucounts *inc_pid_namespaces(struct user_namespace *ns)
{
return inc_ucount(ns, current_euid(), UCOUNT_PID_NAMESPACES);

2017-12-24 03:12:49

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [TEST PATCH] pid: fix allocating pid 2 for init (was Re: proc_flush_task oops)

Alexey Dobriyan <[email protected]> writes:

> On Fri, Dec 22, 2017 at 08:41:54AM -0600, Eric W. Biederman wrote:
>> Alexey Dobriyan <[email protected]> writes:
>
>> > unshare
>> > fork
>> > alloc_pid in level 1 succeeds
>> > alloc_pid in level 0 fails, ->idr_next is 2
>> > fork
>> > alloc pid 2
>> > exit
>> >
>> > Reliable reproducer and fail injection patch attached
>> >
>> > I'd say proper fix is allocating pids in the opposite order
>> > so that failure in the last layer doesn't move IDR cursor
>> > in baby child pidns.
>>
>> I agree with you about changing the order. That will make
>> the code simpler and in the second loop actually conforming C code,
>> and fix the immediate problem.
>
> Something like that (barely tested)

I have thought about this some more and I think we can do better.

I don't like the on stack pid_ns array.

The only reason the code calls disable_pid_allocation is that we
don't handle this error.

The semantics of least surprise, are that if we run out of resources
while allocating something, trying again when more resources are
available will make it work.

So it looks like handling the error will improve the quality of the
implemenation, and be a simpler, less dangerous patch.

diff --git a/kernel/pid.c b/kernel/pid.c
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -226,6 +224,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
while (++i <= ns->level)
idr_remove(&ns->idr, (pid->numbers + i)->nr);

+ /* On failure to allocate the first pid, reset the state */
+ if (ns->pid_allocated == PIDNS_ADDING)
+ idr_set_cursor(&ns->idr, 0);
+
spin_unlock_irq(&pidmap_lock);

kmem_cache_free(ns->pid_cachep, pid);

Eric

2017-12-24 03:16:35

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH] pid: Handle failure to allocate the first pid in a pid namespace

With the replacement of the pid bitmap and hashtable with an idr in
alloc_pid started occassionally failing when allocating the first pid
in a pid namespace. Things were not completely reset resulting in
the first allocated pid getting the number 2 (not 1). Which
further resulted in ns->proc_mnt not getting set and eventually
causing an oops in proc_flush_task.

Oops: 0000 [#1] SMP
CPU: 2 PID: 6743 Comm: trinity-c117 Not tainted 4.15.0-rc4-think+ #2
RIP: 0010:proc_flush_task+0x8e/0x1b0
RSP: 0018:ffffc9000bbffc40 EFLAGS: 00010286
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 00000000fffffffb
RDX: 0000000000000000 RSI: ffffc9000bbffc50 RDI: 0000000000000000
RBP: ffffc9000bbffc63 R08: 0000000000000000 R09: 0000000000000002
R10: ffffc9000bbffb70 R11: ffffc9000bbffc64 R12: 0000000000000003
R13: 0000000000000000 R14: 0000000000000003 R15: ffff8804c10d7840
FS: 00007f7cb8965700(0000) GS:ffff88050a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000003e21ae003 CR4: 00000000001606e0
DR0: 00007fb1d6c22000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
Call Trace:
? release_task+0xaf/0x680
release_task+0xd2/0x680
? wait_consider_task+0xb82/0xce0
wait_consider_task+0xbe9/0xce0
? do_wait+0xe1/0x330
do_wait+0x151/0x330
kernel_wait4+0x8d/0x150
? task_stopped_code+0x50/0x50
SYSC_wait4+0x95/0xa0
? rcu_read_lock_sched_held+0x6c/0x80
? syscall_trace_enter+0x2d7/0x340
? do_syscall_64+0x60/0x210
do_syscall_64+0x60/0x210
entry_SYSCALL64_slow_path+0x25/0x25
RIP: 0033:0x7f7cb82603aa
RSP: 002b:00007ffd60770bc8 EFLAGS: 00000246
ORIG_RAX: 000000000000003d
RAX: ffffffffffffffda RBX: 00007f7cb6cd4000 RCX: 00007f7cb82603aa
RDX: 000000000000000b RSI: 00007ffd60770bd0 RDI: 0000000000007cca
RBP: 0000000000007cca R08: 00007f7cb8965700 R09: 00007ffd607c7080
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd60770bd0 R14: 00007f7cb6cd4058 R15: 00000000cccccccd
Code: c1 e2 04 44 8b 60 30 48 8b 40 38 44 8b 34 11 48 c7 c2 60 3a f5 81 44 89 e1 4c 8b 68 58 e8 4b b4 77 00 89 44 24 14 48 8d 74 24 10 <49> 8b 7d 00 e8 b9 6a f9 ff 48 85 c0 74 1a 48 89 c7 48 89 44 24
RIP: proc_flush_task+0x8e/0x1b0 RSP: ffffc9000bbffc40
CR2: 0000000000000000
---[ end trace 53d67a6481059862 ]---

Improve the quality of the implementation by resetting the place to
start allocating pids on failure to allocate the first pid.

As improving the quality of the implementation is the goal remove the now
unnecesarry disable_pid_allocations call when we fail to mount proc.

Fixes: 95846ecf9dac ("pid: replace pid bitmap implementation with IDR API")
Fixes: 8ef047aaaeb8 ("pid namespaces: make alloc_pid(), free_pid() and put_pid() work with struct upid")
Reported-by: Dave Jones <[email protected]>
Signed-off-by: "Eric W. Biederman" <[email protected]>
---

I have tested this by forcing an allocation failure after allocating pid
1, and the symptoms are identical to what Dave Jones has observed.

I have further confirmed that this change causes the failure to go away.

If no one objects I will send a pull request to Linus in the next day or two.

kernel/pid.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..1e8bb6550ec4 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -193,10 +193,8 @@ struct pid *alloc_pid(struct pid_namespace *ns)
}

if (unlikely(is_child_reaper(pid))) {
- if (pid_ns_prepare_proc(ns)) {
- disable_pid_allocation(ns);
+ if (pid_ns_prepare_proc(ns))
goto out_free;
- }
}

get_pid_ns(ns);
@@ -226,6 +224,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
while (++i <= ns->level)
idr_remove(&ns->idr, (pid->numbers + i)->nr);

+ /* On failure to allocate the first pid, reset the state */
+ if (ns->pid_allocated == PIDNS_ADDING)
+ idr_set_cursor(&ns->idr, 0);
+
spin_unlock_irq(&pidmap_lock);

kmem_cache_free(ns->pid_cachep, pid);
--
2.14.1