2015-08-07 17:09:40

by Herton R. Krzesinski

[permalink] [raw]
Subject: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

The current semaphore code allows a potential use after free: in exit_sem we may
free the task's sem_undo_list while there is still another task looping through
the same semaphore set and cleaning the sem_undo list at freeary function (the
task called IPC_RMID for the same semaphore set).

For example, with a test program [1] running which keeps forking a lot of processes
(which then do a semop call with SEM_UNDO flag), and with the parent right after
removing the semaphore set with IPC_RMID, and a kernel built with CONFIG_SLAB,
CONFIG_SLAB_DEBUG and CONFIG_DEBUG_SPINLOCK, you can easily see something like
the following in the kernel log:

[ 876.146023] Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64
[ 876.146548] 000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk
[ 876.146935] 010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
[ 876.147313] Prev obj: start=ffff88003b45c180, len=64
[ 876.147686] 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
[ 876.147938] 010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7....
[ 876.148161] Next obj: start=ffff88003b45c200, len=64
[ 876.148401] 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ
[ 876.148759] 010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<....
[ 1947.014490] BUG: spinlock wrong CPU on CPU#2, test/18028
[ 1947.014944] general protection fault: 0000 [#1] SMP
[ 1947.015004] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
[ 1947.015004] CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1
[ 1947.015004] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[ 1947.015004] task: ffff88003bf98080 ti: ffff88003750c000 task.ti: ffff88003750c000
[ 1947.015004] RIP: 0010:[<ffffffff810d6053>] [<ffffffff810d6053>] spin_dump+0x53/0xc0
[ 1947.015004] RSP: 0018:ffff88003750fd68 EFLAGS: 00010202
[ 1947.015004] RAX: 000000000000002c RBX: ffff88003cb5be08 RCX: 0000000000000000
[ 1947.015004] RDX: ffff88003fd11518 RSI: ffff88003fd0eb68 RDI: ffff88003fd0eb68
[ 1947.015004] RBP: ffff88003750fd78 R08: 0000000000000000 R09: 000000006b6b6b6b
[ 1947.015004] R10: 000000000000026c R11: 0000000000008000 R12: 6b6b6b6b6b6b6b6b
[ 1947.015004] R13: ffff88003b25f040 R14: ffff88003ad1dcc0 R15: ffff88003b25f040
[ 1947.015004] FS: 00007fda0d30a700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[ 1947.015004] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1947.015004] CR2: 00007fda0d105dd4 CR3: 000000003acac000 CR4: 00000000000406e0
[ 1947.015004] Stack:
[ 1947.015004] ffff88003cb5be08 ffffffff81a4f3b9 ffff88003750fd98 ffffffff810d60f0
[ 1947.015004] ffff88003cb5be08 ffff88003b25f0b8 ffff88003750fdb8 ffffffff810d6171
[ 1947.015004] ffff88003b25f040 ffff88003b25f090 ffff88003750fdc8 ffffffff8175d28e
[ 1947.015004] Call Trace:
[ 1947.015004] [<ffffffff810d60f0>] spin_bug+0x30/0x40
[ 1947.015004] [<ffffffff810d6171>] do_raw_spin_unlock+0x71/0xa0
[ 1947.015004] [<ffffffff8175d28e>] _raw_spin_unlock+0xe/0x10
[ 1947.015004] [<ffffffff812f4472>] freeary+0x82/0x2a0
[ 1947.015004] [<ffffffff8175d30e>] ? _raw_spin_lock+0xe/0x10
[ 1947.015004] [<ffffffff812f5d8e>] semctl_down.clone.0+0xce/0x160
[ 1947.015004] [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[ 1947.015004] [<ffffffff8112e588>] ? __audit_syscall_entry+0xa8/0x100
[ 1947.015004] [<ffffffff812f6056>] SyS_semctl+0x236/0x2c0
[ 1947.015004] [<ffffffff810215ce>] ? syscall_trace_leave+0xde/0x130
[ 1947.015004] [<ffffffff8175d52e>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 1947.015004] Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
[ 1947.015004] RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0
[ 1947.015004] RSP <ffff88003750fd68>
[ 1947.045328] ---[ end trace 783ebb76612867a0 ]---
[ 1972.023009] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053]
[ 1972.023011] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
[ 1972.023011] CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1
[ 1972.023011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[ 1972.023011] task: ffff88003caaa2c0 ti: ffff88003d2f4000 task.ti: ffff88003d2f4000
[ 1972.023011] RIP: 0010:[<ffffffff8101cb30>] [<ffffffff8101cb30>] native_read_tsc+0x0/0x20
[ 1972.023011] RSP: 0018:ffff88003d2f7c10 EFLAGS: 00000202
[ 1972.023011] RAX: 00000539305fc462 RBX: 00007ffd046999d0 RCX: 00000000305fc462
[ 1972.023011] RDX: 0000000000000539 RSI: 0000000000000000 RDI: 0000000000000001
[ 1972.023011] RBP: ffff88003d2f7c38 R08: 0000000000000000 R09: ffff8800370c0e40
[ 1972.023011] R10: 00007ffd046999d0 R11: 000000000000000c R12: 8000000021a12065
[ 1972.023011] R13: ffff88003d2f7c08 R14: ffffffff8106a938 R15: ffff88003d2f7b98
[ 1972.023011] FS: 00007fda0d30a700(0000) GS:ffff88003fd80000(0000) knlGS:0000000000000000
[ 1972.023011] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1972.023011] CR2: 00007fda0ce60fd0 CR3: 000000003d359000 CR4: 00000000000406e0
[ 1972.023011] Stack:
[ 1972.023011] Call Trace:
[ 1972.023011] [<ffffffff8138bbc0>] ? delay_tsc+0x40/0x70
[ 1972.023011] [<ffffffff8138bb3f>] __delay+0xf/0x20
[ 1972.023011] [<ffffffff810d6326>] do_raw_spin_lock+0x96/0x140
[ 1972.023011] [<ffffffff8175d30e>] _raw_spin_lock+0xe/0x10
[ 1972.023011] [<ffffffff812f4b11>] sem_lock_and_putref+0x11/0x70
[ 1972.023011] [<ffffffff812f583f>] SYSC_semtimedop+0x7bf/0x960
[ 1972.023011] [<ffffffff811bec56>] ? handle_mm_fault+0xbf6/0x1880
[ 1972.023011] [<ffffffff810c8429>] ? dequeue_task_fair+0x79/0x4a0
[ 1972.023011] [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[ 1972.023011] [<ffffffff811e47e6>] ? kfree_debugcheck+0x16/0x40
[ 1972.023011] [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[ 1972.023011] [<ffffffff8112e588>] ? __audit_syscall_entry+0xa8/0x100
[ 1972.023011] [<ffffffff81021686>] ? do_audit_syscall_entry+0x66/0x70
[ 1972.023011] [<ffffffff810217c9>] ? syscall_trace_enter_phase1+0x139/0x160
[ 1972.023011] [<ffffffff812f59ee>] SyS_semtimedop+0xe/0x10
[ 1972.023011] [<ffffffff812f3570>] SyS_semop+0x10/0x20
[ 1972.023011] [<ffffffff8175d52e>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 1972.023011] Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9
[ 1972.023011] Kernel panic - not syncing: softlockup: hung tasks

I wasn't able to trigger any badness on a recent kernel without the proper
config debugs enabled, however I have softlockup reports on some kernel versions,
in the semaphore code, which are similar as above (the scenario is seen on some
servers running IBM DB2 which uses semaphore syscalls).

The patch here tries to handle two issues that can be seen on exit_sem: first
we could exit the loop earlier even if we still have list_proc items to go
through (eg. we race with freeary that sets un->semid to -1). The second is that
since sem_undo is protected by RCU etc., I expect we need to call synchonize_rcu()
before leaving the loop, to make sure we wait enough before freeing the
sem_undo_list (ulp) of the task.

After the patch I'm unable to reproduce the problem using the test case [1].

[1] Test case used below:

int sid[NSET];

void thread()
{
struct sembuf op;
int s;
uid_t pid = getuid();

s = rand() % NSET;
op.sem_num = pid % NSEM;
op.sem_op = 1;
op.sem_flg = SEM_UNDO;

semop(sid[s], &op, 1);
exit(EXIT_SUCCESS);
}

void create_set()
{
int i, j;
pid_t p;
union {
int val;
struct semid_ds *buf;
unsigned short int *array;
struct seminfo *__buf;
} un;

/* Create and initialize semaphore set */
for (i = 0; i < NSET; i++) {
sid[i] = semget(IPC_PRIVATE , NSEM, 0644 | IPC_CREAT);
if (sid[i] < 0) {
perror("semget");
exit(EXIT_FAILURE);
}
}
un.val = 0;
for (i = 0; i < NSET; i++) {
for (j = 0; j < NSEM; j++) {
if (semctl(sid[i], j, SETVAL, un) < 0)
perror("semctl");
}
}

/* Launch threads that operate on semaphore set */
for (i = 0; i < NSEM * NSET * NSET; i++) {
p = fork();
if (p < 0)
perror("fork");
if (p == 0)
thread();
}

/* Free semaphore set */
for (i = 0; i < NSET; i++) {
if (semctl(sid[i], NSEM, IPC_RMID))
perror("IPC_RMID");
}

/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}

int main(int argc, char **argv)
{
pid_t p;

srand(time(NULL));

while (1) {
p = fork();
if (p < 0) {
perror("fork");
exit(EXIT_FAILURE);
}
if (p == 0) {
create_set();
goto end;
}

/* Wait for forked processes to exit */
while (wait(NULL)) {
if (errno == ECHILD)
break;
};
}
end:
return 0;
}

Signed-off-by: Herton R. Krzesinski <[email protected]>
---
ipc/sem.c | 24 ++++++++++++++++--------
1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/ipc/sem.c b/ipc/sem.c
index bc3d530..35ccddd 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2074,17 +2074,24 @@ void exit_sem(struct task_struct *tsk)
rcu_read_lock();
un = list_entry_rcu(ulp->list_proc.next,
struct sem_undo, list_proc);
- if (&un->list_proc == &ulp->list_proc)
- semid = -1;
- else
- semid = un->semid;
+ if (&un->list_proc == &ulp->list_proc) {
+ rcu_read_unlock();
+ /* Make sure we wait for any place still referencing
+ * the current ulp to finish */
+ synchronize_rcu();
+ break;
+ }
+ spin_lock(&ulp->lock);
+ semid = un->semid;
+ spin_unlock(&ulp->lock);

+ /* exit_sem raced with IPC_RMID, nothing to do */
if (semid == -1) {
rcu_read_unlock();
- break;
+ continue;
}

- sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
+ sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, semid);
/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma)) {
rcu_read_unlock();
@@ -2112,9 +2119,10 @@ void exit_sem(struct task_struct *tsk)
ipc_assert_locked_object(&sma->sem_perm);
list_del(&un->list_id);

- spin_lock(&ulp->lock);
+ /* we should be the last process using this ulp, so no need
+ * to acquire ulp->lock here; we are also protected against
+ * IPC_RMID as we hold sma->sem_perm.lock */
list_del_rcu(&un->list_proc);
- spin_unlock(&ulp->lock);

/* perform adjustments registered in un */
for (i = 0; i < sma->sem_nsems; i++) {
--
2.4.3


2015-08-07 19:31:00

by Aristeu Rozanski

[permalink] [raw]
Subject: Re: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

On Fri, Aug 07, 2015 at 02:09:35PM -0300, Herton R. Krzesinski wrote:
> The current semaphore code allows a potential use after free: in exit_sem we may
> free the task's sem_undo_list while there is still another task looping through
> the same semaphore set and cleaning the sem_undo list at freeary function (the
> task called IPC_RMID for the same semaphore set).
>
> For example, with a test program [1] running which keeps forking a lot of processes
> (which then do a semop call with SEM_UNDO flag), and with the parent right after
> removing the semaphore set with IPC_RMID, and a kernel built with CONFIG_SLAB,
> CONFIG_SLAB_DEBUG and CONFIG_DEBUG_SPINLOCK, you can easily see something like
> the following in the kernel log:
(snip)

> Signed-off-by: Herton R. Krzesinski <[email protected]>
> ---
> ipc/sem.c | 24 ++++++++++++++++--------
> 1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/ipc/sem.c b/ipc/sem.c
> index bc3d530..35ccddd 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -2074,17 +2074,24 @@ void exit_sem(struct task_struct *tsk)
> rcu_read_lock();
> un = list_entry_rcu(ulp->list_proc.next,
> struct sem_undo, list_proc);
> - if (&un->list_proc == &ulp->list_proc)
> - semid = -1;
> - else
> - semid = un->semid;
> + if (&un->list_proc == &ulp->list_proc) {
> + rcu_read_unlock();
> + /* Make sure we wait for any place still referencing
> + * the current ulp to finish */
> + synchronize_rcu();
> + break;
> + }
> + spin_lock(&ulp->lock);
> + semid = un->semid;
> + spin_unlock(&ulp->lock);
>
> + /* exit_sem raced with IPC_RMID, nothing to do */
> if (semid == -1) {
> rcu_read_unlock();
> - break;
> + continue;
> }
>
> - sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
> + sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, semid);
> /* exit_sem raced with IPC_RMID, nothing to do */
> if (IS_ERR(sma)) {
> rcu_read_unlock();
> @@ -2112,9 +2119,10 @@ void exit_sem(struct task_struct *tsk)
> ipc_assert_locked_object(&sma->sem_perm);
> list_del(&un->list_id);
>
> - spin_lock(&ulp->lock);
> + /* we should be the last process using this ulp, so no need
> + * to acquire ulp->lock here; we are also protected against
> + * IPC_RMID as we hold sma->sem_perm.lock */
> list_del_rcu(&un->list_proc);
> - spin_unlock(&ulp->lock);
>
> /* perform adjustments registered in un */
> for (i = 0; i < sma->sem_nsems; i++) {

I was debugging the same issue and can confirm this fix works and makes
sense.

Acked-by: Aristeu Rozanski <[email protected]>

--
Aristeu

2015-08-09 17:49:15

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

Hi Herton,

On 08/07/2015 07:09 PM, Herton R. Krzesinski wrote:
> The current semaphore code allows a potential use after free: in exit_sem we may
> free the task's sem_undo_list while there is still another task looping through
> the same semaphore set and cleaning the sem_undo list at freeary function (the
> task called IPC_RMID for the same semaphore set).
Correct, good catch!

semid==-1 can happen due to two reasons:
a) end of sem_undo_list (i.e.: last undo structure in CLONE_SYSVSEM
group)
b) parallel IPC_RMID

If semid==-1 happens due to a parallel IPC_RMID, then exit_sem does not
free all sem_undo structures that belong to the current CLONE_SYSVSEM group.
But it does free the sem_undo_list structure.

Since:
- struct sem_undo contains a link to struct sem_undo_list.
- struct sem_undo_list is kfreed immediately at the end of exit_sem()
- the parallel IPC_RMID will find the sem_undo structure, then follow
the link to sem_undo_list to unlink it
-> use after free, spinlock debug errors because spinlock
was already overwritten by slab debug.

(what makes it worse: un->semid is read twice, without synchronization.
It should be read once, with synchronization)

> Signed-off-by: Herton R. Krzesinski<[email protected]>
I would add: Cc: <[email protected]>
> ---
> ipc/sem.c | 24 ++++++++++++++++--------
> 1 file changed, 16 insertions(+), 8 deletions(-)
>
> diff --git a/ipc/sem.c b/ipc/sem.c
> index bc3d530..35ccddd 100644
> --- a/ipc/sem.c
> +++ b/ipc/sem.c
> @@ -2074,17 +2074,24 @@ void exit_sem(struct task_struct *tsk)
> rcu_read_lock();
> un = list_entry_rcu(ulp->list_proc.next,
> struct sem_undo, list_proc);
> - if (&un->list_proc == &ulp->list_proc)
> - semid = -1;
> - else
> - semid = un->semid;
> + if (&un->list_proc == &ulp->list_proc) {
> + rcu_read_unlock();
> + /* Make sure we wait for any place still referencing
> + * the current ulp to finish */
> + synchronize_rcu();
Sorry, no. synchronize_rcu() is a high-latency operation.
We can't call it within exit_sem(). We could use kfree_rcu(), but I
don't see that we need it:

Which race do you imagine?
ulp is accessed by:
- freeary(). Race impossible due to explicit locking.
- exit_sem(). Race impossible due to ulp->refcount
- find_alloc_undo(). Race impossible, because it operates on
current->sysvsem.undo_list.
"current" is in do_exit, thus can't be inside semtimedop().

> + break;
> + }
> + spin_lock(&ulp->lock);
> + semid = un->semid;
> + spin_unlock(&ulp->lock);
Ok/good.
Note (I've tried it first):
Just "READ_ONCE(un->semid)" would be insufficient, because this can happen:
A: thread 1, within freeary:
A: spin_lock(&ulp->lock);
A: un->semid = -1;
B: thread 2, within exit_sem():
B: if (un->semid == -1) exit;
B: kfree(ulp);
A: spin_unlock(&ulp->lock); <<<< use-after-free, bug

>
> + /* exit_sem raced with IPC_RMID, nothing to do */
> if (semid == -1) {
> rcu_read_unlock();
> - break;
> + continue;
> }
>
> - sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
> + sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, semid);
> /* exit_sem raced with IPC_RMID, nothing to do */
> if (IS_ERR(sma)) {
> rcu_read_unlock();
Ok.
> @@ -2112,9 +2119,10 @@ void exit_sem(struct task_struct *tsk)
> ipc_assert_locked_object(&sma->sem_perm);
> list_del(&un->list_id);
>
> - spin_lock(&ulp->lock);
> + /* we should be the last process using this ulp, so no need
> + * to acquire ulp->lock here; we are also protected against
> + * IPC_RMID as we hold sma->sem_perm.lock */
> list_del_rcu(&un->list_proc);
> - spin_unlock(&ulp->lock);
>
> /* perform adjustments registered in un */
> for (i = 0; i < sma->sem_nsems; i++) {
a) "we should be the last" or "we are the last"?
b) The bug that you have found is probably old, thus it must go into the
stable kernels as well.
I would not do this change together with the bugfix.

Perhaps make two patches? One cc stable, the other one without cc stable.
--
Manfred

2015-08-10 15:31:55

by Herton R. Krzesinski

[permalink] [raw]
Subject: Re: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

On Sun, Aug 09, 2015 at 07:49:08PM +0200, Manfred Spraul wrote:
> Hi Herton,
>
(snip)
> >+++ b/ipc/sem.c
> >@@ -2074,17 +2074,24 @@ void exit_sem(struct task_struct *tsk)
> > rcu_read_lock();
> > un = list_entry_rcu(ulp->list_proc.next,
> > struct sem_undo, list_proc);
> >- if (&un->list_proc == &ulp->list_proc)
> >- semid = -1;
> >- else
> >- semid = un->semid;
> >+ if (&un->list_proc == &ulp->list_proc) {
> >+ rcu_read_unlock();
> >+ /* Make sure we wait for any place still referencing
> >+ * the current ulp to finish */
> >+ synchronize_rcu();
> Sorry, no. synchronize_rcu() is a high-latency operation.
> We can't call it within exit_sem(). We could use kfree_rcu(), but I don't
> see that we need it:
>
> Which race do you imagine?
> ulp is accessed by:
> - freeary(). Race impossible due to explicit locking.
> - exit_sem(). Race impossible due to ulp->refcount
> - find_alloc_undo(). Race impossible, because it operates on
> current->sysvsem.undo_list.
> "current" is in do_exit, thus can't be inside semtimedop().

Well without the synchronize_rcu() and with the semid list loop fix I was still
able to get issues, and I thought the problem is related to racing with IPC_RMID
on freeary again. This is one scenario I would imagine:

A B

freeary()
list_del(&un->list_id)
spin_lock(&un->ulp->lock)
un->semid = -1
list_del_rcu(&un->list_proc)
__list_del_entry(&un->list_proc)
__list_del(entry->prev, entry->next) exit_sem()
next->prev = prev; ...
prev->next = next; ...
... un = list_entry_rcu(ulp->list_proc.next...)
(&un->list_proc)->prev = LIST_POISON2 if (&un->list_proc == &ulp->list_proc) <true, last un removed by thread A>
... kfree(ulp)
spin_unlock(&un->ulp->lock) <---- bug

Now that is a very tight window, but I had problems even when I tried this patch
first:

(...)
- if (&un->list_proc == &ulp->list_proc)
- semid = -1;
- else
- semid = un->semid;
+ if (&un->list_proc == &ulp->list_proc) {
+ rcu_read_unlock();
+ break;
+ }
+ spin_lock(&ulp->lock);
+ semid = un->semid;
+ spin_unlock(&ulp->lock);

+ /* exit_sem raced with IPC_RMID, nothing to do */
if (semid == -1) {
rcu_read_unlock();
- break;
+ synchronize_rcu();
+ continue;
}
(...)

So even with the bad/uneeded synchronize_rcu() which I had placed there, I could
still get issues (however the testing on patch above was on an older kernel than
latest upstream, from RHEL 6, I can test without synchronize_rcu() on latest
upstream, however the affected code is the same). That's when I thought of
scenario above. I was able to get this oops:

<0>BUG: spinlock bad magic on CPU#3, test/5427 (Not tainted)
<4>general protection fault: 0000 [#1] SMP
...
<4>Pid: 5427, comm: test Not tainted 2.6.32-573.el6.1233300.2.x86_64.debug #1 QEMU Standard PC (i440FX + PIIX, 1996)
<4>RIP: 0010:[<ffffffff812c8291>] [<ffffffff812c8291>] spin_bug+0x81/0x100
...
<4>Call Trace:
<4> [<ffffffff812c8375>] _raw_spin_unlock+0x65/0xa0
<4> [<ffffffff8157087b>] _spin_unlock+0x2b/0x40
<4> [<ffffffff81241a66>] freeary+0x96/0x250
<4> [<ffffffff81570a72>] ? _spin_lock+0x62/0x70
<4> [<ffffffff81241cad>] semctl_down.clone.0+0x8d/0x130
<4> [<ffffffff810af1a8>] ? sched_clock_cpu+0xb8/0x110
<4> [<ffffffff810bdb3d>] ? trace_hardirqs_off+0xd/0x10
<4> [<ffffffff810af2ef>] ? cpu_clock+0x6f/0x80
<4> [<ffffffff810c0c5d>] ? lock_release_holdtime+0x3d/0x190
<4> [<ffffffff81243a42>] sys_semctl+0x2a2/0x300
<4> [<ffffffff815702d2>] ? trace_hardirqs_on_thunk+0x3a/0x3f
<4> [<ffffffff8100b0d2>] system_call_fastpath+0x16/0x1b

May be below spin_lock(ulp->lock) could be acquired earlier instead before
checking list_proc being empty? Then this would avoid the need for
synchronize_rcu(), which was my initial idea to avoid the problem.

>
> >+ break;
> >+ }
> >+ spin_lock(&ulp->lock);
> >+ semid = un->semid;
> >+ spin_unlock(&ulp->lock);
> Ok/good.
> Note (I've tried it first):
> Just "READ_ONCE(un->semid)" would be insufficient, because this can happen:
> A: thread 1, within freeary:
> A: spin_lock(&ulp->lock);
> A: un->semid = -1;
> B: thread 2, within exit_sem():
> B: if (un->semid == -1) exit;
> B: kfree(ulp);
> A: spin_unlock(&ulp->lock); <<<< use-after-free, bug
>
> >+ /* exit_sem raced with IPC_RMID, nothing to do */
> > if (semid == -1) {
> > rcu_read_unlock();
> >- break;
> >+ continue;
> > }
> >- sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
> >+ sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, semid);
> > /* exit_sem raced with IPC_RMID, nothing to do */
> > if (IS_ERR(sma)) {
> > rcu_read_unlock();
> Ok.
> >@@ -2112,9 +2119,10 @@ void exit_sem(struct task_struct *tsk)
> > ipc_assert_locked_object(&sma->sem_perm);
> > list_del(&un->list_id);
> >- spin_lock(&ulp->lock);
> >+ /* we should be the last process using this ulp, so no need
> >+ * to acquire ulp->lock here; we are also protected against
> >+ * IPC_RMID as we hold sma->sem_perm.lock */
> > list_del_rcu(&un->list_proc);
> >- spin_unlock(&ulp->lock);
> > /* perform adjustments registered in un */
> > for (i = 0; i < sma->sem_nsems; i++) {
> a) "we should be the last" or "we are the last"?

I'll change the comment, as "we are the last".

> b) The bug that you have found is probably old, thus it must go into the
> stable kernels as well.
> I would not do this change together with the bugfix.
>
> Perhaps make two patches? One cc stable, the other one without cc stable.

Ok, I'll separate the changes.

Thanks for reviewing.

> --
> Manfred

--
[]'s
Herton

2015-08-10 19:02:17

by Manfred Spraul

[permalink] [raw]
Subject: Re: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

Hi Herton,

On 08/10/2015 05:31 PM, Herton R. Krzesinski wrote:
> Well without the synchronize_rcu() and with the semid list loop fix I was still
> able to get issues, and I thought the problem is related to racing with IPC_RMID
> on freeary again. This is one scenario I would imagine:
>
> A B
>
> freeary()
> list_del(&un->list_id)
> spin_lock(&un->ulp->lock)
> un->semid = -1
> list_del_rcu(&un->list_proc)
> __list_del_entry(&un->list_proc)
> __list_del(entry->prev, entry->next) exit_sem()
> next->prev = prev; ...
> prev->next = next; ...
> ... un = list_entry_rcu(ulp->list_proc.next...)
> (&un->list_proc)->prev = LIST_POISON2 if (&un->list_proc == &ulp->list_proc) <true, last un removed by thread A>
> ... kfree(ulp)
> spin_unlock(&un->ulp->lock) <---- bug
>
> Now that is a very tight window, but I had problems even when I tried this patch
> first:
>
> (...)
> - if (&un->list_proc == &ulp->list_proc)
> - semid = -1;
> - else
> - semid = un->semid;
> + if (&un->list_proc == &ulp->list_proc) {
> + rcu_read_unlock();
What about:
+ spin_unlock_wait(&ulp->lock);
> + break;
> + }
> + spin_lock(&ulp->lock);
> + semid = un->semid;
> + spin_unlock(&ulp->lock);
>
> + /* exit_sem raced with IPC_RMID, nothing to do */
> if (semid == -1) {
> rcu_read_unlock();
> - break;
> + synchronize_rcu();
> + continue;
> }
> (...)
>
> So even with the bad/uneeded synchronize_rcu() which I had placed there, I could
> still get issues (however the testing on patch above was on an older kernel than
> latest upstream, from RHEL 6, I can test without synchronize_rcu() on latest
> upstream, however the affected code is the same). That's when I thought of
> scenario above. I was able to get this oops:
Adding sleep() usually help, too. But it is ugly, so let's try to
understand the race and to fix it.

Best regards,
Manfred

2015-08-11 16:48:37

by Herton R. Krzesinski

[permalink] [raw]
Subject: Re: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task using same semaphore set exits

On Mon, Aug 10, 2015 at 09:02:11PM +0200, Manfred Spraul wrote:
> Hi Herton,
>
> On 08/10/2015 05:31 PM, Herton R. Krzesinski wrote:
> >Well without the synchronize_rcu() and with the semid list loop fix I was still
> >able to get issues, and I thought the problem is related to racing with IPC_RMID
> >on freeary again. This is one scenario I would imagine:
> >
> > A B
> >
> >freeary()
> > list_del(&un->list_id)
> > spin_lock(&un->ulp->lock)
> > un->semid = -1
> > list_del_rcu(&un->list_proc)
> > __list_del_entry(&un->list_proc)
> > __list_del(entry->prev, entry->next) exit_sem()
> > next->prev = prev; ...
> > prev->next = next; ...
> > ... un = list_entry_rcu(ulp->list_proc.next...)
> > (&un->list_proc)->prev = LIST_POISON2 if (&un->list_proc == &ulp->list_proc) <true, last un removed by thread A>
> > ... kfree(ulp)
> > spin_unlock(&un->ulp->lock) <---- bug
> >
> >Now that is a very tight window, but I had problems even when I tried this patch
> >first:
> >
> >(...)
> >- if (&un->list_proc == &ulp->list_proc)
> >- semid = -1;
> >- else
> >- semid = un->semid;
> >+ if (&un->list_proc == &ulp->list_proc) {
> >+ rcu_read_unlock();
> What about:
> + spin_unlock_wait(&ulp->lock);

spin_unlock_wait works, thanks (both in theory and in practice, I tested).

> >+ break;
> >+ }
> >+ spin_lock(&ulp->lock);
> >+ semid = un->semid;
> >+ spin_unlock(&ulp->lock);
> >
> >+ /* exit_sem raced with IPC_RMID, nothing to do */
> > if (semid == -1) {
> > rcu_read_unlock();
> >- break;
> >+ synchronize_rcu();
> >+ continue;
> > }
> >(...)
> >
> >So even with the bad/uneeded synchronize_rcu() which I had placed there, I could
> >still get issues (however the testing on patch above was on an older kernel than
> >latest upstream, from RHEL 6, I can test without synchronize_rcu() on latest
> >upstream, however the affected code is the same). That's when I thought of
> >scenario above. I was able to get this oops:
> Adding sleep() usually help, too. But it is ugly, so let's try to understand
> the race and to fix it.

Yes, the race should be what I described earlier, and it's understood. I also
now have proof in practice that without synchronization before exit, I'm able to
trigger a crash on latest upstream kernel as well. This is the change I tested
with latest stock 4.2-rc6:

diff --git a/ipc/sem.c b/ipc/sem.c
index bc3d530..3b8b66b 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2074,17 +2074,26 @@ void exit_sem(struct task_struct *tsk)
rcu_read_lock();
un = list_entry_rcu(ulp->list_proc.next,
struct sem_undo, list_proc);
- if (&un->list_proc == &ulp->list_proc)
- semid = -1;
- else
- semid = un->semid;
+ if (&un->list_proc == &ulp->list_proc) {
+ /* we must wait for freeary() before freeing this ulp,
+ * in case we raced with last sem_undo. There is a small
+ * possibility where we exit while freeary() didn't
+ * finish unlocking sem_undo_list */
+ /*spin_unlock_wait(&ulp->lock);*/
+ rcu_read_unlock();
+ break;
+ }
+ spin_lock(&ulp->lock);
+ semid = un->semid;
+ spin_unlock(&ulp->lock);

+ /* exit_sem raced with IPC_RMID, nothing to do */
if (semid == -1) {
rcu_read_unlock();
- break;
+ continue;
}

- sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
+ sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, semid);
/* exit_sem raced with IPC_RMID, nothing to do */
if (IS_ERR(sma)) {
rcu_read_unlock();


I commented the synchronization for testing (the spin_unlock_wait call).

Because of the comment/removing the spin_unlock_wait, I'm then able to trigger
the following crash:

[ 587.768363] BUG: spinlock wrong owner on CPU#2, test/419
[ 587.768434] general protection fault: 0000 [#1] SMP
[ 587.768454] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6
table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_rng virtio_balloon virtio_console virtio_net iosf_mbi crct10dif_pclmul c
rc32_pclmul ghash_clmulni_intel pcspkr snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcor
e qxl ttm drm_kms_helper drm crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
[ 587.768712] CPU: 2 PID: 419 Comm: test Not tainted 4.2.0-rc6+ #1
[ 587.768732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[ 587.768762] task: ffff88003ad68340 ti: ffff88003a0ec000 task.ti: ffff88003a0ec000
[ 587.768785] RIP: 0010:[<ffffffff810d60b3>] [<ffffffff810d60b3>] spin_dump+0x53/0xc0
[ 587.768818] RSP: 0018:ffff88003a0efd68 EFLAGS: 00010202
[ 587.768836] RAX: 000000000000002c RBX: ffff88003d016f08 RCX: 0000000000000000
[ 587.768859] RDX: ffff88003fd11518 RSI: ffff88003fd0eb68 RDI: ffff88003fd0eb68
[ 587.768882] RBP: ffff88003a0efd78 R08: 0000000000000000 R09: 000000006b6b6b6b
[ 587.768904] R10: 000000000000025f R11: ffffc90000b30020 R12: 6b6b6b6b6b6b6b6b
[ 587.768927] R13: ffff88003a5e5c40 R14: ffff88003d1eb340 R15: ffff88003a5e5c40
[ 587.768949] FS: 00007fc2fcd89700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[ 587.768981] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 587.769002] CR2: 00007fc2fcb84dd4 CR3: 000000003d230000 CR4: 00000000000406e0
[ 587.769004] Stack:
[ 587.769004] ffff88003d016f08 ffffffff81a4f4ad ffff88003a0efd98 ffffffff810d6150
[ 587.769004] ffff88003d016f08 ffff88003a5e5cb8 ffff88003a0efdb8 ffffffff810d61e2
[ 587.769004] ffff88003a5e5c40 ffff88003a5e5c90 ffff88003a0efdc8 ffffffff8175d50e
[ 587.769004] Call Trace:
[ 587.769004] [<ffffffff810d6150>] spin_bug+0x30/0x40
[ 587.769004] [<ffffffff810d61e2>] do_raw_spin_unlock+0x82/0xa0
[ 587.769004] [<ffffffff8175d50e>] _raw_spin_unlock+0xe/0x10
[ 587.769004] [<ffffffff812f45f2>] freeary+0x82/0x2a0
[ 587.769004] [<ffffffff8175d58e>] ? _raw_spin_lock+0xe/0x10
[ 587.769004] [<ffffffff812f5f2e>] semctl_down.clone.0+0xce/0x160
[ 587.769004] [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[ 587.769004] [<ffffffff8112e5e8>] ? __audit_syscall_entry+0xa8/0x100
[ 587.769004] [<ffffffff812f61f6>] SyS_semctl+0x236/0x2c0
[ 587.769004] [<ffffffff810215ce>] ? syscall_trace_leave+0xde/0x130
[ 587.769004] [<ffffffff8175d7ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 587.769004] Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2d a4 81 31 c0 65 8b 15 8b 40 f3 7e e8 21 33 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89
[ 587.769004] RIP [<ffffffff810d60b3>] spin_dump+0x53/0xc0
[ 587.769004] RSP <ffff88003a0efd68>
[ 587.769563] ---[ end trace 844d186084062ba4 ]---
[ 612.022002] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:446]
[ 612.022002] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_rng virtio_balloon virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore qxl ttm drm_kms_helper drm crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
[ 612.022002] CPU: 3 PID: 446 Comm: test Tainted: G D 4.2.0-rc6+ #1
[ 612.022002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[ 612.022002] task: ffff88003e3ab040 ti: ffff88003ac20000 task.ti: ffff88003ac20000
[ 612.022002] RIP: 0010:[<ffffffff8101cb36>] [<ffffffff8101cb36>] native_read_tsc+0x6/0x20
[ 612.022002] RSP: 0018:ffff88003ac23c08 EFLAGS: 00000206
[ 612.022002] RAX: 00000000e0df8c7d RBX: 800000002060b065 RCX: 00000000e0df8c5f
[ 612.022002] RDX: 00000000000001a4 RSI: 0000000000000000 RDI: 0000000000000001
[ 612.022002] RBP: ffff88003ac23c08 R08: 0000000000000000 R09: ffff88003aea5cc0
[ 612.022002] R10: 00007ffdf0e89830 R11: 0000000000000008 R12: ffffffff8106a938
[ 612.022002] R13: ffff88003ac23b98 R14: ffff88003a295300 R15: ffff88003a295000
[ 612.022002] FS: 00007fc2fcd89700(0000) GS:ffff88003fd80000(0000) knlGS:0000000000000000
[ 612.022002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 612.022002] CR2: 00007fc2fc8dffd0 CR3: 000000003905d000 CR4: 00000000000406e0
[ 612.022002] Stack:
[ 612.022002] ffff88003ac23c38 ffffffff8138bd30 ffff88003a5e5c40 00000000ac762d50
[ 612.022002] 000000003662aefc 0000000000000001 ffff88003ac23c48 ffffffff8138bcaf
[ 612.022002] ffff88003ac23c78 ffffffff810d6386 ffff88003a5e5c40 ffff880037a0f880
[ 612.022002] Call Trace:
[ 612.022002] [<ffffffff8138bd30>] delay_tsc+0x40/0x70
[ 612.022002] [<ffffffff8138bcaf>] __delay+0xf/0x20
[ 612.022002] [<ffffffff810d6386>] do_raw_spin_lock+0x96/0x140
[ 612.022002] [<ffffffff8175d58e>] _raw_spin_lock+0xe/0x10
[ 612.022002] [<ffffffff812f4cb1>] sem_lock_and_putref+0x11/0x70
[ 612.022002] [<ffffffff812f59df>] SYSC_semtimedop+0x7bf/0x960
[ 612.022002] [<ffffffff811becd6>] ? handle_mm_fault+0xbf6/0x1880
[ 612.022002] [<ffffffff8175d50e>] ? _raw_spin_unlock+0xe/0x10
[ 612.022002] [<ffffffff81194e16>] ? free_pcppages_bulk+0x436/0x5a0
[ 612.022002] [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[ 612.022002] [<ffffffff811e4866>] ? kfree_debugcheck+0x16/0x40
[ 612.022002] [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[ 612.022002] [<ffffffff8112e5e8>] ? __audit_syscall_entry+0xa8/0x100
[ 612.022002] [<ffffffff81021686>] ? do_audit_syscall_entry+0x66/0x70
[ 612.022002] [<ffffffff810217c9>] ? syscall_trace_enter_phase1+0x139/0x160
[ 612.022002] [<ffffffff812f5b8e>] SyS_semtimedop+0xe/0x10
[ 612.022002] [<ffffffff812f36f0>] SyS_semop+0x10/0x20
[ 612.022002] [<ffffffff8175d7ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[ 612.022002] Code: c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 55 48 89 e5 0f 31 <89> c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9 c3 66 2e 0f 1f 84
[ 612.022002] Kernel panic - not syncing: softlockup: hung tasks

Then in another build with the spin_unlock_wait in, I left the test case running
for some hours, I didn't see issues. In theory also I expect there is no more
problems.

I'll submit a new version of the patch, it's same diff as above but without the
comment.

>
> Best regards,
> Manfred

Thanks,
--
[]'s
Herton