Date: Tue, 11 Aug 2015 13:48:33 -0300
From: "Herton R. Krzesinski" <herton@redhat.com>
To: Manfred Spraul <manfred@colorfullife.com>
Cc: linux-kernel@vger.kernel.org, Andrew Morton <akpm@linux-foundation.org>,
        Davidlohr Bueso <dave@stgolabs.net>, Rafael Aquini <aquini@redhat.com>,
        Joe Perches <joe@perches.com>, Aristeu Rozanski <aris@redhat.com>,
        djeffery@redhat.com
Subject: Re: [PATCH] ipc,sem: fix use after free on IPC_RMID after a task
 using same semaphore set exits
Message-ID: <20150811164833.GA29948@dhcppc4.redhat.com>
References: <1438967375-14877-1-git-send-email-herton@redhat.com>
 <55C79294.2010006@colorfullife.com>
 <20150810153147.GA3540@dhcppc4.redhat.com>
 <55C8F533.1090007@colorfullife.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <55C8F533.1090007@colorfullife.com>
User-Agent: Mutt/1.5.23.1-rc1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 12009
Lines: 211

On Mon, Aug 10, 2015 at 09:02:11PM +0200, Manfred Spraul wrote:
> Hi Herton,
> 
> On 08/10/2015 05:31 PM, Herton R. Krzesinski wrote:
> >Well without the synchronize_rcu() and with the semid list loop fix I was still
> >able to get issues, and I thought the problem is related to racing with IPC_RMID
> >on freeary again. This is one scenario I would imagine:
> >
> >                A                                                  B
> >
> >freeary()
> >   list_del(&un->list_id)
> >   spin_lock(&un->ulp->lock)
> >   un->semid = -1
> >   list_del_rcu(&un->list_proc)
> >     __list_del_entry(&un->list_proc)
> >       __list_del(entry->prev, entry->next)      exit_sem()
> >         next->prev = prev;                        ...
> >         prev->next = next;                        ...
> >         ...                                       un = list_entry_rcu(ulp->list_proc.next...)
> >     (&un->list_proc)->prev = LIST_POISON2         if (&un->list_proc == &ulp->list_proc) <true, last un removed by thread A>
> >   ...                                             kfree(ulp)
> >   spin_unlock(&un->ulp->lock) <---- bug
> >
> >Now that is a very tight window, but I had problems even when I tried this patch
> >first:
> >
> >(...)
> >-               if (&un->list_proc == &ulp->list_proc)
> >-                       semid = -1;
> >-                else
> >-                       semid = un->semid;
> >+               if (&un->list_proc == &ulp->list_proc) {
> >+                       rcu_read_unlock();
> What about:
> + spin_unlock_wait(&ulp->lock);

spin_unlock_wait works, thanks (both in theory and in practice, I tested).

> >+                       break;
> >+               }
> >+               spin_lock(&ulp->lock);
> >+               semid = un->semid;
> >+               spin_unlock(&ulp->lock);
> >
> >+               /* exit_sem raced with IPC_RMID, nothing to do */
> >                 if (semid == -1) {
> >                         rcu_read_unlock();
> >-                       break;
> >+                       synchronize_rcu();
> >+                       continue;
> >                 }
> >(...)
> >
> >So even with the bad/uneeded synchronize_rcu() which I had placed there, I could
> >still get issues (however the testing on patch above was on an older kernel than
> >latest upstream, from RHEL 6, I can test without synchronize_rcu() on latest
> >upstream, however the affected code is the same). That's when I thought of
> >scenario above. I was able to get this oops:
> Adding sleep() usually help, too. But it is ugly, so let's try to understand
> the race and to fix it.

Yes, the race should be what I described earlier, and it's understood. I also
now have proof in practice that without synchronization before exit, I'm able to
trigger a crash on latest upstream kernel as well. This is the change I tested
with latest stock 4.2-rc6:

diff --git a/ipc/sem.c b/ipc/sem.c
index bc3d530..3b8b66b 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2074,17 +2074,26 @@ void exit_sem(struct task_struct *tsk)
                rcu_read_lock();
                un = list_entry_rcu(ulp->list_proc.next,
                                    struct sem_undo, list_proc);
-               if (&un->list_proc == &ulp->list_proc)
-                       semid = -1;
-                else
-                       semid = un->semid;
+               if (&un->list_proc == &ulp->list_proc) {
+                       /* we must wait for freeary() before freeing this ulp,
+                        * in case we raced with last sem_undo. There is a small
+                        * possibility where we exit while freeary() didn't
+                        * finish unlocking sem_undo_list */
+                       /*spin_unlock_wait(&ulp->lock);*/
+                       rcu_read_unlock();
+                       break;
+               }
+               spin_lock(&ulp->lock);
+               semid = un->semid;
+               spin_unlock(&ulp->lock);
 
+               /* exit_sem raced with IPC_RMID, nothing to do */
                if (semid == -1) {
                        rcu_read_unlock();
-                       break;
+                       continue;
                }
 
-               sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, un->semid);
+               sma = sem_obtain_object_check(tsk->nsproxy->ipc_ns, semid);
                /* exit_sem raced with IPC_RMID, nothing to do */
                if (IS_ERR(sma)) {
                        rcu_read_unlock();


I commented the synchronization for testing (the spin_unlock_wait call).

Because of the comment/removing the spin_unlock_wait, I'm then able to trigger
the following crash:

[  587.768363] BUG: spinlock wrong owner on CPU#2, test/419
[  587.768434] general protection fault: 0000 [#1] SMP 
[  587.768454] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6
table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_rng virtio_balloon virtio_console virtio_net iosf_mbi crct10dif_pclmul c
rc32_pclmul ghash_clmulni_intel pcspkr snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcor
e qxl ttm drm_kms_helper drm crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
[  587.768712] CPU: 2 PID: 419 Comm: test Not tainted 4.2.0-rc6+ #1
[  587.768732] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[  587.768762] task: ffff88003ad68340 ti: ffff88003a0ec000 task.ti: ffff88003a0ec000
[  587.768785] RIP: 0010:[<ffffffff810d60b3>]  [<ffffffff810d60b3>] spin_dump+0x53/0xc0
[  587.768818] RSP: 0018:ffff88003a0efd68  EFLAGS: 00010202
[  587.768836] RAX: 000000000000002c RBX: ffff88003d016f08 RCX: 0000000000000000
[  587.768859] RDX: ffff88003fd11518 RSI: ffff88003fd0eb68 RDI: ffff88003fd0eb68
[  587.768882] RBP: ffff88003a0efd78 R08: 0000000000000000 R09: 000000006b6b6b6b
[  587.768904] R10: 000000000000025f R11: ffffc90000b30020 R12: 6b6b6b6b6b6b6b6b
[  587.768927] R13: ffff88003a5e5c40 R14: ffff88003d1eb340 R15: ffff88003a5e5c40
[  587.768949] FS:  00007fc2fcd89700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
[  587.768981] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  587.769002] CR2: 00007fc2fcb84dd4 CR3: 000000003d230000 CR4: 00000000000406e0
[  587.769004] Stack:
[  587.769004]  ffff88003d016f08 ffffffff81a4f4ad ffff88003a0efd98 ffffffff810d6150
[  587.769004]  ffff88003d016f08 ffff88003a5e5cb8 ffff88003a0efdb8 ffffffff810d61e2
[  587.769004]  ffff88003a5e5c40 ffff88003a5e5c90 ffff88003a0efdc8 ffffffff8175d50e
[  587.769004] Call Trace:
[  587.769004]  [<ffffffff810d6150>] spin_bug+0x30/0x40
[  587.769004]  [<ffffffff810d61e2>] do_raw_spin_unlock+0x82/0xa0
[  587.769004]  [<ffffffff8175d50e>] _raw_spin_unlock+0xe/0x10
[  587.769004]  [<ffffffff812f45f2>] freeary+0x82/0x2a0
[  587.769004]  [<ffffffff8175d58e>] ? _raw_spin_lock+0xe/0x10
[  587.769004]  [<ffffffff812f5f2e>] semctl_down.clone.0+0xce/0x160
[  587.769004]  [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[  587.769004]  [<ffffffff8112e5e8>] ? __audit_syscall_entry+0xa8/0x100
[  587.769004]  [<ffffffff812f61f6>] SyS_semctl+0x236/0x2c0
[  587.769004]  [<ffffffff810215ce>] ? syscall_trace_leave+0xde/0x130
[  587.769004]  [<ffffffff8175d7ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[  587.769004] Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2d a4 81 31 c0 65 8b 15 8b 40 f3 7e e8 21 33 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89 
[  587.769004] RIP  [<ffffffff810d60b3>] spin_dump+0x53/0xc0
[  587.769004]  RSP <ffff88003a0efd68>
[  587.769563] ---[ end trace 844d186084062ba4 ]---
[  612.022002] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:446]
[  612.022002] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_rng virtio_balloon virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore qxl ttm drm_kms_helper drm crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib]
[  612.022002] CPU: 3 PID: 446 Comm: test Tainted: G      D         4.2.0-rc6+ #1
[  612.022002] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
[  612.022002] task: ffff88003e3ab040 ti: ffff88003ac20000 task.ti: ffff88003ac20000
[  612.022002] RIP: 0010:[<ffffffff8101cb36>]  [<ffffffff8101cb36>] native_read_tsc+0x6/0x20
[  612.022002] RSP: 0018:ffff88003ac23c08  EFLAGS: 00000206
[  612.022002] RAX: 00000000e0df8c7d RBX: 800000002060b065 RCX: 00000000e0df8c5f
[  612.022002] RDX: 00000000000001a4 RSI: 0000000000000000 RDI: 0000000000000001
[  612.022002] RBP: ffff88003ac23c08 R08: 0000000000000000 R09: ffff88003aea5cc0
[  612.022002] R10: 00007ffdf0e89830 R11: 0000000000000008 R12: ffffffff8106a938
[  612.022002] R13: ffff88003ac23b98 R14: ffff88003a295300 R15: ffff88003a295000
[  612.022002] FS:  00007fc2fcd89700(0000) GS:ffff88003fd80000(0000) knlGS:0000000000000000
[  612.022002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  612.022002] CR2: 00007fc2fc8dffd0 CR3: 000000003905d000 CR4: 00000000000406e0
[  612.022002] Stack:
[  612.022002]  ffff88003ac23c38 ffffffff8138bd30 ffff88003a5e5c40 00000000ac762d50
[  612.022002]  000000003662aefc 0000000000000001 ffff88003ac23c48 ffffffff8138bcaf
[  612.022002]  ffff88003ac23c78 ffffffff810d6386 ffff88003a5e5c40 ffff880037a0f880
[  612.022002] Call Trace:
[  612.022002]  [<ffffffff8138bd30>] delay_tsc+0x40/0x70
[  612.022002]  [<ffffffff8138bcaf>] __delay+0xf/0x20
[  612.022002]  [<ffffffff810d6386>] do_raw_spin_lock+0x96/0x140
[  612.022002]  [<ffffffff8175d58e>] _raw_spin_lock+0xe/0x10
[  612.022002]  [<ffffffff812f4cb1>] sem_lock_and_putref+0x11/0x70
[  612.022002]  [<ffffffff812f59df>] SYSC_semtimedop+0x7bf/0x960
[  612.022002]  [<ffffffff811becd6>] ? handle_mm_fault+0xbf6/0x1880
[  612.022002]  [<ffffffff8175d50e>] ? _raw_spin_unlock+0xe/0x10
[  612.022002]  [<ffffffff81194e16>] ? free_pcppages_bulk+0x436/0x5a0
[  612.022002]  [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[  612.022002]  [<ffffffff811e4866>] ? kfree_debugcheck+0x16/0x40
[  612.022002]  [<ffffffff810633fa>] ? __do_page_fault+0x19a/0x430
[  612.022002]  [<ffffffff8112e5e8>] ? __audit_syscall_entry+0xa8/0x100
[  612.022002]  [<ffffffff81021686>] ? do_audit_syscall_entry+0x66/0x70
[  612.022002]  [<ffffffff810217c9>] ? syscall_trace_enter_phase1+0x139/0x160
[  612.022002]  [<ffffffff812f5b8e>] SyS_semtimedop+0xe/0x10
[  612.022002]  [<ffffffff812f36f0>] SyS_semop+0x10/0x20
[  612.022002]  [<ffffffff8175d7ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[  612.022002] Code: c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 55 48 89 e5 0f 31 <89> c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9 c3 66 2e 0f 1f 84 
[  612.022002] Kernel panic - not syncing: softlockup: hung tasks

Then in another build with the spin_unlock_wait in, I left the test case running
for some hours, I didn't see issues. In theory also I expect there is no more
problems.

I'll submit a new version of the patch, it's same diff as above but without the
comment.

> 
> Best regards,
>     Manfred

Thanks,
-- 
[]'s
Herton
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/