Hello,
syzbot found the following issue on:
HEAD commit: 2a8120d7b482 Merge tag 's390-6.10-2' of git://git.kernel.o..
git tree: upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17d42b0a980000
kernel config: https://syzkaller.appspot.com/x/.config?x=5dd4fde1337a9e18
dashboard link: https://syzkaller.appspot.com/bug?extid=50e25cfa4f917d41749f
compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
userspace arch: i386
Unfortunately, I don't have any reproducer for this issue yet.
Downloadable assets:
disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-2a8120d7.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/78c72ae6bdaf/vmlinux-2a8120d7.xz
kernel image: https://storage.googleapis.com/syzbot-assets/99dbb805b738/bzImage-2a8120d7.xz
IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: [email protected]
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(l->owner)
WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_slab mm/slub.c:3088 [inline]
WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
Modules linked in:
CPU: 3 PID: 5221 Comm: kworker/3:3 Not tainted 6.9.0-syzkaller-10713-g2a8120d7b482 #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
Workqueue: slub_flushwq flush_cpu_slab
RIP: 0010:local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
RIP: 0010:flush_slab mm/slub.c:3088 [inline]
RIP: 0010:flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
Code: ff e8 e5 b2 fc 08 e9 25 ff ff ff e8 db b2 fc 08 e9 46 ff ff ff 90 48 c7 c6 7f 68 37 8d 48 c7 c7 b6 33 37 8d e8 72 88 6f ff 90 <0f> 0b 90 90 e9 dd fe ff ff 90 48 c7 c6 88 68 37 8d 48 c7 c7 b6 33
RSP: 0018:ffffc90002b57c98 EFLAGS: 00010086
RAX: 0000000000000000 RBX: ffffe8ffac1021b0 RCX: ffffffff81510229
RDX: ffff88801bb7c880 RSI: ffffffff81510236 RDI: 0000000000000001
RBP: ffff88802790b540 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 000000002d2d2d2d R12: 0000000000000200
R13: ffffe8ffac1021d0 R14: 0000000000000000 R15: ffffc90002b57d80
FS: 0000000000000000(0000) GS:ffff88802c300000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000f73ffec2 CR3: 00000000268ca000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
process_one_work+0x958/0x1ad0 kernel/workqueue.c:3231
process_scheduled_works kernel/workqueue.c:3312 [inline]
worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393
kthread+0x2c1/0x3a0 kernel/kthread.c:389
ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
</TASK>
---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at [email protected].
syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
If the report is already addressed, let syzbot know by replying with:
#syz fix: exact-commit-title
If you want to overwrite report's subsystems, reply with:
#syz set subsystems: new-subsystem
(See the list of subsystem names on the web dashboard)
If the report is a duplicate of another one, reply with:
#syz dup: exact-subject-of-another-report
If you want to undo deduplication, reply with:
#syz undup
CC+: mm folks
On Wed, May 22 2024 at 19:27, syzbot wrote:
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit: 2a8120d7b482 Merge tag 's390-6.10-2' of git://git.kernel.o..
> git tree: upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=17d42b0a980000
> kernel config: https://syzkaller.appspot.com/x/.config?x=5dd4fde1337a9e18
> dashboard link: https://syzkaller.appspot.com/bug?extid=50e25cfa4f917d41749f
> compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
> userspace arch: i386
>
> Unfortunately, I don't have any reproducer for this issue yet.
>
> Downloadable assets:
> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-2a8120d7.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/78c72ae6bdaf/vmlinux-2a8120d7.xz
> kernel image: https://storage.googleapis.com/syzbot-assets/99dbb805b738/bzImage-2a8120d7.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: [email protected]
>
> ------------[ cut here ]------------
> DEBUG_LOCKS_WARN_ON(l->owner)
> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_slab mm/slub.c:3088 [inline]
> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
> Modules linked in:
> CPU: 3 PID: 5221 Comm: kworker/3:3 Not tainted 6.9.0-syzkaller-10713-g2a8120d7b482 #0
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> Workqueue: slub_flushwq flush_cpu_slab
> RIP: 0010:local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
> RIP: 0010:flush_slab mm/slub.c:3088 [inline]
> RIP: 0010:flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
> Code: ff e8 e5 b2 fc 08 e9 25 ff ff ff e8 db b2 fc 08 e9 46 ff ff ff 90 48 c7 c6 7f 68 37 8d 48 c7 c7 b6 33 37 8d e8 72 88 6f ff 90 <0f> 0b 90 90 e9 dd fe ff ff 90 48 c7 c6 88 68 37 8d 48 c7 c7 b6 33
> RSP: 0018:ffffc90002b57c98 EFLAGS: 00010086
> RAX: 0000000000000000 RBX: ffffe8ffac1021b0 RCX: ffffffff81510229
> RDX: ffff88801bb7c880 RSI: ffffffff81510236 RDI: 0000000000000001
> RBP: ffff88802790b540 R08: 0000000000000001 R09: 0000000000000000
> R10: 0000000000000000 R11: 000000002d2d2d2d R12: 0000000000000200
> R13: ffffe8ffac1021d0 R14: 0000000000000000 R15: ffffc90002b57d80
> FS: 0000000000000000(0000) GS:ffff88802c300000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000f73ffec2 CR3: 00000000268ca000 CR4: 0000000000350ef0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> process_one_work+0x958/0x1ad0 kernel/workqueue.c:3231
> process_scheduled_works kernel/workqueue.c:3312 [inline]
> worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393
> kthread+0x2c1/0x3a0 kernel/kthread.c:389
> ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
> </TASK>
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at [email protected].
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>
> If the report is already addressed, let syzbot know by replying with:
> #syz fix: exact-commit-title
>
> If you want to overwrite report's subsystems, reply with:
> #syz set subsystems: new-subsystem
> (See the list of subsystem names on the web dashboard)
>
> If the report is a duplicate of another one, reply with:
> #syz dup: exact-subject-of-another-report
>
> If you want to undo deduplication, reply with:
> #syz undup
On 5/23/24 12:36 PM, Thomas Gleixner wrote:
> CC+: mm folks
>
> On Wed, May 22 2024 at 19:27, syzbot wrote:
>
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit: 2a8120d7b482 Merge tag 's390-6.10-2' of git://git.kernel.o..
>> git tree: upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=17d42b0a980000
>> kernel config: https://syzkaller.appspot.com/x/.config?x=5dd4fde1337a9e18
>> dashboard link: https://syzkaller.appspot.com/bug?extid=50e25cfa4f917d41749f
>> compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40
>> userspace arch: i386
>>
>> Unfortunately, I don't have any reproducer for this issue yet.
>>
>> Downloadable assets:
>> disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-2a8120d7.raw.xz
>> vmlinux: https://storage.googleapis.com/syzbot-assets/78c72ae6bdaf/vmlinux-2a8120d7.xz
>> kernel image: https://storage.googleapis.com/syzbot-assets/99dbb805b738/bzImage-2a8120d7.xz
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: [email protected]
>>
>> ------------[ cut here ]------------
>> DEBUG_LOCKS_WARN_ON(l->owner)
>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_slab mm/slub.c:3088 [inline]
>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
I'm puzzled by this. We use local_lock_irqsave() on !PREEMPT_RT everywhere.
IIUC this warning says we did the irqsave() and then found out somebody else
already set the owner? But that means they also did that irqsave() and set
themselves as l->owner. Does that mey there would be a spurious irq enable
that didn't go through local_unlock_irqrestore()?
Also this particular stack is from the work, which is scheduled by
queue_work_on() in flush_all_cpus_locked(), which also has a
lockdep_assert_cpus_held() so it should fullfill the "the caller must ensure
the cpu doesn't go away" property. But I think even if this ended up on the
wrong cpu (for the full duration or migrated while processing the work item)
somehow, it wouldn't be able to cause such warning, but rather corrupt
something else
Also this code didn't change for a while.
>> Modules linked in:
>> CPU: 3 PID: 5221 Comm: kworker/3:3 Not tainted 6.9.0-syzkaller-10713-g2a8120d7b482 #0
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
>> Workqueue: slub_flushwq flush_cpu_slab
>> RIP: 0010:local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
>> RIP: 0010:flush_slab mm/slub.c:3088 [inline]
>> RIP: 0010:flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
>> Code: ff e8 e5 b2 fc 08 e9 25 ff ff ff e8 db b2 fc 08 e9 46 ff ff ff 90 48 c7 c6 7f 68 37 8d 48 c7 c7 b6 33 37 8d e8 72 88 6f ff 90 <0f> 0b 90 90 e9 dd fe ff ff 90 48 c7 c6 88 68 37 8d 48 c7 c7 b6 33
>> RSP: 0018:ffffc90002b57c98 EFLAGS: 00010086
>> RAX: 0000000000000000 RBX: ffffe8ffac1021b0 RCX: ffffffff81510229
>> RDX: ffff88801bb7c880 RSI: ffffffff81510236 RDI: 0000000000000001
>> RBP: ffff88802790b540 R08: 0000000000000001 R09: 0000000000000000
>> R10: 0000000000000000 R11: 000000002d2d2d2d R12: 0000000000000200
>> R13: ffffe8ffac1021d0 R14: 0000000000000000 R15: ffffc90002b57d80
>> FS: 0000000000000000(0000) GS:ffff88802c300000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00000000f73ffec2 CR3: 00000000268ca000 CR4: 0000000000350ef0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>> <TASK>
>> process_one_work+0x958/0x1ad0 kernel/workqueue.c:3231
>> process_scheduled_works kernel/workqueue.c:3312 [inline]
>> worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393
>> kthread+0x2c1/0x3a0 kernel/kthread.c:389
>> ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147
>> ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
>> </TASK>
>>
>>
>> ---
>> This report is generated by a bot. It may contain errors.
>> See https://goo.gl/tpsmEJ for more information about syzbot.
>> syzbot engineers can be reached at [email protected].
>>
>> syzbot will keep track of this issue. See:
>> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
>>
>> If the report is already addressed, let syzbot know by replying with:
>> #syz fix: exact-commit-title
>>
>> If you want to overwrite report's subsystems, reply with:
>> #syz set subsystems: new-subsystem
>> (See the list of subsystem names on the web dashboard)
>>
>> If the report is a duplicate of another one, reply with:
>> #syz dup: exact-subject-of-another-report
>>
>> If you want to undo deduplication, reply with:
>> #syz undup
On Thu, May 23 2024 at 23:03, Vlastimil Babka wrote:
> On 5/23/24 12:36 PM, Thomas Gleixner wrote:
>>> ------------[ cut here ]------------
>>> DEBUG_LOCKS_WARN_ON(l->owner)
>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_slab mm/slub.c:3088 [inline]
>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
>
> I'm puzzled by this. We use local_lock_irqsave() on !PREEMPT_RT everywhere.
> IIUC this warning says we did the irqsave() and then found out somebody else
> already set the owner? But that means they also did that irqsave() and set
> themselves as l->owner. Does that mey there would be a spurious irq enable
> that didn't go through local_unlock_irqrestore()?
>
> Also this particular stack is from the work, which is scheduled by
> queue_work_on() in flush_all_cpus_locked(), which also has a
> lockdep_assert_cpus_held() so it should fullfill the "the caller must ensure
> the cpu doesn't go away" property. But I think even if this ended up on the
> wrong cpu (for the full duration or migrated while processing the work item)
> somehow, it wouldn't be able to cause such warning, but rather corrupt
> something else
Indeed. There is another report which makes no sense either:
https://lore.kernel.org/lkml/[email protected]
Both look like data corropution issues caused by whatever...
Thanks,
tglx
On 2024-05-23 23:03:52 [+0200], Vlastimil Babka wrote:
> I'm puzzled by this. We use local_lock_irqsave() on !PREEMPT_RT everywhere.
> IIUC this warning says we did the irqsave() and then found out somebody else
> already set the owner? But that means they also did that irqsave() and set
> themselves as l->owner. Does that mey there would be a spurious irq enable
> that didn't go through local_unlock_irqrestore()?
correct.
>
> Also this particular stack is from the work, which is scheduled by
> queue_work_on() in flush_all_cpus_locked(), which also has a
> lockdep_assert_cpus_held() so it should fullfill the "the caller must ensure
> the cpu doesn't go away" property. But I think even if this ended up on the
> wrong cpu (for the full duration or migrated while processing the work item)
> somehow, it wouldn't be able to cause such warning, but rather corrupt
> something else
Based on
> >> CPU: 3 PID: 5221 Comm: kworker/3:3 Not tainted 6.9.0-syzkaller-10713-g2a8120d7b482 #0
the code was invoked on CPU3 and the kworker was made for CPU3. This is
all fine. All access for the lock in question is within a few lines so
there is no unbalance lock/ unlock or IRQ-unlock which could explain it.
Sebastian
On 5/24/24 12:32 AM, Thomas Gleixner wrote:
> On Thu, May 23 2024 at 23:03, Vlastimil Babka wrote:
>> On 5/23/24 12:36 PM, Thomas Gleixner wrote:
>>>> ------------[ cut here ]------------
>>>> DEBUG_LOCKS_WARN_ON(l->owner)
>>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 local_lock_acquire include/linux/local_lock_internal.h:30 [inline]
>>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_slab mm/slub.c:3088 [inline]
>>>> WARNING: CPU: 3 PID: 5221 at include/linux/local_lock_internal.h:30 flush_cpu_slab+0x37f/0x410 mm/slub.c:3146
>>
>> I'm puzzled by this. We use local_lock_irqsave() on !PREEMPT_RT everywhere.
>> IIUC this warning says we did the irqsave() and then found out somebody else
>> already set the owner? But that means they also did that irqsave() and set
>> themselves as l->owner. Does that mey there would be a spurious irq enable
>> that didn't go through local_unlock_irqrestore()?
>>
>> Also this particular stack is from the work, which is scheduled by
>> queue_work_on() in flush_all_cpus_locked(), which also has a
>> lockdep_assert_cpus_held() so it should fullfill the "the caller must ensure
>> the cpu doesn't go away" property. But I think even if this ended up on the
>> wrong cpu (for the full duration or migrated while processing the work item)
>> somehow, it wouldn't be able to cause such warning, but rather corrupt
>> something else
>
> Indeed. There is another report which makes no sense either:
>
> https://lore.kernel.org/lkml/[email protected]
That looks like slab->next which should contain a valid pointer or NULL,
contains 0x13.
slab->next is initialized in put_cpu_partial() from s->cpu_slab->partial
Here we have corruption inside s->cpu_slab->list_lock
> Both look like data corropution issues caused by whatever...
s->cpu_slab is percpu allocation so possibly another percpu alloc user has a
buffer overflow?
> Thanks,
>
> tglx