Hello,
syzbot hit the following crash on upstream commit
86bbbebac1933e6e95e8234c4f7d220c5ddd38bc (Mon Apr 2 18:47:07 2018 +0000)
Merge branch 'ras-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
syzbot dashboard link:
https://syzkaller.appspot.com/bug?extid=dc5ab2babdf22ca091af
So far this crash happened 8 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5688491102961664
syzkaller reproducer:
https://syzkaller.appspot.com/x/repro.syz?id=5709211904245760
Raw console output:
https://syzkaller.appspot.com/x/log.txt?id=5720789257027584
Kernel config:
https://syzkaller.appspot.com/x/.config?id=6801295859785128502
compiler: gcc (GCC) 7.1.1 20170620
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: [email protected]
It will help syzbot understand when the bug is fixed. See footer for
details.
If you forward the report, please keep this part and the footer.
EXT4-fs (sda1): shut down requested (0)
------------[ cut here ]------------
DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133
up_write+0x1cc/0x210 kernel/locking/rwsem.c:133
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 4441 Comm: syzkaller594909 Not tainted 4.16.0+ #11
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:17 [inline]
dump_stack+0x1a7/0x27d lib/dump_stack.c:53
panic+0x1f8/0x42c kernel/panic.c:183
__warn+0x1dc/0x200 kernel/panic.c:547
report_bug+0x1f4/0x2b0 lib/bug.c:186
fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
fixup_bug arch/x86/kernel/traps.c:247 [inline]
do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
RIP: 0010:up_write+0x1cc/0x210 kernel/locking/rwsem.c:133
RSP: 0018:ffff8801b349f710 EFLAGS: 00010286
RAX: dffffc0000000008 RBX: ffff8801ccc0ce40 RCX: ffffffff815ae26e
RDX: 0000000000000000 RSI: 1ffff10036693e92 RDI: 1ffff10036693e67
RBP: ffff8801b349f798 R08: fffffbfff10b0659 R09: fffffbfff10b0659
R10: ffff8801b349f708 R11: fffffbfff10b0658 R12: 1ffff10036693ee2
R13: dffffc0000000000 R14: ffff8801b349f770 R15: ffff8801ccc0ce98
percpu_up_write+0xca/0x110 kernel/locking/percpu-rwsem.c:183
sb_freeze_unlock fs/super.c:1390 [inline]
thaw_super+0x1ca/0x260 fs/super.c:1524
thaw_bdev+0x151/0x180 fs/block_dev.c:555
ext4_shutdown fs/ext4/ioctl.c:489 [inline]
ext4_ioctl+0x1f85/0x3e60 fs/ext4/ioctl.c:1048
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
SYSC_ioctl fs/ioctl.c:701 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x440109
RSP: 002b:00007fffce185d28 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440109
RDX: 0000000020000100 RSI: 000000008004587d RDI: 0000000000000003
RBP: 00000000006ca018 R08: 000000000000000f R09: 65732f636f72702f
R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000401990
R13: 0000000000401a20 R14: 0000000000000000 R15: 0000000000000000
Dumping ftrace buffer:
(ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to [email protected].
syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is
merged
into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug
report.
Note: all commands must start from beginning of the line in the email body.
On Tue, Apr 3, 2018 at 4:01 AM, syzbot
<[email protected]> wrote:
> Hello,
>
> syzbot hit the following crash on upstream commit
> 86bbbebac1933e6e95e8234c4f7d220c5ddd38bc (Mon Apr 2 18:47:07 2018 +0000)
> Merge branch 'ras-core-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> syzbot dashboard link:
> https://syzkaller.appspot.com/bug?extid=dc5ab2babdf22ca091af
>
> So far this crash happened 8 times on upstream.
> C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5688491102961664
> syzkaller reproducer:
> https://syzkaller.appspot.com/x/repro.syz?id=5709211904245760
> Raw console output:
> https://syzkaller.appspot.com/x/log.txt?id=5720789257027584
> Kernel config:
> https://syzkaller.appspot.com/x/.config?id=6801295859785128502
> compiler: gcc (GCC) 7.1.1 20170620
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: [email protected]
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
+Ted for ext4 frames
> EXT4-fs (sda1): shut down requested (0)
> ------------[ cut here ]------------
> DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> kernel/locking/rwsem.c:133
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 4441 Comm: syzkaller594909 Not tainted 4.16.0+ #11
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x1a7/0x27d lib/dump_stack.c:53
> panic+0x1f8/0x42c kernel/panic.c:183
> __warn+0x1dc/0x200 kernel/panic.c:547
> report_bug+0x1f4/0x2b0 lib/bug.c:186
> fixup_bug.part.10+0x37/0x80 arch/x86/kernel/traps.c:178
> fixup_bug arch/x86/kernel/traps.c:247 [inline]
> do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
> do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
> invalid_op+0x1b/0x40 arch/x86/entry/entry_64.S:986
> RIP: 0010:up_write+0x1cc/0x210 kernel/locking/rwsem.c:133
> RSP: 0018:ffff8801b349f710 EFLAGS: 00010286
> RAX: dffffc0000000008 RBX: ffff8801ccc0ce40 RCX: ffffffff815ae26e
> RDX: 0000000000000000 RSI: 1ffff10036693e92 RDI: 1ffff10036693e67
> RBP: ffff8801b349f798 R08: fffffbfff10b0659 R09: fffffbfff10b0659
> R10: ffff8801b349f708 R11: fffffbfff10b0658 R12: 1ffff10036693ee2
> R13: dffffc0000000000 R14: ffff8801b349f770 R15: ffff8801ccc0ce98
> percpu_up_write+0xca/0x110 kernel/locking/percpu-rwsem.c:183
> sb_freeze_unlock fs/super.c:1390 [inline]
> thaw_super+0x1ca/0x260 fs/super.c:1524
> thaw_bdev+0x151/0x180 fs/block_dev.c:555
> ext4_shutdown fs/ext4/ioctl.c:489 [inline]
> ext4_ioctl+0x1f85/0x3e60 fs/ext4/ioctl.c:1048
> vfs_ioctl fs/ioctl.c:46 [inline]
> do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
> SYSC_ioctl fs/ioctl.c:701 [inline]
> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
> do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> RIP: 0033:0x440109
> RSP: 002b:00007fffce185d28 EFLAGS: 00000213 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440109
> RDX: 0000000020000100 RSI: 000000008004587d RDI: 0000000000000003
> RBP: 00000000006ca018 R08: 000000000000000f R09: 65732f636f72702f
> R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000401990
> R13: 0000000000401a20 R14: 0000000000000000 R15: 0000000000000000
> Dumping ftrace buffer:
> (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to [email protected].
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.
>
> --
> You received this message because you are subscribed to the Google Groups
> "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/syzkaller-bugs/001a1148578c10e4700568e814eb%40google.com.
> For more options, visit https://groups.google.com/d/optout.
On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
> On Tue, Apr 3, 2018 at 4:01 AM, syzbot
> <[email protected]> wrote:
> > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> > kernel/locking/rwsem.c:133
> > Kernel panic - not syncing: panic_on_warn set ...
Message-Id: <[email protected]>
On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
> On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
> > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
> > <[email protected]> wrote:
> > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> > > kernel/locking/rwsem.c:133
> > > Kernel panic - not syncing: panic_on_warn set ...
>
> Message-Id: <[email protected]>
>
We were way ahead of syzbot in this case. :-)
I reported the problem Tuesday morning:
https://lkml.org/lkml/2018/4/4/814
And within a few hours Waiman had proposed a fix:
https://patchwork.kernel.org/patch/10322639/
Note also that it's not ext4 specific. It can be trivially reproduced using any one of:
kvm-xfstests -c ext4 generic/068
kvm-xfstests -c btrfs generic/068
kvm-xfstests -c xfs generic/068
(Basically, any file system that supports freeze/thaw.)
Cheers,
- Ted
On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
> On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
> > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
> > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
> > > <[email protected]> wrote:
> > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> > > > kernel/locking/rwsem.c:133
> > > > Kernel panic - not syncing: panic_on_warn set ...
> >
> > Message-Id: <[email protected]>
> >
>
> We were way ahead of syzbot in this case. :-)
Not really ... syzbot caught it Monday evening ;-)
Date: Mon, 02 Apr 2018 19:01:01 -0700
From: syzbot <[email protected]>
To: [email protected], [email protected],
[email protected], [email protected]
Subject: WARNING in up_write
On Thu, Apr 5, 2018 at 5:24 AM, Matthew Wilcox <[email protected]> wrote:
> On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
>> On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
>> > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
>> > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
>> > > <[email protected]> wrote:
>> > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
>> > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
>> > > > kernel/locking/rwsem.c:133
>> > > > Kernel panic - not syncing: panic_on_warn set ...
>> >
>> > Message-Id: <[email protected]>
>> >
>>
>> We were way ahead of syzbot in this case. :-)
>
> Not really ... syzbot caught it Monday evening ;-)
>
> Date: Mon, 02 Apr 2018 19:01:01 -0700
> From: syzbot <[email protected]>
> To: [email protected], [email protected],
> [email protected], [email protected]
> Subject: WARNING in up_write
:)
#syz fix: locking/rwsem: Add up_write_non_owner() for percpu_up_write()
On Wed, Apr 04, 2018 at 08:24:54PM -0700, Matthew Wilcox wrote:
> On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
> > On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
> > > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
> > > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
> > > > <[email protected]> wrote:
> > > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> > > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> > > > > kernel/locking/rwsem.c:133
> > > > > Kernel panic - not syncing: panic_on_warn set ...
> > >
> > > Message-Id: <[email protected]>
> > >
> >
> > We were way ahead of syzbot in this case. :-)
>
> Not really ... syzbot caught it Monday evening ;-)
Rather than arguing over who reported it first, I think that time
would be better spent reflecting on why the syzbot report was
completely ignored until *after* Ted diagnosed the issue
independently and Waiman had already fixed it....
Clearly there is scope for improvement here.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Apr 06, 2018 at 08:32:26AM +1000, Dave Chinner wrote:
> On Wed, Apr 04, 2018 at 08:24:54PM -0700, Matthew Wilcox wrote:
> > On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
> > > On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
> > > > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
> > > > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
> > > > > <[email protected]> wrote:
> > > > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> > > > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> > > > > > kernel/locking/rwsem.c:133
> > > > > > Kernel panic - not syncing: panic_on_warn set ...
> > > >
> > > > Message-Id: <[email protected]>
> > > >
> > >
> > > We were way ahead of syzbot in this case. :-)
> >
> > Not really ... syzbot caught it Monday evening ;-)
>
> Rather than arguing over who reported it first, I think that time
> would be better spent reflecting on why the syzbot report was
> completely ignored until *after* Ted diagnosed the issue
> independently and Waiman had already fixed it....
>
> Clearly there is scope for improvement here.
>
> Cheers,
>
Well, ultimately a human needed to investigate the syzbot bug report to figure
out what was really going on. In my view, the largest problem is that there are
simply too many bugs, so many are getting ignored. If there were only a few
bugs, then Dmitry would investigate each one and send a "real" bug report of
better quality than the automated system can provide, or even send a fix
directly. But in reality, on the same day this bug was reported, syzbot also
found 10 other bugs, and in the previous 2 days it had found 38 more. No single
person can keep up with that. You can see the current bug list, which has 172
open bugs, on the dashboard at https://syzkaller.appspot.com/. Yes, the kernel
really is that broken. Though, of course most bugs are in specific modules, not
the core kernel.
And although quite a few of these bugs will end up to be duplicates or even
already fixed, a human still has to look at each one to figure that out.
(Though, I do think that syzbot should try to automatically detect when a
reproducible bug was already fixed, via bisection. It would cause a few bugs to
be incorrectly considered fixed, but it may be a worthwhile tradeoff.)
These bugs are all over the kernel as well, so most developers don't see the big
picture but rather just see a few bugs for "their" subsystem on "their"
subsystem's mailing list and sometimes demand special attention. Of course,
it's great when people suggest ways to improve the process. But it's not great
when people just don't feel responsible for fixing bugs and wait for
Someone Else to do it.
I'm hoping that in the future the syzbot "team", which seems to actually be just
Dmitry now, can get more resources towards helping fix the bugs. But either
way, in the end Linux is a community effort.
Note also that syzbot wasn't super useful in this particular case because people
running xfstests came across the same bug. But, this is actually a rare case.
Most syzbot bug reports have been for weird corner cases or races that no one
ever thought of before, so there are no existing tests that find them.
Thanks,
Eric
On Thu, Apr 05, 2018 at 05:13:25PM -0700, Eric Biggers wrote:
> Well, ultimately a human needed to investigate the syzbot bug report to figure
> out what was really going on. In my view, the largest problem is that there are
> simply too many bugs, so many are getting ignored. If there were only a few
> bugs, then Dmitry would investigate each one and send a "real" bug report of
> better quality than the automated system can provide, or even send a fix
> directly. But in reality, on the same day this bug was reported, syzbot also
> found 10 other bugs, and in the previous 2 days it had found 38 more. No single
> person can keep up with that. You can see the current bug list, which has 172
> open bugs, on the dashboard at https://syzkaller.appspot.com/. Yes, the kernel
> really is that broken. Though, of course most bugs are in specific modules, not
> the core kernel.
There are a lot of bugs, so it needs to be easier for humans to figure
out which ones they should care about. And not all bugs are created
equal. Some are WARN_ON's that aren't all that important. Others
will hard crash the kernel, but are not likely to be something that
can be turned into a privilege escalation attack. Some bugs are
trivially reproducible, and some take a lot more effort. Making it
easier for humans to decide which ones should be looked at first would
certainly be helpful.y
For me the prioritization goes as follows.
1) Is it a regression? If it's a regression, I want to fix it fast.
2) Is it something that can be easily escalated to a privilege escalation attack?
Again, if so, I want to fix it fast.
3) Is it going to get in the way of my development process? Things
that trigger new xfstests failures are important, because it's how I
detect (1).
So I ignored the Syzkaller reports this week because it's hard to
differentiate important bugs from less important ones, and after the
merge window, I want to make sure that I have not introduced any
regressions, and I also want to make sure that commits getting merged
by others have not introduced any regressions in the testing suite
that I use, which is xfstests.
This is why I've been asking for the bisection feature --- not to find
out when a bug has been fixed, but to find out when a bug has been
*introduced*. If I know that this a bug which has recently
introduced, especially if it has been recently introduced by commits
in my tree, or which I have recently pushed to Linus, I'm going to
care a lot more. If I can't make that determination, I'm going to
deprioritize that bug in favor of those that definitely do meet these
criteria.
It's not a matter of waiting for someone else to fix it (although I
won't complain if someone does :-). It's that I'm overloaded, and I
have to prioritize the work that I do. If syzbot reports are hard to
parse or hard to prioritize, then I may end up prioritizing other work
as being more important. Sorry, but that's just the way that it is.
Note that I haven't just been complaining about it. I've been working
on ways so that the gce-xfstests and kvm-xfstests test appliances can
more easily be used to work on Syzbot reports. If I can make myself
more efficient, or help other people be more efficient, that's
arguably more important than trying to fix some of the 174 currently
open Syzbot issues --- unless you can tell me that certain ones are
super urgent because they (for example) result in CVSS score > 8.
Cheers,
- Ted
On Thu, Apr 05, 2018 at 05:13:25PM -0700, Eric Biggers wrote:
> On Fri, Apr 06, 2018 at 08:32:26AM +1000, Dave Chinner wrote:
> > On Wed, Apr 04, 2018 at 08:24:54PM -0700, Matthew Wilcox wrote:
> > > On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
> > > > On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
> > > > > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
> > > > > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
> > > > > > <[email protected]> wrote:
> > > > > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> > > > > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
> > > > > > > kernel/locking/rwsem.c:133
> > > > > > > Kernel panic - not syncing: panic_on_warn set ...
> > > > >
> > > > > Message-Id: <[email protected]>
> > > > >
> > > >
> > > > We were way ahead of syzbot in this case. :-)
> > >
> > > Not really ... syzbot caught it Monday evening ;-)
> >
> > Rather than arguing over who reported it first, I think that time
> > would be better spent reflecting on why the syzbot report was
> > completely ignored until *after* Ted diagnosed the issue
> > independently and Waiman had already fixed it....
> >
> > Clearly there is scope for improvement here.
> >
> > Cheers,
> >
>
> Well, ultimately a human needed to investigate the syzbot bug report to figure
> out what was really going on. In my view, the largest problem is that there are
> simply too many bugs, so many are getting ignored.
Well, yeah. And when there's too many bugs, looking at the ones
people are actually hitting tend to take precedence over those
reported by a bot an image problem...
> If there were only a few bugs, then Dmitry would investigate each
> one and send a "real" bug report of better quality than the
> automated system can provide, or even send a fix directly. But in
> reality, on the same day this bug was reported, syzbot also found
> 10 other bugs, and in the previous 2 days it had found 38 more.
> No single person can keep up with that.
And this is precisely why people turn around and ask the syzbot
developers to do things that make it easier for them to diagnose
the problems syzbot reports.
> You can see the current
> bug list, which has 172 open bugs, on the dashboard at
> https://syzkaller.appspot.com/.
Is that all? That's *nothing*.
> Yes, the kernel really is that
> broken.
Actually, that tells me the kernel is a hell of a lot better than my
experience leads me to beleive it is. I'd have expected thousands of
bugs, even tens of thousands of bugs given how many issues we deal
with in individual subsystems on a day to day basis.
> And although quite a few of these bugs will end up to be
> duplicates or even already fixed, a human still has to look at
> each one to figure that out. (Though, I do think that syzbot
> should try to automatically detect when a reproducible bug was
> already fixed, via bisection. It would cause a few bugs to be
> incorrectly considered fixed, but it may be a worthwhile
> tradeoff.)
>
> These bugs are all over the kernel as well, so most developers
> don't see the big picture but rather just see a few bugs for
> "their" subsystem on "their" subsystem's mailing list and
> sometimes demand special attention. Of course, it's great when
> people suggest ways to improve the process.
That's not the response I got....
> But it's not great
> when people just don't feel responsible for fixing bugs and wait
> for Someone Else to do it.
The excessive cross posting of the reports is one of the reasons
people think someone else will take care of it. i.e. "Oh, that looks VFS,
that went to -fsdevel, I don't need to look at it"....
Put simply: if you're mounting an XFS filesystem image and something
goes bang, then it should be reported to the XFS list. It does not
need to be cross posted to LKML, -fsdevel, 10 individual developers,
etc. If it's not an XFS problem, then the XFS developers will CC the
relevant lists as needed.
> I'm hoping that in the future the syzbot "team", which seems to
> actually be just Dmitry now, can get more resources towards
> helping fix the bugs. But either way, in the end Linux is a
> community effort.
We don't really need help fixing the bugs - we need help making it
easier to *find the bug* the bot tripped over. That's what the
syzbot team needs to focus on, not tell people that what they got is
all they are going to get.
> Note also that syzbot wasn't super useful in this particular case
> because people running xfstests came across the same bug. But,
> this is actually a rare case. Most syzbot bug reports have been
> for weird corner cases or races that no one ever thought of
> before, so there are no existing tests that find them.
Which is exactly what these whacky "mount a filesystem fragment"
tests it is now doing are exercising. Finding the cause of
corruption related crashes is not easy and takes time. Having the
bot developers add something to the bot that will save the developer
looking at the problem 10 minutes of setup time makes a huge
difference to the effort required to find the problem.
The tool is useless if people find it too hard to make sense of the
bug reports (*cough* lockdep *cough*) or perform triage of the
report. If we want to get the bugs fixed faster, we have to make the
reports from automated tools contain the exact information the
developer needs to solve the problem.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Apr 05, 2018 at 09:37:41PM -0400, Theodore Y. Ts'o wrote:
> Note that I haven't just been complaining about it. I've been working
> on ways so that the gce-xfstests and kvm-xfstests test appliances can
> more easily be used to work on Syzbot reports. If I can make myself
> more efficient, or help other people be more efficient, that's
> arguably more important than trying to fix some of the 174 currently
> open Syzbot issues --- unless you can tell me that certain ones are
> super urgent because they (for example) result in CVSS score > 8.
I've got an initial version of this working for kvm-xfstests. To try
it out, grab the latest version of xfstests-bld from [1], and the
kvm-xfstests image from [2]. For people who have never tried using
kvm-xfstests, see [3].
[1] https://github.com/tytso/xfstests-bld
[2] https://www.kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/testing/root_fs.img.x86_64
[3] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
If you're interested, please try it out, and send me comments.
Sample usage:
kvm-xfstest syz <path/to/repro.{c,syz}>
kvm-xfstest syz <URL to repro.{c,syz}>
Example run:
% kvm-xfstests syz https://syzkaller.appspot.com/x/repro.syz?id=5709211904245760
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 533 100 533 0 0 2157 0 --:--:-- --:--:-- --:--:-- 2157
Saved downloaded copy at /tmp/tytso-downloaded-repro.syz
Networking disabled.
KERNEL: kernel 4.16.0-xfstests-09576-g38c23685b273 #134 SMP Sun Apr 8 01:36:01 EDT 2018 x86_64
FSTESTVER: e2fsprogs v1.43.6-85-g7595699d0 (Wed, 6 Sep 2017 22:04:14 -0400)
FSTESTVER: fio fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
FSTESTVER: quota 59b280e (Mon, 5 Feb 2018 16:48:22 +0100)
FSTESTVER: stress-ng 977ae35 (Wed, 6 Sep 2017 23:45:03 -0400)
FSTESTVER: syzkaller 66f22a7f (Sat, 7 Apr 2018 14:02:03 +0200)
FSTESTVER: xfsprogs v4.15.1 (Mon, 26 Feb 2018 19:50:56 -0600)
FSTESTVER: xfstests-bld 3be913e (Sun, 8 Apr 2018 01:19:21 -0400)
FSTESTVER: xfstests linux-v3.8-1925-g62cc6d02 (Fri, 23 Mar 2018 22:26:41 -0400)
FSTESTCFG: "all"
FSTESTSET: "syz/001"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "2"
MEM: "1684.65"
total used free shared buff/cache available
Mem: 1684 140 1479 8 65 1507
Swap: 0 0 0
BEGIN TEST 4k (1 test): Ext4 4k block Sun Apr 8 01:49:02 EDT 2018
DEVICE: /dev/vdd
EXT_MKFS_OPTIONS: -b 4096
EXT_MOUNT_OPTIONS: -o block_validity
FSTYP -- ext4
PLATFORM -- Linux/x86_64 kvm-xfstests 4.16.0-xfstests-09576-g38c23685b273
MKFS_OPTIONS -- -b 4096 /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc
syz/001 [01:49:04][ 22.859794] run fstests syz/001 at 2018-04-08 01:49:04
[ 23.385195] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity
[ 23.797611] EXT4-fs (vda): shut down requested (0)
[ 23.855759] ------------[ cut here ]------------
[ 23.860823] DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
[ 23.860881] WARNING: CPU: 1 PID: 1332 at /usr/projects/linux/ext4/kernel/locking/rwsem.c:133 up_write+0x113/0x150
[ 23.876121] CPU: 1 PID: 1332 Comm: syz-executor0 Not tainted 4.16.0-xfstests-09576-g38c23685b273 #134
[ 23.880836] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 23.884080] RIP: 0010:up_write+0x113/0x150
[ 23.885873] RSP: 0018:ffff88005e0b7a68 EFLAGS: 00010286
[ 23.887902] RAX: dffffc0000000008 RBX: ffff880066069038 RCX: ffffffff9002f2ce
[ 23.890392] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000293
[ 23.892200] RBP: ffff8800660690a0 R08: fffffbfff245d71d R09: fffffbfff245d71d
[ 23.894877] R10: ffff88007ffca050 R11: fffffbfff245d71c R12: ffff880066068ce0
[ 23.897244] R13: ffff880066068a30 R14: ffff8800660691e0 R15: ffffffff902fe397
[ 23.899597] FS: 000000000275c940(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
[ 23.902104] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 23.903808] CR2: 00000000006dbb18 CR3: 0000000067c7c000 CR4: 00000000000006e0
[ 23.905954] Call Trace:
[ 23.906721] percpu_up_write+0x4c/0x60
[ 23.907868] thaw_super+0x1c4/0x250
[ 23.908943] thaw_bdev+0x14a/0x170
[ 23.909996] ext4_ioctl+0x1fd8/0x39a0
[ 23.911114] ? alloc_set_pte+0x66d/0xe50
[ 23.912318] ? ext4_ioctl_setflags+0x600/0x600
[ 23.913672] ? drop_futex_key_refs.isra.3+0x65/0xb0
[ 23.915106] ? futex_wake+0x14a/0x400
[ 23.916242] ? futex_wait_restart+0x1e0/0x1e0
[ 23.917589] ? lock_contended+0xd30/0xd30
[ 23.918805] ? alloc_set_pte+0x330/0xe50
[ 23.920025] ? kvm_sched_clock_read+0x21/0x30
[ 23.921369] ? sched_clock+0x5/0x10
[ 23.922442] ? sched_clock_cpu+0x18/0x180
[ 23.923691] ? do_futex+0x3ab/0xa90
[ 23.924783] ? exit_robust_list+0x240/0x240
[ 23.926076] ? do_raw_spin_unlock+0x54/0x220
[ 23.927388] ? ext4_ioctl_setflags+0x600/0x600
[ 23.928758] do_vfs_ioctl+0x18b/0xfb0
[ 23.929893] ? ioctl_preallocate+0x1a0/0x1a0
[ 23.931204] ? SyS_futex+0x1c9/0x270
[ 23.932304] ? SyS_futex+0x1d2/0x270
[ 23.933412] ? do_futex+0xa90/0xa90
[ 23.934502] ? up_read+0x1c/0x110
[ 23.935532] ksys_ioctl+0x42/0x80
[ 23.936564] SyS_ioctl+0x23/0x30
[ 23.937567] ? ksys_ioctl+0x80/0x80
[ 23.938649] do_syscall_64+0x1a0/0x640
[ 23.939813] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 23.941360] RIP: 0033:0x455289
[ 23.942298] RSP: 002b:00007ffea24780d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 23.944588] RAX: ffffffffffffffda RBX: 000000000070bea0 RCX: 0000000000455289
[ 23.946762] RDX: 0000000020000100 RSI: 000000008004587d RDI: 0000000000000003
[ 23.948924] RBP: 000000000275c914 R08: 0000000000000000 R09: 0000000000000000
[ 23.951102] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
[ 23.953287] R13: 00000000000001c5 R14: 00000000006dbb18 R15: 00000000006d90a0
[ 23.955435] Code: 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 48 8b 05 14 d0 c2 03 85 c0 75 86 48 c7 c6 60 2c c6 91 48 c7 c7 20 2c c6 91 e8 ad da f1 ff <0f> 0b e9 6c ff ff ff e8 01 a1 2d 00 e9 2a ff ff ff 48 89 ef e8
[ 23.960064] ---[ end trace f542ead798faa3a9 ]---
....
On Sun, Apr 8, 2018 at 8:31 AM, Theodore Y. Ts'o <[email protected]> wrote:
> On Thu, Apr 05, 2018 at 09:37:41PM -0400, Theodore Y. Ts'o wrote:
>> Note that I haven't just been complaining about it. I've been working
>> on ways so that the gce-xfstests and kvm-xfstests test appliances can
>> more easily be used to work on Syzbot reports. If I can make myself
>> more efficient, or help other people be more efficient, that's
>> arguably more important than trying to fix some of the 174 currently
>> open Syzbot issues --- unless you can tell me that certain ones are
>> super urgent because they (for example) result in CVSS score > 8.
>
> I've got an initial version of this working for kvm-xfstests. To try
> it out, grab the latest version of xfstests-bld from [1], and the
> kvm-xfstests image from [2]. For people who have never tried using
> kvm-xfstests, see [3].
>
> [1] https://github.com/tytso/xfstests-bld
> [2] https://www.kernel.org/pub/linux/kernel/people/tytso/kvm-xfstests/testing/root_fs.img.x86_64
> [3] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md
>
> If you're interested, please try it out, and send me comments.
>
> Sample usage:
>
> kvm-xfstest syz <path/to/repro.{c,syz}>
> kvm-xfstest syz <URL to repro.{c,syz}>
>
> Example run:
>
> % kvm-xfstests syz https://syzkaller.appspot.com/x/repro.syz?id=5709211904245760
/\/\/\/\/\/\/\/\
Nice!
But note that syzkaller is under active development, so pre-canned
binaries may not always work. Mismatching binary may not understand
all syscalls, fail to parse program, interpret arguments differently,
execute program differently, setup a different environment for the
test, etc. Now a C program captures all of this, because code that
transforms syzkaller programs into C is versioned along with the rest
of the system.
Strictly saying, for syzkaller reproducers one needs to use the exact
syzkaller revision listed along with the reproducer, see for example:
https://syzkaller.appspot.com/bug?id=3fb9c4777053e79a6d2a65ac3738664c87629a21
The "#syz test" syzbot command does this. Using a different syzkaller
revision may or may not work.
> % Total % Received % Xferd Average Speed Time Time Time Current
> Dload Upload Total Spent Left Speed
> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0100 533 100 533 0 0 2157 0 --:--:-- --:--:-- --:--:-- 2157
> Saved downloaded copy at /tmp/tytso-downloaded-repro.syz
> Networking disabled.
> KERNEL: kernel 4.16.0-xfstests-09576-g38c23685b273 #134 SMP Sun Apr 8 01:36:01 EDT 2018 x86_64
> FSTESTVER: e2fsprogs v1.43.6-85-g7595699d0 (Wed, 6 Sep 2017 22:04:14 -0400)
> FSTESTVER: fio fio-3.2 (Fri, 3 Nov 2017 15:23:49 -0600)
> FSTESTVER: quota 59b280e (Mon, 5 Feb 2018 16:48:22 +0100)
> FSTESTVER: stress-ng 977ae35 (Wed, 6 Sep 2017 23:45:03 -0400)
> FSTESTVER: syzkaller 66f22a7f (Sat, 7 Apr 2018 14:02:03 +0200)
> FSTESTVER: xfsprogs v4.15.1 (Mon, 26 Feb 2018 19:50:56 -0600)
> FSTESTVER: xfstests-bld 3be913e (Sun, 8 Apr 2018 01:19:21 -0400)
> FSTESTVER: xfstests linux-v3.8-1925-g62cc6d02 (Fri, 23 Mar 2018 22:26:41 -0400)
> FSTESTCFG: "all"
> FSTESTSET: "syz/001"
> FSTESTEXC: ""
> FSTESTOPT: "aex"
> MNTOPTS: ""
> CPUS: "2"
> MEM: "1684.65"
> total used free shared buff/cache available
> Mem: 1684 140 1479 8 65 1507
> Swap: 0 0 0
> BEGIN TEST 4k (1 test): Ext4 4k block Sun Apr 8 01:49:02 EDT 2018
> DEVICE: /dev/vdd
> EXT_MKFS_OPTIONS: -b 4096
> EXT_MOUNT_OPTIONS: -o block_validity
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 kvm-xfstests 4.16.0-xfstests-09576-g38c23685b273
> MKFS_OPTIONS -- -b 4096 /dev/vdc
> MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc
>
> syz/001 [01:49:04][ 22.859794] run fstests syz/001 at 2018-04-08 01:49:04
> [ 23.385195] EXT4-fs (vdc): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity
> [ 23.797611] EXT4-fs (vda): shut down requested (0)
> [ 23.855759] ------------[ cut here ]------------
> [ 23.860823] DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
> [ 23.860881] WARNING: CPU: 1 PID: 1332 at /usr/projects/linux/ext4/kernel/locking/rwsem.c:133 up_write+0x113/0x150
> [ 23.876121] CPU: 1 PID: 1332 Comm: syz-executor0 Not tainted 4.16.0-xfstests-09576-g38c23685b273 #134
> [ 23.880836] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [ 23.884080] RIP: 0010:up_write+0x113/0x150
> [ 23.885873] RSP: 0018:ffff88005e0b7a68 EFLAGS: 00010286
> [ 23.887902] RAX: dffffc0000000008 RBX: ffff880066069038 RCX: ffffffff9002f2ce
> [ 23.890392] RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000293
> [ 23.892200] RBP: ffff8800660690a0 R08: fffffbfff245d71d R09: fffffbfff245d71d
> [ 23.894877] R10: ffff88007ffca050 R11: fffffbfff245d71c R12: ffff880066068ce0
> [ 23.897244] R13: ffff880066068a30 R14: ffff8800660691e0 R15: ffffffff902fe397
> [ 23.899597] FS: 000000000275c940(0000) GS:ffff88006d600000(0000) knlGS:0000000000000000
> [ 23.902104] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 23.903808] CR2: 00000000006dbb18 CR3: 0000000067c7c000 CR4: 00000000000006e0
> [ 23.905954] Call Trace:
> [ 23.906721] percpu_up_write+0x4c/0x60
> [ 23.907868] thaw_super+0x1c4/0x250
> [ 23.908943] thaw_bdev+0x14a/0x170
> [ 23.909996] ext4_ioctl+0x1fd8/0x39a0
> [ 23.911114] ? alloc_set_pte+0x66d/0xe50
> [ 23.912318] ? ext4_ioctl_setflags+0x600/0x600
> [ 23.913672] ? drop_futex_key_refs.isra.3+0x65/0xb0
> [ 23.915106] ? futex_wake+0x14a/0x400
> [ 23.916242] ? futex_wait_restart+0x1e0/0x1e0
> [ 23.917589] ? lock_contended+0xd30/0xd30
> [ 23.918805] ? alloc_set_pte+0x330/0xe50
> [ 23.920025] ? kvm_sched_clock_read+0x21/0x30
> [ 23.921369] ? sched_clock+0x5/0x10
> [ 23.922442] ? sched_clock_cpu+0x18/0x180
> [ 23.923691] ? do_futex+0x3ab/0xa90
> [ 23.924783] ? exit_robust_list+0x240/0x240
> [ 23.926076] ? do_raw_spin_unlock+0x54/0x220
> [ 23.927388] ? ext4_ioctl_setflags+0x600/0x600
> [ 23.928758] do_vfs_ioctl+0x18b/0xfb0
> [ 23.929893] ? ioctl_preallocate+0x1a0/0x1a0
> [ 23.931204] ? SyS_futex+0x1c9/0x270
> [ 23.932304] ? SyS_futex+0x1d2/0x270
> [ 23.933412] ? do_futex+0xa90/0xa90
> [ 23.934502] ? up_read+0x1c/0x110
> [ 23.935532] ksys_ioctl+0x42/0x80
> [ 23.936564] SyS_ioctl+0x23/0x30
> [ 23.937567] ? ksys_ioctl+0x80/0x80
> [ 23.938649] do_syscall_64+0x1a0/0x640
> [ 23.939813] entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [ 23.941360] RIP: 0033:0x455289
> [ 23.942298] RSP: 002b:00007ffea24780d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [ 23.944588] RAX: ffffffffffffffda RBX: 000000000070bea0 RCX: 0000000000455289
> [ 23.946762] RDX: 0000000020000100 RSI: 000000008004587d RDI: 0000000000000003
> [ 23.948924] RBP: 000000000275c914 R08: 0000000000000000 R09: 0000000000000000
> [ 23.951102] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
> [ 23.953287] R13: 00000000000001c5 R14: 00000000006dbb18 R15: 00000000006d90a0
> [ 23.955435] Code: 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 48 8b 05 14 d0 c2 03 85 c0 75 86 48 c7 c6 60 2c c6 91 48 c7 c7 20 2c c6 91 e8 ad da f1 ff <0f> 0b e9 6c ff ff ff e8 01 a1 2d 00 e9 2a ff ff ff 48 89 ef e8
> [ 23.960064] ---[ end trace f542ead798faa3a9 ]---
> ....
On Sun, Apr 08, 2018 at 03:18:39PM +0200, Dmitry Vyukov wrote:
>
> But note that syzkaller is under active development, so pre-canned
> binaries may not always work. Mismatching binary may not understand
> all syscalls, fail to parse program, interpret arguments differently,
> execute program differently, setup a different environment for the
> test, etc. Now a C program captures all of this, because code that
> transforms syzkaller programs into C is versioned along with the rest
> of the system.
> Strictly saying, for syzkaller reproducers one needs to use the exact
> syzkaller revision listed along with the reproducer, see for example:
> https://syzkaller.appspot.com/bug?id=3fb9c4777053e79a6d2a65ac3738664c87629a21
> The "#syz test" styzbot command does this. Using a different syzkaller
> revision may or may not work.
Thanks for the warning. I assume you try to maintain backwards
compatibility where possible? It might be nice if you could add some
kind of explicit versioning scheme --- perhaps with a major/minor
version scheme where the syz-executor needs to have the same major
number, and a minor number >= the minor version number of the test?
One of the reasons why the C program is not so useful for me is that
in the Repeat:true case, the C program repeats forever. So for
example, I translate Repeat:true to -repeat=100. See:
https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/usr/local/bin/run-syz
I suppose I could just abort the test after N minutes and assume if
the kernel hasn't crashed, that it's probably not going to. But some
way that the C program can be given an argument or an environment
variable to control how number of loops it will run might be useful.
And some kind of hint as how reliable the repro would be (e.g,. some
indication that you should try to run it at least N times to get a
failure at least 95% of the time).
- Ted
On Sun, Apr 8, 2018 at 8:02 PM, Theodore Y. Ts'o <[email protected]> wrote:
> On Sun, Apr 08, 2018 at 03:18:39PM +0200, Dmitry Vyukov wrote:
>>
>> But note that syzkaller is under active development, so pre-canned
>> binaries may not always work. Mismatching binary may not understand
>> all syscalls, fail to parse program, interpret arguments differently,
>> execute program differently, setup a different environment for the
>> test, etc. Now a C program captures all of this, because code that
>> transforms syzkaller programs into C is versioned along with the rest
>> of the system.
>> Strictly saying, for syzkaller reproducers one needs to use the exact
>> syzkaller revision listed along with the reproducer, see for example:
>> https://syzkaller.appspot.com/bug?id=3fb9c4777053e79a6d2a65ac3738664c87629a21
>> The "#syz test" styzbot command does this. Using a different syzkaller
>> revision may or may not work.
>
> Thanks for the warning. I assume you try to maintain backwards
> compatibility where possible? It might be nice if you could add some
> kind of explicit versioning scheme --- perhaps with a major/minor
> version scheme where the syz-executor needs to have the same major
> number, and a minor number >= the minor version number of the test?
We try to not break backwards compatibility without a reason.
Preserving full backwards compatibility within a single binary is
extremely hard. It's like asking kernel to support each and every ever
existed version of every in-memory data structure and all of the
non-functional aspects (like any fluctuations in performance). If one
could give us several additional FTEs for this, then it might be
doable. But even then I don't think it's the best use of the FTE time
because version control system already gives us exactly this -- exact
behavior on a past revision. On top of this, the backward
compatibility support will sure have bugs too. In the best case we
will spent time debugging why a new version does not precisely model
behavior of an old version. In the worst case you will test something
and think that the bug is fixed, but it's just that the new version
does not behave exactly as the old one. On top of this, this still
does not give us forward compatibility, something that one wants in
majority of cases with an old pre-canned binary. On top of this, the
binaries will be huge because they will need to capture exact versions
of all system call descriptions (and the simplest option for this is
keeping copies all versions), a 87 MiB image definitely won't be
enough to hold this, the binary will be somewhere between gigs and
tens of gigs.
> One of the reasons why the C program is not so useful for me is that
> in the Repeat:true case, the C program repeats forever. So for
> example, I translate Repeat:true to -repeat=100. See:
>
> https://github.com/tytso/xfstests-bld/blob/master/kvm-xfstests/test-appliance/files/usr/local/bin/run-syz
>
> I suppose I could just abort the test after N minutes and assume if
> the kernel hasn't crashed, that it's probably not going to. But some
> way that the C program can be given an argument or an environment
> variable to control how number of loops it will run might be useful.
> And some kind of hint as how reliable the repro would be (e.g,. some
> indication that you should try to run it at least N times to get a
> failure at least 95% of the time).
I think:
timeout -s KILL 450 ./a.out
is the solution.
Repro logic runs programs for at most 7.5 minutes, so 450 should be good.
Re env var. There are opposite views too. People complain that
syzkaller C repros are mess (which they are). Currently they complain
minimal amount of code to reproduce the bugs. If we also start
staffing some aux logic in them, it won't be helpful. timeout command
looks just as good.
On Fri, Apr 6, 2018 at 4:01 AM, Dave Chinner <[email protected]> wrote:
> On Thu, Apr 05, 2018 at 05:13:25PM -0700, Eric Biggers wrote:
>> On Fri, Apr 06, 2018 at 08:32:26AM +1000, Dave Chinner wrote:
>> > On Wed, Apr 04, 2018 at 08:24:54PM -0700, Matthew Wilcox wrote:
>> > > On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
>> > > > On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
>> > > > > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
>> > > > > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
>> > > > > > <[email protected]> wrote:
>> > > > > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
>> > > > > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
>> > > > > > > kernel/locking/rwsem.c:133
>> > > > > > > Kernel panic - not syncing: panic_on_warn set ...
>> > > > >
>> > > > > Message-Id: <[email protected]>
>> > > > >
>> > > >
>> > > > We were way ahead of syzbot in this case. :-)
>> > >
>> > > Not really ... syzbot caught it Monday evening ;-)
>> >
>> > Rather than arguing over who reported it first, I think that time
>> > would be better spent reflecting on why the syzbot report was
>> > completely ignored until *after* Ted diagnosed the issue
>> > independently and Waiman had already fixed it....
>> >
>> > Clearly there is scope for improvement here.
>> >
>> > Cheers,
>> >
>>
>> Well, ultimately a human needed to investigate the syzbot bug report to figure
>> out what was really going on. In my view, the largest problem is that there are
>> simply too many bugs, so many are getting ignored.
>
> Well, yeah. And when there's too many bugs, looking at the ones
> people are actually hitting tend to take precedence over those
> reported by a bot an image problem...
>
>> If there were only a few bugs, then Dmitry would investigate each
>> one and send a "real" bug report of better quality than the
>> automated system can provide, or even send a fix directly. But in
>> reality, on the same day this bug was reported, syzbot also found
>> 10 other bugs, and in the previous 2 days it had found 38 more.
>> No single person can keep up with that.
>
> And this is precisely why people turn around and ask the syzbot
> developers to do things that make it easier for them to diagnose
> the problems syzbot reports.
>
>> You can see the current
>> bug list, which has 172 open bugs, on the dashboard at
>> https://syzkaller.appspot.com/.
>
> Is that all? That's *nothing*.
>
>> Yes, the kernel really is that
>> broken.
>
> Actually, that tells me the kernel is a hell of a lot better than my
> experience leads me to beleive it is. I'd have expected thousands of
> bugs, even tens of thousands of bugs given how many issues we deal
> with in individual subsystems on a day to day basis.
>
>> And although quite a few of these bugs will end up to be
>> duplicates or even already fixed, a human still has to look at
>> each one to figure that out. (Though, I do think that syzbot
>> should try to automatically detect when a reproducible bug was
>> already fixed, via bisection. It would cause a few bugs to be
>> incorrectly considered fixed, but it may be a worthwhile
>> tradeoff.)
>>
>> These bugs are all over the kernel as well, so most developers
>> don't see the big picture but rather just see a few bugs for
>> "their" subsystem on "their" subsystem's mailing list and
>> sometimes demand special attention. Of course, it's great when
>> people suggest ways to improve the process.
>
> That's not the response I got....
>
>> But it's not great
>> when people just don't feel responsible for fixing bugs and wait
>> for Someone Else to do it.
>
> The excessive cross posting of the reports is one of the reasons
> people think someone else will take care of it. i.e. "Oh, that looks VFS,
> that went to -fsdevel, I don't need to look at it"....
>
> Put simply: if you're mounting an XFS filesystem image and something
> goes bang, then it should be reported to the XFS list. It does not
> need to be cross posted to LKML, -fsdevel, 10 individual developers,
> etc. If it's not an XFS problem, then the XFS developers will CC the
> relevant lists as needed.
>
>> I'm hoping that in the future the syzbot "team", which seems to
>> actually be just Dmitry now, can get more resources towards
>> helping fix the bugs. But either way, in the end Linux is a
>> community effort.
>
> We don't really need help fixing the bugs - we need help making it
> easier to *find the bug* the bot tripped over. That's what the
> syzbot team needs to focus on, not tell people that what they got is
> all they are going to get.
>
>> Note also that syzbot wasn't super useful in this particular case
>> because people running xfstests came across the same bug. But,
>> this is actually a rare case. Most syzbot bug reports have been
>> for weird corner cases or races that no one ever thought of
>> before, so there are no existing tests that find them.
>
> Which is exactly what these whacky "mount a filesystem fragment"
> tests it is now doing are exercising. Finding the cause of
> corruption related crashes is not easy and takes time. Having the
> bot developers add something to the bot that will save the developer
> looking at the problem 10 minutes of setup time makes a huge
> difference to the effort required to find the problem.
>
> The tool is useless if people find it too hard to make sense of the
> bug reports (*cough* lockdep *cough*) or perform triage of the
> report. If we want to get the bugs fixed faster, we have to make the
> reports from automated tools contain the exact information the
> developer needs to solve the problem.
Hi,
Regarding feature requests.
We too have limited resources unfortunately and can't handle all
feature requests. Feature requests generally fall into the following
categories:
1. General features that are easy to do.
These are generally done right away (more or less).
2. General features that require significant time.
These are noted and are done as resources permit. For example:
- bisection (https://github.com/google/syzkaller/issues/501)
- kdump collection (https://github.com/google/syzkaller/issues/491)
Examples of what is done already:
- patch testing
- significantly restructured reports
3. Subsystem-specific features that are easy to do.
I don't remember that we got any. I guess they would compete with case 2.
4. Subsystem-specific features that require significant time.
For these we don't have resources at the moment. Our company have
dedicated people for some subsystems (to not go far -- Ted for ext4),
but we don't have people for just any subsystem.
Kernel developers working on Infiniband contributed to syzkaller
themselves, and as far as I understand they are very happy with the
results because it allowed them to find and fix several dozens of
critical bugs (without involing us at all), so that's an option too.
Then, the context of the system is not a single subsystem and not a
single bug. Please don't draw all conclusions from a small subset of
cases. At this scale there inevitably will be harder bugs that will be
handled worse than a dedicated human would do (but a dedicated human
would not be able to handle that amount of bugs). But this does not
make the overall effect negative, lots of hundreds of bugs are getting
fixed. In lots of cases developers pick up bugs from "C program +
repro instructions". There is also considerable amount of simpler bugs
that are getting fixed even without reproducers. In can be a case for
a filesystem too, for example, a NULL deref with an obvious missed
preceeding state check, or a KASAN report with all stacks. It's not
possible to know ahead of time if it's something that can be fixed
with the existing information, or something that can't be. So there is
no option of reporting just the former bugs, we can report either all
of them or none of them (which would mean that none of the bugs are
fixed).
Regarding prioritization.
Bisection is on our plate. But note that a WARNING can be misleading.
One of the bad bugs syzkaller has found was exactly a WARNING, a
WARNING to restore FPU registers on context switch, which means
interprocess, or host->guest information leak. One of the worst ones
manifested in no kernel report at all. It was one of these "target
machine just become unresponsive with no self-detected reports".
"There is something wrong with kernel" reports get lowest priority,
but that one turned out to be full guest->host escape. Even if it's
just a WARNING, but triggered remotely, that can be a large problem
too. So generally prioritizaton still requires an expert atention,
which in turn requires reports all these bugs in the first place.
It can also be a case that an innocent bug masks critical bugs. For
example, if there is an easy to trigger bug on enterance to a
subsystem, nothing else will be discovered until that one is fixed.
There are definitely more than 172 bugs. I agree, thousands. And the
system is generally capable of finding them, it already has found
close to 2000 I think. It's just that the system chokes with existing
bugs and all test machines crash right after boot. The more bugs we
fix, the more new bugs we will see.
Bugs with high CVSS scores are frequently found with similar fuzzing
systems. But these won't be reported by humans on mailing lists, and
these are not bugs people are actually hitting. These look exactly
like this -- some insane inputs to kernel and are sold and used to
exploit our phones and bank accounts.
Regarding CC lists.
If you see issues there, please improve scripts/get_maintainer.pl.
That's what most people use to find relevant emails when reporting
bugs (when they are not maintainers of this very subsystem and have
some secret knowledge) and that's what syzbot uses. If it produces
wrong results, the scope of the problem is larger than syzbot.
On Thu, Apr 5, 2018 at 10:22 AM, Dmitry Vyukov <[email protected]> wrote:
> On Thu, Apr 5, 2018 at 5:24 AM, Matthew Wilcox <[email protected]> wrote:
>> On Wed, Apr 04, 2018 at 11:22:00PM -0400, Theodore Y. Ts'o wrote:
>>> On Wed, Apr 04, 2018 at 12:35:04PM -0700, Matthew Wilcox wrote:
>>> > On Wed, Apr 04, 2018 at 09:24:05PM +0200, Dmitry Vyukov wrote:
>>> > > On Tue, Apr 3, 2018 at 4:01 AM, syzbot
>>> > > <[email protected]> wrote:
>>> > > > DEBUG_LOCKS_WARN_ON(sem->owner != get_current())
>>> > > > WARNING: CPU: 1 PID: 4441 at kernel/locking/rwsem.c:133 up_write+0x1cc/0x210
>>> > > > kernel/locking/rwsem.c:133
>>> > > > Kernel panic - not syncing: panic_on_warn set ...
>>> >
>>> > Message-Id: <[email protected]>
>>> >
>>>
>>> We were way ahead of syzbot in this case. :-)
>>
>> Not really ... syzbot caught it Monday evening ;-)
>>
>> Date: Mon, 02 Apr 2018 19:01:01 -0700
>> From: syzbot <[email protected]>
>> To: [email protected], [email protected],
>> [email protected], [email protected]
>> Subject: WARNING in up_write
>
> :)
>
> #syz fix: locking/rwsem: Add up_write_non_owner() for percpu_up_write()
The title was later changed to:
#syz fix: locking/rwsem: Add a new RWSEM_ANONYMOUSLY_OWNED flag