Hi folks,
Just updated my test boxes from 4.9 to a current Linus 4.10 merge
window kernel to test the XFS merge I am preparing for Linus.
Unfortunately, all my test VMs using iscsi failed pretty much
instantly on the first mount of an iscsi device:
[ 159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
[ 159.374612] XFS (sdb): Mounting V5 Filesystem
[ 159.425710] XFS (sdb): Ending clean mount
[ 160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
[ 160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
[ 160.276565] PGD 336ed067 [ 160.276885] PUD 31b0d067
PMD 0 [ 160.277309]
[ 160.277523] Oops: 0000 [#1] PREEMPT SMP
[ 160.278004] Modules linked in:
[ 160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
[ 160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
[ 160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
[ 160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
[ 160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
[ 160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
[ 160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
[ 160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
[ 160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
[ 160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
[ 160.287362] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[ 160.288340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
[ 160.290084] Call Trace:
[ 160.290429] ? inet_sendpage+0x4d/0x140
[ 160.290957] iscsi_sw_tcp_xmit_segment+0x89/0x110
[ 160.291597] iscsi_sw_tcp_pdu_xmit+0x56/0x180
[ 160.292190] iscsi_tcp_task_xmit+0xb8/0x280
[ 160.292771] iscsi_xmit_task+0x53/0xc0
[ 160.293282] iscsi_xmitworker+0x274/0x310
[ 160.293835] process_one_work+0x1de/0x4d0
[ 160.294388] worker_thread+0x4b/0x4f0
[ 160.294889] kthread+0x10c/0x140
[ 160.295333] ? process_one_work+0x4d0/0x4d0
[ 160.295898] ? kthread_create_on_node+0x40/0x40
[ 160.296525] ret_from_fork+0x25/0x30
[ 160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
[ 160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
[ 160.301584] CR2: 000000000000000c
Known problem, or something new?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> Hi folks,
>
> Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> window kernel to test the XFS merge I am preparing for Linus.
> Unfortunately, all my test VMs using iscsi failed pretty much
> instantly on the first mount of an iscsi device:
>
> [ 159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> [ 159.374612] XFS (sdb): Mounting V5 Filesystem
> [ 159.425710] XFS (sdb): Ending clean mount
> [ 160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> [ 160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
FYI, crash is here:
(gdb) l *(iscsi_tcp_segment_done+0x20d)
0xffffffff81b950bd is in iscsi_tcp_segment_done (drivers/scsi/libiscsi_tcp.c:102).
97 iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
98 struct scatterlist *sg, unsigned int offset)
99 {
100 segment->sg = sg;
101 segment->sg_offset = offset;
102 segment->size = min(sg->length - offset,
103 segment->total_size - segment->total_copied);
104 segment->data = NULL;
105 }
106
So it looks to be sg = NULL, which means there's probably an issue
with the scatterlist...
-Dave.
> [ 160.276565] PGD 336ed067 [ 160.276885] PUD 31b0d067
> PMD 0 [ 160.277309]
> [ 160.277523] Oops: 0000 [#1] PREEMPT SMP
> [ 160.278004] Modules linked in:
> [ 160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> [ 160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [ 160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> [ 160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
> [ 160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> [ 160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
> [ 160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
> [ 160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
> [ 160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
> [ 160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
> [ 160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
> [ 160.287362] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> [ 160.288340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
> [ 160.290084] Call Trace:
> [ 160.290429] ? inet_sendpage+0x4d/0x140
> [ 160.290957] iscsi_sw_tcp_xmit_segment+0x89/0x110
> [ 160.291597] iscsi_sw_tcp_pdu_xmit+0x56/0x180
> [ 160.292190] iscsi_tcp_task_xmit+0xb8/0x280
> [ 160.292771] iscsi_xmit_task+0x53/0xc0
> [ 160.293282] iscsi_xmitworker+0x274/0x310
> [ 160.293835] process_one_work+0x1de/0x4d0
> [ 160.294388] worker_thread+0x4b/0x4f0
> [ 160.294889] kthread+0x10c/0x140
> [ 160.295333] ? process_one_work+0x4d0/0x4d0
> [ 160.295898] ? kthread_create_on_node+0x40/0x40
> [ 160.296525] ret_from_fork+0x25/0x30
> [ 160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> [ 160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
> [ 160.301584] CR2: 000000000000000c
>
>
> Known problem, or something new?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>
--
Dave Chinner
[email protected]
Thanks Dave,
I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
modules loaded (virtio block) so there's something else going on in the
current merge window. I'll keep an eye on it and make sure there's
nothing iSCSI needs fixing for.
Chris
On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > Hi folks,
> >
> > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > window kernel to test the XFS merge I am preparing for Linus.
> > Unfortunately, all my test VMs using iscsi failed pretty much
> > instantly on the first mount of an iscsi device:
> >
> > [ 159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> > [ 159.374612] XFS (sdb): Mounting V5 Filesystem
> > [ 159.425710] XFS (sdb): Ending clean mount
> > [ 160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> > [ 160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
>
> FYI, crash is here:
>
> (gdb) l *(iscsi_tcp_segment_done+0x20d)
> 0xffffffff81b950bd is in iscsi_tcp_segment_done (drivers/scsi/libiscsi_tcp.c:102).
> 97 iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> 98 struct scatterlist *sg, unsigned int offset)
> 99 {
> 100 segment->sg = sg;
> 101 segment->sg_offset = offset;
> 102 segment->size = min(sg->length - offset,
> 103 segment->total_size - segment->total_copied);
> 104 segment->data = NULL;
> 105 }
> 106
>
> So it looks to be sg = NULL, which means there's probably an issue
> with the scatterlist...
>
> -Dave.
>
> > [ 160.276565] PGD 336ed067 [ 160.276885] PUD 31b0d067
> > PMD 0 [ 160.277309]
> > [ 160.277523] Oops: 0000 [#1] PREEMPT SMP
> > [ 160.278004] Modules linked in:
> > [ 160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > [ 160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > [ 160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > [ 160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
> > [ 160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > [ 160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
> > [ 160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
> > [ 160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
> > [ 160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
> > [ 160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
> > [ 160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
> > [ 160.287362] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> > [ 160.288340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
> > [ 160.290084] Call Trace:
> > [ 160.290429] ? inet_sendpage+0x4d/0x140
> > [ 160.290957] iscsi_sw_tcp_xmit_segment+0x89/0x110
> > [ 160.291597] iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > [ 160.292190] iscsi_tcp_task_xmit+0xb8/0x280
> > [ 160.292771] iscsi_xmit_task+0x53/0xc0
> > [ 160.293282] iscsi_xmitworker+0x274/0x310
> > [ 160.293835] process_one_work+0x1de/0x4d0
> > [ 160.294388] worker_thread+0x4b/0x4f0
> > [ 160.294889] kthread+0x10c/0x140
> > [ 160.295333] ? process_one_work+0x4d0/0x4d0
> > [ 160.295898] ? kthread_create_on_node+0x40/0x40
> > [ 160.296525] ret_from_fork+0x25/0x30
> > [ 160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> > [ 160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
> > [ 160.301584] CR2: 000000000000000c
> >
> >
> > Known problem, or something new?
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]
> >
>
> --
> Dave Chinner
> [email protected]
On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> Thanks Dave,
>
> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> modules loaded (virtio block) so there's something else going on in the
> current merge window. I'll keep an eye on it and make sure there's
> nothing iSCSI needs fixing for.
OK, so before this slips through the cracks.....
Linus - your tree as of a few minutes ago still panics immediately
when starting xfstests on iscsi devices. It appears to be a
scatterlist corruption and not an iscsi problem, so the iscsi guys
seem to have bounced it and no-one is looking at it.
I'm disappearing for several months at the end of tomorrow, so I
thought I better make sure you know about it. I've also added
linux-scsi, linux-block to the cc list....
Cheers,
Dave.
> On Thu, Dec 15, 2016 at 09:29:53AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 09:24:11AM +1100, Dave Chinner wrote:
> > > Hi folks,
> > >
> > > Just updated my test boxes from 4.9 to a current Linus 4.10 merge
> > > window kernel to test the XFS merge I am preparing for Linus.
> > > Unfortunately, all my test VMs using iscsi failed pretty much
> > > instantly on the first mount of an iscsi device:
> > >
> > > [ 159.372704] XFS (sdb): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> > > [ 159.374612] XFS (sdb): Mounting V5 Filesystem
> > > [ 159.425710] XFS (sdb): Ending clean mount
> > > [ 160.274438] BUG: unable to handle kernel NULL pointer dereference at 000000000000000c
> > > [ 160.275851] IP: iscsi_tcp_segment_done+0x20d/0x2e0
> >
> > FYI, crash is here:
> >
> > (gdb) l *(iscsi_tcp_segment_done+0x20d)
> > 0xffffffff81b950bd is in iscsi_tcp_segment_done (drivers/scsi/libiscsi_tcp.c:102).
> > 97 iscsi_tcp_segment_init_sg(struct iscsi_segment *segment,
> > 98 struct scatterlist *sg, unsigned int offset)
> > 99 {
> > 100 segment->sg = sg;
> > 101 segment->sg_offset = offset;
> > 102 segment->size = min(sg->length - offset,
> > 103 segment->total_size - segment->total_copied);
> > 104 segment->data = NULL;
> > 105 }
> > 106
> >
> > So it looks to be sg = NULL, which means there's probably an issue
> > with the scatterlist...
> >
> > -Dave.
> >
> > > [ 160.276565] PGD 336ed067 [ 160.276885] PUD 31b0d067
> > > PMD 0 [ 160.277309]
> > > [ 160.277523] Oops: 0000 [#1] PREEMPT SMP
> > > [ 160.278004] Modules linked in:
> > > [ 160.278407] CPU: 0 PID: 16 Comm: kworker/u2:1 Not tainted 4.9.0-dgc #18
> > > [ 160.279224] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> > > [ 160.280314] Workqueue: iscsi_q_2 iscsi_xmitworker
> > > [ 160.280919] task: ffff88003e280000 task.stack: ffffc90000080000
> > > [ 160.281647] RIP: 0010:iscsi_tcp_segment_done+0x20d/0x2e0
> > > [ 160.282312] RSP: 0018:ffffc90000083c38 EFLAGS: 00010206
> > > [ 160.282980] RAX: 0000000000000000 RBX: ffff880039061730 RCX: 0000000000000000
> > > [ 160.283854] RDX: 0000000000001e00 RSI: 0000000000000000 RDI: ffff880039061730
> > > [ 160.284738] RBP: ffffc90000083c90 R08: 0000000000000200 R09: 00000000000005a8
> > > [ 160.285627] R10: 000000009835607d R11: 0000000000000000 R12: 0000000000000200
> > > [ 160.286495] R13: 0000000000000000 R14: ffff8800390615a0 R15: ffff880039061730
> > > [ 160.287362] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> > > [ 160.288340] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [ 160.289113] CR2: 000000000000000c CR3: 0000000031a8d000 CR4: 00000000000006f0
> > > [ 160.290084] Call Trace:
> > > [ 160.290429] ? inet_sendpage+0x4d/0x140
> > > [ 160.290957] iscsi_sw_tcp_xmit_segment+0x89/0x110
> > > [ 160.291597] iscsi_sw_tcp_pdu_xmit+0x56/0x180
> > > [ 160.292190] iscsi_tcp_task_xmit+0xb8/0x280
> > > [ 160.292771] iscsi_xmit_task+0x53/0xc0
> > > [ 160.293282] iscsi_xmitworker+0x274/0x310
> > > [ 160.293835] process_one_work+0x1de/0x4d0
> > > [ 160.294388] worker_thread+0x4b/0x4f0
> > > [ 160.294889] kthread+0x10c/0x140
> > > [ 160.295333] ? process_one_work+0x4d0/0x4d0
> > > [ 160.295898] ? kthread_create_on_node+0x40/0x40
> > > [ 160.296525] ret_from_fork+0x25/0x30
> > > [ 160.297015] Code: 43 18 00 00 00 00 e9 ad fe ff ff 48 8b 7b 30 e8 da e7 ca ff 8b 53 10 44 89 ee 48 89 df 2b 53 14 48 89 43 30 c7 43 40 00 00 00 00 <8b
> > > [ 160.300674] RIP: iscsi_tcp_segment_done+0x20d/0x2e0 RSP: ffffc90000083c38
> > > [ 160.301584] CR2: 000000000000000c
> > >
> > >
> > > Known problem, or something new?
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > [email protected]
> > >
> >
> > --
> > Dave Chinner
> > [email protected]
>
--
Dave Chinner
[email protected]
Hi,
On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <[email protected]> wrote:
> On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
>> Thanks Dave,
>>
>> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
>> modules loaded (virtio block) so there's something else going on in the
>> current merge window. I'll keep an eye on it and make sure there's
>> nothing iSCSI needs fixing for.
>
> OK, so before this slips through the cracks.....
>
> Linus - your tree as of a few minutes ago still panics immediately
> when starting xfstests on iscsi devices. It appears to be a
> scatterlist corruption and not an iscsi problem, so the iscsi guys
> seem to have bounced it and no-one is looking at it.
Hmm. There's not much to go by.
Can somebody in iscsi-land please try to just bisect it - I'm not
seeing a lot of clues to where this comes from otherwise.
There clearly hasn't been anything done to drivers/scsi/*iscsi* since
4.9, so it's coming from elsewhere, presumably the block layer changes
as you say.
But I think we need iscsi people to pinpoint it a bit more, since
nobody else seems to be complaining.
Looking around a bit, the only even halfway suspicious scatterlist
initialization thing I see is commit f9d03f96b988 ("block: improve
handling of the magic discard payload") which used to have a magic
hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
now it does __blk_bvec_map_sg() instead.
Maybe __blk_bvec_map_sg() would want that same
if (!bio->bi_vcnt)
return 0;
adding Christoph explicitly so that he can tell me why I'm a clueless
weenie. Although I suspect he's already on the scsi and block lists
and would have done so anyway.
> I'm disappearing for several months at the end of tomorrow, so I
> thought I better make sure you know about it. I've also added
> linux-scsi, linux-block to the cc list....
Thanks,
Linus
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Hi,
>
> On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <[email protected]> wrote:
> > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> >> Thanks Dave,
> >>
> >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> >> modules loaded (virtio block) so there's something else going on in the
> >> current merge window. I'll keep an eye on it and make sure there's
> >> nothing iSCSI needs fixing for.
> >
> > OK, so before this slips through the cracks.....
> >
> > Linus - your tree as of a few minutes ago still panics immediately
> > when starting xfstests on iscsi devices. It appears to be a
> > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > seem to have bounced it and no-one is looking at it.
>
> Hmm. There's not much to go by.
>
> Can somebody in iscsi-land please try to just bisect it - I'm not
> seeing a lot of clues to where this comes from otherwise.
Yeah, my hopes of this being quickly resolved by someone else didn't
work out and whatever is going on in that test VM is looking like a
different kind of odd. I'm saving that off for later, and seeing if I
can't be a bisect on the iSCSI issue.
Chris
On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Hi,
> >
> > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <[email protected]> wrote:
> > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > >> Thanks Dave,
> > >>
> > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > >> modules loaded (virtio block) so there's something else going on in the
> > >> current merge window. I'll keep an eye on it and make sure there's
> > >> nothing iSCSI needs fixing for.
> > >
> > > OK, so before this slips through the cracks.....
> > >
> > > Linus - your tree as of a few minutes ago still panics immediately
> > > when starting xfstests on iscsi devices. It appears to be a
> > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > seem to have bounced it and no-one is looking at it.
> >
> > Hmm. There's not much to go by.
> >
> > Can somebody in iscsi-land please try to just bisect it - I'm not
> > seeing a lot of clues to where this comes from otherwise.
>
> Yeah, my hopes of this being quickly resolved by someone else didn't
> work out and whatever is going on in that test VM is looking like a
> different kind of odd. I'm saving that off for later, and seeing if I
> can't be a bisect on the iSCSI issue.
There may be deeper issues. I just started running scalability tests
(e.g. 16-way fsmark create tests) and about a minute in I got a
directory corruption reported - something I hadn't seen in the dev
cycle at all. I unmounted the fs, mkfs'd it again, ran the
workload again and about a minute in this fired:
[628867.607417] ------------[ cut here ]------------
[628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
[628867.610702] Modules linked in:
[628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: G W 4.9.0-dgc #18
[628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[628867.616179] Workqueue: events rht_deferred_worker
[628867.632422] Call Trace:
[628867.634691] dump_stack+0x63/0x83
[628867.637937] __warn+0xcb/0xf0
[628867.641359] warn_slowpath_null+0x1d/0x20
[628867.643362] shadow_lru_isolate+0x171/0x220
[628867.644627] __list_lru_walk_one.isra.11+0x79/0x110
[628867.645780] ? __list_lru_init+0x70/0x70
[628867.646628] list_lru_walk_one+0x17/0x20
[628867.647488] scan_shadow_nodes+0x34/0x50
[628867.648358] shrink_slab.part.65.constprop.86+0x1dc/0x410
[628867.649506] shrink_node+0x57/0x90
[628867.650233] do_try_to_free_pages+0xdd/0x230
[628867.651157] try_to_free_pages+0xce/0x1a0
[628867.652342] __alloc_pages_slowpath+0x2df/0x960
[628867.653332] ? __might_sleep+0x4a/0x80
[628867.654148] __alloc_pages_nodemask+0x24b/0x290
[628867.655237] kmalloc_order+0x21/0x50
[628867.656016] kmalloc_order_trace+0x24/0xc0
[628867.656878] __kmalloc+0x17d/0x1d0
[628867.657644] bucket_table_alloc+0x195/0x1d0
[628867.658564] ? __might_sleep+0x4a/0x80
[628867.659449] rht_deferred_worker+0x287/0x3c0
[628867.660366] ? _raw_spin_unlock_irq+0xe/0x30
[628867.661294] process_one_work+0x1de/0x4d0
[628867.662208] worker_thread+0x4b/0x4f0
[628867.662990] kthread+0x10c/0x140
[628867.663687] ? process_one_work+0x4d0/0x4d0
[628867.664564] ? kthread_create_on_node+0x40/0x40
[628867.665523] ret_from_fork+0x25/0x30
[628867.666317] ---[ end trace 7c38634006a9955e ]---
Now, this workload does not touch the page cache at all - it's
entirely an XFS metadata workload, so it should not really be
affecting the working set code.
And worse, on that last error, the /host/ is now going into meltdown
(running 4.7.5) with 32 CPUs all burning down in ACPI code:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
35074 root -2 0 0 0 0 R 99.0 0.0 12:38.92 acpi_pad/12
35079 root -2 0 0 0 0 R 99.0 0.0 12:39.40 acpi_pad/16
35080 root -2 0 0 0 0 R 99.0 0.0 12:39.29 acpi_pad/17
35085 root -2 0 0 0 0 R 99.0 0.0 12:39.35 acpi_pad/22
35087 root -2 0 0 0 0 R 99.0 0.0 12:39.13 acpi_pad/24
35090 root -2 0 0 0 0 R 99.0 0.0 12:38.89 acpi_pad/27
35093 root -2 0 0 0 0 R 99.0 0.0 12:38.88 acpi_pad/30
35063 root -2 0 0 0 0 R 98.1 0.0 12:40.64 acpi_pad/1
35065 root -2 0 0 0 0 R 98.1 0.0 12:40.38 acpi_pad/3
35066 root -2 0 0 0 0 R 98.1 0.0 12:40.30 acpi_pad/4
35067 root -2 0 0 0 0 R 98.1 0.0 12:40.82 acpi_pad/5
35077 root -2 0 0 0 0 R 98.1 0.0 12:39.65 acpi_pad/14
35078 root -2 0 0 0 0 R 98.1 0.0 12:39.58 acpi_pad/15
35081 root -2 0 0 0 0 R 98.1 0.0 12:39.32 acpi_pad/18
35072 root -2 0 0 0 0 R 96.2 0.0 12:40.14 acpi_pad/10
35073 root -2 0 0 0 0 R 96.2 0.0 12:39.39 acpi_pad/11
35076 root -2 0 0 0 0 R 96.2 0.0 12:39.39 acpi_pad/13
35084 root -2 0 0 0 0 R 96.2 0.0 12:39.06 acpi_pad/21
35092 root -2 0 0 0 0 R 96.2 0.0 12:39.14 acpi_pad/29
35069 root -2 0 0 0 0 R 95.2 0.0 12:40.71 acpi_pad/7
35068 root -2 0 0 0 0 R 94.2 0.0 12:40.29 acpi_pad/6
35062 root -2 0 0 0 0 D 93.3 0.0 12:40.56 acpi_pad/0
35064 root -2 0 0 0 0 D 92.3 0.0 12:40.18 acpi_pad/2
35082 root -2 0 0 0 0 R 92.3 0.0 12:39.64 acpi_pad/19
35083 root -2 0 0 0 0 R 92.3 0.0 12:38.98 acpi_pad/20
35086 root -2 0 0 0 0 R 92.3 0.0 12:40.11 acpi_pad/23
35088 root -2 0 0 0 0 R 92.3 0.0 12:39.45 acpi_pad/25
35089 root -2 0 0 0 0 R 92.3 0.0 12:39.11 acpi_pad/26
35070 root -2 0 0 0 0 D 91.3 0.0 12:40.21 acpi_pad/8
35071 root -2 0 0 0 0 D 91.3 0.0 12:39.98 acpi_pad/9
35091 root -2 0 0 0 0 D 91.3 0.0 12:39.33 acpi_pad/28
perf top says:
65.98% [kernel] [k] power_saving_thread
3.27% [kernel] [k] native_queued_spin_lock_slowpath
1.61% [kernel] [k] native_write_msr
1.39% [kernel] [k] update_curr_rt
1.20% [kernel] [k] intel_pstate_update_util
1.01% [kernel] [k] __do_softirq
1.01% [kernel] [k] ktime_get
0.99% [kernel] [k] ktime_get_update_offsets_now
0.93% [kernel] [k] rcu_check_callbacks
0.90% [kernel] [k] _raw_spin_lock
0.88% [kernel] [k] perf_event_task_tick
0.82% [kernel] [k] native_irq_return_iret
0.81% [kernel] [k] run_timer_softirq
0.75% [kernel] [k] trigger_load_balance
No idea how to recover this, so I'm just going to reboot it. Back in
a bit.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all.
By "in the dev cycle", do you mean your XFS changes, or have you been
tracking the merge cycle at least for some testing?
> I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] ------------[ cut here ]------------
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
Well, part of the changes during the merge window were the shadow
entry tracking changes that came in through Andrew's tree. Adding
Johannes Weiner to the participants.
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.
Well, I suspect that anything that creates memory pressure will end up
triggering the working set code, so ..
That said, obviously memory corruption could be involved and result in
random issues too, but I wouldn't really expect that in this code.
It would probably be really useful to get more data points - is the
problem reliably in this area, or is it going to be random and all
over the place.
That said:
> And worse, on that last error, the /host/ is now going into meltdown
> (running 4.7.5) with 32 CPUs all burning down in ACPI code:
The obvious question here is how much you trust the environment if the
host ends up also showing problems. Maybe you do end up having hw
issues pop up too.
The primary suspect would presumably be the development kernel you're
testing triggering something, but it has to be asked..
Linus
On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> Looking around a bit, the only even halfway suspicious scatterlist
> initialization thing I see is commit f9d03f96b988 ("block: improve
> handling of the magic discard payload") which used to have a magic
> hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> now it does __blk_bvec_map_sg() instead.
But that check was only for discard (and discard-like) bios which
had the maic single page that sometimes was unused attached.
For "normal" bios the for_each_segment loop iterates over bi_vcnt,
so it will be ignored anyway. That being said both I and the lists
got CCed halfway through the thread and I haven't seen the original
report, so I'm not really sure what's going on here anyway.
On Thu, Dec 22, 2016 at 04:13:22PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 04:13:03PM -0800, Chris Leech wrote:
> > On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > > Hi,
> > >
> > > On Wed, Dec 21, 2016 at 2:16 PM, Dave Chinner <[email protected]> wrote:
> > > > On Fri, Dec 16, 2016 at 10:59:06AM -0800, Chris Leech wrote:
> > > >> Thanks Dave,
> > > >>
> > > >> I'm hitting a bug at scatterlist.h:140 before I even get any iSCSI
> > > >> modules loaded (virtio block) so there's something else going on in the
> > > >> current merge window. I'll keep an eye on it and make sure there's
> > > >> nothing iSCSI needs fixing for.
> > > >
> > > > OK, so before this slips through the cracks.....
> > > >
> > > > Linus - your tree as of a few minutes ago still panics immediately
> > > > when starting xfstests on iscsi devices. It appears to be a
> > > > scatterlist corruption and not an iscsi problem, so the iscsi guys
> > > > seem to have bounced it and no-one is looking at it.
> > >
> > > Hmm. There's not much to go by.
> > >
> > > Can somebody in iscsi-land please try to just bisect it - I'm not
> > > seeing a lot of clues to where this comes from otherwise.
> >
> > Yeah, my hopes of this being quickly resolved by someone else didn't
> > work out and whatever is going on in that test VM is looking like a
> > different kind of odd. I'm saving that off for later, and seeing if I
> > can't be a bisect on the iSCSI issue.
>
> There may be deeper issues. I just started running scalability tests
> (e.g. 16-way fsmark create tests) and about a minute in I got a
> directory corruption reported - something I hadn't seen in the dev
> cycle at all. I unmounted the fs, mkfs'd it again, ran the
> workload again and about a minute in this fired:
>
> [628867.607417] ------------[ cut here ]------------
> [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> [628867.610702] Modules linked in:
> [628867.611375] CPU: 2 PID: 16925 Comm: kworker/2:97 Tainted: G W 4.9.0-dgc #18
> [628867.613382] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
> [628867.616179] Workqueue: events rht_deferred_worker
> [628867.632422] Call Trace:
> [628867.634691] dump_stack+0x63/0x83
> [628867.637937] __warn+0xcb/0xf0
> [628867.641359] warn_slowpath_null+0x1d/0x20
> [628867.643362] shadow_lru_isolate+0x171/0x220
> [628867.644627] __list_lru_walk_one.isra.11+0x79/0x110
> [628867.645780] ? __list_lru_init+0x70/0x70
> [628867.646628] list_lru_walk_one+0x17/0x20
> [628867.647488] scan_shadow_nodes+0x34/0x50
> [628867.648358] shrink_slab.part.65.constprop.86+0x1dc/0x410
> [628867.649506] shrink_node+0x57/0x90
> [628867.650233] do_try_to_free_pages+0xdd/0x230
> [628867.651157] try_to_free_pages+0xce/0x1a0
> [628867.652342] __alloc_pages_slowpath+0x2df/0x960
> [628867.653332] ? __might_sleep+0x4a/0x80
> [628867.654148] __alloc_pages_nodemask+0x24b/0x290
> [628867.655237] kmalloc_order+0x21/0x50
> [628867.656016] kmalloc_order_trace+0x24/0xc0
> [628867.656878] __kmalloc+0x17d/0x1d0
> [628867.657644] bucket_table_alloc+0x195/0x1d0
> [628867.658564] ? __might_sleep+0x4a/0x80
> [628867.659449] rht_deferred_worker+0x287/0x3c0
> [628867.660366] ? _raw_spin_unlock_irq+0xe/0x30
> [628867.661294] process_one_work+0x1de/0x4d0
> [628867.662208] worker_thread+0x4b/0x4f0
> [628867.662990] kthread+0x10c/0x140
> [628867.663687] ? process_one_work+0x4d0/0x4d0
> [628867.664564] ? kthread_create_on_node+0x40/0x40
> [628867.665523] ret_from_fork+0x25/0x30
> [628867.666317] ---[ end trace 7c38634006a9955e ]---
>
> Now, this workload does not touch the page cache at all - it's
> entirely an XFS metadata workload, so it should not really be
> affecting the working set code.
The system back up, and I haven't reproduced this problem yet.
However, benchmark results are way off where they should be, and at
times the performance is utterly abysmal. The XFS for-next tree
based on the 4.9 kernel shows none of these problems, so I don't
think there's an XFS problem here. Workload is the same 16-way
fsmark workload that I've been using for years as a performance
regression test.
The workload normally averages around 230k files/s - i'm seeing
and average of ~175k files/s on you current kernel. And there are
periods where performance just completely tanks:
# ./fs_mark -D 10000 -S0 -n 100000 -s 0 -L 32 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 -d /mnt/scratch/8 -d /mnt/scratch/9 -d /mnt/scratch/10 -d /mnt/scratch/11 -d /mnt/scratch/12 -d /mnt/scratch/13 -d /mnt/scratch/14 -d /mnt/scratch/15
# Version 3.3, 16 thread(s) starting at Thu Dec 22 16:29:20 2016
# Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
# Directories: Time based hash between directories across 10000 subdirectories with 180 seconds per subdirectory.
# File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
# Files info: size 0 bytes, written with an IO size of 16384 bytes per write
# App overhead is time in microseconds spent in the test not doing file writing related system calls.
FSUse% Count Size Files/sec App Overhead
0 1600000 0 256964.5 10696017
0 3200000 0 244151.9 12129380
0 4800000 0 239929.1 11724413
0 6400000 0 235646.5 11998954
0 8000000 0 162537.3 15027053
0 9600000 0 186597.4 17957243
.....
0 43200000 0 184972.6 27911543
0 44800000 0 141642.2 62862142
0 46400000 0 137700.8 39018008
0 48000000 0 92155.0 38277234
0 49600000 0 57053.8 27810215
0 51200000 0 178941.6 33321543
This sort of thing is normally indicative of a memory reclaim or
lock contention problem. Profile showed unusual spinlock contention,
but then I realised there was only one kswapd thread running.
Yup, sure enough, it's caused by a major change in memory reclaim
behaviour:
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
[ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
[ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff]
[ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff]
[ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
the numa=fake=4 CLI option is broken.
And, just to pile another one in - there's a massive xfs_repair
performance regression, too. It normally takes 4m30s to run over the
50 million inode filesystem created by fsmark. on 4.10:
XFS_REPAIR Summary Thu Dec 22 16:43:58 2016
Phase Start End Duration
Phase 1: 12/22 16:35:36 12/22 16:35:38 2 seconds
Phase 2: 12/22 16:35:38 12/22 16:35:40 2 seconds
Phase 3: 12/22 16:35:40 12/22 16:39:00 3 minutes, 20 seconds
Phase 4: 12/22 16:39:00 12/22 16:40:40 1 minute, 40 seconds
Phase 5: 12/22 16:40:40 12/22 16:40:40
Phase 6: 12/22 16:40:40 12/22 16:43:57 3 minutes, 17 seconds
Phase 7: 12/22 16:43:57 12/22 16:43:57
Total run time: 8 minutes, 21 seconds
So, yeah, there's lots of bad stuff I'm tripping over right now.
This is all now somebody els's problem to deal with - by this time
tomorrow I'm outta here and won't be back for several months....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Dec 22, 2016 at 07:18:27AM +0100, Christoph Hellwig wrote:
> On Wed, Dec 21, 2016 at 03:19:15PM -0800, Linus Torvalds wrote:
> > Looking around a bit, the only even halfway suspicious scatterlist
> > initialization thing I see is commit f9d03f96b988 ("block: improve
> > handling of the magic discard payload") which used to have a magic
> > hack wrt !bio->bi_vcnt, and that got removed. See __blk_bios_map_sg(),
> > now it does __blk_bvec_map_sg() instead.
>
> But that check was only for discard (and discard-like) bios which
> had the maic single page that sometimes was unused attached.
>
> For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> so it will be ignored anyway. That being said both I and the lists
> got CCed halfway through the thread and I haven't seen the original
> report, so I'm not really sure what's going on here anyway.
http://www.gossamer-threads.com/lists/linux/kernel/2587485
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Dec 22, 2016 at 05:30:46PM +1100, Dave Chinner wrote:
> > For "normal" bios the for_each_segment loop iterates over bi_vcnt,
> > so it will be ignored anyway. That being said both I and the lists
> > got CCed halfway through the thread and I haven't seen the original
> > report, so I'm not really sure what's going on here anyway.
>
> http://www.gossamer-threads.com/lists/linux/kernel/2587485
This doesn't look like the discard changes, but if Chris wants to test
without them f9d03f96b988 reverts cleanly.
On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
>
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?
I mean the three months leading up to the 4.10 merge, when all the
XFS changes were being tested against 4.9-rc kernels.
The iscsi problem showed up when I updated the base kernel from
4.9 to 4.10-current last week to test the pullreq I was going to
send you. I've been bust with other stuff until now, so I didn't
upgrade my working trees again until today in the hope the iscsi
problem had already been found and fixed.
> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] ------------[ cut here ]------------
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
>
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
>
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
>
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
>
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
>
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.
The iscsi problem is 100% reproducable. create a pair of iscsi luns,
mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
the filesystems.
The test machine I'm having all these other problems on? stable and
steady as a rock using PMEM devices. Moment I go to use /dev/vdc
(i.e. run load/perf benchmarks) it starts falling over left, right
and center.
And I just smacked into this in the bulkstat phase of the benchmark
(mkfs, fsmark, xfs_repair, mount, bulkstat, find, grep, rm):
[ 2729.750563] BUG: Bad page state in process bstat pfn:14945
[ 2729.751863] page:ffffea0000525140 count:-1 mapcount:0 mapping: (null) index:0x0
[ 2729.753763] flags: 0x4000000000000000()
[ 2729.754671] raw: 4000000000000000 0000000000000000 0000000000000000 ffffffffffffffff
[ 2729.756469] raw: dead000000000100 dead000000000200 0000000000000000 0000000000000000
[ 2729.758276] page dumped because: nonzero _refcount
[ 2729.759393] Modules linked in:
[ 2729.760137] CPU: 7 PID: 25902 Comm: bstat Tainted: G B 4.9.0-dgc #18
[ 2729.761888] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 2729.763943] Call Trace:
[ 2729.764523] <IRQ>
[ 2729.765004] dump_stack+0x63/0x83
[ 2729.765784] bad_page+0xc4/0x130
[ 2729.766552] free_pages_check_bad+0x4f/0x70
[ 2729.767531] free_pcppages_bulk+0x3c5/0x3d0
[ 2729.768513] ? page_alloc_cpu_dead+0x30/0x30
[ 2729.769510] drain_pages_zone+0x41/0x60
[ 2729.770417] drain_pages+0x3e/0x60
[ 2729.771215] drain_local_pages+0x24/0x30
[ 2729.772138] flush_smp_call_function_queue+0x88/0x160
[ 2729.773317] generic_smp_call_function_single_interrupt+0x13/0x30
[ 2729.774742] smp_call_function_single_interrupt+0x27/0x40
[ 2729.776000] smp_call_function_interrupt+0xe/0x10
[ 2729.777102] call_function_interrupt+0x8e/0xa0
[ 2729.778147] RIP: 0010:delay_tsc+0x41/0x90
[ 2729.779085] RSP: 0018:ffffc9000f0cf500 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff03
[ 2729.780852] RAX: 0000000077541291 RBX: ffff88008b5efe40 RCX: 000000000000002e
[ 2729.782514] RDX: 0000057700000000 RSI: 0000057777541291 RDI: 0000000000000001
[ 2729.784167] RBP: ffffc9000f0cf500 R08: 0000000000000007 R09: ffffc9000f0cf678
[ 2729.785818] R10: 0000000000000006 R11: 0000000000001000 R12: 0000000000000061
[ 2729.787480] R13: 0000000000000001 R14: 0000000083214e30 R15: 0000000000000080
[ 2729.789124] </IRQ>
[ 2729.789626] __delay+0xf/0x20
[ 2729.790333] do_raw_spin_lock+0x8c/0x160
[ 2729.791255] _raw_spin_lock+0x15/0x20
[ 2729.792112] list_lru_add+0x1a/0x70
[ 2729.792932] xfs_buf_rele+0x3e7/0x410
[ 2729.793792] xfs_buftarg_shrink_scan+0x6b/0x80
[ 2729.794841] shrink_slab.part.65.constprop.86+0x1dc/0x410
[ 2729.796099] shrink_node+0x57/0x90
[ 2729.796905] do_try_to_free_pages+0xdd/0x230
[ 2729.797914] try_to_free_pages+0xce/0x1a0
[ 2729.798852] __alloc_pages_slowpath+0x2df/0x960
[ 2729.799908] __alloc_pages_nodemask+0x24b/0x290
[ 2729.800963] new_slab+0x2ac/0x380
[ 2729.801743] ___slab_alloc.constprop.82+0x336/0x440
[ 2729.802890] ? kmem_zone_alloc+0x81/0x120
[ 2729.803830] ? xfs_buf_read_map+0x2d/0x1a0
[ 2729.804785] ? kmem_zone_alloc+0x81/0x120
[ 2729.805723] __slab_alloc.isra.78.constprop.81+0x2b/0x40
[ 2729.806970] kmem_cache_alloc+0x142/0x180
[ 2729.807905] kmem_zone_alloc+0x81/0x120
[ 2729.808807] xfs_inode_alloc+0x29/0x1d0
[ 2729.809707] xfs_iget+0x3cd/0x9d0
[ 2729.810496] ? xfs_iflush+0x350/0x350
[ 2729.811358] xfs_bulkstat_one_int+0x87/0x370
[ 2729.812359] xfs_bulkstat_one+0x20/0x30
[ 2729.813259] xfs_bulkstat+0x485/0x690
[ 2729.814125] ? xfs_bulkstat_one_int+0x370/0x370
[ 2729.815183] xfs_ioc_bulkstat+0xfa/0x1d0
[ 2729.816105] xfs_file_ioctl+0x8e4/0xc70
[ 2729.817006] ? _raw_spin_unlock_irq+0xe/0x30
[ 2729.818011] ? finish_task_switch+0xab/0x240
[ 2729.819007] ? __schedule+0x265/0x760
[ 2729.819866] ? _raw_spin_unlock+0xe/0x20
[ 2729.820783] ? preempt_schedule_irq+0x59/0xb0
[ 2729.821797] ? retint_kernel+0x1b/0x1d
[ 2729.822685] do_vfs_ioctl+0x94/0x700
[ 2729.823523] ? __fget+0x77/0xb0
[ 2729.824263] SyS_ioctl+0x79/0x90
[ 2729.825022] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 2729.826105] RIP: 0033:0x7fd25ff088e7
[ 2729.826943] RSP: 002b:00007fcfc4757e18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 2729.828714] RAX: ffffffffffffffda RBX: 00007fd0d4000020 RCX: 00007fd25ff088e7
[ 2729.830372] RDX: 00007fcfc4757e60 RSI: ffffffffc0205865 RDI: 0000000000000003
[ 2729.832020] RBP: 0000000000022010 R08: ffffffffffffffff R09: 0000000000000400
[ 2729.833670] R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000023000
[ 2729.835327] R13: 0000000000022010 R14: 0000000000001000 R15: 00007fd0d4000020
> That said:
>
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
>
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.
I trust the hardware way more than I trust the software running on
it....
> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..
I've seen a couple of guest panics in RCU related code take down the
parent QEMU process recently - asserts in console handling code IIRC
- but that's the only signs I've ever had of host side problems.
That both the host and guest went whacky in new and interesting ways
at exactly the same time might be coincidence, but Occam's Razor
suggests they are linked in some way...
>
> Linus
>
--
Dave Chinner
[email protected]
On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <[email protected]> wrote:
>
> This sort of thing is normally indicative of a memory reclaim or
> lock contention problem. Profile showed unusual spinlock contention,
> but then I realised there was only one kswapd thread running.
> Yup, sure enough, it's caused by a major change in memory reclaim
> behaviour:
>
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> [ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
> [ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff]
> [ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff]
> [ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff]
> [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
>
> the numa=fake=4 CLI option is broken.
Ok, I think that is independent of anything else. Removing block
people and adding the x86 people.
I'm not seeing anything at all that would change the fake numa stuff,
but maybe the cpu hotplug changes?
Thomas/Ingo/Peter - Dave is going away for several months, so you
won't get feedback from him, but can you look at this? Or maybe point
me towards the right people - I'm seeing no possible relevant changes
at all fir x85 numa since 4.9, so it must be some indirect breakage.
Dave is using fake-numa to do performance testing in a VM, and it's a
big deal for the node optimizations for writeback etc. Do you have any
ideas?
Dave, if you're still around, can you send out the kernel config file
you used...
Linus
On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
> > >
> > > There may be deeper issues. I just started running scalability tests
> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > > directory corruption reported - something I hadn't seen in the dev
> > > cycle at all.
> >
> > By "in the dev cycle", do you mean your XFS changes, or have you been
> > tracking the merge cycle at least for some testing?
>
> I mean the three months leading up to the 4.10 merge, when all the
> XFS changes were being tested against 4.9-rc kernels.
>
> The iscsi problem showed up when I updated the base kernel from
> 4.9 to 4.10-current last week to test the pullreq I was going to
> send you. I've been bust with other stuff until now, so I didn't
> upgrade my working trees again until today in the hope the iscsi
> problem had already been found and fixed.
>
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] ------------[ cut here ]------------
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> >
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> >
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> >
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> >
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> >
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
>
> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
> the filesystems.
>
> The test machine I'm having all these other problems on? stable and
> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
> (i.e. run load/perf benchmarks) it starts falling over left, right
> and center.
I'm not reproducing any problems with xfstests running over iscsi_tcp
right now. Two 10G luns exported from an LIO target, attached directly
to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
TEST_DEV and SCRATCH_DEV.
The virtio scatterlist issue that popped right away for me is triggered
by an hdparm ioctl, which is being run by tuned on Fedora. And that
actually seems to happen back on 4.9 as well :(
Chris
On Wed, 21 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
> >
> > There may be deeper issues. I just started running scalability tests
> > (e.g. 16-way fsmark create tests) and about a minute in I got a
> > directory corruption reported - something I hadn't seen in the dev
> > cycle at all.
>
> By "in the dev cycle", do you mean your XFS changes, or have you been
> tracking the merge cycle at least for some testing?
>
> > I unmounted the fs, mkfs'd it again, ran the
> > workload again and about a minute in this fired:
> >
> > [628867.607417] ------------[ cut here ]------------
> > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
>
> Well, part of the changes during the merge window were the shadow
> entry tracking changes that came in through Andrew's tree. Adding
> Johannes Weiner to the participants.
>
> > Now, this workload does not touch the page cache at all - it's
> > entirely an XFS metadata workload, so it should not really be
> > affecting the working set code.
>
> Well, I suspect that anything that creates memory pressure will end up
> triggering the working set code, so ..
>
> That said, obviously memory corruption could be involved and result in
> random issues too, but I wouldn't really expect that in this code.
>
> It would probably be really useful to get more data points - is the
> problem reliably in this area, or is it going to be random and all
> over the place.
Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
soon followed by NULL pointer deref in list_lru_isolate, one time when
I tried out Sunday's git tree. Not seen since, I haven't had time to
investigate, just set it aside as something to worry about if it happens
again. But it looks like shadow_lru_isolate() has issues beyond Dave's
case (I've no XFS and no iscsi), suspect unrelated to his other problems.
Hugh
>
> That said:
>
> > And worse, on that last error, the /host/ is now going into meltdown
> > (running 4.7.5) with 32 CPUs all burning down in ACPI code:
>
> The obvious question here is how much you trust the environment if the
> host ends up also showing problems. Maybe you do end up having hw
> issues pop up too.
>
> The primary suspect would presumably be the development kernel you're
> testing triggering something, but it has to be asked..
>
> Linus
On Thu, 22 Dec 2016, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <[email protected]> wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> > [ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff]
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
> > [ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff]
> > [ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff]
> > [ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff]
> > [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
> >
> > the numa=fake=4 CLI option is broken.
>
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
>
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?
Nope. Double checked it for correctness. The cpu hotplug code there is not
involved in setting up kswapd threads. They are created by the memory node
stuff.
> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
>
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?
There was the nodeid -> cpuid mapping stuff, but that was in 4.9 so it's
not a post 4.9 wreckage, but maybe it's related nevertheless.
> Dave, if you're still around, can you send out the kernel config file
> you used...
Config file and kernel command line would be helpful.
Thanks,
tglx
On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <[email protected]> wrote:
> >
> > This sort of thing is normally indicative of a memory reclaim or
> > lock contention problem. Profile showed unusual spinlock contention,
> > but then I realised there was only one kswapd thread running.
> > Yup, sure enough, it's caused by a major change in memory reclaim
> > behaviour:
> >
> > [ 0.000000] Zone ranges:
> > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> > [ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff]
> > [ 0.000000] Movable zone start for each node
> > [ 0.000000] Early memory node ranges
> > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
> > [ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff]
> > [ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff]
> > [ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff]
> > [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
> >
> > the numa=fake=4 CLI option is broken.
>
> Ok, I think that is independent of anything else. Removing block
> people and adding the x86 people.
>
> I'm not seeing anything at all that would change the fake numa stuff,
> but maybe the cpu hotplug changes?
>
> Thomas/Ingo/Peter - Dave is going away for several months, so you
> won't get feedback from him, but can you look at this? Or maybe point
> me towards the right people - I'm seeing no possible relevant changes
> at all fir x85 numa since 4.9, so it must be some indirect breakage.
>
> Dave is using fake-numa to do performance testing in a VM, and it's a
> big deal for the node optimizations for writeback etc. Do you have any
> ideas?
>
> Dave, if you're still around, can you send out the kernel config file
> you used...
Looking at this fresh this morning (i.e. not pissed off by having
everything I tried to do fail in different ways all afternoon) I
found this:
$ grep NUMA .config
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
# CONFIG_NUMA is not set
$
The .config I was using for 4.9 got 'make oldconfig' upgraded, and
looking at it there's a bunch of stuff that has been turned off that
I know was set:
# CONFIG_EXPERT is not set
# CONFIG_PARAVIRT_SPINLOCKS is not set
# CONFIG_COMPACTION is not set
and stuff I never use so don't set was set, like kernel crash dump,
a bunch of stuff for AMD CPUs, susp/resume and power management
debug, every partition type and filesystem under the sun was
selected, heaps of network devices enabled, etc.
So it looks like the problem has occurred during oldconfig, meaning
I have no idea exactly WTF I was testing. Rebuilding now with a
saner config, see what happens.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Dec 23, 2016 at 07:42:40AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 09:24:12AM -0800, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 10:28 PM, Dave Chinner <[email protected]> wrote:
> > >
> > > This sort of thing is normally indicative of a memory reclaim or
> > > lock contention problem. Profile showed unusual spinlock contention,
> > > but then I realised there was only one kswapd thread running.
> > > Yup, sure enough, it's caused by a major change in memory reclaim
> > > behaviour:
> > >
> > > [ 0.000000] Zone ranges:
> > > [ 0.000000] DMA [mem 0x0000000000001000-0x0000000000ffffff]
> > > [ 0.000000] DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
> > > [ 0.000000] Normal [mem 0x0000000100000000-0x000000083fffffff]
> > > [ 0.000000] Movable zone start for each node
> > > [ 0.000000] Early memory node ranges
> > > [ 0.000000] node 0: [mem 0x0000000000001000-0x000000000009efff]
> > > [ 0.000000] node 0: [mem 0x0000000000100000-0x00000000bffdefff]
> > > [ 0.000000] node 0: [mem 0x0000000100000000-0x00000003bfffffff]
> > > [ 0.000000] node 0: [mem 0x00000005c0000000-0x00000005ffffffff]
> > > [ 0.000000] node 0: [mem 0x0000000800000000-0x000000083fffffff]
> > > [ 0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000083fffffff]
> > >
> > > the numa=fake=4 CLI option is broken.
> >
> > Ok, I think that is independent of anything else. Removing block
> > people and adding the x86 people.
> >
> > I'm not seeing anything at all that would change the fake numa stuff,
> > but maybe the cpu hotplug changes?
> >
> > Thomas/Ingo/Peter - Dave is going away for several months, so you
> > won't get feedback from him, but can you look at this? Or maybe point
> > me towards the right people - I'm seeing no possible relevant changes
> > at all fir x85 numa since 4.9, so it must be some indirect breakage.
> >
> > Dave is using fake-numa to do performance testing in a VM, and it's a
> > big deal for the node optimizations for writeback etc. Do you have any
> > ideas?
> >
> > Dave, if you're still around, can you send out the kernel config file
> > you used...
>
> Looking at this fresh this morning (i.e. not pissed off by having
> everything I tried to do fail in different ways all afternoon) I
> found this:
>
> $ grep NUMA .config
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> # CONFIG_NUMA is not set
> $
>
> The .config I was using for 4.9 got 'make oldconfig' upgraded, and
> looking at it there's a bunch of stuff that has been turned off that
> I know was set:
>
> # CONFIG_EXPERT is not set
> # CONFIG_PARAVIRT_SPINLOCKS is not set
> # CONFIG_COMPACTION is not set
>
> and stuff I never use so don't set was set, like kernel crash dump,
> a bunch of stuff for AMD CPUs, susp/resume and power management
> debug, every partition type and filesystem under the sun was
> selected, heaps of network devices enabled, etc.
>
> So it looks like the problem has occurred during oldconfig, meaning
> I have no idea exactly WTF I was testing. Rebuilding now with a
> saner config, see what happens.
Better, but still bad. average files/s is not up to 200k files/s,
so still a good 10-15% off where it should be. xfs_repair is back
down to 10-15% off where it should be, too. bulkstat still fires off
a bad page reference count warning, iscsi still panics immediately.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Ok, so the numa issue was a red herring. With that fixed:
On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <[email protected]> wrote:
>
> Better, but still bad. average files/s is not up to 200k files/s,
> so still a good 10-15% off where it should be. xfs_repair is back
> down to 10-15% off where it should be, too. bulkstat still fires off
> a bad page reference count warning, iscsi still panics immediately.
Do you have CONFIG_BLK_WBT enabled, perhaps?
It's new to this merge window, and I'm not convinced it's been tuned.
Particularly for your kinds of fairly extreme IO loads.
Linus
On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
>
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <[email protected]> wrote:
> >
> > Better, but still bad. average files/s is not up to 200k files/s,
> > so still a good 10-15% off where it should be. xfs_repair is back
> > down to 10-15% off where it should be, too. bulkstat still fires off
> > a bad page reference count warning, iscsi still panics immediately.
>
> Do you have CONFIG_BLK_WBT enabled, perhaps?
Ok, yes, that's enabled. Let me go turn it off and see what happens.
> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.
Well, these aren't extreme IO workloads. Yes, the filesystem does a
lot of work on the CPU, but the IO loads XFS generates aren't
particularly high - maybe 2-3000 IOPS at most, and mostly the
bandwidth is below 100MB/s.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > Ok, so the numa issue was a red herring. With that fixed:
> >
> > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <[email protected]> wrote:
> > >
> > > Better, but still bad. average files/s is not up to 200k files/s,
> > > so still a good 10-15% off where it should be. xfs_repair is back
> > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > a bad page reference count warning, iscsi still panics immediately.
> >
> > Do you have CONFIG_BLK_WBT enabled, perhaps?
>
> Ok, yes, that's enabled. Let me go turn it off and see what happens.
Numbers are still all over the place.
FSUse% Count Size Files/sec App Overhead
.....
0 28800000 0 228175.5 21928812
0 30400000 0 167880.5 39628229
0 32000000 0 124289.5 41420925
0 33600000 0 150577.9 35382318
0 35200000 0 216535.4 16072628
0 36800000 0 233414.4 11846654
0 38400000 0 213812.0 13356633
0 40000000 0 175905.7 53012015
0 41600000 0 157028.7 34700794
0 43200000 0 138829.1 50282461
And the average is now back down to 185k files/s. repair runtime is
unchanged and still 10-15% off...
I've got to run away for a few hours right now, but I'll retest the
4.9 + xfs for-next branch when I get back to see if the problem is
my curent config or whether there really is a perf problem lurking
somewhere....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech <[email protected]> wrote:
> On Thu, Dec 22, 2016 at 05:50:12PM +1100, Dave Chinner wrote:
>> On Wed, Dec 21, 2016 at 09:46:37PM -0800, Linus Torvalds wrote:
>> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
>> > >
>> > > There may be deeper issues. I just started running scalability tests
>> > > (e.g. 16-way fsmark create tests) and about a minute in I got a
>> > > directory corruption reported - something I hadn't seen in the dev
>> > > cycle at all.
>> >
>> > By "in the dev cycle", do you mean your XFS changes, or have you been
>> > tracking the merge cycle at least for some testing?
>>
>> I mean the three months leading up to the 4.10 merge, when all the
>> XFS changes were being tested against 4.9-rc kernels.
>>
>> The iscsi problem showed up when I updated the base kernel from
>> 4.9 to 4.10-current last week to test the pullreq I was going to
>> send you. I've been bust with other stuff until now, so I didn't
>> upgrade my working trees again until today in the hope the iscsi
>> problem had already been found and fixed.
>>
>> > > I unmounted the fs, mkfs'd it again, ran the
>> > > workload again and about a minute in this fired:
>> > >
>> > > [628867.607417] ------------[ cut here ]------------
>> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
>> >
>> > Well, part of the changes during the merge window were the shadow
>> > entry tracking changes that came in through Andrew's tree. Adding
>> > Johannes Weiner to the participants.
>> >
>> > > Now, this workload does not touch the page cache at all - it's
>> > > entirely an XFS metadata workload, so it should not really be
>> > > affecting the working set code.
>> >
>> > Well, I suspect that anything that creates memory pressure will end up
>> > triggering the working set code, so ..
>> >
>> > That said, obviously memory corruption could be involved and result in
>> > random issues too, but I wouldn't really expect that in this code.
>> >
>> > It would probably be really useful to get more data points - is the
>> > problem reliably in this area, or is it going to be random and all
>> > over the place.
>>
>> The iscsi problem is 100% reproducable. create a pair of iscsi luns,
>> mkfs, run xfstests on them. iscsi fails a second after xfstests mounts
>> the filesystems.
>>
>> The test machine I'm having all these other problems on? stable and
>> steady as a rock using PMEM devices. Moment I go to use /dev/vdc
>> (i.e. run load/perf benchmarks) it starts falling over left, right
>> and center.
>
> I'm not reproducing any problems with xfstests running over iscsi_tcp
> right now. Two 10G luns exported from an LIO target, attached directly
> to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> TEST_DEV and SCRATCH_DEV.
>
> The virtio scatterlist issue that popped right away for me is triggered
> by an hdparm ioctl, which is being run by tuned on Fedora. And that
> actually seems to happen back on 4.9 as well :(
Could you share us what the specific hdparm cmd line is? I tried several
random cmds over virtio-blk/virito-scsi, looks not see this problem.
Thanks,
Ming Lei
On Fri, Dec 23, 2016 at 07:53:50AM +0800, Ming Lei wrote:
> On Fri, Dec 23, 2016 at 2:50 AM, Chris Leech <[email protected]> wrote:
> > I'm not reproducing any problems with xfstests running over iscsi_tcp
> > right now. Two 10G luns exported from an LIO target, attached directly
> > to a test VM as sda/sdb and xfstests configured to use sda1/sdb1 as
> > TEST_DEV and SCRATCH_DEV.
> >
> > The virtio scatterlist issue that popped right away for me is triggered
> > by an hdparm ioctl, which is being run by tuned on Fedora. And that
> > actually seems to happen back on 4.9 as well :(
>
> Could you share us what the specific hdparm cmd line is? I tried several
> random cmds over virtio-blk/virito-scsi, looks not see this problem.
Of course, looks like I've screwed up my bisect run on this so I'm still
taking a look. It triggers for me with 'hdparm -B /dev/vda' but may
also depend on kernel configuration.
I started with the fedora rawhide config with a lot of debug on and
trimed it down with a localmodconfig run in the VM to speed up rebuilds.
Chris
On 12/22/2016 02:10 PM, Linus Torvalds wrote:
> Ok, so the numa issue was a red herring. With that fixed:
>
> On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <[email protected]> wrote:
>>
>> Better, but still bad. average files/s is not up to 200k files/s,
>> so still a good 10-15% off where it should be. xfs_repair is back
>> down to 10-15% off where it should be, too. bulkstat still fires off
>> a bad page reference count warning, iscsi still panics immediately.
>
> Do you have CONFIG_BLK_WBT enabled, perhaps?
>
> It's new to this merge window, and I'm not convinced it's been tuned.
> Particularly for your kinds of fairly extreme IO loads.
We have it running on thousands of machines at Facebook, and I've used
it in all of my testing, haven't seen any issues. It's been run pretty
damn hard.
--
Jens Axboe
On Fri, Dec 23, 2016 at 09:33:36AM +1100, Dave Chinner wrote:
> On Fri, Dec 23, 2016 at 09:15:00AM +1100, Dave Chinner wrote:
> > On Thu, Dec 22, 2016 at 01:10:19PM -0800, Linus Torvalds wrote:
> > > Ok, so the numa issue was a red herring. With that fixed:
> > >
> > > On Thu, Dec 22, 2016 at 1:06 PM, Dave Chinner <[email protected]> wrote:
> > > >
> > > > Better, but still bad. average files/s is not up to 200k files/s,
> > > > so still a good 10-15% off where it should be. xfs_repair is back
> > > > down to 10-15% off where it should be, too. bulkstat still fires off
> > > > a bad page reference count warning, iscsi still panics immediately.
> > >
> > > Do you have CONFIG_BLK_WBT enabled, perhaps?
> >
> > Ok, yes, that's enabled. Let me go turn it off and see what happens.
>
> Numbers are still all over the place.
>
> FSUse% Count Size Files/sec App Overhead
> .....
> 0 28800000 0 228175.5 21928812
> 0 30400000 0 167880.5 39628229
> 0 32000000 0 124289.5 41420925
> 0 33600000 0 150577.9 35382318
> 0 35200000 0 216535.4 16072628
> 0 36800000 0 233414.4 11846654
> 0 38400000 0 213812.0 13356633
> 0 40000000 0 175905.7 53012015
> 0 41600000 0 157028.7 34700794
> 0 43200000 0 138829.1 50282461
>
> And the average is now back down to 185k files/s. repair runtime is
> unchanged and still 10-15% off...
>
> I've got to run away for a few hours right now, but I'll retest the
> 4.9 + xfs for-next branch when I get back to see if the problem is
> my curent config or whether there really is a perf problem lurking
> somewhere....
Well, I'm not sure now. Taking that config back to 4.9 gave
results of 210k files/s. A bit faster, but still not the 230k
files/s I'm expecting. So I'm missing something in the .config
at this point, though I'm not sure what.
FWIW, updating from 4.9 to to the 4.10 tree, this happened:
$ $ make oldconfig; make -j 32
scripts/kconfig/conf --oldconfig Kconfig
*
* Restart config...
*
*
* General setup
*
Yup, it definitely did something unexpected. And almost silently, I
might add - I didn't notice this the first time around, and wouldn't
hav enoticed it this time if I wasn't looking for something strange
to happen.
As iit is, I still haven't found what magic config option is
taking away that 10-15% of performance. There's no unexpected debug
options set, and it's nothing obvious in the fs/block layer config.
I'll keep looking for the moment...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
> > > I unmounted the fs, mkfs'd it again, ran the
> > > workload again and about a minute in this fired:
> > >
> > > [628867.607417] ------------[ cut here ]------------
> > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> >
> > Well, part of the changes during the merge window were the shadow
> > entry tracking changes that came in through Andrew's tree. Adding
> > Johannes Weiner to the participants.
> >
> > > Now, this workload does not touch the page cache at all - it's
> > > entirely an XFS metadata workload, so it should not really be
> > > affecting the working set code.
> >
> > Well, I suspect that anything that creates memory pressure will end up
> > triggering the working set code, so ..
> >
> > That said, obviously memory corruption could be involved and result in
> > random issues too, but I wouldn't really expect that in this code.
> >
> > It would probably be really useful to get more data points - is the
> > problem reliably in this area, or is it going to be random and all
> > over the place.
>
> Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> soon followed by NULL pointer deref in list_lru_isolate, one time when
> I tried out Sunday's git tree. Not seen since, I haven't had time to
> investigate, just set it aside as something to worry about if it happens
> again. But it looks like shadow_lru_isolate() has issues beyond Dave's
> case (I've no XFS and no iscsi), suspect unrelated to his other problems.
This seems consistent with what Dave observed: we encounter regular
pages in radix tree nodes on the shadow LRU that should only contain
nodes full of exceptional shadow entries. It could be an issue in the
new slot replacement code and the node tracking callback.
Looking...
On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner <[email protected]> wrote:
> > > > I unmounted the fs, mkfs'd it again, ran the
> > > > workload again and about a minute in this fired:
> > > >
> > > > [628867.607417] ------------[ cut here ]------------
> > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461 shadow_lru_isolate+0x171/0x220
> > >
> > > Well, part of the changes during the merge window were the shadow
> > > entry tracking changes that came in through Andrew's tree. Adding
> > > Johannes Weiner to the participants.
> > >
> > > > Now, this workload does not touch the page cache at all - it's
> > > > entirely an XFS metadata workload, so it should not really be
> > > > affecting the working set code.
> > >
> > > Well, I suspect that anything that creates memory pressure will end up
> > > triggering the working set code, so ..
> > >
> > > That said, obviously memory corruption could be involved and result in
> > > random issues too, but I wouldn't really expect that in this code.
> > >
> > > It would probably be really useful to get more data points - is the
> > > problem reliably in this area, or is it going to be random and all
> > > over the place.
> >
> > Data point: kswapd got WARNING on mm/workingset.c:457 in shadow_lru_isolate,
> > soon followed by NULL pointer deref in list_lru_isolate, one time when
> > I tried out Sunday's git tree. Not seen since, I haven't had time to
> > investigate, just set it aside as something to worry about if it happens
> > again. But it looks like shadow_lru_isolate() has issues beyond Dave's
> > case (I've no XFS and no iscsi), suspect unrelated to his other problems.
>
> This seems consistent with what Dave observed: we encounter regular
> pages in radix tree nodes on the shadow LRU that should only contain
> nodes full of exceptional shadow entries. It could be an issue in the
> new slot replacement code and the node tracking callback.
Both encounters seem to indicate use-after-free. Dave's node didn't
warn about an unexpected node->count / node->exceptional state, but
had entries that were inconsistent with that. Hugh got the counter
warning but crashed on a list_head that's not NULLed in a live node.
workingset_update_node() should be called on page cache radix tree
leaf nodes that go empty. I must be missing an update_node callback
where a leaf node gets freed somewhere.
On Thu, Dec 22, 2016 at 04:03:56PM -0800, Chris Leech wrote:
> Of course, looks like I've screwed up my bisect run on this so I'm still
> taking a look. It triggers for me with 'hdparm -B /dev/vda' but may
> also depend on kernel configuration.
>
> I started with the fedora rawhide config with a lot of debug on and
> trimed it down with a localmodconfig run in the VM to speed up rebuilds.
I think the configuration dependency is CONFIG_HAVE_ARCH_VMAP_STACK,
I've just reproduce the issue with it, and the backtrace points to
__virtblk_add_req when setting up the sense buffer. And it turns out
that blk_execute_rq tries to do I/O to the on-stack sense buffer.
At least SCSI always has a kmalloced sense buffer, so I guess we'll
need something similar for virtio_blk for now. For 4.11 I plan to
rework how BLOCK_PC commands work entirely, so hopefull we can make
the sense buffer handling a lot less wasteful.
---
>From 0a77bc424ed907c1e99b4756bb498370b498183a Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Fri, 23 Dec 2016 10:57:06 +0100
Subject: virtio_blk: avoid DMA to stack for the sense buffer
Most users of BLOCK_PC requests allocate the sense buffer on the stack,
so to avoid DMA to the stack copy them to a field in the heap allocated
virtblk_req structure. Without that any attempt at SCSI passthrough I/O,
including the SG_IO ioctl from userspace will crash the kernel. Note that
this includes running tools like hdparm even when the host does not have
SCSI passthrough enabled.
Signed-off-by: Christoph Hellwig <[email protected]>
---
drivers/block/virtio_blk.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 5545a67..3c3b8f6 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -56,6 +56,7 @@ struct virtblk_req {
struct virtio_blk_outhdr out_hdr;
struct virtio_scsi_inhdr in_hdr;
u8 status;
+ u8 sense[SCSI_SENSE_BUFFERSIZE];
struct scatterlist sg[];
};
@@ -102,7 +103,8 @@ static int __virtblk_add_req(struct virtqueue *vq,
}
if (type == cpu_to_virtio32(vq->vdev, VIRTIO_BLK_T_SCSI_CMD)) {
- sg_init_one(&sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+ memcpy(vbr->sense, vbr->req->sense, SCSI_SENSE_BUFFERSIZE);
+ sg_init_one(&sense, vbr->sense, SCSI_SENSE_BUFFERSIZE);
sgs[num_out + num_in++] = &sense;
sg_init_one(&inhdr, &vbr->in_hdr, sizeof(vbr->in_hdr));
sgs[num_out + num_in++] = &inhdr;
--
2.1.4
On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig <[email protected]> wrote:
>
> From: Christoph Hellwig <[email protected]>
> Date: Fri, 23 Dec 2016 10:57:06 +0100
> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>
> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
> so to avoid DMA to the stack copy them to a field in the heap allocated
> virtblk_req structure. Without that any attempt at SCSI passthrough I/O,
> including the SG_IO ioctl from userspace will crash the kernel. Note that
> this includes running tools like hdparm even when the host does not have
> SCSI passthrough enabled.
Ugh. This patch is nasty.
I think we should just fix blk_execute_rq() instead.
But from a quick look, we also have at least sg_scsi_ioctl() and
sg_io() doing the same thing.
And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
And resp_requests() in scsi_debug.c.
So I guess ugly it may need to be, and the rule is that the sense
buffer really can be on the stack and you can't DMA to/from it.
Comments from others?
Linus
On 12/23/2016 12:42 PM, Linus Torvalds wrote:
> On Fri, Dec 23, 2016 at 2:00 AM, Christoph Hellwig <[email protected]> wrote:
>>
>> From: Christoph Hellwig <[email protected]>
>> Date: Fri, 23 Dec 2016 10:57:06 +0100
>> Subject: virtio_blk: avoid DMA to stack for the sense buffer
>>
>> Most users of BLOCK_PC requests allocate the sense buffer on the stack,
>> so to avoid DMA to the stack copy them to a field in the heap allocated
>> virtblk_req structure. Without that any attempt at SCSI passthrough I/O,
>> including the SG_IO ioctl from userspace will crash the kernel. Note that
>> this includes running tools like hdparm even when the host does not have
>> SCSI passthrough enabled.
>
> Ugh. This patch is nasty.
>
> I think we should just fix blk_execute_rq() instead.
>
> But from a quick look, we also have at least sg_scsi_ioctl() and
> sg_io() doing the same thing.
>
> And the SG_IO thing in bsg_ioctl(). And spi_execute() in scsi_transport_spi.c
>
> And resp_requests() in scsi_debug.c.
It's not that it's technically hard to fix up, it's more that it's a
pain in the ass to have to do it. For instance, for blk_execute_rq(), we
either should enforce that the caller allocates it dynamically and then
free it, or we need nasty hack where the caller needs to know he has to
free it. Pretty obvious what I would prefer there.
And yes, there would be a good chunk of other places where this would
nede to be fixed up...
> So I guess ugly it may need to be, and the rule is that the sense
> buffer really can be on the stack and you can't DMA to/from it.
> Comments from others?
I'm just wondering why this is being hit now, we have a 4.9 release with
this issue and nobody reported it (that I saw)... Which is pretty sad.
If no one beats me to it, I'll try and get a patch done on Sunday. We're
in the midst of the holidays here.
--
Jens Axboe
On Fri, Dec 23, 2016 at 07:45:45PM -0700, Jens Axboe wrote:
> It's not that it's technically hard to fix up, it's more that it's a
> pain in the ass to have to do it. For instance, for blk_execute_rq(), we
> either should enforce that the caller allocates it dynamically and then
> free it, or we need nasty hack where the caller needs to know he has to
> free it. Pretty obvious what I would prefer there.
>
> And yes, there would be a good chunk of other places where this would
> nede to be fixed up...
My planned rework for the BLOCK_PC code (split all fields for them out
of struct request and move them into a separate, driver-allocate structure)
would fix this up as a side-effect. I really wanted to get it into 4.10,
but I didn't manage to fix it up. I'll try to get it into 4.11 early.
On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
> Ugh. This patch is nasty.
It's the same SCSI has done for ages - except that is uses a separate
kmalloc for the sense buffer.
> I think we should just fix blk_execute_rq() instead.
As you found out below it's not just blk_execute_rq, it's the whole
architecture of the BLOCK_PC code, which expects a caller provided
sense buffer. But with the way blk-mq allocates request structures
we can actually fix it, but I first need to extent the way it allows
drivers to allocate private data to the old request code. I've
actually already implemented that for SCSI long time ago, and have
started to life it to the block layer.
Once that is done the callers won't need a sense buffer at all, and
can just look at the driver provided one. Which currently is missing
in virtio-blk, so we'd need something similar to the above patch
anyway.
On 12/24/2016 11:07 AM, Christoph Hellwig wrote:
> On Fri, Dec 23, 2016 at 11:42:45AM -0800, Linus Torvalds wrote:
>> Ugh. This patch is nasty.
>
> It's the same SCSI has done for ages - except that is uses a separate
> kmalloc for the sense buffer.
>
>> I think we should just fix blk_execute_rq() instead.
>
> As you found out below it's not just blk_execute_rq, it's the whole
> architecture of the BLOCK_PC code, which expects a caller provided
> sense buffer. But with the way blk-mq allocates request structures
> we can actually fix it, but I first need to extent the way it allows
> drivers to allocate private data to the old request code. I've
> actually already implemented that for SCSI long time ago, and have
> started to life it to the block layer.
>
Would be cool to have a generic sense buffer.
I always found it slightly odd, pretending that 'struct request' is
protocol-agnostic and refusing to add a sense data pointer, but at the
same time having a field 'sense_len' (which gives the length of what
exactly?).
Christoph, do you have a pointer to your patchset?
Not that I'll be able to do any meaningful work until next year, but
having a look would be nice. Just to get a feeling where you want to
head to; I might be able to work on this start of January.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: J. Hawn, J. Guild, F. Imend?rffer, HRB 16746 (AG N?rnberg)
On Sat, Dec 24, 2016 at 02:17:26PM +0100, Hannes Reinecke wrote:
> Christoph, do you have a pointer to your patchset?
> Not that I'll be able to do any meaningful work until next year, but having
> a look would be nice. Just to get a feeling where you want to head to; I
> might be able to work on this start of January.
I'll push out a branch once it's revieable and not my current unbisectable
mess, should be soon.