Hi folks,
I'm intermittently seeing the following oops on at least one powerpc box.
The BUG_ON() is from:
static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb)
{
...
count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
BUG_ON(count > sdb->table.nents);
Looking at the dump it looks like count was 2, I can't work out what nents was.
The machine's just a fairly boring bare metal setup, with a single IPR adapter:
0001:08:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 02)
Subsystem: IBM PCIe3 x8 SAS RAID Internal Adapter 6Gb (57D7)
Flags: bus master, fast devsel, latency 0
Kernel driver in use: ipr
Anyone seen it before or have any ideas?
cheers
systemd[1]: Starting Uncomplicated firewall...
Starting Uncomplicated firewall...
------------[ cut here ]------------
kernel BUG at drivers/scsi/scsi_lib.c:1096!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=2048 NUMA PowerNV
Modules linked in:
CPU: 132 PID: 2699 Comm: kworker/132:1H Not tainted 4.4.0-rc1-54939-ge22a248-dirty #77
Workqueue: kblockd cfq_kick_queue
task: c000000fef147400 ti: c000000feb384000 task.ti: c000000feb384000
NIP: c0000000005ab4a8 LR: c0000000005ab490 CTR: 0000000000000000
REGS: c000000feb387620 TRAP: 0700 Not tainted (4.4.0-rc1-54939-ge22a248-dirty)
MSR: 9000000100029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002228 XER: 00000000
CFAR: c000000000464950 SOFTE: 0
GPR00: c0000000005ab490 c000000feb3878a0 c000000000d77d00 0000000000000002
GPR04: c000000ff2030158 c000000ff47a0c00 0000000000000000 0000000000001000
GPR08: 0000000000000000 0000000000000001 0000000000000000 fffffffffffffff7
GPR12: 0000000000002200 c00000000fde5200 c0000000000c8098 c000000000b39858
GPR16: c000000000ae12c8 c000000000b39948 0000000000000000 0000000000000000
GPR20: c0000000005ab9c0 c0000017f5800144 0000000000000000 c000001e551a6850
GPR24: c0000017f5800140 c0000007efb9c800 0000000000000000 c000001e551a6800
GPR28: 0000000000000000 0000000000000000 c000000ff2030158 c000000feb420240
NIP [c0000000005ab4a8] scsi_init_sgtable+0xa8/0x180
LR [c0000000005ab490] scsi_init_sgtable+0x90/0x180
Call Trace:
[c000000feb3878a0] [c0000000005ab490] scsi_init_sgtable+0x90/0x180 (unreliable)
[c000000feb3878e0] [c0000000005ab5d4] scsi_init_io+0x54/0x160
[c000000feb387930] [c0000000005fb43c] sd_init_command+0x6c/0xb00
[c000000feb3879f0] [c0000000005ac2f8] scsi_setup_cmnd+0x108/0x1a0
[c000000feb387a30] [c0000000005ac594] scsi_prep_fn+0x104/0x1c0
[c000000feb387a70] [c00000000045d85c] blk_peek_request+0x20c/0x390
[c000000feb387af0] [c0000000005ad6a8] scsi_request_fn+0xb8/0x7f0
[c000000feb387bf0] [c0000000004583b4] __blk_run_queue+0x54/0x80
[c000000feb387c20] [c00000000047e178] cfq_kick_queue+0x38/0xc0
[c000000feb387c50] [c0000000000c00cc] process_one_work+0x2ac/0x560
[c000000feb387ce0] [c0000000000c0510] worker_thread+0x190/0x660
[c000000feb387d80] [c0000000000c8198] kthread+0x108/0x130
[c000000feb387e30] [c0000000000095b0] ret_from_kernel_thread+0x5c/0xac
Instruction dump:
2fa30000 409e00e4 e8bf0000 e87e0030 7fc4f378 4beb91ed 60000000 813f0008
78630020 7f891840 7d301026 5529effe <0b090000> 907f0008 38600000 813e005c
---[ end trace 3ea632a512ff48d5 ]---
On Wed, 2015-11-18 at 20:18 +1100, Michael Ellerman wrote:
> Hi folks,
>
> I'm intermittently seeing the following oops on at least one powerpc box.
>
> The BUG_ON() is from:
>
> static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb)
> {
> ...
> count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
> BUG_ON(count > sdb->table.nents);
>
> Looking at the dump it looks like count was 2, I can't work out what nents was.
>
> The machine's just a fairly boring bare metal setup, with a single IPR adapter:
>
> 0001:08:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 02)
> Subsystem: IBM PCIe3 x8 SAS RAID Internal Adapter 6Gb (57D7)
> Flags: bus master, fast devsel, latency 0
> Kernel driver in use: ipr
>
>
> Anyone seen it before or have any ideas?
I'm also seeing it on arm64 in 4.4-rc1
[ 6.859003] Call trace:
[ 6.861439] [<fffffe000074345c>] scsi_init_sgtable+0x84/0x88
[ 6.867072] [<fffffe00007434ac>] scsi_init_io+0x4c/0x1ac
[ 6.872358] [<fffffe000075004c>] sd_setup_read_write_cmnd+0x44/0x844
[ 6.878682] [<fffffe0000750884>] sd_init_command+0x38/0xb0
[ 6.884141] [<fffffe00007436e4>] scsi_setup_cmnd+0xd8/0x13c
[ 6.889686] [<fffffe000074397c>] scsi_prep_fn+0xc0/0x140
[ 6.894973] [<fffffe00003bd0cc>] blk_peek_request+0x148/0x24c
[ 6.900692] [<fffffe0000744754>] scsi_request_fn+0x58/0x648
[ 6.906237] [<fffffe00003b8e80>] __blk_run_queue+0x40/0x58
[ 6.911696] [<fffffe00003b91a4>] blk_run_queue+0x30/0x48
[ 6.916983] [<fffffe00007422e4>] scsi_run_queue+0x204/0x294
[ 6.922528] [<fffffe000074321c>] scsi_end_request+0x13c/0x1a0
[ 6.928247] [<fffffe0000744ef4>] scsi_io_completion+0xf0/0x564
[ 6.934052] [<fffffe000073beb0>] scsi_finish_command+0xe4/0x144
[ 6.939943] [<fffffe00007446cc>] scsi_softirq_done+0x148/0x178
[ 6.945748] [<fffffe00003c39a4>] blk_done_softirq+0x7c/0x94
[ 6.951295] [<fffffe00000c8f44>] __do_softirq+0x114/0x2a0
[ 6.956667] [<fffffe00000c9340>] irq_exit+0x8c/0xe4
[ 6.961522] [<fffffe000009cf00>] handle_IPI+0x170/0x228
[ 6.966721] [<fffffe00000904e0>] gic_handle_irq+0xa0/0xb8
[ 6.972093] Exception stack(0xfffffe03dc143de0 to 0xfffffe03dc143f00)
>
> cheers
>
>
> systemd[1]: Starting Uncomplicated firewall...
> Starting Uncomplicated firewall...
> ------------[ cut here ]------------
> kernel BUG at drivers/scsi/scsi_lib.c:1096!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=2048 NUMA PowerNV
> Modules linked in:
> CPU: 132 PID: 2699 Comm: kworker/132:1H Not tainted 4.4.0-rc1-54939-ge22a248-dirty #77
> Workqueue: kblockd cfq_kick_queue
> task: c000000fef147400 ti: c000000feb384000 task.ti: c000000feb384000
> NIP: c0000000005ab4a8 LR: c0000000005ab490 CTR: 0000000000000000
> REGS: c000000feb387620 TRAP: 0700 Not tainted (4.4.0-rc1-54939-ge22a248-dirty)
> MSR: 9000000100029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002228 XER: 00000000
> CFAR: c000000000464950 SOFTE: 0
> GPR00: c0000000005ab490 c000000feb3878a0 c000000000d77d00 0000000000000002
> GPR04: c000000ff2030158 c000000ff47a0c00 0000000000000000 0000000000001000
> GPR08: 0000000000000000 0000000000000001 0000000000000000 fffffffffffffff7
> GPR12: 0000000000002200 c00000000fde5200 c0000000000c8098 c000000000b39858
> GPR16: c000000000ae12c8 c000000000b39948 0000000000000000 0000000000000000
> GPR20: c0000000005ab9c0 c0000017f5800144 0000000000000000 c000001e551a6850
> GPR24: c0000017f5800140 c0000007efb9c800 0000000000000000 c000001e551a6800
> GPR28: 0000000000000000 0000000000000000 c000000ff2030158 c000000feb420240
> NIP [c0000000005ab4a8] scsi_init_sgtable+0xa8/0x180
> LR [c0000000005ab490] scsi_init_sgtable+0x90/0x180
> Call Trace:
> [c000000feb3878a0] [c0000000005ab490] scsi_init_sgtable+0x90/0x180 (unreliable)
> [c000000feb3878e0] [c0000000005ab5d4] scsi_init_io+0x54/0x160
> [c000000feb387930] [c0000000005fb43c] sd_init_command+0x6c/0xb00
> [c000000feb3879f0] [c0000000005ac2f8] scsi_setup_cmnd+0x108/0x1a0
> [c000000feb387a30] [c0000000005ac594] scsi_prep_fn+0x104/0x1c0
> [c000000feb387a70] [c00000000045d85c] blk_peek_request+0x20c/0x390
> [c000000feb387af0] [c0000000005ad6a8] scsi_request_fn+0xb8/0x7f0
> [c000000feb387bf0] [c0000000004583b4] __blk_run_queue+0x54/0x80
> [c000000feb387c20] [c00000000047e178] cfq_kick_queue+0x38/0xc0
> [c000000feb387c50] [c0000000000c00cc] process_one_work+0x2ac/0x560
> [c000000feb387ce0] [c0000000000c0510] worker_thread+0x190/0x660
> [c000000feb387d80] [c0000000000c8198] kthread+0x108/0x130
> [c000000feb387e30] [c0000000000095b0] ret_from_kernel_thread+0x5c/0xac
> Instruction dump:
> 2fa30000 409e00e4 e8bf0000 e87e0030 7fc4f378 4beb91ed 60000000 813f0008
> 78630020 7f891840 7d301026 5529effe <0b090000> 907f0008 38600000 813e005c
> ---[ end trace 3ea632a512ff48d5 ]---
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Wed, 2015-11-18 at 09:03 -0500, Mark Salter wrote:
> On Wed, 2015-11-18 at 20:18 +1100, Michael Ellerman wrote:
> > Hi folks,
> >
> > I'm intermittently seeing the following oops on at least one powerpc box.
> >
> > The BUG_ON() is from:
> >
> > static int scsi_init_sgtable(struct request *req, struct scsi_data_buffer *sdb)
> > {
> > ...
> > count = blk_rq_map_sg(req->q, req, sdb->table.sgl);
> > BUG_ON(count > sdb->table.nents);
> >
> > Looking at the dump it looks like count was 2, I can't work out what nents was.
> >
> > The machine's just a fairly boring bare metal setup, with a single IPR adapter:
> >
> > 0001:08:00.0 RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) (rev 02)
> > Subsystem: IBM PCIe3 x8 SAS RAID Internal Adapter 6Gb (57D7)
> > Flags: bus master, fast devsel, latency 0
> > Kernel driver in use: ipr
> >
> >
> > Anyone seen it before or have any ideas?
>
> I'm also seeing it on arm64 in 4.4-rc1
Ah thanks, that's a good data point. I was assuming it was a driver bug, but I
assume you're not using IPR :)
> [ 6.859003] Call trace:
> [ 6.861439] [<fffffe000074345c>] scsi_init_sgtable+0x84/0x88
> [ 6.867072] [<fffffe00007434ac>] scsi_init_io+0x4c/0x1ac
> [ 6.872358] [<fffffe000075004c>] sd_setup_read_write_cmnd+0x44/0x844
> [ 6.878682] [<fffffe0000750884>] sd_init_command+0x38/0xb0
> [ 6.884141] [<fffffe00007436e4>] scsi_setup_cmnd+0xd8/0x13c
> [ 6.889686] [<fffffe000074397c>] scsi_prep_fn+0xc0/0x140
> [ 6.894973] [<fffffe00003bd0cc>] blk_peek_request+0x148/0x24c
> [ 6.900692] [<fffffe0000744754>] scsi_request_fn+0x58/0x648
> [ 6.906237] [<fffffe00003b8e80>] __blk_run_queue+0x40/0x58
> [ 6.911696] [<fffffe00003b91a4>] blk_run_queue+0x30/0x48
> [ 6.916983] [<fffffe00007422e4>] scsi_run_queue+0x204/0x294
> [ 6.922528] [<fffffe000074321c>] scsi_end_request+0x13c/0x1a0
> [ 6.928247] [<fffffe0000744ef4>] scsi_io_completion+0xf0/0x564
> [ 6.934052] [<fffffe000073beb0>] scsi_finish_command+0xe4/0x144
> [ 6.939943] [<fffffe00007446cc>] scsi_softirq_done+0x148/0x178
> [ 6.945748] [<fffffe00003c39a4>] blk_done_softirq+0x7c/0x94
> [ 6.951295] [<fffffe00000c8f44>] __do_softirq+0x114/0x2a0
> [ 6.956667] [<fffffe00000c9340>] irq_exit+0x8c/0xe4
> [ 6.961522] [<fffffe000009cf00>] handle_IPI+0x170/0x228
> [ 6.966721] [<fffffe00000904e0>] gic_handle_irq+0xa0/0xb8
> [ 6.972093] Exception stack(0xfffffe03dc143de0 to 0xfffffe03dc143f00)
cheers
It's pretty much guaranteed a block layer bug, most likely in the
merge bios to request infrastucture where we don't obey the merging
limits properly.
Does either of you have a known good and first known bad kernel?
On 11/19/2015 09:23 AM, Christoph Hellwig wrote:
> It's pretty much guaranteed a block layer bug, most likely in the
> merge bios to request infrastucture where we don't obey the merging
> limits properly.
>
> Does either of you have a known good and first known bad kernel?
Well, I have been fighting a similar issue for several months now,
albeit with multipath enabled. Haven't had much progress with this,
sadly.
Seeing that this is our distro kernel it might or might not be
related; however, as the symptoms are identical there still is a
chance that this is actually a generic block-layer problem.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> It's pretty much guaranteed a block layer bug, most likely in the
> merge bios to request infrastucture where we don't obey the merging
> limits properly.
>
> Does either of you have a known good and first known bad kernel?
Not me, I've only hit it one or two times. All I can say is I have hit it in
4.4-rc1.
Laurent, can you narrow it down at all?
cheers
On 20/11/2015 13:10, Michael Ellerman wrote:
> On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>
>> It's pretty much guaranteed a block layer bug, most likely in the
>> merge bios to request infrastucture where we don't obey the merging
>> limits properly.
>>
>> Does either of you have a known good and first known bad kernel?
>
> Not me, I've only hit it one or two times. All I can say is I have hit it in
> 4.4-rc1.
>
> Laurent, can you narrow it down at all?
I'm trying bisect it but it's very time consuming, and I think I missed
some bad kernel because I didn't let enough time to the system to panic.
But I'm still on...
Cheers
On Fri, 2015-11-20 at 13:56 +0100, Laurent Dufour wrote:
> On 20/11/2015 13:10, Michael Ellerman wrote:
> > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> >
> > > It's pretty much guaranteed a block layer bug, most likely in the
> > > merge bios to request infrastucture where we don't obey the merging
> > > limits properly.
> > >
> > > Does either of you have a known good and first known bad kernel?
> >
> > Not me, I've only hit it one or two times. All I can say is I have hit it in
> > 4.4-rc1.
> >
> > Laurent, can you narrow it down at all?
>
> I'm trying bisect it but it's very time consuming, and I think I missed
> some bad kernel because I didn't let enough time to the system to panic.
>
> But I'm still on...
>
I usually hit it at boot time. Maybe one out of five times. I'll take a stab
at bisecting today...
On Thu, 2015-11-19 at 16:35 +0100, Hannes Reinecke wrote:
> On 11/19/2015 09:23 AM, Christoph Hellwig wrote:
> > It's pretty much guaranteed a block layer bug, most likely in the
> > merge bios to request infrastucture where we don't obey the merging
> > limits properly.
> >
> > Does either of you have a known good and first known bad kernel?
>
> Well, I have been fighting a similar issue for several months now,
> albeit with multipath enabled. Haven't had much progress with this,
> sadly.
> Seeing that this is our distro kernel it might or might not be
> related; however, as the symptoms are identical there still is a
> chance that this is actually a generic block-layer problem.
>
> Cheers,
>
> Hannes
We have seen this also. (e.g. req->nr_phys_segments was 3, but
blk_rq_map_sg() returned 4.) I was suspicious of the patch:
bio: modify __bio_add_page() to accept pages that don't start a new segment
But we put some debugging code in and didn't hit it. We haven't
found the problem yet, either, though. We're still looking.
As Christoph said, it would seem to be a problem with the block layer
merging.
The API for this seems defective, in that blk_rq_map_sg() should
never be returning a value indicating that it overwrote past the
end of the supplied SG array and depend on the caller to check it.
(We could get data corruption on another I/O if it used adjacent
memory for a different SG list, for example.)
-Ewan
On 11/20/2015 03:38 PM, Ewan Milne wrote:
> On Thu, 2015-11-19 at 16:35 +0100, Hannes Reinecke wrote:
>> On 11/19/2015 09:23 AM, Christoph Hellwig wrote:
>>> It's pretty much guaranteed a block layer bug, most likely in the
>>> merge bios to request infrastucture where we don't obey the merging
>>> limits properly.
>>>
>>> Does either of you have a known good and first known bad kernel?
>>
>> Well, I have been fighting a similar issue for several months now,
>> albeit with multipath enabled. Haven't had much progress with this,
>> sadly.
>> Seeing that this is our distro kernel it might or might not be
>> related; however, as the symptoms are identical there still is a
>> chance that this is actually a generic block-layer problem.
>>
>> Cheers,
>>
>> Hannes
>
> We have seen this also. (e.g. req->nr_phys_segments was 3, but
> blk_rq_map_sg() returned 4.) I was suspicious of the patch:
>
> bio: modify __bio_add_page() to accept pages that don't start a new segment
>
> But we put some debugging code in and didn't hit it. We haven't
> found the problem yet, either, though. We're still looking.
>
Can't we have a joint effort here?
I've been spending a _LOT_ of time trying to debug things here, but
none of the ideas I've come up with have been able to fix anything.
I'm almost tempted to increase the count from scsi_alloc_sgtable()
by one and be done with ...
> As Christoph said, it would seem to be a problem with the block layer
> merging.
>
> The API for this seems defective, in that blk_rq_map_sg() should
> never be returning a value indicating that it overwrote past the
> end of the supplied SG array and depend on the caller to check it.
> (We could get data corruption on another I/O if it used adjacent
> memory for a different SG list, for example.)
>
Yeah, the API is bloody useless.
By the time you hit the BUG_ON you've already caused a memory
corruption, so no way you can recover there.
At the very least we should be passing in the sg list count into
blk_map_rq_sg(), but that's a core blocklayer API and changes here
would require changes by quite a set of drivers. Plus it wouldn't
help me for a distribution kernel ...
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> Can't we have a joint effort here?
> I've been spending a _LOT_ of time trying to debug things here, but
> none of the ideas I've come up with have been able to fix anything.
Yes. I'm not the one primarily looking at it, and we don't have a
reproducer in-house. We just have the one dump right now.
>
> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> by one and be done with ...
>
That might not fix it if it is a problem with the merge code, though.
On 20/11/2015 13:10, Michael Ellerman wrote:
> On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>
>> It's pretty much guaranteed a block layer bug, most likely in the
>> merge bios to request infrastucture where we don't obey the merging
>> limits properly.
>>
>> Does either of you have a known good and first known bad kernel?
>
> Not me, I've only hit it one or two times. All I can say is I have hit it in
> 4.4-rc1.
>
> Laurent, can you narrow it down at all?
It seems that the panic is triggered by the commit bdced438acd8 ("block:
setup bi_phys_segments after splitting") which has been pulled by the
merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
git://git.kernel.dk/linux-block").
My system is panicing promptly when running a kernel built at
d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
without panicing.
This being said, I can't explain what's going wrong.
May Ming shed some light here ?
Cheers
On Sat, 21 Nov 2015 12:30:14 +0100
Laurent Dufour <[email protected]> wrote:
> On 20/11/2015 13:10, Michael Ellerman wrote:
> > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> >
> >> It's pretty much guaranteed a block layer bug, most likely in the
> >> merge bios to request infrastucture where we don't obey the merging
> >> limits properly.
> >>
> >> Does either of you have a known good and first known bad kernel?
> >
> > Not me, I've only hit it one or two times. All I can say is I have hit it in
> > 4.4-rc1.
> >
> > Laurent, can you narrow it down at all?
>
> It seems that the panic is triggered by the commit bdced438acd8 ("block:
> setup bi_phys_segments after splitting") which has been pulled by the
> merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
> git://git.kernel.dk/linux-block").
>
> My system is panicing promptly when running a kernel built at
> d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
> without panicing.
>
> This being said, I can't explain what's going wrong.
>
> May Ming shed some light here ?
Laurent, looks there is one bug in blk_bio_segment_split(), would you
mind testing the following patch to see if it fixes your issue?
---
>From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
From: Ming Lei <[email protected]>
Date: Sun, 22 Nov 2015 00:47:13 +0800
Subject: [PATCH] block: fix segment split
Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
always points to the iterator local variable, which is obviously
wrong, so fix it by pointing to the local variable of 'bvprv'.
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index de5716d8..f2efe8a 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
seg_size += bv.bv_len;
bvprv = bv;
- bvprvp = &bv;
+ bvprvp = &bvprv;
sectors += bv.bv_len >> 9;
continue;
}
@@ -108,7 +108,7 @@ new_segment:
nsegs++;
bvprv = bv;
- bvprvp = &bv;
+ bvprvp = &bvprv;
seg_size = bv.bv_len;
sectors += bv.bv_len >> 9;
}
--
1.9.1
Thanks,
Ming
On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
> On Sat, 21 Nov 2015 12:30:14 +0100
> Laurent Dufour <[email protected]> wrote:
>
> > On 20/11/2015 13:10, Michael Ellerman wrote:
> > > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> > >
> > > > It's pretty much guaranteed a block layer bug, most likely in the
> > > > merge bios to request infrastucture where we don't obey the merging
> > > > limits properly.
> > > >
> > > > Does either of you have a known good and first known bad kernel?
> > >
> > > Not me, I've only hit it one or two times. All I can say is I have hit it in
> > > 4.4-rc1.
> > >
> > > Laurent, can you narrow it down at all?
> >
> > It seems that the panic is triggered by the commit bdced438acd8 ("block:
> > setup bi_phys_segments after splitting") which has been pulled by the
> > merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
> > git://git.kernel.dk/linux-block").
> >
> > My system is panicing promptly when running a kernel built at
> > d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
> > without panicing.
> >
> > This being said, I can't explain what's going wrong.
> >
> > May Ming shed some light here ?
>
> Laurent, looks there is one bug in blk_bio_segment_split(), would you
> mind testing the following patch to see if it fixes your issue?
>
> ---
> From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
> From: Ming Lei <[email protected]>
> Date: Sun, 22 Nov 2015 00:47:13 +0800
> Subject: [PATCH] block: fix segment split
>
> Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
> always points to the iterator local variable, which is obviously
> wrong, so fix it by pointing to the local variable of 'bvprv'.
>
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-merge.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index de5716d8..f2efe8a 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>
> seg_size += bv.bv_len;
> bvprv = bv;
> - bvprvp = &bv;
> + bvprvp = &bvprv;
> sectors += bv.bv_len >> 9;
> continue;
> }
> @@ -108,7 +108,7 @@ new_segment:
>
> nsegs++;
> bvprv = bv;
> - bvprvp = &bv;
> + bvprvp = &bvprv;
> seg_size = bv.bv_len;
> sectors += bv.bv_len >> 9;
> }
I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
On Mon, Nov 23, 2015 at 7:20 AM, Mark Salter <[email protected]> wrote:
> On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
>> On Sat, 21 Nov 2015 12:30:14 +0100
>> Laurent Dufour <[email protected]> wrote:
>>
>> > On 20/11/2015 13:10, Michael Ellerman wrote:
>> > > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>> > >
>> > > > It's pretty much guaranteed a block layer bug, most likely in the
>> > > > merge bios to request infrastucture where we don't obey the merging
>> > > > limits properly.
>> > > >
>> > > > Does either of you have a known good and first known bad kernel?
>> > >
>> > > Not me, I've only hit it one or two times. All I can say is I have hit it in
>> > > 4.4-rc1.
>> > >
>> > > Laurent, can you narrow it down at all?
>> >
>> > It seems that the panic is triggered by the commit bdced438acd8 ("block:
>> > setup bi_phys_segments after splitting") which has been pulled by the
>> > merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
>> > git://git.kernel.dk/linux-block").
>> >
>> > My system is panicing promptly when running a kernel built at
>> > d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
>> > without panicing.
>> >
>> > This being said, I can't explain what's going wrong.
>> >
>> > May Ming shed some light here ?
>>
>> Laurent, looks there is one bug in blk_bio_segment_split(), would you
>> mind testing the following patch to see if it fixes your issue?
>>
>> ---
>> From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
>> From: Ming Lei <[email protected]>
>> Date: Sun, 22 Nov 2015 00:47:13 +0800
>> Subject: [PATCH] block: fix segment split
>>
>> Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
>> always points to the iterator local variable, which is obviously
>> wrong, so fix it by pointing to the local variable of 'bvprv'.
>>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>> block/blk-merge.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index de5716d8..f2efe8a 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>
>> seg_size += bv.bv_len;
>> bvprv = bv;
>> - bvprvp = &bv;
>> + bvprvp = &bvprv;
>> sectors += bv.bv_len >> 9;
>> continue;
>> }
>> @@ -108,7 +108,7 @@ new_segment:
>>
>> nsegs++;
>> bvprv = bv;
>> - bvprvp = &bv;
>> + bvprvp = &bvprv;
>> seg_size = bv.bv_len;
>> sectors += bv.bv_len >> 9;
>> }
>
> I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
OK, looks there are still other bugs, care to share us how to reproduce
it on arm64?
thanks,
Ming
On Mon, 2015-11-23 at 08:36 +0800, Ming Lei wrote:
> On Mon, Nov 23, 2015 at 7:20 AM, Mark Salter <[email protected]> wrote:
> > On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
> > > On Sat, 21 Nov 2015 12:30:14 +0100
> > > Laurent Dufour <[email protected]> wrote:
> > >
> > > > On 20/11/2015 13:10, Michael Ellerman wrote:
> > > > > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> > > > >
> > > > > > It's pretty much guaranteed a block layer bug, most likely in the
> > > > > > merge bios to request infrastucture where we don't obey the merging
> > > > > > limits properly.
> > > > > >
> > > > > > Does either of you have a known good and first known bad kernel?
> > > > >
> > > > > Not me, I've only hit it one or two times. All I can say is I have hit it in
> > > > > 4.4-rc1.
> > > > >
> > > > > Laurent, can you narrow it down at all?
> > > >
> > > > It seems that the panic is triggered by the commit bdced438acd8 ("block:
> > > > setup bi_phys_segments after splitting") which has been pulled by the
> > > > merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
> > > > git://git.kernel.dk/linux-block").
> > > >
> > > > My system is panicing promptly when running a kernel built at
> > > > d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
> > > > without panicing.
> > > >
> > > > This being said, I can't explain what's going wrong.
> > > >
> > > > May Ming shed some light here ?
> > >
> > > Laurent, looks there is one bug in blk_bio_segment_split(), would you
> > > mind testing the following patch to see if it fixes your issue?
> > >
> > > ---
> > > From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
> > > From: Ming Lei <[email protected]>
> > > Date: Sun, 22 Nov 2015 00:47:13 +0800
> > > Subject: [PATCH] block: fix segment split
> > >
> > > Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
> > > always points to the iterator local variable, which is obviously
> > > wrong, so fix it by pointing to the local variable of 'bvprv'.
> > >
> > > Signed-off-by: Ming Lei <[email protected]>
> > > ---
> > > block/blk-merge.c | 4 ++--
> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > > index de5716d8..f2efe8a 100644
> > > --- a/block/blk-merge.c
> > > +++ b/block/blk-merge.c
> > > @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> > >
> > > seg_size += bv.bv_len;
> > > bvprv = bv;
> > > - bvprvp = &bv;
> > > + bvprvp = &bvprv;
> > > sectors += bv.bv_len >> 9;
> > > continue;
> > > }
> > > @@ -108,7 +108,7 @@ new_segment:
> > >
> > > nsegs++;
> > > bvprv = bv;
> > > - bvprvp = &bv;
> > > + bvprvp = &bvprv;
> > > seg_size = bv.bv_len;
> > > sectors += bv.bv_len >> 9;
> > > }
> >
> > I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
>
> OK, looks there are still other bugs, care to share us how to reproduce
> it on arm64?
>
> thanks,
> Ming
Unfortunately, the best reproducer I have is to boot the platform. I have seen the
BUG a few times post-boot, but I don't have a consistant reproducer. I am using
upstream 4.4-rc1 with this config:
http://people.redhat.com/msalter/fh_defconfig
With 4.4-rc1 on an APM Mustang platform, I see the BUG about once every 6-7 boots.
On an AMD Seattle platform, about every 9 boots.
I have a script that loops through an ssh command to reboot the platform under test.
I manually install test kernels and then run the script and wait for failure. While
debugging, I have tried more minimal configs with which I have been unable to
reproduce the problem even after several hours of reboots. With the above mentioned
fh_defconfig, I have been able to get a failure within 20 or so boots with most
kernel builds but at certain kernel commits, the failure has taken a longer time to
reproduce.
>From my POV, I can't say which commit causes the problem. So far, I have not been
able to reproduce at all before commit d9734e0d1ccf but I am currently trying to
reproduce with commit 0d51ce9ca1116 (one merge earlier than d9734e0d1ccf).
Hi Mark,
On Mon, Nov 23, 2015 at 9:50 AM, Mark Salter <[email protected]> wrote:
> On Mon, 2015-11-23 at 08:36 +0800, Ming Lei wrote:
>> On Mon, Nov 23, 2015 at 7:20 AM, Mark Salter <[email protected]> wrote:
>> > On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
>> > > On Sat, 21 Nov 2015 12:30:14 +0100
>> > > Laurent Dufour <[email protected]> wrote:
>> > >
>> > > > On 20/11/2015 13:10, Michael Ellerman wrote:
>> > > > > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>> > > > >
>> > > > > > It's pretty much guaranteed a block layer bug, most likely in the
>> > > > > > merge bios to request infrastucture where we don't obey the merging
>> > > > > > limits properly.
>> > > > > >
>> > > > > > Does either of you have a known good and first known bad kernel?
>> > > > >
>> > > > > Not me, I've only hit it one or two times. All I can say is I have hit it in
>> > > > > 4.4-rc1.
>> > > > >
>> > > > > Laurent, can you narrow it down at all?
>> > > >
>> > > > It seems that the panic is triggered by the commit bdced438acd8 ("block:
>> > > > setup bi_phys_segments after splitting") which has been pulled by the
>> > > > merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
>> > > > git://git.kernel.dk/linux-block").
>> > > >
>> > > > My system is panicing promptly when running a kernel built at
>> > > > d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
>> > > > without panicing.
>> > > >
>> > > > This being said, I can't explain what's going wrong.
>> > > >
>> > > > May Ming shed some light here ?
>> > >
>> > > Laurent, looks there is one bug in blk_bio_segment_split(), would you
>> > > mind testing the following patch to see if it fixes your issue?
>> > >
>> > > ---
>> > > From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
>> > > From: Ming Lei <[email protected]>
>> > > Date: Sun, 22 Nov 2015 00:47:13 +0800
>> > > Subject: [PATCH] block: fix segment split
>> > >
>> > > Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
>> > > always points to the iterator local variable, which is obviously
>> > > wrong, so fix it by pointing to the local variable of 'bvprv'.
>> > >
>> > > Signed-off-by: Ming Lei <[email protected]>
>> > > ---
>> > > block/blk-merge.c | 4 ++--
>> > > 1 file changed, 2 insertions(+), 2 deletions(-)
>> > >
>> > > diff --git a/block/blk-merge.c b/block/blk-merge.c
>> > > index de5716d8..f2efe8a 100644
>> > > --- a/block/blk-merge.c
>> > > +++ b/block/blk-merge.c
>> > > @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>> > >
>> > > seg_size += bv.bv_len;
>> > > bvprv = bv;
>> > > - bvprvp = &bv;
>> > > + bvprvp = &bvprv;
>> > > sectors += bv.bv_len >> 9;
>> > > continue;
>> > > }
>> > > @@ -108,7 +108,7 @@ new_segment:
>> > >
>> > > nsegs++;
>> > > bvprv = bv;
>> > > - bvprvp = &bv;
>> > > + bvprvp = &bvprv;
>> > > seg_size = bv.bv_len;
>> > > sectors += bv.bv_len >> 9;
>> > > }
>> >
>> > I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
>>
>> OK, looks there are still other bugs, care to share us how to reproduce
>> it on arm64?
>>
>> thanks,
>> Ming
>
> Unfortunately, the best reproducer I have is to boot the platform. I have seen the
> BUG a few times post-boot, but I don't have a consistant reproducer. I am using
> upstream 4.4-rc1 with this config:
>
> http://people.redhat.com/msalter/fh_defconfig
>
> With 4.4-rc1 on an APM Mustang platform, I see the BUG about once every 6-7 boots.
> On an AMD Seattle platform, about every 9 boots.
Thanks for the input, and I will try to reproduce the issue on mustang with
your kernel config.
>
> I have a script that loops through an ssh command to reboot the platform under test.
> I manually install test kernels and then run the script and wait for failure. While
> debugging, I have tried more minimal configs with which I have been unable to
> reproduce the problem even after several hours of reboots. With the above mentioned
> fh_defconfig, I have been able to get a failure within 20 or so boots with most
> kernel builds but at certain kernel commits, the failure has taken a longer time to
> reproduce.
>
> From my POV, I can't say which commit causes the problem. So far, I have not been
> able to reproduce at all before commit d9734e0d1ccf but I am currently trying to
> reproduce with commit 0d51ce9ca1116 (one merge earlier than d9734e0d1ccf).
The patch for fixing 'bvprvp' is better to be included for test,
because that issue
may have a big effect on computing physical seg count.
Also I appreciate if you, Laurent or anyone may provide debug log which
can be captured with the attached debug patch when this issue is trigered .
Thanks,
On 11/20/2015 04:28 PM, Ewan Milne wrote:
> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>> Can't we have a joint effort here?
>> I've been spending a _LOT_ of time trying to debug things here, but
>> none of the ideas I've come up with have been able to fix anything.
>
> Yes. I'm not the one primarily looking at it, and we don't have a
> reproducer in-house. We just have the one dump right now.
>
Oh, I got plenty of them :-(
>>
>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>> by one and be done with ...
>>
>
> That might not fix it if it is a problem with the merge code, though.
>
Of course not. But it'll be a band-aid to keep the customer happy.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On 23/11/2015 00:20, Mark Salter wrote:
> On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
>> On Sat, 21 Nov 2015 12:30:14 +0100
>> Laurent Dufour <[email protected]> wrote:
>>
>>> On 20/11/2015 13:10, Michael Ellerman wrote:
>>>> On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>>>>
>>>>> It's pretty much guaranteed a block layer bug, most likely in the
>>>>> merge bios to request infrastucture where we don't obey the merging
>>>>> limits properly.
>>>>>
>>>>> Does either of you have a known good and first known bad kernel?
>>>>
>>>> Not me, I've only hit it one or two times. All I can say is I have hit it in
>>>> 4.4-rc1.
>>>>
>>>> Laurent, can you narrow it down at all?
>>>
>>> It seems that the panic is triggered by the commit bdced438acd8 ("block:
>>> setup bi_phys_segments after splitting") which has been pulled by the
>>> merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
>>> git://git.kernel.dk/linux-block").
>>>
>>> My system is panicing promptly when running a kernel built at
>>> d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
>>> without panicing.
>>>
>>> This being said, I can't explain what's going wrong.
>>>
>>> May Ming shed some light here ?
>>
>> Laurent, looks there is one bug in blk_bio_segment_split(), would you
>> mind testing the following patch to see if it fixes your issue?
>>
>> ---
>> From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
>> From: Ming Lei <[email protected]>
>> Date: Sun, 22 Nov 2015 00:47:13 +0800
>> Subject: [PATCH] block: fix segment split
>>
>> Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
>> always points to the iterator local variable, which is obviously
>> wrong, so fix it by pointing to the local variable of 'bvprv'.
>>
>> Signed-off-by: Ming Lei <[email protected]>
>> ---
>> block/blk-merge.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>> index de5716d8..f2efe8a 100644
>> --- a/block/blk-merge.c
>> +++ b/block/blk-merge.c
>> @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>
>> seg_size += bv.bv_len;
>> bvprv = bv;
>> - bvprvp = &bv;
>> + bvprvp = &bvprv;
>> sectors += bv.bv_len >> 9;
>> continue;
>> }
>> @@ -108,7 +108,7 @@ new_segment:
>>
>> nsegs++;
>> bvprv = bv;
>> - bvprvp = &bv;
>> + bvprvp = &bvprv;
>> seg_size = bv.bv_len;
>> sectors += bv.bv_len >> 9;
>> }
>
> I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
On my side, with the patch applied on top of 4.4-rc1, I can't get the
panic anymore.
On 23/11/2015:02:57:19 PM, Laurent Dufour wrote:
> On 23/11/2015 00:20, Mark Salter wrote:
> > On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
> >> On Sat, 21 Nov 2015 12:30:14 +0100
> >> Laurent Dufour <[email protected]> wrote:
> >>
> >>> On 20/11/2015 13:10, Michael Ellerman wrote:
> >>>> On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> >>>>
> >>>>> It's pretty much guaranteed a block layer bug, most likely in the
> >>>>> merge bios to request infrastucture where we don't obey the merging
> >>>>> limits properly.
> >>>>>
> >>>>> Does either of you have a known good and first known bad kernel?
> >>>>
> >>>> Not me, I've only hit it one or two times. All I can say is I have hit it in
> >>>> 4.4-rc1.
> >>>>
> >>>> Laurent, can you narrow it down at all?
> >>>
> >>> It seems that the panic is triggered by the commit bdced438acd8 ("block:
> >>> setup bi_phys_segments after splitting") which has been pulled by the
> >>> merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
> >>> git://git.kernel.dk/linux-block").
> >>>
> >>> My system is panicing promptly when running a kernel built at
> >>> d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
> >>> without panicing.
> >>>
> >>> This being said, I can't explain what's going wrong.
> >>>
> >>> May Ming shed some light here ?
> >>
> >> Laurent, looks there is one bug in blk_bio_segment_split(), would you
> >> mind testing the following patch to see if it fixes your issue?
> >>
> >> ---
> >> From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
> >> From: Ming Lei <[email protected]>
> >> Date: Sun, 22 Nov 2015 00:47:13 +0800
> >> Subject: [PATCH] block: fix segment split
> >>
> >> Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
> >> always points to the iterator local variable, which is obviously
> >> wrong, so fix it by pointing to the local variable of 'bvprv'.
> >>
> >> Signed-off-by: Ming Lei <[email protected]>
> >> ---
> >> block/blk-merge.c | 4 ++--
> >> 1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> index de5716d8..f2efe8a 100644
> >> --- a/block/blk-merge.c
> >> +++ b/block/blk-merge.c
> >> @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> >>
> >> seg_size += bv.bv_len;
> >> bvprv = bv;
> >> - bvprvp = &bv;
> >> + bvprvp = &bvprv;
> >> sectors += bv.bv_len >> 9;
> >> continue;
> >> }
> >> @@ -108,7 +108,7 @@ new_segment:
> >>
> >> nsegs++;
> >> bvprv = bv;
> >> - bvprvp = &bv;
> >> + bvprvp = &bvprv;
> >> seg_size = bv.bv_len;
> >> sectors += bv.bv_len >> 9;
> >> }
> >
> > I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
>
> On my side, with the patch applied on top of 4.4-rc1, I can't get the
> panic anymore.
git bisect shows:
bdced438acd83ad83a6c6fc7f50099b820245ddb is the first bad commit
commit bdced438acd83ad83a6c6fc7f50099b820245ddb
Author: Ming Lei <[email protected]>
Date: Tue Oct 20 23:13:52 2015 +0800
block: setup bi_phys_segments after splitting
Reverting above commit on top if 4.4-rc1 seems to fix the problem for me.
~Pratyush
On 23/11/2015 16:13, Pratyush Anand wrote:
> On 23/11/2015:02:57:19 PM, Laurent Dufour wrote:
>> On 23/11/2015 00:20, Mark Salter wrote:
>>> On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
>>>> On Sat, 21 Nov 2015 12:30:14 +0100
>>>> Laurent Dufour <[email protected]> wrote:
>>>>
>>>>> On 20/11/2015 13:10, Michael Ellerman wrote:
>>>>>> On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>>>>>>
>>>>>>> It's pretty much guaranteed a block layer bug, most likely in the
>>>>>>> merge bios to request infrastucture where we don't obey the merging
>>>>>>> limits properly.
>>>>>>>
>>>>>>> Does either of you have a known good and first known bad kernel?
>>>>>>
>>>>>> Not me, I've only hit it one or two times. All I can say is I have hit it in
>>>>>> 4.4-rc1.
>>>>>>
>>>>>> Laurent, can you narrow it down at all?
>>>>>
>>>>> It seems that the panic is triggered by the commit bdced438acd8 ("block:
>>>>> setup bi_phys_segments after splitting") which has been pulled by the
>>>>> merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
>>>>> git://git.kernel.dk/linux-block").
>>>>>
>>>>> My system is panicing promptly when running a kernel built at
>>>>> d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
>>>>> without panicing.
>>>>>
>>>>> This being said, I can't explain what's going wrong.
>>>>>
>>>>> May Ming shed some light here ?
>>>>
>>>> Laurent, looks there is one bug in blk_bio_segment_split(), would you
>>>> mind testing the following patch to see if it fixes your issue?
>>>>
>>>> ---
>>>> From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
>>>> From: Ming Lei <[email protected]>
>>>> Date: Sun, 22 Nov 2015 00:47:13 +0800
>>>> Subject: [PATCH] block: fix segment split
>>>>
>>>> Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
>>>> always points to the iterator local variable, which is obviously
>>>> wrong, so fix it by pointing to the local variable of 'b
> ~Pratyush
>
vprv'.
>>>>
>>>> Signed-off-by: Ming Lei <[email protected]>
>>>> ---
>>>> block/blk-merge.c | 4 ++--
>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>>> index de5716d8..f2efe8a 100644
>>>> --- a/block/blk-merge.c
>>>> +++ b/block/blk-merge.c
>>>> @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>>>
>>>> seg_size += bv.bv_len;bdced438acd8
>>>> bvprv = bv;
>>>> - bvprvp = &bv;
>>>> + bvprvp = &bvprv;
>>>> sectors += bv.bv_len >> 9;
>>>> continue;
>>>> }
>>>> @@ -108,7 +108,7 @@ new_segment:
>>>>
>>>> nsegs++;
>>>> bvprv = bv;
>>>> - bvprvp = &bv;
>>>> + bvprvp = &bvprv;
>>>> seg_size = bv.bv_len;
>>>> sectors += bv.bv_len >> 9;
>>>> }
>>>
>>> I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
>>
>> On my side, with the patch applied on top of 4.4-rc1, I can't get the
>> panic anymore.
>
> git bisect shows:
>
> bdced438acd83ad83a6c6fc7f50099b820245ddb is the first bad commit
> commit bdced438acd83ad83a6c6fc7f50099b820245ddb
> Author: Ming Lei <[email protected]>
> Date: Tue Oct 20 23:13:52 2015 +0800
>
> block: setup bi_phys_segments after splitting
>
> Reverting above commit on top if 4.4-rc1 seems to fix the problem for me.
That's what I mentioned earlier ;)
Now Ming send an additional patch with seems to fix the bug introduced
through the commit bdced438acd8. When testing with this new patch I
can't get the panic anymore, but Mark reported he is still hitting it.
On Mon, 23 Nov 2015 10:46:20 +0800
Ming Lei <[email protected]> wrote:
> Hi Mark,
>
> On Mon, Nov 23, 2015 at 9:50 AM, Mark Salter <[email protected]> wrote:
> > On Mon, 2015-11-23 at 08:36 +0800, Ming Lei wrote:
> >> On Mon, Nov 23, 2015 at 7:20 AM, Mark Salter <[email protected]> wrote:
> >> > On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
> >> > > On Sat, 21 Nov 2015 12:30:14 +0100
> >> > > Laurent Dufour <[email protected]> wrote:
> >> > >
> >> > > > On 20/11/2015 13:10, Michael Ellerman wrote:
> >> > > > > On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
> >> > > > >
> >> > > > > > It's pretty much guaranteed a block layer bug, most likely in the
> >> > > > > > merge bios to request infrastucture where we don't obey the merging
> >> > > > > > limits properly.
> >> > > > > >
> >> > > > > > Does either of you have a known good and first known bad kernel?
> >> > > > >
> >> > > > > Not me, I've only hit it one or two times. All I can say is I have hit it in
> >> > > > > 4.4-rc1.
> >> > > > >
> >> > > > > Laurent, can you narrow it down at all?
> >> > > >
> >> > > > It seems that the panic is triggered by the commit bdced438acd8 ("block:
> >> > > > setup bi_phys_segments after splitting") which has been pulled by the
> >> > > > merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
> >> > > > git://git.kernel.dk/linux-block").
> >> > > >
> >> > > > My system is panicing promptly when running a kernel built at
> >> > > > d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
> >> > > > without panicing.
> >> > > >
> >> > > > This being said, I can't explain what's going wrong.
> >> > > >
> >> > > > May Ming shed some light here ?
> >> > >
> >> > > Laurent, looks there is one bug in blk_bio_segment_split(), would you
> >> > > mind testing the following patch to see if it fixes your issue?
> >> > >
> >> > > ---
> >> > > From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
> >> > > From: Ming Lei <[email protected]>
> >> > > Date: Sun, 22 Nov 2015 00:47:13 +0800
> >> > > Subject: [PATCH] block: fix segment split
> >> > >
> >> > > Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
> >> > > always points to the iterator local variable, which is obviously
> >> > > wrong, so fix it by pointing to the local variable of 'bvprv'.
> >> > >
> >> > > Signed-off-by: Ming Lei <[email protected]>
> >> > > ---
> >> > > block/blk-merge.c | 4 ++--
> >> > > 1 file changed, 2 insertions(+), 2 deletions(-)
> >> > >
> >> > > diff --git a/block/blk-merge.c b/block/blk-merge.c
> >> > > index de5716d8..f2efe8a 100644
> >> > > --- a/block/blk-merge.c
> >> > > +++ b/block/blk-merge.c
> >> > > @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> >> > >
> >> > > seg_size += bv.bv_len;
> >> > > bvprv = bv;
> >> > > - bvprvp = &bv;
> >> > > + bvprvp = &bvprv;
> >> > > sectors += bv.bv_len >> 9;
> >> > > continue;
> >> > > }
> >> > > @@ -108,7 +108,7 @@ new_segment:
> >> > >
> >> > > nsegs++;
> >> > > bvprv = bv;
> >> > > - bvprvp = &bv;
> >> > > + bvprvp = &bvprv;
> >> > > seg_size = bv.bv_len;
> >> > > sectors += bv.bv_len >> 9;
> >> > > }
> >> >
> >> > I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
> >>
> >> OK, looks there are still other bugs, care to share us how to reproduce
> >> it on arm64?
> >>
> >> thanks,
> >> Ming
> >
> > Unfortunately, the best reproducer I have is to boot the platform. I have seen the
> > BUG a few times post-boot, but I don't have a consistant reproducer. I am using
> > upstream 4.4-rc1 with this config:
> >
> > http://people.redhat.com/msalter/fh_defconfig
> >
> > With 4.4-rc1 on an APM Mustang platform, I see the BUG about once every 6-7 boots.
> > On an AMD Seattle platform, about every 9 boots.
>
> Thanks for the input, and I will try to reproduce the issue on mustang with
> your kernel config.
I can reproduce the issue on mustang, and looks I may understand the story now.
When 64K page size is used on arm64, and the default segment size of block
is 65536, then one segment should only include one page at most.
Commit bdced438acd83a(block: setup bi_phys_segments after splitting) does not
compute bio->bi_seg_front_size and bio->bi_seg_back_size, then one less segment
may be obtained because blk_phys_contig_segment() thought the last bvec in 1st
bio and the 1st bvec in the 2nd bio is in one physical segment, so cause the
regression.
Looks the following patch can fix the issue by figuring bio->bi_seg_front_size
and bio->bi_seg_back_size in blk_bio_segment_split().
Mark, thanks again for providing the reproduction steps, and could you run your
test to see if it can fix your issue?
---
>From 86b5f33d48715c1150fdcfd9a76e495e7aa913aa Mon Sep 17 00:00:00 2001
From: Ming Lei <[email protected]>
Date: Mon, 23 Nov 2015 20:27:23 +0800
Subject: [PATCH 2/2] blk-merge: fix blk_bio_segment_split
Commit bdced438acd83a(block: setup bi_phys_segments after
splitting) introduces function of computing bio->bi_phys_segments
during bio splitting.
Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
arn't computed, so too many physical segments may be obtained
for one request since both the two are used to check if one segment
across two bios can be possible.
This patch fixes the issue by computing the two variables in
blk_bio_segment_split().
Reported-by: Michael Ellerman <[email protected]>
Reported-by: Mark Salter <[email protected]>
Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
Signed-off-by: Ming Lei <[email protected]>
---
block/blk-merge.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/block/blk-merge.c b/block/blk-merge.c
index f2efe8a..50793cd 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -76,6 +76,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
struct bio_vec bv, bvprv, *bvprvp = NULL;
struct bvec_iter iter;
unsigned seg_size = 0, nsegs = 0, sectors = 0;
+ unsigned front_seg_size = bio->bi_seg_front_size;
+ bool do_split = true;
+ struct bio *new = NULL;
bio_for_each_segment(bv, bio, iter) {
if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
@@ -111,13 +114,26 @@ new_segment:
bvprvp = &bvprv;
seg_size = bv.bv_len;
sectors += bv.bv_len >> 9;
+
+ if (nsegs == 1 && seg_size > front_seg_size)
+ front_seg_size = seg_size;
}
- *segs = nsegs;
- return NULL;
+ do_split = false;
split:
*segs = nsegs;
- return bio_split(bio, sectors, GFP_NOIO, bs);
+
+ if (do_split) {
+ new = bio_split(bio, sectors, GFP_NOIO, bs);
+ if (new)
+ bio = new;
+ }
+
+ bio->bi_seg_front_size = front_seg_size;
+ if (seg_size > bio->bi_seg_back_size)
+ bio->bi_seg_back_size = seg_size;
+
+ return do_split ? new : NULL;
}
void blk_queue_split(struct request_queue *q, struct bio **bio,
--
1.9.1
Thanks,
On Mon, Nov 23, 2015 at 11:20 PM, Laurent Dufour
<[email protected]> wrote:
>>
>> Reverting above commit on top if 4.4-rc1 seems to fix the problem for me.
>
> That's what I mentioned earlier ;)
>
> Now Ming send an additional patch with seems to fix the bug introduced
> through the commit bdced438acd8. When testing with this new patch I
> can't get the panic anymore, but Mark reported he is still hitting it.
Laurent, thanks for your test on the 1st patch, and looks there are at
least two problems, and my 2nd patch sent just now should address
Mark's issue which is caused by bdced438acd83a.
Once the 2nd one is tested OK, I will send out the two together.
Thanks,
Ming
On 23/11/2015 16:27, Ming Lei wrote:
> On Mon, Nov 23, 2015 at 11:20 PM, Laurent Dufour
> <[email protected]> wrote:
>>>
>>> Reverting above commit on top if 4.4-rc1 seems to fix the problem for me.
>>
>> That's what I mentioned earlier ;)
>>
>> Now Ming send an additional patch with seems to fix the bug introduced
>> through the commit bdced438acd8. When testing with this new patch I
>> can't get the panic anymore, but Mark reported he is still hitting it.
>
> Laurent, thanks for your test on the 1st patch, and looks there are at
> least two problems, and my 2nd patch sent just now should address
> Mark's issue which is caused by bdced438acd83a.
>
> Once the 2nd one is tested OK, I will send out the two together.
FWIW, I applied the 2nd patch, and my system is still running like a charm.
Cheers,
Laurent.
On Mon, 2015-11-23 at 23:27 +0800, Ming Lei wrote:
> On Mon, Nov 23, 2015 at 11:20 PM, Laurent Dufour
> <[email protected]> wrote:
> > >
> > > Reverting above commit on top if 4.4-rc1 seems to fix the problem for me.
> >
> > That's what I mentioned earlier ;)
> >
> > Now Ming send an additional patch with seems to fix the bug introduced
> > through the commit bdced438acd8. When testing with this new patch I
> > can't get the panic anymore, but Mark reported he is still hitting it.
>
> Laurent, thanks for your test on the 1st patch, and looks there are at
> least two problems, and my 2nd patch sent just now should address
> Mark's issue which is caused by bdced438acd83a.
>
> Once the 2nd one is tested OK, I will send out the two together.
>
> Thanks,
> Ming
Thanks Ming.
With both patches applied, I have been unable to reproduce the problem.
On 11/23/2015 10:21 AM, Ming Lei wrote:
> On Mon, 23 Nov 2015 10:46:20 +0800
> Ming Lei <[email protected]> wrote:
>
>> Hi Mark,
>>
>> On Mon, Nov 23, 2015 at 9:50 AM, Mark Salter <[email protected]> wrote:
>>> On Mon, 2015-11-23 at 08:36 +0800, Ming Lei wrote:
>>>> On Mon, Nov 23, 2015 at 7:20 AM, Mark Salter <[email protected]> wrote:
>>>>> On Sun, 2015-11-22 at 00:56 +0800, Ming Lei wrote:
>>>>>> On Sat, 21 Nov 2015 12:30:14 +0100
>>>>>> Laurent Dufour <[email protected]> wrote:
>>>>>>
>>>>>>> On 20/11/2015 13:10, Michael Ellerman wrote:
>>>>>>>> On Thu, 2015-11-19 at 00:23 -0800, Christoph Hellwig wrote:
>>>>>>>>
>>>>>>>>> It's pretty much guaranteed a block layer bug, most likely in the
>>>>>>>>> merge bios to request infrastucture where we don't obey the merging
>>>>>>>>> limits properly.
>>>>>>>>>
>>>>>>>>> Does either of you have a known good and first known bad kernel?
>>>>>>>>
>>>>>>>> Not me, I've only hit it one or two times. All I can say is I have hit it in
>>>>>>>> 4.4-rc1.
>>>>>>>>
>>>>>>>> Laurent, can you narrow it down at all?
>>>>>>>
>>>>>>> It seems that the panic is triggered by the commit bdced438acd8 ("block:
>>>>>>> setup bi_phys_segments after splitting") which has been pulled by the
>>>>>>> merge d9734e0d1ccf ("Merge branch 'for-4.4/core' of
>>>>>>> git://git.kernel.dk/linux-block").
>>>>>>>
>>>>>>> My system is panicing promptly when running a kernel built at
>>>>>>> d9734e0d1ccf, while reverting the commit bdced438acd8, it can run hours
>>>>>>> without panicing.
>>>>>>>
>>>>>>> This being said, I can't explain what's going wrong.
>>>>>>>
>>>>>>> May Ming shed some light here ?
>>>>>>
>>>>>> Laurent, looks there is one bug in blk_bio_segment_split(), would you
>>>>>> mind testing the following patch to see if it fixes your issue?
>>>>>>
>>>>>> ---
>>>>>> From 6fc701231dcc000bc8bc4b9105583380d9aa31f4 Mon Sep 17 00:00:00 2001
>>>>>> From: Ming Lei <[email protected]>
>>>>>> Date: Sun, 22 Nov 2015 00:47:13 +0800
>>>>>> Subject: [PATCH] block: fix segment split
>>>>>>
>>>>>> Inside blk_bio_segment_split(), previous bvec pointer('bvprvp')
>>>>>> always points to the iterator local variable, which is obviously
>>>>>> wrong, so fix it by pointing to the local variable of 'bvprv'.
>>>>>>
>>>>>> Signed-off-by: Ming Lei <[email protected]>
>>>>>> ---
>>>>>> block/blk-merge.c | 4 ++--
>>>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/block/blk-merge.c b/block/blk-merge.c
>>>>>> index de5716d8..f2efe8a 100644
>>>>>> --- a/block/blk-merge.c
>>>>>> +++ b/block/blk-merge.c
>>>>>> @@ -98,7 +98,7 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
>>>>>>
>>>>>> seg_size += bv.bv_len;
>>>>>> bvprv = bv;
>>>>>> - bvprvp = &bv;
>>>>>> + bvprvp = &bvprv;
>>>>>> sectors += bv.bv_len >> 9;
>>>>>> continue;
>>>>>> }
>>>>>> @@ -108,7 +108,7 @@ new_segment:
>>>>>>
>>>>>> nsegs++;
>>>>>> bvprv = bv;
>>>>>> - bvprvp = &bv;
>>>>>> + bvprvp = &bvprv;
>>>>>> seg_size = bv.bv_len;
>>>>>> sectors += bv.bv_len >> 9;
>>>>>> }
>>>>>
>>>>> I'm still hitting the BUG even with this patch applied on top of 4.4-rc1.
>>>>
>>>> OK, looks there are still other bugs, care to share us how to reproduce
>>>> it on arm64?
>>>>
>>>> thanks,
>>>> Ming
>>>
>>> Unfortunately, the best reproducer I have is to boot the platform. I have seen the
>>> BUG a few times post-boot, but I don't have a consistant reproducer. I am using
>>> upstream 4.4-rc1 with this config:
>>>
>>> http://people.redhat.com/msalter/fh_defconfig
>>>
>>> With 4.4-rc1 on an APM Mustang platform, I see the BUG about once every 6-7 boots.
>>> On an AMD Seattle platform, about every 9 boots.
>>
>> Thanks for the input, and I will try to reproduce the issue on mustang with
>> your kernel config.
>
> I can reproduce the issue on mustang, and looks I may understand the story now.
>
> When 64K page size is used on arm64, and the default segment size of block
> is 65536, then one segment should only include one page at most.
>
> Commit bdced438acd83a(block: setup bi_phys_segments after splitting) does not
> compute bio->bi_seg_front_size and bio->bi_seg_back_size, then one less segment
> may be obtained because blk_phys_contig_segment() thought the last bvec in 1st
> bio and the 1st bvec in the 2nd bio is in one physical segment, so cause the
> regression.
>
> Looks the following patch can fix the issue by figuring bio->bi_seg_front_size
> and bio->bi_seg_back_size in blk_bio_segment_split().
>
> Mark, thanks again for providing the reproduction steps, and could you run your
> test to see if it can fix your issue?
>
> ---
> From 86b5f33d48715c1150fdcfd9a76e495e7aa913aa Mon Sep 17 00:00:00 2001
> From: Ming Lei <[email protected]>
> Date: Mon, 23 Nov 2015 20:27:23 +0800
> Subject: [PATCH 2/2] blk-merge: fix blk_bio_segment_split
>
> Commit bdced438acd83a(block: setup bi_phys_segments after
> splitting) introduces function of computing bio->bi_phys_segments
> during bio splitting.
>
> Unfortunately both bio->bi_seg_front_size and bio->bi_seg_back_size
> arn't computed, so too many physical segments may be obtained
> for one request since both the two are used to check if one segment
> across two bios can be possible.
>
> This patch fixes the issue by computing the two variables in
> blk_bio_segment_split().
>
> Reported-by: Michael Ellerman <[email protected]>
> Reported-by: Mark Salter <[email protected]>
> Fixes: bdced438acd83a(block: setup bi_phys_segments after splitting)
> Signed-off-by: Ming Lei <[email protected]>
> ---
> block/blk-merge.c | 22 +++++++++++++++++++---
> 1 file changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index f2efe8a..50793cd 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -76,6 +76,9 @@ static struct bio *blk_bio_segment_split(struct request_queue *q,
> struct bio_vec bv, bvprv, *bvprvp = NULL;
> struct bvec_iter iter;
> unsigned seg_size = 0, nsegs = 0, sectors = 0;
> + unsigned front_seg_size = bio->bi_seg_front_size;
> + bool do_split = true;
> + struct bio *new = NULL;
>
> bio_for_each_segment(bv, bio, iter) {
> if (sectors + (bv.bv_len >> 9) > queue_max_sectors(q))
> @@ -111,13 +114,26 @@ new_segment:
> bvprvp = &bvprv;
> seg_size = bv.bv_len;
> sectors += bv.bv_len >> 9;
> +
> + if (nsegs == 1 && seg_size > front_seg_size)
> + front_seg_size = seg_size;
> }
>
> - *segs = nsegs;
> - return NULL;
> + do_split = false;
> split:
> *segs = nsegs;
> - return bio_split(bio, sectors, GFP_NOIO, bs);
> +
> + if (do_split) {
> + new = bio_split(bio, sectors, GFP_NOIO, bs);
> + if (new)
> + bio = new;
> + }
> +
> + bio->bi_seg_front_size = front_seg_size;
> + if (seg_size > bio->bi_seg_back_size)
> + bio->bi_seg_back_size = seg_size;
> +
> + return do_split ? new : NULL;
> }
>
> void blk_queue_split(struct request_queue *q, struct bio **bio,
>
I applied both your patches and tested on Overdrive 3000. This fixes the
issue for me.
Added linux-arm-kernel, since arm64 triggers this issue.
Thanks Ming and testers for your hard work on this.
Tested-by: Alan Ott <[email protected]>
On 11/20/2015 04:28 PM, Ewan Milne wrote:
> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>> Can't we have a joint effort here?
>> I've been spending a _LOT_ of time trying to debug things here, but
>> none of the ideas I've come up with have been able to fix anything.
>
> Yes. I'm not the one primarily looking at it, and we don't have a
> reproducer in-house. We just have the one dump right now.
>
>>
>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>> by one and be done with ...
>>
>
> That might not fix it if it is a problem with the merge code, though.
>
And indeed, it doesn't.
Seems I finally found the culprit.
What happens is this:
We have two paths, with these seg_boundary_masks:
path-1: seg_boundary_mask = 65535,
path-2: seg_boundary_mask = 4294967295,
consequently the DM request queue has this:
md-1: seg_boundary_mask = 65535,
What happens now is that a request is being formatted, and sent
to path 2. During submission req->nr_phys_segments is formatted
with the limits of path 2, arriving at a count of 3.
Now the request gets retried on path 1, but as the NOMERGE request
flag is set req->nr_phys_segments is never updated.
But blk_rq_map_sg() ignores all counters, and just uses the
bi_vec directly, resulting in a count of 4 -> boom.
So the culprit here is the NOMERGE flag, which is evaluated
via
->dm_dispatch_request()
->blk_insert_cloned_request()
->blk_rq_check_limits()
If the above assessment is correct, the following patch should
fix it:
diff --git a/block/blk-core.c b/block/blk-core.c
index 801ced7..12cccd6 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
*/
int blk_rq_check_limits(struct request_queue *q, struct request *rq)
{
- if (!rq_mergeable(rq))
+ if (rq->cmd_type != REQ_TYPE_FS)
return 0;
if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
rq->cmd_flags)) {
Mike? Jens?
Can you comment on it?
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>> Can't we have a joint effort here?
>>> I've been spending a _LOT_ of time trying to debug things here, but
>>> none of the ideas I've come up with have been able to fix anything.
>>
>> Yes. I'm not the one primarily looking at it, and we don't have a
>> reproducer in-house. We just have the one dump right now.
>>
>>>
>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>> by one and be done with ...
>>>
>>
>> That might not fix it if it is a problem with the merge code, though.
>>
> And indeed, it doesn't.
> Seems I finally found the culprit.
>
> What happens is this:
> We have two paths, with these seg_boundary_masks:
>
> path-1: seg_boundary_mask = 65535,
> path-2: seg_boundary_mask = 4294967295,
>
> consequently the DM request queue has this:
>
> md-1: seg_boundary_mask = 65535,
>
> What happens now is that a request is being formatted, and sent
> to path 2. During submission req->nr_phys_segments is formatted
> with the limits of path 2, arriving at a count of 3.
> Now the request gets retried on path 1, but as the NOMERGE request
> flag is set req->nr_phys_segments is never updated.
> But blk_rq_map_sg() ignores all counters, and just uses the
> bi_vec directly, resulting in a count of 4 -> boom.
>
> So the culprit here is the NOMERGE flag, which is evaluated
> via
> ->dm_dispatch_request()
> ->blk_insert_cloned_request()
> ->blk_rq_check_limits()
>
> If the above assessment is correct, the following patch should
> fix it:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 801ced7..12cccd6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> */
> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> {
> - if (!rq_mergeable(rq))
> + if (rq->cmd_type != REQ_TYPE_FS)
> return 0;
>
> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> rq->cmd_flags)) {
>
>
> Mike? Jens?
> Can you comment on it?
We only support merging on REQ_TYPE_FS already, so how is the above
making it any different? In general, NOMERGE being set or not should not
make a difference. It's only a hint that we need not check further if we
should be merging on this request, since we already tried it once, found
we'd exceed various limits, then set NOMERGE to reflect that.
--
Jens Axboe
On Wed, Nov 25 2015 at 4:04am -0500,
Hannes Reinecke <[email protected]> wrote:
> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> > On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >> Can't we have a joint effort here?
> >> I've been spending a _LOT_ of time trying to debug things here, but
> >> none of the ideas I've come up with have been able to fix anything.
> >
> > Yes. I'm not the one primarily looking at it, and we don't have a
> > reproducer in-house. We just have the one dump right now.
> >
> >>
> >> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >> by one and be done with ...
> >>
> >
> > That might not fix it if it is a problem with the merge code, though.
> >
> And indeed, it doesn't.
How did you arrive at that? Do you have a reproducer now?
> Seems I finally found the culprit.
>
> What happens is this:
> We have two paths, with these seg_boundary_masks:
>
> path-1: seg_boundary_mask = 65535,
> path-2: seg_boundary_mask = 4294967295,
>
> consequently the DM request queue has this:
>
> md-1: seg_boundary_mask = 65535,
>
> What happens now is that a request is being formatted, and sent
> to path 2. During submission req->nr_phys_segments is formatted
> with the limits of path 2, arriving at a count of 3.
> Now the request gets retried on path 1, but as the NOMERGE request
> flag is set req->nr_phys_segments is never updated.
> But blk_rq_map_sg() ignores all counters, and just uses the
> bi_vec directly, resulting in a count of 4 -> boom.
>
> So the culprit here is the NOMERGE flag,
NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> which is evaluated via
> ->dm_dispatch_request()
> ->blk_insert_cloned_request()
> ->blk_rq_check_limits()
blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
anyway after reading your mail I'm still left wondering if your proposed
patch is correct.
> If the above assessment is correct, the following patch should
> fix it:
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 801ced7..12cccd6 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> */
> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> {
> - if (!rq_mergeable(rq))
> + if (rq->cmd_type != REQ_TYPE_FS)
> return 0;
>
> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> rq->cmd_flags)) {
>
>
> Mike? Jens?
> Can you comment on it?
You're not explaining the actual change in the patch very well; I think
you're correct but you're leaving the justification as an exercise to
the reviewer:
blk_rq_check_limits() will call blk_recalc_rq_segments() after the
!rq_mergeable() check but you're saying for this case in question we
never get there -- due to the cloned request having NOMERGE set.
So in blk_rq_check_limits() you've unrolled rq_mergeable() and
open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
sense for cloned requests that always have NOMERGE set.
So you're saying that by having blk_rq_check_limits() go on to call
blk_recalc_rq_segments() this bug will be fixed?
BTW, I think blk_rq_check_limits()'s export should be removed and the
function made static and renamed to blk_clone_rq_check_limits(), again:
blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
Seems prudent to make that change now to be clear that this code is only
used by cloned requests.
Thanks,
Mike
On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> On Wed, Nov 25 2015 at 4:04am -0500,
> Hannes Reinecke <[email protected]> wrote:
>
>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>> Can't we have a joint effort here?
>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>> none of the ideas I've come up with have been able to fix anything.
>>>
>>> Yes. I'm not the one primarily looking at it, and we don't have a
>>> reproducer in-house. We just have the one dump right now.
>>>
>>>>
>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>> by one and be done with ...
>>>>
>>>
>>> That might not fix it if it is a problem with the merge code, though.
>>>
>> And indeed, it doesn't.
>
> How did you arrive at that? Do you have a reproducer now?
>
Not a reproducer, but several dumps for analysis.
>> Seems I finally found the culprit.
>>
>> What happens is this:
>> We have two paths, with these seg_boundary_masks:
>>
>> path-1: seg_boundary_mask = 65535,
>> path-2: seg_boundary_mask = 4294967295,
>>
>> consequently the DM request queue has this:
>>
>> md-1: seg_boundary_mask = 65535,
>>
>> What happens now is that a request is being formatted, and sent
>> to path 2. During submission req->nr_phys_segments is formatted
>> with the limits of path 2, arriving at a count of 3.
>> Now the request gets retried on path 1, but as the NOMERGE request
>> flag is set req->nr_phys_segments is never updated.
>> But blk_rq_map_sg() ignores all counters, and just uses the
>> bi_vec directly, resulting in a count of 4 -> boom.
>>
>> So the culprit here is the NOMERGE flag,
>
> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
>
Yes.
>> which is evaluated via
>> ->dm_dispatch_request()
>> ->blk_insert_cloned_request()
>> ->blk_rq_check_limits()
>
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> anyway after reading your mail I'm still left wondering if your proposed
> patch is correct.
>
>> If the above assessment is correct, the following patch should
>> fix it:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 801ced7..12cccd6 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>> */
>> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>> {
>> - if (!rq_mergeable(rq))
>> + if (rq->cmd_type != REQ_TYPE_FS)
>> return 0;
>>
>> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>> rq->cmd_flags)) {
>>
>>
>> Mike? Jens?
>> Can you comment on it?
>
> You're not explaining the actual change in the patch very well; I think
> you're correct but you're leaving the justification as an exercise to
> the reviewer:
>
> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> !rq_mergeable() check but you're saying for this case in question we
> never get there -- due to the cloned request having NOMERGE set.
>
> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
>
> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> sense for cloned requests that always have NOMERGE set.
>
> So you're saying that by having blk_rq_check_limits() go on to call
> blk_recalc_rq_segments() this bug will be fixed?
>
That is the idea.
I've already established that in all instances I have seen so far
req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
As it turns out, req->nr_phys_segemnts _would_ have been updated in
blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
for the cloned request.
So each cloned request inherits the values from the original request,
despite the fact that req->nr_phys_segments _has_ to be evaluated in
the final request_queue context, as the queue limits _might_ be
different from the original (merged) queue limits of the multipath
request queue.
> BTW, I think blk_rq_check_limits()'s export should be removed and the
> function made static and renamed to blk_clone_rq_check_limits(), again:
> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
>
Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
pointless, too, so we might as well remove the entire if-clause.
> Seems prudent to make that change now to be clear that this code is only
> used by cloned requests.
>
Yeah, that would make sense. I'll be preparing a patch.
With a more detailed description :-)
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On 11/25/2015 06:56 PM, Jens Axboe wrote:
> On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>> Can't we have a joint effort here?
>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>> none of the ideas I've come up with have been able to fix anything.
>>>
>>> Yes. I'm not the one primarily looking at it, and we don't have a
>>> reproducer in-house. We just have the one dump right now.
>>>
>>>>
>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>> by one and be done with ...
>>>>
>>>
>>> That might not fix it if it is a problem with the merge code, though.
>>>
>> And indeed, it doesn't.
>> Seems I finally found the culprit.
>>
>> What happens is this:
>> We have two paths, with these seg_boundary_masks:
>>
>> path-1: seg_boundary_mask = 65535,
>> path-2: seg_boundary_mask = 4294967295,
>>
>> consequently the DM request queue has this:
>>
>> md-1: seg_boundary_mask = 65535,
>>
>> What happens now is that a request is being formatted, and sent
>> to path 2. During submission req->nr_phys_segments is formatted
>> with the limits of path 2, arriving at a count of 3.
>> Now the request gets retried on path 1, but as the NOMERGE request
>> flag is set req->nr_phys_segments is never updated.
>> But blk_rq_map_sg() ignores all counters, and just uses the
>> bi_vec directly, resulting in a count of 4 -> boom.
>>
>> So the culprit here is the NOMERGE flag, which is evaluated
>> via
>> ->dm_dispatch_request()
>> ->blk_insert_cloned_request()
>> ->blk_rq_check_limits()
>>
>> If the above assessment is correct, the following patch should
>> fix it:
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 801ced7..12cccd6 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>> */
>> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>> {
>> - if (!rq_mergeable(rq))
>> + if (rq->cmd_type != REQ_TYPE_FS)
>> return 0;
>>
>> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>> rq->cmd_flags)) {
>>
>>
>> Mike? Jens?
>> Can you comment on it?
>
> We only support merging on REQ_TYPE_FS already, so how is the above
> making it any different? In general, NOMERGE being set or not should not
> make a difference. It's only a hint that we need not check further if we
> should be merging on this request, since we already tried it once, found
> we'd exceed various limits, then set NOMERGE to reflect that.
>
The problem is that NOMERGE does too much, as it inhibits _any_ merging.
Unfortunately, the req->nr_phys_segments value is evaluated in the final
_driver_ context _after_ the merging happend; cf
scsi_lib.c:scsi_init_sgtable().
As nr_phys_segments is inherited from the original request (and never
recalculated with the new request queue limits) the following
blk_rq_map_sg() call might end up at a different calculation, especially
after retrying a request on another path.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On 11/25/2015 12:10 PM, Hannes Reinecke wrote:
> On 11/25/2015 06:56 PM, Jens Axboe wrote:
>> On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
>>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>>> Can't we have a joint effort here?
>>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>>> none of the ideas I've come up with have been able to fix anything.
>>>>
>>>> Yes. I'm not the one primarily looking at it, and we don't have a
>>>> reproducer in-house. We just have the one dump right now.
>>>>
>>>>>
>>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>>> by one and be done with ...
>>>>>
>>>>
>>>> That might not fix it if it is a problem with the merge code, though.
>>>>
>>> And indeed, it doesn't.
>>> Seems I finally found the culprit.
>>>
>>> What happens is this:
>>> We have two paths, with these seg_boundary_masks:
>>>
>>> path-1: seg_boundary_mask = 65535,
>>> path-2: seg_boundary_mask = 4294967295,
>>>
>>> consequently the DM request queue has this:
>>>
>>> md-1: seg_boundary_mask = 65535,
>>>
>>> What happens now is that a request is being formatted, and sent
>>> to path 2. During submission req->nr_phys_segments is formatted
>>> with the limits of path 2, arriving at a count of 3.
>>> Now the request gets retried on path 1, but as the NOMERGE request
>>> flag is set req->nr_phys_segments is never updated.
>>> But blk_rq_map_sg() ignores all counters, and just uses the
>>> bi_vec directly, resulting in a count of 4 -> boom.
>>>
>>> So the culprit here is the NOMERGE flag, which is evaluated
>>> via
>>> ->dm_dispatch_request()
>>> ->blk_insert_cloned_request()
>>> ->blk_rq_check_limits()
>>>
>>> If the above assessment is correct, the following patch should
>>> fix it:
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 801ced7..12cccd6 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>> */
>>> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>> {
>>> - if (!rq_mergeable(rq))
>>> + if (rq->cmd_type != REQ_TYPE_FS)
>>> return 0;
>>>
>>> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>>> rq->cmd_flags)) {
>>>
>>>
>>> Mike? Jens?
>>> Can you comment on it?
>>
>> We only support merging on REQ_TYPE_FS already, so how is the above
>> making it any different? In general, NOMERGE being set or not should not
>> make a difference. It's only a hint that we need not check further if we
>> should be merging on this request, since we already tried it once, found
>> we'd exceed various limits, then set NOMERGE to reflect that.
>>
> The problem is that NOMERGE does too much, as it inhibits _any_ merging.
Right, that is the point of the flag from the block layer view, where it
was originally added for the case mentioned.
> Unfortunately, the req->nr_phys_segments value is evaluated in the final
> _driver_ context _after_ the merging happend; cf
> scsi_lib.c:scsi_init_sgtable().
> As nr_phys_segments is inherited from the original request (and never
> recalculated with the new request queue limits) the following
> blk_rq_map_sg() call might end up at a different calculation, especially
> after retrying a request on another path.
That all sounds pretty horrible. Why is blk_rq_check_limits() checking
for mergeable at all? If merging is disabled on the request, I'm
assuming that's an attempt at an optimization since we know it won't
change. But that should be tracked separately, like how it's done on the
bio.
--
Jens Axboe
On Wed, Nov 25 2015 at 2:24pm -0500,
Jens Axboe <[email protected]> wrote:
> On 11/25/2015 12:10 PM, Hannes Reinecke wrote:
> >On 11/25/2015 06:56 PM, Jens Axboe wrote:
> >>On 11/25/2015 02:04 AM, Hannes Reinecke wrote:
> >>>On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>>>On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>>>Can't we have a joint effort here?
> >>>>>I've been spending a _LOT_ of time trying to debug things here, but
> >>>>>none of the ideas I've come up with have been able to fix anything.
> >>>>
> >>>>Yes. I'm not the one primarily looking at it, and we don't have a
> >>>>reproducer in-house. We just have the one dump right now.
> >>>>
> >>>>>
> >>>>>I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>>>by one and be done with ...
> >>>>>
> >>>>
> >>>>That might not fix it if it is a problem with the merge code, though.
> >>>>
> >>>And indeed, it doesn't.
> >>>Seems I finally found the culprit.
> >>>
> >>>What happens is this:
> >>>We have two paths, with these seg_boundary_masks:
> >>>
> >>>path-1: seg_boundary_mask = 65535,
> >>>path-2: seg_boundary_mask = 4294967295,
> >>>
> >>>consequently the DM request queue has this:
> >>>
> >>>md-1: seg_boundary_mask = 65535,
> >>>
> >>>What happens now is that a request is being formatted, and sent
> >>>to path 2. During submission req->nr_phys_segments is formatted
> >>>with the limits of path 2, arriving at a count of 3.
> >>>Now the request gets retried on path 1, but as the NOMERGE request
> >>>flag is set req->nr_phys_segments is never updated.
> >>>But blk_rq_map_sg() ignores all counters, and just uses the
> >>>bi_vec directly, resulting in a count of 4 -> boom.
> >>>
> >>>So the culprit here is the NOMERGE flag, which is evaluated
> >>>via
> >>>->dm_dispatch_request()
> >>> ->blk_insert_cloned_request()
> >>> ->blk_rq_check_limits()
> >>>
> >>>If the above assessment is correct, the following patch should
> >>>fix it:
> >>>
> >>>diff --git a/block/blk-core.c b/block/blk-core.c
> >>>index 801ced7..12cccd6 100644
> >>>--- a/block/blk-core.c
> >>>+++ b/block/blk-core.c
> >>>@@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >>> */
> >>> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >>> {
> >>>- if (!rq_mergeable(rq))
> >>>+ if (rq->cmd_type != REQ_TYPE_FS)
> >>> return 0;
> >>>
> >>> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >>>rq->cmd_flags)) {
> >>>
> >>>
> >>>Mike? Jens?
> >>>Can you comment on it?
> >>
> >>We only support merging on REQ_TYPE_FS already, so how is the above
> >>making it any different? In general, NOMERGE being set or not should not
> >>make a difference. It's only a hint that we need not check further if we
> >>should be merging on this request, since we already tried it once, found
> >>we'd exceed various limits, then set NOMERGE to reflect that.
> >>
> >The problem is that NOMERGE does too much, as it inhibits _any_ merging.
>
> Right, that is the point of the flag from the block layer view,
> where it was originally added for the case mentioned.
And we really don't want _any_ merging. The merging, if any, will have
already happened in upper DM-multipath's elevator. So there should be
no need to have the underlying SCSI paths do any merging.
> >Unfortunately, the req->nr_phys_segments value is evaluated in the final
> >_driver_ context _after_ the merging happend; cf
> >scsi_lib.c:scsi_init_sgtable().
> >As nr_phys_segments is inherited from the original request (and never
> >recalculated with the new request queue limits) the following
> >blk_rq_map_sg() call might end up at a different calculation, especially
> >after retrying a request on another path.
>
> That all sounds pretty horrible. Why is blk_rq_check_limits()
> checking for mergeable at all? If merging is disabled on the
> request, I'm assuming that's an attempt at an optimization since we
> know it won't change. But that should be tracked separately, like
> how it's done on the bio.
Not clear to me why it was checking for merging...
On Wed, Nov 25 2015 at 3:23pm -0500,
Mike Snitzer <[email protected]> wrote:
> On Wed, Nov 25 2015 at 2:24pm -0500,
> Jens Axboe <[email protected]> wrote:
>
> > On 11/25/2015 12:10 PM, Hannes Reinecke wrote:
> > >The problem is that NOMERGE does too much, as it inhibits _any_ merging.
> >
> > Right, that is the point of the flag from the block layer view,
> > where it was originally added for the case mentioned.
>
> And we really don't want _any_ merging. The merging, if any, will have
> already happened in upper DM-multipath's elevator. So there should be
> no need to have the underlying SCSI paths do any merging.
>
> > >Unfortunately, the req->nr_phys_segments value is evaluated in the final
> > >_driver_ context _after_ the merging happend; cf
> > >scsi_lib.c:scsi_init_sgtable().
> > >As nr_phys_segments is inherited from the original request (and never
> > >recalculated with the new request queue limits) the following
> > >blk_rq_map_sg() call might end up at a different calculation, especially
> > >after retrying a request on another path.
> >
> > That all sounds pretty horrible. Why is blk_rq_check_limits()
> > checking for mergeable at all? If merging is disabled on the
> > request, I'm assuming that's an attempt at an optimization since we
> > know it won't change. But that should be tracked separately, like
> > how it's done on the bio.
>
> Not clear to me why it was checking for merging...
Ewan pointed out that blk_rq_check_limits()'s call to rq_mergable() was
introduced as part of Martin's DISCARD cleanup that prepared for
WRITE_SAME, see: commit e2a60da74 ("block: Clean up special command handling logic")
And prior to that, blk_rq_check_limits()'s (rq->cmd_flags & REQ_DISCARD)
check was introduced by some guy crazy doppelganger name "Ike Snitzer",
see commit: 3383977f ("block: update request stacking methods to support discards")
So basically we probably never needed the extra check in
blk_rq_check_limits() to begin with.. Ike was probably paranoid. ;)
On Wed, 25 Nov 2015 20:01:47 +0100,
Hannes Reinecke wrote:
>
> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> > On Wed, Nov 25 2015 at 4:04am -0500,
> > Hannes Reinecke <[email protected]> wrote:
> >
> >> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>> Can't we have a joint effort here?
> >>>> I've been spending a _LOT_ of time trying to debug things here, but
> >>>> none of the ideas I've come up with have been able to fix anything.
> >>>
> >>> Yes. I'm not the one primarily looking at it, and we don't have a
> >>> reproducer in-house. We just have the one dump right now.
> >>>
> >>>>
> >>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>> by one and be done with ...
> >>>>
> >>>
> >>> That might not fix it if it is a problem with the merge code, though.
> >>>
> >> And indeed, it doesn't.
> >
> > How did you arrive at that? Do you have a reproducer now?
> >
> Not a reproducer, but several dumps for analysis.
>
> >> Seems I finally found the culprit.
> >>
> >> What happens is this:
> >> We have two paths, with these seg_boundary_masks:
> >>
> >> path-1: seg_boundary_mask = 65535,
> >> path-2: seg_boundary_mask = 4294967295,
> >>
> >> consequently the DM request queue has this:
> >>
> >> md-1: seg_boundary_mask = 65535,
> >>
> >> What happens now is that a request is being formatted, and sent
> >> to path 2. During submission req->nr_phys_segments is formatted
> >> with the limits of path 2, arriving at a count of 3.
> >> Now the request gets retried on path 1, but as the NOMERGE request
> >> flag is set req->nr_phys_segments is never updated.
> >> But blk_rq_map_sg() ignores all counters, and just uses the
> >> bi_vec directly, resulting in a count of 4 -> boom.
> >>
> >> So the culprit here is the NOMERGE flag,
> >
> > NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> >
> Yes.
>
> >> which is evaluated via
> >> ->dm_dispatch_request()
> >> ->blk_insert_cloned_request()
> >> ->blk_rq_check_limits()
> >
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> > anyway after reading your mail I'm still left wondering if your proposed
> > patch is correct.
> >
> >> If the above assessment is correct, the following patch should
> >> fix it:
> >>
> >> diff --git a/block/blk-core.c b/block/blk-core.c
> >> index 801ced7..12cccd6 100644
> >> --- a/block/blk-core.c
> >> +++ b/block/blk-core.c
> >> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >> */
> >> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >> {
> >> - if (!rq_mergeable(rq))
> >> + if (rq->cmd_type != REQ_TYPE_FS)
> >> return 0;
> >>
> >> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >> rq->cmd_flags)) {
> >>
> >>
> >> Mike? Jens?
> >> Can you comment on it?
> >
> > You're not explaining the actual change in the patch very well; I think
> > you're correct but you're leaving the justification as an exercise to
> > the reviewer:
> >
> > blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> > !rq_mergeable() check but you're saying for this case in question we
> > never get there -- due to the cloned request having NOMERGE set.
> >
> > So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> > open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
> >
> > I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> > the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> > sense for cloned requests that always have NOMERGE set.
> >
> > So you're saying that by having blk_rq_check_limits() go on to call
> > blk_recalc_rq_segments() this bug will be fixed?
> >
> That is the idea.
>
> I've already established that in all instances I have seen so far
> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
>
> As it turns out, req->nr_phys_segemnts _would_ have been updated in
> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
> for the cloned request.
> So each cloned request inherits the values from the original request,
> despite the fact that req->nr_phys_segments _has_ to be evaluated in
> the final request_queue context, as the queue limits _might_ be
> different from the original (merged) queue limits of the multipath
> request queue.
>
> > BTW, I think blk_rq_check_limits()'s export should be removed and the
> > function made static and renamed to blk_clone_rq_check_limits(), again:
> > blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
> >
> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
> pointless, too, so we might as well remove the entire if-clause.
>
> > Seems prudent to make that change now to be clear that this code is only
> > used by cloned requests.
> >
> Yeah, that would make sense. I'll be preparing a patch.
> With a more detailed description :-)
Do we have already a fix? Right now I got (likely) this kernel BUG()
on the almost latest Linus tree (commit 25364a9e54fb8296). It
happened while I started a KVM right after a fresh boot. The machine
paniced even before that, so I hit this twice today.
I have a crash dump, so if anything is needed, let me know.
(But the system had to be restarted since it's my main workstation.)
thanks,
Takashi
===
[ 137.211255] ------------[ cut here ]------------
[ 137.211265] WARNING: CPU: 0 PID: 391 at block/blk-merge.c:435 blk_rq_map_sg+0x430/0x4d0()
[ 137.211267] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[ 137.211326] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[ 137.211342] CPU: 0 PID: 391 Comm: kworker/0:1H Not tainted 4.4.0-rc3-test2+ #183
[ 137.211344] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[ 137.211349] Workqueue: kblockd bfq_kick_queue
[ 137.211352] ffffffff81a6b805 ffff880214307ab8 ffffffff813b1cc9 0000000000000000
[ 137.211357] ffff880214307af0 ffffffff8106d9a2 ffff88019e7d49c8 0000000000000000
[ 137.211362] 0000000000000000 0000000000001000 0000000000000020 ffff880214307b00
[ 137.211367] Call Trace:
[ 137.211373] [<ffffffff813b1cc9>] dump_stack+0x4b/0x72
[ 137.211378] [<ffffffff8106d9a2>] warn_slowpath_common+0x82/0xc0
[ 137.211383] [<ffffffff8106da9a>] warn_slowpath_null+0x1a/0x20
[ 137.211387] [<ffffffff81381340>] blk_rq_map_sg+0x430/0x4d0
[ 137.211393] [<ffffffff81514a60>] ? scsi_init_cmd_errh+0x90/0x90
[ 137.211398] [<ffffffff81514fc4>] scsi_init_sgtable+0x44/0x80
[ 137.211402] [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[ 137.211408] [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[ 137.211413] [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[ 137.211416] [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[ 137.211420] [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[ 137.211424] [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[ 137.211428] [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[ 137.211433] [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[ 137.211438] [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[ 137.211442] [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[ 137.211447] [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[ 137.211451] [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[ 137.211456] [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[ 137.211460] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.211464] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.211468] [<ffffffff8109206e>] kthread+0xfe/0x120
[ 137.211473] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.211478] [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[ 137.211482] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.211485] ---[ end trace b39a62f95b1cad19 ]---
[ 137.211499] ------------[ cut here ]------------
[ 137.211506] kernel BUG at drivers/scsi/scsi_lib.c:1096!
[ 137.211511] invalid opcode: 0000 [#1] PREEMPT SMP
[ 137.211519] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore
[ 137.211630] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg
[ 137.211660] CPU: 0 PID: 391 Comm: kworker/0:1H Tainted: G W 4.4.0-rc3-test2+ #183
[ 137.211668] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013
[ 137.211675] Workqueue: kblockd bfq_kick_queue
[ 137.211681] task: ffff880214754a40 ti: ffff880214304000 task.ti: ffff880214304000
[ 137.211713] RIP: 0010:[<ffffffff81514ff2>] [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[ 137.211747] RSP: 0018:ffff880214307b90 EFLAGS: 00010097
[ 137.211777] RAX: 000000000000000e RBX: ffff8802147e68b0 RCX: 0000000000000006
[ 137.211807] RDX: 0000000000000024 RSI: 0000000000000000 RDI: 0000000000000009
[ 137.211838] RBP: ffff880214307ba8 R08: 0000000000000001 R09: 0000000000000000
[ 137.211868] R10: ffff8802147e68b0 R11: 0000000000000001 R12: ffff88003711c320
[ 137.211899] R13: 0000000000000000 R14: ffff8800370be000 R15: ffff8802147e68b0
[ 137.211930] FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
[ 137.211961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 137.211991] CR2: 000000000041bee0 CR3: 00000000d85ac000 CR4: 00000000001426f0
[ 137.212021] Stack:
[ 137.212048] ffff88003711c200 000000000003b000 0000000000000000 ffff880214307bf0
[ 137.212082] ffffffff81515045 ffff880214307c28 0000000000000092 ffff8802147e68b0
[ 137.212115] 000000000003b000 ffff8800d8a90800 ffff8800370be000 0000000000000020
[ 137.212148] Call Trace:
[ 137.212178] [<ffffffff81515045>] scsi_init_io+0x45/0x1d0
[ 137.212205] [<ffffffff815214a8>] sd_setup_read_write_cmnd+0x38/0xa20
[ 137.212232] [<ffffffff810c141d>] ? trace_hardirqs_off+0xd/0x10
[ 137.212258] [<ffffffff81524fc7>] sd_init_command+0x27/0xa0
[ 137.212285] [<ffffffff815152c0>] scsi_setup_cmnd+0xf0/0x150
[ 137.212312] [<ffffffff81515471>] scsi_prep_fn+0xd1/0x170
[ 137.212338] [<ffffffff8137b87c>] blk_peek_request+0x18c/0x320
[ 137.212365] [<ffffffff81516cb2>] scsi_request_fn+0x42/0x610
[ 137.212392] [<ffffffff81374943>] __blk_run_queue+0x33/0x40
[ 137.212418] [<ffffffff813a8015>] bfq_kick_queue+0x25/0x40
[ 137.212444] [<ffffffff8108b18d>] process_one_work+0x1ed/0x660
[ 137.212471] [<ffffffff8108b0fc>] ? process_one_work+0x15c/0x660
[ 137.212498] [<ffffffff8108b64e>] worker_thread+0x4e/0x450
[ 137.212525] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.212552] [<ffffffff8108b600>] ? process_one_work+0x660/0x660
[ 137.212578] [<ffffffff8109206e>] kthread+0xfe/0x120
[ 137.212605] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.212631] [<ffffffff816d19ef>] ret_from_fork+0x3f/0x70
[ 137.212657] [<ffffffff81091f70>] ? kthread_create_on_node+0x230/0x230
[ 137.212683] Code: 24 08 72 27 41 89 44 24 08 8b 43 5c 5b 41 89 44 24 10 44 89 e8 41 5c 41 5d 5d c3 41 bd 02 00 00 00 5b 44 89 e8 41 5c 41 5d 5d c3 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[ 137.212798] RIP [<ffffffff81514ff2>] scsi_init_sgtable+0x72/0x80
[ 137.212826] RSP <ffff880214307b90>
On 12/04/2015 09:59 AM, Takashi Iwai wrote:
> On Wed, 25 Nov 2015 20:01:47 +0100,
> Hannes Reinecke wrote:
>>
>> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
>>> On Wed, Nov 25 2015 at 4:04am -0500,
>>> Hannes Reinecke <[email protected]> wrote:
>>>
>>>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
>>>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
>>>>>> Can't we have a joint effort here?
>>>>>> I've been spending a _LOT_ of time trying to debug things here, but
>>>>>> none of the ideas I've come up with have been able to fix anything.
>>>>>
>>>>> Yes. I'm not the one primarily looking at it, and we don't have a
>>>>> reproducer in-house. We just have the one dump right now.
>>>>>
>>>>>>
>>>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
>>>>>> by one and be done with ...
>>>>>>
>>>>>
>>>>> That might not fix it if it is a problem with the merge code, though.
>>>>>
>>>> And indeed, it doesn't.
>>>
>>> How did you arrive at that? Do you have a reproducer now?
>>>
>> Not a reproducer, but several dumps for analysis.
>>
>>>> Seems I finally found the culprit.
>>>>
>>>> What happens is this:
>>>> We have two paths, with these seg_boundary_masks:
>>>>
>>>> path-1: seg_boundary_mask = 65535,
>>>> path-2: seg_boundary_mask = 4294967295,
>>>>
>>>> consequently the DM request queue has this:
>>>>
>>>> md-1: seg_boundary_mask = 65535,
>>>>
>>>> What happens now is that a request is being formatted, and sent
>>>> to path 2. During submission req->nr_phys_segments is formatted
>>>> with the limits of path 2, arriving at a count of 3.
>>>> Now the request gets retried on path 1, but as the NOMERGE request
>>>> flag is set req->nr_phys_segments is never updated.
>>>> But blk_rq_map_sg() ignores all counters, and just uses the
>>>> bi_vec directly, resulting in a count of 4 -> boom.
>>>>
>>>> So the culprit here is the NOMERGE flag,
>>>
>>> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
>>>
>> Yes.
>>
>>>> which is evaluated via
>>>> ->dm_dispatch_request()
>>>> ->blk_insert_cloned_request()
>>>> ->blk_rq_check_limits()
>>>
>>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
>>> anyway after reading your mail I'm still left wondering if your proposed
>>> patch is correct.
>>>
>>>> If the above assessment is correct, the following patch should
>>>> fix it:
>>>>
>>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>>> index 801ced7..12cccd6 100644
>>>> --- a/block/blk-core.c
>>>> +++ b/block/blk-core.c
>>>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
>>>> */
>>>> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
>>>> {
>>>> - if (!rq_mergeable(rq))
>>>> + if (rq->cmd_type != REQ_TYPE_FS)
>>>> return 0;
>>>>
>>>> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
>>>> rq->cmd_flags)) {
>>>>
>>>>
>>>> Mike? Jens?
>>>> Can you comment on it?
>>>
>>> You're not explaining the actual change in the patch very well; I think
>>> you're correct but you're leaving the justification as an exercise to
>>> the reviewer:
>>>
>>> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
>>> !rq_mergeable() check but you're saying for this case in question we
>>> never get there -- due to the cloned request having NOMERGE set.
>>>
>>> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
>>> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
>>>
>>> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
>>> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
>>> sense for cloned requests that always have NOMERGE set.
>>>
>>> So you're saying that by having blk_rq_check_limits() go on to call
>>> blk_recalc_rq_segments() this bug will be fixed?
>>>
>> That is the idea.
>>
>> I've already established that in all instances I have seen so far
>> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
>>
>> As it turns out, req->nr_phys_segemnts _would_ have been updated in
>> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
>> for the cloned request.
>> So each cloned request inherits the values from the original request,
>> despite the fact that req->nr_phys_segments _has_ to be evaluated in
>> the final request_queue context, as the queue limits _might_ be
>> different from the original (merged) queue limits of the multipath
>> request queue.
>>
>>> BTW, I think blk_rq_check_limits()'s export should be removed and the
>>> function made static and renamed to blk_clone_rq_check_limits(), again:
>>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
>>>
>> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
>> pointless, too, so we might as well remove the entire if-clause.
>>
>>> Seems prudent to make that change now to be clear that this code is only
>>> used by cloned requests.
>>>
>> Yeah, that would make sense. I'll be preparing a patch.
>> With a more detailed description :-)
>
> Do we have already a fix? Right now I got (likely) this kernel BUG()
> on the almost latest Linus tree (commit 25364a9e54fb8296). It
> happened while I started a KVM right after a fresh boot. The machine
> paniced even before that, so I hit this twice today.
Update to the tree as-of yesterday (or today) and it should work.
25364a9e54fb8296 doesn't include the latest block fixes that were sent
in yesterday, that should fix it. You need commit a88d32af18b8 or newer.
--
Jens Axboe
On Fri, 04 Dec 2015 18:02:58 +0100,
Jens Axboe wrote:
>
> On 12/04/2015 09:59 AM, Takashi Iwai wrote:
> > On Wed, 25 Nov 2015 20:01:47 +0100,
> > Hannes Reinecke wrote:
> >>
> >> On 11/25/2015 07:01 PM, Mike Snitzer wrote:
> >>> On Wed, Nov 25 2015 at 4:04am -0500,
> >>> Hannes Reinecke <[email protected]> wrote:
> >>>
> >>>> On 11/20/2015 04:28 PM, Ewan Milne wrote:
> >>>>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote:
> >>>>>> Can't we have a joint effort here?
> >>>>>> I've been spending a _LOT_ of time trying to debug things here, but
> >>>>>> none of the ideas I've come up with have been able to fix anything.
> >>>>>
> >>>>> Yes. I'm not the one primarily looking at it, and we don't have a
> >>>>> reproducer in-house. We just have the one dump right now.
> >>>>>
> >>>>>>
> >>>>>> I'm almost tempted to increase the count from scsi_alloc_sgtable()
> >>>>>> by one and be done with ...
> >>>>>>
> >>>>>
> >>>>> That might not fix it if it is a problem with the merge code, though.
> >>>>>
> >>>> And indeed, it doesn't.
> >>>
> >>> How did you arrive at that? Do you have a reproducer now?
> >>>
> >> Not a reproducer, but several dumps for analysis.
> >>
> >>>> Seems I finally found the culprit.
> >>>>
> >>>> What happens is this:
> >>>> We have two paths, with these seg_boundary_masks:
> >>>>
> >>>> path-1: seg_boundary_mask = 65535,
> >>>> path-2: seg_boundary_mask = 4294967295,
> >>>>
> >>>> consequently the DM request queue has this:
> >>>>
> >>>> md-1: seg_boundary_mask = 65535,
> >>>>
> >>>> What happens now is that a request is being formatted, and sent
> >>>> to path 2. During submission req->nr_phys_segments is formatted
> >>>> with the limits of path 2, arriving at a count of 3.
> >>>> Now the request gets retried on path 1, but as the NOMERGE request
> >>>> flag is set req->nr_phys_segments is never updated.
> >>>> But blk_rq_map_sg() ignores all counters, and just uses the
> >>>> bi_vec directly, resulting in a count of 4 -> boom.
> >>>>
> >>>> So the culprit here is the NOMERGE flag,
> >>>
> >>> NOMERGE is always set in __blk_rq_prep_clone() for cloned requests.
> >>>
> >> Yes.
> >>
> >>>> which is evaluated via
> >>>> ->dm_dispatch_request()
> >>>> ->blk_insert_cloned_request()
> >>>> ->blk_rq_check_limits()
> >>>
> >>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits();
> >>> anyway after reading your mail I'm still left wondering if your proposed
> >>> patch is correct.
> >>>
> >>>> If the above assessment is correct, the following patch should
> >>>> fix it:
> >>>>
> >>>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>>> index 801ced7..12cccd6 100644
> >>>> --- a/block/blk-core.c
> >>>> +++ b/block/blk-core.c
> >>>> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio);
> >>>> */
> >>>> int blk_rq_check_limits(struct request_queue *q, struct request *rq)
> >>>> {
> >>>> - if (!rq_mergeable(rq))
> >>>> + if (rq->cmd_type != REQ_TYPE_FS)
> >>>> return 0;
> >>>>
> >>>> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q,
> >>>> rq->cmd_flags)) {
> >>>>
> >>>>
> >>>> Mike? Jens?
> >>>> Can you comment on it?
> >>>
> >>> You're not explaining the actual change in the patch very well; I think
> >>> you're correct but you're leaving the justification as an exercise to
> >>> the reviewer:
> >>>
> >>> blk_rq_check_limits() will call blk_recalc_rq_segments() after the
> >>> !rq_mergeable() check but you're saying for this case in question we
> >>> never get there -- due to the cloned request having NOMERGE set.
> >>>
> >>> So in blk_rq_check_limits() you've unrolled rq_mergeable() and
> >>> open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS)
> >>>
> >>> I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in
> >>> the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no
> >>> sense for cloned requests that always have NOMERGE set.
> >>>
> >>> So you're saying that by having blk_rq_check_limits() go on to call
> >>> blk_recalc_rq_segments() this bug will be fixed?
> >>>
> >> That is the idea.
> >>
> >> I've already established that in all instances I have seen so far
> >> req->nr_phys_segments is _less_ than req->bio->bi_phys_segments.
> >>
> >> As it turns out, req->nr_phys_segemnts _would_ have been updated in
> >> blk_rq_check_limits(), but isn't due to the NOMERGE flag being set
> >> for the cloned request.
> >> So each cloned request inherits the values from the original request,
> >> despite the fact that req->nr_phys_segments _has_ to be evaluated in
> >> the final request_queue context, as the queue limits _might_ be
> >> different from the original (merged) queue limits of the multipath
> >> request queue.
> >>
> >>> BTW, I think blk_rq_check_limits()'s export should be removed and the
> >>> function made static and renamed to blk_clone_rq_check_limits(), again:
> >>> blk_insert_cloned_request() is the only caller of blk_rq_check_limits()
> >>>
> >> Actually, seeing Jens' last comment the check for REQ_TYPE_FS is
> >> pointless, too, so we might as well remove the entire if-clause.
> >>
> >>> Seems prudent to make that change now to be clear that this code is only
> >>> used by cloned requests.
> >>>
> >> Yeah, that would make sense. I'll be preparing a patch.
> >> With a more detailed description :-)
> >
> > Do we have already a fix? Right now I got (likely) this kernel BUG()
> > on the almost latest Linus tree (commit 25364a9e54fb8296). It
> > happened while I started a KVM right after a fresh boot. The machine
> > paniced even before that, so I hit this twice today.
>
> Update to the tree as-of yesterday (or today) and it should work.
> 25364a9e54fb8296 doesn't include the latest block fixes that were sent
> in yesterday, that should fix it. You need commit a88d32af18b8 or newer.
Alright, I'll give it a try. Thanks for information!
Takashi