Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756119AbbLDQ7Z (ORCPT ); Fri, 4 Dec 2015 11:59:25 -0500 Received: from mx2.suse.de ([195.135.220.15]:36272 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755542AbbLDQ7V (ORCPT ); Fri, 4 Dec 2015 11:59:21 -0500 Date: Fri, 04 Dec 2015 17:59:17 +0100 Message-ID: From: Takashi Iwai To: Hannes Reinecke Cc: Mike Snitzer , emilne@redhat.com, Christoph Hellwig , Michael Ellerman , Mark Salter , "James E. J. Bottomley" , brking , linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-block@vger.kernel.org, "Jun'ichi Nomura" , Jens Axboe Subject: Re: kernel BUG at drivers/scsi/scsi_lib.c:1096! In-Reply-To: <5656059B.9010102@suse.de> References: <1447838334.1564.2.camel@ellerman.id.au> <1447855399.3974.24.camel@redhat.com> <1447894964.15206.0.camel@ellerman.id.au> <20151119082325.GA11419@infradead.org> <564DEC41.5010600@suse.de> <1448030316.4067.18.camel@localhost.localdomain> <564F3453.9040603@suse.de> <1448033323.4067.24.camel@localhost.localdomain> <565579A2.4000005@suse.de> <20151125180133.GA18839@redhat.com> <5656059B.9010102@suse.de> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI/1.14.6 (Maruoka) FLIM/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL/10.8 Emacs/24.5 (x86_64-suse-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI 1.14.6 - "Maruoka") Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 13042 Lines: 233 On Wed, 25 Nov 2015 20:01:47 +0100, Hannes Reinecke wrote: > > On 11/25/2015 07:01 PM, Mike Snitzer wrote: > > On Wed, Nov 25 2015 at 4:04am -0500, > > Hannes Reinecke wrote: > > > >> On 11/20/2015 04:28 PM, Ewan Milne wrote: > >>> On Fri, 2015-11-20 at 15:55 +0100, Hannes Reinecke wrote: > >>>> Can't we have a joint effort here? > >>>> I've been spending a _LOT_ of time trying to debug things here, but > >>>> none of the ideas I've come up with have been able to fix anything. > >>> > >>> Yes. I'm not the one primarily looking at it, and we don't have a > >>> reproducer in-house. We just have the one dump right now. > >>> > >>>> > >>>> I'm almost tempted to increase the count from scsi_alloc_sgtable() > >>>> by one and be done with ... > >>>> > >>> > >>> That might not fix it if it is a problem with the merge code, though. > >>> > >> And indeed, it doesn't. > > > > How did you arrive at that? Do you have a reproducer now? > > > Not a reproducer, but several dumps for analysis. > > >> Seems I finally found the culprit. > >> > >> What happens is this: > >> We have two paths, with these seg_boundary_masks: > >> > >> path-1: seg_boundary_mask = 65535, > >> path-2: seg_boundary_mask = 4294967295, > >> > >> consequently the DM request queue has this: > >> > >> md-1: seg_boundary_mask = 65535, > >> > >> What happens now is that a request is being formatted, and sent > >> to path 2. During submission req->nr_phys_segments is formatted > >> with the limits of path 2, arriving at a count of 3. > >> Now the request gets retried on path 1, but as the NOMERGE request > >> flag is set req->nr_phys_segments is never updated. > >> But blk_rq_map_sg() ignores all counters, and just uses the > >> bi_vec directly, resulting in a count of 4 -> boom. > >> > >> So the culprit here is the NOMERGE flag, > > > > NOMERGE is always set in __blk_rq_prep_clone() for cloned requests. > > > Yes. > > >> which is evaluated via > >> ->dm_dispatch_request() > >> ->blk_insert_cloned_request() > >> ->blk_rq_check_limits() > > > > blk_insert_cloned_request() is the only caller of blk_rq_check_limits(); > > anyway after reading your mail I'm still left wondering if your proposed > > patch is correct. > > > >> If the above assessment is correct, the following patch should > >> fix it: > >> > >> diff --git a/block/blk-core.c b/block/blk-core.c > >> index 801ced7..12cccd6 100644 > >> --- a/block/blk-core.c > >> +++ b/block/blk-core.c > >> @@ -1928,7 +1928,7 @@ EXPORT_SYMBOL(submit_bio); > >> */ > >> int blk_rq_check_limits(struct request_queue *q, struct request *rq) > >> { > >> - if (!rq_mergeable(rq)) > >> + if (rq->cmd_type != REQ_TYPE_FS) > >> return 0; > >> > >> if (blk_rq_sectors(rq) > blk_queue_get_max_sectors(q, > >> rq->cmd_flags)) { > >> > >> > >> Mike? Jens? > >> Can you comment on it? > > > > You're not explaining the actual change in the patch very well; I think > > you're correct but you're leaving the justification as an exercise to > > the reviewer: > > > > blk_rq_check_limits() will call blk_recalc_rq_segments() after the > > !rq_mergeable() check but you're saying for this case in question we > > never get there -- due to the cloned request having NOMERGE set. > > > > So in blk_rq_check_limits() you've unrolled rq_mergeable() and > > open-coded the lone remaining check (rq->cmd_type != REQ_TYPE_FS) > > > > I agree that the (rq->cmd_flags & REQ_NOMERGE_FLAGS) check in > > the blk_insert_cloned_request() call-chain (via rq_mergeable()) makes no > > sense for cloned requests that always have NOMERGE set. > > > > So you're saying that by having blk_rq_check_limits() go on to call > > blk_recalc_rq_segments() this bug will be fixed? > > > That is the idea. > > I've already established that in all instances I have seen so far > req->nr_phys_segments is _less_ than req->bio->bi_phys_segments. > > As it turns out, req->nr_phys_segemnts _would_ have been updated in > blk_rq_check_limits(), but isn't due to the NOMERGE flag being set > for the cloned request. > So each cloned request inherits the values from the original request, > despite the fact that req->nr_phys_segments _has_ to be evaluated in > the final request_queue context, as the queue limits _might_ be > different from the original (merged) queue limits of the multipath > request queue. > > > BTW, I think blk_rq_check_limits()'s export should be removed and the > > function made static and renamed to blk_clone_rq_check_limits(), again: > > blk_insert_cloned_request() is the only caller of blk_rq_check_limits() > > > Actually, seeing Jens' last comment the check for REQ_TYPE_FS is > pointless, too, so we might as well remove the entire if-clause. > > > Seems prudent to make that change now to be clear that this code is only > > used by cloned requests. > > > Yeah, that would make sense. I'll be preparing a patch. > With a more detailed description :-) Do we have already a fix? Right now I got (likely) this kernel BUG() on the almost latest Linus tree (commit 25364a9e54fb8296). It happened while I started a KVM right after a fresh boot. The machine paniced even before that, so I hit this twice today. I have a crash dump, so if anything is needed, let me know. (But the system had to be restarted since it's my main workstation.) thanks, Takashi === [ 137.211255] ------------[ cut here ]------------ [ 137.211265] WARNING: CPU: 0 PID: 391 at block/blk-merge.c:435 blk_rq_map_sg+0x430/0x4d0() [ 137.211267] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore [ 137.211326] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg [ 137.211342] CPU: 0 PID: 391 Comm: kworker/0:1H Not tainted 4.4.0-rc3-test2+ #183 [ 137.211344] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 [ 137.211349] Workqueue: kblockd bfq_kick_queue [ 137.211352] ffffffff81a6b805 ffff880214307ab8 ffffffff813b1cc9 0000000000000000 [ 137.211357] ffff880214307af0 ffffffff8106d9a2 ffff88019e7d49c8 0000000000000000 [ 137.211362] 0000000000000000 0000000000001000 0000000000000020 ffff880214307b00 [ 137.211367] Call Trace: [ 137.211373] [] dump_stack+0x4b/0x72 [ 137.211378] [] warn_slowpath_common+0x82/0xc0 [ 137.211383] [] warn_slowpath_null+0x1a/0x20 [ 137.211387] [] blk_rq_map_sg+0x430/0x4d0 [ 137.211393] [] ? scsi_init_cmd_errh+0x90/0x90 [ 137.211398] [] scsi_init_sgtable+0x44/0x80 [ 137.211402] [] scsi_init_io+0x45/0x1d0 [ 137.211408] [] sd_setup_read_write_cmnd+0x38/0xa20 [ 137.211413] [] ? trace_hardirqs_off+0xd/0x10 [ 137.211416] [] sd_init_command+0x27/0xa0 [ 137.211420] [] scsi_setup_cmnd+0xf0/0x150 [ 137.211424] [] scsi_prep_fn+0xd1/0x170 [ 137.211428] [] blk_peek_request+0x18c/0x320 [ 137.211433] [] scsi_request_fn+0x42/0x610 [ 137.211438] [] __blk_run_queue+0x33/0x40 [ 137.211442] [] bfq_kick_queue+0x25/0x40 [ 137.211447] [] process_one_work+0x1ed/0x660 [ 137.211451] [] ? process_one_work+0x15c/0x660 [ 137.211456] [] worker_thread+0x4e/0x450 [ 137.211460] [] ? process_one_work+0x660/0x660 [ 137.211464] [] ? process_one_work+0x660/0x660 [ 137.211468] [] kthread+0xfe/0x120 [ 137.211473] [] ? kthread_create_on_node+0x230/0x230 [ 137.211478] [] ret_from_fork+0x3f/0x70 [ 137.211482] [] ? kthread_create_on_node+0x230/0x230 [ 137.211485] ---[ end trace b39a62f95b1cad19 ]--- [ 137.211499] ------------[ cut here ]------------ [ 137.211506] kernel BUG at drivers/scsi/scsi_lib.c:1096! [ 137.211511] invalid opcode: 0000 [#1] PREEMPT SMP [ 137.211519] Modules linked in: nfsv3 nfs_acl auth_rpcgss oid_registry nfsv4 dns_resolver nfs lockd grace sunrpc fscache parport_pc ppdev parport fuse vmw_vsock_vmci_transport vsock vmw_vmci iscsi_ibft iscsi_boot_sysfs snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm i915 irqbypass crct10dif_pclmul crc32_pclmul snd_emu10k1 snd_hda_codec_ca0132 snd_util_mem aesni_intel snd_ac97_codec snd_hda_intel aes_x86_64 dcdbas snd_hda_codec glue_helper dm_mod ac97_bus lrw gf128mul ablk_helper cryptd snd_rawmidi snd_hda_core i2c_algo_bit snd_seq_device drm_kms_helper snd_hwdep syscopyarea snd_pcm sysfillrect sysimgblt snd_timer fb_sys_fops iTCO_wdt i2c_i801 drm snd iTCO_vendor_support emu10k1_gp battery lpc_ich gameport mfd_core soundcore [ 137.211630] mei_me video mei thermal button usbhid crc32c_intel sr_mod ehci_pci cdrom serio_raw xhci_pci ehci_hcd e1000e xhci_hcd ptp pps_core usbcore usb_common sg [ 137.211660] CPU: 0 PID: 391 Comm: kworker/0:1H Tainted: G W 4.4.0-rc3-test2+ #183 [ 137.211668] Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 [ 137.211675] Workqueue: kblockd bfq_kick_queue [ 137.211681] task: ffff880214754a40 ti: ffff880214304000 task.ti: ffff880214304000 [ 137.211713] RIP: 0010:[] [] scsi_init_sgtable+0x72/0x80 [ 137.211747] RSP: 0018:ffff880214307b90 EFLAGS: 00010097 [ 137.211777] RAX: 000000000000000e RBX: ffff8802147e68b0 RCX: 0000000000000006 [ 137.211807] RDX: 0000000000000024 RSI: 0000000000000000 RDI: 0000000000000009 [ 137.211838] RBP: ffff880214307ba8 R08: 0000000000000001 R09: 0000000000000000 [ 137.211868] R10: ffff8802147e68b0 R11: 0000000000000001 R12: ffff88003711c320 [ 137.211899] R13: 0000000000000000 R14: ffff8800370be000 R15: ffff8802147e68b0 [ 137.211930] FS: 0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000 [ 137.211961] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 137.211991] CR2: 000000000041bee0 CR3: 00000000d85ac000 CR4: 00000000001426f0 [ 137.212021] Stack: [ 137.212048] ffff88003711c200 000000000003b000 0000000000000000 ffff880214307bf0 [ 137.212082] ffffffff81515045 ffff880214307c28 0000000000000092 ffff8802147e68b0 [ 137.212115] 000000000003b000 ffff8800d8a90800 ffff8800370be000 0000000000000020 [ 137.212148] Call Trace: [ 137.212178] [] scsi_init_io+0x45/0x1d0 [ 137.212205] [] sd_setup_read_write_cmnd+0x38/0xa20 [ 137.212232] [] ? trace_hardirqs_off+0xd/0x10 [ 137.212258] [] sd_init_command+0x27/0xa0 [ 137.212285] [] scsi_setup_cmnd+0xf0/0x150 [ 137.212312] [] scsi_prep_fn+0xd1/0x170 [ 137.212338] [] blk_peek_request+0x18c/0x320 [ 137.212365] [] scsi_request_fn+0x42/0x610 [ 137.212392] [] __blk_run_queue+0x33/0x40 [ 137.212418] [] bfq_kick_queue+0x25/0x40 [ 137.212444] [] process_one_work+0x1ed/0x660 [ 137.212471] [] ? process_one_work+0x15c/0x660 [ 137.212498] [] worker_thread+0x4e/0x450 [ 137.212525] [] ? process_one_work+0x660/0x660 [ 137.212552] [] ? process_one_work+0x660/0x660 [ 137.212578] [] kthread+0xfe/0x120 [ 137.212605] [] ? kthread_create_on_node+0x230/0x230 [ 137.212631] [] ret_from_fork+0x3f/0x70 [ 137.212657] [] ? kthread_create_on_node+0x230/0x230 [ 137.212683] Code: 24 08 72 27 41 89 44 24 08 8b 43 5c 5b 41 89 44 24 10 44 89 e8 41 5c 41 5d 5d c3 41 bd 02 00 00 00 5b 44 89 e8 41 5c 41 5d 5d c3 <0f> 0b 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 [ 137.212798] RIP [] scsi_init_sgtable+0x72/0x80 [ 137.212826] RSP -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/