Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp853234ybc; Tue, 19 Nov 2019 10:18:32 -0800 (PST) X-Google-Smtp-Source: APXvYqwl5t8MWQUepGt+Kgr96kOYLuj5wQmQyrIiz+NdHdvXqocj37USAUNdvE59LRQwRBrPtdxy X-Received: by 2002:a17:906:2615:: with SMTP id h21mr36957465ejc.212.1574187512803; Tue, 19 Nov 2019 10:18:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574187512; cv=none; d=google.com; s=arc-20160816; b=eEI/if7cCVFjJu9/3t9WWo0Gzbrbhux+oUVAd8XsNdv2rgLoOxB/FUfj+Jhv35uFRH 4Ecv39PwFhqNQXFbR2Q9oAoTBO+sthYhIvg4ZLU/gO/1mlarNdbHf89Brt+FvgHl6gkz 99D3STS0QdkKVsOfCant+Tn1VhQl61N/nkkKiDrKnDI+Efo6fw6HcBOdsG6nAoW7faCt SxLXHI8qdFFByU+HaVvTWWFhdX5YntohWkWxHBskmCJZRHZ+WFOovslbczHFBHtxtrhO 9pEhnlbQ+D13c1gMw6FxekKIG/HemPpf0eRbD9F5Ad6Rc1mgOaMb1osvGpZM/SSIsVlH dZZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=8NrpdQf2ZBW4x4Pq9ydpuBKwx0Gsq3Fx7T4KV7KYjyM=; b=eabbAc5WZNpSCRlAsJsT8MXolXvtUv9+OIv4gcGKJl+z/1PgT/IZRe+G3wu9bASXaA wLZPL4T3qyigfW6TYED58wh3sJkBdjK3RgKawCdqnEB9ar/jJIvbELBfKI0lCuR4mi1S K3EOh0o/scdvI3g4g3DegP21FjodWDFNZ9aLMrrBGLz6HqircVNxIRZOqhG0uVp/Lv1z lL2M1OZy8GhUWfaMD58bWHWk+BXOjut9fdN99CarchdkoyRTF7gZPuf8mPFV0ao7DZ50 8Gm4MTZNa7iLRguVCRQrZX6dMJ4cCPJjfT+3OsiZ9G2Qv9rpj6gwU1au8pYcyuZ2VG9b Qi6w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o18si14233606ejj.427.2019.11.19.10.18.08; Tue, 19 Nov 2019 10:18:32 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727264AbfKSSOj (ORCPT + 99 others); Tue, 19 Nov 2019 13:14:39 -0500 Received: from mx2.suse.de ([195.135.220.15]:48234 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726510AbfKSSOj (ORCPT ); Tue, 19 Nov 2019 13:14:39 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 87068AF22; Tue, 19 Nov 2019 18:14:36 +0000 (UTC) Date: Tue, 19 Nov 2019 19:14:35 +0100 From: Daniel Wagner To: James Smart Cc: "Martin K. Petersen" , Dick Kennedy , linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] scsi: lpfc: Move work items to a stack list Message-ID: <20191119181435.taxa56wbf4zd4f2f@beryllium.lan> References: <20191105080855.16881-1-dwagner@suse.de> <20191119132854.mwkxx4fixjaoxv4w@beryllium.lan> <4a1fedc9-03f1-7312-fd50-a041a78c0294@broadcom.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4a1fedc9-03f1-7312-fd50-a041a78c0294@broadcom.com> User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi James, On Tue, Nov 19, 2019 at 09:25:54AM -0800, James Smart wrote: > On 11/19/2019 5:28 AM, Daniel Wagner wrote: > > On Tue, Nov 12, 2019 at 10:15:00PM -0500, Martin K. Petersen wrote: > > > > While trying to understand what's going on in the Oops below I figured > > > > that it could be the result of the invalid pointer access. The patch > > > > still needs testing by our customer but indepent of this I think the > > > > patch fixes a real bug. > > I was able to reproduce the same stack trace with this patch > > applied... That is obviously bad. The good news, I have access to this > > machine, so maybe I able to figure out what's the root cause of this > > crash. > > fyi - Dick and I were taking our time reviewing your patch as the major > concern we had with it was that the splice and then the re-add of the > work-list could have an abort complete before the IO, which could lead to > DC. I think there is potentially a bug. The crash I see might not be related to my attempt here to fix it. After looking at it again, it looks a bit ugly and over complex compared to what should do the trick: @@ -3915,6 +3915,9 @@ lpfc_sli_handle_slow_ring_event_s4(struct lpfc_hba *phba, cq_event, struct lpfc_cq_event, list); spin_unlock_irqrestore(&phba->hbalock, iflag); + if (!cq_event) + break; + switch (bf_get(lpfc_wcqe_c_code, &cq_event->cqe.wcqe_cmpl)) { case CQE_CODE_COMPL_WQE: irspiocbq = container_of(cq_event, struct lpfc_iocbq, > We'll look anew at your repro. Thanks a lot. Anyway, I tried to gather some more data on it. (gdb) l *lpfc_sli4_io_xri_aborted+0x29c 0xa1f2c is in lpfc_sli4_io_xri_aborted (drivers/scsi/lpfc/lpfc_scsi.c:543). 538 } 539 #endif 540 qp->abts_scsi_io_bufs--; 541 spin_unlock(&qp->abts_io_buf_list_lock); 542 543 if (psb->rdata && psb->rdata->pnode) 544 ndlp = psb->rdata->pnode; 545 else 546 ndlp = NULL; 547 I looked at the assembler code and if I am not completely mistaken it is 'psb->rdata' which triggers the KASAN dump. So I though let's reset psb->rdata when the buf is freed in order to avoid the report. So this was just to find out if my theory holds. ff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c index ba26df90a36a..8e3a05e44d76 100644 --- a/drivers/scsi/lpfc/lpfc_scsi.c +++ b/drivers/scsi/lpfc/lpfc_scsi.c @@ -791,10 +791,12 @@ lpfc_release_scsi_buf_s4(struct lpfc_hba *phba, struct lpfc_io_buf *psb) if (psb->flags & LPFC_SBUF_XBUSY) { spin_lock_irqsave(&qp->abts_io_buf_list_lock, iflag); psb->pCmd = NULL; + psb->rdata = NULL; list_add_tail(&psb->list, &qp->lpfc_abts_io_buf_list); qp->abts_scsi_io_bufs++; spin_unlock_irqrestore(&qp->abts_io_buf_list_lock, iflag); } else { + psb->rdata = NULL; lpfc_release_io_buf(phba, (struct lpfc_io_buf *)psb, qp); } } Unfortunatly, this didn't help and instead I got a new KASAN report instead: [ 167.006341] ================================================================== [ 167.006428] BUG: KASAN: slab-out-of-bounds in lpfc_unreg_login+0x7c/0xc0 [lpfc] [ 167.006434] Read of size 2 at addr ffff88a035f792e2 by task lpfc_worker_3/1130 [ 167.006435] [ 167.006444] CPU: 4 PID: 1130 Comm: lpfc_worker_3 Tainted: G E 5.4.0-rc1-default+ #3 [ 167.006447] Hardware name: HP ProLiant DL580 Gen9/ProLiant DL580 Gen9, BIOS U17 07/21/2019 [ 167.006449] Call Trace: [ 167.006463] dump_stack+0x71/0xab [ 167.006535] ? lpfc_unreg_login+0x7c/0xc0 [lpfc] [ 167.006545] print_address_description.constprop.6+0x1b/0x2f0 [ 167.006615] ? lpfc_unreg_login+0x7c/0xc0 [lpfc] [ 167.006684] ? lpfc_unreg_login+0x7c/0xc0 [lpfc] [ 167.006688] sd 7:0:0:0: alua: port group 3e9 state N non-preferred supports TolUsNA [ 167.006693] __kasan_report+0x14e/0x192 [ 167.006764] ? lpfc_unreg_login+0x7c/0xc0 [lpfc] [ 167.006770] kasan_report+0xe/0x20 [ 167.006840] lpfc_unreg_login+0x7c/0xc0 [lpfc] [ 167.006912] lpfc_sli_def_mbox_cmpl+0x329/0x560 [lpfc] [ 167.006921] ? _raw_spin_lock_irqsave+0x8d/0xf0 [ 167.006991] ? __lpfc_sli_rpi_release.isra.59+0xc0/0xc0 [lpfc] [ 167.007057] lpfc_sli_handle_mb_event+0x3f3/0x8d0 [lpfc] [ 167.007064] ? _raw_spin_lock_irqsave+0x8d/0xf0 [ 167.007069] ? _raw_write_lock_bh+0xe0/0xe0 [ 167.007138] ? lpfc_mbox_get+0x1d/0xa0 [lpfc] [ 167.007204] ? lpfc_sli_free_hbq+0x80/0x80 [lpfc] [ 167.007277] lpfc_do_work+0x100b/0x26e0 [lpfc] [ 167.007285] ? __schedule+0x6ca/0xc10 [ 167.007288] ? _raw_spin_lock_irqsave+0x8d/0xf0 [ 167.007329] ? lpfc_unregister_unused_fcf+0xc0/0xc0 [lpfc] [ 167.007334] ? wait_woken+0x130/0x130 [ 167.007375] ? lpfc_unregister_unused_fcf+0xc0/0xc0 [lpfc] [ 167.007380] kthread+0x1b3/0x1d0 [ 167.007383] ? kthread_create_worker_on_cpu+0xc0/0xc0 [ 167.007386] ret_from_fork+0x35/0x40 [ 167.007389] [ 167.007391] Allocated by task 667: [ 167.007395] save_stack+0x19/0x80 [ 167.007398] __kasan_kmalloc.constprop.9+0xa0/0xd0 [ 167.007400] __kmalloc+0xfb/0x5d0 [ 167.007439] lpfc_sli4_alloc_resource_identifiers+0x260/0x8a0 [lpfc] [ 167.007478] lpfc_sli4_hba_setup+0xbbd/0x3480 [lpfc] [ 167.007517] lpfc_pci_probe_one_s4.isra.46+0x100b/0x1d70 [lpfc] [ 167.007556] lpfc_pci_probe_one+0x191/0x11b0 [lpfc] [ 167.007563] local_pci_probe+0x74/0xd0 [ 167.007567] work_for_cpu_fn+0x29/0x40 [ 167.007570] process_one_work+0x46e/0x7f0 [ 167.007573] worker_thread+0x4cd/0x6b0 [ 167.007575] kthread+0x1b3/0x1d0 [ 167.007578] ret_from_fork+0x35/0x40 [ 167.007578] [ 167.007580] Freed by task 0: [ 167.007580] (stack is not available) [ 167.007581] [ 167.007584] The buggy address belongs to the object at ffff88a035f790c0 [ 167.007584] which belongs to the cache kmalloc-512 of size 512 [ 167.007587] The buggy address is located 34 bytes to the right of [ 167.007587] 512-byte region [ffff88a035f790c0, ffff88a035f792c0) [ 167.007588] The buggy address belongs to the page: [ 167.007592] page:ffffea0080d7de40 refcount:1 mapcount:0 mapping:ffff888107c00c40 index:0x0 [ 167.007595] flags: 0xd7ffffc0000200(slab) [ 167.007601] raw: 00d7ffffc0000200 ffffea0080ec5588 ffffea0080d3e988 ffff888107c00c40 [ 167.007604] raw: 0000000000000000 ffff88a035f790c0 0000000100000006 0000000000000000 [ 167.007606] page dumped because: kasan: bad access detected [ 167.007606] [ 167.007607] Memory state around the buggy address: [ 167.007610] ffff88a035f79180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 167.007613] ffff88a035f79200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 167.007615] >ffff88a035f79280: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc [ 167.007617] ^ [ 167.007619] ffff88a035f79300: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 167.007622] ffff88a035f79380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 167.007623] ================================================================== (gdb) l *lpfc_unreg_login+0x7c 0x8bb1c is in lpfc_unreg_login (drivers/scsi/lpfc/lpfc_mbox.c:826). 821 memset(pmb, 0, sizeof (LPFC_MBOXQ_t)); 822 823 mb->un.varUnregLogin.rpi = rpi; 824 mb->un.varUnregLogin.rsvd1 = 0; 825 if (phba->sli_rev >= LPFC_SLI_REV3) 826 mb->un.varUnregLogin.vpi = phba->vpi_ids[vpi]; 827 828 mb->mbxCommand = MBX_UNREG_LOGIN; 829 mb->mbxOwner = OWN_HOST; 830 Not sure if this is not again an red herring. Thanks, Daniel