Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp1972464ybh; Tue, 14 Jul 2020 12:00:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy+lMGja9U5e6DnqVEAzAQ3iRJIGhtY+LSG7+BgCsrzWSC8L6T9DFtQ/kXBq6kvfaOBlmBj X-Received: by 2002:a17:906:7d9:: with SMTP id m25mr5768381ejc.25.1594753255642; Tue, 14 Jul 2020 12:00:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1594753255; cv=none; d=google.com; s=arc-20160816; b=N2QwZUApWhozqYPyfChUK92Gibl1fX63o/thoYx5jIUhrIa2gVyn9XCaRbXiEAuLv1 1FTDW+LQ9PQFvR16u3rj9irs7vh0RzSOkS2me7ZaewiuP70FVd8GIy3l26Y5K+4Oz7Xk pKzgfoqVeVlsK1iNB2/YxezPnuDx8vJfjO4uIQC1ddn6kvdkYLFxmcNz6K5Q102sgOyt h3OiSvY4ETvGgr8CrNSQPzmPjsFsHUVvUZrz+Lp2NGg7blpu7um2KkeUGtGdT+i+UsaK 9ALMMtH65tNQLenwRFK7Ny5tKGFzyIlAAmK01KzKGvg3vKMvdDV99eYaN+h0SKX5uIog X9kA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=jwq1Q8ptu6IwEES7Eq7HKqyG743ml2AG3/0Pwld5I2k=; b=zNpcPralh1A+p95W7DuS5yyMQ0lrDGidh+wK3EBbp/Ac5/a+zjyu7BMJHQnAy5dPVz pu5jtpNyt4R99oJM3PSGRtHajlyrBqyQajYuroEL3GhSwCSl0Ac+24rJpMm16IQCxQuu AiYDqf9pBoWjGSegIlLtNO9EKDRbZmU+jG/Doca5E2Ageye74UavQj6qPMZynjsMlTv2 Ska0CK4wgElpMo4bAz4UznQSwD9+7B1jr/c3laciT5Ng+/C/82qU/tBY9LD92KcfcEFK solFmeg9dhsuGCYBPWmftIW6U08BeNqeQvxZBT0HY4NFDOKdDVr006b6Cfz328GzCuh8 bIQg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=YxUgtVjo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h1si11400224ejj.575.2020.07.14.12.00.31; Tue, 14 Jul 2020 12:00:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=YxUgtVjo; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730759AbgGNS5u (ORCPT + 99 others); Tue, 14 Jul 2020 14:57:50 -0400 Received: from mail.kernel.org ([198.145.29.99]:56086 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731135AbgGNS5r (ORCPT ); Tue, 14 Jul 2020 14:57:47 -0400 Received: from localhost (83-86-89-107.cable.dynamic.v4.ziggo.nl [83.86.89.107]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 3A0DF207F5; Tue, 14 Jul 2020 18:57:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1594753066; bh=0UlU6MmCmjiBgLVWngg6X8UHSA6SYjsELOi/ZlNr5iU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=YxUgtVjoeM8wD3YekFZfQNBPyx5s/8gJX25toC8VQRrVyNyXAprmUbkRRnaSRk7LW LqgUpGl+aVlPIdvxGeqs5plFkPxPoBqF0nS9gqHDvfKvupq28QeZQOVvkVW9wYQrrv MELi9e3f1EXisX4dbtl4S6rIPDzkiAWtA6R6wdwc= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Mike Marciniszyn , Kaike Wan , Dennis Dalessandro , Jason Gunthorpe Subject: [PATCH 5.7 107/166] IB/hfi1: Do not destroy hfi1_wq when the device is shut down Date: Tue, 14 Jul 2020 20:44:32 +0200 Message-Id: <20200714184120.968127370@linuxfoundation.org> X-Mailer: git-send-email 2.27.0 In-Reply-To: <20200714184115.844176932@linuxfoundation.org> References: <20200714184115.844176932@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Kaike Wan commit 28b70cd9236563e1a88a6094673fef3c08db0d51 upstream. The workqueue hfi1_wq is destroyed in function shutdown_device(), which is called by either shutdown_one() or remove_one(). The function shutdown_one() is called when the kernel is rebooted while remove_one() is called when the hfi1 driver is unloaded. When the kernel is rebooted, hfi1_wq is destroyed while all qps are still active, leading to a kernel crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000102 IP: [] __queue_work+0x32/0x3e0 PGD 0 Oops: 0000 [#1] SMP Modules linked in: dm_round_robin nvme_rdma(OE) nvme_fabrics(OE) nvme_core(OE) ib_isert iscsi_target_mod target_core_mod ib_ucm mlx4_ib iTCO_wdt iTCO_vendor_support mxm_wmi sb_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm rpcrdma sunrpc irqbypass crc32_pclmul ghash_clmulni_intel rdma_ucm aesni_intel ib_uverbs lrw gf128mul opa_vnic glue_helper ablk_helper ib_iser cryptd ib_umad rdma_cm iw_cm ses enclosure libiscsi scsi_transport_sas pcspkr joydev ib_ipoib(OE) scsi_transport_iscsi ib_cm sg ipmi_ssif mei_me lpc_ich i2c_i801 mei ioatdma ipmi_si dm_multipath ipmi_devintf ipmi_msghandler wmi acpi_pad acpi_power_meter hangcheck_timer ip_tables ext4 mbcache jbd2 mlx4_en sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm hfi1(OE) crct10dif_pclmul crct10dif_common crc32c_intel drm ahci mlx4_core libahci rdmavt(OE) igb megaraid_sas ib_core libata drm_panel_orientation_quirks ptp pps_core devlink dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod CPU: 19 PID: 0 Comm: swapper/19 Kdump: loaded Tainted: G OE ------------ 3.10.0-957.el7.x86_64 #1 Hardware name: Phegda X2226A/S2600CW, BIOS SE5C610.86B.01.01.0024.021320181901 02/13/2018 task: ffff8a799ba0d140 ti: ffff8a799bad8000 task.ti: ffff8a799bad8000 RIP: 0010:[] [] __queue_work+0x32/0x3e0 RSP: 0018:ffff8a90dde43d80 EFLAGS: 00010046 RAX: 0000000000000082 RBX: 0000000000000086 RCX: 0000000000000000 RDX: ffff8a90b924fcb8 RSI: 0000000000000000 RDI: 000000000000001b RBP: ffff8a90dde43db8 R08: ffff8a799ba0d6d8 R09: ffff8a90dde53900 R10: 0000000000000002 R11: ffff8a90dde43de8 R12: ffff8a90b924fcb8 R13: 000000000000001b R14: 0000000000000000 R15: ffff8a90d2890000 FS: 0000000000000000(0000) GS:ffff8a90dde40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000102 CR3: 0000001a70410000 CR4: 00000000001607e0 Call Trace: [] queue_work_on+0x45/0x50 [] _hfi1_schedule_send+0x6e/0xc0 [hfi1] [] hfi1_schedule_send+0x32/0x70 [hfi1] [] rvt_rc_timeout+0xe9/0x130 [rdmavt] [] ? trigger_load_balance+0x6a/0x280 [] ? rvt_free_qpn+0x40/0x40 [rdmavt] [] call_timer_fn+0x38/0x110 [] ? rvt_free_qpn+0x40/0x40 [rdmavt] [] run_timer_softirq+0x24d/0x300 [] __do_softirq+0xf5/0x280 [] call_softirq+0x1c/0x30 [] do_softirq+0x65/0xa0 [] irq_exit+0x105/0x110 [] smp_apic_timer_interrupt+0x48/0x60 [] apic_timer_interrupt+0x162/0x170 [] ? cpuidle_enter_state+0x57/0xd0 [] cpuidle_idle_call+0xde/0x230 [] arch_cpu_idle+0xe/0xc0 [] cpu_startup_entry+0x14a/0x1e0 [] start_secondary+0x1f7/0x270 [] start_cpu+0x5/0x14 The solution is to destroy the workqueue only when the hfi1 driver is unloaded, not when the device is shut down. In addition, when the device is shut down, no more work should be scheduled on the workqueues and the workqueues are flushed. Fixes: 8d3e71136a08 ("IB/{hfi1, qib}: Add handling of kernel restart") Link: https://lore.kernel.org/r/20200623204047.107638.77646.stgit@awfm-01.aw.intel.com Cc: Reviewed-by: Mike Marciniszyn Signed-off-by: Kaike Wan Signed-off-by: Dennis Dalessandro Signed-off-by: Jason Gunthorpe Signed-off-by: Greg Kroah-Hartman --- drivers/infiniband/hw/hfi1/init.c | 27 +++++++++++++++++++++++---- drivers/infiniband/hw/hfi1/qp.c | 5 ++++- drivers/infiniband/hw/hfi1/tid_rdma.c | 5 ++++- 3 files changed, 31 insertions(+), 6 deletions(-) --- a/drivers/infiniband/hw/hfi1/init.c +++ b/drivers/infiniband/hw/hfi1/init.c @@ -829,6 +829,25 @@ wq_error: } /** + * destroy_workqueues - destroy per port workqueues + * @dd: the hfi1_ib device + */ +static void destroy_workqueues(struct hfi1_devdata *dd) +{ + int pidx; + struct hfi1_pportdata *ppd; + + for (pidx = 0; pidx < dd->num_pports; ++pidx) { + ppd = dd->pport + pidx; + + if (ppd->hfi1_wq) { + destroy_workqueue(ppd->hfi1_wq); + ppd->hfi1_wq = NULL; + } + } +} + +/** * enable_general_intr() - Enable the IRQs that will be handled by the * general interrupt handler. * @dd: valid devdata @@ -1102,11 +1121,10 @@ static void shutdown_device(struct hfi1_ */ hfi1_quiet_serdes(ppd); - if (ppd->hfi1_wq) { - destroy_workqueue(ppd->hfi1_wq); - ppd->hfi1_wq = NULL; - } + if (ppd->hfi1_wq) + flush_workqueue(ppd->hfi1_wq); if (ppd->link_wq) { + flush_workqueue(ppd->link_wq); destroy_workqueue(ppd->link_wq); ppd->link_wq = NULL; } @@ -1757,6 +1775,7 @@ static void remove_one(struct pci_dev *p * clear dma engines, etc. */ shutdown_device(dd); + destroy_workqueues(dd); stop_timers(dd); --- a/drivers/infiniband/hw/hfi1/qp.c +++ b/drivers/infiniband/hw/hfi1/qp.c @@ -381,7 +381,10 @@ bool _hfi1_schedule_send(struct rvt_qp * struct hfi1_ibport *ibp = to_iport(qp->ibqp.device, qp->port_num); struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); - struct hfi1_devdata *dd = dd_from_ibdev(qp->ibqp.device); + struct hfi1_devdata *dd = ppd->dd; + + if (dd->flags & HFI1_SHUTDOWN) + return true; return iowait_schedule(&priv->s_iowait, ppd->hfi1_wq, priv->s_sde ? --- a/drivers/infiniband/hw/hfi1/tid_rdma.c +++ b/drivers/infiniband/hw/hfi1/tid_rdma.c @@ -5406,7 +5406,10 @@ static bool _hfi1_schedule_tid_send(stru struct hfi1_ibport *ibp = to_iport(qp->ibqp.device, qp->port_num); struct hfi1_pportdata *ppd = ppd_from_ibp(ibp); - struct hfi1_devdata *dd = dd_from_ibdev(qp->ibqp.device); + struct hfi1_devdata *dd = ppd->dd; + + if ((dd->flags & HFI1_SHUTDOWN)) + return true; return iowait_tid_schedule(&priv->s_iowait, ppd->hfi1_wq, priv->s_sde ?