Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp192786pxb; Wed, 11 Nov 2020 00:54:59 -0800 (PST) X-Google-Smtp-Source: ABdhPJzw7msxVYaOooYjt9T4XGD8lEwkq1TND93GFBh737e9dnaq/+u4bvxK1BpGRcRHTudk/ct3 X-Received: by 2002:aa7:d787:: with SMTP id s7mr3720594edq.205.1605084898815; Wed, 11 Nov 2020 00:54:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605084898; cv=none; d=google.com; s=arc-20160816; b=EsujpNkpAFPJp3ZEaG6+eJvGltl08mRMmWRy5CnXHwWIjx9vxFDfJT7bTxeek+t6KU 8xK62i8EFB6KUgP3ahpPVXsrrId0HBK3OzidIpf127lSwRHd6TFUkuUxFGbsL8kgERn6 m7IAhoeQwyKX5gkiT78d9/mbXqvqCTkmzUtIxdDuxzu5F0XuhrsKPAR+bXii/7YC2AoT g7zot4AwNQHPCoaVKQtDpcdjx2VHRYm7R57C9wXJJTvZYc3rrghUiaE3F5mpC14kWYVZ by3hsSspogSI9xBgSa0nK4spGVy1ipNfEbfq74XAalEVuo1D35qmuOgNfvR19hjWSmt4 051w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=z1/aW+YKmNuw9aiD10rFgY6KBkltVd7qVsLEcIqPOP4=; b=w+pJC0AUwym6Ltfcx+qIcLmVGelHFdaGIxYb06tSHpMKtGL2KI6gl1RX8hbEHZw1OW eRhxu0mKcWuxPMx5iafBJ/LIDLB74GwnW/FR2ediABBUXSiMi7GauInBTULl6uSKbSPl gZil9mB/8tYCTqvc4eDIwHXqhV2Ie6NuMwaTu2sML0c8JRU3WZnNOblF5T76sT0uwI0W 1u2npTOnJSzTWjxuk3L3r4vCBP27ncF6u3Zb+WaCDNazrrvTNqXVrwV8BIDb7bScQ1Xk tcyTZb16eyq6Tud+B+GKF9cNkcFPazN3L/q/eWATNR4pnNz2R1I6ysLgh/rLACtfKyeV M9yQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=rUf8zgU8; spf=pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id cc20si951415edb.341.2020.11.11.00.54.30; Wed, 11 Nov 2020 00:54:58 -0800 (PST) Received-SPF: pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=rUf8zgU8; spf=pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-wireless-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726343AbgKKIxo (ORCPT + 99 others); Wed, 11 Nov 2020 03:53:44 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37628 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726026AbgKKIxn (ORCPT ); Wed, 11 Nov 2020 03:53:43 -0500 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 31967C0613D1; Wed, 11 Nov 2020 00:53:43 -0800 (PST) Received: by mail-pg1-x543.google.com with SMTP id h6so1038660pgk.4; Wed, 11 Nov 2020 00:53:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=message-id:subject:from:to:cc:date:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=z1/aW+YKmNuw9aiD10rFgY6KBkltVd7qVsLEcIqPOP4=; b=rUf8zgU89T+ptKdrm3imu9Xzdh9pKukAV6oEM4nODNMU9rLneO5Zz/1AdUQaXUzJEY oeGWtuUXLJXrfZh734jIvDovPKnDzrZvbwO9zDZM06LUld36uO4WfJDOxUWFQIIdvnWK aJ8JgMCK7A4XvJ5LPvPNzn99NGhqRYsDDtHY4OvhdPyhcvDlhWOqv44J9l249lVfOobR Kfsownt9oTajo0pKBdJoHWGF7f4sK34VhLiLKdJARArDdIE2zAnBLYU+dWuaNMnm9xHL wRrOKLTmnGGI8KAaDmTVR+zblPe3YR0EUW6C3FRKU3tZprmsxO/jmNS9q+ZMdcWBfKTz FNrg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=z1/aW+YKmNuw9aiD10rFgY6KBkltVd7qVsLEcIqPOP4=; b=dNwj60P+YJLInUU+dcq+Np3XE7iZI/3SIVtaSoS7SWFZ5uPKZsYOLTbvHjcrsXuDnz M9jX5ta/3jx+ISBh82PO0xlW3TWO2vWXqLNn0zb5KOznxy+cMCwYwQdUelhUd/7AUc7i P9VO5ZAsaK5Mzge7uL0SWkKYaX8dGZXnrj59drdSRss8YMZJIPEE+C2U/ehOb4nHxqD1 eXddPeBGnysy5Arv0J9Cnpp8sw7ivjReX/FyWq7Pzirm9yVcnLZePBOlvdEi4sEAMOIT aXIiV3BomPOMPzrSGXErxEfzBAj+vj8aL95bHvv9Y9JRLvGSoDirpnUPqIDbnLpiymro 8PmQ== X-Gm-Message-State: AOAM532sYWsL32UhFJfkEx/twyJ2lR5M+p1mtpN9E0Exw4R8E/BQuJXg OH2sqXsjfWNrT1qeOM96Ijg= X-Received: by 2002:a65:6a13:: with SMTP id m19mr4510161pgu.260.1605084822677; Wed, 11 Nov 2020 00:53:42 -0800 (PST) Received: from [192.168.1.59] (i60-35-254-237.s41.a020.ap.plala.or.jp. [60.35.254.237]) by smtp.gmail.com with ESMTPSA id c3sm1661048pjv.27.2020.11.11.00.53.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Nov 2020 00:53:42 -0800 (PST) Message-ID: Subject: Re: [PATCH] mwifiex: pcie: skip cancel_work_sync() on reset failure path From: Tsuchiya Yuto To: Kalle Valo Cc: Amitkumar Karwar , Ganapathi Bhat , Xinming Hu , "David S. Miller" , Jakub Kicinski , linux-wireless@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, Maximilian Luz , Andy Shevchenko , verdre@v0yd.nl Date: Wed, 11 Nov 2020 17:53:37 +0900 In-Reply-To: <20201110185139.A1541C433C9@smtp.codeaurora.org> References: <20201028142346.18355-1-kitakar@gmail.com> <20201110185139.A1541C433C9@smtp.codeaurora.org> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.38.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-wireless@vger.kernel.org On Tue, 2020-11-10 at 18:51 +0000, Kalle Valo wrote: > Tsuchiya Yuto wrote: > > > If a reset is performed, but even the reset fails for some reasons (e.g., > > on Surface devices, the fw reset requires another quirks), > > cancel_work_sync() hangs in mwifiex_cleanup_pcie(). > > > > # firmware went into a bad state > > [...] > > [ 1608.281690] mwifiex_pcie 0000:03:00.0: info: shutdown mwifiex... > > [ 1608.282724] mwifiex_pcie 0000:03:00.0: rx_pending=0, tx_pending=1, cmd_pending=0 > > [ 1608.292400] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed > > [ 1608.292405] mwifiex_pcie 0000:03:00.0: PREP_CMD: card is removed > > # reset performed after firmware went into a bad state > > [ 1609.394320] mwifiex_pcie 0000:03:00.0: WLAN FW already running! Skip FW dnld > > [ 1609.394335] mwifiex_pcie 0000:03:00.0: WLAN FW is active > > # but even the reset failed > > [ 1619.499049] mwifiex_pcie 0000:03:00.0: mwifiex_cmd_timeout_func: Timeout cmd id = 0xfa, act = 0xe000 > > [ 1619.499094] mwifiex_pcie 0000:03:00.0: num_data_h2c_failure = 0 > > [ 1619.499103] mwifiex_pcie 0000:03:00.0: num_cmd_h2c_failure = 0 > > [ 1619.499110] mwifiex_pcie 0000:03:00.0: is_cmd_timedout = 1 > > [ 1619.499117] mwifiex_pcie 0000:03:00.0: num_tx_timeout = 0 > > [ 1619.499124] mwifiex_pcie 0000:03:00.0: last_cmd_index = 0 > > [ 1619.499133] mwifiex_pcie 0000:03:00.0: last_cmd_id: fa 00 07 01 07 01 07 01 07 01 > > [ 1619.499140] mwifiex_pcie 0000:03:00.0: last_cmd_act: 00 e0 00 00 00 00 00 00 00 00 > > [ 1619.499147] mwifiex_pcie 0000:03:00.0: last_cmd_resp_index = 3 > > [ 1619.499155] mwifiex_pcie 0000:03:00.0: last_cmd_resp_id: 07 81 07 81 07 81 07 81 07 81 > > [ 1619.499162] mwifiex_pcie 0000:03:00.0: last_event_index = 2 > > [ 1619.499169] mwifiex_pcie 0000:03:00.0: last_event: 58 00 58 00 58 00 58 00 58 00 > > [ 1619.499177] mwifiex_pcie 0000:03:00.0: data_sent=0 cmd_sent=1 > > [ 1619.499185] mwifiex_pcie 0000:03:00.0: ps_mode=0 ps_state=0 > > [ 1619.499215] mwifiex_pcie 0000:03:00.0: info: _mwifiex_fw_dpc: unregister device > > # mwifiex_pcie_work hang happening > > [ 1823.233923] INFO: task kworker/3:1:44 blocked for more than 122 seconds. > > [ 1823.233932] Tainted: G WC OE 5.10.0-rc1-1-mainline #1 > > [ 1823.233935] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 1823.233940] task:kworker/3:1 state:D stack: 0 pid: 44 ppid: 2 flags:0x00004000 > > [ 1823.233960] Workqueue: events mwifiex_pcie_work [mwifiex_pcie] > > [ 1823.233965] Call Trace: > > [ 1823.233981] __schedule+0x292/0x820 > > [ 1823.233990] schedule+0x45/0xe0 > > [ 1823.233995] schedule_timeout+0x11c/0x160 > > [ 1823.234003] wait_for_completion+0x9e/0x100 > > [ 1823.234012] __flush_work.isra.0+0x156/0x210 > > [ 1823.234018] ? flush_workqueue_prep_pwqs+0x130/0x130 > > [ 1823.234026] __cancel_work_timer+0x11e/0x1a0 > > [ 1823.234035] mwifiex_cleanup_pcie+0x28/0xd0 [mwifiex_pcie] > > [ 1823.234049] mwifiex_free_adapter+0x24/0xe0 [mwifiex] > > [ 1823.234060] _mwifiex_fw_dpc+0x294/0x560 [mwifiex] > > [ 1823.234074] mwifiex_reinit_sw+0x15d/0x300 [mwifiex] > > [ 1823.234080] mwifiex_pcie_reset_done+0x50/0x80 [mwifiex_pcie] > > [ 1823.234087] pci_try_reset_function+0x5c/0x90 > > [ 1823.234094] process_one_work+0x1d6/0x3a0 > > [ 1823.234100] worker_thread+0x4d/0x3d0 > > [ 1823.234107] ? rescuer_thread+0x410/0x410 > > [ 1823.234112] kthread+0x142/0x160 > > [ 1823.234117] ? __kthread_bind_mask+0x60/0x60 > > [ 1823.234124] ret_from_fork+0x22/0x30 > > [...] > > > > This is a deadlock caused by calling cancel_work_sync() in > > mwifiex_cleanup_pcie(): > > > > - Device resets are done via mwifiex_pcie_card_reset() > > - which schedules card->work to call mwifiex_pcie_card_reset_work() > > - which calls pci_try_reset_function(). > > - This leads to mwifiex_pcie_reset_done() be called on the same workqueue, > > which in turn calls > > - mwifiex_reinit_sw() and that calls > > - _mwifiex_fw_dpc(). > > > > The problem is now that _mwifiex_fw_dpc() calls mwifiex_free_adapter() > > in case firmware initialization fails. That ends up calling > > mwifiex_cleanup_pcie(). > > > > Note that all those calls are still running on the workqueue. So when > > mwifiex_cleanup_pcie() now calls cancel_work_sync(), it's really waiting > > on itself to complete, causing a deadlock. > > > > This commit fixes the deadlock by skipping cancel_work_sync() on a reset > > failure path. > > > > After this commit, when reset fails, the following output is > > expected to be shown: > > > > kernel: mwifiex_pcie 0000:03:00.0: info: _mwifiex_fw_dpc: unregister device > > kernel: mwifiex: Failed to bring up adapter: -5 > > kernel: mwifiex_pcie 0000:03:00.0: reinit failed: -5 > > > > To reproduce this issue, for example, try putting the root port of wifi > > into D3 (replace "00:1d.3" with your setup). > > > > # put into D3 (root port) > > sudo setpci -v -s 00:1d.3 CAP_PM+4.b=0b > > > > Cc: Maximilian Luz > > Signed-off-by: Tsuchiya Yuto > > Patch applied to wireless-drivers-next.git, thanks. > > 4add4d988f95 mwifiex: pcie: skip cancel_work_sync() on reset failure path > Sorry, but is it too late to ask you to change my commit message? I'd really appreciate it if it's still possible. If it's difficult, please ignore this. Even if it's too late, I hope writing the improved one here still helps someone as long as there is a link to this mail thread in the Link tag. So, I write the improved one below anyway: - added the vendor name (Microsoft) of devices where we observed this issue - fixed some grammatical errors If a reset is performed but even the reset fails for some reasons (e.g., on Microsoft Surface devices, fw reset requires other quirks), cancel_work_sync() in mwifiex_cleanup_pcie() hangs. [...] Thank you.