Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp7211270rwd; Tue, 6 Jun 2023 07:46:15 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ7LLYJ1LJFQl9yNS1O84XPPbA+Cu0eRoH3NiwEmezeBjwjkJRAgEjceaI/TdG96ia94iRI0 X-Received: by 2002:a05:6a00:16d1:b0:64f:ad7c:70fb with SMTP id l17-20020a056a0016d100b0064fad7c70fbmr2712334pfc.17.1686062775166; Tue, 06 Jun 2023 07:46:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686062775; cv=none; d=google.com; s=arc-20160816; b=GgtTQbuPKxoLb5jrsEO3oi3mvSrHPnlwFeiXm0V8+//uyAnjKET8IgCAGhngAgaZrb kcnAQe3No0H6J3+3RylVv/0eLLR3zmmVT1E29o+0OS4ZIQojVAcF7Vlyv5TpTZ29V1JY 2QrIVT58yX6bf6pu48nmZKm3DowBIIgdyo3u8e4RGLXy5mUkJm26y+V46TaTjOS/olgE m9crYGDUGO5OaLx0woPmnIRptsODdbCVr2+eO59qyJ80cCL9KQXmzz5EiRX+QpFf7doU pNRbb47kY0skx0zNizAUeHqtKhUrFUBZpHok8ZnQbURDneBE/+wVDSRRPdu5qwUytkTC mHXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=+uOaW7/P2mRHgWTXaClBReuacX2fmrasr4/Dns+wyy0=; b=BcOaLaarSAhssHJ1ywDMiiGV2ShNugYsOg+wxLedu+AqUFfCLqVmlnBwApp95J2E6Q jcX04BoK4rt2tuyDOrqlBcz66miELm7vhZRwAhXyDjnNZVTyfxCgf9INzbsyCM9w5Lab 2F3IFE5r2tQN1icu8zMuSUv34Qs4EpjG6Nz8HuPjuBOlfap140+/ONIpgaWeD/Xczhbj Po7OCDOzttT8hCpc7moy+s6veaMQUWPA4olGbrEFIDcAB18kgz71L9ieFG2A9+7Wr86N kLJBrIqtKvrEgujW+RTFr3dX7iuHxfQX3eNPhXx1MWiLWvD5l9vbK1SQY+WcQ/FmAIwr +XJA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="q7/qcxoV"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id o5-20020aa79785000000b00654355ac8a5si6080311pfp.56.2023.06.06.07.46.01; Tue, 06 Jun 2023 07:46:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="q7/qcxoV"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237422AbjFFOdO (ORCPT + 99 others); Tue, 6 Jun 2023 10:33:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37586 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238013AbjFFOdG (ORCPT ); Tue, 6 Jun 2023 10:33:06 -0400 Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 977591729 for ; Tue, 6 Jun 2023 07:33:00 -0700 (PDT) Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-4f505aace48so7880830e87.0 for ; Tue, 06 Jun 2023 07:33:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1686061979; x=1688653979; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+uOaW7/P2mRHgWTXaClBReuacX2fmrasr4/Dns+wyy0=; b=q7/qcxoVb752NHqDstK/BMZ0lbEhIO2jCI+gHaX5SCIFaltxtceJiS1BO88p0edwXl tK1qJOHvdEz4/xJM/3xILO0iC/DlpAtkOoilPN0lDpVHCZKOgXuDpV3/PkcFVVxLzbhf ycFpG/GDFL10Ha316icwTmyvZu87PdV+nh12oSVA+0YjvTZRKWqPK3Hvsc7El3VJXGDX KB4YWBVgBlM5UF8fbuPdN/GjS4Bz/NmYoDH69SZBDQRlWx3m9D19PlxSNJiCUd6hx9g0 9yBuFd3NIvPOyxqO5L1reGb4Aw+mUrvDBoBAHQraPKsnTmIvLjeztcBmkWAbK+L5fXQZ 8j6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686061979; x=1688653979; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+uOaW7/P2mRHgWTXaClBReuacX2fmrasr4/Dns+wyy0=; b=dcCExZHSAvXX8ezz02PZYogvE4+Tvxx8cTPgSsalB1v3gXr6Ae4V646u2Y1ruDA7eN Bm1nbSHlco1jISSV+KG9PhWhNfGemkMdeiLVbVLVAUP13kyXI+GOCCtOlq+I8mUBN8Ly fl94xIQNG2Htcl44jJNxqEHOxwUfyi88oJYhRzzY5na/CzYSO56Szx26a/FywmDlPD0b aY/gXQsiMFiVJbWCZMqg48Kam4eP776LlwQrxQMGHdFcgPrqqzVS9k179FMxS3FnoPaT X3n2mwmv/Otv13jINbqaOb2O/Nv9AiTWKj8EWLVoj3/KEmGLLilTR8vs8AqO5SMzwdmC 9Wpg== X-Gm-Message-State: AC+VfDzpfhkJ5swNvParuB/B98gOT0nG1oY25ALLLSBfDtKrh63DQlST A2I7ki8v+d6TO7Ywk2kOMYSCttLOKlruiyQS0v4= X-Received: by 2002:a2e:b012:0:b0:2ad:ac93:3c7d with SMTP id y18-20020a2eb012000000b002adac933c7dmr1234022ljk.38.1686061978530; Tue, 06 Jun 2023 07:32:58 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?B?6K645pil5YWJ?= Date: Tue, 6 Jun 2023 22:32:45 +0800 Message-ID: Subject: Re: [RFC PATCH 0/4] nvme-tcp: fix hung issues for deleting To: Sagi Grimberg Cc: kbusch@kernel.org, axboe@kernel.dk, hch@lst.de, linux-nvme@lists.infradead.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi grimberg, I have read Ming's patch, it seems that MIng fix the case my patchset missed, Ming mainly fixes the hang when reconnect fails, my patchset fixes the issue that while processing error_recover or reconnect(have not reach max retries), user actively remove ctrl(nvme disconnect), this will interrupt error_recovery or recoonect, but ctrl freezed and th request queue quiescing, the new IO or timeouted IOs cannot continue to process, as a result nvme_remove_namespaces hang on flush scan_work or blk_mq_freeze_queue_wait, new IO hang or __bio_queue_enter()=EF=BC=8Cit seems that if the first patch add the next code, it may cover Ming's case: static void nvme_tcp_reconnect_or_remove(struct nvme_ctrl *ctrl) { /* If we are resetting/deleting then do nothing */ if (ctrl->state !=3D NVME_CTRL_CONNECTING) { WARN_ON_ONCE(ctrl->state =3D=3D NVME_CTRL_NEW || ctrl->state =3D=3D NVME_CTRL_LIVE); return; } if (nvmf_should_reconnect(ctrl)) { dev_info(ctrl->device, "Reconnecting in %d seconds...\n", ctrl->opts->reconnect_delay); queue_delayed_work(nvme_wq, &to_tcp_ctrl(ctrl)->connect_work, ctrl->opts->reconnect_delay * HZ); } else { dev_info(ctrl->device, "Removing controller...\n"); nvme_delete_ctrl(ctrl); + nvme_ctrl_reconnect_exit(ctrl); } } Thanls. Sagi Grimberg =E4=BA=8E2023=E5=B9=B46=E6=9C=886=E6=97=A5= =E5=91=A8=E4=BA=8C 07:09=E5=86=99=E9=81=93=EF=BC=9A > > > > From: Chunguang Xu > > > > We found that nvme_remove_namespaces() may hang in flush_work(&ctrl->sc= an_work) > > while removing ctrl. The root cause may due to the state of ctrl change= d to > > NVME_CTRL_DELETING while removing ctrl , which intterupt nvme_tcp_error= _recovery_work()/ > > nvme_reset_ctrl_work()/nvme_tcp_reconnect_or_remove(). At this time, c= trl is > > freezed and queue is quiescing . Since scan_work may continue to issue = IOs to > > load partition table, make it blocked, and lead to nvme_tcp_error_recov= ery_work() > > hang in flush_work(&ctrl->scan_work). > > > > After analyzation, we found that there are mainly two case: > > 1. Since ctrl is freeze, scan_work hang in __bio_queue_enter() while it= issue > > new IO to load partition table. > > 2. Since queus is quiescing, requeue timeouted IO may hang in hctx->dis= patch > > queue, leading scan_work waiting for IO completion. > > Hey, can you please look at the discussion with Mings' proposal in > "nvme: add nvme_delete_dead_ctrl for avoiding io deadlock" ? > Hi grimberg, I have look MIng's patch, I think we may fix > Looks the same to me.