Received: by 2002:ab2:1689:0:b0:1f7:5705:b850 with SMTP id d9csp812731lqa; Sun, 28 Apr 2024 05:38:46 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCU2nWd5CI0dxsQIvJBdX9OMwFmlMb3oeTghOvw5mAUjEk4AfOyqUtp6tctsp3DG4rUOVMBm0WHRRWhCQcWAwtXfnqjG1U0sT1b2CpK7Gw== X-Google-Smtp-Source: AGHT+IGrej7EBVciv15SjQC+T99y8tJoxXTZFSbTjdJZDGShmNyojYYjBDXsKOwkoiiPjqsIT/ch X-Received: by 2002:a05:622a:283:b0:43a:5f01:9682 with SMTP id z3-20020a05622a028300b0043a5f019682mr5630974qtw.15.1714307925768; Sun, 28 Apr 2024 05:38:45 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1714307925; cv=pass; d=google.com; s=arc-20160816; b=D/pAfq/VhXFynY2LfRtndZ4CkvTSHNAeL4Fqo9vg03KxiPxe3yrHugD6mLCSVawdAt GEBaVlUxbHcPkDA6o+IBqtpMZxR+mfMfmpe4FQp6R3NcImVjpvpFvOeN3EM+nCCkbGGz 7PDxFVdf4A//vGQXDCIUYeMADyNu8TeMMo8QoO3zfxFDBefjqxVlM32hCvP9Z1uI3onP qf+hvln3JC/udA4gWbuE1NLoVY+IhMHhP1crW/pywhX/gjvm9l2Fpmem5OKk3yeN6Bb+ eZ8cDl96/Jmdyx9tfjeZKE8gGqD/Ju1Dyx2kUpPc+XLYSb8P3/9uhPTqu/HTohYji18n h8xA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=0qrkhLGFiOhJUvJx8Zd/ZRGYvnBujxufO0mRfURt1kw=; fh=yUXEOrL8kfwr0nh8FKwAUwmCY96ReiqGEzNqZsCt9W4=; b=PNOIW6xzlEcasLcUDLcRz/0H7x1XqKGLJewfD+1sJq4vI6kfx7IJqdexU40vR1NxoH RV/O2qwS/cAWhOvmIlcoXD63aMN8F2kPemx+EALie7uXIfdKrY/0Zuj0QhV7N3IZzTAt pLPtAJCYuycRQ41woK0rav/CuITUH4e9fg2LqwpbnhkrIyeXM934VTgpRRT+e7MGltHz vQ0oXc27/EuKEp0y0PMzsEDKcxEiCtwfuWUR9EJ/h/rASixacgHiPMlfOF1qkn+RyRKk rOMFhj76Mnb/0zcO3k906SyCMjhsq20+6WwbrCSs3MCOtLvx66jXW900/hBGr69G51iR CDsQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-161421-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-161421-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id ha13-20020a05622a2b0d00b0043762e136c0si22131099qtb.542.2024.04.28.05.38.45 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 Apr 2024 05:38:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-161421-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=gmail.com); spf=pass (google.com: domain of linux-kernel+bounces-161421-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-161421-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 7FB511C20F7F for ; Sun, 28 Apr 2024 12:38:45 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0FA4E6A8A4; Sun, 28 Apr 2024 12:38:41 +0000 (UTC) Received: from mail-wr1-f45.google.com (mail-wr1-f45.google.com [209.85.221.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A7EE653E0D for ; Sun, 28 Apr 2024 12:38:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.221.45 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714307920; cv=none; b=Vs/UlFJWk7oKiDQujpx86NyBCZKg+pn+TFLTDMSfY4PtC6szPmOPCJ1f9/ZKOxXoHUUFfV0ynjHH9SJD72U6CPDEZ4As0TlFVf6Mb0QEqoT6DcngempF3EM4+R82NAfzc1wzVEn3C7Yz1zc/6Po8bL9tUtEOM9PG6pBmDT/D6Ec= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714307920; c=relaxed/simple; bh=a3C3w/l9Jk1EM3ykAh3xpeCqiIxLu5anbdm5krjQwRo=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=lE1Mxzp3q19ak+nMglqo+Iqeq3JW4brE8o9I8OHBZCzY6JoUPxd65IQZM2minGVf6DohAWuyKfyDSvTQu6xB2FN/yQc0bQpzWR27uRoYyb5KtjwKf6LQ/Y66NNPRvFVOL6G/giGR8x2I164aQ0pYOPnKg59j2uVTxGUIWUltcSw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=grimberg.me; spf=pass smtp.mailfrom=gmail.com; arc=none smtp.client-ip=209.85.221.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=grimberg.me Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Received: by mail-wr1-f45.google.com with SMTP id ffacd0b85a97d-346407b8c9aso952909f8f.0 for ; Sun, 28 Apr 2024 05:38:38 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1714307917; x=1714912717; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=0qrkhLGFiOhJUvJx8Zd/ZRGYvnBujxufO0mRfURt1kw=; b=bsvb4nNRzIh+wLwzkNrpVZOQDxvdOClWkAgpIWa1Bg8BFLGFLve0DOxr5TBV925mmX Wvi/r5aZRd7orkFRA5icbtViifkAh3jTPkgoqJnhgXF5kl+Mkloj1xNccNd7WipHg3iV 5H+HsmJK5jczoiV7/jh84HtqUnIIsQggujHFPEFk9EUnICaM1bgtPElJOm3TEcgx3kQ2 BwRWldTbWPd6ikVVB4ZgT9eGgfmhxlEb5bRL7Y8WLEWVFye44D69wBSV3UQ358olGCfC 63sIeoq+rfsyCHch8PJFJr04IgYtjmem4vL7NcJ3rXqYBoBlBl1vduG8MkXlcH3hLzQU eaNg== X-Forwarded-Encrypted: i=1; AJvYcCVhTc68aunj+t7ojRrTw3RZRUyJf/K+wIqSfw3ZSsLxVFqfssqfiuovYRmJGV/v3SgLXwAKBYSIGsKVJ5EgQlDr8EfG22kKJ2lBXHjx X-Gm-Message-State: AOJu0Yz+64b5+o37TiPT5mJSi+dRCVJ0rhLiGeXcbm2H7bAaal2VYcwY 4V8urtlnXvX0aZGOif5SST+jxF5T1j3g7h0rnLEsqEn4qaJT3GRu X-Received: by 2002:a5d:5f52:0:b0:34a:a754:eb51 with SMTP id cm18-20020a5d5f52000000b0034aa754eb51mr6222293wrb.3.1714307916652; Sun, 28 Apr 2024 05:38:36 -0700 (PDT) Received: from [10.100.102.74] (85.65.192.64.dynamic.barak-online.net. [85.65.192.64]) by smtp.gmail.com with ESMTPSA id n4-20020a5d4844000000b00349f098f4a6sm26983632wrs.53.2024.04.28.05.38.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 28 Apr 2024 05:38:36 -0700 (PDT) Message-ID: Date: Sun, 28 Apr 2024 15:38:34 +0300 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [Bug Report] nvme connect deadlock in allocating tag To: kwb Cc: axboe@fb.com, chunguang.xu@shopee.com, hch@lst.de, james.smart@broadcom.com, kbusch@kernel.org, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org References: <20240428102527.37462-1-wangbing.kuang@shopee.com> Content-Language: he-IL, en-US From: Sagi Grimberg In-Reply-To: <20240428102527.37462-1-wangbing.kuang@shopee.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 28/04/2024 13:25, kwb wrote: >> On 28/04/2024 12:16, Wangbing Kuang wrote: >>> "The error_recovery work should unquiesce the admin_q, which should fail >>> fast all pending admin commands, >>> so it is unclear to me how the connect process gets stuck." >>> I think the reason is: the command can be unquiesce but the tag cannot be >>> return until command success. >> The error recovery also cancels all pending requests. See >> nvme_cancel_admin_tagset > nvme_cancel_admin_tagset can cancel requests before stop admin queue, but > cannot cancel requests before next reconnect time. the error recovery does quiesce + cancel_admin_taget + unquiesce, all following admin I/O should fail immediately upon submission as the ctrl/queue is not live. > The time line is: > recover failed(we can reproduce by hang io for more time) > -> reconnect delay > -> multi nvme list issue(used up tagset) > -> reconnect start(wait for tag when call nvme_enabel_ctrl and nvme_wait_ready) failing all admin I/O should not be associated with the next reconnect, it happens way before that, in the error recovery work. Hence it is still not clear to me how you are seeing what you are seeing. It is possible that 5.15 is missing something. > > >>> "What is step (2) - make nvme io timeout to recover the connection?" >>> I use spdk-nvmf-target for backend. It is easy to set read/write >>> nvmf-target io hang and unhang. So I just set the io hang for over 30 >>> seconds, then trigger linux-nvmf-host trigger io timeout event. then io >>> timeout will trigger connection recover. >>> by the way, I use multipath=0 >> Interesting, does this happen with multipath=Y ? >> I didn't expect people to be using multipath=0 for fabrics in the past few >> years. > No certain, I did not test on multipath=Y.We choose multipath=0 cos less code and we need only one path > >>> "Is this reproducing with upstream nvme? or is this some distro kernel >>> where this happens?" >>> it is reproduced in a kernel based from v5.15, but I think this is common >>> error. >> It would be beneficial to verify this. > ok, test need more time, but we can first verify it only in v5.15. We should not be spending time debugging an issue that might have been addressed in upstream. The first thing we should do is to understand if this reproduces in upstream, if so fix it, if not identify the missing patch(es) in 5.15