Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1771037imm; Thu, 19 Jul 2018 07:37:41 -0700 (PDT) X-Google-Smtp-Source: AAOMgpeXMCrulmE5gZ0UKmweE8awKlGy8ap8v6bBPXw0j58u7Bm94qqsmk9Ez+POulUow33Ul+xZ X-Received: by 2002:a63:7d48:: with SMTP id m8-v6mr10468115pgn.0.1532011061755; Thu, 19 Jul 2018 07:37:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532011061; cv=none; d=google.com; s=arc-20160816; b=ts2JPVTzt67yvYtRY1iWw6TRlQq5MzY8LTFqv2GoXRnQfUk5Pol4vKWwAIaSNtqVX1 x0L+kSAhi5GVBm44y1SO8QMl1cWgIroYQKGOO5tYcctmgAesTp7ljEr4gi6g2GR6kIU1 oZfQEOSHhZDg1PspKTCiOb5qWp9YZfTRhwUGeiI6ZVmQD807izUAT45rlJXg/W9A9JIc FQMNtEg7sZh3o5Wc5b5n7UGRhaim4TQheSKgACP9THlxJqCD/R/gwiuB4a078NCUEOnt 0cPmstU4OW+Y7KSa0NlVwtJAciWj7lsUvb6dNRZHa89DkOCpPlwdcaI+mS/ckxXW5N4/ nr2w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=Be7kh+416aCOlmzhC2J4JuA98eiFBLbfEzKBauFpCY4=; b=Zzj87v+BtGo5iOU6OvGr3sSAa/sBWySoTGdVCP4mFcxbFeaCl7HusBXcJbnT1k95Dk x0yGq+hzqAUJOjkT5seT/J22uAR3r7MSq70zkSRPkstWGq93C1O84gvbwZupqbV/Gfod n7cem0H54GXHTgEapR7s0eyjPHrGcdWNJD9Kr/DeMmH8GQs2iYBeRaZJYU5KfsL0T0hK aWmy+xcr/e8nQxZiA50g/BSVOZc5NEVVlrUPTDB52EuC8o7inaXdK1/g9T2C4PgaJCiC yX4E0vJc3Mlxzh6igg5OvpbBni2vB1UB2PJH8elVlD5voszMmfquVH2wGLm+RDkWB4+x xtXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f1-v6si5618269plf.453.2018.07.19.07.37.03; Thu, 19 Jul 2018 07:37:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731825AbeGSPTE (ORCPT + 99 others); Thu, 19 Jul 2018 11:19:04 -0400 Received: from mx2.suse.de ([195.135.220.15]:39532 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1731471AbeGSPTE (ORCPT ); Thu, 19 Jul 2018 11:19:04 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 8E958AD61; Thu, 19 Jul 2018 14:35:35 +0000 (UTC) Date: Thu, 19 Jul 2018 16:35:34 +0200 From: Johannes Thumshirn To: Christoph Hellwig Cc: Sagi Grimberg , Keith Busch , James Smart , Hannes Reinecke , Ewan Milne , Max Gurtovoy , Linux NVMe Mailinglist , Linux Kernel Mailinglist Subject: Re: [PATCH 0/4] Rework NVMe abort handling Message-ID: <20180719143534.i36vo45lhz24xbrg@linux-x5ow.site> References: <20180719132838.15556-1-jthumshirn@suse.de> <20180719134203.GA15212@lst.de> <20180719141025.yveza2svhvc2r4lw@linux-x5ow.site> <20180719142355.GA18800@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20180719142355.GA18800@lst.de> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 19, 2018 at 04:23:55PM +0200, Christoph Hellwig wrote: > On Thu, Jul 19, 2018 at 04:10:25PM +0200, Johannes Thumshirn wrote: > > The problem I'm trying to solve here is really just single commands > > timing out because of i.e. a bad switch in between which causes frame > > loss somewhere. > > And that is exactly the case where NVMe abort does not actually work > in any sensible way. > > Remember that while NVMe guarantes ordered delivery inside a given > queue it does not guarantee anything between multiple queues. > > So now you have your buggy FC setup where an I/O command times out > because your switch delayed it for two hours due to a firmware bug. > > After 30 seconds we send an abort over the admin queue, which happens > to pass through just fine. The controller will tell you: no command > found as it has never seen it. > > No with the the code following what we have in PCIe that just means > we'll eventually controller reset after the I/O command times out > the second time as we still won't have seen a completion for it. Exactly that was my intention. > If you incorrectly just continue and resend the command we'll actually > get the command sent twice and thus a potential bug once the original > command just gets sent along. OK, let me see where I'm stuck here. We're issuing a command, it gets lost due to $REASON and I'm aborting it. The upper layers then eventually retry the command and it arrives at the target side. But so does the old command as well and we have a duplicate. Correct? So if we keep our old behavior and tear down the queues and re-establish them, then the upper layers retry the command and it arrives on the target. But shortly afterwards the switch happens to find the old command in it's ingress buffers and decides to forward it to the target as well, how does that differ? The CMDID and SQID are probably different but all the payload will be the same, wouldn't it? So we still have our duplicate on the other side, don't we? I feel I'm missing out something here. Byte, Johannes -- Johannes Thumshirn Storage jthumshirn@suse.de +49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Jane Smithard, Graham Norton HRB 21284 (AG N?rnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850