Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp1410264rwb; Wed, 7 Dec 2022 12:44:06 -0800 (PST) X-Google-Smtp-Source: AA0mqf7FJDSz+0NcNAhA9ZF+KqNx84mfc+ksGPPjD51GdBMuSpMMQvmlmueznffEyQ0wzVpXpiDE X-Received: by 2002:a17:902:e005:b0:189:f426:ace with SMTP id o5-20020a170902e00500b00189f4260acemr5317552plo.98.1670445845903; Wed, 07 Dec 2022 12:44:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670445845; cv=none; d=google.com; s=arc-20160816; b=BZMwp8RWjhGnkb7rjOadZUMgwIxet8eP4mSFo3Q7bx8b/bBg9pW3rYTZn+S+2A7PuQ xS1JpzrHB5A8xTumBQxB/pefW6fBYhFfK0HP3A72Dvn6bNiW0cczKct99uq4Kuas7seI sOF1Aje5hUupehXsmTbyIuAKrNOfK8sYhEKPr4lJtUnp9kp6qw0VOfoTcjJowkZ++1g9 4ZsXTeRKiMBUmEbXDkpuIPInA5OvjTLwjGRdjKHD3eWQRvHy1tUsX3bloVHPQ6AkrfWq mBEcyhp1r2+Dji3HC7MAHmf9IPPafHhec7P5N5hKztiiLPkccgJlbSyGHWi50BOdFmfv RtDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=vr5WiCvTh3UIK/ZVLkYO3CxfyfQu9sqezzU1lem3qSI=; b=K91ovGzbqKz//MNqdsa6RToeow6u9P0SshzXjysP1ixnjddjJeBk/s88l2RDy8Nx4q WSlB818h5rGuizysUoFxe2dk7kNX50iDcxNa03eNO9T0VCfCYh3mWkb44RnFnGN8NAH/ GwiI9Or3lUjCwCD+MR/O/0eS2s9+kgqt+tawWd/Pg8boLHK8cTn6lc9PVOasPcDjFxCH zeca3BDAWApsB3QVtQgSa7Rz9Iqflkpo3+/PijjOoJAPZmsvLl+fLWTU/ue2/EmQ9tov l4NO3mpdxBrWviiIK0d7slMkkzST5Z0h89ZFINDCi/Irl+1+mA50JyQE8dbBwMFc/7Aa AY0g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=YCUg4Ppb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g3-20020a056a0023c300b0056c94208afcsi23641703pfc.6.2022.12.07.12.43.56; Wed, 07 Dec 2022 12:44:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=YCUg4Ppb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229449AbiLGUII (ORCPT + 74 others); Wed, 7 Dec 2022 15:08:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37300 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229530AbiLGUIG (ORCPT ); Wed, 7 Dec 2022 15:08:06 -0500 Received: from mail-qv1-xf2e.google.com (mail-qv1-xf2e.google.com [IPv6:2607:f8b0:4864:20::f2e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 50774686AB for ; Wed, 7 Dec 2022 12:08:04 -0800 (PST) Received: by mail-qv1-xf2e.google.com with SMTP id mn15so13421473qvb.13 for ; Wed, 07 Dec 2022 12:08:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=vr5WiCvTh3UIK/ZVLkYO3CxfyfQu9sqezzU1lem3qSI=; b=YCUg4PpbC8tWQcP/j2pX6xtXhkhIANkGRKqwa0o8UMps3pEfjeu5/R4N/tGf3nha2v CHvipOOXOXX3KqsLbyV5E6K4gQ1fvj6f2baLUE9My4QvRrN+EzRAoztf8iO0QBdU++QW VcALPMS+0FXsZdgsBC2tnHCZpqz0RoA6J145my+5ci8rOi8MjU05ML5Koaw30WwbnhtF dW1tx0K1qE5Z9Ae+uOAW7t0dpfckaNDalca944KTZRmEnnZwC6yASRPqvrNrYQBk9KCb 9vXtSy1YbzDIzj608LLFV97I8x+9AMWk3g5re44m9SF1nGl4OLuf+n6KYOJDa+cjD28u Ll1g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=vr5WiCvTh3UIK/ZVLkYO3CxfyfQu9sqezzU1lem3qSI=; b=60K6qtz4ePh0HOw5x0guS5ETiX0JXfhxfWTia8TyIJ6hmQAl6LbTSgW3dtFINti9uP AGa+mJ3iRqRn6WwyJVwsK2Upeb4/FGlAO1gza+lYRAPGW3JFR7xd7gvsFvSh6At56bnJ W1/zvwGhDtBu2RPeMkoGq6tg2vKgWwAv/n2PlRHj8ZvXxNj7IA7y9WHuzQpMJ1cw0svN 5LirKfJFOkXTzsqoElQAbrCi+go3F6cG4lnjlR+m4+b+TD7XuDwQ4Ts+DMxBuPcFmoCv /Cy+UkTyDFe90oDy137MXajVmDOlfs2wgThKpQmhv3koG6aWcwTT3qRurOl1HQr3M9I/ sohQ== X-Gm-Message-State: ANoB5plsJHMHUbFXlTwvReUp7h0ayAqvx1pFkrPu6Qvj/BzvNkO3LRGu 64l2zqPDa/d+0NHCFzHiLh7VOQ== X-Received: by 2002:a0c:f94e:0:b0:4c7:5d3a:69bf with SMTP id i14-20020a0cf94e000000b004c75d3a69bfmr16196245qvo.9.1670443683420; Wed, 07 Dec 2022 12:08:03 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-47-55-122-23.dhcp-dynamic.fibreop.ns.bellaliant.net. [47.55.122.23]) by smtp.gmail.com with ESMTPSA id bl12-20020a05620a1a8c00b006fbb4b98a25sm17638764qkb.109.2022.12.07.12.08.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 12:08:02 -0800 (PST) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1p30i2-005Luz-2M; Wed, 07 Dec 2022 16:08:02 -0400 Date: Wed, 7 Dec 2022 16:08:02 -0400 From: Jason Gunthorpe To: Christoph Hellwig Cc: Lei Rao , kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org, eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver. Message-ID: References: <20221206165503.GA8677@lst.de> <20221207075415.GB2283@lst.de> <20221207135203.GA22803@lst.de> <20221207163857.GB2010@lst.de> <20221207183333.GA7049@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221207183333.GA7049@lst.de> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 07, 2022 at 07:33:33PM +0100, Christoph Hellwig wrote: > On Wed, Dec 07, 2022 at 01:31:44PM -0400, Jason Gunthorpe wrote: > > > Sorry, I meant VF. Your continued using of SR-IOV teminology now keeps > > > confusing my mind so much that I start mistyping things. > > > > Well, what words do you want to use? > > The same I've used through this whole thread: controlling and > controlled function. > > > So I don't think I've learned anything more about your objection. > > > > "fundamentally broken" doesn't help > > The objection is that: > > - in hardware fundamentally only the controlling funtion can > control live migration features on the controlled function, > because the controlled function is assigned to a VM which has > control over it. Yes However hisilicon managed to do their implementation without this, or rather you could say their "controlling function" is a single MMIO BAR page in their PCI VF and their "controlled function" is the rest of the PCI VF. > - for the same reason there is no portable way to even find > the controlling function from a controlled function, unless > you want to equate PF = controlling and VF = controlled, > and even that breaks down in some corner cases As you say, the kernel must know the relationship between controlling->controlled. Nothing else is sane. If the kernel knows this information then we can find a way for the vfio_device to have pointers to both controlling and controlled objects. I have a suggestion below. > - if you want to control live migration from the controlled > VM you need a new vfio subdriver for a function that has > absolutely no new functionality itself related to live > migration (quirks for bugs excepted). I see it differently, the VFIO driver *is* the live migration driver. Look at all the drivers that have come through and they are 99% live migration code. They have, IMHO, properly split the live migration concern out of their controlling/PF driver and placed it in the "VFIO live migration driver". We've done a pretty good job of allowing the VFIO live migration driver to pretty much just focus on live migration stuff and delegate the VFIO part to library code. Excepting quirks and bugs sounds nice, except we actually can't ignore them. Having all the VFIO capabilities is exactly how we are fixing the quirks and bugs in the first place, and I don't see your vision how we can continue to do that if we split all the live migration code into yet another subsystem. For instance how do I trap FLR like mlx5 must do if the drivers/live_migration code cannot intercept the FLR VFIO ioctl? How do I mangle and share the BAR like hisilicon does? Which is really why this is in VFIO in the first place. It actually is coupled in practice, if not in theory. > So by this architecture you build a convoluted mess where you need > tons of vfio subdrivers that mostly just talk to the driver for the > controlling function, which they can't even sanely discover. That's > what I keep calling fundamentally broken. The VFIO live migration drivers will look basically the same if you put them under drivers/live_migration. This cannot be considered a "convoluted mess" as splitting things by subsystem is best-practice, AFAIK. If we accept that drivers/vfio can be the "live migration subsystem" then lets go all the way and have the controlling driver to call vfio_device_group_register() to create the VFIO char device for the controlled function. This solves the "sanely discover" problem because of course the controlling function driver knows what the controlled function is and it can acquire both functions before it calls vfio_device_group_register(). This is actually what I want to do anyhow for SIOV-like functions and VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I really dislike that our current SRIOV model in Linux forces the VF to instantly exist without a chance for the controlling driver to provision it. We have some challenges on how to do this in the kernel, but I think we can overcome them. VFIO is ready for this thanks to all the restructuring work we already did. I'd really like to get away from VFIO having to do all this crazy sysfs crap to activate its driver. I think there is a lot of appeal to having, say, a nvmecli command that just commands the controlling driver to provision a function, enable live migration, configure it and then make it visible via VFIO. The same API regardless if the underlying controlled function technology is PF/VF/SIOV. At least we have been very interested in doing that for networking devices. Jason