Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp6374862rwb; Mon, 12 Dec 2022 00:35:01 -0800 (PST) X-Google-Smtp-Source: AA0mqf4fUSSclbq6RHT5L9mmLMaNzMPg1EozDbf7bCXqLH+q7yyGma3YPWz01oIu+zM9mvv8nU0/ X-Received: by 2002:a17:90a:5786:b0:216:cdf6:54f4 with SMTP id g6-20020a17090a578600b00216cdf654f4mr15790296pji.48.1670834101168; Mon, 12 Dec 2022 00:35:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670834101; cv=none; d=google.com; s=arc-20160816; b=XZ1gLn8ZfgCH1IT7JlM67SPxrKfg3X0p8wCVjYplKn2WxRITCKdjBrZ5bo5k2EPfZ+ E6gqFoXBFkbXjnWWDsGt98aVuihZ85fIas6Z5XwvcHP4hURl7i/pLugdFGQkbRYCxyfo 5SCjnMyp38tMrJuu7j8XRMp6zlepFQxFRbjtEFOoCRb/Os/ppP17myHt8pwCPi1+lp3f 3WxguYYfiJp8UnIF0N63LB+lZe7hcDA+DC1739nEcEXKBvXTCZhM+6EsqQdteBciM84Z 7lFmI+e+LDzhWpjtWFYzbHY6tcwg5U7nh57mLWIqf7haNNa8k/rW6TPVYP38BHLi13BS XJVg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=F9LezqzZ1Q0tDpxUflzZKV9SDwnFQRac+hwCdF8nZa4=; b=agky0aS42g9Fw+0dem6WVgMJ8ArHGNhYE8G35RPAX1Bnlj1q8CWHPmJS7OQcSgzS8M b4iVvTzfH386aKUe7FJ3ozNc4H0AtQgkbft5lkqskdAPMvTsrc51rUUfwGDG0v2hDinM 3JoCBywMEZ5r93I5ovIFrgvUDcqyEIj3r/SnWtkPjFFzaW3FRXDu4H7QYobti6L90Zid m62SepH+CVCJzEid/NO+KoHxbUL/KLVoYbE2M/NF0Vy+egAI03BVldFJHfj1OMtfjfQu eJa7b0xdz46hyGrGsXSiS6qMXFtgYrne+jFtD10LTf8iOxpSMMtFCpIpTKK2eLB9gqbt 2U7Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id pq8-20020a17090b3d8800b0021939584595si9263617pjb.82.2022.12.12.00.34.51; Mon, 12 Dec 2022 00:35:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231417AbiLLHuy (ORCPT + 75 others); Mon, 12 Dec 2022 02:50:54 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230427AbiLLHuw (ORCPT ); Mon, 12 Dec 2022 02:50:52 -0500 Received: from verein.lst.de (verein.lst.de [213.95.11.211]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 80559B86B; Sun, 11 Dec 2022 23:50:51 -0800 (PST) Received: by verein.lst.de (Postfix, from userid 2407) id B334168AA6; Mon, 12 Dec 2022 08:50:46 +0100 (CET) Date: Mon, 12 Dec 2022 08:50:46 +0100 From: Christoph Hellwig To: Jason Gunthorpe Cc: Christoph Hellwig , Lei Rao , kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org, eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver. Message-ID: <20221212075046.GB11162@lst.de> References: <20221206165503.GA8677@lst.de> <20221207075415.GB2283@lst.de> <20221207135203.GA22803@lst.de> <20221207163857.GB2010@lst.de> <20221207183333.GA7049@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.17 (2007-11-01) X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,SPF_HELO_NONE, SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 07, 2022 at 04:08:02PM -0400, Jason Gunthorpe wrote: > However hisilicon managed to do their implementation without this, or > rather you could say their "controlling function" is a single MMIO BAR > page in their PCI VF and their "controlled function" is the rest of > the PCI VF. Eww. So you need to carefully filter the BAR and can't actually do any DMA at all? I'm not sure that is actually practical, especially not for something with a lot of state. > If the kernel knows this information then we can find a way for the > vfio_device to have pointers to both controlling and controlled > objects. I have a suggestion below. So now we need to write a vfio shim for every function even if there is absolutely nothing special about that function? Migrating really is the controlling functions behavior, and writing a new vfio bit for every controlled thing just does not scale. > I see it differently, the VFIO driver *is* the live migration > driver. Look at all the drivers that have come through and they are > 99% live migration code. Yes, and that's the problem, because they are associated with the controlled function, and now we have a communication problem between that vfio driver binding to the controlled function and the drive that actually controlls live migration that is associated with the controlling function. In other words: you've created a giant mess. > Excepting quirks and bugs sounds nice, except we actually can't ignore > them. I'm not proposing to ignore them. But they should not be needed most of the time. > For instance how do I trap FLR like mlx5 must do if the > drivers/live_migration code cannot intercept the FLR VFIO ioctl? > > How do I mangle and share the BAR like hisilicon does? > > Which is really why this is in VFIO in the first place. It actually is > coupled in practice, if not in theory. So you've created a long term userspace API around working around around buggy and/or misdesigned early designs and now want to force it down everyones throat? Can we please take a step back and think about how things should work, and only then think how to work around weirdo devices that do strange things as a second step? > If we accept that drivers/vfio can be the "live migration subsystem" > then lets go all the way and have the controlling driver to call > vfio_device_group_register() to create the VFIO char device for the > controlled function. While creating the VFs from the PF driver makes a lot more sense, remember that vfio is absolutely not the only use case for VFs. There are plenty use cases where you want to use them with the normal kernel driver as well. So the interface to create VFs needs a now to decide if it should be vfio exported, or use the normal kernel binding. > This solves the "sanely discover" problem because of course the > controlling function driver knows what the controlled function is and > it can acquire both functions before it calls > vfio_device_group_register(). Yes. > This is actually what I want to do anyhow for SIOV-like functions and > VFIO. Doing it for PCI VFs (or related PFs) is very nice symmetry. I > really dislike that our current SRIOV model in Linux forces the VF to > instantly exist without a chance for the controlling driver to > provision it. For SIOV you have no other choice anyway. But I agree that it is the right thing to do for VFIO. Now the next step is to control live migration from the controlling function, so that for most sane devices the controlled device does not need all the pointless boilerplate of its own vfio driver. > I'd really like to get away from VFIO having to do all this crazy > sysfs crap to activate its driver. I think there is a lot of appeal to > having, say, a nvmecli command that just commands the controlling > driver to provision a function, enable live migration, configure it > and then make it visible via VFIO. The same API regardless if the > underlying controlled function technology is PF/VF/SIOV. Yes.