Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp897775rwb; Wed, 7 Dec 2022 06:23:02 -0800 (PST) X-Google-Smtp-Source: AA0mqf6/V5L16mdVbWpeK4hw+3Fn16oOx9VzrvqRPs1lgnVXpIpWxG+IfmL8fgIZW4W8shnqUx54 X-Received: by 2002:a05:6402:550b:b0:45f:9526:e35a with SMTP id fi11-20020a056402550b00b0045f9526e35amr82725557edb.256.1670422981353; Wed, 07 Dec 2022 06:23:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670422981; cv=none; d=google.com; s=arc-20160816; b=jYjxZNQo/+APnT95fv1VSitaM4C/A7b+hTEzbDeuzt45w2jpHwOxFvP0R8Zg+relHe K3UxohwxR8aAohSUh3WXNtbpUiAbzmCoB+xEYUpj4W96nvQLamu2MySO3Zp0JxR6MSOY c6s+U747tGlVxSeoqaRSxDgnnba1+knUTeGzz+qD4BH5YWJ3Ls/HG4UnhqBwDZDMwCQM PcZG64jwGb28W0fd/erGBCMx10rmNp6AF7D6ITG5+Z7zZQgwHzhkUFEPbNOIIOJMwsbv HJpAVS8w7byozJkbfT4T9e1Au4R3Lpo0IIMYIsjAXzkSrivHLHYt+reh+TSBjjo/i4VX ZCVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=l6xJr+k1Rglf8MP4WV7CssKD0Zhdk6N8F5ziwJ65c+E=; b=MSWUNg4713PED5jnHR/REdFVwDRWGk26lLfv45wtVRnxhNC++XbxJmCfCrMOdkdZV0 3bySMxZ4WLVnETJUDmcMan8/N02WkQ8+amDSHiXH17U9/xFZw15qDMw7Qcae1GDmIsgK 6BWq5bKL4KKZRUbLyAMBuAlTFT9Q0V5hc4p80/MTqJc2w94v2vIWjYEMzyttzkTELBww SFjFWQLWB7Tl7MX6CBhPNLSKGHecRPCJ1CYjnbMoE5qmKd4D8T8qgGp3K0aKwttdIRkX 4JUA8CCL+PO2L/b/RZ8XmaZTDl7GwsA/lne2CS+rfRJqgH404ZCownWAabHgz0BbczM0 CX4g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=WHJVaTga; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id z15-20020a17090655cf00b007af039d0bcasi12315228ejp.429.2022.12.07.06.22.42; Wed, 07 Dec 2022 06:23:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=WHJVaTga; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229868AbiLGNee (ORCPT + 75 others); Wed, 7 Dec 2022 08:34:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35458 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230036AbiLGNeS (ORCPT ); Wed, 7 Dec 2022 08:34:18 -0500 Received: from mail-qt1-x82f.google.com (mail-qt1-x82f.google.com [IPv6:2607:f8b0:4864:20::82f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3DCAF59870 for ; Wed, 7 Dec 2022 05:34:17 -0800 (PST) Received: by mail-qt1-x82f.google.com with SMTP id cg5so16088310qtb.12 for ; Wed, 07 Dec 2022 05:34:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=l6xJr+k1Rglf8MP4WV7CssKD0Zhdk6N8F5ziwJ65c+E=; b=WHJVaTgaT4PNH+QudjAp44gzIj2kA06/emtbgyqhca7ByTe7tRve3x+VbVfVY4H3nL doA/AKmFKI4LZGyBGKoYAZTt9HD63XICxmeh7io6fkDoKx5ATTsJ/QlgDbj0wkmQ8Zpt 46qN8QrBTRR+DeKKtFTFRYdyKEKshbig7w0o+6L3IL0BNU0DFKi/nBQm3QMcQ4LdotZj WFriWBSrr5zvlcsyxwrJG5X68S8Oz4KrbF6OnCq6u3v6FgPk9KnzGgYlOfajKFE2r4U4 w/7bA1tjMT3RPbe2OZPgArHAAkZQE++UZ5V/F+8PhthPBl6wDBtZwI92D5oUpsZkKH3g Ocyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=l6xJr+k1Rglf8MP4WV7CssKD0Zhdk6N8F5ziwJ65c+E=; b=w6KVrSPClGoVPLSBDFWHlqTR/kHSPvLsFSoNTuBl4hyY4cxYJRDf6X0oUXlbWtQLlt DCQWb45T1WytP/sYBIOfSj+8iMdmehBy6FnTCu+woJd0ToA8k5mpGfKFGSQyWAGk2Qmv /U6aVk73rCM0kdZivBbGihNzKm9TqzQmEnDb6yJDq+1nMjzjoA/N0lu2bCIoM7YzDQsc pkX0xmHP3Ku8faiaXd1gBlBzMNpKNAezihOL0FaBJixhmGkdNGYFQTbUaxRll5klI53Z t9DSwGqJNMJm3QSNsBDCJBN3BVTfDXuv89LVYdMeeJAosXTuDmdGHm/r5ZYT3fr0xfi9 uvpg== X-Gm-Message-State: ANoB5pmQNc5QGbMoyg4QUFl/jkVmfq9BzpTChjGJ+ie+j4vpPuxScm1W 6bfstmGktTQQR+Xxqo4pxZsyOQ== X-Received: by 2002:ac8:548:0:b0:3a5:2fd6:cf82 with SMTP id c8-20020ac80548000000b003a52fd6cf82mr83999346qth.62.1670420056314; Wed, 07 Dec 2022 05:34:16 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-47-55-122-23.dhcp-dynamic.fibreop.ns.bellaliant.net. [47.55.122.23]) by smtp.gmail.com with ESMTPSA id z18-20020ac87112000000b0039853b7b771sm13614698qto.80.2022.12.07.05.34.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Dec 2022 05:34:15 -0800 (PST) Received: from jgg by wakko with local (Exim 4.95) (envelope-from ) id 1p2uYw-005CTN-Gj; Wed, 07 Dec 2022 09:34:14 -0400 Date: Wed, 7 Dec 2022 09:34:14 -0400 From: Jason Gunthorpe To: Christoph Hellwig Cc: Lei Rao , kbusch@kernel.org, axboe@fb.com, kch@nvidia.com, sagi@grimberg.me, alex.williamson@redhat.com, cohuck@redhat.com, yishaih@nvidia.com, shameerali.kolothum.thodi@huawei.com, kevin.tian@intel.com, mjrosato@linux.ibm.com, linux-kernel@vger.kernel.org, linux-nvme@lists.infradead.org, kvm@vger.kernel.org, eddie.dong@intel.com, yadong.li@intel.com, yi.l.liu@intel.com, Konrad.wilk@oracle.com, stephen@eideticom.com, hang.yuan@intel.com Subject: Re: [RFC PATCH 1/5] nvme-pci: add function nvme_submit_vf_cmd to issue admin commands for VF driver. Message-ID: References: <20221206055816.292304-2-lei.rao@intel.com> <20221206061940.GA6595@lst.de> <20221206135810.GA27689@lst.de> <20221206153811.GB2266@lst.de> <20221206165503.GA8677@lst.de> <20221207075415.GB2283@lst.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20221207075415.GB2283@lst.de> X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 07, 2022 at 08:54:15AM +0100, Christoph Hellwig wrote: > On Tue, Dec 06, 2022 at 03:15:41PM -0400, Jason Gunthorpe wrote: > > What the kernel is doing is providing the abstraction to link the > > controlling function to the VFIO device in a general way. > > > > We don't want to just punt this problem to user space and say 'good > > luck finding the right cdev for migration control'. If the kernel > > struggles to link them then userspace will not fare better on its own. > > Yes. But the right interface for that is to issue the userspace > commands for anything that is not normal PCIe function level > to the controlling funtion, and to discover the controlled functions > based on the controlling functions. I don't think we should mix up how the HW works, what PCI function the commands are executed at, with how the uAPI is designed. The VFIO design assumes that the "vfio migration driver" will talk to both functions under the hood, and I don't see a fundamental problem with this beyond it being awkward with the driver core. It is done this way deliberately. When we did the work on it we found there are problems with some device models, like when you suspend them and then wrongly MMIO touch them they can trigger AER and maybe even a machine check crash. One of the roles of the VFIO driver is to fix these HW bugs and make the uAPI safe. Eg by revoking mmaps or whatever. Even more importantly, we do not want migration to ever be operated unless VFIO is in control of the device. In general, migration resume basically allows userspace to command the device to do effectively any DMA. This is a kernel security problem if the device is not constrained by a user iommu_domain - for security we must not allow userspace to resume a VF that is under kernel control and potentially linked to an passthrough iommu_domain. VFIO provides the security model to make all of this safe - the two concepts cannot become disconnected. Even if we create a new migration uAPI it just means that the nvme driver has to be awkwardly aware of VFIO VF drivers and interlock with their state, and the uAPI is useless unless you already have a VFIO FD open. Even the basic assumption that there would be a controlling/controlled relationship is not universally true. The mdev type drivers, and SIOV-like devices are unlikely to have that. Once you can use PASID the reasons to split things at the HW level go away, and a VF could certainly self-migrate. VFIO's goal is to abstract all of the above stuff. You get one char device that inherently provides the security guarentees required to operate migration. It provides all the necessary hooks to fix up HW issues, which so far every merged device has: - mlx5 has a weird issue where FLR on the VF resets the migration context, we fix that in the VFIO driver (in mlx5 even though the commands are issued via the PF they are logically executed inside the VF context) - hi-silicon has security problems because it doesn't have controlling/controlled, so it needs to carve out BAR MMIO maps and other stuff So while I agree that, in principle, migration and the VFIO VF are seperate concerns our broken reality makes them linked. Even the idea that started this thread - that PF/PF could be a problem - seems to have been explored by the Pensando RFC which is using the aux devices to connect arbitrary PF/PF for their model. The logical model we have been using on complex multi-function devices (like every DPU driver) has been to split the driver into subsystem local code and thread all the pieces together with aux devices. The model has a PCI driver that operates the lowest level of the device, eg the 'admin queue' and then aux devices create subsystem-local drivers (netdev, rdma, vdpa, iscsi, vfio, etc, etc) that ride on the common API exported by the PCI driver. So, when you see both Intel and Pensando proposing this kind of layered model for NVMe where migration is subsystem-local to VFIO, I think this is where the inspiration is coming from. Their native DPU drivers already work this way. Jason