Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp222803iob; Mon, 2 May 2022 17:41:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx/cRJ4Y18daAypDC/HTkh4sIp6Uojr84lwMZB4/2Dt5QSUjMi7dc+fj5ur6TXsDfaAGm5o X-Received: by 2002:a05:6a00:1ac8:b0:4fa:917f:c1aa with SMTP id f8-20020a056a001ac800b004fa917fc1aamr13933839pfv.2.1651538488671; Mon, 02 May 2022 17:41:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651538488; cv=none; d=google.com; s=arc-20160816; b=A16P6i0Hv33yfymldvKBswxyxL3WhBNY3j5k885WNRP8+HqSc0yBwFIAULxjLz32Ii x4Iec7hpZPNj8++Uv53hGqRoY+ZGu+XQiOF5nox0dW1VrsMLOKV5xZJZTn091OBPGsfT mIjMNAk6H9+Ba76dTrMhVuWPutZAxOCpKYOaZv3BG6eKdDXZnzvrOCMoe/jzqS7Avhco iJ+R7D0tKj4qsUiJadnwHJ3kj/zSIiBRx5eOjfrpUsa95Xj6U1MGEFa4JbqUW4BSIdCn wva6zt4MrntTGK8n+A3fHH090FQqoL4NSL4rWuza3aUnI1/YQHvr2PRFru6JtDc5sjtQ PQPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :message-id:subject:cc:to:from:date:dkim-signature; bh=gWOAomfuhq9SVShHeTe5mcsVFcDDGI5KPCkYgxxbHIE=; b=bSQOellgP/9shEF3uWW6HJeQlV1341964/jsIb2SHMOSk6ENBdOyoIO3d5oiQAVXAl A//NVMvun2fhZd/Ssff6/iA5sx5ts+vHVIj2ajfbGsrdUP6H0EeRXAOOhL87/csO0Qld CDRX1iwzZ9e8N43E/XByOCrXHBLg6QWoIEGiluQmh7+7QZ1hpEGWZ+S2wtuk/76Rc767 8u8A36xJY5EiVd9xnNUSJBYnrn4sTc0kXpmKcwhZJEWnGdhd54NkQ41rEPwVDQZZ71ZZ DEe4FiFwngUwLfFdLCqB9hwVIYE4+hRuncAh+wHirDU8pfTwZBlox7x5W+vryZwBST/2 opPA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=InTqMwim; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id e17-20020a17090301d100b00153ce7c9244si5498975plh.97.2022.05.02.17.41.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 02 May 2022 17:41:28 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=InTqMwim; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 642984889E; Mon, 2 May 2022 17:31:45 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1351375AbiD1TPh (ORCPT + 99 others); Thu, 28 Apr 2022 15:15:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57548 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238779AbiD1TPf (ORCPT ); Thu, 28 Apr 2022 15:15:35 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AE269B3C75; Thu, 28 Apr 2022 12:12:19 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 6CA64B82F6E; Thu, 28 Apr 2022 19:12:18 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 93EE0C385A0; Thu, 28 Apr 2022 19:12:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1651173137; bh=Z5QrtyPoKtr4EXTz5f8XYaD44MEJLtU/UpYiIT1zJhQ=; h=Date:From:To:Cc:Subject:In-Reply-To:From; b=InTqMwimwY/xTFEieuVIfeswQ0EoYsWCrDONLYTJLJQe3xAd7iMGaEPYN1qBRZS7q i1b7F7wFSFfr5pW+k1pHsP2+K1rGEWUXsgiifuLmsk1Z/yLnZ/NIWHfYlEznZCBqwd GYgkjGQ9NTzNa2yswmEWoaqy5eX5LBMFHMxw//ePSz5Ip0HfgGYJhaC6VM5d7e8Xk7 eb/K72NLbZkTquWmEGRMzYUfcwgDQJS5dj4EzMDo41mYbbOyRHjE7KLxKTWj3KPhMb rXXg2VEMGBY97I+7LzkyhwJwMk/6/ysxPAJE8cx/YfUwG9mmLZ0itIViS62avreQDf 75qKdG9k7Dz4w== Date: Thu, 28 Apr 2022 14:12:13 -0500 From: Bjorn Helgaas To: Jake Oshins Cc: Dexuan Cui , Lorenzo Pieralisi , "bhelgaas@google.com" , Alex Williamson , "wei.liu@kernel.org" , KY Srinivasan , Haiyang Zhang , Stephen Hemminger , "linux-hyperv@vger.kernel.org" , "linux-pci@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Michael Kelley (LINUX)" , "robh@kernel.org" , "kw@linux.com" , "kvm@vger.kernel.org" Subject: Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time Message-ID: <20220428191213.GA36573@bhelgaas> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,MAILING_LIST_MULTI, RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 26, 2022 at 07:25:43PM +0000, Jake Oshins wrote: > > -----Original Message----- > > From: Dexuan Cui > > Sent: Tuesday, April 26, 2022 11:32 AM > > To: Lorenzo Pieralisi > > Cc: Jake Oshins ; Bjorn Helgaas ; > > bhelgaas@google.com; Alex Williamson ; > > wei.liu@kernel.org; KY Srinivasan ; Haiyang Zhang > > ; Stephen Hemminger ; > > linux-hyperv@vger.kernel.org; linux-pci@vger.kernel.org; linux- > > kernel@vger.kernel.org; Michael Kelley (LINUX) ; > > robh@kernel.org; kw@linux.com; kvm@vger.kernel.org > > Subject: RE: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce > > VM boot time > > > > > From: Lorenzo Pieralisi > > > Sent: Tuesday, April 26, 2022 9:45 AM > > > > ... > > > > Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge > > > > window in hv_pci_allocate_bridge_windows() and registers the MMIO > > > > ranges to the core PCI driver via pci_add_resource(), and later the > > > > core PCI driver probes the bus/device(s), validates the BAR sizes > > > > and the pre-initialized BAR values, and uses the BAR configuration. > > > > IMO the whole process doesn't require the bit PCI_COMMAND_MEMORY to > > > > be pre-set, and there should be no issue to delay setting the bit to > > > > a PCI device device's .probe() -> pci_enable_device(). > > > > > > IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear > > > (otherwise PCI core would toggle it on and off for eg BAR sizing). > > > > > > Is that correct ? > > > > Yes, that's the exact purpose of this patch. > > > > Do you see any potential architectural issue with the patch? > > From my reading of the core PCI code, it looks like this should be safe. I don't know much about Hyper-V, but in general I don't think the PCI core should turn on PCI_COMMAND_MEMORY at all unless a driver requests it. I assume that if a guest OS depends on PCI_COMMAND_MEMORY being set, guest firmware would take care of setting it. > > Jake has some concerns that I don't quite follow. > > @Jake, could you please explain the concerns with more details? > > First, let me say that I really don't know whether this is an issue. > I know it's an issue with other operating system kernels. I'm > curious whether the Linux kernel / Linux PCI driver would behave in > a way that has an issue here. > > The VM has a window of address space into which it chooses to put > PCI device's BARs. The guest OS will generally pick the value that > is within the BAR, by default, but it can theoretically place the > device in any free address space. The subset of the VM's memory > address space which can be populated by devices' BARs is finite, and > generally not particularly large. > > Imagine a VM that is configured with 25 NVMe controllers, each of > which requires 64KiB of address space. (This is just an example.) > At first boot, all of these NVMe controllers are packed into address > space, one after the other. > > While that VM is running, one of the 25 NVMe controllers fails and > is replaced with an NVMe controller from a separate manufacturer, > but this one requires 128KiB of memory, for some reason. Perhaps it > implements the "controller buffer" feature of NVMe. It doesn't fit > in the hole that was vacated by the failed NVMe controller, so it > needs to be placed somewhere else in address space. This process > continues over months, with several more failures and replacements. > Eventually, the address space is very fragmented. > > At some point, there is an attempt to place an NVMe controller into > the VM but there is no contiguous block of address space free which > would allow that NVMe controller to operate. There is, however, > enough total address space if the other, currently functioning, NVMe > controllers are moved from the address space that they are using to > other ranges, consolidating their usage and reducing fragmentation. > Let's call this a rebalancing of memory resources. > > When the NVMe controllers are moved, a new value is written into > their BAR. In general, the PCI spec would require that you clear > the memory enable bit in the command register (PCI_COMMAND_MEMORY) > during this move operation, both so that there's never a moment when > two devices are occupying the same address space and because writing > a 64-bit BAR atomically isn't possible. This is the reason that I > originally wrote the code in this driver to unmap the device from > the VM's address space when the memory enable bit is cleared. > > What I don't know is whether this sequence of operations can ever > happen in Linux, or perhaps in a VM running Linux. Will it > rebalance resources in order to consolidate address space? If it > will, will this involve clearing the memory enable bit to ensure > that two devices never overlap? This sequence definitely can occur in Linux, but it hasn't yet become a real priority. But we do already have issues with assigning space for hot-added devices in general, especially if firmware hasn't assigned large windows to things like Thunderbolt controllers. I suspect that we have or will soon have issues where resource assignment starts failing after a few hotplugs, e.g., dock/undock events. There have been patches posted to rebalance resources (quiesce drivers, reassign, restart drivers), but they haven't gone anywhere yet for lack of interest and momentum. I do feel like we're the tracks and the train is coming, though ;) Bjorn