Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp210491iob; Thu, 28 Apr 2022 00:08:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzIhQE+Ld4v80mDOYmNk7Z138pDaA/fCQtc8hXfCqLcBoy1kofIMvdlHupqaPZZBQZ84YJk X-Received: by 2002:a17:906:2991:b0:6cf:6b24:e92f with SMTP id x17-20020a170906299100b006cf6b24e92fmr30025519eje.748.1651129714568; Thu, 28 Apr 2022 00:08:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651129714; cv=none; d=google.com; s=arc-20160816; b=xQ4gDzcPbxrIZaeN6MqoPUOX/USfckrEf6wvnvpD2u74+ML4ZW19p0yC/ohGw5vzKf EJk0/1CkdpBshImHClfu6h56Qp0umMFhPR/wpsxob3eDHD+h+NIsGoR+ptWyE3LEpXYo GdEJeyP9w/H5AtgZwVC4znA72vYF/lf28+ZdMfza7f6Iq6DahQfapwVCF3iRatlxj2EC aGAC8a9MfTBurY3K1i7oIz9uZ05IEj35tlwPLByzcrJyDefufb4tc64h4e6tJAAhaM5u 1Pdhe5Mm9EvwndDXt/mTDL+a5nhVNyOotaRQh+lyQyX798P3LCqjFTDRJCtlMx3zN8jf 0nig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date:dkim-signature; bh=4Q6+PayGphb3zKZnUdKR+LTCV+SuNJxanUj+aw0RLUU=; b=lEl8XdMRkMMTBDeNWAeePc+xTg1CWJLwxSoUxIeNRLP3lmZHk5WMDuttYlPiQO4T3s ZVwsJ0DNO0G2zW2yVe3GMxMeC7Lhgl+lqz44rSThE73iq+NNx7KPTo+sWqyKeY1XLEF/ mSKVCbQSCNj8E0x9jlH4/c/lhUSObx5g1sj1ODjIcPHQXTFxdvUtuHaQjJ03M4jQfWNg rwxxqLiFGk+nQpZysRHoA75d6HYyZYLnaM9XavLEKVKm7vlUurrm6ev0pkTsAKINovHD CsoTwWaiNiZeFxGIkExJPZl+kXsW+wktz7Oh37BdB0QWE99YEvHi6v381svGdKwJ1/cM u1eQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=RiomnL2r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hd42-20020a17090796aa00b006f3997f2199si4078222ejc.328.2022.04.28.00.08.10; Thu, 28 Apr 2022 00:08:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=RiomnL2r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234311AbiD0WpL (ORCPT + 99 others); Wed, 27 Apr 2022 18:45:11 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233439AbiD0WpI (ORCPT ); Wed, 27 Apr 2022 18:45:08 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 92CDD140AF for ; Wed, 27 Apr 2022 15:41:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1651099315; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4Q6+PayGphb3zKZnUdKR+LTCV+SuNJxanUj+aw0RLUU=; b=RiomnL2ru2J1LlAl2K1MT3qkIc5BugpE5cAiNzMLPXrWdwTWDOnW3u67f1hQd577bQlgR1 B1dEP7OjQQfNDW5tvQFBmXVGtCO+ri1zd3jZbNTRqSgExy3zY+26wFWx+fCkOpjNkblIQi NyMv87Pf/ZXoq0lA6QIVNfnjznQ4SD0= Received: from mail-il1-f200.google.com (mail-il1-f200.google.com [209.85.166.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-523--3YJYLy2OuWOSuSvuO9Q6g-1; Wed, 27 Apr 2022 18:41:54 -0400 X-MC-Unique: -3YJYLy2OuWOSuSvuO9Q6g-1 Received: by mail-il1-f200.google.com with SMTP id j4-20020a92c204000000b002caad37af3fso755294ilo.22 for ; Wed, 27 Apr 2022 15:41:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:organization:mime-version:content-transfer-encoding; bh=4Q6+PayGphb3zKZnUdKR+LTCV+SuNJxanUj+aw0RLUU=; b=sGWEseG/ERALokMczWLi8VlHglOdPHc08ht85h4Jz8sWOF6bxiCCj1Ag0JbCrpFuXL ArCM8eSFOSDtyVvzn0c/9UV+MOwVRPKK++2Jqj4AhvnNs9ObsFHo0nCAPmxSqfQ1ktFT +ffKJK3ajsNFADb+5DlJb0nCKKgkKVwDNn1BySfOmc8qGyHG8QoCSk1KPq0lxOpt8Ear C8u0BqdUdQyleMFjYHhsfwwJUvG1J0gXenwQRqfA+QRAUpQuutajLbtw0NMzJL13Azmq wMmVZTFWd4HMS1U1eH/u+4H3VMKbNHiufh/kf9FliyTgeuG+2vaHHtXYbkfXL1zS+0kV lz5g== X-Gm-Message-State: AOAM532TXL3cnXM27fQ+1ARKcfiCmnCKXLhOD58waZV7Omwx+nXjMXNz MmxqsjZ8KnHWG5lIJCLw3y+75wkyiAO/mbGwYEVD8oq6w0sBMlyIYK/ESJSFOJ6ItUQjPLKdq5P 2BBxIkP1G7iat5y95Sh/CMN3n X-Received: by 2002:a05:6602:1211:b0:654:94db:fa48 with SMTP id y17-20020a056602121100b0065494dbfa48mr12622907iot.48.1651099311091; Wed, 27 Apr 2022 15:41:51 -0700 (PDT) X-Received: by 2002:a05:6602:1211:b0:654:94db:fa48 with SMTP id y17-20020a056602121100b0065494dbfa48mr12622885iot.48.1651099310872; Wed, 27 Apr 2022 15:41:50 -0700 (PDT) Received: from redhat.com ([38.15.36.239]) by smtp.gmail.com with ESMTPSA id g5-20020a5d8c85000000b0065726e18c0csm12223712ion.22.2022.04.27.15.41.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 27 Apr 2022 15:41:50 -0700 (PDT) Date: Wed, 27 Apr 2022 16:41:47 -0600 From: Alex Williamson To: Jake Oshins Cc: Dexuan Cui , Lorenzo Pieralisi , Bjorn Helgaas , "bhelgaas@google.com" , "wei.liu@kernel.org" , KY Srinivasan , Haiyang Zhang , Stephen Hemminger , "linux-hyperv@vger.kernel.org" , "linux-pci@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Michael Kelley (LINUX)" , "robh@kernel.org" , "kw@linux.com" , "kvm@vger.kernel.org" Subject: Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot time Message-ID: <20220427164147.330a0bc8.alex.williamson@redhat.com> In-Reply-To: References: Organization: Red Hat MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 26 Apr 2022 19:25:43 +0000 Jake Oshins wrote: > > -----Original Message----- > > From: Dexuan Cui > > Sent: Tuesday, April 26, 2022 11:32 AM > > To: Lorenzo Pieralisi > > Cc: Jake Oshins ; Bjorn Helgaas ; > > bhelgaas@google.com; Alex Williamson ; > > wei.liu@kernel.org; KY Srinivasan ; Haiyang Zhang > > ; Stephen Hemminger ; > > linux-hyperv@vger.kernel.org; linux-pci@vger.kernel.org; linux- > > kernel@vger.kernel.org; Michael Kelley (LINUX) ; > > robh@kernel.org; kw@linux.com; kvm@vger.kernel.org > > Subject: RE: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce > > VM boot time > > > > > From: Lorenzo Pieralisi > > > Sent: Tuesday, April 26, 2022 9:45 AM > > > > ... > > > > Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge > > > > window in hv_pci_allocate_bridge_windows() and registers the MMIO > > > > ranges to the core PCI driver via pci_add_resource(), and later the > > > > core PCI driver probes the bus/device(s), validates the BAR sizes > > > > and the pre-initialized BAR values, and uses the BAR configuration. > > > > IMO the whole process doesn't require the bit PCI_COMMAND_MEMORY to > > > > be pre-set, and there should be no issue to delay setting the bit to > > > > a PCI device device's .probe() -> pci_enable_device(). > > > > > > IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear > > > (otherwise PCI core would toggle it on and off for eg BAR sizing). > > > > > > Is that correct ? > > > > Yes, that's the exact purpose of this patch. > > > > Do you see any potential architectural issue with the patch? > > From my reading of the core PCI code, it looks like this should be safe. > > > > Jake has some concerns that I don't quite follow. > > @Jake, could you please explain the concerns with more details? > > > > First, let me say that I really don't know whether this is an issue. > I know it's an issue with other operating system kernels. I'm > curious whether the Linux kernel / Linux PCI driver would behave in a > way that has an issue here. > > The VM has a window of address space into which it chooses to put PCI > device's BARs. The guest OS will generally pick the value that is > within the BAR, by default, but it can theoretically place the device > in any free address space. The subset of the VM's memory address > space which can be populated by devices' BARs is finite, and > generally not particularly large. AIUI, if the firmware has programmed the BAR addresses, Linux will generally try to leave them alone, it's only unprogrammed devices or when using the pci=realloc option that we'll try to shuffle things around. If you talk to bare metal system firmware developers, you might find disagreement regarding whether the OS or system firmware owns the device address space, which I believe also factors into our handling of the memory space enable bit of the command register. Minimally, system firmware is required to allocate resources and enable boot devices, and often these are left enabled after the hand-off to the OS. This might include some peripherals, for instance legacy emulation on a USB keyboard might leave the USB host controller enabled. There are also more devious use cases, where there might be device monitoring running across the bus under the OS, perhaps via SMI or other means, where if we start moving devices around, that could theoretically break. However, I don't really see any obvious problems with your proposal that we simply leave the memory enable bit in the hand-off state. > Imagine a VM that is configured with 25 NVMe controllers, each of > which requires 64KiB of address space. (This is just an example.) > At first boot, all of these NVMe controllers are packed into address > space, one after the other. > > While that VM is running, one of the 25 NVMe controllers fails and is > replaced with an NVMe controller from a separate manufacturer, but > this one requires 128KiB of memory, for some reason. Perhaps it > implements the "controller buffer" feature of NVMe. It doesn't fit > in the hole that was vacated by the failed NVMe controller, so it > needs to be placed somewhere else in address space. This process > continues over months, with several more failures and replacements. > Eventually, the address space is very fragmented. > > At some point, there is an attempt to place an NVMe controller into > the VM but there is no contiguous block of address space free which > would allow that NVMe controller to operate. There is, however, > enough total address space if the other, currently functioning, NVMe > controllers are moved from the address space that they are using to > other ranges, consolidating their usage and reducing fragmentation. > Let's call this a rebalancing of memory resources. > > When the NVMe controllers are moved, a new value is written into > their BAR. In general, the PCI spec would require that you clear the > memory enable bit in the command register (PCI_COMMAND_MEMORY) during > this move operation, both so that there's never a moment when two > devices are occupying the same address space and because writing a > 64-bit BAR atomically isn't possible. This is the reason that I > originally wrote the code in this driver to unmap the device from the > VM's address space when the memory enable bit is cleared. > > What I don't know is whether this sequence of operations can ever > happen in Linux, or perhaps in a VM running Linux. Will it rebalance > resources in order to consolidate address space? If it will, will > this involve clearing the memory enable bit to ensure that two > devices never overlap? Once the OS is running and drivers are attached to devices, any reshuffling of resources for those devices would require coordination of the driver to release the resources and reprogram them. Even if an atomic update of the BAR were possible, that can't account for possible in-flight use cases, such as p2p DMA. There were a couple sessions from the 2019 Linux Plumbers conference that might be useful to review regarding these issues. IIRC the first[1] was specifically looking at whether we could do partial BAR allocations for NVMe devices, where we might have functionality but reduced performance or features with a partial mapping. In your example, perhaps we're replacing a device with one that has twice the BAR space, but is functional with only half that, so we can slide it into the same slot as the previous device. This would likely mean enlightening the PCI core with device or class specific information. I've not followed whether anything occurred here. The second[2] (next session, same recording) discusses problems around resource allocation and dynamic reallocation. Again, I haven't followed further discussions here, but I don't expect much has changed. Thanks, Alex [1]https://youtu.be/ozlQ1XQreac?list=PLVsQ_xZBEyN1PDehCCAiztGf45K_D6txS&t=6481 [2]https://youtu.be/ozlQ1XQreac?list=PLVsQ_xZBEyN1PDehCCAiztGf45K_D6txS&t=7980