Received: by 2002:a05:7412:b995:b0:f9:9502:5bb8 with SMTP id it21csp7430599rdb; Wed, 3 Jan 2024 16:24:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IHD/J6cp+ySqajtS4XZ4HPFiBF7yqxQFiy+2bh6OTCXYT6Qi6LnBuqjnUsKUzWaeDkZ1tv9 X-Received: by 2002:a05:620a:4507:b0:77f:51f:5e29 with SMTP id t7-20020a05620a450700b0077f051f5e29mr25846122qkp.106.1704327896683; Wed, 03 Jan 2024 16:24:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704327896; cv=none; d=google.com; s=arc-20160816; b=KhDPxKja/rzIbYZfZbLbYKUDsS88A8Sb8phMvx2LEqbL0GsQrS3cNH6h5XvaR2IJOR bnNAfL7YVqvuQgTWjMW55mKGzoxquYNEJG3Ez5cuKYr5sxls1ES+PvKB7/zzViDu9Hga jOd1kE9qGyIy2GDwP9eH89xy/+scheekwd0u23SCCOhvwq4ylaCgDfh+8gOjNbLQGklU poSJDQrxZPYo41L35pb9mceoRsaiC9wyAVI6DAbbIBPHbe3J/v9zc1eyVOcv4nsQliPq Yrl6UC1nVTAEJTuShMHTN21QCSxuDp6UuVkBajfVZEJNbTD0hHmhAPZyMAAI0nDS9r5S LmJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :subject:cc:to:from:date:dkim-signature; bh=jPW79gS2Q4P8ctw9avR6IJ5fl8Ns1nGbXMFhQOH5gSA=; fh=1aX6zFW+HhWmFWJa220gyUB9ZqdjUGTwJOsbtsMT+tk=; b=eYUFftFmnEtt+cHgrKTCogHlbWcAI9cUZj8R5wmmMDMvAy6x1X2qiPbHm3RmRfkaB/ zrLO2Kd47/HgiWPvlrQogBN8nTfaUpl/SLt5NN8ImJMnCVXxggvt3f/fmKo4mYkCY3ED 5HdLiUDDl5JbPJppvYf3IoVwAg4BlxUwaSr7j2i+cxSrlYmhGs6hZdY3KwkY/ctsKUQr ePi0EpnyhF7cotSCVqFyIipoj050acGr5eJYqJbZ5rmaXodHc8Q60Wn833V/i+x3yhx6 3HFQqRe1TByj8ZAbjA7G8eb6QaXITz4IcZRlKcjKO+TaooBJIjxW4PgpEDeksPVHaogu qEnA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=IRjzeTwm; spf=pass (google.com: domain of linux-kernel+bounces-16143-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-16143-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id x12-20020a05620a258c00b00781723f0a42si15689467qko.137.2024.01.03.16.24.56 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Jan 2024 16:24:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-16143-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=IRjzeTwm; spf=pass (google.com: domain of linux-kernel+bounces-16143-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-16143-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id DF4871C23DB7 for ; Thu, 4 Jan 2024 00:24:55 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BB6894C8B; Thu, 4 Jan 2024 00:24:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="IRjzeTwm" X-Original-To: linux-kernel@vger.kernel.org Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2BE5A442D for ; Thu, 4 Jan 2024 00:24:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1704327871; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jPW79gS2Q4P8ctw9avR6IJ5fl8Ns1nGbXMFhQOH5gSA=; b=IRjzeTwmLrcWhdUN2phNHARYLWFrrrdafBhXNiHUFawbpkIDTqz6FwWk6+X8piesQ93xwm B7U9KH4TCN9oIqHCmY0hhlXGrWbSlBqG1BwB9Opj/teXXzw7ooeixodqROpBaMfN0eTND8 MshAD50I+l5nBBv2F+8SVZxhG5CEzz4= Received: from mail-io1-f70.google.com (mail-io1-f70.google.com [209.85.166.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-202-c6g5fANjMQO_TT2zY6ML3A-1; Wed, 03 Jan 2024 19:24:29 -0500 X-MC-Unique: c6g5fANjMQO_TT2zY6ML3A-1 Received: by mail-io1-f70.google.com with SMTP id ca18e2360f4ac-7bbaec23e66so283017039f.0 for ; Wed, 03 Jan 2024 16:24:29 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1704327869; x=1704932669; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:subject:cc:to:from:date:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jPW79gS2Q4P8ctw9avR6IJ5fl8Ns1nGbXMFhQOH5gSA=; b=gtpurwyXT9P1Mdzb3yd8l9NigNgzn3EYIhULVY0k+u+1l3hM1j9jrJ/GfJHMD4duMI r4bFnyNzkUiZZ15ItvxKncHUtOxh8s6WkOEKE9Sh8wOebqAPtDwqMVMWCkf7TBIY1uyM oluLOK3rPWC4d/JBd1kfPelhjTOprqA3GJtZKXcE21brBqqcHE77SeO2KgbP1rhj1bGF eFj1RM0rfVzELxxVeGTM984erd+Teb59N9wz8c4fSE8VI9BHJQygYZVN/AJ6/s4roHcE RzhMb3jlvfJzPDTGx67zeRKBS35cFOa1/Y+xD0JCaJCg1guNQMqCkzPli+eslEA84pCc Rq1g== X-Gm-Message-State: AOJu0YxhGK0a2CE15gpJLuGSJbE7/BkRRaQsr76wNMSnrHFjASWKQzIa HpsX3BZa6v+mJGN9AUVsNuoDCIHhvejbv4QEPoB1W48Mhw6hBkLmlxB6SXA9j4uVkdms2qW3Ups ZjIluVJYnf2z22zseYLDFieaH/awIMLt/ X-Received: by 2002:a05:6602:b96:b0:7ba:9b52:51d7 with SMTP id fm22-20020a0566020b9600b007ba9b5251d7mr30855590iob.36.1704327869033; Wed, 03 Jan 2024 16:24:29 -0800 (PST) X-Received: by 2002:a05:6602:b96:b0:7ba:9b52:51d7 with SMTP id fm22-20020a0566020b9600b007ba9b5251d7mr30855568iob.36.1704327868764; Wed, 03 Jan 2024 16:24:28 -0800 (PST) Received: from redhat.com ([38.15.60.12]) by smtp.gmail.com with ESMTPSA id z4-20020a056638318400b0046d6b3edd2asm4667268jak.132.2024.01.03.16.24.27 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Jan 2024 16:24:28 -0800 (PST) Date: Wed, 3 Jan 2024 17:24:26 -0700 From: Alex Williamson To: Jason Gunthorpe Cc: Ankit Agrawal , Yishai Hadas , "shameerali.kolothum.thodi@huawei.com" , "kevin.tian@intel.com" , "eric.auger@redhat.com" , "brett.creeley@amd.com" , "horms@kernel.org" , Aniket Agashe , Neo Jia , Kirti Wankhede , "Tarun Gupta (SW-GPU)" , Vikram Sethi , Andy Currid , Alistair Popple , John Hubbard , Dan Williams , "Anuj Aggarwal (SW-GPU)" , Matt Ochs , "kvm@vger.kernel.org" , "linux-kernel@vger.kernel.org" Subject: Re: [PATCH v15 1/1] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper Message-ID: <20240103172426.0a1f4ae6.alex.williamson@redhat.com> In-Reply-To: <20240103193356.GU50406@nvidia.com> References: <20231217191031.19476-1-ankita@nvidia.com> <20231218151717.169f80aa.alex.williamson@redhat.com> <20240102091001.5fcc8736.alex.williamson@redhat.com> <20240103165727.GQ50406@nvidia.com> <20240103110016.5067b42e.alex.williamson@redhat.com> <20240103193356.GU50406@nvidia.com> X-Mailer: Claws Mail 4.2.0 (GTK 3.24.38; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Wed, 3 Jan 2024 15:33:56 -0400 Jason Gunthorpe wrote: > On Wed, Jan 03, 2024 at 11:00:16AM -0700, Alex Williamson wrote: > > On Wed, 3 Jan 2024 12:57:27 -0400 > > Jason Gunthorpe wrote: > > =20 > > > On Tue, Jan 02, 2024 at 09:10:01AM -0700, Alex Williamson wrote: > > > =20 > > > > Yes, it's possible to add support that these ranges honor the memory > > > > enable bit, but it's not trivial and unfortunately even vfio-pci is= n't > > > > a great example of this. =20 > > >=20 > > > We talked about this already, the HW architects here confirm there is > > > no issue with reset and memory enable. You will get all 1's on read > > > and NOP on write. It doesn't need to implement VMA zap. =20 > >=20 > > We talked about reset, I don't recall that we discussed that coherent > > and uncached memory ranges masquerading as PCI BARs here honor the > > memory enable bit in the command register. =20 >=20 > Why do it need to do anything special? If the VM read/writes from > memory that the master bit is disabled on it gets undefined > behavior. The system doesn't crash and it does something reasonable. The behavior is actually defined (6.0.1 Table 7-4): Memory Space Enable - Controls a Function's response to Memory Space accesses. When this bit is Clear, all received Memory Space accesses are caused to be handled as Unsupported Requests. When this bit is Set, the Function is enabled to decode the address and further process Memory Space accesses. =46rom there we get into system error handling decisions where some platforms claim to protect data integrity by generating a fault before allowing drivers to consume the UR response and others are more lenient. The former platforms are the primary reason that we guard access to the device when the memory enable bit is cleared. It seems we've also already opened the door somewhat given our VF behavior, where we test pci_dev.no_command_memory to allow access regardless of the state of the emulated command register bit. AIUI, the address space enable bits are primarily to prevent the device from decoding accesses during BAR sizing operations or prior to BAR programming. BAR sizing operations are purely emulated in vfio-pci and unprogrammed BARs are ignored (ie. not exposed to userspace), so perhaps as long as it can be guaranteed that an access with the address space enable bit cleared cannot generate a system level fault, we actually have no strong argument to strictly enforce the address space bits. > > > > around device reset or relative to the PCI command register. The > > > > variant driver becomes a trivial implementation that masks BARs 2 &= 4 > > > > and exposes the ACPI range as a device specific region with only mm= ap > > > > support. QEMU can then map the device specific region into VM memo= ry > > > > and create an equivalent ACPI table for the guest. =20 > > >=20 > > > Well, no, probably not. There is an NVIDIA specification for how the > > > vPCI function should be setup within the VM and it uses the BAR > > > method, not the ACPI. =20 > >=20 > > Is this specification available? It's a shame we've gotten this far > > without a reference to it. =20 >=20 > No, at this point it is internal short form only. >=20 > > > There are a lot of VMMs and OSs this needs to support so it must all > > > be consistent. For better or worse the decision was taken for the vPCI > > > spec to use BAR not ACPI, in part due to feedback from the broader VMM > > > ecosystem, and informed by future product plans. > > >=20 > > > So, if vfio does special regions then qemu and everyone else has to > > > fix it to meet the spec. =20 > >=20 > > Great, this is the sort of justification and transparency that had not > > been previously provided. It is curious that only within the past > > couple months the device ABI changed by adding the uncached BAR, so > > this hasn't felt like a firm design. =20 >=20 > That is to work around some unfortunate HW defect, and is connected to > another complex discussion about how ARM memory types need to > work. Originally people hoped this could simply work transparently but > it eventually became clear it was not possible for the VM to degrade > from cachable without VMM help. Thus a mess was born.. >=20 > > Also I believe it's been stated that the driver supports both the > > bare metal representation of the device and this model where the > > coherent memory is mapped as a BAR, so I'm not sure what obstacles > > remain or how we're positioned for future products if take the bare > > metal approach. =20 >=20 > It could work, but it is not really the direction that was decided on > for the vPCI functions. >=20 > > > I thought all the meaningful differences are fixed now? > > >=20 > > > The main remaining issue seems to be around the config space > > > emulation? =20 > >=20 > > In the development of the virtio-vfio-pci variant driver we noted that > > r/w access to the IO BAR didn't honor the IO bit in the command > > register, which was quickly remedied and now returns -EIO if accessed > > while disabled. We were already adding r/w support to the coherent BAR > > at the time as vfio doesn't have a means to express a region as only > > having mmap support and precedent exists that BAR regions must support > > these accesses. So it was suggested that r/w accesses should also > > honor the command register memory enable bit, but of course memory BARs > > also support mmap, which snowballs into a much more complicated problem > > than we have in the case of the virtio IO BAR. =20 >=20 > I think that has just become too pedantic, accessing the regions with > the enable bits turned off is broadly undefined behavior. So long as > the platform doesn't crash, I think it is fine to behave in a simple > way. >=20 > There is no use case for providing stronger emulation of this. As above, I think I can be convinced this is acceptable given that the platform and device are essentially one in the same here with understood lack of a system wide error response. Now I'm wondering if we should do something different with virtio-vfio-pci. As a VF, the memory space is effectively always enabled, governed by the SR-IOV MSE bit on the PF which is assumed to be static. It doesn't make a lot of sense to track the IO enable bit for the emulated IO BAR when the memory BAR is always enabled. It's a fairly trivial amount of code though, so it's not harmful either. > > So where do we go? Do we continue down the path of emulating full PCI > > semantics relative to these emulated BARs? Does hardware take into > > account the memory enable bit of the command register? Do we > > re-evaluate the BAR model for a device specific region? =20 >=20 > It has to come out as a BAR in the VM side so all of this can't be > avoided. The simple answer is we don't need to care about enables > because there is no reason to care. VMs don't write to memory with the > enable turned off because on some platforms you will crash the system > if you do that. >=20 > > I'd suggest we take a look at whether we need to continue to pursue > > honoring the memory enable bit for these BARs and make a conscious and > > documented decision if we choose to ignore it. =20 >=20 > Let's document it. Ok, I'd certainly appreciate more background around why it was chosen to expose the device differently than bare metal in the commit log and code comments and in particular why we're consciously choosing not to honor these specific PCI semantics. Presumably we'd remove the -EIO from the r/w path based on the memory enable state as well. > > Ideally we could also make this shared spec that we're implementing > > to available to the community to justify the design decisions here. > > In the case of GPUDirect Cliques we had permission to post the spec > > to the list so it could be archived to provide a stable link for > > future reference. Thanks, =20 >=20 > Ideally. I'll see that people consider it at least. Yep, I think it would have helped move this driver along if such a document had been shared earlier for the sake of transparency rather than leaving us to speculate on the motivation and design. Thanks, Alex