Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp3358853pxj; Mon, 7 Jun 2021 08:44:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyHhBz5Dve0GYiMUEpgbTw5FjfzKhn8M3/HLkbkHAF7HIcmMgG4Lo/KIsroPq+QpJ0FhDVo X-Received: by 2002:a17:906:3a8e:: with SMTP id y14mr18617546ejd.153.1623080642454; Mon, 07 Jun 2021 08:44:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623080642; cv=none; d=google.com; s=arc-20160816; b=lK94Oc4x2x7cDHcj5VRL4aRXhC6UE8lhLr5FxhkxyPCqxY4LMFvfwM9uFvtgn0uf0B W9vkfIibJzyMvP3jtuDnC1LJ9ZcfZc87hzaVxKHzGO/ip4DmXFMjdJ4+8I8R+AJwsM7n WFby1Vce3MISlbGeObqPNzvp+zfLPjodTBfMK2TAciKdJm4rCL/kiDF9PxYDze8AGM4m RqSzmyWLVi95EbK9I678D2uwM5hzPMWGmHJXh8aUCjfa9xynujOpoYIcDnYYgK0hoQUm 6QBTsB3guKcraE503yubiQ6YPjZ5NECO3jfeCIbicd1WN6OInhU/wf/C1FDIMRxvxLKM TcFQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=4A1ZmmoADOeXeNhTZ74YAKcHNecX4lcp36tJa9lAB8U=; b=u1DJcVbkJ4Bnz/rMDbhu88sifIMJhJBYgVhHpEQdN/Ti+rqzYES11/Bo8km3OA+rAf dmW2u9oa2kQ0ahM2X3bCEhgv2rWV47YniToddK06tMe0pOA5zbrDAo8B3ZYSXnIeuSX3 Jmvdz6jlfTtvuguG0TTm1QLo0Klx98Zan5gcAyNFJKTomJVWJrDq2rJoLsDJdryuCrh6 DayIbgBZzAleQhFetUhO1+rgVbRrt7nDOwtpLRSm0HnSHrCfrFgEQ4xVMFBYTAj05NeY otSp6j+PyHbLByjRAkZNAXRaDXY313P0wLIFS6gWxhrwH9AzDneabj/hILzDsSECuzxu HgZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=LCta4Wpj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l16si10982300edj.405.2021.06.07.08.43.39; Mon, 07 Jun 2021 08:44:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=LCta4Wpj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231235AbhFGPns (ORCPT + 99 others); Mon, 7 Jun 2021 11:43:48 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:42178 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230463AbhFGPno (ORCPT ); Mon, 7 Jun 2021 11:43:44 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623080512; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4A1ZmmoADOeXeNhTZ74YAKcHNecX4lcp36tJa9lAB8U=; b=LCta4Wpjk0Hm8zF2i9O4sbCQYDbsTAS0a6AUImeKy1I5VA443GzDSVIeeYveZbao6us2Mh lgvBb2Mve1ig0lOJ/dCSBk5NWHrgTBLiecI+BcTDTmOp35PTM0KHRq+wsXKTRWZzg5PuxN TUPmEO/g6Rk7YV5JHb5KglwFC159Ryw= Received: from mail-oo1-f69.google.com (mail-oo1-f69.google.com [209.85.161.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-377-ewHtK-MkPte7lJ8b88dpzg-1; Mon, 07 Jun 2021 11:41:51 -0400 X-MC-Unique: ewHtK-MkPte7lJ8b88dpzg-1 Received: by mail-oo1-f69.google.com with SMTP id j9-20020a4ad1890000b0290249480f62d9so6314581oor.0 for ; Mon, 07 Jun 2021 08:41:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=4A1ZmmoADOeXeNhTZ74YAKcHNecX4lcp36tJa9lAB8U=; b=LYsWRmpP5bvwV1PinVspGOU/W+u5Qr58FH7jbvHiNpgrR808omCMDiHiyw9aIUeF5I wMmFDERRXuU8xXv7RtWefMNP3rq+RHp8nplklIkKAB5ZEqVIih3j9xcitQoEn/929Tey PEOuNIwnOwfxc/3wDzI2Y8lgMQOWJJjvYTV56EgI/zi7Wq7GaW7vPBITI7FXAmlRnxke 6HLkeKk40vIA9qgIDOb6ow+sNKdxRRWIlB43DkwSMySpFFnyHEioOSfe0XxmVAZCS6lg I92V4Wfkf4mWwcYOoLPSINBq9wNuoOixHbAe84uqSCQ3qpADI10DSzkTUkytWBGURuE5 uSvQ== X-Gm-Message-State: AOAM532JEVwSnBZXPj13/sc/c3M3F7piIywW6EDGao9bo9GZgooPcqI3 R2QBE1UMavLs0zjWl64sMlwv2KbY5bVe+hXdCN0EPFNNI6w63vEzxTcJJzEgEX6xlZl+rS5t02j W4DPQtN8yOBhkxpdao3U/pHYM X-Received: by 2002:a05:6830:2472:: with SMTP id x50mr14281124otr.277.1623080510430; Mon, 07 Jun 2021 08:41:50 -0700 (PDT) X-Received: by 2002:a05:6830:2472:: with SMTP id x50mr14281089otr.277.1623080510058; Mon, 07 Jun 2021 08:41:50 -0700 (PDT) Received: from redhat.com ([198.99.80.109]) by smtp.gmail.com with ESMTPSA id t21sm2412663otd.35.2021.06.07.08.41.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 07 Jun 2021 08:41:49 -0700 (PDT) Date: Mon, 7 Jun 2021 09:41:48 -0600 From: Alex Williamson To: Jason Gunthorpe Cc: Paolo Bonzini , "Tian, Kevin" , Jean-Philippe Brucker , "Jiang, Dave" , "Raj, Ashok" , "kvm@vger.kernel.org" , Jonathan Corbet , Robin Murphy , LKML , "iommu@lists.linux-foundation.org" , David Gibson , Kirti Wankhede , David Woodhouse , Jason Wang Subject: Re: [RFC] /dev/ioasid uAPI proposal Message-ID: <20210607094148.7e2341fc.alex.williamson@redhat.com> In-Reply-To: <20210604230108.GB1002214@nvidia.com> References: <20210604122830.GK1002214@nvidia.com> <20210604092620.16aaf5db.alex.williamson@redhat.com> <815fd392-0870-f410-cbac-859070df1b83@redhat.com> <20210604155016.GR1002214@nvidia.com> <30e5c597-b31c-56de-c75e-950c91947d8f@redhat.com> <20210604160336.GA414156@nvidia.com> <2c62b5c7-582a-c710-0436-4ac5e8fd8b39@redhat.com> <20210604172207.GT1002214@nvidia.com> <20210604152918.57d0d369.alex.williamson@redhat.com> <20210604230108.GB1002214@nvidia.com> X-Mailer: Claws Mail 3.17.8 (GTK+ 2.24.33; x86_64-redhat-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 4 Jun 2021 20:01:08 -0300 Jason Gunthorpe wrote: > On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote: > > On Fri, 4 Jun 2021 14:22:07 -0300 > > Jason Gunthorpe wrote: > > > > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote: > > > > On 04/06/21 18:03, Jason Gunthorpe wrote: > > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote: > > > > > > I don't want a security proof myself; I want to trust VFIO to make the right > > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device). > > > > > > > > > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly > > > > > > more roundabout way to do stuff that is accessible to bare metal; not a way > > > > > > to gain extra privilege. > > > > > > > > > > Okay, fine, lets turn the question on its head then. > > > > > > > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO > > > > > application can make use of no-snoop optimizations. The ability of KVM > > > > > to execute wbinvd should be tied to the ability of that IOCTL to run > > > > > in a normal process context. > > > > > > > > > > So, under what conditions do we want to allow VFIO to giave a process > > > > > elevated access to the CPU: > > > > > > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e. > > > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it > > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl, > > > > which then would be on VFIO and not on KVM. > > > > > > At the end of the day we need an ioctl with two arguments: > > > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever) > > > - The KVM FD to control wbinvd support on > > > > > > Philosophically it doesn't matter too much which subsystem that ioctl > > > lives, but we have these obnoxious cross module dependencies to > > > consider.. > > > > > > Framing the question, as you have, to be about the process, I think > > > explains why KVM doesn't really care what is decided, so long as the > > > process and the VM have equivalent rights. > > > > > > Alex, how about a more fleshed out suggestion: > > > > > > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID > > > it communicates its no-snoop configuration: > > > > Communicates to whom? > > To the /dev/iommu FD which will have to maintain a list of devices > attached to it internally. > > > > - 0 enable, allow WBINVD > > > - 1 automatic disable, block WBINVD if the platform > > > IOMMU can police it (what we do today) > > > - 2 force disable, do not allow BINVD ever > > > > The only thing we know about the device is whether or not Enable > > No-snoop is hard wired to zero, ie. it either can't generate no-snoop > > TLPs ("coherent-only") or it might ("assumed non-coherent"). > > Here I am outlining the choice an also imagining we might want an > admin knob to select the three. You're calling this an admin knob, which to me suggests a global module option, so are you trying to implement both an administrator and a user policy? ie. the user can create scenarios where access to wbinvd might be justified by hardware/IOMMU configuration, but can be limited by the admin? For example I proposed that the ioasidfd would bear the responsibility of a wbinvd ioctl and therefore validate the user's access to enable wbinvd emulation w/ KVM, so I'm assuming this module option lives there. I essentially described the "enable" behavior in my previous reply, user has access to wbinvd if owning a non-coherent capable device managed in a non-coherent IOASID. Yes, the user IOASID configuration controls the latter half of this. What then is "automatic" mode? The user cannot create a non-coherent IOASID with a non-coherent device if the IOMMU supports no-snoop blocking? Do they get a failure? Does it get silently promoted to coherent? In "disable" mode, I think we're just narrowing the restriction further, a non-coherent capable device cannot be used except in a forced coherent IOASID. > > If we're putting the policy decision in the hands of userspace they > > should have access to wbinvd if they own a device that is assumed > > non-coherent AND it's attached to an IOMMU (page table) that is not > > blocking no-snoop (a "non-coherent IOASID"). > > There are two parts here, like Paolo was leading too. If the process > has access to WBINVD and then if such an allowed process tells KVM to > turn on WBINVD in the guest. > > If the process has a device and it has a way to create a non-coherent > IOASID, then that process has access to WBINVD. > > For security it doesn't matter if the process actually creates the > non-coherent IOASID or not. An attacker will simply do the steps that > give access to WBINVD. Yes, at this point the user has the ability to create a configuration where they could have access to wbinvd, but if they haven't created such a configuration, is the wbinvd a no-op? > The important detail is that access to WBINVD does not compell the > process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD > can still choose to create a secure guest by always using IOMMU_CACHE > in its page tables and not asking KVM to enable WBINVD. Of course. > This propsal shifts this policy decision from the kernel to userspace. > qemu is responsible to determine if KVM should enable wbinvd or not > based on if it was able to create IOASID's with IOMMU_CACHE. QEMU is responsible for making sure the VM is consistent; if non-coherent DMA can occur, wbinvd is emulated. But it's still the KVM/IOASID connection that validates that access. > > Conversely, a user could create a non-coherent IOASID and attach any > > device to it, regardless of IOMMU backing capabilities. Only if an > > assumed non-coherent device is attached would the wbinvd be allowed. > > Right, this is exactly the point. Since the user gets to pick if the > IOASID is coherent or not then an attacker can always reach WBINVD > using only the device FD. Additional checks don't add to the security > of the process. > > The additional checks you are describing add to the security of the > guest, however qemu is capable of doing them without more help from the > kernel. > > It is the strenth of Paolo's model that KVM should not be able to do > optionally less, not more than the process itself can do. I think my previous reply was working towards those guidelines. I feel like we're mostly in agreement, but perhaps reading past each other. Nothing here convinced me against my previous proposal that the ioasidfd bears responsibility for managing access to a wbinvd ioctl, and therefore the equivalent KVM access. Whether wbinvd is allowed or no-op'd when the use has access to a non-coherent device in a configuration where the IOMMU prevents non-coherent DMA is maybe still a matter of personal preference. > > > It is pretty simple from a /dev/ioasid perpsective, covers todays > > > compat requirement, gives some future option to allow the no-snoop > > > optimization, and gives a new option for qemu to totally block wbinvd > > > no matter what. > > > > What do you imagine is the use case for totally blocking wbinvd? > > If wbinvd is really security important then an operator should endevor > to turn it off. It can be safely turned off if the operator > understands the SRIOV devices they are using. ie if you are only using > mlx5 or a nvme then force it off and be secure, regardless of the > platform capability. Ok, I'm not opposed to something like a module option that restricts to only coherent DMA, but we need to work through how that's exposed and the userspace behavior. The most obvious would be that a GET_INFO ioctl on the ioasidfd indicates the restrictions, a flag on the IOASID alloc indicates the coherency of the IOASID, and we fail any cases where the admin policy or hardware support doesn't match (ie. alloc if it's incompatible with policy, attach if the device/IOMMU backing violates policy). This is all a compatible layer with what I described previously. Thanks, Alex