Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp453943imm; Fri, 21 Sep 2018 03:05:49 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYLtxaTsSFmrg9QyUvAlQe5ndcPTGZlWqCdaGqVDAgBQ29VlYO9vGN07sKQ77DnSx1YoP/x X-Received: by 2002:a63:5756:: with SMTP id h22-v6mr41148406pgm.423.1537524349312; Fri, 21 Sep 2018 03:05:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537524349; cv=none; d=google.com; s=arc-20160816; b=HJSg/m1WDGr20Anf5UQ//Ug1IfJ056EVx+xaWZy6KWetqOpjDFBdB3PF/uT3MDnky2 eNhVmmFQL2dWoyx7tU+9c2cr4ZBDGcYOdS+oDmoUcobTAjQJl2dH+iK8SE4yTamLdYjJ sdDojnXTqKwMwLdD257j7qQNW1CalpMph9gl/0C+1ZXbmkrMMayL5R1mFLJIclOJDUz/ sMqadaitqTw4fv2RrnIhm2gR2Mwc+0xT+Mic9ErFH78v661/uK8aA3K+fZ0tNNq2Ty9C IbD6R5EARodz591H8QvYub1n+EGQiuA5Ewzw7eca1HLFacGPF6Musfqc10bEjLZ4VQIv Gn8A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=P4eKApLATX6na1zssf2W9mwgjqCgVMx9IaH24RvxPxE=; b=Bt8ZPk4cPqx+JTCTaC2i60IQa3HZKoa6PRSkMtXGsRzcQw2bBt2BOOWPe4lHGRbPF7 2EwXKl5PPP+6wVRqI14RYMXTIqEas2SiF885fAbJTi6sBeeq0SOXdVIWn0Fwk4R6YDa9 LU7mSNmiJPN7DUs1YrHyouOrnKG++rItrmmZK+hJ1r9r7XD8i3HnW6XHQaFdHWMstY1V 7yGd+wJaX7pCU4xrF6AUMfmo1XSuP/lBNnsuGppFVEdJbU8ILkxZeswMbGcBcjDpA9GH M4qXR7JW/hB99B7qApgy56ktR4pYy4C6AJWPX/K2qQYtPlWzNAEHi3eslNJu7sR5vniu 273A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d10-v6si26851982pla.436.2018.09.21.03.05.32; Fri, 21 Sep 2018 03:05:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389610AbeIUPxa (ORCPT + 99 others); Fri, 21 Sep 2018 11:53:30 -0400 Received: from szxga07-in.huawei.com ([45.249.212.35]:60989 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727554AbeIUPx3 (ORCPT ); Fri, 21 Sep 2018 11:53:29 -0400 Received: from DGGEMS403-HUB.china.huawei.com (unknown [172.30.72.58]) by Forcepoint Email with ESMTP id CE684D20F6838; Fri, 21 Sep 2018 18:05:18 +0800 (CST) Received: from localhost (10.67.212.75) by DGGEMS403-HUB.china.huawei.com (10.3.19.203) with Microsoft SMTP Server (TLS) id 14.3.399.0; Fri, 21 Sep 2018 18:05:14 +0800 Date: Fri, 21 Sep 2018 18:03:14 +0800 From: Kenneth Lee To: Jerome Glisse CC: Kenneth Lee , Jonathan Corbet , Herbert Xu , "David S . Miller" , Joerg Roedel , Alex Williamson , Hao Fang , Zhou Wang , Zaibo Xu , Philippe Ombredanne , Greg Kroah-Hartman , Thomas Gleixner , , , , , , , Lu Baolu , Sanjay Kumar , Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Message-ID: <20180921100314.GH207969@Turing-Arch-b> References: <20180903005204.26041-1-nek.in.cn@gmail.com> <20180917014244.GA27596@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20180917014244.GA27596@redhat.com> User-Agent: Mutt/1.5.21 (2010-09-15) X-Originating-IP: [10.67.212.75] X-CFilter-Loop: Reflected Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote: > Received: from POPSCN.huawei.com [10.3.17.45] by Turing-Arch-b with POP3 > (fetchmail-6.3.26) for (single-drop); Mon, 17 Sep 2018 > 09:45:02 +0800 (CST) > Received: from DGGEMM406-HUB.china.huawei.com (10.3.20.214) by > dggeml421-hub.china.huawei.com (10.1.199.38) with Microsoft SMTP Server > (TLS) id 14.3.399.0; Mon, 17 Sep 2018 09:43:07 +0800 > Received: from dggwg01-in.huawei.com (172.30.65.32) by > DGGEMM406-HUB.china.huawei.com (10.3.20.214) with Microsoft SMTP Server id > 14.3.399.0; Mon, 17 Sep 2018 09:43:00 +0800 > Received: from mx1.redhat.com (unknown [209.132.183.28]) by Forcepoint > Email with ESMTPS id A15E04AB7D1C3; Mon, 17 Sep 2018 09:42:56 +0800 (CST) > Received: from smtp.corp.redhat.com > (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using > TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client > certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id > EC621308212D; Mon, 17 Sep 2018 01:42:52 +0000 (UTC) > Received: from redhat.com (ovpn-121-3.rdu2.redhat.com [10.10.121.3]) by > smtp.corp.redhat.com (Postfix) with ESMTPS id 8874530912F4; Mon, 17 Sep > 2018 01:42:46 +0000 (UTC) > Date: Sun, 16 Sep 2018 21:42:44 -0400 > From: Jerome Glisse > To: Kenneth Lee > CC: Jonathan Corbet , Herbert Xu > , "David S . Miller" , > Joerg Roedel , Alex Williamson > , Kenneth Lee , Hao > Fang , Zhou Wang , Zaibo Xu > , Philippe Ombredanne , Greg > Kroah-Hartman , Thomas Gleixner > , linux-doc@vger.kernel.org, > linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org, > iommu@lists.linux-foundation.org, kvm@vger.kernel.org, > linux-accelerators@lists.ozlabs.org, Lu Baolu , > Sanjay Kumar , linuxarm@huawei.com > Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive > Message-ID: <20180917014244.GA27596@redhat.com> > References: <20180903005204.26041-1-nek.in.cn@gmail.com> > Content-Type: text/plain; charset="iso-8859-1" > Content-Disposition: inline > Content-Transfer-Encoding: 8bit > In-Reply-To: <20180903005204.26041-1-nek.in.cn@gmail.com> > User-Agent: Mutt/1.10.1 (2018-07-13) > X-Scanned-By: MIMEDefang 2.84 on 10.5.11.26 > X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 > (mx1.redhat.com [10.5.110.42]); Mon, 17 Sep 2018 01:42:53 +0000 (UTC) > Return-Path: jglisse@redhat.com > X-MS-Exchange-Organization-AuthSource: DGGEMM406-HUB.china.huawei.com > X-MS-Exchange-Organization-AuthAs: Anonymous > MIME-Version: 1.0 > > So i want to summarize issues i have as this threads have dig deep into > details. For this i would like to differentiate two cases first the easy > one when relying on SVA/SVM. Then the second one when there is no SVA/SVM. > In both cases your objectives as i understand them: > > [R1]- expose a common user space API that make it easy to share boiler > plate code accross many devices (discovering devices, opening > device, creating context, creating command queue ...). > [R2]- try to share the device as much as possible up to device limits > (number of independant queues the device has) > [R3]- minimize syscall by allowing user space to directly schedule on the > device queue without a round trip to the kernel > > I don't think i missed any. > > > (1) Device with SVA/SVM > > For that case it is easy, you do not need to be in VFIO or part of any > thing specific in the kernel. There is no security risk (modulo bug in > the SVA/SVM silicon). Fork/exec is properly handle and binding a process > to a device is just couple dozen lines of code. > > > (2) Device does not have SVA/SVM (or it is disabled) > > You want to still allow device to be part of your framework. However > here i see fundamentals securities issues and you move the burden of > being careful to user space which i think is a bad idea. We should > never trus the userspace from kernel space. > > To keep the same API for the user space code you want a 1:1 mapping > between device physical address and process virtual address (ie if > device access device physical address A it is accessing the same > memory as what is backing the virtual address A in the process. > > Security issues are on two things: > [I1]- fork/exec, a process who opened any such device and created an > active queue can transfer without its knowledge control of its > commands queue through COW. The parent map some anonymous region > to the device as a command queue buffer but because of COW the > parent can be the first to copy on write and thus the child can > inherit the original pages that are mapped to the hardware. > Here parent lose control and child gain it. > Hi, Jerome, I reconsider your logic. I think the problem can be solved. Let us separate the SVA/SVM feature into two: fault-from-device and device-va-awareness. A device with iommu can support only device-va-awareness or both. VFIO works on top of iommu, so it will support at least device-va-awareness. For the COW problem, it can be taken as a mmu synchronization issue. If the mmu page table is changed, it should be synchronize to iommu (via iommu_notifier). In the case that the device support fault-from-device, it will work fine. In the case that it supports only device-va-awareness, we can prefault (handle_mm_fault) also via iommu_notifier and reset to iommu page table. So this can be considered as a bug of VFIO, cannot it? > [I2]- Because of [R3] you want to allow userspace to schedule commands > on the device without doing an ioctl and thus here user space > can schedule any commands to the device with any address. What > happens if that address have not been mapped by the user space > is undefined and in fact can not be defined as what each IOMMU > does on invalid address access is different from IOMMU to IOMMU. > > In case of a bad IOMMU, or simply an IOMMU improperly setup by > the kernel, this can potentialy allow user space to DMA anywhere. > > [I3]- By relying on GUP in VFIO you are not abiding by the implicit > contract (at least i hope it is implicit) that you should not > try to map to the device any file backed vma (private or share). > > The VFIO code never check the vma controlling the addresses that > are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the > user space can provide file backed range. > > I am guessing that the VFIO code never had any issues because its > number one user is QEMU and QEMU never does that (and that's good > as no one should ever do that). > > So if process does that you are opening your self to serious file > system corruption (depending on file system this can lead to total > data loss for the filesystem). > > Issue is that once you GUP you never abide to file system flushing > which write protect the page before writing to the disk. So > because the page is still map with write permission to the device > (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can > write to the page while it is in the middle of being written back > to disk. Consult your nearest file system specialist to ask him > how bad that can be. In the case, we cannot do anything if the device do not support fault-from-device. But we can reject write map with file-backed mapping. It seems both issues can be solved under VFIO framework:) (But of cause, I don't mean it has to) > > [I4]- Design issue, mdev design As Far As I Understand It is about > sharing a single device to multiple clients (most obvious case > here is again QEMU guest). But you are going against that model, > in fact AFAIUI you are doing the exect opposite. When there is > no SVA/SVM you want only one mdev device that can not be share. > > So this is counter intuitive to the mdev existing design. It is > not about sharing device among multiple users but about giving > exclusive access to the device to one user. > > > > All the reasons above is why i believe a different model would serve > you and your user better. Below is a design that avoids all of the > above issues and still delivers all of your objectives with the > exceptions of the third one [R3] when there is no SVA/SVM. > > > Create a subsystem (very much boiler plate code) which allow device to > register themself against (very much like what you do in your current > patchset but outside of VFIO). > > That subsystem will create a device file for each registered system and > expose a common API (ie set of ioctl) for each of those device files. > > When user space create a queue (through an ioctl after opening the device > file) the kernel can return -EBUSY if all the device queue are in use, > or create a device queue and return a flag like SYNC_ONLY for device that > do not have SVA/SVM. > > For device with SVA/SVM at the time the process create a queue you bind > the process PASID to the device queue. From there on the userspace can > schedule commands and use the device without going to kernel space. > > For device without SVA/SVM you create a fake queue that is just pure > memory is not related to the device. From there on the userspace must > call an ioctl every time it wants the device to consume its queue > (hence why the SYNC_ONLY flag for synchronous operation only). The > kernel portion read the fake queue expose to user space and copy > commands into the real hardware queue but first it properly map any > of the process memory needed for those commands to the device and > adjust the device physical address with the one it gets from dma_map > API. > > With that model it is "easy" to listen to mmu_notifier and to abide by > them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2] > issue by only mapping a fake device queue to userspace. > > So yes with that models it means that every device that wish to support > the non SVA/SVM case will have to do extra work (ie emulate its command > queue in software in the kernel). But by doing so, you support an > unlimited number of process on your device (ie all the process can share > one single hardware command queues or multiple hardware queues). > > The big advantages i see here is that the process do not have to worry > about doing something wrong. You are protecting yourself and your user > from stupid mistakes. > > > I hope this is useful to you. > > Cheers, > J?r?me Cheers -- -Kenneth(Hisilicon)