Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp5047139imu; Tue, 29 Jan 2019 11:51:29 -0800 (PST) X-Google-Smtp-Source: ALg8bN6v0TRmxnpCSi95u58hlFvb+g1ejJ+mkZr6DLlC1q5eA8mG6i0ck3nsqLB9gcHEDmOs8fOJ X-Received: by 2002:a17:902:9691:: with SMTP id n17mr28121993plp.9.1548791489715; Tue, 29 Jan 2019 11:51:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548791489; cv=none; d=google.com; s=arc-20160816; b=UQDmH/Agi/s7yvqzh/FhoyfiTzzbfv7LoLBDe/vko4PVq09onGLxhq5lGv+gC/0gfl k5IjCWtxNGD0ES0Igh8jdLm/NnjpD1YJti71ilCe0LMltBcSGVRSUiu6F0ugtjXUnNws sbCHD6pfKiuzSVnVXsIH/aH8nO93bhhlglpqrChzhUDukQtNw1lq4wYLiEbuGYi93zoW vzFgS3S6otw2A+Rd6yu2Qt2/GJbf/0SRFQMSE8G+0DvvQW8arh9scubHwU4XKHq11n1S wQ0138+Ekm1ZyRvfOYWw0UjCz5abgxy4QEtVX/CJzCJWapAd4/VB9w+TuZhCKrkohUcp jgrw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=d8daq5x8Q/FOO0VpmXfcs4RHoiPtUhcFgcdzyiY7Oe0=; b=LkxdgIdZZcx+go5d8mqDko8fvfVAHkkEWd/FByM5uRlM2DVGQ9jYDpxoTHegnZbllv ozCoB2IA4Jq2nZ3iLqAgFBFQ0LfaquyB6CiN4xrubR0Z8tmUHhf7cYWrVcSL9GO0UwNC X8+JlSi+0HBoLFNT9mbQhcP9ClgD/YLmaBnVEvsaxL0s7ibSowUfxUDuVBSHM++ZwJ1H 4iFEXuGXAy4i5zpCSgmI9q5PKguoIDJLaI8zrOpBjtMuiuNWplYasXWAB90zQNaEJS05 OxNIDYOiwuIJqbsPyo4m6NHyN2AeIwtRZxcLZZHVReCZOmKoIpBVw5wPflA+Gc6tGImJ M2yQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c30si35550468pgn.52.2019.01.29.11.51.14; Tue, 29 Jan 2019 11:51:29 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729184AbfA2TvB (ORCPT + 99 others); Tue, 29 Jan 2019 14:51:01 -0500 Received: from mx1.redhat.com ([209.132.183.28]:58424 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727056AbfA2TvB (ORCPT ); Tue, 29 Jan 2019 14:51:01 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 5104490901; Tue, 29 Jan 2019 19:51:00 +0000 (UTC) Received: from redhat.com (ovpn-122-2.rdu2.redhat.com [10.10.122.2]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A659053; Tue, 29 Jan 2019 19:50:57 +0000 (UTC) Date: Tue, 29 Jan 2019 14:50:55 -0500 From: Jerome Glisse To: Jason Gunthorpe Cc: Logan Gunthorpe , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , Greg Kroah-Hartman , "Rafael J . Wysocki" , Bjorn Helgaas , Christian Koenig , Felix Kuehling , "linux-pci@vger.kernel.org" , "dri-devel@lists.freedesktop.org" , Christoph Hellwig , Marek Szyprowski , Robin Murphy , Joerg Roedel , "iommu@lists.linux-foundation.org" Subject: Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma Message-ID: <20190129195055.GH3176@redhat.com> References: <20190129174728.6430-1-jglisse@redhat.com> <20190129174728.6430-4-jglisse@redhat.com> <20190129191120.GE3176@redhat.com> <20190129193250.GK10108@mellanox.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190129193250.GK10108@mellanox.com> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.29]); Tue, 29 Jan 2019 19:51:00 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 29, 2019 at 07:32:57PM +0000, Jason Gunthorpe wrote: > On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote: > > On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote: > > > > > > > > > On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote: > > > > > > > + /* > > > > + * Optional for device driver that want to allow peer to peer (p2p) > > > > + * mapping of their vma (which can be back by some device memory) to > > > > + * another device. > > > > + * > > > > + * Note that the exporting device driver might not have map anything > > > > + * inside the vma for the CPU but might still want to allow a peer > > > > + * device to access the range of memory corresponding to a range in > > > > + * that vma. > > > > + * > > > > + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A > > > > + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID > > > > + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing > > > > + * device to map once during setup and report any failure at that time > > > > + * to the userspace. Further mapping of the same range might happen > > > > + * after mmu notifier invalidation over the range. The exporting device > > > > + * can use this to move things around (defrag BAR space for instance) > > > > + * or do other similar task. > > > > + * > > > > + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap() > > > > + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY > > > > + * POINT IN TIME WITH NO LOCK HELD. > > > > + * > > > > + * In below function, the device argument is the importing device, > > > > + * the exporting device is the device to which the vma belongs. > > > > + */ > > > > + long (*p2p_map)(struct vm_area_struct *vma, > > > > + struct device *device, > > > > + unsigned long start, > > > > + unsigned long end, > > > > + dma_addr_t *pa, > > > > + bool write); > > > > + long (*p2p_unmap)(struct vm_area_struct *vma, > > > > + struct device *device, > > > > + unsigned long start, > > > > + unsigned long end, > > > > + dma_addr_t *pa); > > > > > > I don't understand why we need new p2p_[un]map function pointers for > > > this. In subsequent patches, they never appear to be set anywhere and > > > are only called by the HMM code. I'd have expected it to be called by > > > some core VMA code and set by HMM as that's what vm_operations_struct is > > > for. > > > > > > But the code as all very confusing, hard to follow and seems to be > > > missing significant chunks. So I'm not really sure what is going on. > > > > It is set by device driver when userspace do mmap(fd) where fd comes > > from open("/dev/somedevicefile"). So it is set by device driver. HMM > > has nothing to do with this. It must be set by device driver mmap > > call back (mmap callback of struct file_operations). For this patch > > you can completely ignore all the HMM patches. Maybe posting this as > > 2 separate patchset would make it clearer. > > > > For instance see [1] for how a non HMM driver can export its memory > > by just setting those callback. Note that a proper implementation of > > this should also include some kind of driver policy on what to allow > > to map and what to not allow ... All this is driver specific in any > > way. > > I'm imagining that the RDMA drivers would use this interface on their > per-process 'doorbell' BAR pages - we also wish to have P2P DMA to > this memory. Also the entire VFIO PCI BAR mmap would be good to cover > with this too. Correct, you would set those callback on the mmap of your doorbell. > > Jerome, I think it would be nice to have a helper scheme - I think the > simple case would be simple remapping of PCI BAR memory, so if we > could have, say something like: > > static const struct vm_operations_struct my_ops { > .p2p_map = p2p_ioremap_map_op, > .p2p_unmap = p2p_ioremap_unmap_op, > } > > struct ioremap_data { > [..] > } > > fops_mmap() { > vma->private_data = &driver_priv->ioremap_data; > return p2p_ioremap_device_memory(vma, exporting_device, [..]); > } > > Which closely matches at least what the RDMA drivers do. Where > p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers > with sensible functions, etc. > > It looks like vfio would be able to use this as well (though I am > unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for > BAR memory..) Yes simple helper that implement a sane default implementation is definitly a good idea. As i was working with GPU it was not something that immediatly poped to mind (see below). But i can certainly do a sane set of default helper that simple device driver can use right away without too much thinking on there part. I will add this for next posting. > Do any drivers need more control than this? GPU driver do want more control :) GPU driver are moving things around all the time and they have more memory than bar space (on newer platform AMD GPU do resize the bar but it is not the rule for all GPUs). So GPU driver do actualy manage their BAR address space and they map and unmap thing there. They can not allow someone to just pin stuff there randomly or this would disrupt their regular work flow. Hence they need control and they might implement threshold for instance if they have more than N pages of bar space map for peer to peer then they can decide to fall back to main memory for any new peer mapping. Cheers, J?r?me