Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp48663ybl; Thu, 15 Aug 2019 12:24:19 -0700 (PDT) X-Google-Smtp-Source: APXvYqzDWNPlWhla/xwNsEPvfsGAQYaXAB9GQZuKa+wMrnXdBcoE/01sHpRP/GkP2OG6X11wJUky X-Received: by 2002:a65:6401:: with SMTP id a1mr4719813pgv.42.1565897058986; Thu, 15 Aug 2019 12:24:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1565897058; cv=none; d=google.com; s=arc-20160816; b=uvcWDwbMED5daAUGIzvQvz6FRCgmQDTkvVh5VQgXD0/Hw/lKdN5yuFmyOcNaW9rHj4 YsaSd9XQKCJRu+I31Q1EW1PhgmHNPl5UnBPSmk+xmfMW/FFBRyOEh8YBG7j4eir1LAh5 2y4ZIvuhHkuIUceRVUUx/iAIyZvJ3fJstfb3uL+EDWH9DCrwak9FitWy99qGywgK05kD Q9roFE8oFlCnNZIInRG+Y51NkWgYbWeyFSKDSIw4o/3UKG2ql4GY/n2b0NM0B2P9KeVx xIPTKeX2mB/hzB/D6Nhk/ne2/vXlr32tqy8r9P7fii5NLvz3EYBDvfUFkurdix5mk5pF AE2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=ubE6bzDCXFazVxLKTliNU91lMgo1aOciD1ugbsWb5Kg=; b=QTGqeMtk4t2DZMogIudn7Q8qv1GPK0hqSACdysBxRMOK5T6lnsq92DM2TPW3q72nUH kF465LrvlYg8RV8JvCYgr++uVup4VcjfQ96UJIgRl2lvFWViFftexfKoxhhI+qJjQ9q9 MkADTskWc17HinaipqnhK8kPVSvdkD3+vcqfhjoOn6aPALzPEcVgJn5lsWNXqdznRbwN 3Fa98Mb/bjg6v2A22cuieAkeoBdrxoddhOZ2sUsiq+1hm+0OhvotfNcfME+2s6h1ee+v UV4NUbDSlw5PQroMAkRqd9j8oFSPtP5ctHUOa31wqc7hDiMJ6GilQxjC112sMJowCal8 MGaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k67si2270584pgc.26.2019.08.15.12.24.03; Thu, 15 Aug 2019 12:24:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732352AbfHORfQ (ORCPT + 99 others); Thu, 15 Aug 2019 13:35:16 -0400 Received: from mx1.redhat.com ([209.132.183.28]:50172 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726203AbfHORfQ (ORCPT ); Thu, 15 Aug 2019 13:35:16 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 98DBE83F3C; Thu, 15 Aug 2019 17:35:15 +0000 (UTC) Received: from redhat.com (unknown [10.20.6.178]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 8B9FE3796; Thu, 15 Aug 2019 17:35:13 +0000 (UTC) Date: Thu, 15 Aug 2019 13:35:11 -0400 From: Jerome Glisse To: Daniel Vetter Cc: Jason Gunthorpe , Michal Hocko , Andrew Morton , LKML , Linux MM , DRI Development , Intel Graphics Development , Peter Zijlstra , Ingo Molnar , David Rientjes , Christian =?iso-8859-1?Q?K=F6nig?= , Masahiro Yamada , Wei Wang , Andy Shevchenko , Thomas Gleixner , Jann Horn , Feng Tang , Kees Cook , Randy Dunlap , Daniel Vetter Subject: Re: [PATCH 2/5] kernel.h: Add non_block_start/end() Message-ID: <20190815173511.GG30916@redhat.com> References: <20190814134558.fe659b1a9a169c0150c3e57c@linux-foundation.org> <20190815084429.GE9477@dhcp22.suse.cz> <20190815130415.GD21596@ziepe.ca> <20190815143759.GG21596@ziepe.ca> <20190815151028.GJ21596@ziepe.ca> <20190815163238.GA30781@redhat.com> <20190815171622.GL21596@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.27]); Thu, 15 Aug 2019 17:35:15 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote: > On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe wrote: > > > > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote: > > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote: > > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote: > > > > > > > > > You have to wait for the gpu to finnish current processing in > > > > > invalidate_range_start. Otherwise there's no point to any of this > > > > > really. So the wait_event/dma_fence_wait are unavoidable really. > > > > > > > > I don't envy your task :| > > > > > > > > But, what you describe sure sounds like a 'registration cache' model, > > > > not the 'shadow pte' model of coherency. > > > > > > > > The key difference is that a regirstationcache is allowed to become > > > > incoherent with the VMA's because it holds page pins. It is a > > > > programming bug in userspace to change VA mappings via mmap/munmap/etc > > > > while the device is working on that VA, but it does not harm system > > > > integrity because of the page pin. > > > > > > > > The cache ensures that each initiated operation sees a DMA setup that > > > > matches the current VA map when the operation is initiated and allows > > > > expensive device DMA setups to be re-used. > > > > > > > > A 'shadow pte' model (ie hmm) *really* needs device support to > > > > directly block DMA access - ie trigger 'device page fault'. ie the > > > > invalidate_start should inform the device to enter a fault mode and > > > > that is it. If the device can't do that, then the driver probably > > > > shouldn't persue this level of coherency. The driver would quickly get > > > > into the messy locking problems like dma_fence_wait from a notifier. > > > > > > I think here we do not agree on the hardware requirement. For GPU > > > we will always need to be able to wait for some GPU fence from inside > > > the notifier callback, there is just no way around that for many of > > > the GPUs today (i do not see any indication of that changing). > > > > I didn't say you couldn't wait, I was trying to say that the wait > > should only be contigent on the HW itself. Ie you can wait on a GPU > > page table lock, and you can wait on a GPU page table flush completion > > via IRQ. > > > > What is troubling is to wait till some other thread gets a GPU command > > completion and decr's a kref on the DMA buffer - which kinda looks > > like what this dma_fence() stuff is all about. A driver like that > > would have to be super careful to ensure consistent forward progress > > toward dma ref == 0 when the system is under reclaim. > > > > ie by running it's entire IRQ flow under fs_reclaim locking. > > This is correct. At least for i915 it's already a required due to our > shrinker also having to do the same. I think amdgpu isn't bothering > with that since they have vram for most of the stuff, and just limit > system memory usage to half of all and forgo the shrinker. Probably > not the nicest approach. Anyway, both do the same mmu_notifier dance, > just want to explain that we've been living with this for longer > already. > > So yeah writing a gpu driver is not easy. > > > > associated with the mm_struct. In all GPU driver so far it is a short > > > lived lock and nothing blocking is done while holding it (it is just > > > about updating page table directory really wether it is filling it or > > > clearing it). > > > > The main blocking I expect in a shadow PTE flow is waiting for the HW > > to complete invalidations of its PTE cache. > > > > > > It is important to identify what model you are going for as defining a > > > > 'registration cache' coherence expectation allows the driver to skip > > > > blocking in invalidate_range_start. All it does is invalidate the > > > > cache so that future operations pick up the new VA mapping. > > > > > > > > Intel's HFI RDMA driver uses this model extensively, and I think it is > > > > well proven, within some limitations of course. > > > > > > > > At least, 'registration cache' is the only use model I know of where > > > > it is acceptable to skip invalidate_range_end. > > > > > > Here GPU are not in the registration cache model, i know it might looks > > > like it because of GUP but GUP was use just because hmm did not exist > > > at the time. > > > > It is not because of GUP, it is because of the lack of > > invalidate_range_end. A driver cannot correctly implement the SPTE > > model without invalidate_range_end, even if it holds the page pins via > > GUP. > > > > So, I've been assuming the few drivers without invalidate_range_end > > are trying to do registration caching, rather than assuming they are > > broken. > > I915 might just be broken. amdgpu does the full thing, using > hmm_mirror. But still with dma_fence_wait. Yeah i915 is broken but it never hurted anyone ;) I posted patch a long time ago to convert it to hmm but i delayed that to until i can get through making something of GUPfast that can also be use for HMM/ODP user. Cheers, J?r?me