Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752275AbaKJWuo (ORCPT ); Mon, 10 Nov 2014 17:50:44 -0500 Received: from mail-qc0-f174.google.com ([209.85.216.174]:43728 "EHLO mail-qc0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751178AbaKJWum (ORCPT ); Mon, 10 Nov 2014 17:50:42 -0500 Date: Mon, 10 Nov 2014 17:50:37 -0500 From: Jerome Glisse To: Linus Torvalds Cc: Andrew Morton , Linux Kernel Mailing List , linux-mm , Joerg Roedel , Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Shachar Raindel , Liran Liss , Roland Dreier , Ben Sander , Greg Stoner , John Bridgman , Michael Mantor , Paul Blinzer , Laurent Morichetti , Alexander Deucher , Oded Gabbay , =?iso-8859-1?B?Suly9G1l?= Glisse Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2. Message-ID: <20141110225036.GB4186@gmail.com> References: <1415644096-3513-1-git-send-email-j.glisse@gmail.com> <1415644096-3513-4-git-send-email-j.glisse@gmail.com> <20141110205814.GA4186@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 10, 2014 at 01:35:24PM -0800, Linus Torvalds wrote: > On Mon, Nov 10, 2014 at 12:58 PM, Jerome Glisse wrote: > > On Mon, Nov 10, 2014 at 12:22:03PM -0800, Linus Torvalds wrote: > > > > Also during Linux Plumber people working on IOMMU expressed there wish to > > see some generic "page table" code that can be share among IOMMU as most > > IOMMU use a page table directory hierarchy for mapping and it is not the > > same as the one use by the CPU. > > If that is the case, why can't it just use the same format as the CPU anyway? I wish i could but GPU or IOMMU do have different page table and page directory entry format. Some fields only make sense on GPU. Even if you look at Intel or AMD IOMMU they use different format. The intention of my patch is to provide common infrastructure code share page table management for different hw each having different entry format. > > You can create page tables that have the same format as the CPU, they > just don't get loaded by the CPU. > > Because quite frankly, I think that's where we want to end up in the > end anyway. You want to be able to have a "struct mm_struct" that just > happens to run on the GPU (or other accelerator). Then, the actual > hardware tables (or whatever) end up acting like just a TLB of that > tree. And in a perfect world, you can actually *share* the page > tables, so that you can have CLONE_VM threads that simply run on the > GPU, and if the CPU process exists, the normal ref-counting of the > "struct mm_struct" will keep the page tables around. > > Because if you *don't* do it that way, you'll always have to have > these magical synchronization models between the two. Which is > obviously what you're adding (the whole invalidation callbacks), but > it really would be lovely if the "heterogeneous" memory model would > aim to be a bit more homogeneous... Again that would be my wish but this is sadly far from being possible. First unlike CPU, GPU are control through a command buffer queue. Inside the command buffer queue you schedule program to run (address of program code and number of threads to spawn as well as others arguments) but you also schedule things GPU page table update for specific process (group of threads). Thus inherently you have no idea on how long a GPU page table update will take, unlike on a CPU with TLB flush and IPI. So code/cmd updating page table on GPU run on a distinct engine than where the code actually using those page table is runing. So scheduling GPU page table update require a fair amount of driver work and also allocation of a slot inside the GPU command buffer queue. Doing all this along side CPU TLB flush inside atomic section sounds insane to me. It could block the CPU for long time nor even mentioning the fact that bug in driver would have more chance to cripple more severly core kernel code path. Second, as i explained the memory bandwidth gap btw CPU and GPU keeps growing. So GPU will keep having discret memory and it will keep being only accessible to the GPU. Even Intel finaly understood that GPU are all about bandwidth and while their solution is let add some special insanely big and fast cache sounds like the right way to go, it is not unless you are ready to have a cache that is several gigabytes in size all coupled with insane heuristic implemented in the mmu silicon to decide what should be inside that fast cache and what should not. So to take advantages of GPU memory you need to migrate what is in system memory to GPU memory and for this you need different page table btw the CPU and the GPU. Things that are migrated inside the GPU memory will have entry pointing to it inside the GPU page table but same address will have special entry inside the CPU page table. This are the two main motivations for having distinct page table that still needs to be synchronize with each others. > > > I am not sure to which locking you are refering to here. The design is > > to allow concurrent readers and faulters to operate at same time. For > > this i need reader to ignore newly faulted|created directory. So during > > table walk done there is a bit of trickery to achieve just that. > > There's two different locking things I really don't like: > > The USE_SPLIT_PTE_PTLOCKS thing is horrible for stuff like this. I > really wouldn't want random library code digging into core data > structures and random VM config options.. > > We do it for the VM, because we scale up to insane loads that do crazy > things, and it matters deeply, and the VM is really really core. I > have yet to see any reason to believe that the same kind of tricks are > needed or appropriate here. Some update to this secondary hw page table will happen on the CPU inside the same code path as the CPU page table update (hidden inside the mmu notifier callback). Hence why i would like to have the same kind of scalability where i do have a spinlock per directory allowing concurrent updates to disjoint address range. > > And the "test_bit(i, wlock->locked)" kind of thing is also > unacceptable, because your "locks" aren't - you don't actually do the > lock acquire/release ordering for those things at all, and you test > them without any synchronization what-so-ever that I can see. As explained the test_bit is not use for synchronization whatsoever, the wlock name is missleading here. It is use as a flag : was this entry modified by previous loop. All inside one CPU thread and never share with other thread. This is not use as synchronization btw different CPU thread at all. I understand that this code might be hard to read and that name of the variable is somewhat missleading. > > > Update to page directory are synchronize through the spinlock of each > > page backing a directory this is why i rely on that option. As explained > > above i am trying to adapt the design of CPU page table to other hw page > > table. The only difference is that the page directory entry and the page > > table entry are different from the CPU and vary from one hw to the other. > > So quite frankly, I think it's wrong. > > Either use the CPU page tables (just don't *load* them on the CPU), or > don't try to claim they are page tables. I really think you shouldn't > mix things up and confuse the issue. They aren't page tables. They > can't even match any particular piece of hardware, since the different > non-CPU "page tables" in the system are just basically random - the > mapping that a GPU uses may look very different from the mappings that > an IOMMU uses. So unlike the real VM page tables that the VM uses that > *try* to actually match the hardware if at all possible, a > device-specific page table very much will be still tied to the device. As explained above i can not reuse the CPU page table first because the entry format is hw dependant second because i want to have different content btw the GPU and CPU page table for memory migration. > > Or am I reading something wrong? Because that's what it looks like > from my reading: your code is written for *some* things to be > dynamically configurable for the sizes of the levels (as 64-bit values > for the shift count? WTF? That's just psychedelic and seems insane) > but a lot seems to be tied to the VM page size and you use the lock in > the page for the page directories, so it doesn't seem like you can > actually ever do the same kind of "match and share the physical > memory" that we do with the VM page tables. It is like to page size because page size on arch we care about is 4k and GPU page table for all hw i care about is also using the magic 4k value. This might very well be false on some future hw and it would then need to be untie from the VM page size. The whole magic shift things is because a 32bit arch might be pair with a GPU that have 64bit entry. The whole point of this patch is to provide common code to walk and update a hw page table from the CPU and allowing concurrent update of that hw page table. So instead of having each single device driver implement its own code for page table walk and management and implement its own synchronization for update i try hear to provide a framework with those 2 features that can be share no matter what is the format of entry use by the hardware. > > So it still looks like just a radix tree to me. With some > configuration options for the size of the elements, but not really to > share the actual page tables with any real hardware (iommu or gpu or > whatever). So as i said above i would want to update some of this page table from CPU and thus i would like to be able share page table walk and locking among different devices. And i believe IOMMU folks would like to do so too, ie share page table walk and locking as common code and everything else as hw specific code. > > Or do you actually have a setup where actual non-CPU hardware actually > walks the page tables you create and call "page tables"? Yes that's my point, hw will walk those page table but CPU would manipulate them. So the format of the entry is dictated by the hw but the way to update those and to walk them on the CPU can be done through common code. Cheers, J?r?me > > Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/