Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751416AbaKJXxH (ORCPT ); Mon, 10 Nov 2014 18:53:07 -0500 Received: from mail-yh0-f48.google.com ([209.85.213.48]:40722 "EHLO mail-yh0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750799AbaKJXxE (ORCPT ); Mon, 10 Nov 2014 18:53:04 -0500 MIME-Version: 1.0 In-Reply-To: <20141110225036.GB4186@gmail.com> References: <1415644096-3513-1-git-send-email-j.glisse@gmail.com> <1415644096-3513-4-git-send-email-j.glisse@gmail.com> <20141110205814.GA4186@gmail.com> <20141110225036.GB4186@gmail.com> Date: Mon, 10 Nov 2014 15:53:03 -0800 X-Google-Sender-Auth: 0ytjcwUdWNfmQ7ngqlYw72IBzds Message-ID: Subject: Re: [PATCH 3/5] lib: lockless generic and arch independent page table (gpt) v2. From: Linus Torvalds To: Jerome Glisse Cc: Andrew Morton , Linux Kernel Mailing List , linux-mm , Joerg Roedel , Mel Gorman , "H. Peter Anvin" , Peter Zijlstra , Andrea Arcangeli , Johannes Weiner , Larry Woodman , Rik van Riel , Dave Airlie , Brendan Conoboy , Joe Donohue , Duncan Poole , Sherry Cheung , Subhash Gutti , John Hubbard , Mark Hairgrove , Lucien Dunning , Cameron Buschardt , Arvind Gopalakrishnan , Shachar Raindel , Liran Liss , Roland Dreier , Ben Sander , Greg Stoner , John Bridgman , Michael Mantor , Paul Blinzer , Laurent Morichetti , Alexander Deucher , Oded Gabbay , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 10, 2014 at 2:50 PM, Jerome Glisse wrote: > > I wish i could but GPU or IOMMU do have different page table and page directory > entry format. Some fields only make sense on GPU. Even if you look at Intel or > AMD IOMMU they use different format. The intention of my patch is to provide > common infrastructure code share page table management for different hw each > having different entry format. So I am fine with that, it's the details that confuse me. The thing doesn't seem to be generic enough to be used for arbitrary page tables, with (for example) the shifts fixed by the VM page size and the size of the pte entry type. Also, the levels seem to be very infexible, with the page table entries being the simple case, but then you have that "pdep" thing that seems to be just _one_ level of page directory. The thing is, some of those fields are just *crazy*. For example, just look at these: + uint64_t pd_shift; + uint64_t pde_shift; + uint64_t pgd_shift; and making a shift value be 64-bit is *insane*. It makes no sense. The way you use them, you take a value and shift by that, but that means that the shift value cannot *possibly* be bigger than the size (in bits) of the shift value. In other words, those shifts are in the range 0..63. You can hold that in 6 bits. Using a single "unsigned char" would already have two extraneous bits. Yet you don't save them in a byte. You save them in a "uint64_t" that can hold values between 0..18446744073709551615. Doesn't that seem strange and crazy to you? And then you have these *insane* levels. That's what makes me think it's not actually really generic enough to describe a real page table, _or_ it is overkill for describing them. You have that pde_from_pdp() function to ostensibly allow arbitrary page directory formats, but you don't actually give it an address, so that function cannot be used to actually walk the upper levels at all. Instead you have those three shift values (and one mask value) that presumably describe the depths of the different levels of the page tables. And no, it's not some really clever "you describe different levels separately, and they have a link to each other", because there's no longer from one level to the next in the "struct gpt" either. So it seems to be this really odd mixture of trying to be generic, but at the same time there are all those odd details that are very specific to one very particular two-level page table layout. > It is like to page size because page size on arch we care about is 4k > and GPU page table for all hw i care about is also using the magic 4k > value. This might very well be false on some future hw and it would then > need to be untie from the VM page size. Ok, so fixing the page size at PAGE_SIZE might be reasonable. I wonder how that actually works on other architectures that don't have a 4kB page size, but it's possible that the answer is "this only works for hardware that has the same page size as the CPU". Which might be a reasonable assumption. The actual layout is still very odd. And the whole "locked" array I can't make heads or tails of. It is fixed as "PAGE_SIZE bits" (using some pretty odd arithmetic, but that's what it comes out to, but at the same time it seems to not be about the last-level page size, but about some upper-level thing. And that big array is allocated on the stack, etc etc. Not to mention the whole "it's not even a real lock" issue, apparently. This just doesn't make any *sense*. Why is that array PAGE_SIZE bits (ie 4k bits, 512 bytes on x86) in size? Where does that 4k bits come from? THAT is not really the page-size, and the upper entries don't even have PAGE_SIZE number of entries anyway. > The whole magic shift things is because a 32bit arch might be pair with > a GPU that have 64bit entry No, those shift values are never uint64_t. Not on 32-bit, not on 64-bit. In both cases, all the shift values must very much fit in just 6 bits. Six bits is enough to cover it. Not sixtyfour. > The whole point of this patch is to provide > common code to walk and update a hw page table from the CPU and allowing > concurrent update of that hw page table. So quite frankly, I just don't understand *why* it does any of what it does the way it does. It makes no sense. How many levels of directories? Why do we even care? Why the three fixed shifts? So for example, I *like* the notion of just saying "we'll not describe the upper levels of the tree at all, we'll just let those be behind a manual walking function that the tree description is associated with". Before looking closer - and judging by the comments - that was what "pde_from_pdp()" would fo. But no, that one doesn't seem to walk anything, and cannot actually do so without having an address to walk. So it's something else. It should be entirely possible to create a "generic page table walker" (and by generic I mean that it would actually work very naturally with thge CPU page tables too) by just having a "look up last level of page tables" function, and then an iterator that walks just inside that last level. Leave all the upper-level locking to the unspecified "walk the upper layers" function. That would make sense, and sounds like it should be fairly simple. But that's not what this patch is. So the "locked" bits are apparently not about locking, which I guess I should be relieved about since they cannot work as locks. The *number* of bits is odd and unexplained (from the *use*, it looks like the number of PDE's in an upper-level directory, but that is just me trying to figure out the code, and it seems to have nothing to do with PAGE_SIZE), the shifts have three different levels (why?) and are too big. The pde_from_pdp() thing doesn't get an address so it can only work one single entry at a time, and despite claiming to be about scalability and performance I see "synchronize_rcu()" usage which basically guarantees that it cannot possibly be either, and updating must be slow as hell. It all looks very fancy, but very little of it makes *sense* to me. Why is something that isn't a lock called "locked" and "wlock" respectively? Can anybody explain it to me? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/