Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
To:     Kent Overstreet <kent.overstreet@gmail.com>,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Cc:     Johannes Weiner <hannes@cmpxchg.org>,
        Matthew Wilcox <willy@infradead.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        "Darrick J. Wong" <djwong@kernel.org>,
        Christoph Hellwig <hch@infradead.org>,
        David Howells <dhowells@redhat.com>
References: <YUvWm6G16+ib+Wnb@moria.home.lan>
From:   David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: Struct page proposal
Message-ID: <e567ad16-0f2b-940b-a39b-a4d1505bfcb9@redhat.com>
Date:   Thu, 23 Sep 2021 11:03:44 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <YUvWm6G16+ib+Wnb@moria.home.lan>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 23.09.21 03:21, Kent Overstreet wrote:
> One thing that's come out of the folios discussions with both Matthew and
> Johannes is that we seem to be thinking along similar lines regarding our end
> goals for struct page.
> 
> The fundamental reason for struct page is that we need memory to be self
> describing, without any context - we need to be able to go from a generic
> untyped struct page and figure out what it contains: handling physical memory
> failure is the most prominent example, but migration and compaction are more
> common. We need to be able to ask the thing that owns a page of memory "hey,
> stop using this and move your stuff here".
> 
> Matthew's helpfully been coming up with a list of page types:
> https://kernelnewbies.org/MemoryTypes
> 
> But struct page could be a lot smaller than it is now. I think we can get it
> down to two pointers, which means it'll take up 0.4% of system memory. Both
> Matthew and Johannes have ideas for getting it down even further - the main
> thing to note is that virt_to_page() _should_ be an uncommon operation (most of
> the places we're currently using it are completely unnecessary, look at all the
> places we're using it on the zero page). Johannes is thinking two layer radix
> tree, Matthew was thinking about using maple trees - personally, I think that
> 0.4% of system memory is plenty good enough.
> 
> 
> Ok, but what do we do with the stuff currently in struct page?
> -------------------------------------------------------------
> 
> The main thing to note is that since in normal operation most folios are going
> to be describing many pages, not just one - and we'll be using _less_ memory
> overall if we allocate them separately. That's cool.
> 
> Of course, for this to make sense, we'll have to get all the other stuff in
> struct page moved into their own types, but file & anon pages are the big one,
> and that's already being tackled.
> 
> Why two ulongs/pointers, instead of just one?
> ---------------------------------------------
> 
> Because one of the things we really want and don't have now is a clean division
> between allocator and allocatee state. Allocator meaning either the buddy
> allocator or slab, allocatee state would be the folio or the network pool state
> or whatever actually called kmalloc() or alloc_pages().
> 
> Right now slab state sits in the same place in struct page where allocatee state
> does, and the reason this is bad is that slab/slub are a hell of a lot faster
> than the buddy allocator, and Johannes wants to move the boundary between slab
> allocations and buddy allocator allocations up to like 64k. If we fix where slab
> state lives, this will become completely trivial to do.
> 
> So if we have this:
> 
> struct page {
> 	unsigned long	allocator;
> 	unsigned long	allocatee;
> };
> 
> The allocator field would be used for either a pointer to slab/slub's state, if
> it's a slab page, or if it's a buddy allocator page it'd encode the order of the
> allocation - like compound order today, and probably whether or not the
> (compound group of) pages is free.
> 
> The allocatee field would be used for a type tagged (using the low bits of the
> pointer) to one of:
>   - struct folio
>   - struct anon_folio, if that becomes a thing
>   - struct network_pool_page
>   - struct pte_page
>   - struct zone_device_page
> 
> Then we can further refactor things until all the stuff that's currently crammed
> in struct page lives in types where each struct field means one and precisely
> one thing, and also where we can freely reshuffle and reorganize and add stuff
> to the various types where we couldn't before because it'd make struct page
> bigger.
> 
> Other notes & potential issues:
>   - page->compound_dtor needs to die
> 
>   - page->rcu_head moves into the types that actually need it, no issues there
> 
>   - page->refcount has question marks around it. I think we can also just move it
>     into the types that need it; with RCU derefing the pointer to the folio or
>     whatever and grabing a ref on folio->refcount can happen under a RCU read
>     lock - there's no real question about whether it's technically possible to
>     get it out of struct page, and I think it would be cleaner overall that way.
> 
>     However, depending on how it's used from code paths that go from generic
>     untyped pages, I could see it turning into more of a hassle than it's worth.
>     More investigation is needed.
> 
>   - page->memcg_data - I don't know whether that one more properly belongs in
>     struct page or in the page subtypes - I'd love it if Johannes could talk
>     about that one.
> 
>   - page->flags - dealing with this is going to be a huge hassle but also where
>     we'll find some of the biggest gains in overall sanity and readability of the
>     code. Right now, PG_locked is super special and ad hoc and I have run into
>     situations multiple times (and Johannes was in vehement agreement on this
>     one) where I simply could not figure the behaviour of the current code re:
>     who is responsible for locking pages without instrumenting the code with
>     assertions.
> 
>     Meaning anything we do to create and enforce module boundaries between
>     different chunks of code is going to suck, but the end result should be
>     really worthwhile.
> 
> Matthew Wilcox and David Howells have been having conversations on IRC about
> what to do about other page bits. It appears we should be able to kill a lot of
> filesystem usage of both PG_private and PG_private_2 - filesystems in general
> hang state off of page->private, soon to be folio->private, and PG_private in
> current use just indicates whether page->private is nonzero - meaning it's
> completely redundant.
> 

Don't get me wrong, but before there are answers to some of the very 
basic questions raised above (especially everything that lives in 
page->flags, which are not only page flags, refcount, ...) this isn't 
very tempting to spend more time on, from a reviewer perspective.

-- 
Thanks,

David / dhildenb