Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Wed, 22 Sep 2021 21:21:31 -0400
From:   Kent Overstreet <kent.overstreet@gmail.com>
To:     linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        linux-mm@kvack.org
Cc:     Johannes Weiner <hannes@cmpxchg.org>,
        Matthew Wilcox <willy@infradead.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        "Darrick J. Wong" <djwong@kernel.org>,
        Christoph Hellwig <hch@infradead.org>,
        David Howells <dhowells@redhat.com>
Subject: Struct page proposal
Message-ID: <YUvWm6G16+ib+Wnb@moria.home.lan>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Precedence: bulk

One thing that's come out of the folios discussions with both Matthew and
Johannes is that we seem to be thinking along similar lines regarding our end
goals for struct page.

The fundamental reason for struct page is that we need memory to be self
describing, without any context - we need to be able to go from a generic
untyped struct page and figure out what it contains: handling physical memory
failure is the most prominent example, but migration and compaction are more
common. We need to be able to ask the thing that owns a page of memory "hey,
stop using this and move your stuff here".

Matthew's helpfully been coming up with a list of page types:
https://kernelnewbies.org/MemoryTypes

But struct page could be a lot smaller than it is now. I think we can get it
down to two pointers, which means it'll take up 0.4% of system memory. Both
Matthew and Johannes have ideas for getting it down even further - the main
thing to note is that virt_to_page() _should_ be an uncommon operation (most of
the places we're currently using it are completely unnecessary, look at all the
places we're using it on the zero page). Johannes is thinking two layer radix
tree, Matthew was thinking about using maple trees - personally, I think that
0.4% of system memory is plenty good enough.


Ok, but what do we do with the stuff currently in struct page?
-------------------------------------------------------------

The main thing to note is that since in normal operation most folios are going
to be describing many pages, not just one - and we'll be using _less_ memory
overall if we allocate them separately. That's cool.

Of course, for this to make sense, we'll have to get all the other stuff in
struct page moved into their own types, but file & anon pages are the big one,
and that's already being tackled.

Why two ulongs/pointers, instead of just one?
---------------------------------------------

Because one of the things we really want and don't have now is a clean division
between allocator and allocatee state. Allocator meaning either the buddy
allocator or slab, allocatee state would be the folio or the network pool state
or whatever actually called kmalloc() or alloc_pages().

Right now slab state sits in the same place in struct page where allocatee state
does, and the reason this is bad is that slab/slub are a hell of a lot faster
than the buddy allocator, and Johannes wants to move the boundary between slab
allocations and buddy allocator allocations up to like 64k. If we fix where slab
state lives, this will become completely trivial to do.

So if we have this:

struct page {
	unsigned long	allocator;
	unsigned long	allocatee;
};

The allocator field would be used for either a pointer to slab/slub's state, if
it's a slab page, or if it's a buddy allocator page it'd encode the order of the
allocation - like compound order today, and probably whether or not the
(compound group of) pages is free.

The allocatee field would be used for a type tagged (using the low bits of the
pointer) to one of:
 - struct folio
 - struct anon_folio, if that becomes a thing
 - struct network_pool_page
 - struct pte_page
 - struct zone_device_page

Then we can further refactor things until all the stuff that's currently crammed
in struct page lives in types where each struct field means one and precisely
one thing, and also where we can freely reshuffle and reorganize and add stuff
to the various types where we couldn't before because it'd make struct page
bigger.

Other notes & potential issues:
 - page->compound_dtor needs to die

 - page->rcu_head moves into the types that actually need it, no issues there

 - page->refcount has question marks around it. I think we can also just move it
   into the types that need it; with RCU derefing the pointer to the folio or
   whatever and grabing a ref on folio->refcount can happen under a RCU read
   lock - there's no real question about whether it's technically possible to
   get it out of struct page, and I think it would be cleaner overall that way.

   However, depending on how it's used from code paths that go from generic
   untyped pages, I could see it turning into more of a hassle than it's worth.
   More investigation is needed.

 - page->memcg_data - I don't know whether that one more properly belongs in
   struct page or in the page subtypes - I'd love it if Johannes could talk
   about that one.

 - page->flags - dealing with this is going to be a huge hassle but also where
   we'll find some of the biggest gains in overall sanity and readability of the
   code. Right now, PG_locked is super special and ad hoc and I have run into
   situations multiple times (and Johannes was in vehement agreement on this
   one) where I simply could not figure the behaviour of the current code re:
   who is responsible for locking pages without instrumenting the code with
   assertions.

   Meaning anything we do to create and enforce module boundaries between
   different chunks of code is going to suck, but the end result should be
   really worthwhile.

Matthew Wilcox and David Howells have been having conversations on IRC about
what to do about other page bits. It appears we should be able to kill a lot of
filesystem usage of both PG_private and PG_private_2 - filesystems in general
hang state off of page->private, soon to be folio->private, and PG_private in
current use just indicates whether page->private is nonzero - meaning it's
completely redundant.