2007-05-18 04:09:12

by Nick Piggin

[permalink] [raw]
Subject: [rfc] increase struct page size?!

Hi,

I'd like to be the first to propose an increase to the size of struct page
just for the sake of increasing it!

If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
which is quite a nice number for cache purposes.

However we don't have to let those 8 bytes go to waste: we can use them
to store the virtual address of the page, which kind of makes sense for
64-bit, because they can likely to use complicated memory models.

I'd say all up this is going to decrease overall cache footprint in
fastpaths, both by reducing text and data footprint of page_address and
related operations, and by reducing cacheline footprint of most batched
operations on struct pages.

Flame away :)

--

Many batch operations on struct page are completely random, and as such, I
think it is better if each struct page fits completely into a single
cacheline even if it means being slightly larger.

Don't let this space go to waste though, we can use page->virtual in order
to optimise page_address operations.

Interestingly, the irony of 32-bit architectures setting WANT_PAGE_VIRTUAL
because they have slow multiplications is that without WANT_PAGE_VIRTUAL, the
struct is 32-bytes and so page_address can usually be calculated with a shift.
So WANT_PAGE_VIRTUAL just bloats up the size of struct page for those guys!


Index: linux-2.6/include/linux/mm_types.h
===================================================================
--- linux-2.6.orig/include/linux/mm_types.h
+++ linux-2.6/include/linux/mm_types.h
@@ -9,6 +9,14 @@
struct address_space;

/*
+ * WANT_PAGE_VIRTUAL on 64-bit machines gives a nice 64 byte alignment,
+ * so a struct page will fit entirely into a cacheline on modern CPUs.
+ */
+#if BITS_PER_LONG == 64
+# define WANT_PAGE_VIRTUAL
+#endif
+
+/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
Index: linux-2.6/include/linux/bootmem.h
===================================================================
--- linux-2.6.orig/include/linux/bootmem.h
+++ linux-2.6/include/linux/bootmem.h
@@ -91,7 +91,7 @@ extern void free_bootmem_node(pg_data_t

#ifndef CONFIG_HAVE_ARCH_BOOTMEM_NODE
#define alloc_bootmem_node(pgdat, x) \
- __alloc_bootmem_node(pgdat, x, SMP_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
+ __alloc_bootmem_node(pgdat, x, L1_CACHE_BYTES, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_pages_node(pgdat, x) \
__alloc_bootmem_node(pgdat, x, PAGE_SIZE, __pa(MAX_DMA_ADDRESS))
#define alloc_bootmem_low_pages_node(pgdat, x) \
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c
+++ linux-2.6/mm/page_alloc.c
@@ -2717,6 +2717,9 @@ static void __meminit alloc_node_mem_map
}
#endif
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
+
+ if ((unsigned long)pgdat->node_mem_map & (L1_CACHE_BYTES - 1))
+ printk(KERN_WARNING "node_mem_map is not cacheline aligned!\n");
}

void __meminit free_area_init_node(int nid, struct pglist_data *pgdat,


2007-05-18 04:47:46

by David Miller

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

From: Nick Piggin <[email protected]>
Date: Fri, 18 May 2007 06:08:54 +0200

> I'd like to be the first to propose an increase to the size of struct page
> just for the sake of increasing it!
>
> If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> which is quite a nice number for cache purposes.
>
> However we don't have to let those 8 bytes go to waste: we can use them
> to store the virtual address of the page, which kind of makes sense for
> 64-bit, because they can likely to use complicated memory models.
>
> I'd say all up this is going to decrease overall cache footprint in
> fastpaths, both by reducing text and data footprint of page_address and
> related operations, and by reducing cacheline footprint of most batched
> operations on struct pages.
>
> Flame away :)

I've toyed with this several times on sparc64, and in my experience
the extra memory reference on page->virtual costs on average about the
same as the non-power-of-2 pointer arithmetic.

The decision is absolutely arbitrary performance wise, but if you
consider the memory wastage on enormous systems going without
page->virtual I think is clearly better.

2007-05-18 05:12:55

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Thu, May 17, 2007 at 09:47:40PM -0700, David Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Fri, 18 May 2007 06:08:54 +0200
>
> > I'd like to be the first to propose an increase to the size of struct page
> > just for the sake of increasing it!
> >
> > If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> > which is quite a nice number for cache purposes.
> >
> > However we don't have to let those 8 bytes go to waste: we can use them
> > to store the virtual address of the page, which kind of makes sense for
> > 64-bit, because they can likely to use complicated memory models.
> >
> > I'd say all up this is going to decrease overall cache footprint in
> > fastpaths, both by reducing text and data footprint of page_address and
> > related operations, and by reducing cacheline footprint of most batched
> > operations on struct pages.
> >
> > Flame away :)
>
> I've toyed with this several times on sparc64, and in my experience
> the extra memory reference on page->virtual costs on average about the
> same as the non-power-of-2 pointer arithmetic.

Of course it is likely to be in the same cacheline, however if your L1
cache latency or throughput simply isn't up to it, FLATMEM systems
definitely could just not use ->virtual, but still add the extra padding
in the struct page for performance reasons... then they get to use
power-of-2 arithmetic to boot.

The page->virtual thing is just a bonus (although have you seen what
sort of hoops SPARSEMEM has to go through to find page_address?! It
will definitely be a win on those architectures).


> The decision is absolutely arbitrary performance wise, but if you
> consider the memory wastage on enormous systems going without
> page->virtual I think is clearly better.

0.2% of memory, or 2MB per GB. But considering we already use 14MB per
GB for the page structures, it isn't like I'm introducing an order of
magnitude problem.

The real benefit I see comes from cache footprint reduction of operations
on struct page. Consider with a 64 byte cacheline and 56 byte struct page,
then every 8 struct pages, 6 span 2 cachelines (75%). If you consider an
operation like reclaim that uses first and last fields, and assume that
pages are going to end up being random, then you're going to touch 75%
more cachelines. Ditto for allocating and freeing pages.

2007-05-18 05:22:22

by David Miller

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

From: Nick Piggin <[email protected]>
Date: Fri, 18 May 2007 07:12:38 +0200

> The page->virtual thing is just a bonus (although have you seen what
> sort of hoops SPARSEMEM has to go through to find page_address?! It
> will definitely be a win on those architectures).

If you set the bit ranges in asm/sparsemem.h properly, as I
have currently on sparc64, it isn't bad at all. It's a
single extra dereference from a table that sits in the main
kernel image and thus is in a locked TLB entry.

SPARSEMEM_EXTREME is pretty much unnecessary and with the
virtual mem-map stuff the sparsemem overhead goes away entirely
and we're back to "page - mem_map" type simple calculations
obviating any dereferencing advantage from page->virtual.

> 0.2% of memory, or 2MB per GB. But considering we already use 14MB per
> GB for the page structures, it isn't like I'm introducing an order of
> magnitude problem.

All these little things add up, let's not suck like some other
OSs by having that kind of mentality.

Show me instead a change that makes page struct 8 bytes smaller
:-))))

2007-05-18 05:31:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Thu, May 17, 2007 at 10:22:17PM -0700, David Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Fri, 18 May 2007 07:12:38 +0200
>
> > The page->virtual thing is just a bonus (although have you seen what
> > sort of hoops SPARSEMEM has to go through to find page_address?! It
> > will definitely be a win on those architectures).
>
> If you set the bit ranges in asm/sparsemem.h properly, as I
> have currently on sparc64, it isn't bad at all. It's a
> single extra dereference from a table that sits in the main
> kernel image and thus is in a locked TLB entry.

It is still another cacheline, another load and more icache.


> SPARSEMEM_EXTREME is pretty much unnecessary and with the
> virtual mem-map stuff the sparsemem overhead goes away entirely
> and we're back to "page - mem_map" type simple calculations
> obviating any dereferencing advantage from page->virtual.

Sure, but you'd still like to save several KB of icache by doing
power of 2 arithmetic ;)


> > 0.2% of memory, or 2MB per GB. But considering we already use 14MB per
> > GB for the page structures, it isn't like I'm introducing an order of
> > magnitude problem.
>
> All these little things add up, let's not suck like some other
> OSs by having that kind of mentality.
>
> Show me instead a change that makes page struct 8 bytes smaller
> :-))))

They all do add up, but this isn't just wasting memory for no reason,
it is to make much better use of CPU caches. Back when PCs had only
a couple of MB of memory, size-speed optimisations were all the rage
because you had enough memory to throw around on big lookup tables and
such... that's only gone away because the cache cost hurts.

But this is one such size/speed tradeoff that actually should make
better use of the cache. Obviously extensive benchmarks are needed,
but I don't think it should be dismissed.

If you have a big problem with struct page overhead, cutting 8 bytes
off it isn't going to make you much happier -- you need to increase
PAGE_SIZE to get some real order-of-magnitude savings.

2007-05-18 07:21:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007 06:08:54 +0200 Nick Piggin <[email protected]> wrote:

> Many batch operations on struct page are completely random,

But they shouldn't be: we should aim to place physically contiguous pages
into logically contiguous pagecache slots, for all the reasons we
discussed.

If/when that happens, there will be a *lot* of locality of reference
against the pageframes in a lot of important codepaths.

2007-05-18 07:32:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 12:19:05AM -0700, Andrew Morton wrote:
> On Fri, 18 May 2007 06:08:54 +0200 Nick Piggin <[email protected]> wrote:
>
> > Many batch operations on struct page are completely random,
>
> But they shouldn't be: we should aim to place physically contiguous pages
> into logically contiguous pagecache slots, for all the reasons we
> discussed.

For big IO batch operations, pagecache would be more likely to be
physically contiguous, as would LRU, I suppose.

I'm more thinking of operations where things get reclaimed over time,
touched or dirtied in slightly different orderings, interleaved with
other allocations, etc.


> If/when that happens, there will be a *lot* of locality of reference
> against the pageframes in a lot of important codepaths.

And when it doesn't happen, we eat 75% more cache misses. And for that
matter we eat 75% more cache misses for non-batch operations like
allocating or freeing a page by slab, for example.

2007-05-18 07:45:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007 09:32:23 +0200 Nick Piggin <[email protected]> wrote:

> On Fri, May 18, 2007 at 12:19:05AM -0700, Andrew Morton wrote:
> > On Fri, 18 May 2007 06:08:54 +0200 Nick Piggin <[email protected]> wrote:
> >
> > > Many batch operations on struct page are completely random,
> >
> > But they shouldn't be: we should aim to place physically contiguous pages
> > into logically contiguous pagecache slots, for all the reasons we
> > discussed.
>
> For big IO batch operations, pagecache would be more likely to be
> physically contiguous, as would LRU, I suppose.

read(), write(), truncate(), writeback, pagefault. Pretty common stuff.

> I'm more thinking of operations where things get reclaimed over time,
> touched or dirtied in slightly different orderings, interleaved with
> other allocations, etc.

Yes, that can happen. But in such cases we by definition aren't touching
the pageframes very often. I'd assert that when the kernel is really
hitting those pageframes hard, it is commonly doing this in ascending
pagecache order.

>
> > If/when that happens, there will be a *lot* of locality of reference
> > against the pageframes in a lot of important codepaths.
>
> And when it doesn't happen, we eat 75% more cache misses. And for that
> matter we eat 75% more cache misses for non-batch operations like
> allocating or freeing a page by slab, for example.

"measure twice, cut once"

2007-05-18 07:59:53

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 12:43:04AM -0700, Andrew Morton wrote:
> On Fri, 18 May 2007 09:32:23 +0200 Nick Piggin <[email protected]> wrote:
>
> > On Fri, May 18, 2007 at 12:19:05AM -0700, Andrew Morton wrote:
> > > On Fri, 18 May 2007 06:08:54 +0200 Nick Piggin <[email protected]> wrote:
> > >
> > > > Many batch operations on struct page are completely random,
> > >
> > > But they shouldn't be: we should aim to place physically contiguous pages
> > > into logically contiguous pagecache slots, for all the reasons we
> > > discussed.
> >
> > For big IO batch operations, pagecache would be more likely to be
> > physically contiguous, as would LRU, I suppose.
>
> read(), write(), truncate(), writeback, pagefault. Pretty common stuff.

Of course, but if you're doing them on random-ish ranges, or multiple
files, or continually on the same file while parts of it get reclaimed
and reinstantiated...


> > I'm more thinking of operations where things get reclaimed over time,
> > touched or dirtied in slightly different orderings, interleaved with
> > other allocations, etc.
>
> Yes, that can happen. But in such cases we by definition aren't touching
> the pageframes very often. I'd assert that when the kernel is really
> hitting those pageframes hard, it is commonly doing this in ascending
> pagecache order.

I'm not sure that I would always agree. Sure if there is random *IO*
involved, then it is going to be slow and we won't be hitting page frames
so hard. But if there is random pagecache access, it will hit them almost
as hard.


> > > If/when that happens, there will be a *lot* of locality of reference
> > > against the pageframes in a lot of important codepaths.
> >
> > And when it doesn't happen, we eat 75% more cache misses. And for that
> > matter we eat 75% more cache misses for non-batch operations like
> > allocating or freeing a page by slab, for example.
>
> "measure twice, cut once"

Definitely agree there.

2007-05-18 09:42:53

by David Howells

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

Nick Piggin <[email protected]> wrote:

> I'd like to be the first to propose an increase to the size of struct page
> just for the sake of increasing it!

Heh. I'm surprised you haven't got more adverse reactions.

> If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> which is quite a nice number for cache purposes.

Whilst that's true, if you have to deal with a run of contiguous page structs
(eg: the page allocator, perhaps) it's actually less efficient because it
takes more cache to do it. But, hey, it's a compromise whatever.

In the scheme of things, if we're mostly dealing with individual page structs
(as I think we are), then yes, I think it's probably a good thing to do -
especially with larger page sizes.

> However we don't have to let those 8 bytes go to waste: we can use them
> to store the virtual address of the page, which kind of makes sense for
> 64-bit, because they can likely to use complicated memory models.

That's a good idea, one that's implemented on some platforms anyway. It'll be
especially good with NUMA, I suspect.

> I'd say all up this is going to decrease overall cache footprint in
> fastpaths, both by reducing text and data footprint of page_address and
> related operations, and by reducing cacheline footprint of most batched
> operations on struct pages.

kmap, filling in scatter/gather lists, crypto stuff. I like it.

Can you do this just by turning on WANT_PAGE_VIRTUAL on all 64-bit platforms?

David

2007-05-18 12:07:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!


>
> I'd say all up this is going to decrease overall cache footprint in
> fastpaths, both by reducing text and data footprint of page_address and
> related operations, and by reducing cacheline footprint of most batched
> operations on struct pages.

I suspect the cache line footprint is not the main problem here (talking about
only one other cache line), but the potential latency of fetching the other
half. One possible alternative instead of increasing struct page would be to
identify places that commonly touch a page first (e.g. using oprofile) and
then always add a prefetch() there to fetch the other half of the page
early.

prefetch on something that is already in cache should be cheap,
so for the structs that don't straddle cachelines it shouldn't be a big
overhead.

I don't think doing the ->virtual addition will buy very much,
because at least the 64bit architectures will probably move
towards vmemmap where pfn->virt is quite cheap.

Of course the real long term fix for struct page cache overhead
would be larger soft page size.

-Andi

2007-05-18 15:42:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007, Nick Piggin wrote:
>
> If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> which is quite a nice number for cache purposes.
>
> However we don't have to let those 8 bytes go to waste: we can use them
> to store the virtual address of the page, which kind of makes sense for
> 64-bit, because they can likely to use complicated memory models.

Sooner rather than later, don't we need those 8 bytes to expand from
atomic_t to atomic64_t _count and _mapcount? Not that we really need
all 64 bits of both, but I don't know how to work atomically with less.

(Why do I have this sneaking feeling that you're actually wanting
to stick something into the lower bits of page->virtual?)

Hugh

2007-05-18 18:15:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007, Nick Piggin wrote:

> However we don't have to let those 8 bytes go to waste: we can use them
> to store the virtual address of the page, which kind of makes sense for
> 64-bit, because they can likely to use complicated memory models.

That is not a valid consideration anymore. There is virtual memmap update
pending with the sparsemem folks that will simplify things.

> Many batch operations on struct page are completely random, and as such, I
> think it is better if each struct page fits completely into a single
> cacheline even if it means being slightly larger.

Right. That would simplify the calculations.

> Don't let this space go to waste though, we can use page->virtual in order
> to optimise page_address operations.

page->virtual is a benefit if the page is cache hot. Otherwise it may
cause a useless lookup.

I wonder if there are other uses for the free space?

2007-05-18 18:15:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007, Nick Piggin wrote:

> The page->virtual thing is just a bonus (although have you seen what
> sort of hoops SPARSEMEM has to go through to find page_address?! It
> will definitely be a win on those architectures).

That is on the way out. See the discussion on virtual memmap support in
sparseme.

2007-05-18 20:37:27

by Tony Luck

[permalink] [raw]
Subject: RE: [rfc] increase struct page size?!

> I wonder if there are other uses for the free space?

unsigned long moreflags;

Nick and Hugh were just sparring over adding a couple (or perhaps 8)
flag bits. This would supply 64 new bits ... maybe that would keep
them happy for a few more years.

-Tony

2007-05-19 01:22:50

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 04:42:10PM +0100, Hugh Dickins wrote:
> On Fri, 18 May 2007, Nick Piggin wrote:
> >
> > If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> > which is quite a nice number for cache purposes.
> >
> > However we don't have to let those 8 bytes go to waste: we can use them
> > to store the virtual address of the page, which kind of makes sense for
> > 64-bit, because they can likely to use complicated memory models.
>
> Sooner rather than later, don't we need those 8 bytes to expand from
> atomic_t to atomic64_t _count and _mapcount? Not that we really need
> all 64 bits of both, but I don't know how to work atomically with less.

Yeah, that would be a very good use of it.


> (Why do I have this sneaking feeling that you're actually wanting
> to stick something into the lower bits of page->virtual?)

No, I just thought I would get even more flamed if I was just adding
padding that wasn't even doing _anything_ :) I remembered Ken had a
benchmark where page_address was slow, and the rest is history...

2007-05-19 01:25:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 11:14:26AM -0700, Christoph Lameter wrote:
> On Fri, 18 May 2007, Nick Piggin wrote:
>
> > However we don't have to let those 8 bytes go to waste: we can use them
> > to store the virtual address of the page, which kind of makes sense for
> > 64-bit, because they can likely to use complicated memory models.
>
> That is not a valid consideration anymore. There is virtual memmap update
> pending with the sparsemem folks that will simplify things.
>
> > Many batch operations on struct page are completely random, and as such, I
> > think it is better if each struct page fits completely into a single
> > cacheline even if it means being slightly larger.
>
> Right. That would simplify the calculations.

It isn't the calculations I'm worried about, although they'll get simpler
too. It is the cache cost.


> > Don't let this space go to waste though, we can use page->virtual in order
> > to optimise page_address operations.
>
> page->virtual is a benefit if the page is cache hot. Otherwise it may
> cause a useless lookup.

It would be very rare for the page not to be in L1 cache at this point,
because we've likely taken a reference on it and/or locked it or moved it
between lists etc.

> I wonder if there are other uses for the free space?

Hugh points out that we should make _count and _mapcount atomic_long_t's,
which would probably be a better use of the space once your vmemmap goes
in.

2007-05-19 01:30:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 10:42:30AM +0100, David Howells wrote:
> Nick Piggin <[email protected]> wrote:
>
> > I'd like to be the first to propose an increase to the size of struct page
> > just for the sake of increasing it!
>
> Heh. I'm surprised you haven't got more adverse reactions.
>
> > If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> > which is quite a nice number for cache purposes.
>
> Whilst that's true, if you have to deal with a run of contiguous page structs
> (eg: the page allocator, perhaps) it's actually less efficient because it
> takes more cache to do it. But, hey, it's a compromise whatever.
>
> In the scheme of things, if we're mostly dealing with individual page structs
> (as I think we are), then yes, I think it's probably a good thing to do -
> especially with larger page sizes.

Yeah, we would end up eating about 12.5% more cachelines for contiguous
runs of pages... but that only kicks in after we've touched 8 of them I
think, and by that point the accesses should be very prefetchable.

I think the average of 75% more cachelines touched for random accesses
is going to outweigh the contiguous batch savings, but that's just a
guess at this point.

2007-05-19 02:03:49

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?! (now sparsemem vmemmap)

On Sat, 19 May 2007, Nick Piggin wrote:

> Hugh points out that we should make _count and _mapcount atomic_long_t's,
> which would probably be a better use of the space once your vmemmap goes
> in.

Well Andy was going to merge it:

http://marc.info/?l=linux-kernel&m=117620162415620&w=2

Andy when are we going to get the vmemmap patches into sparsemem?

2007-05-19 15:43:32

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?! (now sparsemem vmemmap)

Christoph Lameter wrote:
> On Sat, 19 May 2007, Nick Piggin wrote:
>
>> Hugh points out that we should make _count and _mapcount atomic_long_t's,
>> which would probably be a better use of the space once your vmemmap goes
>> in.
>
> Well Andy was going to merge it:
>
> http://marc.info/?l=linux-kernel&m=117620162415620&w=2
>
> Andy when are we going to get the vmemmap patches into sparsemem?

Sorry this has been backed up with all the too-ing and fro-ing on other
things. I am just cleaning up the next round which includes feedback
from wli and first stab at PPC64 support. Should be out monday for review.

-apw

2007-05-19 17:52:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007, Nick Piggin wrote:
>> If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
>> which is quite a nice number for cache purposes.
>> However we don't have to let those 8 bytes go to waste: we can use them
>> to store the virtual address of the page, which kind of makes sense for
>> 64-bit, because they can likely to use complicated memory models.

On Fri, May 18, 2007 at 04:42:10PM +0100, Hugh Dickins wrote:
> Sooner rather than later, don't we need those 8 bytes to expand from
> atomic_t to atomic64_t _count and _mapcount? Not that we really need
> all 64 bits of both, but I don't know how to work atomically with less.
> (Why do I have this sneaking feeling that you're actually wanting
> to stick something into the lower bits of page->virtual?)

I wonder how close we get to overflow on ->_mapcount and ->_count.
(untested/uncompiled).


-- wli


Index: mm-2.6.21/include/linux/mm.h
===================================================================
--- mm-2.6.21.orig/include/linux/mm.h 2007-05-19 10:17:17.682653270 -0700
+++ mm-2.6.21/include/linux/mm.h 2007-05-19 10:38:52.376433663 -0700
@@ -248,6 +248,24 @@
* routine so they can be sure the page doesn't go away from under them.
*/

+static inline struct page *compound_head(struct page *page)
+{
+ if (unlikely(PageTail(page)))
+ return page->first_page;
+ return page;
+}
+
+static inline int page_count(struct page *page)
+{
+ return atomic_read(&compound_head(page)->_count);
+}
+
+#ifdef CONFIG_ATOMIC_HIWATER
+unsigned long count_hiwater(void);
+int put_page_testzero(struct page *);
+int get_page_unless_zero(struct page *);
+void get_page(struct page *);
+#else /* !CONFIG_ATOMIC_HIWATER */
/*
* Drop a ref, return true if the refcount fell to zero (the page has no users)
*/
@@ -267,24 +285,13 @@
return atomic_inc_not_zero(&page->_count);
}

-static inline struct page *compound_head(struct page *page)
-{
- if (unlikely(PageTail(page)))
- return page->first_page;
- return page;
-}
-
-static inline int page_count(struct page *page)
-{
- return atomic_read(&compound_head(page)->_count);
-}
-
static inline void get_page(struct page *page)
{
page = compound_head(page);
VM_BUG_ON(atomic_read(&page->_count) == 0);
atomic_inc(&page->_count);
}
+#endif /* !CONFIG_ATOMIC_HIWATER */

static inline struct page *virt_to_head_page(const void *x)
{
Index: mm-2.6.21/mm/Makefile
===================================================================
--- mm-2.6.21.orig/mm/Makefile 2007-05-18 09:58:43.851524250 -0700
+++ mm-2.6.21/mm/Makefile 2007-05-18 09:58:59.484415118 -0700
@@ -31,4 +31,4 @@
obj-$(CONFIG_MIGRATION) += migrate.o
obj-$(CONFIG_SMP) += allocpercpu.o
obj-$(CONFIG_QUICKLIST) += quicklist.o
-
+obj-$(CONFIG_ATOMIC_HIWATER) += atomic-hiwater.o
Index: mm-2.6.21/mm/atomic-hiwater.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ mm-2.6.21/mm/atomic-hiwater.c 2007-05-19 10:46:09.713356074 -0700
@@ -0,0 +1,63 @@
+#include <linux/mm_types.h>
+#include <linux/percpu.h>
+#include <linux/module.h>
+#include <linux/irqflags.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h>
+
+static DEFINE_PER_CPU(unsigned long, __count_hiwater);
+
+unsigned long count_hiwater(void)
+{
+ int cpu;
+ unsigned long *hiwater, count = 0;
+
+ for_each_online_cpu(cpu) {
+ hiwater = &per_cpu(__count_hiwater, cpu);
+ if (*hiwater > count)
+ count = *hiwater;
+ }
+ return count;
+}
+EXPORT_SYMBOL_GPL(count_hiwater);
+
+static void update_count_hiwater(unsigned long count)
+{
+ unsigned long flags, *hiwater;
+
+ local_irq_save(flags);
+ hiwater = &__get_cpu_var(__count_hiwater);
+ if (unlikely(count > *hiwater))
+ *hiwater = count;
+ local_irq_restore(flags);
+}
+
+int get_page_unless_zero(struct page *page)
+{
+ int ret;
+
+ VM_BUG_ON(PageCompound(page));
+ ret = atomic_inc_not_zero(&page->_count);
+ update_count_hiwater(atomic_read(&page->_count));
+ return ret;
+}
+EXPORT_SYMBOL(get_page_unless_zero);
+
+void get_page(struct page *page)
+{
+ page = compound_head(page);
+ VM_BUG_ON(atomic_read(&page->_count) == 0);
+ update_count_hiwater(atomic_inc_return(&page->_count));
+}
+EXPORT_SYMBOL(get_page);
+
+int put_page_testzero(struct page *page)
+{
+ int count;
+
+ VM_BUG_ON(atomic_read(&page->_count) == 0);
+ count = atomic_dec_return(&page->_count);
+ update_count_hiwater(count);
+ return count;
+}
+EXPORT_SYMBOL(put_page_testzero);
Index: mm-2.6.21/mm/Kconfig
===================================================================
--- mm-2.6.21.orig/mm/Kconfig 2007-05-19 10:20:24.361291479 -0700
+++ mm-2.6.21/mm/Kconfig 2007-05-19 10:22:17.231723598 -0700
@@ -168,3 +168,8 @@
depends on QUICKLIST
default "1"

+config ATOMIC_HIWATER
+ bool "Track page reference count high watermarks."
+ default n
+ help
+ This option tracks the largest reference counts seen for a page.
Index: mm-2.6.21/mm/vmstat.c
===================================================================
--- mm-2.6.21.orig/mm/vmstat.c 2007-05-19 10:34:41.382130313 -0700
+++ mm-2.6.21/mm/vmstat.c 2007-05-19 10:38:38.519644010 -0700
@@ -607,6 +607,29 @@
return v + *pos;
}

+#ifdef CONFIG_ATOMIC_HIWATER
+static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+ (*pos)++;
+ if (*pos == ARRAY_SIZE(vmstat_text))
+ return (void *)~0UL;
+ else if (*pos > ARRAY_SIZE(vmstat_text))
+ return NULL;
+ return (unsigned long *)m->private + *pos;
+}
+
+static int vmstat_show(struct seq_file *m, void *arg)
+{
+ unsigned long *l = arg;
+ unsigned long off = l - (unsigned long *)m->private;
+
+ if (off < ARRAY_SIZE(vmstat_text))
+ seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
+ else
+ seq_printf(m, "count_hiwater %lu\n", count_hiwater());
+ return 0;
+}
+#else /* !CONFIG_ATOMIC_HIWATER */
static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
{
(*pos)++;
@@ -623,6 +646,7 @@
seq_printf(m, "%s %lu\n", vmstat_text[off], *l);
return 0;
}
+#endif /* !CONFIG_ATOMIC_HIWATER */

static void vmstat_stop(struct seq_file *m, void *arg)
{

2007-05-19 18:14:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 11:14:26AM -0700, Christoph Lameter wrote:
>> Right. That would simplify the calculations.

On Sat, May 19, 2007 at 03:25:30AM +0200, Nick Piggin wrote:
> It isn't the calculations I'm worried about, although they'll get simpler
> too. It is the cache cost.

The cache cost argument is specious. Even misaligned, smaller is
smaller. The cache footprint reduction is merely amortized,
probabilistic, etc.


On Fri, May 18, 2007 at 11:14:26AM -0700, Christoph Lameter wrote:
>> I wonder if there are other uses for the free space?

On Sat, May 19, 2007 at 03:25:30AM +0200, Nick Piggin wrote:
> Hugh points out that we should make _count and _mapcount atomic_long_t's,
> which would probably be a better use of the space once your vmemmap goes
> in.

I'm not so sure about that. I doubt we have issues with that. I say
if there's to be padding to 64B to use the of the whole additional
space for additional flag bits. I'm sure fs's could make good use of
64 spare flag bits, or whatever's left over after the VM has its fill.
Perhaps so many spare flag bits could be used in lieu of buffer_heads.

page->virtual is the same old mistake as it was when it was removed.
The virtual mem_map code should be used to resolve the computational
expense. Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
44 bits or more, about as much as is possible, and one reference per
page per page is not even feasible. Full-length atomic_t's are just
not necessary.

However, there are numerous optimizations and features made possible
with flag bits, which might as could be made cheap by padding struct
page up to the next highest power of 2 bytes with space for flag bits.


-- wli

2007-05-19 18:25:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sat, 19 May 2007, William Lee Irwin III wrote:

> However, there are numerous optimizations and features made possible
> with flag bits, which might as could be made cheap by padding struct
> page up to the next highest power of 2 bytes with space for flag bits.

Well the last time I tried to get this by Andi we became a bit concerned
when we realized that the memory map would grow by 14% in size. Given
that 4k page size challenged platforms have a huge amount of page structs
that growth is significant. I think it would be fine to do it for IA64
with 16k page size but not for x86_64.


2007-05-19 22:13:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sat, 19 May 2007 11:15:01 -0700 William Lee Irwin III <[email protected]> wrote:

> Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
> 44 bits or more, about as much as is possible, and one reference per
> page per page is not even feasible. Full-length atomic_t's are just
> not necessary.

You can overflow a page's refcount by mapping it 4G times. That requires
32GB of pagetable memory. It's quite feasible with remap_file_pages().

2007-05-20 04:10:39

by Eric Dumazet

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

Christoph Lameter a ?crit :
> On Sat, 19 May 2007, William Lee Irwin III wrote:
>
>> However, there are numerous optimizations and features made possible
>> with flag bits, which might as could be made cheap by padding struct
>> page up to the next highest power of 2 bytes with space for flag bits.
>
> Well the last time I tried to get this by Andi we became a bit concerned
> when we realized that the memory map would grow by 14% in size. Given
> that 4k page size challenged platforms have a huge amount of page structs
> that growth is significant. I think it would be fine to do it for IA64
> with 16k page size but not for x86_64.

This reminds me Andi attempted in the past to convert 'flags' to a 32 bits field :

http://marc.info/?l=linux-kernel&m=107903527523739&w=2

I wonder why this idea was not taken, saving 2MB per GB of memory is nice :)

2007-05-20 05:22:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
> On Fri, May 18, 2007 at 11:14:26AM -0700, Christoph Lameter wrote:
> >> Right. That would simplify the calculations.
>
> On Sat, May 19, 2007 at 03:25:30AM +0200, Nick Piggin wrote:
> > It isn't the calculations I'm worried about, although they'll get simpler
> > too. It is the cache cost.
>
> The cache cost argument is specious. Even misaligned, smaller is
> smaller.

Of course smaller is smaller ;) Why would that make the cache cost
argument specious?


> The cache footprint reduction is merely amortized,
> probabilistic, etc.

I don't really know what you mean by this, or what part of my cache cost
argument you disagree with...

I think it is that you could construct mem_map access patterns, without
specifically looking at alignment, where a 56 byte struct page would suffer
about 75% more cache misses than a 64 byte aligned one (and you could also
get about 12% fewer cache misses with other access patterns).

I also think the kernel's mem_map access patterns would be more on the
random side, so overall would result in significantly fewer cache misses
with 64 byte aligned pages.

Which part do you disagree with?


> On Fri, May 18, 2007 at 11:14:26AM -0700, Christoph Lameter wrote:
> >> I wonder if there are other uses for the free space?
>
> On Sat, May 19, 2007 at 03:25:30AM +0200, Nick Piggin wrote:
> > Hugh points out that we should make _count and _mapcount atomic_long_t's,
> > which would probably be a better use of the space once your vmemmap goes
> > in.
>
> I'm not so sure about that. I doubt we have issues with that. I say

The issue is that userspace can DOS or crash the kernel by deliberately
overflowing count or mapcount.


> if there's to be padding to 64B to use the of the whole additional
> space for additional flag bits. I'm sure fs's could make good use of
> 64 spare flag bits, or whatever's left over after the VM has its fill.
> Perhaps so many spare flag bits could be used in lieu of buffer_heads.

Really? 64-bit architectures can already use about maybe 16 or 32 more
page flag bits than 32-bit architectures, and I definitely do not want
to increase the size of 32-bit struct page, so I think this wouldn't
work.


> page->virtual is the same old mistake as it was when it was removed.
> The virtual mem_map code should be used to resolve the computational

Don't get too hung up on the page->virtual thing. I'll send another
patch with atomic_t/atomic_long_t conversion.


> expense. Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
> 44 bits or more, about as much as is possible, and one reference per
> page per page is not even feasible. Full-length atomic_t's are just
> not necessary.

I don't know what your 32 + PAGE_SHIFT calculation is for, but yes you
can wrap these counters from userspace on 64-bit architectures.

2007-05-20 07:26:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sat, 19 May 2007 11:15:01 -0700 William Lee Irwin III <[email protected]> wrote:
>> Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
>> 44 bits or more, about as much as is possible, and one reference per
>> page per page is not even feasible. Full-length atomic_t's are just
>> not necessary.

On Sat, May 19, 2007 at 03:09:34PM -0700, Andrew Morton wrote:
> You can overflow a page's refcount by mapping it 4G times. That requires
> 32GB of pagetable memory. It's quite feasible with remap_file_pages().

Oh dear, worst-case app behavior. I'm just wrong.


-- wli

2007-05-20 08:51:09

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
>> The cache cost argument is specious. Even misaligned, smaller is
>> smaller.

On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> Of course smaller is smaller ;) Why would that make the cache cost
> argument specious?

It's not possible to ignore aggregation. For instance, for a subset
of mem_map whose size ignoring alignment would otherwise fit in the
cache to completely avoid sharing any cachelines between page
structures requires page structures to be separated by at least one
mem_map index. This is highly unlikely in uniform distributions.


On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
>> The cache footprint reduction is merely amortized,
>> probabilistic, etc.

On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> I don't really know what you mean by this, or what part of my cache cost
> argument you disagree with...
> I think it is that you could construct mem_map access patterns, without
> specifically looking at alignment, where a 56 byte struct page would suffer
> about 75% more cache misses than a 64 byte aligned one (and you could also
> get about 12% fewer cache misses with other access patterns).
> I also think the kernel's mem_map access patterns would be more on the
> random side, so overall would result in significantly fewer cache misses
> with 64 byte aligned pages.
> Which part do you disagree with?

The lack of consideration of the average case. I'll see what I can smoke
out there.


On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
>> I'm not so sure about that. I doubt we have issues with that. I say

On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> The issue is that userspace can DOS or crash the kernel by deliberately
> overflowing count or mapcount.

This was a flat out error.


On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
>> if there's to be padding to 64B to use the of the whole additional
>> space for additional flag bits. I'm sure fs's could make good use of
>> 64 spare flag bits, or whatever's left over after the VM has its fill.
>> Perhaps so many spare flag bits could be used in lieu of buffer_heads.

On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> Really? 64-bit architectures can already use about maybe 16 or 32 more
> page flag bits than 32-bit architectures, and I definitely do not want
> to increase the size of 32-bit struct page, so I think this wouldn't
> work.

Actually they can't use most of those flag bits on account of
portability to the 32-bit case. A 32-bit flags on 64-bit is rather
plausible due to such.


On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
>> page->virtual is the same old mistake as it was when it was removed.
>> The virtual mem_map code should be used to resolve the computational

On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> Don't get too hung up on the page->virtual thing. I'll send another
> patch with atomic_t/atomic_long_t conversion.

That's fine.


On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
>> expense. Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
>> 44 bits or more, about as much as is possible, and one reference per
>> page per page is not even feasible. Full-length atomic_t's are just
>> not necessary.

On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> I don't know what your 32 + PAGE_SHIFT calculation is for, but yes you
> can wrap these counters from userspace on 64-bit architectures.

That's just an error.


-- wli

2007-05-20 09:26:03

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sun, May 20, 2007 at 01:46:47AM -0700, William Lee Irwin III wrote:
> On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
> >> The cache cost argument is specious. Even misaligned, smaller is
> >> smaller.
>
> On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> > Of course smaller is smaller ;) Why would that make the cache cost
> > argument specious?
>
> It's not possible to ignore aggregation. For instance, for a subset
> of mem_map whose size ignoring alignment would otherwise fit in the
> cache to completely avoid sharing any cachelines between page
> structures requires page structures to be separated by at least one
> mem_map index. This is highly unlikely in uniform distributions.

But that wasn't my argument. I _know_ there are cases where the smaller
struct would be better, and I'm sure they would even arise in a running
kernel.


> On Sat, May 19, 2007 at 11:15:01AM -0700, William Lee Irwin III wrote:
> >> The cache footprint reduction is merely amortized,
> >> probabilistic, etc.
>
> On Sun, May 20, 2007 at 07:22:29AM +0200, Nick Piggin wrote:
> > I don't really know what you mean by this, or what part of my cache cost
> > argument you disagree with...
> > I think it is that you could construct mem_map access patterns, without
> > specifically looking at alignment, where a 56 byte struct page would suffer
> > about 75% more cache misses than a 64 byte aligned one (and you could also
> > get about 12% fewer cache misses with other access patterns).
> > I also think the kernel's mem_map access patterns would be more on the
> > random side, so overall would result in significantly fewer cache misses
> > with 64 byte aligned pages.
> > Which part do you disagree with?
>
> The lack of consideration of the average case. I'll see what I can smoke
> out there.

I _am_ considering the average case, and I consider the aligned structure
is likely to win on average :) I just don't have numbers for it yet.


2007-05-20 12:56:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sunday 20 May 2007 06:10:16 Eric Dumazet wrote:
> Christoph Lameter a ?crit :
> > On Sat, 19 May 2007, William Lee Irwin III wrote:
> >
> >> However, there are numerous optimizations and features made possible
> >> with flag bits, which might as could be made cheap by padding struct
> >> page up to the next highest power of 2 bytes with space for flag bits.
> >
> > Well the last time I tried to get this by Andi we became a bit concerned
> > when we realized that the memory map would grow by 14% in size. Given
> > that 4k page size challenged platforms have a huge amount of page structs
> > that growth is significant. I think it would be fine to do it for IA64
> > with 16k page size but not for x86_64.
>
> This reminds me Andi attempted in the past to convert 'flags' to a 32 bits field :
>
> http://marc.info/?l=linux-kernel&m=107903527523739&w=2
>
> I wonder why this idea was not taken, saving 2MB per GB of memory is nice :)

It made sense in 2.4, but in 2.6 it doesn't actually save any memory because
there is no field to put into the freed padding.

Besides with the scarcity of pageflags it might make sense to do "64 bit only"
flags at some point.

-Andi

2007-05-20 17:13:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, May 18, 2007 at 06:08:54AM +0200, Nick Piggin wrote:
> If we add 8 bytes to struct page on 64-bit machines, it becomes 64 bytes,
> which is quite a nice number for cache purposes.

We had those hardware alignment for many data structures where they
were only wasting memory (i.e. vmas).

There are few places where the hardware alignment matters, page struct
isn't going to be one of them. But feel free to measure yourself.

> I'd say all up this is going to decrease overall cache footprint in
> fastpaths, both by reducing text and data footprint of page_address and
> related operations, and by reducing cacheline footprint of most batched
> operations on struct pages.

IIRC the math is faster for any x86. Overall I doubt the change is
measurable.

Even if this would be a microoptimization barely measurable in some
microbenchmark, I don't think this one is worth doing. mem_map is such
a bloat that it really has to be as small as it can unless we can
improve performance _significantly_ by enlarging it.

> Interestingly, the irony of 32-bit architectures setting WANT_PAGE_VIRTUAL
> because they have slow multiplications is that without WANT_PAGE_VIRTUAL, the
> struct is 32-bytes and so page_address can usually be calculated with a shift.
> So WANT_PAGE_VIRTUAL just bloats up the size of struct page for those guys!

If you want to drop it you can, there's nothing fundamental that
prevents you to drop the 'virtual' completely from page struct, by
just making the vaddr per-process and storing it on the stack like
with the atomic kmaps, but passing it up the stack may require heavy
changes to various apis, which is why we've taken the few-changes lazy
way back then. If it wasn't worth back then, I doubt it worth now for
just pae36.

2007-05-20 22:50:30

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sat, May 19, 2007 at 10:53:20AM -0700, William Lee Irwin III wrote:
> On Fri, May 18, 2007 at 04:42:10PM +0100, Hugh Dickins wrote:
> > Sooner rather than later, don't we need those 8 bytes to expand from
> > atomic_t to atomic64_t _count and _mapcount? Not that we really need
> > all 64 bits of both, but I don't know how to work atomically with less.
> > (Why do I have this sneaking feeling that you're actually wanting
> > to stick something into the lower bits of page->virtual?)
>
> I wonder how close we get to overflow on ->_mapcount and ->_count.
> (untested/uncompiled).

I think the problem is that an attacker can deliberately overflow
->_count, not that it can happen innocuously. By mmaping, say, the page
of libc that contains memcpy() several million times, and forking
enough, can't you make ->_mapcount hit 0? I'm not a VM guy, I just
vaguely remember people talking about this before.

2007-05-21 06:28:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Fri, 18 May 2007 13:37:09 -0700
"Luck, Tony" <[email protected]> wrote:

> > I wonder if there are other uses for the free space?
>
> unsigned long moreflags;
>
> Nick and Hugh were just sparring over adding a couple (or perhaps 8)
> flag bits. This would supply 64 new bits ... maybe that would keep
> them happy for a few more years.
>
- page->zone
free some flags bits and makes page_zone() simple.
and software (fake) zone for memory control can be added ?
or

-page->some_memory_controler ?
(I don't know whether resource controller people want this or not.)

-Kame

2007-05-21 08:07:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sun, May 20, 2007 at 01:46:47AM -0700, William Lee Irwin III wrote:
>> The lack of consideration of the average case. I'll see what I can smoke
>> out there.

On Sun, May 20, 2007 at 11:25:52AM +0200, Nick Piggin wrote:
> I _am_ considering the average case, and I consider the aligned structure
> is likely to win on average :) I just don't have numbers for it yet.

Choosing k distinct integers (mem_map array indices) from the interval
[0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous
array indices on average. The average interval length is
(n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much
further somewhat hairy, but it should be clear that contiguity arising
from random choice is non-negligible.

In any event, I don't have all that much of an objection to what's
actually proposed, just this particular cache footprint argument.
One can motivate increases in sizeof(struct page), but not this way.

Now that I've been informed of the ->_count and ->_mapcount issues,
I'd say that they're grave and should be corrected even at the cost
of sizeof(struct page).


-- wli

Many thanks to int-e on EfNet #math for help with the calculations
(perhaps better described as doing them outright).
Heavily-edited IRC log (using Knuth's conventions for M, N, and k as
the number of runs):

<int-e:#math> wli: oh maybe this can be solved exactly after all. The number
+of configurations of N numbers out of M with exactly k runs is C(N-1, k-1) *
+C(M-N+1, k). When there are k runs, the average run length is N/k, obviously.
<int-e:#math> wli: assume there are k runs. add an empty dummy element at the
+end and at the front - then you have (k+1) empty runs between the k runs.
+Every run has positive length. the empty runs correspond to a partition of
+M-N+2 into k+1 positive numbers, and the occupied runs correspond to one of N
+into k positive numbers, which gives that formula.
<int-e:#math> wli: So the average is 1/C(M,N) * sum[k=1 to N] N/k C(N-1,k-1)
+C(M-N+1,k) = 1/C(M,N) * sum[k=1 to N] C(N,k) C(M-N+1,k) = 1/C(M,N) * (C(M+1,
+N) - 1) = (M+1)/(M-N+1) - 1/C(M,N).

2007-05-21 09:22:56

by Helge Hafting

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

Andrew Morton wrote:
> On Sat, 19 May 2007 11:15:01 -0700 William Lee Irwin III <[email protected]> wrote:
>
>
>> Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
>> 44 bits or more, about as much as is possible, and one reference per
>> page per page is not even feasible. Full-length atomic_t's are just
>> not necessary.
>>
>
> You can overflow a page's refcount by mapping it 4G times. That requires
> 32GB of pagetable memory. It's quite feasible with remap_file_pages().
>
But do anybody ever need to do that?
Such an attack is easily thwarted by refusing to map it more
than, say 3G times?

Helge Hafting

2007-05-21 09:28:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> On Sun, May 20, 2007 at 01:46:47AM -0700, William Lee Irwin III wrote:
> >> The lack of consideration of the average case. I'll see what I can smoke
> >> out there.
>
> On Sun, May 20, 2007 at 11:25:52AM +0200, Nick Piggin wrote:
> > I _am_ considering the average case, and I consider the aligned structure
> > is likely to win on average :) I just don't have numbers for it yet.
>
> Choosing k distinct integers (mem_map array indices) from the interval
> [0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous
> array indices on average. The average interval length is
> (n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much
> further somewhat hairy, but it should be clear that contiguity arising
> from random choice is non-negligible.

That doesn't say anything about temporal locality, though.


> In any event, I don't have all that much of an objection to what's
> actually proposed, just this particular cache footprint argument.
> One can motivate increases in sizeof(struct page), but not this way.

Realise that you have to have a run of I think at least 7 or 8 contiguous
pages and temporally close references in order to save a single cacheline.

Then also that if the page being touched is not partially in cache from
an earlier access, then it is statistically going to cost more lines to
touch it (up to 75% if you touch the first and the last field, obviously 0%
if you only touch a single field, but that's unlikely given that you
usually take a reference then do at least something else like check flags).

I think the problem with the cache footprint argument is just whether
it makes any significant difference to performance. But..


> Now that I've been informed of the ->_count and ->_mapcount issues,
> I'd say that they're grave and should be corrected even at the cost
> of sizeof(struct page).

... yeah, something like that would bypass

2007-05-21 09:31:49

by Eric Dumazet

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, 21 May 2007 01:08:13 -0700
William Lee Irwin III <[email protected]> wrote:

> Now that I've been informed of the ->_count and ->_mapcount issues,
> I'd say that they're grave and should be corrected even at the cost
> of sizeof(struct page).


As long we handle 4 KB pages, adding 64 bits per page means 0.2 % of overhead. Ouch...

We currently have an overhead of 1.36 % for mem_map

Maybe we can still use 32 bits counters, and make sure non root users cannot
make these counters exceed 2^30. (I believe high order bit has already a meaning,
check page_mapped() definition)

We could use a special atomic_inc_if_not_huge() function, that could revert to
normal atomic_inc() on machines with less than 32 GB (using alternative_() variant)

On small setups (or 32 bits arches), atomic_inc_if_not_huge() would unconditionnally
increment the counter.

#if !defined(BIG_MACHINES)
static int inline atomic_inc_if_not_huge(atomic_t *v)
{
atomic_inc(v);
return 1;
}
#else
extern int atomic_inc_if_not_huge(atomic_t *v);

#endif


/* in a .c file */
/* could be patched at boot time if available memory < 32GB (or other limit) */
#if defined(BIG_MACHINES)
#define MAP_LIMIT_COUNT (2<<30)
int atomic_inc_if_not_huge(atomic_t *v);
{
/* lazy test, we dont care enough to do a real atomic read-modify-write */
if (unlikely(atomic_read(v) >= MAP_LIMIT_COUNT)) {
if (non_root_user())
return 0;
}
atomic_inc(v);
return 1;
}
#endif

2007-05-21 09:45:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 11:12:59AM +0200, Helge Hafting wrote:
> Andrew Morton wrote:
> >On Sat, 19 May 2007 11:15:01 -0700 William Lee Irwin III
> ><[email protected]> wrote:
> >
> >
> >>Much the same holds for the atomic_t's; 32 + PAGE_SHIFT is
> >>44 bits or more, about as much as is possible, and one reference per
> >>page per page is not even feasible. Full-length atomic_t's are just
> >>not necessary.
> >>
> >
> >You can overflow a page's refcount by mapping it 4G times. That requires
> >32GB of pagetable memory. It's quite feasible with remap_file_pages().
> >
> But do anybody ever need to do that?
> Such an attack is easily thwarted by refusing to map it more
> than, say 3G times?

That still allows you to DoS the page.

2007-05-21 11:25:27

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
>> Choosing k distinct integers (mem_map array indices) from the interval
>> [0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous
>> array indices on average. The average interval length is
>> (n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much
>> further somewhat hairy, but it should be clear that contiguity arising
>> from random choice is non-negligible.

On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> That doesn't say anything about temporal locality, though.

It doesn't need to. If what's in the cache is uniformly distributed,
you get that result for spatial locality. From there, it's counting
cachelines.


On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
>> In any event, I don't have all that much of an objection to what's
>> actually proposed, just this particular cache footprint argument.
>> One can motivate increases in sizeof(struct page), but not this way.

On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> Realise that you have to have a run of I think at least 7 or 8 contiguous
> pages and temporally close references in order to save a single cacheline.
> Then also that if the page being touched is not partially in cache from
> an earlier access, then it is statistically going to cost more lines to
> touch it (up to 75% if you touch the first and the last field, obviously 0%
> if you only touch a single field, but that's unlikely given that you
> usually take a reference then do at least something else like check flags).
> I think the problem with the cache footprint argument is just whether
> it makes any significant difference to performance. But..

The average interval ("run") length is (n+1)/(n-k+1) - 1/C(n,k), so for
that to be >= 8 you need (n+1)/(n-k+1) - 1/C(n,k) >= 8 which also happens
when (n+1)/(n-k+1) >= 9 or when n >= (9/8)*k - 1 or k <= (8/9)*(n+1).
Clearly a lower bound on k is required, but not obviously derivable.
k >= 8 is obvious, but the least k where (n+1)/(n-k+1) - 1/C(n,k) >= 8
is not entirely obvious. Numerically solving for the least such k finds
that k actually needs to be relatively close to (8/9)*n. A lower bound
of something like 0.87*n + O(1) probably holds.


On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
>> Now that I've been informed of the ->_count and ->_mapcount issues,
>> I'd say that they're grave and should be corrected even at the cost
>> of sizeof(struct page).

On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> ... yeah, something like that would bypass

Did you get cut off here?


-- wli

2007-05-21 17:06:53

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sun, 20 May 2007, Nick Piggin wrote:

> I _am_ considering the average case, and I consider the aligned structure
> is likely to win on average :) I just don't have numbers for it yet.

I'd be glad too if you could get some numbers. I did some benchmarking a
few weeks ago on x86_64 and I found only a very minimal performance drop
if the calculation was simplified.

Note also that a smaller structure means that more page structs can be
covered by a certain amount of cachelines. Doing the alignment may cause
more cacheline misses.

2007-05-21 17:08:34

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Sun, 20 May 2007, Andi Kleen wrote:

> Besides with the scarcity of pageflags it might make sense to do "64 bit only"
> flags at some point.

There is no scarcity of page flags. There is

1. Hoarding by Andrew

2. Waste by Sparsemem (section flags no longer necessary with
virtual memmap)

2 will hopefully be addressed soon and with that 1 will go away.

2007-05-21 22:43:52

by Matt Mackall

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> > On Sun, May 20, 2007 at 01:46:47AM -0700, William Lee Irwin III wrote:
> > >> The lack of consideration of the average case. I'll see what I can smoke
> > >> out there.
> >
> > On Sun, May 20, 2007 at 11:25:52AM +0200, Nick Piggin wrote:
> > > I _am_ considering the average case, and I consider the aligned structure
> > > is likely to win on average :) I just don't have numbers for it yet.
> >
> > Choosing k distinct integers (mem_map array indices) from the interval
> > [0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous
> > array indices on average. The average interval length is
> > (n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much
> > further somewhat hairy, but it should be clear that contiguity arising
> > from random choice is non-negligible.
>
> That doesn't say anything about temporal locality, though.
>
>
> > In any event, I don't have all that much of an objection to what's
> > actually proposed, just this particular cache footprint argument.
> > One can motivate increases in sizeof(struct page), but not this way.
>
> Realise that you have to have a run of I think at least 7 or 8 contiguous
> pages and temporally close references in order to save a single cacheline.
>
> Then also that if the page being touched is not partially in cache from
> an earlier access, then it is statistically going to cost more lines to
> touch it (up to 75% if you touch the first and the last field, obviously 0%
> if you only touch a single field, but that's unlikely given that you
> usually take a reference then do at least something else like check flags).
>
> I think the problem with the cache footprint argument is just whether
> it makes any significant difference to performance. But..
>
>
> > Now that I've been informed of the ->_count and ->_mapcount issues,
> > I'd say that they're grave and should be corrected even at the cost
> > of sizeof(struct page).
>
> ... yeah, something like that would bypass

As long as we're throwing out crazy unpopular ideas, try this one:

Divide struct page in two such that all the most commonly used
elements are in one piece that's nicely sized and the rest are in
another. Have two parallel arrays containing these pieces and accessor
functions around the unpopular bits.

Whether a sensible divide between popular and unpopular bits isn't
clear to me. But hey, I said it was crazy.

--
Mathematics is the supreme nostalgia of our time.

2007-05-22 00:31:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, 21 May 2007 10:08:06 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:

> On Sun, 20 May 2007, Andi Kleen wrote:
>
> > Besides with the scarcity of pageflags it might make sense to do "64 bit only"
> > flags at some point.
>
> There is no scarcity of page flags. There is
>
> 1. Hoarding by Andrew
>
> 2. Waste by Sparsemem (section flags no longer necessary with
> virtual memmap)

For i386(32bit arch), there is not enough space for vmemmap.
For 64bit arch, page flags are not exhausted yet.

-Kame


2007-05-22 00:39:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:

> For i386(32bit arch), there is not enough space for vmemmap.

I thought 32 bit would use flatmem? Is memory really sparse on 32
bit? Likely difficult due to lack of address space?

> For 64bit arch, page flags are not exhausted yet.

Right.

2007-05-22 00:52:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 04:26:03AM -0700, William Lee Irwin III wrote:
> On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> >> Choosing k distinct integers (mem_map array indices) from the interval
> >> [0,n-1] results in k(n-k+1)/n non-adjacent intervals of contiguous
> >> array indices on average. The average interval length is
> >> (n+1)/(n-k+1) - 1/C(n,k). Alignment considerations make going much
> >> further somewhat hairy, but it should be clear that contiguity arising
> >> from random choice is non-negligible.
>
> On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> > That doesn't say anything about temporal locality, though.
>
> It doesn't need to. If what's in the cache is uniformly distributed,
> you get that result for spatial locality. From there, it's counting
> cachelines.

OK, so your 'k' is the number of struct pages that are in cache? Then
that's fine.

I'm not sure how many that is going to be, but I would be surprised if
it were a significant proportion of mem_map, even on not-so-large
memory systems.


> On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> >> In any event, I don't have all that much of an objection to what's
> >> actually proposed, just this particular cache footprint argument.
> >> One can motivate increases in sizeof(struct page), but not this way.
>
> On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> > Realise that you have to have a run of I think at least 7 or 8 contiguous
> > pages and temporally close references in order to save a single cacheline.
> > Then also that if the page being touched is not partially in cache from
> > an earlier access, then it is statistically going to cost more lines to
> > touch it (up to 75% if you touch the first and the last field, obviously 0%
> > if you only touch a single field, but that's unlikely given that you
> > usually take a reference then do at least something else like check flags).
> > I think the problem with the cache footprint argument is just whether
> > it makes any significant difference to performance. But..
>
> The average interval ("run") length is (n+1)/(n-k+1) - 1/C(n,k), so for
> that to be >= 8 you need (n+1)/(n-k+1) - 1/C(n,k) >= 8 which also happens
> when (n+1)/(n-k+1) >= 9 or when n >= (9/8)*k - 1 or k <= (8/9)*(n+1).
> Clearly a lower bound on k is required, but not obviously derivable.
> k >= 8 is obvious, but the least k where (n+1)/(n-k+1) - 1/C(n,k) >= 8
> is not entirely obvious. Numerically solving for the least such k finds
> that k actually needs to be relatively close to (8/9)*n. A lower bound
> of something like 0.87*n + O(1) probably holds.

Ah, you worked it out... yeah I'd guess this is going to be pretty difficult
a condition to satisfy (given that it isn't possible for a 4GB system, even
if you had 32MB of cache to fill entirely with struct pages).


> On Mon, May 21, 2007 at 01:08:13AM -0700, William Lee Irwin III wrote:
> >> Now that I've been informed of the ->_count and ->_mapcount issues,
> >> I'd say that they're grave and should be corrected even at the cost
> >> of sizeof(struct page).
>
> On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> > ... yeah, something like that would bypass
>
> Did you get cut off here?

Must have. I was going to say it would bypass the whole speed/size
discussion anyway :P

2007-05-22 00:58:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, 21 May 2007 17:38:58 -0700 (PDT)
Christoph Lameter <[email protected]> wrote:

> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
>
> > For i386(32bit arch), there is not enough space for vmemmap.
>
> I thought 32 bit would use flatmem? Is memory really sparse on 32
> bit? Likely difficult due to lack of address space?
>

Of course, i386 can use flatmem.

I am just afraid that memory hotplug is just for sprasemem.
But I also think we can add memory-hotplug for flatmem if necessary.
(I myself have no plan now. I wonder memory power-save-mode may be supported
by chipsets.)

-Kame

2007-05-22 01:08:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 05:43:16PM -0500, Matt Mackall wrote:
> On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> >
> > ... yeah, something like that would bypass
>
> As long as we're throwing out crazy unpopular ideas, try this one:
>
> Divide struct page in two such that all the most commonly used
> elements are in one piece that's nicely sized and the rest are in
> another. Have two parallel arrays containing these pieces and accessor
> functions around the unpopular bits.
>
> Whether a sensible divide between popular and unpopular bits isn't
> clear to me. But hey, I said it was crazy.

That would be unpopular with pagecache, because that uses pretty well
all fields.

2007-05-22 01:13:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Tue, 22 May 2007, Nick Piggin wrote:

> That would be unpopular with pagecache, because that uses pretty well
> all fields.

SLUB also uses all fields....

2007-05-22 01:39:19

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
>> ... yeah, something like that would bypass

On Mon, May 21, 2007 at 05:43:16PM -0500, Matt Mackall wrote:
> As long as we're throwing out crazy unpopular ideas, try this one:
> Divide struct page in two such that all the most commonly used
> elements are in one piece that's nicely sized and the rest are in
> another. Have two parallel arrays containing these pieces and accessor
> functions around the unpopular bits.
> Whether a sensible divide between popular and unpopular bits isn't
> clear to me. But hey, I said it was crazy.

I have a crazier and even less popular idea. Eliminate struct page
entirely as an accounting structure (and, of course, mem_map with it).
Filesystems can keep the per-page metadata they need in their own
accounting structures, slab mutatis mutandis, etc. The brilliant bit
here is that devolving the accounting structures this way allows the
fs and/or subsystem to arrange for strong cache locality, file offset
adjacency to imply memory adjacency of the page accounting fields,
etc., where grabbing random structures out of some array is a real
cache thrasher.

The page allocation and page replacement algorithms would have to be
adjusted, and things would have to allocate their own refcounts,
supposing they want/need refcounts, but it's not so far out. Refer to
filesystem pages by <mapping, index> pairs, refer to slab pages by
address (virtual and physical are trivially inter-convertible), mock
up something akin to what filesystems do for anonymous pages, etc.

The real objection everyone's going to have is that driver writers
will stain their shorts when faced with the rules for handling such
things. The thing is, I'm not entirely sure who these driver writers
that would have such trouble are, since the driver writers I know
personally are sophisticates rather than walking disaster areas as such
would imply. I suppose they may not be representative of the whole.


-- wli

P.S. This idea is not plucked out of the air; it has precedents. A
number of microkernels do this, and IIRC k42 does so also.

2007-05-22 01:57:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 06:39:51PM -0700, William Lee Irwin III wrote:
> On Mon, May 21, 2007 at 11:27:42AM +0200, Nick Piggin wrote:
> >> ... yeah, something like that would bypass
>
> On Mon, May 21, 2007 at 05:43:16PM -0500, Matt Mackall wrote:
> > As long as we're throwing out crazy unpopular ideas, try this one:
> > Divide struct page in two such that all the most commonly used
> > elements are in one piece that's nicely sized and the rest are in
> > another. Have two parallel arrays containing these pieces and accessor
> > functions around the unpopular bits.
> > Whether a sensible divide between popular and unpopular bits isn't
> > clear to me. But hey, I said it was crazy.
>
> I have a crazier and even less popular idea. Eliminate struct page
> entirely as an accounting structure (and, of course, mem_map with it).
> Filesystems can keep the per-page metadata they need in their own
> accounting structures, slab mutatis mutandis, etc. The brilliant bit
> here is that devolving the accounting structures this way allows the
> fs and/or subsystem to arrange for strong cache locality, file offset
> adjacency to imply memory adjacency of the page accounting fields,
> etc., where grabbing random structures out of some array is a real
> cache thrasher.
>
> The page allocation and page replacement algorithms would have to be
> adjusted, and things would have to allocate their own refcounts,
> supposing they want/need refcounts, but it's not so far out. Refer to
> filesystem pages by <mapping, index> pairs, refer to slab pages by

BTW. I think the filesystem APIs (at least the VM-side ones) should be
doing this anyway (not even index, but offset). Passing things like
lists of pages around is just horrible. See my write_begin/write_end
and perform_write aops for (what I think is) a step in the right
direction.


> address (virtual and physical are trivially inter-convertible), mock
> up something akin to what filesystems do for anonymous pages, etc.
>
> The real objection everyone's going to have is that driver writers
> will stain their shorts when faced with the rules for handling such
> things. The thing is, I'm not entirely sure who these driver writers
> that would have such trouble are, since the driver writers I know
> personally are sophisticates rather than walking disaster areas as such
> would imply. I suppose they may not be representative of the whole.

That's not the objection I would have. I would say that firstly, I
don't think the mem_map overhead is very significant (at any rate,
an allocated-on-demand metadata is not going to be any smaller if
you fill up on pagecache...). Secondly, I think there is merit to
having the same page metadata used by the major subsystems, because
it helps for locality of reference.

But I haven't explored the idea enough myself to know whether there
would be any really killer benefits to this. Delayed metadata freeing
via RCU without holding up the freeing of the actual page would have
been something, however I can do similar with speculative references
now (or whenever the code gets merged), which doesn't even require the
RCU overhead.


> -- wli
>
> P.S. This idea is not plucked out of the air; it has precedents. A
> number of microkernels do this, and IIRC k42 does so also.

Psst, just say "kernels" when you mention this to Linus ;)

2007-05-22 05:03:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 06:39:51PM -0700, William Lee Irwin III wrote:
>> address (virtual and physical are trivially inter-convertible), mock
>> up something akin to what filesystems do for anonymous pages, etc.
>> The real objection everyone's going to have is that driver writers
>> will stain their shorts when faced with the rules for handling such
>> things. The thing is, I'm not entirely sure who these driver writers
>> that would have such trouble are, since the driver writers I know
>> personally are sophisticates rather than walking disaster areas as such
>> would imply. I suppose they may not be representative of the whole.

On Tue, May 22, 2007 at 03:57:03AM +0200, Nick Piggin wrote:
> That's not the objection I would have. I would say that firstly, I
> don't think the mem_map overhead is very significant (at any rate,
> an allocated-on-demand metadata is not going to be any smaller if
> you fill up on pagecache...). Secondly, I think there is merit to
> having the same page metadata used by the major subsystems, because
> it helps for locality of reference.

The size isn't the advantage being cited; I'd actually expect the net
result to be larger. It's the control over the layout of the metadata
for cache locality and even things like having enough flags, folding
buffer_head -like affairs into the per-page metadata for filesystems
and so reaping cache locality benefits even there (assuming it works
out in other respects), and so on.

Passing pages between subsystems doesn't seem very significant to me.
There isn't going to be much locality of reference, or even any
guarantee that the subsystem gets fed a cache hot page structure. The
subsystem being passed the page will have its own cache hot accounting
structures to stick the information about the memory into.


On Tue, May 22, 2007 at 03:57:03AM +0200, Nick Piggin wrote:
> But I haven't explored the idea enough myself to know whether there
> would be any really killer benefits to this. Delayed metadata freeing
> via RCU without holding up the freeing of the actual page would have
> been something, however I can do similar with speculative references
> now (or whenever the code gets merged), which doesn't even require the
> RCU overhead.

I'm not entirely sure what you're on about there, but it sounds
interesting.


-- wli

2007-05-22 06:25:23

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 10:04:10PM -0700, William Lee Irwin III wrote:
> On Mon, May 21, 2007 at 06:39:51PM -0700, William Lee Irwin III wrote:
> >> address (virtual and physical are trivially inter-convertible), mock
> >> up something akin to what filesystems do for anonymous pages, etc.
> >> The real objection everyone's going to have is that driver writers
> >> will stain their shorts when faced with the rules for handling such
> >> things. The thing is, I'm not entirely sure who these driver writers
> >> that would have such trouble are, since the driver writers I know
> >> personally are sophisticates rather than walking disaster areas as such
> >> would imply. I suppose they may not be representative of the whole.
>
> On Tue, May 22, 2007 at 03:57:03AM +0200, Nick Piggin wrote:
> > That's not the objection I would have. I would say that firstly, I
> > don't think the mem_map overhead is very significant (at any rate,
> > an allocated-on-demand metadata is not going to be any smaller if
> > you fill up on pagecache...). Secondly, I think there is merit to
> > having the same page metadata used by the major subsystems, because
> > it helps for locality of reference.
>
> The size isn't the advantage being cited; I'd actually expect the net
> result to be larger. It's the control over the layout of the metadata
> for cache locality and even things like having enough flags, folding
> buffer_head -like affairs into the per-page metadata for filesystems
> and so reaping cache locality benefits even there (assuming it works
> out in other respects), and so on.
>
> Passing pages between subsystems doesn't seem very significant to me.
> There isn't going to be much locality of reference, or even any
> guarantee that the subsystem gets fed a cache hot page structure. The
> subsystem being passed the page will have its own cache hot accounting
> structures to stick the information about the memory into.

Well consider the page allocator and pagecache. The page allocator
uses page metadata rather than eg. a bitmap, and it uses page list
heads for the per-cpu allocator.

If we were to instead perhaps use external bitmaps and arrays to
keep track of pages, then the pagecache would have to go and allocate
its own structures rather than reuse the cache hot page allocator
structures.

Buffer heads might be something that would work well, but we'd still
like to be able to deallocate them without freeing the whole pagecache
(because they tend to be associated with less frequent operations like
IO). But anyway, I don't know. I'm sure there would be cases where it
works better.


> On Tue, May 22, 2007 at 03:57:03AM +0200, Nick Piggin wrote:
> > But I haven't explored the idea enough myself to know whether there
> > would be any really killer benefits to this. Delayed metadata freeing
> > via RCU without holding up the freeing of the actual page would have
> > been something, however I can do similar with speculative references
> > now (or whenever the code gets merged), which doesn't even require the
> > RCU overhead.
>
> I'm not entirely sure what you're on about there, but it sounds
> interesting.

Heh :) Well the lockless pagecache would become basically trivial if we
could RCU-free pagecache pages, however doing that is really awful for
a number of reasons. However if you had a system where the metadata is
decoupled, you could simply RCU-free the 'struct page' (while still
immediately freeing the page itself) which would make lockless pagecache
(and potentially similar things) equally trivial.

I assumed K42 might have been into that angle.

2007-05-22 09:44:29

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, 21 May 2007, Christoph Lameter wrote:
> On Tue, 22 May 2007, KAMEZAWA Hiroyuki wrote:
> > For i386(32bit arch), there is not enough space for vmemmap.
>
> I thought 32 bit would use flatmem? Is memory really sparse on 32
> bit? Likely difficult due to lack of address space?

Throwing in more crazy comments: many m68k boxes have really sparse memory, due
to lack of memory and large address space.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2007-05-22 10:59:04

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [rfc] increase struct page size?!

On Mon, May 21, 2007 at 10:04:10PM -0700, William Lee Irwin III wrote:
>> The size isn't the advantage being cited; I'd actually expect the net
>> result to be larger. It's the control over the layout of the metadata
>> for cache locality and even things like having enough flags, folding
>> buffer_head -like affairs into the per-page metadata for filesystems
>> and so reaping cache locality benefits even there (assuming it works
>> out in other respects), and so on.
>> Passing pages between subsystems doesn't seem very significant to me.
>> There isn't going to be much locality of reference, or even any
>> guarantee that the subsystem gets fed a cache hot page structure. The
>> subsystem being passed the page will have its own cache hot accounting
>> structures to stick the information about the memory into.

On Tue, May 22, 2007 at 08:24:53AM +0200, Nick Piggin wrote:
> Well consider the page allocator and pagecache. The page allocator
> uses page metadata rather than eg. a bitmap, and it uses page list
> heads for the per-cpu allocator.
> If we were to instead perhaps use external bitmaps and arrays to
> keep track of pages, then the pagecache would have to go and allocate
> its own structures rather than reuse the cache hot page allocator
> structures.
> Buffer heads might be something that would work well, but we'd still
> like to be able to deallocate them without freeing the whole pagecache
> (because they tend to be associated with less frequent operations like
> IO). But anyway, I don't know. I'm sure there would be cases where it
> works better.

The page allocator maintains a number of bitmaps, but anyway. Each
subsystem will basically have its own cache-hot structures. Instead
of passing around metadata that's hot, each tries to keep its own
"working set" of metadata hot. Basically yes, it will work better in
some situations and the current metadata passing will work better in
others. I'd expect the control over the layout to be more advantageous
more often, especially since it arranges cache contiguity while pages
are in use.


On Mon, May 21, 2007 at 10:04:10PM -0700, William Lee Irwin III wrote:
>> I'm not entirely sure what you're on about there, but it sounds
>> interesting.

On Tue, May 22, 2007 at 08:24:53AM +0200, Nick Piggin wrote:
> Heh :) Well the lockless pagecache would become basically trivial if we
> could RCU-free pagecache pages, however doing that is really awful for
> a number of reasons. However if you had a system where the metadata is
> decoupled, you could simply RCU-free the 'struct page' (while still
> immediately freeing the page itself) which would make lockless pagecache
> (and potentially similar things) equally trivial.
> I assumed K42 might have been into that angle.

That does sound convenient. I'll add that to the list of benefits.


-- wli