2005-03-14 21:15:05

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 0/4] sparsemem intro patches

The following four patches provide the last needed changes before the
introduction of sparsemem. For a more complete description of what this
will do, please see this patch:

http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch

or previous posts on the subject:
http://marc.theaimsgroup.com/?t=110868540700001&r=1&w=2
http://marc.theaimsgroup.com/?l=linux-mm&m=109897373315016&w=2

Three of these are i386-only, but one of them reorganizes the macros
used to manage the space in page->flags, and will affect all platforms.
There are analogous patches to the i386 ones for ppc64, ia64, and
x86_64, but those will be submitted by the normal arch maintainers.

The combination of the four patches has been test-booted on a variety of
i386 hardware, and compiled for ppc64, i386, and x86-64 with about 17
different .configs. It's also been runtime-tested on ia64 configs (with
more patches on top).

-- Dave


2005-03-14 21:54:13

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

On Mon, 14 Mar 2005 13:14:43 -0800
Dave Hansen <[email protected]> wrote:

> Three of these are i386-only, but one of them reorganizes the macros
> used to manage the space in page->flags, and will affect all platforms.
> There are analogous patches to the i386 ones for ppc64, ia64, and
> x86_64, but those will be submitted by the normal arch maintainers.

Sparc64 uses some of the upper page->flags bits to store D-cache
flushing state.

Specifically, PG_arch_1 is used to set whether the page is scheduled
for delayed D-cache flushing, and bits 24 and up say which CPU the
CPU stores occurred on (and thus which CPU will get the cross-CPU
message to flush it's D-cache should the deferred flush actually
occur).

I imagine that since we don't support the domain stuff (yet) on sparc64,
your patches won't break things, but it is something to be aware of.

2005-03-14 22:20:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

On Mon, 2005-03-14 at 13:50 -0800, David S. Miller wrote:
> On Mon, 14 Mar 2005 13:14:43 -0800
> Dave Hansen <[email protected]> wrote:
>
> > Three of these are i386-only, but one of them reorganizes the macros
> > used to manage the space in page->flags, and will affect all platforms.
> > There are analogous patches to the i386 ones for ppc64, ia64, and
> > x86_64, but those will be submitted by the normal arch maintainers.
>
> Sparc64 uses some of the upper page->flags bits to store D-cache
> flushing state.
>
> Specifically, PG_arch_1 is used to set whether the page is scheduled
> for delayed D-cache flushing, and bits 24 and up say which CPU the
> CPU stores occurred on (and thus which CPU will get the cross-CPU
> message to flush it's D-cache should the deferred flush actually
> occur).
>
> I imagine that since we don't support the domain stuff (yet) on sparc64,
> your patches won't break things, but it is something to be aware of.

Those bits are used today for page_zone() and page_to_nid(). I assume
that you don't support NUMA, but how do you get around the page_zone()
definition? (a quick grep in asm-sparc64 didn't show anything obvious)

static inline struct zone *page_zone(struct page *page)
{
return zone_table[page->flags >> NODEZONE_SHIFT];
}

BTW, in theory, the new patch should allow page->flags to be better
managed by a variety of users, including special arch users. An
architecture should be able to relatively easily add the necessary
pieces to reserve them. We could even have a ARCH_RESERVED_BITS macro
or something.

-- Dave

2005-03-14 22:41:47

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

On Mon, 14 Mar 2005 14:18:31 -0800
Dave Hansen <[email protected]> wrote:

> Those bits are used today for page_zone() and page_to_nid(). I assume
> that you don't support NUMA, but how do you get around the page_zone()
> definition? (a quick grep in asm-sparc64 didn't show anything obvious)
>
> static inline struct zone *page_zone(struct page *page)
> {
> return zone_table[page->flags >> NODEZONE_SHIFT];
> }

NODEZONE_SHIFT is (64 /* sizeof(page_flags_t)*8 */ -
1 /* MAX_NODES_SHIFT */ -
2 /* MAX_ZONES_SHIFT */)

Which means the table is indexed by the top 3 bits of page->flags.
Sparc64 only uses a couple bits (specifically, enough to hold
(NR_CPUS - 1)) starting at bit 24, so this should not intersect
the page_zone() usage.

I don't even accidently modify those bits when setting and clearing
this cpu field.

However, I do notice that I assume NR_CPUS is a power of two. I should
certainly cure that. (Basically, I use ~(NR_CPUS - 1) as a mask).

> BTW, in theory, the new patch should allow page->flags to be better
> managed by a variety of users, including special arch users. An
> architecture should be able to relatively easily add the necessary
> pieces to reserve them. We could even have a ARCH_RESERVED_BITS macro
> or something.

That sounds like a great idea. We have several issues like this, perhaps
it's time to create some abstraction accessors via include/asm-*/page-flags.h
The platform can specify the type and size or whatever of page_flags_t, how
to stick node and zone numbers into the field, and whatever else. Furthermore,
we can have an asm-generic/page-flags.h that most folks can just use and
replicates what occurs right now.

That may be overkill, however.

2005-03-15 02:31:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

Dave Hansen <[email protected]> wrote:
>
> The following four patches provide the last needed changes before the
> introduction of sparsemem. For a more complete description of what this
> will do, please see this patch:
>
> http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch

I don't know what to think about this. Can you describe sparsemem a little
further, differentiate it from discontigmem and tell us why we want one?
Is it for memory hotplug? If so, how does it support hotplug?

To which architectures is this useful, and what is the attitude of the
relevant maintenance teams?

Quoting from the above patch:

> Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that
> it can eventually become a complete replacement.
> ...
> This patch introduces CONFIG_FLATMEM. It is used in almost all
> cases where there used to be an #ifndef DISCONTIG, because
> SPARSEMEM and DISCONTIGMEM often have to compile out the same areas
> of code.

Would I be right to worry about increasing complexity, decreased
maintainability and generally increasing mayhem?

If a competent kernel developer who is not familiar with how all this code
hangs together wishes to acquaint himself with it, what steps should he
take?

2005-03-15 03:54:43

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

On Mon, 2005-03-14 at 18:30 -0800, Andrew Morton wrote:
> Dave Hansen <[email protected]> wrote:
> >
> > The following four patches provide the last needed changes before the
> > introduction of sparsemem. For a more complete description of what this
> > will do, please see this patch:
> >
> > http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch
>
> I don't know what to think about this. Can you describe sparsemem a little
> further, differentiate it from discontigmem and tell us why we want one?
>
> Is it for memory hotplug? If so, how does it support hotplug?

Sparsemem is more flexible than discontig, and not tied to any existing
NUMA or MM structures like zones or pgdats. That makes it ideal for
hotplug where those structures are going to be coming and going, sliced
and diced.

Another advantage is that sparse doesn't require each NUMA node's ranges
to be contiguous. It can handle overlapping ranges between nodes with
no problems, where DISCONTIGMEM currently throws away that memory.
DISCONTIGMEM also requires that memory *inside* of a node be contiguous,
and have mem_map for all of it. A once 64GB NUMA node with 63GB of the
memory removed wouldn't have much space left for anything but its
mem_map without sparsemem.

> To which architectures is this useful, and what is the attitude of the
> relevant maintenance teams?

We have implementations for NUMAQ, x86 Summit, flat x86, flat x86-64,
flat and NUMA ppc64, and some ia64 configurations. All of those can
either do simulated, virtualized, or actual hardware memory hotplug of
some kind based on the sparsemem implementations.

Not to put words in their mouths, but there hasn't been anything
negative that I can recall in a while from the architecture maintainers.
What was said that was negative was months ago, and resolved. We've
been talking about this to most of them for quite a while now, and I
think they've grown accustomed to the idea. :)

I've cc'd all of the guilty parties. Perhaps they can fill in my vague
statements with actual facts. But, here are the vague statements
anyway:

i386 - Martin Bligh seems happy with it, he helped design it.
x86-64 - Matt Tolentino has approached Andi Kleen with the necessary
cleanups, and I believe the reaction has been positive. I
think Andi had some other non-hotplug plans for sparsemem, too.
ppc64 - I can bribe Anton and Paul's employer. Mike Kravetz and Joel
Schopp have been working on this port, and I believe they've
kept the maintainers informed and calm.
ia64 - Quote from Jesse Barnes (November 19, 2004):

> CONFIG_NONLINEAR (SPARSE's old name) should be the *only*
> memory init code on ia64 when this is done. That means
> getting rid of both discontig and contig and virtual memmap...

I believe Jesse's been keeping up with the development as well.


> Quoting from the above patch:
>
> > Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that
> > it can eventually become a complete replacement.
> > ...
> > This patch introduces CONFIG_FLATMEM. It is used in almost all
> > cases where there used to be an #ifndef DISCONTIG, because
> > SPARSEMEM and DISCONTIGMEM often have to compile out the same areas
> > of code.
>
> Would I be right to worry about increasing complexity, decreased
> maintainability and generally increasing mayhem?

You certainly would be. For the time being, this increases the number
of config options and places for us to screw up. However, I am
confident at this point that we're doing the right thing. We had a more
complicated version of sparsemem at first. We stripped it down to the
bare bones, and that's what we would like to submit soon. It has the
capability to replace discontig, and will eventually _reduce_
complexity.

One of my favorite ways to demonstrate why I think it's *simple* are the
architecture ports. The longest added function that I can find in the
ports is 17 lines including whitespace.

139 insertions(+), 36 deletions(-) for ia64:
http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-180-sparsemem-ia64.patch

75 insertions(+), 17 deletions(-) for ppc64:
http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-170-sparsemem-ppc64.patch

x86_64 is broken up a little more, but it's probably smaller than the
ppc64 one.

> If a competent kernel developer who is not familiar with how all this code
> hangs together wishes to acquaint himself with it, what steps should he
> take?

Dan Phillips spelled out the basic concepts of chopping things up into
sections a few years ago:

http://lwn.net/2002/0411/a/discontig.php3

However, we haven't yet implemented the phys_to_virt() translations that
he envisioned. We don't need that until unless we need some advanced
hot-remove features which are many, many months away.

Where should a competent kernel developer look to understand the code
more?

The sparsemem implementation isn't horribly deep. At the implementation
level, it replaces pfn_to_page() and page_to_pfn(). It does that with
an array lookup and some bits from page->flags. I'd check out a few
architectures' current implementations of those functions as well as the
one in the patch referenced at the beginning of the mail:
B-sparse-150-sparsemem.patch .

Next, see how the memory_present() abstraction allows the memory layout
of the system to be either encoded in arch-specific discontig structures
or fed into the arch-independent structures that sparse_init() uses to
set up the mem_section[] array.

You could also go look at some of the hotplug code, but this email is
getting long enough as it is :)

-- Dave

2005-03-15 14:57:10

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

>> The following four patches provide the last needed changes before the
>> introduction of sparsemem. For a more complete description of what this
>> will do, please see this patch:
>>
>> http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch
>
> I don't know what to think about this. Can you describe sparsemem a little
> further, differentiate it from discontigmem and tell us why we want one?
> Is it for memory hotplug? If so, how does it support hotplug?
>
> To which architectures is this useful, and what is the attitude of the
> relevant maintenance teams?

This isn't just for hotplug by any means. Andy wrote it to get rid of a whole
bunch of different problems, roughly based on some previous work by Dan Phillips
and Dave McCracken (I've added a cc to the actual authors of these patches).
This is the major part of what used to be called CONFIG_NONLINEAR, which we
discussed at last year's kernel summit, and people were pretty enthusiastic
about.

> Quoting from the above patch:
>
>> Sparsemem replaces DISCONTIGMEM when enabled, and it is hoped that
>> it can eventually become a complete replacement.
>> ...
>> This patch introduces CONFIG_FLATMEM. It is used in almost all
>> cases where there used to be an #ifndef DISCONTIG, because
>> SPARSEMEM and DISCONTIGMEM often have to compile out the same areas
>> of code.
>
> Would I be right to worry about increasing complexity, decreased
> maintainability and generally increasing mayhem?

Not really - it cleans up the current mess where discontigmem means, and
is used for, two distinct things: 1. the memory is significantly non-contig
in the physical layout. 2. NUMA support.

It also allows us to support discontiguous memory *within* a NUMA node, which
is important for some systems - we can scrap the added complexity of ia64s
vmemmap stuff, for instance.

Whatever your opinions are on mem hotplug, I think we want CONFIG_SPARSEMEM
to clean up the existing mess of discontig - with or without hotplug. I've
wanted this for a very long time, and was dicussing it with Andy at OLS last
year; he came up with a much better, cleaner way to implement it than I had.

It also makes a lot of sense as a foundation for hotplug, which multiple
people seem to want for virtualization stuff.

Anyway, that's what I want it for ;-)

> If a competent kernel developer who is not familiar with how all this code
> hangs together wishes to acquaint himself with it, what steps should he
> take?

Andy, can you explain that further? Maybe also worth checking these are the
correct version of your patches.

M.

2005-03-17 16:21:58

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

Andrew Morton wrote:
> Dave Hansen <[email protected]> wrote:
>
>> The following four patches provide the last needed changes before the
>> introduction of sparsemem. For a more complete description of what this
>> will do, please see this patch:
>>
>> http://www.sr71.net/patches/2.6.11/2.6.11-bk7-mhp1/broken-out/B-sparse-150-sparsemem.patch

> I don't know what to think about this. Can you describe sparsemem a little
> further, differentiate it from discontigmem and tell us why we want one?
> Is it for memory hotplug? If so, how does it support hotplug?

SPARSEMEM was born out of discussions which followed the OLS last year
over the NONLINEAR memory model which was being proposed for hotplug.
We got interested as it appeared that a simple form of NONLINEAR memory
could help us handle some problematics cases with DISCONTIG memory.
Particularly the case where we have large intra-node memory holes.

The DISCONTIGMEM memory model appears to have been designed to handle
discontiguous UMA configuration. It was subsequently put into service
to provide node support under NUMA configurations. This dual use seems
to have led to confusing code and compromises on functionality. In its
current form we can only express inter-node memory spaces, making it
majorly inefficient for NUMA systems with sparse physical inter-node
memory maps, effectivly not supporting some configurations. Also,
although DISCONTIGMEM is a common model between a number of
architectures there is almost no code overlap.

SPARSEMEM essentially is a replacement for DISCONTIGMEM providing
support for non-contigious memory but with the advantage of handling
both inter- and intra-node memory holes. The goal of the implementation
was to design a clean memory memory model covering the needs of both UMA
and NUMA discontigouos memory layouts whilst providing a basis for
hotplug. This should allow us to consolidate the implementation of
various "discontiguous" memory model whilst trying to fix its short comings.

Hotplug at its most complex puts two requirements on the memory model.
Firstly, It requires the arbirary replacement of physical memory with
memory which may be at a different address (the breaking of V=P+c) to
cope with the case of memory replacement under unmovable kernel objects.
Secondly, it requires we cope with memory "all over" the physical map.
SPARSEMEM is geared towards providing the required infrastructure for
NONLINEAR memory needed in hotplug. The idea being that NONLINEAR would
be layered on top of it and share its implementation.

-apw.

2005-03-28 20:41:17

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

Hi!

> Three of these are i386-only, but one of them reorganizes the macros
> used to manage the space in page->flags, and will affect all platforms.
> There are analogous patches to the i386 ones for ppc64, ia64, and
> x86_64, but those will be submitted by the normal arch maintainers.
>
> The combination of the four patches has been test-booted on a variety of
> i386 hardware, and compiled for ppc64, i386, and x86-64 with about 17
> different .configs. It's also been runtime-tested on ia64 configs (with
> more patches on top).

Could you try swsusp on i386, too?
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms

2005-03-28 21:23:36

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

On Sat, 2005-03-19 at 20:33 +0100, Pavel Machek wrote:
> > Three of these are i386-only, but one of them reorganizes the macros
> > used to manage the space in page->flags, and will affect all platforms.
> > There are analogous patches to the i386 ones for ppc64, ia64, and
> > x86_64, but those will be submitted by the normal arch maintainers.
> >
> > The combination of the four patches has been test-booted on a variety of
> > i386 hardware, and compiled for ppc64, i386, and x86-64 with about 17
> > different .configs. It's also been runtime-tested on ia64 configs (with
> > more patches on top).
>
> Could you try swsusp on i386, too?

Runtime, or just compiling?

Have you noticed a real problem?

-- Dave

2005-03-28 22:23:16

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 0/4] sparsemem intro patches

Hi!

> > > Three of these are i386-only, but one of them reorganizes the macros
> > > used to manage the space in page->flags, and will affect all platforms.
> > > There are analogous patches to the i386 ones for ppc64, ia64, and
> > > x86_64, but those will be submitted by the normal arch maintainers.
> > >
> > > The combination of the four patches has been test-booted on a variety of
> > > i386 hardware, and compiled for ppc64, i386, and x86-64 with about 17
> > > different .configs. It's also been runtime-tested on ia64 configs (with
> > > more patches on top).
> >
> > Could you try swsusp on i386, too?
>
> Runtime, or just compiling?
>
> Have you noticed a real problem?

I'd prefer runtime, but.... No, I did not notice anything, but in past
we have some "interesting" problems with discontigmem... and this
looks similar.
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!