2004-06-11 10:44:45

by David Howells

[permalink] [raw]
Subject: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size


Hi Linus, Andrew,

Here's a patch to allocate memory for big system hash tables with the bootmem
allocator rather than with main page allocator.

I've coelesced the meat of the inode and dentry allocation routines into one
such routine in mm/page_alloc.c that the the respective initialisation
functions now call before mem_init() is called.

This routine gets it's approximation of memory size by counting up the
ZONE_NORMAL and ZONE_DMA pages (and ZONE_HIGHMEM if requested) in all the
nodes passed to the main allocator by paging_init() (or wherever the arch does
it). It does not use max_low_pfn as that doesn't seem to be available on all
archs, and it doesn't use num_physpages since that includes highmem pages not
available to the kernel for allocating data structures upon - which may not be
appropriate when calculating hash table size.

On the off chance that the size of each hash bucket may not be exactly a power
of two, the routine will only allocate as many pages as is necessary to ensure
that the number of buckets is exactly a power of two, rather than allocating
the smallest power-of-two sized chunk of memory that will hold the same array
of buckets.

The maximum size of any single hash table is given by MAX_SYS_HASH_TABLE_ORDER,
as is now defined in linux/mmzone.h.

David


Attachments:
bootmem-hash-267rc3.diff (10.26 kB)

2004-06-11 10:49:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

David Howells <[email protected]> wrote:
>
> Here's a patch to allocate memory for big system hash tables with the bootmem
> allocator rather than with main page allocator.

umm, why?

2004-06-11 11:12:37

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size


> > Here's a patch to allocate memory for big system hash tables with the bootmem
> > allocator rather than with main page allocator.
>
> umm, why?

Three reasons:

(1) So that the size can be bigger than MAX_ORDER. IBM have done some testing
on their big PPC64 systems (64GB of RAM) with linux-2.4 and found that
they get better performance if the sizes of the inode cache hash, dentry
cache hash, buffer head hash and page cache hash are increased beyond
MAX_ORDER (order 11).

Now the main allocator can't allocate anything larger than MAX_ORDER, but
the bootmem allocator can.

In 2.6 it appears that only the inode and dentry hashes remain of those
four, but there are other hash tables that could use this service.

(2) Changing MAX_ORDER appears to have a number of effects beyond just
limiting the maximum size that can be allocated in one go.

(3) Should someone want a hash table in which each bucket isn't a power of two
in size, memory will be wasted as the chunk of memory allocated will be a
power of two in size (to hold a power of two number of buckets).

On the other hand, using the bootmem allocator means the allocation will
only take up sufficient pages to hold it, rather than the next power of
two up.

Admittedly, this point doesn't apply to the dentry and inode hashes, but
it might to another hash table that might want to use this service.

David

2004-06-11 14:22:08

by David Howells

[permalink] [raw]
Subject: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size [#2]


Hi Linus, Andrew,

Here's an update to my patch. Thanks to Ingo Oeser for noticing that the patch
had a couple of problems in the allocation loop (it would never end if an
allocation failed, and the result of the allocation didn't need casting).

David


Attachments:
bootmem-hash-267rc3-2.diff (10.27 kB)

2004-06-11 21:58:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size [#2]

David Howells <[email protected]> wrote:
>
> Here's an update to my patch.

-static void __init dcache_init(unsigned long mempages)
+static void __init dcache_init_early(void)
{
- struct hlist_head *d;
- unsigned long order;
- unsigned int nr_hash;
- int i;
+ struct hlist_head *p;
+ int loop;
+
+ dentry_hashtable =
+ alloc_large_system_hash("Dentry cache",
+ sizeof(struct hlist_head),
+ dhash_entries,
+ 13,
+ 0,
+ &d_hash_shift,
+ &d_hash_mask);
+
+ p = dentry_hashtable;
+ loop = 1 << d_hash_shift;
+ do {
+ INIT_HLIST_HEAD(p);
+ p++;
+ loop--;
+ } while (loop);
+}

We have an opportunity to make this loop less baroque.

for (i = 0; i < (1 << d_hash_shift); i++)
INIT_HLIST_HEAD(p[i]);




+void __init vfs_caches_init_early(void)

This function was declared

+extern void vfs_caches_init_early(void);

whereas in other places you _have_ made the definition of __init functions
include the "__init". I'm not sure which is best, really, but it seems
better to include the __init in the declaration.

+static inline int log2(unsigned long x) __attribute__((pure));

What's the attribute((pure)) for?

It generates a warning on older gcc - please use __attribute_pure__.

+static inline int log2(unsigned long x)

This is just asking for namespace collisions.

+{
+ int r = 0;
+ for (x >>= 1; x > 0; x >>= 1)
+ r++;
+ return r;
+}

The other four or five implementations of log2() use ffx(~n).

2004-06-11 22:01:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

David Howells <[email protected]> wrote:
>
> (2) Changing MAX_ORDER appears to have a number of effects beyond just
> limiting the maximum size that can be allocated in one go.

Several architectures implement CONFIG_FORCE_MAX_ZONEORDER and I haven't
heard of larger MAX_ORDERs causing problems.

Certainly, increasing MAX_ORDER is the simplest solution to the problems
which you identify so we need to substantiate these "number of effects"
much more than this please.

2004-06-11 23:04:54

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

>> (2) Changing MAX_ORDER appears to have a number of effects beyond just
>> limiting the maximum size that can be allocated in one go.
>
> Several architectures implement CONFIG_FORCE_MAX_ZONEORDER and I haven't
> heard of larger MAX_ORDERs causing problems.
>
> Certainly, increasing MAX_ORDER is the simplest solution to the problems
> which you identify so we need to substantiate these "number of effects"
> much more than this please.

We've hit a problem with alignment issues where the start of the zone is
aligned to 16MB, for instance, and the max grouping is now 256MB. That
generatates a "warning: wrong zone alignment: it will crash" error (or
something similar). Andy sent me a patch this morning to throw away
the lower section, which is much nicer than crashing ... but I'd prefer
not to throw that RAM away unless we have to.

Allocating the big-assed hashes out of bootmem seems much cleaner to me,
at least ...

M.

2004-06-11 23:16:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

"Martin J. Bligh" <[email protected]> wrote:
>
> We've hit a problem with alignment issues where the start of the zone is
> aligned to 16MB, for instance, and the max grouping is now 256MB. That
> generatates a "warning: wrong zone alignment: it will crash" error (or
> something similar). Andy sent me a patch this morning to throw away
> the lower section, which is much nicer than crashing ... but I'd prefer
> not to throw that RAM away unless we have to.

Confused. Why do we have that test in there at all? We should just toss
the pages one at a time into the buddy list and let the normal coalescing
work it out. That way we'd end up with a single 16MB "page" followed by N
256MB "pages".

> Allocating the big-assed hashes out of bootmem seems much cleaner to me,
> at least ...

Maybe. That code seems fragile and I have premonitions of unhappy arch
maintainers.

2004-06-11 23:20:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

--On Friday, June 11, 2004 16:19:20 -0700 Andrew Morton <[email protected]> wrote:

> "Martin J. Bligh" <[email protected]> wrote:
>>
>> We've hit a problem with alignment issues where the start of the zone is
>> aligned to 16MB, for instance, and the max grouping is now 256MB. That
>> generatates a "warning: wrong zone alignment: it will crash" error (or
>> something similar). Andy sent me a patch this morning to throw away
>> the lower section, which is much nicer than crashing ... but I'd prefer
>> not to throw that RAM away unless we have to.
>
> Confused. Why do we have that test in there at all? We should just toss
> the pages one at a time into the buddy list and let the normal coalescing
> work it out. That way we'd end up with a single 16MB "page" followed by N
> 256MB "pages".

That's what I thought ... Andy looked at it more than I did, but I think
he's asleep, unfortunately. IIRC, he said the buddy stuff keys off
zone_start_pfn though. Maybe that's fixable ...

M.

2004-06-11 23:28:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

"Martin J. Bligh" <[email protected]> wrote:
>
> --On Friday, June 11, 2004 16:19:20 -0700 Andrew Morton <[email protected]> wrote:
>
> > "Martin J. Bligh" <[email protected]> wrote:
> >>
> >> We've hit a problem with alignment issues where the start of the zone is
> >> aligned to 16MB, for instance, and the max grouping is now 256MB. That
> >> generatates a "warning: wrong zone alignment: it will crash" error (or
> >> something similar). Andy sent me a patch this morning to throw away
> >> the lower section, which is much nicer than crashing ... but I'd prefer
> >> not to throw that RAM away unless we have to.
> >
> > Confused. Why do we have that test in there at all? We should just toss
> > the pages one at a time into the buddy list and let the normal coalescing
> > work it out. That way we'd end up with a single 16MB "page" followed by N
> > 256MB "pages".
>
> That's what I thought ... Andy looked at it more than I did, but I think
> he's asleep, unfortunately. IIRC, he said the buddy stuff keys off
> zone_start_pfn though. Maybe that's fixable ...

Doesn't look that way. It uses

(zone->zone_mem_map - page) >> (1 + order)

2004-06-12 00:21:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

"Martin J. Bligh" <[email protected]> writes:
>
> Allocating the big-assed hashes out of bootmem seems much cleaner to me,
> at least ...

Machines big enough that such big hashes make sense are probably NUMA.
And on NUMA systems you imho should rather use node interleaving vmalloc(),
not a bit physical allocation on a specific node for these hashes.
This will avoid memory controller hot spots and avoid the problem completely.
Likely it will perform better too.

-Andi

2004-06-12 06:00:51

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

>> Allocating the big-assed hashes out of bootmem seems much cleaner to me,
>> at least ...
>
> Machines big enough that such big hashes make sense are probably NUMA.
> And on NUMA systems you imho should rather use node interleaving vmalloc(),
> not a bit physical allocation on a specific node for these hashes.
> This will avoid memory controller hot spots and avoid the problem completely.
> Likely it will perform better too.

I thought Manfred already fixed all that up, didn't he?

M.

2004-06-12 07:37:07

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

On Fri, 2004-06-11 at 17:21, Andi Kleen wrote:
> "Martin J. Bligh" <[email protected]> writes:
> >
> > Allocating the big-assed hashes out of bootmem seems much cleaner to me,
> > at least ...
>
> Machines big enough that such big hashes make sense are probably NUMA.
> And on NUMA systems you imho should rather use node interleaving vmalloc(),
> not a bit physical allocation on a specific node for these hashes.
> This will avoid memory controller hot spots and avoid the problem completely.
> Likely it will perform better too.

Since vmalloc() maps the pages with small pagetable entries (unlike most
of the rest of the kernel address space), do you think the interleaving
will outweigh any negative TLB effects?

-- Dave

2004-06-12 12:46:55

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

--On 11 June 2004 16:30 -0700 Andrew Morton <[email protected]> wrote:

> Doesn't look that way. It uses
>
> (zone->zone_mem_map - page) >> (1 + order)

Hmmm, yes. Does this not mean that we will violate the object size N will be aligned at size N. Presumably that's why the error says 'it'll crash'. If we are two pages offset we will allocate 4 page objects 2 page aligned. I'll have a closer look and see if we can just 'round down' the zone_mem_map pointer here to give the correct alignment. As long as we don't try and free them into the allocator originally we should be ok, as they are marked allocated in the bitmaps at the start.

-apw

2004-06-12 13:11:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

> Since vmalloc() maps the pages with small pagetable entries (unlike most
> of the rest of the kernel address space), do you think the interleaving
> will outweigh any negative TLB effects?

I think so, yes (assuming you run the benchmark on all CPUs)

-Andi

2004-06-12 15:01:03

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

>> Since vmalloc() maps the pages with small pagetable entries (unlike most
>> of the rest of the kernel address space), do you think the interleaving
>> will outweigh any negative TLB effects?
>
> I think so, yes (assuming you run the benchmark on all CPUs)

On the other hand, there's no reason we can't hack up a version of vmalloc
to use large pages, and interleave only based on that.

M.

2004-06-13 16:09:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size



On Fri, 11 Jun 2004, Andrew Morton wrote:
>
> Confused. Why do we have that test in there at all? We should just toss
> the pages one at a time into the buddy list and let the normal coalescing
> work it out. That way we'd end up with a single 16MB "page" followed by N
> 256MB "pages".

Doesn't work that way. We use the base of the memory area as the "zero
point", and while the buddy allocator itself shouldn't really care where
that zero-point is, anybody who expects physical alignment would be really
upset if it doesn't get it.

So the base address has to be as aligned as anybody could ever want. And
"anybody" wants quite a bit of alignment. Largepages usually want
alignment guarantees, and on most architectures that means a minimum
_physical_ address alignment of at least a few megabytes.

So the rule really should be: make sure that the buddy system base address
is maximally aligned, and if your memory doesn't actually _start_ at that
alignment point, just add the pages and let the buddy allocator build up
all the bitmaps etc for you.

Linus

2004-06-14 10:48:40

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size [try #3]


Here's version 3 of the patch, incorporating comments from Andrew Morton.

Andrew Morton <[email protected]> wrote:
> We have an opportunity to make this loop less baroque.

That seems reasonable. gcc's optimiser should be able to render them the same.

> whereas in other places you _have_ made the definition of __init functions
> include the "__init". I'm not sure which is best, really, but it seems
> better to include the __init in the declaration.

You're probably right. This should probably apply to vfs_caches_init() and
inode_init() too.

> What's the attribute((pure)) for?

It tells gcc that the function's result only depends on its arguments... it
makes no references at all to external data. It's like __attribute__((const))
but stronger. This means that gcc can generate code that caches the result.

See the gcc manual.

> It generates a warning on older gcc - please use __attribute_pure__.

We still support gcc's that old?

> The other four or five implementations of log2() use ffx(~n).

Yes... but ffs() and ffz() take int args, not long args. I suspect that
shouldn't matter (that would require the hash table to be calculated at 4Gig
buckets in size or greater to be a problem), but why take the chance when we
can avoid it easily?

David


Attachments:
bootmem-hash-267rc3-3.diff (11.21 kB)

2004-06-14 11:05:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size [try #3]

David Howells <[email protected]> wrote:
>
>
> > What's the attribute((pure)) for?
>
> It tells gcc that the function's result only depends on its arguments... it
> makes no references at all to external data. It's like __attribute__((const))
> but stronger. This means that gcc can generate code that caches the result.

Of course, but we could use it in zillions of places and we don't. Does it
actually help?

> > It generates a warning on older gcc - please use __attribute_pure__.
>
> We still support gcc's that old?

yup. And gcc-3.4 complains that log2() conflicts with an inbuilt function,
but you renamed it.

> > The other four or five implementations of log2() use ffx(~n).
>
> Yes... but ffs() and ffz() take int args, not long args. I suspect that
> shouldn't matter (that would require the hash table to be calculated at 4Gig
> buckets in size or greater to be a problem), but why take the chance when we
> can avoid it easily?

No, ffz() takes an unsigned long argument.


It's still not clear to me why we cannot simply increase MAX_ORDER. If
that exposes shortcomings in the buddy allocator, that should be where we
expend effort?

2004-06-14 11:41:11

by David Howells

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size [try #3]

> > > What's the attribute((pure)) for?
> >
> > It tells gcc that the function's result only depends on its arguments... it
> > makes no references at all to external data. It's like __attribute__((const))
> > but stronger. This means that gcc can generate code that caches the result.

Actually... I'm wrong about that. I've got "pure" and "const" the wrong way
around... Perhaps I need to go and read the gcc manual too:-)

> Of course, but we could use it in zillions of places and we don't. Does it
> actually help?

Well... it could. It certainly doesn't hurt. Actually, it's probably
irrelevant on inline functions as the compiler can look inside of those.

> > > The other four or five implementations of log2() use ffx(~n).
> >

I presume this should be ffz() not ffx().

> > Yes... but ffs() and ffz() take int args, not long args. I suspect that
> > shouldn't matter (that would require the hash table to be calculated at 4Gig
> > buckets in size or greater to be a problem), but why take the chance when we
> > can avoid it easily?
>
> No, ffz() takes an unsigned long argument.

So it does. However, ffz(~n) is not equivalent to log2(n). This can be
demonstrated by this little test program:

#include <stdio.h>
#include <stdlib.h>

static __inline__ unsigned long ffz(unsigned long word)
{
__asm__("bsfl %1,%0"
:"=r" (word)
:"r" (~word));
return word;
}

int main(int argc, char *argv[])
{
for (argv++; *argv; argv++)
printf("%lu\n", ffz(~strtoul(*argv, NULL, 0)));

return 0;
}

dhowells>/tmp/ffz 0x10
4
dhowells>/tmp/ffz 0x8
3
dhowells>/tmp/ffz 0x1f
0

This is not entirely correct; the third answer should be 4 too. The bit scan
occurs in the wrong direction.

David

2004-06-15 16:43:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] Permit inode & dentry hash tables to be allocated > MAX_ORDER size

On Sat, 2004-06-12 at 08:00, Martin J. Bligh wrote:
> >> Since vmalloc() maps the pages with small pagetable entries (unlike most
> >> of the rest of the kernel address space), do you think the interleaving
> >> will outweigh any negative TLB effects?
> >
> > I think so, yes (assuming you run the benchmark on all CPUs)
>
> On the other hand, there's no reason we can't hack up a version of vmalloc
> to use large pages, and interleave only based on that.

Think about ppc64 where the large page size is 16MB. That might hurt
interleaving a bit if the structure is only 32MB. It's better than
*everything* on node 0, but not by much.

-- Dave