2003-11-25 13:36:08

by Jes Sorensen

[permalink] [raw]
Subject: hash table sizes

Hi,

On NUMA systems with way too much memory, the current algorithms for
determining the size of the inode and dentry hash tables ends up trying
to allocate tables that are so big they may not fit within the physical
memory of a single node. Ie. on a 256 node system with 512GB of RAM with
16KB pages it basically ends up eating up all the memory on node before
completing a boot because of this. The inode and dentry hashes are 256MB
each and the IP routing table hash is 128MB.

I have tried changing the algorithm as below and it seems to produce
reasonable results and almost identical numbers for the smaller /
mid-sized configs I looked at.

This is not meant to be a final patch, any input/oppinion is welcome.

Thanks,
Jes

--- orig/linux-2.6.0-test10/fs/dcache.c Sat Oct 25 11:42:58 2003
+++ linux-2.6.0-test10/fs/dcache.c Tue Nov 25 05:33:04 2003
@@ -1549,9 +1549,8 @@
static void __init dcache_init(unsigned long mempages)
{
struct hlist_head *d;
- unsigned long order;
unsigned int nr_hash;
- int i;
+ int i, order;

/*
* A constructor could be added for stable state like the lists,
@@ -1571,12 +1570,17 @@

set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);

+#if 0
#if PAGE_SHIFT < 13
mempages >>= (13 - PAGE_SHIFT);
#endif
mempages *= sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;
+#endif
+ mempages >>= (23 - (PAGE_SHIFT - 1));
+ order = max(2, fls(mempages));
+ order = min(12, order);

do {
unsigned long tmp;
@@ -1594,7 +1598,7 @@
__get_free_pages(GFP_ATOMIC, order);
} while (dentry_hashtable == NULL && --order >= 0);

- printk(KERN_INFO "Dentry cache hash table entries: %d (order: %ld, %ld bytes)\n",
+ printk(KERN_INFO "Dentry cache hash table entries: %d (order: %d, %ld bytes)\n",
nr_hash, order, (PAGE_SIZE << order));

if (!dentry_hashtable)
--- orig/linux-2.6.0-test10/fs/inode.c Sat Oct 25 11:44:53 2003
+++ linux-2.6.0-test10/fs/inode.c Tue Nov 25 05:33:27 2003
@@ -1333,17 +1333,21 @@
void __init inode_init(unsigned long mempages)
{
struct hlist_head *head;
- unsigned long order;
unsigned int nr_hash;
- int i;
+ int i, order;

for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
init_waitqueue_head(&i_wait_queue_heads[i].wqh);

+#if 0
mempages >>= (14 - PAGE_SHIFT);
mempages *= sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;
+#endif
+ mempages >>= (23 - (PAGE_SHIFT - 1));
+ order = max(2, fls(mempages));
+ order = min(12, order);

do {
unsigned long tmp;


2003-11-25 13:42:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
> + mempages >>= (23 - (PAGE_SHIFT - 1));
> + order = max(2, fls(mempages));
> + order = min(12, order);

This is curious; what's 23 - (PAGE_SHIFT - 1) supposed to represent?


-- wli

2003-11-25 13:54:12

by Jes Sorensen

[permalink] [raw]
Subject: Re: hash table sizes

>>>>> "William" == William Lee Irwin <[email protected]> writes:

William> On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
>> + mempages >>= (23 - (PAGE_SHIFT - 1)); + order = max(2,
>> fls(mempages)); + order = min(12, order);

William> This is curious; what's 23 - (PAGE_SHIFT - 1) supposed to
William> represent?

Well MB >> 2 or consider it trial and error ;-)

Cheers,
Jes

2003-11-25 16:25:58

by Thomas Schlichter

[permalink] [raw]
Subject: Re: hash table sizes

Hi Jens,

I was wondering about you funny formula, too. (especially because it was 23 -
(PAGE_SHIFT -1) and not 24 - PAGE_SHIFT ;-) So I wanted to find out why this:
1: mempages >>= (23 - (PAGE_SHIFT - 1));
2: order = max(2, fls(mempages));
3: order = min(12, order);

should be similar to this original code:
4: mempages >>= (14 - PAGE_SHIFT);
5: mempages *= sizeof(struct hlist_head);
6: for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
7: ;

Well, first I saw that lines 2 and 3 make order to be between 2 and 12. OK
this is new, and I like it ;-) Then I saw you tried to convert the ugly lines
6 + 7 into a fls() call. Fine! (I like that, too) Line 6 + 7 can be converted
to following (almost equivalent) lines:
6 + 7 -> 8a: order = fls(mempages) - PAGE_SHIFT;
or
6 + 7 -> 8b: order = fls(mempages >> PAGE_SHIFT);

Now the shift is not present in line 2, so I folded it from line 8b into line
4 and got:
4 -> 9: mempages >>= 14;
5: mempages *= sizeof(struct hlist_head);
8b -> 10: order = fls(mempages);

But to make the math a bit more correct the multiplication from line 5 has to
be done before the shift in line 9. So I got following simple two lines
equivalent to lines 4 to 7:
5 + 9 -> 11: mempages = (mempages * sizeof(struct hlist_head)) >> 14;
10: order = fls(mempages);

So, now we can compare line 1 and 11. OK, let's assume PAGE_SHIFT = 12. So we
get:
1 -> 12: mempages >>= (23 - (12 - 1)); // right shift by 12

If sizeof(struct hlist_head) == 4, we get for line 10:
11 -> 13: mempages = (mempages * 4) >> 14; // right shift by 12

And HURRAY, we've got it...!
But they are only equivalent if (PAGE_SHIFT == 12) && (sizeof(struct
hlist_head) == 4)...

So I propose the attached patch which should be closer to the original and
makes order not to be greater than 12. (Btw. this only changes anything for
more than 2^24 pages of memory, so I think this is a good idea!)

Best reagrds
Thomas Schlichter

On Tuesday 25 November 2003 14:54, Jes Sorensen wrote:
> >>>>> "William" == William Lee Irwin <[email protected]> writes:
>
> William> On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
> >> + mempages >>= (23 - (PAGE_SHIFT - 1)); + order = max(2,
> >> fls(mempages)); + order = min(12, order);
>
> William> This is curious; what's 23 - (PAGE_SHIFT - 1) supposed to
> William> represent?
>
> Well MB >> 2 or consider it trial and error ;-)
>
> Cheers,
> Jes


Attachments:
(No filename) (0.00 B)
(No filename) (189.00 B)
signature
Download all attachments

2003-11-25 17:52:20

by Antonio Vargas

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 05:25:07PM +0100, Thomas Schlichter wrote:
> Hi Jens,
>
> I was wondering about you funny formula, too. (especially because it was 23 -
> (PAGE_SHIFT -1) and not 24 - PAGE_SHIFT ;-) So I wanted to find out why this:
> 1: mempages >>= (23 - (PAGE_SHIFT - 1));
> 2: order = max(2, fls(mempages));
> 3: order = min(12, order);
>
> should be similar to this original code:
> 4: mempages >>= (14 - PAGE_SHIFT);
> 5: mempages *= sizeof(struct hlist_head);
> 6: for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
> 7: ;
>
> Well, first I saw that lines 2 and 3 make order to be between 2 and 12. OK
> this is new, and I like it ;-) Then I saw you tried to convert the ugly lines
> 6 + 7 into a fls() call. Fine! (I like that, too) Line 6 + 7 can be converted

is fls(x) sort-of log2(x) via some "find-highest-bit-set"?
I recall discussing something related with Jesse Barnes
last 5 november (search for "[DMESG] cpumask_t in action" in lkml).

[SNIP]

Greets, Antonio Vargas

1. Dado un programa, siempre tiene al menos un fallo.
2. Dadas varias lineas de codigo, siempre se pueden acortar a menos lineas.
3. Por induccion, todos los programas se pueden
reducir a una linea que no funciona.

2003-11-25 17:54:37

by William Lee Irwin III

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 06:52:15PM +0100, Antonio Vargas wrote:
> is fls(x) sort-of log2(x) via some "find-highest-bit-set"?
> I recall discussing something related with Jesse Barnes
> last 5 november (search for "[DMESG] cpumask_t in action" in lkml).
> [SNIP]
> Greets, Antonio Vargas

fls() computes floor(lg(n)) via "find highest bit", yes. It stands
for "find last set".


-- wli

2003-11-25 20:50:13

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 08:35:49AM -0500, Jes Sorensen wrote:
> Hi,
>
> On NUMA systems with way too much memory, the current algorithms for
> determining the size of the inode and dentry hash tables ends up trying
> to allocate tables that are so big they may not fit within the physical
> memory of a single node. Ie. on a 256 node system with 512GB of RAM with
> 16KB pages it basically ends up eating up all the memory on node before
> completing a boot because of this. The inode and dentry hashes are 256MB
> each and the IP routing table hash is 128MB.
>
> I have tried changing the algorithm as below and it seems to produce
> reasonable results and almost identical numbers for the smaller /
> mid-sized configs I looked at.
>
> This is not meant to be a final patch, any input/oppinion is welcome.
>
> Thanks,
> Jes
>
...

> + mempages >>= (23 - (PAGE_SHIFT - 1));
> + order = max(2, fls(mempages));
> + order = min(12, order);
>

I dont think you want to constrain the allocation to a specific "order". Otherwise,
when the kernel is built with a 64k pagesize, the size of the caches will increase 4X.

Some architectures (IA64, for example) dont have severe limitations on usage of vmalloc
space. Would it make sense to use vmalloc on these architectures. Even if the
max size of the structures being allocated is limited to an acceptible size, it still
concentrates the memory on a single node. In general, we should try to
distribute the memory references across as many nodes as possible - at least in theory.
(In practice, I wonder if we could actually measure the difference.....)


BTW, I think the network code (tcp_init()) has similar problems.

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2003-11-25 21:07:49

by Andrew Morton

[permalink] [raw]
Subject: Re: hash table sizes

Jack Steiner <[email protected]> wrote:
>
> > + mempages >>= (23 - (PAGE_SHIFT - 1));
> > + order = max(2, fls(mempages));
> > + order = min(12, order);
> >
>
> I dont think you want to constrain the allocation to a specific "order". Otherwise,
> when the kernel is built with a 64k pagesize, the size of the caches will increase 4X.
>
> Some architectures (IA64, for example) dont have severe limitations on usage of vmalloc
> space. Would it make sense to use vmalloc on these architectures. Even if the
> max size of the structures being allocated is limited to an acceptible size, it still
> concentrates the memory on a single node. In general, we should try to
> distribute the memory references across as many nodes as possible - at least in theory.

I don't think we want to use vmalloc, unless it really can be shown that
the use of an entire memory zone on some machine _still_ provides a
hashtable which is so small that it unacceptably impacts performance.

And that it can be shown that a mega-hashtable is still an appropriate data
structure...

> (In practice, I wonder if we could actually measure the difference.....)

Well yes. I suspect the inode hashtable could be shrunk; I don't think we
do hash-based inode lookups very much?

Also the inode hashtable perhaps could become per superblock, perhaps with
sizing hints from the fs. Or not.

This is hard. We could change the sizing of these tables so that for their
setup arithmetic they use "the size of the zone from which the table is to
be allocated" rather than "the size of all memory". But we do want to make
the size of these tables dependent upon the number of dentries/inodes/etc
which the system is likely to support. And that does depend upon the
amount of direct-addressible memory.


So hum. As a starting point, what happens if we do:

- vfs_caches_init(num_physpages);
+ vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));

?


It would be very nice to have some confirmation that the size of these
tables is being appropriately chosen, too. Maybe we should shrink 'em 32x
and see who complains...


2003-11-25 21:16:33

by Jesse Barnes

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 01:07:41PM -0800, Andrew Morton wrote:
> the size of these tables dependent upon the number of dentries/inodes/etc
> which the system is likely to support. And that does depend upon the
> amount of direct-addressible memory.
>
>
> So hum. As a starting point, what happens if we do:
>
> - vfs_caches_init(num_physpages);
> + vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));
>
> ?

Something like that might be ok, but on our system, all memory is in
ZONE_DMA...

Jesse

2003-11-25 21:20:49

by Anton Blanchard

[permalink] [raw]
Subject: Re: hash table sizes


Hi,

> Some architectures (IA64, for example) dont have severe limitations on
> usage of vmalloc space. Would it make sense to use vmalloc on these
> architectures. Even if the max size of the structures being allocated
> is limited to an acceptible size, it still concentrates the memory on
> a single node.

The kernel region on many architectures is mapped with large pages, we
would lose that by going to vmalloc.

Anton

2003-11-25 21:24:38

by Andrew Morton

[permalink] [raw]
Subject: Re: hash table sizes

[email protected] (Jesse Barnes) wrote:
>
> On Tue, Nov 25, 2003 at 01:07:41PM -0800, Andrew Morton wrote:
> > the size of these tables dependent upon the number of dentries/inodes/etc
> > which the system is likely to support. And that does depend upon the
> > amount of direct-addressible memory.
> >
> >
> > So hum. As a starting point, what happens if we do:
> >
> > - vfs_caches_init(num_physpages);
> > + vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));
> >
> > ?
>
> Something like that might be ok, but on our system, all memory is in
> ZONE_DMA...
>

Well yes, we'd want

vfs_caches_init(min(num_physpages, some_platform_limit()));

which on ia32 would evaluate to nr_free_buffer_pages() and on ia64 would
evaluate to the size of one of those zones.

If the machine has zillions of small zones then perhaps this will result in
an undersized hashtable.

2003-11-25 23:12:58

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: hash table sizes

On Wed, Nov 26, 2003 at 08:16:11AM +1100, Anton Blanchard wrote:
>
> Hi,
>
> > Some architectures (IA64, for example) dont have severe limitations on
> > usage of vmalloc space. Would it make sense to use vmalloc on these
> > architectures. Even if the max size of the structures being allocated
> > is limited to an acceptible size, it still concentrates the memory on
> > a single node.
>
> The kernel region on many architectures is mapped with large pages, we
> would lose that by going to vmalloc.
>
> Anton


That was a concern to me too. However, on IA64, all page structs are
in the vmalloc region (it isnt allocated by vmalloc but is in the
same region as vmalloc'ed pages. They are mapped with 16k pages instead of the 64MB
pages used for memory allocated by kmalloc).

Before switching to 16K pages for the page structs, we made numerous performance
measurements. As far as we could tell, there was no performce degradation
caused by the smaller pages. It seems to me that if page structs are ok
being mapped by 16k pages, the hash tables would be ok too.



--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2003-11-26 02:14:37

by David Miller

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, 25 Nov 2003 13:24:39 -0800
Andrew Morton <[email protected]> wrote:

> Well yes, we'd want
>
> vfs_caches_init(min(num_physpages, some_platform_limit()));
>
> which on ia32 would evaluate to nr_free_buffer_pages() and on ia64 would
> evaluate to the size of one of those zones.
>
> If the machine has zillions of small zones then perhaps this will result in
> an undersized hashtable.

While a platform may have a hard limit, there is also a generic
"usefullness" limit.

If you make the hash tables too huge, you start trading cache misses
on the hash list traversal for misses on the hash array head accesses
themselves. Big hashes can hurt you also if you don't actually use
them to capacity.

Luckily, now that we've moved the page cache over to the rbtree thing,
there are not many hash tables that strongly depend upon the amount
of RAM in the system. Unfortunately, the dentry and inode cache ones
being discussed here are two of the remaining ones.

I also believe that vmalloc()'ing this thing is the wrong idea.

Dynamic growth of hash tables, while intellectually interesting to
consider as a solution to the "use to capacity" issue mentioned above,
is wholly impractical in my experience for two reasons: 1) the locking
is messy if not next to impossible 2) getting the higher order allocs
after the system has fully booted is a game of Russian Roulette.

Therefore, I think the safest "fix" for this problem right now is to
just put a hard fixed cap on the hash tables sizes for now instead of
coming up with all sorts of clever new formulas. In particular, for
the purposes of 2.6.0, we can explore better ideas later.

Andrew's suggested ideas seem to come closest to this.

2003-11-26 03:40:23

by Rik van Riel

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, 25 Nov 2003, Jack Steiner wrote:

> That was a concern to me too. However, on IA64, all page structs are in
> the vmalloc region

Which you'll probably want to fix eventually. At least
PAGE_VALID() doesn't seem to work as advertised currently...

(occasionally leading to "false positives", with PAGE_VALID()
saying that a page exists while it really doesn't)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2003-11-26 04:00:25

by William Lee Irwin III

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, 25 Nov 2003, Jack Steiner wrote:
>> That was a concern to me too. However, on IA64, all page structs are in
>> the vmalloc region

On Tue, Nov 25, 2003 at 10:39:10PM -0500, Rik van Riel wrote:
> Which you'll probably want to fix eventually. At least
> PAGE_VALID() doesn't seem to work as advertised currently...
> (occasionally leading to "false positives", with PAGE_VALID()
> saying that a page exists while it really doesn't)

Speaking of which, no one's bothered fixing the X crashes on i386
discontigmem. Untested patch below.


-- wli


diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
--- linux-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-23 17:31:56.000000000 -0800
+++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-25 19:54:31.000000000 -0800
@@ -85,13 +85,19 @@ extern struct pglist_data *node_data[];
})
#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
/*
- * pfn_valid should be made as fast as possible, and the current definition
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) )
+ * pfn_valid must absolutely be correct, regardless of speed concerns.
*/
-#define pfn_valid(pfn) ((pfn) < num_physpages)
+#define pfn_valid(pfn) \
+({ \
+ unsigned long __pfn__ = pfn; \
+ u8 __nid__ = pfn_to_nid(__pfn__); \
+ pg_data_t *__pgdat__; \
+ __pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL; \
+ __pgdat__ && \
+ __pfn__ >= __pgdat__->node_start_pfn && \
+ __pfn__ - __pgdat__->node_start_pfn \
+ < __pgdat__->node_spanned_pages; \
+})

/*
* generic node memory support, the following assumptions apply:

2003-11-26 04:19:56

by Andrew Morton

[permalink] [raw]
Subject: Re: hash table sizes

William Lee Irwin III <[email protected]> wrote:
>
> Speaking of which, no one's bothered fixing the X crashes on i386
> discontigmem. Untested patch below.
>
> ...
> -#define pfn_valid(pfn) ((pfn) < num_physpages)
> +#define pfn_valid(pfn) \
> +({ \
> + unsigned long __pfn__ = pfn; \
> + u8 __nid__ = pfn_to_nid(__pfn__); \
> + pg_data_t *__pgdat__; \
> + __pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL; \
> + __pgdat__ && \
> + __pfn__ >= __pgdat__->node_start_pfn && \
> + __pfn__ - __pgdat__->node_start_pfn \
> + < __pgdat__->node_spanned_pages; \
> +})

Boggle.

Does this evaulate to the same thing on non-discontigmem? (surely no)

Can we please arrange for it to?

2003-11-26 04:23:27

by William Lee Irwin III

[permalink] [raw]
Subject: Re: hash table sizes

William Lee Irwin III <[email protected]> wrote:
>> -#define pfn_valid(pfn) ((pfn) < num_physpages)
>> +#define pfn_valid(pfn) \
>> +({ \
>> + unsigned long __pfn__ = pfn; \
>> + u8 __nid__ = pfn_to_nid(__pfn__); \
>> + pg_data_t *__pgdat__; \
>> + __pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL; \
>> + __pgdat__ && \
>> + __pfn__ >= __pgdat__->node_start_pfn && \
>> + __pfn__ - __pgdat__->node_start_pfn \
>> + < __pgdat__->node_spanned_pages; \
>> +})

On Tue, Nov 25, 2003 at 08:25:45PM -0800, Andrew Morton wrote:
> Boggle.
> Does this evaulate to the same thing on non-discontigmem? (surely no)
> Can we please arrange for it to?

The non-discontigmem case is unaltered, as it's handled in
include/asm-i386/page.h under a #ifdef. The semantics agree.

-- wli

2003-11-26 05:28:14

by Matt Mackall

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 01:24:39PM -0800, Andrew Morton wrote:
> [email protected] (Jesse Barnes) wrote:
> >
> > On Tue, Nov 25, 2003 at 01:07:41PM -0800, Andrew Morton wrote:
> > > the size of these tables dependent upon the number of dentries/inodes/etc
> > > which the system is likely to support. And that does depend upon the
> > > amount of direct-addressible memory.
> > >
> > >
> > > So hum. As a starting point, what happens if we do:
> > >
> > > - vfs_caches_init(num_physpages);
> > > + vfs_caches_init(min(num_physpages, pages_in_ZONE_NORMAL));
> > >
> > > ?
> >
> > Something like that might be ok, but on our system, all memory is in
> > ZONE_DMA...
> >
>
> Well yes, we'd want
>
> vfs_caches_init(min(num_physpages, some_platform_limit()));
>
> which on ia32 would evaluate to nr_free_buffer_pages() and on ia64 would
> evaluate to the size of one of those zones.

I actually just added this to the tree I'm working on:

+ vfs_caches_init(min(1000, num_physpages-16000));

Caches are too expensive on the low end of the scale as well, when the
kernel is taking up most of RAM.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting

2003-11-26 05:53:51

by Zhang, Yanmin

[permalink] [raw]
Subject: RE: hash table sizes

Tcp/ip have the same problem. See function tcp_init/ip_rt_init.

My colleague's testing on IA64 shows the same result like Jef's while
tcp_establish hash table applies for about 1G memory! Actually, it seems
it wants to apply for 4G memory at the beginning!

It's not a good idea to find a smart formula to fit into all scenarios.
We can add new command line parameters to setup the hash table size,
such as dentry_ht_size/inode_ht_size/tcp_estab_ht_size/ip_rt_ht_szie. If
users don't input such parameters, then kernel can use Jes's formula to
work out the size.

2003-11-26 07:27:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: hash table sizes

> Speaking of which, no one's bothered fixing the X crashes on i386
> discontigmem. Untested patch below.
>
>
> -- wli
>
>
> diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
> --- linux-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-23 17:31:56.000000000 -0800
> +++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-25 19:54:31.000000000 -0800
> @@ -85,13 +85,19 @@ extern struct pglist_data *node_data[];
> })
> #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
> /*
> - * pfn_valid should be made as fast as possible, and the current definition
> - * is valid for machines that are NUMA, but still contiguous, which is what
> - * is currently supported. A more generalised, but slower definition would
> - * be something like this - mbligh:
> - * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) )
> + * pfn_valid must absolutely be correct, regardless of speed concerns.
> */
> -#define pfn_valid(pfn) ((pfn) < num_physpages)
> +#define pfn_valid(pfn) \
> +({ \
> + unsigned long __pfn__ = pfn; \
> + u8 __nid__ = pfn_to_nid(__pfn__); \
> + pg_data_t *__pgdat__; \
> + __pgdat__ = __nid__ < MAX_NUMNODES ? NODE_DATA(__nid__) : NULL; \
> + __pgdat__ && \
> + __pfn__ >= __pgdat__->node_start_pfn && \
> + __pfn__ - __pgdat__->node_start_pfn \
> + < __pgdat__->node_spanned_pages; \
> +})
>
> /*
> * generic node memory support, the following assumptions apply:

Would it not be rather more readable as something along the lines of:

static inline int pfn_valid (int pfn) {
int nid = pfn_to_nid(pfn);
pg_data_t *pgdat;

if (nid >= MAX_NUMNODES)
return 0; /* node invalid */
pgdat = NODE_DATA(nid);
if (!pgdat)
return 0; /* pgdat invalid */
if (pfn < pgdat->node_start_pfn)
return 0; /* before start of node */
if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
return 0; /* past end of node */
return 1;
}

However, I'm curious as to why this crashes X, as I don't see how this
code change makes a difference in practice. I didn't think we had any i386
NUMA with memory holes between nodes at the moment, though perhaps the x440
does.

M.

PS. No, I haven't tested my rephrasing of your patch either.

2003-11-26 09:51:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: hash table sizes

On Tue, Nov 25, 2003 at 09:14:32PM -0800, Martin J. Bligh wrote:
> Would it not be rather more readable as something along the lines of:
> static inline int pfn_valid (int pfn) {
> int nid = pfn_to_nid(pfn);
> pg_data_t *pgdat;
>
> if (nid >= MAX_NUMNODES)
> return 0; /* node invalid */
> pgdat = NODE_DATA(nid);
> if (!pgdat)
> return 0; /* pgdat invalid */
> if (pfn < pgdat->node_start_pfn)
> return 0; /* before start of node */
> if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
> return 0; /* past end of node */
> return 1;
> }

This header at least used to be included before the relevant structures
were declared (to all appearances, some macros have been converted to
inlines since last I looked).


On Tue, Nov 25, 2003 at 09:14:32PM -0800, Martin J. Bligh wrote:
> However, I'm curious as to why this crashes X, as I don't see how this
> code change makes a difference in practice. I didn't think we had any i386
> NUMA with memory holes between nodes at the moment, though perhaps the x440
> does.
> M.
> PS. No, I haven't tested my rephrasing of your patch either.

mmap() of framebuffers. It takes the box out, not just X. There are
holes just below 4GB regardless. This has actually been reported by
rml and some others.

False positives on pfn_valid() result in manipulations of purported page
structures beyond the bounds of actual allocated pgdat->node_mem_map[]'s,
potentially either corrupting memory or accessing areas outside memory's
limits (the case causing oopsen).

Insubstantially modified version of the above below (it looks fine, I
was just typing from memory). Compiletested only.


-- wli


diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
--- linux-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-23 17:31:56.000000000 -0800
+++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-26 01:40:36.000000000 -0800
@@ -84,14 +84,30 @@ extern struct pglist_data *node_data[];
+ __zone->zone_start_pfn; \
})
#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
+
+static inline int pfn_to_nid(unsigned long);
/*
- * pfn_valid should be made as fast as possible, and the current definition
- * is valid for machines that are NUMA, but still contiguous, which is what
- * is currently supported. A more generalised, but slower definition would
- * be something like this - mbligh:
- * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) )
+ * pfn_valid must absolutely be correct, regardless of speed concerns.
*/
-#define pfn_valid(pfn) ((pfn) < num_physpages)
+static inline int pfn_valid(unsigned long pfn)
+{
+ u8 nid = pfn_to_nid(pfn);
+ pg_data_t *pgdat;
+
+ if (nid < MAX_NUMNODES)
+ pgdat = NODE_DATA(nid);
+ else
+ return 0;
+
+ if (!pgdat)
+ return 0;
+ else if (pfn < pgdat->node_start_pfn)
+ return 0;
+ else if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
+ return 0;
+ else
+ return 1;
+}

/*
* generic node memory support, the following assumptions apply:

2003-11-26 16:18:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: hash table sizes

>> However, I'm curious as to why this crashes X, as I don't see how this
>> code change makes a difference in practice. I didn't think we had any i386
>> NUMA with memory holes between nodes at the moment, though perhaps the x440
>> does.
>> M.
>> PS. No, I haven't tested my rephrasing of your patch either.
>
> mmap() of framebuffers. It takes the box out, not just X. There are
> holes just below 4GB regardless. This has actually been reported by
> rml and some others.
>
> False positives on pfn_valid() result in manipulations of purported page
> structures beyond the bounds of actual allocated pgdat->node_mem_map[]'s,
> potentially either corrupting memory or accessing areas outside memory's
> limits (the case causing oopsen).

OK. But the hole from 3.75 - 4GB you're referring to doesn't seem to
fall under that definition.

1) It still has a valid pfn, though the backing memory itself isn't there.
2) It's covered by node_start_pfn ... node_start_pfn + node_spanned_pages,
which is what your patch tests for. Maybe not if you have = 4GB
per node? Will be if you have more than that.

I agree that pfn_valid absolutely has to be correct. The current definition
was, I thought, correct unless we have holes *between* nodes. I was under
the impression that no box we had uses that setup, but I guess we might
do - were you seeing this on x440 or NUMA-Q?

NUMA-Q looks like this:

BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 0000000000100000 - 00000000e0000000 (usable)
BIOS-e820: 00000000fec00000 - 00000000fec09000 (reserved)
BIOS-e820: 00000000ffe80000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000400000000 (usable)

which should create mem_map from 0 - 4GB contigious for node 0, AFAICS
(I believe struct pages are created for reserved areas still). Maybe
the srat stuff for x440 does something else.

Your patch is correct, I just don't see that it'll fix the X problem.

> diff -prauN linux-2.6.0-test10/include/asm-i386/mmzone.h pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h
> --- linux-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-23 17:31:56.000000000 -0800
> +++ pfn_valid-2.6.0-test10/include/asm-i386/mmzone.h 2003-11-26 01:40:36.000000000 -0800
> @@ -84,14 +84,30 @@ extern struct pglist_data *node_data[];
> + __zone->zone_start_pfn; \
> })
> #define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
> +
> +static inline int pfn_to_nid(unsigned long);
> /*
> - * pfn_valid should be made as fast as possible, and the current definition
> - * is valid for machines that are NUMA, but still contiguous, which is what
> - * is currently supported. A more generalised, but slower definition would
> - * be something like this - mbligh:
> - * ( pfn_to_pgdat(pfn) && ((pfn) < node_end_pfn(pfn_to_nid(pfn))) )
> + * pfn_valid must absolutely be correct, regardless of speed concerns.
> */
> -#define pfn_valid(pfn) ((pfn) < num_physpages)
> +static inline int pfn_valid(unsigned long pfn)
> +{
> + u8 nid = pfn_to_nid(pfn);
> + pg_data_t *pgdat;
> +
> + if (nid < MAX_NUMNODES)
> + pgdat = NODE_DATA(nid);
> + else
> + return 0;
> +
> + if (!pgdat)
> + return 0;
> + else if (pfn < pgdat->node_start_pfn)
> + return 0;
> + else if (pfn - pgdat->node_start_pfn >= pgdat->node_spanned_pages)
> + return 0;
> + else
> + return 1;
> +}
>
> /*
> * generic node memory support, the following assumptions apply:


Cool, thanks. I'll try runtesting it.

M.

2003-11-26 22:17:23

by Anton Blanchard

[permalink] [raw]
Subject: Re: hash table sizes


> That was a concern to me too. However, on IA64, all page structs are
> in the vmalloc region (it isnt allocated by vmalloc but is in the same
> region as vmalloc'ed pages. They are mapped with 16k pages instead of
> the 64MB pages used for memory allocated by kmalloc).
>
> Before switching to 16K pages for the page structs, we made numerous
> performance measurements. As far as we could tell, there was no
> performce degradation caused by the smaller pages. It seems to me that
> if page structs are ok being mapped by 16k pages, the hash tables
> would be ok too.

OK, on ppc64 with a 4kB base pagesize Id be more worried about using the
vmalloc region.

Anton

2003-11-28 14:15:47

by Jes Sorensen

[permalink] [raw]
Subject: Re: hash table sizes

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> [email protected] (Jesse Barnes) wrote:
>> Something like that might be ok, but on our system, all memory is
>> in ZONE_DMA...

Andrew> Well yes, we'd want

Andrew> vfs_caches_init(min(num_physpages,
Andrew> some_platform_limit()));

Andrew> which on ia32 would evaluate to nr_free_buffer_pages() and on
Andrew> ia64 would evaluate to the size of one of those zones.

What about something like this? I believe node_present_pages should be
the same as nym_physpages on a non-NUMA machine. If not we can make it
min(num_physpages, NODE_DATA(0)->node_present_pages).

Of course this might not work perfectly if one has multiple nodes and
node 0 has no or very little memory. It would also be nice if one
could spread the various caches onto various nodes, but we can leave
that for stage 2 ;-)

Cheers,
Jes

--- orig/linux-2.6.0-test10/init/main.c Sun Nov 23 17:31:14 2003
+++ linux-2.6.0-test10/init/main.c Fri Nov 28 07:06:45 2003
@@ -447,7 +447,7 @@
proc_caches_init();
buffer_init();
security_scaffolding_startup();
- vfs_caches_init(num_physpages);
+ vfs_caches_init(NODE_DATA(0)->node_present_pages);
radix_tree_init();
signals_init();
/* rootfs populating might need page-writeback */

2003-11-28 14:54:15

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: hash table sizes

On Fri, Nov 28, 2003 at 09:15:02AM -0500, Jes Sorensen wrote:
> >>>>> "Andrew" == Andrew Morton <[email protected]> writes:
>
> Andrew> [email protected] (Jesse Barnes) wrote:
> >> Something like that might be ok, but on our system, all memory is
> >> in ZONE_DMA...
>
> Andrew> Well yes, we'd want
>
> Andrew> vfs_caches_init(min(num_physpages,
> Andrew> some_platform_limit()));
>
> Andrew> which on ia32 would evaluate to nr_free_buffer_pages() and on
> Andrew> ia64 would evaluate to the size of one of those zones.
>
> What about something like this? I believe node_present_pages should be
> the same as nym_physpages on a non-NUMA machine. If not we can make it
> min(num_physpages, NODE_DATA(0)->node_present_pages).


I may be missing something but I dont see how this fixes the original
problem that we started with.

The system has a large number of nodes. Physically, each node has the same
amount of memory. After boot, we observe that several nodes have
substantially less memory than other nodes. Some of the inbalance is
due to the kernel data/text being on node 0, but by far, the major source
of in the inbalance is the 3 (in 2.4.x) large hash tables that are being
allocated.

I suspect the size of the hash tables is a lot bigger than is needed.
That is certainly the first problem to be fixed, but unless the
required size is a very small percentage (5-10%) of the amount of memory
on a node (2GB to 32GB per node & 256 nodes), we still have a problem.

We run large MPI applications that place threads onto each node. Each
thread needs the same amount of local memory. The maximum problem size
that can be efficiently solved is limited by the amount of free memory
on the smallest node. We need an allocation scheme that doesn't deplete
a significant amount of memory from any single node.



>
> Of course this might not work perfectly if one has multiple nodes and
> node 0 has no or very little memory. It would also be nice if one
> could spread the various caches onto various nodes, but we can leave
> that for stage 2 ;-)
>
> Cheers,
> Jes
>
> --- orig/linux-2.6.0-test10/init/main.c Sun Nov 23 17:31:14 2003
> +++ linux-2.6.0-test10/init/main.c Fri Nov 28 07:06:45 2003
> @@ -447,7 +447,7 @@
> proc_caches_init();
> buffer_init();
> security_scaffolding_startup();
> - vfs_caches_init(num_physpages);
> + vfs_caches_init(NODE_DATA(0)->node_present_pages);
> radix_tree_init();
> signals_init();
> /* rootfs populating might need page-writeback */

--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2003-11-28 16:23:32

by Jes Sorensen

[permalink] [raw]
Subject: Re: hash table sizes

>>>>> "Jack" == Jack Steiner <[email protected]> writes:

Jack> On Fri, Nov 28, 2003 at 09:15:02AM -0500, Jes Sorensen wrote:
>> What about something like this? I believe node_present_pages
>> should be the same as nym_physpages on a non-NUMA machine. If not
>> we can make it min(num_physpages,
>> NODE_DATA(0)->node_present_pages).

Jack> The system has a large number of nodes. Physically, each node
Jack> has the same amount of memory. After boot, we observe that
Jack> several nodes have substantially less memory than other
Jack> nodes. Some of the inbalance is due to the kernel data/text
Jack> being on node 0, but by far, the major source of in the
Jack> inbalance is the 3 (in 2.4.x) large hash tables that are being
Jack> allocated.

Jack> I suspect the size of the hash tables is a lot bigger than is
Jack> needed. That is certainly the first problem to be fixed, but
Jack> unless the required size is a very small percentage (5-10%) of
Jack> the amount of memory on a node (2GB to 32GB per node & 256
Jack> nodes), we still have a problem.

Jack,

I agree with you, however as you point out, there are two problems to
deal with, the excessive size of the hash tables on large systems and
the imbalance that everything goes on node zero. My patch only solves
the first problem, or rather works around it.

Solving the problem of allocating structures on multiple nodes is yet
to be solved.

Cheers,
Jes

2003-11-28 19:37:01

by tip-bot for Jack Steiner

[permalink] [raw]
Subject: Re: hash table sizes


On Fri, Nov 28, 2003 at 11:22:47AM -0500, Jes Sorensen wrote:
> >>>>> "Jack" == Jack Steiner <[email protected]> writes:
>
> Jack> On Fri, Nov 28, 2003 at 09:15:02AM -0500, Jes Sorensen wrote:
> >> What about something like this? I believe node_present_pages
> >> should be the same as nym_physpages on a non-NUMA machine. If not
> >> we can make it min(num_physpages,
> >> NODE_DATA(0)->node_present_pages).
>
> Jack> The system has a large number of nodes. Physically, each node
> Jack> has the same amount of memory. After boot, we observe that
> Jack> several nodes have substantially less memory than other
> Jack> nodes. Some of the inbalance is due to the kernel data/text
> Jack> being on node 0, but by far, the major source of in the
> Jack> inbalance is the 3 (in 2.4.x) large hash tables that are being
> Jack> allocated.
>
> Jack> I suspect the size of the hash tables is a lot bigger than is
> Jack> needed. That is certainly the first problem to be fixed, but
> Jack> unless the required size is a very small percentage (5-10%) of
> Jack> the amount of memory on a node (2GB to 32GB per node & 256
> Jack> nodes), we still have a problem.
>
> Jack,
>
> I agree with you, however as you point out, there are two problems to
> deal with, the excessive size of the hash tables on large systems and
> the imbalance that everything goes on node zero. My patch only solves
> the first problem, or rather works around it.
>
> Solving the problem of allocating structures on multiple nodes is yet
> to be solved.

Jes

Then I still dont understand your proposal. (I probably missed some piece
of the discussion).

You proposed above to limit the allocation to the amount of memory on a node.
I dont see that does anything on SN systems - allocation is already limited to
that amount because memory between nodes is discontiguous. We need to limit
the allocation to a small percentage of the memory on a node. I
dont see how we can do that without:

- using vmalloc (on systems that dont have vmalloc issues)
OR
- changing algorithms so that a lrge hash table is not
needed. Either lots of smaller hash tables or ???. I suspect
there are performance issues with this.
OR
- ????

I suppose I need to wait to see the proposal for allocating memory across nodes....


--
Thanks

Jack Steiner ([email protected]) 651-683-5302
Principal Engineer SGI - Silicon Graphics, Inc.


2003-11-28 21:18:41

by Jörn Engel

[permalink] [raw]
Subject: Re: hash table sizes

[pruned CC: list]

On Fri, 28 November 2003 13:35:36 -0600, Jack Steiner wrote:
>
> Then I still dont understand your proposal. (I probably missed some piece
> of the discussion).
>
> You proposed above to limit the allocation to the amount of memory on a node.

Jes didn't _limit_ the allocation to the memory on a node, he _based_
it on it, instead of total memory for all nodes. Therefore a 1024
node NUMA machine with 2GB per node has no bigger hash tables, than a
single CPU machine with 2GB total memory, however big that may be.

Unless I didn't understand his patch, that is. :)

J?rn

--
"Error protection by error detection and correction."
-- from a university class

2003-11-29 10:40:29

by Manfred Spraul

[permalink] [raw]
Subject: Re: hash table sizes

// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 0
// EXTRAVERSION = -test11
--- 2.6/mm/page_alloc.c 2003-11-29 09:46:35.000000000 +0100
+++ build-2.6/mm/page_alloc.c 2003-11-29 11:34:04.000000000 +0100
@@ -681,6 +681,42 @@

EXPORT_SYMBOL(__alloc_pages);

+#ifdef CONFIG_NUMA
+/* Early boot: Everything is done by one cpu, but the data structures will be
+ * used by all cpus - spread them on all nodes.
+ */
+static __init unsigned long get_boot_pages(unsigned int gfp_mask, unsigned int order)
+{
+static int nodenr;
+ int i = nodenr;
+ struct page *page;
+
+ for (;;) {
+ if (i > nodenr + numnodes)
+ return 0;
+ if (node_present_pages(i%numnodes)) {
+ struct zone **z;
+ /* The node contains memory. Check that there is
+ * memory in the intended zonelist.
+ */
+ z = NODE_DATA(i%numnodes)->node_zonelists[gfp_mask & GFP_ZONEMASK].zones;
+ while (*z) {
+ if ( (*z)->free_pages > (1UL<<order))
+ goto found_node;
+ z++;
+ }
+ }
+ i++;
+ }
+found_node:
+ nodenr = i+1;
+ page = alloc_pages_node(i%numnodes, gfp_mask, order);
+ if (!page)
+ return 0;
+ return (unsigned long) page_address(page);
+}
+#endif
+
/*
* Common helper functions.
*/
@@ -688,6 +724,10 @@
{
struct page * page;

+#ifdef CONFIG_NUMA
+ if (unlikely(!system_running))
+ return get_boot_pages(gfp_mask, order);
+#endif
page = alloc_pages(gfp_mask, order);
if (!page)
return 0;
--- 2.6/fs/inode.c 2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/inode.c 2003-11-29 10:19:21.000000000 +0100
@@ -1327,6 +1327,20 @@
wake_up_all(wq);
}

+static __initdata int ihash_entries;
+
+static int __init set_ihash_entries(char *str)
+{
+ get_option(&str, &ihash_entries);
+ if (ihash_entries <= 0) {
+ ihash_entries = 0;
+ return 0;
+ }
+ return 1;
+}
+
+__setup("ihash_entries=", set_ihash_entries);
+
/*
* Initialize the waitqueues and inode hash table.
*/
@@ -1340,8 +1354,16 @@
for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
init_waitqueue_head(&i_wait_queue_heads[i].wqh);

- mempages >>= (14 - PAGE_SHIFT);
- mempages *= sizeof(struct hlist_head);
+ if (!ihash_entries) {
+ ihash_entries = mempages >> (14 - PAGE_SHIFT);
+ /* Limit inode hash size. Override for nfs servers
+ * that handle lots of files.
+ */
+ if (ihash_entries > 1024*1024)
+ ihash_entries = 1024*1024;
+ }
+
+ mempages = ihash_entries*sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;

--- 2.6/fs/dcache.c 2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/dcache.c 2003-11-29 10:53:15.000000000 +0100
@@ -1546,6 +1546,20 @@
return ino;
}

+static __initdata int dhash_entries;
+
+static int __init set_dhash_entries(char *str)
+{
+ get_option(&str, &dhash_entries);
+ if (dhash_entries <= 0) {
+ dhash_entries = 0;
+ return 0;
+ }
+ return 1;
+}
+
+__setup("dhash_entries=", set_dhash_entries);
+
static void __init dcache_init(unsigned long mempages)
{
struct hlist_head *d;
@@ -1571,10 +1585,18 @@

set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);

+ if (!dhash_entries) {
#if PAGE_SHIFT < 13
- mempages >>= (13 - PAGE_SHIFT);
+ mempages >>= (13 - PAGE_SHIFT);
#endif
- mempages *= sizeof(struct hlist_head);
+ dhash_entries = mempages;
+ /* 8 mio is enough for general purpose systems.
+ * For file servers, override with "dhash_entries="
+ */
+ if (dhash_entries > 8*1024*1024)
+ dhash_entries = 8*1024*1024;
+ }
+ mempages = dhash_entries*sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;


Attachments:
patch-numa (3.51 kB)