2004-01-08 23:12:35

by Chen, Kenneth W

[permalink] [raw]
Subject: Limit hash table size

The issue of exceedingly large hash tables has been discussed on the
mailing list a while back, but seems to slip through the cracks.

What we found is it's not a problem for x86 (and most other
architectures) because __get_free_pages won't be able to get anything
beyond order MAX_ORDER-1 (10) which means at most those hash tables are
4MB each (assume 4K page size). However, on ia64, in order to support
larger hugeTLB page size, the MAX_ORDER is bumped up to 18, which now
means a 2GB upper limits enforced by the page allocator (assume 16K page
size). PPC64 is another example that bumps up MAX_ORDER.

Last time I checked, the tcp ehash table is taking a whooping (insane!)
2GB on one of our large machine. dentry and inode hash tables also take
considerable amount of memory.

This patch just enforces all the hash tables to have a max order of 10,
which limits them down to 16MB each on ia64. People can clean up other
part of table size calculation. But minimally, this patch doesn't
change any hash sizes already in use on x86.

Andrew, would you please apply?

- Ken Chen
<<hashtable.patch>>


Attachments:
hashtable.patch (2.14 kB)
hashtable.patch

2004-01-09 09:24:50

by Andrew Morton

[permalink] [raw]
Subject: Re: Limit hash table size

"Chen, Kenneth W" <[email protected]> wrote:
>
> The issue of exceedingly large hash tables has been discussed on the
> mailing list a while back, but seems to slip through the cracks.
>
> What we found is it's not a problem for x86 (and most other
> architectures) because __get_free_pages won't be able to get anything
> beyond order MAX_ORDER-1 (10) which means at most those hash tables are
> 4MB each (assume 4K page size). However, on ia64, in order to support
> larger hugeTLB page size, the MAX_ORDER is bumped up to 18, which now
> means a 2GB upper limits enforced by the page allocator (assume 16K page
> size). PPC64 is another example that bumps up MAX_ORDER.
>
> Last time I checked, the tcp ehash table is taking a whooping (insane!)
> 2GB on one of our large machine. dentry and inode hash tables also take
> considerable amount of memory.
>
> This patch just enforces all the hash tables to have a max order of 10,
> which limits them down to 16MB each on ia64. People can clean up other
> part of table size calculation. But minimally, this patch doesn't
> change any hash sizes already in use on x86.

Fair enough; it's better than what we had before and Mr Networking is OK
with it ;) I can't say that I can think of anything smarter. Thanks.

2004-01-09 14:30:09

by Anton Blanchard

[permalink] [raw]
Subject: Re: Limit hash table size


Hi,

> Last time I checked, the tcp ehash table is taking a whooping (insane!)
> 2GB on one of our large machine. dentry and inode hash tables also take
> considerable amount of memory.
>
> This patch just enforces all the hash tables to have a max order of 10,
> which limits them down to 16MB each on ia64. People can clean up other
> part of table size calculation. But minimally, this patch doesn't
> change any hash sizes already in use on x86.

By limiting it by order you are crippling ppc64 compared to ia64 :)
Perhaps we should limit it by size instead.

Have you done any analysis of hash depths of large memory machines? We
had some extremely deep pagecache hashchains in 2.4 on a 64GB machine.
While the radix tree should fix that, whos to say we cant get into a
similar situation with the dcache?

Check out how deep some of the inode hash chains are here:

http://www.ussg.iu.edu/hypermail/linux/kernel/0312.0/0105.html

That was on a 64GB box from memory. Switch to the current high end,
say 512GB and those chains could become a real dogs breakfast,
especially if we limit the hashtable to 4MB.

Anton

2004-01-09 19:05:37

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Limit hash table size

Anton Blanchard wrote:

>Have you done any analysis of hash depths of large memory machines? We
>had some extremely deep pagecache hashchains in 2.4 on a 64GB machine.
>While the radix tree should fix that, whos to say we cant get into a
>similar situation with the dcache?

We don't have any data to justify any size change for x86, that was the
main reason we limit the size by page order.


>Check out how deep some of the inode hash chains are here:
>http://www.ussg.iu.edu/hypermail/linux/kernel/0312.0/0105.html

If I read them correctly, most of the distribution is in the first 2
buckets, so it doesn't matter whether you have 100 buckets or 1 million
buckets, only first 2 are being hammered hard. So are we wasting memory
on the buckets that are not being used?

- Ken

2004-01-12 13:39:59

by Anton Blanchard

[permalink] [raw]
Subject: Re: Limit hash table size


> We don't have any data to justify any size change for x86, that was the
> main reason we limit the size by page order.

Well x86 isnt very interesting here, its all the 64bit archs that will
end up with TBs of memory in the future.

> If I read them correctly, most of the distribution is in the first 2
> buckets, so it doesn't matter whether you have 100 buckets or 1 million
> buckets, only first 2 are being hammered hard. So are we wasting memory
> on the buckets that are not being used?

But look at the horrid worst case there. My point is limiting the hash
without any data is not a good idea. In 2.4 we raised MAX_ORDER on ppc64
because we spent so much time walking pagecache chains, id hate to see
us limit the icache and dcache hash in 2.6 and end up with a similar
problem.

Why cant we do something like Andrews recent min_free_kbytes patch and
make the rate of change non linear. Just slow the increase down as we
get bigger. I agree a 2GB hashtable is pretty ludicrous, but a 4MB one
on a 512GB machine (which we sell at the moment) could be too :)

Anton

2004-01-12 16:50:55

by Manfred Spraul

[permalink] [raw]
Subject: Re: Limit hash table size

// $Header$
// Kernel Version:
// VERSION = 2
// PATCHLEVEL = 6
// SUBLEVEL = 0
// EXTRAVERSION = -test11
--- 2.6/fs/inode.c 2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/inode.c 2003-11-29 10:19:21.000000000 +0100
@@ -1327,6 +1327,20 @@
wake_up_all(wq);
}

+static __initdata int ihash_entries;
+
+static int __init set_ihash_entries(char *str)
+{
+ get_option(&str, &ihash_entries);
+ if (ihash_entries <= 0) {
+ ihash_entries = 0;
+ return 0;
+ }
+ return 1;
+}
+
+__setup("ihash_entries=", set_ihash_entries);
+
/*
* Initialize the waitqueues and inode hash table.
*/
@@ -1340,8 +1354,16 @@
for (i = 0; i < ARRAY_SIZE(i_wait_queue_heads); i++)
init_waitqueue_head(&i_wait_queue_heads[i].wqh);

- mempages >>= (14 - PAGE_SHIFT);
- mempages *= sizeof(struct hlist_head);
+ if (!ihash_entries) {
+ ihash_entries = mempages >> (14 - PAGE_SHIFT);
+ /* Limit inode hash size. Override for nfs servers
+ * that handle lots of files.
+ */
+ if (ihash_entries > 1024*1024)
+ ihash_entries = 1024*1024;
+ }
+
+ mempages = ihash_entries*sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;

--- 2.6/fs/dcache.c 2003-11-29 09:46:34.000000000 +0100
+++ build-2.6/fs/dcache.c 2003-11-29 10:53:15.000000000 +0100
@@ -1546,6 +1546,20 @@
return ino;
}

+static __initdata int dhash_entries;
+
+static int __init set_dhash_entries(char *str)
+{
+ get_option(&str, &dhash_entries);
+ if (dhash_entries <= 0) {
+ dhash_entries = 0;
+ return 0;
+ }
+ return 1;
+}
+
+__setup("dhash_entries=", set_dhash_entries);
+
static void __init dcache_init(unsigned long mempages)
{
struct hlist_head *d;
@@ -1571,10 +1585,18 @@

set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);

+ if (!dhash_entries) {
#if PAGE_SHIFT < 13
- mempages >>= (13 - PAGE_SHIFT);
+ mempages >>= (13 - PAGE_SHIFT);
#endif
- mempages *= sizeof(struct hlist_head);
+ dhash_entries = mempages;
+ /* 8 mio is enough for general purpose systems.
+ * For file servers, override with "dhash_entries="
+ */
+ if (dhash_entries > 8*1024*1024)
+ dhash_entries = 8*1024*1024;
+ }
+ mempages = dhash_entries*sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < mempages; order++)
;


Attachments:
patch-hash-alloc (2.21 kB)

2004-01-14 22:32:46

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Limit hash table size

Anton Blanchard wrote:
> Well x86 isnt very interesting here, its all the 64bit archs
> that will end up with TBs of memory in the future.

To address Anton's concerns on PPC64, we have revised the patch to
enforce maximum size base on number of entry instead of page order. So
differences in page size/pointer size etc doesn't affect the final
calculation. The upper bound is capped at 2M. All numbers on x86
remain the same as we don't want to disturb already established and
working number. See patch at the end of the email. It is diff'ed
relative to 2.6.1-mm3 tree.

> But look at the horrid worst case there. My point is limiting
> the hash without any data is not a good idea. In 2.4 we raised
> MAX_ORDER on ppc64 because we spent so much time walking
> pagecache chains,

I just have to re-iterate that when hash table is made too large, we
start trading cache misses on the head array accesses for misses on the
hash list traversal. Big hashes can hurt you if you don't actually use
the capacity.

> Why cant we do something like Andrews recent min_free_kbytes
> patch and make the rate of change non linear. Just slow the
> increase down as we get bigger. I agree a 2GB hashtable is
> pretty ludicrous, but a 4MB one on a 512GB machine (which
> we sell at the moment) could be too :)

It doesn't need to be over designed. Generally there is no one size fit
all type of solution either. Linear scale works fine for many years and
it just start to tip off on large machine. We just need to put a upper
bound before it runs away.

- Ken


Attachments:
hash2.patch (1.73 kB)
hash2.patch

2004-01-14 22:29:27

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: Limit hash table size

Manfred Spraul wrote:
> What about making the limit configurable with a boot time
> parameter? If someone uses a 512 GB ppc64 as an nfs server,
> he might want a 2 GB inode hash.

I'm sorry, this code won't have any effect beyond MAX_ORDER defined for
each architecture. It's not possible to get 2GB hash table on PPC64
since MAX_ORDER is defined at 13 so far for PPC64, which means a 16MB
absolute upper limit enforced by the page allocator.

- Ken

2004-01-18 14:27:14

by Anton Blanchard

[permalink] [raw]
Subject: Re: Limit hash table size


Hi,

> To address Anton's concerns on PPC64, we have revised the patch to
> enforce maximum size base on number of entry instead of page order. So
> differences in page size/pointer size etc doesn't affect the final
> calculation. The upper bound is capped at 2M. All numbers on x86
> remain the same as we don't want to disturb already established and
> working number. See patch at the end of the email. It is diff'ed
> relative to 2.6.1-mm3 tree.

That sounds reasonable to me.

Anton