Shrink hashes on small systems
Tweak vfs_caches_init logic so that hashes don't start growing as
quickly on small systems.
tiny-mpm/init/main.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletion(-)
diff -puN init/main.c~hash-sizes init/main.c
--- tiny/init/main.c~hash-sizes 2004-03-20 12:14:43.000000000 -0600
+++ tiny-mpm/init/main.c 2004-03-20 12:14:43.000000000 -0600
@@ -470,7 +470,9 @@ asmlinkage void __init start_kernel(void
buffer_init();
unnamed_dev_init();
security_scaffolding_startup();
- vfs_caches_init(num_physpages);
+ /* Treat machines smaller than 6M as having 2M of memory
+ for hash-sizing purposes */
+ vfs_caches_init(max(500, (int)num_physpages-1000));
radix_tree_init();
signals_init();
/* rootfs populating might need page-writeback */
_
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
Matt Mackall <[email protected]> wrote:
>
> Shrink hashes on small systems
>
> Tweak vfs_caches_init logic so that hashes don't start growing as
> quickly on small systems.
>
> - vfs_caches_init(num_physpages);
> + /* Treat machines smaller than 6M as having 2M of memory
> + for hash-sizing purposes */
> + vfs_caches_init(max(500, (int)num_physpages-1000));
This seems rather arbitrary. It also implicitly "knows" that
PAGE_SIZE=4096.
num_physpages is of course the wrong thing to use here - on small systems
we should be accounting for memory which is pinned by kernel text, etc.
But you're going further than that. What's the theory here?
On Mon, Apr 05, 2004 at 02:02:23PM -0700, Andrew Morton wrote:
> Matt Mackall <[email protected]> wrote:
> >
> > Shrink hashes on small systems
> >
> > Tweak vfs_caches_init logic so that hashes don't start growing as
> > quickly on small systems.
> >
> > - vfs_caches_init(num_physpages);
> > + /* Treat machines smaller than 6M as having 2M of memory
> > + for hash-sizing purposes */
> > + vfs_caches_init(max(500, (int)num_physpages-1000));
>
> This seems rather arbitrary. It also implicitly "knows" that
> PAGE_SIZE=4096.
Yep. I can reword it in terms of pages, if that helps. Boxes with 8k
pages tend to have larger instruction words and data structures by
virtue of being RISC/64bit/etc., so I think 1000 pages is a reasonable
number in either case.
> num_physpages is of course the wrong thing to use here - on small systems
> we should be accounting for memory which is pinned by kernel text, etc.
>
> But you're going further than that. What's the theory here?
Basically, the numfreepages approach doesn't take into account the
size of the kernel/critical userspace at all. So we assume that
anything less than 4M is already tight and that we're not yet in a
position to trade space for performance, so lets just pull that off the top.
Longer term, I think some serious thought needs to go into scaling
hash sizes across the board, but this serves my purposes on the
low-end without changing behaviour asymptotically.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
Matt Mackall <[email protected]> wrote:
>
> Longer term, I think some serious thought needs to go into scaling
> hash sizes across the board, but this serves my purposes on the
> low-end without changing behaviour asymptotically.
Can we make longer-term a bit shorter? init_per_zone_pages_min() only took
a few minutes thinking..
I suspect what we need is to replace num_physpages with nr_free_pages(),
then account for PAGE_SIZE, then muck around with sqrt() for a while then
apply upper and lower bounds to it.
That hard-wired breakpoint seems just too arbitrary to me - some sort of
graduated thing which has a little thought and explanation behind it would
be preferred please.
On Mon, Apr 05, 2004 at 02:38:24PM -0700, Andrew Morton wrote:
> Matt Mackall <[email protected]> wrote:
> >
> > Longer term, I think some serious thought needs to go into scaling
> > hash sizes across the board, but this serves my purposes on the
> > low-end without changing behaviour asymptotically.
>
> Can we make longer-term a bit shorter? init_per_zone_pages_min() only took
> a few minutes thinking..
>
> I suspect what we need is to replace num_physpages with nr_free_pages(),
> then account for PAGE_SIZE, then muck around with sqrt() for a while then
> apply upper and lower bounds to it.
Ok, I can take a look at that. I believe Arjan's been looking at the
upper end side of the equation.
On the small end, I think we need to take account for the fact
that:
- a bunch of stuff gets effectively pinned post-init
- there are fewer pages total so paging granularity is bad
- the desired size/performance ratio is heavily skewed towards size
..so I think the head of this curve is squeezing most of these hashes
down to order 0 or 1 for the first 16MB of RAM or so.
On the large end, we obviously have diminishing returns for larger
hashes and lots of dirty cachelines for hash misses. We almost
certainly want sublinear growth here, but sqrt might be too
aggressive. Arjan?
For 2.7, I've been thinking about pulling out a generic lookup API,
which would allow replacing hashing with something like rbtree, etc.,
depending on space and performance criterion.
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
On Mon, Apr 05, 2004 at 05:59:11PM -0500, Matt Mackall wrote:
> On the large end, we obviously have diminishing returns for larger
> hashes and lots of dirty cachelines for hash misses. We almost
> certainly want sublinear growth here, but sqrt might be too
> aggressive.
Hand wavy.
Memory size is not necessarily predictive of optimal hash size;
certain embedded workloads may want huge TCP hashes but
render farms may only need a few dozen dentries. Why not
just start small and rehash when chains get too long (or too short)?
That gives better cache behavior _and_ memory usage at the
expensive of increased latency during rehash. Maybe that's OK?
> For 2.7, I've been thinking about pulling out a generic lookup API,
> which would allow replacing hashing with something like rbtree, etc.,
> depending on space and performance criterion.
rbtrees have different performance characteristics than a hash, and
hashing seems pretty optimal in the places it's currently used.
But, I'd love to be wrong if it means a faster kernel.
-Bryan
On Mon, Apr 05, 2004 at 02:38:24PM -0700, Andrew Morton wrote:
> Matt Mackall <[email protected]> wrote:
> >
> > Longer term, I think some serious thought needs to go into scaling
> > hash sizes across the board, but this serves my purposes on the
> > low-end without changing behaviour asymptotically.
>
> Can we make longer-term a bit shorter? init_per_zone_pages_min() only took
> a few minutes thinking..
>
> I suspect what we need is to replace num_physpages with nr_free_pages(),
> then account for PAGE_SIZE, then muck around with sqrt() for a while then
> apply upper and lower bounds to it.
>
> That hard-wired breakpoint seems just too arbitrary to me - some sort of
> graduated thing which has a little thought and explanation behind it would
> be preferred please.
Hi,
Arjan told me his changes were not to allow orders higher than 5
(thus maximizing hash table size to 128K) to avoid possible cache thrashing.
I've done some tests with dbench during the day with different dentry hash table
sizes, here are the results on a 2-way P4 2GB box (default of 1MB hashtable).
I ran three consecutive dbenchs (with 32 clients) each reboot (each line), and then
six consecutive dbenchs on the last run.
Output is "Mb/s" output from dbench 2.1.
128K dentry hashtable:
160 145 155
152 147 148
170 132 132 156 154 127
512K dentry hashtable:
156 135 144
153 146 159
157 135 148 149 148 143
1Mb dentry hashtable:
167 127 139
160 144 139
144 137 162 153 132 121
Not much of noticeable difference between them. I was told
SPECWeb benefits from big hash tables. Any other recommended test?
I have no access to SPECWeb.
Testing the different hash sizes is not trivial because there are
so many different workloads...
Another thing is we allow the cache to grow too big: 1MB for 2GB,
4Mb for 32Gb, 8Mb for 64Gb (on 32-bit, twice as much on 64-bit).
What about the following patch to calculate the sizes of the VFS
caches based on more correct variables.
It might be good to shrink the caches a half (passing "4" instead of "3" to
vfs_cache_size() does it). We gain lowmem pinned memory and dont seem
to loose performance. Help with testing is very much appreciated.
PS: Another improvement which might be interesting is non-power-of-2
allocations? That would make the increase on the cache size "smoother" when memory size
increases.
AFAICS we dont do that because of our allocator is limited.
--- linux-2.6.4.orig/fs/inode.c 2004-03-10 23:55:51.000000000 -0300
+++ linux-2.6.4/fs/inode.c 2004-04-07 01:36:05.000000000 -0300
@@ -1337,6 +1337,8 @@
}
__setup("ihash_entries=", set_ihash_entries);
+int vfs_cache_size(int);
+
/*
* Initialize the waitqueues and inode hash table.
*/
@@ -1351,11 +1353,8 @@
init_waitqueue_head(&i_wait_queue_heads[i].wqh);
if (!ihash_entries)
- ihash_entries = PAGE_SHIFT < 14 ?
- mempages >> (14 - PAGE_SHIFT) :
- mempages << (PAGE_SHIFT - 14);
+ ihash_entries = vfs_cache_size(4);
- ihash_entries *= sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < ihash_entries; order++)
;
--- linux-2.6.4.orig/fs/dcache.c 2004-03-10 23:55:22.000000000 -0300
+++ linux-2.6.4/fs/dcache.c 2004-04-07 01:38:15.000000000 -0300
@@ -1541,6 +1541,38 @@
}
__setup("dhash_entries=", set_dhash_entries);
+unsigned int nr_free_pages(void);
+
+/*
+ * This function is used to calculate the number of elements in
+ * the VFS caches.
+ *
+ * Increasing "shift" will cause the size to be shrunk harder -- we
+ * do for small mem boxes.
+ *
+ */
+
+int __init vfs_cache_size(int shift)
+{
+ unsigned long free_kbytes;
+ unsigned long cache_size;
+
+ free_kbytes = nr_free_pages() * (PAGE_SIZE >> 10);
+
+ if (free_kbytes < (8 * 1024))
+ shift += 3;
+ else if (free_kbytes < (32 * 1024))
+ shift += 2;
+ else if (free_kbytes < (512 * 1024))
+ shift += 1;
+
+ cache_size = int_sqrt(free_kbytes) >> shift;
+ cache_size *= sizeof(struct hlist_head);
+
+ return (cache_size << 10);
+}
+
+
static void __init dcache_init(unsigned long mempages)
{
struct hlist_head *d;
@@ -1567,11 +1599,8 @@
set_shrinker(DEFAULT_SEEKS, shrink_dcache_memory);
if (!dhash_entries)
- dhash_entries = PAGE_SHIFT < 13 ?
- mempages >> (13 - PAGE_SHIFT) :
- mempages << (PAGE_SHIFT - 13);
+ dhash_entries = vfs_cache_size(3);
- dhash_entries *= sizeof(struct hlist_head);
for (order = 0; ((1UL << order) << PAGE_SHIFT) < dhash_entries; order++)
;
On Wed, Apr 07, 2004 at 02:21:03AM -0300, Marcelo Tosatti wrote:
> On Mon, Apr 05, 2004 at 02:38:24PM -0700, Andrew Morton wrote:
> > Matt Mackall <[email protected]> wrote:
> > >
> > > Longer term, I think some serious thought needs to go into scaling
> > > hash sizes across the board, but this serves my purposes on the
> > > low-end without changing behaviour asymptotically.
> >
> > Can we make longer-term a bit shorter? init_per_zone_pages_min() only took
> > a few minutes thinking..
> >
> > I suspect what we need is to replace num_physpages with nr_free_pages(),
> > then account for PAGE_SIZE, then muck around with sqrt() for a while then
> > apply upper and lower bounds to it.
> >
> > That hard-wired breakpoint seems just too arbitrary to me - some sort of
> > graduated thing which has a little thought and explanation behind it would
> > be preferred please.
>
> Hi,
>
> Arjan told me his changes were not to allow orders higher than 5
> (thus maximizing hash table size to 128K) to avoid possible cache thrashing.
>
> I've done some tests with dbench during the day with different
> dentry hash table sizes, here are the results on a 2-way P4 2GB box
> (default of 1MB hashtable).
>
> I ran three consecutive dbenchs (with 32 clients) each reboot (each
> line), and then six consecutive dbenchs on the last run.
>
> Output is "Mb/s" output from dbench 2.1.
>
> 128K dentry hashtable:
>
> 160 145 155
> 152 147 148
> 170 132 132 156 154 127
>
> 512K dentry hashtable:
>
> 156 135 144
> 153 146 159
> 157 135 148 149 148 143
>
> 1Mb dentry hashtable:
>
> 167 127 139
> 160 144 139
> 144 137 162 153 132 121
>
> Not much of noticeable difference between them. I was told
> SPECWeb benefits from big hash tables. Any other recommended test?
> I have no access to SPECWeb.
>
> Testing the different hash sizes is not trivial because there are
> so many different workloads...
>
> Another thing is we allow the cache to grow too big: 1MB for 2GB,
> 4Mb for 32Gb, 8Mb for 64Gb (on 32-bit, twice as much on 64-bit).
>
> What about the following patch to calculate the sizes of the VFS
> caches based on more correct variables.
>
> It might be good to shrink the caches a half (passing "4" instead of "3" to
> vfs_cache_size() does it). We gain lowmem pinned memory and dont seem
> to loose performance. Help with testing is very much appreciated.
I'm working on something similar, core code below.
> PS: Another improvement which might be interesting is non-power-of-2
> allocations? That would make the increase on the cache size
> "smoother" when memory size increases. AFAICS we dont do that
> because of our allocator is limited.
The other problem is hash functions frequently take advantage of bit
masking to wrap results so powers of two is nice.
My hash-sizing code:
/* calc_hash_order - calculate the page order for a hash table
* @loworder: smallest allowed hash order
* @lowpages: keep hash size minimal below this number of pages
* @hiorder: largest order for linear growth
* @hipages: point at which to switch to logarithmic growth
* @pages: number of available pages
*
* This calculates the page order for hash tables. It uses linear
* interpolation for memory sizes between lowpages and hipages and
* then switches to a logarithmic algorithm. For every factor of 10
* pages beyond hipages, the hash order is increased by 1. The
* logarithmic piece is used to future-proof the code on large boxes.
*
*/
int calc_hash_order(int loworder, unsigned long lowpages,
int hiorder, unsigned long hipages, unsigned long pages)
{
unsigned long lowhash = 1<<loworder, hihash = 1<<hiorder, hash, order;
if (pages <= hipages) {
/* linear interpolation on hash sizes, not hash order */
hash = minhash + ((pages - lopages) * (lowhash - hihash) /
(lowpages - hipages));
order = ffs(hash);
} else {
/* for every factor of 10 beyond hipages, increase order
by one */
for (order = hiorder; pages > hipages; pages /= 10)
order++;
}
/* clip order range */
return max(minorder, order);
}
--
Matt Mackall : http://www.selenic.com : Linux development and consulting
> Yep. I can reword it in terms of pages, if that helps. Boxes with 8k
> pages tend to have larger instruction words and data structures by
> virtue of being RISC/64bit/etc., so I think 1000 pages is a reasonable
> number in either case.
No. For example, the embedded 4xx IBM CPUs can have several page sizes,
and a size like 16k could be useful for embedded applications: We don't
implement support for that in the ppc32 kernel yet, but we may do, it
makes a _lot_ of sense for small embedded configs with no swap, as it
reduces page tables size & pressure on the TLB software load handlers.
Ben.