2003-11-12 15:22:42

by Erik Jacobson

[permalink] [raw]
Subject: available memory imbalance on large NUMA systems

Summary:
--------
We wish to implement node round-robin memory allocation for certain large
kernel hash tables allocated during kernel startup on NUMA systems.
We are interested in getting a community-accepted solution in to the
2.6 kernel.

Background:
-----------
NUMA systems are made of multiple nodes connected together by a fast
interconnect to make one large system. Each node has it's own set of
processors and memory. There is a notion of memory that is close to a
node (perhaps memory on the node itself) and memory further away (perhaps
located on a different node separated by a router).

When the kernel starts up, certain hash tables are allocated. The routines
that allocate these hashes don't know about NUMA systems. They see a large
amount of memory on the system and allocate a chunk of it sometimes based on
the size of overall memory available on the system.

The end result is that the first node in the system is hit harder in terms
of memory usage than other nodes. On a very large system (32 or more nodes
with 4g of memory per node for example), the first node in the system can have
less than half of its total memory available.

This imbalance is not desirable for folks wishing to run large computational
jobs that depend on memory being available on all nodes. For example,
certain large MPI programs may be negatively impacted if they expect to be
able to get equal amounts of memory from all nodes.

Example Fix:
------------
To fix this problem, we propose implementing a round-robin memory allocation
scheme. We have included an example implementation as a patch to 2.4.21
(attached). In it, we create a new function in vmalloc.h named
alloc_big_struct. It is based on vmalloc (so the resulting memory does go
through the page table). This is function can be used to allocate certain
kernel hashes such as the page table or the dentry table in place of
__get_free_pages().

Now, I understand that this patch would not be accepted by the community how
it stands right now. So think of the patch as an example to illustrate my
point rather than a polished proposal. In fact, this patch may not cleanly
apply to kernel.org 2.4 as-is. The example makes heavy use of vmalloc for
NUMA systems and I understand (from yesterday :) that this isn't necessarily
desirable. I think the example patch does still illustrate what we're trying
to do.

I guess I'm hoping to be pointed in a direction that will have a fair chance
of being accepted in to the 2.6 kernel if proposed. Depending on what
direction this takes, I or someone else will attempt to implement something.

Here is a detailed list of changes and what they do for the 2.4 example.

mm.h: Add a new action modifier, GFP_ROUND_ROBIN. This modifier is used by
alloc_area_pte in vmalloc.c. If the bit is set, round-robin allocation
is used. Add a function called alloc_pages_round_robin and a macro
alloc_page_round_robin that calls it. These are meant to mirror alloc_page
and alloc_pages.

vmalloc.h: Add function named alloc_big_struct. This function takes the
place of __get_free_pages when we wish to do round-robin allocation.
It takes an order number as input but converts it to a number of bytes
as is needed by __vmalloc. When it calls __vmalloc, it ORs GFP_ROUND_ROBIN
to the gfpmask so alloc_area_pte knows to do round-robin allocation of
memory. If the system isn't NUMA, a macro named alloc_big_struct simply
calls __get_free_pages.

vmalloc.c: alloc_area_pte is adjusted to look for GFP_ROUND_ROBIN. If its
set, alloc_page_round_robin is called. Otherwise, alloc_page is called
like before.

numa.c: page_cache_alloc is modified (inside an ifdef CONFIG_NUMA) to
use the new alloc_page_round_robin support instead of alloc_pages_node.
This avoids code duplication.

tcp.c, buffer.c, inode.c, dcache.c:
When allocating large hash tables, __get_free_pages call replaced with
alloc_big_struct to spread the memory use across nodes.


Testing
-------
On Altix systems of various sizes, we ran aim7 and compared results. We
found almost no difference in performance between the round-robin-enabled
kernels and kernels without this fix implemented.


Big Hash Tables
---------------
As a side point, some of the hash tables allocated during startup get very
large on large-memory systems (systems with a terrabyte of memory for example).
Someone may wish to consider implementing a cap on the size of some of these
tables. My example doesn't address this issue - it just spreads the load.
In fact, I don't have an idea as to what reasonable caps would be on these
tables, if any.


The example patch is attached.

--
Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota


Attachments:
roundrobin.patch (8.00 kB)

2003-11-12 21:09:08

by Andrew Morton

[permalink] [raw]
Subject: Re: available memory imbalance on large NUMA systems

Erik Jacobson <[email protected]> wrote:
>
> As a side point, some of the hash tables allocated during startup get very
> large on large-memory systems (systems with a terrabyte of memory for example).
> Someone may wish to consider implementing a cap on the size of some of these
> tables.

The patch seems a reasonable way of implementing it, but I think your above
comment lies at the heart of the issue: those tables are just too darn big.

Both the pagecache hash table and the buffer_head hash tables were removed
from 2.6 (but I suspect the structures which replaced them are all still
crammed into the zeroeth node?). That leaves the dentry, inode and TCP
hash tables. These need stern examination and benchmarking to decide
whether we really are appropriately sizing them on large machines.

If we can get away with just making these sanely sized then the remaining
issue is the node-round-robining of pagecache allocations. I don't have an
opinion on the desirability of this for NUMA machines in general.