2002-09-21 09:55:57

by Erich Focht

[permalink] [raw]
Subject: [PATCH 1/2] node affine NUMA scheduler

Hi,

here is an update of the NUMA scheduler for the 2.5.37 kernel. It
contains some bugfixes and the coupling to discontigmem memory
allocation (memory is allocated from the processes' homenode).

The node affine NUMA scheduler is targeted for multi-node platforms
and built on top of the O(1) scheduler. Its main objective is to
keep the memory access latency for each task as low as possible by
scheduling it on or near the node on which its memory is allocated.
This should achieve the hard-affinity benefits automatically.

The patch comes in two parts. The first part is the core NUMA scheduler,
it is functional without the second part and provides following features:
- Node aware scheduler (implemented CPU pools).
- Scheduler behaves like O(1) scheduler within a node.
- Equal load among nodes is targeted, stealing tasks from remote nodes
is delayed more if the current node is averagely loaded, less if it's
unloaded.
- Multi-level node hierarchies are supported by stealing delays adjusted
by the relative "node-distance" (memory access latency ratio).

The second part of the patch extends the pooling NUMA scheduler to
have node affine tasks:
- Each process has a homenode assigned to it at creation time
(initial load balancing). Memory will be allocated from this node.
- Each process is preferentially scheduled on its homenode and
attracted back to it if scheduled away for some reason. For
multi-level node hierarchies the task is attracted to its
supernode, too.

The patch was tested on IA64 platforms but should work on NUMAQ i386,
too. Similar code for 2.4.18 (cf. http://home.arcor.de/efocht/sched)
runs in production environments since months.

Comments, tests, ports to other platforms/architectures are very welcome!

Regards,
Erich


Attachments:
01-numa_sched_core-2.5.patch (29.57 kB)

2002-09-21 09:58:15

by Erich Focht

[permalink] [raw]
Subject: Re: [PATCH 2/2] node affine NUMA scheduler

Here comes the second part of the node affine NUMA scheduler.

> The patch comes in two parts. The first part is the core NUMA scheduler,
> it is functional without the second part and provides following features:
> - Node aware scheduler (implemented CPU pools).
> - Scheduler behaves like O(1) scheduler within a node.
> - Equal load among nodes is targeted, stealing tasks from remote nodes
> is delayed more if the current node is averagely loaded, less if it's
> unloaded.
> - Multi-level node hierarchies are supported by stealing delays adjusted
> by the relative "node-distance" (memory access latency ratio).
>
> The second part of the patch extends the pooling NUMA scheduler to
> have node affine tasks:
> - Each process has a homenode assigned to it at creation time
> (initial load balancing). Memory will be allocated from this node.
> - Each process is preferentially scheduled on its homenode and
> attracted back to it if scheduled away for some reason. For
> multi-level node hierarchies the task is attracted to its
> supernode, too.

Regards,
Erich


Attachments:
02-numa_sched_affine-2.5.patch (14.85 kB)

2002-09-21 15:52:36

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler


> The second part of the patch extends the pooling NUMA scheduler to
> have node affine tasks:
> - Each process has a homenode assigned to it at creation time
> (initial load balancing). Memory will be allocated from this node.

Hmmm ... I was wondering how you achieved that without modifying
alloc_pages ... until I saw this bit.

#ifdef CONFIG_NUMA
+#ifdef CONFIG_NUMA_SCHED
+#define numa_node_id() (current->node)
+#else
#define numa_node_id() _cpu_to_node(smp_processor_id())
+#endif
#endif /* CONFIG_NUMA */

I'm not convinced it's a good idea to modify this generic function,
which was meant to tell you what node you're running on. I can't
see it being used anywhere else right now, but wouldn't it be better
to just modify alloc_pages instead to use current->node, and leave
this macro as intended? Or make a process_node_id or something?

Anyway, I'm giving your code a quick spin ... will give you some
results later ;-)

M.


2002-09-21 16:29:31

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

> Anyway, I'm giving your code a quick spin ... will give you some
> results later ;-)

Hmmm .... well I ran the One True Benchmark (tm). The patch
*increased* my kernel compile time from about 20s to about 28s.
Not sure I like that idea ;-) Anything you'd like tweaked, or
more info? Both user and system time were up ... I'll grab a
profile of kernel stuff.

M.

2002-09-21 16:43:17

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

>> Anyway, I'm giving your code a quick spin ... will give you some
>> results later ;-)
>
> Hmmm .... well I ran the One True Benchmark (tm). The patch
> *increased* my kernel compile time from about 20s to about 28s.
> Not sure I like that idea ;-) Anything you'd like tweaked, or
> more info? Both user and system time were up ... I'll grab a
> profile of kernel stuff.

>From the below, I'd suggest you're getting pages off the wrong
nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
allocator. Are you sure the current->node thing is getting set
correctly? I'll try backing out your alloc_pages tweaking, and
see what happens.

An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)

87elapse133639 total 0.1390
74447 default_idle
6887 do_anonymous_page
4456 page_remove_rmap
4082 handle_mm_fault
3733 .text.lock.namei
2512 page_add_rmap
2187 __generic_copy_from_user
1989 rmqueue
1964 .text.lock.dec_and_lock
1761 vm_enough_memory
1631 file_read_actor
1599 zap_pte_range
1507 __free_pages_ok
1323 find_get_page
1117 do_no_page
1097 get_empty_filp
1023 link_path_walk

2.5.37-mm1

256745 total 0.2584
82934 default_idle
38978 do_anonymous_page
36533 rmqueue
35099 __free_pages_ok
5551 page_remove_rmap
4694 handle_mm_fault
3166 page_add_rmap
2904 do_no_page
2674 .text.lock.namei
2566 __alloc_pages
2526 zap_pte_range
2306 __generic_copy_from_user
2218 file_read_actor
1803 vm_enough_memory
1789 do_wp_page
1557 .text.lock.dec_and_lock
1414 find_get_page
1251 do_softirq
1123 release_pages
1086 link_path_walk
1072 get_empty_filp
1038 schedule

2002-09-21 17:08:43

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

> From the below, I'd suggest you're getting pages off the wrong
> nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> allocator. Are you sure the current->node thing is getting set
> correctly? I'll try backing out your alloc_pages tweaking, and
> see what happens.

OK, well removing that part of the patch gets us back from 28s to
about 21s (compared to 20s virgin), total user time compared to
virgin is up from 59s to 62s, user from 191 to 195. So it's still
a net loss, but not by nearly as much. Are you determining target
node on fork or exec ? I forget ...

Profile is more comparible. Nothing sticks out any more, but maybe
it just needs some tuning for balance intervals or something.

153385 total 0.1544
91219 default_idle
7475 do_anonymous_page
4564 page_remove_rmap
4167 handle_mm_fault
3467 .text.lock.namei
2520 page_add_rmap
2112 rmqueue
1905 .text.lock.dec_and_lock
1849 zap_pte_range
1668 vm_enough_memory
1612 __free_pages_ok
1504 file_read_actor
1484 find_get_page
1381 __generic_copy_from_user
1207 do_no_page
1066 schedule
1050 get_empty_filp
1034 link_path_walk

2002-09-21 17:28:55

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

Hi Martin,

thanks for the comments and the testing!

On Saturday 21 September 2002 19:11, Martin J. Bligh wrote:
> > From the below, I'd suggest you're getting pages off the wrong
> > nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> > allocator. Are you sure the current->node thing is getting set
> > correctly? I'll try backing out your alloc_pages tweaking, and
> > see what happens.

The current->node is most probably wrong for most of the kernel threads,
except for migration_thread and ksoftirqd. But it should be fine for
user processes.

Might also be that the __node_distance matrix which you might use
by default is not optimal for NUMAQ. It is fine for our remote/local
latency ratio of 1.6. Yours is maybe an order of magnitude larger?
Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now...


> OK, well removing that part of the patch gets us back from 28s to
> about 21s (compared to 20s virgin), total user time compared to
> virgin is up from 59s to 62s, user from 191 to 195. So it's still
> a net loss, but not by nearly as much. Are you determining target
> node on fork or exec ? I forget ...

The default is exec(). You can use
http://home.arcor.de/efocht/sched/nodpol.c
to set the node_policy to do initial load_balancing in fork().
Just do "nodpol -P 2" in the shell before starting the job/task.
This is VERY reccomended if you are creating many tasks/threads.
The default behavior is fine for MPI jobs or users starting serial
processes.

> Profile is more comparible. Nothing sticks out any more, but maybe
> it just needs some tuning for balance intervals or something.

Hmmm... There are two changes which might lead to lower performance:
1. load_balance() is not inlined any more.
2. pull_task steals only one task at a load_balance() call. It was
maximally imbalance/2 (if I remember correctly).

And of course, there is some real additional overhead due to the
initial load balancing which one feels for short living tasks... So
please try "nodpol -P 2" (and reset to default by "nodpol -P 0".

Did you try the first patch alone? I mean the pooling-only scheduler?

Thanks,
Erich

> 153385 total 0.1544
> 91219 default_idle
> 7475 do_anonymous_page
> 4564 page_remove_rmap
> 4167 handle_mm_fault
> 3467 .text.lock.namei
> 2520 page_add_rmap
> 2112 rmqueue
> 1905 .text.lock.dec_and_lock
> 1849 zap_pte_range
> 1668 vm_enough_memory
> 1612 __free_pages_ok
> 1504 file_read_actor
> 1484 find_get_page
> 1381 __generic_copy_from_user
> 1207 do_no_page
> 1066 schedule
> 1050 get_empty_filp
> 1034 link_path_walk

--
Dr. Erich Focht <[email protected]>
NEC European Supercomputer Systems, European HPC Technology Center
Hessbruehlstr. 21B, 70565 Stuttgart, Germany
phone: +49-711-78055-15 fax : +49-711-78055-25

2002-09-21 17:39:46

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Sat, Sep 21, 2002 at 07:32:59PM +0200, Erich Focht wrote:
> Might also be that the __node_distance matrix which you might use
> by default is not optimal for NUMAQ. It is fine for our remote/local
> latency ratio of 1.6. Yours is maybe an order of magnitude larger?
> Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now...

I'm running with 8 over the weekend, and by and large we go to 16,
though we rarely put all our eggs in one basket.

I'll take it for a spin.


Cheers,
Bill

2002-09-21 23:19:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote:
> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)

Some 8-quad numbers for 2.5.37 (virgin) follow.

I'll get dcache_rcu and NUMA sched stuff in on the act for round 2.
This is more fs transaction-based (and VM process-spawning overhead)
and not pure I/O throughput so there won't be things like "but
everybody's running at peak I/O bandwidth" obscuring the issues.

real 0m30.854s

c01053ec 16963617 89.009 poll_idle
c0114a48 452526 2.37443 load_balance
c013962c 253666 1.331 get_page_state
c01466de 177354 0.930586 .text.lock.file_table
c0114ec0 150583 0.790117 scheduler_tick
c01547b3 116144 0.609414 .text.lock.namei
c01422ac 94042 0.493444 page_remove_rmap
c0138c24 83293 0.437043 rmqueue
c0141e48 51963 0.272653 page_add_rmap
c012d5ec 37086 0.194592 do_anonymous_page
c01391d0 34454 0.180782 __alloc_pages
c0130e08 32111 0.168488 find_get_page
c0139534 29222 0.153329 nr_free_pages
c012df4c 26734 0.140275 handle_mm_fault
c0146070 25881 0.135799 get_empty_filp
c01a0d2c 25250 0.132488 __generic_copy_from_user
c0111728 22018 0.11553 smp_apic_timer_interrupt
c0112b90 20777 0.109018 pfn_to_nid
c0136238 18410 0.0965983 kmem_cache_free
c0113270 17494 0.091792 do_page_fault
c0138900 17282 0.0906796 __free_pages_ok
c01a0f90 17160 0.0900395 atomic_dec_and_lock
c01a100b 16937 0.0888694 .text.lock.dec_and_lock

2002-09-22 08:11:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote:
>> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)

On Sat, Sep 21, 2002 at 04:18:10PM -0700, William Lee Irwin III wrote:
> Some 8-quad numbers for 2.5.37 (virgin) follow.

Okay, 2.5.37 virgin with overcommit_memory set to 1 this time.
(compiles with -j256 seem to do better than -j32 or -j48, here is -j256):

... will follow up with 2.5.38-mm1 with and without NUMA sched, at
least if the arrival rate of releases doesn't exceed the benchtime.

c01053ec 1605553 95.6785 poll_idle
c0114a48 7611 0.453556 load_balance
c0114ec0 5303 0.316018 scheduler_tick
c01422ac 5017 0.298974 page_remove_rmap
c01466de 4026 0.239918 .text.lock.file_table
c012d5ec 3290 0.196058 do_anonymous_page
c0141e48 3211 0.191351 page_add_rmap
c012df4c 2920 0.174009 handle_mm_fault
c010d6e8 2844 0.16948 timer_interrupt
c01547b3 2080 0.123952 .text.lock.namei
c0146070 1934 0.115251 get_empty_filp
c01a0d2c 1591 0.0948112 __generic_copy_from_user
c0111728 1477 0.0880177 smp_apic_timer_interrupt
c014633c 1437 0.085634 __fput
c01a0ce0 1346 0.0802111 __generic_copy_to_user
c0138c24 1249 0.0744307 rmqueue
c012e450 1169 0.0696633 vm_enough_memory
c012b694 1056 0.0629294 zap_pte_range
c0130e08 997 0.0594135 find_get_page
c014427c 950 0.0566126 dentry_open
c01391d0 949 0.056553 __alloc_pages
c0151424 892 0.0531563 link_path_walk
c01a0f90 834 0.0496999 atomic_dec_and_lock

2002-09-22 08:26:36

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

Bill,

would you please check the boot messages for the NUMA scheduler before
doing the run. Martin sent me an example where he has:

CPU pools : 1
pool 0 :0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node level 0 : 10
pool_delay matrix:
129

which is clearly wrong. In that case we need to fix the cpu-pools setup
first.

Regards,
Erich

On Sunday 22 September 2002 10:09, William Lee Irwin III wrote:
> On Sat, Sep 21, 2002 at 09:46:05AM -0700, Martin J. Bligh wrote:
> >> An old compile off 2.5.31-mm1 + extras (I don't have 37, but similar)
>
> On Sat, Sep 21, 2002 at 04:18:10PM -0700, William Lee Irwin III wrote:
> > Some 8-quad numbers for 2.5.37 (virgin) follow.
>
> Okay, 2.5.37 virgin with overcommit_memory set to 1 this time.
> (compiles with -j256 seem to do better than -j32 or -j48, here is -j256):
>
> ... will follow up with 2.5.38-mm1 with and without NUMA sched, at
> least if the arrival rate of releases doesn't exceed the benchtime.

2002-09-22 10:31:27

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Saturday 21 September 2002 18:46, Martin J. Bligh wrote:
> > Hmmm .... well I ran the One True Benchmark (tm). The patch
> > *increased* my kernel compile time from about 20s to about 28s.
> > Not sure I like that idea ;-) Anything you'd like tweaked, or
> > more info? Both user and system time were up ... I'll grab a
> > profile of kernel stuff.
>
> From the below, I'd suggest you're getting pages off the wrong
> nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> allocator. Are you sure the current->node thing is getting set
> correctly? I'll try backing out your alloc_pages tweaking, and
> see what happens.

Could you please check in dmesg whether the CPU pools are initialised
correctly? Maybe something goes wrong for your platform.

The node_distance is most probably non-optimal for NUMAQ, that might
need some tuning. The default is set for maximum 8 nodes, nodes 1-4
and 5-8 being in separate supernodes, with the latency ratios 1:1.5:2.

You could use the attached patch for getting an idea about the load
distribution. It's a quick&dirty hack which creates files called
/proc/sched/load/rqNN :load of RQs, including info on tasks not running
on their homenode
/proc/sched/history/ilbNN : history of last 25 initial load balancing
decisions for runqueue NN
/proc/sched/history/lbNN : last 25 load balancing decisions on rq NN.

It should be possible to find the reason for the poor performance by
looking at the nr_homenode entries in /proc/sched/load/rqNN.

Thanks,
best regards,
Erich


Attachments:
proc_sched_hist_2.5.37.patch (6.32 kB)

2002-09-22 10:41:22

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Saturday 21 September 2002 17:55, Martin J. Bligh wrote:
> > - Each process has a homenode assigned to it at creation time
> > (initial load balancing). Memory will be allocated from this node.
...
> #ifdef CONFIG_NUMA
> +#ifdef CONFIG_NUMA_SCHED
> +#define numa_node_id() (current->node)
> +#else
> #define numa_node_id() _cpu_to_node(smp_processor_id())
> +#endif
> #endif /* CONFIG_NUMA */
>
> I'm not convinced it's a good idea to modify this generic function,
> which was meant to tell you what node you're running on. I can't
> see it being used anywhere else right now, but wouldn't it be better
> to just modify alloc_pages instead to use current->node, and leave
> this macro as intended? Or make a process_node_id or something?

OK, I see your point and I agree that numa_node_is() should be similar to
smp_processor_id(). I'll change the alloc_pages instead.

Do you think it makes sense to get memory from the homenode only for
user processes? Many kernel threads have currently the wrong homenode,
for some of them it's unclear which homenode they should have...

There is an alternative idea (we discussed this at OLS with Andrea, maybe
you remember): allocate memory from the current node and keep statistics
on where it is allocated. Determine the homenode from this (from time to
time) and schedule accordingly. This eliminates the initial load balancing
and leaves it all to the scheduler, but has the drawback that memory can
be somewhat scattered across the nodes. Any comments?

Regards,
Erich

2002-09-22 14:53:47

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

> OK, I see your point and I agree that numa_node_is() should be similar to
> smp_processor_id(). I'll change the alloc_pages instead.
>
> Do you think it makes sense to get memory from the homenode only for
> user processes? Many kernel threads have currently the wrong homenode,
> for some of them it's unclear which homenode they should have...

Well yes ... if you can keep things pretty much on their home nodes.
That means some sort of algorithm for updating it, which may be fairly
complex (and doesn't currently seem to work, but maybe that's just
because I only have 1 pool)

> There is an alternative idea (we discussed this at OLS with Andrea, maybe
> you remember): allocate memory from the current node and keep statistics
> on where it is allocated. Determine the homenode from this (from time to
> time) and schedule accordingly. This eliminates the initial load balancing
> and leaves it all to the scheduler, but has the drawback that memory can
> be somewhat scattered across the nodes. Any comments?

Well, that's a lot simpler. Things should end up running on their home
node, and thus will allocate pages from their home node, so it should
be self-re-enforcing. The algorithm for the home node is then implicitly
worked out from the scheduler itself, and its actions, so it's one less
set of stuff to write. Would suggest we do this at first, to keep things
as simple as possible so you have something mergeable.

M.

2002-09-22 15:49:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

A few comments ... mainly just i386 arch things (which aren't
really your fault, was just a lack of the mechanisms being in
mainline), and a request to push a couple of things down into
the arch trees from rearing their ugly head into generic code ;-)

M.

> +static int __initdata nr_lnodes = 0;
> +

Use numnodes.

> cpu = ++cpucount;
> +#ifdef CONFIG_NUMA_SCHED
> + cell = SAPICID_TO_PNODE(apicid);
> + if (pnode_to_lnode[cell] < 0) {
> + pnode_to_lnode[cell] = nr_lnodes++;
> + }
> +#endif

pnodes and lnodes are all 1-1, so they're just called nodes for
i386, and there's no such thing as a SAPICID on this platform.

> /*
> * We can't use kernel_thread since we must avoid to
> * reschedule the child.
> @@ -996,6 +1004,10 @@ static void __init smp_boot_cpus(unsigne
> set_bit(0, &cpu_callout_map);
> boot_cpu_logical_apicid = logical_smp_processor_id();
> map_cpu_to_boot_apicid(0, boot_cpu_apicid);
> +#ifdef CONFIG_NUMA_SCHED
> + pnode_to_lnode[SAPICID_TO_PNODE(boot_cpu_apicid)] = nr_lnodes++;
> + printk("boot_cpu_apicid = %d, nr_lnodes = %d, lnode = %d\n", boot_cpu_apicid, nr_lnodes, pnode_to_lnode[0]);
> +#endif

Ditto. All these mappings exist already. No need to reinvent them.
The -mm tree has a generic cpu_to_node macro, which keys off the
logical_apic_id.

http://www.zipworld.com.au/~akpm/linux/patches/2.5/2.5.37/2.5.37-mm1/broken-out/topology-api.patch

> +#ifdef CONFIG_NUMA_SCHED
> +#define NR_NODES 8

MAX_NUMNODES exists for every arch (except maybe ia64 ;-))

> +#define cpu_physical_id(cpuid) (cpu_to_physical_apicid(cpuid))

Not needed, should be buried within a wrapper, not exposed to
generic code.

> +#define SAPICID_TO_PNODE(hwid) (hwid >> 4)

Grrr. No such thing.

> diff -urNp a/include/linux/sched.h b/include/linux/sched.h

> +extern int pnode_to_lnode[NR_NODES];
> +extern char lnode_number[NR_CPUS] __cacheline_aligned;

Can't you push all this down into the arch ....

> +#define CPU_TO_NODE(cpu) lnode_number[cpu]

... by letting them define cpu_to_node() themselves?
(most people don't have lnodes and pnodes, etc).

> +EXPORT_SYMBOL(runqueues_address);
> +EXPORT_SYMBOL(numpools);
> +EXPORT_SYMBOL(pool_ptr);
> +EXPORT_SYMBOL(pool_cpus);
> +EXPORT_SYMBOL(pool_nr_cpus);
> +EXPORT_SYMBOL(pool_mask);
> +EXPORT_SYMBOL(sched_migrate_task);

Aren't these internal scheduler things?

> diff -urNp a/kernel/sched.c b/kernel/sched.c

> +int pnode_to_lnode[NR_NODES] = { [0 ... NR_NODES-1] = -1 };
> +char lnode_number[NR_CPUS] __cacheline_aligned;

Ditto.

> + int this_pool = CPU_TO_NODE(this_cpu);
> + int this_pool=CPU_TO_NODE(this_cpu), weight, maxweight=0;

Howcome you can use the CPU_TO_NODE abstraction here ...

> + /* build translation table for CPU_TO_NODE macro */
> + for (i = 0; i < NR_CPUS; i++)
> + if (cpu_online(i))
> + lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))];

But not here?


2002-09-22 17:08:42

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

> would you please check the boot messages for the NUMA scheduler before
> doing the run. Martin sent me an example where he has:
>
> CPU pools : 1
> pool 0 :0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> node level 0 : 10
> pool_delay matrix:
> 129
>
> which is clearly wrong. In that case we need to fix the cpu-pools setup
> first.

OK, well I hacked this for now:

sched.c somewhere:
- lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))];
+ lnode_number[i] = i/4;

Which makes the pools work properly. I think you should just use
the cpu_to_node macro here, but the hack will allow us to do some
testing.

Results, averaged over 5 kernel compiles:

Before:
Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
After:
Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%

So you actually take a little less horsepower to do the work, but
don't utilize the CPUs quite as well, so elapsed time is higher.
I seem to recall getting better results from Mike's quick hack
though ... that was a long time back. What were the balancing
issues you mentioned?

M.

2002-09-22 19:16:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

I tried putting back the current->node logic now that we have
the correct node IDs, but it made things worse (not as bad as
before, but ... looks like we're still allocing off the wrong
node.

This run is the last one in the list.

Virgin:
Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
7059 do_anonymous_page
4459 page_remove_rmap
3863 handle_mm_fault
3695 .text.lock.namei
2912 page_add_rmap
2458 rmqueue
2119 vm_enough_memory

Both numasched patches, just compile fixes:
Elapsed: 28.744s User: 204.62s System: 173.708s CPU: 1315.8%
38978 do_anonymous_page
36533 rmqueue
35099 __free_pages_ok
5551 page_remove_rmap
4694 handle_mm_fault
3166 page_add_rmap

Both numasched patches, alloc from local node
Elapsed: 21.094s User: 195.808s System: 62.41s CPU: 1224.4%
7475 do_anonymous_page
4564 page_remove_rmap
4167 handle_mm_fault
3467 .text.lock.namei
2520 page_add_rmap
2112 rmqueue
1905 .text.lock.dec_and_lock
1849 zap_pte_range
1668 vm_enough_memory

Both numasched patches, hack node IDs, alloc from local node
Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%
5793 do_anonymous_page
4475 page_remove_rmap
4281 handle_mm_fault
3820 .text.lock.namei
2625 page_add_rmap
2028 .text.lock.dec_and_lock
1748 vm_enough_memory
1713 file_read_actor
1672 rmqueue

Both numasched patches, hack node IDs, alloc from current->node
Elapsed: 24.414s User: 194.86s System: 98.606s CPU: 1201.6%
30317 do_anonymous_page
6962 rmqueue
5190 page_remove_rmap
4773 handle_mm_fault
3522 .text.lock.namei
3161 page_add_rmap

2002-09-22 19:21:19

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

>> + int this_pool = CPU_TO_NODE(this_cpu);
>> + int this_pool=CPU_TO_NODE(this_cpu), weight, maxweight=0;
>
> Howcome you can use the CPU_TO_NODE abstraction here ...
>
>> + /* build translation table for CPU_TO_NODE macro */
>> + for (i = 0; i < NR_CPUS; i++)
>> + if (cpu_online(i))
>> + lnode_number[i] = pnode_to_lnode[SAPICID_TO_PNODE(cpu_physical_id(i))];
>
> But not here?

Doh! Because you're building the list to use for CPU_TO_NODE,
obviously ;-) Sorry.

Should still get buried back down into the arch code though.

M.



2002-09-22 21:55:17

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Sunday 22 September 2002 21:20, Martin J. Bligh wrote:
> I tried putting back the current->node logic now that we have
> the correct node IDs, but it made things worse (not as bad as
> before, but ... looks like we're still allocing off the wrong
> node.

Thanks a lot for the testing! It looks like something is still
wrong. NUMAQ suffers a lot more from hops across the nodes than
the Azusa therefore I expect it is more sensitive to initial
load balancing errors.

The easiest thing to try is "nodpol -P 2" in the shell before
running the test. This changes the initial load balancing policy
from exec() to fork() ("nodpol -P 0" gets you back to the default).

A bit more difficult is tuning the scheduler parameters which can
be done pretty simply by changing the __node_distance matrix. A first
attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to
larger delays when stealing from a remote node.

Anyhow, it would be good to understand what is going on and maybe
simpler tests than a kernel compile can reveal something. Or looking
into at the /proc/sched/load/rqNN files (you need the patch I posted
a few mails ago).

I'll modify alloc_pages not to take into acount the kernel threads
in the mean time.

Regards,
Erich

> This run is the last one in the list.
>
> Virgin:
> Elapsed: 20.82s User: 191.262s System: 59.782s CPU: 1206.4%
> 7059 do_anonymous_page
> 4459 page_remove_rmap
> 3863 handle_mm_fault
> 3695 .text.lock.namei
> 2912 page_add_rmap
> 2458 rmqueue
> 2119 vm_enough_memory
>
> Both numasched patches, just compile fixes:
> Elapsed: 28.744s User: 204.62s System: 173.708s CPU: 1315.8%
> 38978 do_anonymous_page
> 36533 rmqueue
> 35099 __free_pages_ok
> 5551 page_remove_rmap
> 4694 handle_mm_fault
> 3166 page_add_rmap
>
> Both numasched patches, alloc from local node
> Elapsed: 21.094s User: 195.808s System: 62.41s CPU: 1224.4%
> 7475 do_anonymous_page
> 4564 page_remove_rmap
> 4167 handle_mm_fault
> 3467 .text.lock.namei
> 2520 page_add_rmap
> 2112 rmqueue
> 1905 .text.lock.dec_and_lock
> 1849 zap_pte_range
> 1668 vm_enough_memory
>
> Both numasched patches, hack node IDs, alloc from local node
> Elapsed: 21.918s User: 190.224s System: 59.166s CPU: 1137.4%
> 5793 do_anonymous_page
> 4475 page_remove_rmap
> 4281 handle_mm_fault
> 3820 .text.lock.namei
> 2625 page_add_rmap
> 2028 .text.lock.dec_and_lock
> 1748 vm_enough_memory
> 1713 file_read_actor
> 1672 rmqueue
>
> Both numasched patches, hack node IDs, alloc from current->node
> Elapsed: 24.414s User: 194.86s System: 98.606s CPU: 1201.6%
> 30317 do_anonymous_page
> 6962 rmqueue
> 5190 page_remove_rmap
> 4773 handle_mm_fault
> 3522 .text.lock.namei
> 3161 page_add_rmap


Attachments:
nodpol.c (2.52 kB)

2002-09-22 22:39:12

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Sun, Sep 22, 2002 at 11:59:16PM +0200, Erich Focht wrote:
> A bit more difficult is tuning the scheduler parameters which can
> be done pretty simply by changing the __node_distance matrix. A first
> attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to
> larger delays when stealing from a remote node.

This is not entirely reflective of our architecture. Node-to-node
latencies vary as well. Some notion of whether communication must cross
a lash at the very least should be present.


Cheers,
Bill

2002-09-22 22:47:39

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

> On Sun, Sep 22, 2002 at 11:59:16PM +0200, Erich Focht wrote:
>> A bit more difficult is tuning the scheduler parameters which can
>> be done pretty simply by changing the __node_distance matrix. A first
>> attempt could be: 10 on the diagonal, 100 off-diagonal. This leads to
>> larger delays when stealing from a remote node.
>
> This is not entirely reflective of our architecture. Node-to-node
> latencies vary as well. Some notion of whether communication must cross
> a lash at the very least should be present.

Ummm ... I think it's just flat on or off node, presumably Erich
has "on the diagonal" meaning they're on the same node, and
"off-diagonal" meaning they're not. In which case, what he suggested
seems fine ... it's really about 20:1 ratio so I might use 10
and 200, but other than that, it seems correct to me.

M.

2002-09-23 18:15:59

by Erich Focht

[permalink] [raw]
Subject: node affine NUMA scheduler: simple benchmark

Here is a simple benchmark which is NUMA sensitive and simulates a simple
but normal situation in an environment running number crunching jobs. It
starts N independent tasks which access a large array in a random manner.
This is both bandwidth and latency sensitive. The output shows on which
node(s) the tasks have spent their lives. Additionally it shows (on a
NUMA scheduler kernel) the homenode (iSched).

Could you please run it on the virgin kernel and on the
"Both numasched patches, hack node IDs, alloc from current->node" one?

Maybe we see what's wrong...

Thanks,
Erich


Attachments:
rand_updt (5.54 kB)

2002-09-23 18:34:08

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Sunday 22 September 2002 16:57, Martin J. Bligh wrote:
> > There is an alternative idea (we discussed this at OLS with Andrea, maybe
> > you remember): allocate memory from the current node and keep statistics
> > on where it is allocated. Determine the homenode from this (from time to
> > time) and schedule accordingly. This eliminates the initial load
> > balancing and leaves it all to the scheduler, but has the drawback that
> > memory can be somewhat scattered across the nodes. Any comments?
>
> Well, that's a lot simpler. Things should end up running on their home
> node, and thus will allocate pages from their home node, so it should
> be self-re-enforcing. The algorithm for the home node is then implicitly
> worked out from the scheduler itself, and its actions, so it's one less
> set of stuff to write. Would suggest we do this at first, to keep things
> as simple as possible so you have something mergeable.

OK, sounds encouraging. So here is my first attempt (attached). You'll
have to apply it on top of the two NUMA scheduler patches and hack
SAPICID_TO_PNODE again.

The patch adds a node_mem[NR_NODES] array to each task. When allocating
memory (in rmqueue) and freeing it (in free_pages_ok) the number of
pages is added/subtracted from that array and the homenode is set to
the node having the largest entry. Is there a better place where to put
this in (other than rmqueue/free_pages_ok)?

I have two problems with this approach:
1: Freeing memory is quite expensive, as it currently involves finding the
maximum of the array node_mem[].
2: I have no idea how tasks sharing the mm structure will behave. I'd
like them to run on different nodes (that's why node_mem is not in mm),
but they could (legally) free pages which they did not allocate and
have wrong values in node_mem[].

Comments?

Regards,
Erich


Attachments:
node_affine_dyn-2.5.37.patch (8.75 kB)

2002-09-23 18:53:56

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

> OK, sounds encouraging. So here is my first attempt (attached). You'll
> have to apply it on top of the two NUMA scheduler patches and hack
> SAPICID_TO_PNODE again.
>
> The patch adds a node_mem[NR_NODES] array to each task. When allocating
> memory (in rmqueue) and freeing it (in free_pages_ok) the number of
> pages is added/subtracted from that array and the homenode is set to
> the node having the largest entry. Is there a better place where to put
> this in (other than rmqueue/free_pages_ok)?
>
> I have two problems with this approach:
> 1: Freeing memory is quite expensive, as it currently involves finding the
> maximum of the array node_mem[].

Bleh ... why? This needs to be calculated much more lazily than this,
or you're going to kick the hell out of any cache affinity. Can you
recalc this in the rebalance code or something instead?

> 2: I have no idea how tasks sharing the mm structure will behave. I'd
> like them to run on different nodes (that's why node_mem is not in mm),
> but they could (legally) free pages which they did not allocate and
> have wrong values in node_mem[].

Yes, that really ought to be per-process, not per task. Which means
locking or atomics ... and overhead. Ick.

For the first cut of the NUMA sched, maybe you could just leave page
allocation alone, and do that seperately? or is that what the second
patch was meant to be?

M.

2002-09-24 21:14:27

by Martin J. Bligh

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

>> > 2: I have no idea how tasks sharing the mm structure will behave. I'd
>> > like them to run on different nodes (that's why node_mem is not in mm),
>> > but they could (legally) free pages which they did not allocate and
>> > have wrong values in node_mem[].
>>
>> Yes, that really ought to be per-process, not per task. Which means
>> locking or atomics ... and overhead. Ick.
>
> Hmm, I think it is sometimes ok to have it per task. For example OpenMP
> parallel jobs working on huge arrays. The "first-touch" of these arrays
> leads to pagefaults generated by the different tasks and thus different
> node_mem[] arrays for each task. As long as they just allocate memory,
> all is well. If they only release it at the end of the job, too. This
> probably goes wrong if we have a long running task that spawns short
> living clones. They inherit the node_mem from the parent but pages
> added by them to the common mm are not reflected in the parent's node_mem
> after their death.

But you're left with a choice whether to base it on the per-task or
per-process information when you make decisions.

1. per-process requires cross-node collation for a data read. Bad.

2. per-task leads to obviously bad decision cases when there's significant
amounts of shared data between the tasks of a process (which was often
the point of making them threads in the first place).

Yes, I can imagine a situation for which it would work, as you describe
above ... but that's not the point - this is a general policy, and I
don't think it works in general ... as you say yourself "it is
*sometimes* ok" ;-)

> The first patch needs a correction, add in load_balance()
> if (!busiest) goto out;
> after the call to find_busiest_queue. This works alone. On top of this
> pooling NUMA scheduler we can put the node affinity approach that fits
> best. With or without memory allocation. I'll update the patches and
> their setup code (thanks for the comments!) and resend them.

Excellent! Will try out the correction above and get back to you.
Might be a day or so, I'm embroiled in some other code at the moment.

M.

2002-09-24 21:04:24

by Erich Focht

[permalink] [raw]
Subject: Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

On Monday 23 September 2002 20:47, Martin J. Bligh wrote:
> > I have two problems with this approach:
> > 1: Freeing memory is quite expensive, as it currently involves finding
> > the maximum of the array node_mem[].
>
> Bleh ... why? This needs to be calculated much more lazily than this,
> or you're going to kick the hell out of any cache affinity. Can you
> recalc this in the rebalance code or something instead?

You're right, that would be too slow. I think of marking the tasks
needing recalculation and update their homenode when their runqueue
is scanned for a task to be stolen.

> > 2: I have no idea how tasks sharing the mm structure will behave. I'd
> > like them to run on different nodes (that's why node_mem is not in mm),
> > but they could (legally) free pages which they did not allocate and
> > have wrong values in node_mem[].
>
> Yes, that really ought to be per-process, not per task. Which means
> locking or atomics ... and overhead. Ick.

Hmm, I think it is sometimes ok to have it per task. For example OpenMP
parallel jobs working on huge arrays. The "first-touch" of these arrays
leads to pagefaults generated by the different tasks and thus different
node_mem[] arrays for each task. As long as they just allocate memory,
all is well. If they only release it at the end of the job, too. This
probably goes wrong if we have a long running task that spawns short
living clones. They inherit the node_mem from the parent but pages
added by them to the common mm are not reflected in the parent's node_mem
after their death.

> For the first cut of the NUMA sched, maybe you could just leave page
> allocation alone, and do that seperately? or is that what the second
> patch was meant to be?

The first patch needs a correction, add in load_balance()
if (!busiest) goto out;
after the call to find_busiest_queue. This works alone. On top of this
pooling NUMA scheduler we can put the node affinity approach that fits
best. With or without memory allocation. I'll update the patches and
their setup code (thanks for the comments!) and resend them.

Regards,
Erich