This test does a make -j X vmlinux on a 2.4.17 kernel compile with
a very large config set. X is 16 times the number of cpus. This is
on a 16-way NUMA-Q so we end up with a make -j256 (it's fastest
with about 1.5 * num_cpus), but this test puts more stress on the
kernel.
None of the other tests I ran showed anything very interesting.
(the new NUMA sched stuff from Ingo seems to give mild degredations
in -mjb ... probably needs some more tuning).
Going from 59 to 59-mm6, I get:
Kernbench-16:
Elapsed User System CPU
2.5.59 47.45 568.02 143.17 1498.17
2.5.59-mm6 47.18 567.15 138.62 1495.50
Summary: Scheduler stuff seems like a wash (schedule -> do_schedule).
Seems to be some sort of rearrangement of the dcache stuff which
appears to be mildly beneficial (what's going in there?).
current_kernel_time seems to be less than half the cost, I'm assuming
the new frlock kernel time stuff is doing that. This workload doesn't
stress that very much, so I'll find a better test for that one ...
2.5.59: 1657 current_kernel_time
2.5.59-mm6: 747 current_kernel_time
diffprofile (+ gets worse, - gets better).
2023 do_schedule
485 dentry_open
289 .text.lock.file_table
132 clear_page_tables
131 pgd_ctor
113 vma_merge
75 kmap_atomic
62 get_empty_filp
51 can_vma_merge_after
-52 dget_locked
-54 vfs_follow_link
-55 kmem_cache_free
-66 buffered_rmqueue
-74 __copy_to_user_ll
-94 page_add_rmap
-102 fd_install
-110 __copy_from_user_ll
-117 __d_lookup
-157 do_generic_mapping_read
-188 path_lookup
-273 .text.lock.dec_and_lock
-275 file_ra_state_init
-283 do_anonymous_page
-331 pfn_to_nid
-405 page_remove_rmap
-413 pgd_alloc
-427 vm_enough_memory
-910 current_kernel_time
-1222 .text.lock.namei
-2076 total
-2133 schedule
On Mon, Jan 27, 2003 at 06:40:15PM +0100, Martin J. Bligh wrote:
> Going from 59 to 59-mm6, I get:
>
> Kernbench-16:
> Elapsed User System CPU
> 2.5.59 47.45 568.02 143.17 1498.17
> 2.5.59-mm6 47.18 567.15 138.62 1495.50
>
> Summary: Scheduler stuff seems like a wash (schedule -> do_schedule).
> Seems to be some sort of rearrangement of the dcache stuff which
> appears to be mildly beneficial (what's going in there?).
>
> diffprofile (+ gets worse, - gets better).
>
> 2023 do_schedule
> 485 dentry_open
> 289 .text.lock.file_table
Looks like you are getting hit by contention on files_lock. I have
been messing around with some code to split up the files_lock, but
I can't seem to get the locking in the tty layer right.
Hmm.. .text.lock.namei is probably dcache_lock. -mms no longer has
dcache_rcu, so not quite sure what helped you here.
Thanks
Dipankar
On Mon, Jan 27, 2003 at 09:36:52AM -0800, Martin J. Bligh wrote:
> 132 clear_page_tables
> 131 pgd_ctor
> -413 pgd_alloc
The pagetable preconstruction cache hit is spread across
clear_page_tables() and pgd_ctor() with the pgd_ctor patches.
This is the equivalent of the explicit zeroing in pgd_alloc().
Your result appears to imply the overhead has been reduced by 36%,
which is useful evidence for the PAE case. Before this the pgd_alloc()
overhead had only been observed on non-PAE systems.
Now, YTF hadn't I seen this before if all it took to bring it out was
a kernel compile? Perhaps diffprof (I prefer the multiplicative flavor
but nm that) of some flavor was lacking.
-- wli