2002-03-19 04:26:35

by Martin J. Bligh

[permalink] [raw]
Subject: Scalability problem (kmap_lock) with -aa kernels

OK, I finally got the -aa kernel series running in conjunction with the
NUMA-Q discontigmem stuff. For some reason which I haven't debugged
yet 2.4.19-pre3-aa2 won't boot on the NUMA-Q even without the discontigmem
stuff in ... so I went back to 2.4.19-pre1-aa1, which I knew worked from
last time around (thanks again for that patch).

So just comparing aa+discontigmem to standard 2.4.18+discontigmem, I see
kernel compile times are about 35s vs 26.5s .... hmmm. Looking at the top
part of the profiles, I see this:

standard:

23991 total 0.0257
7679 default_idle 147.6731
3044 _text_lock_dcache 8.7221
2340 _text_lock_swap 43.3333
1160 do_anonymous_page 3.4940
776 d_lookup 2.8116
650 __free_pages_ok 1.2405
627 lru_cache_add 6.8152
608 do_generic_file_read 0.5468
498 __generic_copy_from_user 4.7885
480 lru_cache_del 21.8182
437 atomic_dec_and_lock 6.0694
426 schedule 0.3017
402 _text_lock_dec_and_lock 16.7500
...
109 kmap_high 0.3028
46 _text_lock_highmem 0.4071

andrea:
38549 total 0.0405
13102 _text_lock_highmem 108.2810
8627 default_idle 165.9038
2578 kunmap_high 14.3222
2556 kmap_high 6.0857
1242 do_anonymous_page 3.2684
1052 _text_lock_swap 22.8696
942 _text_lock_dcache 2.4987
683 do_page_fault 0.4337
587 pte_alloc 1.2332
535 __generic_copy_from_user 5.1442
518 d_lookup 1.8768
443 __free_pages_ok 0.7745
422 lru_cache_add 2.7763

_text_lock_highmem appears to be kmap_lock, looking at dissassembly.
Recompiling with the trusty lockmeter, I see this (on -aa).

33.4% 63.5% 5.4us(7893us) 155us( 16ms)(37.8%) 2551814 36.5% 63.5% 0% kmap_lock_cacheline
17.4% 64.9% 5.7us(7893us) 158us( 16ms)(19.7%) 1275907 35.1% 64.9% 0% kmap_high+0x34
16.0% 62.1% 5.2us( 982us) 152us( 13ms)(18.1%) 1275907 37.9% 62.1% 0% kunmap_high+0x40

Ick. On a vaguely comparible mainline kernel we're looking at:

1.6% 2.7% 0.5us(4208us) 28us(3885us)(0.14%) 716044 97.3% 2.7% 0% kmap_lock
1.2% 2.9% 0.9us(4208us) 35us(3885us)(0.09%) 358022 97.1% 2.9% 0% kmap_high+0x10
0.33% 2.5% 0.2us( 71us) 21us(2598us)(0.05%) 358022 97.5% 2.5% 0% kunmap_high+0xc

Andrea - is this your new highmem pte stuff doing this?
Or is that not even in your tree as yet? Would be a shame if that's
the problem as I really want to get the highmem pte stuff - allows
me to put processes pagetables on their own nodes ....

Thanks,

Martin.



2002-03-19 08:59:11

by Rik van Riel

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Mon, 18 Mar 2002, Martin J. Bligh wrote:

> OK, I finally got the -aa kernel series running in conjunction with the
> NUMA-Q discontigmem stuff. For some reason which I haven't debugged
> yet 2.4.19-pre3-aa2 won't boot on the NUMA-Q even without the discontigmem
> stuff in ... so I went back to 2.4.19-pre1-aa1, which I knew worked from
> last time around (thanks again for that patch).
>
> So just comparing aa+discontigmem to standard 2.4.18+discontigmem, I see
> kernel compile times are about 35s vs 26.5s ....

That's probably Andrea's pte-highmem code, which uses the
global kmap pool instead of doing cpu-local atomic kmaps.

He's been told about this becoming a scalability problem
but didn't believe it since I didn't have any numbers for
him. Maybe he believes it now since you've measured it ;)


> standard:
>
> 23991 total 0.0257
> 7679 default_idle 147.6731
> 3044 _text_lock_dcache 8.7221
> 2340 _text_lock_swap 43.3333
> 1160 do_anonymous_page 3.4940
> ...
> 109 kmap_high 0.3028
> 46 _text_lock_highmem 0.4071
>
> andrea:
> 38549 total 0.0405
> 13102 _text_lock_highmem 108.2810
> 8627 default_idle 165.9038
> 2578 kunmap_high 14.3222
> 2556 kmap_high 6.0857
> 1242 do_anonymous_page 3.2684
> 1052 _text_lock_swap 22.8696

> Andrea - is this your new highmem pte stuff doing this?
> Or is that not even in your tree as yet? Would be a shame if that's
> the problem as I really want to get the highmem pte stuff - allows
> me to put processes pagetables on their own nodes ....

You really want the pte-highmem by Arjan and Ingo, which uses
kmap_atomic (which is CPU-local and doesn't suffer the same
scalability problem as the global kmap pool).

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-20 01:39:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Mon, Mar 18, 2002 at 08:25:42PM -0800, Martin J. Bligh wrote:
> OK, I finally got the -aa kernel series running in conjunction with the
> NUMA-Q discontigmem stuff. For some reason which I haven't debugged
> yet 2.4.19-pre3-aa2 won't boot on the NUMA-Q even without the discontigmem

If you have PIII cpus pre3 mainline has a bug in the machine check code,
this one liner will fix it:

diff -urN 2.4.19pre3/arch/i386/kernel/bluesmoke.c mcheck/arch/i386/kernel/bluesmoke.c
--- 2.4.19pre3/arch/i386/kernel/bluesmoke.c Tue Mar 12 00:07:06 2002
+++ mcheck/arch/i386/kernel/bluesmoke.c Thu Mar 14 12:45:05 2002
@@ -169,7 +169,7 @@
if(l&(1<<8))
wrmsr(MSR_IA32_MCG_CTL, 0xffffffff, 0xffffffff);
banks = l&0xff;
- for(i=0;i<banks;i++)
+ for(i=1;i<banks;i++)
{
wrmsr(MSR_IA32_MC0_CTL+4*i, 0xffffffff, 0xffffffff);
}


> stuff in ... so I went back to 2.4.19-pre1-aa1, which I knew worked from
> last time around (thanks again for that patch).
>
> So just comparing aa+discontigmem to standard 2.4.18+discontigmem, I see
> kernel compile times are about 35s vs 26.5s .... hmmm. Looking at the top
> part of the profiles, I see this:
>
> standard:
>
> 23991 total 0.0257
> 7679 default_idle 147.6731
> 3044 _text_lock_dcache 8.7221
> 2340 _text_lock_swap 43.3333
> 1160 do_anonymous_page 3.4940
> 776 d_lookup 2.8116
> 650 __free_pages_ok 1.2405
> 627 lru_cache_add 6.8152
> 608 do_generic_file_read 0.5468
> 498 __generic_copy_from_user 4.7885
> 480 lru_cache_del 21.8182
> 437 atomic_dec_and_lock 6.0694
> 426 schedule 0.3017
> 402 _text_lock_dec_and_lock 16.7500
> ...
> 109 kmap_high 0.3028
> 46 _text_lock_highmem 0.4071
>
> andrea:
> 38549 total 0.0405
> 13102 _text_lock_highmem 108.2810
> 8627 default_idle 165.9038
> 2578 kunmap_high 14.3222
> 2556 kmap_high 6.0857

One thing I see is not only a scalability problem with the locking, but
it seems kmap_high is also spending an huge amount of time in kernel
compared to the "standard" profiling. That maybe because I increased
too much the size of the pool (the algorithm is O(N)). Can you try again
with this incremental patch applied?

--- 2.4.19pre3aa3/include/asm-i386/highmem.h.~1~ Thu Mar 14 12:48:11 2002
+++ 2.4.19pre3aa3/include/asm-i386/highmem.h Wed Mar 20 01:31:42 2002
@@ -47,7 +47,7 @@
KM_NR_SERIES,
};

-#define LAST_PKMAP 1024
+#define LAST_PKMAP 128
#define PKMAP_SIZE ((LAST_PKMAP*KM_NR_SERIES) << PAGE_SHIFT)
#define PKMAP_BASE (FIXADDR_START - PKMAP_SIZE - PAGE_SIZE) /* left a page in between */
#define PKMAP_NR(virt) (((virt)-PKMAP_BASE) >> PAGE_SHIFT)


If kmap_high was dogslow, all cpus had to stall for a very long time
into _text_lock_highmem (OTOH also kunmap_high has a quite high rate,
infact it has an higher rate, that doesn't make much sense).

> Andrea - is this your new highmem pte stuff doing this?

The pte-highmem stuff has nothing to do with the kmap_high O(N)
complexity that maybe the real reason of this slowdown. (the above patch
decreases N of an order of magnitude and so we'll see if that was the
real problem or not)

However avoiding completly persistent kmaps would definitely give you an
additional scalability at the cost of an additional tlb overhead and
uglier code. I never questioned this point. If you want to run a 1024
CPU system then you don't want the persistent kmaps, period.

So avoiding persistent kmaps in the pte handling would in turn you give
you additional scalability __if__ your workload is very pagetable
intensive (a kernel compile is very pagetable intensive incidentally),
but the very same scalability problem you can find with the pagetables
you will have it also for the cache in different workloads because all
the pagecache is in highmem too and every time you execute a read
syscall you will also need to kmap-persistent a pagecache.

So I will never agree that avoiding the persistent kmaps in the
pagetable handling is a final solution, if the persistent kmaps are a
bottleneck they have to be removed completly, not just from the
pagetable handling. 512 pages = 2M of cache being cached isn't enough if
it doesn't work well for the pagetables (infact such pool of mapped
pages sounds much nicer for the pagetables, where the ram footprint is
much smaller than for the pagecache, and also thanks to the quicklist
that will make more likely to have page->virtual just in cache, btw
removing the quicklist from 2.5 was a bad idea, the global allocator
affinity can be completly polluted under high I/O loads, and the
quicklist was still useful as an active list, but this is totally
offtopic here, it just happened to be broken in 2.5 with the high-pte
patch and that's why it came to mind).

Avoiding persistent kmaps for the whole pagetable handling doesn't
guarantee your scalability problem will go away, you will run in the
_very_same_scalability_problem_ with a just little different workload,
it depends completly on the workload, and lots of common applications
are going to be hurted by the pagecache I/O the same or more than the
pagetable kmaps.

The 2.5 kernel avoids using persistent kmaps for pagetables, that's the
only interesting difference with pte-highmem in 2.4 (all other
differences are not interesting and I prefer pte-highmem for all other
parts for sure), but note that you will still have to pay for an hit if
you want the feature compared to the "standard" 2.4 that you benchmarked
against: in 2.5 the CPU will have to walk pagetables for the kmap areas
after every new kmap because the kmap will be forced to flush the tlb
entry without persistence. The pagetables relative to the kmap atomic
area are shared across all cpus and the cpu issues locked cycles to walk
them.

The whole persistent kmap design is superior in common machines where
people doesn't need to scale up to 16-way and it allows us to write
simpler code compared to the atomic kmaps. It wasn't really designed for
16-way machines. It's not a matter of pte-highmem vs high-pte, it's a
matter of persistent kmaps vs atomic kmaps, not just for the pagetables
but for everything. If you need to avoid the global spinlock you don't
want the persistent kmaps, no matter where they came from
(pagetable/pagecache).

As soon as I'll get your result for the above patch, we'll be able to
choose if the right thing is to remove all persistent kmaps from
pte-highmem (removing them from the pagecache would be too complicated
in 2.4 because it would break the fs API, while just for pte-highmem is
doable in a few hours). It will be a choice oriented for the very high
end 32bit x86 though, my 2-way x86 doesn't showup any locking problem in
the profiles yet.

Many thanks for the useful feedback.

Andrea

2002-03-20 12:29:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Tue, Mar 19, 2002 at 10:15:30PM -0800, Martin J. Bligh wrote:
> > If you have PIII cpus pre3 mainline has a bug in the machine check code,
> > this one liner will fix it:
>
> Thanks - that'll probably save me a bunch of time ...
>
> May I assume, for the sake of discussion, that there's not much
> difference between pre1-aa1 and your latest stuff in the area we're
> talking about here?
>
> > One thing I see is not only a scalability problem with the locking, but
> > it seems kmap_high is also spending an huge amount of time in kernel
> > compared to the "standard" profiling.
>
> One other wierd thing to note in the profiling is that kunmap_high
> is way up in the profile as well - seeing as kunmap_high doesn't
> do the same sort of scanning as kmap_high, it shouldn't be O(N).
> I'm wondering if somehow the profiling attributes some of the spinning
> cost to the calling function. Pure speculation, but there's defintely
> something strange there ....

Yep. What's the profile buffer size? Just make sure to boot with
profile=2 so you'll have a quite accurate precision.

> > That maybe because I increased
> > too much the size of the pool (the algorithm is O(N)). Can you try
> > again with this incremental patch applied?
>
> Sure ... will retest. But by shrinking the size of the pool back down,
> won't you just increase the number of global tlbflushes? Any way you

Yes.

> I appreciate the scanning doesn't scale well, but is the pte-highmem
> stuff the cause of the increase of kmap frequency? There seems to be
> both a frequency and duration problem.

the frequency is higher during a kernel compile due the pte-highmem, but
if you just change the workload and you start 16 tasks simultaenous
reading from a file in cache, you will get the same very frequency no
matter of pte-highmem or not. What you found is a scalability issue with
the persistent kmaps, not introduced by pte-highmem (however with
pte-highmem I have increased its visibility due the additional
usages of the persistent kmaps for certain pte intensive workload and
with a larger pool to scan).

> still need to do a kmap for the pagecache situation you mention?

of course yes. bounce buffers are completly transparent with the
pagecache layer. The very same persistent kmap that is hurting you with
the pagetable handling is happening also in every generic_file_read/write,
it will happen even if you are using ramfs that doesn't do any I/O at
all.

> OK, I guess we're back to the question of whether a local tlb_flush_one
> per kmap is cheaper than a global tlb_flush_all once per LAST_PKMAP
> kmaps. Not just in terms of time to execute, but in terms of how much
> we slow down others by trashing the cache ... I guess that's going to
> be tough to really measure.

On a 16-way considering all cpus will hit the mem bus to re-read all the
pagetables at the very same time plus there will be an IPI flood to 16
cpus, so depending on the design of your hardware it may be bad. More
cpus also means a potential much higher kmap frequency. So I'm pretty
sure for NUMA-Q you don't want the persistent kmaps in pagetables and in
pagecache. Persistent kmaps aren't meant to scale, they're meant to
optimize the more common x86 machines (also UP with say 1G of ram) and
to make the kernel code dealing with highmem simpler and more readable.
Those are the very same reasons I'm using the persistent kmaps in
pte-highmem indeed. But on a 16-way you have different scalability needs
for both pagecache and pagetables.

BTW, I would also clarify that it's not that I never checked any
number/profiling of high end systems yet with pte-highmem, it's just
that your hardware and your workload has to be quite different from the
one where pte-highmem is been developed for. The workloads that were
running out of pagetables due the too many tasks mapping the same 512M
SGA, didn't show any bottleneck in the persistent kmap during the
profiling.

> It would be nice to be able to compare the two different kmap approaches
> against each other - AFAIK, the 2.5 implementation isn't available for
> 2.4 to compare though ... if your stuff is easy to change over to
> atomic_kmap, I'd be happy to compare it (unless shrinking the pool size
> fixes it, in which case we're done).

Correct. If shrinking the pool doesn't make significant difference (for
example it may be acceptable if it would reduce the level of overhead
to the same one of lru_cache_add in the anonymous memory page fault that
you also don't want on a NUMA-Q just for kernel compiles without the
need of swapout anything) I can very easily drop the persistent kmap
usage from my tree so you can try that way too (without adding the
kernel pagetables in kernel stuff in 2.5 and without dropping the
quicklist cpu affine cache like what happened in 2.5).

BTW, before I drop the persistent kmaps from the pagetable handling you
can also make a quick check by removing __GFP_HIGHMEM from the
allocation in mm/memory.c:pte_alloc_one() and verifying the kmap_high
overhead goes away during the kernel compiles (that basically disables
the pte-highmem feature).

> Thanks for taking the time to explain all of this - I have a much
> better idea what's going on now. I'll get you the new numbers tommorow.

You're welcome. Thanks to you to help fixing very-high-end scalability
problems in my tree.

Andrea

2002-03-20 06:15:20

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

> If you have PIII cpus pre3 mainline has a bug in the machine check code,
> this one liner will fix it:

Thanks - that'll probably save me a bunch of time ...

May I assume, for the sake of discussion, that there's not much
difference between pre1-aa1 and your latest stuff in the area we're
talking about here?

> One thing I see is not only a scalability problem with the locking, but
> it seems kmap_high is also spending an huge amount of time in kernel
> compared to the "standard" profiling.

One other wierd thing to note in the profiling is that kunmap_high
is way up in the profile as well - seeing as kunmap_high doesn't
do the same sort of scanning as kmap_high, it shouldn't be O(N).
I'm wondering if somehow the profiling attributes some of the spinning
cost to the calling function. Pure speculation, but there's defintely
something strange there ....

> That maybe because I increased
> too much the size of the pool (the algorithm is O(N)). Can you try
> again with this incremental patch applied?

Sure ... will retest. But by shrinking the size of the pool back down,
won't you just increase the number of global tlbflushes? Any way you
cut it, the kmaps are going to be expensive ... according to the
lockmeter stuff, you're doing about 3.5 times as many kmaps.

> The pte-highmem stuff has nothing to do with the kmap_high O(N)
> complexity that maybe the real reason of this slowdown. (the above
> patch decreases N of an order of magnitude and so we'll see if that
> was the real problem or not)

I appreciate the scanning doesn't scale well, but is the pte-highmem
stuff the cause of the increase of kmap frequency? There seems to be
both a frequency and duration problem.

> So avoiding persistent kmaps in the pte handling would in turn you give
> you additional scalability __if__ your workload is very pagetable
> intensive (a kernel compile is very pagetable intensive incidentally),
> but the very same scalability problem you can find with the pagetables
> you will have it also for the cache in different workloads because all
> the pagecache is in highmem too and every time you execute a read
> syscall you will also need to kmap-persistent a pagecache.

I don't think anyone would deny that making kmap faster / more scalable
would be a Good Thing ;-) I haven't stared at the pagecache code too
much - once we avoid the bounce buffers with Jens' patches, do we
still need to do a kmap for the pagecache situation you mention?

> The 2.5 kernel avoids using persistent kmaps for pagetables, that's the
> only interesting difference with pte-highmem in 2.4 (all other
> differences are not interesting and I prefer pte-highmem for all other
> parts for sure), but note that you will still have to pay for an hit if
> you want the feature compared to the "standard" 2.4 that you benchmarked
> against: in 2.5 the CPU will have to walk pagetables for the kmap areas
> after every new kmap because the kmap will be forced to flush the tlb
> entry without persistence. The pagetables relative to the kmap atomic
> area are shared across all cpus and the cpu issues locked cycles to walk
> them.

OK, I guess we're back to the question of whether a local tlb_flush_one
per kmap is cheaper than a global tlb_flush_all once per LAST_PKMAP
kmaps. Not just in terms of time to execute, but in terms of how much
we slow down others by trashing the cache ... I guess that's going to
be tough to really measure.

It would be nice to be able to compare the two different kmap approaches
against each other - AFAIK, the 2.5 implementation isn't available for
2.4 to compare though ... if your stuff is easy to change over to
atomic_kmap, I'd be happy to compare it (unless shrinking the pool size
fixes it, in which case we're done).

Thanks for taking the time to explain all of this - I have a much
better idea what's going on now. I'll get you the new numbers tommorow.

M.

2002-03-20 16:14:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

> Yep. What's the profile buffer size? Just make sure to boot with
> profile=2 so you'll have a quite accurate precision.

Yup, I have profile=2.

> the frequency is higher during a kernel compile due the pte-highmem, but
> if you just change the workload and you start 16 tasks simultaenous
> reading from a file in cache, you will get the same very frequency no
> matter of pte-highmem or not. What you found is a scalability issue with
> the persistent kmaps, not introduced by pte-highmem (however with
> pte-highmem I have increased its visibility due the additional
> usages of the persistent kmaps for certain pte intensive workload and
> with a larger pool to scan).

I understand that, but if the mechanism doesn't work well, let's not
use it any more that we have to ;-) And I have a crazy plan to fix all
this that I'll send out shortly in another email with a more appropriate
title, but that's a bigger change.

> Persistent kmaps aren't meant to scale,

Indeed. If my thinking is correct, they scale as O(1/N^2) - the pool
size is ultimately fixed as we have a limited virtual address space;
more cpus means we use up the pool N times faster and the cost of
the global tlb_flush_all goes up by a factor of N as well.

> Correct. If shrinking the pool doesn't make significant difference (for

OK, here are the results of shrinking the kmap pool: compile times (with
lockmeter, so they may look different from before) go up from 40s (with
1024 pool) to 43s (with 128 pool) - presumably this is the extra cost
of the extra global tlbflushes.

The numbers from the profile I gave you yesterday were without lockmeter.
Looking at the profiles from both runs with lockmeter, we can see that
the high cost of kmap_high and kunmap_high themselves does indeed seem
to be due to the anomoly I noted earlier - we must be counting some of
the spin time, and the anomoly goes away with lockmeter installed. Profiles indicate no measurable kunmap_high, and kmap_high goes from about 238
(with 1024 pool) to 334 (with 128 pool) - the time actually *increases*.

lockstat (1024 pool):

33.4% 63.5% 5.4us(7893us) 155us( 16ms)(37.8%) 2551814 36.5% 63.5% 0% kmap_lock_cacheline
17.4% 64.9% 5.7us(7893us) 158us( 16ms)(19.7%) 1275907 35.1% 64.9% 0% kmap_high+0x34

lockstat (128 pool)

35.5% 67.9% 6.0us(1166us) 171us( 18ms)(43.3%) 2602716 32.1% 67.
9% 0% kmap_lock_cacheline
19.1% 69.6% 6.4us(1166us) 175us( 17ms)(22.7%) 1301358 30.4% 69.
6% 0% kmap_high+0x34

So (as expected from the previous sentence) lock times actually go up
by shrinking the pool.

I don't believe that kmap_high is really O(N) on the size of the pool.
Looking at the code for map_new_virtual, note that we start at where
we left off before: last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
So we don't scan the whole array every time - we just walk through it
one step (on most instances, assuming most of pool is short term use).
On a smaller pool, more of the pool is clogged with long term usage,
so we have more things to "step over" to find an available mapping,
so it's actually more expensive.

Thus I'd conclude your original idea to increase the size of the kmap
pool was perfectly correct.

> example it may be acceptable if it would reduce the level of overhead
> to the same one of lru_cache_add in the anonymous memory page fault that
> you also don't want on a NUMA-Q just for kernel compiles without the
> need of swapout anything) I can very easily drop the persistent kmap
> usage from my tree so you can try that way too (without adding the
> kernel pagetables in kernel stuff in 2.5 and without dropping the
> quicklist cpu affine cache like what happened in 2.5).

If you could give me a patch to do that, I'd be happy to try it out.

> BTW, before I drop the persistent kmaps from the pagetable handling you
> can also make a quick check by removing __GFP_HIGHMEM from the
> allocation in mm/memory.c:pte_alloc_one() and verifying the kmap_high
> overhead goes away during the kernel compiles (that basically disables
> the pte-highmem feature).

I added this change on top of the pool shrinkage (i.e. we're still at 128)
resulting in:

3.4% 4.1% 1.4us(1377us) 31us(1462us)(0.19%) 692386 95.9% 4.1%
0% kmap_lock_cacheline
2.9% 4.4% 2.4us(1377us) 39us(1333us)(0.13%) 346193 95.6% 4.4%
0% kmap_high+0x34

Much better ;-) And compile times are much better ... hard to say
exactly how much, due to some other complications that I won't
delve into ....

M.

2002-03-20 16:49:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Wed, Mar 20, 2002 at 08:14:31AM -0800, Martin J. Bligh wrote:
> I don't believe that kmap_high is really O(N) on the size of the pool.

It is O(N), that's the worst case. Of course assuming that the number of
entries in the pool is N and that it is variable, for us it is variable
at compile time.

> Looking at the code for map_new_virtual, note that we start at where
> we left off before: last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
> So we don't scan the whole array every time - we just walk through it
> one step (on most instances, assuming most of pool is short term use).

and if we didn't find anything we call flush_all_zero_pkmaps that does a
whole O(N) scan on the pool to try to release the entries that aren't
pinned and then we try again. In short if we increase the pool size, we
linearly increase the time we spend on it (actually more than linearly
because we'll run out of l1/l2/l3 while the pool size increases)

> If you could give me a patch to do that, I'd be happy to try it out.

I will do that in the next -aa. I've quite a lot of stuff pending
(including Andrew split of my VM fixes to merge).

> I added this change on top of the pool shrinkage (i.e. we're still at 128)
> resulting in:
>
> 3.4% 4.1% 1.4us(1377us) 31us(1462us)(0.19%) 692386 95.9% 4.1%
> 0% kmap_lock_cacheline
> 2.9% 4.4% 2.4us(1377us) 39us(1333us)(0.13%) 346193 95.6% 4.4%
> 0% kmap_high+0x34
>
> Much better ;-) And compile times are much better ... hard to say

Good.

> exactly how much, due to some other complications that I won't
> delve into ....

then just for curiousity you can try to write a program doing:

open a file
while 1:
read 1 byte of it
lseek back to the start of the file

and run 16 copies simultanously and the kmap_lock_cacheline should raise
again similar to previous levels (no matter if it's the same file or
not). You'll also hit the pagecache_lock in such a workload of course
(like you're also hitting the lru_lock during the kernel compiles for
the lru_cache_add of the anonymous ram).

The only brainer problem with the total removal of the persistent kmaps
are the copy-users, what's your idea about it? (and of course I'm not
going to change that anyways in 2.4, it's not a showstopper)

Andrea

2002-03-20 17:41:45

by Rik van Riel

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Wed, 20 Mar 2002, Andrea Arcangeli wrote:
> On Wed, Mar 20, 2002 at 08:14:31AM -0800, Martin J. Bligh wrote:

> > Looking at the code for map_new_virtual, note that we start at where
> > we left off before: last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
> > So we don't scan the whole array every time - we just walk through it
> > one step (on most instances, assuming most of pool is short term use).
>
> and if we didn't find anything we call flush_all_zero_pkmaps that does a
> whole O(N) scan on the pool to try to release the entries that aren't
> pinned and then we try again. In short if we increase the pool size, we
> linearly increase the time we spend on it (actually more than linearly
> because we'll run out of l1/l2/l3 while the pool size increases)

That suggests we'd want a pool size of 1 to be O(1) buzzword
compliant ;)

Going for a smaller pool just doesn't make sense if you want
the mappings to be cached, it could even result in more processes
_sleeping_ on kmap entries to be freed.

Please make a choice, do you want kmaps to be cached or would
you be content to have just cpu-local or maybe process-local
kmaps and get rid of the global kmap pool ?

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-20 18:12:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

My guess: persistent kmaps are okay, kmapped high pagetables are okay,
persistent kmapped high pagetables are okay. What's wrong is how we
flush_all_zero_pkmaps on all cpus, synchronously while holding the
kmap_lock everyone needs to get a new kmap (and hopefully more often,
just inc or dec the pkmap_count of kmap already got). That's what
cries out for redesign: it's served us well but should now be improved.

Hugh

2002-03-20 18:18:30

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

--On Wednesday, March 20, 2002 17:39:59 +0100 Andrea Arcangeli <[email protected]> wrote:

> On Wed, Mar 20, 2002 at 08:14:31AM -0800, Martin J. Bligh wrote:
>> I don't believe that kmap_high is really O(N) on the size of the pool.
>
> It is O(N), that's the worst case. Of course assuming that the number of
> entries in the pool is N and that it is variable, for us it is variable
> at compile time.
> ...
> and if we didn't find anything we call flush_all_zero_pkmaps that does a
> whole O(N) scan on the pool to try to release the entries that aren't
> pinned and then we try again. In short if we increase the pool size, we
> linearly increase the time we spend on it (actually more than linearly
> because we'll run out of l1/l2/l3 while the pool size increases)

Worst case I agree is about is O(N), though you're forgetting the cost of
the global tlb_flush_all. Average case isn't by a long way, it's more like
O(1).

N = size of pool - number of permanent maps (makes the math readable)
F = cost of global tlbflush.

cost without flush = 1, which happens every time we do an alloc.
cost of flush = N + F (N to go through flush_all_pkmaps, zeroing)

=> average cost = 1/N + (N+F)/N = (1+F+N)/N = 1 + (1+F)/N

So the cost (on average) actually gets cheaper as the pool size increases.
Yes, I've simpilified things a little by ignoring the case of stepping over
the permanently allocated maps (this get less frequent as pool size is
increased, thus this actually helps), and by ignoring the cache effects
you mention, but I'm sure you see my point.

If you wanted to fix the worst case for latency issues, I would think it
would be possible to split the pool in half - we could still allocate out
of one half whilst we flush the other half (release the lock during a
half-flush, but mark in a current_half variable (protected by the lock)
where newcomers should be currently allocating from). I haven't really
thought this last bit through, so it might be totally bogus.

But all this is moot, as what's killing me is really the global lock ;-)

>> If you could give me a patch to do that, I'd be happy to try it out.
>
> I will do that in the next -aa. I've quite a lot of stuff pending

Cool - thanks.

> The only brainer problem with the total removal of the persistent kmaps
> are the copy-users, what's your idea about it? (and of course I'm not
> going to change that anyways in 2.4, it's not a showstopper)

Writing that up now ...

M.

2002-03-20 18:27:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Wed, Mar 20, 2002 at 02:41:14PM -0300, Rik van Riel wrote:
> Going for a smaller pool just doesn't make sense if you want
> the mappings to be cached, it could even result in more processes
> _sleeping_ on kmap entries to be freed.

256 simultaenous kmaps isn't that small, you need 256 task page faulting
simultaneously to get any of them sleeping, the bigger problem is the
frequency of the IPI and the reduced cache I think, not the
"_sleeping_". Infact it isn't much slower either. I had a feeling that
256 entreis was a kind of low-limit number to make persistent kmaps
still useful, even if I known it was a bit underpowered, but I needed to
check that it wasn't the O(N) pass hurting.

What I was worried about by seeing such long kmap_high time is that with
lots of concurrent tasks faulting, most of the entries were pinned most
of the time and so the pkmap pass had to be run much more frequently
and in a less useful manner, than on a system where the frequency of the
kmaps is high, but where the simultaneous kmaps aren't frequent.

Of course 2048 entries is better from a caching and tlb flushing
prospective (even if the O(N) pass takes more time), not incidentally I
was using 1024 instead of 128 :). (note that mainline only has 512 with
PAE, this is another reason for which I asked to try with 256)

> Please make a choice, do you want kmaps to be cached or would
> you be content to have just cpu-local or maybe process-local
> kmaps and get rid of the global kmap pool ?

One design idea I had to avoid the locking in the persistnt kmaps around
the copy-user, is to rewrite completly the persistnt code with a pool of
atomic kmaps per-CPU and to pin the task to the current CPU before doing
the copy-user, and then to count the number of reentrant persistent
kmaps happening in the current CPU and if it overflows we block waiting
somebody else to be wakenup and to kunmap. pros: it would avoid all the
locks (complete scalability, all per-cpu), it would avoid the IPI and
the global flush. cons: it will possibly waste some more address space
since not all NR_CPUS are going to be used in all machines, and it will
not be able to do any caching, so page->virtual can be dropped enterely
in x86 then, and it will reduce the ability of the scheduler to
reschedule in idle cpus during copy users around persistent kmaps. And
of course those kmaps should be dropped ASAP, they are meant only for
things like copy-users, or we risk to pin stuff in cpus, it should be
always the scheduler, not the kmap that chooses which cpu the task has
to run on after a wakeup.

And again, such reasoning and the above idea is very-high-end
scalability oriented, a 1G laptop would prefer the current persistent
kmaps, possibly also running invlpg during the O(N) pass instead of the
global tlb flush (we can't do that in SMP because it would take too long
to run such pass in all cpus, so we've to IPI a global flush instead).
And in turn I'm not going to make any change like this in 2.4, it simply
isn't important enough for the 90% of userbase, I will just limit to
avoid persistent kmaps in pte-highmem that is doable without any pain.

btw, this thread is been very useful to learn about some of the timings
and the problematics on the 16-ways NUMA-Q.

comments?

Andrea

2002-03-20 18:30:37

by Martin J. Bligh

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

> => average cost = 1/N + (N+F)/N = (1+F+N)/N = 1 + (1+F)/N

Actually, I think this should have read 1 + (N+F)/N = 1+ 1 + F/N = 2 +F/N
but the point is the same.

Sorry,

M.

2002-03-20 18:41:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Wed, Mar 20, 2002 at 10:16:44AM -0800, Martin J. Bligh wrote:
> --On Wednesday, March 20, 2002 17:39:59 +0100 Andrea Arcangeli <[email protected]> wrote:
>
> > On Wed, Mar 20, 2002 at 08:14:31AM -0800, Martin J. Bligh wrote:
> >> I don't believe that kmap_high is really O(N) on the size of the pool.
> >
> > It is O(N), that's the worst case. Of course assuming that the number of
> > entries in the pool is N and that it is variable, for us it is variable
> > at compile time.
> > ...
> > and if we didn't find anything we call flush_all_zero_pkmaps that does a
> > whole O(N) scan on the pool to try to release the entries that aren't
> > pinned and then we try again. In short if we increase the pool size, we
> > linearly increase the time we spend on it (actually more than linearly
> > because we'll run out of l1/l2/l3 while the pool size increases)
>
> Worst case I agree is about is O(N), though you're forgetting the cost of
> the global tlb_flush_all. Average case isn't by a long way, it's more like
> O(1).

Common case is O(1) of course, but "average" it's a mean between "worst"
and "best/common". I'm not missing anything about the global flush
either.

It _enterely_ depends on the timings.

If the global flush takes 1 hour, and the O(N) pass takes 1usec, you'd
better increase N.

If the O(N) pass takes 1 hour, and the flush takes 1usec, you'd better
drecrease N.

This is always been clear, it's just that I _didn't_ know the exact
timings of your hardware before this thread, so I couldn't calculate
with math what was the best balance between the two (plus there's also
the fact of the process sleeping and the page->virtual cache size that
changes as well and that are other variables that you overlooked in your
below equation, that are non nearly not possible to evaluate with an
exact math formula as well).

2048 looked a fair enough balance, 8M of mapped cache, not an excessive
virtual space wasted, and not too high frequency of the flushes (that
probably are very costly on a 16way). Thanks to Hugh's hold-counter now
the kmaps can also be shared across the higher user. so the 8m of cache
are used at best for both pagecache and pagetables.

> N = size of pool - number of permanent maps (makes the math readable)
> F = cost of global tlbflush.
>
> cost without flush = 1, which happens every time we do an alloc.
> cost of flush = N + F (N to go through flush_all_pkmaps, zeroing)
>
> => average cost = 1/N + (N+F)/N = (1+F+N)/N = 1 + (1+F)/N
>
> So the cost (on average) actually gets cheaper as the pool size increases.
> Yes, I've simpilified things a little by ignoring the case of stepping over
> the permanently allocated maps (this get less frequent as pool size is
> increased, thus this actually helps), and by ignoring the cache effects
> you mention, but I'm sure you see my point.
>
> If you wanted to fix the worst case for latency issues, I would think it
> would be possible to split the pool in half - we could still allocate out
> of one half whilst we flush the other half (release the lock during a
> half-flush, but mark in a current_half variable (protected by the lock)
> where newcomers should be currently allocating from). I haven't really
> thought this last bit through, so it might be totally bogus.
>
> But all this is moot, as what's killing me is really the global lock ;-)

Quite a lot yes. Furthmore the global tlb flush is slow and it has to be
run atomically to avoid people to start using entries still cached in
the tlb with corrupted values. so it doesn't matter much the size of the
pool, the flush_pkmap pass still is slow due the global flush and the
other cpus stalls into the lock.

> >> If you could give me a patch to do that, I'd be happy to try it out.
> >
> > I will do that in the next -aa. I've quite a lot of stuff pending
>
> Cool - thanks.
>
> > The only brainer problem with the total removal of the persistent kmaps
> > are the copy-users, what's your idea about it? (and of course I'm not
> > going to change that anyways in 2.4, it's not a showstopper)
>
> Writing that up now ...

I outlined a possible way to do that too in the previous email, not sure
if there are possibility to make it completly scalar per-cpu, I hope so
because the pinning of the task doesn't look very appealing from a
scheduler standpoint.

Andrea

2002-03-20 18:57:29

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Wed, Mar 20, 2002 at 06:15:10PM +0000, Hugh Dickins wrote:
> My guess: persistent kmaps are okay, kmapped high pagetables are okay,
> persistent kmapped high pagetables are okay. What's wrong is how we

In UP definitely :)

> flush_all_zero_pkmaps on all cpus, synchronously while holding the
> kmap_lock everyone needs to get a new kmap (and hopefully more often,
> just inc or dec the pkmap_count of kmap already got). That's what
> cries out for redesign: it's served us well but should now be improved.

I'm not really sure if the time spent into the O(N) pass is the problem, I
asked him to decrease it and the contention increased (probably due the
increased frequency of the global flushes).

the problem is that the page->virtual cache is global, and so when you
have to drop the page->virtual from the virtual page you need to make a
global tlb flush. It cannot be a local tlb flush, this is the problem,
and if you want to make it a local flush but still having the cache you
need a page->virtual[NR_CPUS] that is not feasible or it would waste too
much ram. However if we could at least run the global tlb flush outside
the spinlock that would just be a nice scalability optimization though,
but even that doesn't seem obvious to implement because being the
virtual entry shared, if we want to make it available or to get it, we
must first do a global flush to be sure not to crash at the first
schedule().

One way that would be completly scalar in the copy-users, that I
outlined in a previous email of the thread, is to make the pool local to
the cpu, but without page->virtual cache and by binding the task to the
current cpu and unbinding it at the kunmap. I don't see other ways to
get rid of the scalability issues in all the places.

Andrea

2002-03-20 19:38:26

by Rik van Riel

[permalink] [raw]
Subject: Re: Scalability problem (kmap_lock) with -aa kernels

On Wed, 20 Mar 2002, Andrea Arcangeli wrote:

> One design idea I had to avoid the locking in the persistnt kmaps around
> the copy-user, is to rewrite completly the persistnt code with a pool of
> atomic kmaps per-CPU and to pin the task to the current CPU before doing
> the copy-user, and then to count the number of reentrant persistent
> kmaps happening in the current CPU and if it overflows we block waiting
> somebody else to be wakenup and to kunmap. pros: it would avoid all the
> locks (complete scalability, all per-cpu), it would avoid the IPI and
> the global flush. cons: it will possibly waste some more address space
> since not all NR_CPUS are going to be used in all machines, and it will
> not be able to do any caching, so page->virtual can be dropped enterely
> in x86 then, and it will reduce the ability of the scheduler to
> reschedule in idle cpus during copy users around persistent kmaps.

Ahhhhh, but once you've reached that point, you might
as well make the kmap array PER PROCESS.

This has some advantages over the per-cpu scheme:
- no address space wastage
- caching is possible
- processes can be rescheduled
- the process can sleep while holding a kmap

It would still avoid the locks and flushes because we
can simply mark the address range used for these per
process kmaps as non-global so they'll be carried with
the process just like the user space page tables are.

This should result in somewhat simpler code than either
a global kmap array or a per-cpu kmap array. The fact
that the amount of kmap space scales with the number of
processes should also completely get rid of processes
sleeping on kmap space.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/