2004-03-04 03:15:39

by Peter Zaitsev

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Sat, 2004-02-28 at 17:43, Andrea Arcangeli wrote:

> >
> > Definately not what we expected, but a nice surprise nontheless.
>
> this is the first time I hear something like this. Maybe you mean the
> 4:4 was actually using more ram for the SGA? Just curious.

I actually recently Did MySQL benchmarks using DBT2 MySQL port.

The test box was 4Way Xeon w HT, 4Gb RAM, 8 SATA Disks in RAID10.

I used RH AS 3.0 for tests (2.4.21-9.ELxxx)

For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
1450TPM for "smp" kernel, which is some 14% slowdown.

For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
which is over 35% slowdown.





--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/


2004-03-04 03:33:56

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Peter Zaitsev <[email protected]> wrote:
>
> On Sat, 2004-02-28 at 17:43, Andrea Arcangeli wrote:
>
> > >
> > > Definately not what we expected, but a nice surprise nontheless.
> >
> > this is the first time I hear something like this. Maybe you mean the
> > 4:4 was actually using more ram for the SGA? Just curious.
>
> I actually recently Did MySQL benchmarks using DBT2 MySQL port.
>
> The test box was 4Way Xeon w HT, 4Gb RAM, 8 SATA Disks in RAID10.
>
> I used RH AS 3.0 for tests (2.4.21-9.ELxxx)
>
> For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> 1450TPM for "smp" kernel, which is some 14% slowdown.

Please define these terms. What is the difference between "hugemem" and
"smp"?

> For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
> which is over 35% slowdown.

Well no, it is a 56% speedup. Please clarify. Lots.

2004-03-04 03:45:40

by Peter Zaitsev

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Wed, 2004-03-03 at 19:33, Andrew Morton wrote:



> >
> > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> > 1450TPM for "smp" kernel, which is some 14% slowdown.
>
> Please define these terms. What is the difference between "hugemem" and
> "smp"?

Andrew,


Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel
namings. "SMP" corresponds to normal SMP kernel they have, "hugemem"
is kernel with 4G/4G split.

>
> > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
> > which is over 35% slowdown.
>
> Well no, it is a 56% speedup. Please clarify. Lots.

Huh. The numbers shall be other way around of course :) "smp" kernel
had better performance of some 7000TPM, compared to 4500TPM with
HugeMem kernel.

Swap was disable in both cases.


--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/

2004-03-04 04:13:56

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Peter Zaitsev <[email protected]> wrote:
>
> Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel
> namings. "SMP" corresponds to normal SMP kernel they have, "hugemem"
> is kernel with 4G/4G split.
>
> >
> > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
> > > which is over 35% slowdown.
> >
> > Well no, it is a 56% speedup. Please clarify. Lots.
>
> Huh. The numbers shall be other way around of course :) "smp" kernel
> had better performance of some 7000TPM, compared to 4500TPM with
> HugeMem kernel.

That's a larger difference than I expected. But then, everyone has been
mysteriously quiet with the 4g/4g benchmarking.

A kernel profile would be interesting. As would an optimisation effort,
which, as far as I know, has never been undertaken.

2004-03-04 04:46:35

by Peter Zaitsev

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Wed, 2004-03-03 at 20:07, Andrew Morton wrote:

> > Huh. The numbers shall be other way around of course :) "smp" kernel
> > had better performance of some 7000TPM, compared to 4500TPM with
> > HugeMem kernel.
>
> That's a larger difference than I expected. But then, everyone has been
> mysteriously quiet with the 4g/4g benchmarking.

Yes. It is larger than I expected as well but numbers are pretty
reliable.

>
> A kernel profile would be interesting. As would an optimisation effort,
> which, as far as I know, has never been undertaken.

Just let me know which information you would like me to gather and how
and I'll get it for you.



--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/

2004-03-04 04:51:36

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:
> That's a larger difference than I expected. But then, everyone has been

mysql is threaded (it's not using processes that force tlb flushes at
every context switch), so the only time a tlb flush ever happens is when
a syscall or an irq or a page fault happens with 4:4. Not tlb flush
would ever happen with 3:1 in the whole workload (yeah, some background
tlb flushing happens anyways when you type char on bash or move the
mouse of course but it's very low frequency)

(to be fair, because it's threaded it means they also find 512m of
address space lost more problematic than the db using processes, though
besides the reduced address space there would be no measurable slowdown
with 2.5:1.5)

Also the 4:4 pretty much depends on the vgettimeofday to be backported
from the x86-64 tree and an userspace to use it, so the test may be
repeated with vgettimeofday, though it's very possible mysql isn't using
that much gettimeofday as other databases, especially the I/O bound
workload shouldn't matter that much with gettimeofday.

another reason could be the xeon bit, all numbers I've seen were on p3,
that's why I was asking about xeon and p4 or more recent.

all random ideas, just guessing.

> mysteriously quiet with the 4g/4g benchmarking.

indeed.

> A kernel profile would be interesting. As would an optimisation effort,
> which, as far as I know, has never been undertaken.

yes, though I doubt you'll find anything interesting in the kernel, the
slowdown should happen because the userspace runs slower, it's like
undercloking the cpu, it's not a bottleneck in the kernel that can be
optimized (at least unless there are bugs in the patch which I think not).

2004-03-04 05:10:52

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Andrea Arcangeli <[email protected]> wrote:
>
> On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:
> > That's a larger difference than I expected. But then, everyone has been
>
> mysql is threaded

There is a patch in -mm's 4g/4g implementation
(4g4g-locked-userspace-copy.patch) which causes all kernel<->userspace
copies to happen under page_table_lock. In some threaded apps on SMP this
is likely to cause utterly foul performance.

That's why I'm keeping it as a separate patch. The problem which it fixes
is very obscure indeed and I suspect most implementors will simply drop it
after they'e had a two-second peek at the profile results.

hm, I note that the changelog in that patch is junk. I'll fix that up.

Something like:

The current 4g/4g implementation does not guarantee the atomicity of
mprotect() on SMP machines. If one CPU is in the middle of a read() into
a user memory region and another CPU is in the middle of an
mprotect(!PROT_READ) of that region, it is possible for a race to occur
which will result in that read successfully completing _after_ the other
CPU's mprotect() call has returned.

We believe that this could cause misbehaviour of such things as the
boehm garbage collector. This patch provides the mprotect() atomicity by
performing all userspace copies under page_table_lock.


It is a judgement call. Personally, I wouldn't ship a production kernel
with this patch. People need to be aware of the tradeoff and to think and
test very carefully.

2004-03-04 05:27:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Wed, Mar 03, 2004 at 09:10:42PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:
> > > That's a larger difference than I expected. But then, everyone has been
> >
> > mysql is threaded
>
> There is a patch in -mm's 4g/4g implementation
> (4g4g-locked-userspace-copy.patch) which causes all kernel<->userspace
> copies to happen under page_table_lock. In some threaded apps on SMP this
> is likely to cause utterly foul performance.

I see, I wasn't aware about this issue with the copy-user code, thanks
for the info, I definitely agree having a profiling of the run would be
nice since it maybe part of the overhead is due this lock (though I
doubt it's most the overhead), so we can see if it was that spinlock
generating part of the slowdown.

> That's why I'm keeping it as a separate patch. The problem which it fixes
> is very obscure indeed and I suspect most implementors will simply drop it
> after they'e had a two-second peek at the profile results.

I doubt one can ship without it without feeling a bit like cheating, the
garbage collectors sometime depends on mprotect to generate protection
faults, it's not like nothing is using mprotect in racy ways against
other threads.

> It is a judgement call. Personally, I wouldn't ship a production kernel
> with this patch. People need to be aware of the tradeoff and to think and
> test very carefully.

test what? there's no way to know what soft of proprietary software
people will run on the thing.

Personally I wouldn't feel safe to ship a kernel with a known race
condition add-on. I mean, if you don't know about it and it's an
implementation bug you know nobody is perfect and you try to fix it if
it happens, but if you know about it and you don't apply it, that's
pretty bad if something goes wrong. Especially because it's a race,
even you test it, it may still happen only a long time later during
production. I would never trade performance for safety, if something I'd
try to find a more complex way to serialize against the vmas or similar.

2004-03-04 05:38:31

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Andrea Arcangeli <[email protected]> wrote:
>
> > It is a judgement call. Personally, I wouldn't ship a production kernel
> > with this patch. People need to be aware of the tradeoff and to think and
> > test very carefully.
>
> test what? there's no way to know what soft of proprietary software
> people will run on the thing.

In the vast majority of cases the application was already racy. It took
davem a very long time to convince me that this was really a bug ;)

> Personally I wouldn't feel safe to ship a kernel with a known race
> condition add-on. I mean, if you don't know about it and it's an
> implementation bug you know nobody is perfect and you try to fix it if
> it happens, but if you know about it and you don't apply it, that's
> pretty bad if something goes wrong. Especially because it's a race,
> even you test it, it may still happen only a long time later during
> production. I would never trade performance for safety, if something I'd
> try to find a more complex way to serialize against the vmas or similar.

Well first people need to understand the problem and convince themselves
that this really is a bug. And yes, there are surely other ways of fixing
it up. One might be to put some sequence counter in the mm_struct and
rerun the mprotect if it detects that someone else snuck in with a
usercopy. Or add an rwsem to the mm_struct, take it for writing in
mprotect.

2004-03-04 12:12:47

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Thu, 4 Mar 2004, Andrea Arcangeli wrote:
> On Wed, Mar 03, 2004 at 08:07:04PM -0800, Andrew Morton wrote:

> > A kernel profile would be interesting. As would an optimisation effort,
> > which, as far as I know, has never been undertaken.
>
> yes, though I doubt you'll find anything interesting in the kernel,

Oh, but there is a big bottleneck left, at least in RHEL3.

All the CPUs use the _same_ mm_struct in kernel space, so
all VM operations inside the kernel are effectively single
threaded.

Ingo had a patch to fix that, but it wasn't ready in time.
Maybe it is in the 2.6 patch set, maybe not ...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-03-04 16:22:34

by Peter Zaitsev

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Wed, 2004-03-03 at 20:52, Andrea Arcangeli wrote:

Andrea,

> mysql is threaded (it's not using processes that force tlb flushes at
> every context switch), so the only time a tlb flush ever happens is when
> a syscall or an irq or a page fault happens with 4:4. Not tlb flush
> would ever happen with 3:1 in the whole workload (yeah, some background
> tlb flushing happens anyways when you type char on bash or move the
> mouse of course but it's very low frequency)

Do not we get TLB flush also due to latching or are pthread_mutex_lock
etc implemented without one nowadays ?

>
> (to be fair, because it's threaded it means they also find 512m of
> address space lost more problematic than the db using processes, though
> besides the reduced address space there would be no measurable slowdown
> with 2.5:1.5)

Hm. What 512Mb of address space loss are you speaking here. Are threaded
programs only able to use 2.5G in 3G/1G memory split ?


>
> Also the 4:4 pretty much depends on the vgettimeofday to be backported
> from the x86-64 tree and an userspace to use it, so the test may be
> repeated with vgettimeofday, though it's very possible mysql isn't using
> that much gettimeofday as other databases, especially the I/O bound
> workload shouldn't matter that much with gettimeofday.

You're right. MySQL does not use gettimeofday very frequently now,
actually it uses time() most of the time, as some platforms used to have
huge performance problems with gettimeofday() in the past.

The amount of gettimeofday() use will increase dramatically in the
future so it is good to know about this matter.


--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/

2004-03-04 17:35:49

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

> Peter Zaitsev <[email protected]> wrote:
>>
>> Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel
>> namings. "SMP" corresponds to normal SMP kernel they have, "hugemem"
>> is kernel with 4G/4G split.
>>
>> >
>> > > For CPU bound load (10 Warehouses) I got 7000TPM instead of 4500TPM,
>> > > which is over 35% slowdown.
>> >
>> > Well no, it is a 56% speedup. Please clarify. Lots.
>>
>> Huh. The numbers shall be other way around of course :) "smp" kernel
>> had better performance of some 7000TPM, compared to 4500TPM with
>> HugeMem kernel.
>
> That's a larger difference than I expected. But then, everyone has been
> mysteriously quiet with the 4g/4g benchmarking.
>
> A kernel profile would be interesting. As would an optimisation effort,
> which, as far as I know, has never been undertaken.

In particular:

1. a diffprofile between the two would be interesting (assuming it's
at least partly increase in kernel time), or any other way to see exactly
why it's slower (well, TLB flushes, obviously, but what's causing them).

2. If it's gettimeofday hammering it (which it probably is, from previous
comments by others, and my own experience), then vsyscall gettimeofday
(John's patch) may well fix it up.

3. Are you using the extra user address space? Otherwise yes, it'll be
all downside. And 4/4 vs 3/1 isn't really a fair comparison ... 4/4 is
designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
said before that DB performance can increase linearly with shared area
sizes (for some workloads), so that'd bring you a 100% or so increase
in performance for 4/4 to counter the loss.

M.

2004-03-04 18:13:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Thu, Mar 04, 2004 at 08:21:26AM -0800, Peter Zaitsev wrote:
> On Wed, 2004-03-03 at 20:52, Andrea Arcangeli wrote:
>
> Andrea,
>
> > mysql is threaded (it's not using processes that force tlb flushes at
> > every context switch), so the only time a tlb flush ever happens is when
> > a syscall or an irq or a page fault happens with 4:4. Not tlb flush
> > would ever happen with 3:1 in the whole workload (yeah, some background
> > tlb flushing happens anyways when you type char on bash or move the
> > mouse of course but it's very low frequency)
>
> Do not we get TLB flush also due to latching or are pthread_mutex_lock
> etc implemented without one nowadays ?

pthread mutex uses futex in nptl and ngpt or they use sched_yield in
linuxthreads, either ways they don't need to flush the tlb. The address
space is the same, no need of changing address space for the mutex
(otherwise mutex would be very detrimental too). Kernel threads as well
don't require a tlb flush.

> > (to be fair, because it's threaded it means they also find 512m of
> > address space lost more problematic than the db using processes, though
> > besides the reduced address space there would be no measurable slowdown
> > with 2.5:1.5)
>
> Hm. What 512Mb of address space loss are you speaking here. Are threaded
> programs only able to use 2.5G in 3G/1G memory split ?

I was talking about the 2.5:1.5: split here, 3:1 gives you 3G of address
space (both for threads and processes), 2.5:1.5 would give you only 2.5G
of address space to use instead (with a loss of 512m that are being used
by kernel to handle properly a 64G box).

> > Also the 4:4 pretty much depends on the vgettimeofday to be backported
> > from the x86-64 tree and an userspace to use it, so the test may be
> > repeated with vgettimeofday, though it's very possible mysql isn't using
> > that much gettimeofday as other databases, especially the I/O bound
> > workload shouldn't matter that much with gettimeofday.
>
> You're right. MySQL does not use gettimeofday very frequently now,
> actually it uses time() most of the time, as some platforms used to have
> huge performance problems with gettimeofday() in the past.
>
> The amount of gettimeofday() use will increase dramatically in the
> future so it is good to know about this matter.

If you noticed Martin mentioned a >30% figure due gettimeofday being
called frequently (w/o vsyscalls implementing vgettimeofday like in
x86-64), this figure it certainly won't sum to your current number
linearly but you can expect a significant further loss by calling
gettimeofday dramatically more frequently.

2004-03-04 18:16:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Thu, Mar 04, 2004 at 09:35:13AM -0800, Martin J. Bligh wrote:
> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
> said before that DB performance can increase linearly with shared area
> sizes (for some workloads), so that'd bring you a 100% or so increase
> in performance for 4/4 to counter the loss.

that's a nice theory with the benchmarks that runs with a 64G working
set, but if your working set is smaller than 32G 99% of the time and
you install the 64G to handle the peak load happening 1% of the time
faster, you'll run 30% slower 99% of the time even if the benchmark
only stressing the 64G working set runs a lot faster than with 32G only.

2004-03-04 19:32:21

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

> On Thu, Mar 04, 2004 at 09:35:13AM -0800, Martin J. Bligh wrote:
>> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
>> said before that DB performance can increase linearly with shared area
>> sizes (for some workloads), so that'd bring you a 100% or so increase
>> in performance for 4/4 to counter the loss.
>
> that's a nice theory with the benchmarks that runs with a 64G working
> set, but if your working set is smaller than 32G 99% of the time and
> you install the 64G to handle the peak load happening 1% of the time
> faster, you'll run 30% slower 99% of the time even if the benchmark
> only stressing the 64G working set runs a lot faster than with 32G only.

The amount of ram in the system, and the amount consumed by mem_map can,
I think, be taken as static for the purposes of this argument. So I don't
see why the total working set of the machine matters.

What does matter is the per-process user address space set - if the same
argument applied to that (ie most of the time, processes only use 1GB
of shmem each), then I'd agree with you. I don't know whether that's
true or not though ... I'll let the DB people argue that one out.

Much though people hate benchmarks, it's also important to be able to
prove that Linux can run as fast as RandomOtherOS in order to ensure
total world domination for Linux ;-) So it would be nice to ensure the
benchmarks at least have an option to be able to run as fast as possible.

M.

2004-03-04 20:22:38

by Peter Zaitsev

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Thu, 2004-03-04 at 09:35, Martin J. Bligh wrote:

>
> 2. If it's gettimeofday hammering it (which it probably is, from previous
> comments by others, and my own experience), then vsyscall gettimeofday
> (John's patch) may well fix it up.

Well, as I wrote MySQL does not use a lot of gettimeofday. It rather
has 2-3 calls to time() per query, but it is very small number compared
to othet syscalls.

>
> 3. Are you using the extra user address space? Otherwise yes, it'll be
> all downside. And 4/4 vs 3/1 isn't really a fair comparison ... 4/4 is
> designed for bigboxen, so 4/4 vs 2/2 would be better, IMHO. People have
> said before that DB performance can increase linearly with shared area
> sizes (for some workloads), so that'd bring you a 100% or so increase
> in performance for 4/4 to counter the loss.

I do not really understand this :)

I know 4/4 was designed for BigBoxes, however we're more interested in
side effect we have - having 4G per user process instead of 3G in 3G/1G
split. As MySQL is designed as single process this is what rather
important for us.

I was not using extra address space in this test, as the idea was to see
how much slowdown 4G/4G split gives you with all other being the same.

Based on other benchmarks I know extra performance extra 1Gb used as
buffers can give.

Bringing this numbers together I shall conclude what 4G/4G does not make
sense for most MySQL loads, as 1Gb used for internal buffers (vs 1Gb
used for file cache) will not give high enough performance to cover such
major speed loss.

There are exceptions of course, for example the case where your full
workload will fit in 3G cache while will not fit in 2G (very edge one),
or in case you need 4G just to manage 10000+ connections with
reasonable buffers etc, which is also far from most typical scenario.

For "Big Boxes" I just would not advice having 32bit configuration at
all - happily nowadays you can get 64bit pretty cheap.





--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/

2004-03-05 10:32:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Peter Zaitsev <[email protected]> wrote:

> > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> > > 1450TPM for "smp" kernel, which is some 14% slowdown.
> >
> > Please define these terms. What is the difference between "hugemem" and
> > "smp"?
>
> Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel
> namings. "SMP" corresponds to normal SMP kernel they have, "hugemem"
> is kernel with 4G/4G split.

the 'hugemem' kernel also has config_highpte defined which is a bit
redundant - that complexity one could avoid with the 4/4 split. Another
detail: the hugemem kernel also enables PAE, which adds another 2 usecs
to every syscall (!). So these performance numbers only hold if you are
running mysql on x86 using more than 4GB of RAM. (which, given mysql's
threaded design, doesnt make all that much of a sense.)

But no doubt, the 4/4 split is not for free. If a workload does lots of
high-frequency system-calls then the cost can be pretty high.

vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
for mysql. Also, the highly threaded nature of mysql on the same MM
which is pretty much the worst-case for the 4/4 design. If it's an
issue, there are multiple ways to mitigate this cost.

but 4/4 is mostly a life-extender for the high end of the x86 platform -
which is dying fast. If i were to decide between some of the highly
intrusive architectural highmem solutions (which all revolve about the
concept of dynamically mapping back and forth) and the simplicity of
4/4, i'd go for 4/4 unless forced otherwise.

Ingo

2004-03-05 14:14:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 11:33:08AM +0100, Ingo Molnar wrote:
>
> * Peter Zaitsev <[email protected]> wrote:
>
> > > > For Disk Bound workloads (200 Warehouse) I got 1250TPM for "hugemem" vs
> > > > 1450TPM for "smp" kernel, which is some 14% slowdown.
> > >
> > > Please define these terms. What is the difference between "hugemem" and
> > > "smp"?
> >
> > Sorry if I was unclear. These are suffexes from RH AS 3.0 kernel
> > namings. "SMP" corresponds to normal SMP kernel they have, "hugemem"
> > is kernel with 4G/4G split.
>
> the 'hugemem' kernel also has config_highpte defined which is a bit
> redundant - that complexity one could avoid with the 4/4 split. Another

the machine only has 4G of ram and you've an huge zone-normal, so I
guess it will offset not more than 1 point percent or so.

> detail: the hugemem kernel also enables PAE, which adds another 2 usecs
> to every syscall (!). So these performance numbers only hold if you are
> running mysql on x86 using more than 4GB of RAM. (which, given mysql's
> threaded design, doesnt make all that much of a sense.)

are you saying you force _all_ people with >4G of ram to use 4:4?!?
that would be way way overkill. 8/16/32G boxes works perfectly with 3:1
with the stock 2.4 VM (after you nuke rmap).

> vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
> for mysql. Also, the highly threaded nature of mysql on the same MM

he said he doesn't use gettimeofday frequently, so most of the flushes
are from other syscalls.

> which is pretty much the worst-case for the 4/4 design. If it's an

definitely agreed.

> issue, there are multiple ways to mitigate this cost.

how? just curious.

2004-03-05 14:31:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Andrea Arcangeli <[email protected]> wrote:

> [...] 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM
> (after you nuke rmap).

the mem_map[] on 32G is 400 MB (using the stock 2.4 struct page). This
leaves ~500 MB for the lowmem zone. It's ridiculously easy to use up 500
MB of lowmem. 500 MB is a lowmem:RAM ratio of 1:60. With 4/4 you have 6
times more lowmem. So starting at 32 GB (but often much earlier) the 3/1
split breaks down. And you obviously it's a no-go at 64 GB.

inbetween it all depends on the workload. If the 3:1 split works fine
then sure, use it. There's no one kernel that fits all sizes.

Ingo

2004-03-05 14:33:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Andrea Arcangeli <[email protected]> wrote:

> > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
^^^^^^^^^^^^^^^^^
> > for mysql. Also, the highly threaded nature of mysql on the same MM
>
> he said he doesn't use gettimeofday frequently, so most of the flushes
> are from other syscalls.

you are not reading Pete's and my emails too carefully, are you? Pete
said:

> [...] MySQL does not use gettimeofday very frequently now, actually it
> uses time() most of the time, as some platforms used to have huge
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> performance problems with gettimeofday() in the past.
>
> The amount of gettimeofday() use will increase dramatically in the
> future so it is good to know about this matter.

Ingo

2004-03-05 15:01:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Andrea Arcangeli <[email protected]> wrote:

> I thought time() wouldn't be called more than 1 per second anyways,
> why would anyone call time more than 1 per second?

if mysql in fact calls time() frequently, then it should rather start a
worker thread that updates a global time variable every second.

Ingo

2004-03-05 14:57:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 03:32:10PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > [...] 8/16/32G boxes works perfectly with 3:1 with the stock 2.4 VM
> > (after you nuke rmap).
>
> the mem_map[] on 32G is 400 MB (using the stock 2.4 struct page). This
> leaves ~500 MB for the lowmem zone. It's ridiculously easy to use up 500

yes, mem_map_t takes 384M that leaves us 879-384 = 495Mbyte of
zone-normal.

> MB of lowmem. 500 MB is a lowmem:RAM ratio of 1:60. With 4/4 you have 6
> times more lowmem. So starting at 32 GB (but often much earlier) the 3/1
> split breaks down. And you obviously it's a no-go at 64 GB.

It's a nogo for 64G but I would be really pleased to see a workload
triggering the zone-normal shortage in 32G, I've never seen any one. And
16G has even more margin.

Note that on a 32G box with my google-logic a correct kernel like latest
2.4 mainline reserves 100% of the zone-normal to allocations that cannot
go in highmem, plus the vm highmem fixes like bh and inode zone-normal
related reclaims. Without those logics it would be easy to run oom due
highmem allocations going into zone-normal but that's just a vm issue
and it's fixed (all fixes should be in mainline already).

> inbetween it all depends on the workload. If the 3:1 split works fine
> then sure, use it. There's no one kernel that fits all sizes.

yes, the inbetween definitely works fine but there's always plenty of
margin even on the 32G in all heavy workloads I've seen. I've not a
single pending report for 32G boxes, all the bugreports starts at >=48G
and that tells you those 32G users had a 198M of margin free to use for
the peak loads which are more than enough in practice. I agree it's not
a huge margin, but it's quite reasonable considering they've only 60-70%
of the zone-normal pinned during the workload.

2004-03-05 14:59:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 03:34:25PM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > > vsyscall-sys_gettimeofday and vsyscall-sys_time could help quite some
> ^^^^^^^^^^^^^^^^^
> > > for mysql. Also, the highly threaded nature of mysql on the same MM
> >
> > he said he doesn't use gettimeofday frequently, so most of the flushes
> > are from other syscalls.
>
> you are not reading Pete's and my emails too carefully, are you? Pete
> said:

I thought time() wouldn't be called more than 1 per second anyways, why
would anyone call time more than 1 per second?

>
> > [...] MySQL does not use gettimeofday very frequently now, actually it
> > uses time() most of the time, as some platforms used to have huge
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > performance problems with gettimeofday() in the past.
> >
> > The amount of gettimeofday() use will increase dramatically in the
> > future so it is good to know about this matter.
>
> Ingo

2004-03-05 15:25:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Andrea Arcangeli <[email protected]> wrote:

> It's a nogo for 64G but I would be really pleased to see a workload
> triggering the zone-normal shortage in 32G, I've never seen any one.
> [...]

have you tried TPC-C/TPC-H?

Ingo

2004-03-05 15:52:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 04:26:22PM +0100, Ingo Molnar wrote:
> have you tried TPC-C/TPC-H?

not sure, I'm not the one dealing with the testing, but most relevant
data is public on the official websites. the limit reached is around 5k
users with 8cpus 32G and I don't recall that limit to be zone-normal
bound. With 2.6 and bio and remap_file_pages we may reduce the
zone-normal usage as well (after dropping rmap).

But I definitely agree going past that with 3:1 is not feasible.

Overall we may argue about the 32G (especially a 32-way would be more
problematic due the 4 times higher per-cpu memory reservation in
zone-normal, I mean 48M of zone-normal are just wasted in the page
allocator per-cpu logic, without counting the other per-cpu stuff, all
would be easily fixable by limiting the per-cpu sizes, though for 2.4 it
probably doesn't worth it), but I'm quite confortable to say that up to
16G (included) 4:4 is worthless unless you've to deal with the rmap
waste IMHO. And <= 16G probably counts for 99% of machines out there
which are handled optimally by 3:1.

2004-03-05 18:43:29

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

> It's a nogo for 64G but I would be really pleased to see a workload
> triggering the zone-normal shortage in 32G, I've never seen any one. And
> 16G has even more margin.

The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are:

1. mem_map (obviously) (64GB = 704MB of mem_map)

2. Buffer_heads (much improved in 2.6, though not completely gone IIRC)

3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem,
10,000 tasks would be 117MB)

4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously)

5. rmap chains - this is the real killer without objrmap (even 1000 tasks
sharing a 2GB shmem segment will kill you without large pages).

6. vmas - wierdo Oracle things before remap_file_pages especially.

I may have forgotten some, but I think those were the main ones. 10,000 tasks
is a little heavy, but it's easy to scale the numbers around. I guess my main
point is that it's often as much to do with the number of tasks as it is
with just the larger amount of memory - but bigger machines tend to run more
tasks, so it often goes hand-in-hand.

Also bear in mind that as memory gets tight, the reclaimable things like
dcache and icache will get shrunk, which will hurt performance itself too,
so some of the cost of 4/4 is paid back there too. Without shared pagetables,
we may need highpte even on 4/4, which kind of sucks (can be 10% or so hit).

M.

2004-03-05 19:12:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 10:42:55AM -0800, Martin J. Bligh wrote:
> > It's a nogo for 64G but I would be really pleased to see a workload
> > triggering the zone-normal shortage in 32G, I've never seen any one. And
> > 16G has even more margin.
>
> The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are:
>
> 1. mem_map (obviously) (64GB = 704MB of mem_map)

I was asking 32G, that's half of that and it leaves 500M free. 64G is a
no-way with 3:1.

>
> 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC)

the vm is able to reclaim them before running oom, though it has a
performance cost.

> 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem,
> 10,000 tasks would be 117MB)

pmds seems 13M for 10000 tasks, but maybe I did the math wrong.

>
> 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously)

4k stacks then need to allocate the task struct in the heap, though it
still saves ram, but it's not very different.

>
> 5. rmap chains - this is the real killer without objrmap (even 1000 tasks
> sharing a 2GB shmem segment will kill you without large pages).

this overhead doesn't exist in 2.4.

> 6. vmas - wierdo Oracle things before remap_file_pages especially.

this is one of the main issues of 2.4.

> I may have forgotten some, but I think those were the main ones. 10,000 tasks
> is a little heavy, but it's easy to scale the numbers around. I guess my main
> point is that it's often as much to do with the number of tasks as it is
> with just the larger amount of memory - but bigger machines tend to run more
> tasks, so it often goes hand-in-hand.

yes, an 8-way with 32G it's unlikely that can scale up to 10000 tasks,
regardless, but maybe things change with a 32-way 32G.

The main thing you didn't mention is the overhead in the per-cpu data
structures, that alone generates an overhead of several dozen mbytes
only in the page allocator, without accounting the slab caches,
pagetable caches etc.. putting an high limit to the per-cpu caches
should make a 32-way 32G work fine with 3:1 too though. 8-way is
fine with 32G currently.

other relevant things are the fs stuff like file handles per task and
other pinned slab things.

> Also bear in mind that as memory gets tight, the reclaimable things like
> dcache and icache will get shrunk, which will hurt performance itself too,

for these workloads (the 10000 tasks are the workloads we know very
well) dcache/icache doesn't matter, and still I find 3:1 a more generic
kernel than 4:4 for random workloads too. And if you don't run the 10000
tasks workload then you've the normal-zone free to use for dcache
anyways.

> so some of the cost of 4/4 is paid back there too. Without shared pagetables,
> we may need highpte even on 4/4, which kind of sucks (can be 10% or so hit).

I think pte-highmem is definitely needed on 4:4 too, even if you use
hugetlbfs that won't cover PAE and the granular window which is quite a
lot of the ram.

Overall shared pageteables doesn't payoff for its complexity, rather
than sharing the pagetables it's better not to allocate them in the
first place ;) (hugetlbfs/largepages).

The pratical limit of the hardware was 5k tasks, not a kernel issue.
Your 10k example has never been tested, but obviously at some point a
limit will trigger (eventually the get_pid will stop finding a free pid
too ;)

2004-03-05 19:55:38

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

>> The things I've seen consume ZONE_NORMAL (which aren't reclaimable) are:
>>
>> 1. mem_map (obviously) (64GB = 704MB of mem_map)
>
> I was asking 32G, that's half of that and it leaves 500M free. 64G is a
> no-way with 3:1.

Yup.

>> 2. Buffer_heads (much improved in 2.6, though not completely gone IIRC)
>
> the vm is able to reclaim them before running oom, though it has a
> performance cost.

Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew
worked on that a lot.

>> 3. Pagetables (pte_highmem helps, pmds are existant, but less of a problem,
>> 10,000 tasks would be 117MB)
>
> pmds seems 13M for 10000 tasks, but maybe I did the math wrong.

3 pages per task = 12K per task = 120,000Kb. Or that's the way I figured
it at least.

>> 4. Kernel stacks (10,000 tasks would be 78MB - 4K stacks would help obviously)
>
> 4k stacks then need to allocate the task struct in the heap, though it
> still saves ram, but it's not very different.

In 2.6, I think the task struct is outside the kernel stack either way.
Maybe you were pointing out something else? not sure.

> The main thing you didn't mention is the overhead in the per-cpu data
> structures, that alone generates an overhead of several dozen mbytes
> only in the page allocator, without accounting the slab caches,
> pagetable caches etc.. putting an high limit to the per-cpu caches
> should make a 32-way 32G work fine with 3:1 too though. 8-way is
> fine with 32G currently.

Humpf. Do you have a hard figure on how much it actually is per cpu?

> other relevant things are the fs stuff like file handles per task and
> other pinned slab things.

Yeah, that was a huge one we forgot ... sysfs. Particularly with large
numbers of disks, IIRC, though other resources might generate similar
issues.

> I think pte-highmem is definitely needed on 4:4 too, even if you use
> hugetlbfs that won't cover PAE and the granular window which is quite a
> lot of the ram.
>
> Overall shared pageteables doesn't payoff for its complexity, rather
> than sharing the pagetables it's better not to allocate them in the
> first place ;) (hugetlbfs/largepages).

That might be another approach, yes ... some more implicit allocation
stuff would help here - modifying ISV apps is a PITA to get done, and
takes *forever*. Adam wrote some patches that are sitting in my tree,
some of which were ported forward from SLES8. But then we get into
massive problems with them not being swappable, so you need capabilities,
etc, etc. Ugh.

> The pratical limit of the hardware was 5k tasks, not a kernel issue.
> Your 10k example has never been tested, but obviously at some point a
> limit will trigger (eventually the get_pid will stop finding a free pid
> too ;)

You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger
boxes will get progressively scarier ;-)

What scares me more is that we can sit playing counting games all day,
but there's always something we will forget. So I'm not keen on playing
brinkmanship games with customers systems ;-)

M.

2004-03-05 20:11:54

by Jamie Lokier

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Ingo Molnar wrote:
> if mysql in fact calls time() frequently, then it should rather start a
> worker thread that updates a global time variable every second.

That has the same problem as discussed later in this thread with
vsyscall-time: the worker thread may not run immediately it is woken,
and also setitimer() and select() round up the delay a little more
then expected, so sometimes the global time variable will be out of
date and misordered w.r.t. gettimeofday() and stat() results of
recently modified files.

Also, if there's paging the variable may be out of date by quite a
long time, so mlock() should be used to remove that aspect of the delay.

I don't know if such delays a problem for MySQL.

-- Jamie

2004-03-05 20:20:08

by Jamie Lokier

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Andrew Morton wrote:
> We believe that this could cause misbehaviour of such things as the
> boehm garbage collector. This patch provides the mprotect() atomicity by
> performing all userspace copies under page_table_lock.

Can you use a read-write lock, so that userspace copies only need to
take the lock for reading? That doesn't eliminate cacheline bouncing
but does eliminate the serialisation.

Or did you do that already, and found performance is still very low?

> It is a judgement call. Personally, I wouldn't ship a production kernel
> with this patch. People need to be aware of the tradeoff and to think and
> test very carefully.

If this isn't fixed, _please_ provide a way for a garbage collector to
query the kernel as to whether this race condition is present.

-- Jamie

2004-03-05 20:29:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 11:55:05AM -0800, Martin J. Bligh wrote:
> Didn't used to in SLES8 at least. maybe it does in 2.6 now, I know Andrew
> worked on that a lot.

it should in every SLES8 kernel out there too (it wasn't in mainline
until very recently), see the related bhs stuff.

> In 2.6, I think the task struct is outside the kernel stack either way.
> Maybe you were pointing out something else? not sure.

I meant that making the kernel stack 4k pretty much requires removing
the task_struct, making it 4k w/o removing the task_struct sounds too
small.

> > The main thing you didn't mention is the overhead in the per-cpu data
> > structures, that alone generates an overhead of several dozen mbytes
> > only in the page allocator, without accounting the slab caches,
> > pagetable caches etc.. putting an high limit to the per-cpu caches
> > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > fine with 32G currently.
>
> Humpf. Do you have a hard figure on how much it actually is per cpu?

not a definitive one, but it's sure more than 2m per cpu, could be 3m
per cpu.

> > other relevant things are the fs stuff like file handles per task and
> > other pinned slab things.
>
> Yeah, that was a huge one we forgot ... sysfs. Particularly with large
> numbers of disks, IIRC, though other resources might generate similar
> issues.

which doesn't need to be mounted during production and hotplug should
mount read it and unmount. It's worthless to leave it mounted. Only
root-only hardware related stuff should be in sysfs, everything else
that has been abstracted at the kernel level (transparent to
applications) should remain in /proc. unmounting /proc hurts the
production systems, unmounting sysfs should not.

> You mean with the 8cpu box you mentioned above? Yes, probably 5K. Larger
> boxes will get progressively scarier ;-)

yes.

> What scares me more is that we can sit playing counting games all day,
> but there's always something we will forget. So I'm not keen on playing
> brinkmanship games with customers systems ;-)

this is true for 4:4 too. Also with 2.4 the system will return -ENOMEM,
not like 2.6 that lockup the box. so it's not a fatal thing if a certain
kernel can't sustain a certain workload in a certain hardware, just like
it's not a fatal thing if your run out of memory for the pagetables on a
64bit architecture with 64bit kernel. My only object is to make it feasible
to run the most high end workloads in the most high end hardware with a
good safety margin, knowing if something goes wrong the worst that can
happen is that a syscall returns -ENOMEM. There will always be a
malicious workload able to fill the zone-normal, if you fork off a tons
of tasks, and you open a gazzillon of sockets and you flood all of them
at the same time to fill all receive windows you'll fill your cool 4G
zone-normal of 4:4 in half a second with a 10gigabit NIC.

2004-03-05 20:33:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 08:19:55PM +0000, Jamie Lokier wrote:
> Andrew Morton wrote:
> > We believe that this could cause misbehaviour of such things as the
> > boehm garbage collector. This patch provides the mprotect() atomicity by
> > performing all userspace copies under page_table_lock.
>
> Can you use a read-write lock, so that userspace copies only need to
> take the lock for reading? That doesn't eliminate cacheline bouncing
> but does eliminate the serialisation.

normally the bouncing would be the only overhead, but here I also think
the serialization is a significant factor of the contention because the
critical section is taking lots of time. So I would expect some
improvement by using a read/write lock.

2004-03-05 20:41:12

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Andrea Arcangeli <[email protected]> wrote:
>
> > > The main thing you didn't mention is the overhead in the per-cpu data
> > > structures, that alone generates an overhead of several dozen mbytes
> > > only in the page allocator, without accounting the slab caches,
> > > pagetable caches etc.. putting an high limit to the per-cpu caches
> > > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > > fine with 32G currently.
> >
> > Humpf. Do you have a hard figure on how much it actually is per cpu?
>
> not a definitive one, but it's sure more than 2m per cpu, could be 3m
> per cpu.

It'll average out to 68 pages per cpu. (4 in ZONE_DMA, 64 in ZONE_NORMAL).

That's eight megs on 32-way. Maybe it can be trimmed back a bit, but on
32-way you probably want the locking amortisation more than the 8 megs.

The settings we have in there are still pretty much guesswork. I don't
think anyone has done any serious tuning on them. Any differences are
likely to be small.


2004-03-05 21:07:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, Mar 05, 2004 at 12:41:19PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > > The main thing you didn't mention is the overhead in the per-cpu data
> > > > structures, that alone generates an overhead of several dozen mbytes
> > > > only in the page allocator, without accounting the slab caches,
> > > > pagetable caches etc.. putting an high limit to the per-cpu caches
> > > > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > > > fine with 32G currently.
> > >
> > > Humpf. Do you have a hard figure on how much it actually is per cpu?
> >
> > not a definitive one, but it's sure more than 2m per cpu, could be 3m
> > per cpu.
>
> It'll average out to 68 pages per cpu. (4 in ZONE_DMA, 64 in ZONE_NORMAL).

3m per cpu with all 3m in zone normal.

2004-03-05 21:29:24

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

> * Andrea Arcangeli <[email protected]> wrote:
>
>> It's a nogo for 64G but I would be really pleased to see a workload
>> triggering the zone-normal shortage in 32G, I've never seen any one.
>> [...]
>
> have you tried TPC-C/TPC-H?

We're doing those here. Publishing results will be tricky due to their
draconian rules, but I'm sure you'll be able to read between the lines ;-)

OASB (Oracle apps) is the other total killer I've found in the past.

M.

2004-03-05 21:44:50

by Jamie Lokier

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Andrea Arcangeli wrote:
> > Can you use a read-write lock, so that userspace copies only need to
> > take the lock for reading? That doesn't eliminate cacheline bouncing
> > but does eliminate the serialisation.
>
> normally the bouncing would be the only overhead, but here I also think
> the serialization is a significant factor of the contention because the
> critical section is taking lots of time. So I would expect some
> improvement by using a read/write lock.

For something as significant as user<->kernel data transfers, it might
be worth eliminating the bouncing as well - by using per-CPU * per-mm
spinlocks.

User<->kernel data transfers would take the appropriate per-CPU lock
for the current mm, and not take page_table_lock. Everything that
normally takes page_table_lock would, and also take all of the per-CPU locks.

That does require a set of per-CPU spinlocks to be allocated whenever
a new mm is allocated (although the sets could be cached so it needn't
be slow).

-- Jamie

2004-03-05 22:10:22

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Andrea Arcangeli <[email protected]> wrote:
>
> On Fri, Mar 05, 2004 at 12:41:19PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > > > The main thing you didn't mention is the overhead in the per-cpu data
> > > > > structures, that alone generates an overhead of several dozen mbytes
> > > > > only in the page allocator, without accounting the slab caches,
> > > > > pagetable caches etc.. putting an high limit to the per-cpu caches
> > > > > should make a 32-way 32G work fine with 3:1 too though. 8-way is
> > > > > fine with 32G currently.
> > > >
> > > > Humpf. Do you have a hard figure on how much it actually is per cpu?
> > >
> > > not a definitive one, but it's sure more than 2m per cpu, could be 3m
> > > per cpu.
> >
> > It'll average out to 68 pages per cpu. (4 in ZONE_DMA, 64 in ZONE_NORMAL).
>
> 3m per cpu with all 3m in zone normal.

In the page allocator? How did you arrive at this figure?

2004-03-06 05:13:12

by Jamie Lokier

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Jamie Lokier wrote:
> Ingo Molnar wrote:
> > if mysql in fact calls time() frequently, then it should rather start a
> > worker thread that updates a global time variable every second.
>
> That has the same problem as discussed later in this thread with
> vsyscall-time: the worker thread may not run immediately it is woken,
> and also setitimer() and select() round up the delay a little more
> then expected, so sometimes the global time variable will be out of
> date and misordered.
>
> I don't know if such delays a problem for MySQL.

I still don't know about MySQL, but I have just encounted some code of
my own which does break if time() returns significantly out of date
values.

Any code which is structured like this will break:

time_t timeout = time(0) + TIMEOUT_IN_SECONDS;

do {
/* Do some stuff which takes a little while. */
} while (time(0) <= timeout);

It goes wrong when time() returns a value that is in the past, and
then jumps forward to the correct time suddenly. The timeout of the
above code is reduced by the size of that jump. If the jump is larger
than TIMEOUT_IN_SECONDS, the timeout mechanism is defeated completely.

That sort of code is a prime candidate for the method of using a
worker thread updating a global variable, so it's really important to
to take care when using it.

-- Jamie

2004-03-06 12:57:13

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Jamie Lokier wrote:
[snip]
>
> Any code which is structured like this will break:
>
> time_t timeout = time(0) + TIMEOUT_IN_SECONDS;
>
> do {
> /* Do some stuff which takes a little while. */
> } while (time(0) <= timeout);
>
> It goes wrong when time() returns a value that is in the past, and
> then jumps forward to the correct time suddenly. The timeout of the
> above code is reduced by the size of that jump. If the jump is larger
> than TIMEOUT_IN_SECONDS, the timeout mechanism is defeated completely.
>
> That sort of code is a prime candidate for the method of using a
> worker thread updating a global variable, so it's really important to
> to take care when using it.
>

But isn't this kind of code a known buggy way of implementing timeouts?
Shouldn't it be like:

time_t x = time(0);
do {
...
} while (time(0) - x >= TIMEOUT_IN_SECONDS);

Ofcourse it can't handle times in the past, but it won't get easily hung
with regards to leaps or wraparounds (if used with other functions).

Regards

Magnus


2004-03-06 13:13:44

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

Magnus Naeslund(t) wrote:
>
> But isn't this kind of code a known buggy way of implementing timeouts?
> Shouldn't it be like:
>
> time_t x = time(0);
> do {
> ...
> } while (time(0) - x >= TIMEOUT_IN_SECONDS);

I meant:
} while (time(0) - x < TIMEOUT_IN_SECONDS);

Also if time_t is signed, that needs to be taken care of.

Magnus - butterfingers

2004-03-07 06:51:09

by Peter Zaitsev

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Fri, 2004-03-05 at 07:02, Ingo Molnar wrote:
> * Andrea Arcangeli <[email protected]> wrote:
>
> > I thought time() wouldn't be called more than 1 per second anyways,
> > why would anyone call time more than 1 per second?
>
> if mysql in fact calls time() frequently, then it should rather start a
> worker thread that updates a global time variable every second.

Ingo, Andrea,

I would not say MySQL calls time that often, it is normally 2 times per
query (to measure query execution time), might be couple of times more.

Looking at typical profiling results it takes much less than 1% of time,
even for very simple query loads.

Rather than changing design how time is computed I think we would better
to go to better accuracy - nowadays 1 second is far too raw.


--
Peter Zaitsev, Senior Support Engineer
MySQL AB, http://www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/

2004-03-07 08:40:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Andrea Arcangeli <[email protected]> wrote:

> [...] but I'm quite confortable to say that up to 16G (included) 4:4
> is worthless unless you've to deal with the rmap waste IMHO. [...]

i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
lowmem zone. (it had to do with many files and having them as a big
dentry cache, so yes, it's unfixable unless you start putting inodes
into highmem which is crazy. And yes, performance broke down unless most
of the dentries/inodes were cached in lowmem.)

as i said - it all depends on the workload, and users are amazingly
creative at finding all sorts of workloads. Whether 4:4 or 3:1 is thus
workload dependent.

should lowmem footprint be reduced? By all means yes, but only as long
as it doesnt jeopardize the real 64-bit platforms. Is 3:1 adequate as a
generic x86 kernel for absolutely everything up to and including 16 GB?
Strong no. [not to mention that 'up to 16 GB' is an artificial thing
created by us which wont satisfy an IHV that has a hw line with RAM up
to 32 or 64 GB. It doesnt matter that 90% of the customers wont have
that much RAM, it's a basic "can it scale to that much RAM" question.]

so i think the right answer is to have 4:4 around to cover the bases -
and those users who have workloads that will run fine on 3:1 should run
3:1.

(not to mention the range of users who need 4GB _userspace_.)

but i'm quite strongly convinced that 'getting rid' of the 'pte chain
overhead' in favor of questionable lowmem space gains for a dying
(high-end server) platform is very shortsighted. [getting rid of them
for purposes of the 64-bit platforms could be OK, but the argumentation
isnt that strong there i think.]

Ingo

2004-03-07 10:29:51

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)



Ingo Molnar wrote:

>* Andrea Arcangeli <[email protected]> wrote:
>
>
>>[...] but I'm quite confortable to say that up to 16G (included) 4:4
>>is worthless unless you've to deal with the rmap waste IMHO. [...]
>>
>
>i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
>lowmem zone. (it had to do with many files and having them as a big
>dentry cache, so yes, it's unfixable unless you start putting inodes
>into highmem which is crazy. And yes, performance broke down unless most
>of the dentries/inodes were cached in lowmem.)
>
>

If you still have any of these workloads around, they would be
good to test on the memory management changes in Andrew's mm tree
which should correctly balance slab on highmem systems. Linus'
tree has a few problems here.

But if you really have a lot more than 800MB of active dentries,
then maybe 4:4 would be a win?

2004-03-07 11:54:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)


* Jamie Lokier <[email protected]> wrote:

> Ingo Molnar wrote:
> > if mysql in fact calls time() frequently, then it should rather start a
> > worker thread that updates a global time variable every second.
>
> That has the same problem as discussed later in this thread with
> vsyscall-time: the worker thread may not run immediately it is woken,
> and also setitimer() and select() round up the delay a little more
> then expected, so sometimes the global time variable will be out of
> date and misordered w.r.t. gettimeofday() and stat() results of
> recently modified files.

we dont have any guarantees wrt. the synchronization of the time() and
the gettimeofday() clocks - irrespective of vsyscalls, do we?

Ingo

2004-03-07 17:23:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Sun, Mar 07, 2004 at 09:41:20AM +0100, Ingo Molnar wrote:
>
> * Andrea Arcangeli <[email protected]> wrote:
>
> > [...] but I'm quite confortable to say that up to 16G (included) 4:4
> > is worthless unless you've to deal with the rmap waste IMHO. [...]
>
> i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
> lowmem zone. (it had to do with many files and having them as a big

was that a kernel with rmap or w/o rmap?

> but i'm quite strongly convinced that 'getting rid' of the 'pte chain
> overhead' in favor of questionable lowmem space gains for a dying
> (high-end server) platform is very shortsighted. [getting rid of them
> for purposes of the 64-bit platforms could be OK, but the argumentation
> isnt that strong there i think.]

disagree, the reason I'm doing it is for the 64bit platforms, I can't
care less about x86. the vm is dogslow with rmap.

2004-03-07 17:33:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)

On Sun, Mar 07, 2004 at 09:29:37PM +1100, Nick Piggin wrote:
>
>
> Ingo Molnar wrote:
>
> >* Andrea Arcangeli <[email protected]> wrote:
> >
> >
> >>[...] but I'm quite confortable to say that up to 16G (included) 4:4
> >>is worthless unless you've to deal with the rmap waste IMHO. [...]
> >>
> >
> >i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
> >lowmem zone. (it had to do with many files and having them as a big
> >dentry cache, so yes, it's unfixable unless you start putting inodes
> >into highmem which is crazy. And yes, performance broke down unless most
> >of the dentries/inodes were cached in lowmem.)
> >
> >
>
> If you still have any of these workloads around, they would be

I also have workloads that would die with 4:4 and rmap.

the question is if they tested this in the stock 2.4 or 2.4-aa VM, or if
this was tested on kernels with rmap.

most kernels are also broken w.r.t. lowmem reservation, there are huge
vm design breakages in tons of 2.4 out there, those breakages would
generate lomwm shortages too, so just saying the 8G box runs out of
lowmem is meaningless unless we know exactly which kind of 2.4
incarnation was running on that box.

For istance google was running out of lowmem zone even on 2.5G boxes
until I fixed it, and the fix was merged in mainline only around 2.4.23,
so unless I'm sure all relevant fixes were applied the 8G runs out of
lowmem means nothing to me, since it was running out of lowmem for me
too for ages even on the 4G boxes until I've fixed all those issues in
the vm, not related to the pinned amount of memory.

alternatively if they can count the number of tasks, and the number of
files open, we can do the math and count the mbytes of lowmem pinned,
that as well can demonstrate it was a limitation of the 3:1 and not a
design bug of the vm in-use on that box.

2004-03-08 05:15:24

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.4.23aa2 (bugfixes and important VM improvements for the high end)



Andrea Arcangeli wrote:

>On Sun, Mar 07, 2004 at 09:29:37PM +1100, Nick Piggin wrote:
>
>>
>>Ingo Molnar wrote:
>>
>>
>>>* Andrea Arcangeli <[email protected]> wrote:
>>>
>>>
>>>
>>>>[...] but I'm quite confortable to say that up to 16G (included) 4:4
>>>>is worthless unless you've to deal with the rmap waste IMHO. [...]
>>>>
>>>>
>>>i've seen workloads on 8G RAM systems that easily filled up the ~800 MB
>>>lowmem zone. (it had to do with many files and having them as a big
>>>dentry cache, so yes, it's unfixable unless you start putting inodes
>>>into highmem which is crazy. And yes, performance broke down unless most
>>>of the dentries/inodes were cached in lowmem.)
>>>
>>>
>>>
>>If you still have any of these workloads around, they would be
>>
>
>I also have workloads that would die with 4:4 and rmap.
>
>

I don't doubt that, and of course no amount of tinkering with
reclaim will help where you are dying due to pinned lowmem.

Ingo's workload sounded like slab cache reclaim improvements in
recent mm kernels might possibly help. I was purely interested
in this for testing the reclaim changes.