2008-06-06 06:17:37

by Solofo.Ramangalahy

[permalink] [raw]
Subject: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

The size in bytes of a SysV IPC message queue, msgmnb, is too small
for large machines, but we don't want to bloat small machines

Several methods are used already to modify (mainly increase) msgmnb:
. distribution specific patch
. system wide sysctl.conf
. application specific tuning via /proc/sys/kernel/msgmnb

Integrating this series would:
. reflect hardware and software evolutions and diversity,
. reduce configuration/tuning for the applications.

Here is the timeline of the evolution of MSG* #defines:
Year 1994 1999 1999 2008
Version 1.0 2.3.27 2.3.30 2.6.24
#define MSGMNI 128 128 16 16
#define MSGMAX 4056 8192 8192 8192
#define MSGMNB 16384 16384 16384 16384

This patch series scales msgmnb, with respect to the number of
cpus/cores for larger machines. For uniprocessor machines the value
does not increase.

This series is similar to (and depends on) the series which scales
msgmni, the number of IPC message queue identifiers, to the amount of
low memory.
While Nadia's previous series scaled msgmni along the memory axis,
hence the message pool (msgmni x msgmnb), this series uses a second
axis: the number of online CPUs.
As well as covering the (cpu,memory) space of machines size, this
reflects the parallelism allowed by lockless send/receive for
in-flight messages in queues (msgmnb / msgmax messages).

The initial scaling is done at initialization of the ipc namespace.
Furthermore, the value becomes dynamic with respect to cpu hotplug.

The msgmni and msgmnb values become dependent, as the value of msgmni
is computed with respect to the value of msgmnb.

The series is as follows:
. patch 1 introduces the scaling function
. patch 2 deals with cpu hotplug
. patch 3 allows user space to disable the scaling mechanism
. patch 4 allows user space to reenable the scaling mechanism
. patch 5 finer grain disabling/reenabling scaling mechanism
(disconnect msgmnb and msgmni)
. patch 6 adds documentation

---

The series applies to 2.6.26-rc2-mm1 + patch suppressing KERN_INFO
messages as discussed at:
http://article.gmane.org/gmane.linux.kernel/686229
"[PATCH 1/1] Only output msgmni value at boot time"
(in mmotm: ipc-only-output-msgmni-value-at-boot-time.patch)

The plan would be to have this ready for the 2.6.27 merge window if
there are no objections.

Documentation/sysctl/kernel.txt | 27 ++++++++++++++++++++++
include/linux/ipc_namespace.h | 4 ++-
include/linux/msg.h | 5 ++++
ipc/ipc_sysctl.c | 48 ++++++++++++++++++++++++++++++----------
ipc/ipcns_notifier.c | 23 +++++++------------
ipc/msg.c | 25 +++++++++++++++++---
ipc/util.c | 28 +++++++++++++++++++++++
ipc/util.h | 1
8 files changed, 131 insertions(+), 30 deletions(-)

--
Solofo Ramangalahy
Bull SA.


2008-06-06 08:23:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

On Friday 06 June 2008 16:09, [email protected] wrote:
> The size in bytes of a SysV IPC message queue, msgmnb, is too small
> for large machines, but we don't want to bloat small machines

What's your evidence for this? Can you provide before / after
performance numbers?

Also, when scaling things like this, it is probably more usual
to use a log scale rather than linear, so that's a thought.

>
> Several methods are used already to modify (mainly increase) msgmnb:
> . distribution specific patch
> . system wide sysctl.conf
> . application specific tuning via /proc/sys/kernel/msgmnb
>
> Integrating this series would:
> . reflect hardware and software evolutions and diversity,
> . reduce configuration/tuning for the applications.
>
> Here is the timeline of the evolution of MSG* #defines:
> Year 1994 1999 1999 2008
> Version 1.0 2.3.27 2.3.30 2.6.24
> #define MSGMNI 128 128 16 16
> #define MSGMAX 4056 8192 8192 8192
> #define MSGMNB 16384 16384 16384 16384
>
> This patch series scales msgmnb, with respect to the number of
> cpus/cores for larger machines. For uniprocessor machines the value
> does not increase.
>
> This series is similar to (and depends on) the series which scales
> msgmni, the number of IPC message queue identifiers, to the amount of
> low memory.
> While Nadia's previous series scaled msgmni along the memory axis,
> hence the message pool (msgmni x msgmnb), this series uses a second
> axis: the number of online CPUs.
> As well as covering the (cpu,memory) space of machines size, this
> reflects the parallelism allowed by lockless send/receive for
> in-flight messages in queues (msgmnb / msgmax messages).
>
> The initial scaling is done at initialization of the ipc namespace.
> Furthermore, the value becomes dynamic with respect to cpu hotplug.
>
> The msgmni and msgmnb values become dependent, as the value of msgmni
> is computed with respect to the value of msgmnb.
>
> The series is as follows:
> . patch 1 introduces the scaling function
> . patch 2 deals with cpu hotplug
> . patch 3 allows user space to disable the scaling mechanism
> . patch 4 allows user space to reenable the scaling mechanism
> . patch 5 finer grain disabling/reenabling scaling mechanism
> (disconnect msgmnb and msgmni)
> . patch 6 adds documentation
>
> ---
>
> The series applies to 2.6.26-rc2-mm1 + patch suppressing KERN_INFO
> messages as discussed at:
> http://article.gmane.org/gmane.linux.kernel/686229
> "[PATCH 1/1] Only output msgmni value at boot time"
> (in mmotm: ipc-only-output-msgmni-value-at-boot-time.patch)
>
> The plan would be to have this ready for the 2.6.27 merge window if
> there are no objections.
>
> Documentation/sysctl/kernel.txt | 27 ++++++++++++++++++++++
> include/linux/ipc_namespace.h | 4 ++-
> include/linux/msg.h | 5 ++++
> ipc/ipc_sysctl.c | 48
> ++++++++++++++++++++++++++++++---------- ipc/ipcns_notifier.c |
> 23 +++++++------------
> ipc/msg.c | 25 +++++++++++++++++---
> ipc/util.c | 28 +++++++++++++++++++++++
> ipc/util.h | 1
> 8 files changed, 131 insertions(+), 30 deletions(-)

2008-06-06 10:20:37

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

Hi Nick,

Nick Piggin writes:
> On Friday 06 June 2008 16:09, [email protected] wrote:
> > The size in bytes of a SysV IPC message queue, msgmnb, is too small
> > for large machines, but we don't want to bloat small machines
>
> What's your evidence for this? Can you provide before / after
> performance numbers?

Maybe I have not been clear enough that this is not directly about
performance, but more changing default value. So maybe "scale" in the
title is misleading.

The evidence would be that these default values are changed either
by a patch or "manually":
> > Several methods are used already to modify (mainly increase) msgmnb:
> > . distribution specific patch
> > . system wide sysctl.conf
> > . application specific tuning via /proc/sys/kernel/msgmnb

Further "evidence" could be googling for "linux msgmnb 65536", to see
that tuning for benchmarks or recommended application configuration
increase the value.

This is just settings default values. Performance test results would
not be different from those obtained by setting the values before
running the tests.

So here :
> > Here is the timeline of the evolution of MSG* #defines:
> > Year 1994 1999 1999 2008
> > Version 1.0 2.3.27 2.3.30 2.6.24
> > #define MSGMNI 128 128 16 16
> > #define MSGMAX 4056 8192 8192 8192
> > #define MSGMNB 16384 16384 16384 16384

I have 65536 instead of 16384 for msgmnb
(1982 instead of 16 for msgmni)
for my 4 cpus/4GB x86_64 machine.

Some result with pmsg used in recent discussions about performance gives:
16384/16:
./pmsg 4 10 |grep Total
Total: 9795993

65536/1982:
./pmsg 4 10 |grep Total
Total: 9829590

> Also, when scaling things like this, it is probably more usual to
> use a log scale rather than linear, so that's a thought.

Agreed, in general.
Here, there are only 4 values, so I do not think it is worth using a
log scale.
If different values are desirable (finer grain, bigger,...), then the
formula can be easily refined:
min(MSGMNB * num_online_cpus(), MSGMNB * MSG_CPU_SCALE);
What I did for the formula is simply taking the old value, a known
modified value and the intermediate values.

I hope this answers your questions,
--
solofo




2008-06-07 14:38:25

by Manfred Spraul

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

[email protected] wrote:
> The size in bytes of a SysV IPC message queue, msgmnb, is too small
> for large machines, but we don't want to bloat small machines
>
> Several methods are used already to modify (mainly increase) msgmnb:
> . distribution specific patch
> . system wide sysctl.conf
> . application specific tuning via /proc/sys/kernel/msgmnb
>
>
Which distributions use a patch?

The whole configuration can be done from user space, thus I assumed that
a sysctl.conf value (or in the worst case: a dbus/hal daemon that
updates /proc/sys/kernel/msgnmb) could do the job.

--
Manfred

2008-06-08 07:19:24

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

Manfred Spraul writes:
> > The size in bytes of a SysV IPC message queue, msgmnb, is too small
> > for large machines, but we don't want to bloat small machines
> >
> > Several methods are used already to modify (mainly increase) msgmnb:
> > . distribution specific patch
> > . system wide sysctl.conf
> > . application specific tuning via /proc/sys/kernel/msgmnb
> >
> >
> Which distributions use a patch?

opensuse has this:

"The defaults are too small for most users."

[...]
#define MSGMNI 16 /* <= IPCMNI */ /* max # of msg queue identifiers */
-#define MSGMAX 8192 /* <= INT_MAX */ /* max size of message (bytes) */
-#define MSGMNB 16384 /* <= INT_MAX */ /* default max size of a message queue */
+#define MSGMAX 65536 /* <= INT_MAX */ /* max size of message (bytes) */
+#define MSGMNB 65536 /* <= INT_MAX */ /* default max size of a message queue */
[...]

--
solofo

2008-06-10 06:55:45

by Nadia Derbey

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

[email protected] wrote:
> The size in bytes of a SysV IPC message queue, msgmnb, is too small
> for large machines, but we don't want to bloat small machines
>
> Several methods are used already to modify (mainly increase) msgmnb:
> . distribution specific patch
> . system wide sysctl.conf
> . application specific tuning via /proc/sys/kernel/msgmnb
>
> Integrating this series would:
> . reflect hardware and software evolutions and diversity,
> . reduce configuration/tuning for the applications.
>
> Here is the timeline of the evolution of MSG* #defines:
> Year 1994 1999 1999 2008
> Version 1.0 2.3.27 2.3.30 2.6.24
> #define MSGMNI 128 128 16 16
> #define MSGMAX 4056 8192 8192 8192
> #define MSGMNB 16384 16384 16384 16384
>
> This patch series scales msgmnb, with respect to the number of
> cpus/cores for larger machines. For uniprocessor machines the value
> does not increase.
>
> This series is similar to (and depends on) the series which scales
> msgmni, the number of IPC message queue identifiers, to the amount of
> low memory.
> While Nadia's previous series scaled msgmni along the memory axis,
> hence the message pool (msgmni x msgmnb), this series uses a second
> axis: the number of online CPUs.
> As well as covering the (cpu,memory) space of machines size, this
> reflects the parallelism allowed by lockless send/receive for
> in-flight messages in queues (msgmnb / msgmax messages).
>
> The initial scaling is done at initialization of the ipc namespace.
> Furthermore, the value becomes dynamic with respect to cpu hotplug.
>
> The msgmni and msgmnb values become dependent, as the value of msgmni
> is computed with respect to the value of msgmnb.
>
> The series is as follows:
> . patch 1 introduces the scaling function
> . patch 2 deals with cpu hotplug
> . patch 3 allows user space to disable the scaling mechanism
> . patch 4 allows user space to reenable the scaling mechanism
> . patch 5 finer grain disabling/reenabling scaling mechanism
> (disconnect msgmnb and msgmni)
> . patch 6 adds documentation
>

Solofo,

Patches 3 and 4 are useless imho. If you really really want to keep
them, you should at least merge them.

Regards,
Nadia

2008-06-23 13:14:54

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

Hi Manfred,

This part is more difficult to answer than the other one:
> The whole configuration can be done from user space, thus I assumed that
> a sysctl.conf value [...] could do the job

Yes, while this is (still) possible, this can become cumbersome with
namespaces, hotplug, ...

> (or in the worst case: a dbus/hal daemon that
> updates /proc/sys/kernel/msgnmb) [...]

This would probably mean one daemon per ipc namespace.
The patches seems lighter.

There has been related discussions regarding kernel
space vs. userspace approach in the threads:
. "Change in default vm_dirty_ratio"
http://lkml.org/lkml/2007/6/18/471
. "[RFC][PATCH 0/6] Automatic kernel tunables (AKT)"
http://lkml.org/lkml/2007/1/16/16
(and probably others)

So it seems there is no consensus in doing it either way: kernel or
user space.

Humm... now this make me think that you did not change the MSGMNB
value when you changed MSGMNI and MSGMAX.
Maybe that was on purpose?

--
solofo

2008-06-24 18:00:50

by Manfred Spraul

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

[email protected] wrote:
> Humm... now this make me think that you did not change the MSGMNB
> value when you changed MSGMNI and MSGMAX.
> Maybe that was on purpose?
>
>
I was afraid that it might break user space applications that queue a
few kb of messages.
That's also the reason for
> if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
> 1 + msq->q_qnum <= msq->q_qbytes) {
> break;
> }
It's possible to send 0-byte messages even if the message queue is full
[except that you can't send more than MSGMNB messages].

--
Manfred

2008-06-25 06:21:32

by Solofo.Ramangalahy

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

Manfred Spraul writes:
> > Humm... now this make me think that you did not change the MSGMNB
> > value when you changed MSGMNI and MSGMAX.
> > Maybe that was on purpose?
> >
> >
> I was afraid that it might break user space applications that queue a
> few kb of messages.

Ok, the choice of a maximum value of 65536 which is already is use for
several months/years was made partly for the same concern.
Beside, as the values are not enforced, we should be relatively safe.

Searching the archives, I also found usage of a value "around a MB".

> That's also the reason for
> > if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
> > 1 + msq->q_qnum <= msq->q_qbytes) {
> > break;
> > }
> It's possible to send 0-byte messages even if the message queue is full
> [except that you can't send more than MSGMNB messages].

Thanks for this information. I should add that I checked that no
regression was introduced with ltp-full-20080531, but I did not look
more closely (e.g. coverage of this part of the code).

--
solofo

2008-06-25 10:12:21

by Nadia Derbey

[permalink] [raw]
Subject: Re: [RFC -mm 0/6] sysv ipc: scale msgmnb with the number of cpus

Manfred Spraul wrote:
> [email protected] wrote:
>
>> Humm... now this make me think that you did not change the MSGMNB
>> value when you changed MSGMNI and MSGMAX.
>> Maybe that was on purpose?
>>
>>
>
> I was afraid that it might break user space applications that queue a
> few kb of messages.
> That's also the reason for
>
>> if (msgsz + msq->q_cbytes <= msq->q_qbytes &&
>> 1 + msq->q_qnum <= msq->q_qbytes) {
>> break;
>> }
>
> It's possible to send 0-byte messages even if the message queue is full
> [except that you can't send more than MSGMNB messages].
>

Manfred,

If I'm not missign anything when reading the code, sending up to MSGMNB
0-bytes messages would make us enqueue MSGMNB msg_msg structures (this
is in the worst case where no receiver is waiting for those messages).
==> MSGMNB * 24 bytes (or 48 bytes in 64-bit mode))
==> 384 KB with current MSGMNB value (16K).

But 1,5 MB with a MSGMNB=64K

Even if it is a worst case, it should be considered and may be we should
refine the formula Solofo has proposed if you think this is not a
reasonable value.
May be add a dependency on the memory size?

Regards,
Nadia