The size in bytes of a SysV IPC message queue, msgmnb, is too small
for large machines, but we don't want to bloat small machines.
This series change ("scale") the default value of
/proc/sys/kernel/msgmnb.
Several methods are used already to modify (mainly increase) msgmnb:
. distribution specific patch (e.g. openSUSE)
. system wide sysctl.conf
. application specific tuning via /proc/sys/kernel/msgmnb
Integrating this series would:
. reflect hardware and software evolutions and diversity,
. reduce configuration/tuning for the applications.
Here is the timeline of the evolution of MSG* #defines:
Year 1994 1999 1999 2008
Version 1.0 2.3.27 2.3.30 2.6.24
#define MSGMNI 128 128 16 16
#define MSGMAX 4056 8192 8192 8192
#define MSGMNB 16384 16384 16384 16384
This patch series increases msgmnb with respect to the number of
cpus/cores for larger machines. For uniprocessor machines the value
does not increase.
This series is similar to (and depends on) the series which scales
msgmni, the number of IPC message queue identifiers, to the amount of
low memory.
While Nadia's previous series scaled msgmni along the memory axis,
hence the message pool (msgmni x msgmnb), this series uses a second
axis: the number of online CPUs.
As well as covering the (cpu,memory) space of machines size, this
reflects the parallelism allowed by lockless send/receive for
in-flight messages in queues (msgmnb / msgmax messages).
The initial scaling is done at initialization of the ipc namespace.
Furthermore, the value becomes dynamic with respect to cpu hotplug,
decreasing/increasing when a cpu is taken offline/online.
The msgmni and msgmnb values become dependent, as the value of msgmni
is computed with respect to the value of msgmnb.
Other solutions could be possible, like using a dbus/hal daemon. This
patches seems light enough not to go to user space. In particular, the
computation formula is simple.
The series is as follows:
. patch 1 introduces the scaling function
. patch 2 deals with cpu hotplug
. patch 3 finer grain disabling/reenabling scaling mechanism
(disconnect msgmnb and msgmni)
---
The series applies to 2.6.26-rc5-mm3
Compared to the RFC, the following changes have been made:
. reduce use of "scale" word which leads to confusion about the fact
that this is not directly a performance patch [Nick]
. mention that the formula is simple, not needing logarithm (or user
space) [Nick]
. example of distribution using patch [Manfred]
. mention hal/dbus daemon [Manfred]
. do not reenable msgmni recomputation when reenabling msgmnb. It
suffices to do a one shot recomputation [Nadia]
. patch 3 and 4 have been merged with patch 1 [Nadia]
. Integrated documentation with patch [Matt]
Thanks for the help!
. corrected a bug in the last patch
(forgot to add braces when adding statement in if)
The last remark of Nadia about the third patch has not been addressed
(other than keeping it like that):
"Doing this, we are completly loosing the benefits of the notification
chains: since the the notifier blocks remain registered + we are
unconditionally adding a test at the top of each recompute routine. But
the other choice would have been to define another notifier chain
dedicated to msgmnb. I'm not convinced about what is the best solution?"
Documentation/sysctl/kernel.txt | 27 +++++++++++++++++++
include/linux/ipc_namespace.h | 4 ++
include/linux/msg.h | 5 +++
ipc/ipc_sysctl.c | 56 ++++++++++++++++++++++++++++++----------
ipc/ipcns_notifier.c | 23 ++++++----------
ipc/msg.c | 23 +++++++++++++---
ipc/util.c | 28 ++++++++++++++++++++
ipc/util.h | 1
8 files changed, 135 insertions(+), 32 deletions(-)
--
On Tue, 24 Jun 2008 11:34:52 +0200
<[email protected]> wrote:
> The size in bytes of a SysV IPC message queue, msgmnb, is too small
> for large machines, but we don't want to bloat small machines.
>
> This series change ("scale") the default value of
> /proc/sys/kernel/msgmnb.
I'm afraid I've lost track of what's happening here. Did we come up
with an alternative to "magical positive-versus-negative number trick"?
Your patch #1 adds and uses recompute_msgmnb() without adding the
declaration to a header file. Your patch #2 does add the
recompute_msgmnb() to a header file, so we have a window in which the
build is broken, which is bad.
recompute_msgmnb() isn't a terribly good globally-visible identifier,
btw. It is nice to add some subsystem identifer as a prefix. There's
little chance of this symbol colliding with anything else, so this is a
minor cosmetic thing in this case.
Andrew Morton writes:
> I'm afraid I've lost track of what's happening here. Did we come up
> with an alternative to "magical positive-versus-negative number trick"?
Yes, several proposals:
1. /proc/sys/kernel/automatic-msgmnb
(attempt attached).
2. a variation /sys/kernel/automatic/msgmn*
as a mean to alleviate the doubling of the interface.
> Your patch #1 adds and uses recompute_msgmnb() without adding the
> declaration to a header file. Your patch #2 does add the
> recompute_msgmnb() to a header file, so we have a window in which the
> build is broken, which is bad.
Sorry. Another fix needed to my quilt workflow.
Thanks.
> recompute_msgmnb() isn't a terribly good globally-visible identifier,
> btw. It is nice to add some subsystem identifer as a prefix. There's
> little chance of this symbol colliding with anything else, so this is a
> minor cosmetic thing in this case.
ok. msg_recompute_msgmnb seems to be the better name.
--
solofo
Index: b/ipc/msg.c
===================================================================
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -38,6 +38,7 @@
#include <linux/rwsem.h>
#include <linux/nsproxy.h>
#include <linux/ipc_namespace.h>
+#include <linux/cpumask.h>
#include <asm/current.h>
#include <asm/uaccess.h>
@@ -90,9 +91,11 @@ void recompute_msgmni(struct ipc_namespa
unsigned long allowed;
int nb_ns;
+ if (!ns->msg_ctlmni_activated)
+ return;
si_meminfo(&i);
allowed = (((i.totalram - i.totalhigh) / MSG_MEM_SCALE) * i.mem_unit)
- / MSGMNB;
+ / ns->msg_ctlmnb;
nb_ns = atomic_read(&nr_ipc_ns);
allowed /= nb_ns;
@@ -108,11 +111,23 @@ void recompute_msgmni(struct ipc_namespa
ns->msg_ctlmni = allowed;
}
+/*
+ * Scale msgmnb with the number of online cpus, up to 4x MSGMNB.
+ */
+void msg_recompute_msgmnb(struct ipc_namespace *ns)
+{
+ if (!ns->msg_ctlmnb_activated)
+ return;
+ ns->msg_ctlmnb =
+ min(MSGMNB * num_online_cpus(), MSGMNB * MSG_CPU_SCALE);
+}
void msg_init_ns(struct ipc_namespace *ns)
{
ns->msg_ctlmax = MSGMAX;
- ns->msg_ctlmnb = MSGMNB;
+ ns->msg_ctlmnb_activated = 1;
+ ns->msg_ctlmni_activated = 1;
+ msg_recompute_msgmnb(ns);
recompute_msgmni(ns);
@@ -132,8 +147,8 @@ void __init msg_init(void)
{
msg_init_ns(&init_ipc_ns);
- printk(KERN_INFO "msgmni has been set to %d\n",
- init_ipc_ns.msg_ctlmni);
+ printk(KERN_INFO "msgmni has been set to %d, msgmnb to %d\n",
+ init_ipc_ns.msg_ctlmni, init_ipc_ns.msg_ctlmnb);
ipc_init_proc_interface("sysvipc/msg",
" key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",
Index: b/include/linux/msg.h
===================================================================
--- a/include/linux/msg.h
+++ b/include/linux/msg.h
@@ -58,6 +58,12 @@ struct msginfo {
* more than 16 GB : msgmni = 32K (IPCMNI)
*/
#define MSG_MEM_SCALE 32
+/*
+ * Scaling factor to compute msgmnb: ns->msg_ctlmnb is between MSGMNB
+ * and MSGMNB * MSG_CPU_SCALE. This leads to a max msgmnb value of
+ * 65536 which is an already used and recommended value.
+ */
+#define MSG_CPU_SCALE 4
#define MSGMNI 16 /* <= IPCMNI */ /* max # of msg queue identifiers */
#define MSGMAX 8192 /* <= INT_MAX */ /* max size of message (bytes) */
Index: b/ipc/ipc_sysctl.c
===================================================================
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -31,19 +31,22 @@ static void *get_ipc(ctl_table *table)
* hand and it has a callback routine registered on the ipc namespace notifier
* chain: we don't want such tunables to be recomputed anymore upon memory
* add/remove or ipc namespace creation/removal.
- * They can come back to a recomputable state by being set to a <0 value.
*/
-static void tunable_set_callback(int val)
+static void tunable_set_callback(int val, ctl_table *table)
{
- if (val >= 0)
- unregister_ipcns_notifier(current->nsproxy->ipc_ns);
- else {
- /*
- * Re-enable automatic recomputing only if not already
- * enabled.
- */
- recompute_msgmni(current->nsproxy->ipc_ns);
- cond_register_ipcns_notifier(current->nsproxy->ipc_ns);
+ int tunable = table->ctl_name;
+
+ switch (tunable) {
+ case KERN_MSGMNB:
+ current->nsproxy->ipc_ns->msg_ctlmnb_activated = 0;
+ break;
+ case KERN_MSGMNI:
+ current->nsproxy->ipc_ns->msg_ctlmni_activated = 0;
+ break;
+ default:
+ printk(KERN_ERR "ipc: unexpected value %s\n",
+ table->procname);
+ break;
}
}
@@ -70,9 +73,10 @@ static int proc_ipc_callback_dointvec(ct
rc = proc_dointvec(&ipc_table, write, filp, buffer, lenp, ppos);
- if (write && !rc && lenp_bef == *lenp)
- tunable_set_callback(*((int *)(ipc_table.data)));
-
+ if (write && !rc && lenp_bef == *lenp) {
+ BUG_ON(table == NULL);
+ tunable_set_callback(*((int *)(ipc_table.data)), table);
+ }
return rc;
}
@@ -147,8 +151,8 @@ static int sysctl_ipc_registered_data(ct
* Tunable has successfully been changed from userland
*/
int *data = get_ipc(table);
-
- tunable_set_callback(*data);
+ BUG_ON(table == NULL);
+ tunable_set_callback(*data, table);
}
return rc;
@@ -210,6 +214,15 @@ static struct ctl_table ipc_kern_table[]
.data = &init_ipc_ns.msg_ctlmnb,
.maxlen = sizeof (init_ipc_ns.msg_ctlmnb),
.mode = 0644,
+ .proc_handler = proc_ipc_callback_dointvec,
+ .strategy = sysctl_ipc_registered_data,
+ },
+ {
+ .ctl_name = KERN_AUTOMATIC_MSGMNB,
+ .procname = "automatic_msgmnb",
+ .data = &init_ipc_ns.msg_ctlmnb_activated,
+ .maxlen = sizeof (init_ipc_ns.msg_ctlmnb_activated),
+ .mode = 0644,
.proc_handler = proc_ipc_dointvec,
.strategy = sysctl_ipc_data,
},
Index: b/include/linux/ipc_namespace.h
===================================================================
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -12,6 +12,7 @@
#define IPCNS_MEMCHANGED 0x00000001 /* Notify lowmem size changed */
#define IPCNS_CREATED 0x00000002 /* Notify new ipc namespace created */
#define IPCNS_REMOVED 0x00000003 /* Notify ipc namespace removed */
+#define IPCNS_CPUCHANGED 0x00000004 /* Notify cpu hotplug addition/removal */
#define IPCNS_CALLBACK_PRI 0
@@ -33,7 +34,9 @@ struct ipc_namespace {
int msg_ctlmax;
int msg_ctlmnb;
+ int msg_ctlmnb_activated;
int msg_ctlmni;
+ int msg_ctlmni_activated;
atomic_t msg_bytes;
atomic_t msg_hdrs;
@@ -52,7 +55,6 @@ extern atomic_t nr_ipc_ns;
#define INIT_IPC_NS(ns) .ns = &init_ipc_ns,
extern int register_ipcns_notifier(struct ipc_namespace *);
-extern int cond_register_ipcns_notifier(struct ipc_namespace *);
extern int unregister_ipcns_notifier(struct ipc_namespace *);
extern int ipcns_notify(unsigned long);
Index: b/ipc/util.c
===================================================================
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -34,6 +34,7 @@
#include <linux/nsproxy.h>
#include <linux/rwsem.h>
#include <linux/memory.h>
+#include <linux/cpu.h>
#include <linux/ipc_namespace.h>
#include <asm/unistd.h>
@@ -96,6 +97,32 @@ static int ipc_memory_callback(struct no
#endif /* CONFIG_MEMORY_HOTPLUG */
+#ifdef CONFIG_HOTPLUG_CPU
+
+static void ipc_cpu_notifier(struct work_struct *work)
+{
+ ipcns_notify(IPCNS_CPUCHANGED);
+}
+
+static DECLARE_WORK(ipc_cpu_wq, ipc_cpu_notifier);
+
+static int __cpuinit ipc_cpu_callback(struct notifier_block *nfb,
+ unsigned long action, void *hcpu)
+{
+ switch (action) {
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ schedule_work(&ipc_cpu_wq);
+ break;
+ default:
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+#endif /* CONFIG_HOTPLUG_CPU */
/**
* ipc_init - initialise IPC subsystem
*
@@ -112,6 +139,7 @@ static int __init ipc_init(void)
msg_init();
shm_init();
hotplug_memory_notifier(ipc_memory_callback, IPC_CALLBACK_PRI);
+ hotcpu_notifier(ipc_cpu_callback, IPC_CALLBACK_PRI);
register_ipcns_notifier(&init_ipc_ns);
return 0;
}
Index: b/ipc/ipcns_notifier.c
===================================================================
--- a/ipc/ipcns_notifier.c
+++ b/ipc/ipcns_notifier.c
@@ -26,16 +26,20 @@ static int ipcns_callback(struct notifie
unsigned long action, void *arg)
{
struct ipc_namespace *ns;
-
+ ns = container_of(self, struct ipc_namespace, ipcns_nb);
switch (action) {
+ case IPCNS_CPUCHANGED:
+ /*
+ * Fall through.
+ * We do not scale msgmnb with IPC namespace
+ * add/remove for simplicity (adjustment of the
+ * message pool is done indirectly via msgmni).
+ */
+ msg_recompute_msgmnb(ns);
case IPCNS_MEMCHANGED: /* amount of lowmem has changed */
case IPCNS_CREATED:
case IPCNS_REMOVED:
/*
- * It's time to recompute msgmni
- */
- ns = container_of(self, struct ipc_namespace, ipcns_nb);
- /*
* No need to get a reference on the ns: the 1st job of
* free_ipc_ns() is to unregister the callback routine.
* blocking_notifier_chain_unregister takes the wr lock to do
@@ -61,15 +65,6 @@ int register_ipcns_notifier(struct ipc_n
return blocking_notifier_chain_register(&ipcns_chain, &ns->ipcns_nb);
}
-int cond_register_ipcns_notifier(struct ipc_namespace *ns)
-{
- memset(&ns->ipcns_nb, 0, sizeof(ns->ipcns_nb));
- ns->ipcns_nb.notifier_call = ipcns_callback;
- ns->ipcns_nb.priority = IPCNS_CALLBACK_PRI;
- return blocking_notifier_chain_cond_register(&ipcns_chain,
- &ns->ipcns_nb);
-}
-
int unregister_ipcns_notifier(struct ipc_namespace *ns)
{
return blocking_notifier_chain_unregister(&ipcns_chain,
Index: b/ipc/util.h
===================================================================
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -122,6 +122,7 @@ extern struct msg_msg *load_msg(const vo
extern int store_msg(void __user *dest, struct msg_msg *msg, int len);
extern void recompute_msgmni(struct ipc_namespace *);
+extern void msg_recompute_msgmnb(struct ipc_namespace *);
static inline int ipc_buildid(int id, int seq)
{
Index: b/include/linux/sysctl.h
===================================================================
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -163,6 +163,7 @@ enum
KERN_MAX_LOCK_DEPTH=74,
KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+ KERN_AUTOMATIC_MSGMNB=77, /* int: automatic msgmnb computation */
};
Index: b/kernel/sysctl_check.c
===================================================================
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -104,6 +104,7 @@ static const struct trans_ctl_table tran
{ KERN_MAX_LOCK_DEPTH, "max_lock_depth" },
{ KERN_NMI_WATCHDOG, "nmi_watchdog" },
{ KERN_PANIC_ON_NMI, "panic_on_unrecovered_nmi" },
+ { KERN_AUTOMATIC_MSGMNB, "automatic_msgmnb" },
{}
};
[email protected] wrote:
> Andrew Morton writes:
> > I'm afraid I've lost track of what's happening here. Did we come up
> > with an alternative to "magical positive-versus-negative number trick"?
>
> Yes, several proposals:
> 1. /proc/sys/kernel/automatic-msgmnb
> (attempt attached).
>
> 2. a variation /sys/kernel/automatic/msgmn*
> as a mean to alleviate the doubling of the interface.
>
> > Your patch #1 adds and uses recompute_msgmnb() without adding the
> > declaration to a header file. Your patch #2 does add the
> > recompute_msgmnb() to a header file, so we have a window in which the
> > build is broken, which is bad.
>
> Sorry. Another fix needed to my quilt workflow.
> Thanks.
>
> > recompute_msgmnb() isn't a terribly good globally-visible identifier,
> > btw. It is nice to add some subsystem identifer as a prefix. There's
> > little chance of this symbol colliding with anything else, so this is a
> > minor cosmetic thing in this case.
>
> ok. msg_recompute_msgmnb seems to be the better name.
>
Solofo,
I had a look at your patch, and noticed that it is mixing msgmni and
msgmnb stuff.
Since the msgmni part is already in mainline, I think we need first to
fix the interface issue for msgmni: I've got a patch ready.
And your patch, when reviwed, can then come on top of it.
Sending the patch right now.
Regards,
Nadia