[PATCH 01/06]
Defines the auto_tune structure: this is the structure that contains the
information needed by the adjustment routine for a given tunable.
Also defines the registration routines.
The fork kernel component defines a tunable structure for the threads-max
tunable and registers it.
Signed-off-by: Nadia Derbey <[email protected]>
---
Documentation/00-INDEX | 2
Documentation/auto_tune.txt | 333 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/akt.h | 166 +++++++++++++++++++++
include/linux/akt_ops.h | 109 ++++++++++++++
init/Kconfig | 2
init/main.c | 2
kernel/Makefile | 1
kernel/autotune/Kconfig | 26 +++
kernel/autotune/Makefile | 7
kernel/autotune/akt.c | 119 +++++++++++++++
kernel/fork.c | 18 ++
11 files changed, 785 insertions(+)
Index: linux-2.6.20-rc4/Documentation/auto_tune.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc4/Documentation/auto_tune.txt 2007-01-29 12:54:09.000000000 +0100
@@ -0,0 +1,333 @@
+ Automatic Kernel Tunables
+ =========================
+
+ Nadia Derbey ([email protected])
+
+
+
+This feature aims at making the kernel automatically change the tunables
+values as it sees resources running out.
+
+The AKT framework is made of 2 parts:
+
+1) Kernel part:
+Interfaces are provided to the kernel subsystems, to (un)register the
+tunables that might be automatically tuned in the future.
+
+Registering a tunable consists of the following steps:
+- a structure is declared and filled by the kernel subsystem for the
+registered tunable
+- that tunable structure is registered into sysfs
+
+Registration should be done during the kernel subsystem initialization step.
+
+Unregistering a tunable is the reverse operation. It should not be necessary
+for the kernel subsystems: it is only useful when unloading modules that would
+have registered a tunable during their loading step.
+
+The routines interfaces are the following:
+
+1.1) Declaring a tunable:
+
+A tunable structure should be declared and defined by the kernel subsystems as
+follows:
+
+DEFINE_TUNABLE(structure_name, threshold, min, max,
+ tunable_variable_ptr, checked_variable_ptr,
+ tunable_variable_type);
+
+Parameters:
+- structure_name: this is the name of the tunable structure
+
+- threshold: percentage to apply to the tunable value to detect if adjustment
+is needed
+
+- min: minimum value the tunable can ever reach (needed when adjusting down
+the tunable)
+
+- max: maximum value the tunable can ever reach (needed when adjusting up the
+tunable)
+
+- tunable_variable_ptr: address of the tunable that will be adjusted if
+needed.
+(ex: in kernel/fork.c it is max_threads's address)
+
+- checked_variable_ptr: address of the variable that is controlled by the
+tunable. This is the calling subsystem's object counter.
+(ex: in kernel/fork.c it is nr_threads's address: nr_threads should
+always remain < max_threads)
+
+- tunable_variable_type: this type is important since it helps choosing the
+appropriate automatic tuning routine.
+Presently, it can be one of int / size_t and this can easily be enhanced.
+
+The automatic tuning routine (i.e. the routine that should be called when
+automatic tuning is activated) is set to the default one:
+default_auto_tuning_<type>().
+<type> is chosen according to the tunable_variable_type parameters.
+All the previously listed parameters are useful to this routine.
+Refer to the description of the automatic adjustment routine to see how
+these parameters are actually used.
+
+Refer to "Updating the auto-tuning function pointer" to know how to set
+this routine to another one.
+
+
+1.2) Updating a tunable's characteristics
+
+1.2.1) Updating min / max values:
+
+Sometimes, when calling DEFINE_TUNABLE(), the min and max values are not
+exactly known, yet. In that case, the following routine should be called
+once these values are known:
+
+set_tunable_min_max(structure_name, new_min, new_max)
+
+Parameters:
+- structure_name: this is the name of the tunable structure
+
+- new_min: minimum value the tunable can ever reach
+
+- new_max: maximum value the tunable can ever reach
+
+1.2.2) Updating the auto-tuning function pointer:
+
+If the default auto-tuning routine doesn't fit your needs, you can define
+another one and associate it to the tunable using the following routine:
+
+set_autotuning_routine(structure_name, auto_tune)
+
+Parameters:
+- structure_name: this is the name of the tunable structure
+
+- auto_tune: routine that should be called when automatic tuning is activated.
+If this parameter is not NULL, it should be set to a function pointer defined
+by the kernel subsystem caller. See 1.5) for the routine prototype. See also
+maxfiles_auto_tuning() in fs/file_table.c for an example.
+
+
+1.3) Registering a tunable:
+
+Once declared and its min / max / auto_tuning routine updated, the tunable
+structure should be registered using the following routine:
+
+int register_tunable(struct auto_tune *tunable_addr);
+
+Parameters:
+- tunable_addr: address of the tunable structure previsouly declared.
+
+Return value:
+- 0 : successful
+- < 0 : failure
+
+
+Registering a tunable makes it potentially automatically adjustable:
+the tunable is viewed as a kobject with 3 attributes (i.e. 3 files at sysfs
+level):
+- autotune (rw): enables to (de)activate the auto tuning for that tunable
+- min (rw): enables to play with the min tunable value
+- max (rw): enables to play with the max tunable value
+
+The only way to make a registered tunable automatically adjustable is through
+sysfs (see the sysfs part for more details).
+
+
+
+1.4) Unregistering a tunable:
+
+int unregister_tunable(struct auto_tune *reg_tun_addr);
+
+Parameters:
+- reg_tun_addr: address of the tunable structure to unregister
+
+
+This routine is only useful for modules: when unloading, they should
+unregister any previously registered tunable.
+
+
+
+1.5) Automatic tuning routine:
+
+The 2nd main service provided by the kernel part is a function pointer
+(auto_tune_func): it points to the routine that actually automatically
+adjusts the tunable passed in as a parameter.
+
+This is accomplished by one of the following:
+- if an automatic tuning routine has been provided during the tunable
+declaration, that routine will actually be called.
+- if no automatic tuning routine has been provided, the default one is called.
+NOTE: it can process one of the following types, depending on the type used
+ when declaring the tunable (see DEFINE_TUNABLE above): int, size_t.
+
+
+If the automatic tuning routine is provided by the kernel subsystem caller,
+it should be declared as follows:
+
+int <routine_name>(int cmd, struct auto_tune *params);
+
+Parameters:
+- cmd: tuning direction
+ . AKT_UP: the tunable will be adjusted upwards (i.e. its value is
+ increased if needed)
+ . AKT_DOWN: the tunable is adjusted downwards (i.e. its value is
+ decreased if needed)
+- params: pointer to the previously registered tunable structure
+
+
+Any kernel subsystem that has registered a tunable should call
+auto_tune_func() as follows:
+
++-------------------------+--------------------------------------------+
+| Step | Routine to call |
++-------------------------+--------------------------------------------+
+| Declaration phase | DEFINE_TUNABLE(name, values...); |
++-------------------------+--------------------------------------------+
+| Initialization routine | set_tunable_min_max(name, min, max); |
+| | set_autotuning_routine(name, routine); |
+| | register_tunable(&name); |
+| Note: the 1st 2 calls | |
+| are optional | |
++-------------------------+--------------------------------------------+
+| Alloc | activate_auto_tuning(AKT_UP, &name); |
++-------------------------+--------------------------------------------+
+| Free | activate_auto_tuning(AKT_DOWN, &name); |
++-------------------------+--------------------------------------------+
+| module_exit() routine | unregister_tunable(&name); |
++-------------------------+--------------------------------------------+
+
+activate_auto_tuning is a static inline defined in akt.h, that does the
+following:
+. if <tunable is registered> and <auto tuning is allowed for tunable>
+. call the routine stored in tunable->auto_tune
+
+
+The effect of the default automatic tuning routine is the following:
+
+ +----------------------------------------------------------------+
+ | Tunable automatically adjustable |
+ +---------------+------------------------------------------------+
+ | NO | YES |
++----------+---------------+------------------------------------------------+
+| AKT_UP | No effect | If the tunable value exceeds the specified |
+| | | threshold, that value is increased up to a |
+| | | maximum value. |
+| | | The maximum value is specified during the |
+| | | tunable declaration and can be changed at any |
+| | | time through sysfs |
++----------+---------------+------------------------------------------------+
+| AKT_DOWN | No effect | If the tunable value falls under the specified |
+| | | threshold, that value is decreased down to a |
+| | | minimum value. |
+| | | The minimum value is specified during the |
+| | | tunable declaration and can be changed at any |
+| | | time through sysfs |
++----------+---------------+------------------------------------------------+
+
+
+1.6. Default automatic adjustment routine
+
+The last service provided by AKT at the kernel level is the default automatic
+adjustment routine. As seen, above, this routine supports various tunables
+types. It works as follows (only the AKT_UP direction is described here -
+AKT_DOWN does the reverse operation):
+
+The 2nd parameter passed in to this routine is a pointer to a previously
+registered tunable structure. That structure contains the following fields
+(see 1.1 for the detailed description):
+- threshold
+- key
+- min
+- max
+- tunable
+- checked
+
+When this routine is entered, it does the following:
+1. <*checked> is compared to <*tunable> * threshold
+2. if <*checked> is greater, <*tunable> is set to:
+ <*tunable> + (<*tunable> * (100 - threshold) / 100)
+
+
+
+1.6) akt and sysfs:
+
+AKT uses sysfs to enable the tunables management from the user world (mainly
+making them automatic or manual).
+
+akt uses sysfs in the following way:
+- a tunables subsystem (tunables_subsys) is declared and registered during akt
+initialization.
+- registering a tunable is equivalent to registering the corresponding kobject
+within that subsystem.
+- each tunable kobject has 3 associated attributes, all with a RW mode (i.e.
+the show() and store() methods are provided for them):
+ . autotune: enables to (de)activate automatic tuning for the tunable
+ . max: enables to set a new maximum value for the tunable
+ . min: enables to set a new minimum value for the tunable
+
+
+1.7) tunables that are namespace dependent
+
+In this paragraph, the particular case of tunables that are namespace
+dependent is presented.
+
+1.7.1) Declaring a tunable:
+
+The tunable structure for such tunables should be declared in the namespace
+structure that contains the associated tunable (ex: the tunable structure for
+msg_ctlmni should be declared in the ipc_namespace structure).
+
+The tunable structure should be declared as follows:
+
+DECLARE_TUNABLE(structure_name);
+
+Parameters:
+- structure_name: this is the name of the tunable structure
+
+1.7.2) Initializing the tunable structure
+
+Then the tunable structure should be initialized by calling the following
+routine:
+
+init_tunable_ipcns(namespace_ptr, structure_name, threshold, min, max,
+ tunable_variable_ptr, checked_variable_ptr,
+ tunable_variable_type);
+
+Parameters:
+- namespace_ptr: pointer to the namespace the tunable belongs to.
+
+See DEFINE_TUNABLE for the other parameters.
+
+1.7.3) Registering the tunable structure
+
+register_tunable should be called, giving it the tunable structure address
+that belongs to the init namespace.
+
+This applies to activate_auto_tuning too.
+
+All the routines that show/store attributes or that do the auto tuning are
+namespace dependent.
+
+
+2) User part:
+
+As seen above, the only way to activate automatic tuning is from user side:
+- the directory /sys/tunables is created during the init phase.
+- each time a tunable is registered by a kernel subsystem, a directory is
+created for it under /sys/tunables.
+- This directory contains 1 file for each tunable kobject attribute:
++-----------+---------------+-------------------+--------------------------+
+| attribute | default value | how to set it | effect |
++-----------+---------------+-------------------+--------------------------+
+| autotune | 0 | echo 1 > autotune | makes the tunable |
+| | | | automatic |
+| | | echo 0 > autotune | makes the tunable manual |
++-----------+---------------+-------------------+--------------------------+
+| max | max value set | echo <M> > max | sets the tunable max |
+| | during tunable| | value to <M> |
+| | definition | | |
++-----------+---------------+-------------------+--------------------------+
+| min | min value set | echo <m> > min | sets the tunable min |
+| | during tunable| | value to <m> |
+| | definition | | |
++-----------+---------------+-------------------+--------------------------+
+
Index: linux-2.6.20-rc4/Documentation/00-INDEX
===================================================================
--- linux-2.6.20-rc4.orig/Documentation/00-INDEX 2007-01-29 12:39:29.000000000 +0100
+++ linux-2.6.20-rc4/Documentation/00-INDEX 2007-01-29 12:55:49.000000000 +0100
@@ -52,6 +52,8 @@ applying-patches.txt
- description of various trees and how to apply their patches.
arm/
- directory with info about Linux on the ARM architecture.
+auto_tune.txt
+ - info on the Automatic Kernel Tunables (AKT) feature.
basic_profiling.txt
- basic instructions for those who wants to profile Linux kernel.
binfmt_misc.txt
Index: linux-2.6.20-rc4/include/linux/akt.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc4/include/linux/akt.h 2007-01-29 14:59:38.000000000 +0100
@@ -0,0 +1,166 @@
+/*
+ * linux/include/akt.h
+ *
+ * Automatic Kernel Tunables support for Linux.
+ * This file contains structures definitions and prototypes needed for AKT
+ * support.
+ *
+ * Copyright (C) 2006 Bull S.A.S
+ *
+ * Author: Nadia Derbey <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef AKT_H
+#define AKT_H
+
+#include <linux/types.h>
+#include <linux/kobject.h>
+
+
+
+/*
+ * First parameter passed to the adjustment routine
+ */
+#define AKT_UP 0 /* adjustment "up" */
+#define AKT_DOWN 1 /* adjustment "down" */
+
+
+struct auto_tune;
+/*
+ * Automatic adjustment routine.
+ * Returns 0, if the tunable value has not been changed, 1 else
+ */
+typedef int (*auto_tune_fn)(int, struct auto_tune *);
+
+
+/*
+ * Structure used to describe the min / max values for a tunable inside the
+ * auto_tune structure.
+ */
+struct tunable_limit {
+ ulong value;
+};
+
+
+
+/*
+ * This is the structure that describes a tunable. One of these structures is
+ * allocated for each registered tunable, and the associated kobject exported
+ * via sysfs.
+ *
+ * The structure lock (tunable_lck) protects
+ * against concurrent accesses to tunable and checked pointers
+ *
+ * A pointer to this structure is passed in to the automatic adjustment
+ * routine.
+ * automatic adjustment principle is the following:
+ * AKT_UP:
+ * 1. *checked is compared to *tunable * threshold
+ * 2. if *checked is greater, the tunable is adjusted up
+ * AKT_DOWN: reverse operation
+ */
+struct auto_tune {
+ spinlock_t tunable_lck; /* serializes access to the stucture fields */
+ auto_tune_fn auto_tune; /* auto tuning routine registered by the */
+ /* calling kernel susbsystem. If NULL, the */
+ /* auto tuning routine that will be called */
+ /* is the default one that processes uints */
+ const char *name;
+ unsigned char flags; /* Only 2 bits are meaningful: */
+ /* bit 0: 1 if the associated tunable can */
+ /* be automatically adjusted */
+ /* bits 1: 1 if the tunable has been */
+ /* registered */
+ /* bits 2-7: unused */
+ char threshold; /* threshold to enable the adjustment expressed as */
+ /* a %age */
+ struct tunable_limit min; /* min value the tunable can ever */
+ /* reach */
+ struct tunable_limit max; /* max value the tunable can ever */
+ /* reach */
+ void *tunable; /* address of the tunable to adjust */
+ void *checked; /* address of the variable that is controlled by */
+ /* the tunable. This is the calling subsystem's */
+ /* object counter */
+};
+
+
+/*
+ * Flags for a registered tunable
+ */
+#define TUNABLE_REGISTERED 0x02
+
+
+/*
+ * When calling this routine the tunable lock should be held
+ */
+static inline int is_tunable_registered(struct auto_tune *tunable)
+{
+ return (tunable->flags & TUNABLE_REGISTERED) == TUNABLE_REGISTERED;
+}
+
+
+#ifdef CONFIG_AKT
+
+
+
+#define TUNABLE_INIT(_name, _thresh, _min, _max, _tun, _chk, type) \
+ { \
+ .tunable_lck = SPIN_LOCK_UNLOCKED, \
+ .auto_tune = default_auto_tuning_##type, \
+ .name = (_name), \
+ .flags = 0, \
+ .threshold = (_thresh), \
+ .min = { \
+ .value = (_min), \
+ }, \
+ .max = { \
+ .value = (_max), \
+ }, \
+ .tunable = (_tun), \
+ .checked = (_chk), \
+ }
+
+
+#define DEFINE_TUNABLE(s, thr, min, max, tun, chk, type) \
+ struct auto_tune s = TUNABLE_INIT(#s, thr, min, max, tun, chk, type)
+
+#define set_tunable_min_max(s, _min, _max) \
+ do { \
+ (s).min.value = _min; \
+ (s).max.value = _max; \
+ } while (0)
+
+
+extern int register_tunable(struct auto_tune *);
+extern int unregister_tunable(struct auto_tune *);
+
+
+#else /* CONFIG_AKT */
+
+
+#define DEFINE_TUNABLE(s, thresh, min, max, tun, chk, type)
+#define set_tunable_min_max(s, min, max) do { } while (0)
+
+#define register_tunable(a) 0
+#define unregister_tunable(a) 0
+
+#endif /* CONFIG_AKT */
+
+extern void fork_late_init(void);
+
+#endif /* AKT_H */
Index: linux-2.6.20-rc4/include/linux/akt_ops.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc4/include/linux/akt_ops.h 2007-01-29 13:34:45.000000000 +0100
@@ -0,0 +1,109 @@
+/*
+ * linux/include/akt_ops.h
+ *
+ * Automatic Kernel Tunables support for Linux.
+ * This file contains the definitions for the type dependent routines
+ * needed for AKT support.
+ *
+ * Copyright (C) 2006 Bull S.A.S
+ *
+ * Author: Nadia Derbey <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef AKT_OPS_H
+#define AKT_OPS_H
+
+
+#ifdef CONFIG_AKT
+
+/**
+ * default_auto_tuning - Default automatic tuning routine
+ * @direction: controls the adjustment direction (up / down)
+ * @p: registered tunable structure
+ *
+ * This is the routine called to accomplish auto tuning if none has been
+ * specified for a tunable.
+ * It can be called by any kernel subsystem that is allocating or freeing an
+ * object whose maximum value is controlled by a tunable.
+ * ex: max # of semaphore ids is controlled by sc_semmni
+ * ==> this routine might be called by sys_semget() to "adjust up"
+ * and by semctl_down() to "adjust down"
+ *
+ * Upwards adjustment:
+ * Adjustment is needed if the checked variable has reached
+ * (threshold / 100 * tunable)
+ * In that case, tunable is set to
+ * (tunable + tunable * (100 - threshold) / 100)
+ *
+ * Downards adjustment:
+ * Adjustment is needed if the checked variable has fallen under
+ * (threshold / 100 * tunable previous value)
+ * In that case tunable is set back to its previous value, i.e. to
+ * (tunable * 100 / (200 - threshold))
+ *
+ * NOTES:
+ * 1. This routine should be called with the p->tunable_lck lock held
+ * 2. Type independent - can be one of int / size_t
+ * This list of types can easily be enhanced as needed.
+ *
+ * Returns: 1 - the tunable has been adjusted
+ * 0 - else
+ */
+#define __default_auto_tuning(direction, p, type) \
+( { \
+ int __rc; \
+ ulong _chk = (ulong) *((type *) p->checked); \
+ ulong _tun = (ulong) *((type *) p->tunable); \
+ ulong _thr = p->threshold; \
+ ulong _min = p->min.value; \
+ ulong _max = p->max.value; \
+ \
+ if (direction == AKT_UP) { \
+ if ((_chk >= (_tun * _thr) / 100) && (_tun < _max)) { \
+ ulong ___x = (_tun * (200 - _thr)) / 100; \
+ *((type *) p->tunable) = min((type) _max, \
+ (type) ___x); \
+ __rc = 1; \
+ } else \
+ __rc = 0; \
+ } else { \
+ if ((_chk < (_tun * _thr) / (200 - _thr)) && (_tun>_min)) { \
+ ulong ___x = (_tun * 100) / (200 - _thr); \
+ *((type *) p->tunable) = max((type) _min, \
+ (type) ___x); \
+ __rc = 1; \
+ } else \
+ __rc = 0; \
+ } \
+ __rc; \
+} )
+
+static inline int default_auto_tuning_int(int dir, struct auto_tune *p)
+{
+ return __default_auto_tuning(dir, p, int);
+}
+
+static inline int default_auto_tuning_size_t(int dir, struct auto_tune *p)
+{
+ return __default_auto_tuning(dir, p, size_t);
+}
+
+
+
+#endif /* CONFIG_AKT */
+
+#endif /* AKT_OPS_H */
Index: linux-2.6.20-rc4/init/Kconfig
===================================================================
--- linux-2.6.20-rc4.orig/init/Kconfig 2007-01-29 12:39:30.000000000 +0100
+++ linux-2.6.20-rc4/init/Kconfig 2007-01-29 13:35:53.000000000 +0100
@@ -466,6 +466,8 @@ config VM_EVENT_COUNTERS
on EMBEDDED systems. /proc/vmstat will only show page counts
if VM event counters are disabled.
+source "kernel/autotune/Kconfig"
+
endmenu # General setup
config RT_MUTEXES
Index: linux-2.6.20-rc4/init/main.c
===================================================================
--- linux-2.6.20-rc4.orig/init/main.c 2007-01-29 12:39:30.000000000 +0100
+++ linux-2.6.20-rc4/init/main.c 2007-01-29 13:36:41.000000000 +0100
@@ -54,6 +54,7 @@
#include <linux/pid_namespace.h>
#include <linux/compile.h>
#include <linux/device.h>
+#include <linux/akt.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -613,6 +614,7 @@ asmlinkage void __init start_kernel(void
signals_init();
/* rootfs populating might need page-writeback */
page_writeback_init();
+ fork_late_init();
#ifdef CONFIG_PROC_FS
proc_root_init();
#endif
Index: linux-2.6.20-rc4/kernel/Makefile
===================================================================
--- linux-2.6.20-rc4.orig/kernel/Makefile 2007-01-29 12:39:30.000000000 +0100
+++ linux-2.6.20-rc4/kernel/Makefile 2007-01-29 13:37:39.000000000 +0100
@@ -50,6 +50,7 @@ obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_UTS_NS) += utsname.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
+obj-$(CONFIG_AKT) += autotune/
ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
# According to Alan Modra <[email protected]>, the -fno-omit-frame-pointer is
Index: linux-2.6.20-rc4/kernel/autotune/Kconfig
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc4/kernel/autotune/Kconfig 2007-01-29 13:38:19.000000000 +0100
@@ -0,0 +1,26 @@
+#
+# Automatic Kernel Tunables
+#
+
+config AKT
+ bool "Automatic kernel tunables support (AKT)"
+ depends on PROC_FS && SYSFS
+ help
+ This is a functionality that enables automatic adjustment of kernel
+ tunables: when this feature is enabled the kernel can automatically
+ change the tunables values as it sees resources running out.
+
+ The list of kernel tunables that can potentially be automatically
+ adjusted can found under /sys/tunables.
+
+ In order to make a tunable actually automatic, issue the following
+ command:
+ echo 1 > /sys/tunables/<tunable_name>/autotune
+
+ In order to make it manual, issue the following command:
+ echo 0 > /sys/tunables/<tunable_name>/autotune
+
+ See Documentation/auto_tune.txt for more details.
+
+ If unsure, say N.
+
Index: linux-2.6.20-rc4/kernel/autotune/akt.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc4/kernel/autotune/akt.c 2007-01-29 14:03:16.000000000 +0100
@@ -0,0 +1,119 @@
+/*
+ * linux/kernel/autotune/akt.c
+ *
+ * Automatic Kernel Tunables for Linux - Kernel support
+ *
+ * Copyright (C) 2006 Bull S.A.S
+ *
+ * Author: Nadia Derbey <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+/*
+ * FUNCTIONS:
+ * register_tunable (exported)
+ * unregister_tunable (exported)
+ */
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/akt.h>
+
+
+/**
+ * register_tunable - Inserts a tunable structure into sysfs
+ * @tun: tunable structure to be registered
+ *
+ * Checks the tunable structure fields and inserts it into sysfs.
+ * This routine is called by any kernel subsystem that wants to use akt
+ * services (automatic tunables adjustment) in the future
+ *
+ * NOTE: when calling this routine, the tunable structure should have already
+ * been filled by defining it with DEFINE_TUNABLE()
+ *
+ * Returns: 0 - successful
+ * <0 - failure
+ */
+int register_tunable(struct auto_tune *tun)
+{
+ if (tun == NULL) {
+ printk(KERN_ERR
+ "AKT: Bad tunable structure pointer (NULL)\n");
+ return -EINVAL;
+ }
+
+ if (tun->threshold <= 0 || tun->threshold >= 100) {
+ printk(KERN_ERR "AKT: Bad threshold (%d) value - should be in"
+ " the [1-99] interval\n", tun->threshold);
+ return -EINVAL;
+ }
+
+ if (tun->tunable == NULL) {
+ printk(KERN_ERR "AKT: Bad tunable pointer (NULL)\n");
+ return -EINVAL;
+ }
+
+ if (tun->checked == NULL) {
+ printk(KERN_ERR "AKT: Bad checked value pointer (NULL)\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Check the min / max value
+ */
+ if (tun->min.value > tun->max.value) {
+ printk(KERN_ERR "AKT: Bad min / max values\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+EXPORT_SYMBOL_GPL(register_tunable);
+
+
+/**
+ * unregister_tunable - Removes a tunable structure from sysfs
+ * @reg_tun: registered tunable structure to be removed
+ *
+ * This routine is called by any kernel subsystem that doesn't need the akt
+ * services anymore
+ *
+ * NOTE: @reg_tun should point to a previously registered tunable
+ *
+ * Returns: 0 - successful
+ * <0 - failure
+ */
+int unregister_tunable(struct auto_tune *reg_tun)
+{
+ if (reg_tun == NULL) {
+ printk(KERN_ERR "AKT: Bad tunable address (NULL)\n");
+ return -EINVAL;
+ }
+
+ spin_lock(®_tun->tunable_lck);
+
+ BUG_ON(!is_tunable_registered(reg_tun));
+
+ reg_tun->flags = 0;
+
+ spin_unlock(®_tun->tunable_lck);
+
+ return 0;
+}
+
+EXPORT_SYMBOL_GPL(unregister_tunable);
Index: linux-2.6.20-rc4/kernel/autotune/Makefile
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.20-rc4/kernel/autotune/Makefile 2007-01-29 13:40:06.000000000 +0100
@@ -0,0 +1,7 @@
+#
+# Makefile for akt
+#
+
+obj-y := akt.o
+
+
Index: linux-2.6.20-rc4/kernel/fork.c
===================================================================
--- linux-2.6.20-rc4.orig/kernel/fork.c 2007-01-29 12:39:30.000000000 +0100
+++ linux-2.6.20-rc4/kernel/fork.c 2007-01-29 13:41:44.000000000 +0100
@@ -49,6 +49,8 @@
#include <linux/delayacct.h>
#include <linux/taskstats_kern.h>
#include <linux/random.h>
+#include <linux/akt.h>
+#include <linux/akt_ops.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -65,6 +67,13 @@ int nr_threads; /* The idle threads do
int max_threads; /* tunable limit on nr_threads */
+#define THREADTHRESH 80
+/*
+ * The actual values for min and max will be known during fork_init
+ */
+DEFINE_TUNABLE(max_threads_akt, THREADTHRESH, 0, 0, &max_threads,
+ &nr_threads, int);
+
DEFINE_PER_CPU(unsigned long, process_counts) = 0;
__cacheline_aligned DEFINE_RWLOCK(tasklist_lock); /* outer */
@@ -152,12 +161,21 @@ void __init fork_init(unsigned long memp
if(max_threads < 20)
max_threads = 20;
+ set_tunable_min_max(max_threads_akt, max_threads, mempages / 2);
+
init_task.signal->rlim[RLIMIT_NPROC].rlim_cur = max_threads/2;
init_task.signal->rlim[RLIMIT_NPROC].rlim_max = max_threads/2;
init_task.signal->rlim[RLIMIT_SIGPENDING] =
init_task.signal->rlim[RLIMIT_NPROC];
}
+void __init fork_late_init(void)
+{
+ if (register_tunable(&max_threads_akt))
+ printk(KERN_WARNING
+ "AKT: Failed registering tunable max_threads\n");
+}
+
static struct task_struct *dup_task_struct(struct task_struct *orig)
{
struct task_struct *tsk;
--
[email protected] writes:
> +
> +This feature aims at making the kernel automatically change the tunables
> +values as it sees resources running out.
The only reason we have resource limit is to avoid DOS when one
resource consumes too much memory. When there is no such danger then
there isn't any reason to have a limit at all and it could be just
eliminated (or set to unlimited by default)
Your feature doesn't address the DOS and without that there isn't
any reason to have limits at all. So what's the point?
I agree that some of the default limits we have are not very useful
on modern machines. I guess you're trying to address that.
I would suggest the following strategy:
- Review any limits we have and make sure they make sense.
- Figure out if they actually serve a useful purpose
e.g. what happens when they are exceeded, is there a DOS?.
If yes can the DOS be addressed in a better way (e.g. by allowing to shrink
the resource by a shrinker callback).
Some of the existing limits are clearly bogus, e.g. the limit
on shared memory.
For others i don't see a good alternative. e.g. if you don't limit
the number of files allocated the only alternative would be to kill
processes when they allocate too many files. Is that really preferable
to a errno?
- If they serve a useful purpose then check if the default is useful
on a modern machine. Or make them scale with the amount of memory
like many limits already do.
-Andi
Andi Kleen wrote:
> [email protected] writes:
>
>>+
>>+This feature aims at making the kernel automatically change the tunables
>>+values as it sees resources running out.
>
>
> The only reason we have resource limit is to avoid DOS when one
> resource consumes too much memory. When there is no such danger then
> there isn't any reason to have a limit at all and it could be just
> eliminated (or set to unlimited by default)
Automatic tuning is a way to set the limit to unlimited, in a sense,
doesn't it? With this feature, we can leave the default limits as they
are for an "every-day" usage, and when a particular application runs on
the machine, authorize the limit to grow up as needed.
>
> Your feature doesn't address the DOS and without that there isn't
> any reason to have limits at all. So what's the point?
As I told Eric Biederman in another mail, DoS in ensured in AKT by
exporting the min and max values for each tunable to sysfs (actually
Eric complained about these min and max :-( ). These are RW atrributes
that make it possible for a sysadmin to set the max value a tunable can
ever reach, instead of letting it grow up to huge values.
>
> I agree that some of the default limits we have are not very useful
> on modern machines. I guess you're trying to address that.
Yep
>
> I would suggest the following strategy:
>
> - Review any limits we have and make sure they make sense.
>
> - Figure out if they actually serve a useful purpose
> e.g. what happens when they are exceeded, is there a DOS?.
> If yes can the DOS be addressed in a better way (e.g. by allowing to shrink
> the resource by a shrinker callback).
>
> Some of the existing limits are clearly bogus, e.g. the limit
> on shared memory.
>
> For others i don't see a good alternative. e.g. if you don't limit
> the number of files allocated the only alternative would be to kill
> processes when they allocate too many files. Is that really preferable
> to a errno?
Agree with you, BUT between the default max_files and the "too many
files" situation, there is a gap that can be crossed by automatically
tuning max_files, isn't there? e.g. max_file default value is NR_FILE
(0x2000), while Oracle expects to have it set to 0x10000.
>
> - If they serve a useful purpose then check if the default is useful
> on a modern machine. Or make them scale with the amount of memory
> like many limits already do.
>
> -Andi
>
Regards,
Nadia
On Tuesday 13 February 2007 11:18, Nadia Derbey wrote:
> Andi Kleen wrote:
> > [email protected] writes:
> >
> >>+
> >>+This feature aims at making the kernel automatically change the tunables
> >>+values as it sees resources running out.
> >
> >
> > The only reason we have resource limit is to avoid DOS when one
> > resource consumes too much memory. When there is no such danger then
> > there isn't any reason to have a limit at all and it could be just
> > eliminated (or set to unlimited by default)
>
> Automatic tuning is a way to set the limit to unlimited, in a sense,
> doesn't it? With this feature, we can leave the default limits as they
> are for an "every-day" usage, and when a particular application runs on
> the machine, authorize the limit to grow up as needed.
That would be effectively no limit, so why not just do away with them
completely? You have to solve the DOS issues first of course, either way.
> > Your feature doesn't address the DOS and without that there isn't
> > any reason to have limits at all. So what's the point?
>
> As I told Eric Biederman in another mail, DoS in ensured in AKT by
> exporting the min and max values for each tunable to sysfs (actually
> Eric complained about these min and max :-( ). These are RW atrributes
> that make it possible for a sysadmin to set the max value a tunable can
> ever reach, instead of letting it grow up to huge values.
Then you could just set the limit always to that max value and would
get the same effect.
Min limit doesn't make much sense to me because near all Linux data structures
are on demand only anyways (excluding mempools and a few special cases)
and min is defined as the current number of used resources.
> Agree with you, BUT between the default max_files and the "too many
> files" situation, there is a gap that can be crossed by automatically
> tuning max_files, isn't there? e.g. max_file default value is NR_FILE
> (0x2000), while Oracle expects to have it set to 0x10000.
The reason NR_FILE is by default relatively low is mostly because the existing
ulimits suck :-) .i.e. you want to limit the memory consumption
of a user you limit the number of the user's processes and the number of files
per process -- then the max memory they can allocate in files is process ulimit*nr_files
[in theory in practice there are holes like in flight fds in unix socket fd passing]
But of course that ends up with either NR_FILE being far too low or process far
too low or too much memory anyways. In practice it doesn't work particularly
well.
The real solution for that is "beancounter" which has been posted in several
forms recently (from OpenVZ and from Google). Basically it adds real limits
(for much more than just file descriptors) per uid or container.
With such infrastructure in place NR_FILES could be set to unlimited, as long
as you have a reasonable (large) default per uid.
Similar reasoning applies to other resources which are currently limited.
I think you're just attacking the symptoms here instead of the basic issue.
-Andi