2011-05-13 23:24:55

by Andi Kleen

[permalink] [raw]
Subject: Add a sysconf syscall

We ran into some problems with programs querying then number
of processors at startup through sysconf. This calls into /proc
from glibc, which is quite slow and scales poorly if the program
is frequently executed. This happens with programs using
Sleepycat DB for example.

While there are worarounds -- like using sched_getaffinity --
they don't have quite equivalent semantics.

This patchkit adds a sysconf() syscall to the kernel that allows
to get this information much faster. It only implements a subset
of sysconf() -- all the information the kernel knows usefully
about.

Another advantage is that the kernel can in several cases offer
more accurate information than glibc, which has to guess.

The syscall is quite compat and not a significant burden.

In addition I did a few cleanups first to export all the information
needed by sysconf clearly from other subsystems.

-Andi


2011-05-13 23:24:57

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 1/5] VFS: Make symlink nesting limit a define

From: Andi Kleen <[email protected]>

Use a SYMLOOP_MAX define for the total symlink nesting limit instead
of a hardcoded 40. No behaviour change.

Used in a followon patch

Signed-off-by: Andi Kleen <[email protected]>
---
fs/namei.c | 6 +++---
include/linux/limits.h | 1 +
include/linux/namei.h | 1 +
3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 54fc993..9d0a92e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -763,7 +763,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
if (link->mnt == nd->path.mnt)
mntget(link->mnt);

- if (unlikely(current->total_link_count >= 40)) {
+ if (unlikely(current->total_link_count >= SYMLOOP_MAX)) {
*p = ERR_PTR(-ELOOP); /* no ->put_link(), please */
path_put(&nd->path);
return -ELOOP;
@@ -871,7 +871,7 @@ static int follow_automount(struct path *path, unsigned flags,
return -EISDIR;

current->total_link_count++;
- if (current->total_link_count >= 40)
+ if (current->total_link_count >= SYMLOOP_MAX)
return -ELOOP;

mnt = path->dentry->d_op->d_automount(path);
@@ -1369,7 +1369,7 @@ static inline int walk_component(struct nameidata *nd, struct path *path,

/*
* This limits recursive symlink follows to 8, while
- * limiting consecutive symlinks to 40.
+ * limiting consecutive symlinks to 40 (SYMLOOP_MAX)
*
* Without that kind of total limit, nasty chains of consecutive
* symlinks can cause almost arbitrarily long lookups.
diff --git a/include/linux/limits.h b/include/linux/limits.h
index 2d0f941..2f39657 100644
--- a/include/linux/limits.h
+++ b/include/linux/limits.h
@@ -14,6 +14,7 @@
#define XATTR_NAME_MAX 255 /* # chars in an extended attribute name */
#define XATTR_SIZE_MAX 65536 /* size of an extended attribute value (64k) */
#define XATTR_LIST_MAX 65536 /* size of extended attribute namelist (64k) */
+#define SYMLOOP_MAX 40 /* max number of total symlink nests */

#define RTSIG_MAX 32

diff --git a/include/linux/namei.h b/include/linux/namei.h
index eba45ea..403e39c 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -4,6 +4,7 @@
#include <linux/dcache.h>
#include <linux/linkage.h>
#include <linux/path.h>
+#include <linux/limits.h>

struct vfsmount;

--
1.7.4.4

2011-05-13 23:25:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 2/5] Move max_threads variable declaration into include file

From: Andi Kleen <[email protected]>

Multiple files have a private extern for the max_threads variable
defined in fork.c. Declare it in linux/sched.h and remove the
special externs.

Needed for followon patch.

Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/sched.h | 1 +
kernel/kmod.c | 2 --
kernel/sysctl.c | 1 -
3 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 781abd1..34b487b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -134,6 +134,7 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);

extern unsigned long total_forks;
extern int nr_threads;
+extern int max_threads;
DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned long nr_running(void);
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 9cd0591..3c12d13 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -39,8 +39,6 @@

#include <trace/events/module.h>

-extern int max_threads;
-
static struct workqueue_struct *khelper_wq;

#ifdef CONFIG_MODULES
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c0bb324..ece9b94 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -88,7 +88,6 @@
/* External variables not in a header file. */
extern int sysctl_overcommit_memory;
extern int sysctl_overcommit_ratio;
-extern int max_threads;
extern int core_uses_pid;
extern int suid_dumpable;
extern char core_pattern[];
--
1.7.4.4

2011-05-13 23:26:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 3/5] EXEC: Use define for stack to argument size limit

From: Andi Kleen <[email protected]>

Change the hardcoded 4 constant for the argument size limit in exec
to a exported constant.

Needed for followon patch.

Open: better place for the constant?

Signed-off-by: Andi Kleen <[email protected]>
---
fs/exec.c | 2 +-
include/linux/fs.h | 2 ++
2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5e62d26..1707832 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -225,7 +225,7 @@ struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
* to work from.
*/
rlim = current->signal->rlim;
- if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / 4) {
+ if (size > ACCESS_ONCE(rlim[RLIMIT_STACK].rlim_cur) / ARG_MAX_FACTOR) {
put_page(page);
return NULL;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dbd860a..fcd84a6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -33,6 +33,8 @@
#define SEEK_END 2 /* seek relative to end of file */
#define SEEK_MAX SEEK_END

+#define ARG_MAX_FACTOR 4
+
struct fstrim_range {
__u64 start;
__u64 len;
--
1.7.4.4

2011-05-13 23:25:08

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 4/5] Add a sysconf syscall

From: Andi Kleen <[email protected]>

During testing we found some cases where a library wants to know
the number of CPUs for internal tuning, and calls sysconf for that.
glibc then reads /proc/stat which is very slow and scales poorly,
when the program is executed often.

For example sleepycat DB has this problem.

This patch adds a sysconf system call to avoid this problem.
This adds very little code to the kernel, but gives a large speedup.

It is intended to be called from glibc.

It is not a 100% POSIX sysconf -- some values in there are only
known to the C library, but supplies all values usefully
known to the kernel.

In some cases it is more accurate than glibc can do because it doesn't
have to guess. So when some value changes in the kernel it can
return the current value.
---
include/linux/sysconf.h | 23 ++++++++++++++
kernel/Makefile | 2 +-
kernel/sysconf.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 101 insertions(+), 1 deletions(-)
create mode 100644 include/linux/sysconf.h
create mode 100644 kernel/sysconf.c

diff --git a/include/linux/sysconf.h b/include/linux/sysconf.h
new file mode 100644
index 0000000..6d64ef7
--- /dev/null
+++ b/include/linux/sysconf.h
@@ -0,0 +1,23 @@
+#ifndef _LINUX_SYSCONF_H
+#define _LINUX_SYSCONF 1
+
+/*
+ * Subset of the glibc values for the entries the kernel implements.
+ */
+
+#define _SC_ARG_MAX 0
+#define _SC_CHILD_MAX 1
+#define _SC_CLK_TCK 2
+#define _SC_NGROUPS_MAX 3
+#define _SC_OPEN_MAX 4
+#define _SC_PAGESIZE 30
+#define _SC_SEM_NSEMS_MAX 32
+#define _SC_SIGQUEUE_MAX 34
+#define _SC_UIO_MAXIOV 60
+#define _SC_NPROCESSORS_CONF 83
+#define _SC_NPROCESSORS_ONLN 84
+#define _SC_PHYS_PAGES 85
+#define _SC_AVPHYS_PAGES 86
+#define _SC_SYMLOOP_MAX 173
+
+#endif
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..6ef66ca 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -10,7 +10,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
- async.o range.o jump_label.o
+ async.o range.o jump_label.o sysconf.o
obj-y += groups.o

ifdef CONFIG_FUNCTION_TRACER
diff --git a/kernel/sysconf.c b/kernel/sysconf.c
new file mode 100644
index 0000000..f41db10
--- /dev/null
+++ b/kernel/sysconf.c
@@ -0,0 +1,77 @@
+#include <linux/syscalls.h>
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/swap.h>
+#include <linux/sched.h>
+#include <linux/limits.h>
+#include <linux/sysconf.h>
+#include <linux/ipc_namespace.h>
+
+/* Do lockless because we only read a single number */
+static long rlimit_or(int rlim, long max)
+{
+ unsigned long cur = ACCESS_ONCE(current->signal->rlim[rlim].rlim_cur);
+
+ return cur == RLIM_INFINITY ? max : cur;
+}
+
+/*
+ * POSIX sysconf subset. Some programs need this in relatively fast paths
+ * and /proc is too slow for them.
+ *
+ * Note this is only a subset of the values supported by POSIX.
+ * We assume the C library handles the others.
+ */
+SYSCALL_DEFINE1(sysconf, int, name)
+{
+ switch (name) {
+ case _SC_ARG_MAX:
+ return rlimit_or(RLIMIT_STACK, ARG_MAX*ARG_MAX_FACTOR) /
+ ARG_MAX_FACTOR;
+
+ case _SC_CHILD_MAX:
+ return rlimit_or(RLIMIT_NPROC, max_threads);
+
+ case _SC_CLK_TCK:
+ return HZ;
+
+ case _SC_SEM_NSEMS_MAX:
+ return current->nsproxy->ipc_ns->sem_ctls[1];
+
+ case _SC_OPEN_MAX:
+ return rlimit_or(RLIMIT_NOFILE, sysctl_nr_open);
+
+ case _SC_SIGQUEUE_MAX:
+ /* or fallback based on memory? */
+ return rlimit_or(RLIMIT_SIGPENDING, INT_MAX);
+
+ case _SC_UIO_MAXIOV:
+ return UIO_MAXIOV;
+
+ case _SC_PAGESIZE:
+ return PAGE_SIZE;
+
+ case _SC_SYMLOOP_MAX:
+ return SYMLOOP_MAX;
+
+ case _SC_PHYS_PAGES:
+ return totalram_pages;
+
+ case _SC_AVPHYS_PAGES:
+ return nr_free_pages();
+
+ case _SC_NPROCESSORS_CONF:
+ return num_possible_cpus();
+
+ case _SC_NPROCESSORS_ONLN:
+ return num_online_cpus();
+
+ case _SC_NGROUPS_MAX:
+ return NGROUPS_MAX;
+
+ default:
+ return -EINVAL;
+ }
+}
+
--
1.7.4.4

2011-05-13 23:26:10

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 5/5] Hook up sysconf syscall for all architectures

From: Andi Kleen <[email protected]>

Signed-off-by: Andi Kleen <[email protected]>
---
arch/arm/include/asm/unistd.h | 1 +
arch/arm/kernel/calls.S | 1 +
arch/blackfin/include/asm/unistd.h | 3 ++-
arch/blackfin/mach-common/entry.S | 1 +
arch/ia64/include/asm/unistd.h | 3 ++-
arch/ia64/kernel/entry.S | 1 +
arch/m68k/include/asm/unistd.h | 3 ++-
arch/m68k/kernel/entry_mm.S | 1 +
arch/m68k/kernel/syscalltable.S | 1 +
arch/microblaze/include/asm/unistd.h | 3 ++-
arch/microblaze/kernel/syscall_table.S | 1 +
arch/mips/include/asm/unistd.h | 9 ++++++---
arch/mips/kernel/scall32-o32.S | 1 +
arch/mips/kernel/scall64-64.S | 1 +
arch/mips/kernel/scall64-n32.S | 1 +
arch/mips/kernel/scall64-o32.S | 1 +
arch/powerpc/include/asm/systbl.h | 1 +
arch/powerpc/include/asm/unistd.h | 3 ++-
arch/s390/include/asm/unistd.h | 3 ++-
arch/s390/kernel/compat_wrapper.S | 5 +++++
arch/s390/kernel/syscalls.S | 1 +
arch/sh/include/asm/unistd_32.h | 3 ++-
arch/sh/include/asm/unistd_64.h | 3 ++-
arch/sh/kernel/syscalls_32.S | 1 +
arch/sh/kernel/syscalls_64.S | 1 +
arch/sparc/include/asm/unistd.h | 3 ++-
arch/sparc/kernel/systbls_32.S | 2 +-
arch/sparc/kernel/systbls_64.S | 4 ++--
arch/x86/ia32/ia32entry.S | 1 +
arch/x86/include/asm/unistd_32.h | 3 ++-
arch/x86/include/asm/unistd_64.h | 2 ++
arch/x86/kernel/syscall_table_32.S | 1 +
include/asm-generic/unistd.h | 5 ++++-
33 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/arch/arm/include/asm/unistd.h b/arch/arm/include/asm/unistd.h
index 87dbe3e..d9b693b 100644
--- a/arch/arm/include/asm/unistd.h
+++ b/arch/arm/include/asm/unistd.h
@@ -400,6 +400,7 @@
#define __NR_open_by_handle_at (__NR_SYSCALL_BASE+371)
#define __NR_clock_adjtime (__NR_SYSCALL_BASE+372)
#define __NR_syncfs (__NR_SYSCALL_BASE+373)
+#define __NR_sysconf (__NR_SYSCALL_BASE+374)

/*
* The following SWIs are ARM private.
diff --git a/arch/arm/kernel/calls.S b/arch/arm/kernel/calls.S
index 7fbf28c..46c6a1b 100644
--- a/arch/arm/kernel/calls.S
+++ b/arch/arm/kernel/calls.S
@@ -383,6 +383,7 @@
CALL(sys_open_by_handle_at)
CALL(sys_clock_adjtime)
CALL(sys_syncfs)
+ CALL(sys_sysconf)
#ifndef syscalls_counted
.equ syscalls_padding, ((NR_syscalls + 3) & ~3) - NR_syscalls
#define syscalls_counted
diff --git a/arch/blackfin/include/asm/unistd.h b/arch/blackfin/include/asm/unistd.h
index ff9a9f3..6e29297 100644
--- a/arch/blackfin/include/asm/unistd.h
+++ b/arch/blackfin/include/asm/unistd.h
@@ -397,8 +397,9 @@
#define __NR_open_by_handle_at 376
#define __NR_clock_adjtime 377
#define __NR_syncfs 378
+#define __NR_sysconf 379

-#define __NR_syscall 379
+#define __NR_syscall 380
#define NR_syscalls __NR_syscall

/* Old optional stuff no one actually uses */
diff --git a/arch/blackfin/mach-common/entry.S b/arch/blackfin/mach-common/entry.S
index f96933f..49daa46 100644
--- a/arch/blackfin/mach-common/entry.S
+++ b/arch/blackfin/mach-common/entry.S
@@ -1753,6 +1753,7 @@ ENTRY(_sys_call_table)
.long _sys_open_by_handle_at
.long _sys_clock_adjtime
.long _sys_syncfs
+ .long _sys_sysconf

.rept NR_syscalls-(.-_sys_call_table)/4
.long _sys_ni_syscall
diff --git a/arch/ia64/include/asm/unistd.h b/arch/ia64/include/asm/unistd.h
index 404d037..9d73923 100644
--- a/arch/ia64/include/asm/unistd.h
+++ b/arch/ia64/include/asm/unistd.h
@@ -319,11 +319,12 @@
#define __NR_open_by_handle_at 1327
#define __NR_clock_adjtime 1328
#define __NR_syncfs 1329
+#define __NR_sysconf 1330

#ifdef __KERNEL__


-#define NR_syscalls 306 /* length of syscall table */
+#define NR_syscalls 307 /* length of syscall table */

/*
* The following defines stop scripts/checksyscalls.sh from complaining about
diff --git a/arch/ia64/kernel/entry.S b/arch/ia64/kernel/entry.S
index 6de2e23..dee434f 100644
--- a/arch/ia64/kernel/entry.S
+++ b/arch/ia64/kernel/entry.S
@@ -1775,6 +1775,7 @@ sys_call_table:
data8 sys_open_by_handle_at
data8 sys_clock_adjtime
data8 sys_syncfs
+ data8 sys_sysconf

.org sys_call_table + 8*NR_syscalls // guard against failures to increase NR_syscalls
#endif /* __IA64_ASM_PARAVIRTUALIZED_NATIVE */
diff --git a/arch/m68k/include/asm/unistd.h b/arch/m68k/include/asm/unistd.h
index 29e1790..acc8f01 100644
--- a/arch/m68k/include/asm/unistd.h
+++ b/arch/m68k/include/asm/unistd.h
@@ -347,10 +347,11 @@
#define __NR_open_by_handle_at 341
#define __NR_clock_adjtime 342
#define __NR_syncfs 343
+#define __NR_sysconf 344

#ifdef __KERNEL__

-#define NR_syscalls 344
+#define NR_syscalls 345

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/m68k/kernel/entry_mm.S b/arch/m68k/kernel/entry_mm.S
index 1359ee6..0280b58 100644
--- a/arch/m68k/kernel/entry_mm.S
+++ b/arch/m68k/kernel/entry_mm.S
@@ -754,4 +754,5 @@ sys_call_table:
.long sys_open_by_handle_at
.long sys_clock_adjtime
.long sys_syncfs
+ .long sys_sysconf

diff --git a/arch/m68k/kernel/syscalltable.S b/arch/m68k/kernel/syscalltable.S
index 9b8393d..e8ea4cc 100644
--- a/arch/m68k/kernel/syscalltable.S
+++ b/arch/m68k/kernel/syscalltable.S
@@ -362,6 +362,7 @@ ENTRY(sys_call_table)
.long sys_open_by_handle_at
.long sys_clock_adjtime
.long sys_syncfs
+ .long sys_sysconf

.rept NR_syscalls-(.-sys_call_table)/4
.long sys_ni_syscall
diff --git a/arch/microblaze/include/asm/unistd.h b/arch/microblaze/include/asm/unistd.h
index 30edd61..cfc0c7b 100644
--- a/arch/microblaze/include/asm/unistd.h
+++ b/arch/microblaze/include/asm/unistd.h
@@ -390,8 +390,9 @@
#define __NR_open_by_handle_at 372
#define __NR_clock_adjtime 373
#define __NR_syncfs 374
+#define __NR_sysconf 375

-#define __NR_syscalls 375
+#define __NR_syscalls 376

#ifdef __KERNEL__
#ifndef __ASSEMBLY__
diff --git a/arch/microblaze/kernel/syscall_table.S b/arch/microblaze/kernel/syscall_table.S
index 85cea81..4350241 100644
--- a/arch/microblaze/kernel/syscall_table.S
+++ b/arch/microblaze/kernel/syscall_table.S
@@ -379,3 +379,4 @@ ENTRY(sys_call_table)
.long sys_open_by_handle_at
.long sys_clock_adjtime
.long sys_syncfs
+ .long sys_sysconf
diff --git a/arch/mips/include/asm/unistd.h b/arch/mips/include/asm/unistd.h
index fa2e37e..590290c 100644
--- a/arch/mips/include/asm/unistd.h
+++ b/arch/mips/include/asm/unistd.h
@@ -363,11 +363,12 @@
#define __NR_open_by_handle_at (__NR_Linux + 340)
#define __NR_clock_adjtime (__NR_Linux + 341)
#define __NR_syncfs (__NR_Linux + 342)
+#define __NR_sysconf (__NR_Linux + 343)

/*
* Offset of the last Linux o32 flavoured syscall
*/
-#define __NR_Linux_syscalls 342
+#define __NR_Linux_syscalls 343

#endif /* _MIPS_SIM == _MIPS_SIM_ABI32 */

@@ -682,11 +683,12 @@
#define __NR_open_by_handle_at (__NR_Linux + 299)
#define __NR_clock_adjtime (__NR_Linux + 300)
#define __NR_syncfs (__NR_Linux + 301)
+#define __NR_sysconf (__NR_Linux + 302)

/*
* Offset of the last Linux 64-bit flavoured syscall
*/
-#define __NR_Linux_syscalls 301
+#define __NR_Linux_syscalls 302

#endif /* _MIPS_SIM == _MIPS_SIM_ABI64 */

@@ -1006,11 +1008,12 @@
#define __NR_open_by_handle_at (__NR_Linux + 304)
#define __NR_clock_adjtime (__NR_Linux + 305)
#define __NR_syncfs (__NR_Linux + 306)
+#define __NR_sysconf (__NR_Linux + 307)

/*
* Offset of the last N32 flavoured syscall
*/
-#define __NR_Linux_syscalls 306
+#define __NR_Linux_syscalls 307

#endif /* _MIPS_SIM == _MIPS_SIM_NABI32 */

diff --git a/arch/mips/kernel/scall32-o32.S b/arch/mips/kernel/scall32-o32.S
index 7f5468b..f5b5ccc 100644
--- a/arch/mips/kernel/scall32-o32.S
+++ b/arch/mips/kernel/scall32-o32.S
@@ -590,6 +590,7 @@ einval: li v0, -ENOSYS
sys sys_open_by_handle_at 3 /* 4340 */
sys sys_clock_adjtime 2
sys sys_syncfs 1
+ sys sys_sysconf 1
.endm

/* We pre-compute the number of _instruction_ bytes needed to
diff --git a/arch/mips/kernel/scall64-64.S b/arch/mips/kernel/scall64-64.S
index a2e1fcb..16b40f5 100644
--- a/arch/mips/kernel/scall64-64.S
+++ b/arch/mips/kernel/scall64-64.S
@@ -429,4 +429,5 @@ sys_call_table:
PTR sys_open_by_handle_at
PTR sys_clock_adjtime /* 5300 */
PTR sys_syncfs
+ PTR sys_sysconf
.size sys_call_table,.-sys_call_table
diff --git a/arch/mips/kernel/scall64-n32.S b/arch/mips/kernel/scall64-n32.S
index b2c7624..32c7ba2 100644
--- a/arch/mips/kernel/scall64-n32.S
+++ b/arch/mips/kernel/scall64-n32.S
@@ -429,4 +429,5 @@ EXPORT(sysn32_call_table)
PTR sys_open_by_handle_at
PTR compat_sys_clock_adjtime /* 6305 */
PTR sys_syncfs
+ PTR sys_sysconf
.size sysn32_call_table,.-sysn32_call_table
diff --git a/arch/mips/kernel/scall64-o32.S b/arch/mips/kernel/scall64-o32.S
index 049a9c8..b9e751c 100644
--- a/arch/mips/kernel/scall64-o32.S
+++ b/arch/mips/kernel/scall64-o32.S
@@ -547,4 +547,5 @@ sys_call_table:
PTR compat_sys_open_by_handle_at /* 4340 */
PTR compat_sys_clock_adjtime
PTR sys_syncfs
+ PTR sys_sysconf
.size sys_call_table,.-sys_call_table
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index 60f64b1..6889207 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -352,3 +352,4 @@ SYSCALL_SPU(name_to_handle_at)
COMPAT_SYS_SPU(open_by_handle_at)
COMPAT_SYS_SPU(clock_adjtime)
SYSCALL_SPU(syncfs)
+SYSCALL_SPU(sysconf)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index 3c21564..e6c9a73 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -371,10 +371,11 @@
#define __NR_open_by_handle_at 346
#define __NR_clock_adjtime 347
#define __NR_syncfs 348
+#define __NR_sysconf 349

#ifdef __KERNEL__

-#define __NR_syscalls 349
+#define __NR_syscalls 350

#define __NR__exit __NR_exit
#define NR_syscalls __NR_syscalls
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index e821525..442cc13 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -276,7 +276,8 @@
#define __NR_open_by_handle_at 336
#define __NR_clock_adjtime 337
#define __NR_syncfs 338
-#define NR_syscalls 339
+#define __NR_sysconf 339
+#define NR_syscalls 340

/*
* There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 1dc96ea..988f859 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1904,3 +1904,8 @@ compat_sys_clock_adjtime_wrapper:
sys_syncfs_wrapper:
lgfr %r2,%r2 # int
jg sys_syncfs
+
+ .globl sys_sysconf_wrapper
+sys_sysconf_wrapper:
+ lgfr %r2,%r2 # int
+ jg sys_sysconf
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 9c65fd4..b97c80d 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -347,3 +347,4 @@ SYSCALL(sys_name_to_handle_at,sys_name_to_handle_at,sys_name_to_handle_at_wrappe
SYSCALL(sys_open_by_handle_at,sys_open_by_handle_at,compat_sys_open_by_handle_at_wrapper)
SYSCALL(sys_clock_adjtime,sys_clock_adjtime,compat_sys_clock_adjtime_wrapper)
SYSCALL(sys_syncfs,sys_syncfs,sys_syncfs_wrapper)
+SYSCALL(sys_sysconf,sys_sysconf,sys_sysconf_wrapper)
diff --git a/arch/sh/include/asm/unistd_32.h b/arch/sh/include/asm/unistd_32.h
index ca7765e..ccf2de4 100644
--- a/arch/sh/include/asm/unistd_32.h
+++ b/arch/sh/include/asm/unistd_32.h
@@ -373,8 +373,9 @@
#define __NR_open_by_handle_at 360
#define __NR_clock_adjtime 361
#define __NR_syncfs 362
+#define __NR_sysconf 363

-#define NR_syscalls 363
+#define NR_syscalls 364

#ifdef __KERNEL__

diff --git a/arch/sh/include/asm/unistd_64.h b/arch/sh/include/asm/unistd_64.h
index a694009..b0ff4d9 100644
--- a/arch/sh/include/asm/unistd_64.h
+++ b/arch/sh/include/asm/unistd_64.h
@@ -394,10 +394,11 @@
#define __NR_open_by_handle_at 371
#define __NR_clock_adjtime 372
#define __NR_syncfs 373
+#define __NR_sysconf 374

#ifdef __KERNEL__

-#define NR_syscalls 374
+#define NR_syscalls 375

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/sh/kernel/syscalls_32.S b/arch/sh/kernel/syscalls_32.S
index 030966a..705d2de 100644
--- a/arch/sh/kernel/syscalls_32.S
+++ b/arch/sh/kernel/syscalls_32.S
@@ -380,3 +380,4 @@ ENTRY(sys_call_table)
.long sys_open_by_handle_at /* 360 */
.long sys_clock_adjtime
.long sys_syncfs
+ .long sys_sysconf
diff --git a/arch/sh/kernel/syscalls_64.S b/arch/sh/kernel/syscalls_64.S
index ca0a614..c288875 100644
--- a/arch/sh/kernel/syscalls_64.S
+++ b/arch/sh/kernel/syscalls_64.S
@@ -400,3 +400,4 @@ sys_call_table:
.long sys_open_by_handle_at
.long sys_clock_adjtime
.long sys_syncfs
+ .long sys_sysconf
diff --git a/arch/sparc/include/asm/unistd.h b/arch/sparc/include/asm/unistd.h
index 9d897b6..0b0b362 100644
--- a/arch/sparc/include/asm/unistd.h
+++ b/arch/sparc/include/asm/unistd.h
@@ -404,8 +404,9 @@
#define __NR_open_by_handle_at 333
#define __NR_clock_adjtime 334
#define __NR_syncfs 335
+#define __NR_sysconf 336

-#define NR_syscalls 336
+#define NR_syscalls 337

#ifdef __32bit_syscall_numbers__
/* Sparc 32-bit only has the "setresuid32", "getresuid32" variants,
diff --git a/arch/sparc/kernel/systbls_32.S b/arch/sparc/kernel/systbls_32.S
index 47ac73c..1eeae1f 100644
--- a/arch/sparc/kernel/systbls_32.S
+++ b/arch/sparc/kernel/systbls_32.S
@@ -84,4 +84,4 @@ sys_call_table:
/*320*/ .long sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv
/*325*/ .long sys_pwritev, sys_rt_tgsigqueueinfo, sys_perf_event_open, sys_recvmmsg, sys_fanotify_init
/*330*/ .long sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, sys_open_by_handle_at, sys_clock_adjtime
-/*335*/ .long sys_syncfs
+/*335*/ .long sys_syncfs, sys_sysconf
diff --git a/arch/sparc/kernel/systbls_64.S b/arch/sparc/kernel/systbls_64.S
index 4f3170c..1dc8e34 100644
--- a/arch/sparc/kernel/systbls_64.S
+++ b/arch/sparc/kernel/systbls_64.S
@@ -85,7 +85,7 @@ sys_call_table32:
/*320*/ .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, compat_sys_preadv
.word compat_sys_pwritev, compat_sys_rt_tgsigqueueinfo, sys_perf_event_open, compat_sys_recvmmsg, sys_fanotify_init
/*330*/ .word sys32_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, compat_sys_open_by_handle_at, compat_sys_clock_adjtime
- .word sys_syncfs
+ .word sys_syncfs, sys_sysconf

#endif /* CONFIG_COMPAT */

@@ -162,4 +162,4 @@ sys_call_table:
/*320*/ .word sys_dup3, sys_pipe2, sys_inotify_init1, sys_accept4, sys_preadv
.word sys_pwritev, sys_rt_tgsigqueueinfo, sys_perf_event_open, sys_recvmmsg, sys_fanotify_init
/*330*/ .word sys_fanotify_mark, sys_prlimit64, sys_name_to_handle_at, sys_open_by_handle_at, sys_clock_adjtime
- .word sys_syncfs
+ .word sys_syncfs, sys_sysconf
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 849a9d2..493fb4f 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -848,4 +848,5 @@ ia32_sys_call_table:
.quad compat_sys_open_by_handle_at
.quad compat_sys_clock_adjtime
.quad sys_syncfs
+ .quad sys_sysconf
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index a755ef5..da83485 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -350,10 +350,11 @@
#define __NR_open_by_handle_at 342
#define __NR_clock_adjtime 343
#define __NR_syncfs 344
+#define __NR_sysconf 345

#ifdef __KERNEL__

-#define NR_syscalls 345
+#define NR_syscalls 346

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 160fa76..2948044 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -677,6 +677,8 @@ __SYSCALL(__NR_open_by_handle_at, sys_open_by_handle_at)
__SYSCALL(__NR_clock_adjtime, sys_clock_adjtime)
#define __NR_syncfs 306
__SYSCALL(__NR_syncfs, sys_syncfs)
+#define __NR_sysconf 307
+__SYSCALL(__NR_sysconf, sys_sysconf)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index abce34d..e926eb3 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -344,3 +344,4 @@ ENTRY(sys_call_table)
.long sys_open_by_handle_at
.long sys_clock_adjtime
.long sys_syncfs
+ .long sys_sysconf
diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index 07c40d5..1092b02 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -654,9 +654,12 @@ __SYSCALL(__NR_open_by_handle_at, sys_open_by_handle_at)
__SYSCALL(__NR_clock_adjtime, sys_clock_adjtime)
#define __NR_syncfs 267
__SYSCALL(__NR_syncfs, sys_syncfs)
+#define __NR_sysconf 268
+__SYSCALL(__NR_sysconf, sys_sysconf)
+

#undef __NR_syscalls
-#define __NR_syscalls 268
+#define __NR_syscalls 269

/*
* All syscalls below here should go away really,
--
1.7.4.4

2011-05-14 01:21:52

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 5/5] Hook up sysconf syscall for all architectures

From: Andi Kleen <[email protected]>
Date: Fri, 13 May 2011 16:24:19 -0700

> diff --git a/arch/sparc/include/asm/unistd.h b/arch/sparc/include/asm/unistd.h
> index 9d897b6..0b0b362 100644
> --- a/arch/sparc/include/asm/unistd.h
> +++ b/arch/sparc/include/asm/unistd.h
> @@ -404,8 +404,9 @@
> #define __NR_open_by_handle_at 333
> #define __NR_clock_adjtime 334
> #define __NR_syncfs 335
> +#define __NR_sysconf 336
>
> -#define NR_syscalls 336
> +#define NR_syscalls 337
>

x86, powerpc, and sparc entries will conflict with already assigned
slots for sys_sendmmsg() in networking tree.

And Eric W. Bierderman has changes that also add system calls to
all architectures for his "setns" namespace changes.

2011-05-14 02:23:52

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH 5/5] Hook up sysconf syscall for all architectures

On Fri, May 13, 2011 at 19:24, Andi Kleen wrote:
>  arch/blackfin/include/asm/unistd.h     |    3 ++-
>  arch/blackfin/mach-common/entry.S      |    1 +

i wish more people would wire up new syscalls for arches ;)

i dont recall off the top of my head acking any other pending new
syscall which would cause a conflict, so:
Acked-by: Mike Frysinger <[email protected]>

i generally wait for syscalls to hit mainline before i wire them up myself
-mike

2011-05-14 02:51:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/5] Hook up sysconf syscall for all architectures

On Fri, May 13, 2011 at 09:21:24PM -0400, David Miller wrote:
> From: Andi Kleen <[email protected]>
> Date: Fri, 13 May 2011 16:24:19 -0700
>
> > diff --git a/arch/sparc/include/asm/unistd.h b/arch/sparc/include/asm/unistd.h
> > index 9d897b6..0b0b362 100644
> > --- a/arch/sparc/include/asm/unistd.h
> > +++ b/arch/sparc/include/asm/unistd.h
> > @@ -404,8 +404,9 @@
> > #define __NR_open_by_handle_at 333
> > #define __NR_clock_adjtime 334
> > #define __NR_syncfs 335
> > +#define __NR_sysconf 336
> >
> > -#define NR_syscalls 336
> > +#define NR_syscalls 337
> >
>
> x86, powerpc, and sparc entries will conflict with already assigned
> slots for sys_sendmmsg() in networking tree.

Thanks. I'll fix it up on the next try, hopefully those
changes are in mainline then.

-Andi

--
[email protected] -- Speaking for myself only.

2011-05-14 06:58:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall


* Andi Kleen <[email protected]> wrote:

> From: Andi Kleen <[email protected]>
>
> During testing we found some cases where a library wants to know
> the number of CPUs for internal tuning, and calls sysconf for that.
> glibc then reads /proc/stat which is very slow and scales poorly,
> when the program is executed often.
>
> For example sleepycat DB has this problem.
>
> This patch adds a sysconf system call to avoid this problem.
> This adds very little code to the kernel, but gives a large speedup.
>
> It is intended to be called from glibc.
>
> It is not a 100% POSIX sysconf -- some values in there are only
> known to the C library, but supplies all values usefully
> known to the kernel.
>
> In some cases it is more accurate than glibc can do because it doesn't
> have to guess. So when some value changes in the kernel it can
> return the current value.
> ---
> include/linux/sysconf.h | 23 ++++++++++++++
> kernel/Makefile | 2 +-
> kernel/sysconf.c | 77 +++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 101 insertions(+), 1 deletions(-)
> create mode 100644 include/linux/sysconf.h
> create mode 100644 kernel/sysconf.c

What glibc does (opening /proc/stat) is rather stupid, but i think your syscall
is just as stupid a solution as well, you are just implementing a weird
filesystem, and a pretty slow one at that.

Note that these are mostly constant or semi-constant values that are updated
very rarely:

+#define _SC_ARG_MAX 0
+#define _SC_CHILD_MAX 1
+#define _SC_CLK_TCK 2
+#define _SC_NGROUPS_MAX 3
+#define _SC_OPEN_MAX 4
+#define _SC_PAGESIZE 30
+#define _SC_SEM_NSEMS_MAX 32
+#define _SC_SIGQUEUE_MAX 34
+#define _SC_UIO_MAXIOV 60
+#define _SC_NPROCESSORS_CONF 83
+#define _SC_NPROCESSORS_ONLN 84
+#define _SC_PHYS_PAGES 85
+#define _SC_AVPHYS_PAGES 86
+#define _SC_SYMLOOP_MAX 173

If glibc is stupid and reads /proc/stat to receive something it could cache or
mmap() itself then hey, did you consider fixing glibc or creating a sane libc?
It's open-source.

Do not help glibc remain stupid and sloppy until eternity!

If we *really* want to offer kernel help for these values even then your
solution is still wrong: then the proper solution would be to define a standard
*data* structure and map it as a vsyscall *data* page - essentially a
kernel-guaranteed data mmap(), with no extra syscall needed!

That could have other uses as well in the future.

That way it would much faster than your current code btw.

So unless there are some compelling arguments in favor of sys_sysconf() that i
missed, i have to NAK this:

Nacked-by: Ingo Molnar <[email protected]>

Thanks,

Ingo

2011-05-14 16:34:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

> What glibc does (opening /proc/stat) is rather stupid, but i think your syscall

I don't think it has any other choice today. So if anything is "stupid"
it is the kernel for not providing efficient interfaces for this.

> Note that these are mostly constant or semi-constant values that are updated
> very rarely:

That's not true. Most of them are dynamic. Take a look at the patch.
Also several of those have changed recently.

> If glibc is stupid and reads /proc/stat to receive something it could cache or
> mmap() itself then hey, did you consider fixing glibc or creating a sane libc?
Caching doesn't help when you have a workload that exec()s a lot.
Also some of these values can change at runtime.

> If we *really* want to offer kernel help for these values even then your
> solution is still wrong: then the proper solution would be to define a standard
> *data* structure and map it as a vsyscall *data* page - essentially a
> kernel-guaranteed data mmap(), with no extra syscall needed!

That's quite complicted because several of those are dynamically computed
based on other values. Sometimes they are also not tied to the mm_struct -- like
the vsyscall is. For example some of the rlimits are per task, not VM.
Basically your proposal doesn't interact well with clone().

Even if we ignored that semantic problem it would need another writable page
per task because the values cannot be shared.

Also I never liked the idea of having more writable pages per task,
It increases the memory footprint of a single process more. Given a 4K
page is not a lot, but lots of 4K pages add up. Some workloads like
to have lots of small processes and I think that's a valuable use
case Linux should stay lean and efficient at.

[OK in theory one could do COW for the page and share it but that would
get really complicated]

I also don't think it's THAT performance critical to justify the vsyscall.
The simple syscall is already orders of magnitude faster than /proc, and
seems to solve the performance problems we've seen completely.

It's also simple and straight forward and simple to userstand and maintain.
I doubt any of that would apply to a vsyscall solution.

I don't think the additional effort for a vsyscall would be worth
it at this point, unless there's some concrete example that would
justify it. Even then it wouldn't work for some of the values.

Also a vsyscall doesn't help on non x86 anyways.

As for efficiency: I thought about doing a batched interface where
the user could pass in an array of values to fill in. But at least for the
workloads I looked at the application usually calls sysconf() where
the array size would be always 1. And that's the standard interface.
This might be still worth doing, but I would like to see a concrete
use case first.

> That could have other uses as well in the future.

Hmm for what?

Note we already have a fast mechanism to pass some thing to glibc
in the aux vector.

>
> That way it would much faster than your current code btw.
>
> So unless there are some compelling arguments in favor of sys_sysconf() that i
> missed, i have to NAK this:

Well see above for lots of reasons you missed. They are understandable
mistakes for someone who first looks at the problem though. I'll
attempt to improve the documentation next time.

-Andi

2011-05-16 13:37:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall


* Andi Kleen <[email protected]> wrote:

> > What glibc does (opening /proc/stat) is rather stupid, but i think your
> > syscall
>
> I don't think it has any other choice today. [...]

Sure it has a choice: most of the sysconf values are global so creating a
permanent mmap()-able data in /tmp or elsewhere and mapping it unless it's
inaccessible isnt that particularly hard to cache most of the system-wide
constants, now is it?

The CPU count could be updated from CPU hotplug events.

rlimits can be queried using getrlimit().

If at that point glibc said to us: "hey, lets work in making it even faster"
then there would be no reason for us to criticise glibc. Right now gibc does
not even *try* to be smart about it.

> [...] So if anything is "stupid" it is the kernel for not providing efficient
> interfaces for this.
>
> > Note that these are mostly constant or semi-constant values that are
> > updated very rarely:
>
> That's not true. Most of them are dynamic. Take a look at the patch.
> Also several of those have changed recently.

As i said they are mostly constant or semi-constant values that are updated
very rarely.

If you think that i am wrong then do me the basic courtesy of mentioning the
examples that you think disprove my claim, instead of broadly pointing me to
your patch ...

> > If glibc is stupid and reads /proc/stat to receive something it could cache
> > or mmap() itself then hey, did you consider fixing glibc or creating a sane
> > libc?
>
> Caching doesn't help when you have a workload that exec()s a lot. Also some
> of these values can change at runtime.

Here you modify your claim. Now it's not 'dynamic' anymore but 'can change'?

Which one is it now, "mostly constant or semi-constant values that are updated
very rarely" as i claim, "dynamic" as you first claimed or "can change" as you
claim here (which is also pretty ambiguous)?

> > If we *really* want to offer kernel help for these values even then your
> > solution is still wrong: then the proper solution would be to define a
> > standard *data* structure and map it as a vsyscall *data* page -
> > essentially a kernel-guaranteed data mmap(), with no extra syscall needed!
>
> That's quite complicted because several of those are dynamically computed
> based on other values. Sometimes they are also not tied to the mm_struct --
> like the vsyscall is. For example some of the rlimits are per task, not VM.
> Basically your proposal doesn't interact well with clone().

Threads with different rlimits but shared VM are extreme rarities.

Could we please concentrate on the common case? A '-1' in the data page can let
the code fall back to some slow path.

Also note that rlimit itself already has an interface to query them:
getrlimit(). So if you do not want the complexity of caching rlimits in the
data page you do not have to start with that complexity.

[ But it can be done: modifying the rlimit (which predominantly only happens in
the login process) is rare and happens in a parent task. ]

> Even if we ignored that semantic problem it would need another writable page
> per task because the values cannot be shared.

Sure most of the values can be shared.

Most of them are exactly one of a low number of variants for all tasks in the
system, for typical Linux bootups. I suspect if the data page was COW-ed but
inherited across exec() it would behave exactly the way it should be: inherited
by all tasks but privatized if a task modifies it for itself and all children.

Also, the first step could be simplified by not exposing rlimits - as rlimits
are already exposed via other channels.

> Also I never liked the idea of having more writable pages per task, [...]

If you limit it to your own faulty implementation then i'm not surprised that
you do not like it.

> [...] It increases the memory footprint of a single process more. Given a 4K
> page is not a lot, but lots of 4K pages add up. Some workloads like to have
> lots of small processes and I think that's a valuable use case Linux should
> stay lean and efficient at.
>
> [OK in theory one could do COW for the page and share it but that would get
> really complicated]

Why would it get complicated? It's not writable to user-space, that's all that
is special about it.

> I also don't think it's THAT performance critical to justify the vsyscall.

You apparently did not understand the gist of my point: it's the CONCEPT of
adding a crappy catch-all sysconf() interface that sucks IMHO. It's simply bad
taste.

If you want to expose data then expose the data intelligently, not some poor
system call interface that is also slower.

> The simple syscall is already orders of magnitude faster than /proc, and
> seems to solve the performance problems we've seen completely.

A handful of horse manure is less stinky than a big pile of it, still i wouldnt
want to eat either.

> It's also simple and straight forward and simple to userstand and maintain. I
> doubt any of that would apply to a vsyscall solution.

Note: i did not suggest a vsyscall, but a vsyscall *data area*. There's a big
difference between the two!

It could be passed down to user-space using a new auxiliary entry (see
fs/binfmt_elf.c), as it's really part of a task's environment conceptually.

> I don't think the additional effort for a vsyscall would be worth it at this
> point, unless there's some concrete example that would justify it. Even then
> it wouldn't work for some of the values.
>
> Also a vsyscall doesn't help on non x86 anyways.

There's nothing x86 about aux entries.

> As for efficiency: I thought about doing a batched interface where
> the user could pass in an array of values to fill in. But at least for the
> workloads I looked at the application usually calls sysconf() where
> the array size would be always 1. And that's the standard interface.
> This might be still worth doing, but I would like to see a concrete
> use case first.
>
> > That could have other uses as well in the future.
>
> Hmm for what?

*Iff* the concensus is that we are fine with a new page per task/thread then we
could use it for all sorts of nifty things like the current CPU id for example.

> Note we already have a fast mechanism to pass some thing to glibc in the aux
> vector.

So when you got so far in your reply why did you not delete your above (false)
outrage about the vsyscall, which i never suggested and which you thus forced
me to reply to?

> > That way it would much faster than your current code btw.
> >
> > So unless there are some compelling arguments in favor of sys_sysconf()
> > that i missed, i have to NAK this:
>
> Well see above for lots of reasons you missed. They are understandable
> mistakes for someone who first looks at the problem though. I'll attempt to
> improve the documentation next time.

I don't think your condescending tone towards me is warranted or fair, i
offered a fair technical criticism of your patch series. Why are you
attacking me like this?

Thanks,

Ingo

2011-05-16 15:43:14

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

On Sat, May 14, 2011 at 1:24 AM, Andi Kleen <[email protected]> wrote:
> +/*
> + * POSIX sysconf subset. Some programs need this in relatively fast paths
> + * and /proc is too slow for them.
> + *
> + * Note this is only a subset of the values supported by POSIX.
> + * We assume the C library handles the others.
> + */
> +SYSCALL_DEFINE1(sysconf, int, name)
> +{
> + ? ? ? switch (name) {
> + ? ? ? case _SC_ARG_MAX:
> + ? ? ? ? ? ? ? return rlimit_or(RLIMIT_STACK, ARG_MAX*ARG_MAX_FACTOR) /
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ARG_MAX_FACTOR;
> +
> + ? ? ? case _SC_CHILD_MAX:
> + ? ? ? ? ? ? ? return rlimit_or(RLIMIT_NPROC, max_threads);
> +
> + ? ? ? case _SC_CLK_TCK:
> + ? ? ? ? ? ? ? return HZ;
> +
> + ? ? ? case _SC_SEM_NSEMS_MAX:
> + ? ? ? ? ? ? ? return current->nsproxy->ipc_ns->sem_ctls[1];
> +
> + ? ? ? case _SC_OPEN_MAX:
> + ? ? ? ? ? ? ? return rlimit_or(RLIMIT_NOFILE, sysctl_nr_open);
> +
> + ? ? ? case _SC_SIGQUEUE_MAX:
> + ? ? ? ? ? ? ? /* or fallback based on memory? */
> + ? ? ? ? ? ? ? return rlimit_or(RLIMIT_SIGPENDING, INT_MAX);
> +
> + ? ? ? case _SC_UIO_MAXIOV:
> + ? ? ? ? ? ? ? return UIO_MAXIOV;
> +
> + ? ? ? case _SC_PAGESIZE:
> + ? ? ? ? ? ? ? return PAGE_SIZE;
> +
> + ? ? ? case _SC_SYMLOOP_MAX:
> + ? ? ? ? ? ? ? return SYMLOOP_MAX;
> +
> + ? ? ? case _SC_PHYS_PAGES:
> + ? ? ? ? ? ? ? return totalram_pages;
> +
> + ? ? ? case _SC_AVPHYS_PAGES:
> + ? ? ? ? ? ? ? return nr_free_pages();
> +
> + ? ? ? case _SC_NPROCESSORS_CONF:
> + ? ? ? ? ? ? ? return num_possible_cpus();
> +
> + ? ? ? case _SC_NPROCESSORS_ONLN:
> + ? ? ? ? ? ? ? return num_online_cpus();
> +
> + ? ? ? case _SC_NGROUPS_MAX:
> + ? ? ? ? ? ? ? return NGROUPS_MAX;
> +
> + ? ? ? default:
> + ? ? ? ? ? ? ? return -EINVAL;
> + ? ? ? }
> +}

...and libc will start making many such calls in a row in order to retrieve
a dozen of such values.

It's rather inefficient to return just one word.
Try to return more data per call.

Pass a pointer to the result struct and its length.
Future-proof API, such as using generously wide data types,
passing "version of struct" input parameter to facilitate incompatible
future changes, etc.

Pass a bit flag variable which says what data you want to have:
all-zeros means "give me only the cheap stuff which needs no calculations",
such as NGROUPS_MAX.
If SYSCONF_NPROCESSORS_ONLN bit is set, also return num_online_cpus().
If SYSCONF_AVPHYS_PAGES bit is set, return nr_free_pages().
Etc.

Maybe it makes sense to pass bits in the struct rather than in a parameter,
to not get caught by "we ran out of bits in the word!" problem later.

--
vda

2011-05-16 15:51:57

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

On 05/14/2011 12:34 PM, Andi Kleen wrote:
>> What glibc does (opening /proc/stat) is rather stupid, but i think your syscall
>
> I don't think it has any other choice today. So if anything is "stupid"
> it is the kernel for not providing efficient interfaces for this.
>
>> Note that these are mostly constant or semi-constant values that are updated
>> very rarely:
>
> That's not true. Most of them are dynamic. Take a look at the patch.
> Also several of those have changed recently.
>
>> If glibc is stupid and reads /proc/stat to receive something it could cache or
>> mmap() itself then hey, did you consider fixing glibc or creating a sane libc?
> Caching doesn't help when you have a workload that exec()s a lot.
> Also some of these values can change at runtime.
>
>> If we *really* want to offer kernel help for these values even then your
>> solution is still wrong: then the proper solution would be to define a standard
>> *data* structure and map it as a vsyscall *data* page - essentially a
>> kernel-guaranteed data mmap(), with no extra syscall needed!
>
> That's quite complicted because several of those are dynamically computed
> based on other values. Sometimes they are also not tied to the mm_struct -- like
> the vsyscall is. For example some of the rlimits are per task, not VM.
> Basically your proposal doesn't interact well with clone().
>
> Even if we ignored that semantic problem it would need another writable page
> per task because the values cannot be shared.
>
> Also I never liked the idea of having more writable pages per task,
> It increases the memory footprint of a single process more. Given a 4K
> page is not a lot, but lots of 4K pages add up. Some workloads like
> to have lots of small processes and I think that's a valuable use
> case Linux should stay lean and efficient at.
>
> [OK in theory one could do COW for the page and share it but that would
> get really complicated]
>
> I also don't think it's THAT performance critical to justify the vsyscall.
> The simple syscall is already orders of magnitude faster than /proc, and
> seems to solve the performance problems we've seen completely.
>
> It's also simple and straight forward and simple to userstand and maintain.
> I doubt any of that would apply to a vsyscall solution.
>
> I don't think the additional effort for a vsyscall would be worth
> it at this point, unless there's some concrete example that would
> justify it. Even then it wouldn't work for some of the values.

You could also add a vsyscall function and have that function pull the
values from vvar data. It would be very fast (i.e. almost as fast as
mmaping some data) and it would be more portable.

If you built it on top of the vdso cleanup in my rdtsc patches, it would
be dead simple, too. (Think three or four lines of code and no linker
hackery).

>
> Also a vsyscall doesn't help on non x86 anyways.
>

Add one? x86-32 at least would benefit a lot. (I'm not volunteering
right now...)

--Andy

2011-05-16 16:01:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

> ...and libc will start making many such calls in a row in order to retrieve
> a dozen of such values.

It doesn't because the user interface is sysconf(). So the user program
just asks for it piece by piece.

> It's rather inefficient to return just one word.
> Try to return more data per call.

I considered that, but is there a concrete use case?
I didn't want to code it up without concrete use case.

> Pass a pointer to the result struct and its length.
> Future-proof API, such as using generously wide data types,
> passing "version of struct" input parameter to facilitate incompatible
> future changes, etc.

Please provide a use case for all this complexity.

[... more complexity snipped ... ]

-Andi

2011-05-16 16:08:12

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

> You could also add a vsyscall function and have that function pull the
> values from vvar data. It would be very fast (i.e. almost as fast as
> mmaping some data) and it would be more portable.

vvar is a global. The variable data in sysconf is per thread
(and not even per mm)

Even if you ignore the semantic problem of thread versus mm, you would
need a new writable page per process and update it any time
the rlimits and other information is updated.

On the other hand my solution is dead simple and already solves
the problem.

-Andi

2011-05-16 17:06:31

by Andrew Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

On Mon, May 16, 2011 at 12:08 PM, Andi Kleen <[email protected]> wrote:
>> You could also add a vsyscall function and have that function pull the
>> values from vvar data. ?It would be very fast (i.e. almost as fast as
>> mmaping some data) and it would be more portable.
>
> vvar is a global. The variable data in sysconf is per thread
> (and not even per mm)
>
> Even if you ignore the semantic problem of thread versus mm, you would
> need a new writable page per process and update it any time
> the rlimits and other information is updated.
>
> On the other hand my solution is dead simple and already solves
> the problem.

True. I should read the entire list of variables before making
suggestions that won't work. :)

--Andy

>
> -Andi
>

2011-05-16 17:39:58

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

On Mon, May 16, 2011 at 12:06:40PM -0500, Steve Munroe wrote:
d Seems like the Aux Vector would do job. Simple, in memory, LIBC already



It's difficult to update the aux vector dynamically after the program has
started. Applications in theory could reuse it for something else, so it
wouldn't be safe to poke in there (and quite complex too)

One of the main motivations for the new call was _SC_NPROCESSORS_ONLN,
which can change dynamically.

So if we put that into the aux vector the program would only ever know about
the CPUs online at startup time. Not good.

> fills in some sysconf values from the auxv and has infrastructure for that.
> And the perform conscious hacker can access the auxv directly if they want.

rlimits are not static and the they don't map to clone semantics (see all the
other mails on this) Various values are based on rlimits.

vsyscalls would work for the online CPUs, but not for a lot of other things.

I suppose if this was just to solve the online cpu case it may be a suitable
solution, but I tried to be a bit more general.

Here's a handy reference table:

vdso aux syscall
constant values x x x
rlimit based values -* - x
runtime values like cpus x - x
portable to all archs - x x
performance best best good enough
coding complexity high low low


* unless you break clone and add a writable page to each program and be complex

So you can see the aux vector is not doing very well overall.
The vdso is somewhat better, but only if you sacrify the rlimits
and accept high complexity.

The syscall is imho still the sweet spot overall. It's excellent
everywhere, except it doesn't have the best performance. But it seems
good enough.

-Andi

2011-05-16 20:51:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

> The syscall is only current if you call it frequently (i.e. every time),
> but there is tendency to go local (static memory) caching to avoid the
> syscall overhead. Something like _SC_NPROCESSORS_ONLN that is used to set
> up OpenMP/auto-parallel threading for major loops is likely to be cached.

That is true. Still I assume other programs would ask for it more frequently.

Longer term at some point we probably need a proper CPU hotplug
event interface, but short term polling is the standard model.

>
> But if you look at the gettimeofday/vdso implementation, the time_base to
> time conversion is semi-dynamic for the NTPd case and we handle that in
> VDSO. There is shared memory access between the VDSO and the kernel for
> this case.

Yes, I wrote that code for x86 ;-)

> Is problem with rlimit that it is process or session dependent? But is it
> really dynamic once a process starts?

It can be changed any time per thread (actually per signal context).
Now that's probably not common, but I don't think a new ABI should
ignore the possibility of it changing anyways.

> It is not clear one size fits all here.

What partition would you suggest?

Also do you have a use case where the syscall is too slow?

I'm basically trying to find a good tradeoff for implementation effort
vs usefullness vs good semantics vs performance here.

-Andi
--
[email protected] -- Speaking for myself only.

2011-05-17 10:59:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall


* Andi Kleen <[email protected]> wrote:

> > Auxv is simpler yet.
> >
>
> Yes simple and doesn't work.

And i've noticed one bug and three uncleanlinesses in your series.

See how destructive obstructionist comments that intentionally cut down on
essential detail (which you do to gain a discussion advantage) are to the
healthy flow of a discussion? Yet you are doing it all the time on lkml.

Why are you doing that, do you really enjoy inefficient discussions where you
regularly throw in cryptic, ambiguous half-references to disrupt contrary
opinion?

Please try to be more constructive in discussions.

Thanks,

Ingo

2011-05-17 11:25:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall


Andi,

AFAICS you have not replied to my detailed analysis below yet (you only replied
to a sub-reply, not addressing most of the points i raised) - should i expect a
reply in the days to come, or will you follow your past pattern of
unpredictably ignoring review concerns?

Firstly i think the obvious canonical fix is to expose the number of CPUs in a
separate file in /proc and not in /proc/stat which is expensive to construct
and parse. Also arguably sleepycat should not need to read the number of CPUs
all the time - why does it do that?

Secondly, even assuming that fast access to all these system globals is
desirable i question the choice to include rlimits and nsems_max: they are not
really system globals (so why are they in sysconf() to begin with?) and they
can already be queried cheaply using getrlimits()/semctl().

These are the values that make the sysconf() information per task and
complicate the data structure.

If we look at all non-rlimit values:

_SC_CLK_TCK: return HZ;
_SC_UIO_MAXIOV: return UIO_MAXIOV;
_SC_PAGESIZE: return PAGE_SIZE;
_SC_SYMLOOP_MAX: return SYMLOOP_MAX;
_SC_PHYS_PAGES: return totalram_pages;
_SC_AVPHYS_PAGES: return nr_free_pages();
_SC_NPROCESSORS_CONF: return num_possible_cpus();
_SC_NPROCESSORS_ONLN: return num_online_cpus();
_SC_NGROUPS_MAX: return NGROUPS_MAX;

These are all global, read-mostly values.

Were these rlimits not included we could do a sane global vdso page with the
NPROCESSORS number exposed (which is what the sleepycat example you cited
really needs it appears) and expose it to glibc using a new auxiliary entry (or
via a vsyscall that returns the pointer).

It would become very simple and no COW trickery is needed in that case either.

See my arguments in more detail below.

Thanks,

Ingo

* Ingo Molnar <[email protected]> wrote:

> * Andi Kleen <[email protected]> wrote:
>
> > > What glibc does (opening /proc/stat) is rather stupid, but i think your
> > > syscall
> >
> > I don't think it has any other choice today. [...]
>
> Sure it has a choice: most of the sysconf values are global so creating a
> permanent mmap()-able data in /tmp or elsewhere and mapping it unless it's
> inaccessible isnt that particularly hard to cache most of the system-wide
> constants, now is it?
>
> The CPU count could be updated from CPU hotplug events.
>
> rlimits can be queried using getrlimit().
>
> If at that point glibc said to us: "hey, lets work in making it even faster"
> then there would be no reason for us to criticise glibc. Right now gibc does
> not even *try* to be smart about it.
>
> > [...] So if anything is "stupid" it is the kernel for not providing efficient
> > interfaces for this.
> >
> > > Note that these are mostly constant or semi-constant values that are
> > > updated very rarely:
> >
> > That's not true. Most of them are dynamic. Take a look at the patch.
> > Also several of those have changed recently.
>
> As i said they are mostly constant or semi-constant values that are updated
> very rarely.
>
> If you think that i am wrong then do me the basic courtesy of mentioning the
> examples that you think disprove my claim, instead of broadly pointing me to
> your patch ...
>
> > > If glibc is stupid and reads /proc/stat to receive something it could cache
> > > or mmap() itself then hey, did you consider fixing glibc or creating a sane
> > > libc?
> >
> > Caching doesn't help when you have a workload that exec()s a lot. Also some
> > of these values can change at runtime.
>
> Here you modify your claim. Now it's not 'dynamic' anymore but 'can change'?
>
> Which one is it now, "mostly constant or semi-constant values that are updated
> very rarely" as i claim, "dynamic" as you first claimed or "can change" as you
> claim here (which is also pretty ambiguous)?
>
> > > If we *really* want to offer kernel help for these values even then your
> > > solution is still wrong: then the proper solution would be to define a
> > > standard *data* structure and map it as a vsyscall *data* page -
> > > essentially a kernel-guaranteed data mmap(), with no extra syscall needed!
> >
> > That's quite complicted because several of those are dynamically computed
> > based on other values. Sometimes they are also not tied to the mm_struct --
> > like the vsyscall is. For example some of the rlimits are per task, not VM.
> > Basically your proposal doesn't interact well with clone().
>
> Threads with different rlimits but shared VM are extreme rarities.
>
> Could we please concentrate on the common case? A '-1' in the data page can let
> the code fall back to some slow path.
>
> Also note that rlimit itself already has an interface to query them:
> getrlimit(). So if you do not want the complexity of caching rlimits in the
> data page you do not have to start with that complexity.
>
> [ But it can be done: modifying the rlimit (which predominantly only happens in
> the login process) is rare and happens in a parent task. ]
>
> > Even if we ignored that semantic problem it would need another writable page
> > per task because the values cannot be shared.
>
> Sure most of the values can be shared.
>
> Most of them are exactly one of a low number of variants for all tasks in the
> system, for typical Linux bootups. I suspect if the data page was COW-ed but
> inherited across exec() it would behave exactly the way it should be: inherited
> by all tasks but privatized if a task modifies it for itself and all children.
>
> Also, the first step could be simplified by not exposing rlimits - as rlimits
> are already exposed via other channels.
>
> > Also I never liked the idea of having more writable pages per task, [...]
>
> If you limit it to your own faulty implementation then i'm not surprised that
> you do not like it.
>
> > [...] It increases the memory footprint of a single process more. Given a 4K
> > page is not a lot, but lots of 4K pages add up. Some workloads like to have
> > lots of small processes and I think that's a valuable use case Linux should
> > stay lean and efficient at.
> >
> > [OK in theory one could do COW for the page and share it but that would get
> > really complicated]
>
> Why would it get complicated? It's not writable to user-space, that's all that
> is special about it.
>
> > I also don't think it's THAT performance critical to justify the vsyscall.
>
> You apparently did not understand the gist of my point: it's the CONCEPT of
> adding a crappy catch-all sysconf() interface that sucks IMHO. It's simply bad
> taste.
>
> If you want to expose data then expose the data intelligently, not some poor
> system call interface that is also slower.
>
> > The simple syscall is already orders of magnitude faster than /proc, and
> > seems to solve the performance problems we've seen completely.
>
> A handful of horse manure is less stinky than a big pile of it, still i wouldnt
> want to eat either.
>
> > It's also simple and straight forward and simple to userstand and maintain. I
> > doubt any of that would apply to a vsyscall solution.
>
> Note: i did not suggest a vsyscall, but a vsyscall *data area*. There's a big
> difference between the two!
>
> It could be passed down to user-space using a new auxiliary entry (see
> fs/binfmt_elf.c), as it's really part of a task's environment conceptually.
>
> > I don't think the additional effort for a vsyscall would be worth it at this
> > point, unless there's some concrete example that would justify it. Even then
> > it wouldn't work for some of the values.
> >
> > Also a vsyscall doesn't help on non x86 anyways.
>
> There's nothing x86 about aux entries.
>
> > As for efficiency: I thought about doing a batched interface where
> > the user could pass in an array of values to fill in. But at least for the
> > workloads I looked at the application usually calls sysconf() where
> > the array size would be always 1. And that's the standard interface.
> > This might be still worth doing, but I would like to see a concrete
> > use case first.
> >
> > > That could have other uses as well in the future.
> >
> > Hmm for what?
>
> *Iff* the concensus is that we are fine with a new page per task/thread then we
> could use it for all sorts of nifty things like the current CPU id for example.
>
> > Note we already have a fast mechanism to pass some thing to glibc in the aux
> > vector.
>
> So when you got so far in your reply why did you not delete your above (false)
> outrage about the vsyscall, which i never suggested and which you thus forced
> me to reply to?
>
> > > That way it would much faster than your current code btw.
> > >
> > > So unless there are some compelling arguments in favor of sys_sysconf()
> > > that i missed, i have to NAK this:
> >
> > Well see above for lots of reasons you missed. They are understandable
> > mistakes for someone who first looks at the problem though. I'll attempt to
> > improve the documentation next time.
>
> I don't think your condescending tone towards me is warranted or fair, i
> offered a fair technical criticism of your patch series. Why are you
> attacking me like this?
>
> Thanks,
>
> Ingo

--
Thanks,

Ingo

2011-05-17 12:33:29

by Denys Vlasenko

[permalink] [raw]
Subject: Re: [PATCH 4/5] Add a sysconf syscall

On Mon, May 16, 2011 at 6:01 PM, Andi Kleen <[email protected]> wrote:
>> ...and libc will start making many such calls in a row in order to retrieve
>> a dozen of such values.
>
> It doesn't because the user interface is sysconf(). So the user program
> just asks for it piece by piece.

Not if libc is caching known-to-be-constant-until-reboot data on the first call.

>> It's rather inefficient to return just one word.
>> Try to return more data per call.
>
> I considered that, but is there a concrete use case?
> I didn't want to code it up without concrete use case.

Look at recent history of having to add more syscalls
(such as signalfd/signalfd4) only because we didn't think through
how they will need to be extended.

It's likely you will need to return more than one long word
for some data, eventually. Therefore, better add struct now
than needing to add horrible hacks later in order to be able
to return an uint64_t value.

--
vda

2011-05-24 01:46:26

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH 5/5] Hook up sysconf syscall for all architectures

On Fri, May 13, 2011 at 22:23, Mike Frysinger wrote:
> On Fri, May 13, 2011 at 19:24, Andi Kleen wrote:
>>  arch/blackfin/include/asm/unistd.h     |    3 ++-
>>  arch/blackfin/mach-common/entry.S      |    1 +
>
> i wish more people would wire up new syscalls for arches ;)
>
> i dont recall off the top of my head acking any other pending new
> syscall which would cause a conflict, so:
> Acked-by: Mike Frysinger <[email protected]>
>
> i generally wait for syscalls to hit mainline before i wire them up myself

i'm wiring up sendmmsg on Blackfin systems and will be pushing in a
day or two, so this will need a little tweaking after that hits
mainline
-mike

2011-05-26 18:04:32

by Mike Frysinger

[permalink] [raw]
Subject: Re: [PATCH 5/5] Hook up sysconf syscall for all architectures

On Mon, May 23, 2011 at 21:46, Mike Frysinger wrote:
> On Fri, May 13, 2011 at 22:23, Mike Frysinger wrote:
>> On Fri, May 13, 2011 at 19:24, Andi Kleen wrote:
>>>  arch/blackfin/include/asm/unistd.h     |    3 ++-
>>>  arch/blackfin/mach-common/entry.S      |    1 +
>>
>> i wish more people would wire up new syscalls for arches ;)
>>
>> i dont recall off the top of my head acking any other pending new
>> syscall which would cause a conflict, so:
>> Acked-by: Mike Frysinger <[email protected]>
>>
>> i generally wait for syscalls to hit mainline before i wire them up myself
>
> i'm wiring up sendmmsg on Blackfin systems and will be pushing in a
> day or two, so this will need a little tweaking after that hits
> mainline

ok, i lied about that. i kept the patch local so that when your merge
goes through, there wont be any conflicts. you are still planning on
pushing this for 2.6.40-rc1 right ? if not, i'll just send out my
sendmmsg wiring now.
-mike

2011-05-26 18:46:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 5/5] Hook up sysconf syscall for all architectures

> ok, i lied about that. i kept the patch local so that when your merge
> goes through, there wont be any conflicts. you are still planning on
> pushing this for 2.6.40-rc1 right ? if not, i'll just send out my
> sendmmsg wiring now.

I don't plan to push it for .40 currently.

-Andi
--
[email protected] -- Speaking for myself only