LinuxLists.cc - [patch 02/12] Immediate Values - Architecture Independent Code

2009-09-24 14:44:01

Subject: [patch 02/12] Immediate Values - Architecture Independent Code

Immediate values are used as read mostly variables that are rarely updated. They
use code patching to modify the values inscribed in the instruction stream. It
provides a way to save precious cache lines that would otherwise have to be used
by these variables.

There is a generic _imv_read() version, which uses standard global
variables, and optimized per architecture imv_read() implementations,
which use a load immediate to remove a data cache hit. When the immediate values
functionnality is disabled in the kernel, it falls back to global variables.

It adds a new rodata section "__imv" to place the pointers to the enable
value. Immediate values activation functions sits in kernel/immediate.c.

Immediate values refer to the memory address of a previously declared integer.
This integer holds the information about the state of the immediate values
associated, and must be accessed through the API found in linux/immediate.h.

At module load time, each immediate value is checked to see if it must be
enabled. It would be the case if the variable they refer to is exported from
another module and already enabled.

In the early stages of start_kernel(), the immediate values are updated to
reflect the state of the variable they refer to.

* Why should this be merged *

It improves performances on heavy memory I/O workloads.

An interesting result shows the potential this infrastructure has by
showing the slowdown a simple system call such as getppid() suffers when it is
used under heavy user-space cache trashing:

Random walk L1 and L2 trashing surrounding a getppid() call:
(note: in this test, do_syscal_trace was taken at each system call, see
Documentation/immediate.txt in these patches for details)
- No memory pressure : getppid() takes 1573 cycles
- With memory pressure : getppid() takes 15589 cycles

We therefore have a slowdown of 10 times just to get the kernel variables from
memory. Another test on the same architecture (Intel P4) measured the memory
latency to be 559 cycles. Therefore, each cache line removed from the hot path
would improve the syscall time of 3.5% in these conditions.

Changelog:

- section __imv is already SHF_ALLOC
- Because of the wonders of ELF, section 0 has sh_addr and sh_size 0. So
the if (immediateindex) is unnecessary here.
- Remove module_mutex usage: depend on functions implemented in module.c for
that.
- Does not update tainted module's immediate values.
- remove imv_*_t types, add DECLARE_IMV() and DEFINE_IMV().
- imv_read(&var) becomes imv_read(var) because of this.
- Adding a new EXPORT_IMV_SYMBOL(_GPL).
- remove imv_if(). Should use if (unlikely(imv_read(var))) instead.
- Wait until we have gcc support before we add the imv_if macro, since
its form may have to change.
- Dont't declare the __imv section in vmlinux.lds.h, just put the content
in the rodata section.
- Simplify interface : remove imv_set_early, keep track of kernel boot
status internally.
- Remove the ALIGN(8) before the __imv section. It is packed now.
- Uses an IPI busy-loop on each CPU with interrupts disabled as a simple,
architecture agnostic, update mechanism.
- Use imv_* instead of immediate_*.
- Updating immediate values, cannot rely on smp_call_function() b/c
synchronizing cpus using IPIs leads to deadlocks. Process A held a read lock
on tasklist_lock, then process B called apply_imv_update(). Process A received
the IPI and begins executing ipi_busy_loop(). Then process C takes a write
lock irq on the task list lock, before receiving the IPI. Thus, process A
holds up process C, and C can't get an IPI b/c interrupts are disabled. Solve
this problem by using a new 'ALL_CPUS' parameter to stop_machine_run(). Which
runs a function on all cpus after they are busy looping and have disabled
irqs. Since this is done in a new process context, we don't have to worry
about interrupted spin_locks. Also, less lines of code. Has survived 24 hours+
of testing...
- Folded "fix use immediate" patch.

Signed-off-by: Mathieu Desnoyers <[email protected]>
Signed-off-by: Jason Baron <[email protected]>
CC: Rusty Russell <[email protected]>
CC: Adrian Bunk <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Christoph Hellwig <[email protected]>
CC: [email protected]
CC: [email protected]
---
include/asm-generic/vmlinux.lds.h | 3
include/linux/immediate.h | 94 +++++++++++++++++++
include/linux/module.h | 16 +++
init/main.c | 7 +
kernel/Makefile | 1
kernel/immediate.c | 183 ++++++++++++++++++++++++++++++++++++++
kernel/module.c | 41 ++++++++
7 files changed, 345 insertions(+)

Index: linux.trees.git/include/linux/module.h
===================================================================
--- linux.trees.git.orig/include/linux/module.h 2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/include/linux/module.h 2009-09-24 08:59:42.000000000 -0400
@@ -16,6 +16,7 @@
#include <linux/kobject.h>
#include <linux/moduleparam.h>
#include <linux/tracepoint.h>
+#include <linux/immediate.h>

#include <asm/local.h>
#include <asm/module.h>
@@ -330,6 +331,10 @@ struct module
struct tracepoint *tracepoints;
unsigned int num_tracepoints;
#endif
+#ifdef CONFIG_IMMEDIATE
+ const struct __imv *immediate;
+ unsigned int num_immediate;
+#endif

#ifdef CONFIG_TRACING
const char **trace_bprintk_fmt_start;
@@ -533,6 +538,9 @@ extern void print_modules(void);
extern void module_update_tracepoints(void);
extern int module_get_iter_tracepoints(struct tracepoint_iter *iter);

+extern void _module_imv_update(void);
+extern void module_imv_update(void);
+
#else /* !CONFIG_MODULES... */
#define EXPORT_SYMBOL(sym)
#define EXPORT_SYMBOL_GPL(sym)
@@ -653,6 +661,14 @@ static inline int module_get_iter_tracep
return 0;
}

+static inline void _module_imv_update(void)
+{
+}
+
+static inline void module_imv_update(void)
+{
+}
+
#endif /* CONFIG_MODULES */

struct device_driver;
Index: linux.trees.git/kernel/module.c
===================================================================
--- linux.trees.git.orig/kernel/module.c 2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/module.c 2009-09-24 08:58:28.000000000 -0400
@@ -36,6 +36,7 @@
#include <linux/cpu.h>
#include <linux/moduleparam.h>
#include <linux/errno.h>
+#include <linux/immediate.h>
#include <linux/err.h>
#include <linux/vermagic.h>
#include <linux/notifier.h>
@@ -2236,6 +2237,11 @@ static noinline struct module *load_modu
mod->ctors = section_objs(hdr, sechdrs, secstrings, ".ctors",
sizeof(*mod->ctors), &mod->num_ctors);
#endif
+#ifdef CONFIG_IMMEDIATE
+ mod->immediate = section_objs(hdr, sechdrs, secstrings, "__imv",
+ sizeof(*mod->immediate),
+ &mod->num_immediate);
+#endif

#ifdef CONFIG_TRACEPOINTS
mod->tracepoints = section_objs(hdr, sechdrs, secstrings,
@@ -3000,3 +3006,38 @@ int module_get_iter_tracepoints(struct t
return found;
}
#endif
+
+#ifdef CONFIG_IMMEDIATE
+/**
+ * _module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Module_mutex must be held be the caller.
+ */
+void _module_imv_update(void)
+{
+ struct module *mod;
+
+ list_for_each_entry(mod, &modules, list) {
+ if (mod->taints)
+ continue;
+ imv_update_range(mod->immediate,
+ mod->immediate + mod->num_immediate);
+ }
+}
+EXPORT_SYMBOL_GPL(_module_imv_update);
+
+/**
+ * module_imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ * Takes module_mutex.
+ */
+void module_imv_update(void)
+{
+ mutex_lock(&module_mutex);
+ _module_imv_update();
+ mutex_unlock(&module_mutex);
+}
+EXPORT_SYMBOL_GPL(module_imv_update);
+#endif
Index: linux.trees.git/kernel/immediate.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/kernel/immediate.c 2009-09-24 08:58:28.000000000 -0400
@@ -0,0 +1,183 @@
+/*
+ * Copyright (C) 2007 Mathieu Desnoyers
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#include <linux/module.h>
+#include <linux/mutex.h>
+#include <linux/immediate.h>
+#include <linux/memory.h>
+#include <linux/cpu.h>
+#include <linux/stop_machine.h>
+
+#include <asm/cacheflush.h>
+#include <asm/atomic.h>
+
+/*
+ * Kernel ready to execute the SMP update that may depend on trap and ipi.
+ */
+static int imv_early_boot_complete;
+static atomic_t stop_machine_first;
+static int wrote_text;
+
+extern const struct __imv __start___imv[];
+extern const struct __imv __stop___imv[];
+
+static int stop_machine_imv_update(void *imv_ptr)
+{
+ struct __imv *imv = imv_ptr;
+
+ if (atomic_dec_and_test(&stop_machine_first)) {
+ text_poke((void *)imv->imv, (void *)imv->var, imv->size);
+ smp_wmb(); /* make sure other cpus see that this has run */
+ wrote_text = 1;
+ } else {
+ while (!wrote_text)
+ smp_rmb();
+ sync_core();
+ }
+
+ flush_icache_range(imv->imv, imv->imv + imv->size);
+
+ return 0;
+}
+
+/*
+ * imv_mutex nests inside module_mutex. imv_mutex protects builtin
+ * immediates and module immediates.
+ */
+static DEFINE_MUTEX(imv_mutex);
+
+/**
+ * apply_imv_update - update one immediate value
+ * @imv: pointer of type const struct __imv to update
+ *
+ * Update one immediate value. Must be called with imv_mutex held.
+ * It makes sure all CPUs are not executing the modified code by having them
+ * busy looping with interrupts disabled.
+ * It does _not_ protect against NMI and MCE (could be a problem with Intel's
+ * errata if we use immediate values in their code path).
+ */
+static int apply_imv_update(const struct __imv *imv)
+{
+ /*
+ * If the variable and the instruction have the same value, there is
+ * nothing to do.
+ */
+ switch (imv->size) {
+ case 1:
+ if (*(uint8_t *)imv->imv == *(uint8_t *)imv->var)
+ return 0;
+ break;
+ case 2:
+ if (*(uint16_t *)imv->imv == *(uint16_t *)imv->var)
+ return 0;
+ break;
+ case 4:
+ if (*(uint32_t *)imv->imv == *(uint32_t *)imv->var)
+ return 0;
+ break;
+ case 8:
+ if (*(uint64_t *)imv->imv == *(uint64_t *)imv->var)
+ return 0;
+ break;
+ default:
+ return -EINVAL;
+ }
+
+ if (imv_early_boot_complete) {
+ mutex_lock(&text_mutex);
+ atomic_set(&stop_machine_first, 1);
+ wrote_text = 0;
+ stop_machine(stop_machine_imv_update, (void *)imv, NULL);
+ mutex_unlock(&text_mutex);
+ } else
+ text_poke_early((void *)imv->imv, (void *)imv->var,
+ imv->size);
+ return 0;
+}
+
+/**
+ * imv_update_range - Update immediate values in a range
+ * @begin: pointer to the beginning of the range
+ * @end: pointer to the end of the range
+ *
+ * Updates a range of immediates.
+ */
+void imv_update_range(const struct __imv *begin,
+ const struct __imv *end)
+{
+ const struct __imv *iter;
+ int ret;
+
+ mutex_lock(&imv_mutex);
+ for (iter = begin; iter < end; iter++) {
+ ret = apply_imv_update(iter);
+ if (imv_early_boot_complete && ret)
+ printk(KERN_WARNING
+ "Invalid immediate value. "
+ "Variable at %p, "
+ "instruction at %p, size %hu\n",
+ (void *)iter->imv,
+ (void *)iter->var, iter->size);
+ }
+ mutex_unlock(&imv_mutex);
+}
+EXPORT_SYMBOL_GPL(imv_update_range);
+
+/**
+ * imv_update - update all immediate values in the kernel
+ *
+ * Iterate on the kernel core and modules to update the immediate values.
+ */
+void core_imv_update(void)
+{
+ /* Core kernel imvs */
+ imv_update_range(__start___imv, __stop___imv);
+}
+EXPORT_SYMBOL_GPL(core_imv_update);
+
+void __init imv_init_complete(void)
+{
+ imv_early_boot_complete = 1;
+}
+
+int imv_module_notify(struct notifier_block *self,
+ unsigned long val, void *data)
+{
+ struct module *mod = data;
+
+ switch (val) {
+ case MODULE_STATE_COMING:
+ imv_update_range(mod->immediate,
+ mod->immediate + mod->num_immediate);
+ break;
+ case MODULE_STATE_GOING:
+ /* All references will be gone, no update required. */
+ break;
+ }
+ return 0;
+}
+
+struct notifier_block imv_module_nb = {
+ .notifier_call = imv_module_notify,
+ .priority = 0,
+};
+
+static int init_imv(void)
+{
+ return register_module_notifier(&imv_module_nb);
+}
+__initcall(init_imv);
Index: linux.trees.git/init/main.c
===================================================================
--- linux.trees.git.orig/init/main.c 2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/init/main.c 2009-09-24 08:58:28.000000000 -0400
@@ -97,6 +97,11 @@ static inline void mark_rodata_ro(void)
#ifdef CONFIG_TC
extern void tc_init(void);
#endif
+#ifdef USE_IMMEDIATE
+extern void imv_init_complete(void);
+#else
+static inline void imv_init_complete(void) { }
+#endif

enum system_states system_state __read_mostly;
EXPORT_SYMBOL(system_state);
@@ -544,6 +549,7 @@ asmlinkage void __init start_kernel(void
boot_init_stack_canary();

cgroup_init_early();
+ core_imv_update();

local_irq_disable();
early_boot_irqs_off();
@@ -685,6 +691,7 @@ asmlinkage void __init start_kernel(void
cpuset_init();
taskstats_init_early();
delayacct_init();
+ imv_init_complete();

check_bugs();

Index: linux.trees.git/include/asm-generic/vmlinux.lds.h
===================================================================
--- linux.trees.git.orig/include/asm-generic/vmlinux.lds.h 2009-09-24 08:52:54.000000000 -0400
+++ linux.trees.git/include/asm-generic/vmlinux.lds.h 2009-09-24 08:58:28.000000000 -0400
@@ -202,6 +202,9 @@
*(__vermagic) /* Kernel version magic */ \
*(__markers_strings) /* Markers: strings */ \
*(__tracepoints_strings)/* Tracepoints: strings */ \
+ VMLINUX_SYMBOL(__start___imv) = .; \
+ *(__imv) /* Immediate values: pointers */ \
+ VMLINUX_SYMBOL(__stop___imv) = .; \
} \
\
.rodata1 : AT(ADDR(.rodata1) - LOAD_OFFSET) { \
Index: linux.trees.git/include/linux/immediate.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux.trees.git/include/linux/immediate.h 2009-09-24 08:58:28.000000000 -0400
@@ -0,0 +1,94 @@
+#ifndef _LINUX_IMMEDIATE_H
+#define _LINUX_IMMEDIATE_H
+
+/*
+ * Immediate values, can be updated at runtime and save cache lines.
+ *
+ * (C) Copyright 2007 Mathieu Desnoyers <[email protected]>
+ *
+ * This file is released under the GPLv2.
+ * See the file COPYING for more details.
+ */
+
+#ifdef CONFIG_IMMEDIATE
+
+struct __imv {
+ unsigned long var; /* Pointer to the identifier variable of the
+ * immediate value
+ */
+ unsigned long imv; /*
+ * Pointer to the memory location of the
+ * immediate value within the instruction.
+ */
+ unsigned char size; /* Type size. */
+} __attribute__ ((packed));
+
+#include <asm/immediate.h>
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i) \
+ do { \
+ name##__imv = (i); \
+ core_imv_update(); \
+ module_imv_update(); \
+ } while (0)
+
+/*
+ * Internal update functions.
+ */
+extern void core_imv_update(void);
+extern void imv_update_range(const struct __imv *begin,
+ const struct __imv *end);
+
+#else
+
+/*
+ * Generic immediate values: a simple, standard, memory load.
+ */
+
+/**
+ * imv_read - read immediate variable
+ * @name: immediate value name
+ *
+ * Reads the value of @name.
+ */
+#define imv_read(name) _imv_read(name)
+
+/**
+ * imv_set - set immediate variable (with locking)
+ * @name: immediate value name
+ * @i: required value
+ *
+ * Sets the value of @name, taking the module_mutex if required by
+ * the architecture.
+ */
+#define imv_set(name, i) (name##__imv = (i))
+
+static inline void core_imv_update(void) { }
+static inline void module_imv_update(void) { }
+
+#endif
+
+#define DECLARE_IMV(type, name) extern __typeof__(type) name##__imv
+#define DEFINE_IMV(type, name) __typeof__(type) name##__imv
+
+#define EXPORT_IMV_SYMBOL(name) EXPORT_SYMBOL(name##__imv)
+#define EXPORT_IMV_SYMBOL_GPL(name) EXPORT_SYMBOL_GPL(name##__imv)
+
+/**
+ * _imv_read - Read immediate value with standard memory load.
+ * @name: immediate value name
+ *
+ * Force a data read of the immediate value instead of the immediate value
+ * based mechanism. Useful for __init and __exit section data read.
+ */
+#define _imv_read(name) (name##__imv)
+
+#endif
Index: linux.trees.git/kernel/Makefile
===================================================================
--- linux.trees.git.orig/kernel/Makefile 2009-09-24 08:52:55.000000000 -0400
+++ linux.trees.git/kernel/Makefile 2009-09-24 09:00:15.000000000 -0400
@@ -88,6 +88,7 @@ obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
+obj-$(CONFIG_IMMEDIATE) += immediate.o
obj-$(CONFIG_LATENCYTOP) += latencytop.o
obj-$(CONFIG_FUNCTION_TRACER) += trace/
obj-$(CONFIG_TRACING) += trace/

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-09-25 04:21:23

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <[email protected]> wrote:

> Immediate values are used as read mostly variables that are rarely updated. They
> use code patching to modify the values inscribed in the instruction stream. It
> provides a way to save precious cache lines that would otherwise have to be used
> by these variables.

What a hare-brained concept.

> * Why should this be merged *
>
> It improves performances on heavy memory I/O workloads.
>
> An interesting result shows the potential this infrastructure has by
> showing the slowdown a simple system call such as getppid() suffers when it is
> used under heavy user-space cache trashing:
>
> Random walk L1 and L2 trashing surrounding a getppid() call:
> (note: in this test, do_syscal_trace was taken at each system call, see
> Documentation/immediate.txt in these patches for details)
> - No memory pressure : getppid() takes 1573 cycles
> - With memory pressure : getppid() takes 15589 cycles

Our ideas of what constitutes an "interesting result" differ.

Do you have any data which indicates that this thing is of any real
benefit to anyone for anything?

2009-09-27 23:23:45

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

* Andrew Morton ([email protected]) wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <[email protected]> wrote:
>
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
>
> What a hare-brained concept.
>

Hi Andrew,

Improving performance by specializing the implementation has been
studied thoroughly by many in the past, especially for JIT compilers.
What I am proposing here is merely a very specific use of the concept,
applied to read-often variables.

> > * Why should this be merged *
> >
> > It improves performances on heavy memory I/O workloads.
> >
> > An interesting result shows the potential this infrastructure has by
> > showing the slowdown a simple system call such as getppid() suffers when it is
> > used under heavy user-space cache trashing:
> >
> > Random walk L1 and L2 trashing surrounding a getppid() call:
> > (note: in this test, do_syscal_trace was taken at each system call, see
> > Documentation/immediate.txt in these patches for details)
> > - No memory pressure : getppid() takes 1573 cycles
> > - With memory pressure : getppid() takes 15589 cycles
>
> Our ideas of what constitutes an "interesting result" differ.
>
> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?

Yep. See the benchmarks I just ran below.

Immediate Values Benchmarks

Kernel 2.6.31-tip
8-core Xeon, 2.0Ghz, E5405
gcc version 4.3.2 (Debian 4.3.2-1.1)

Test workload: build the Linux kernel tree, cache-hot, make -j10

Executive result summary:

In these tests, each system call has an added workload, which is to read a fixed
number of integers from randomly chosen cache lines within an array and perform
a branch. The implementation is added to ptrace.c. The baseline is an unmodified
kernel.

* Baseline: sys 0m57.63s

* 4096 integer reads, random locations sys 2m21.781s
* 4096 integer reads, immediate values sys 1m44.695s

* 128 integer reads, random locations sys 0m59.348s
* 128 integer reads, immediate values sys 0m58.640s

* 32 integer reads, random locations sys 0m58.68s
* 32 integer reads, immediate values sys 0m57.60s

These numbers show that by turning read-often data accesses into immediate
values, we can speed up the kernel.

Binary size results:

* 4096 integer reads, random locations
text data bss dec hex filename
66079 648 262156 328883 504b3 arch/x86/kernel/ptrace.o

* 4096 integer reads, immediate values
text data bss dec hex filename
66079 74412 262156 402647 624d7 arch/x86/kernel/ptrace.o

As we notice, the size of text is the same, same for bss, but the data size
increases with immediate values. The section headers confirms that this extra
data is put in the __imv section, which is only accessed when immediate value
updates are performed.

So the tradeoff is: immediate values use more cache-cold space to increase
speed.

Therefore, if we can turn a significant amount of fast-path read-often variables
into immediate values, this should lead to a performance gain. Also,
given we can expect the fastpath cache-line footprint to grow with the
next kernel releases (this has been a trend I've seen a lot of people
complaining about), immediate values should help minimizing this by
removing the d-cache hit from such read-often variables, leaving a
i-cache hit within a mostly sequential instruction stream.

A quick look at the vmlinux section headers:

vmlinux: file format elf64-x86-64

Sections:
Idx Name Size VMA LMA File off Algn
13 .data.read_mostly 00002df0 ffffffff80859440 0000000000859440 00859440 2**6
CONTENTS, ALLOC, LOAD, DATA

Shows that we have about 11.48kB of read mostly data in the kernel image
which could be turned into immediate values. This is without counting
the modules. If only a portion of this data is not only read mostly, but
also read often, then we will see a clear performance improvement.

Thanks,

Mathieu

Detailed test results follow.
----------------------------------------

* Baseline:

# size of kernel original ptrace.o

text data bss dec hex filename
12863 648 8 13519 34cf arch/x86/kernel/ptrace.o

# time make -j10

real 1m25.358s
user 9m7.506s
sys 0m57.856s

real 1m21.580s
user 9m7.362s
sys 0m57.212s

real 1m21.361s
user 9m6.358s
sys 0m57.824s

* 4096 cache lines read per system call (random cache lines)
(CONFIG_IMMEDIATE=n)

# size of modified ptrace.o

text data bss dec hex filename
66079 648 262156 328883 504b3 arch/x86/kernel/ptrace.o

# section headers

arch/x86/kernel/ptrace.o: file format elf64-x86-64

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000f4e8 0000000000000000 0000000000000000 00000040 2**4
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 000000a8 0000000000000000 0000000000000000 0000f540 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 0004000c 0000000000000000 0000000000000000 0000f600 2**5
ALLOC
3 .rodata 00000988 0000000000000000 0000000000000000 0000f600 2**5
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
4 .fixup 0000005b 0000000000000000 0000000000000000 0000ff88 2**0
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
5 __ex_table 00000090 0000000000000000 0000000000000000 0000ffe8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
6 .smp_locks 00000028 0000000000000000 0000000000000000 00010078 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
7 .rodata.str1.8 000001f2 0000000000000000 0000000000000000 000100a0 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
8 .rodata.str1.1 00000097 0000000000000000 0000000000000000 00010292 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
9 __tracepoints 00000080 0000000000000000 0000000000000000 00010340 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
10 _ftrace_events 00000160 0000000000000000 0000000000000000 000103c0 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
11 __tracepoints_strings 00000013 0000000000000000 0000000000000000 00010520 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
12 .comment 0000001f 0000000000000000 0000000000000000 00010533 2**0
CONTENTS, READONLY
13 .note.GNU-stack 00000000 0000000000000000 0000000000000000 00010552 2**0
CONTENTS, READONLY

# pattern

820: 83 3d 00 00 00 00 01 cmpl $0x1,0x0(%rip) # 827 <test_pollute_cache+0x7>
827: 0f 84 cb cf 00 00 je d7f8 <test_pollute_cache+0xcfd8>

# time make -j10

real 1m36.075s
user 9m15.163s
sys 2m21.781s

* 4096 imv read per system call
(CONFIG_IMMEDIATE=y)

# size of modified ptrace.o

text data bss dec hex filename
66079 74412 262156 402647 624d7 arch/x86/kernel/ptrace.o

(note: data is larger due to __imv table, which is used only for updates)

# section headers

arch/x86/kernel/ptrace.o: file format elf64-x86-64

Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000f4e8 0000000000000000 0000000000000000 00000040 2**4
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
1 .data 000000a8 0000000000000000 0000000000000000 0000f540 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
2 .bss 0004000c 0000000000000000 0000000000000000 0000f600 2**5
ALLOC
3 .rodata 00000988 0000000000000000 0000000000000000 0000f600 2**5
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
4 .fixup 0000005b 0000000000000000 0000000000000000 0000ff88 2**0
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
5 __ex_table 00000090 0000000000000000 0000000000000000 0000ffe8 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
6 .smp_locks 00000028 0000000000000000 0000000000000000 00010078 2**3
CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
7 __discard 00005004 0000000000000000 0000000000000000 000100a0 2**0
CONTENTS, READONLY
8 __imv 00012024 0000000000000000 0000000000000000 000150a4 2**0
CONTENTS, ALLOC, LOAD, RELOC, DATA
9 .rodata.str1.8 000001f2 0000000000000000 0000000000000000 000270c8 2**3
CONTENTS, ALLOC, LOAD, READONLY, DATA
10 .rodata.str1.1 00000097 0000000000000000 0000000000000000 000272ba 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
11 __tracepoints 00000080 0000000000000000 0000000000000000 00027360 2**5
CONTENTS, ALLOC, LOAD, RELOC, DATA
12 _ftrace_events 00000160 0000000000000000 0000000000000000 000273e0 2**3
CONTENTS, ALLOC, LOAD, RELOC, DATA
13 __tracepoints_strings 00000013 0000000000000000 0000000000000000 00027540 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
14 .comment 0000001f 0000000000000000 0000000000000000 00027553 2**0
CONTENTS, READONLY
15 .note.GNU-stack 00000000 0000000000000000 0000000000000000 00027572 2**0
CONTENTS, READONLY

# pattern

820: b8 00 00 00 00 mov $0x0,%eax
825: ff c8 dec %eax
827: 0f 84 d3 cf 00 00 je d800 <test_pollute_cache+0xcfe0>

# time make -j10

real 1m30.688s
user 9m7.770s
sys 1m44.695s

* 128 cache lines read per system call (random cache lines)
(CONFIG_IMMEDIATE=n)

# time make -j10

real 1m27.801s
user 9m12.447s
sys 0m59.348s

* 128 imv read per system call
(CONFIG_IMMEDIATE=y)

# time make -j10

real 1m22.454s
user 9m5.822s
sys 0m58.640s

* 32 cache lines read per system call (random cache lines)
(CONFIG_IMMEDIATE=n)

# time make -j10

real 1m21.539s
user 9m6.946s
sys 0m57.888s

real 1m26.789s
user 9m11.606s
sys 0m59.392s

real 1m29.461s
user 9m12.195s
sys 0m58.768s

avg sys: 58.68s

* 32 imv read per system call
(CONFIG_IMMEDIATE=y)

# time make -j10

real 1m21.844s
user 9m7.278s
sys 0m57.648s

real 1m22.123s
user 9m6.850s
sys 0m56.848s

real 1m24.589s
user 9m5.674s
sys 0m58.328s

avg sys: 57.60s

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-09-28 01:23:36

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On Thu, Sep 24, 2009 at 09:20:13PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <[email protected]> wrote:
>
> > Immediate values are used as read mostly variables that are rarely updated. They
> > use code patching to modify the values inscribed in the instruction stream. It
> > provides a way to save precious cache lines that would otherwise have to be used
> > by these variables.
>
> What a hare-brained concept.

The concept makes a lot of sense. Cache misses are extremly costly
on modern CPUs and when the workload has blown the caches away in user space
it can literally be hundreds or even thousands of cycles to fetch
a data cache line.

Similar optimizations are also quite common in compilers (with
profile feedback) and JITs.

> Do you have any data which indicates that this thing is of any real
> benefit to anyone for anything?

There's a lot of data around that the kernel has very little IPC
due to a lot of cache misses in some workloads. In general this applies
to anything that touches a lot of data in user space and blows
the caches away.

-Andi

--
[email protected] -- Speaking for myself only.

2009-09-28 17:47:59

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On Mon, 28 Sep 2009 03:23:37 +0200 Andi Kleen <[email protected]> wrote:

> On Thu, Sep 24, 2009 at 09:20:13PM -0700, Andrew Morton wrote:
> > On Thu, 24 Sep 2009 09:26:28 -0400 Mathieu Desnoyers <[email protected]> wrote:
> >
> > > Immediate values are used as read mostly variables that are rarely updated. They
> > > use code patching to modify the values inscribed in the instruction stream. It
> > > provides a way to save precious cache lines that would otherwise have to be used
> > > by these variables.
> >
> > What a hare-brained concept.
>
> The concept makes a lot of sense.

But does it?

> Cache misses are extremly costly
> on modern CPUs and when the workload has blown the caches away in user space
> it can literally be hundreds or even thousands of cycles to fetch
> a data cache line.

Well yes. But for a kernel dcache entry to have been replaced by a
userspace one, userspace will, on average, have itself incurred a *lot*
of dcache misses. So we just spent a lot of CPU cycles in userspace,
so the cost of the in-kernel dcache miss is relatively small.

That's how caches work! If a kernel variable is read frequently, it's
still in dcache. If it's read infrequently, it falls out of dcache but
that doesn't matter much because it's read infrequently!

And lo, it appears that we're unable to observe any measurable benefit
from the changes, so we're cooking up weird fake testcases to be able to
drag this thing out of the noise floor.

Obviously the change will have _some_ performance benefit. But is it
enough to justify the addition of yet more tricksy code to maintain?
That's a very different question.

> There's a lot of data around that the kernel has very little IPC
> due to a lot of cache misses in some workloads.

Kernel gets a lot of cache misses, but that's usually against
userspace, pagecache, net headers/data, etc. I doubt if it gets many
misses against a small number of small, read-mostly data items which is
what this patch addresses.

And it is a *small* number of things to which this change is
applicable. This is because the write operation for these read-mostly
variables becomes very expensive indeed. This means that we cannot use
"immediate values" for any variable which can conceivable be modified
at high frequency by any workload.

For example, how do we know it's safe to use immediate-values for
anything which can be modified from userspace, such as a sysfs-accessed
tunable? How do we know this won't take someone's odd-but-legitimate
workload and shoot it in the head?

Summary:

- at this stage no real-world beenefit has been demonstrated afaict

- the feature is narrowly applicable anyway

- it addes complexity and maintenance cost

2009-09-28 18:03:33

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On Mon, 28 Sep 2009 10:46:17 -0700
Andrew Morton <[email protected]> wrote:
>
> Kernel gets a lot of cache misses, but that's usually against
> userspace, pagecache, net headers/data, etc. I doubt if it gets many
> misses against a small number of small, read-mostly data items which
> is what this patch addresses.
>
> And it is a *small* number of things to which this change is
> applicable. This is because the write operation for these read-mostly
> variables becomes very expensive indeed. This means that we cannot
> use "immediate values" for any variable which can conceivable be
> modified at high frequency by any workload.

btw just to add to this:
caches are unified code/data after L1 in general... it then does not
matter much if you encode the "almost constant" in the codestream or
slightly farther away, in both cases it takes up cache space.
(you can argue "but in the data case it might pull in a whole cacheline
just for this".. but that's a case for us to pack such read mostly
things properly)

And for L1.. well.. the L2 latency is not THAT much bigger. And L1 is
tiny. more icache pressure hurts just as much as having more dcache
pressure there.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-28 18:45:17

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

* Arjan van de Ven ([email protected]) wrote:
> On Mon, 28 Sep 2009 10:46:17 -0700
> Andrew Morton <[email protected]> wrote:
> >
> > Kernel gets a lot of cache misses, but that's usually against
> > userspace, pagecache, net headers/data, etc. I doubt if it gets many
> > misses against a small number of small, read-mostly data items which
> > is what this patch addresses.
> >
> > And it is a *small* number of things to which this change is
> > applicable. This is because the write operation for these read-mostly
> > variables becomes very expensive indeed. This means that we cannot
> > use "immediate values" for any variable which can conceivable be
> > modified at high frequency by any workload.
>
> btw just to add to this:
> caches are unified code/data after L1 in general... it then does not
> matter much if you encode the "almost constant" in the codestream or
> slightly farther away, in both cases it takes up cache space.

Standard read from memory will typically need to have the address of the
data to access as operand to the instruction in i-cache, plus the data
in d-cache.

Compared to this, immediate values remove the need to have a pointer in
the i-cache, so the overall footprint, even for L2 cache, is lower.

> (you can argue "but in the data case it might pull in a whole cacheline
> just for this".. but that's a case for us to pack such read mostly
> things properly)
>
> And for L1.. well.. the L2 latency is not THAT much bigger. And L1 is
> tiny. more icache pressure hurts just as much as having more dcache
> pressure there.

Immediate values does not add i-cache pressure. They just remove d-cache
pressure. So it saves L1 d-cache, and the L1 i-cache pressure stays
mostly unchanged.

Thanks,

Mathieu

>
>
> --
> Arjan van de Ven Intel Open Source Technology Centre
> For development, discussion and tips for power savings,
> visit http://www.lesswatts.org

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-09-28 19:54:45

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

> btw just to add to this:
> caches are unified code/data after L1 in general... it then does not
> matter much if you encode the "almost constant" in the codestream or
> slightly farther away, in both cases it takes up cache space.

It does take up cache space, but when it's embedded in the instruction
stream the CPU has it usually already prefetched (CPUs are very good
at prefetching instructions). That's often not the case
with arbitary data accesses like this.

-Andi

--
[email protected] -- Speaking for myself only.

2009-09-28 20:11:11

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

> That's how caches work! If a kernel variable is read frequently, it's
> still in dcache. If it's read infrequently, it falls out of dcache but
> that doesn't matter much because it's read infrequently!

You're assuming that the CPU's cache LRU is perfect. e.g. that it
never gets swamped by a lot of short term accesses. And that it has perfect
insight if something is really frequently used or not. But that's
not true. A short term cache pig, even when it uses the data only
once, can swamp it and throw out all the kernel state, even when it's
accessed frequently enough. It's a bit similar to all the similar problems
with the page cache LRU.
>
> And lo, it appears that we're unable to observe any measurable benefit
> from the changes, so we're cooking up weird fake testcases to be able to
> drag this thing out of the noise floor.

Yes, a really measurable improvement would be great. One problem
right now is that not enough users are there, so measuring
something would first need more users to really reduce the cache misses.
>
> Obviously the change will have _some_ performance benefit. But is it
> enough to justify the addition of yet more tricksy code to maintain?

I don't think the code is particularly tricky. Especially the user API
is very simple and neat.
>
> And it is a *small* number of things to which this change is
> applicable. This is because the write operation for these read-mostly
> variables becomes very expensive indeed. This means that we cannot use
> "immediate values" for any variable which can conceivable be modified
> at high frequency by any workload.

A natural target is any sysctl for example.

>
> For example, how do we know it's safe to use immediate-values for
> anything which can be modified from userspace, such as a sysfs-accessed
> tunable? How do we know this won't take someone's odd-but-legitimate
> workload and shoot it in the head?

You're arguing we should tune for sysctl performance? That doesn't make
sense to me.

>
>
> Summary:
>
> - at this stage no real-world beenefit has been demonstrated afaict

Yes that's an issue that needs to be addressed.

A good way would be probably to do some measurements on cache misses
for given workloads and then convert all applicable global references
and see how much difference it makes.

> - the feature is narrowly applicable anyway

I don't think so, there are quite a lot of global flag variables.

% find /proc/sys -type f | wc -l
565

not counting sysfs, boot options and other things.
>
> - it addes complexity and maintenance cost

Very little as far as I can see.

-Andi

--
[email protected] -- Speaking for myself only.

2009-09-28 20:37:24

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On Mon, 28 Sep 2009 21:54:45 +0200
Andi Kleen <[email protected]> wrote:

> > btw just to add to this:
> > caches are unified code/data after L1 in general... it then does not
> > matter much if you encode the "almost constant" in the codestream or
> > slightly farther away, in both cases it takes up cache space.
>
> It does take up cache space, but when it's embedded in the instruction
> stream the CPU has it usually already prefetched (CPUs are very good
> at prefetching instructions). That's often not the case
> with arbitary data accesses like this.

this makes me wonder what happens when a variable is used in multiple
places... that makes the icache overhead multiply right?

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-09-28 21:17:34

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On Mon, 28 Sep 2009 22:11:08 +0200
Andi Kleen <[email protected]> wrote:

> > For example, how do we know it's safe to use immediate-values for
> > anything which can be modified from userspace, such as a sysfs-accessed
> > tunable? How do we know this won't take someone's odd-but-legitimate
> > workload and shoot it in the head?
>
> You're arguing we should tune for sysctl performance? That doesn't make
> sense to me.

We're talking about a tiny tiny performance gain (one which thus far
appears to be unobserveable) on the read-side traded off against a
tremendous slowdown on the write-side.

That's OK for people whose workloads use the expected read-vs-write
ratio. But there's always someone out there who does something
peculiar. There will be people who simply cannot accept large
slowdowns in writes to particular tunables. Who these people are and
which tunables they care about we do not know.

No, I'm not saying we should "tune for sysctl performance". I'm saying
we should tune for not making Linux utterly uselessly slow for people
for whom it previously worked OK.

It means we'd have to look very carefully at each tunable and decide
whether there's any conceivable situation in which someone would want
to alter it frequently. If so, we need to leave it alone.

How many tunables will that leave behind, and how much use was it to
speed that remainder up by a teensy amount? Who knows.

2009-09-28 21:44:18

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

On 09/28/2009 01:37 PM, Arjan van de Ven wrote:
>
> this makes me wonder what happens when a variable is used in multiple
> places... that makes the icache overhead multiply right?
>

On x86, the icache overhead can often be zero or close to zero -- or
even negative in a fairly common subcase[1] -- simply because you are
dropping a displacement used to fetch a global variable with an
immediate in the code itself.

For 8- or 16-bit data items this is even more of a win in terms of
icache space; for 64-bit data it is always a lose.

It is also worth noting that the way this is implemented as a graft-on
rather than with compiler support means that the full instruction set
cannot exploited -- x86 can often use a memory operand or immediate as
part of an operation. This adds icache pressure.

-hpa

[1] Common subcase:

movl global, %reg ; 6 bytes (unless reg is eax on 32 bits)
movl $immed, %reg ; 5 bytes

2009-09-28 22:01:13

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

* Andrew Morton ([email protected]) wrote:
> On Mon, 28 Sep 2009 22:11:08 +0200
> Andi Kleen <[email protected]> wrote:
>
> > > For example, how do we know it's safe to use immediate-values for
> > > anything which can be modified from userspace, such as a sysfs-accessed
> > > tunable? How do we know this won't take someone's odd-but-legitimate
> > > workload and shoot it in the head?
> >
> > You're arguing we should tune for sysctl performance? That doesn't make
> > sense to me.
>
> We're talking about a tiny tiny performance gain (one which thus far
> appears to be unobserveable) on the read-side traded off against a
> tremendous slowdown on the write-side.
>
> That's OK for people whose workloads use the expected read-vs-write
> ratio. But there's always someone out there who does something
> peculiar. There will be people who simply cannot accept large
> slowdowns in writes to particular tunables. Who these people are and
> which tunables they care about we do not know.
>
> No, I'm not saying we should "tune for sysctl performance". I'm saying
> we should tune for not making Linux utterly uselessly slow for people
> for whom it previously worked OK.
>
> It means we'd have to look very carefully at each tunable and decide
> whether there's any conceivable situation in which someone would want
> to alter it frequently. If so, we need to leave it alone.
>
> How many tunables will that leave behind, and how much use was it to
> speed that remainder up by a teensy amount? Who knows.
>

BTW, when/if we get the OK from Intel to use a breakpoint/IPI-based
scheme to perform the updates rather than using the heavyweight
stop_machine(), this update performance question will be much less of a
concern.

hpa is currently looking into this.

Thanks,

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68

2009-09-28 22:10:06

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [patch 02/12] Immediate Values - Architecture Independent Code

* H. Peter Anvin ([email protected]) wrote:
> On 09/28/2009 01:37 PM, Arjan van de Ven wrote:
> >
> > this makes me wonder what happens when a variable is used in multiple
> > places... that makes the icache overhead multiply right?
> >
>
> On x86, the icache overhead can often be zero or close to zero -- or
> even negative in a fairly common subcase[1] -- simply because you are
> dropping a displacement used to fetch a global variable with an
> immediate in the code itself.
>
> For 8- or 16-bit data items this is even more of a win in terms of
> icache space; for 64-bit data it is always a lose.
>
> It is also worth noting that the way this is implemented as a graft-on
> rather than with compiler support means that the full instruction set
> cannot exploited -- x86 can often use a memory operand or immediate as
> part of an operation. This adds icache pressure.

Indeed, these cases could make good use of compiler support to let
immediate values be added to a wider range of operations. Currently,
being limited to "mov" is somewhat limiting on x86. We could definitely
do better.

Mathieu

>
> -hpa
>
> [1] Common subcase:
>
> movl global, %reg ; 6 bytes (unless reg is eax on 32 bits)
> movl $immed, %reg ; 5 bytes
>

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68