2013-08-09 23:04:33

by Andi Kleen

[permalink] [raw]
Subject: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

The x86 user access functions (*_user) were originally very well tuned,
with partial inline code and other optimizations.

Then over time various new checks -- particularly the sleep checks for
a voluntary preempt kernel -- destroyed a lot of the tunings

A typical user access operation is now doing multiple useless
function calls. Also the without force inline gcc's inlining
policy makes it even worse, with adding more unnecessary calls.

Here's a typical example from ftrace:

10) | might_fault() {
10) | _cond_resched() {
10) | should_resched() {
10) | need_resched() {
10) 0.063 us | test_ti_thread_flag();
10) 0.643 us | }
10) 1.238 us | }
10) 1.845 us | }
10) 2.438 us | }

So we spent 2.5us doing nothing (ok it's a bit less without
ftrace, but still pretty bad)

Then in other cases we would have an out of line function,
but would actually do the might_sleep() checks in the inlined
caller. This doesn't make any sense at all.

There were also a few other problems, for example the x86-64 uaccess
code regularly falls back to string functions, even though a simple
mov would be enough. For example every futex access to the lock
variable would actually use string instructions, even though
it's just 4 bytes.

This patch kit is an attempt to get us back to sane code,
mostly by doing proper inlining and doing sleep checks in the right
place. Unfortunately I had to add one tree sweep to avoid an nasty
include loop.

It costs a bit of text space, but I think it's worth it
(if only to keep my blood pressure down while reading ftrace logs...)

I haven't done any particular benchmarks, but important low level
functions just ought to be fast.

64bit:
13249492 1881328 1159168 16289988 f890c4 vmlinux-before-uaccess
13260877 1877232 1159168 16297277 f8ad3d vmlinux-uaccess
+ 11k, +0.08%

32bit:
11223248 899512 1916928 14039688 d63a88 vmlinux-before-uaccess
11230358 895416 1916928 14042702 d6464e vmlinux-uaccess
+ 7k, +0.06%


2013-08-09 23:04:28

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 02/13] x86: Include linux/sched.h in asm/uaccess.h

From: Andi Kleen <[email protected]>

uaccess.h uses might_sleep, but there is currently no explicit include for this.
Since a upcoming patch moves might_sleep into sched.h include sched.h here.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 5ee2687..8fa3bd6 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -3,6 +3,7 @@
/*
* User space memory access functions
*/
+#include <linux/sched.h>
#include <linux/errno.h>
#include <linux/compiler.h>
#include <linux/thread_info.h>
--
1.8.3.1

2013-08-09 23:04:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 06/13] x86: Add 32bit versions of SAVE_ALL/RESTORE_ALL to calling.h

From: Andi Kleen <[email protected]>

Add 32bit versions of SAVE/RESTORE_ALL to calling.h. Needed for
the following patches. These are much simplified over both 64bit
and the 32bit entry* version, just saving and restoring all registers
without anything fancy.

These are different from the entry_32.S versions in not changing
the segment registers or cld. So only suitable for in kernel use, not when
transitioning from or to user space. The resulting stack frame is not
a standard pt_regs frame. The main use case is calling C code from
assembler when all the registers need to be preserved.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/calling.h | 49 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/arch/x86/include/asm/calling.h b/arch/x86/include/asm/calling.h
index 0fa6750..3365cea 100644
--- a/arch/x86/include/asm/calling.h
+++ b/arch/x86/include/asm/calling.h
@@ -48,6 +48,8 @@ For 32-bit we have the following conventions - kernel is built with

#include <asm/dwarf2.h>

+#ifdef CONFIG_X86_64
+
/*
* 64-bit system call stack frame layout defines and helpers,
* for assembly code:
@@ -192,3 +194,50 @@ For 32-bit we have the following conventions - kernel is built with
.macro icebp
.byte 0xf1
.endm
+
+#else
+
+/*
+ * For 32bit only simplified versions of SAVE_ALL/RESTORE_ALL. These
+ * are different from the entry_32.S versions in not changing the segment
+ * registers. So only suitable for in kernel use, not when transitioning
+ * from or to user space. The resulting stack frame is not a standard
+ * pt_regs frame. The main use case is calling C code from assembler
+ * when all the registers need to be preserved.
+ */
+
+ .macro SAVE_ALL
+ pushl_cfi %eax
+ CFI_REL_OFFSET eax, 0
+ pushl_cfi %ebp
+ CFI_REL_OFFSET ebp, 0
+ pushl_cfi %edi
+ CFI_REL_OFFSET edi, 0
+ pushl_cfi %esi
+ CFI_REL_OFFSET esi, 0
+ pushl_cfi %edx
+ CFI_REL_OFFSET edx, 0
+ pushl_cfi %ecx
+ CFI_REL_OFFSET ecx, 0
+ pushl_cfi %ebx
+ CFI_REL_OFFSET ebx, 0
+ .endm
+
+ .macro RESTORE_ALL
+ popl_cfi %ebx
+ CFI_RESTORE ebx
+ popl_cfi %ecx
+ CFI_RESTORE ecx
+ popl_cfi %edx
+ CFI_RESTORE edx
+ popl_cfi %esi
+ CFI_RESTORE esi
+ popl_cfi %edi
+ CFI_RESTORE edi
+ popl_cfi %ebp
+ CFI_RESTORE ebp
+ popl_cfi %eax
+ CFI_RESTORE eax
+ .endm
+
+#endif
--
1.8.3.1

2013-08-09 23:04:32

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 04/13] Move might_sleep and friends from kernel.h to sched.h

From: Andi Kleen <[email protected]>

These are really related to scheduling, so they should be in sched.h
Users usually will need to schedule anyways.

The advantage of having them there is that we can access some of the
scheduler inlines to make their fast path more efficient. This will come
in a followon patch.

Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/kernel.h | 35 -----------------------------------
include/linux/sched.h | 38 ++++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+), 35 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 482ad2d..badcc13 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -141,35 +141,6 @@ struct completion;
struct pt_regs;
struct user;

-#ifdef CONFIG_PREEMPT_VOLUNTARY
-extern int _cond_resched(void);
-# define might_resched() _cond_resched()
-#else
-# define might_resched() do { } while (0)
-#endif
-
-#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
- void __might_sleep(const char *file, int line, int preempt_offset);
-/**
- * might_sleep - annotation for functions that can sleep
- *
- * this macro will print a stack trace if it is executed in an atomic
- * context (spinlock, irq-handler, ...).
- *
- * This is a useful debugging help to be able to catch problems early and not
- * be bitten later when the calling function happens to sleep when it is not
- * supposed to.
- */
-# define might_sleep() \
- do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
-#else
- static inline void __might_sleep(const char *file, int line,
- int preempt_offset) { }
-# define might_sleep() do { might_resched(); } while (0)
-#endif
-
-#define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
-
/*
* abs() handles unsigned and signed longs, ints, shorts and chars. For all
* input types abs() returns a signed long.
@@ -193,12 +164,6 @@ extern int _cond_resched(void);
(__x < 0) ? -__x : __x; \
})

-#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)
-void might_fault(void);
-#else
-static inline void might_fault(void) { }
-#endif
-
extern struct atomic_notifier_head panic_notifier_list;
extern long (*panic_blink)(int state);
__printf(1, 2)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d722490..773f21d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2433,6 +2433,35 @@ extern int __cond_resched_softirq(void);
__cond_resched_softirq(); \
})

+#ifdef CONFIG_PREEMPT_VOLUNTARY
+extern int _cond_resched(void);
+# define might_resched() _cond_resched()
+#else
+# define might_resched() do { } while (0)
+#endif
+
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+ void __might_sleep(const char *file, int line, int preempt_offset);
+/**
+ * might_sleep - annotation for functions that can sleep
+ *
+ * this macro will print a stack trace if it is executed in an atomic
+ * context (spinlock, irq-handler, ...).
+ *
+ * This is a useful debugging help to be able to catch problems early and not
+ * be bitten later when the calling function happens to sleep when it is not
+ * supposed to.
+ */
+# define might_sleep() \
+ do { __might_sleep(__FILE__, __LINE__, 0); might_resched(); } while (0)
+#else
+ static inline void __might_sleep(const char *file, int line,
+ int preempt_offset) { }
+# define might_sleep() do { might_resched(); } while (0)
+#endif
+
+#define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
+
static inline void cond_resched_rcu(void)
{
#if defined(CONFIG_DEBUG_ATOMIC_SLEEP) || !defined(CONFIG_PREEMPT_RCU)
@@ -2442,6 +2471,15 @@ static inline void cond_resched_rcu(void)
#endif
}

+#ifdef CONFIG_PROVE_LOCKING
+void might_fault(void);
+#else
+static inline void might_fault(void)
+{
+ might_sleep();
+}
+#endif
+
/*
* Does a critical section need to be broken due to another
* task waiting?: (technically does not depend on CONFIG_PREEMPT,
--
1.8.3.1

2013-08-09 23:05:14

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 05/13] sched: mark should_resched() __always_inline

From: Andi Kleen <[email protected]>

At least gcc 4.6 and some earlier ones does not inline this function.
Since it's small and on relatively hot paths force inline it.

Signed-off-by: Andi Kleen <[email protected]>
---
kernel/sched/core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 74d7c04..23df96a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3767,7 +3767,7 @@ SYSCALL_DEFINE0(sched_yield)
return 0;
}

-static inline int should_resched(void)
+static __always_inline int should_resched(void)
{
return need_resched() && !(preempt_count() & PREEMPT_ACTIVE);
}
--
1.8.3.1

2013-08-09 23:04:26

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 07/13] Add might_fault_debug_only()

From: Andi Kleen <[email protected]>

Add a might_fault_debug_only() that only does something in the PROVE_LOCKING
case, but does not cond_resched for PREEMPT_VOLUNTARY. This is for
cases when the cond_resched is done elsewhere

Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/sched.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 773f21d..bb7a08a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2473,11 +2473,13 @@ static inline void cond_resched_rcu(void)

#ifdef CONFIG_PROVE_LOCKING
void might_fault(void);
+#define might_fault_debug_only() might_fault()
#else
static inline void might_fault(void)
{
might_sleep();
}
+#define might_fault_debug_only() do {} while(0)
#endif

/*
--
1.8.3.1

2013-08-09 23:06:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

From: Andi Kleen <[email protected]>

Move the cond_resched() check for CONFIG_PREEMPT_VOLUNTARY into
the low level copy_*_user code. This avoids some code bloat and
makes check much more efficient by avoiding unnecessary function calls.

This is currently only for the non __ variants.

For the sleep debug case the call is still done in the caller.

I did not do this for copy_in_user() or the nocache variants because there's
no obvious place to put the check, and those calls are comparatively rare.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess_64.h | 4 ++--
arch/x86/lib/copy_user_64.S | 5 +++--
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 64476bb..b327057 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -58,7 +58,7 @@ static inline unsigned long __must_check copy_from_user(void *to,
{
int sz = __compiletime_object_size(to);

- might_fault();
+ might_fault_debug_only();
if (likely(sz == -1 || sz >= n))
n = _copy_from_user(to, from, n);
#ifdef CONFIG_DEBUG_VM
@@ -71,7 +71,7 @@ static inline unsigned long __must_check copy_from_user(void *to,
static __always_inline __must_check
int copy_to_user(void __user *dst, const void *src, unsigned size)
{
- might_fault();
+ might_fault_debug_only();

return _copy_to_user(dst, src, size);
}
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index a30ca15..7039fc9 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -18,6 +18,7 @@
#include <asm/alternative-asm.h>
#include <asm/asm.h>
#include <asm/smap.h>
+#include "user-common.h"

/*
* By placing feature2 after feature1 in altinstructions section, we logically
@@ -73,7 +74,7 @@
/* Standard copy_to_user with segment limit checking */
ENTRY(_copy_to_user)
CFI_STARTPROC
- GET_THREAD_INFO(%rax)
+ GET_THREAD_AND_SCHEDULE %rax
movq %rdi,%rcx
addq %rdx,%rcx
jc bad_to_user
@@ -88,7 +89,7 @@ ENDPROC(_copy_to_user)
/* Standard copy_from_user with segment limit checking */
ENTRY(_copy_from_user)
CFI_STARTPROC
- GET_THREAD_INFO(%rax)
+ GET_THREAD_AND_SCHEDULE %rax
movq %rsi,%rcx
addq %rdx,%rcx
jc bad_from_user
--
1.8.3.1

2013-08-09 23:06:13

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 03/13] tree-sweep: Include linux/sched.h for might_sleep users

From: Andi Kleen <[email protected]>

might_sleep is moving from linux/kernel.h to linux/sched.h, so any users
need to include linux/sched.h

This was done with a mechanistic script and some uses may be redundant
(already included in some other include file). However it's good practice
to always include any needed symbols from the top level .c file.

Tested with x86-64 allyesconfig. I used to do a x86-32 allyesconfig
on a old kernel, but since that is broken now I didn't retest.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/arm/common/mcpm_entry.c | 1 +
arch/arm/mach-omap2/omap_hwmod.c | 1 +
arch/arm/mm/highmem.c | 1 +
arch/blackfin/kernel/bfin_gpio.c | 1 +
arch/frv/mm/highmem.c | 1 +
arch/m32r/include/asm/uaccess.h | 1 +
arch/microblaze/include/asm/highmem.h | 1 +
arch/mn10300/include/asm/uaccess.h | 1 +
arch/parisc/include/asm/cacheflush.h | 1 +
arch/powerpc/include/asm/highmem.h | 1 +
arch/powerpc/kernel/rtas.c | 1 +
arch/powerpc/lib/checksum_wrappers_64.c | 1 +
arch/powerpc/lib/usercopy_64.c | 1 +
arch/tile/mm/highmem.c | 1 +
arch/x86/include/asm/checksum_32.h | 1 +
arch/x86/lib/csum-wrappers_64.c | 1 +
arch/x86/mm/highmem_32.c | 1 +
arch/x86/mm/mmio-mod.c | 1 +
block/blk-cgroup.c | 1 +
block/blk-core.c | 1 +
block/genhd.c | 1 +
drivers/base/dma-buf.c | 1 +
drivers/block/rsxx/dev.c | 1 +
drivers/dma/ipu/ipu_irq.c | 1 +
drivers/gpio/gpiolib.c | 1 +
drivers/ide/ide-io.c | 1 +
drivers/infiniband/hw/amso1100/c2_cq.c | 1 +
drivers/infiniband/hw/cxgb3/iwch_cm.c | 1 +
drivers/infiniband/hw/cxgb4/cm.c | 1 +
drivers/md/dm.c | 1 +
drivers/md/raid5.c | 1 +
drivers/mmc/core/core.c | 1 +
drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 1 +
drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h | 1 +
drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c | 1 +
drivers/net/ethernet/intel/e1000e/netdev.c | 1 +
drivers/net/ethernet/intel/igbvf/netdev.c | 1 +
drivers/net/ethernet/sfc/falcon.c | 1 +
drivers/net/ieee802154/at86rf230.c | 1 +
drivers/net/ieee802154/fakelb.c | 1 +
drivers/net/wireless/ath/carl9170/usb.c | 1 +
drivers/net/wireless/ath/wil6210/wmi.c | 1 +
drivers/net/wireless/b43/dma.c | 1 +
drivers/net/wireless/b43/main.c | 1 +
drivers/net/wireless/b43/phy_a.c | 1 +
drivers/net/wireless/b43/phy_g.c | 1 +
drivers/net/wireless/b43legacy/dma.c | 1 +
drivers/net/wireless/b43legacy/radio.c | 1 +
drivers/net/wireless/cw1200/cw1200_spi.c | 1 +
drivers/net/wireless/iwlwifi/dvm/sta.c | 1 +
drivers/net/wireless/iwlwifi/iwl-op-mode.h | 1 +
drivers/net/wireless/iwlwifi/iwl-trans.h | 1 +
drivers/net/wireless/libertas_tf/cmd.c | 1 +
drivers/pci/iov.c | 1 +
drivers/pci/pci.c | 1 +
drivers/platform/olpc/olpc-ec.c | 1 +
drivers/ssb/driver_pcicore.c | 1 +
drivers/staging/lustre/lustre/llite/remote_perm.c | 1 +
drivers/staging/lustre/lustre/obdclass/cl_lock.c | 1 +
drivers/staging/lustre/lustre/obdclass/cl_object.c | 1 +
drivers/staging/lustre/lustre/obdclass/cl_page.c | 1 +
drivers/staging/lustre/lustre/osc/osc_lock.c | 1 +
drivers/staging/lustre/lustre/osc/osc_page.c | 1 +
drivers/staging/lustre/lustre/ptlrpc/client.c | 1 +
drivers/staging/lustre/lustre/ptlrpc/gss/gss_cli_upcall.c | 1 +
drivers/staging/lustre/lustre/ptlrpc/gss/gss_pipefs.c | 1 +
drivers/staging/lustre/lustre/ptlrpc/sec.c | 1 +
drivers/staging/lustre/lustre/ptlrpc/sec_config.c | 1 +
drivers/staging/lustre/lustre/ptlrpc/sec_gc.c | 1 +
drivers/usb/core/hcd.c | 1 +
drivers/usb/core/urb.c | 1 +
drivers/video/atmel_lcdfb.c | 1 +
fs/block_dev.c | 1 +
fs/buffer.c | 1 +
fs/dcache.c | 1 +
fs/ext3/inode.c | 1 +
fs/ext4/ext4_jbd2.c | 1 +
fs/ext4/inode.c | 1 +
fs/ext4/mballoc.c | 1 +
fs/file_table.c | 1 +
fs/inode.c | 1 +
fs/jbd/revoke.c | 1 +
fs/jbd2/revoke.c | 1 +
fs/locks.c | 1 +
fs/nfs/nfs4filelayoutdev.c | 1 +
fs/nfs/nfs4proc.c | 1 +
fs/nfs/nfs4state.c | 1 +
fs/nilfs2/the_nilfs.c | 1 +
fs/xfs/xfs_mount.c | 1 +
include/asm-generic/gpio.h | 1 +
include/linux/buffer_head.h | 1 +
include/linux/clk.h | 1 +
include/linux/gpio.h | 1 +
include/linux/highmem.h | 1 +
include/linux/pagemap.h | 1 +
kernel/fork.c | 1 +
kernel/freezer.c | 1 +
kernel/irq/chip.c | 1 +
kernel/nsproxy.c | 1 +
kernel/printk/printk.c | 1 +
kernel/sched/core.c | 1 +
kernel/smp.c | 1 +
mm/hugetlb.c | 1 +
mm/memory.c | 1 +
mm/mmap.c | 1 +
mm/rmap.c | 1 +
net/caif/cfcnfg.c | 1 +
net/mac80211/driver-ops.h | 1 +
net/mac80211/key.c | 1 +
net/mac80211/main.c | 1 +
net/mac80211/sta_info.c | 1 +
net/phonet/pep.c | 1 +
net/sunrpc/clnt.c | 1 +
net/wimax/op-msg.c | 1 +
net/wimax/op-reset.c | 1 +
net/wimax/op-rfkill.c | 1 +
net/wireless/wext-proc.c | 1 +
sound/core/info.c | 1 +
virt/kvm/async_pf.c | 1 +
120 files changed, 120 insertions(+)

diff --git a/arch/arm/common/mcpm_entry.c b/arch/arm/common/mcpm_entry.c
index 370236d..c083f90 100644
--- a/arch/arm/common/mcpm_entry.c
+++ b/arch/arm/common/mcpm_entry.c
@@ -10,6 +10,7 @@
*/

#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/init.h>
#include <linux/irqflags.h>

diff --git a/arch/arm/mach-omap2/omap_hwmod.c b/arch/arm/mach-omap2/omap_hwmod.c
index 7f4db12..04a2674 100644
--- a/arch/arm/mach-omap2/omap_hwmod.c
+++ b/arch/arm/mach-omap2/omap_hwmod.c
@@ -127,6 +127,7 @@
*/
#undef DEBUG

+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/io.h>
diff --git a/arch/arm/mm/highmem.c b/arch/arm/mm/highmem.c
index 21b9e1b..a8be1f1 100644
--- a/arch/arm/mm/highmem.c
+++ b/arch/arm/mm/highmem.c
@@ -10,6 +10,7 @@
* published by the Free Software Foundation.
*/

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/highmem.h>
#include <linux/interrupt.h>
diff --git a/arch/blackfin/kernel/bfin_gpio.c b/arch/blackfin/kernel/bfin_gpio.c
index ed978f1..0efe3d5 100644
--- a/arch/blackfin/kernel/bfin_gpio.c
+++ b/arch/blackfin/kernel/bfin_gpio.c
@@ -6,6 +6,7 @@
* Licensed under the GPL-2 or later
*/

+#include <linux/sched.h>
#include <linux/delay.h>
#include <linux/module.h>
#include <linux/err.h>
diff --git a/arch/frv/mm/highmem.c b/arch/frv/mm/highmem.c
index bed9a9b..766ff5b 100644
--- a/arch/frv/mm/highmem.c
+++ b/arch/frv/mm/highmem.c
@@ -8,6 +8,7 @@
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>

diff --git a/arch/m32r/include/asm/uaccess.h b/arch/m32r/include/asm/uaccess.h
index 84fe7ba..a097157 100644
--- a/arch/m32r/include/asm/uaccess.h
+++ b/arch/m32r/include/asm/uaccess.h
@@ -11,6 +11,7 @@
/*
* User space memory access functions
*/
+#include <linux/sched.h>
#include <linux/errno.h>
#include <linux/thread_info.h>
#include <asm/page.h>
diff --git a/arch/microblaze/include/asm/highmem.h b/arch/microblaze/include/asm/highmem.h
index d046389..40c5b59 100644
--- a/arch/microblaze/include/asm/highmem.h
+++ b/arch/microblaze/include/asm/highmem.h
@@ -20,6 +20,7 @@
#ifdef __KERNEL__

#include <linux/init.h>
+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/uaccess.h>
#include <asm/fixmap.h>
diff --git a/arch/mn10300/include/asm/uaccess.h b/arch/mn10300/include/asm/uaccess.h
index 5372787..274c9c2 100644
--- a/arch/mn10300/include/asm/uaccess.h
+++ b/arch/mn10300/include/asm/uaccess.h
@@ -14,6 +14,7 @@
/*
* User space memory access functions
*/
+#include <linux/sched.h>
#include <linux/thread_info.h>
#include <linux/kernel.h>
#include <asm/page.h>
diff --git a/arch/parisc/include/asm/cacheflush.h b/arch/parisc/include/asm/cacheflush.h
index f0e2784..ab4ed76 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -2,6 +2,7 @@
#define _PARISC_CACHEFLUSH_H

#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/uaccess.h>
#include <asm/tlbflush.h>

diff --git a/arch/powerpc/include/asm/highmem.h b/arch/powerpc/include/asm/highmem.h
index caaf6e0..721bc1b 100644
--- a/arch/powerpc/include/asm/highmem.h
+++ b/arch/powerpc/include/asm/highmem.h
@@ -22,6 +22,7 @@

#ifdef __KERNEL__

+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <asm/kmap_types.h>
#include <asm/tlbflush.h>
diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
index 80b5ef4..e479bcf 100644
--- a/arch/powerpc/kernel/rtas.c
+++ b/arch/powerpc/kernel/rtas.c
@@ -11,6 +11,7 @@
* 2 of the License, or (at your option) any later version.
*/

+#include <linux/sched.h>
#include <stdarg.h>
#include <linux/kernel.h>
#include <linux/types.h>
diff --git a/arch/powerpc/lib/checksum_wrappers_64.c b/arch/powerpc/lib/checksum_wrappers_64.c
index 08e3a33..8d9598c 100644
--- a/arch/powerpc/lib/checksum_wrappers_64.c
+++ b/arch/powerpc/lib/checksum_wrappers_64.c
@@ -17,6 +17,7 @@
*
* Author: Anton Blanchard <[email protected]>
*/
+#include <linux/sched.h>
#include <linux/export.h>
#include <linux/compiler.h>
#include <linux/types.h>
diff --git a/arch/powerpc/lib/usercopy_64.c b/arch/powerpc/lib/usercopy_64.c
index 5eea6f3..8c8cfa6 100644
--- a/arch/powerpc/lib/usercopy_64.c
+++ b/arch/powerpc/lib/usercopy_64.c
@@ -6,6 +6,7 @@
* as published by the Free Software Foundation; either version
* 2 of the License, or (at your option) any later version.
*/
+#include <linux/sched.h>
#include <linux/module.h>
#include <asm/uaccess.h>

diff --git a/arch/tile/mm/highmem.c b/arch/tile/mm/highmem.c
index 347d123..f82b1e0 100644
--- a/arch/tile/mm/highmem.c
+++ b/arch/tile/mm/highmem.c
@@ -12,6 +12,7 @@
* more details.
*/

+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>
#include <linux/pagemap.h>
diff --git a/arch/x86/include/asm/checksum_32.h b/arch/x86/include/asm/checksum_32.h
index 46fc474..b9aa5d0 100644
--- a/arch/x86/include/asm/checksum_32.h
+++ b/arch/x86/include/asm/checksum_32.h
@@ -2,6 +2,7 @@
#define _ASM_X86_CHECKSUM_32_H

#include <linux/in6.h>
+#include <linux/sched.h>

#include <asm/uaccess.h>

diff --git a/arch/x86/lib/csum-wrappers_64.c b/arch/x86/lib/csum-wrappers_64.c
index 25b7ae8..aaba241 100644
--- a/arch/x86/lib/csum-wrappers_64.c
+++ b/arch/x86/lib/csum-wrappers_64.c
@@ -4,6 +4,7 @@
*
* Wrappers of assembly checksum functions for x86-64.
*/
+#include <linux/sched.h>
#include <asm/checksum.h>
#include <linux/module.h>

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index 4500142..1212a56 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -1,3 +1,4 @@
+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>
#include <linux/swap.h> /* for totalram_pages */
diff --git a/arch/x86/mm/mmio-mod.c b/arch/x86/mm/mmio-mod.c
index 0057a7a..07b0235 100644
--- a/arch/x86/mm/mmio-mod.c
+++ b/arch/x86/mm/mmio-mod.c
@@ -24,6 +24,7 @@

#define DEBUG 1

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/debugfs.h>
#include <linux/slab.h>
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 290792a..20664e2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -11,6 +11,7 @@
* Nauman Rafique <[email protected]>
*/
#include <linux/ioprio.h>
+#include <linux/sched.h>
#include <linux/kdev_t.h>
#include <linux/module.h>
#include <linux/err.h>
diff --git a/block/blk-core.c b/block/blk-core.c
index 93a18d1..cf9bb93 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -11,6 +11,7 @@
/*
* This handles all read/write requests to block devices
*/
+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/backing-dev.h>
diff --git a/block/genhd.c b/block/genhd.c
index dadf42b..a5012aa9 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -2,6 +2,7 @@
* gendisk handling
*/

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/genhd.h>
diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 6687ba7..b116b28 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -22,6 +22,7 @@
* this program. If not, see <http://www.gnu.org/licenses/>.
*/

+#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/dma-buf.h>
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index d7af441..a033bce 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -23,6 +23,7 @@
*/

#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/module.h>
#include <linux/pci.h>
diff --git a/drivers/dma/ipu/ipu_irq.c b/drivers/dma/ipu/ipu_irq.c
index 2e284a4..7f4b2e5 100644
--- a/drivers/dma/ipu/ipu_irq.c
+++ b/drivers/dma/ipu/ipu_irq.c
@@ -7,6 +7,7 @@
* published by the Free Software Foundation.
*/

+#include <linux/sched.h>
#include <linux/init.h>
#include <linux/err.h>
#include <linux/spinlock.h>
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index ff0fd65..906a80d 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1,3 +1,4 @@
+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/interrupt.h>
diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index 177db6d..9aa758e 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -24,6 +24,7 @@
*/


+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/string.h>
diff --git a/drivers/infiniband/hw/amso1100/c2_cq.c b/drivers/infiniband/hw/amso1100/c2_cq.c
index 49e0e85..07f9d3a 100644
--- a/drivers/infiniband/hw/amso1100/c2_cq.c
+++ b/drivers/infiniband/hw/amso1100/c2_cq.c
@@ -35,6 +35,7 @@
* SOFTWARE.
*
*/
+#include <linux/sched.h>
#include <linux/gfp.h>

#include "c2.h"
diff --git a/drivers/infiniband/hw/cxgb3/iwch_cm.c b/drivers/infiniband/hw/cxgb3/iwch_cm.c
index 3e094cd..a0fba9c 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_cm.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_cm.c
@@ -29,6 +29,7 @@
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/list.h>
#include <linux/slab.h>
diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 65c30ea..f0e6a5b 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -29,6 +29,7 @@
* CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/list.h>
#include <linux/workqueue.h>
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9e39d2b..43ee49a 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -5,6 +5,7 @@
* This file is released under the GPL.
*/

+#include <linux/sched.h>
#include "dm.h"
#include "dm-uevent.h"

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 78ea443..fcf17e7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -43,6 +43,7 @@
* miss any bits.
*/

+#include <linux/sched.h>
#include <linux/blkdev.h>
#include <linux/kthread.h>
#include <linux/raid/pq.h>
diff --git a/drivers/mmc/core/core.c b/drivers/mmc/core/core.c
index 49a5bca..f1832b0 100644
--- a/drivers/mmc/core/core.c
+++ b/drivers/mmc/core/core.c
@@ -10,6 +10,7 @@
* it under the terms of the GNU General Public License version 2 as
* published by the Free Software Foundation.
*/
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/interrupt.h>
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index e06186c..95eed3f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -17,6 +17,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/kernel.h>
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
index 8f03c98..d908719 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c
@@ -19,6 +19,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/crc32.h>
#include <linux/netdevice.h>
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
index d143a7c..01ebe72 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.h
@@ -20,6 +20,7 @@
#define BNX2X_SRIOV_H

#include "bnx2x_vfpf.h"
+#include <linux/sched.h>
#include "bnx2x.h"

enum sample_bulletin_result {
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c
index 98366ab..962e22a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c
@@ -17,6 +17,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/sched.h>
#include "bnx2x_stats.h"
#include "bnx2x_cmn.h"
#include "bnx2x_sriov.h"
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 77f81cb..36aca7e 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -28,6 +28,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/init.h>
diff --git a/drivers/net/ethernet/intel/igbvf/netdev.c b/drivers/net/ethernet/intel/igbvf/netdev.c
index 93eb7ee..95d5430 100644
--- a/drivers/net/ethernet/intel/igbvf/netdev.c
+++ b/drivers/net/ethernet/intel/igbvf/netdev.c
@@ -27,6 +27,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/types.h>
#include <linux/init.h>
diff --git a/drivers/net/ethernet/sfc/falcon.c b/drivers/net/ethernet/sfc/falcon.c
index 71998e7..a7a390f 100644
--- a/drivers/net/ethernet/sfc/falcon.c
+++ b/drivers/net/ethernet/sfc/falcon.c
@@ -8,6 +8,7 @@
* by the Free Software Foundation, incorporated herein by reference.
*/

+#include <linux/sched.h>
#include <linux/bitops.h>
#include <linux/delay.h>
#include <linux/pci.h>
diff --git a/drivers/net/ieee802154/at86rf230.c b/drivers/net/ieee802154/at86rf230.c
index 6f10b49..59350e3 100644
--- a/drivers/net/ieee802154/at86rf230.c
+++ b/drivers/net/ieee802154/at86rf230.c
@@ -21,6 +21,7 @@
* Alexander Smirnov <[email protected]>
*/
#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/interrupt.h>
#include <linux/gpio.h>
diff --git a/drivers/net/ieee802154/fakelb.c b/drivers/net/ieee802154/fakelb.c
index b8d2217..124969c 100644
--- a/drivers/net/ieee802154/fakelb.c
+++ b/drivers/net/ieee802154/fakelb.c
@@ -23,6 +23,7 @@
*/

#include <linux/module.h>
+#include <linux/sched.h>
#include <linux/timer.h>
#include <linux/platform_device.h>
#include <linux/netdevice.h>
diff --git a/drivers/net/wireless/ath/carl9170/usb.c b/drivers/net/wireless/ath/carl9170/usb.c
index 307bc0d..a6bc868 100644
--- a/drivers/net/wireless/ath/carl9170/usb.c
+++ b/drivers/net/wireless/ath/carl9170/usb.c
@@ -37,6 +37,7 @@
* OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
*/

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/usb.h>
diff --git a/drivers/net/wireless/ath/wil6210/wmi.c b/drivers/net/wireless/ath/wil6210/wmi.c
index dc8059a..bc61ab3 100644
--- a/drivers/net/wireless/ath/wil6210/wmi.c
+++ b/drivers/net/wireless/ath/wil6210/wmi.c
@@ -15,6 +15,7 @@
*/

#include <linux/etherdevice.h>
+#include <linux/sched.h>
#include <linux/if_arp.h>

#include "wil6210.h"
diff --git a/drivers/net/wireless/b43/dma.c b/drivers/net/wireless/b43/dma.c
index f7c70b3..2bdbd63 100644
--- a/drivers/net/wireless/b43/dma.c
+++ b/drivers/net/wireless/b43/dma.c
@@ -27,6 +27,7 @@

*/

+#include <linux/sched.h>
#include "b43.h"
#include "dma.h"
#include "main.h"
diff --git a/drivers/net/wireless/b43/main.c b/drivers/net/wireless/b43/main.c
index 0e933bb..6e49c09 100644
--- a/drivers/net/wireless/b43/main.c
+++ b/drivers/net/wireless/b43/main.c
@@ -32,6 +32,7 @@

*/

+#include <linux/sched.h>
#include <linux/delay.h>
#include <linux/init.h>
#include <linux/module.h>
diff --git a/drivers/net/wireless/b43/phy_a.c b/drivers/net/wireless/b43/phy_a.c
index a6c3810..0a7d6b6 100644
--- a/drivers/net/wireless/b43/phy_a.c
+++ b/drivers/net/wireless/b43/phy_a.c
@@ -26,6 +26,7 @@

*/

+#include <linux/sched.h>
#include <linux/slab.h>

#include "b43.h"
diff --git a/drivers/net/wireless/b43/phy_g.c b/drivers/net/wireless/b43/phy_g.c
index 12f467b..01141cd 100644
--- a/drivers/net/wireless/b43/phy_g.c
+++ b/drivers/net/wireless/b43/phy_g.c
@@ -26,6 +26,7 @@

*/

+#include <linux/sched.h>
#include "b43.h"
#include "phy_g.h"
#include "phy_common.h"
diff --git a/drivers/net/wireless/b43legacy/dma.c b/drivers/net/wireless/b43legacy/dma.c
index faeafe2..f709415 100644
--- a/drivers/net/wireless/b43legacy/dma.c
+++ b/drivers/net/wireless/b43legacy/dma.c
@@ -27,6 +27,7 @@

*/

+#include <linux/sched.h>
#include "b43legacy.h"
#include "dma.h"
#include "main.h"
diff --git a/drivers/net/wireless/b43legacy/radio.c b/drivers/net/wireless/b43legacy/radio.c
index 8961776..ff59e7e 100644
--- a/drivers/net/wireless/b43legacy/radio.c
+++ b/drivers/net/wireless/b43legacy/radio.c
@@ -29,6 +29,7 @@

*/

+#include <linux/sched.h>
#include <linux/delay.h>

#include "b43legacy.h"
diff --git a/drivers/net/wireless/cw1200/cw1200_spi.c b/drivers/net/wireless/cw1200/cw1200_spi.c
index d063760..c2795d4 100644
--- a/drivers/net/wireless/cw1200/cw1200_spi.c
+++ b/drivers/net/wireless/cw1200/cw1200_spi.c
@@ -14,6 +14,7 @@
*/

#include <linux/module.h>
+#include <linux/sched.h>
#include <linux/gpio.h>
#include <linux/delay.h>
#include <linux/spinlock.h>
diff --git a/drivers/net/wireless/iwlwifi/dvm/sta.c b/drivers/net/wireless/iwlwifi/dvm/sta.c
index c3c13ce..71df3ce 100644
--- a/drivers/net/wireless/iwlwifi/dvm/sta.c
+++ b/drivers/net/wireless/iwlwifi/dvm/sta.c
@@ -27,6 +27,7 @@
*
*****************************************************************************/
#include <linux/etherdevice.h>
+#include <linux/sched.h>
#include <net/mac80211.h>
#include "iwl-trans.h"
#include "dev.h"
diff --git a/drivers/net/wireless/iwlwifi/iwl-op-mode.h b/drivers/net/wireless/iwlwifi/iwl-op-mode.h
index 98c7aa7..92695a6 100644
--- a/drivers/net/wireless/iwlwifi/iwl-op-mode.h
+++ b/drivers/net/wireless/iwlwifi/iwl-op-mode.h
@@ -64,6 +64,7 @@
#define __iwl_op_mode_h__

#include <linux/debugfs.h>
+#include <linux/sched.h>

struct iwl_op_mode;
struct iwl_trans;
diff --git a/drivers/net/wireless/iwlwifi/iwl-trans.h b/drivers/net/wireless/iwlwifi/iwl-trans.h
index 8d91422c..231949f 100644
--- a/drivers/net/wireless/iwlwifi/iwl-trans.h
+++ b/drivers/net/wireless/iwlwifi/iwl-trans.h
@@ -63,6 +63,7 @@
#ifndef __iwl_trans_h__
#define __iwl_trans_h__

+#include <linux/sched.h>
#include <linux/ieee80211.h>
#include <linux/mm.h> /* for page_address */
#include <linux/lockdep.h>
diff --git a/drivers/net/wireless/libertas_tf/cmd.c b/drivers/net/wireless/libertas_tf/cmd.c
index 909ac36..fa69e9f 100644
--- a/drivers/net/wireless/libertas_tf/cmd.c
+++ b/drivers/net/wireless/libertas_tf/cmd.c
@@ -9,6 +9,7 @@
*/
#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+#include <linux/sched.h>
#include <linux/hardirq.h>
#include <linux/slab.h>
#include <linux/export.h>
diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index de8ffac..a581e34 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -8,6 +8,7 @@
* Address Translation Service 1.0
*/

+#include <linux/sched.h>
#include <linux/pci.h>
#include <linux/slab.h>
#include <linux/mutex.h>
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e37fea6..6b44f63 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -7,6 +7,7 @@
* Copyright 1997 -- 2000 Martin Mares <[email protected]>
*/

+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/delay.h>
#include <linux/init.h>
diff --git a/drivers/platform/olpc/olpc-ec.c b/drivers/platform/olpc/olpc-ec.c
index 0f9f859..dd3984a 100644
--- a/drivers/platform/olpc/olpc-ec.c
+++ b/drivers/platform/olpc/olpc-ec.c
@@ -6,6 +6,7 @@
* Licensed under the GPL v2 or later.
*/
#include <linux/completion.h>
+#include <linux/sched.h>
#include <linux/debugfs.h>
#include <linux/spinlock.h>
#include <linux/mutex.h>
diff --git a/drivers/ssb/driver_pcicore.c b/drivers/ssb/driver_pcicore.c
index d75b72b..c343ee7 100644
--- a/drivers/ssb/driver_pcicore.c
+++ b/drivers/ssb/driver_pcicore.c
@@ -8,6 +8,7 @@
* Licensed under the GNU/GPL. See COPYING for details.
*/

+#include <linux/sched.h>
#include <linux/ssb/ssb.h>
#include <linux/pci.h>
#include <linux/export.h>
diff --git a/drivers/staging/lustre/lustre/llite/remote_perm.c b/drivers/staging/lustre/lustre/llite/remote_perm.c
index 68b2dc4..ceac936 100644
--- a/drivers/staging/lustre/lustre/llite/remote_perm.c
+++ b/drivers/staging/lustre/lustre/llite/remote_perm.c
@@ -44,6 +44,7 @@
#define DEBUG_SUBSYSTEM S_LLITE

#include <linux/module.h>
+#include <linux/sched.h>
#include <linux/types.h>
#include <linux/version.h>

diff --git a/drivers/staging/lustre/lustre/obdclass/cl_lock.c b/drivers/staging/lustre/lustre/obdclass/cl_lock.c
index d34e044..3fb55ff 100644
--- a/drivers/staging/lustre/lustre/obdclass/cl_lock.c
+++ b/drivers/staging/lustre/lustre/obdclass/cl_lock.c
@@ -41,6 +41,7 @@
#define DEBUG_SUBSYSTEM S_CLASS

#include <obd_class.h>
+#include <linux/sched.h>
#include <obd_support.h>
#include <lustre_fid.h>
#include <linux/list.h>
diff --git a/drivers/staging/lustre/lustre/obdclass/cl_object.c b/drivers/staging/lustre/lustre/obdclass/cl_object.c
index cdb5fba..72a1ac4 100644
--- a/drivers/staging/lustre/lustre/obdclass/cl_object.c
+++ b/drivers/staging/lustre/lustre/obdclass/cl_object.c
@@ -52,6 +52,7 @@
#define DEBUG_SUBSYSTEM S_CLASS

#include <linux/libcfs/libcfs.h>
+#include <linux/sched.h>
/* class_put_type() */
#include <obd_class.h>
#include <obd_support.h>
diff --git a/drivers/staging/lustre/lustre/obdclass/cl_page.c b/drivers/staging/lustre/lustre/obdclass/cl_page.c
index bb93359..f5189f8 100644
--- a/drivers/staging/lustre/lustre/obdclass/cl_page.c
+++ b/drivers/staging/lustre/lustre/obdclass/cl_page.c
@@ -41,6 +41,7 @@
#define DEBUG_SUBSYSTEM S_CLASS

#include <linux/libcfs/libcfs.h>
+#include <linux/sched.h>
#include <obd_class.h>
#include <obd_support.h>
#include <linux/list.h>
diff --git a/drivers/staging/lustre/lustre/osc/osc_lock.c b/drivers/staging/lustre/lustre/osc/osc_lock.c
index 640bc3d..d38347b 100644
--- a/drivers/staging/lustre/lustre/osc/osc_lock.c
+++ b/drivers/staging/lustre/lustre/osc/osc_lock.c
@@ -43,6 +43,7 @@
# include <linux/libcfs/libcfs.h>
/* fid_build_reg_res_name() */
#include <lustre_fid.h>
+#include <linux/sched.h>

#include "osc_cl_internal.h"

diff --git a/drivers/staging/lustre/lustre/osc/osc_page.c b/drivers/staging/lustre/lustre/osc/osc_page.c
index baba959..ad2c6c7 100644
--- a/drivers/staging/lustre/lustre/osc/osc_page.c
+++ b/drivers/staging/lustre/lustre/osc/osc_page.c
@@ -41,6 +41,7 @@
#define DEBUG_SUBSYSTEM S_OSC

#include "osc_cl_internal.h"
+#include <linux/sched.h>

static void osc_lru_del(struct client_obd *cli, struct osc_page *opg, bool del);
static void osc_lru_add(struct client_obd *cli, struct osc_page *opg);
diff --git a/drivers/staging/lustre/lustre/ptlrpc/client.c b/drivers/staging/lustre/lustre/ptlrpc/client.c
index 22f7e65..6791b65 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/client.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/client.c
@@ -39,6 +39,7 @@
#define DEBUG_SUBSYSTEM S_RPC

#include <obd_support.h>
+#include <linux/sched.h>
#include <obd_class.h>
#include <lustre_lib.h>
#include <lustre_ha.h>
diff --git a/drivers/staging/lustre/lustre/ptlrpc/gss/gss_cli_upcall.c b/drivers/staging/lustre/lustre/ptlrpc/gss/gss_cli_upcall.c
index 142c789..b2b893c 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/gss/gss_cli_upcall.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/gss/gss_cli_upcall.c
@@ -40,6 +40,7 @@

#define DEBUG_SUBSYSTEM S_SEC
#include <linux/init.h>
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/dcache.h>
diff --git a/drivers/staging/lustre/lustre/ptlrpc/gss/gss_pipefs.c b/drivers/staging/lustre/lustre/ptlrpc/gss/gss_pipefs.c
index 3df7257..9dddaa8 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/gss/gss_pipefs.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/gss/gss_pipefs.c
@@ -48,6 +48,7 @@

#define DEBUG_SUBSYSTEM S_SEC
#include <linux/init.h>
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/dcache.h>
diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec.c b/drivers/staging/lustre/lustre/ptlrpc/sec.c
index 36e8bed..28dc4e6 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec.c
@@ -41,6 +41,7 @@
#define DEBUG_SUBSYSTEM S_SEC

#include <linux/libcfs/libcfs.h>
+#include <linux/sched.h>
#include <linux/crypto.h>
#include <linux/key.h>

diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec_config.c b/drivers/staging/lustre/lustre/ptlrpc/sec_config.c
index a45a392..c4b206c 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec_config.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec_config.c
@@ -37,6 +37,7 @@
#define DEBUG_SUBSYSTEM S_SEC

#include <linux/libcfs/libcfs.h>
+#include <linux/sched.h>
#include <linux/crypto.h>
#include <linux/key.h>

diff --git a/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c b/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
index 4c96a14a..b512cbc 100644
--- a/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
+++ b/drivers/staging/lustre/lustre/ptlrpc/sec_gc.c
@@ -41,6 +41,7 @@
#define DEBUG_SUBSYSTEM S_SEC

#include <linux/libcfs/libcfs.h>
+#include <linux/sched.h>

#include <obd_support.h>
#include <obd_class.h>
diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 014dc99..ae84867 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -23,6 +23,7 @@
*/

#include <linux/bcd.h>
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/version.h>
#include <linux/kernel.h>
diff --git a/drivers/usb/core/urb.c b/drivers/usb/core/urb.c
index 16927fa..f033bcd0 100644
--- a/drivers/usb/core/urb.c
+++ b/drivers/usb/core/urb.c
@@ -1,3 +1,4 @@
+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/string.h>
#include <linux/bitops.h>
diff --git a/drivers/video/atmel_lcdfb.c b/drivers/video/atmel_lcdfb.c
index effdb37..915836b 100644
--- a/drivers/video/atmel_lcdfb.c
+++ b/drivers/video/atmel_lcdfb.c
@@ -8,6 +8,7 @@
* more details.
*/

+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/platform_device.h>
#include <linux/dma-mapping.h>
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7bda5c..00f49af 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -5,6 +5,7 @@
* Copyright (C) 2001 Andrea Arcangeli <[email protected]> SuSE
*/

+#include <linux/sched.h>
#include <linux/init.h>
#include <linux/mm.h>
#include <linux/fcntl.h>
diff --git a/fs/buffer.c b/fs/buffer.c
index 4d74335..050c03b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -18,6 +18,7 @@
* async buffer flushing, 1999 Andrea Arcangeli <[email protected]>
*/

+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/syscalls.h>
#include <linux/fs.h>
diff --git a/fs/dcache.c b/fs/dcache.c
index 87bdb53..ab41273 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -14,6 +14,7 @@
* the dcache entry is deleted or garbage collected.
*/

+#include <linux/sched.h>
#include <linux/syscalls.h>
#include <linux/string.h>
#include <linux/mm.h>
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 2bd8548..8caf634 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -22,6 +22,7 @@
* Assorted race fixes, rewrite of ext3_get_block() by Al Viro, 2000
*/

+#include <linux/sched.h>
#include <linux/highuid.h>
#include <linux/quotaops.h>
#include <linux/writeback.h>
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 72a3600..c94cb65 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -2,6 +2,7 @@
* Interface between ext4 and JBD
*/

+#include <linux/sched.h>
#include "ext4_jbd2.h"

#include <trace/events/ext4.h>
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd32a2e..d12f990 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -18,6 +18,7 @@
* Assorted race fixes, rewrite of ext4_get_block() by Al Viro, 2000
*/

+#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/time.h>
#include <linux/jbd2.h>
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 4bbbf13b..108e7f6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -22,6 +22,7 @@
*/

#include "ext4_jbd2.h"
+#include <linux/sched.h>
#include "mballoc.h"
#include <linux/log2.h>
#include <linux/module.h>
diff --git a/fs/file_table.c b/fs/file_table.c
index b44e4c5..b35edc4 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -5,6 +5,7 @@
* Copyright (C) 1997 David S. Miller ([email protected])
*/

+#include <linux/sched.h>
#include <linux/string.h>
#include <linux/slab.h>
#include <linux/file.h>
diff --git a/fs/inode.c b/fs/inode.c
index d6dfb09..ef3a12d 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -2,6 +2,7 @@
* (C) 1997 Linus Torvalds
* (C) 1999 Andrea Arcangeli <[email protected]> (dynamic inode allocation)
*/
+#include <linux/sched.h>
#include <linux/export.h>
#include <linux/fs.h>
#include <linux/mm.h>
diff --git a/fs/jbd/revoke.c b/fs/jbd/revoke.c
index 25c713e..4642e55 100644
--- a/fs/jbd/revoke.c
+++ b/fs/jbd/revoke.c
@@ -81,6 +81,7 @@
*/

#ifndef __KERNEL__
+#include <linux/sched.h>
#include "jfs_user.h"
#else
#include <linux/time.h>
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index 198c9c1..b406894 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -81,6 +81,7 @@
*/

#ifndef __KERNEL__
+#include <linux/sched.h>
#include "jfs_user.h"
#else
#include <linux/time.h>
diff --git a/fs/locks.c b/fs/locks.c
index b27a300..b8a6709 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -114,6 +114,7 @@
* Stephen Rothwell <[email protected]>, June, 2000.
*/

+#include <linux/sched.h>
#include <linux/capability.h>
#include <linux/file.h>
#include <linux/fdtable.h>
diff --git a/fs/nfs/nfs4filelayoutdev.c b/fs/nfs/nfs4filelayoutdev.c
index 95604f6..c26be3b 100644
--- a/fs/nfs/nfs4filelayoutdev.c
+++ b/fs/nfs/nfs4filelayoutdev.c
@@ -29,6 +29,7 @@
*/

#include <linux/nfs_fs.h>
+#include <linux/sched.h>
#include <linux/vmalloc.h>
#include <linux/module.h>
#include <linux/sunrpc/addr.h>
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index cf11799..90319a3 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -35,6 +35,7 @@
* SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/

+#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/delay.h>
#include <linux/errno.h>
diff --git a/fs/nfs/nfs4state.c b/fs/nfs/nfs4state.c
index e22862f..c9acdbc 100644
--- a/fs/nfs/nfs4state.c
+++ b/fs/nfs/nfs4state.c
@@ -39,6 +39,7 @@
*/

#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/nfs_fs.h>
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 94c451c..462be53 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -21,6 +21,7 @@
*
*/

+#include <linux/sched.h>
#include <linux/buffer_head.h>
#include <linux/slab.h>
#include <linux/blkdev.h>
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 2b0ba35..eb9ba15 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -15,6 +15,7 @@
* along with this program; if not, write the Free Software Foundation,
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
+#include <linux/sched.h>
#include "xfs.h"
#include "xfs_fs.h"
#include "xfs_types.h"
diff --git a/include/asm-generic/gpio.h b/include/asm-generic/gpio.h
index bde6469..17114e0 100644
--- a/include/asm-generic/gpio.h
+++ b/include/asm-generic/gpio.h
@@ -2,6 +2,7 @@
#define _ASM_GENERIC_GPIO_H

#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/types.h>
#include <linux/errno.h>
#include <linux/of.h>
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 91fa9a9..432e212 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -8,6 +8,7 @@
#define _LINUX_BUFFER_HEAD_H

#include <linux/types.h>
+#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/linkage.h>
#include <linux/pagemap.h>
diff --git a/include/linux/clk.h b/include/linux/clk.h
index 9a6d045..8b440a2 100644
--- a/include/linux/clk.h
+++ b/include/linux/clk.h
@@ -14,6 +14,7 @@

#include <linux/err.h>
#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/notifier.h>

struct device;
diff --git a/include/linux/gpio.h b/include/linux/gpio.h
index 552e3f4..be83d05 100644
--- a/include/linux/gpio.h
+++ b/include/linux/gpio.h
@@ -2,6 +2,7 @@
#define __LINUX_GPIO_H

#include <linux/errno.h>
+#include <linux/sched.h>

/* see Documentation/gpio.txt */

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 7fb31da..435de87 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -2,6 +2,7 @@
#define _LINUX_HIGHMEM_H

#include <linux/fs.h>
+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/bug.h>
#include <linux/mm.h>
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..a409785 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -5,6 +5,7 @@
* Copyright 1995 Linus Torvalds
*/
#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/list.h>
#include <linux/highmem.h>
diff --git a/kernel/fork.c b/kernel/fork.c
index 403d2bb..f12e417 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -11,6 +11,7 @@
* management can be a bitch. See 'mm/memory.c': 'copy_page_range()'
*/

+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/unistd.h>
diff --git a/kernel/freezer.c b/kernel/freezer.c
index b462fa1..1d975d1 100644
--- a/kernel/freezer.c
+++ b/kernel/freezer.c
@@ -4,6 +4,7 @@
* Originally from kernel/power/process.c
*/

+#include <linux/sched.h>
#include <linux/interrupt.h>
#include <linux/suspend.h>
#include <linux/export.h>
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index a3bb14f..9948f28 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -10,6 +10,7 @@
* Detailed information is available in Documentation/DocBook/genericirq
*/

+#include <linux/sched.h>
#include <linux/irq.h>
#include <linux/msi.h>
#include <linux/module.h>
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 364ceab..d018644 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -13,6 +13,7 @@
* Pavel Emelianov <[email protected]>
*/

+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/export.h>
#include <linux/nsproxy.h>
diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 5b5a708..c97c144 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -17,6 +17,7 @@
*/

#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/tty.h>
#include <linux/tty_driver.h>
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b7c32cb..74d7c04 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -26,6 +26,7 @@
* Thomas Gleixner, Mike Kravetz
*/

+#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/nmi.h>
diff --git a/kernel/smp.c b/kernel/smp.c
index fe9f773..5b58226 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -9,6 +9,7 @@
#include <linux/export.h>
#include <linux/percpu.h>
#include <linux/init.h>
+#include <linux/sched.h>
#include <linux/gfp.h>
#include <linux/smp.h>
#include <linux/cpu.h>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..e342674 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2,6 +2,7 @@
* Generic hugetlb support.
* (C) Nadia Yvette Chambers, April 2004
*/
+#include <linux/sched.h>
#include <linux/list.h>
#include <linux/init.h>
#include <linux/module.h>
diff --git a/mm/memory.c b/mm/memory.c
index 1ce2e2a..61add89 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -38,6 +38,7 @@
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
*/

+#include <linux/sched.h>
#include <linux/kernel_stat.h>
#include <linux/mm.h>
#include <linux/hugetlb.h>
diff --git a/mm/mmap.c b/mm/mmap.c
index 1edbaa3..1fda64c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -7,6 +7,7 @@
*/

#include <linux/kernel.h>
+#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/backing-dev.h>
#include <linux/mm.h>
diff --git a/mm/rmap.c b/mm/rmap.c
index cd356df..ca1c188 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -42,6 +42,7 @@
* pte map lock
*/

+#include <linux/sched.h>
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
diff --git a/net/caif/cfcnfg.c b/net/caif/cfcnfg.c
index fa39fc2..2809fd5 100644
--- a/net/caif/cfcnfg.c
+++ b/net/caif/cfcnfg.c
@@ -6,6 +6,7 @@

#define pr_fmt(fmt) KBUILD_MODNAME ":%s(): " fmt, __func__

+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/stddef.h>
#include <linux/slab.h>
diff --git a/net/mac80211/driver-ops.h b/net/mac80211/driver-ops.h
index b931c96..f6de52c 100644
--- a/net/mac80211/driver-ops.h
+++ b/net/mac80211/driver-ops.h
@@ -1,6 +1,7 @@
#ifndef __MAC80211_DRIVER_OPS
#define __MAC80211_DRIVER_OPS

+#include <linux/sched.h>
#include <net/mac80211.h>
#include "ieee80211_i.h"
#include "trace.h"
diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index e39cc91..9ae77d9 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -9,6 +9,7 @@
* published by the Free Software Foundation.
*/

+#include <linux/sched.h>
#include <linux/if_ether.h>
#include <linux/etherdevice.h>
#include <linux/list.h>
diff --git a/net/mac80211/main.c b/net/mac80211/main.c
index 091088a..53db5ee 100644
--- a/net/mac80211/main.c
+++ b/net/mac80211/main.c
@@ -8,6 +8,7 @@
* published by the Free Software Foundation.
*/

+#include <linux/sched.h>
#include <net/mac80211.h>
#include <linux/module.h>
#include <linux/init.h>
diff --git a/net/mac80211/sta_info.c b/net/mac80211/sta_info.c
index aeb967a..5a5ae34 100644
--- a/net/mac80211/sta_info.c
+++ b/net/mac80211/sta_info.c
@@ -7,6 +7,7 @@
* published by the Free Software Foundation.
*/

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/init.h>
#include <linux/etherdevice.h>
diff --git a/net/phonet/pep.c b/net/phonet/pep.c
index e774117..76bb2dc 100644
--- a/net/phonet/pep.c
+++ b/net/phonet/pep.c
@@ -22,6 +22,7 @@
* 02110-1301 USA
*/

+#include <linux/sched.h>
#include <linux/kernel.h>
#include <linux/slab.h>
#include <linux/socket.h>
diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 74f6a70..6efea48 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -19,6 +19,7 @@


#include <linux/module.h>
+#include <linux/sched.h>
#include <linux/types.h>
#include <linux/kallsyms.h>
#include <linux/mm.h>
diff --git a/net/wimax/op-msg.c b/net/wimax/op-msg.c
index 0694d62..2d48d26 100644
--- a/net/wimax/op-msg.c
+++ b/net/wimax/op-msg.c
@@ -71,6 +71,7 @@
* wimax_msg_alloc()
* wimax_msg_send()
*/
+#include <linux/sched.h>
#include <linux/device.h>
#include <linux/slab.h>
#include <net/genetlink.h>
diff --git a/net/wimax/op-reset.c b/net/wimax/op-reset.c
index 7ceffe3..30af0ee 100644
--- a/net/wimax/op-reset.c
+++ b/net/wimax/op-reset.c
@@ -28,6 +28,7 @@
* disconnect and reconnect the device).
*/

+#include <linux/sched.h>
#include <net/wimax.h>
#include <net/genetlink.h>
#include <linux/wimax.h>
diff --git a/net/wimax/op-rfkill.c b/net/wimax/op-rfkill.c
index 7ab60ba..6efe0bd 100644
--- a/net/wimax/op-rfkill.c
+++ b/net/wimax/op-rfkill.c
@@ -60,6 +60,7 @@
* wimax_rfkill_rm() [called by wimax_dev_add/rm()]
*/

+#include <linux/sched.h>
#include <net/wimax.h>
#include <net/genetlink.h>
#include <linux/wimax.h>
diff --git a/net/wireless/wext-proc.c b/net/wireless/wext-proc.c
index e98a01c..e771c39 100644
--- a/net/wireless/wext-proc.c
+++ b/net/wireless/wext-proc.c
@@ -16,6 +16,7 @@
* The content of the file is basically the content of "struct iw_statistics".
*/

+#include <linux/sched.h>
#include <linux/module.h>
#include <linux/proc_fs.h>
#include <linux/seq_file.h>
diff --git a/sound/core/info.c b/sound/core/info.c
index e79baa1..719a8ef 100644
--- a/sound/core/info.c
+++ b/sound/core/info.c
@@ -19,6 +19,7 @@
*
*/

+#include <linux/sched.h>
#include <linux/init.h>
#include <linux/time.h>
#include <linux/mm.h>
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index ea475cd..b44cea0 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -20,6 +20,7 @@
* Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301, USA.
*/

+#include <linux/sched.h>
#include <linux/kvm_host.h>
#include <linux/slab.h>
#include <linux/module.h>
--
1.8.3.1

2013-08-09 23:06:10

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 08/13] x86: Move cond_resched into the out of line put_user code

From: Andi Kleen <[email protected]>

CONFIG_PREEMPT_VOLUNTARY kernels always do a cond_resched in put_user().
Currently this is done in the caller and in a inefficient way (multiple
function calls to decide to do nothing). Move the reschedule check
into the low level functions instead, where it can be merged cheaply
with the address limit check.

For the DEBUG_SLEEP case we still do the call in the caller.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess.h | 2 +-
arch/x86/lib/putuser.S | 12 +++++++++++-
arch/x86/lib/user-common.h | 12 ++++++++++++
3 files changed, 24 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/lib/user-common.h

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 8fa3bd6..6cbe976 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -246,7 +246,7 @@ extern void __put_user_8(void);
int __ret_pu; \
__typeof__(*(ptr)) __pu_val; \
__chk_user_ptr(ptr); \
- might_fault(); \
+ might_fault_debug_only(); \
__pu_val = x; \
switch (sizeof(*(ptr))) { \
case 1: \
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index fc6ba17..9ba7f52 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -16,6 +16,8 @@
#include <asm/errno.h>
#include <asm/asm.h>
#include <asm/smap.h>
+#include <asm/calling.h>
+#include "user-common.h"


/*
@@ -31,7 +33,7 @@
*/

#define ENTER CFI_STARTPROC ; \
- GET_THREAD_INFO(%_ASM_BX)
+ GET_THREAD_AND_SCHEDULE %_ASM_BX
#define EXIT ASM_CLAC ; \
ret ; \
CFI_ENDPROC
@@ -99,3 +101,11 @@ END(bad_put_user)
#ifdef CONFIG_X86_32
_ASM_EXTABLE(5b,bad_put_user)
#endif
+
+ENTRY(user_schedule)
+ CFI_STARTPROC
+ SAVE_ALL
+ call _cond_resched
+ RESTORE_ALL
+ ret
+ CFI_ENDPROC
diff --git a/arch/x86/lib/user-common.h b/arch/x86/lib/user-common.h
new file mode 100644
index 0000000..d61cd1a
--- /dev/null
+++ b/arch/x86/lib/user-common.h
@@ -0,0 +1,12 @@
+ .macro GET_THREAD_AND_SCHEDULE reg
+ GET_THREAD_INFO(\reg)
+#ifdef CONFIG_PREEMPT_VOLUNTARY
+ testl $_TIF_NEED_RESCHED,TI_flags(\reg)
+ jnz 1f
+2:
+ .section .fixup,"ax"
+1: call user_schedule
+ jmp 2b
+ .previous
+#endif
+ .endm
--
1.8.3.1

2013-08-09 23:06:09

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 13/13] x86: drop cond rescheds from __copy_{from,to}_user

From: Andi Kleen <[email protected]>

The __copy_* variants are right now more expensive than the non __ copy*user
in CONFIG_PREEMPT_VOLUNTARY because they have a additional function call to
might_fault().

Since they are usually used in a row with other functions, which also
schedule or only in the thin compat layers and also __get/__put_user
do not do explicit reschedule check drop them here for the non debug
case.

Normal non __ copy*user will still schedule of course

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess_32.h | 6 +++---
arch/x86/include/asm/uaccess_64.h | 4 ++--
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_32.h b/arch/x86/include/asm/uaccess_32.h
index 7f760a9..e656ee9 100644
--- a/arch/x86/include/asm/uaccess_32.h
+++ b/arch/x86/include/asm/uaccess_32.h
@@ -81,7 +81,7 @@ __copy_to_user_inatomic(void __user *to, const void *from, unsigned long n)
static __always_inline unsigned long __must_check
__copy_to_user(void __user *to, const void *from, unsigned long n)
{
- might_fault();
+ might_fault_debug_only();
return __copy_to_user_inatomic(to, from, n);
}

@@ -136,7 +136,7 @@ __copy_from_user_inatomic(void *to, const void __user *from, unsigned long n)
static __always_inline unsigned long
__copy_from_user(void *to, const void __user *from, unsigned long n)
{
- might_fault();
+ might_fault_debug_only();
if (__builtin_constant_p(n)) {
unsigned long ret;

@@ -158,7 +158,7 @@ __copy_from_user(void *to, const void __user *from, unsigned long n)
static __always_inline unsigned long __copy_from_user_nocache(void *to,
const void __user *from, unsigned long n)
{
- might_fault();
+ might_fault_debug_only();
if (__builtin_constant_p(n)) {
unsigned long ret;

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 831f4a3..86959f5 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -122,7 +122,7 @@ int __copy_from_user_nocheck(void *dst, const void __user *src, unsigned size)
static __always_inline __must_check
int __copy_from_user(void *dst, const void __user *src, unsigned size)
{
- might_fault();
+ might_fault_debug_only();
return __copy_from_user_nocheck(dst, src, size);
}

@@ -172,7 +172,7 @@ int __copy_to_user_nocheck(void __user *dst, const void *src, unsigned size)
static __always_inline __must_check
int __copy_to_user(void __user *dst, const void *src, unsigned size)
{
- might_fault();
+ might_fault_debug_only();
return __copy_to_user_nocheck(dst, src, size);
}

--
1.8.3.1

2013-08-09 23:04:25

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 01/13] x86: Add 1/2/4/8 byte optimization to 64bit __copy_{from,to}_user_inatomic

From: Andi Kleen <[email protected]>

The 64bit __copy_{from,to}_user_inatomic always called
copy_from_user_generic, but skipped the special optimizations for 1/2/4/8
byte accesses.

This especially hurts the futex call, which accesses the 4 byte futex
user value with a complicated fast string operation in a function call,
instead of a single movl.

Use __copy_{from,to}_user for _inatomic instead to get the same
optimizations. The only problem was the might_fault() in those functions.
So move that to new wrapper and call __copy_{f,t}_user_nocheck()
from *_inatomic directly.

32bit already did this correctly by duplicating the code.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess_64.h | 24 ++++++++++++++++++------
1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index 4f7923d..64476bb 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -77,11 +77,10 @@ int copy_to_user(void __user *dst, const void *src, unsigned size)
}

static __always_inline __must_check
-int __copy_from_user(void *dst, const void __user *src, unsigned size)
+int __copy_from_user_nocheck(void *dst, const void __user *src, unsigned size)
{
int ret = 0;

- might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic(dst, (__force void *)src, size);
switch (size) {
@@ -121,11 +120,17 @@ int __copy_from_user(void *dst, const void __user *src, unsigned size)
}

static __always_inline __must_check
-int __copy_to_user(void __user *dst, const void *src, unsigned size)
+int __copy_from_user(void *dst, const void __user *src, unsigned size)
+{
+ might_fault();
+ return __copy_from_user_nocheck(dst, src, size);
+}
+
+static __always_inline __must_check
+int __copy_to_user_nocheck(void __user *dst, const void *src, unsigned size)
{
int ret = 0;

- might_fault();
if (!__builtin_constant_p(size))
return copy_user_generic((__force void *)dst, src, size);
switch (size) {
@@ -165,6 +170,13 @@ int __copy_to_user(void __user *dst, const void *src, unsigned size)
}

static __always_inline __must_check
+int __copy_to_user(void __user *dst, const void *src, unsigned size)
+{
+ might_fault();
+ return __copy_to_user_nocheck(dst, src, size);
+}
+
+static __always_inline __must_check
int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
{
int ret = 0;
@@ -220,13 +232,13 @@ int __copy_in_user(void __user *dst, const void __user *src, unsigned size)
static __must_check __always_inline int
__copy_from_user_inatomic(void *dst, const void __user *src, unsigned size)
{
- return copy_user_generic(dst, (__force const void *)src, size);
+ return __copy_from_user_nocheck(dst, (__force const void *)src, size);
}

static __must_check __always_inline int
__copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
{
- return copy_user_generic((__force void *)dst, src, size);
+ return __copy_to_user_nocheck((__force void *)dst, src, size);
}

extern long __copy_user_nocache(void *dst, const void __user *src,
--
1.8.3.1

2013-08-09 23:07:31

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 11/13] sched: Inline the need_resched test into the caller for _cond_resched

From: Andi Kleen <[email protected]>

_cond_resched is very common in kernel calls, e.g. it's used in every user
access. Usually it does at least two explicit calls just to decide to do
nothing: _cond_resched and should_resched(). Inline a need_resched()
into the caller to avoid these calls in the common case of no reschedule
being needed.

Previously this would have been very expensive in terms of binary size
because there were a lot of inlined cond_resched()s in copy_*_user()
and put/get_user(). But with the newest changes to x86 uaccess.h
these not inlined anymore, so we can use a slightly bigger, but much
faster fast path version.

Signed-off-by: Andi Kleen <[email protected]>
---
include/linux/sched.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb7a08a..9e0efa9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2435,7 +2435,7 @@ extern int __cond_resched_softirq(void);

#ifdef CONFIG_PREEMPT_VOLUNTARY
extern int _cond_resched(void);
-# define might_resched() _cond_resched()
+# define might_resched() (need_resched() ? _cond_resched() : 0)
#else
# define might_resched() do { } while (0)
#endif
--
1.8.3.1

2013-08-09 23:07:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 09/13] x86: Move cond_resched into the out of line get_user code

From: Andi Kleen <[email protected]>

CONFIG_PREEMPT_VOLUNTARY kernels always do a cond_resched in get_user().
Currently this is done in the caller and in a inefficient way (multiple
function calls to decide to do nothing). Move the reschedule check
into the low level functions instead, where it can be merged cheaply
with the address limit check.

For the DEBUG_SLEEP case we still do the call in the caller.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess.h | 3 +--
arch/x86/lib/getuser.S | 9 +++++----
2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index 6cbe976..1ab2d2a 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -165,7 +165,7 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
int __ret_gu; \
register __inttype(*(ptr)) __val_gu asm("%edx"); \
__chk_user_ptr(ptr); \
- might_fault(); \
+ might_fault_debug_only(); \
asm volatile("call __get_user_%P3" \
: "=a" (__ret_gu), "=r" (__val_gu) \
: "0" (ptr), "i" (sizeof(*(ptr)))); \
@@ -541,4 +541,3 @@ extern struct movsl_mask {
#endif

#endif /* _ASM_X86_UACCESS_H */
-
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index a451235..0288aa9 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -33,11 +33,12 @@
#include <asm/thread_info.h>
#include <asm/asm.h>
#include <asm/smap.h>
+#include "user-common.h"

.text
ENTRY(__get_user_1)
CFI_STARTPROC
- GET_THREAD_INFO(%_ASM_DX)
+ GET_THREAD_AND_SCHEDULE %_ASM_DX
cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
@@ -52,7 +53,7 @@ ENTRY(__get_user_2)
CFI_STARTPROC
add $1,%_ASM_AX
jc bad_get_user
- GET_THREAD_INFO(%_ASM_DX)
+ GET_THREAD_AND_SCHEDULE %_ASM_DX
cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
@@ -67,7 +68,7 @@ ENTRY(__get_user_4)
CFI_STARTPROC
add $3,%_ASM_AX
jc bad_get_user
- GET_THREAD_INFO(%_ASM_DX)
+ GET_THREAD_AND_SCHEDULE %_ASM_DX
cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
@@ -83,7 +84,7 @@ ENTRY(__get_user_8)
#ifdef CONFIG_X86_64
add $7,%_ASM_AX
jc bad_get_user
- GET_THREAD_INFO(%_ASM_DX)
+ GET_THREAD_AND_SCHEDULE %_ASM_DX
cmp TI_addr_limit(%_ASM_DX),%_ASM_AX
jae bad_get_user
ASM_STAC
--
1.8.3.1

2013-08-09 23:07:52

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 12/13] x86: move __copy_*_nocache might fault check out of line

From: Andi Kleen <[email protected]>

Can as well do the normal conditional resched check out of line.
This saves one function call.

Signed-off-by: Andi Kleen <[email protected]>
---
arch/x86/include/asm/uaccess_64.h | 6 ++++--
arch/x86/lib/copy_user_nocache_64.S | 7 +++++++
2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index b327057..831f4a3 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -243,12 +243,14 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)

extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);
+extern long __copy_user_nocache_might_fault(void *dst, const void __user *src,
+ unsigned size, int zerorest);

static inline int
__copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
{
- might_fault();
- return __copy_user_nocache(dst, src, size, 1);
+ might_fault_debug_only();
+ return __copy_user_nocache_might_fault(dst, src, size, 1);
}

static inline int
diff --git a/arch/x86/lib/copy_user_nocache_64.S b/arch/x86/lib/copy_user_nocache_64.S
index 6a4f43c..8e6b0bd 100644
--- a/arch/x86/lib/copy_user_nocache_64.S
+++ b/arch/x86/lib/copy_user_nocache_64.S
@@ -16,6 +16,7 @@
#include <asm/thread_info.h>
#include <asm/asm.h>
#include <asm/smap.h>
+#include "user-common.h"

.macro ALIGN_DESTINATION
#ifdef FIX_ALIGNMENT
@@ -43,6 +44,12 @@
#endif
.endm

+ENTRY(__copy_user_nocache_might_fault)
+ CFI_STARTPROC
+ GET_THREAD_AND_SCHEDULE %rax
+ CFI_ENDPROC
+ /* fall through */
+
/*
* copy_user_nocache - Uncached memory copy with exception handling
* This will force destination/source out of cache for more performance.
--
1.8.3.1

2013-08-10 04:43:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/09/2013 04:04 PM, Andi Kleen wrote:
>
> This patch kit is an attempt to get us back to sane code,
> mostly by doing proper inlining and doing sleep checks in the right
> place. Unfortunately I had to add one tree sweep to avoid an nasty
> include loop.
>
> It costs a bit of text space, but I think it's worth it
> (if only to keep my blood pressure down while reading ftrace logs...)
>

Looks nice at first glance.

Now, here is a bigger question: shouldn't we be deprecating/getting rid
of PREEMPT_VOUNTARY in favor of PREEMPT?

-hpa

2013-08-10 05:57:03

by Mike Galbraith

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Fri, 2013-08-09 at 21:42 -0700, H. Peter Anvin wrote:
> On 08/09/2013 04:04 PM, Andi Kleen wrote:
> >
> > This patch kit is an attempt to get us back to sane code,
> > mostly by doing proper inlining and doing sleep checks in the right
> > place. Unfortunately I had to add one tree sweep to avoid an nasty
> > include loop.
> >
> > It costs a bit of text space, but I think it's worth it
> > (if only to keep my blood pressure down while reading ftrace logs...)
> >
>
> Looks nice at first glance.
>
> Now, here is a bigger question: shouldn't we be deprecating/getting rid
> of PREEMPT_VOUNTARY in favor of PREEMPT?

I sure hope not, PREEMPT munches throughput. If you need PREEMPT, seems
to me what you _really_ need is PREEMPT_RT (the real deal), so
eventually depreciating PREEMPT makes more sense to me.

-Mike

2013-08-10 15:43:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

On Fri, Aug 9, 2013 at 4:04 PM, Andi Kleen <[email protected]> wrote:
> From: Andi Kleen <[email protected]>
>
> Move the cond_resched() check for CONFIG_PREEMPT_VOLUNTARY into
> the low level copy_*_user code. This avoids some code bloat and
> makes check much more efficient by avoiding unnecessary function calls.

May I suggest going one step further, and just removing the
cond_resched() _entirely_, leaving just the debug test?

There really is zero reason for doing a cond_resched() for user
accesses. If they take a page fault, then yes, by all means do that
(and maybe we should add one to the page fault trap if we don't have
it already), but without a page fault they really aren't that
expensive.

We do many more expensive things without any cond_resched(), and doing
that cond_resched() really doesn't make much sense *unless* there's a
big expensive loop involved.

Most of this series looks fine, but I really think that we
could/should just take that extra step, and say "no, user accesses
don't imply that we need to check for scheduling".

Linus

2013-08-10 16:09:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/09/2013 10:55 PM, Mike Galbraith wrote:
>>
>> Now, here is a bigger question: shouldn't we be deprecating/getting rid
>> of PREEMPT_VOUNTARY in favor of PREEMPT?
>
> I sure hope not, PREEMPT munches throughput. If you need PREEMPT, seems
> to me what you _really_ need is PREEMPT_RT (the real deal), so
> eventually depreciating PREEMPT makes more sense to me.
>

Do you have any quantification of "munches throughput?" It seems odd
that it would be worse than polling for preempt all over the kernel, but
perhaps the additional locking is what costs.

-hpa

2013-08-10 16:10:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

> Most of this series looks fine, but I really think that we
> could/should just take that extra step, and say "no, user accesses
> don't imply that we need to check for scheduling".

Hmm. I can do that, but wouldn't that make CONFIG_PREEMPT_VOLUNTARY
mostly equivalent to CONFIG_PREEMPT_NONE?

Need to check how many other reschedule tests are left then for VOLUNTARY.

-Andi

--
[email protected] -- Speaking for myself only.

2013-08-10 16:27:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

On Sat, Aug 10, 2013 at 9:10 AM, Andi Kleen <[email protected]> wrote:
>
> Hmm. I can do that, but wouldn't that make CONFIG_PREEMPT_VOLUNTARY
> mostly equivalent to CONFIG_PREEMPT_NONE?

According the the Kconfig help, PREEMPT_VOLUNTARY is about the
*explicit* preemption points. And we do have a lot of them in
"might_sleep()".

And personally, I think it makes a *lot* more sense to have a
"might_sleep()" in the MM allocators than it does to have it in
copy_from_user().

But yes, it's obviously a gray area on exactly where you'd want to put
them. But for a "get_user()" that can literally be just a few
instructions? Why would we make that a preemption point? Or even a
copy of few tens of bytes?

I'd *much* rather have preemption points in places that actually loop,
and that have real work associated with them.

Now, the *debug* logic is entirely different, of course. Maybe the
problem is that we have mixed up the two so badly, and we have
"might_sleep()" that implies more of a debug issue than a preemption
issue, and then people add those because they want the debug coverage
(and then you *absolutely* want it even for a single-byte user mode
access). And then because the concept is tied together with
preemption, we end up doing preemption even for that single-byte
access despite the fact that it makes no sense what-so-ever.

So I think your patches are fine, but I do think we should take a
deeper look at this.

Linus

2013-08-10 16:43:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Sat, Aug 10, 2013 at 9:09 AM, H. Peter Anvin <[email protected]> wrote:
>
> Do you have any quantification of "munches throughput?" It seems odd
> that it would be worse than polling for preempt all over the kernel, but
> perhaps the additional locking is what costs.

Actually, the big thing for true preemption is not so much the preempt
count itself, but the fact that when the preempt count goes back to
zero we have that "check if we should have been preempted" thing.

And in particular, the conditional function call that goes along with it.

The thing is, even if that is almost never taken, just the fact that
there is a conditional function call very often makes code generation
*much* worse. A function that is a leaf function with no stack frame
with no preemption often turns into a non-leaf function with stack
frames when you enable preemption, just because it had a RCU read
region which disabled preemption.

It's similar to the kind of code generation issue that Andi's patches
are trying to work on.

Andi did the "test and jump to a different section to call the
scheduler with registers saved" as an assembly stub in one of his
patches in this series exactly to avoid the cost of this for the
might_sleep() case, and generated that GET_THREAD_AND_SCHEDULE asm
macro for it. But look at that asm macro, and compare it to
"preempt_check_resched()"..

I have often wanted to have access to that kind of thing from C code.
It's not unusual. Think lock failure paths, not Tom Jones.

Linus

2013-08-10 17:18:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/10/2013 09:43 AM, Linus Torvalds wrote:
> On Sat, Aug 10, 2013 at 9:09 AM, H. Peter Anvin <[email protected]> wrote:
>>
>> Do you have any quantification of "munches throughput?" It seems odd
>> that it would be worse than polling for preempt all over the kernel, but
>> perhaps the additional locking is what costs.
>
> Actually, the big thing for true preemption is not so much the preempt
> count itself, but the fact that when the preempt count goes back to
> zero we have that "check if we should have been preempted" thing.
>
> And in particular, the conditional function call that goes along with it.
>
> The thing is, even if that is almost never taken, just the fact that
> there is a conditional function call very often makes code generation
> *much* worse. A function that is a leaf function with no stack frame
> with no preemption often turns into a non-leaf function with stackcheck
> frames when you enable preemption, just because it had a RCU read
> region which disabled preemption.
>
> It's similar to the kind of code generation issue that Andi's patches
> are trying to work on.
>
> Andi did the "test and jump to a different section to call the
> scheduler with registers saved" as an assembly stub in one of his
> patches in this series exactly to avoid the cost of this for the
> might_sleep() case, and generated that GET_THREAD_AND_SCHEDULE asm
> macro for it. But look at that asm macro, and compare it to
> "preempt_check_resched()"..
>
> I have often wanted to have access to that kind of thing from C code.
> It's not unusual. Think lock failure paths, not Tom Jones.
>

Hmm... if that is really the big issue then I'm wondering if
preempt_enable() &c shouldn't be rewritten in assembly... if nothing
else to get the outbound call out of view of the C compiler; it could
even be turned into an exception instruction.

There are a few other things that one have to wonder about: the
preempt_count is currently located in the thread_info structure, but
since by definition we can't switch a thread that is preemption-locked
it should work in a percpu variable as well.

We could then play a really ugly stunt by marking NEED_RESCHED by adding
0x7fffffff to the counter. Then the whole sequence becomes something like:

subl $1,%fs:preempt_count
jno 1f
call __naked_preempt_schedule /* Or a trap */
1:

For architectures with conditional traps the trapping option becomes
even more attractive.

-hpa


2013-08-10 18:23:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

On Sat, Aug 10, 2013 at 09:27:33AM -0700, Linus Torvalds wrote:
> Now, the *debug* logic is entirely different, of course. Maybe the
> problem is that we have mixed up the two so badly, and we have
> "might_sleep()" that implies more of a debug issue than a preemption
> issue, and then people add those because they want the debug coverage
> (and then you *absolutely* want it even for a single-byte user
> mode access). And then because the concept is tied together with
> preemption, we end up doing preemption even for that single-byte
> access despite the fact that it makes no sense what-so-ever.

Sounds like the debug aspect and the preemption point addition need
to be sorf-of split into two different functions/macros and each used
separately.

Something like keep the current might_sleep and have debug_sleep or
similar which does only __might_sleep without the resched...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-08-10 18:51:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Sat, Aug 10, 2013 at 10:18 AM, H. Peter Anvin <[email protected]> wrote:
>
> We could then play a really ugly stunt by marking NEED_RESCHED by adding
> 0x7fffffff to the counter. Then the whole sequence becomes something like:
>
> subl $1,%fs:preempt_count
> jno 1f
> call __naked_preempt_schedule /* Or a trap */

This is indeed one of the few cases where we probably *could* use
trapv or something like that in theory, but those instructions tend to
be slow enough that even if you don't take the trap, you'd be better
off just testing by hand.

However, it's worse than you think. Preempt count is per-thread, not
per-cpu. So to access preempt-count, we currently have to look up
thread_info (which is per-cpu or stack-based).

I'd *like* to make preempt-count be per-cpu, and then copy it at
thread switch time, and it's been discussed. But as things are now,
preemption enable is quite expensive, and looks something like

movq %gs:kernel_stack,%rdx #, pfo_ret__
subl $1, -8124(%rdx) #, ti_22->preempt_count
movq %gs:kernel_stack,%rdx #, pfo_ret__
movq -8136(%rdx), %rdx # MEM[(const long unsigned int
*)ti_27 + 16B], D.
andl $8, %edx #, D.34545
jne .L139 #,

and that's actually the *good* case (ie not counting any extra costs
of turning leaf functions into non-leaf ones).

That "kernel_stack" thing is actually getting the thread_info pointer,
and it doesn't get cached because gcc thinks the preempt_count value
might alias. Sad, sad, sad. We actually used to do better back when we
used actual tricks with the stack registers and used a const inline
asm to let gcc know it could re-use the value etc.

It would be *lovely* if we
(a) made preempt-count per-cpu and just copied it at thread-switch
(b) made the NEED_RESCHED bit be part of preempt-count (rather than
thread flags) and just made it the high bit

adn then maybe we could just do

subl $1, %fs:preempt_count
js .L139

with the actual schedule call being done as an

asm volatile("call user_schedule": : :"memory");

that Andi introduced that doesn't pollute the register space. Note
that you still want the *test* to be done in C code, because together
with "unlikely()" you'd likely do pretty close to optimal code
generation, and hiding the decrement and test and conditional jump in
asm you wouldn't get the proper instruction scheduling and branch
following that gcc does.

I dunno. It looks like a fair amount of effort. But as things are now,
the code generation difference between PREEMPT_NONE and PREEMPT is
actually fairly noticeable. And PREEMPT_VOLUNTARY - which is supposed
to be almost as cheap as PREEMPT_NONE - has lots of bad cases too, as
Andi noticed.

Linus

2013-08-10 19:19:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

Right... I mentioned the need to move thread count into percpu and the other restructuring... all of that seems essential for this not to suck.

Linus Torvalds <[email protected]> wrote:
>On Sat, Aug 10, 2013 at 10:18 AM, H. Peter Anvin <[email protected]> wrote:
>>
>> We could then play a really ugly stunt by marking NEED_RESCHED by
>adding
>> 0x7fffffff to the counter. Then the whole sequence becomes something
>like:
>>
>> subl $1,%fs:preempt_count
>> jno 1f
>> call __naked_preempt_schedule /* Or a trap */
>
>This is indeed one of the few cases where we probably *could* use
>trapv or something like that in theory, but those instructions tend to
>be slow enough that even if you don't take the trap, you'd be better
>off just testing by hand.
>
>However, it's worse than you think. Preempt count is per-thread, not
>per-cpu. So to access preempt-count, we currently have to look up
>thread_info (which is per-cpu or stack-based).
>
>I'd *like* to make preempt-count be per-cpu, and then copy it at
>thread switch time, and it's been discussed. But as things are now,
>preemption enable is quite expensive, and looks something like
>
> movq %gs:kernel_stack,%rdx #, pfo_ret__
> subl $1, -8124(%rdx) #, ti_22->preempt_count
> movq %gs:kernel_stack,%rdx #, pfo_ret__
> movq -8136(%rdx), %rdx # MEM[(const long unsigned int
>*)ti_27 + 16B], D.
> andl $8, %edx #, D.34545
> jne .L139 #,
>
>and that's actually the *good* case (ie not counting any extra costs
>of turning leaf functions into non-leaf ones).
>
>That "kernel_stack" thing is actually getting the thread_info pointer,
>and it doesn't get cached because gcc thinks the preempt_count value
>might alias. Sad, sad, sad. We actually used to do better back when we
>used actual tricks with the stack registers and used a const inline
>asm to let gcc know it could re-use the value etc.
>
>It would be *lovely* if we
> (a) made preempt-count per-cpu and just copied it at thread-switch
> (b) made the NEED_RESCHED bit be part of preempt-count (rather than
>thread flags) and just made it the high bit
>
>adn then maybe we could just do
>
> subl $1, %fs:preempt_count
> js .L139
>
>with the actual schedule call being done as an
>
> asm volatile("call user_schedule": : :"memory");
>
>that Andi introduced that doesn't pollute the register space. Note
>that you still want the *test* to be done in C code, because together
>with "unlikely()" you'd likely do pretty close to optimal code
>generation, and hiding the decrement and test and conditional jump in
>asm you wouldn't get the proper instruction scheduling and branch
>following that gcc does.
>
>I dunno. It looks like a fair amount of effort. But as things are now,
>the code generation difference between PREEMPT_NONE and PREEMPT is
>actually fairly noticeable. And PREEMPT_VOLUNTARY - which is supposed
>to be almost as cheap as PREEMPT_NONE - has lots of bad cases too, as
>Andi noticed.
>
> Linus

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-10 20:27:13

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/10/2013 11:51 AM, Linus Torvalds wrote:
> That "kernel_stack" thing is actually getting the thread_info pointer,
> and it doesn't get cached because gcc thinks the preempt_count value
> might alias.

This is just plain braindamaged. Somewhere on my list of things is to
merge thread_info and task_struct and have a pointer to the unified
structure.

-hpa

2013-08-10 20:38:46

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

On Sat, 10 August 2013 20:23:09 +0200, Borislav Petkov wrote:
>
> Sounds like the debug aspect and the preemption point addition need
> to be sorf-of split into two different functions/macros and each used
> separately.
>
> Something like keep the current might_sleep and have debug_sleep or
> similar which does only __might_sleep without the resched...

I would argue for using "might_sleep" for the debug variant. Before
reading this thread I wasn't even aware of the non-debug aspect.
After all, might_sleep naturally reads like some assertion.
"might_preempt" for the non-debug version? "cond_preempt"?

Jörn

--
It is the mark of an educated mind to be able to entertain a thought
without accepting it.
-- Aristotle

2013-08-10 23:01:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/10/2013 11:51 AM, Linus Torvalds wrote:
> Note that you still want the *test* to be done in C code, because together
> with "unlikely()" you'd likely do pretty close to optimal code
> generation, and hiding the decrement and test and conditional jump in
> asm you wouldn't get the proper instruction scheduling and branch
> following that gcc does.

How much instruction scheduling can gcc actually do given that there is
a barrier involved? I guess some on the RISC architectures which need
load/subtract/store...

-hpa

2013-08-11 04:17:47

by Mike Galbraith

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Sat, 2013-08-10 at 09:09 -0700, H. Peter Anvin wrote:
> On 08/09/2013 10:55 PM, Mike Galbraith wrote:
> >>
> >> Now, here is a bigger question: shouldn't we be deprecating/getting rid
> >> of PREEMPT_VOUNTARY in favor of PREEMPT?
> >
> > I sure hope not, PREEMPT munches throughput. If you need PREEMPT, seems
> > to me what you _really_ need is PREEMPT_RT (the real deal), so
> > eventually depreciating PREEMPT makes more sense to me.
> >
>
> Do you have any quantification of "munches throughput?" It seems odd
> that it would be worse than polling for preempt all over the kernel, but
> perhaps the additional locking is what costs.

I hadn't compared in ages, so made some fresh samples.

Q6600 3.11-rc4

vmark
voluntary 169808 155826 154741 1.000
preempt 149354 124016 128436 .836

That should be ~worst case, it hates preemption.

tbench 8
voluntary 1027.96 1028.76 1044.60 1.000
preempt 929.06 935.01 928.64 .900

hackbench -l 10000
voluntary 23.146 23.124 23.230 1.000
preempt 25.065 24.633 24.789 1.071

kbuild vmlinux
voluntary 3m44.842s 3m42.975s 3m42.954s 1.000
preempt 3m46.141s 3m45.835s 3m45.953s 1.010

Compute load comparisons are boring 'course.

-Mike

2013-08-11 04:28:07

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/10/2013 09:17 PM, Mike Galbraith wrote:
>>
>> Do you have any quantification of "munches throughput?" It seems odd
>> that it would be worse than polling for preempt all over the kernel, but
>> perhaps the additional locking is what costs.
>
> I hadn't compared in ages, so made some fresh samples.
>
> Q6600 3.11-rc4
>
> vmark
> voluntary 169808 155826 154741 1.000
> preempt 149354 124016 128436 .836
>
> That should be ~worst case, it hates preemption.
>
> tbench 8
> voluntary 1027.96 1028.76 1044.60 1.000
> preempt 929.06 935.01 928.64 .900
>
> hackbench -l 10000
> voluntary 23.146 23.124 23.230 1.000
> preempt 25.065 24.633 24.789 1.071
>
> kbuild vmlinux
> voluntary 3m44.842s 3m42.975s 3m42.954s 1.000
> preempt 3m46.141s 3m45.835s 3m45.953s 1.010
>
> Compute load comparisons are boring 'course.
>

I presume voluntary is indistinguishable from no preemption at all?

Either way, that is definitely a reproducible test case, so if someone
is willing to take on optimizing preemption they can use vmark as the
litmus test. It would be really awesome if we genuinely could get the
cost of preemption down to where it just doesn't matter.

-hpa

2013-08-11 04:38:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Sat, 2013-08-10 at 21:27 -0700, H. Peter Anvin wrote:
> On 08/10/2013 09:17 PM, Mike Galbraith wrote:
> >>
> >> Do you have any quantification of "munches throughput?" It seems odd
> >> that it would be worse than polling for preempt all over the kernel, but
> >> perhaps the additional locking is what costs.
> >
> > I hadn't compared in ages, so made some fresh samples.
> >
> > Q6600 3.11-rc4
> >
> > vmark
> > voluntary 169808 155826 154741 1.000
> > preempt 149354 124016 128436 .836
> >
> > That should be ~worst case, it hates preemption.
> >
> > tbench 8
> > voluntary 1027.96 1028.76 1044.60 1.000
> > preempt 929.06 935.01 928.64 .900
> >
> > hackbench -l 10000
> > voluntary 23.146 23.124 23.230 1.000
> > preempt 25.065 24.633 24.789 1.071
> >
> > kbuild vmlinux
> > voluntary 3m44.842s 3m42.975s 3m42.954s 1.000
> > preempt 3m46.141s 3m45.835s 3m45.953s 1.010
> >
> > Compute load comparisons are boring 'course.
> >
>
> I presume voluntary is indistinguishable from no preemption at all?

No, all preemption options produce performance deltas.

> Either way, that is definitely a reproducible test case, so if someone
> is willing to take on optimizing preemption they can use vmark as the
> litmus test. It would be really awesome if we genuinely could get the
> cost of preemption down to where it just doesn't matter.

You have to eat more scheduler cycles, that's what PREEMPT does for a
living. Release a lock, wham.

-Mike

2013-08-11 04:57:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

That sounds like an issue with specific preemption policies.

Mike Galbraith <[email protected]> wrote:
>On Sat, 2013-08-10 at 21:27 -0700, H. Peter Anvin wrote:
>> On 08/10/2013 09:17 PM, Mike Galbraith wrote:
>> >>
>> >> Do you have any quantification of "munches throughput?" It seems
>odd
>> >> that it would be worse than polling for preempt all over the
>kernel, but
>> >> perhaps the additional locking is what costs.
>> >
>> > I hadn't compared in ages, so made some fresh samples.
>> >
>> > Q6600 3.11-rc4
>> >
>> > vmark
>> > voluntary 169808 155826 154741 1.000
>> > preempt 149354 124016 128436 .836
>> >
>> > That should be ~worst case, it hates preemption.
>> >
>> > tbench 8
>> > voluntary 1027.96 1028.76 1044.60 1.000
>> > preempt 929.06 935.01 928.64 .900
>> >
>> > hackbench -l 10000
>> > voluntary 23.146 23.124 23.230 1.000
>> > preempt 25.065 24.633 24.789 1.071
>> >
>> > kbuild vmlinux
>> > voluntary 3m44.842s 3m42.975s 3m42.954s 1.000
>> > preempt 3m46.141s 3m45.835s 3m45.953s 1.010
>> >
>> > Compute load comparisons are boring 'course.
>> >
>>
>> I presume voluntary is indistinguishable from no preemption at all?
>
>No, all preemption options produce performance deltas.
>
>> Either way, that is definitely a reproducible test case, so if
>someone
>> is willing to take on optimizing preemption they can use vmark as the
>> litmus test. It would be really awesome if we genuinely could get
>the
>> cost of preemption down to where it just doesn't matter.
>
>You have to eat more scheduler cycles, that's what PREEMPT does for a
>living. Release a lock, wham.
>
>-Mike

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-11 06:00:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Sat, 2013-08-10 at 21:57 -0700, H. Peter Anvin wrote:
> That sounds like an issue with specific preemption policies.

Actually, voluntary/nopreempt delta for _these_ loads was nil.

-Mike

2013-08-13 18:10:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On 08/09/2013 04:04 PM, Andi Kleen wrote:
> The x86 user access functions (*_user) were originally very well tuned,
> with partial inline code and other optimizations.
>
> Then over time various new checks -- particularly the sleep checks for
> a voluntary preempt kernel -- destroyed a lot of the tunings
>

Hi Andi,

Are you going to respin this patchset to address the feedback?

-hpa

2013-08-13 18:13:00

by Andi Kleen

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Tue, Aug 13, 2013 at 11:09:21AM -0700, H. Peter Anvin wrote:
> On 08/09/2013 04:04 PM, Andi Kleen wrote:
> > The x86 user access functions (*_user) were originally very well tuned,
> > with partial inline code and other optimizations.
> >
> > Then over time various new checks -- particularly the sleep checks for
> > a voluntary preempt kernel -- destroyed a lot of the tunings
> >
>
> Hi Andi,
>
> Are you going to respin this patchset to address the feedback?

Yes. I'm dropping all the user checks.

But you could already merge the first patch, it's independent
of all the others and not affected by the feedback.

-Andi

2013-08-14 18:23:57

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 07/13] Add might_fault_debug_only()

On Fri, Aug 09, 2013 at 04:04:14PM -0700, Andi Kleen wrote:
> From: Andi Kleen <[email protected]>
>
> Add a might_fault_debug_only() that only does something in the PROVE_LOCKING
> case, but does not cond_resched for PREEMPT_VOLUNTARY. This is for
> cases when the cond_resched is done elsewhere
>
> Signed-off-by: Andi Kleen <[email protected]>
> ---
> include/linux/sched.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 773f21d..bb7a08a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2473,11 +2473,13 @@ static inline void cond_resched_rcu(void)
>
> #ifdef CONFIG_PROVE_LOCKING
> void might_fault(void);
> +#define might_fault_debug_only() might_fault()
> #else
> static inline void might_fault(void)
> {
> might_sleep();
> }
> +#define might_fault_debug_only() do {} while(0)

Hi Andy, this is against which kernel version?
In 3.11-rc3 I see:

#if defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP)
void might_fault(void);
#else
static inline void might_fault(void) { }
#endif

So it's not clear to me how it's different from your might_fault_debug_only ..


> #endif
>
> /*
> --
> 1.8.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-08-14 18:26:28

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

On Fri, Aug 09, 2013 at 04:04:07PM -0700, Andi Kleen wrote:
> The x86 user access functions (*_user) were originally very well tuned,
> with partial inline code and other optimizations.
>
> Then over time various new checks -- particularly the sleep checks for
> a voluntary preempt kernel -- destroyed a lot of the tunings
>
> A typical user access operation is now doing multiple useless
> function calls. Also the without force inline gcc's inlining
> policy makes it even worse, with adding more unnecessary calls.
>
> Here's a typical example from ftrace:
>
> 10) | might_fault() {
> 10) | _cond_resched() {
> 10) | should_resched() {
> 10) | need_resched() {
> 10) 0.063 us | test_ti_thread_flag();
> 10) 0.643 us | }
> 10) 1.238 us | }
> 10) 1.845 us | }
> 10) 2.438 us | }
>
> So we spent 2.5us doing nothing (ok it's a bit less without
> ftrace, but still pretty bad)

Hmm, which kernel version is this?

I thought I fixed this for good in
commit 114276ac0a3beb9c391a410349bd770653e185ce
Author: Michael S. Tsirkin <[email protected]>
Date: Sun May 26 17:32:13 2013 +0300
mm, sched: Drop voluntary schedule from might_fault()

might_fault shouldn't be calling cond_resched anymore.

Did this get reverted at some point?
I hope not, there's more code relying on this now
(see e.g. 662bbcb2747c2422cf98d3d97619509379eee466)

> Then in other cases we would have an out of line function,
> but would actually do the might_sleep() checks in the inlined
> caller. This doesn't make any sense at all.
>
> There were also a few other problems, for example the x86-64 uaccess
> code regularly falls back to string functions, even though a simple
> mov would be enough. For example every futex access to the lock
> variable would actually use string instructions, even though
> it's just 4 bytes.
>
> This patch kit is an attempt to get us back to sane code,
> mostly by doing proper inlining and doing sleep checks in the right
> place. Unfortunately I had to add one tree sweep to avoid an nasty
> include loop.
>
> It costs a bit of text space, but I think it's worth it
> (if only to keep my blood pressure down while reading ftrace logs...)
>
> I haven't done any particular benchmarks, but important low level
> functions just ought to be fast.
>
> 64bit:
> 13249492 1881328 1159168 16289988 f890c4 vmlinux-before-uaccess
> 13260877 1877232 1159168 16297277 f8ad3d vmlinux-uaccess
> + 11k, +0.08%
>
> 32bit:
> 11223248 899512 1916928 14039688 d63a88 vmlinux-before-uaccess
> 11230358 895416 1916928 14042702 d6464e vmlinux-uaccess
> + 7k, +0.06%
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-08-14 22:08:19

by Andi Kleen

[permalink] [raw]
Subject: Re: Re-tune x86 uaccess code for PREEMPT_VOLUNTARY

> I thought I fixed this for good in
> commit 114276ac0a3beb9c391a410349bd770653e185ce
> Author: Michael S. Tsirkin <[email protected]>
> Date: Sun May 26 17:32:13 2013 +0300
> mm, sched: Drop voluntary schedule from might_fault()

You're right this was an older kernel. So you already fixed
part of it. Great.

I'll remove those parts.

-Andi

2013-08-15 05:02:48

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

On Sat, Aug 10, 2013 at 08:42:58AM -0700, Linus Torvalds wrote:
> On Fri, Aug 9, 2013 at 4:04 PM, Andi Kleen <[email protected]> wrote:
> > From: Andi Kleen <[email protected]>
> >
> > Move the cond_resched() check for CONFIG_PREEMPT_VOLUNTARY into
> > the low level copy_*_user code. This avoids some code bloat and
> > makes check much more efficient by avoiding unnecessary function calls.
>
> May I suggest going one step further, and just removing the
> cond_resched() _entirely_, leaving just the debug test?
>
> There really is zero reason for doing a cond_resched() for user
> accesses. If they take a page fault, then yes, by all means do that
> (and maybe we should add one to the page fault trap if we don't have
> it already), but without a page fault they really aren't that
> expensive.
>
> We do many more expensive things without any cond_resched(), and doing
> that cond_resched() really doesn't make much sense *unless* there's a
> big expensive loop involved.
>
> Most of this series looks fine, but I really think that we
> could/should just take that extra step, and say "no, user accesses
> don't imply that we need to check for scheduling".
>
> Linus

In fact we are doing exactly this since 3.11-rc1.

--
MST

2013-08-20 21:03:36

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 10/13] x86: Move cond resched for copy_{from,to}_user into low level code 64bit

>> Hmm. I can do that, but wouldn't that make CONFIG_PREEMPT_VOLUNTARY
>> mostly equivalent to CONFIG_PREEMPT_NONE?
>
> According the the Kconfig help, PREEMPT_VOLUNTARY is about the
> *explicit* preemption points. And we do have a lot of them in
> "might_sleep()".
>
> And personally, I think it makes a *lot* more sense to have a
> "might_sleep()" in the MM allocators than it does to have it in
> copy_from_user().

AFAIK, MM allocation already does that.

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
(snip)
might_sleep_if(gfp_mask & __GFP_WAIT);


btw, Sorry for the very late response. I haven't noticed this thread.