2015-07-18 03:18:23

by Ingo Molnar

[permalink] [raw]
Subject: [GIT PULL] x86 fixes

Linus,

Please pull the latest x86-urgent-for-linus git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-urgent-for-linus

# HEAD: 5aaeb5c01c5b6c0be7b7aadbf3ace9f3a4458c3d x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86

Two families of fixes:

- Fix an FPU context related boot crash on newer x86 hardware with larger
context sizes than what most people test. To fix this without ugly kludges or
extensive reverts we had to touch core task allocator, to allow x86 to
determine the task size dynamically, at boot time.

I've tested it on a number of x86 platforms, and I cross-built it to a handful
of architectures:

(warns) (warns)
testing x86-64: -git: pass ( 0), -tip: pass ( 0)
testing x86-32: -git: pass ( 0), -tip: pass ( 0)
testing arm: -git: pass ( 1359), -tip: pass ( 1359)
testing cris: -git: pass ( 1031), -tip: pass ( 1031)
testing m32r: -git: pass ( 1135), -tip: pass ( 1135)
testing m68k: -git: pass ( 1471), -tip: pass ( 1471)
testing mips: -git: pass ( 1162), -tip: pass ( 1162)
testing mn10300: -git: pass ( 1058), -tip: pass ( 1058)
testing parisc: -git: pass ( 1846), -tip: pass ( 1846)
testing sparc: -git: pass ( 1185), -tip: pass ( 1185)

... so I hope the cross-arch impact 'none', as intended.

(by Dave Hansen)

- Fix various NMI handling related bugs unearthed by the big asm code rewrite
and generally make the NMI code more robust and more maintainable while at it.
These changes are a bit late in the cycle, I hope they are still acceptable.

(by Andy Lutomirski)

out-of-topic modifications in x86-urgent-for-linus:
-----------------------------------------------------
arch/Kconfig # 5aaeb5c01c5b: x86/fpu, sched: Introduce CO
fs/proc/kcore.c # 5aaeb5c01c5b: x86/fpu, sched: Introduce CO
kernel/fork.c # 0c8c0f03e3a2: x86/fpu, sched: Dynamically

Thanks,

Ingo

------------------>
Andy Lutomirski (9):
x86/nmi: Enable nested do_nmi() handling for 64-bit kernels
x86/nmi/64: Remove asm code that saves CR2
x86/nmi/64: Switch stacks on userspace NMI entry
x86/nmi/64: Improve nested NMI comments
x86/nmi/64: Reorder nested NMI checks
x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection
x86/nmi/64: Minor asm simplification
x86/nmi/64: Make the "NMI executing" variable more consistent
x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code

Dave Hansen (1):
x86/fpu, sched: Dynamically allocate 'struct fpu'

Ingo Molnar (1):
x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86


arch/Kconfig | 4 +
arch/x86/Kconfig | 1 +
arch/x86/Kconfig.debug | 12 ++
arch/x86/entry/entry_64.S | 299 ++++++++++++++++++++++++++-------------
arch/x86/include/asm/fpu/types.h | 72 +++++-----
arch/x86/include/asm/processor.h | 10 +-
arch/x86/kernel/fpu/init.c | 40 ++++++
arch/x86/kernel/nmi.c | 123 +++++++---------
arch/x86/kernel/process.c | 2 +-
fs/proc/kcore.c | 4 +-
include/linux/sched.h | 16 ++-
kernel/fork.c | 7 +-
12 files changed, 376 insertions(+), 214 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index bec6666a3cc4..8a8ea7110de8 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -221,6 +221,10 @@ config ARCH_TASK_STRUCT_ALLOCATOR
config ARCH_THREAD_INFO_ALLOCATOR
bool

+# Select if arch wants to size task_struct dynamically via arch_task_struct_size:
+config ARCH_WANTS_DYNAMIC_TASK_STRUCT
+ bool
+
config HAVE_REGS_AND_STACK_ACCESS_API
bool
help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 3dbb7e7909ca..b3a1a5d77d92 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -41,6 +41,7 @@ config X86
select ARCH_USE_CMPXCHG_LOCKREF if X86_64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+ select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_IPC_PARSE_VERSION if X86_32
select ARCH_WANT_OPTIONAL_GPIOLIB
diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug
index a15893d17c55..d8c0d3266173 100644
--- a/arch/x86/Kconfig.debug
+++ b/arch/x86/Kconfig.debug
@@ -297,6 +297,18 @@ config OPTIMIZE_INLINING

If unsure, say N.

+config DEBUG_ENTRY
+ bool "Debug low-level entry code"
+ depends on DEBUG_KERNEL
+ ---help---
+ This option enables sanity checks in x86's low-level entry code.
+ Some of these sanity checks may slow down kernel entries and
+ exits or otherwise impact performance.
+
+ This is currently used to help test NMI code.
+
+ If unsure, say N.
+
config DEBUG_NMI_SELFTEST
bool "NMI Selftest"
depends on DEBUG_KERNEL && X86_LOCAL_APIC
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 3bb2c4302df1..8cb3e438f21e 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1237,11 +1237,12 @@ ENTRY(nmi)
* If the variable is not set and the stack is not the NMI
* stack then:
* o Set the special variable on the stack
- * o Copy the interrupt frame into a "saved" location on the stack
- * o Copy the interrupt frame into a "copy" location on the stack
+ * o Copy the interrupt frame into an "outermost" location on the
+ * stack
+ * o Copy the interrupt frame into an "iret" location on the stack
* o Continue processing the NMI
* If the variable is set or the previous stack is the NMI stack:
- * o Modify the "copy" location to jump to the repeate_nmi
+ * o Modify the "iret" location to jump to the repeat_nmi
* o return back to the first NMI
*
* Now on exit of the first NMI, we first clear the stack variable
@@ -1250,31 +1251,151 @@ ENTRY(nmi)
* a nested NMI that updated the copy interrupt stack frame, a
* jump will be made to the repeat_nmi code that will handle the second
* NMI.
+ *
+ * However, espfix prevents us from directly returning to userspace
+ * with a single IRET instruction. Similarly, IRET to user mode
+ * can fault. We therefore handle NMIs from user space like
+ * other IST entries.
*/

/* Use %rdx as our temp variable throughout */
pushq %rdx

+ testb $3, CS-RIP+8(%rsp)
+ jz .Lnmi_from_kernel
+
+ /*
+ * NMI from user mode. We need to run on the thread stack, but we
+ * can't go through the normal entry paths: NMIs are masked, and
+ * we don't want to enable interrupts, because then we'll end
+ * up in an awkward situation in which IRQs are on but NMIs
+ * are off.
+ */
+
+ SWAPGS
+ cld
+ movq %rsp, %rdx
+ movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+ pushq 5*8(%rdx) /* pt_regs->ss */
+ pushq 4*8(%rdx) /* pt_regs->rsp */
+ pushq 3*8(%rdx) /* pt_regs->flags */
+ pushq 2*8(%rdx) /* pt_regs->cs */
+ pushq 1*8(%rdx) /* pt_regs->rip */
+ pushq $-1 /* pt_regs->orig_ax */
+ pushq %rdi /* pt_regs->di */
+ pushq %rsi /* pt_regs->si */
+ pushq (%rdx) /* pt_regs->dx */
+ pushq %rcx /* pt_regs->cx */
+ pushq %rax /* pt_regs->ax */
+ pushq %r8 /* pt_regs->r8 */
+ pushq %r9 /* pt_regs->r9 */
+ pushq %r10 /* pt_regs->r10 */
+ pushq %r11 /* pt_regs->r11 */
+ pushq %rbx /* pt_regs->rbx */
+ pushq %rbp /* pt_regs->rbp */
+ pushq %r12 /* pt_regs->r12 */
+ pushq %r13 /* pt_regs->r13 */
+ pushq %r14 /* pt_regs->r14 */
+ pushq %r15 /* pt_regs->r15 */
+
+ /*
+ * At this point we no longer need to worry about stack damage
+ * due to nesting -- we're on the normal thread stack and we're
+ * done with the NMI stack.
+ */
+
+ movq %rsp, %rdi
+ movq $-1, %rsi
+ call do_nmi
+
+ /*
+ * Return back to user mode. We must *not* do the normal exit
+ * work, because we don't want to enable interrupts. Fortunately,
+ * do_nmi doesn't modify pt_regs.
+ */
+ SWAPGS
+ jmp restore_c_regs_and_iret
+
+.Lnmi_from_kernel:
+ /*
+ * Here's what our stack frame will look like:
+ * +---------------------------------------------------------+
+ * | original SS |
+ * | original Return RSP |
+ * | original RFLAGS |
+ * | original CS |
+ * | original RIP |
+ * +---------------------------------------------------------+
+ * | temp storage for rdx |
+ * +---------------------------------------------------------+
+ * | "NMI executing" variable |
+ * +---------------------------------------------------------+
+ * | iret SS } Copied from "outermost" frame |
+ * | iret Return RSP } on each loop iteration; overwritten |
+ * | iret RFLAGS } by a nested NMI to force another |
+ * | iret CS } iteration if needed. |
+ * | iret RIP } |
+ * +---------------------------------------------------------+
+ * | outermost SS } initialized in first_nmi; |
+ * | outermost Return RSP } will not be changed before |
+ * | outermost RFLAGS } NMI processing is done. |
+ * | outermost CS } Copied to "iret" frame on each |
+ * | outermost RIP } iteration. |
+ * +---------------------------------------------------------+
+ * | pt_regs |
+ * +---------------------------------------------------------+
+ *
+ * The "original" frame is used by hardware. Before re-enabling
+ * NMIs, we need to be done with it, and we need to leave enough
+ * space for the asm code here.
+ *
+ * We return by executing IRET while RSP points to the "iret" frame.
+ * That will either return for real or it will loop back into NMI
+ * processing.
+ *
+ * The "outermost" frame is copied to the "iret" frame on each
+ * iteration of the loop, so each iteration starts with the "iret"
+ * frame pointing to the final return target.
+ */
+
/*
- * If %cs was not the kernel segment, then the NMI triggered in user
- * space, which means it is definitely not nested.
+ * Determine whether we're a nested NMI.
+ *
+ * If we interrupted kernel code between repeat_nmi and
+ * end_repeat_nmi, then we are a nested NMI. We must not
+ * modify the "iret" frame because it's being written by
+ * the outer NMI. That's okay; the outer NMI handler is
+ * about to about to call do_nmi anyway, so we can just
+ * resume the outer NMI.
*/
- cmpl $__KERNEL_CS, 16(%rsp)
- jne first_nmi
+
+ movq $repeat_nmi, %rdx
+ cmpq 8(%rsp), %rdx
+ ja 1f
+ movq $end_repeat_nmi, %rdx
+ cmpq 8(%rsp), %rdx
+ ja nested_nmi_out
+1:

/*
- * Check the special variable on the stack to see if NMIs are
- * executing.
+ * Now check "NMI executing". If it's set, then we're nested.
+ * This will not detect if we interrupted an outer NMI just
+ * before IRET.
*/
cmpl $1, -8(%rsp)
je nested_nmi

/*
- * Now test if the previous stack was an NMI stack.
- * We need the double check. We check the NMI stack to satisfy the
- * race when the first NMI clears the variable before returning.
- * We check the variable because the first NMI could be in a
- * breakpoint routine using a breakpoint stack.
+ * Now test if the previous stack was an NMI stack. This covers
+ * the case where we interrupt an outer NMI after it clears
+ * "NMI executing" but before IRET. We need to be careful, though:
+ * there is one case in which RSP could point to the NMI stack
+ * despite there being no NMI active: naughty userspace controls
+ * RSP at the very beginning of the SYSCALL targets. We can
+ * pull a fast one on naughty userspace, though: we program
+ * SYSCALL to mask DF, so userspace cannot cause DF to be set
+ * if it controls the kernel's RSP. We set DF before we clear
+ * "NMI executing".
*/
lea 6*8(%rsp), %rdx
/* Compare the NMI stack (rdx) with the stack we came from (4*8(%rsp)) */
@@ -1286,25 +1407,20 @@ ENTRY(nmi)
cmpq %rdx, 4*8(%rsp)
/* If it is below the NMI stack, it is a normal NMI */
jb first_nmi
- /* Ah, it is within the NMI stack, treat it as nested */
+
+ /* Ah, it is within the NMI stack. */
+
+ testb $(X86_EFLAGS_DF >> 8), (3*8 + 1)(%rsp)
+ jz first_nmi /* RSP was user controlled. */
+
+ /* This is a nested NMI. */

nested_nmi:
/*
- * Do nothing if we interrupted the fixup in repeat_nmi.
- * It's about to repeat the NMI handler, so we are fine
- * with ignoring this one.
+ * Modify the "iret" frame to point to repeat_nmi, forcing another
+ * iteration of NMI handling.
*/
- movq $repeat_nmi, %rdx
- cmpq 8(%rsp), %rdx
- ja 1f
- movq $end_repeat_nmi, %rdx
- cmpq 8(%rsp), %rdx
- ja nested_nmi_out
-
-1:
- /* Set up the interrupted NMIs stack to jump to repeat_nmi */
- leaq -1*8(%rsp), %rdx
- movq %rdx, %rsp
+ subq $8, %rsp
leaq -10*8(%rsp), %rdx
pushq $__KERNEL_DS
pushq %rdx
@@ -1318,61 +1434,42 @@ ENTRY(nmi)
nested_nmi_out:
popq %rdx

- /* No need to check faults here */
+ /* We are returning to kernel mode, so this cannot result in a fault. */
INTERRUPT_RETURN

first_nmi:
- /*
- * Because nested NMIs will use the pushed location that we
- * stored in rdx, we must keep that space available.
- * Here's what our stack frame will look like:
- * +-------------------------+
- * | original SS |
- * | original Return RSP |
- * | original RFLAGS |
- * | original CS |
- * | original RIP |
- * +-------------------------+
- * | temp storage for rdx |
- * +-------------------------+
- * | NMI executing variable |
- * +-------------------------+
- * | copied SS |
- * | copied Return RSP |
- * | copied RFLAGS |
- * | copied CS |
- * | copied RIP |
- * +-------------------------+
- * | Saved SS |
- * | Saved Return RSP |
- * | Saved RFLAGS |
- * | Saved CS |
- * | Saved RIP |
- * +-------------------------+
- * | pt_regs |
- * +-------------------------+
- *
- * The saved stack frame is used to fix up the copied stack frame
- * that a nested NMI may change to make the interrupted NMI iret jump
- * to the repeat_nmi. The original stack frame and the temp storage
- * is also used by nested NMIs and can not be trusted on exit.
- */
- /* Do not pop rdx, nested NMIs will corrupt that part of the stack */
+ /* Restore rdx. */
movq (%rsp), %rdx

- /* Set the NMI executing variable on the stack. */
- pushq $1
+ /* Make room for "NMI executing". */
+ pushq $0

- /* Leave room for the "copied" frame */
+ /* Leave room for the "iret" frame */
subq $(5*8), %rsp

- /* Copy the stack frame to the Saved frame */
+ /* Copy the "original" frame to the "outermost" frame */
.rept 5
pushq 11*8(%rsp)
.endr

/* Everything up to here is safe from nested NMIs */

+#ifdef CONFIG_DEBUG_ENTRY
+ /*
+ * For ease of testing, unmask NMIs right away. Disabled by
+ * default because IRET is very expensive.
+ */
+ pushq $0 /* SS */
+ pushq %rsp /* RSP (minus 8 because of the previous push) */
+ addq $8, (%rsp) /* Fix up RSP */
+ pushfq /* RFLAGS */
+ pushq $__KERNEL_CS /* CS */
+ pushq $1f /* RIP */
+ INTERRUPT_RETURN /* continues at repeat_nmi below */
+1:
+#endif
+
+repeat_nmi:
/*
* If there was a nested NMI, the first NMI's iret will return
* here. But NMIs are still enabled and we can take another
@@ -1381,16 +1478,20 @@ ENTRY(nmi)
* it will just return, as we are about to repeat an NMI anyway.
* This makes it safe to copy to the stack frame that a nested
* NMI will update.
+ *
+ * RSP is pointing to "outermost RIP". gsbase is unknown, but, if
+ * we're repeating an NMI, gsbase has the same value that it had on
+ * the first iteration. paranoid_entry will load the kernel
+ * gsbase if needed before we call do_nmi. "NMI executing"
+ * is zero.
*/
-repeat_nmi:
+ movq $1, 10*8(%rsp) /* Set "NMI executing". */
+
/*
- * Update the stack variable to say we are still in NMI (the update
- * is benign for the non-repeat case, where 1 was pushed just above
- * to this very stack slot).
+ * Copy the "outermost" frame to the "iret" frame. NMIs that nest
+ * here must not modify the "iret" frame while we're writing to
+ * it or it will end up containing garbage.
*/
- movq $1, 10*8(%rsp)
-
- /* Make another copy, this one may be modified by nested NMIs */
addq $(10*8), %rsp
.rept 5
pushq -6*8(%rsp)
@@ -1399,9 +1500,9 @@ ENTRY(nmi)
end_repeat_nmi:

/*
- * Everything below this point can be preempted by a nested
- * NMI if the first NMI took an exception and reset our iret stack
- * so that we repeat another NMI.
+ * Everything below this point can be preempted by a nested NMI.
+ * If this happens, then the inner NMI will change the "iret"
+ * frame to point back to repeat_nmi.
*/
pushq $-1 /* ORIG_RAX: no syscall to restart */
ALLOC_PT_GPREGS_ON_STACK
@@ -1415,28 +1516,11 @@ ENTRY(nmi)
*/
call paranoid_entry

- /*
- * Save off the CR2 register. If we take a page fault in the NMI then
- * it could corrupt the CR2 value. If the NMI preempts a page fault
- * handler before it was able to read the CR2 register, and then the
- * NMI itself takes a page fault, the page fault that was preempted
- * will read the information from the NMI page fault and not the
- * origin fault. Save it off and restore it if it changes.
- * Use the r12 callee-saved register.
- */
- movq %cr2, %r12
-
/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
movq %rsp, %rdi
movq $-1, %rsi
call do_nmi

- /* Did the NMI take a page fault? Restore cr2 if it did */
- movq %cr2, %rcx
- cmpq %rcx, %r12
- je 1f
- movq %r12, %cr2
-1:
testl %ebx, %ebx /* swapgs needed? */
jnz nmi_restore
nmi_swapgs:
@@ -1444,11 +1528,26 @@ ENTRY(nmi)
nmi_restore:
RESTORE_EXTRA_REGS
RESTORE_C_REGS
- /* Pop the extra iret frame at once */
+
+ /* Point RSP at the "iret" frame. */
REMOVE_PT_GPREGS_FROM_STACK 6*8

- /* Clear the NMI executing stack variable */
- movq $0, 5*8(%rsp)
+ /*
+ * Clear "NMI executing". Set DF first so that we can easily
+ * distinguish the remaining code between here and IRET from
+ * the SYSCALL entry and exit paths. On a native kernel, we
+ * could just inspect RIP, but, on paravirt kernels,
+ * INTERRUPT_RETURN can translate into a jump into a
+ * hypercall page.
+ */
+ std
+ movq $0, 5*8(%rsp) /* clear "NMI executing" */
+
+ /*
+ * INTERRUPT_RETURN reads the "iret" frame and exits the NMI
+ * stack in a single instruction. We are returning to kernel
+ * mode, so this cannot result in a fault.
+ */
INTERRUPT_RETURN
END(nmi)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 0637826292de..c49c5173158e 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -189,6 +189,7 @@ union fpregs_state {
struct fxregs_state fxsave;
struct swregs_state soft;
struct xregs_state xsave;
+ u8 __padding[PAGE_SIZE];
};

/*
@@ -198,40 +199,6 @@ union fpregs_state {
*/
struct fpu {
/*
- * @state:
- *
- * In-memory copy of all FPU registers that we save/restore
- * over context switches. If the task is using the FPU then
- * the registers in the FPU are more recent than this state
- * copy. If the task context-switches away then they get
- * saved here and represent the FPU state.
- *
- * After context switches there may be a (short) time period
- * during which the in-FPU hardware registers are unchanged
- * and still perfectly match this state, if the tasks
- * scheduled afterwards are not using the FPU.
- *
- * This is the 'lazy restore' window of optimization, which
- * we track though 'fpu_fpregs_owner_ctx' and 'fpu->last_cpu'.
- *
- * We detect whether a subsequent task uses the FPU via setting
- * CR0::TS to 1, which causes any FPU use to raise a #NM fault.
- *
- * During this window, if the task gets scheduled again, we
- * might be able to skip having to do a restore from this
- * memory buffer to the hardware registers - at the cost of
- * incurring the overhead of #NM fault traps.
- *
- * Note that on modern CPUs that support the XSAVEOPT (or other
- * optimized XSAVE instructions), we don't use #NM traps anymore,
- * as the hardware can track whether FPU registers need saving
- * or not. On such CPUs we activate the non-lazy ('eagerfpu')
- * logic, which unconditionally saves/restores all FPU state
- * across context switches. (if FPU state exists.)
- */
- union fpregs_state state;
-
- /*
* @last_cpu:
*
* Records the last CPU on which this context was loaded into
@@ -288,6 +255,43 @@ struct fpu {
* deal with bursty apps that only use the FPU for a short time:
*/
unsigned char counter;
+ /*
+ * @state:
+ *
+ * In-memory copy of all FPU registers that we save/restore
+ * over context switches. If the task is using the FPU then
+ * the registers in the FPU are more recent than this state
+ * copy. If the task context-switches away then they get
+ * saved here and represent the FPU state.
+ *
+ * After context switches there may be a (short) time period
+ * during which the in-FPU hardware registers are unchanged
+ * and still perfectly match this state, if the tasks
+ * scheduled afterwards are not using the FPU.
+ *
+ * This is the 'lazy restore' window of optimization, which
+ * we track though 'fpu_fpregs_owner_ctx' and 'fpu->last_cpu'.
+ *
+ * We detect whether a subsequent task uses the FPU via setting
+ * CR0::TS to 1, which causes any FPU use to raise a #NM fault.
+ *
+ * During this window, if the task gets scheduled again, we
+ * might be able to skip having to do a restore from this
+ * memory buffer to the hardware registers - at the cost of
+ * incurring the overhead of #NM fault traps.
+ *
+ * Note that on modern CPUs that support the XSAVEOPT (or other
+ * optimized XSAVE instructions), we don't use #NM traps anymore,
+ * as the hardware can track whether FPU registers need saving
+ * or not. On such CPUs we activate the non-lazy ('eagerfpu')
+ * logic, which unconditionally saves/restores all FPU state
+ * across context switches. (if FPU state exists.)
+ */
+ union fpregs_state state;
+ /*
+ * WARNING: 'state' is dynamically-sized. Do not put
+ * anything after it here.
+ */
};

#endif /* _ASM_X86_FPU_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 43e6519df0d5..944f1785ed0d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -390,9 +390,6 @@ struct thread_struct {
#endif
unsigned long gs;

- /* Floating point and extended processor state */
- struct fpu fpu;
-
/* Save middle states of ptrace breakpoints */
struct perf_event *ptrace_bps[HBP_NUM];
/* Debug status used for traps, single steps, etc... */
@@ -418,6 +415,13 @@ struct thread_struct {
unsigned long iopl;
/* Max allowed port in the bitmap, in bytes: */
unsigned io_bitmap_max;
+
+ /* Floating point and extended processor state */
+ struct fpu fpu;
+ /*
+ * WARNING: 'fpu' is dynamically-sized. It *MUST* be at
+ * the end.
+ */
};

/*
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 32826791e675..0b39173dd971 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -4,6 +4,8 @@
#include <asm/fpu/internal.h>
#include <asm/tlbflush.h>

+#include <linux/sched.h>
+
/*
* Initialize the TS bit in CR0 according to the style of context-switches
* we are using:
@@ -136,6 +138,43 @@ static void __init fpu__init_system_generic(void)
unsigned int xstate_size;
EXPORT_SYMBOL_GPL(xstate_size);

+/* Enforce that 'MEMBER' is the last field of 'TYPE': */
+#define CHECK_MEMBER_AT_END_OF(TYPE, MEMBER) \
+ BUILD_BUG_ON(sizeof(TYPE) != offsetofend(TYPE, MEMBER))
+
+/*
+ * We append the 'struct fpu' to the task_struct:
+ */
+static void __init fpu__init_task_struct_size(void)
+{
+ int task_size = sizeof(struct task_struct);
+
+ /*
+ * Subtract off the static size of the register state.
+ * It potentially has a bunch of padding.
+ */
+ task_size -= sizeof(((struct task_struct *)0)->thread.fpu.state);
+
+ /*
+ * Add back the dynamically-calculated register state
+ * size.
+ */
+ task_size += xstate_size;
+
+ /*
+ * We dynamically size 'struct fpu', so we require that
+ * it be at the end of 'thread_struct' and that
+ * 'thread_struct' be at the end of 'task_struct'. If
+ * you hit a compile error here, check the structure to
+ * see if something got added to the end.
+ */
+ CHECK_MEMBER_AT_END_OF(struct fpu, state);
+ CHECK_MEMBER_AT_END_OF(struct thread_struct, fpu);
+ CHECK_MEMBER_AT_END_OF(struct task_struct, thread);
+
+ arch_task_struct_size = task_size;
+}
+
/*
* Set up the xstate_size based on the legacy FPU context size.
*
@@ -287,6 +326,7 @@ void __init fpu__init_system(struct cpuinfo_x86 *c)
fpu__init_system_generic();
fpu__init_system_xstate_size_legacy();
fpu__init_system_xstate();
+ fpu__init_task_struct_size();

fpu__init_system_ctx_switch();
}
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index c3e985d1751c..d05bd2e2ee91 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -408,15 +408,15 @@ static void default_do_nmi(struct pt_regs *regs)
NOKPROBE_SYMBOL(default_do_nmi);

/*
- * NMIs can hit breakpoints which will cause it to lose its
- * NMI context with the CPU when the breakpoint does an iret.
- */
-#ifdef CONFIG_X86_32
-/*
- * For i386, NMIs use the same stack as the kernel, and we can
- * add a workaround to the iret problem in C (preventing nested
- * NMIs if an NMI takes a trap). Simply have 3 states the NMI
- * can be in:
+ * NMIs can page fault or hit breakpoints which will cause it to lose
+ * its NMI context with the CPU when the breakpoint or page fault does an IRET.
+ *
+ * As a result, NMIs can nest if NMIs get unmasked due an IRET during
+ * NMI processing. On x86_64, the asm glue protects us from nested NMIs
+ * if the outer NMI came from kernel mode, but we can still nest if the
+ * outer NMI came from user mode.
+ *
+ * To handle these nested NMIs, we have three states:
*
* 1) not running
* 2) executing
@@ -430,15 +430,14 @@ NOKPROBE_SYMBOL(default_do_nmi);
* (Note, the latch is binary, thus multiple NMIs triggering,
* when one is running, are ignored. Only one NMI is restarted.)
*
- * If an NMI hits a breakpoint that executes an iret, another
- * NMI can preempt it. We do not want to allow this new NMI
- * to run, but we want to execute it when the first one finishes.
- * We set the state to "latched", and the exit of the first NMI will
- * perform a dec_return, if the result is zero (NOT_RUNNING), then
- * it will simply exit the NMI handler. If not, the dec_return
- * would have set the state to NMI_EXECUTING (what we want it to
- * be when we are running). In this case, we simply jump back
- * to rerun the NMI handler again, and restart the 'latched' NMI.
+ * If an NMI executes an iret, another NMI can preempt it. We do not
+ * want to allow this new NMI to run, but we want to execute it when the
+ * first one finishes. We set the state to "latched", and the exit of
+ * the first NMI will perform a dec_return, if the result is zero
+ * (NOT_RUNNING), then it will simply exit the NMI handler. If not, the
+ * dec_return would have set the state to NMI_EXECUTING (what we want it
+ * to be when we are running). In this case, we simply jump back to
+ * rerun the NMI handler again, and restart the 'latched' NMI.
*
* No trap (breakpoint or page fault) should be hit before nmi_restart,
* thus there is no race between the first check of state for NOT_RUNNING
@@ -461,49 +460,36 @@ enum nmi_states {
static DEFINE_PER_CPU(enum nmi_states, nmi_state);
static DEFINE_PER_CPU(unsigned long, nmi_cr2);

-#define nmi_nesting_preprocess(regs) \
- do { \
- if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) { \
- this_cpu_write(nmi_state, NMI_LATCHED); \
- return; \
- } \
- this_cpu_write(nmi_state, NMI_EXECUTING); \
- this_cpu_write(nmi_cr2, read_cr2()); \
- } while (0); \
- nmi_restart:
-
-#define nmi_nesting_postprocess() \
- do { \
- if (unlikely(this_cpu_read(nmi_cr2) != read_cr2())) \
- write_cr2(this_cpu_read(nmi_cr2)); \
- if (this_cpu_dec_return(nmi_state)) \
- goto nmi_restart; \
- } while (0)
-#else /* x86_64 */
+#ifdef CONFIG_X86_64
/*
- * In x86_64 things are a bit more difficult. This has the same problem
- * where an NMI hitting a breakpoint that calls iret will remove the
- * NMI context, allowing a nested NMI to enter. What makes this more
- * difficult is that both NMIs and breakpoints have their own stack.
- * When a new NMI or breakpoint is executed, the stack is set to a fixed
- * point. If an NMI is nested, it will have its stack set at that same
- * fixed address that the first NMI had, and will start corrupting the
- * stack. This is handled in entry_64.S, but the same problem exists with
- * the breakpoint stack.
+ * In x86_64, we need to handle breakpoint -> NMI -> breakpoint. Without
+ * some care, the inner breakpoint will clobber the outer breakpoint's
+ * stack.
*
- * If a breakpoint is being processed, and the debug stack is being used,
- * if an NMI comes in and also hits a breakpoint, the stack pointer
- * will be set to the same fixed address as the breakpoint that was
- * interrupted, causing that stack to be corrupted. To handle this case,
- * check if the stack that was interrupted is the debug stack, and if
- * so, change the IDT so that new breakpoints will use the current stack
- * and not switch to the fixed address. On return of the NMI, switch back
- * to the original IDT.
+ * If a breakpoint is being processed, and the debug stack is being
+ * used, if an NMI comes in and also hits a breakpoint, the stack
+ * pointer will be set to the same fixed address as the breakpoint that
+ * was interrupted, causing that stack to be corrupted. To handle this
+ * case, check if the stack that was interrupted is the debug stack, and
+ * if so, change the IDT so that new breakpoints will use the current
+ * stack and not switch to the fixed address. On return of the NMI,
+ * switch back to the original IDT.
*/
static DEFINE_PER_CPU(int, update_debug_stack);
+#endif

-static inline void nmi_nesting_preprocess(struct pt_regs *regs)
+dotraplinkage notrace void
+do_nmi(struct pt_regs *regs, long error_code)
{
+ if (this_cpu_read(nmi_state) != NMI_NOT_RUNNING) {
+ this_cpu_write(nmi_state, NMI_LATCHED);
+ return;
+ }
+ this_cpu_write(nmi_state, NMI_EXECUTING);
+ this_cpu_write(nmi_cr2, read_cr2());
+nmi_restart:
+
+#ifdef CONFIG_X86_64
/*
* If we interrupted a breakpoint, it is possible that
* the nmi handler will have breakpoints too. We need to
@@ -514,22 +500,8 @@ static inline void nmi_nesting_preprocess(struct pt_regs *regs)
debug_stack_set_zero();
this_cpu_write(update_debug_stack, 1);
}
-}
-
-static inline void nmi_nesting_postprocess(void)
-{
- if (unlikely(this_cpu_read(update_debug_stack))) {
- debug_stack_reset();
- this_cpu_write(update_debug_stack, 0);
- }
-}
#endif

-dotraplinkage notrace void
-do_nmi(struct pt_regs *regs, long error_code)
-{
- nmi_nesting_preprocess(regs);
-
nmi_enter();

inc_irq_stat(__nmi_count);
@@ -539,8 +511,17 @@ do_nmi(struct pt_regs *regs, long error_code)

nmi_exit();

- /* On i386, may loop back to preprocess */
- nmi_nesting_postprocess();
+#ifdef CONFIG_X86_64
+ if (unlikely(this_cpu_read(update_debug_stack))) {
+ debug_stack_reset();
+ this_cpu_write(update_debug_stack, 0);
+ }
+#endif
+
+ if (unlikely(this_cpu_read(nmi_cr2) != read_cr2()))
+ write_cr2(this_cpu_read(nmi_cr2));
+ if (this_cpu_dec_return(nmi_state))
+ goto nmi_restart;
}
NOKPROBE_SYMBOL(do_nmi);

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9cad694ed7c4..397688beed4b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,7 @@ EXPORT_SYMBOL_GPL(idle_notifier_unregister);
*/
int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
{
- *dst = *src;
+ memcpy(dst, src, arch_task_struct_size);

return fpu__copy(&dst->thread.fpu, &src->thread.fpu);
}
diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c
index 91a4e6426321..92e6726f6e37 100644
--- a/fs/proc/kcore.c
+++ b/fs/proc/kcore.c
@@ -92,7 +92,7 @@ static size_t get_kcore_size(int *nphdr, size_t *elf_buflen)
roundup(sizeof(CORE_STR), 4)) +
roundup(sizeof(struct elf_prstatus), 4) +
roundup(sizeof(struct elf_prpsinfo), 4) +
- roundup(sizeof(struct task_struct), 4);
+ roundup(arch_task_struct_size, 4);
*elf_buflen = PAGE_ALIGN(*elf_buflen);
return size + *elf_buflen;
}
@@ -415,7 +415,7 @@ static void elf_kcore_store_hdr(char *bufp, int nphdr, int dataoff)
/* set up the task structure */
notes[2].name = CORE_STR;
notes[2].type = NT_TASKSTRUCT;
- notes[2].datasz = sizeof(struct task_struct);
+ notes[2].datasz = arch_task_struct_size;
notes[2].data = current;

nhdr->p_filesz += notesize(&notes[2]);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ae21f1591615..04b5ada460b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1522,8 +1522,6 @@ struct task_struct {
/* hung task detection */
unsigned long last_switch_count;
#endif
-/* CPU-specific state of this task */
- struct thread_struct thread;
/* filesystem information */
struct fs_struct *fs;
/* open file information */
@@ -1778,8 +1776,22 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+/* CPU-specific state of this task */
+ struct thread_struct thread;
+/*
+ * WARNING: on x86, 'thread_struct' contains a variable-sized
+ * structure. It *MUST* be at the end of 'task_struct'.
+ *
+ * Do not put anything below here!
+ */
};

+#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
+extern int arch_task_struct_size __read_mostly;
+#else
+# define arch_task_struct_size (sizeof(struct task_struct))
+#endif
+
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

diff --git a/kernel/fork.c b/kernel/fork.c
index 1bfefc6f96a4..dbd9b8d7b7cc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -287,6 +287,11 @@ static void set_max_threads(unsigned int max_threads_suggested)
max_threads = clamp_t(u64, threads, MIN_THREADS, MAX_THREADS);
}

+#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
+/* Initialized by the architecture: */
+int arch_task_struct_size __read_mostly;
+#endif
+
void __init fork_init(void)
{
#ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
@@ -295,7 +300,7 @@ void __init fork_init(void)
#endif
/* create a slab on which task_structs can be allocated */
task_struct_cachep =
- kmem_cache_create("task_struct", sizeof(struct task_struct),
+ kmem_cache_create("task_struct", arch_task_struct_size,
ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
#endif


2015-07-20 07:20:47

by Heiko Carstens

[permalink] [raw]
Subject: Re: [GIT PULL] x86 fixes

On Sat, Jul 18, 2015 at 05:18:10AM +0200, Ingo Molnar wrote:
> Linus,
>
> Please pull the latest x86-urgent-for-linus git tree from:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86-urgent-for-linus
>
> # HEAD: 5aaeb5c01c5b6c0be7b7aadbf3ace9f3a4458c3d x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86
>
> Two families of fixes:
>
> - Fix an FPU context related boot crash on newer x86 hardware with larger
> context sizes than what most people test. To fix this without ugly kludges or
> extensive reverts we had to touch core task allocator, to allow x86 to
> determine the task size dynamically, at boot time.
>
> I've tested it on a number of x86 platforms, and I cross-built it to a handful
> of architectures:
>
> (warns) (warns)
> testing x86-64: -git: pass ( 0), -tip: pass ( 0)
> testing x86-32: -git: pass ( 0), -tip: pass ( 0)
> testing arm: -git: pass ( 1359), -tip: pass ( 1359)
> testing cris: -git: pass ( 1031), -tip: pass ( 1031)
> testing m32r: -git: pass ( 1135), -tip: pass ( 1135)
> testing m68k: -git: pass ( 1471), -tip: pass ( 1471)
> testing mips: -git: pass ( 1162), -tip: pass ( 1162)
> testing mn10300: -git: pass ( 1058), -tip: pass ( 1058)
> testing parisc: -git: pass ( 1846), -tip: pass ( 1846)
> testing sparc: -git: pass ( 1185), -tip: pass ( 1185)
>
> ... so I hope the cross-arch impact 'none', as intended.
>
> (by Dave Hansen)

Unfortunately not true. It breaks the build on s390 since a couple of
displacements used in asm code now get too large:

arch/s390/kernel/entry.S:181: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
arch/s390/kernel/entry.S:191: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
arch/s390/kernel/entry.S:423: Error: operand out of range (0x0000000000001924 is not between 0x0000000000000000 and 0x0000000000000fff)
arch/s390/kernel/entry.S:437: Error: operand out of range (0x00000000000018e8 is not between 0x0000000000000000 and 0x0000000000000fff)
arch/s390/kernel/entry.S:438: Error: operand out of range (0x00000000000018e0 is not between 0x0000000000000000 and 0x0000000000000fff)
arch/s390/kernel/entry.S:439: Error: operand out of range (0x00000000000018f0 is not between 0x0000000000000000 and 0x0000000000000fff)
make[1]: *** [arch/s390/kernel/entry.o] Error 1

Let's see how we can fix this.

2015-07-20 08:00:38

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] sched, s390: Fix the fallout of increasing the offset of 'thread_struct' within 'task_struct'


* Heiko Carstens <[email protected]> wrote:

> > I've tested it on a number of x86 platforms, and I cross-built it to a handful
> > of architectures:
> >
> > (warns) (warns)
> > testing x86-64: -git: pass ( 0), -tip: pass ( 0)
> > testing x86-32: -git: pass ( 0), -tip: pass ( 0)
> > testing arm: -git: pass ( 1359), -tip: pass ( 1359)
> > testing cris: -git: pass ( 1031), -tip: pass ( 1031)
> > testing m32r: -git: pass ( 1135), -tip: pass ( 1135)
> > testing m68k: -git: pass ( 1471), -tip: pass ( 1471)
> > testing mips: -git: pass ( 1162), -tip: pass ( 1162)
> > testing mn10300: -git: pass ( 1058), -tip: pass ( 1058)
> > testing parisc: -git: pass ( 1846), -tip: pass ( 1846)
> > testing sparc: -git: pass ( 1185), -tip: pass ( 1185)
> >
> > ... so I hope the cross-arch impact 'none', as intended.
> >
> > (by Dave Hansen)
>
> Unfortunately not true. It breaks the build on s390 since a couple of
> displacements used in asm code now get too large:
>
> arch/s390/kernel/entry.S:181: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:191: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:423: Error: operand out of range (0x0000000000001924 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:437: Error: operand out of range (0x00000000000018e8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:438: Error: operand out of range (0x00000000000018e0 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:439: Error: operand out of range (0x00000000000018f0 is not between 0x0000000000000000 and 0x0000000000000fff)
> make[1]: *** [arch/s390/kernel/entry.o] Error 1
>
> Let's see how we can fix this.

There's also a traps.c build breakage reported below - and an RFC fix for it.

Thanks,

Ingo

=========================>

* Guenter <[email protected]> wrote:

> Hi,
>
> Commit 0c8c0f03e3a2 ("x86/fpu, sched: Dynamically allocate 'struct fpu'")
> causes s390 builds in mainline to fail as follows.
>
> arch/s390/kernel/traps.c: Assembler messages:
> arch/s390/kernel/traps.c:262: Error: operand out of range
> (0x00000000000023e8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/traps.c:300: Error: operand out of range
> (0x00000000000023e8 is not between 0x0000000000000000 and 0x0000000000000fff)


Yeah, so I'm really out on a limb here as I know next to nothing about s390
assembly, but the build failure appears to be analogous to the arm64 one: the
offset of thread_struct fields within task_struct increased due to commit
0c8c0f03e3a2 ("x86/fpu, sched: Dynamically allocate 'struct fpu'"), which
increased assembly offsets beyond the limit this instruction can apparently
encode.

Does the (untested!) patch below help?

It's an equivalent transformation on the C side, but it might cause GCC to
generate different assembly code, because we now have a temporary variable with
much smaller offsets.

The code is also a tiny bit cleaner this way, as the 'current->thread.fp_regs'
pattern isn't repeated twice.

In case this works:

Signed-off-by: Ingo Molnar <[email protected]>

Thanks,

Ingo

================>

arch/s390/kernel/traps.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 4d96c9f53455..db6f0eec55b5 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -251,6 +251,7 @@ int alloc_vector_registers(struct task_struct *tsk)

void vector_exception(struct pt_regs *regs)
{
+ s390_fp_regs *fp_regs = &current->thread.fp_regs;
int si_code, vic;

if (!MACHINE_HAS_VX) {
@@ -259,8 +260,9 @@ void vector_exception(struct pt_regs *regs)
}

/* get vector interrupt code from fpc */
- asm volatile("stfpc %0" : "=m" (current->thread.fp_regs.fpc));
- vic = (current->thread.fp_regs.fpc & 0xf00) >> 8;
+ asm volatile("stfpc %0" : "=m" (fp_regs->fpc));
+ vic = (fp_regs->fpc & 0xf00) >> 8;
+
switch (vic) {
case 1: /* invalid vector operation */
si_code = FPE_FLTINV;

2015-07-20 08:12:32

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] sched, s390: Fix the fallout of increasing the offset of 'thread_struct' within 'task_struct'

On Mon, 20 Jul 2015 10:00:32 +0200
Ingo Molnar <[email protected]> wrote:

> * Heiko Carstens <[email protected]> wrote:
>
> > arch/s390/kernel/entry.S:181: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:191: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:423: Error: operand out of range (0x0000000000001924 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:437: Error: operand out of range (0x00000000000018e8 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:438: Error: operand out of range (0x00000000000018e0 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:439: Error: operand out of range (0x00000000000018f0 is not between 0x0000000000000000 and 0x0000000000000fff)
> > make[1]: *** [arch/s390/kernel/entry.o] Error 1
> >
> > Let's see how we can fix this.
>
> There's also a traps.c build breakage reported below - and an RFC fix for it.

This patch should fix it for good.
--
>From 54b3c4b85572bf5acc698cfca15d2c9cad446c44 Mon Sep 17 00:00:00 2001
From: Martin Schwidefsky <[email protected]>
Date: Mon, 20 Jul 2015 10:01:46 +0200
Subject: [PATCH] s390: adapt entry.S to the move of thread_struct

git commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a
"x86/fpu, sched: Dynamically allocate 'struct fpu'"
moved the thread_struct to the end of the task_struct.

This causes some of the offsets used in entry.S to overflow their
instruction operand field. To fix this use aghi to create a
dedicated pointer for the thread_struct.

Signed-off-by: Martin Schwidefsky <[email protected]>
---
arch/s390/kernel/asm-offsets.c | 15 +++++++--------
arch/s390/kernel/entry.S | 13 +++++++++----
arch/s390/kernel/traps.c | 4 ++--
3 files changed, 18 insertions(+), 14 deletions(-)

diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index c7d1b9d..a2da259 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -23,15 +23,15 @@

int main(void)
{
- DEFINE(__THREAD_info, offsetof(struct task_struct, stack));
- DEFINE(__THREAD_ksp, offsetof(struct task_struct, thread.ksp));
- DEFINE(__THREAD_mm_segment, offsetof(struct task_struct, thread.mm_segment));
- BLANK();
+ DEFINE(__TASK_thread_info, offsetof(struct task_struct, stack));
+ DEFINE(__TASK_thread, offsetof(struct task_struct, thread));
DEFINE(__TASK_pid, offsetof(struct task_struct, pid));
BLANK();
- DEFINE(__THREAD_per_cause, offsetof(struct task_struct, thread.per_event.cause));
- DEFINE(__THREAD_per_address, offsetof(struct task_struct, thread.per_event.address));
- DEFINE(__THREAD_per_paid, offsetof(struct task_struct, thread.per_event.paid));
+ DEFINE(__THREAD_ksp, offsetof(struct thread_struct, ksp));
+ DEFINE(__THREAD_per_cause, offsetof(struct thread_struct, per_event.cause));
+ DEFINE(__THREAD_per_address, offsetof(struct thread_struct, per_event.address));
+ DEFINE(__THREAD_per_paid, offsetof(struct thread_struct, per_event.paid));
+ DEFINE(__THREAD_trap_tdb, offsetof(struct thread_struct, trap_tdb));
BLANK();
DEFINE(__TI_task, offsetof(struct thread_info, task));
DEFINE(__TI_flags, offsetof(struct thread_info, flags));
@@ -176,7 +176,6 @@ int main(void)
DEFINE(__LC_VDSO_PER_CPU, offsetof(struct _lowcore, vdso_per_cpu_data));
DEFINE(__LC_GMAP, offsetof(struct _lowcore, gmap));
DEFINE(__LC_PGM_TDB, offsetof(struct _lowcore, pgm_tdb));
- DEFINE(__THREAD_trap_tdb, offsetof(struct task_struct, thread.trap_tdb));
DEFINE(__GMAP_ASCE, offsetof(struct gmap, asce));
DEFINE(__SIE_PROG0C, offsetof(struct kvm_s390_sie_block, prog0c));
DEFINE(__SIE_PROG20, offsetof(struct kvm_s390_sie_block, prog20));
diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
index 3238893..84062e7 100644
--- a/arch/s390/kernel/entry.S
+++ b/arch/s390/kernel/entry.S
@@ -178,17 +178,21 @@ _PIF_WORK = (_PIF_PER_TRAP)
*/
ENTRY(__switch_to)
stmg %r6,%r15,__SF_GPRS(%r15) # store gprs of prev task
- stg %r15,__THREAD_ksp(%r2) # store kernel stack of prev
- lg %r4,__THREAD_info(%r2) # get thread_info of prev
- lg %r5,__THREAD_info(%r3) # get thread_info of next
+ lgr %r1,%r2
+ aghi %r1,__TASK_thread # thread_struct of prev task
+ lg %r4,__TASK_thread_info(%r2) # get thread_info of prev
+ lg %r5,__TASK_thread_info(%r3) # get thread_info of next
+ stg %r15,__THREAD_ksp(%r1) # store kernel stack of prev
+ lgr %r1,%r3
+ aghi %r1,__TASK_thread # thread_struct of next task
lgr %r15,%r5
aghi %r15,STACK_INIT # end of kernel stack of next
stg %r3,__LC_CURRENT # store task struct of next
stg %r5,__LC_THREAD_INFO # store thread info of next
stg %r15,__LC_KERNEL_STACK # store end of kernel stack
+ lg %r15,__THREAD_ksp(%r1) # load kernel stack of next
lctl %c4,%c4,__TASK_pid(%r3) # load pid to control reg. 4
mvc __LC_CURRENT_PID+4(4,%r0),__TASK_pid(%r3) # store pid of next
- lg %r15,__THREAD_ksp(%r3) # load kernel stack of next
lmg %r6,%r15,__SF_GPRS(%r15) # load gprs of next task
br %r14

@@ -417,6 +421,7 @@ ENTRY(pgm_check_handler)
LAST_BREAK %r14
lg %r15,__LC_KERNEL_STACK
lg %r14,__TI_task(%r12)
+ aghi %r14,__TASK_thread # pointer to thread_struct
lghi %r13,__LC_PGM_TDB
tm __LC_PGM_ILC+2,0x02 # check for transaction abort
jz 2f
diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 4d96c9f..7bea81d 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -259,7 +259,7 @@ void vector_exception(struct pt_regs *regs)
}

/* get vector interrupt code from fpc */
- asm volatile("stfpc %0" : "=m" (current->thread.fp_regs.fpc));
+ asm volatile("stfpc %0" : "=Q" (current->thread.fp_regs.fpc));
vic = (current->thread.fp_regs.fpc & 0xf00) >> 8;
switch (vic) {
case 1: /* invalid vector operation */
@@ -297,7 +297,7 @@ void data_exception(struct pt_regs *regs)

location = get_trap_ip(regs);

- asm volatile("stfpc %0" : "=m" (current->thread.fp_regs.fpc));
+ asm volatile("stfpc %0" : "=Q" (current->thread.fp_regs.fpc));
/* Check for vector register enablement */
if (MACHINE_HAS_VX && !current->thread.vxrs &&
(current->thread.fp_regs.fpc & FPC_DXC_MASK) == 0xfe00) {
--
2.3.8


--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2015-07-20 08:20:15

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] s390, sched: Fix thread_struct move fallout to __switch_to()


* Heiko Carstens <[email protected]> wrote:

> Unfortunately not true. It breaks the build on s390 since a couple of
> displacements used in asm code now get too large:
>
> arch/s390/kernel/entry.S:181: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:191: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:423: Error: operand out of range (0x0000000000001924 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:437: Error: operand out of range (0x00000000000018e8 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:438: Error: operand out of range (0x00000000000018e0 is not between 0x0000000000000000 and 0x0000000000000fff)
> arch/s390/kernel/entry.S:439: Error: operand out of range (0x00000000000018f0 is not between 0x0000000000000000 and 0x0000000000000fff)
> make[1]: *** [arch/s390/kernel/entry.o] Error 1
>
> Let's see how we can fix this.

Sorry about this!

So I looked at the __switch_to() assembly, and the main complication appears to be
that we have two uses of 'prev':

lg %r4,__THREAD_info(%r2) # get thread_info of prev


stg %r15,__THREAD_ksp(%r2) # store kernel stack of prev

the __THREAD_info offset is best expressed relative to task_struct - the
__THREAD_ksp offset is best expressed relative to thread_struct.

I.e. I think the best fix would be to extend the signature of s390's __switch_to()
from (prev,next) to (prev,next,prev_thread,next_thread).

Is the C parameter mapping r2,r3,r4,r5? If yes then the patch below should do the
trick. (Utterly untested.)

Note:

- I renamed __THREAD_info to __TASK_thread_info, to better separate task_struct
and thread_struct offsets syntactically and visually.

- I removed __THREAD_mm_segment which is unused

I also looked at fixing pgm_check_handler(), but my s390-fu gave up completely on
that one: task_struct is in 'r14', but this is a hardware entry function it
appears. So to fix it we'd have to pick a temporary register, put thread_struct
pointer into it, and fix up these offsets:

arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_per_address, offsetof(struct task_struct, thread.per_event.address));
arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_per_paid, offsetof(struct task_struct, thread.per_event.paid));
arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_trap_tdb, offsetof(struct task_struct, thread.trap_tdb));

to be thread_struct relative.

Thanks,

Ingo

==============>

arch/s390/include/asm/switch_to.h | 7 +++++--
arch/s390/kernel/asm-offsets.c | 5 ++---
arch/s390/kernel/entry.S | 14 +++++++++-----
3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/arch/s390/include/asm/switch_to.h b/arch/s390/include/asm/switch_to.h
index d62e7a69605f..aa854b600938 100644
--- a/arch/s390/include/asm/switch_to.h
+++ b/arch/s390/include/asm/switch_to.h
@@ -10,7 +10,10 @@
#include <linux/thread_info.h>
#include <asm/ptrace.h>

-extern struct task_struct *__switch_to(void *, void *);
+extern struct task_struct *
+__switch_to(struct task_struct *prev, struct task_struct *next,
+ struct thread_struct *prev_thread, struct thread_struct *next_thrad);
+
extern void update_cr_regs(struct task_struct *task);

static inline int test_fp_ctl(u32 fpc)
@@ -169,7 +172,7 @@ static inline void restore_access_regs(unsigned int *acrs)
restore_access_regs(&next->thread.acrs[0]); \
restore_ri_cb(next->thread.ri_cb, prev->thread.ri_cb); \
} \
- prev = __switch_to(prev,next); \
+ prev = __switch_to(prev, next, &prev->thread, &next->thread); \
} while (0)

#endif /* __ASM_SWITCH_TO_H */
diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index c7d1b9d09011..2b422297660f 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -23,9 +23,8 @@

int main(void)
{
- DEFINE(__THREAD_info, offsetof(struct task_struct, stack));
- DEFINE(__THREAD_ksp, offsetof(struct task_struct, thread.ksp));
- DEFINE(__THREAD_mm_segment, offsetof(struct task_struct, thread.mm_segment));
+ DEFINE(__TASK_thread_info, offsetof(struct task_struct, stack));
+ DEFINE(__THREAD_ksp, offsetof(struct thread_struct, ksp));
BLANK();
DEFINE(__TASK_pid, offsetof(struct task_struct, pid));
BLANK();
diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
index 3238893c9d4f..c2fcb9e7ad25 100644
--- a/arch/s390/kernel/entry.S
+++ b/arch/s390/kernel/entry.S
@@ -170,17 +170,21 @@ _PIF_WORK = (_PIF_PER_TRAP)
.section .kprobes.text, "ax"

/*
- * Scheduler resume function, called by switch_to
+ * Low level scheduler resume function, called by switch_to():
+ *
* gpr2 = (task_struct *) prev
* gpr3 = (task_struct *) next
+ * gpr4 = (thread_struct *) prev->thread
+ * gpr5 = (thread_struct *) next->thread
+ *
* Returns:
* gpr2 = prev
*/
ENTRY(__switch_to)
stmg %r6,%r15,__SF_GPRS(%r15) # store gprs of prev task
- stg %r15,__THREAD_ksp(%r2) # store kernel stack of prev
- lg %r4,__THREAD_info(%r2) # get thread_info of prev
- lg %r5,__THREAD_info(%r3) # get thread_info of next
+ stg %r15,__THREAD_ksp(%r4) # store kernel stack of prev
+ lg %r4,__TASK_thread_info(%r2) # get thread_info of prev
+ lg %r5,__TASK_thread_info(%r3) # get thread_info of next
lgr %r15,%r5
aghi %r15,STACK_INIT # end of kernel stack of next
stg %r3,__LC_CURRENT # store task struct of next
@@ -188,7 +192,7 @@ ENTRY(__switch_to)
stg %r15,__LC_KERNEL_STACK # store end of kernel stack
lctl %c4,%c4,__TASK_pid(%r3) # load pid to control reg. 4
mvc __LC_CURRENT_PID+4(4,%r0),__TASK_pid(%r3) # store pid of next
- lg %r15,__THREAD_ksp(%r3) # load kernel stack of next
+ lg %r15,__THREAD_ksp(%r5) # load kernel stack of next
lmg %r6,%r15,__SF_GPRS(%r15) # load gprs of next task
br %r14

2015-07-20 08:33:56

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] s390, sched: Fix thread_struct move fallout to __switch_to()

On Mon, 20 Jul 2015 10:20:04 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Heiko Carstens <[email protected]> wrote:
>
> > Unfortunately not true. It breaks the build on s390 since a couple of
> > displacements used in asm code now get too large:
> >
> > arch/s390/kernel/entry.S:181: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:191: Error: operand out of range (0x00000000000018a8 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:423: Error: operand out of range (0x0000000000001924 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:437: Error: operand out of range (0x00000000000018e8 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:438: Error: operand out of range (0x00000000000018e0 is not between 0x0000000000000000 and 0x0000000000000fff)
> > arch/s390/kernel/entry.S:439: Error: operand out of range (0x00000000000018f0 is not between 0x0000000000000000 and 0x0000000000000fff)
> > make[1]: *** [arch/s390/kernel/entry.o] Error 1
> >
> > Let's see how we can fix this.
>
> Sorry about this!
>
> So I looked at the __switch_to() assembly, and the main complication appears to be
> that we have two uses of 'prev':
>
> lg %r4,__THREAD_info(%r2) # get thread_info of prev
>
>
> stg %r15,__THREAD_ksp(%r2) # store kernel stack of prev
>
> the __THREAD_info offset is best expressed relative to task_struct - the
> __THREAD_ksp offset is best expressed relative to thread_struct.
>
> I.e. I think the best fix would be to extend the signature of s390's __switch_to()
> from (prev,next) to (prev,next,prev_thread,next_thread).
>
> Is the C parameter mapping r2,r3,r4,r5? If yes then the patch below should do the
> trick. (Utterly untested.)
>
> Note:
>
> - I renamed __THREAD_info to __TASK_thread_info, to better separate task_struct
> and thread_struct offsets syntactically and visually.
>
> - I removed __THREAD_mm_segment which is unused
>
> I also looked at fixing pgm_check_handler(), but my s390-fu gave up completely on
> that one: task_struct is in 'r14', but this is a hardware entry function it
> appears. So to fix it we'd have to pick a temporary register, put thread_struct
> pointer into it, and fix up these offsets:
>
> arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_per_address, offsetof(struct task_struct, thread.per_event.address));
> arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_per_paid, offsetof(struct task_struct, thread.per_event.paid));
> arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_trap_tdb, offsetof(struct task_struct, thread.trap_tdb));
>
> to be thread_struct relative.

Just use the patch I've sent. No worries :-)

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

2015-07-20 08:34:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] s390, sched: Fix thread_struct move fallout to __switch_to()


* Ingo Molnar <[email protected]> wrote:

> I also looked at fixing pgm_check_handler(), but my s390-fu gave up completely
> on that one: task_struct is in 'r14', but this is a hardware entry function it
> appears. So to fix it we'd have to pick a temporary register, put thread_struct
> pointer into it, and fix up these offsets:
>
> arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_per_address, offsetof(struct task_struct, thread.per_event.address));
> arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_per_paid, offsetof(struct task_struct, thread.per_event.paid));
> arch/s390/kernel/asm-offsets.c: DEFINE(__THREAD_trap_tdb, offsetof(struct task_struct, thread.trap_tdb));
>
> to be thread_struct relative.

Here's a stab at that: we can move 'r14' from 'current' to 'current->thread' and
back relatively cheaply I think. If I got the mnemonics right.

Thanks,

Ingo

===============================>

arch/s390/kernel/asm-offsets.c | 9 +++++----
arch/s390/kernel/entry.S | 2 ++
2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/s390/kernel/asm-offsets.c b/arch/s390/kernel/asm-offsets.c
index 2b422297660f..26edbc4b5078 100644
--- a/arch/s390/kernel/asm-offsets.c
+++ b/arch/s390/kernel/asm-offsets.c
@@ -24,13 +24,14 @@
int main(void)
{
DEFINE(__TASK_thread_info, offsetof(struct task_struct, stack));
+ DEFINE(__TASK_thread_struct, offsetof(struct task_struct, thread));
DEFINE(__THREAD_ksp, offsetof(struct thread_struct, ksp));
BLANK();
DEFINE(__TASK_pid, offsetof(struct task_struct, pid));
BLANK();
- DEFINE(__THREAD_per_cause, offsetof(struct task_struct, thread.per_event.cause));
- DEFINE(__THREAD_per_address, offsetof(struct task_struct, thread.per_event.address));
- DEFINE(__THREAD_per_paid, offsetof(struct task_struct, thread.per_event.paid));
+ DEFINE(__THREAD_per_cause, offsetof(struct thread_struct, per_event.cause));
+ DEFINE(__THREAD_per_address, offsetof(struct thread_struct, per_event.address));
+ DEFINE(__THREAD_per_paid, offsetof(struct thread_struct, per_event.paid));
BLANK();
DEFINE(__TI_task, offsetof(struct thread_info, task));
DEFINE(__TI_flags, offsetof(struct thread_info, flags));
@@ -175,7 +176,7 @@ int main(void)
DEFINE(__LC_VDSO_PER_CPU, offsetof(struct _lowcore, vdso_per_cpu_data));
DEFINE(__LC_GMAP, offsetof(struct _lowcore, gmap));
DEFINE(__LC_PGM_TDB, offsetof(struct _lowcore, pgm_tdb));
- DEFINE(__THREAD_trap_tdb, offsetof(struct task_struct, thread.trap_tdb));
+ DEFINE(__THREAD_trap_tdb, offsetof(struct thread_struct, trap_tdb));
DEFINE(__GMAP_ASCE, offsetof(struct gmap, asce));
DEFINE(__SIE_PROG0C, offsetof(struct kvm_s390_sie_block, prog0c));
DEFINE(__SIE_PROG20, offsetof(struct kvm_s390_sie_block, prog20));
diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
index c2fcb9e7ad25..9806b2f94985 100644
--- a/arch/s390/kernel/entry.S
+++ b/arch/s390/kernel/entry.S
@@ -421,6 +421,7 @@ ENTRY(pgm_check_handler)
LAST_BREAK %r14
lg %r15,__LC_KERNEL_STACK
lg %r14,__TI_task(%r12)
+ ahi %r14,__TASK_thread_struct # r14 now points to 'current->thread'
lghi %r13,__LC_PGM_TDB
tm __LC_PGM_ILC+2,0x02 # check for transaction abort
jz 2f
@@ -448,6 +449,7 @@ ENTRY(pgm_check_handler)
nill %r10,0x007f
sll %r10,2
je .Lsysc_return
+ ahi %r14,-__TASK_thread_struct # r14 now points to 'current'
lgf %r1,0(%r10,%r1) # load address of handler routine
lgr %r2,%r11 # pass pointer to pt_regs
basr %r14,%r1 # branch to interrupt-handler

2015-07-20 08:38:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched, s390: Fix the fallout of increasing the offset of 'thread_struct' within 'task_struct'


* Martin Schwidefsky <[email protected]> wrote:

> diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
> index 3238893..84062e7 100644
> --- a/arch/s390/kernel/entry.S
> +++ b/arch/s390/kernel/entry.S
> @@ -178,17 +178,21 @@ _PIF_WORK = (_PIF_PER_TRAP)
> */
> ENTRY(__switch_to)
> stmg %r6,%r15,__SF_GPRS(%r15) # store gprs of prev task
> - stg %r15,__THREAD_ksp(%r2) # store kernel stack of prev
> - lg %r4,__THREAD_info(%r2) # get thread_info of prev
> - lg %r5,__THREAD_info(%r3) # get thread_info of next
> + lgr %r1,%r2
> + aghi %r1,__TASK_thread # thread_struct of prev task
> + lg %r4,__TASK_thread_info(%r2) # get thread_info of prev
> + lg %r5,__TASK_thread_info(%r3) # get thread_info of next
> + stg %r15,__THREAD_ksp(%r1) # store kernel stack of prev
> + lgr %r1,%r3
> + aghi %r1,__TASK_thread # thread_struct of next task
> lgr %r15,%r5
> aghi %r15,STACK_INIT # end of kernel stack of next
> stg %r3,__LC_CURRENT # store task struct of next
> stg %r5,__LC_THREAD_INFO # store thread info of next
> stg %r15,__LC_KERNEL_STACK # store end of kernel stack
> + lg %r15,__THREAD_ksp(%r1) # load kernel stack of next
> lctl %c4,%c4,__TASK_pid(%r3) # load pid to control reg. 4
> mvc __LC_CURRENT_PID+4(4,%r0),__TASK_pid(%r3) # store pid of next
> - lg %r15,__THREAD_ksp(%r3) # load kernel stack of next
> lmg %r6,%r15,__SF_GPRS(%r15) # load gprs of next task
> br %r14

Btw., I think we'd be slightly better off with the variant I sent: that way the
offset arithmetics are done in C code and are ready in the relevant registers by
the time they are used in __switch_to().

>
> @@ -417,6 +421,7 @@ ENTRY(pgm_check_handler)
> LAST_BREAK %r14
> lg %r15,__LC_KERNEL_STACK
> lg %r14,__TI_task(%r12)
> + aghi %r14,__TASK_thread # pointer to thread_struct
> lghi %r13,__LC_PGM_TDB
> tm __LC_PGM_ILC+2,0x02 # check for transaction abort
> jz 2f

Don't we also need the chunk I have:

@@ -448,6 +449,7 @@ ENTRY(pgm_check_handler)
nill %r10,0x007f
sll %r10,2
je .Lsysc_return
+ ahi %r14,-__TASK_thread_struct # r14 now points to 'current'
lgf %r1,0(%r10,%r1) # load address of handler routine
lgr %r2,%r11 # pass pointer to pt_regs
basr %r14,%r1 # branch to interrupt-handler

Because the 'basr' line relies on 'r14' having task_struct, I think?

But I don't really know what I'm talking about here ...

Thanks,

Ingo

2015-07-20 08:56:12

by Martin Schwidefsky

[permalink] [raw]
Subject: Re: [PATCH] sched, s390: Fix the fallout of increasing the offset of 'thread_struct' within 'task_struct'

On Mon, 20 Jul 2015 10:38:47 +0200
Ingo Molnar <[email protected]> wrote:

>
> * Martin Schwidefsky <[email protected]> wrote:
>
> > diff --git a/arch/s390/kernel/entry.S b/arch/s390/kernel/entry.S
> > index 3238893..84062e7 100644
> > --- a/arch/s390/kernel/entry.S
> > +++ b/arch/s390/kernel/entry.S
> > @@ -178,17 +178,21 @@ _PIF_WORK = (_PIF_PER_TRAP)
> > */
> > ENTRY(__switch_to)
> > stmg %r6,%r15,__SF_GPRS(%r15) # store gprs of prev task
> > - stg %r15,__THREAD_ksp(%r2) # store kernel stack of prev
> > - lg %r4,__THREAD_info(%r2) # get thread_info of prev
> > - lg %r5,__THREAD_info(%r3) # get thread_info of next
> > + lgr %r1,%r2
> > + aghi %r1,__TASK_thread # thread_struct of prev task
> > + lg %r4,__TASK_thread_info(%r2) # get thread_info of prev
> > + lg %r5,__TASK_thread_info(%r3) # get thread_info of next
> > + stg %r15,__THREAD_ksp(%r1) # store kernel stack of prev
> > + lgr %r1,%r3
> > + aghi %r1,__TASK_thread # thread_struct of next task
> > lgr %r15,%r5
> > aghi %r15,STACK_INIT # end of kernel stack of next
> > stg %r3,__LC_CURRENT # store task struct of next
> > stg %r5,__LC_THREAD_INFO # store thread info of next
> > stg %r15,__LC_KERNEL_STACK # store end of kernel stack
> > + lg %r15,__THREAD_ksp(%r1) # load kernel stack of next
> > lctl %c4,%c4,__TASK_pid(%r3) # load pid to control reg. 4
> > mvc __LC_CURRENT_PID+4(4,%r0),__TASK_pid(%r3) # store pid of next
> > - lg %r15,__THREAD_ksp(%r3) # load kernel stack of next
> > lmg %r6,%r15,__SF_GPRS(%r15) # load gprs of next task
> > br %r14
>
> Btw., I think we'd be slightly better off with the variant I sent: that way the
> offset arithmetics are done in C code and are ready in the relevant registers by
> the time they are used in __switch_to().

If we save any instruction then not many, we are trading additional parameter
passing vs. four additional instructions in __switch_to. With this variant
the fallout is kept at a minimum which I would prefer at this point.

> >
> > @@ -417,6 +421,7 @@ ENTRY(pgm_check_handler)
> > LAST_BREAK %r14
> > lg %r15,__LC_KERNEL_STACK
> > lg %r14,__TI_task(%r12)
> > + aghi %r14,__TASK_thread # pointer to thread_struct
> > lghi %r13,__LC_PGM_TDB
> > tm __LC_PGM_ILC+2,0x02 # check for transaction abort
> > jz 2f
>
> Don't we also need the chunk I have:
>
> @@ -448,6 +449,7 @@ ENTRY(pgm_check_handler)
> nill %r10,0x007f
> sll %r10,2
> je .Lsysc_return
> + ahi %r14,-__TASK_thread_struct # r14 now points to 'current'
> lgf %r1,0(%r10,%r1) # load address of handler routine
> lgr %r2,%r11 # pass pointer to pt_regs
> basr %r14,%r1 # branch to interrupt-handler
>
> Because the 'basr' line relies on 'r14' having task_struct, I think?
>
> But I don't really know what I'm talking about here ...

The basr instruction is one of the function calling instructions and
%r14 is used for the return address. Between function calls the register
is basically free. So far it is used for the task_struct pointer
(the code following the "lg %r14,__TI_task(%r12)"). What I noticed is
that the pointer is only used to access the embedded thread_struct.
To get from the task_struct to a thread_struct a single aghi is
needed. The rest is moving stuff around in asm-offsets.c

--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.