Given feedbacks from [1], it was asked to make first a few
patches ready as soon as possible. To make FSGSBASE facilitated,
some helper functions and refactoring work are incorporated.
Besides that, it includes Andy's fix for accurate FS/GS base read
and cleanup for the vDSO initialization.
Changes from V1:
* Rename the x86-64 CPU_NUMBER segment from PER_CPU
* Add i386 CPU_NUMBER equivalent to x86-64 at GDT entry 23
* Add additional helper function to store CPU number
[1] FSGSBASE patch set V2: https://lkml.org/lkml/2018/5/31/686
[2] infrastructure to enable FSGSBASE V1: https://lkml.org/lkml/2018/6/4/887
Andy Lutomirski (1):
x86/fsgsbase/64: Make ptrace read FS/GS base accurately
Chang S. Bae (7):
x86/fsgsbase/64: Introduce FS/GS base helper functions
x86/fsgsbase/64: Use FS/GS base helpers in core dump
x86/fsgsbase/64: Factor out load FS/GS segments from __switch_to
x86/msr: write_rdtscp_aux() to use wrmsr_safe()
x86/segments/64: Rename PER_CPU segment to CPU_NUMBER
x86/segments/32: Introduce CPU_NUMBER segment
x86/vdso: Move out the CPU number store
arch/x86/entry/vdso/vgetcpu.c | 4 +-
arch/x86/entry/vdso/vma.c | 38 +--------
arch/x86/include/asm/elf.h | 6 +-
arch/x86/include/asm/fsgsbase.h | 47 +++++++++++
arch/x86/include/asm/msr.h | 2 +-
arch/x86/include/asm/segment.h | 60 ++++++++++---
arch/x86/include/asm/vgtod.h | 4 +-
arch/x86/kernel/cpu/common.c | 5 ++
arch/x86/kernel/process_64.c | 181 +++++++++++++++++++++++++++++++---------
arch/x86/kernel/ptrace.c | 28 ++-----
arch/x86/kernel/setup_percpu.c | 25 ++++++
11 files changed, 284 insertions(+), 116 deletions(-)
create mode 100644 arch/x86/include/asm/fsgsbase.h
--
2.7.4
The new entry will be equivalent to that of x86-64 which
stores CPU number. The entry is placed in segment 23 in GDT
by bumping down 23-28 by one, which are all kernel-internal
segments and so have no impact on user space.
CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
for the default (flat, initial) user space %ss. %ss is
specified than %ds because it is less likely to be changed
as 64-bit has %ss defined.
Suggested-by: H. Peter <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/segment.h | 30 +++++++++++++++++++++---------
1 file changed, 21 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 492e3d1..fca55d7 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -77,14 +77,14 @@
* 20 - PNPBIOS support <=== cacheline #6
* 21 - PNPBIOS support
* 22 - PNPBIOS support
- * 23 - APM BIOS support
+ * 23 - CPU number
* 24 - APM BIOS support <=== cacheline #7
* 25 - APM BIOS support
*
- * 26 - ESPFIX small SS
- * 27 - per-cpu [ offset to per-cpu data area ]
- * 28 - stack_canary-20 [ for stack protector ] <=== cacheline #8
- * 29 - unused
+ * 26 - APM BIOS support
+ * 27 - ESPFIX small SS
+ * 28 - per-cpu [ offset to per-cpu data area ] <=== cacheline #8
+ * 29 - stack_canary-20 [ for stack protector ]
* 30 - unused
* 31 - TSS for double fault handler
*/
@@ -102,11 +102,12 @@
#define GDT_ENTRY_PNPBIOS_DS 20
#define GDT_ENTRY_PNPBIOS_TS1 21
#define GDT_ENTRY_PNPBIOS_TS2 22
-#define GDT_ENTRY_APMBIOS_BASE 23
+#define GDT_ENTRY_CPU_NUMBER 23
+#define GDT_ENTRY_APMBIOS_BASE 24
-#define GDT_ENTRY_ESPFIX_SS 26
-#define GDT_ENTRY_PERCPU 27
-#define GDT_ENTRY_STACK_CANARY 28
+#define GDT_ENTRY_ESPFIX_SS 27
+#define GDT_ENTRY_PERCPU 28
+#define GDT_ENTRY_STACK_CANARY 29
#define GDT_ENTRY_DOUBLEFAULT_TSS 31
@@ -140,6 +141,13 @@
/* another data segment: */
#define PNP_TS2 (GDT_ENTRY_PNPBIOS_TS2*8)
+/*
+ * CPU_NUMBER segment is at '%ss + 80' for the default (flat, initial)
+ * user space %ss (for 64-bit as well). Using %ss than %ds for less
+ * likely to be changed as defined in 64-bit too.
+ */
+#define __CPU_NUMBER_SEG (GDT_ENTRY_CPU_NUMBER*8 + 3)
+
#ifdef CONFIG_SMP
# define __KERNEL_PERCPU (GDT_ENTRY_PERCPU*8)
#else
@@ -206,6 +214,10 @@
#define __USER_DS (GDT_ENTRY_DEFAULT_USER_DS*8 + 3)
#define __USER32_DS __USER_DS
#define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS*8 + 3)
+/*
+ * CPU_NUMBER segment at '%ss (USER_DS) + 80',
+ * like 32-bit for the same reason
+ */
#define __CPU_NUMBER_SEG (GDT_ENTRY_CPU_NUMBER*8 + 3)
#endif
--
2.7.4
Instead of open code, load_fsgs() will cleanup __switch_to
and symmetric with FS/GS segment save. When FSGSBASE
enabled, X86_FEATURE_FSGSBASE check will be incorporated.
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
---
arch/x86/kernel/process_64.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index e498671..cebf240 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -279,6 +279,15 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}
+static __always_inline void load_fsgs(struct thread_struct *prev,
+ struct thread_struct *next)
+{
+ load_seg_legacy(prev->fsindex, prev->fsbase,
+ next->fsindex, next->fsbase, FS);
+ load_seg_legacy(prev->gsindex, prev->gsbase,
+ next->gsindex, next->gsbase, GS);
+}
+
static unsigned long task_seg_base(struct task_struct *task,
unsigned short selector)
{
@@ -588,10 +597,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);
- load_seg_legacy(prev->fsindex, prev->fsbase,
- next->fsindex, next->fsbase, FS);
- load_seg_legacy(prev->gsindex, prev->gsbase,
- next->gsindex, next->gsbase, GS);
+ load_fsgs(prev, next);
switch_fpu_finish(next_fpu, cpu);
--
2.7.4
The CPU (and node) number will be written, as early enough,
to the segment limit of per CPU data and TSC_AUX MSR entry.
The information has been retrieved by vgetcpu in user space
and will be also loaded from the paranoid entry, when
FSGSBASE enabled. So, it is moved out from vDSO to the CPU
initialization path where IST setup is serialized.
Now, redundant setting of the segment in entry/vdso/vma.c
was removed; a substantial code removal. It removes a
hotplug notifier, makes a facility useful to both the kernel
and userspace unconditionally available much sooner, and
unification with i386. (Thanks to HPA for suggesting the
cleanup)
Signed-off-by: Chang S. Bae <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vgetcpu.c | 4 ++--
arch/x86/entry/vdso/vma.c | 41 +----------------------------------------
arch/x86/include/asm/segment.h | 25 +++++++++++++++++++++++++
arch/x86/include/asm/vgtod.h | 2 --
arch/x86/kernel/cpu/common.c | 5 +++++
arch/x86/kernel/setup_percpu.c | 25 +++++++++++++++++++++++++
6 files changed, 58 insertions(+), 44 deletions(-)
diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
index 8ec3d1f..3284069 100644
--- a/arch/x86/entry/vdso/vgetcpu.c
+++ b/arch/x86/entry/vdso/vgetcpu.c
@@ -18,9 +18,9 @@ __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
p = __getcpu();
if (cpu)
- *cpu = p & VGETCPU_CPU_MASK;
+ *cpu = lsl_tscp_to_cpu(p);
if (node)
- *node = p >> 12;
+ *node = lsl_tscp_to_node(p);
return 0;
}
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 833e229..3f9d43f 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -332,43 +332,6 @@ static __init int vdso_setup(char *s)
return 0;
}
__setup("vdso=", vdso_setup);
-#endif
-
-#ifdef CONFIG_X86_64
-static void vgetcpu_cpu_init(void *arg)
-{
- int cpu = smp_processor_id();
- struct desc_struct d = { };
- unsigned long node = 0;
-#ifdef CONFIG_NUMA
- node = cpu_to_node(cpu);
-#endif
- if (static_cpu_has(X86_FEATURE_RDTSCP))
- write_rdtscp_aux((node << 12) | cpu);
-
- /*
- * Store cpu number in limit so that it can be loaded
- * quickly in user space in vgetcpu. (12 bits for the CPU
- * and 8 bits for the node)
- */
- d.limit0 = cpu | ((node & 0xf) << 12);
- d.limit1 = node >> 4;
- d.type = 5; /* RO data, expand down, accessed */
- d.dpl = 3; /* Visible to user code */
- d.s = 1; /* Not a system segment */
- d.p = 1; /* Present */
- d.d = 1; /* 32-bit */
-
- write_gdt_entry(get_cpu_gdt_rw(cpu),
- GDT_ENTRY_CPU_NUMBER,
- &d,
- DESCTYPE_S);
-}
-
-static int vgetcpu_online(unsigned int cpu)
-{
- return smp_call_function_single(cpu, vgetcpu_cpu_init, NULL, 1);
-}
static int __init init_vdso(void)
{
@@ -378,9 +341,7 @@ static int __init init_vdso(void)
init_vdso_image(&vdso_image_x32);
#endif
- /* notifier priority > KVM */
- return cpuhp_setup_state(CPUHP_AP_X86_VDSO_VMA_ONLINE,
- "x86/vdso/vma:online", vgetcpu_online, NULL);
+ return 0;
}
subsys_initcall(init_vdso);
#endif /* CONFIG_X86_64 */
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index fca55d7..a2b1172 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -236,6 +236,31 @@
#define GDT_ENTRY_TLS_ENTRIES 3
#define TLS_SIZE (GDT_ENTRY_TLS_ENTRIES* 8)
+/* Bit size and mask of CPU number stored in the per CPU data (and TSC_AUX) */
+#define LSL_TSCP_CPU_SIZE 12
+#define LSL_TSCP_CPU_MASK 0xfff
+
+#ifndef __ASSEMBLY__
+
+/* Helper functions to store/load CPU and node numbers */
+
+static inline unsigned long make_lsl_tscp(int cpu, unsigned long node)
+{
+ return ((node << LSL_TSCP_CPU_SIZE) | cpu);
+}
+
+static inline unsigned int lsl_tscp_to_cpu(unsigned long x)
+{
+ return (x & LSL_TSCP_CPU_MASK);
+}
+
+static inline unsigned int lsl_tscp_to_node(unsigned long x)
+{
+ return (x >> LSL_TSCP_CPU_SIZE);
+}
+
+#endif
+
#ifdef __KERNEL__
/*
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index 9cd9036..24e69b3 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -79,8 +79,6 @@ static inline void gtod_write_end(struct vsyscall_gtod_data *s)
#ifdef CONFIG_X86_64
-#define VGETCPU_CPU_MASK 0xfff
-
static inline unsigned int __getcpu(void)
{
unsigned int p;
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 38276f5..c7b54f0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1665,6 +1665,11 @@ void cpu_init(void)
wrmsrl(MSR_FS_BASE, 0);
wrmsrl(MSR_KERNEL_GS_BASE, 0);
+#ifdef CONFIG_NUMA
+ write_rdtscp_aux(make_lsl_tscp(cpu, early_cpu_to_node(cpu)));
+#else
+ write_rdtscp_aux(make_lsl_tscp(cpu, 0));
+#endif
barrier();
x86_configure_nx();
diff --git a/arch/x86/kernel/setup_percpu.c b/arch/x86/kernel/setup_percpu.c
index ea554f8..61ab2e2 100644
--- a/arch/x86/kernel/setup_percpu.c
+++ b/arch/x86/kernel/setup_percpu.c
@@ -163,6 +163,30 @@ static inline void setup_percpu_segment(int cpu)
#endif
}
+static inline void setup_cpu_number_segment(int cpu)
+{
+#ifdef CONFIG_NUMA
+ unsigned long node = early_cpu_to_node(cpu);
+#else
+ unsigned long node = 0;
+#endif
+ struct desc_struct d = GDT_ENTRY_INIT(0x40f5, 0x0,
+ make_lsl_tscp(cpu, node));
+
+ /*
+ * CPU_NUMBER segment flag
+ * type: R0 data, expand down, accessed
+ * dpl: Visible to user code
+ * s: Not a system segment
+ * p: Present
+ * d: 32-bit
+ */
+ write_gdt_entry(get_cpu_gdt_rw(cpu),
+ GDT_ENTRY_CPU_NUMBER,
+ &d,
+ DESCTYPE_S);
+}
+
void __init setup_per_cpu_areas(void)
{
unsigned int cpu;
@@ -223,6 +247,7 @@ void __init setup_per_cpu_areas(void)
per_cpu(cpu_number, cpu) = cpu;
setup_percpu_segment(cpu);
setup_stack_canary_segment(cpu);
+ setup_cpu_number_segment(cpu);
/*
* Copy data used in early init routines from the
* initial arrays to the per cpu data areas. These
--
2.7.4
From: Andy Lutomirski <[email protected]>
ptrace can read FS/GS base using the register access API
(PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL. Make both of these
mechanisms return the actual FS/GS base.
This will improve debuggability by providing the correct information
to ptracer (GDB and etc).
Signed-off-by: Andy Lutomirski <[email protected]>
[chang: Rebase and revise patch description]
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/kernel/process_64.c | 67 +++++++++++++++++++++++++++++++++-----------
1 file changed, 51 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ace0158..e498671 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -279,6 +279,49 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}
+static unsigned long task_seg_base(struct task_struct *task,
+ unsigned short selector)
+{
+ unsigned short idx = selector >> 3;
+ unsigned long base;
+
+ if (likely((selector & SEGMENT_TI_MASK) == 0)) {
+ if (unlikely(idx >= GDT_ENTRIES))
+ return 0;
+
+ /*
+ * There are no user segments in the GDT with nonzero bases
+ * other than the TLS segments.
+ */
+ if (idx < GDT_ENTRY_TLS_MIN || idx > GDT_ENTRY_TLS_MAX)
+ return 0;
+
+ idx -= GDT_ENTRY_TLS_MIN;
+ base = get_desc_base(&task->thread.tls_array[idx]);
+ } else {
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ struct ldt_struct *ldt;
+
+ /*
+ * If performance here mattered, we could protect the LDT
+ * with RCU. This is a slow path, though, so we can just
+ * take the mutex.
+ */
+ mutex_lock(&task->mm->context.lock);
+ ldt = task->mm->context.ldt;
+ if (unlikely(idx >= ldt->nr_entries))
+ base = 0;
+ else
+ base = get_desc_base(ldt->entries + idx);
+ mutex_unlock(&task->mm->context.lock);
+#else
+ base = 0;
+#endif
+ }
+
+ return base;
+}
+
void write_fsbase(unsigned long fsbase)
{
/* set the selector to 0 to not confuse __switch_to */
@@ -297,16 +340,12 @@ unsigned long read_task_fsbase(struct task_struct *task)
{
unsigned long fsbase;
- if (task == current) {
+ if (task == current)
fsbase = read_fsbase();
- } else {
- /*
- * XXX: This will not behave as expected if called
- * if fsindex != 0. This preserves an existing bug
- * that will be fixed.
- */
+ else if (task->thread.fsindex == 0)
fsbase = task->thread.fsbase;
- }
+ else
+ fsbase = task_seg_base(task, task->thread.fsindex);
return fsbase;
}
@@ -315,16 +354,12 @@ unsigned long read_task_gsbase(struct task_struct *task)
{
unsigned long gsbase;
- if (task == current) {
+ if (task == current)
gsbase = read_inactive_gsbase();
- } else {
- /*
- * XXX: This will not behave as expected if called
- * if gsindex != 0. Same bug preservation as above
- * read_task_fsbase.
- */
+ else if (task->thread.gsindex == 0)
gsbase = task->thread.gsbase;
- }
+ else
+ gsbase = task_seg_base(task, task->thread.gsindex);
return gsbase;
}
--
2.7.4
On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>
> Using wrmsr_safe() can make code a bit simpler by removing
> some condition check
NAK.
If we really want to stop checking the cpu feature and unconditionally
write the MSR, then there should be a single patch that does it. If
we don't, then we should not make this change, since it will make the
very useful warning if we screwed it up go away.
I think we should continue checking the cpu feature, personally. I
also thing that, if we make the change, it should not be part of the
FSGSBASE series.
On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>
> The new entry will be equivalent to that of x86-64 which
> stores CPU number. The entry is placed in segment 23 in GDT
> by bumping down 23-28 by one, which are all kernel-internal
> segments and so have no impact on user space.
>
> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
> for the default (flat, initial) user space %ss.
No, it won't :( This is because, on Xen PV, user code very frequently
sees a different, Xen-supplied "flat" SS value. This is definitely
true right now on 64-bit, and I'm reasonably confident it's also the
case on 32-bit.
As it stands, as far as I can tell, we don't have a "cpu number"
segment on 32-bit kernels. I see no compelling reason to add one, and
we should definitely not add one as part of the FSGSBASE series. I
think the right solution is to rename the 64-bit segment to
"CPU_NUMBER" and then have the rearrangement of the initialization
code as a followup patch. The goal is to make the patches
individually reviewable. As it stands, this patch adds some #defines
without making them work, which is extra confusing.
Given how many times we screwed it up, I really want the patch that
moves the initialization of the 64-bit CPU number to be obviously
correct and to avoid changing the sematics of anything except the
actual CPU number fields during boot.
So NAK to this patch, at least as part of the FSGSBASE series.
(My apologies -- a bunch of this is because I along with everyone else
misunderstood the existing code.)
On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>
> 64-bit doesn't use the entry for per CPU data, but for CPU
> numbers. The change will clarify the real usage of this
> entry in GDT.
Acked-by: Andy Lutomirski <[email protected]>
This is a *huge* improvement.
On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>
> The CPU (and node) number will be written, as early enough,
> to the segment limit of per CPU data and TSC_AUX MSR entry.
> The information has been retrieved by vgetcpu in user space
> and will be also loaded from the paranoid entry, when
> FSGSBASE enabled. So, it is moved out from vDSO to the CPU
> initialization path where IST setup is serialized.
>
Please split this into two patches. One patch should do the cleanups
and one patch should move the code.
> diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
> index 8ec3d1f..3284069 100644
> --- a/arch/x86/entry/vdso/vgetcpu.c
> +++ b/arch/x86/entry/vdso/vgetcpu.c
> @@ -18,9 +18,9 @@ __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
> p = __getcpu();
>
> if (cpu)
> - *cpu = p & VGETCPU_CPU_MASK;
> + *cpu = lsl_tscp_to_cpu(p);
> if (node)
> - *node = p >> 12;
> + *node = lsl_tscp_to_node(p);
> return 0;
This goes in the cleanup patch.
> +/* Bit size and mask of CPU number stored in the per CPU data (and TSC_AUX) */
> +#define LSL_TSCP_CPU_SIZE 12
> +#define LSL_TSCP_CPU_MASK 0xfff
> +
> +#ifndef __ASSEMBLY__
> +
> +/* Helper functions to store/load CPU and node numbers */
> +
> +static inline unsigned long make_lsl_tscp(int cpu, unsigned long node)
> +{
> + return ((node << LSL_TSCP_CPU_SIZE) | cpu);
> +}
> +
> +static inline unsigned int lsl_tscp_to_cpu(unsigned long x)
> +{
> + return (x & LSL_TSCP_CPU_MASK);
> +}
> +
> +static inline unsigned int lsl_tscp_to_node(unsigned long x)
> +{
> + return (x >> LSL_TSCP_CPU_SIZE);
> +}
As do these. But maybe they should be #ifdef CONFIG_X86_64 to make it
clear that they are not presently supported on 32-bit systems.
(Unless I'm wrong and they are supported.)
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 38276f5..c7b54f0 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1665,6 +1665,11 @@ void cpu_init(void)
>
> wrmsrl(MSR_FS_BASE, 0);
> wrmsrl(MSR_KERNEL_GS_BASE, 0);
> +#ifdef CONFIG_NUMA
> + write_rdtscp_aux(make_lsl_tscp(cpu, early_cpu_to_node(cpu)));
> +#else
> + write_rdtscp_aux(make_lsl_tscp(cpu, 0));
> +#endif
Why the ifdef? early_cpu_to_node() should return 0 if !CONFIG_NUMA.
> +static inline void setup_cpu_number_segment(int cpu)
> +{
> +#ifdef CONFIG_NUMA
> + unsigned long node = early_cpu_to_node(cpu);
> +#else
> + unsigned long node = 0;
> +#endif
This duplicates half the rdtscp_aux code. How about making this one
function setup_cpu_number() that does all of it?
> + struct desc_struct d = GDT_ENTRY_INIT(0x40f5, 0x0,
> + make_lsl_tscp(cpu, node));
Please get rid of the GDT_ENTRY_INIT stuff and just fill out all the
fields, just like it is in the code that you're moving. No one enjoys
decoding hex segment descriptors. (Well, Linus and hpa probably
*love* doing it just to show off.)
Using wrmsr_safe() can make code a bit simpler by removing
some condition check
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/msr.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 04addd6..09bec62 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -317,7 +317,7 @@ static inline int wrmsrl_safe(u32 msr, u64 val)
#define write_tsc(low, high) wrmsr(MSR_IA32_TSC, (low), (high))
-#define write_rdtscp_aux(val) wrmsr(MSR_TSC_AUX, (val), 0)
+#define write_rdtscp_aux(val) wrmsr_safe(MSR_TSC_AUX, (val), 0)
struct msr *msrs_alloc(void);
void msrs_free(struct msr *msrs);
--
2.7.4
64-bit doesn't use the entry for per CPU data, but for CPU
numbers. The change will clarify the real usage of this
entry in GDT.
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/entry/vdso/vma.c | 5 ++++-
arch/x86/include/asm/segment.h | 5 ++---
arch/x86/include/asm/vgtod.h | 2 +-
3 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 5b8b556..833e229 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -359,7 +359,10 @@ static void vgetcpu_cpu_init(void *arg)
d.p = 1; /* Present */
d.d = 1; /* 32-bit */
- write_gdt_entry(get_cpu_gdt_rw(cpu), GDT_ENTRY_PER_CPU, &d, DESCTYPE_S);
+ write_gdt_entry(get_cpu_gdt_rw(cpu),
+ GDT_ENTRY_CPU_NUMBER,
+ &d,
+ DESCTYPE_S);
}
static int vgetcpu_online(unsigned int cpu)
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index 8f09012b..492e3d1 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -186,8 +186,7 @@
#define GDT_ENTRY_TLS_MIN 12
#define GDT_ENTRY_TLS_MAX 14
-/* Abused to load per CPU data from limit */
-#define GDT_ENTRY_PER_CPU 15
+#define GDT_ENTRY_CPU_NUMBER 15
/*
* Number of entries in the GDT table:
@@ -207,7 +206,7 @@
#define __USER_DS (GDT_ENTRY_DEFAULT_USER_DS*8 + 3)
#define __USER32_DS __USER_DS
#define __USER_CS (GDT_ENTRY_DEFAULT_USER_CS*8 + 3)
-#define __PER_CPU_SEG (GDT_ENTRY_PER_CPU*8 + 3)
+#define __CPU_NUMBER_SEG (GDT_ENTRY_CPU_NUMBER*8 + 3)
#endif
diff --git a/arch/x86/include/asm/vgtod.h b/arch/x86/include/asm/vgtod.h
index fb856c9..9cd9036 100644
--- a/arch/x86/include/asm/vgtod.h
+++ b/arch/x86/include/asm/vgtod.h
@@ -96,7 +96,7 @@ static inline unsigned int __getcpu(void)
alternative_io ("lsl %[p],%[seg]",
".byte 0xf3,0x0f,0xc7,0xf8", /* RDPID %eax/rax */
X86_FEATURE_RDPID,
- [p] "=a" (p), [seg] "r" (__PER_CPU_SEG));
+ [p] "=a" (p), [seg] "r" (__CPU_NUMBER_SEG));
return p;
}
--
2.7.4
With new helpers, FS/GS base access is centralized.
Eventually, when FSGSBASE instruction enabled, it will
be faster.
The helpers are used on ptrace APIs (PTRACE_ARCH_PRCTL,
PTRACE_SETREG, PTRACE_GETREG, etc). Idea is to keep
the FS/GS-update mechanism organized.
"inactive" GS base refers to base backed up at kernel
entries and of inactive (user) task's.
The bug that returns stale FS/GS base value (when index
is nonzero) is preserved and will be fixed by next
patch.
Based-on-code-from: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/fsgsbase.h | 47 ++++++++++++++
arch/x86/kernel/process_64.c | 132 +++++++++++++++++++++++++++++-----------
arch/x86/kernel/ptrace.c | 28 +++------
3 files changed, 153 insertions(+), 54 deletions(-)
create mode 100644 arch/x86/include/asm/fsgsbase.h
diff --git a/arch/x86/include/asm/fsgsbase.h b/arch/x86/include/asm/fsgsbase.h
new file mode 100644
index 0000000..f00a8a6
--- /dev/null
+++ b/arch/x86/include/asm/fsgsbase.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_FSGSBASE_H
+#define _ASM_FSGSBASE_H 1
+
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_X86_64
+
+#include <asm/msr-index.h>
+
+/*
+ * Read/write a task's fsbase or gsbase. This returns the value that
+ * the FS/GS base would have (if the task were to be resumed). These
+ * work on current or on a different non-running task.
+ */
+unsigned long read_task_fsbase(struct task_struct *task);
+unsigned long read_task_gsbase(struct task_struct *task);
+int write_task_fsbase(struct task_struct *task, unsigned long fsbase);
+int write_task_gsbase(struct task_struct *task, unsigned long gsbase);
+
+/* Helper functions for reading/writing FS/GS base */
+
+static inline unsigned long read_fsbase(void)
+{
+ unsigned long fsbase;
+
+ rdmsrl(MSR_FS_BASE, fsbase);
+ return fsbase;
+}
+
+void write_fsbase(unsigned long fsbase);
+
+static inline unsigned long read_inactive_gsbase(void)
+{
+ unsigned long gsbase;
+
+ rdmsrl(MSR_KERNEL_GS_BASE, gsbase);
+ return gsbase;
+}
+
+void write_inactive_gsbase(unsigned long gsbase);
+
+#endif /* CONFIG_X86_64 */
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_FSGSBASE_H */
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 12bb445..ace0158 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -54,6 +54,7 @@
#include <asm/vdso.h>
#include <asm/intel_rdt_sched.h>
#include <asm/unistd.h>
+#include <asm/fsgsbase.h>
#ifdef CONFIG_IA32_EMULATION
/* Not included via unistd.h */
#include <asm/unistd_32_ia32.h>
@@ -278,6 +279,94 @@ static __always_inline void load_seg_legacy(unsigned short prev_index,
}
}
+void write_fsbase(unsigned long fsbase)
+{
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(FS, 0);
+ wrmsrl(MSR_FS_BASE, fsbase);
+}
+
+void write_inactive_gsbase(unsigned long gsbase)
+{
+ /* set the selector to 0 to not confuse __switch_to */
+ loadseg(GS, 0);
+ wrmsrl(MSR_KERNEL_GS_BASE, gsbase);
+}
+
+unsigned long read_task_fsbase(struct task_struct *task)
+{
+ unsigned long fsbase;
+
+ if (task == current) {
+ fsbase = read_fsbase();
+ } else {
+ /*
+ * XXX: This will not behave as expected if called
+ * if fsindex != 0. This preserves an existing bug
+ * that will be fixed.
+ */
+ fsbase = task->thread.fsbase;
+ }
+
+ return fsbase;
+}
+
+unsigned long read_task_gsbase(struct task_struct *task)
+{
+ unsigned long gsbase;
+
+ if (task == current) {
+ gsbase = read_inactive_gsbase();
+ } else {
+ /*
+ * XXX: This will not behave as expected if called
+ * if gsindex != 0. Same bug preservation as above
+ * read_task_fsbase.
+ */
+ gsbase = task->thread.gsbase;
+ }
+
+ return gsbase;
+}
+
+int write_task_fsbase(struct task_struct *task, unsigned long fsbase)
+{
+ int cpu;
+
+ /*
+ * Not strictly needed for fs, but do it for symmetry
+ * with gs
+ */
+ if (unlikely(fsbase >= TASK_SIZE_MAX))
+ return -EPERM;
+
+ cpu = get_cpu();
+ task->thread.fsbase = fsbase;
+ if (task == current)
+ write_fsbase(fsbase);
+ task->thread.fsindex = 0;
+ put_cpu();
+
+ return 0;
+}
+
+int write_task_gsbase(struct task_struct *task, unsigned long gsbase)
+{
+ int cpu;
+
+ if (unlikely(gsbase >= TASK_SIZE_MAX))
+ return -EPERM;
+
+ cpu = get_cpu();
+ task->thread.gsbase = gsbase;
+ if (task == current)
+ write_inactive_gsbase(gsbase);
+ task->thread.gsindex = 0;
+ put_cpu();
+
+ return 0;
+}
+
int copy_thread_tls(unsigned long clone_flags, unsigned long sp,
unsigned long arg, struct task_struct *p, unsigned long tls)
{
@@ -618,54 +707,27 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
long do_arch_prctl_64(struct task_struct *task, int option, unsigned long arg2)
{
int ret = 0;
- int doit = task == current;
- int cpu;
switch (option) {
- case ARCH_SET_GS:
- if (arg2 >= TASK_SIZE_MAX)
- return -EPERM;
- cpu = get_cpu();
- task->thread.gsindex = 0;
- task->thread.gsbase = arg2;
- if (doit) {
- load_gs_index(0);
- ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, arg2);
- }
- put_cpu();
+ case ARCH_SET_GS: {
+ ret = write_task_gsbase(task, arg2);
break;
- case ARCH_SET_FS:
- /* Not strictly needed for fs, but do it for symmetry
- with gs */
- if (arg2 >= TASK_SIZE_MAX)
- return -EPERM;
- cpu = get_cpu();
- task->thread.fsindex = 0;
- task->thread.fsbase = arg2;
- if (doit) {
- /* set the selector to 0 to not confuse __switch_to */
- loadsegment(fs, 0);
- ret = wrmsrl_safe(MSR_FS_BASE, arg2);
- }
- put_cpu();
+ }
+ case ARCH_SET_FS: {
+ ret = write_task_fsbase(task, arg2);
break;
+ }
case ARCH_GET_FS: {
unsigned long base;
- if (doit)
- rdmsrl(MSR_FS_BASE, base);
- else
- base = task->thread.fsbase;
+ base = read_task_fsbase(task);
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
case ARCH_GET_GS: {
unsigned long base;
- if (doit)
- rdmsrl(MSR_KERNEL_GS_BASE, base);
- else
- base = task->thread.gsbase;
+ base = read_task_gsbase(task);
ret = put_user(base, (unsigned long __user *)arg2);
break;
}
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index ed5c4cd..b2f0beb 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -39,6 +39,7 @@
#include <asm/hw_breakpoint.h>
#include <asm/traps.h>
#include <asm/syscall.h>
+#include <asm/fsgsbase.h>
#include "tls.h"
@@ -396,12 +397,11 @@ static int putreg(struct task_struct *child,
if (value >= TASK_SIZE_MAX)
return -EIO;
/*
- * When changing the segment base, use do_arch_prctl_64
- * to set either thread.fs or thread.fsindex and the
- * corresponding GDT slot.
+ * When changing the FS base, use the same
+ * mechanism as for do_arch_prctl_64
*/
if (child->thread.fsbase != value)
- return do_arch_prctl_64(child, ARCH_SET_FS, value);
+ return write_task_fsbase(child, value);
return 0;
case offsetof(struct user_regs_struct,gs_base):
/*
@@ -410,7 +410,7 @@ static int putreg(struct task_struct *child,
if (value >= TASK_SIZE_MAX)
return -EIO;
if (child->thread.gsbase != value)
- return do_arch_prctl_64(child, ARCH_SET_GS, value);
+ return write_task_gsbase(child, value);
return 0;
#endif
}
@@ -434,20 +434,10 @@ static unsigned long getreg(struct task_struct *task, unsigned long offset)
return get_flags(task);
#ifdef CONFIG_X86_64
- case offsetof(struct user_regs_struct, fs_base): {
- /*
- * XXX: This will not behave as expected if called on
- * current or if fsindex != 0.
- */
- return task->thread.fsbase;
- }
- case offsetof(struct user_regs_struct, gs_base): {
- /*
- * XXX: This will not behave as expected if called on
- * current or if fsindex != 0.
- */
- return task->thread.gsbase;
- }
+ case offsetof(struct user_regs_struct, fs_base):
+ return read_task_fsbase(task);
+ case offsetof(struct user_regs_struct, gs_base):
+ return read_task_gsbase(task);
#endif
}
--
2.7.4
When new FSGSBASE instructions enabled, this read will
become faster.
Based-on-code-from: Andy Lutomirski <[email protected]>
Signed-off-by: Chang S. Bae <[email protected]>
Reviewed-by: Andi Kleen <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Reviewed-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/elf.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 0d157d2..49acefb 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -10,6 +10,7 @@
#include <asm/ptrace.h>
#include <asm/user.h>
#include <asm/auxvec.h>
+#include <asm/fsgsbase.h>
typedef unsigned long elf_greg_t;
@@ -205,7 +206,6 @@ void set_personality_ia32(bool);
#define ELF_CORE_COPY_REGS(pr_reg, regs) \
do { \
- unsigned long base; \
unsigned v; \
(pr_reg)[0] = (regs)->r15; \
(pr_reg)[1] = (regs)->r14; \
@@ -228,8 +228,8 @@ do { \
(pr_reg)[18] = (regs)->flags; \
(pr_reg)[19] = (regs)->sp; \
(pr_reg)[20] = (regs)->ss; \
- rdmsrl(MSR_FS_BASE, base); (pr_reg)[21] = base; \
- rdmsrl(MSR_KERNEL_GS_BASE, base); (pr_reg)[22] = base; \
+ (pr_reg)[21] = read_fsbase(); \
+ (pr_reg)[22] = read_inactive_gsbase(); \
asm("movl %%ds,%0" : "=r" (v)); (pr_reg)[23] = v; \
asm("movl %%es,%0" : "=r" (v)); (pr_reg)[24] = v; \
asm("movl %%fs,%0" : "=r" (v)); (pr_reg)[25] = v; \
--
2.7.4
On Wed, Jun 6, 2018 at 10:25 AM Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>>
>> The CPU (and node) number will be written, as early enough,
>> to the segment limit of per CPU data and TSC_AUX MSR entry.
>> The information has been retrieved by vgetcpu in user space
>> and will be also loaded from the paranoid entry, when
>> FSGSBASE enabled. So, it is moved out from vDSO to the CPU
>> initialization path where IST setup is serialized.
>>
> Please split this into two patches. One patch should do the cleanups
> and one patch should move the code.
>> diff --git a/arch/x86/entry/vdso/vgetcpu.c b/arch/x86/entry/vdso/vgetcpu.c
>> index 8ec3d1f..3284069 100644
>> --- a/arch/x86/entry/vdso/vgetcpu.c
>> +++ b/arch/x86/entry/vdso/vgetcpu.c
>> @@ -18,9 +18,9 @@ __vdso_getcpu(unsigned *cpu, unsigned *node, struct getcpu_cache *unused)
>> p = __getcpu();
>>
>> if (cpu)
>> - *cpu = p & VGETCPU_CPU_MASK;
>> + *cpu = lsl_tscp_to_cpu(p);
>> if (node)
>> - *node = p >> 12;
>> + *node = lsl_tscp_to_node(p);
>> return 0;
> This goes in the cleanup patch.
>> +/* Bit size and mask of CPU number stored in the per CPU data (and TSC_AUX) */
>> +#define LSL_TSCP_CPU_SIZE 12
>> +#define LSL_TSCP_CPU_MASK 0xfff
>> +
>> +#ifndef __ASSEMBLY__
>> +
>> +/* Helper functions to store/load CPU and node numbers */
>> +
>> +static inline unsigned long make_lsl_tscp(int cpu, unsigned long node)
>> +{
>> + return ((node << LSL_TSCP_CPU_SIZE) | cpu);
>> +}
>> +
>> +static inline unsigned int lsl_tscp_to_cpu(unsigned long x)
>> +{
>> + return (x & LSL_TSCP_CPU_MASK);
>> +}
>> +
>> +static inline unsigned int lsl_tscp_to_node(unsigned long x)
>> +{
>> + return (x >> LSL_TSCP_CPU_SIZE);
>> +}
> As do these. But maybe they should be #ifdef CONFIG_X86_64 to make it
> clear that they are not presently supported on 32-bit systems.
> (Unless I'm wrong and they are supported.)
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 38276f5..c7b54f0 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -1665,6 +1665,11 @@ void cpu_init(void)
>>
>> wrmsrl(MSR_FS_BASE, 0);
>> wrmsrl(MSR_KERNEL_GS_BASE, 0);
>> +#ifdef CONFIG_NUMA
>> + write_rdtscp_aux(make_lsl_tscp(cpu, early_cpu_to_node(cpu)));
>> +#else
>> + write_rdtscp_aux(make_lsl_tscp(cpu, 0));
>> +#endif
> Why the ifdef? early_cpu_to_node() should return 0 if !CONFIG_NUMA.
>> +static inline void setup_cpu_number_segment(int cpu)
>> +{
>> +#ifdef CONFIG_NUMA
>> + unsigned long node = early_cpu_to_node(cpu);
>> +#else
>> + unsigned long node = 0;
>> +#endif
> This duplicates half the rdtscp_aux code. How about making this one
> function setup_cpu_number() that does all of it?
>> + struct desc_struct d = GDT_ENTRY_INIT(0x40f5, 0x0,
>> + make_lsl_tscp(cpu, node));
> Please get rid of the GDT_ENTRY_INIT stuff and just fill out all the
> fields, just like it is in the code that you're moving. No one enjoys
> decoding hex segment descriptors. (Well, Linus and hpa probably
> *love* doing it just to show off.)
Will apply as what you commented.
Thanks
Chang
On Wed, Jun 6, 2018 at 1:16 PM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>>
>> The new entry will be equivalent to that of x86-64 which
>> stores CPU number. The entry is placed in segment 23 in GDT
>> by bumping down 23-28 by one, which are all kernel-internal
>> segments and so have no impact on user space.
>>
>> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
>> for the default (flat, initial) user space %ss.
>
> No, it won't :( This is because, on Xen PV, user code very frequently
> sees a different, Xen-supplied "flat" SS value. This is definitely
> true right now on 64-bit, and I'm reasonably confident it's also the
> case on 32-bit.
>
> As it stands, as far as I can tell, we don't have a "cpu number"
> segment on 32-bit kernels. I see no compelling reason to add one, and
> we should definitely not add one as part of the FSGSBASE series. I
> think the right solution is to rename the 64-bit segment to
> "CPU_NUMBER" and then have the rearrangement of the initialization
> code as a followup patch. The goal is to make the patches
> individually reviewable. As it stands, this patch adds some #defines
> without making them work, which is extra confusing.
>
> Given how many times we screwed it up, I really want the patch that
> moves the initialization of the 64-bit CPU number to be obviously
> correct and to avoid changing the sematics of anything except the
> actual CPU number fields during boot.
>
> So NAK to this patch, at least as part of the FSGSBASE series.
>
> (My apologies -- a bunch of this is because I along with everyone else
> misunderstood the existing code.)
The sole purpose of this segment is for the getcpu() function in the
VDSO. No other userspace code can rely on its presence or location.
--
Brian Gerst
On Wed, Jun 6, 2018 at 12:07 PM Brian Gerst <[email protected]> wrote:
>
> On Wed, Jun 6, 2018 at 1:16 PM, Andy Lutomirski <[email protected]> wrote:
> > On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
> >>
> >> The new entry will be equivalent to that of x86-64 which
> >> stores CPU number. The entry is placed in segment 23 in GDT
> >> by bumping down 23-28 by one, which are all kernel-internal
> >> segments and so have no impact on user space.
> >>
> >> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
> >> for the default (flat, initial) user space %ss.
> >
> > No, it won't :( This is because, on Xen PV, user code very frequently
> > sees a different, Xen-supplied "flat" SS value. This is definitely
> > true right now on 64-bit, and I'm reasonably confident it's also the
> > case on 32-bit.
> >
> > As it stands, as far as I can tell, we don't have a "cpu number"
> > segment on 32-bit kernels. I see no compelling reason to add one, and
> > we should definitely not add one as part of the FSGSBASE series. I
> > think the right solution is to rename the 64-bit segment to
> > "CPU_NUMBER" and then have the rearrangement of the initialization
> > code as a followup patch. The goal is to make the patches
> > individually reviewable. As it stands, this patch adds some #defines
> > without making them work, which is extra confusing.
> >
> > Given how many times we screwed it up, I really want the patch that
> > moves the initialization of the 64-bit CPU number to be obviously
> > correct and to avoid changing the sematics of anything except the
> > actual CPU number fields during boot.
> >
> > So NAK to this patch, at least as part of the FSGSBASE series.
> >
> > (My apologies -- a bunch of this is because I along with everyone else
> > misunderstood the existing code.)
>
> The sole purpose of this segment is for the getcpu() function in the
> VDSO. No other userspace code can rely on its presence or location.
>
Agreed. But this means that no code whatsoever should use it on a
32-bit kernel, so let's not add support.\
On June 6, 2018 12:07:15 PM PDT, Brian Gerst <[email protected]> wrote:
>On Wed, Jun 6, 2018 at 1:16 PM, Andy Lutomirski <[email protected]>
>wrote:
>> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae
><[email protected]> wrote:
>>>
>>> The new entry will be equivalent to that of x86-64 which
>>> stores CPU number. The entry is placed in segment 23 in GDT
>>> by bumping down 23-28 by one, which are all kernel-internal
>>> segments and so have no impact on user space.
>>>
>>> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
>>> for the default (flat, initial) user space %ss.
>>
>> No, it won't :( This is because, on Xen PV, user code very
>frequently
>> sees a different, Xen-supplied "flat" SS value. This is definitely
>> true right now on 64-bit, and I'm reasonably confident it's also the
>> case on 32-bit.
>>
>> As it stands, as far as I can tell, we don't have a "cpu number"
>> segment on 32-bit kernels. I see no compelling reason to add one,
>and
>> we should definitely not add one as part of the FSGSBASE series. I
>> think the right solution is to rename the 64-bit segment to
>> "CPU_NUMBER" and then have the rearrangement of the initialization
>> code as a followup patch. The goal is to make the patches
>> individually reviewable. As it stands, this patch adds some #defines
>> without making them work, which is extra confusing.
>>
>> Given how many times we screwed it up, I really want the patch that
>> moves the initialization of the 64-bit CPU number to be obviously
>> correct and to avoid changing the sematics of anything except the
>> actual CPU number fields during boot.
>>
>> So NAK to this patch, at least as part of the FSGSBASE series.
>>
>> (My apologies -- a bunch of this is because I along with everyone
>else
>> misunderstood the existing code.)
>
>The sole purpose of this segment is for the getcpu() function in the
>VDSO. No other userspace code can rely on its presence or location.
>
>--
>Brian Gerst
Unfortunately that is not true in reality :(
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
> On Wed, Jun 6, 2018 at 10:25 AM Andy Lutomirski <[email protected]> wrote:
>> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae <[email protected]> wrote:
>> +static inline void setup_cpu_number_segment(int cpu)
>> +{
>> +#ifdef CONFIG_NUMA
>> + unsigned long node = early_cpu_to_node(cpu);
>> +#else
>> + unsigned long node = 0;
>> +#endif
> This duplicates half the rdtscp_aux code. How about making this one
> function setup_cpu_number() that does all of it?
Looks like the segment setup done during early boot (single-threaded).
So, separating MSR TSC_AUX setup out to each CPU initialization seems
to be still reasonable to me.
Thanks,
Chang