From: "H. Peter Anvin" <[email protected]>
Give a debugger access to the visible part of the GDT and LDT. This
allows a debugger to find out what a particular segment descriptor
corresponds to; e.g. if %cs is 16, 32, or 64 bits.
v3:
Requalify LDT segments for selectors that have actually changed.
v2:
Add missing change to elf.h for the very last patch.
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
Cc: H. Peter Anvin <[email protected]>
arch/x86/Kconfig | 4 +
arch/x86/include/asm/desc.h | 24 +++-
arch/x86/include/asm/ldt.h | 16 +++
arch/x86/include/asm/segment.h | 10 ++
arch/x86/kernel/Makefile | 3 +-
arch/x86/kernel/ldt.c | 283 ++++++++++++++++++++++++++++++++---------
arch/x86/kernel/ptrace.c | 103 ++++++++++++++-
arch/x86/kernel/tls.c | 102 +++++----------
arch/x86/kernel/tls.h | 8 +-
include/uapi/linux/elf.h | 2 +
10 files changed, 413 insertions(+), 142 deletions(-)
From: "H. Peter Anvin" <[email protected]>
Instead of using X86_32 || IA32_EMULATION, which is really rather ugly
in the Makefile especially, create a dedicated config symbol for the
TLS area. This will be further used in subsequent patches.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/kernel/Makefile | 3 +--
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f1dbb4ee19d7..e48036e3f158 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2937,6 +2937,10 @@ config X86_DMA_REMAP
config HAVE_GENERIC_GUP
def_bool y
+config X86_TLS_AREA
+ def_bool y
+ depends on IA32_EMULATION || X86_32
+
source "net/Kconfig"
source "drivers/Kconfig"
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 02d6f5cf4e70..d26ef8a40a88 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -65,8 +65,7 @@ obj-y += resource.o
obj-y += process.o
obj-y += fpu/
obj-y += ptrace.o
-obj-$(CONFIG_X86_32) += tls.o
-obj-$(CONFIG_IA32_EMULATION) += tls.o
+obj-$(CONFIG_X86_TLS_AREA) += tls.o
obj-y += step.o
obj-$(CONFIG_INTEL_TXT) += tboot.o
obj-$(CONFIG_ISA_DMA_API) += i8237.o
--
2.14.4
From: "H. Peter Anvin" <[email protected]>
This is generic code which is potentially useful in other contexts.
Unfortunately modify_ldt() is kind of stupid in that it returns a
descriptor in CPU format but takes a different format, but regsets
*have* to operate differently.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/include/asm/desc.h | 22 ++++++++++++++++++++++
arch/x86/kernel/tls.c | 19 -------------------
2 files changed, 22 insertions(+), 19 deletions(-)
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 13c5ee878a47..5e69f993c9ff 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -386,6 +386,28 @@ static inline void set_desc_limit(struct desc_struct *desc, unsigned long limit)
desc->limit1 = (limit >> 16) & 0xf;
}
+
+static inline void fill_user_desc(struct user_desc *info, int idx,
+ const struct desc_struct *desc)
+
+{
+ memset(info, 0, sizeof(*info));
+ info->entry_number = idx;
+ if (desc->s && desc->dpl == 3) {
+ info->base_addr = get_desc_base(desc);
+ info->limit = get_desc_limit(desc);
+ info->seg_32bit = desc->d;
+ info->contents = desc->type >> 2;
+ info->read_exec_only = !(desc->type & 2);
+ info->limit_in_pages = desc->g;
+ info->seg_not_present = !desc->p;
+ info->useable = desc->avl;
+#ifdef CONFIG_X86_64
+ info->lm = desc->l;
+#endif
+ }
+}
+
void update_intr_gate(unsigned int n, const void *addr);
void alloc_intr_gate(unsigned int n, const void *addr);
diff --git a/arch/x86/kernel/tls.c b/arch/x86/kernel/tls.c
index a5b802a12212..7b8ecb760707 100644
--- a/arch/x86/kernel/tls.c
+++ b/arch/x86/kernel/tls.c
@@ -197,25 +197,6 @@ SYSCALL_DEFINE1(set_thread_area, struct user_desc __user *, u_info)
* Get the current Thread-Local Storage area:
*/
-static void fill_user_desc(struct user_desc *info, int idx,
- const struct desc_struct *desc)
-
-{
- memset(info, 0, sizeof(*info));
- info->entry_number = idx;
- info->base_addr = get_desc_base(desc);
- info->limit = get_desc_limit(desc);
- info->seg_32bit = desc->d;
- info->contents = desc->type >> 2;
- info->read_exec_only = !(desc->type & 2);
- info->limit_in_pages = desc->g;
- info->seg_not_present = !desc->p;
- info->useable = desc->avl;
-#ifdef CONFIG_X86_64
- info->lm = desc->l;
-#endif
-}
-
int do_get_thread_area(struct task_struct *p, int idx,
struct user_desc __user *u_info)
{
--
2.14.4
From: "H. Peter Anvin" <[email protected]>
We don't want to advertise to user space how many slots the kernel
GDT has, but user space can trivially find out what the last
user-accessible GDT slot is. Add a #define for that so we can use
that in sizing a regset.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/include/asm/segment.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/segment.h b/arch/x86/include/asm/segment.h
index e293c122d0d5..5eb809eec048 100644
--- a/arch/x86/include/asm/segment.h
+++ b/arch/x86/include/asm/segment.h
@@ -115,6 +115,11 @@
*/
#define GDT_ENTRIES 32
+/*
+ * Last user-visible GDT slot
+ */
+#define GDT_LAST_USER GDT_ENTRY_DEFAULT_USER_DS
+
/*
* Segment selector values corresponding to the above entries:
*/
@@ -194,6 +199,11 @@
*/
#define GDT_ENTRIES 16
+/*
+ * Last user-visible GDT slot
+ */
+#define GDT_LAST_USER GDT_ENTRY_PER_CPU
+
/*
* Segment selector values corresponding to the above entries:
*
--
2.14.4
From: "H. Peter Anvin" <[email protected]>
Provide ptrace/regset access to the LDT, if one exists. This
interface provides both read and write access. The write code is
unified with modify_ldt(); the read code doesn't have enough
similarity so it has been kept made separate.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/include/asm/desc.h | 2 +-
arch/x86/include/asm/ldt.h | 16 ++++
arch/x86/kernel/ldt.c | 209 +++++++++++++++++++++++++++++++++++---------
arch/x86/kernel/ptrace.c | 20 +++++
include/uapi/linux/elf.h | 1 +
5 files changed, 204 insertions(+), 44 deletions(-)
create mode 100644 arch/x86/include/asm/ldt.h
diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 5e69f993c9ff..12a759d76314 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -3,7 +3,7 @@
#define _ASM_X86_DESC_H
#include <asm/desc_defs.h>
-#include <asm/ldt.h>
+#include <uapi/asm/ldt.h>
#include <asm/mmu.h>
#include <asm/fixmap.h>
#include <asm/irq_vectors.h>
diff --git a/arch/x86/include/asm/ldt.h b/arch/x86/include/asm/ldt.h
new file mode 100644
index 000000000000..302ec2d6d45d
--- /dev/null
+++ b/arch/x86/include/asm/ldt.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Ptrace interface to the LDT
+ *
+ */
+
+#ifndef _ARCH_X86_INCLUDE_ASM_LDT_H
+
+#include <linux/regset.h>
+#include <uapi/asm/ldt.h>
+
+extern user_regset_active_fn regset_ldt_active;
+extern user_regset_get_fn regset_ldt_get;
+extern user_regset_set_fn regset_ldt_set;
+
+#endif /* _ARCH_X86_INCLUDE_ASM_LDT_H */
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 601d24268a99..e80dfde1f82f 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -392,53 +392,39 @@ static int read_default_ldt(void __user *ptr, unsigned long bytecount)
return bytecount;
}
-static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
+
+static int do_write_ldt(const struct task_struct *target,
+ const void *kbuf, const void __user *ubuf,
+ unsigned long bytecount, unsigned int index,
+ bool oldmode)
{
- struct mm_struct *mm = current->mm;
+ struct mm_struct *mm = target->mm;
struct ldt_struct *new_ldt, *old_ldt;
- unsigned int old_nr_entries, new_nr_entries;
+ unsigned int count, old_nr_entries, new_nr_entries;
struct user_desc ldt_info;
+ const struct user_desc *kptr = kbuf;
+ const struct user_desc __user *uptr = ubuf;
struct desc_struct ldt;
+ unsigned short first_index, last_index;
int error;
- error = -EINVAL;
- if (bytecount != sizeof(ldt_info))
- goto out;
- error = -EFAULT;
- if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info)))
- goto out;
+ if (bytecount % sizeof(ldt_info) ||
+ bytecount >= LDT_ENTRY_SIZE*LDT_ENTRIES)
+ return -EINVAL;
- error = -EINVAL;
- if (ldt_info.entry_number >= LDT_ENTRIES)
- goto out;
- if (ldt_info.contents == 3) {
- if (oldmode)
- goto out;
- if (ldt_info.seg_not_present == 0)
- goto out;
- }
+ count = bytecount/sizeof(ldt_info);
+ if (index >= LDT_ENTRIES || index + count > LDT_ENTRIES)
+ return -EINVAL;
- if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
- LDT_empty(&ldt_info)) {
- /* The user wants to clear the entry. */
- memset(&ldt, 0, sizeof(ldt));
- } else {
- if (!IS_ENABLED(CONFIG_X86_16BIT) && !ldt_info.seg_32bit) {
- error = -EINVAL;
- goto out;
- }
-
- fill_ldt(&ldt, &ldt_info);
- if (oldmode)
- ldt.avl = 0;
- }
+ first_index = index;
+ last_index = index + count - 1;
if (down_write_killable(&mm->context.ldt_usr_sem))
return -EINTR;
- old_ldt = mm->context.ldt;
+ old_ldt = mm->context.ldt;
old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
- new_nr_entries = max(ldt_info.entry_number + 1, old_nr_entries);
+ new_nr_entries = max(index + count, old_nr_entries);
error = -ENOMEM;
new_ldt = alloc_ldt_struct(new_nr_entries);
@@ -446,9 +432,46 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
goto out_unlock;
if (old_ldt)
- memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE);
+ memcpy(new_ldt->entries, old_ldt->entries,
+ old_nr_entries * LDT_ENTRY_SIZE);
+
+ while (count--) {
+ error = -EFAULT;
+ if (kptr) {
+ memcpy(&ldt_info, kptr++, sizeof(ldt_info));
+ } else {
+ if (__copy_from_user(&ldt_info, uptr++,
+ sizeof(ldt_info)))
+ goto out_free;
+ }
+
+ error = -EINVAL;
+ if (ldt_info.contents == 3) {
+ if (oldmode)
+ goto out_free;
+ if (ldt_info.seg_not_present == 0)
+ goto out_free;
+ }
+
+ if ((oldmode && !ldt_info.base_addr && !ldt_info.limit) ||
+ LDT_empty(&ldt_info)) {
+ /* The user wants to clear the entry. */
+ memset(&ldt, 0, sizeof(ldt));
+ } else {
+ if (!IS_ENABLED(CONFIG_X86_16BIT) &&
+ !ldt_info.seg_32bit) {
+ error = -EINVAL;
+ goto out_free;
+ }
+ }
+
+ fill_ldt(&ldt, &ldt_info);
+ if (oldmode)
+ ldt.avl = 0;
+
+ new_ldt->entries[index++] = ldt;
+ }
- new_ldt->entries[ldt_info.entry_number] = ldt;
finalize_ldt_struct(new_ldt);
/*
@@ -466,20 +489,41 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
*/
if (!WARN_ON_ONCE(old_ldt))
free_ldt_pgtables(mm);
- free_ldt_struct(new_ldt);
- goto out_unlock;
+ goto out_free;
}
- install_ldt(mm, new_ldt, ldt_info.entry_number, ldt_info.entry_number);
- free_ldt_struct(old_ldt);
+ install_ldt(mm, new_ldt, first_index, last_index);
+
+ /* Success! */
+ new_ldt = old_ldt; /* Free the old LDT, not the new one */
error = 0;
+out_free:
+ free_ldt_struct(new_ldt);
out_unlock:
up_write(&mm->context.ldt_usr_sem);
-out:
return error;
}
+static int write_ldt(void __user *ptr, unsigned long bytecount, bool oldmode)
+{
+ struct user_desc ldt_info;
+
+ /*
+ * We have to read the LDT entry number from the structure
+ * ahead of time, so it is not practical to allow more than
+ * one update here.
+ */
+ if (bytecount != sizeof(ldt_info))
+ return -EINVAL;
+
+ if (copy_from_user(&ldt_info, ptr, sizeof(ldt_info)))
+ return -EFAULT;
+
+ return do_write_ldt(current, &ldt_info, NULL, sizeof(ldt_info),
+ ldt_info.entry_number, oldmode);
+}
+
SYSCALL_DEFINE3(modify_ldt, int , func , void __user * , ptr ,
unsigned long , bytecount)
{
@@ -490,13 +534,13 @@ SYSCALL_DEFINE3(modify_ldt, int , func , void __user * , ptr ,
ret = read_ldt(ptr, bytecount);
break;
case 1:
- ret = write_ldt(ptr, bytecount, 1);
+ ret = write_ldt(ptr, bytecount, true);
break;
case 2:
ret = read_default_ldt(ptr, bytecount);
break;
case 0x11:
- ret = write_ldt(ptr, bytecount, 0);
+ ret = write_ldt(ptr, bytecount, false);
break;
}
/*
@@ -510,3 +554,82 @@ SYSCALL_DEFINE3(modify_ldt, int , func , void __user * , ptr ,
*/
return (unsigned int)ret;
}
+
+int regset_ldt_active(struct task_struct *target,
+ const struct user_regset *regset)
+{
+ struct mm_struct *mm = target->mm;
+ const struct desc_struct *p;
+ int n;
+
+ down_read(&mm->context.ldt_usr_sem);
+
+ if (!mm->context.ldt) {
+ n = 0;
+ } else {
+ n = mm->context.ldt->nr_entries;
+ p = mm->context.ldt->entries + n;
+ while (n > 0 && desc_empty(--p))
+ n--;
+ }
+
+ up_read(&mm->context.ldt_usr_sem);
+ return n;
+}
+
+int regset_ldt_get(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ void *kbuf, void __user *ubuf)
+{
+ struct mm_struct *mm = target->mm;
+ const struct desc_struct *p;
+ struct user_desc udesc;
+ unsigned int index, nr_entries;
+ int err = 0;
+
+ if (pos % sizeof(struct user_desc))
+ return -EINVAL;
+
+ index = pos/sizeof(struct user_desc);
+
+ down_read(&mm->context.ldt_usr_sem);
+
+ if (!mm->context.ldt) {
+ nr_entries = 0;
+ p = NULL;
+ } else {
+ nr_entries = mm->context.ldt->nr_entries;
+ p = mm->context.ldt->entries + index;
+ }
+
+ while (count >= sizeof(struct user_desc) && index < nr_entries) {
+ fill_user_desc(&udesc, index++, p++);
+ err = user_regset_copyout(&pos, &count, &kbuf, &ubuf,
+ &udesc, pos, pos + sizeof(udesc));
+ if (err)
+ goto out_unlock;
+ }
+
+
+ up_read(&mm->context.ldt_usr_sem);
+
+ return user_regset_copyout_zero(&pos, &count, &kbuf, &ubuf,
+ nr_entries*sizeof(struct user_desc),
+ LDT_ENTRIES*sizeof(struct user_desc));
+out_unlock:
+ up_read(&mm->context.ldt_usr_sem);
+ return err;
+}
+
+int regset_ldt_set(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ const void *kbuf, const void __user *ubuf)
+{
+ if (pos % sizeof(struct user_desc))
+ return -EINVAL;
+
+ pos /= sizeof(struct user_desc);
+ return do_write_ldt(target, kbuf, ubuf, count, pos, false);
+}
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 5ce10310f440..4400ef506b5d 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1340,6 +1340,16 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
.active = gdt_active, .get = gdt_get,
.set = regset_gdt_set
},
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ [REGSET_LDT] = {
+ .core_note_type = NT_X86_LDT,
+ .n = LDT_ENTRIES,
+ .size = sizeof(struct user_desc),
+ .align = sizeof(struct user_desc),
+ .active = regset_ldt_active, .get = regset_ldt_get,
+ .set = regset_ldt_set
+ },
+#endif
};
static const struct user_regset_view user_x86_64_view = {
@@ -1404,6 +1414,16 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
.active = gdt_active,
.get = gdt_get, .set = regset_gdt_set
},
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+ [REGSET_LDT] = {
+ .core_note_type = NT_X86_LDT,
+ .n = LDT_ENTRIES,
+ .size = sizeof(struct user_desc),
+ .align = sizeof(struct user_desc),
+ .active = regset_ldt_active,
+ .get = regset_ldt_get, .set = regset_ldt_set
+ },
+#endif
};
static const struct user_regset_view user_x86_32_view = {
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 8c386906eba8..9d3564fd6daa 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -401,6 +401,7 @@ typedef struct elf64_shdr {
#define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
#define NT_X86_GDT 0x203 /* x86 GDT content (user visible) */
+#define NT_X86_LDT 0x204 /* x86 LDT content (if any) */
#define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
#define NT_S390_TIMER 0x301 /* s390 timer register */
#define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
--
2.14.4
From: "H. Peter Anvin" <[email protected]>
Provide access to the user-visible part of the GDT via a regset in
ptrace(). Note that we already provide a regset for the TLS area part
of the GDT; these can trivially be unified by looking at the contents
of the regset structure, especially since the TLS area is the only
user-modifiable part of the GDT.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/kernel/ptrace.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++--
arch/x86/kernel/tls.c | 83 ++++++++++++++++++------------------------------
arch/x86/kernel/tls.h | 8 +++--
include/uapi/linux/elf.h | 1 +
4 files changed, 118 insertions(+), 57 deletions(-)
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index e2ee403865eb..5ce10310f440 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -50,6 +50,7 @@ enum x86_regset {
REGSET_XSTATE,
REGSET_TLS,
REGSET_IOPERM32,
+ REGSET_GDT
};
struct pt_regs_offset {
@@ -747,6 +748,60 @@ static int ioperm_get(struct task_struct *target,
0, IO_BITMAP_BYTES);
}
+/*
+ * These provide read access to the GDT. As the only part that is
+ * writable is the TLS area, that code is in tls.c.
+ */
+static int gdt_get(struct task_struct *target,
+ const struct user_regset *regset,
+ unsigned int pos, unsigned int count,
+ void *kbuf, void __user *ubuf)
+{
+ struct desc_struct gdt_copy[GDT_LAST_USER + 1];
+ const struct desc_struct *p;
+ struct user_desc udesc;
+ unsigned int index, endindex;
+ int err;
+
+ if (pos % sizeof(struct user_desc))
+ return -EINVAL;
+
+ /* Get a snapshot of the GDT from an arbitrary CPU */
+ memcpy(gdt_copy, get_current_gdt_ro(), sizeof(gdt_copy));
+
+ /* Copy over the TLS area */
+ memcpy(&gdt_copy[GDT_ENTRY_TLS_MIN], target->thread.tls_array,
+ sizeof(target->thread.tls_array));
+
+ /* Descriptor zero is never accessible */
+ memset(&gdt_copy[0], 0, sizeof(gdt_copy[0]));
+
+ index = pos/sizeof(struct user_desc);
+ endindex = index + count/sizeof(struct user_desc);
+ endindex = min_t(unsigned int, GDT_LAST_USER + 1 - regset->bias,
+ endindex);
+
+ p = &gdt_copy[index + regset->bias];
+
+ while (count && index < endindex) {
+ fill_user_desc(&udesc, index++, p++);
+ err = user_regset_copyout(&pos, &count, &kbuf, &ubuf, &udesc,
+ pos, pos + sizeof(udesc));
+ if (err)
+ return err;
+ }
+
+ /* Return zero for the rest of the regset, if applicable. */
+ return user_regset_copyout_zero(&pos, &count, &kbuf, &ubuf, 0, -1);
+}
+
+static int gdt_active(struct task_struct *target,
+ const struct user_regset *regset)
+{
+ (void)target;
+ return GDT_LAST_USER + 1;
+}
+
/*
* Called by kernel/ptrace.c when detaching..
*
@@ -1262,7 +1317,8 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
.core_note_type = NT_PRFPREG,
.n = sizeof(struct user_i387_struct) / sizeof(long),
.size = sizeof(long), .align = sizeof(long),
- .active = regset_xregset_fpregs_active, .get = xfpregs_get, .set = xfpregs_set
+ .active = regset_xregset_fpregs_active, .get = xfpregs_get,
+ .set = xfpregs_set
},
[REGSET_XSTATE] = {
.core_note_type = NT_X86_XSTATE,
@@ -1276,6 +1332,14 @@ static struct user_regset x86_64_regsets[] __ro_after_init = {
.size = sizeof(long), .align = sizeof(long),
.active = ioperm_active, .get = ioperm_get
},
+ [REGSET_GDT] = {
+ .core_note_type = NT_X86_GDT,
+ .n = LDT_ENTRIES, /* Theoretical maximum */
+ .size = sizeof(struct user_desc),
+ .align = sizeof(struct user_desc),
+ .active = gdt_active, .get = gdt_get,
+ .set = regset_gdt_set
+ },
};
static const struct user_regset_view user_x86_64_view = {
@@ -1309,7 +1373,8 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
.core_note_type = NT_PRXFPREG,
.n = sizeof(struct user32_fxsr_struct) / sizeof(u32),
.size = sizeof(u32), .align = sizeof(u32),
- .active = regset_xregset_fpregs_active, .get = xfpregs_get, .set = xfpregs_set
+ .active = regset_xregset_fpregs_active, .get = xfpregs_get,
+ .set = xfpregs_set
},
[REGSET_XSTATE] = {
.core_note_type = NT_X86_XSTATE,
@@ -1323,7 +1388,7 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
.size = sizeof(struct user_desc),
.align = sizeof(struct user_desc),
.active = regset_tls_active,
- .get = regset_tls_get, .set = regset_tls_set
+ .get = gdt_get, .set = regset_gdt_set
},
[REGSET_IOPERM32] = {
.core_note_type = NT_386_IOPERM,
@@ -1331,6 +1396,14 @@ static struct user_regset x86_32_regsets[] __ro_after_init = {
.size = sizeof(u32), .align = sizeof(u32),
.active = ioperm_active, .get = ioperm_get
},
+ [REGSET_GDT] = {
+ .core_note_type = NT_X86_GDT,
+ .n = LDT_ENTRIES, /* Theoretical maximum */
+ .size = sizeof(struct user_desc),
+ .align = sizeof(struct user_desc),
+ .active = gdt_active,
+ .get = gdt_get, .set = regset_gdt_set
+ },
};
static const struct user_regset_view user_x86_32_view = {
@@ -1399,3 +1472,7 @@ void send_sigtrap(struct task_struct *tsk, struct pt_regs *regs,
/* Send us the fake SIGTRAP */
force_sig_info(SIGTRAP, &info, tsk);
}
+
+/*
+ * Copy out a set of segment descriptors in user_desc format.
+ */
diff --git a/arch/x86/kernel/tls.c b/arch/x86/kernel/tls.c
index 7b8ecb760707..fd8aa21654ff 100644
--- a/arch/x86/kernel/tls.c
+++ b/arch/x86/kernel/tls.c
@@ -231,67 +231,46 @@ int regset_tls_active(struct task_struct *target,
return n;
}
-int regset_tls_get(struct task_struct *target, const struct user_regset *regset,
- unsigned int pos, unsigned int count,
- void *kbuf, void __user *ubuf)
-{
- const struct desc_struct *tls;
-
- if (pos >= GDT_ENTRY_TLS_ENTRIES * sizeof(struct user_desc) ||
- (pos % sizeof(struct user_desc)) != 0 ||
- (count % sizeof(struct user_desc)) != 0)
- return -EINVAL;
-
- pos /= sizeof(struct user_desc);
- count /= sizeof(struct user_desc);
-
- tls = &target->thread.tls_array[pos];
-
- if (kbuf) {
- struct user_desc *info = kbuf;
- while (count-- > 0)
- fill_user_desc(info++, GDT_ENTRY_TLS_MIN + pos++,
- tls++);
- } else {
- struct user_desc __user *u_info = ubuf;
- while (count-- > 0) {
- struct user_desc info;
- fill_user_desc(&info, GDT_ENTRY_TLS_MIN + pos++, tls++);
- if (__copy_to_user(u_info++, &info, sizeof(info)))
- return -EFAULT;
- }
- }
-
- return 0;
-}
-
-int regset_tls_set(struct task_struct *target, const struct user_regset *regset,
+/* The only part of the GDT that is settable is the TLS area */
+int regset_gdt_set(struct task_struct *target,
+ const struct user_regset *regset,
unsigned int pos, unsigned int count,
const void *kbuf, const void __user *ubuf)
{
struct user_desc infobuf[GDT_ENTRY_TLS_ENTRIES];
- const struct user_desc *info;
- int i;
-
- if (pos >= GDT_ENTRY_TLS_ENTRIES * sizeof(struct user_desc) ||
- (pos % sizeof(struct user_desc)) != 0 ||
- (count % sizeof(struct user_desc)) != 0)
+ const struct user_desc * const info = infobuf;
+ const unsigned int minpos =
+ GDT_ENTRY_TLS_MIN * sizeof(struct user_desc);
+ const unsigned int maxpos =
+ (GDT_ENTRY_TLS_MAX+1) * sizeof(struct user_desc);
+ int err;
+ unsigned int index, ntls, i;
+
+ if (pos % sizeof(struct user_desc))
return -EINVAL;
- if (kbuf)
- info = kbuf;
- else if (__copy_from_user(infobuf, ubuf, count))
- return -EFAULT;
- else
- info = infobuf;
+ pos += regset->bias * sizeof(struct user_desc);
+
+ /* Ignore entries before the TLS region */
+ err = user_regset_copyin_ignore(&pos, &count, &kbuf, &ubuf, 0, minpos);
+ if (err)
+ return err;
- for (i = 0; i < count / sizeof(struct user_desc); i++)
- if (!tls_desc_okay(info + i))
+ /* Load the TLS descriptor information */
+ index = pos/sizeof(struct user_desc);
+ ntls = count/sizeof(struct user_desc);
+ ntls = min_t(unsigned int, GDT_ENTRY_TLS_ENTRIES, ntls);
+ err = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
+ infobuf, minpos, maxpos);
+ if (err)
+ return err;
+
+ for (i = 0; i < ntls; i++) {
+ if (!tls_desc_okay(&info[i]))
return -EINVAL;
+ }
- set_tls_desc(target,
- GDT_ENTRY_TLS_MIN + (pos / sizeof(struct user_desc)),
- info, count / sizeof(struct user_desc));
+ set_tls_desc(target, index, info, ntls);
return 0;
}
diff --git a/arch/x86/kernel/tls.h b/arch/x86/kernel/tls.h
index 2f083a2fe216..936680e904d0 100644
--- a/arch/x86/kernel/tls.h
+++ b/arch/x86/kernel/tls.h
@@ -14,8 +14,12 @@
#include <linux/regset.h>
+#ifdef CONFIG_X86_TLS_AREA
extern user_regset_active_fn regset_tls_active;
-extern user_regset_get_fn regset_tls_get;
-extern user_regset_set_fn regset_tls_set;
+extern user_regset_set_fn regset_gdt_set;
+#else
+#define regset_tls_active NULL
+#define regset_gdt_set NULL
+#endif
#endif /* _ARCH_X86_KERNEL_TLS_H */
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 4e12c423b9fe..8c386906eba8 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -400,6 +400,7 @@ typedef struct elf64_shdr {
#define NT_386_TLS 0x200 /* i386 TLS slots (struct user_desc) */
#define NT_386_IOPERM 0x201 /* x86 io permission bitmap (1=deny) */
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
+#define NT_X86_GDT 0x203 /* x86 GDT content (user visible) */
#define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
#define NT_S390_TIMER 0x301 /* s390 timer register */
#define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
--
2.14.4
From: "H. Peter Anvin" <[email protected]>
There is no point in having two different sizes for the "default ldt";
a concept which is obsolete anyway. Since this is kernel-dependent
and not user-space dependent, a 32-bit app needs to be able to accept
the 64-bit value anyway, so use that value, which is the larger of the
two.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/kernel/ldt.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 18e9f4c0633d..601d24268a99 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -383,12 +383,8 @@ static int read_ldt(void __user *ptr, unsigned long bytecount)
static int read_default_ldt(void __user *ptr, unsigned long bytecount)
{
- /* CHECKME: Can we use _one_ random number ? */
-#ifdef CONFIG_X86_32
- unsigned long size = 5 * sizeof(struct desc_struct);
-#else
- unsigned long size = 128;
-#endif
+ const unsigned long size = 128;
+
if (bytecount > size)
bytecount = size;
if (clear_user(ptr, bytecount))
--
2.14.4
From: "H. Peter Anvin" <[email protected]>
It is not only %ds and %es which contain cached user descriptor
information, %fs and %gs do as well.
To make sure we don't do something stupid that will affect processes
which wouldn't want this requalification, be more restrictive about
which selector numbers will be requalified: they need to be LDT
selectors (which by definition are never null), have an RPL of 3
(always the case in user space unless null), and match the updated
descriptor.
The infrastructure is set up to allow a range of descriptors; this
will be used in a subsequent patch.
Signed-off-by: H. Peter Anvin (Intel) <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Chang S. Bae <[email protected]>
Cc: Markus T. Metzger <[email protected]>
---
arch/x86/kernel/ldt.c | 70 +++++++++++++++++++++++++++++++++++++++------------
1 file changed, 54 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index c9b14020f4dd..18e9f4c0633d 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -29,36 +29,68 @@
#include <asm/mmu_context.h>
#include <asm/syscalls.h>
-static void refresh_ldt_segments(void)
-{
+struct flush_ldt_info {
+ struct mm_struct *mm;
+ unsigned short first_desc;
+ unsigned short last_desc;
+};
+
#ifdef CONFIG_X86_64
+
+static inline bool
+need_requalify(unsigned short sel, const struct flush_ldt_info *info)
+{
+ /* Must be an LDT segment descriptor with an RPL of 3 */
+ if ((sel & (SEGMENT_TI_MASK|SEGMENT_RPL_MASK)) != (SEGMENT_LDT|3))
+ return false;
+
+ return sel >= info->first_desc && sel <= info->last_desc;
+}
+
+static void refresh_ldt_segments(const struct flush_ldt_info *info)
+{
unsigned short sel;
/*
- * Make sure that the cached DS and ES descriptors match the updated
- * LDT.
+ * Make sure that the cached DS/ES/FS/GS descriptors
+ * match the updated LDT, if the specific selectors point
+ * to LDT entries that have changed.
*/
savesegment(ds, sel);
- if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
+ if (need_requalify(sel, info))
loadsegment(ds, sel);
savesegment(es, sel);
- if ((sel & SEGMENT_TI_MASK) == SEGMENT_LDT)
+ if (need_requalify(sel, info))
loadsegment(es, sel);
-#endif
+
+ savesegment(fs, sel);
+ if (need_requalify(sel, info))
+ loadsegment(fs, sel);
+
+ savesegment(gs, sel);
+ if (need_requalify(sel, info))
+ load_gs_index(sel);
}
+#else
+/* On 32 bits, entry_32.S takes care of this on kernel exit */
+static void refresh_ldt_segments(const struct flush_ldt_info *info)
+{
+ (void)info;
+}
+#endif
+
/* context.lock is held by the task which issued the smp function call */
-static void flush_ldt(void *__mm)
+static void flush_ldt(void *_info)
{
- struct mm_struct *mm = __mm;
+ const struct flush_ldt_info *info = _info;
- if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
+ if (this_cpu_read(cpu_tlbstate.loaded_mm) != info->mm)
return;
- load_mm_ldt(mm);
-
- refresh_ldt_segments();
+ load_mm_ldt(info->mm);
+ refresh_ldt_segments(info);
}
/* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
@@ -223,15 +255,21 @@ static void finalize_ldt_struct(struct ldt_struct *ldt)
paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
}
-static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt,
+ unsigned short first_index, unsigned short last_index)
{
+ struct flush_ldt_info info;
+
mutex_lock(&mm->context.lock);
/* Synchronizes with READ_ONCE in load_mm_ldt. */
smp_store_release(&mm->context.ldt, ldt);
/* Activate the LDT for all CPUs using currents mm. */
- on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+ info.mm = mm;
+ info.first_desc = (first_index << 3)|SEGMENT_LDT|3;
+ info.last_desc = (last_index << 3)|SEGMENT_LDT|3;
+ on_each_cpu_mask(mm_cpumask(mm), flush_ldt, &info, true);
mutex_unlock(&mm->context.lock);
}
@@ -436,7 +474,7 @@ static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
goto out_unlock;
}
- install_ldt(mm, new_ldt);
+ install_ldt(mm, new_ldt, ldt_info.entry_number, ldt_info.entry_number);
free_ldt_struct(old_ldt);
error = 0;
--
2.14.4
* H. Peter Anvin, Intel <[email protected]> wrote:
> From: "H. Peter Anvin" <[email protected]>
>
> Give a debugger access to the visible part of the GDT and LDT. This
> allows a debugger to find out what a particular segment descriptor
> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>
> v3:
> Requalify LDT segments for selectors that have actually changed.
>
> v2:
> Add missing change to elf.h for the very last patch.
>
> Cc: Ingo Molnar <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Chang S. Bae <[email protected]>
> Cc: Markus T. Metzger <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
>
> arch/x86/Kconfig | 4 +
> arch/x86/include/asm/desc.h | 24 +++-
> arch/x86/include/asm/ldt.h | 16 +++
> arch/x86/include/asm/segment.h | 10 ++
> arch/x86/kernel/Makefile | 3 +-
> arch/x86/kernel/ldt.c | 283 ++++++++++++++++++++++++++++++++---------
> arch/x86/kernel/ptrace.c | 103 ++++++++++++++-
> arch/x86/kernel/tls.c | 102 +++++----------
> arch/x86/kernel/tls.h | 8 +-
> include/uapi/linux/elf.h | 2 +
> 10 files changed, 413 insertions(+), 142 deletions(-)
Ok, this looks mostly good to me at a quick glance, but could you please do one
more iteration and more explicitly describe where you change/extend existing ABIs?
I.e. instead of a terse and somewhat vague summary:
> x86/tls,ptrace: provide regset access to the GDT
>
> Give a debugger access to the visible part of the GDT and LDT. This
> allows a debugger to find out what a particular segment descriptor
> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
Please make it really explicit how the ABI is affected, both in the title and in
the description, and also _explain_ how this helps us over what we had before,
plus also list limitations of the new ABI.
I.e. something like:
x86/tls, ptrace: Extend the ptrace ABI with the new REGSET_GDT method
Add the new REGSET_GDT ptrace variant to PTRACE_{GET|SET}REGSET,
to give read and write access to the GDT:
- Previously only certain parts of the GDT were visible, and only via random
ABIs and instructions - there was no architectured access to all of it.
- The SETREGSET variant is only allowed to change the TLS area of the GDT.
(or so.)
This is important not just for documentation and library support purposes, but
also to be able to review it properly, to match 'intent' to 'implementation'.
(It might also help later on in finding bugs/quirks, if any.)
Please do this for all patches in the series that change the ABI.
Thanks,
Ingo
On June 21, 2018 6:58:51 PM PDT, Ingo Molnar <[email protected]> wrote:
>
>* H. Peter Anvin, Intel <[email protected]> wrote:
>
>> From: "H. Peter Anvin" <[email protected]>
>>
>> Give a debugger access to the visible part of the GDT and LDT. This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>>
>> v3:
>> Requalify LDT segments for selectors that have actually changed.
>>
>> v2:
>> Add missing change to elf.h for the very last patch.
>>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Chang S. Bae <[email protected]>
>> Cc: Markus T. Metzger <[email protected]>
>> Cc: H. Peter Anvin <[email protected]>
>>
>> arch/x86/Kconfig | 4 +
>> arch/x86/include/asm/desc.h | 24 +++-
>> arch/x86/include/asm/ldt.h | 16 +++
>> arch/x86/include/asm/segment.h | 10 ++
>> arch/x86/kernel/Makefile | 3 +-
>> arch/x86/kernel/ldt.c | 283
>++++++++++++++++++++++++++++++++---------
>> arch/x86/kernel/ptrace.c | 103 ++++++++++++++-
>> arch/x86/kernel/tls.c | 102 +++++----------
>> arch/x86/kernel/tls.h | 8 +-
>> include/uapi/linux/elf.h | 2 +
>> 10 files changed, 413 insertions(+), 142 deletions(-)
>
>Ok, this looks mostly good to me at a quick glance, but could you
>please do one
>more iteration and more explicitly describe where you change/extend
>existing ABIs?
>
>I.e. instead of a terse and somewhat vague summary:
>
>> x86/tls,ptrace: provide regset access to the GDT
>>
>> Give a debugger access to the visible part of the GDT and LDT. This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>
>Please make it really explicit how the ABI is affected, both in the
>title and in
>the description, and also _explain_ how this helps us over what we had
>before,
>plus also list limitations of the new ABI.
>
>I.e. something like:
>
> x86/tls, ptrace: Extend the ptrace ABI with the new REGSET_GDT method
>
> Add the new REGSET_GDT ptrace variant to PTRACE_{GET|SET}REGSET,
> to give read and write access to the GDT:
>
>- Previously only certain parts of the GDT were visible, and only via
>random
>ABIs and instructions - there was no architectured access to all of it.
>
>- The SETREGSET variant is only allowed to change the TLS area of the
>GDT.
>
>(or so.)
>
>This is important not just for documentation and library support
>purposes, but
>also to be able to review it properly, to match 'intent' to
>'implementation'.
>
>(It might also help later on in finding bugs/quirks, if any.)
>
>Please do this for all patches in the series that change the ABI.
>
>Thanks,
>
> Ingo
ACK.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
<[email protected]> wrote:
>
> From: "H. Peter Anvin" <[email protected]>
>
> It is not only %ds and %es which contain cached user descriptor
> information, %fs and %gs do as well.
>
> To make sure we don't do something stupid that will affect processes
> which wouldn't want this requalification, be more restrictive about
> which selector numbers will be requalified: they need to be LDT
> selectors (which by definition are never null), have an RPL of 3
> (always the case in user space unless null), and match the updated
> descriptor.
That RPL3 part is false. The following program does:
#include <stdio.h>
int main()
{
unsigned short sel;
asm volatile ("mov %%ss, %0" : "=rm" (sel));
sel &= ~3;
printf("Will write 0x%hx to GS\n", sel);
asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
asm volatile ("mov %%gs, %0" : "=rm" (sel));
printf("GS = 0x%hx\n", sel);
return 0;
}
prints:
Will write 0x28 to GS
GS = 0x28
The x86 architecture is *insane*.
Other than that, this patch seems generally sensible. But my
objection that it's incorrect with FSGSBASE enabled for %fs and %gs
still applies.
On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
<[email protected]> wrote:
>
> From: "H. Peter Anvin" <[email protected]>
>
> Provide access to the user-visible part of the GDT via a regset in
> ptrace(). Note that we already provide a regset for the TLS area part
> of the GDT; these can trivially be unified by looking at the contents
> of the regset structure, especially since the TLS area is the only
> user-modifiable part of the GDT.
This seems reasonable, although I'm not sure I see the point of making
REGSET_GDT writable. (I see the point of making the LDT writable, but
that's a different story.)
> +static int gdt_get(struct task_struct *target,
> + const struct user_regset *regset,
> + unsigned int pos, unsigned int count,
> + void *kbuf, void __user *ubuf)
> +{
> + struct desc_struct gdt_copy[GDT_LAST_USER + 1];
> + const struct desc_struct *p;
> + struct user_desc udesc;
> + unsigned int index, endindex;
> + int err;
> +
> + if (pos % sizeof(struct user_desc))
> + return -EINVAL;
> +
> + /* Get a snapshot of the GDT from an arbitrary CPU */
> + memcpy(gdt_copy, get_current_gdt_ro(), sizeof(gdt_copy));
> +
> + /* Copy over the TLS area */
> + memcpy(&gdt_copy[GDT_ENTRY_TLS_MIN], target->thread.tls_array,
> + sizeof(target->thread.tls_array));
> +
> + /* Descriptor zero is never accessible */
> + memset(&gdt_copy[0], 0, sizeof(gdt_copy[0]));
I think you should also mask out all system segments and all RPL != 3 segments.
> +static int gdt_active(struct task_struct *target,
> + const struct user_regset *regset)
> +{
> + (void)target;
> + return GDT_LAST_USER + 1;
> +}
I can't find the code, if any, that calls ->active. But, if it
exists, is the result exposed to user space at all? If so, I think
this should return the maximum theoretical size of the GDT.
> + [REGSET_GDT] = {
> + .core_note_type = NT_X86_GDT,
> + .n = LDT_ENTRIES, /* Theoretical maximum */
> + .size = sizeof(struct user_desc),
> + .align = sizeof(struct user_desc),
> + .active = gdt_active,
> + .get = gdt_get, .set = regset_gdt_set
As above, barring a reason why it's useful, I think that removing .set
would make sense.
--Andy
On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
<[email protected]> wrote:
>
> From: "H. Peter Anvin" <[email protected]>
>
> Provide ptrace/regset access to the LDT, if one exists. This
> interface provides both read and write access. The write code is
> unified with modify_ldt(); the read code doesn't have enough
> similarity so it has been kept made separate.
For this and for the GDT, you've chosen to use struct user_desc as
your format instead of using a native hardware descriptor format. Any
particular reason why? If nothing else, it will bloat core files a
bit more than needed.
On June 22, 2018 7:49:13 AM PDT, Andy Lutomirski <[email protected]> wrote:
>On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
><[email protected]> wrote:
>>
>> From: "H. Peter Anvin" <[email protected]>
>>
>> Provide ptrace/regset access to the LDT, if one exists. This
>> interface provides both read and write access. The write code is
>> unified with modify_ldt(); the read code doesn't have enough
>> similarity so it has been kept made separate.
>
>For this and for the GDT, you've chosen to use struct user_desc as
>your format instead of using a native hardware descriptor format. Any
>particular reason why? If nothing else, it will bloat core files a
>bit more than needed.
I did because REGSET_TLS was implemented that way, and that is simply a subset of the GDT (which made the same code trivially applicable to both.) modify_ldt() does it *both* ways for extra fun (one for reading, and one for writing.)
->active is defined as "beyond this point the regset contains only the default value", which seemed appropriate in this case.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Fri, Jun 22, 2018 at 8:06 AM <[email protected]> wrote:
>
> On June 22, 2018 7:49:13 AM PDT, Andy Lutomirski <[email protected]> wrote:
> >On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
> ><[email protected]> wrote:
> >>
> >> From: "H. Peter Anvin" <[email protected]>
> >>
> >> Provide ptrace/regset access to the LDT, if one exists. This
> >> interface provides both read and write access. The write code is
> >> unified with modify_ldt(); the read code doesn't have enough
> >> similarity so it has been kept made separate.
> >
> >For this and for the GDT, you've chosen to use struct user_desc as
> >your format instead of using a native hardware descriptor format. Any
> >particular reason why? If nothing else, it will bloat core files a
> >bit more than needed.
>
> I did because REGSET_TLS was implemented that way, and that is simply a subset of the GDT (which made the same code trivially applicable to both.) modify_ldt() does it *both* ways for extra fun (one for reading, and one for writing.)
>
> ->active is defined as "beyond this point the regset contains only the default value", which seemed appropriate in this case.
I saw that definition too. But I'm still very unclear as to what, if
anything, the code actually does :)
--Andy
On 06/22/18 07:24, Andy Lutomirski wrote:
>
> That RPL3 part is false. The following program does:
>
> #include <stdio.h>
>
> int main()
> {
> unsigned short sel;
> asm volatile ("mov %%ss, %0" : "=rm" (sel));
> sel &= ~3;
> printf("Will write 0x%hx to GS\n", sel);
> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
> asm volatile ("mov %%gs, %0" : "=rm" (sel));
> printf("GS = 0x%hx\n", sel);
> return 0;
> }
>
> prints:
>
> Will write 0x28 to GS
> GS = 0x28
>
> The x86 architecture is *insane*.
>
> Other than that, this patch seems generally sensible. But my
> objection that it's incorrect with FSGSBASE enabled for %fs and %gs
> still applies.
>
Ugh, you're right... I misremembered. The CPL simply overrides the RPL
rather than trapping.
We still need to give legacy applications which have zero idea about the
separate bases that apply only to 64-bit mode a way to DTRT. Requiring
these old crufty applications to do something new is not an option.
As ugly as it is, I'm thinking the Right Thing is to simply make it a
part of the Linux ABI that if the FS or GS selector registers point into
the LDT then we will requalify them; if a 64-bit app does that then they
get that behavior. This isn't something that will happen
asynchronously, and if a 64-bit process loads an LDT value into FS or
GS, they are considered to have opted in to that behavior.
The only other sensible option is to conditionalize this on the affected
process being in !64-bit mode. I don't like that myself.
-hpa
> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin <[email protected]> wrote:
>
>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>
>> That RPL3 part is false. The following program does:
>>
>> #include <stdio.h>
>>
>> int main()
>> {
>> unsigned short sel;
>> asm volatile ("mov %%ss, %0" : "=rm" (sel));
>> sel &= ~3;
>> printf("Will write 0x%hx to GS\n", sel);
>> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>> asm volatile ("mov %%gs, %0" : "=rm" (sel));
>> printf("GS = 0x%hx\n", sel);
>> return 0;
>> }
>>
>> prints:
>>
>> Will write 0x28 to GS
>> GS = 0x28
>>
>> The x86 architecture is *insane*.
>>
>> Other than that, this patch seems generally sensible. But my
>> objection that it's incorrect with FSGSBASE enabled for %fs and %gs
>> still applies.
>>
>
> Ugh, you're right... I misremembered. The CPL simply overrides the RPL
> rather than trapping.
>
> We still need to give legacy applications which have zero idea about the
> separate bases that apply only to 64-bit mode a way to DTRT. Requiring
> these old crufty applications to do something new is not an option.
>
> As ugly as it is, I'm thinking the Right Thing is to simply make it a
> part of the Linux ABI that if the FS or GS selector registers point into
> the LDT then we will requalify them; if a 64-bit app does that then they
> get that behavior. This isn't something that will happen
> asynchronously, and if a 64-bit process loads an LDT value into FS or
> GS, they are considered to have opted in to that behavior.
But the old and crusty apps don’t depend on requalification because we never used to do it.
I’m not convinced we ever need to refresh the base. In fact, we could start preserving the base of LDT-referencing FS/GS across context switches even without FSGSBASE at some minor performance cost, but I don’t really see the point. I still think my proposed semantics are easy to implement and preserve the ABI even if they have the sad property that the FSGSBASE behavior and the non-FSGSBASE behavior end up different.
>
> The only other sensible option is to conditionalize this on the affected
> process being in !64-bit mode. I don't like that myself.
Please no.
I really wish Intel had made WRGSBASE at CPL 3 require GS==0. The current ISA was a mistake.
>
> -hpa
On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski <[email protected]> wrote:
>
>
>
>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin <[email protected]> wrote:
>>
>>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>>
>>> That RPL3 part is false. The following program does:
>>>
>>> #include <stdio.h>
>>>
>>> int main()
>>> {
>>> unsigned short sel;
>>> asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>> sel &= ~3;
>>> printf("Will write 0x%hx to GS\n", sel);
>>> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>> asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>> printf("GS = 0x%hx\n", sel);
>>> return 0;
>>> }
>>>
>>> prints:
>>>
>>> Will write 0x28 to GS
>>> GS = 0x28
>>>
>>> The x86 architecture is *insane*.
>>>
>>> Other than that, this patch seems generally sensible. But my
>>> objection that it's incorrect with FSGSBASE enabled for %fs and %gs
>>> still applies.
>>>
>>
>> Ugh, you're right... I misremembered. The CPL simply overrides the RPL
>> rather than trapping.
>>
>> We still need to give legacy applications which have zero idea about the
>> separate bases that apply only to 64-bit mode a way to DTRT. Requiring
>> these old crufty applications to do something new is not an option.
>
>>
>> As ugly as it is, I'm thinking the Right Thing is to simply make it a
>> part of the Linux ABI that if the FS or GS selector registers point into
>> the LDT then we will requalify them; if a 64-bit app does that then they
>> get that behavior. This isn't something that will happen
>> asynchronously, and if a 64-bit process loads an LDT value into FS or
>> GS, they are considered to have opted in to that behavior.
>
> But the old and crusty apps don’t depend on requalification because we never used to do it.
>
> I’m not convinced we ever need to refresh the base. In fact, we could start preserving the base of LDT-referencing FS/GS across context switches even without FSGSBASE at some minor performance cost, but I don’t really see the point. I still think my proposed semantics are easy to implement and preserve the ABI even if they have the sad property that the FSGSBASE behavior and the non-FSGSBASE behavior end up different.
>
There's another reasonable solution: do exactly what your patch does,
minus the bugs. We would need to get the RPL != 3 case right (easy)
and the case where there's a non-running thread using the selector in
question. The latter is probably best handled by adding a flag to
thread_struct that says "fsbase needs reloading from the descriptor
table" and only applies if the selector is in the LDT or TLS area. Or
we could hijack a high bit in the selector. Then we'd need to update
everything that uses the fields.
On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski <[email protected]> wrote:
>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski <[email protected]>
>wrote:
>>
>>
>>
>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
><[email protected]> wrote:
>>>
>>>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>>>
>>>> That RPL3 part is false. The following program does:
>>>>
>>>> #include <stdio.h>
>>>>
>>>> int main()
>>>> {
>>>> unsigned short sel;
>>>> asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>>> sel &= ~3;
>>>> printf("Will write 0x%hx to GS\n", sel);
>>>> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>>> asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>>> printf("GS = 0x%hx\n", sel);
>>>> return 0;
>>>> }
>>>>
>>>> prints:
>>>>
>>>> Will write 0x28 to GS
>>>> GS = 0x28
>>>>
>>>> The x86 architecture is *insane*.
>>>>
>>>> Other than that, this patch seems generally sensible. But my
>>>> objection that it's incorrect with FSGSBASE enabled for %fs and %gs
>>>> still applies.
>>>>
>>>
>>> Ugh, you're right... I misremembered. The CPL simply overrides the
>RPL
>>> rather than trapping.
>>>
>>> We still need to give legacy applications which have zero idea about
>the
>>> separate bases that apply only to 64-bit mode a way to DTRT.
>Requiring
>>> these old crufty applications to do something new is not an option.
>>
>>>
>>> As ugly as it is, I'm thinking the Right Thing is to simply make it
>a
>>> part of the Linux ABI that if the FS or GS selector registers point
>into
>>> the LDT then we will requalify them; if a 64-bit app does that then
>they
>>> get that behavior. This isn't something that will happen
>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>or
>>> GS, they are considered to have opted in to that behavior.
>>
>> But the old and crusty apps don’t depend on requalification because
>we never used to do it.
>>
>> I’m not convinced we ever need to refresh the base. In fact, we could
>start preserving the base of LDT-referencing FS/GS across context
>switches even without FSGSBASE at some minor performance cost, but I
>don’t really see the point. I still think my proposed semantics are
>easy to implement and preserve the ABI even if they have the sad
>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>up different.
>>
>
>There's another reasonable solution: do exactly what your patch does,
>minus the bugs. We would need to get the RPL != 3 case right (easy)
>and the case where there's a non-running thread using the selector in
>question. The latter is probably best handled by adding a flag to
>thread_struct that says "fsbase needs reloading from the descriptor
>table" and only applies if the selector is in the LDT or TLS area. Or
>we could hijack a high bit in the selector. Then we'd need to update
>everything that uses the fields.
Obviously fix the bugs.
How would you control this bit?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On June 27, 2018 11:22:14 AM PDT, [email protected] wrote:
>On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski <[email protected]>
>wrote:
>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
><[email protected]>
>>wrote:
>>>
>>>
>>>
>>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>><[email protected]> wrote:
>>>>
>>>>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>>>>
>>>>> That RPL3 part is false. The following program does:
>>>>>
>>>>> #include <stdio.h>
>>>>>
>>>>> int main()
>>>>> {
>>>>> unsigned short sel;
>>>>> asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>>>> sel &= ~3;
>>>>> printf("Will write 0x%hx to GS\n", sel);
>>>>> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>>>> asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>>>> printf("GS = 0x%hx\n", sel);
>>>>> return 0;
>>>>> }
>>>>>
>>>>> prints:
>>>>>
>>>>> Will write 0x28 to GS
>>>>> GS = 0x28
>>>>>
>>>>> The x86 architecture is *insane*.
>>>>>
>>>>> Other than that, this patch seems generally sensible. But my
>>>>> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
>>>>> still applies.
>>>>>
>>>>
>>>> Ugh, you're right... I misremembered. The CPL simply overrides the
>>RPL
>>>> rather than trapping.
>>>>
>>>> We still need to give legacy applications which have zero idea
>about
>>the
>>>> separate bases that apply only to 64-bit mode a way to DTRT.
>>Requiring
>>>> these old crufty applications to do something new is not an option.
>>>
>>>>
>>>> As ugly as it is, I'm thinking the Right Thing is to simply make it
>>a
>>>> part of the Linux ABI that if the FS or GS selector registers point
>>into
>>>> the LDT then we will requalify them; if a 64-bit app does that then
>>they
>>>> get that behavior. This isn't something that will happen
>>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>>or
>>>> GS, they are considered to have opted in to that behavior.
>>>
>>> But the old and crusty apps don’t depend on requalification because
>>we never used to do it.
>>>
>>> I’m not convinced we ever need to refresh the base. In fact, we
>could
>>start preserving the base of LDT-referencing FS/GS across context
>>switches even without FSGSBASE at some minor performance cost, but I
>>don’t really see the point. I still think my proposed semantics are
>>easy to implement and preserve the ABI even if they have the sad
>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>up different.
>>>
>>
>>There's another reasonable solution: do exactly what your patch does,
>>minus the bugs. We would need to get the RPL != 3 case right (easy)
>>and the case where there's a non-running thread using the selector in
>>question. The latter is probably best handled by adding a flag to
>>thread_struct that says "fsbase needs reloading from the descriptor
>>table" and only applies if the selector is in the LDT or TLS area. Or
>>we could hijack a high bit in the selector. Then we'd need to update
>>everything that uses the fields.
>
>Obviously fix the bugs.
>
>How would you control this bit?
I can personally think of these options:
1. A prctl() to disable requalification;
2. Make the new instructions trap until used. This will add to the startup time of legitimate users of these instructions;
3. Either of these, but start out in "off" mode until one of the descriptor table system calls are called.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Jun 27, 2018 at 11:22 AM, <[email protected]> wrote:
> On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski <[email protected]> wrote:
>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski <[email protected]>
>>wrote:
>>>
>>>
>>>
>>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>><[email protected]> wrote:
>>>>
>>>>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>>>>
>>>>> That RPL3 part is false. The following program does:
>>>>>
>>>>> #include <stdio.h>
>>>>>
>>>>> int main()
>>>>> {
>>>>> unsigned short sel;
>>>>> asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>>>> sel &= ~3;
>>>>> printf("Will write 0x%hx to GS\n", sel);
>>>>> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>>>> asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>>>> printf("GS = 0x%hx\n", sel);
>>>>> return 0;
>>>>> }
>>>>>
>>>>> prints:
>>>>>
>>>>> Will write 0x28 to GS
>>>>> GS = 0x28
>>>>>
>>>>> The x86 architecture is *insane*.
>>>>>
>>>>> Other than that, this patch seems generally sensible. But my
>>>>> objection that it's incorrect with FSGSBASE enabled for %fs and %gs
>>>>> still applies.
>>>>>
>>>>
>>>> Ugh, you're right... I misremembered. The CPL simply overrides the
>>RPL
>>>> rather than trapping.
>>>>
>>>> We still need to give legacy applications which have zero idea about
>>the
>>>> separate bases that apply only to 64-bit mode a way to DTRT.
>>Requiring
>>>> these old crufty applications to do something new is not an option.
>>>
>>>>
>>>> As ugly as it is, I'm thinking the Right Thing is to simply make it
>>a
>>>> part of the Linux ABI that if the FS or GS selector registers point
>>into
>>>> the LDT then we will requalify them; if a 64-bit app does that then
>>they
>>>> get that behavior. This isn't something that will happen
>>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>>or
>>>> GS, they are considered to have opted in to that behavior.
>>>
>>> But the old and crusty apps don’t depend on requalification because
>>we never used to do it.
>>>
>>> I’m not convinced we ever need to refresh the base. In fact, we could
>>start preserving the base of LDT-referencing FS/GS across context
>>switches even without FSGSBASE at some minor performance cost, but I
>>don’t really see the point. I still think my proposed semantics are
>>easy to implement and preserve the ABI even if they have the sad
>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>up different.
>>>
>>
>>There's another reasonable solution: do exactly what your patch does,
>>minus the bugs. We would need to get the RPL != 3 case right (easy)
>>and the case where there's a non-running thread using the selector in
>>question. The latter is probably best handled by adding a flag to
>>thread_struct that says "fsbase needs reloading from the descriptor
>>table" and only applies if the selector is in the LDT or TLS area. Or
>>we could hijack a high bit in the selector. Then we'd need to update
>>everything that uses the fields.
>
> Obviously fix the bugs.
>
> How would you control this bit?
Sorry, I was wrong in my previous email. Let me try this again:
Notwithstanding the RPL thing, the reason I don't like your patch as
is, and the reason I didn't write a similar patch myself, is that it
will behave nondeterministically on an FSGSBASE kernel. Suppose there
are two threads, A and B, that share an mm. A has %fs == 0x7 and
FSBASE = 0. The LDT has the base for entry 0 set to 0.
Now thread B calls modify_ldt to change the base for entry 0 to 1.
The Obviously Sane (tm) behavior is for task A's FSBASE to
asynchronously change to 1. This is the only deterministic behavior
that is even possible on a 32-bit kernel, and it's the only
not-totally-nutty behavior that is possible on a 64-bit non-FSGSBASE
kernel, and it's still perfectly reasonable for FSGSBASE. The problem
is that it's not so easly to implement.
With your patch, on an FSGSBASE kernel, we get the desired behavior if
thread A is running while thread B calls modify_ldt(). But we get
different behavior if thread A is stopped -- thread A's FSBASE will
remain set to 0.
With that in mind, my email was otherwise garbage, and the magic "bit"
idea was total crap.
I can see three vaguely credible ways to implement this.
1. thread B walks all threads on the system, notices that thread A has
the same mm, and asynchronously fixes it up. The locking is a bit
tricky, and the performance isn't exactly great. Maybe that's okay.
2. We finally add an efficient way to find all threads that share an
mm and do (1) but faster.
3. We add enough bookkeeping to thread_struct so that, the next time
thread A runs or has ptrace try to read its FSBASE, we notice that
FSBASE is stale and fix it up.
(3) will perform the best, but the implementation is probably nasty.
If we want modify_ldt() to only reset the base for the modified
records, we probably need a version number for each of the 8192
possible LDT entries stored in ldt_struct (which will double its size,
but so what?). Then we need thread_struct to store the version number
of the LDT entries that fsindex and gsindex refer to. Now we make
sure that every code path that reads fsbase or gsbase first calls some
revalidate_fs_and_gs() function that will reset the bases and maybe
even the selectors if needed. Getting the locking right on that last
bit is possibly a bit tricky, since we may need the LDT lock to be
held across the revalidation *and* whatever subsequent code actually
reads the values.
I think that (3) is the nicest solution, but it would need to be implemeted.
What do you think?
On June 28, 2018 1:33:24 PM PDT, Andy Lutomirski <[email protected]> wrote:
>On Wed, Jun 27, 2018 at 11:22 AM, <[email protected]> wrote:
>> On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski <[email protected]>
>wrote:
>>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
><[email protected]>
>>>wrote:
>>>>
>>>>
>>>>
>>>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>>><[email protected]> wrote:
>>>>>
>>>>>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>>>>>
>>>>>> That RPL3 part is false. The following program does:
>>>>>>
>>>>>> #include <stdio.h>
>>>>>>
>>>>>> int main()
>>>>>> {
>>>>>> unsigned short sel;
>>>>>> asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>>>>> sel &= ~3;
>>>>>> printf("Will write 0x%hx to GS\n", sel);
>>>>>> asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>>>>> asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>>>>> printf("GS = 0x%hx\n", sel);
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> prints:
>>>>>>
>>>>>> Will write 0x28 to GS
>>>>>> GS = 0x28
>>>>>>
>>>>>> The x86 architecture is *insane*.
>>>>>>
>>>>>> Other than that, this patch seems generally sensible. But my
>>>>>> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
>>>>>> still applies.
>>>>>>
>>>>>
>>>>> Ugh, you're right... I misremembered. The CPL simply overrides
>the
>>>RPL
>>>>> rather than trapping.
>>>>>
>>>>> We still need to give legacy applications which have zero idea
>about
>>>the
>>>>> separate bases that apply only to 64-bit mode a way to DTRT.
>>>Requiring
>>>>> these old crufty applications to do something new is not an
>option.
>>>>
>>>>>
>>>>> As ugly as it is, I'm thinking the Right Thing is to simply make
>it
>>>a
>>>>> part of the Linux ABI that if the FS or GS selector registers
>point
>>>into
>>>>> the LDT then we will requalify them; if a 64-bit app does that
>then
>>>they
>>>>> get that behavior. This isn't something that will happen
>>>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>>>or
>>>>> GS, they are considered to have opted in to that behavior.
>>>>
>>>> But the old and crusty apps don’t depend on requalification because
>>>we never used to do it.
>>>>
>>>> I’m not convinced we ever need to refresh the base. In fact, we
>could
>>>start preserving the base of LDT-referencing FS/GS across context
>>>switches even without FSGSBASE at some minor performance cost, but I
>>>don’t really see the point. I still think my proposed semantics are
>>>easy to implement and preserve the ABI even if they have the sad
>>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>>up different.
>>>>
>>>
>>>There's another reasonable solution: do exactly what your patch does,
>>>minus the bugs. We would need to get the RPL != 3 case right (easy)
>>>and the case where there's a non-running thread using the selector in
>>>question. The latter is probably best handled by adding a flag to
>>>thread_struct that says "fsbase needs reloading from the descriptor
>>>table" and only applies if the selector is in the LDT or TLS area.
>Or
>>>we could hijack a high bit in the selector. Then we'd need to update
>>>everything that uses the fields.
>>
>> Obviously fix the bugs.
>>
>> How would you control this bit?
>
>Sorry, I was wrong in my previous email. Let me try this again:
>
>Notwithstanding the RPL thing, the reason I don't like your patch as
>is, and the reason I didn't write a similar patch myself, is that it
>will behave nondeterministically on an FSGSBASE kernel. Suppose there
>are two threads, A and B, that share an mm. A has %fs == 0x7 and
>FSBASE = 0. The LDT has the base for entry 0 set to 0.
>
>Now thread B calls modify_ldt to change the base for entry 0 to 1.
>
>The Obviously Sane (tm) behavior is for task A's FSBASE to
>asynchronously change to 1. This is the only deterministic behavior
>that is even possible on a 32-bit kernel, and it's the only
>not-totally-nutty behavior that is possible on a 64-bit non-FSGSBASE
>kernel, and it's still perfectly reasonable for FSGSBASE. The problem
>is that it's not so easly to implement.
>
>With your patch, on an FSGSBASE kernel, we get the desired behavior if
>thread A is running while thread B calls modify_ldt(). But we get
>different behavior if thread A is stopped -- thread A's FSBASE will
>remain set to 0.
>
>With that in mind, my email was otherwise garbage, and the magic "bit"
>idea was total crap.
>
>I can see three vaguely credible ways to implement this.
>
>1. thread B walks all threads on the system, notices that thread A has
>the same mm, and asynchronously fixes it up. The locking is a bit
>tricky, and the performance isn't exactly great. Maybe that's okay.
>
>2. We finally add an efficient way to find all threads that share an
>mm and do (1) but faster.
>
>3. We add enough bookkeeping to thread_struct so that, the next time
>thread A runs or has ptrace try to read its FSBASE, we notice that
>FSBASE is stale and fix it up.
>
>(3) will perform the best, but the implementation is probably nasty.
>If we want modify_ldt() to only reset the base for the modified
>records, we probably need a version number for each of the 8192
>possible LDT entries stored in ldt_struct (which will double its size,
>but so what?). Then we need thread_struct to store the version number
>of the LDT entries that fsindex and gsindex refer to. Now we make
>sure that every code path that reads fsbase or gsbase first calls some
>revalidate_fs_and_gs() function that will reset the bases and maybe
>even the selectors if needed. Getting the locking right on that last
>bit is possibly a bit tricky, since we may need the LDT lock to be
>held across the revalidation *and* whatever subsequent code actually
>reads the values.
>
>I think that (3) is the nicest solution, but it would need to be
>implemeted.
>
>What do you think?
Ok, now I grok what you're getting at, and I do agree with your problem statement. I think I can figure this out.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.