2019-11-06 20:58:14

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

Sane ioperm() users only set the few bits in the I/O space which they need to
access. But the update logic of the TSS I/O bitmap copies always from byte
0 to the last byte in the tasks bitmap which contains a zero permission bit.

That means that for access only to port 65335 the full 8K bitmap has to be
copied even if all the bytes in the TSS I/O bitmap are already filled with
0xff.

Calculate both the position of the first zero bit and the last zero bit to
limit the range which needs to be copied. This does not solve the problem
when the previous tasked had only byte 0 cleared and the next one has only
byte 65535 cleared, but trying to solve that would be too complex and
heavyweight for the context switch path. As the ioperm() usage is very rare
the case which is optimized is the single task/process which uses ioperm().

Signed-off-by: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/processor.h | 28 ++++---
arch/x86/kernel/cpu/common.c | 3
arch/x86/kernel/ioport.c | 141 +++++++++++++++++++++++++++------------
arch/x86/kernel/process.c | 51 +++++++++-----
arch/x86/kernel/ptrace.c | 2
5 files changed, 152 insertions(+), 73 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -365,19 +365,19 @@ struct tss_struct {
struct x86_hw_tss x86_tss;

/*
- * Store the dirty size of the last io bitmap offender. The next
- * one will have to do the cleanup as the switch out to a non
- * io bitmap user will just set x86_tss.io_bitmap_base to a value
- * outside of the TSS limit. So for sane tasks there is no need
- * to actually touch the io_bitmap at all.
+ * Store the dirty byte range of the last io bitmap offender. The
+ * next one will have to do the cleanup because the switch out to a
+ * non I/O bitmap user will just set x86_tss.io_bitmap_base to a
+ * value outside of the TSS limit to not penalize tasks which do
+ * not use the I/O bitmap at all.
*/
- unsigned int io_bitmap_prev_max;
+ unsigned int io_zerobits_start;
+ unsigned int io_zerobits_end;

/*
- * The extra 1 is there because the CPU will access an
- * additional byte beyond the end of the IO permission
- * bitmap. The extra byte must be all 1 bits, and must
- * be within the limit.
+ * The extra 1 is there because the CPU will access an additional
+ * byte beyond the end of the I/O permission bitmap. The extra byte
+ * must have all bits set and must be within the TSS limit.
*/
unsigned long io_bitmap[IO_BITMAP_LONGS + 1];
} __aligned(PAGE_SIZE);
@@ -496,8 +496,12 @@ struct thread_struct {
/* IO permissions: */
unsigned long *io_bitmap_ptr;
unsigned long iopl;
- /* Max allowed port in the bitmap, in bytes: */
- unsigned io_bitmap_max;
+ /*
+ * The byte range in the I/O permission bitmap which contains zero
+ * bits.
+ */
+ unsigned int io_zerobits_start;
+ unsigned int io_zerobits_end;

mm_segment_t addr_limit;

--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1864,7 +1864,8 @@ void cpu_init(void)
/* Initialize the TSS. */
tss_setup_ist(tss);
tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
- tss->io_bitmap_prev_max = 0;
+ tss->io_zerobits_start = IO_BITMAP_BYTES;
+ tss->io_zerobits_end = 0;
memset(tss->io_bitmap, 0xff, sizeof(tss->io_bitmap));
set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);

--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -26,9 +26,11 @@
*/
long ksys_ioperm(unsigned long from, unsigned long num, int turn_on)
{
+ unsigned int first, last, next, start, end, copy_start, copy_len;
struct thread_struct *t = &current->thread;
+ unsigned long *bitmap, *tofree = NULL;
struct tss_struct *tss;
- unsigned int i, max_long, bytes, bytes_updated;
+ char *src, *dst;

if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
return -EINVAL;
@@ -37,63 +39,120 @@ long ksys_ioperm(unsigned long from, uns
return -EPERM;

/*
- * If it's the first ioperm() call in this thread's lifetime, set the
- * IO bitmap up. ioperm() is much less timing critical than clone(),
- * this is why we delay this operation until now:
+ * I/O bitmap storage is allocated on demand, but don't bother if
+ * the task does not have one and permissions are only cleared.
*/
- if (!t->io_bitmap_ptr) {
- unsigned long *bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
+ if (!t->io_bitmap_ptr && !turn_on)
+ return 0;

- if (!bitmap)
- return -ENOMEM;
+ /*
+ * Allocate a temporary bitmap to minimize the amount of work
+ * in the atomic region.
+ */
+ bitmap = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
+ if (!bitmap)
+ return -ENOMEM;

+ if (!t->io_bitmap_ptr)
memset(bitmap, 0xff, IO_BITMAP_BYTES);
- t->io_bitmap_ptr = bitmap;
- set_thread_flag(TIF_IO_BITMAP);
+ else
+ memcpy(bitmap, t->io_bitmap_ptr, IO_BITMAP_BYTES);
+
+ /* Update the bitmap */
+ if (turn_on) {
+ bitmap_clear(bitmap, from, num);
+ } else {
+ bitmap_set(bitmap, from, num);
+ }
+
+ /* Get the new range */
+ first = find_first_zero_bit(bitmap, IO_BITMAP_BITS);
+
+ for (last = next = first; next < IO_BITMAP_BITS; last = next) {
+ /* Find the next set bit and update last */
+ next = find_next_bit(bitmap, IO_BITMAP_BITS, last);
+ last = next - 1;
+ if (next == IO_BITMAP_BITS)
+ break;
+ /* Find the next zero bit and continue searching */
+ next = find_next_zero_bit(bitmap, IO_BITMAP_BITS, next);
+ }
+
+ /* Calculate the byte boundaries for the updated region */
+ copy_start = from / 8;
+ copy_len = (round_up(from + num, 8) / 8) - copy_start;
+
+ /*
+ * Update the per thread storage and the TSS bitmap. This must be
+ * done with preemption disabled to prevent racing against a
+ * context switch.
+ */
+ preempt_disable();
+ tss = this_cpu_ptr(&cpu_tss_rw);

+ if (!t->io_bitmap_ptr) {
+ unsigned int tss_start = tss->io_zerobits_start;
+ /*
+ * If the task did not use the I/O bitmap yet then the
+ * perhaps stale content in the TSS needs to be taken into
+ * account. If tss start is out of bounds the TSS storage
+ * does not contain a zero bit and it's sufficient just to
+ * copy the new range over.
+ */
+ if (tss_start < IO_BITMAP_BYTES) {
+ unsigned int tss_end = tss->io_zerobits_end;
+ unsigned int copy_end = copy_start + copy_len;
+
+ copy_start = min(tss_start, copy_start);
+ copy_len = max(tss_end, copy_end) - copy_start;
+ }
+ }
+
+ /* Copy the changed range over to the TSS bitmap */
+ dst = (char *)tss->io_bitmap;
+ src = (char *)bitmap;
+ memcpy(dst + copy_start, src + copy_start, copy_len);
+
+ if (first >= IO_BITMAP_BITS) {
+ /*
+ * If the resulting bitmap has all permissions dropped, clear
+ * TIF_IO_BITMAP and set the IO bitmap offset in the TSS to
+ * invalid. Deallocate both the new and the thread's bitmap.
+ */
+ clear_thread_flag(TIF_IO_BITMAP);
+ tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
+ tofree = bitmap;
+ bitmap = NULL;
+ } else {
/*
- * Now that we have an IO bitmap, we need our TSS limit to be
- * correct. It's fine if we are preempted after doing this:
- * with TIF_IO_BITMAP set, context switches will keep our TSS
- * limit correct.
+ * I/O bitmap contains zero bits. Set TIF_IO_BITMAP, make
+ * the bitmap offset valid and make sure that the TSS limit
+ * is correct. It might have been wreckaged by a VMEXiT.
*/
- preempt_disable();
+ set_thread_flag(TIF_IO_BITMAP);
+ tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
refresh_tss_limit();
- preempt_enable();
}

/*
- * do it in the per-thread copy and in the TSS ...
+ * Update the range in the thread and the TSS
*
- * Disable preemption via get_cpu() - we must not switch away
- * because the ->io_bitmap_max value must match the bitmap
- * contents:
+ * Get the byte position of the first zero bit and calculate
+ * the length of the range in which zero bits exist.
*/
- tss = &per_cpu(cpu_tss_rw, get_cpu());
-
- if (turn_on)
- bitmap_clear(t->io_bitmap_ptr, from, num);
- else
- bitmap_set(t->io_bitmap_ptr, from, num);
+ start = first / 8;
+ end = first < IO_BITMAP_BITS ? round_up(last, 8) / 8 : 0;
+ t->io_zerobits_start = tss->io_zerobits_start = start;
+ t->io_zerobits_end = tss->io_zerobits_end = end;

/*
- * Search for a (possibly new) maximum. This is simple and stupid,
- * to keep it obviously correct:
+ * Finally exchange the bitmap pointer in the thread.
*/
- max_long = 0;
- for (i = 0; i < IO_BITMAP_LONGS; i++)
- if (t->io_bitmap_ptr[i] != ~0UL)
- max_long = i;
-
- bytes = (max_long + 1) * sizeof(unsigned long);
- bytes_updated = max(bytes, t->io_bitmap_max);
-
- t->io_bitmap_max = bytes;
-
- /* Update the TSS: */
- memcpy(tss->io_bitmap, t->io_bitmap_ptr, bytes_updated);
+ bitmap = xchg(&t->io_bitmap_ptr, bitmap);
+ preempt_enable();

- put_cpu();
+ kfree(bitmap);
+ kfree(tofree);

return 0;
}
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -110,7 +110,8 @@ void exit_thread(struct task_struct *tsk
tss = this_cpu_ptr(&cpu_tss_rw);

t->io_bitmap_ptr = NULL;
- t->io_bitmap_max = 0;
+ t->io_zerobits_start = IO_BITMAP_BYTES;
+ t->io_zerobits_end = 0;
clear_thread_flag(TIF_IO_BITMAP);
/* Invalidate the io bitmap base in the TSS */
tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
@@ -141,7 +142,8 @@ static inline int copy_io_bitmap(struct
tsk->thread.io_bitmap_ptr = kmemdup(current->thread.io_bitmap_ptr,
IO_BITMAP_BYTES, GFP_KERNEL);
if (!tsk->thread.io_bitmap_ptr) {
- tsk->thread.io_bitmap_max = 0;
+ tsk->thread.io_zerobits_start = IO_BITMAP_BYTES;
+ tsk->thread.io_zerobits_end = 0;
return -ENOMEM;
}
set_tsk_thread_flag(tsk, TIF_IO_BITMAP);
@@ -153,7 +155,8 @@ static inline void free_io_bitmap(struct
if (tsk->thread.io_bitmap_ptr) {
kfree(tsk->thread.io_bitmap_ptr);
tsk->thread.io_bitmap_ptr = NULL;
- tsk->thread.io_bitmap_max = 0;
+ tsk->thread.io_zerobits_start = IO_BITMAP_BYTES;
+ tsk->thread.io_zerobits_end = 0;
}
}

@@ -354,27 +357,39 @@ void arch_setup_new_exec(void)
}
}

+static void tss_update_io_bitmap(struct tss_struct *tss,
+ struct thread_struct *thread)
+{
+ unsigned int start, len;
+ char *src, *dst;
+
+ /*
+ * Copy at least the byte range of the incoming tasks bitmap which
+ * covers the permitted I/O ports.
+ *
+ * If the previous task which used an I/O bitmap had more bits
+ * permitted, then the copy needs to cover those as well so they
+ * get turned off.
+ */
+ start = min(tss->io_zerobits_start, thread->io_zerobits_start);
+ len = max(tss->io_zerobits_end, thread->io_zerobits_end) - start;
+ src = (char *)thread->io_bitmap_ptr;
+ dst = (char *)tss->io_bitmap_map;
+ memcpy(dst + start, dst + start, len);
+
+ /* Store the new start/end and set io_bitmap_base valid */
+ tss->io_zerobits_start = thread->io_zerobits_start;
+ tss->io_zerobits_end = thread->io_zerobits_end;
+ tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID_MAP;
+}
+
static inline void switch_to_bitmap(struct thread_struct *next,
unsigned long tifp, unsigned long tifn)
{
struct tss_struct *tss = this_cpu_ptr(&cpu_tss_rw);

if (tifn & _TIF_IO_BITMAP) {
- /*
- * Copy at least the size of the incoming tasks bitmap
- * which covers the last permitted I/O port.
- *
- * If the previous task which used an io bitmap had more
- * bits permitted, then the copy needs to cover those as
- * well so they get turned off.
- */
- memcpy(tss->io_bitmap, next->io_bitmap_ptr,
- max(tss->io_bitmap_prev_max, next->io_bitmap_max));
-
- /* Store the new max and set io_bitmap_base valid */
- tss->io_bitmap_prev_max = next->io_bitmap_max;
- tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
-
+ tss_update_io_bitmap(tss, next);
/*
* Make sure that the TSS limit is covering the io bitmap.
* It might have been cut down by a VMEXIT to 0x67 which
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -697,7 +697,7 @@ static int ptrace_set_debugreg(struct ta
static int ioperm_active(struct task_struct *target,
const struct user_regset *regset)
{
- return DIV_ROUND_UP(target->thread.io_bitmap_max, regset->size);
+ return DIV_ROUND_UP(target->thread.io_zerobits_end, regset->size);
}

static int ioperm_get(struct task_struct *target,



2019-11-07 01:13:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Wed, Nov 6, 2019 at 12:57 PM Thomas Gleixner <[email protected]> wrote:
>
> Calculate both the position of the first zero bit and the last zero bit to
> limit the range which needs to be copied. This does not solve the problem
> when the previous tasked had only byte 0 cleared and the next one has only
> byte 65535 cleared, but trying to solve that would be too complex and
> heavyweight for the context switch path. As the ioperm() usage is very rare
> the case which is optimized is the single task/process which uses ioperm().

Hmm.

I may read this patch wrong, but from what I can tell, if we really
have just one process with an io bitmap, we're doing unnecessary
copies.

If we really have just one process that has an iobitmap, I think we
could just keep the bitmap of that process entirely unchanged. Then,
when we switch away from it, we set the io_bitmap_base to an invalid
base outside the TSS segment, and when we switch back, we set it back
to the valid one. No actual bitmap copies at all.

So I think that rather than the "begin/end offset" games, we should
perhaps have a "what was the last process that used the IO bitmap for
this TSS" pointer (and, I think, some sequence counter, so that when
the process updates its bitmap, it invalidates that case)?

Of course, you can do *nboth*, but if we really think that the common
case is "one special process", then I think the begin/end offset is
useless, but a "last bitmap process" would be very useful.

Am I missing something?

Linus

2019-11-07 07:41:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further


* Thomas Gleixner <[email protected]> wrote:

> @@ -365,19 +365,19 @@ struct tss_struct {
> struct x86_hw_tss x86_tss;
>
> /*
> - * Store the dirty size of the last io bitmap offender. The next
> - * one will have to do the cleanup as the switch out to a non
> - * io bitmap user will just set x86_tss.io_bitmap_base to a value
> - * outside of the TSS limit. So for sane tasks there is no need
> - * to actually touch the io_bitmap at all.
> + * Store the dirty byte range of the last io bitmap offender. The
> + * next one will have to do the cleanup because the switch out to a
> + * non I/O bitmap user will just set x86_tss.io_bitmap_base to a
> + * value outside of the TSS limit to not penalize tasks which do
> + * not use the I/O bitmap at all.
> */
> - unsigned int io_bitmap_prev_max;
> + unsigned int io_zerobits_start;
> + unsigned int io_zerobits_end;
>
> /*
> - * The extra 1 is there because the CPU will access an
> - * additional byte beyond the end of the IO permission
> - * bitmap. The extra byte must be all 1 bits, and must
> - * be within the limit.
> + * The extra 1 is there because the CPU will access an additional
> + * byte beyond the end of the I/O permission bitmap. The extra byte
> + * must have all bits set and must be within the TSS limit.
> */
> unsigned long io_bitmap[IO_BITMAP_LONGS + 1];
> } __aligned(PAGE_SIZE);

Note that on 32-bit kernels this blows up our CPU area calculations:

./include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_181’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
./include/linux/compiler.h:331:4: note: in definition of macro ‘__compiletime_assert’
./include/linux/compiler.h:350:2: note: in expansion of macro ‘_compiletime_assert’
./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
arch/x86/mm/cpu_entry_area.c:181:2: note: in expansion of macro ‘BUILD_BUG_ON’
make[2]: *** [scripts/Makefile.build:265: arch/x86/mm/cpu_entry_area.o] Error 1

Thanks,

Ingo

2019-11-07 07:47:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Wed, 6 Nov 2019, Linus Torvalds wrote:
> I may read this patch wrong, but from what I can tell, if we really
> have just one process with an io bitmap, we're doing unnecessary
> copies.
>
> If we really have just one process that has an iobitmap, I think we
> could just keep the bitmap of that process entirely unchanged. Then,
> when we switch away from it, we set the io_bitmap_base to an invalid
> base outside the TSS segment, and when we switch back, we set it back
> to the valid one. No actual bitmap copies at all.
>
> So I think that rather than the "begin/end offset" games, we should
> perhaps have a "what was the last process that used the IO bitmap for
> this TSS" pointer (and, I think, some sequence counter, so that when
> the process updates its bitmap, it invalidates that case)?
>
> Of course, you can do *nboth*, but if we really think that the common
> case is "one special process", then I think the begin/end offset is
> useless, but a "last bitmap process" would be very useful.
>
> Am I missing something?

No. You are right. I'll have a look at that.

Thanks,

tglx

2019-11-07 07:48:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, 7 Nov 2019, Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
> > /*
> > - * The extra 1 is there because the CPU will access an
> > - * additional byte beyond the end of the IO permission
> > - * bitmap. The extra byte must be all 1 bits, and must
> > - * be within the limit.
> > + * The extra 1 is there because the CPU will access an additional
> > + * byte beyond the end of the I/O permission bitmap. The extra byte
> > + * must have all bits set and must be within the TSS limit.
> > */
> > unsigned long io_bitmap[IO_BITMAP_LONGS + 1];
> > } __aligned(PAGE_SIZE);
>
> Note that on 32-bit kernels this blows up our CPU area calculations:
>
> ./include/linux/compiler.h:350:38: error: call to ‘__compiletime_assert_181’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE
> ./include/linux/compiler.h:331:4: note: in definition of macro ‘__compiletime_assert’
> ./include/linux/compiler.h:350:2: note: in expansion of macro ‘_compiletime_assert’
> ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’
> ./include/linux/build_bug.h:50:2: note: in expansion of macro ‘BUILD_BUG_ON_MSG’
> arch/x86/mm/cpu_entry_area.c:181:2: note: in expansion of macro ‘BUILD_BUG_ON’
> make[2]: *** [scripts/Makefile.build:265: arch/x86/mm/cpu_entry_area.o] Error 1

Duh. /me goes investigate.

2019-11-07 08:18:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further


* Thomas Gleixner <[email protected]> wrote:

> + /* Update the bitmap */
> + if (turn_on) {
> + bitmap_clear(bitmap, from, num);
> + } else {
> + bitmap_set(bitmap, from, num);
> + }
> +
> + /* Get the new range */
> + first = find_first_zero_bit(bitmap, IO_BITMAP_BITS);
> +
> + for (last = next = first; next < IO_BITMAP_BITS; last = next) {
> + /* Find the next set bit and update last */
> + next = find_next_bit(bitmap, IO_BITMAP_BITS, last);
> + last = next - 1;
> + if (next == IO_BITMAP_BITS)
> + break;
> + /* Find the next zero bit and continue searching */
> + next = find_next_zero_bit(bitmap, IO_BITMAP_BITS, next);
> + }
> +
> + /* Calculate the byte boundaries for the updated region */
> + copy_start = from / 8;
> + copy_len = (round_up(from + num, 8) / 8) - copy_start;

This might seem like a small detail, but since we do the range tracking
and copying at byte granularity anyway, why not do the zero range search
at byte granularity as well?

I bet it's faster and simpler as well than the bit-searching.

We could also change over the bitmap to a char or u8 based array and lose
all the sizeof(long) indexing headaches, resulting type casts, for
anything but the actual bitmap_set/clear() calls, etc.?

I.e. now that most of the logic is byte granular, the basic data
structure might as well reflect that?

> + /*
> + * Update the per thread storage and the TSS bitmap. This must be
> + * done with preemption disabled to prevent racing against a
> + * context switch.
> + */
> + preempt_disable();
> + tss = this_cpu_ptr(&cpu_tss_rw);
>
> + if (!t->io_bitmap_ptr) {
> + unsigned int tss_start = tss->io_zerobits_start;
> + /*
> + * If the task did not use the I/O bitmap yet then the
> + * perhaps stale content in the TSS needs to be taken into
> + * account. If tss start is out of bounds the TSS storage
> + * does not contain a zero bit and it's sufficient just to
> + * copy the new range over.
> + */

s/tss/TSS

> + if (tss_start < IO_BITMAP_BYTES) {
> + unsigned int tss_end = tss->io_zerobits_end;
> + unsigned int copy_end = copy_start + copy_len;
> +
> + copy_start = min(tss_start, copy_start);
> + copy_len = max(tss_end, copy_end) - copy_start;
> + }
> + }
> +
> + /* Copy the changed range over to the TSS bitmap */
> + dst = (char *)tss->io_bitmap;
> + src = (char *)bitmap;
> + memcpy(dst + copy_start, src + copy_start, copy_len);
> +
> + if (first >= IO_BITMAP_BITS) {
> + /*
> + * If the resulting bitmap has all permissions dropped, clear
> + * TIF_IO_BITMAP and set the IO bitmap offset in the TSS to
> + * invalid. Deallocate both the new and the thread's bitmap.
> + */
> + clear_thread_flag(TIF_IO_BITMAP);
> + tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
> + tofree = bitmap;
> + bitmap = NULL;

BTW., wouldn't it be simpler to just say that if a thread uses IO ops
even once, it gets a bitmap and that's it? I.e. we could further simplify
this seldom used piece of code.

> + } else {
> /*
> + * I/O bitmap contains zero bits. Set TIF_IO_BITMAP, make
> + * the bitmap offset valid and make sure that the TSS limit
> + * is correct. It might have been wreckaged by a VMEXiT.
> */
> + set_thread_flag(TIF_IO_BITMAP);
> + tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
> refresh_tss_limit();
> }

I'm wondering, shouldn't we call refresh_tss_limit() in both branches, or
is a VMEXIT-wreckaged TSS limit harmless if we otherwise have
io_bitmap_base set to IO_BITMAP_OFFSET_INVALID?


> /*
> + * Update the range in the thread and the TSS
> *
> + * Get the byte position of the first zero bit and calculate
> + * the length of the range in which zero bits exist.
> */
> + start = first / 8;
> + end = first < IO_BITMAP_BITS ? round_up(last, 8) / 8 : 0;
> + t->io_zerobits_start = tss->io_zerobits_start = start;
> + t->io_zerobits_end = tss->io_zerobits_end = end;
>
> /*
> + * Finally exchange the bitmap pointer in the thread.
> */
> + bitmap = xchg(&t->io_bitmap_ptr, bitmap);
> + preempt_enable();
>
> + kfree(bitmap);
> + kfree(tofree);
>
> return 0;


Thanks,

Ingo

2019-11-07 08:28:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further


* Linus Torvalds <[email protected]> wrote:

> On Wed, Nov 6, 2019 at 12:57 PM Thomas Gleixner <[email protected]> wrote:
> >
> > Calculate both the position of the first zero bit and the last zero bit to
> > limit the range which needs to be copied. This does not solve the problem
> > when the previous tasked had only byte 0 cleared and the next one has only
> > byte 65535 cleared, but trying to solve that would be too complex and
> > heavyweight for the context switch path. As the ioperm() usage is very rare
> > the case which is optimized is the single task/process which uses ioperm().
>
> Hmm.
>
> I may read this patch wrong, but from what I can tell, if we really
> have just one process with an io bitmap, we're doing unnecessary
> copies.
>
> If we really have just one process that has an iobitmap, I think we
> could just keep the bitmap of that process entirely unchanged. Then,
> when we switch away from it, we set the io_bitmap_base to an invalid
> base outside the TSS segment, and when we switch back, we set it back
> to the valid one. No actual bitmap copies at all.
>
> So I think that rather than the "begin/end offset" games, we should
> perhaps have a "what was the last process that used the IO bitmap for
> this TSS" pointer (and, I think, some sequence counter, so that when
> the process updates its bitmap, it invalidates that case)?
>
> Of course, you can do *nboth*, but if we really think that the common
> case is "one special process", then I think the begin/end offset is
> useless, but a "last bitmap process" would be very useful.
>
> Am I missing something?

In fact on SMP systems this would result in a very nice optimization:
pretty quickly *all* TSS's would be populated with that single task's
bitmap, and it would persist even across migrations from CPU to CPU.

I'd love to get rid of the offset caching and bit scanning games as well
- it doesn't really help in a number of common scenarios and it
complicates this non-trivial piece of code a *LOT* - and we probably
don't really have the natural testing density of this code anymore to
find any regressions quickly.

So intuitively I'd suggest we gravitate towards the simplest
implementation, with a good caching optimization for the single-task
case.

I.e. the model I'm suggesting is that if a task uses ioperm() or iopl()
then it should have a bitmap from that point on until exit(), even if
it's all zeroes or all ones. Most applications that are using those
primitives really need it all the time and are using just a few ioports,
so all the tracking doesn't help much anyway.

On a related note, another simplification would be that in principle we
could also use just a single bitmap and emulate iopl() as an ioperm(all)
or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed
ioperm()/iopl() uses, but is that ABI actually being relied on in
practice?

Thanks,

Ingo

2019-11-07 09:22:01

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 07, 2019 at 09:25:41AM +0100, Ingo Molnar wrote:
> I.e. the model I'm suggesting is that if a task uses ioperm() or iopl()
> then it should have a bitmap from that point on until exit(), even if
> it's all zeroes or all ones. Most applications that are using those
> primitives really need it all the time and are using just a few ioports,
> so all the tracking doesn't help much anyway.

I'd go even further, considering that any task having called ioperm()
or iopl() once is granted access to all 64k ports for life: if the task
was granted access to any port, it will be able to request access for any
other port anyway. And we cannot claim that finely filtering accesses
brings any particular reliability in my opinion, considering that it's
generally possible to make the system really sick by starting to play
with most I/O ports. So for me that becomes a matter of trusted vs not
trusted task. Then we can simply have two pages of 0xFF to describe
their I/O access bitmap.

> On a related note, another simplification would be that in principle we
> could also use just a single bitmap and emulate iopl() as an ioperm(all)
> or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed
> ioperm()/iopl() uses, but is that ABI actually being relied on in
> practice?

You mean you'd have a unified map for all tasks ? In this case I think
it's simpler and equivalent to simply ignore the values in the calls
and grant full perms to the 64k ports range after the calls were
validated. I could be totally wrong and missing something obvious
though.

Regards,
Willy

2019-11-07 10:02:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, 7 Nov 2019, Willy Tarreau wrote:
> On Thu, Nov 07, 2019 at 09:25:41AM +0100, Ingo Molnar wrote:
> > I.e. the model I'm suggesting is that if a task uses ioperm() or iopl()
> > then it should have a bitmap from that point on until exit(), even if
> > it's all zeroes or all ones. Most applications that are using those
> > primitives really need it all the time and are using just a few ioports,
> > so all the tracking doesn't help much anyway.
>
> I'd go even further, considering that any task having called ioperm()
> or iopl() once is granted access to all 64k ports for life: if the task
> was granted access to any port, it will be able to request access for any
> other port anyway. And we cannot claim that finely filtering accesses
> brings any particular reliability in my opinion, considering that it's
> generally possible to make the system really sick by starting to play
> with most I/O ports. So for me that becomes a matter of trusted vs not
> trusted task. Then we can simply have two pages of 0xFF to describe
> their I/O access bitmap.
>
> > On a related note, another simplification would be that in principle we
> > could also use just a single bitmap and emulate iopl() as an ioperm(all)
> > or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed
> > ioperm()/iopl() uses, but is that ABI actually being relied on in
> > practice?
>
> You mean you'd have a unified map for all tasks ? In this case I think
> it's simpler and equivalent to simply ignore the values in the calls
> and grant full perms to the 64k ports range after the calls were
> validated. I could be totally wrong and missing something obvious
> though.

Changing ioperm(single port, port range) to be ioperm(all) is going to
break a bunch of test cases which actually check whether the permission is
restricted to a single I/O port or the requested port range.

Thanks,

tglx

2019-11-07 10:18:08

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 07, 2019 at 11:00:27AM +0100, Thomas Gleixner wrote:
> Changing ioperm(single port, port range) to be ioperm(all) is going to
> break a bunch of test cases which actually check whether the permission is
> restricted to a single I/O port or the requested port range.

But out of curiosity, are these solely test cases or things that real
applications do ? We could imagine having a sysctl entry to indicate
whether or not we want strict compatibility with older code in which
case we'd take the slow path, or a modernized behavior using only the
fast path. If we managed to deal with mmap_min_addr over time, I think
it should be manageable to deal with the rare applications using ioperm().

Willy

2019-11-07 10:25:22

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On November 7, 2019 2:00:27 AM PST, Thomas Gleixner <[email protected]> wrote:
>On Thu, 7 Nov 2019, Willy Tarreau wrote:
>> On Thu, Nov 07, 2019 at 09:25:41AM +0100, Ingo Molnar wrote:
>> > I.e. the model I'm suggesting is that if a task uses ioperm() or
>iopl()
>> > then it should have a bitmap from that point on until exit(), even
>if
>> > it's all zeroes or all ones. Most applications that are using those
>
>> > primitives really need it all the time and are using just a few
>ioports,
>> > so all the tracking doesn't help much anyway.
>>
>> I'd go even further, considering that any task having called ioperm()
>> or iopl() once is granted access to all 64k ports for life: if the
>task
>> was granted access to any port, it will be able to request access for
>any
>> other port anyway. And we cannot claim that finely filtering accesses
>> brings any particular reliability in my opinion, considering that
>it's
>> generally possible to make the system really sick by starting to play
>> with most I/O ports. So for me that becomes a matter of trusted vs
>not
>> trusted task. Then we can simply have two pages of 0xFF to describe
>> their I/O access bitmap.
>>
>> > On a related note, another simplification would be that in
>principle we
>> > could also use just a single bitmap and emulate iopl() as an
>ioperm(all)
>> > or ioperm(none) calls. Yeah, it's not fully ABI compatible for
>mixed
>> > ioperm()/iopl() uses, but is that ABI actually being relied on in
>> > practice?
>>
>> You mean you'd have a unified map for all tasks ? In this case I
>think
>> it's simpler and equivalent to simply ignore the values in the calls
>> and grant full perms to the 64k ports range after the calls were
>> validated. I could be totally wrong and missing something obvious
>> though.
>
>Changing ioperm(single port, port range) to be ioperm(all) is going to
>break a bunch of test cases which actually check whether the permission
>is
>restricted to a single I/O port or the requested port range.
>
>Thanks,
>
> tglx

This seems very undesirable... as much as we might wish otherwise, the port bitmap is the equivalent to the MMU, and there are definitely users doing direct device I/O out there.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-11-07 10:30:59

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 07, 2019 at 02:19:19AM -0800, [email protected] wrote:
> >Changing ioperm(single port, port range) to be ioperm(all) is going to
> >break a bunch of test cases which actually check whether the permission
> >is restricted to a single I/O port or the requested port range.
> >
> >Thanks,
> >
> > tglx
>
> This seems very undesirable... as much as we might wish otherwise, the port
> bitmap is the equivalent to the MMU, and there are definitely users doing
> direct device I/O out there.

Doing these, sure, but doing these while ranges are really checked ?
I mean, the MMU grants you access to the pages you were assigned. Here
with the I/O bitmap you just have to ask for access to port X and you
get it. I could understand the benefit if we had EBUSY in return but
that's not the case, you can actually request access to a port range
another device driver or process is currently using, and mess up with
what it does even by accident. I remember streaming 1-bit music in
userland from the LED of my floppy drive in the late-90s, it used to
cause some trouble to the floppy driver when using mtools in parallel :-)

Willy

2019-11-07 13:01:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 07, 2019 at 02:50:20AM -0800, [email protected] wrote:
> You get access to the ports you are assigned, just like pages you are
> assigned... the rest is kernel policy, or, for that matter, privileged
> userspace (get permissions to the necessary ports, then drop privilege... the
> usual stuff.)

I agree, my point is that there's already no policy checking at the
moment ports are assigned, hence a process having the permissions to
request just port 0x70-0x71 to read the hwclock will also have permission
to request access to the sensor chip a 0x2E and trigger a watchdog
reset or stop the CPU fan. Thus any policy enforcement is solely done
by the requesting process itself, assuming it doesn't simply use iopl()
already, which grants everything.

This is why I'm really wondering if the real use cases that need all
this stuff still exist at all in practice.

Willy

2019-11-07 14:46:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On November 7, 2019 2:27:56 AM PST, Willy Tarreau <[email protected]> wrote:
>On Thu, Nov 07, 2019 at 02:19:19AM -0800, [email protected] wrote:
>> >Changing ioperm(single port, port range) to be ioperm(all) is going
>to
>> >break a bunch of test cases which actually check whether the
>permission
>> >is restricted to a single I/O port or the requested port range.
>> >
>> >Thanks,
>> >
>> > tglx
>>
>> This seems very undesirable... as much as we might wish otherwise,
>the port
>> bitmap is the equivalent to the MMU, and there are definitely users
>doing
>> direct device I/O out there.
>
>Doing these, sure, but doing these while ranges are really checked ?
>I mean, the MMU grants you access to the pages you were assigned. Here
>with the I/O bitmap you just have to ask for access to port X and you
>get it. I could understand the benefit if we had EBUSY in return but
>that's not the case, you can actually request access to a port range
>another device driver or process is currently using, and mess up with
>what it does even by accident. I remember streaming 1-bit music in
>userland from the LED of my floppy drive in the late-90s, it used to
>cause some trouble to the floppy driver when using mtools in parallel
>:-)
>
>Willy

You get access to the ports you are assigned, just like pages you are assigned... the rest is kernel policy, or, for that matter, privileged userspace (get permissions to the necessary ports, then drop privilege... the usual stuff.)

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-11-07 16:46:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

Willy Tarreau <[email protected]> writes:

> On Thu, Nov 07, 2019 at 02:50:20AM -0800, [email protected] wrote:
>> You get access to the ports you are assigned, just like pages you are
>> assigned... the rest is kernel policy, or, for that matter, privileged
>> userspace (get permissions to the necessary ports, then drop privilege... the
>> usual stuff.)
>
> I agree, my point is that there's already no policy checking at the
> moment ports are assigned, hence a process having the permissions to
> request just port 0x70-0x71 to read the hwclock will also have permission
> to request access to the sensor chip a 0x2E and trigger a watchdog
> reset or stop the CPU fan. Thus any policy enforcement is solely done
> by the requesting process itself, assuming it doesn't simply use iopl()
> already, which grants everything.
>
> This is why I'm really wondering if the real use cases that need all
> this stuff still exist at all in practice.

My memory is that the applications that didn't need fine grain access to
ports would just use iopl.

Further a quick look shows that dosemu uses ioperm in a fine grained
manner. From memory it would allow a handful of ports to allow directly
accessing a device and depended on the rest of the port accesses to be
disallowed so it could trap and emulate them.

So yes I do believe making ioperm ioperm(all) will break userspace.

Eric

2019-11-07 16:54:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 7, 2019 at 8:45 AM Eric W. Biederman <[email protected]> wrote:
>
> Further a quick look shows that dosemu uses ioperm in a fine grained
> manner. From memory it would allow a handful of ports to allow directly
> accessing a device and depended on the rest of the port accesses to be
> disallowed so it could trap and emulate them.

Yes. Making ioperm() some all-or-nothing thing would be horribly bad,
and has no real advantages that I can see.

Linus

2019-11-07 17:03:06

by Willy Tarreau

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 07, 2019 at 10:45:09AM -0600, Eric W. Biederman wrote:
> Further a quick look shows that dosemu uses ioperm in a fine grained
> manner. From memory it would allow a handful of ports to allow directly
> accessing a device and depended on the rest of the port accesses to be
> disallowed so it could trap and emulate them.
>
> So yes I do believe making ioperm ioperm(all) will break userspace.

OK, and I must confess I almost forgot about dosemu, that's a good point!

Thanks,
Willy

2019-11-07 18:04:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, 7 Nov 2019, Ingo Molnar wrote:
> * Thomas Gleixner <[email protected]> wrote:
> This might seem like a small detail, but since we do the range tracking
> and copying at byte granularity anyway, why not do the zero range search
> at byte granularity as well?
>
> I bet it's faster and simpler as well than the bit-searching.

Not necessarily. The bitmap search uses unsigned long internally at least
to the point where it finds a zero bit.

But probably we should just ditch that optimization and rather have the
sequence number to figure out whether something needs to be copied at all.

> > + if (first >= IO_BITMAP_BITS) {
> > + /*
> > + * If the resulting bitmap has all permissions dropped, clear
> > + * TIF_IO_BITMAP and set the IO bitmap offset in the TSS to
> > + * invalid. Deallocate both the new and the thread's bitmap.
> > + */
> > + clear_thread_flag(TIF_IO_BITMAP);
> > + tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_INVALID;
> > + tofree = bitmap;
> > + bitmap = NULL;
>
> BTW., wouldn't it be simpler to just say that if a thread uses IO ops
> even once, it gets a bitmap and that's it? I.e. we could further simplify
> this seldom used piece of code.

Maybe.

> > + } else {
> > /*
> > + * I/O bitmap contains zero bits. Set TIF_IO_BITMAP, make
> > + * the bitmap offset valid and make sure that the TSS limit
> > + * is correct. It might have been wreckaged by a VMEXiT.
> > */
> > + set_thread_flag(TIF_IO_BITMAP);
> > + tss->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET_VALID;
> > refresh_tss_limit();
> > }
>
> I'm wondering, shouldn't we call refresh_tss_limit() in both branches, or
> is a VMEXIT-wreckaged TSS limit harmless if we otherwise have
> io_bitmap_base set to IO_BITMAP_OFFSET_INVALID?

Yes. because the VMEXIT wreckage restricts TSS limit to 0x67 which is the
end of the hardware tss struct. As the invalid offset points in any case
outside the TSS limit it does not matter. It always #GP's when an I/O
access happens in user space.

Thanks,

tglx

2019-11-07 19:25:26

by Brian Gerst

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Wed, Nov 6, 2019 at 4:01 PM Thomas Gleixner <[email protected]> wrote:
>
> Sane ioperm() users only set the few bits in the I/O space which they need to
> access. But the update logic of the TSS I/O bitmap copies always from byte
> 0 to the last byte in the tasks bitmap which contains a zero permission bit.
>
> That means that for access only to port 65335 the full 8K bitmap has to be
> copied even if all the bytes in the TSS I/O bitmap are already filled with
> 0xff.
>
> Calculate both the position of the first zero bit and the last zero bit to
> limit the range which needs to be copied. This does not solve the problem
> when the previous tasked had only byte 0 cleared and the next one has only
> byte 65535 cleared, but trying to solve that would be too complex and
> heavyweight for the context switch path. As the ioperm() usage is very rare
> the case which is optimized is the single task/process which uses ioperm().

Here is a different idea: We already map the TSS virtually in
cpu_entry_area. Why not page-align the IO bitmap and remap it to the
task's bitmap on task switch? That would avoid all copying on task
switch.

--
Brian Gerst

2019-11-07 19:57:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <[email protected]> wrote:
>
> Here is a different idea: We already map the TSS virtually in
> cpu_entry_area. Why not page-align the IO bitmap and remap it to the
> task's bitmap on task switch? That would avoid all copying on task
> switch.

We map the tss _once_, statically, percpu, without ever changing it,
and then we just (potentially) change a couple of fields in it on
process switch.

Your idea isn't horrible, but it would involve a TLB flush for the
page when the io bitmap changes. Which is almost certainly more
expensive than just copying the bitmap intelligently.

Particularly since I do think that the copy can basically be done
effectively never, assuming there really aren't multiple concurrent
users of ioperm() (and iopl).

Linus

2019-11-07 21:05:06

by Brian Gerst

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 7, 2019 at 2:54 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <[email protected]> wrote:
> >
> > Here is a different idea: We already map the TSS virtually in
> > cpu_entry_area. Why not page-align the IO bitmap and remap it to the
> > task's bitmap on task switch? That would avoid all copying on task
> > switch.
>
> We map the tss _once_, statically, percpu, without ever changing it,
> and then we just (potentially) change a couple of fields in it on
> process switch.
>
> Your idea isn't horrible, but it would involve a TLB flush for the
> page when the io bitmap changes. Which is almost certainly more
> expensive than just copying the bitmap intelligently.
>
> Particularly since I do think that the copy can basically be done
> effectively never, assuming there really aren't multiple concurrent
> users of ioperm() (and iopl).

There wouldn't have to be a flush on every task switch. If we make it
so that tasks that don't use a bitmap just unmap the pages in the
cpu_entry_area and set tss.io_bitmap_base to outside the segment
limit, we would only have to flush when switching from a task using
the bitmap (because the next task uses a different bitmap or we are
unmapping it). If the previous task doesn't have a bitmap the pages
in cpu_entry_area were unmapped and can't be in the TLB, so no flush
is needed.

Going a step further, we could track which task is mapped to the
current cpu like proposed above, and only flush when a different task
needs the IO bitmap, or when the bitmap is being freed on task exit.

--
Brian Gerst

2019-11-07 21:33:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, 7 Nov 2019, Brian Gerst wrote:
> On Thu, Nov 7, 2019 at 2:54 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <[email protected]> wrote:
> > >
> > > Here is a different idea: We already map the TSS virtually in
> > > cpu_entry_area. Why not page-align the IO bitmap and remap it to the
> > > task's bitmap on task switch? That would avoid all copying on task
> > > switch.
> >
> > We map the tss _once_, statically, percpu, without ever changing it,
> > and then we just (potentially) change a couple of fields in it on
> > process switch.
> >
> > Your idea isn't horrible, but it would involve a TLB flush for the
> > page when the io bitmap changes. Which is almost certainly more
> > expensive than just copying the bitmap intelligently.
> >
> > Particularly since I do think that the copy can basically be done
> > effectively never, assuming there really aren't multiple concurrent
> > users of ioperm() (and iopl).
>
> There wouldn't have to be a flush on every task switch. If we make it
> so that tasks that don't use a bitmap just unmap the pages in the
> cpu_entry_area and set tss.io_bitmap_base to outside the segment
> limit, we would only have to flush when switching from a task using
> the bitmap (because the next task uses a different bitmap or we are
> unmapping it). If the previous task doesn't have a bitmap the pages
> in cpu_entry_area were unmapped and can't be in the TLB, so no flush
> is needed.

Funny. I was just debating exactly this with Peter Ziljstra over IRC :)

> Going a step further, we could track which task is mapped to the
> current cpu like proposed above, and only flush when a different task
> needs the IO bitmap, or when the bitmap is being freed on task exit.

Yes.

But, we really should check what aside of DoSemu is using this still. None
of my machines I checked have a single instance of ioperm()/iopl() usage.

So the real question is whether it's worth the trouble or if we are just
better off to copy if there is an actual user and the sequence count of the
bitmap is different than the one which was active last time.

Thanks,

tglx




2019-11-07 21:46:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <[email protected]> wrote:
>
> There wouldn't have to be a flush on every task switch.

No. But we'd have to flush on any switch that currently does that memcpy.

And my point is that a tlb flush (even the single-page case) is likely
more expensive than the memcpy.

> Going a step further, we could track which task is mapped to the
> current cpu like proposed above, and only flush when a different task
> needs the IO bitmap, or when the bitmap is being freed on task exit.

Well, that's exactly my "track the last task" optimization for copying
the thing.

IOW, it's the same optimization as avoiding the memcpy.

Which I think is likely very effective, but also makes it fairly
pointless to then try to be clever..

So the basic issue remains that playing VM games has almost
universally been slower and more complex than simply not playing VM
games. TLB flushes - even invlpg - tends to be pretty slow.

Of course, we probably end up invalidating the TLB's anyway, so maybe
in this case we don't care. The ioperm bitmap is _technically_
per-thread, though, so it should be flushed even if the VM isn't
flushed...

Linus

2019-11-07 23:23:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On November 7, 2019 1:32:15 PM PST, Thomas Gleixner <[email protected]> wrote:
>On Thu, 7 Nov 2019, Brian Gerst wrote:
>> On Thu, Nov 7, 2019 at 2:54 PM Linus Torvalds
>> <[email protected]> wrote:
>> >
>> > On Thu, Nov 7, 2019 at 11:24 AM Brian Gerst <[email protected]>
>wrote:
>> > >
>> > > Here is a different idea: We already map the TSS virtually in
>> > > cpu_entry_area. Why not page-align the IO bitmap and remap it to
>the
>> > > task's bitmap on task switch? That would avoid all copying on
>task
>> > > switch.
>> >
>> > We map the tss _once_, statically, percpu, without ever changing
>it,
>> > and then we just (potentially) change a couple of fields in it on
>> > process switch.
>> >
>> > Your idea isn't horrible, but it would involve a TLB flush for the
>> > page when the io bitmap changes. Which is almost certainly more
>> > expensive than just copying the bitmap intelligently.
>> >
>> > Particularly since I do think that the copy can basically be done
>> > effectively never, assuming there really aren't multiple concurrent
>> > users of ioperm() (and iopl).
>>
>> There wouldn't have to be a flush on every task switch. If we make
>it
>> so that tasks that don't use a bitmap just unmap the pages in the
>> cpu_entry_area and set tss.io_bitmap_base to outside the segment
>> limit, we would only have to flush when switching from a task using
>> the bitmap (because the next task uses a different bitmap or we are
>> unmapping it). If the previous task doesn't have a bitmap the pages
>> in cpu_entry_area were unmapped and can't be in the TLB, so no flush
>> is needed.
>
>Funny. I was just debating exactly this with Peter Ziljstra over IRC :)
>
>> Going a step further, we could track which task is mapped to the
>> current cpu like proposed above, and only flush when a different task
>> needs the IO bitmap, or when the bitmap is being freed on task exit.
>
>Yes.
>
>But, we really should check what aside of DoSemu is using this still.
>None
>of my machines I checked have a single instance of ioperm()/iopl()
>usage.
>
>So the real question is whether it's worth the trouble or if we are
>just
>better off to copy if there is an actual user and the sequence count of
>the
>bitmap is different than the one which was active last time.
>
>Thanks,
>
> tglx

I have written suffer using this, because of far better real time performance. I just want to punch a hole (just like mmapping an MMIO device.)

I do agree that let's not optimize for the rare case.
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

2019-11-08 01:17:14

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On 2019-11-07 13:44, Linus Torvalds wrote:
> On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <[email protected]> wrote:
>>
>> There wouldn't have to be a flush on every task switch.
>
> No. But we'd have to flush on any switch that currently does that memcpy.
>
> And my point is that a tlb flush (even the single-page case) is likely
> more expensive than the memcpy.
>
>> Going a step further, we could track which task is mapped to the
>> current cpu like proposed above, and only flush when a different task
>> needs the IO bitmap, or when the bitmap is being freed on task exit.
>
> Well, that's exactly my "track the last task" optimization for copying
> the thing.
>
> IOW, it's the same optimization as avoiding the memcpy.
>
> Which I think is likely very effective, but also makes it fairly
> pointless to then try to be clever..
>
> So the basic issue remains that playing VM games has almost
> universally been slower and more complex than simply not playing VM
> games. TLB flushes - even invlpg - tends to be pretty slow.
>
> Of course, we probably end up invalidating the TLB's anyway, so maybe
> in this case we don't care. The ioperm bitmap is _technically_
> per-thread, though, so it should be flushed even if the VM isn't
> flushed...
>

One option, probably a lot saner (if we care at all, after all, copying 8K
really isn't that much, but it might have some impact on real-time processes,
which is one of the rather few use cases for direct I/O) would be to keep the
bitmask in a pre-formatted TSS (ioperm being per thread, so no concerns about
the TSS being in use on another processor), and copy the TSS fields (88 bytes)
over if and only if the thread has been migrated to a different CPU, then
switch the TSS rather than switching For the common case (no ioperms) we use
the standard per-cpu TSS.

That being said, I don't actually know that copying 88 bytes + LTR is any
cheaper than copying 8K.

-hpa

2019-11-08 02:14:15

by Brian Gerst

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On Thu, Nov 7, 2019 at 8:12 PM H. Peter Anvin <[email protected]> wrote:
>
> On 2019-11-07 13:44, Linus Torvalds wrote:
> > On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <[email protected]> wrote:
> >>
> >> There wouldn't have to be a flush on every task switch.
> >
> > No. But we'd have to flush on any switch that currently does that memcpy.
> >
> > And my point is that a tlb flush (even the single-page case) is likely
> > more expensive than the memcpy.
> >
> >> Going a step further, we could track which task is mapped to the
> >> current cpu like proposed above, and only flush when a different task
> >> needs the IO bitmap, or when the bitmap is being freed on task exit.
> >
> > Well, that's exactly my "track the last task" optimization for copying
> > the thing.
> >
> > IOW, it's the same optimization as avoiding the memcpy.
> >
> > Which I think is likely very effective, but also makes it fairly
> > pointless to then try to be clever..
> >
> > So the basic issue remains that playing VM games has almost
> > universally been slower and more complex than simply not playing VM
> > games. TLB flushes - even invlpg - tends to be pretty slow.
> >
> > Of course, we probably end up invalidating the TLB's anyway, so maybe
> > in this case we don't care. The ioperm bitmap is _technically_
> > per-thread, though, so it should be flushed even if the VM isn't
> > flushed...
> >
>
> One option, probably a lot saner (if we care at all, after all, copying 8K
> really isn't that much, but it might have some impact on real-time processes,
> which is one of the rather few use cases for direct I/O) would be to keep the
> bitmask in a pre-formatted TSS (ioperm being per thread, so no concerns about
> the TSS being in use on another processor), and copy the TSS fields (88 bytes)
> over if and only if the thread has been migrated to a different CPU, then
> switch the TSS rather than switching For the common case (no ioperms) we use
> the standard per-cpu TSS.
>
> That being said, I don't actually know that copying 88 bytes + LTR is any
> cheaper than copying 8K.

I don't think that can work. The TSS has to be at a fixed address in
the cpu_entry_area so that it is visible when running in usermode
(thanks to Meltdown).

--
Brian Gerst

2019-11-10 17:20:16

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On 11/7/19 12:25 AM, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
>> On Wed, Nov 6, 2019 at 12:57 PM Thomas Gleixner <[email protected]> wrote:
>>>
>>> Calculate both the position of the first zero bit and the last zero bit to
>>> limit the range which needs to be copied. This does not solve the problem
>>> when the previous tasked had only byte 0 cleared and the next one has only
>>> byte 65535 cleared, but trying to solve that would be too complex and
>>> heavyweight for the context switch path. As the ioperm() usage is very rare
>>> the case which is optimized is the single task/process which uses ioperm().
>>
>> Hmm.
>>
>> I may read this patch wrong, but from what I can tell, if we really
>> have just one process with an io bitmap, we're doing unnecessary
>> copies.
>>
>> If we really have just one process that has an iobitmap, I think we
>> could just keep the bitmap of that process entirely unchanged. Then,
>> when we switch away from it, we set the io_bitmap_base to an invalid
>> base outside the TSS segment, and when we switch back, we set it back
>> to the valid one. No actual bitmap copies at all.
>>
>> So I think that rather than the "begin/end offset" games, we should
>> perhaps have a "what was the last process that used the IO bitmap for
>> this TSS" pointer (and, I think, some sequence counter, so that when
>> the process updates its bitmap, it invalidates that case)?
>>
>> Of course, you can do *nboth*, but if we really think that the common
>> case is "one special process", then I think the begin/end offset is
>> useless, but a "last bitmap process" would be very useful.
>>
>> Am I missing something?
>
> In fact on SMP systems this would result in a very nice optimization:
> pretty quickly *all* TSS's would be populated with that single task's
> bitmap, and it would persist even across migrations from CPU to CPU.
>
> I'd love to get rid of the offset caching and bit scanning games as well
> - it doesn't really help in a number of common scenarios and it
> complicates this non-trivial piece of code a *LOT* - and we probably
> don't really have the natural testing density of this code anymore to
> find any regressions quickly.

I think we should not over-optimize this. I am all for penalizing
ioperm() and iopl() users as much as is convenient for us. There is
simply no legitimate use case. Sorry, DPDK, but "virtio-legacy sucks,
let's optimize the crap out of something that is slow anyway and use
iopl()" is not a good excuse. Just use the %*!7 syscall to write to
/sys/.../resource0 and suck up the probably negligible performance hit.
And tell your customers to upgrade their hypervisors. And quite
kvetching about performance of the control place on an old
software-emulated NIC while you're at it.

For the TLB case, it's worth tracking who last used which ASID and
whether it's still up to date, since *everyone* uses the MMU. For
ioperm, I don't really believe this is worth it.

>
> So intuitively I'd suggest we gravitate towards the simplest
> implementation, with a good caching optimization for the single-task
> case.

I agree with the first bit, but caching on an SMP system is necessarily
subtle. Some kind of invalidation is needed.

>
> I.e. the model I'm suggesting is that if a task uses ioperm() or iopl()
> then it should have a bitmap from that point on until exit(), even if
> it's all zeroes or all ones. Most applications that are using those
> primitives really need it all the time and are using just a few ioports,
> so all the tracking doesn't help much anyway.
>
> On a related note, another simplification would be that in principle we
> could also use just a single bitmap and emulate iopl() as an ioperm(all)
> or ioperm(none) calls. Yeah, it's not fully ABI compatible for mixed
> ioperm()/iopl() uses, but is that ABI actually being relied on in
> practice?
>

Let's please keep the ABI. Or rather, let's attempt to eventually
remove the ABI, but let's not change it in the mean time please.

2019-11-10 17:24:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch 5/9] x86/ioport: Reduce ioperm impact for sane usage further

On 11/7/19 1:44 PM, Linus Torvalds wrote:
> On Thu, Nov 7, 2019 at 1:00 PM Brian Gerst <[email protected]> wrote:
>>
>> There wouldn't have to be a flush on every task switch.
>
> No. But we'd have to flush on any switch that currently does that memcpy.
>
> And my point is that a tlb flush (even the single-page case) is likely
> more expensive than the memcpy.
>
>> Going a step further, we could track which task is mapped to the
>> current cpu like proposed above, and only flush when a different task
>> needs the IO bitmap, or when the bitmap is being freed on task exit.
>
> Well, that's exactly my "track the last task" optimization for copying
> the thing.
>
> IOW, it's the same optimization as avoiding the memcpy.
>
> Which I think is likely very effective, but also makes it fairly
> pointless to then try to be clever..
>
> So the basic issue remains that playing VM games has almost
> universally been slower and more complex than simply not playing VM
> games. TLB flushes - even invlpg - tends to be pretty slow.
>

With my TLB-handling-writing-and-reviewing code on, NAK to any VM games
here.

Honestly, I almost think we should unconditionally copy the whole 8K for
the case where the ioperm() syscall has been used (i.e. not emulating
iopl()). The benefit simply does not justify the risk of getting it
wrong. I'm okay, but barely, with optimizing the end of the copied
range. Optimizing the start of the copied range is pushing it. Playing
MMU tricks and getting all the per-task-ioperm and invalidation right is
way beyond reasonable.

Even the time spent discussing how to optimize a case that has literally
one known user that none of us can bring ourselves to care about much
seems wasteful.