Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754208AbcJCUy3 (ORCPT ); Mon, 3 Oct 2016 16:54:29 -0400 Received: from mail-vk0-f43.google.com ([209.85.213.43]:36188 "EHLO mail-vk0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753874AbcJCUy0 (ORCPT ); Mon, 3 Oct 2016 16:54:26 -0400 MIME-Version: 1.0 In-Reply-To: <1475366900.21644.8.camel@redhat.com> References: <1475353895-22175-1-git-send-email-riel@redhat.com> <1475353895-22175-3-git-send-email-riel@redhat.com> <1475366900.21644.8.camel@redhat.com> From: Andy Lutomirski Date: Mon, 3 Oct 2016 13:54:02 -0700 Message-ID: Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace To: Rik van Riel , Dave Hansen Cc: "linux-kernel@vger.kernel.org" , X86 ML , Thomas Gleixner , Paolo Bonzini , Ingo Molnar , Andrew Lutomirski , "H. Peter Anvin" , Borislav Petkov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6261 Lines: 176 On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel wrote: > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote: >> On Sat, Oct 1, 2016 at 1:31 PM, wrote: >> > >> > #define CREATE_TRACE_POINTS >> > #include >> > @@ -197,6 +198,9 @@ __visible inline void >> > prepare_exit_to_usermode(struct pt_regs *regs) >> > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) >> > exit_to_usermode_loop(regs, cached_flags); >> > >> > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU))) >> > + switch_fpu_return(); >> > + >> >> How about: >> >> if (unlikely(...)) { >> exit_to_usermode_loop(regs, cached_flags); >> cached_flags = READ_ONCE(ti->flags); >> } >> >> if (ti->flags & _TIF_LOAD_FPU) { >> clear_thread_flag(TIF_LOAD_FPU); >> switch_fpu_return(); >> } > > It looks like test_thread_flag should be fine for avoiding > atomics, too? > > if (unlikely(test_thread_flag(TIF_LOAD_FPU))) { > clear_thread_flag(TIF_LOAD_FPU); > switch_fpu_return(); > } True, but we've got that cached_flags variable already... > > >> > +static inline void switch_fpu_finish(void) >> > { >> > + set_thread_flag(TIF_LOAD_FPU); >> > } >> >> I can imagine this causing problems with kernel code that accesses >> current's FPU state, e.g. get_xsave_field_ptr(). I wonder if it >> would >> make sense to make your changes deeper into the FPU core innards so >> that, for example, we'd have explicit functions that cause the > > Whereabouts do you have in mind? > > I like your general idea, but am not sure quite where > to put it. Somewhere in arch/x86/kernel/fpu/ perhaps? One of my objections to the current code structure is that I find it quite hard to keep track of all of the possible states of the fpu that we have. Off the top of my head, we have: 1. In-cpu state is valid and in-memory state is invalid. (This is the case when we're running user code, for example.) 2. In-memory state is valid. 2a. In-cpu state is *also* valid for some particular task. In this state, no one may write to either FPU state copy for that task lest they get out of sync. 2b. In-cpu state is not valid. Rik, your patches further complicate this by making it possible to have the in-cpu state be valid for a non-current task even when non-idle. This is why I'm thinking that we should maybe revisit the API a bit to make this clearer and to avoid scattering around all the state manipulation quite so much. I propose, as a straw-man: user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for current. After calling this, we're in state 1 or 2a. user_fpregs_move_to_cpu() -- Makes the CPU state be valid and writable for current. Invalidates any in-memory copy. After calling this, we're in state 1. user_fpregs_copy_to_memory() -- Makes the in-memory state be valid for current. After calling this, we're in state 2a or 2b. user_fpregs_move_to_memory() -- Makes the in-memory state be valid and writable for current. After calling this, we're in state 2b. The xxx_to_cpu() functions will make sure that, if the cpu state was valid for a different task, it stops being valid for that different task. Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX and PKRU memory state readers. MPX and PKRU will do user_fpregs_move_to_memory() before writing to memory. kernel_fpu_begin() does user_fpregs_move_to_memory(). There are a couple ways we could deal with TIF_LOAD_FPU in this scheme. My instinct would be to rename it TIF_REFRESH_FPU, set it unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(), and make the handler do user_fpregs_move_to_cpu(). This avoids needing to deal with complications in which we're in state 2a and we land in actual user code and then forget that the in-memory copy is out of date. It also means we get TIF_REFRESH_FPU for free for newly-forked threads. >> in-memory state for current to be up-to-date and readable, to cause >> the in-memory state to be up-to-date and writable (which is the same >> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the >> in-CPU state to be up-to-date (possibly readable and writable). >> TIF_LOAD_FPU would trigger the latter. >> >> I've often found it confusing that fpu__save by itself has somewhat >> ill-defined effects. > > Fully agreed on both. I guess that means we want a few > different functions: > > 1) initialize FPU state, both in memory and in registers > > 2) cause FPU state in registers to be updated from memory, > if necessary > > 3) cause FPU state in memory to be updated from registers, > if necessary > > The latter functions could use fpu->fpregs_active to see > whether any action is required, and be called from various > places in the code. > > The signal code in particular is doing some strange things > that should probably be hidden away in FPU specific functions. > >> >> > >> > +/* >> > + * Set up the userspace FPU context before returning to userspace. >> > + */ >> > +void switch_fpu_return(void) >> > +{ >> > + struct fpu *fpu = ¤t->thread.fpu; >> > + bool preload; >> > + /* >> > + * If the task has used the math, pre-load the FPU on xsave >> > processors >> > + * or if the past 5 consecutive context-switches used math. >> > + */ >> > + preload = static_cpu_has(X86_FEATURE_FPU) && >> > + fpu->fpstate_active && >> > + (use_eager_fpu() || fpu->counter > 5); >> > + >> > + if (preload) { >> > + prefetch(&fpu->state); >> > + fpu->counter++; >> > + __fpregs_activate(fpu); >> > + trace_x86_fpu_regs_activated(fpu); >> > + >> > + /* Don't change CR0.TS if we just switch! */ >> > + if (!__this_cpu_read(fpu_active)) { >> > + __fpregs_activate_hw(); >> > + __this_cpu_write(fpu_active, true); >> > + } >> >> We should just finish getting rid of all TS uses. > > Agreed. I will rebase on top of your FPU changes, that > will make things easier for everybody. > > -- > All Rights Reversed. -- Andy Lutomirski AMA Capital Management, LLC