From: Willy Tarreau <w@1wt.eu>
To: linux-kernel@vger.kernel.org, x86@kernel.org
Cc: tglx@linutronix.de, gnomes@lxorguk.ukuu.org.uk,
        torvalds@linux-foundation.org, Willy Tarreau <w@1wt.eu>
Subject: [PATCH RFC 0/4] Per-task PTI activation
Date: Mon,  8 Jan 2018 17:12:15 +0100
Message-Id: <1515427939-10999-1-git-send-email-w@1wt.eu>
Sender: linux-kernel-owner@vger.kernel.org

Hi!

I could experiment a bit with the possibility to enable/disable PTI per
task. Please keep in mind that it's not my area of experitise at all, but
doing so I could recover the initial performance without disabling PTI on
the whole system.

So what I did in this series consists in the following :
  - addition of a new per-task TIF_NOPTI flag. Please note that I'm not
    proud of the way I did it, as 32 flags were already taken. The flags
    are declared as "long" so there are 32 more flags available on x86_64
    but C and asm disagree on the type of 1<<32 so I had to declare the
    hex value by hand... By the way I even suspect that _TIF_FSCHECK is
    wrong once cast to a long, I think it causes sign extension into the
    32 upper bits since it's supposed to be signed.

  - addition of a set of arch_prctl() calls (ARCH_GET_NOPTI and
    ARCH_SET_NOPTI), to check and change the activation of the
    protection. The change requires CAP_SYS_RAWIO and can be done in
    a wrapper (that's how I tested)

  - the user PGD was marked with _PAGE_NX to prevent an accidental leak
    of CR3 from not being detected. I obviously had to disable this since
    in this case we do want such a user task to run without switching the
    PGD. I think this could be performed per-task maybe. Another approach
    might consist in dealing with 3 PGDs and using a different one for
    unprotected tasks but that really starts to sound overkill.

  - upon return to userspace, I check if the task's flags contain the
    new TIF_NOPTI or not. If it does contain it, then we don't switch
    the CR3.

  - upon entry into the kernel from userspace, we can't access the task's
    flags but we can already check if CR3 points to the kernel or user PGD,
    and we refrain from switching if it's already the system one.

By doing so I could recover the initial performance of haproxy in a VM,
going from 12400 connections per second to 21000 once started with this
trivial wrapper :

  #include <asm/prctl.h>
  #include <sys/prctl.h>
  
  #ifndef ARCH_SET_NOPTI
  #define ARCH_SET_NOPTI 0x1022
  #endif
  
  int main(int argc, char **argv)
  {
          arch_prctl(ARCH_SET_NOPTI, 1);
          argv++;
          return execvp(argv[0], argv);
  }

I have not yet run it on real hardware. Before trying to go a bit further
I'd like to know if such an approach is acceptable or if I'm doing anything
stupid and looking in the wrong direction.

Thanks!
Willy


Willy Tarreau (4):
  x86/thread_info: add TIF_NOPTI to disable PTI per task
  x86/arch_prctl: add ARCH_GET_NOPTI and ARCH_SET_NOPTI to
    enable/disable PTI
  x86/pti: don't mark the user PGD with _PAGE_NX.
  x86/entry/pti: don't switch PGD on tasks holding flag TIF_NOPTI

 arch/x86/entry/calling.h           | 23 +++++++++++++++++++++++
 arch/x86/include/asm/thread_info.h |  8 ++++++++
 arch/x86/include/uapi/asm/prctl.h  |  3 +++
 arch/x86/kernel/process_64.c       | 24 ++++++++++++++++++++++++
 arch/x86/mm/pti.c                  |  2 ++
 5 files changed, 60 insertions(+)

-- 
1.7.12.1