Received: by 2002:ac0:e34a:0:0:0:0:0 with SMTP id g10csp517012imn; Tue, 26 Jul 2022 02:03:37 -0700 (PDT) X-Google-Smtp-Source: AGRyM1skzNiQVcHL6dsGSeKWgnwgg/ft7PCf3yynQMJVwfbn8Gw5RI9JZCoYRP8LLbi0ST1uLZc3 X-Received: by 2002:a05:6402:5cb:b0:434:eb48:754f with SMTP id n11-20020a05640205cb00b00434eb48754fmr17534727edx.421.1658826217068; Tue, 26 Jul 2022 02:03:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658826217; cv=none; d=google.com; s=arc-20160816; b=pAqDXB0I3NCUxJAsEfsgD92YhqG5bYgMDiHoM68CGZSq4rCYOmrzboXwXs9U7/3sub e/pL0jOCROX+sYxL98o5CON7a7TCSq8myjLjJzsqTKoOFQoR3YyEusE4Gc6+59XWBVzs +kkC45UbgrNpIlW/BwiowTR6m74k/tOaIw/eGbzT4maHVBeSzk90oC2egTpltYrjCUX1 roWmkEOBi3/9k99X86oDSsDwJOrU2EQZiOuPFqBoncbpFDZ4CNIbQzY3CO05q8+Gq7Hx aOgoXaTev+giLpOj5xFAWZUcIKgwfU9Yb3ql7iiKJJz2oCFXBhCCCF263BEnSDg+Uf2n DM0g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=2Zj6IDscOqGpySURIv3/6OT9Yjl3MfpVD47TFHbtsSY=; b=bI72ZD3A4Y5+6BFiE1Hi+kjQyJuBRDeJ9opr4GueT0mzG/FiAsKRDg8I7NyR0z7plA 8QBidw1kIxjJhhqf/OB8+itibBNa+3IKqUFHlmGhHwuCl9pl5pPLerduGSnM4aX0wwKM 0/q5I+EkGgisl9peD6CyPteMPvANCRFhYRqUJxnBBU7cJ048CSQhrHHQIzvRYjgVOJnr RGixC3pQUTlZuyrqv507ZX9Sb2NGa1hesPcPDcm8TfgeGgzZd+VYnrDkJmk1J7L5PSKM jahPIsonpQIimxV8DJBmBhNXWeYp28/at+ArJ+fqaoYBm3suxuUUwOQ4jfupzBJHev0f myeQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Cq3woY32; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id rv5-20020a17090710c500b0072f8785a8aasi14272320ejb.273.2022.07.26.02.03.08; Tue, 26 Jul 2022 02:03:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Cq3woY32; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238420AbiGZIdp (ORCPT + 99 others); Tue, 26 Jul 2022 04:33:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49222 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238048AbiGZIdk (ORCPT ); Tue, 26 Jul 2022 04:33:40 -0400 Received: from mail-yw1-x112a.google.com (mail-yw1-x112a.google.com [IPv6:2607:f8b0:4864:20::112a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 541C430F4B for ; Tue, 26 Jul 2022 01:33:39 -0700 (PDT) Received: by mail-yw1-x112a.google.com with SMTP id 00721157ae682-31e623a4ff4so134803477b3.4 for ; Tue, 26 Jul 2022 01:33:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=2Zj6IDscOqGpySURIv3/6OT9Yjl3MfpVD47TFHbtsSY=; b=Cq3woY32gfETWotKVJ0jVvPaZnuXlO78yPMcYGL8XNfszfNG7vyRsamBeGl50H7/DM J4k0t8KmRkm3KkP1t1wOHez6f3tCNBb5ZYCEXlAegZfUA1u1uQ2N4SyDIsmwBF6HImaV A7ct2UFqNDuN9o8jqRbLaVguw6IYnuM4PnMqtw756iGGGcOXvtYVuMCLiPtkWFXIR4QD i/9LXg2E5q5LxtRVbWiRFKxQBEWsHFGOIs0KmhqhLaPeYWCS/OBQdZA/foxm0yZypBzP AUjJqyGrt6YMi7GLRGTUEaTJF7FrAPzyZL37KFUUJjNFMtJ00jt0BgkhbdosUZK2Kvaw QH+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=2Zj6IDscOqGpySURIv3/6OT9Yjl3MfpVD47TFHbtsSY=; b=5Lapd5TIJS3Y62VqMCtZ4LLHwahJVe388G7nMn+QrjfRZJXV6Z8Qm/qYOKa4jYWmTB u/k/XYMA/9UlBoB0gwaKEVGS1IfUXAKo99S7AMmEOo2x+jQuxVWgZuTpCE3nXaz7Y4wz +DOqbkgN4MYk8Vp5/7lYbY7Brnoopx6i1dUMZpOZBS3Ab0ZfwOFplq+wiqpsS7Ll9RxT nZ17IvixvhekFH7kLmseP5rghzluMKq+++KPUHLl4CzjY6akv3w+zqA2bkdXoVUGNGkN l1oM/CkVTdMD1IjrquFxcI50QeKVzzuHBeY1paLMT/yXRlFkuYcu4hT3qDDFRHrSRNJz rVFw== X-Gm-Message-State: AJIora8fVj7WKHQUwNGkohZUU/9vhgEZ/948yxSzXldroJFcxjsP1RhF 9y6ScpZU0MMZSwIxJKW01i1svCucelIZKljsJWwO4+es199Cetop X-Received: by 2002:a81:74d:0:b0:31e:c419:fc75 with SMTP id 74-20020a81074d000000b0031ec419fc75mr13398395ywh.364.1658824418308; Tue, 26 Jul 2022 01:33:38 -0700 (PDT) MIME-Version: 1.0 References: <20220722230241.1944655-1-avagin@google.com> In-Reply-To: From: Andrei Vagin Date: Tue, 26 Jul 2022 01:33:27 -0700 Message-ID: Subject: Re: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system To: Sean Christopherson Cc: Paolo Bonzini , linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Wanpeng Li , Vitaly Kuznetsov , Jianfeng Tan , Adin Scannell , Konstantin Bogomolov , Etienne Perot , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 22, 2022 at 4:41 PM Sean Christopherson wro= te: > > +x86 maintainers, patch 1 most definitely needs acceptance from folks bey= ond KVM. > > On Fri, Jul 22, 2022, Andrei Vagin wrote: > > Another option is the KVM platform. In this case, the Sentry (gVisor > > kernel) can run in a guest ring0 and create/manage multiple address > > spaces. Its performance is much better than the ptrace one, but it is > > still not great compared with the native performance. This change > > optimizes the most critical part, which is the syscall overhead. > > What exactly is the source of the syscall overhead, Here are perf traces for two cases: when "guest" syscalls are executed via hypercalls and when syscalls are executed by the user-space VMM: https://gist.github.com/avagin/f50a6d569440c9ae382281448c187f4e And here are two tests that I use to collect these traces: https://github.com/avagin/linux-task-diag/commit/4e19c7007bec6a15645025c337= f2e85689b81f99 If we compare these traces, we can find that in the second case, we spend e= xtra time in vmx_prepare_switch_to_guest, fpu_swap_kvm_fpstate, vcpu_put, syscall_exit_to_user_mode. > and what alternatives have been explored? Making arbitrary syscalls from > within KVM is mildly terrifying. "mildly terrifying" is a good sentence in this case:). If I were in your pl= ace, I would think about it similarly. I understand these concerns about calling syscalls from the KVM code, and t= his is why I hide this feature under a separate capability that can be enabled explicitly. We can think about restricting the list of system calls that this hypercall= can execute. In the user-space changes for gVisor, we have a list of system cal= ls that are not executed via this hypercall. For example, sigprocmask is never executed by this hypercall, because the kvm vcpu has its signal mask. Anot= her example is the ioctl syscall, because it can be one of kvm ioctl-s. As for alternatives, we explored different ways: =3D=3D Host Ring3/Guest ring0 mixed mode =3D=3D This is how the gVisor KVM platform works right now. We don=E2=80=99t have = a separate hypervisor, and the Sentry does its functions. The Sentry creates a KVM vir= tual machine instance, sets it up, and handles VMEXITs. As a result, the Sentry = runs in the host ring3 and the guest ring0 and can transparently switch between these two contexts. When the Sentry starts, it creates a new kernel VM instance and maps its me= mory to the guest physical. Then it makes a set of page tables for the Sentry th= at mirrors the host virtual address space. When host and guest address spaces = are identical, the Sentry can switch between these two contexts. The bluepill function switches the Sentry into guest ring0. It calls a privileged instruction (CLI) that is a no-op in the guest (by design, since= we ensure interrupts are disabled for guest ring 0 execution) and triggers a signal on the host. The signal is handled by the bluepillHandler that takes= a virtual CPU and executes it with the current thread state grabbed from a si= gnal frame. As for regular VMs, user processes have their own address spaces (page tabl= es) and run in guest ring3. So when the Sentry is going to execute a user proce= ss, it needs to be sure that it is running inside a VM, and it is the exact poi= nt when it calls bluepill(). Then it executes a user process with its page tab= les before it triggers an exception or a system call. All such events are trapp= ed and handled in the Sentry. The Sentry is a normal Linux process that can trigger a fault and execute system calls. To handle these events, the Sentry returns to the host mode. = If ring0 sysenter or exception entry point detects an event from the Sentry, t= hey save the current thread state on a per-CPU structure and trigger VMEXIT. Th= is returns us into bluepillHandler, where we set the thread state on a signal frame and exit from the signal handler, so the Sentry resumes from the poin= t where it has been in the VM. In this scheme, the sentry syscall time is 3600ns. This is for the case whe= n a system call is called from gr0. The benefit of this way is that only a first system call triggers vmexit an= d all subsequent syscalls are executed on the host natively. But it has downsides: * Each sentry system call trigger the full exit to hr3. * Each vmenter/vmexit requires to trigger a signal but it is expensive. * It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sent= ry has to be fully enclosed in a VM to be able to support these technologies= . =3D=3D Execute system calls from a user-space VMM =3D=3D In this case, the Sentry is always running in VM, and a syscall handler in = GR0 triggers vmexit to transfer control to VMM (user process that is running in hr3), VMM executes a required system call, and transfers control back to th= e Sentry. We can say that it implements the suggested hypercall in the user-space. The sentry syscall time is 2100ns in this case. The new hypercall does the same but without switching to the host ring 3. I= t reduces the sentry syscall time to 1000ns. =3D=3D A new BPF hook to handle vmexit-s =3D=3D https://github.com/avagin/linux-task-diag/commits/kvm-bpf This way allows us to reach the same performance numbers, but it gives less control over who and how use this functionality. Second, it requires adding= a few questionable BPF helpers like calling syscall from BPF hooks. =3D=3D Non-KVM platforms =3D=3D We are experimenting with non-KVM platforms. We have the ptrace platform, b= ut it is almost for experiments due to the slowness of the ptrace interface. Another idea was to add the process_vm_exec system call: https://lwn.net/Articles/852662/ This system call can significantly increase performance compared with the ptrace platform, but it is still slower than the KVM platform in its curren= t form (without the new hypercall). But this is true only if we run the KVM platform on a bare-metal. In the case of nested-virtualization, the KVM platform becomes much slower, which is expected. We have another idea to use the seccomp notify to trap system calls, but it requires some kernel change to reach a reasonable performance. I am working= on these changes and will present them soon. I want to emphasize that non-KVM platforms don't allow us to implement the confidential concept in gVisor, but this is one of our main goals concernin= g the KVM platform. All previous numbers have been getting from the same host (Xeon(R) Gold 626= 8CL, 5.19-rc5). > > > The idea of using vmcall to execute system calls isn=E2=80=99t new. Two= large users > > of gVisor (Google and AntFinacial) have out-of-tree code to implement s= uch > > hypercalls. > > > > In the Google kernel, we have a kvm-like subsystem designed especially > > for gVisor. This change is the first step of integrating it into the KV= M > > code base and making it available to all Linux users. > > Can you please lay out the complete set of changes that you will be propo= sing? > Doesn't have to be gory details, but at a minimum there needs to be a hig= h level > description that very clearly defines the scope of what changes you want = to make > and what the end result will look like. > > It's practically impossible to review this series without first understan= ding the > bigger picture, e.g. if KVM_HC_HOST_SYSCALL is ultimately useless without= the other > bits you plan to upstream, then merging it without a high level of confid= ence that > the other bits are acceptable is a bad idea since it commits KVM to suppo= rting > unused ABI. I was not precise in my description. This is the only change that we need right now. The gVisor KVM platform is the real thing that exists today and works on the upstream kernels: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/platform/kvm= / This hypercall improves its performance and makes it comparable with the google-internal platform. Thanks, Andrei