Received: by 2002:ac0:da4c:0:0:0:0:0 with SMTP id a12csp1051636imi; Fri, 22 Jul 2022 16:09:17 -0700 (PDT) X-Google-Smtp-Source: AGRyM1uT3raCwVyDE6pQUafngaAxKGrKGs1rWKo7/nwelbiUYD8H+2eMxBKAcem56ViU37c95fz3 X-Received: by 2002:a17:907:a40b:b0:72b:64e3:4c5e with SMTP id sg11-20020a170907a40b00b0072b64e34c5emr1673663ejc.612.1658531357241; Fri, 22 Jul 2022 16:09:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1658531357; cv=none; d=google.com; s=arc-20160816; b=HYFLOywVR6s/07Jr3wwSLn6ZSS+Wf/gLYGjr2YuPGVhHZG5syhg77KefAdVSEx6oG6 Mn1YiJzi3i9WTe+/4KP2LgnNDFwjebR4aM5Yi5B8+sRQwyOYnuK0r6olPb/sxtUPECqD MfFpUTyfjVsH+K1m8JXIYrqNFD6/m+B24KKA0DyMkomyTQ4rADpqBQg9GKjpiNll0zGm NwapYH0VhYJP65mnS0q2cMlU8Yfc16lUMB6JNTOJTjj+YFwFv3SBA+s8pTIsLtqkYQfT Ihj+WxBA10AAoOqWwkMJiTHrmTdzuw/yCTaD9zRU81EFpbk31bq5AvyZbKzZLqyM84T6 5Lkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :mime-version:message-id:date:dkim-signature; bh=Gvp/73vL/wYhHUzfrA+ck8Wczuy8shTz34sZSNOpQKg=; b=wTyjOCb8mjIf45+hgqaG9LYwzCJjPI0yy6E99iWwPn0YGpHUUwu9gMdhuBYcxxcdC9 3t1OxujHjpY4vLoJYEgv49d1/OFFoHfCk8eop/waerCWe8MR+zFDqcNBOwRzeplI6Bqc Pm/YZoU0wF86dcEF7y4jaP/2UUxLoitStRX/eAuwfKroHE2/j9jJNuliyyp61RSeFSRW ADGaQvA3gmkiE2Xb++JHWeZdybxsHqKbmoqJ/9JddKYBP+Jpb0OXc5Mu/S5V4eXcMqhR ntMhGQoge05gGBIhUutocmeIQmJMOrq2Z3hq264uIriWxuQRN86Wsdzp/OTtkQHk3MKQ HCMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="WK6X/HCy"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id gb15-20020a170907960f00b0072f579adddcsi8799430ejc.519.2022.07.22.16.08.48; Fri, 22 Jul 2022 16:09:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="WK6X/HCy"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235910AbiGVXCr (ORCPT + 99 others); Fri, 22 Jul 2022 19:02:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52086 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229572AbiGVXCq (ORCPT ); Fri, 22 Jul 2022 19:02:46 -0400 Received: from mail-pl1-x649.google.com (mail-pl1-x649.google.com [IPv6:2607:f8b0:4864:20::649]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4799A3DBFA for ; Fri, 22 Jul 2022 16:02:45 -0700 (PDT) Received: by mail-pl1-x649.google.com with SMTP id p10-20020a170902e74a00b0016c3f3acb51so3281781plf.16 for ; Fri, 22 Jul 2022 16:02:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=Gvp/73vL/wYhHUzfrA+ck8Wczuy8shTz34sZSNOpQKg=; b=WK6X/HCymmMhe2Mzd8y2j/58pke48wRYsfxle5do6yBeImKUe6NkD1aWKlMeNUj7eb qWztqrOhJgl7pnatCMNFALGK7dHBwY/h1nASnlRM/fe17TSspBkw8QWN6tyDjBfmhCEu d1hJ+CtoqxFPFmfNWnORIJGDtSd6Xo05giuWFq9l1DhKZTAqBT1rg0E1B73qi7KXyJ9I rSc1QxV0JZG5O451pH8VmHxqvTP1mq9dr+Uyk69Hkff9G8OlSF0zf526obnMe2njXfcl Rt1LZtwQ3yB3eyJEnJdjo2cgxlh4NX7kOYQ/AHxHFd7/64TiVjCWK/jGeI/rHGb1to1S tOzg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc :content-transfer-encoding; bh=Gvp/73vL/wYhHUzfrA+ck8Wczuy8shTz34sZSNOpQKg=; b=E/9v6W63adJkk2VCL9AGC1op5Im3KLT3ydqcl/3EP/aAFXrMeKGZftoxKUgJiHEcPL 9ft+d6F01bOIHkbvvObu+r1BS6G3bDUx+bXKr6latLRpwutGPN0Lqkcyr0J0LnAwENH3 nESoO0UhAlydiGeW8AtCTTbvMm0rFqHHhX943qKtOTdZ/WSs7tcpc947RdXuOmUcw10R L11sv5XvvpWCe01AO9W8wzztUtHRjH6IlPUBuQMl2z/uhTEgBTNFd8zhlL5th7YwdNMR a7iYoaOFunWx9N2d4c5mMVrz3R/ENGQETQ9sryJQ2k1NstFpO6OFJLk6zBmZMCEotlO0 ELwg== X-Gm-Message-State: AJIora/V2Wfx3+22vin01ILism86uLg5heAyftv6OtBBkpWUXMIqwjfz gBxRJK8qw6BZ6umfLVndfK/j9hQB3Bg= X-Received: from avagin.kir.corp.google.com ([2620:15c:29:204:5863:d08b:b2f8:4a3e]) (user=avagin job=sendgmr) by 2002:aa7:9e9b:0:b0:528:2948:e974 with SMTP id p27-20020aa79e9b000000b005282948e974mr1941123pfq.79.1658530964856; Fri, 22 Jul 2022 16:02:44 -0700 (PDT) Date: Fri, 22 Jul 2022 16:02:36 -0700 Message-Id: <20220722230241.1944655-1-avagin@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.37.1.359.gd136c6c3e2-goog Subject: [PATCH 0/5] KVM/x86: add a new hypercall to execute host system From: Andrei Vagin To: Paolo Bonzini Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org, Andrei Vagin , Sean Christopherson , Wanpeng Li , Vitaly Kuznetsov , Jianfeng Tan , Adin Scannell , Konstantin Bogomolov , Etienne Perot Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There is a class of applications that use KVM to manage multiple address spaces rather than use it as an isolation boundary. In all other terms, they are normal processes that execute system calls, handle signals, etc. Currently, each time when such a process needs to interact with the operation system, it has to switch to host and back to guest. Such entire switches are expensive and significantly increase the overhead of system calls. The new hypercall reduces this overhead by more than two times. The new hypercall runs system calls on the host. As for native system calls, seccomp filters are executed before system calls. It takes one argument that is a pointer to a pt_regs structure in the host address space. It provides registers to execute a system call according to the calling convention. Arguments are passed in %rdi, %rsi, %rdx, %r10, %r8 and %r9 and a return code is stored in %rax.=C2=A0 The hypercall returns 0 if a system call has been executed. Otherwise, it returns an error code. This series introduces a new capability that has to be set to enable the hypercall. The new hypercall is a backdoor for regular virtual machines, so it is disabled by default. There is another standard way to allow hypercalls via cpuid. It has not been used because one of the common ways to manage them is to request all available features and let them all together. In this case, it is a hard requirement that the new hypercall can be enabled only intentionally. =3D Background =3D gVisor is one such application. It is an application kernel written in Go that implements a substantial portion of the Linux system call interface. gVisor intercepts application system calls and acts as the guest kernel. It has a platform abstraction that implements interception of syscalls, basic context switching, and memory mapping functionality. Currently, it has two platforms: ptrace and KVM. The ptrace platform uses PTRACE_SYSEMU to execute user code without allowing it to perform host system calls, and it creates stub processes to manage user address spaces. This platform is primarily for testing needs due to its bad performance. Another option is the KVM platform. In this case, the Sentry (gVisor kernel) can run in a guest ring0 and create/manage multiple address spaces. Its performance is much better than the ptrace one, but it is still not great compared with the native performance. This change optimizes the most critical part, which is the syscall overhead. The idea of using vmcall to execute system calls isn=E2=80=99t new. Two large u= sers of gVisor (Google and AntFinacial) have out-of-tree code to implement such hypercalls. In the Google kernel, we have a kvm-like subsystem designed especially for gVisor. This change is the first step of integrating it into the KVM code base and making it available to all Linux users. Cc: Paolo Bonzini Cc: Sean Christopherson Cc: Wanpeng Li Cc: Vitaly Kuznetsov Cc: Jianfeng Tan Cc: Adin Scannell Cc: Konstantin Bogomolov Cc: Etienne Perot Andrei Vagin (5): kernel: add a new helper to execute system calls from kernel code kvm: add controls to enable/disable paravirtualized system calls KVM/x86: add a new hypercall to execute host system calls. selftests/kvm/x86_64: set rax before vmcall selftests/kvm/x86_64: add tests for KVM_HC_HOST_SYSCALL Documentation/virt/kvm/x86/hypercalls.rst | 15 ++ arch/x86/entry/common.c | 48 ++++++ arch/x86/include/asm/syscall.h | 1 + arch/x86/include/uapi/asm/kvm_para.h | 2 + arch/x86/kvm/cpuid.c | 25 +++ arch/x86/kvm/cpuid.h | 8 +- arch/x86/kvm/x86.c | 37 +++++ include/uapi/linux/kvm.h | 1 + include/uapi/linux/kvm_para.h | 1 + tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/include/x86_64/processor.h | 4 + .../selftests/kvm/lib/x86_64/processor.c | 2 +- .../kvm/x86_64/kvm_pv_syscall_test.c | 145 ++++++++++++++++++ 14 files changed, 289 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_pv_syscall_test.= c --=20 2.37.0.rc0.161.g10f37bed90-goog