Received: by 2002:a05:6a10:9afc:0:0:0:0 with SMTP id t28csp701694pxm; Fri, 25 Feb 2022 17:42:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJxs1JqX/viBL9ITIJ1JTmUzNd7CVO6aVZgNXS7Pyvp7leOfbCSXjR3IbspcxzgPXiatYxXz X-Received: by 2002:aa7:909a:0:b0:4e1:6d4:5905 with SMTP id i26-20020aa7909a000000b004e106d45905mr10505965pfa.34.1645839725027; Fri, 25 Feb 2022 17:42:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1645839725; cv=none; d=google.com; s=arc-20160816; b=PolAz4PEYai2iSJ5P8pKbTkmsTSnwjKSkKS63Vr0D88nwiTH+cs492x7wg+yaz2KPl 4Mqb2H3P5vKeBYJP64c+8AdzVrwBioaN0oJY8vqWdMtnp4xqJN6Bfp2Z9P8IHKzdtiit XFDGa9d0lsXQonpKKvEy2AJ77vdawHzrDFej3cZmHDxODzIuIO5NkZxoY/W5DJqCn5IV LGx7ZLhDQCdOe0vXb0MRnQOvpzYKcFlH0TrwyPeOXIILzNPT2Ic0LkHYubUmUkDwt500 9jrP/pk56msRgoJHD71tsFCZomNTgVTvzQL/MRTX3H/rl3oxr0ci10/1AbXgXtX6h4K7 bJ1Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:mime-version:message-id:date :reply-to:dkim-signature; bh=nJ7T2/Eu0kELrWfOHFiuRi78BtlV/GKdiarGzSeljcE=; b=g+gdEbfBGVwxteQdm4EXrZnvfNMIGbiTEoo/nZsmwo7VGYmxzqPxMDPHm2rpjwiiEj Tq1aO1EULkWSOo/OnSixRaaE9IGU6cFRascZMYl2ODjT9nRWcK6P8OCW/2mcC95QiHr3 wKV85fwodKfxxGO2dBg16ZePeIiq3cvoqZro0K4NpuRq8Met1mlP6C15utMiryIAHJL2 TKVqV+dVuHF6NQ3BBeUkUVM7eJVTsZIIA7K/e9cFR7DtioejzXxX/UZhMwdgj7AGJPVz hT6E3AbobMIaZP2a/mPboAmaYA1nEsYtBFmAk/bnbCb8lPRh7o45eADrk7LD/L2Uukb7 Idcw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Xt+4iTJw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id b23-20020a656697000000b00372d703119csi3294696pgw.205.2022.02.25.17.42.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Feb 2022 17:42:05 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=Xt+4iTJw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 9A47718C789; Fri, 25 Feb 2022 17:33:12 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240065AbiBZAQg (ORCPT + 99 others); Fri, 25 Feb 2022 19:16:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49612 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233911AbiBZAQe (ORCPT ); Fri, 25 Feb 2022 19:16:34 -0500 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA8292118FA for ; Fri, 25 Feb 2022 16:16:01 -0800 (PST) Received: by mail-pg1-x549.google.com with SMTP id p21-20020a631e55000000b00372d919267cso3495027pgm.1 for ; Fri, 25 Feb 2022 16:16:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=reply-to:date:message-id:mime-version:subject:from:to:cc; bh=nJ7T2/Eu0kELrWfOHFiuRi78BtlV/GKdiarGzSeljcE=; b=Xt+4iTJwwp9DbCI6oRTmoYIzy6Mi2+yjlhedxHM4EIkIrsuU4DkckUVye5uQyxXOut lsOU4iTkW5/AkP+K9a5HdsyBDsuAR3zC5KX9TEyu18p5R1iUOg4kwrJmEj6q2NlbSdFg EdyflBEuagAyC2yErwUDGzqnuMLkZDkkhcwfOpql1/FkPULLhd6gyu2vP7+YxX173NLo ppU19l22kPw2G9K5RZBHisFip7jO7ZBcL5jTYmueprtzH85Pd4vdP17+Lld5OmeICRhH 78JvzLsbsXCKCjBuYDcrbm/5RwT6HmHUkWUJA4B0f+CwPDk/Rk8QLerCFPEafS2L0cnT INIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:reply-to:date:message-id:mime-version:subject :from:to:cc; bh=nJ7T2/Eu0kELrWfOHFiuRi78BtlV/GKdiarGzSeljcE=; b=sgTmWV3xYwmPMCdy468150wwwrtnUlR6uRzHWZKUcNpaYUOXZ5yqVOjDKtQe8I0bOd j8Qyz7vGkOX5MShJifw4DXUX7etZiw8pLMsLj9pAfeRCQLvhOvXi22pMbMUJNf+AUMGT PMBGQh9y5LtGDn5Ty7mPc0fRaeneQuuqSt2qqedRsxI7CPDWgF3rqQ4epgKVsTVUNgx4 rtm5FZdpC3OPLOzXdMtbxa00+FkT/T3EkDEWC8DDvif/WJtoF/3pma9ojAV3vwDRmkSI 4RwZh6MAa0pptbYj9VMXndQFVCV/QJlIHD9hF2bFQlyTr+pIrqLpEuHnX0h+F7u7azhl jbmQ== X-Gm-Message-State: AOAM530G1sH6jVOUHzAlIBreBbIAK+QdUNSLc2Zh7AHVJKLy+cL8ZWaW HIrhtS49wU3PtJM/a2mLgSBbO5cUKq8= X-Received: from seanjc.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3e5]) (user=seanjc job=sendgmr) by 2002:a17:90a:f48f:b0:1bc:2521:fb0a with SMTP id bx15-20020a17090af48f00b001bc2521fb0amr5661180pjb.48.1645834561115; Fri, 25 Feb 2022 16:16:01 -0800 (PST) Reply-To: Sean Christopherson Date: Sat, 26 Feb 2022 00:15:18 +0000 Message-Id: <20220226001546.360188-1-seanjc@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.35.1.574.g5d30c73bfb-goog Subject: [PATCH v3 00/28] KVM: x86/mmu: Overhaul TDP MMU zapping and flushing From: Sean Christopherson To: Paolo Bonzini , Christian Borntraeger , Janosch Frank , Claudio Imbrenda Cc: Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, David Matlack , Ben Gardon , Mingwei Zhang Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-9.5 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RDNS_NONE,SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Overhaul TDP MMU's handling of zapping and TLB flushing to reduce the number of TLB flushes, fix soft lockups and RCU stalls, avoid blocking vCPUs for long durations while zapping paging structure, and to clean up the zapping code. The largest cleanup is to separate the flows for zapping roots (zap _everything_), zapping leaf SPTEs (zap guest mappings for whatever reason), and zapping a specific SP (NX recovery). They're currently smushed into a single zap_gfn_range(), which was a good idea at the time, but became a mess when trying to handle the different rules, e.g. TLB flushes aren't needed when zapping a root because KVM can safely zap a root if and only if it's unreachable. To solve the soft lockups, stalls, and vCPU performance issues: - Defer remote TLB flushes to the caller when zapping TDP MMU shadow pages by relying on RCU to ensure the paging structure isn't freed until all vCPUs have exited the guest. - Allowing yielding when zapping TDP MMU roots in response to the root's last reference being put. This requires a bit of trickery to ensure the root is reachable via mmu_notifier, but it's not too gross. - Zap roots in two passes to avoid holding RCU for potential hundreds of seconds when zapping guest with terabytes of memory that is backed entirely by 4kb SPTEs. - Zap defunct roots asynchronously via the common work_queue so that a vCPU doesn't get stuck doing the work if the vCPU happens to drop the last reference to a root. The selftest at the end allows populating a guest with the max amount of memory allowed by the underlying architecture. The most I've tested is ~64tb (MAXPHYADDR=46) as I don't have easy access to a system with MAXPHYADDR=52. The selftest compiles on arm64 and s390x, but otherwise hasn't been tested outside of x86-64. It will hopefully do something useful as is, but there's a non-zero chance it won't get past init with a high max memory. Running on x86 without the TDP MMU is comically slow. v3: - Drop patches that were applied. - Rebase to latest kvm/queue. - Collect a review. [David] - Use helper instead of goto to zap roots in two passes. [David] - Add patches to disallow REMOVED "old" SPTE when atomically setting SPTE. v2: - https://lore.kernel.org/all/20211223222318.1039223-1-seanjc@google.com - Drop patches that were applied. - Collect reviews for patches that weren't modified. [Ben] - Abandon the idea of taking invalid roots off the list of roots. - Add a patch to fix misleading/wrong comments with respect to KVM's responsibilities in the "fast zap" flow, specifically that all SPTEs must be dropped before the zap completes. - Rework yielding in kvm_tdp_mmu_put_root() to keep the root visibile while yielding. - Add patch to zap roots in two passes. [Mingwei, David] - Add a patch to asynchronously zap defunct roots. - Add the selftest. v1: https://lore.kernel.org/all/20211120045046.3940942-1-seanjc@google.com Sean Christopherson (28): KVM: x86/mmu: Use common iterator for walking invalid TDP MMU roots KVM: x86/mmu: Check for present SPTE when clearing dirty bit in TDP MMU KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush KVM: x86/mmu: Require mmu_lock be held for write in unyielding root iter KVM: x86/mmu: Check for !leaf=>leaf, not PFN change, in TDP MMU SP removal KVM: x86/mmu: Batch TLB flushes from TDP MMU for MMU notifier change_spte KVM: x86/mmu: Drop RCU after processing each root in MMU notifier hooks KVM: x86/mmu: Add helpers to read/write TDP MMU SPTEs and document RCU KVM: x86/mmu: WARN if old _or_ new SPTE is REMOVED in non-atomic path KVM: x86/mmu: Refactor low-level TDP MMU set SPTE helper to take raw vals KVM: x86/mmu: Zap only the target TDP MMU shadow page in NX recovery KVM: x86/mmu: Skip remote TLB flush when zapping all of TDP MMU KVM: x86/mmu: Add dedicated helper to zap TDP MMU root shadow page KVM: x86/mmu: Require mmu_lock be held for write to zap TDP MMU range KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() KVM: x86/mmu: Do remote TLB flush before dropping RCU in TDP MMU resched KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages KVM: x86/mmu: Allow yielding when zapping GFNs for defunct TDP MMU root KVM: x86/mmu: Zap roots in two passes to avoid inducing RCU stalls KVM: x86/mmu: Zap defunct roots via asynchronous worker KVM: x86/mmu: Check for a REMOVED leaf SPTE before making the SPTE KVM: x86/mmu: WARN on any attempt to atomically update REMOVED SPTE KVM: selftests: Move raw KVM_SET_USER_MEMORY_REGION helper to utils KVM: selftests: Split out helper to allocate guest mem via memfd KVM: selftests: Define cpu_relax() helpers for s390 and x86 KVM: selftests: Add test to populate a VM with the max possible guest mem arch/x86/kvm/mmu/mmu.c | 42 +- arch/x86/kvm/mmu/mmu_internal.h | 15 +- arch/x86/kvm/mmu/tdp_iter.c | 6 +- arch/x86/kvm/mmu/tdp_iter.h | 15 +- arch/x86/kvm/mmu/tdp_mmu.c | 595 ++++++++++++------ arch/x86/kvm/mmu/tdp_mmu.h | 26 +- tools/testing/selftests/kvm/.gitignore | 1 + tools/testing/selftests/kvm/Makefile | 3 + .../selftests/kvm/include/kvm_util_base.h | 5 + .../selftests/kvm/include/s390x/processor.h | 8 + .../selftests/kvm/include/x86_64/processor.h | 5 + tools/testing/selftests/kvm/lib/kvm_util.c | 66 +- .../selftests/kvm/max_guest_memory_test.c | 292 +++++++++ .../selftests/kvm/set_memory_region_test.c | 35 +- 14 files changed, 832 insertions(+), 282 deletions(-) create mode 100644 tools/testing/selftests/kvm/max_guest_memory_test.c base-commit: 625e7ef7da1a4addd8db41c2504fe8a25b93acd5 -- 2.35.1.574.g5d30c73bfb-goog