Received: by 2002:a05:6358:f14:b0:e5:3b68:ec04 with SMTP id b20csp6559063rwj; Wed, 21 Dec 2022 17:45:48 -0800 (PST) X-Google-Smtp-Source: AMrXdXttQzha6XRQhGp3H9OniQ+Z5TlSKgZdx2Q5l6/qUf0rTkxpKdYDAgdDY0n7qo8THDz0rlLU X-Received: by 2002:aa7:c981:0:b0:472:cf27:6991 with SMTP id c1-20020aa7c981000000b00472cf276991mr3210701edt.22.1671673548404; Wed, 21 Dec 2022 17:45:48 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1671673548; cv=none; d=google.com; s=arc-20160816; b=Mqjm9iUFdrok2aHl2DcLwIUXWtgUmBk61pKvSMn4k5Fh0z3nlnEe/zqYkKNAvv+vTk 87bwGgp8N1amxzu+9aNFM9x99kUu6nAIO8zoklRnv0iZ5CDqf4iU7Wyd4rGSUCklRGao Ks9wu9cNlANYr6/C6vtj2zYVE4cSbpVXCf9p3xg6TNpPSX5nPhGwzALQyrcPyXEiFKjE re5fxs/62fo9yhLDNrqs4fkbFLwmlHnEBErTvk166bogd7NdKCBYA+tgeFhmvNMSD9fB N2GRxL/XXFkwS6MuQ4JBHT9gMGIGWsBz6XEHxVfjnyzE2I47MregC9st/kOiAGCQcNZd iUnA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:from:subject :message-id:mime-version:date:dkim-signature; bh=NMbd6YPE3mDcFXMpMkb2Cl5GlR82hgSLYy3B2RTrJ24=; b=EgoWUDhrZJf287S31wLVOKXK0CreTXIuVStGlmm2+DDVMkqy5ZGcfGk2OxNe5Ai701 EM8oJt47RIq2yslWjFxhLyYMMnoYfyrscJ1hzspdMXh+1WCSRS+YvbxqFkX4bsOVy8+/ VMa+kxeGtMpI3iXfiYU7Ul5vVDNyqKktiutlEhqfaP0iQyOM6w1A0onRS7hpw66PYeyB eipGMj/M3glP3dqcBVXVwLhXatf1segbO2ZS/FGfj8yAEx4ay9YWeGOA1bQe1uuNaTFz /mu+7z4ApIjBuhZgLruPTXqKCzDHl7c6llT9ccSbyk1PrHm5ZRr8kiqAHJHlBWh7nh9w 9Hxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iPqNqREy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k7-20020a50ce47000000b0046c786d3711si3274180edj.51.2022.12.21.17.45.29; Wed, 21 Dec 2022 17:45:48 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=iPqNqREy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234811AbiLVBdm (ORCPT + 67 others); Wed, 21 Dec 2022 20:33:42 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60394 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229854AbiLVBdk (ORCPT ); Wed, 21 Dec 2022 20:33:40 -0500 Received: from mail-yw1-x114a.google.com (mail-yw1-x114a.google.com [IPv6:2607:f8b0:4864:20::114a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C54520180 for ; Wed, 21 Dec 2022 17:33:38 -0800 (PST) Received: by mail-yw1-x114a.google.com with SMTP id 00721157ae682-3d2994e2d7dso6879887b3.9 for ; Wed, 21 Dec 2022 17:33:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:from:to:cc:subject:date:message-id:reply-to; bh=NMbd6YPE3mDcFXMpMkb2Cl5GlR82hgSLYy3B2RTrJ24=; b=iPqNqREydN6RH8EAKMfQ3krja3jcNSNIVXyMeSPbiv3qxfT4FujakOttz4xz/kwCTL 1mVtEICZPiGeTUVk4POhXMUHZH37mmgonWwjNnjArrl+/DCz3SeY7SsMEwdWCK74sD9P lE7cPlcCJWrSpou2akxgz9Jk5oMCXH5z81cfXbrSqTisN8UthkUFeU4p0b9yhsCK19DA njnNd2X85mAmuMLy7LWgQHO+pkek2nnztGnoUkwb5Dw7kFqbcxq2uYnELnLq93CtVPIn ZKsellIp7I1uoHlT0RmELO6g+DhxSh56y/IeI/GglvztpBr3GeenQq7bFALVdUMKcLz2 pIkQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:from:subject:message-id :mime-version:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=NMbd6YPE3mDcFXMpMkb2Cl5GlR82hgSLYy3B2RTrJ24=; b=PcfnK5u/10LGq68rqaSzErSnfHzMDDq0PtUbm7VzlS0GJmyWgbKXNrgYfsp/abPKjq PdTsNc400RQdZMWfbG1ZWJr0r0Pcw1ye1uXrPyR64UJFm+cYN2o1Nun/+HaetQU4bZ9S ra52mRvhTxOkZTSb1ucQBC06TDXXGvrlSErDzWJUGR/mg/FqFb1WpVhlcR8W7ZQjHRxJ f4etEzs0cf01fmvYm91Svh1XN2wQjNsxGFuN5AXQ36yC7IBTD7GhRIM6kPPj5Xg4o3yk iDXWTmzOBpT9qcmQMJ1hswrLuXFzzK001cilnKGgvbhUF9VJO016aI/xHOaTCcIKV2tM Mi+Q== X-Gm-Message-State: AFqh2krqvlGUPeriEhJ2aFZ+NiNIBaO3rlmq1ki7kgQmyOfXNOKFRV23 0titXr5Zg82CpSlXlfHCC4XiBPQQT18= X-Received: from jackyli.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:3b51]) (user=jackyli job=sendgmr) by 2002:a25:8487:0:b0:71d:a4ec:6069 with SMTP id v7-20020a258487000000b0071da4ec6069mr435416ybk.613.1671672817363; Wed, 21 Dec 2022 17:33:37 -0800 (PST) Date: Thu, 22 Dec 2022 01:33:30 +0000 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221222013330.831474-1-jackyli@google.com> Subject: [PATCH] x86/mm/cpa: get rid of the cpa lock From: Jacky Li To: Dave Hansen , Andy Lutomirski , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" Cc: Marc Orr , Alper Gun , x86@kernel.org, linux-kernel@vger.kernel.org, Jacky Li Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-9.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This RFC is to solicit feedback on how to remove/disable the CPA lock for modern x86 CPUs. We suspect it can be removed for older x86 CPUs as well per the third bullet in our full reasoning below. However, offlist discussion at LPC suggested that doing so could be too risky because it is hard to test these changes on very old CPUs. The cpa_lock was introduced in commit ad5ca55f6bdb ("x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa") to solve a race condition where one cpu is splitting a large page entry along with changing the attribute, while another cpu with stale large tlb entries is also changing the page attributes. There are 3 reasons to remove/modify this cpa_lock today. First, this cpa_lock is inefficient because it=E2=80=99s a global spin lock= . It only protects the race condition when multiple threads are modifying the same large page entry while preventing all parallelization when threads are updating different 4K page entries, which is much more common. Second, as stated in arch/x86/include/asm/set_memory.h, "the API does not provide exclusion between various callers - including callers that operation on other mappings of the same physical page." the caller should handle the race condition where two threads are modifying the same page entry. The API should only handle it when this race condition can crash the kernel, which might have been true back in 2008 because the commit cover letter mentioned "If the two translations differ with respect to page frame or attributes (e.g., permissions), processor behavior is undefined and may be implementation specific. The processor may use a page frame or attributes that correspond to neither translation;" However it=E2=80=99s no longer true today per Intel's spec [1]: "the TLBs may subsequently contain multiple translations for the address range (one for each page size). A reference to a linear address in the address range may use any of these translations." Third, even though it=E2=80=99s possible in old hardware that this race condition can crash the kernel, this specific race condition that cpa_lock was trying to protect when introduced in 2008 has already been protected by pgd_lock today, thanks to the commit c0a759abf5a6 ("x86/mm/cpa: Move flush_tlb_all()") in 2018 that moves the flush_tlb_all() from outside pgd_lock to inside. Therefore today when one cpu is splitting the large page and changing attributes, the other cpu will need to wait until the global tlb flush is done and pgd_lock gets released, and after that there won=E2=80=99t be stale large tlb entrie= s to change within this cpu. (I did a talk in LPC [2] that has a pseudo code explaining why the race condition is protected by pgd_lock today) It=E2=80=99s true that with such old code, the cpa_lock might protect more race conditions than those that it was introduced to protect in 2008, or some old hardware may depend on the cpa_lock for undocumented behavior. So removing the lock directly might not be a good idea, but it probably should not mean that we need to keep the inefficient code forever. I would appreciate any suggestion to navigate this lock removal from the folks on the to and cc list. [1] Intel=C2=AE 64 and IA-32 Architectures Software Developer=E2=80=99s Man= ual, Volume 3A: System Programming Guide, Part 1, Section 4.10.2. [2] https://youtu.be/LFJQ1PGGF7Q?t=3D330 Signed-off-by: Jacky Li --- arch/x86/mm/pat/set_memory.c | 18 +----------------- 1 file changed, 1 insertion(+), 17 deletions(-) diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index 356758b7d4b4..84ad8198830f 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -62,14 +62,6 @@ enum cpa_warn { =20 static const int cpa_warn_level =3D CPA_PROTECT; =20 -/* - * Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mapping= s) - * using cpa_lock. So that we don't allow any other cpu, with stale large = tlb - * entries change the page attribute in parallel to some other cpu - * splitting a large page entry along with changing the attribute. - */ -static DEFINE_SPINLOCK(cpa_lock); - #define CPA_FLUSHTLB 1 #define CPA_ARRAY 2 #define CPA_PAGES_ARRAY 4 @@ -1127,7 +1119,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte,= unsigned long address, * (e.g., permissions), processor behavior is undefined and may * be implementation-specific." * - * We do this global tlb flush inside the cpa_lock, so that we + * We do this global tlb flush inside the pgd_lock, so that we * don't allow any other cpu, with stale tlb entries change the * page attribute in parallel, that also falls into the * just split large page entry. @@ -1143,11 +1135,7 @@ static int split_large_page(struct cpa_data *cpa, pt= e_t *kpte, { struct page *base; =20 - if (!debug_pagealloc_enabled()) - spin_unlock(&cpa_lock); base =3D alloc_pages(GFP_KERNEL, 0); - if (!debug_pagealloc_enabled()) - spin_lock(&cpa_lock); if (!base) return -ENOMEM; =20 @@ -1759,11 +1747,7 @@ static int __change_page_attr_set_clr(struct cpa_dat= a *cpa, int primary) if (cpa->flags & (CPA_ARRAY | CPA_PAGES_ARRAY)) cpa->numpages =3D 1; =20 - if (!debug_pagealloc_enabled()) - spin_lock(&cpa_lock); ret =3D __change_page_attr(cpa, primary); - if (!debug_pagealloc_enabled()) - spin_unlock(&cpa_lock); if (ret) goto out; =20 --=20 2.39.0.314.g84b9a713c41-goog