2022-12-22 01:45:48

by Jacky Li

[permalink] [raw]
Subject: [PATCH] x86/mm/cpa: get rid of the cpa lock

This RFC is to solicit feedback on how to remove/disable the CPA lock
for modern x86 CPUs. We suspect it can be removed for older x86 CPUs
as well per the third bullet in our full reasoning below. However,
offlist discussion at LPC suggested that doing so could be too risky
because it is hard to test these changes on very old CPUs.

The cpa_lock was introduced in commit ad5ca55f6bdb ("x86, cpa: srlz
cpa(), global flush tlb after splitting big page and before doing cpa")
to solve a race condition where one cpu is splitting a large
page entry along with changing the attribute, while another cpu with
stale large tlb entries is also changing the page attributes.

There are 3 reasons to remove/modify this cpa_lock today.

First, this cpa_lock is inefficient because it’s a global spin lock.
It only protects the race condition when multiple threads are
modifying the same large page entry while preventing all
parallelization when threads are updating different 4K page entries,
which is much more common.

Second, as stated in arch/x86/include/asm/set_memory.h,
"the API does not provide exclusion between various callers -
including callers that operation on other mappings of the same
physical page."
the caller should handle the race condition where two threads are
modifying the same page entry. The API should only handle it when this
race condition can crash the kernel, which might have been true back
in 2008 because the commit cover letter mentioned
"If the two translations differ with respect to page frame or
attributes (e.g., permissions), processor behavior is
undefined and may be implementation specific. The processor
may use a page frame or attributes that correspond to neither
translation;"
However it’s no longer true today per Intel's spec [1]:
"the TLBs may subsequently contain multiple translations for
the address range (one for each page size). A reference to a
linear address in the address range may use any of these
translations."

Third, even though it’s possible in old hardware that this race
condition can crash the kernel, this specific race condition that
cpa_lock was trying to protect when introduced in 2008 has already
been protected by pgd_lock today, thanks to the commit c0a759abf5a6
("x86/mm/cpa: Move flush_tlb_all()") in 2018 that moves the
flush_tlb_all() from outside pgd_lock to inside. Therefore today when
one cpu is splitting the large page and changing attributes, the other
cpu will need to wait until the global tlb flush is done and pgd_lock
gets released, and after that there won’t be stale large tlb entries
to change within this cpu. (I did a talk in LPC [2] that has a pseudo
code explaining why the race condition is protected by pgd_lock today)

It’s true that with such old code, the cpa_lock might protect more
race conditions than those that it was introduced to protect in 2008,
or some old hardware may depend on the cpa_lock for undocumented
behavior. So removing the lock directly might not be a good idea, but
it probably should not mean that we need to keep the inefficient code
forever. I would appreciate any suggestion to navigate this lock
removal from the folks on the to and cc list.

[1] Intel® 64 and IA-32 Architectures Software Developer’s Manual,
Volume 3A: System Programming Guide, Part 1, Section 4.10.2.
[2] https://youtu.be/LFJQ1PGGF7Q?t=330

Signed-off-by: Jacky Li <[email protected]>
---
arch/x86/mm/pat/set_memory.c | 18 +-----------------
1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 356758b7d4b4..84ad8198830f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -62,14 +62,6 @@ enum cpa_warn {

static const int cpa_warn_level = CPA_PROTECT;

-/*
- * Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings)
- * using cpa_lock. So that we don't allow any other cpu, with stale large tlb
- * entries change the page attribute in parallel to some other cpu
- * splitting a large page entry along with changing the attribute.
- */
-static DEFINE_SPINLOCK(cpa_lock);
-
#define CPA_FLUSHTLB 1
#define CPA_ARRAY 2
#define CPA_PAGES_ARRAY 4
@@ -1127,7 +1119,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
* (e.g., permissions), processor behavior is undefined and may
* be implementation-specific."
*
- * We do this global tlb flush inside the cpa_lock, so that we
+ * We do this global tlb flush inside the pgd_lock, so that we
* don't allow any other cpu, with stale tlb entries change the
* page attribute in parallel, that also falls into the
* just split large page entry.
@@ -1143,11 +1135,7 @@ static int split_large_page(struct cpa_data *cpa, pte_t *kpte,
{
struct page *base;

- if (!debug_pagealloc_enabled())
- spin_unlock(&cpa_lock);
base = alloc_pages(GFP_KERNEL, 0);
- if (!debug_pagealloc_enabled())
- spin_lock(&cpa_lock);
if (!base)
return -ENOMEM;

@@ -1759,11 +1747,7 @@ static int __change_page_attr_set_clr(struct cpa_data *cpa, int primary)
if (cpa->flags & (CPA_ARRAY | CPA_PAGES_ARRAY))
cpa->numpages = 1;

- if (!debug_pagealloc_enabled())
- spin_lock(&cpa_lock);
ret = __change_page_attr(cpa, primary);
- if (!debug_pagealloc_enabled())
- spin_unlock(&cpa_lock);
if (ret)
goto out;

--
2.39.0.314.g84b9a713c41-goog


2023-01-06 11:15:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/cpa: get rid of the cpa lock


* Jacky Li <[email protected]> wrote:

> It’s true that with such old code, the cpa_lock might protect more
> race conditions than those that it was introduced to protect in 2008,
> or some old hardware may depend on the cpa_lock for undocumented
> behavior. So removing the lock directly might not be a good idea, but
> it probably should not mean that we need to keep the inefficient code
> forever. I would appreciate any suggestion to navigate this lock
> removal from the folks on the to and cc list.

> -/*
> - * Serialize cpa() (for !DEBUG_PAGEALLOC which uses large identity mappings)
> - * using cpa_lock. So that we don't allow any other cpu, with stale large tlb
> - * entries change the page attribute in parallel to some other cpu
> - * splitting a large page entry along with changing the attribute.
> - */
> -static DEFINE_SPINLOCK(cpa_lock);

Yeah, so I'm *really* tempted to just remove cpa_lock if there's no in-code
documented uses of it - your patch provides *exhaustive* background.

The thing is, even in the worst-case if it breaks anything, it will get
investigated, documented better and maybe reverted - which would *still* be
an improvement over today, because we turn undocumented code into
documented code.

We cannot indefinitely keep a global lock just because we fear it might
have some undocumented dependencies...

But no strong feelings either way - I've added a few more Cc:s to discuss
this more widely.

Unless there's objections I'd be inclined to give this patch a try, and
keep an eye open for regressions, it's not difficult to revert either.

Thanks,

Ingo

2023-01-09 20:46:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/mm/cpa: get rid of the cpa lock

Jacky!

On Thu, Dec 22 2022 at 01:33, Jacky Li wrote:
> This RFC is to solicit feedback on how to remove/disable the CPA lock
> for modern x86 CPUs. We suspect it can be removed for older x86 CPUs
> as well per the third bullet in our full reasoning below. However,
> offlist discussion at LPC suggested that doing so could be too risky
> because it is hard to test these changes on very old CPUs.

Definitely so.

> The cpa_lock was introduced in commit ad5ca55f6bdb ("x86, cpa: srlz
> cpa(), global flush tlb after splitting big page and before doing cpa")
> to solve a race condition where one cpu is splitting a large
> page entry along with changing the attribute, while another cpu with
> stale large tlb entries is also changing the page attributes.
>
> There are 3 reasons to remove/modify this cpa_lock today.
>
> First, this cpa_lock is inefficient because it’s a global spin lock.
> It only protects the race condition when multiple threads are
> modifying the same large page entry while preventing all
> parallelization when threads are updating different 4K page entries,
> which is much more common.

It does not matter whether a particular operation is common or not,
really. Either the lock is required for protection or not.

> Second, as stated in arch/x86/include/asm/set_memory.h,
> "the API does not provide exclusion between various callers -
> including callers that operation on other mappings of the same
> physical page."
>
> the caller should handle the race condition where two threads are
> modifying the same page entry.

The API deals with memory ranges and of course is the caller responsible
that there are no two concurrent calls to change the same memory range.

But the caller is completely oblivious about large pages. That's an
internal implementation detail of the CPA code and that code is
responsible for serialization of large page splits and the resulting
subtleties.

Assume:
BASEADDR is covered by a large TLB

CPU 0 CPU 1
cpa(BASEADDR, PAGE_SIZE, protA); cpa(BASEADDR+PAGE_SIZE, PAGE_SIZE, protB);

is completely correct from an API usage point of view, no?

> The API should only handle it when this race condition can crash the
> kernel, which might have been true back in 2008 because the commit

Might have been true? The crashes were real.

> cover letter mentioned
> "If the two translations differ with respect to page frame or
> attributes (e.g., permissions), processor behavior is
> undefined and may be implementation specific. The processor
> may use a page frame or attributes that correspond to neither
> translation;"
> However it’s no longer true today per Intel's spec [1]:
> "the TLBs may subsequently contain multiple translations for
> the address range (one for each page size). A reference to a
> linear address in the address range may use any of these
> translations."

That's a partial quote. The full sentence is:

"If software modifies the paging structures so that the page size used
for a 4-KByte range of linear addresses changes, the TLBs may
subsequently contain multiple translations for the address range (one
for each page size). A reference to a linear address in the address
range may use any of these translations. Which translation is used may
vary from one execution to another, and the choice may be
implementation-specific."

The important part is the first part of the first sentence, which only
talks about changing the page size used, but does not talk about
changing attributes in a conflicting way. The latter is the real issue
which was addressed back then if my memory does not trick me.

It's still today a real issue with certain PAT combinations.

> Third, even though it’s possible in old hardware that this race
> condition can crash the kernel, this specific race condition that
> cpa_lock was trying to protect when introduced in 2008 has already
> been protected by pgd_lock today, thanks to the commit c0a759abf5a6
> ("x86/mm/cpa: Move flush_tlb_all()") in 2018 that moves the
> flush_tlb_all() from outside pgd_lock to inside. Therefore today when
> one cpu is splitting the large page and changing attributes, the other
> cpu will need to wait until the global tlb flush is done and pgd_lock
> gets released, and after that there won’t be stale large tlb entries
> to change within this cpu. (I did a talk in LPC [2] that has a pseudo
> code explaining why the race condition is protected by pgd_lock today)

A link to a video is not replacing a coherent written explanation why a
change is correct. Changelogs have to be self contained and fully
explanatory.

I agree that there is no issue vs. two CPUs trying to split the same
large page concurrently. They are properly serialized by pgd_lock, but
there are other scenarios too:

BASEADDR is covered by a large TLB

CPU 0 CPU 1

cpa(BASEADDR, PAGE_SIZE, protA)

observes large TLB
split_large_page()
spin_lock(pgd_lock);
__set_pmd_pte(...);

cpa(BASEADDR + PAGE_SIZE, PAGE_SIZE, protB)

observes 4k PTE in lookup_addr_cpa()
and proceeds
flush_tlb_all();

Today this is fully serialized via cpa_lock and CPU1 cannot proceed
before the split is complete (including the flush), so this needs a
proper explanation too.

Thanks,

tglx