2024-03-07 13:42:55

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 0/3] x86/mm: LAM fixups and cleanups

This series has a few fixups and cleanups for LAM code noticed through
code inspection. They are marked as RFC because I am not very familiar
with this code so I may have gotten something (or all of it) wrong.

Yosry Ahmed (3):
x86/mm: fix LAM cr3 mask inconsistency during context switch
x86/mm: make sure LAM is up-to-date during context switching
x86/mm: cleanup prctl_enable_tagged_addr() nr_bits error checking

arch/x86/include/asm/tlbflush.h | 11 ++++--
arch/x86/kernel/process_64.c | 10 ++---
arch/x86/mm/tlb.c | 67 +++++++++++++++++++--------------
3 files changed, 48 insertions(+), 40 deletions(-)

--
2.44.0.278.ge034bb2e1d-goog



2024-03-07 13:43:18

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

During context switching, if we are not switching to new mm and no TLB
flush is needed, we do not write CR3. However, it is possible that a
user thread enables LAM while a kthread is running on a different CPU
with the old LAM CR3 mask. If the kthread context switches into any
thread of that user process, it may not write CR3 with the new LAM mask,
which would cause the user thread to run with a misconfigured CR3 that
disables LAM on the CPU.

Fix this by making sure we write a new CR3 if LAM is not up-to-date. No
problems were observed in practice, this was found by code inspection.

Not that it is possible that mm->context.lam_cr3_mask changes throughout
switch_mm_irqs_off(). But since LAM can only be enabled by a
single-threaded process on its own behalf, in that case we cannot be
switching to a user thread in that same process, we can only be
switching to another kthread using the borrowed mm or a different user
process, which should be fine.

Fixes: 82721d8b25d7 ("x86/mm: Handle LAM on context switch")
Signed-off-by: Yosry Ahmed <[email protected]>
---
arch/x86/mm/tlb.c | 50 ++++++++++++++++++++++++++++-------------------
1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2975d3f89a5de..3610c23499085 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -503,11 +503,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
{
struct mm_struct *prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ u64 cpu_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
+ bool need_flush = false, need_lam_update = false;
unsigned cpu = smp_processor_id();
unsigned long new_lam;
u64 next_tlb_gen;
- bool need_flush;
u16 new_asid;

/* We don't want flush_tlb_func() to run concurrently with us. */
@@ -570,32 +571,41 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
!cpumask_test_cpu(cpu, mm_cpumask(next))))
cpumask_set_cpu(cpu, mm_cpumask(next));

+ /*
+ * tlbstate_lam_cr3_mask() may be outdated if a different thread
+ * has enabled LAM while we were borrowing its mm on this CPU.
+ * Make sure we update CR3 in case we are switching to another
+ * thread in that process.
+ */
+ if (tlbstate_lam_cr3_mask() != mm_lam_cr3_mask(next))
+ need_lam_update = true;
+
/*
* If the CPU is not in lazy TLB mode, we are just switching
* from one thread in a process to another thread in the same
* process. No TLB flush required.
*/
- if (!was_lazy)
- return;
+ if (was_lazy) {
+ /*
+ * Read the tlb_gen to check whether a flush is needed.
+ * If the TLB is up to date, just use it. The barrier
+ * synchronizes with the tlb_gen increment in the TLB
+ * shootdown code.
+ */
+ smp_mb();
+ next_tlb_gen = atomic64_read(&next->context.tlb_gen);
+ if (cpu_tlb_gen < next_tlb_gen) {
+ /*
+ * TLB contents went out of date while we were
+ * in lazy mode.
+ */
+ new_asid = prev_asid;
+ need_flush = true;
+ }
+ }

- /*
- * Read the tlb_gen to check whether a flush is needed.
- * If the TLB is up to date, just use it.
- * The barrier synchronizes with the tlb_gen increment in
- * the TLB shootdown code.
- */
- smp_mb();
- next_tlb_gen = atomic64_read(&next->context.tlb_gen);
- if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
- next_tlb_gen)
+ if (!need_flush && !need_lam_update)
return;
-
- /*
- * TLB contents went out of date while we were in lazy
- * mode. Fall through to the TLB switching code below.
- */
- new_asid = prev_asid;
- need_flush = true;
} else {
/*
* Apply process to process speculation vulnerability
--
2.44.0.278.ge034bb2e1d-goog


2024-03-07 13:43:41

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 3/3] x86/mm: cleanup prctl_enable_tagged_addr() nr_bits error checking

In prctl_enable_tagged_addr(), we check that nr_bits is in the correct
range, but we do so in a twisted if/else block where the correct case is
sandwiched between two error cases doing exactly the same thing.

Simplify the if condition and pull the correct case outside with the
rest of the success code path.

Signed-off-by: Yosry Ahmed <[email protected]>
---
arch/x86/kernel/process_64.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 33b268747bb7b..3f381906bbe1d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -771,17 +771,13 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
return -EBUSY;
}

- if (!nr_bits) {
- mmap_write_unlock(mm);
- return -EINVAL;
- } else if (nr_bits <= LAM_U57_BITS) {
- mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
- mm->context.untag_mask = ~GENMASK(62, 57);
- } else {
+ if (!nr_bits || nr_bits > LAM_U57_BITS) {
mmap_write_unlock(mm);
return -EINVAL;
}

+ mm->context.lam_cr3_mask = X86_CR3_LAM_U57;
+ mm->context.untag_mask = ~GENMASK(62, 57);
write_cr3(__read_cr3() | mm->context.lam_cr3_mask);
set_tlbstate_lam_mode(mm);
set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags);
--
2.44.0.278.ge034bb2e1d-goog


2024-03-07 13:43:48

by Yosry Ahmed

[permalink] [raw]
Subject: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into
'new_lam', which is later passed to load_new_mm_cr3(). However, there is
a call to set_tlbstate_lam_mode() in between which will read
'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly.
If we race with another thread updating 'mm->context.lam_cr3_mask', the
value in 'cpu_tlbstate.lam' could end up being different from CR3.

Fix the problem by updating set_tlbstate_lam_mode() to return the LAM
mask that was set to 'cpu_tlbstate.lam', and use that mask in
switch_mm_irqs_off() when writing CR3. Use READ_ONCE to make sure we
read the mask once and use it consistenly.

No practical problems have been observed from this, but it's a recipe
for future problems (e.g. debug warnings in switch_mm_irqs_off() or
__get_current_cr3_fast() could fire). It is unlikely that this can cause
any real issues since only a single-threaded process can update its own
LAM mask, so the race here could happen when context switching between
kthreads using a borrowed MM. In this case, it's unlikely that LAM is
important. If it is, then we would need to ensure all CPUs using the mm
are updated before returning to userspace when LAM is enabled -- but we
don't do that.

While we are at it, remove the misguiding comment that states that
'new_lam' may not match tlbstate_lam_cr3_mask() if a race occurs. This
can happen without a race, a different thread could have just enabled
LAM since the last context switch on the current CPU. Replace it with a
hopefully clearer comment above set_tlbstate_lam_mode().

Signed-off-by: Yosry Ahmed <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 11 +++++++----
arch/x86/mm/tlb.c | 17 ++++++++---------
2 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 25726893c6f4d..a4ddb20f84fe7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -399,11 +399,13 @@ static inline u64 tlbstate_lam_cr3_mask(void)
return lam << X86_CR3_LAM_U57_BIT;
}

-static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
+static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
{
- this_cpu_write(cpu_tlbstate.lam,
- mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT);
+ unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
+
+ this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
+ return lam;
}

#else
@@ -413,8 +415,9 @@ static inline u64 tlbstate_lam_cr3_mask(void)
return 0;
}

-static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
+static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
{
+ return 0;
}
#endif
#endif /* !MODULE */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 51f9f56941058..2975d3f89a5de 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -503,9 +503,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
{
struct mm_struct *prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
- unsigned long new_lam = mm_lam_cr3_mask(next);
bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
+ unsigned long new_lam;
u64 next_tlb_gen;
bool need_flush;
u16 new_asid;
@@ -561,11 +561,6 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) !=
next->context.ctx_id);

- /*
- * If this races with another thread that enables lam, 'new_lam'
- * might not match tlbstate_lam_cr3_mask().
- */
-
/*
* Even in lazy TLB mode, the CPU should stay set in the
* mm_cpumask. The TLB shootdown code can figure out from
@@ -633,7 +628,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
barrier();
}

- set_tlbstate_lam_mode(next);
+ /*
+ * Even if we are not actually switching mm's, another thread could have
+ * updated mm->context.lam_cr3_mask. Make sure tlbstate_lam_cr3_mask()
+ * and the loaded CR3 use the up-to-date mask.
+ */
+ new_lam = set_tlbstate_lam_mode(next);
if (need_flush) {
this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
@@ -705,7 +705,6 @@ void initialize_tlbstate_and_flush(void)

/* LAM expected to be disabled */
WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
- WARN_ON(mm_lam_cr3_mask(mm));

/*
* Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
@@ -724,7 +723,7 @@ void initialize_tlbstate_and_flush(void)
this_cpu_write(cpu_tlbstate.next_asid, 1);
this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
- set_tlbstate_lam_mode(mm);
+ WARN_ON(set_tlbstate_lam_mode(mm));

for (i = 1; i < TLB_NR_DYN_ASIDS; i++)
this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
--
2.44.0.278.ge034bb2e1d-goog


2024-03-07 15:29:58

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On 3/7/24 05:39, Yosry Ahmed wrote:
> - /*
> - * Read the tlb_gen to check whether a flush is needed.
> - * If the TLB is up to date, just use it.
> - * The barrier synchronizes with the tlb_gen increment in
> - * the TLB shootdown code.
> - */
> - smp_mb();
> - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> - next_tlb_gen)
> + if (!need_flush && !need_lam_update)
> return;

Instead of all this new complexity, why not just inc_mm_tlb_gen() at the
site where LAM is enabled? That should signal to any CPU thread that
its TLB is out of date and it needs to do a full CR3 reload.

Also, have you been able to actually catch this scenario in practice?
Considering how fun this code path is, a little effort at an actual
reproduction would be really appreciated.

2024-03-07 17:22:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On Thu, Mar 07, 2024 at 01:39:14PM +0000, Yosry Ahmed wrote:
> In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into
> 'new_lam', which is later passed to load_new_mm_cr3(). However, there is
> a call to set_tlbstate_lam_mode() in between which will read
> 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly.
> If we race with another thread updating 'mm->context.lam_cr3_mask', the
> value in 'cpu_tlbstate.lam' could end up being different from CR3.

What other thread? LAM can only be enabled when the process has single
thread. And cannot be disabled. See MM_CONTEXT_LOCK_LAM.

> While we are at it, remove the misguiding comment that states that
> 'new_lam' may not match tlbstate_lam_cr3_mask() if a race occurs.

The comment is indeed misguiding, but for different reason. It is leftover
from the earlier version of LAM patchset.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-07 17:32:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] x86/mm: cleanup prctl_enable_tagged_addr() nr_bits error checking

On Thu, Mar 07, 2024 at 01:39:16PM +0000, Yosry Ahmed wrote:
> In prctl_enable_tagged_addr(), we check that nr_bits is in the correct
> range, but we do so in a twisted if/else block where the correct case is
> sandwiched between two error cases doing exactly the same thing.
>
> Simplify the if condition and pull the correct case outside with the
> rest of the success code path.

I'm okay either way.

I structured the code this way as I had separate patch that adds also
LAM_U48. But it is unlikely to get upstreamed.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-07 17:37:11

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

I know we all have different rules, but any time you could spend absorbing:

https://www.kernel.org/doc/html/next/process/maintainer-tip.html

would be appreciated, especially:

> The condensed patch description in the subject line should start with
> a uppercase letter and should be written in imperative tone.


On 3/7/24 05:39, Yosry Ahmed wrote:
> In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into
> 'new_lam', which is later passed to load_new_mm_cr3(). However, there is
> a call to set_tlbstate_lam_mode() in between which will read
> 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly.
> If we race with another thread updating 'mm->context.lam_cr3_mask', the
> value in 'cpu_tlbstate.lam' could end up being different from CR3.

Your description is fine (modulo the we's). But I slightly reworded it
to make it more plainly readable:

LAM can only be enabled when a process is single-threaded. But _kernel_
threads can temporarily use a single-threaded process's mm. That means
that a context-switching kernel thread can race and observe the mm's LAM
metadata (mm->context.lam_cr3_mask) change.

The context switch code does two logical things with that metadata:
populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race,
'cpu_tlbstate.lam' and CR3 can end up out of sync.

This de-synchronization is currently harmless. But it is confusing and
might lead to warnings or real bugs.

--

> Fix the problem by updating set_tlbstate_lam_mode() to return the LAM
> mask that was set to 'cpu_tlbstate.lam', and use that mask in
> switch_mm_irqs_off() when writing CR3. Use READ_ONCE to make sure we
> read the mask once and use it consistenly.

Spell checking is also appreciated.

..
> -static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
> +static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
> {
> - this_cpu_write(cpu_tlbstate.lam,
> - mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT);
> + unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
> +
> + this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
> this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
> + return lam;
> }

The comments about races need to be _here_ so that the purpose of the
READ_ONCE() is clear.

It would also be nice to call out the rule that this can only
meaningfully be called once per context switch.

> @@ -633,7 +628,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> barrier();
> }
>
> - set_tlbstate_lam_mode(next);
> + /*
> + * Even if we are not actually switching mm's, another thread could have
> + * updated mm->context.lam_cr3_mask. Make sure tlbstate_lam_cr3_mask()
> + * and the loaded CR3 use the up-to-date mask.
> + */

I kinda dislike how the comment talks about the details of what
set_tlbstate_lam_mode() does. It would be much better to put the meat
of this comment at the set_tlbstate_lam_mode() definition.

> + new_lam = set_tlbstate_lam_mode(next);
> if (need_flush) {
> this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
> this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);

This is less a complaint about your change and more of the existing
code, but I wish it was more obvious that set_tlbstate_lam_mode() is
logically shuffling data (once) from 'next' into the tlbstate.

The naming makes it sound like it is modifying the tlbstate of 'next'.

But I don't have any particularly brilliant ideas to fix it either.
Maybe just:

/* new_lam is effectively cpu_tlbstate.lam */

> @@ -705,7 +705,6 @@ void initialize_tlbstate_and_flush(void)
>
> /* LAM expected to be disabled */
> WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
> - WARN_ON(mm_lam_cr3_mask(mm));
>
> /*
> * Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
> @@ -724,7 +723,7 @@ void initialize_tlbstate_and_flush(void)
> this_cpu_write(cpu_tlbstate.next_asid, 1);
> this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
> this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
> - set_tlbstate_lam_mode(mm);
> + WARN_ON(set_tlbstate_lam_mode(mm));

The "set_" naming bugs me in both of the sites that get modified here.
I'd be with a new name that fits better, if we can think of one.


2024-03-07 17:56:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On 3/7/24 09:29, Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2024 at 01:39:15PM +0000, Yosry Ahmed wrote:
>> During context switching, if we are not switching to new mm and no TLB
>> flush is needed, we do not write CR3. However, it is possible that a
>> user thread enables LAM while a kthread is running on a different CPU
>> with the old LAM CR3 mask. If the kthread context switches into any
>> thread of that user process, it may not write CR3 with the new LAM mask,
>> which would cause the user thread to run with a misconfigured CR3 that
>> disables LAM on the CPU.
> I don't think it is possible. As I said we can only enable LAM when the
> process has single thread. If it enables LAM concurrently with kernel
> thread and kernel thread gets control on the same CPU after the userspace
> thread of the same process LAM is already going to be enabled. No need in
> special handling.

I think it's something logically like this:

// main thread
kthread_use_mm()
cr3 |= mm->lam_cr3_mask;
mm->lam_cr3_mask = foo;
cpu_tlbstate.lam = mm->lam_cr3_mask;

Obviously the kthread's LAM state is going to be random. It's
fundamentally racing with the enabling thread. That part is fine.

The main pickle is the fact that CR3 and cpu_tlbstate.lam are out of
sync. That seems worth fixing.

Or is there something else that keeps this whole thing from racing in
the first place?

2024-03-07 18:50:05

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On Thu, Mar 07, 2024, Dave Hansen wrote:
> I know we all have different rules, but any time you could spend absorbing:
>
> https://www.kernel.org/doc/html/next/process/maintainer-tip.html
>
> would be appreciated, especially:
>
> > The condensed patch description in the subject line should start with
> > a uppercase letter and should be written in imperative tone.

And while you're at it,

https://www.kernel.org/doc/html/next/process/maintainer-kvm-x86.html

which my crystal ball says will be helpful in your future. ;-)

2024-03-07 20:29:02

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 3/3] x86/mm: cleanup prctl_enable_tagged_addr() nr_bits error checking

On Thu, Mar 07, 2024 at 07:31:44PM +0200, Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2024 at 01:39:16PM +0000, Yosry Ahmed wrote:
> > In prctl_enable_tagged_addr(), we check that nr_bits is in the correct
> > range, but we do so in a twisted if/else block where the correct case is
> > sandwiched between two error cases doing exactly the same thing.
> >
> > Simplify the if condition and pull the correct case outside with the
> > rest of the success code path.
>
> I'm okay either way.
>
> I structured the code this way as I had separate patch that adds also
> LAM_U48. But it is unlikely to get upstreamed.

I see, thanks for the context. For now, I think this makes the code a
little bit clearer.

2024-03-07 20:42:44

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On Thu, Mar 07, 2024 at 09:36:58AM -0800, Dave Hansen wrote:
> I know we all have different rules, but any time you could spend absorbing:
>
> https://www.kernel.org/doc/html/next/process/maintainer-tip.html

Thanks for the quick review and tips.

I didn't know this existed, I will take a look before respinning.

>
> would be appreciated, especially:
>
> > The condensed patch description in the subject line should start with
> > a uppercase letter and should be written in imperative tone.
>
>
> On 3/7/24 05:39, Yosry Ahmed wrote:
> > In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into
> > 'new_lam', which is later passed to load_new_mm_cr3(). However, there is
> > a call to set_tlbstate_lam_mode() in between which will read
> > 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly.
> > If we race with another thread updating 'mm->context.lam_cr3_mask', the
> > value in 'cpu_tlbstate.lam' could end up being different from CR3.
>
> Your description is fine (modulo the we's). But I slightly reworded it
> to make it more plainly readable:
>
> LAM can only be enabled when a process is single-threaded. But _kernel_
> threads can temporarily use a single-threaded process's mm. That means
> that a context-switching kernel thread can race and observe the mm's LAM
> metadata (mm->context.lam_cr3_mask) change.
>
> The context switch code does two logical things with that metadata:
> populate CR3 and populate 'cpu_tlbstate.lam'. If it hits this race,
> 'cpu_tlbstate.lam' and CR3 can end up out of sync.
>
> This de-synchronization is currently harmless. But it is confusing and
> might lead to warnings or real bugs.

Thanks a lot! I will adopt your version moving forward :)

>
> --
>
> > Fix the problem by updating set_tlbstate_lam_mode() to return the LAM
> > mask that was set to 'cpu_tlbstate.lam', and use that mask in
> > switch_mm_irqs_off() when writing CR3. Use READ_ONCE to make sure we
> > read the mask once and use it consistenly.
>
> Spell checking is also appreciated.

I did run checkpatch. Did it miss something?

>
> ...
> > -static inline void set_tlbstate_lam_mode(struct mm_struct *mm)
> > +static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
> > {
> > - this_cpu_write(cpu_tlbstate.lam,
> > - mm->context.lam_cr3_mask >> X86_CR3_LAM_U57_BIT);
> > + unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
> > +
> > + this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
> > this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
> > + return lam;
> > }
>
> The comments about races need to be _here_ so that the purpose of the
> READ_ONCE() is clear.
>
> It would also be nice to call out the rule that this can only
> meaningfully be called once per context switch.

I wanted the comments in switch_mm_irqs_off() where the races actually
matter, but I guess I can make the comment more generic and specify that
the return value is used to write CR3 so we READ_ONCE keeps CR3 and
tlbstate.lam consistent.

>
> > @@ -633,7 +628,12 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> > barrier();
> > }
> >
> > - set_tlbstate_lam_mode(next);
> > + /*
> > + * Even if we are not actually switching mm's, another thread could have
> > + * updated mm->context.lam_cr3_mask. Make sure tlbstate_lam_cr3_mask()
> > + * and the loaded CR3 use the up-to-date mask.
> > + */
>
> I kinda dislike how the comment talks about the details of what
> set_tlbstate_lam_mode() does. It would be much better to put the meat
> of this comment at the set_tlbstate_lam_mode() definition.

Agreed. I will move most comments to set_tlbstate_lam_mode().

>
> > + new_lam = set_tlbstate_lam_mode(next);
> > if (need_flush) {
> > this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
> > this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
>
> This is less a complaint about your change and more of the existing
> code, but I wish it was more obvious that set_tlbstate_lam_mode() is
> logically shuffling data (once) from 'next' into the tlbstate.
>
> The naming makes it sound like it is modifying the tlbstate of 'next'.

We can update the function name to make it more verbose, maybe something
like update_cpu_tlbstate_lam()? We can also try to put "return"
somewhere in the name to imply that it returns the LAM mask it sets, but
I can't make that look pretty.

>
> But I don't have any particularly brilliant ideas to fix it either.
> Maybe just:
>
> /* new_lam is effectively cpu_tlbstate.lam */
>
> > @@ -705,7 +705,6 @@ void initialize_tlbstate_and_flush(void)
> >
> > /* LAM expected to be disabled */
> > WARN_ON(cr3 & (X86_CR3_LAM_U48 | X86_CR3_LAM_U57));
> > - WARN_ON(mm_lam_cr3_mask(mm));
> >
> > /*
> > * Assert that CR4.PCIDE is set if needed. (CR4.PCIDE initialization
> > @@ -724,7 +723,7 @@ void initialize_tlbstate_and_flush(void)
> > this_cpu_write(cpu_tlbstate.next_asid, 1);
> > this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, mm->context.ctx_id);
> > this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, tlb_gen);
> > - set_tlbstate_lam_mode(mm);
> > + WARN_ON(set_tlbstate_lam_mode(mm));
>
> The "set_" naming bugs me in both of the sites that get modified here.
> I'd be with a new name that fits better, if we can think of one.

Is it because it's not clear we are updating cpu_tlbstate (in which case
I think update_cpu_tlbstate_lam() is an improvement), or is it because
the function returns a value now? If the latter we can put "return" in
the name somewhere, or keep the function void and pass in an output
parameter.

2024-03-07 20:44:29

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On Thu, Mar 07, 2024 at 10:49:51AM -0800, Sean Christopherson wrote:
> On Thu, Mar 07, 2024, Dave Hansen wrote:
> > I know we all have different rules, but any time you could spend absorbing:
> >
> > https://www.kernel.org/doc/html/next/process/maintainer-tip.html
> >
> > would be appreciated, especially:
> >
> > > The condensed patch description in the subject line should start with
> > > a uppercase letter and should be written in imperative tone.
>
> And while you're at it,
>
> https://www.kernel.org/doc/html/next/process/maintainer-kvm-x86.html
>
> which my crystal ball says will be helpful in your future. ;-)

I am sincerely hoping that there are no contradictions between different
maintainer docs, otherwise my brain will break really fast.


2024-03-07 20:51:48

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On Thu, Mar 07, 2024 at 07:22:36PM +0200, Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2024 at 01:39:14PM +0000, Yosry Ahmed wrote:
> > In switch_mm_irqs_off(), we read the 'mm->context.lam_cr3_mask' into
> > 'new_lam', which is later passed to load_new_mm_cr3(). However, there is
> > a call to set_tlbstate_lam_mode() in between which will read
> > 'mm->context.lam_cr3_mask' again and set 'cpu_tlbstate.lam' accordingly.
> > If we race with another thread updating 'mm->context.lam_cr3_mask', the
> > value in 'cpu_tlbstate.lam' could end up being different from CR3.
>
> What other thread? LAM can only be enabled when the process has single
> thread. And cannot be disabled. See MM_CONTEXT_LOCK_LAM.

Right, but a kthread may run with that single-threaded process's mm
IIUC. I think this can happen via kthread_use_mm() or if we context
switch directly from the user process to the kthread (context_switch()
doesn't seem to update the mm in this case).

>
> > While we are at it, remove the misguiding comment that states that
> > 'new_lam' may not match tlbstate_lam_cr3_mask() if a race occurs.
>
> The comment is indeed misguiding, but for different reason. It is leftover
> from the earlier version of LAM patchset.

2024-03-07 21:06:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 01:39:15PM +0000, Yosry Ahmed wrote:
> During context switching, if we are not switching to new mm and no TLB
> flush is needed, we do not write CR3. However, it is possible that a
> user thread enables LAM while a kthread is running on a different CPU
> with the old LAM CR3 mask. If the kthread context switches into any
> thread of that user process, it may not write CR3 with the new LAM mask,
> which would cause the user thread to run with a misconfigured CR3 that
> disables LAM on the CPU.

I don't think it is possible. As I said we can only enable LAM when the
process has single thread. If it enables LAM concurrently with kernel
thread and kernel thread gets control on the same CPU after the userspace
thread of the same process LAM is already going to be enabled. No need in
special handling.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-07 21:08:48

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 09:56:07AM -0800, Dave Hansen wrote:
> On 3/7/24 09:29, Kirill A. Shutemov wrote:
> > On Thu, Mar 07, 2024 at 01:39:15PM +0000, Yosry Ahmed wrote:
> >> During context switching, if we are not switching to new mm and no TLB
> >> flush is needed, we do not write CR3. However, it is possible that a
> >> user thread enables LAM while a kthread is running on a different CPU
> >> with the old LAM CR3 mask. If the kthread context switches into any
> >> thread of that user process, it may not write CR3 with the new LAM mask,
> >> which would cause the user thread to run with a misconfigured CR3 that
> >> disables LAM on the CPU.
> > I don't think it is possible. As I said we can only enable LAM when the
> > process has single thread. If it enables LAM concurrently with kernel
> > thread and kernel thread gets control on the same CPU after the userspace
> > thread of the same process LAM is already going to be enabled. No need in
> > special handling.
>
> I think it's something logically like this:
>
> // main thread
> kthread_use_mm()
> cr3 |= mm->lam_cr3_mask;
> mm->lam_cr3_mask = foo;
> cpu_tlbstate.lam = mm->lam_cr3_mask;

IIUC it doesn't have to be through kthread_use_mm(). If we context
switch directly from the user thread to a kthread, the kthread will keep
using the user thread's mm AFAICT.

>
> Obviously the kthread's LAM state is going to be random. It's
> fundamentally racing with the enabling thread. That part is fine.
>
> The main pickle is the fact that CR3 and cpu_tlbstate.lam are out of
> sync. That seems worth fixing.

That's what is fixed by patch 1, specifically a race between
switch_mm_irqs_off() and LAM being enabled. This patch is fixing a
different problem:

CPU 1 CPU 2
/* user thread running */
context_switch() /* to kthread */
/* user thread enables LAM */
context_switch()
context_switch() /* to user thread */

In this case, there are no races, but the second context switch on CPU 1
may not write CR3 (if TLB is up-to-date), in which case we will run the
user thread with CR3 having the wrong LAM mask. This could cause bigger
problems, right?

>
> Or is there something else that keeps this whole thing from racing in
> the first place?

+1 that would be good to know, but I didn't find anything.

2024-03-07 21:40:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On 3/7/24 13:04, Yosry Ahmed wrote:
> I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
> hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
> that we correctly handle LAM updates. We can certainly add a comment,
> but I think an explicit check for CPU LAM vs. mm LAM is much clearer.
>
> WDYT?

The mm generations are literally there so that if the mm changes that
all the CPUs know they need an update. Changing LAM enabling is 100%
consistent with telling other CPUs that they need an update.

I'd be curious of Andy feels differently though.

>> Considering how fun this code path is, a little effort at an actual
>> reproduction would be really appreciated.
>
> I tried reproducing it but gave up quickly. We need a certain sequence
> of events to happen:
>
> CPU 1 CPU 2
> kthread_use_mm()
> /* user thread enables LAM */
> context_switch()
> context_switch() /* to user thread */

First, it would be fine to either create a new kthread for reproduction
purposes or to hack an existing one. For instance, have have the LAM
prctl() take an extra ref on the mm and stick it in a global variable:

mmgrab(current->mm);
global_mm = current->mm;

Then in the kthread, grab the mm and use it:

while (!global_mm);
kthread_use_mm(global_mm);
... check for the race
mmdrop(global_mm);

You can also hackily wait for thread to move with a stupid spin loop:

while (smp_processor_id() != 1);

and then actually move it with sched_setaffinity() from userspace. That
can make it easier to get that series of events to happen in lockstep.

2024-03-07 21:48:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On 3/7/24 13:08, Yosry Ahmed wrote:
> CPU 1 CPU 2
> /* user thread running */
> context_switch() /* to kthread */
> /* user thread enables LAM */
> context_switch()
> context_switch() /* to user thread */
>
> In this case, there are no races, but the second context switch on CPU 1
> may not write CR3 (if TLB is up-to-date), in which case we will run the
> user thread with CR3 having the wrong LAM mask. This could cause bigger
> problems, right?

Yes, but this is precisely the kind of thing that we should solve with
mm generations. Its sole purpose is to thwart optimization where the
mm_prev==mm_next and might not rewrite CR3.

2024-03-07 22:13:14

by Sean Christopherson

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On Thu, Mar 07, 2024, Yosry Ahmed wrote:
> On Thu, Mar 07, 2024 at 10:49:51AM -0800, Sean Christopherson wrote:
> > On Thu, Mar 07, 2024, Dave Hansen wrote:
> > > I know we all have different rules, but any time you could spend absorbing:
> > >
> > > https://www.kernel.org/doc/html/next/process/maintainer-tip.html
> > >
> > > would be appreciated, especially:
> > >
> > > > The condensed patch description in the subject line should start with
> > > > a uppercase letter and should be written in imperative tone.
> >
> > And while you're at it,
> >
> > https://www.kernel.org/doc/html/next/process/maintainer-kvm-x86.html
> >
> > which my crystal ball says will be helpful in your future. ;-)
>
> I am sincerely hoping that there are no contradictions between different
> maintainer docs, otherwise my brain will break really fast.

Heh, as long as you don't contribute to net/, you'll be fine. kvm-x86 mostly just
follows tip, with a few minor exceptions. But the exceptions are explicitly
documented in kvm-x86, and I am intentionally forgiving/lax on them, because yeah,
getting pulled in multiple directions is no fun.

2024-03-07 22:29:40

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 01:39:53PM -0800, Dave Hansen wrote:
> On 3/7/24 13:04, Yosry Ahmed wrote:
> > I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
> > hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
> > that we correctly handle LAM updates. We can certainly add a comment,
> > but I think an explicit check for CPU LAM vs. mm LAM is much clearer.
> >
> > WDYT?
>
> The mm generations are literally there so that if the mm changes that
> all the CPUs know they need an update. Changing LAM enabling is 100%
> consistent with telling other CPUs that they need an update.
>
> I'd be curious of Andy feels differently though.

The mm generations are TLB-specific and all the code using them implies
as such (e.g. look at the checks in switch_mm_irqs_off() when prev ==
next). We can go around and update comments and/or function names to
make them more generic, but this seems excessive. If we don't, the code
becomes less clear imo.

I agree that the use case here is essentially the same (let other
CPUs know they need to write CR3), but I still think that since the LAM
case is just a simple one-time enablement, an explicit check in
switch_mm_irqs_off() would be clearer.

Just my 2c, let me know what you prefer :)

>
> >> Considering how fun this code path is, a little effort at an actual
> >> reproduction would be really appreciated.
> >
> > I tried reproducing it but gave up quickly. We need a certain sequence
> > of events to happen:
> >
> > CPU 1 CPU 2
> > kthread_use_mm()
> > /* user thread enables LAM */
> > context_switch()
> > context_switch() /* to user thread */
>
> First, it would be fine to either create a new kthread for reproduction
> purposes or to hack an existing one. For instance, have have the LAM
> prctl() take an extra ref on the mm and stick it in a global variable:
>
> mmgrab(current->mm);
> global_mm = current->mm;
>
> Then in the kthread, grab the mm and use it:
>
> while (!global_mm);
> kthread_use_mm(global_mm);
> ... check for the race
> mmdrop(global_mm);
>
> You can also hackily wait for thread to move with a stupid spin loop:
>
> while (smp_processor_id() != 1);
>
> and then actually move it with sched_setaffinity() from userspace. That
> can make it easier to get that series of events to happen in lockstep.

I will take a stab at doing something similar and let you know, thanks.

2024-03-07 22:30:21

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 01:48:28PM -0800, Dave Hansen wrote:
> On 3/7/24 13:08, Yosry Ahmed wrote:
> > CPU 1 CPU 2
> > /* user thread running */
> > context_switch() /* to kthread */
> > /* user thread enables LAM */
> > context_switch()
> > context_switch() /* to user thread */
> >
> > In this case, there are no races, but the second context switch on CPU 1
> > may not write CR3 (if TLB is up-to-date), in which case we will run the
> > user thread with CR3 having the wrong LAM mask. This could cause bigger
> > problems, right?
>
> Yes, but this is precisely the kind of thing that we should solve with
> mm generations. Its sole purpose is to thwart optimization where the
> mm_prev==mm_next and might not rewrite CR3.

I think it's clearer to have an explicit check for oudated LAM in
switch_mm_irqs_off(), as I mentioned in my other reply.

2024-03-07 22:42:06

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On 3/7/24 14:29, Yosry Ahmed wrote:
> Just my 2c, let me know what you prefer ????

I'll let Andy weigh in here, but I think I kinda already mentioned at
least twice what I prefer. :)

2024-03-07 22:45:22

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 7, 2024 at 2:41 PM Dave Hansen <[email protected]> wrote:
>
> On 3/7/24 14:29, Yosry Ahmed wrote:
> > Just my 2c, let me know what you prefer ????
>
> I'll let Andy weigh in here, but I think I kinda already mentioned at
> least twice what I prefer. :)

Well I was hoping my last email changed your mind, but obviously not :)

2024-03-07 23:22:16

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

> > The "set_" naming bugs me in both of the sites that get modified here.
> > I'd be with a new name that fits better, if we can think of one.
>
> Is it because it's not clear we are updating cpu_tlbstate (in which case
> I think update_cpu_tlbstate_lam() is an improvement), or is it because
> the function returns a value now? If the latter we can put "return" in
> the name somewhere, or keep the function void and pass in an output
> parameter.

Or we can avoid returning a value from the helper and avoid passing an
mm. The callers would be more verbose, they'll have to call
mm_lam_cr3_mask() and mm_untag_mask() and pass the results into the
helper (set_tlbstate_lam_mode() or update_cpu_tlbstate_lam()). Another
advantage of this is that we can move the READ_ONCE to
switch_mm_irqs_off() and keep the comment here.

WDYT?

2024-03-07 23:33:05

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

On 3/7/24 15:21, Yosry Ahmed wrote:
>>> The "set_" naming bugs me in both of the sites that get modified here.
>>> I'd be with a new name that fits better, if we can think of one.
>> Is it because it's not clear we are updating cpu_tlbstate (in which case
>> I think update_cpu_tlbstate_lam() is an improvement), or is it because
>> the function returns a value now?

That's part of it.

>> If the latter we can put "return" in the name somewhere, or keep
>> the function void and pass in an output parameter.
No, adding a "_return" won't make it better.

> Or we can avoid returning a value from the helper and avoid passing an
> mm. The callers would be more verbose, they'll have to call
> mm_lam_cr3_mask() and mm_untag_mask() and pass the results into the
> helper (set_tlbstate_lam_mode() or update_cpu_tlbstate_lam()). Another
> advantage of this is that we can move the READ_ONCE to
> switch_mm_irqs_off() and keep the comment here.


One thing I don't like about set_tlbstate_lam_mode() is that it's not
obvious that it's writing to "cpu_tlbstate" and its right smack in the
middle of a bunch of other writers to the same thing.

But I'm also not sure that open-coding it at its three call sites makes
things better overall.

I honestly don't have any brilliant suggestions.

2024-03-07 23:37:51

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 1/3] x86/mm: fix LAM cr3 mask inconsistency during context switch

> > Or we can avoid returning a value from the helper and avoid passing an
> > mm. The callers would be more verbose, they'll have to call
> > mm_lam_cr3_mask() and mm_untag_mask() and pass the results into the
> > helper (set_tlbstate_lam_mode() or update_cpu_tlbstate_lam()). Another
> > advantage of this is that we can move the READ_ONCE to
> > switch_mm_irqs_off() and keep the comment here.
>
>
> One thing I don't like about set_tlbstate_lam_mode() is that it's not
> obvious that it's writing to "cpu_tlbstate" and its right smack in the
> middle of a bunch of other writers to the same thing.
>
> But I'm also not sure that open-coding it at its three call sites makes
> things better overall.
>
> I honestly don't have any brilliant suggestions.

Let me ponder this a little bit and try to come up with something, I
think a max of renaming and open-coding could make an improvement.

Thanks!

2024-03-08 01:26:25

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 10:29:29PM +0000, Yosry Ahmed wrote:
> On Thu, Mar 07, 2024 at 01:39:53PM -0800, Dave Hansen wrote:
> > On 3/7/24 13:04, Yosry Ahmed wrote:
> > > I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
> > > hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
> > > that we correctly handle LAM updates. We can certainly add a comment,
> > > but I think an explicit check for CPU LAM vs. mm LAM is much clearer.
> > >
> > > WDYT?
> >
> > The mm generations are literally there so that if the mm changes that
> > all the CPUs know they need an update. Changing LAM enabling is 100%
> > consistent with telling other CPUs that they need an update.
> >
> > I'd be curious of Andy feels differently though.
>
> The mm generations are TLB-specific and all the code using them implies
> as such (e.g. look at the checks in switch_mm_irqs_off() when prev ==
> next). We can go around and update comments and/or function names to
> make them more generic, but this seems excessive. If we don't, the code
> becomes less clear imo.
>
> I agree that the use case here is essentially the same (let other
> CPUs know they need to write CR3), but I still think that since the LAM
> case is just a simple one-time enablement, an explicit check in
> switch_mm_irqs_off() would be clearer.
>
> Just my 2c, let me know what you prefer :)
>
> >
> > >> Considering how fun this code path is, a little effort at an actual
> > >> reproduction would be really appreciated.
> > >
> > > I tried reproducing it but gave up quickly. We need a certain sequence
> > > of events to happen:
> > >
> > > CPU 1 CPU 2
> > > kthread_use_mm()
> > > /* user thread enables LAM */
> > > context_switch()
> > > context_switch() /* to user thread */
> >
> > First, it would be fine to either create a new kthread for reproduction
> > purposes or to hack an existing one. For instance, have have the LAM
> > prctl() take an extra ref on the mm and stick it in a global variable:
> >
> > mmgrab(current->mm);
> > global_mm = current->mm;
> >
> > Then in the kthread, grab the mm and use it:
> >
> > while (!global_mm);
> > kthread_use_mm(global_mm);
> > ... check for the race
> > mmdrop(global_mm);
> >
> > You can also hackily wait for thread to move with a stupid spin loop:
> >
> > while (smp_processor_id() != 1);
> >
> > and then actually move it with sched_setaffinity() from userspace. That
> > can make it easier to get that series of events to happen in lockstep.
>
> I will take a stab at doing something similar and let you know, thanks.

I came up with a kernel patch that I *think* may reproduce the problem
with enough iterations. Userspace only needs to enable LAM, so I think
the selftest can be enough to trigger it.

However, there is no hardware with LAM at my disposal, and IIUC I cannot
use QEMU without KVM to run a kernel with LAM. I was planning to do more
testing before sending a non-RFC version, but apparently I cannot do
any testing beyond building at this point (including reproducing) :/

Let me know how you want to proceed. I can send a non-RFC v1 based on
the feedback I got on the RFC, but it will only be build tested.

For the record, here is the diff that I *think* may reproduce the bug:

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 33b268747bb7b..c37a8c26a3c21 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -750,8 +750,25 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)

#define LAM_U57_BITS 6

+static int kthread_fn(void *_mm)
+{
+ struct mm_struct *mm = _mm;
+
+ /*
+ * Wait for LAM to be enabled then schedule. Hopefully we will context
+ * switch directly into the task that enabled LAM due to CPU pinning.
+ */
+ kthread_use_mm(mm);
+ while (!test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags));
+ schedule();
+ return 0;
+}
+
static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
{
+ struct task_struct *kthread_task;
+ int kthread_cpu;
+
if (!cpu_feature_enabled(X86_FEATURE_LAM))
return -ENODEV;

@@ -782,10 +799,22 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
return -EINVAL;
}

+ /* Pin the task to the current CPU */
+ set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id()));
+
+ /* Run a kthread on another CPU and wait for it to start */
+ kthread_cpu = cpumask_next_wrap(smp_processor_id(), cpu_online_mask, 0, false),
+ kthread_task = kthread_run_on_cpu(kthread_fn, mm, kthread_cpu, "lam_repro_kthread");
+ while (!task_is_running(kthread_task));
+
write_cr3(__read_cr3() | mm->context.lam_cr3_mask);
set_tlbstate_lam_mode(mm);
set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags);

+ /* Move the task to the kthread CPU */
+ set_cpus_allowed_ptr(current, cpumask_of(kthread_cpu));
+
mmap_write_unlock(mm);

return 0;
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 51f9f56941058..3afb53f1a1901 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -593,7 +593,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
next_tlb_gen)
- return;
+ BUG_ON(new_lam != tlbstate_lam_cr3_mask());

/*
* TLB contents went out of date while we were in lazy


2024-03-08 01:35:45

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

Catching up a bit...

On Thu, Mar 7, 2024, at 5:39 AM, Yosry Ahmed wrote:
> During context switching, if we are not switching to new mm and no TLB
> flush is needed, we do not write CR3. However, it is possible that a
> user thread enables LAM while a kthread is running on a different CPU
> with the old LAM CR3 mask. If the kthread context switches into any
> thread of that user process, it may not write CR3 with the new LAM mask,
> which would cause the user thread to run with a misconfigured CR3 that
> disables LAM on the CPU.

So I think (off the top of my head -- haven't thought about it all that hard) that LAM is logically like PCE and LDT: it's a property of an mm that is only rarely changed, and it doesn't really belong as part of the tlb_gen mechanism. And, critically, it's not worth the effort and complexity to try to optimize LAM changes when we have a lazy CPU (just like PCE and LDT) (whereas TLB flushes are performance critical and are absolutely worth optimizing).

So...

>
> Fix this by making sure we write a new CR3 if LAM is not up-to-date. No
> problems were observed in practice, this was found by code inspection.

I think it should be fixed with a much bigger hammer: explicit IPIs. Just don't ever let it get out of date, like install_ldt().

>
> Not that it is possible that mm->context.lam_cr3_mask changes throughout
> switch_mm_irqs_off(). But since LAM can only be enabled by a
> single-threaded process on its own behalf, in that case we cannot be
> switching to a user thread in that same process, we can only be
> switching to another kthread using the borrowed mm or a different user
> process, which should be fine.

The thought process is even simpler with the IPI: it *can* change while switching, but it will resynchronize immediately once IRQs turn back on. And whoever changes it will *synchronize* with us, which would otherwise require extremely complex logic to get right.

And...

> - if (!was_lazy)
> - return;
> + if (was_lazy) {
> + /*
> + * Read the tlb_gen to check whether a flush is needed.
> + * If the TLB is up to date, just use it. The barrier
> + * synchronizes with the tlb_gen increment in the TLB
> + * shootdown code.
> + */
> + smp_mb();

This is actually rather expensive -- from old memory, we're talking maybe 20 cycles here, but this path is *very* hot and we try fairly hard to make it be fast. If we get the happy PCID path, it's maybe 100-200 cycles, so this is like a 10% regression. Ouch.

And you can delete all of this if you accept my suggestion.

2024-03-08 01:47:51

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 05:34:21PM -0800, Andy Lutomirski wrote:
> Catching up a bit...
>
> On Thu, Mar 7, 2024, at 5:39 AM, Yosry Ahmed wrote:
> > During context switching, if we are not switching to new mm and no TLB
> > flush is needed, we do not write CR3. However, it is possible that a
> > user thread enables LAM while a kthread is running on a different CPU
> > with the old LAM CR3 mask. If the kthread context switches into any
> > thread of that user process, it may not write CR3 with the new LAM mask,
> > which would cause the user thread to run with a misconfigured CR3 that
> > disables LAM on the CPU.
>
> So I think (off the top of my head -- haven't thought about it all that hard) that LAM is logically like PCE and LDT: it's a property of an mm that is only rarely changed, and it doesn't really belong as part of the tlb_gen mechanism. And, critically, it's not worth the effort and complexity to try to optimize LAM changes when we have a lazy CPU (just like PCE and LDT) (whereas TLB flushes are performance critical and are absolutely worth optimizing).
>
> So...
>
> >
> > Fix this by making sure we write a new CR3 if LAM is not up-to-date. No
> > problems were observed in practice, this was found by code inspection.
>
> I think it should be fixed with a much bigger hammer: explicit IPIs. Just don't ever let it get out of date, like install_ldt().

I like this, and I think earlier versions of the code did this. I think
the code now assumes it's fine to not send an IPI since only
single-threaded processes can enable LAM, but this means we have to
handle kthreads switching to user threads with outdated LAMs (what this
patch is trying to do).

I also think there is currently an assumption that it's fine for
kthreads to run with an incorrect LAM, which is mostly fine, but the IPI
also drops that assumption.

>
> >
> > Not that it is possible that mm->context.lam_cr3_mask changes throughout
> > switch_mm_irqs_off(). But since LAM can only be enabled by a
> > single-threaded process on its own behalf, in that case we cannot be
> > switching to a user thread in that same process, we can only be
> > switching to another kthread using the borrowed mm or a different user
> > process, which should be fine.
>
> The thought process is even simpler with the IPI: it *can* change while switching, but it will resynchronize immediately once IRQs turn back on. And whoever changes it will *synchronize* with us, which would otherwise require extremely complex logic to get right.
>
> And...
>
> > - if (!was_lazy)
> > - return;
> > + if (was_lazy) {
> > + /*
> > + * Read the tlb_gen to check whether a flush is needed.
> > + * If the TLB is up to date, just use it. The barrier
> > + * synchronizes with the tlb_gen increment in the TLB
> > + * shootdown code.
> > + */
> > + smp_mb();
>
> This is actually rather expensive -- from old memory, we're talking maybe 20 cycles here, but this path is *very* hot and we try fairly hard to make it be fast. If we get the happy PCID path, it's maybe 100-200 cycles, so this is like a 10% regression. Ouch.

This is not newly introduced by this patch. I merely refactored this
code (reversed the if conditions). I think if we keep the current
approach I should move this refactoring to a separate patch to make
things clearer.

>
> And you can delete all of this if you accept my suggestion.

I like it very much. The problem now is, as I told Dave, I realized I
cannot do any testing beyond compilation due to lack of hardware. I am
happy to send a next version if this is acceptable or if someone else
can test.

2024-03-08 08:09:32

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

> I came up with a kernel patch that I *think* may reproduce the problem
> with enough iterations. Userspace only needs to enable LAM, so I think
> the selftest can be enough to trigger it.
>
> However, there is no hardware with LAM at my disposal, and IIUC I cannot
> use QEMU without KVM to run a kernel with LAM. I was planning to do more
> testing before sending a non-RFC version, but apparently I cannot do
> any testing beyond building at this point (including reproducing) :/
>
> Let me know how you want to proceed. I can send a non-RFC v1 based on
> the feedback I got on the RFC, but it will only be build tested.
>
> For the record, here is the diff that I *think* may reproduce the bug:

Okay, I was actually able to run _some_ testing with the diff below on
_a kernel_, and I hit the BUG_ON pretty quickly. If I did things
correctly, this BUG_ON means that even though we have an outdated LAM in
our CR3, we will not update CR3 because the TLB is up-to-date.

I can work on a v1 now with the IPI approach that Andy suggested. A
small kink is that we may still hit the BUG_ON with that fix, but in
that case it should be fine to not write CR3 because once we re-enable
interrupts we will receive the IPI and fix it. IOW, the diff below will
still BUG with the proposed fix, but it should be okay.

One thing I am not clear about with the IPI approach, if we use
mm_cpumask() to limit the IPI scope, we need to make sure that we read
mm_lam_cr3_mask() *after* we update the cpumask in switch_mm_irqs_off(),
which makes me think we'll need a barrier (and Andy said we want to
avoid those in this path). But looking at the code I see:

/*
* Start remote flushes and then read tlb_gen.
*/
if (next != &init_mm)
cpumask_set_cpu(cpu, mm_cpumask(next));
next_tlb_gen = atomic64_read(&next->context.tlb_gen);

This code doesn't have a barrier. How do we make sure the read actually
happens after the write?

If no barrier is needed there, then I think we can similarly just read
the LAM mask after cpumask_set_cpu().

>
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 33b268747bb7b..c37a8c26a3c21 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -750,8 +750,25 @@ static long prctl_map_vdso(const struct vdso_image *image, unsigned long addr)
>
> #define LAM_U57_BITS 6
>
> +static int kthread_fn(void *_mm)
> +{
> + struct mm_struct *mm = _mm;
> +
> + /*
> + * Wait for LAM to be enabled then schedule. Hopefully we will context
> + * switch directly into the task that enabled LAM due to CPU pinning.
> + */
> + kthread_use_mm(mm);
> + while (!test_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags));
> + schedule();
> + return 0;
> +}
> +
> static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> {
> + struct task_struct *kthread_task;
> + int kthread_cpu;
> +
> if (!cpu_feature_enabled(X86_FEATURE_LAM))
> return -ENODEV;
>
> @@ -782,10 +799,22 @@ static int prctl_enable_tagged_addr(struct mm_struct *mm, unsigned long nr_bits)
> return -EINVAL;
> }
>
> + /* Pin the task to the current CPU */
> + set_cpus_allowed_ptr(current, cpumask_of(smp_processor_id()));
> +
> + /* Run a kthread on another CPU and wait for it to start */
> + kthread_cpu = cpumask_next_wrap(smp_processor_id(), cpu_online_mask, 0, false),
> + kthread_task = kthread_run_on_cpu(kthread_fn, mm, kthread_cpu, "lam_repro_kthread");
> + while (!task_is_running(kthread_task));
> +
> write_cr3(__read_cr3() | mm->context.lam_cr3_mask);
> set_tlbstate_lam_mode(mm);
> set_bit(MM_CONTEXT_LOCK_LAM, &mm->context.flags);
>
> + /* Move the task to the kthread CPU */
> + set_cpus_allowed_ptr(current, cpumask_of(kthread_cpu));
> +
> mmap_write_unlock(mm);
>
> return 0;
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index 51f9f56941058..3afb53f1a1901 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -593,7 +593,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
> next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> next_tlb_gen)
> - return;
> + BUG_ON(new_lam != tlbstate_lam_cr3_mask());
>
> /*
> * TLB contents went out of date while we were in lazy
>

2024-03-08 15:24:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On 3/7/24 17:34, Andy Lutomirski wrote:
>> Fix this by making sure we write a new CR3 if LAM is not
>> up-to-date. No problems were observed in practice, this was found
>> by code inspection.
> I think it should be fixed with a much bigger hammer: explicit IPIs.
> Just don't ever let it get out of date, like install_ldt().
I guess it matters whether the thing that matters is having a persistent
inconsistency or a temporary one. IPIs will definitely turn a permanent
one into a temporary one.

But this is all easier to reason about if we can get rid of even the
temporary inconsistency.

Wouldn't this be even simpler than IPIs?

static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
{
unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);

+ /* LAM is for userspace only. Ignore it for kernel threads: */
+ if (tsk->flags & PF_KTHREAD)
+ return 0;

this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
return lam;
}

2024-03-08 17:33:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Fri, Mar 08, 2024 at 01:47:39AM +0000, Yosry Ahmed wrote:
> I like it very much. The problem now is, as I told Dave, I realized I
> cannot do any testing beyond compilation due to lack of hardware. I am
> happy to send a next version if this is acceptable or if someone else
> can test.

I have non-upstreamable QEMU patch that adds LAM emulation, if it helps:

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 93b1ca810bf4..fe887a86a156 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1295,6 +1295,19 @@ void tlb_set_page(CPUState *cpu, vaddr addr,
prot, mmu_idx, size);
}

+
+static vaddr clean_addr(CPUState *cpu, vaddr addr)
+{
+ CPUClass *cc = CPU_GET_CLASS(cpu);
+
+ if (cc->tcg_ops->do_clean_addr) {
+ addr = cc->tcg_ops->do_clean_addr(cpu, addr);
+ }
+
+ return addr;
+}
+
+
/*
* Note: tlb_fill() can trigger a resize of the TLB. This means that all of the
* caller's prior references to the TLB table (e.g. CPUTLBEntry pointers) must
@@ -1867,9 +1880,10 @@ static bool mmu_lookup(CPUState *cpu, vaddr addr, MemOpIdx oi,
* Probe for an atomic operation. Do not allow unaligned operations,
* or io operations to proceed. Return the host address.
*/
-static void *atomic_mmu_lookup(CPUState *cpu, vaddr addr, MemOpIdx oi,
+static void *atomic_mmu_lookup(CPUState *cpu, vaddr address, MemOpIdx oi,
int size, uintptr_t retaddr)
{
+ vaddr addr = clean_addr(cpu, address);
uintptr_t mmu_idx = get_mmuidx(oi);
MemOp mop = get_memop(oi);
int a_bits = get_alignment_bits(mop);
@@ -2002,10 +2016,11 @@ static void *atomic_mmu_lookup(CPUState *cpu, vaddr addr, MemOpIdx oi,
* The bytes are concatenated in big-endian order with @ret_be.
*/
static uint64_t int_ld_mmio_beN(CPUState *cpu, CPUTLBEntryFull *full,
- uint64_t ret_be, vaddr addr, int size,
+ uint64_t ret_be, vaddr address, int size,
int mmu_idx, MMUAccessType type, uintptr_t ra,
MemoryRegion *mr, hwaddr mr_offset)
{
+ vaddr addr = clean_addr(cpu, address);
do {
MemOp this_mop;
unsigned this_size;
@@ -2543,10 +2558,11 @@ static Int128 do_ld16_mmu(CPUState *cpu, vaddr addr,
* return the bytes of @val_le beyond @p->size that have not been stored.
*/
static uint64_t int_st_mmio_leN(CPUState *cpu, CPUTLBEntryFull *full,
- uint64_t val_le, vaddr addr, int size,
+ uint64_t val_le, vaddr address, int size,
int mmu_idx, uintptr_t ra,
MemoryRegion *mr, hwaddr mr_offset)
{
+ vaddr addr = clean_addr(cpu, address);
do {
MemOp this_mop;
unsigned this_size;
diff --git a/include/hw/core/tcg-cpu-ops.h b/include/hw/core/tcg-cpu-ops.h
index bf8ff8e3eec1..eaa8e09a6226 100644
--- a/include/hw/core/tcg-cpu-ops.h
+++ b/include/hw/core/tcg-cpu-ops.h
@@ -140,6 +140,12 @@ struct TCGCPUOps {
MMUAccessType access_type,
int mmu_idx, uintptr_t retaddr);

+
+ /**
+ * @do_clean_addr: Callback for clearing metadata/tags from the address.
+ */
+ vaddr (*do_clean_addr)(CPUState *cpu, vaddr addr);
+
/**
* @adjust_watchpoint_address: hack for cpu_check_watchpoint used by ARM
*/
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 2666ef380891..1bbfd31042b2 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -739,7 +739,7 @@ void x86_cpu_vendor_words2str(char *dst, uint32_t vendor1,
#define TCG_7_0_EDX_FEATURES (CPUID_7_0_EDX_FSRM | CPUID_7_0_EDX_KERNEL_FEATURES)

#define TCG_7_1_EAX_FEATURES (CPUID_7_1_EAX_FZRM | CPUID_7_1_EAX_FSRS | \
- CPUID_7_1_EAX_FSRC | CPUID_7_1_EAX_CMPCCXADD)
+ CPUID_7_1_EAX_FSRC | CPUID_7_1_EAX_CMPCCXADD | CPUID_7_1_EAX_LAM)
#define TCG_7_1_EDX_FEATURES 0
#define TCG_7_2_EDX_FEATURES 0
#define TCG_APM_FEATURES 0
@@ -968,7 +968,7 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
"fsrc", NULL, NULL, NULL,
NULL, NULL, NULL, NULL,
NULL, "amx-fp16", NULL, "avx-ifma",
- NULL, NULL, NULL, NULL,
+ NULL, NULL, "lam", NULL,
NULL, NULL, NULL, NULL,
},
.cpuid = {
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 952174bb6f52..6ef9afd443b7 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -238,6 +238,9 @@ typedef enum X86Seg {
#define CR0_CD_MASK (1U << 30)
#define CR0_PG_MASK (1U << 31)

+#define CR3_LAM_U57 (1ULL << 61)
+#define CR3_LAM_U48 (1ULL << 62)
+
#define CR4_VME_MASK (1U << 0)
#define CR4_PVI_MASK (1U << 1)
#define CR4_TSD_MASK (1U << 2)
@@ -261,6 +264,7 @@ typedef enum X86Seg {
#define CR4_SMAP_MASK (1U << 21)
#define CR4_PKE_MASK (1U << 22)
#define CR4_PKS_MASK (1U << 24)
+#define CR4_LAM_SUP (1U << 28)

#define CR4_RESERVED_MASK \
(~(target_ulong)(CR4_VME_MASK | CR4_PVI_MASK | CR4_TSD_MASK \
@@ -269,7 +273,8 @@ typedef enum X86Seg {
| CR4_OSFXSR_MASK | CR4_OSXMMEXCPT_MASK | CR4_UMIP_MASK \
| CR4_LA57_MASK \
| CR4_FSGSBASE_MASK | CR4_PCIDE_MASK | CR4_OSXSAVE_MASK \
- | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_PKE_MASK | CR4_PKS_MASK))
+ | CR4_SMEP_MASK | CR4_SMAP_MASK | CR4_PKE_MASK | CR4_PKS_MASK \
+ | CR4_LAM_SUP ))

#define DR6_BD (1 << 13)
#define DR6_BS (1 << 14)
@@ -932,6 +937,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
#define CPUID_7_1_EAX_AMX_FP16 (1U << 21)
/* Support for VPMADD52[H,L]UQ */
#define CPUID_7_1_EAX_AVX_IFMA (1U << 23)
+/* Linear Address Masking */
+#define CPUID_7_1_EAX_LAM (1U << 26)

/* Support for VPDPB[SU,UU,SS]D[,S] */
#define CPUID_7_1_EDX_AVX_VNNI_INT8 (1U << 4)
@@ -2525,6 +2532,24 @@ static inline bool hyperv_feat_enabled(X86CPU *cpu, int feat)
return !!(cpu->hyperv_features & BIT(feat));
}

+static inline uint64_t cr3_reserved_bits(CPUX86State *env)
+{
+ uint64_t reserved_bits;
+
+ if (!(env->efer & MSR_EFER_LMA)) {
+ return 0;
+ }
+
+ reserved_bits = (~0ULL) << env_archcpu(env)->phys_bits;
+
+ if (env->features[FEAT_7_1_EAX] & CPUID_7_1_EAX_LAM) {
+ reserved_bits &= ~(CR3_LAM_U48 | CR3_LAM_U57);
+ }
+
+ return reserved_bits;
+}
+
+
static inline uint64_t cr4_reserved_bits(CPUX86State *env)
{
uint64_t reserved_bits = CR4_RESERVED_MASK;
diff --git a/target/i386/helper.c b/target/i386/helper.c
index 2070dd0dda1f..4901c9c17b1e 100644
--- a/target/i386/helper.c
+++ b/target/i386/helper.c
@@ -262,7 +262,7 @@ hwaddr x86_cpu_get_phys_page_attrs_debug(CPUState *cs, vaddr addr,
}

if (la57) {
- pml5e_addr = ((env->cr[3] & ~0xfff) +
+ pml5e_addr = ((env->cr[3] & PG_ADDRESS_MASK) +
(((addr >> 48) & 0x1ff) << 3)) & a20_mask;
pml5e = x86_ldq_phys(cs, pml5e_addr);
if (!(pml5e & PG_PRESENT_MASK)) {
diff --git a/target/i386/tcg/helper-tcg.h b/target/i386/tcg/helper-tcg.h
index effc2c1c9842..11f75ea475e3 100644
--- a/target/i386/tcg/helper-tcg.h
+++ b/target/i386/tcg/helper-tcg.h
@@ -84,6 +84,7 @@ bool x86_cpu_tlb_fill(CPUState *cs, vaddr address, int size,
G_NORETURN void x86_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
MMUAccessType access_type,
int mmu_idx, uintptr_t retaddr);
+vaddr x86_cpu_clean_addr(CPUState *cpu, vaddr addr);
#endif

/* cc_helper.c */
diff --git a/target/i386/tcg/sysemu/excp_helper.c b/target/i386/tcg/sysemu/excp_helper.c
index 8f7011d96631..1bc71170e6a3 100644
--- a/target/i386/tcg/sysemu/excp_helper.c
+++ b/target/i386/tcg/sysemu/excp_helper.c
@@ -163,7 +163,7 @@ static bool mmu_translate(CPUX86State *env, const TranslateParams *in,
/*
* Page table level 5
*/
- pte_addr = (in->cr3 & ~0xfff) + (((addr >> 48) & 0x1ff) << 3);
+ pte_addr = (in->cr3 & PG_ADDRESS_MASK) + (((addr >> 48) & 0x1ff) << 3);
if (!ptw_translate(&pte_trans, pte_addr)) {
return false;
}
@@ -638,3 +638,30 @@ G_NORETURN void x86_cpu_do_unaligned_access(CPUState *cs, vaddr vaddr,
X86CPU *cpu = X86_CPU(cs);
handle_unaligned_access(&cpu->env, vaddr, access_type, retaddr);
}
+
+
+static inline int64_t sign_extend64(uint64_t value, int index)
+{
+ int shift = 63 - index;
+ return (int64_t)(value << shift) >> shift;
+}
+
+vaddr x86_cpu_clean_addr(CPUState *cs, vaddr addr)
+{
+ CPUX86State *env = &X86_CPU(cs)->env;
+ bool la57 = env->cr[4] & CR4_LA57_MASK;
+
+ if (addr >> 63) {
+ if (env->cr[4] & CR4_LAM_SUP) {
+ return sign_extend64(addr, la57 ? 56 : 47);
+ }
+ } else {
+ if (env->cr[3] & CR3_LAM_U57) {
+ return sign_extend64(addr, 56);
+ } else if (env->cr[3] & CR3_LAM_U48) {
+ return sign_extend64(addr, 47);
+ }
+ }
+
+ return addr;
+}
diff --git a/target/i386/tcg/sysemu/misc_helper.c b/target/i386/tcg/sysemu/misc_helper.c
index edb7c3d89408..aecb523e777d 100644
--- a/target/i386/tcg/sysemu/misc_helper.c
+++ b/target/i386/tcg/sysemu/misc_helper.c
@@ -98,8 +98,7 @@ void helper_write_crN(CPUX86State *env, int reg, target_ulong t0)
cpu_x86_update_cr0(env, t0);
break;
case 3:
- if ((env->efer & MSR_EFER_LMA) &&
- (t0 & ((~0ULL) << env_archcpu(env)->phys_bits))) {
+ if (t0 & cr3_reserved_bits(env)) {
cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
}
if (!(env->efer & MSR_EFER_LMA)) {
diff --git a/target/i386/tcg/sysemu/svm_helper.c b/target/i386/tcg/sysemu/svm_helper.c
index 5d6de2294fa1..e981b124d975 100644
--- a/target/i386/tcg/sysemu/svm_helper.c
+++ b/target/i386/tcg/sysemu/svm_helper.c
@@ -305,8 +305,7 @@ void helper_vmrun(CPUX86State *env, int aflag, int next_eip_addend)
cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
}
new_cr3 = x86_ldq_phys(cs, env->vm_vmcb + offsetof(struct vmcb, save.cr3));
- if ((env->efer & MSR_EFER_LMA) &&
- (new_cr3 & ((~0ULL) << cpu->phys_bits))) {
+ if (new_cr3 & cr3_reserved_bits(env)) {
cpu_vmexit(env, SVM_EXIT_ERR, 0, GETPC());
}
new_cr4 = x86_ldq_phys(cs, env->vm_vmcb + offsetof(struct vmcb, save.cr4));
diff --git a/target/i386/tcg/tcg-cpu.c b/target/i386/tcg/tcg-cpu.c
index cca19cd40e81..8ceeb954364e 100644
--- a/target/i386/tcg/tcg-cpu.c
+++ b/target/i386/tcg/tcg-cpu.c
@@ -118,6 +118,7 @@ static const TCGCPUOps x86_tcg_ops = {
.record_sigbus = x86_cpu_record_sigbus,
#else
.tlb_fill = x86_cpu_tlb_fill,
+ .do_clean_addr = x86_cpu_clean_addr,
.do_interrupt = x86_cpu_do_interrupt,
.cpu_exec_halt = x86_cpu_exec_halt,
.cpu_exec_interrupt = x86_cpu_exec_interrupt,
--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-08 18:18:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Fri, Mar 08, 2024 at 07:23:58AM -0800, Dave Hansen wrote:
> On 3/7/24 17:34, Andy Lutomirski wrote:
> >> Fix this by making sure we write a new CR3 if LAM is not
> >> up-to-date. No problems were observed in practice, this was found
> >> by code inspection.
> > I think it should be fixed with a much bigger hammer: explicit IPIs.
> > Just don't ever let it get out of date, like install_ldt().
> I guess it matters whether the thing that matters is having a persistent
> inconsistency or a temporary one. IPIs will definitely turn a permanent
> one into a temporary one.
>
> But this is all easier to reason about if we can get rid of even the
> temporary inconsistency.
>
> Wouldn't this be even simpler than IPIs?
>
> static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
> {
> unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
>
> + /* LAM is for userspace only. Ignore it for kernel threads: */
> + if (tsk->flags & PF_KTHREAD)
> + return 0;

I like this approach. kthread_use_mm() WARNs if it called for
non-PF_KTHREAD task, so it should be okay.

I was worried that it would also exclude io_uring, but worker threads
don't have the flag set.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-09 02:19:35

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Fri, Mar 08, 2024 at 07:23:58AM -0800, Dave Hansen wrote:
> On 3/7/24 17:34, Andy Lutomirski wrote:
> >> Fix this by making sure we write a new CR3 if LAM is not
> >> up-to-date. No problems were observed in practice, this was found
> >> by code inspection.
> > I think it should be fixed with a much bigger hammer: explicit IPIs.
> > Just don't ever let it get out of date, like install_ldt().
> I guess it matters whether the thing that matters is having a persistent
> inconsistency or a temporary one. IPIs will definitely turn a permanent
> one into a temporary one.
>
> But this is all easier to reason about if we can get rid of even the
> temporary inconsistency.
>
> Wouldn't this be even simpler than IPIs?
>
> static inline unsigned long set_tlbstate_lam_mode(struct mm_struct *mm)
> {
> unsigned long lam = READ_ONCE(mm->context.lam_cr3_mask);
>
> + /* LAM is for userspace only. Ignore it for kernel threads: */
> + if (tsk->flags & PF_KTHREAD)
> + return 0;
>
> this_cpu_write(cpu_tlbstate.lam, lam >> X86_CR3_LAM_U57_BIT);
> this_cpu_write(tlbstate_untag_mask, mm->context.untag_mask);
> return lam;
> }

Hmm I don't see how this fixes the problem addressed by this patch. I
think this fixes the problem addressed by patch 1, where CR3 and
cpu_tlbstate.lam may get out of sync if LAM enablement races with
switch_mm_irqs_off().

However, this patch is fixing a deeper problem (an actual bug).
Precisely this situation:

CPU 1 CPU 2
/* kthread */
kthread_use_mm()
/* user thread */
prctl_enable_tagged_addr()
/* LAM enabled */
context_switch() /* to CPU 1 */
switch_mm_irqs_off()
/* user thread */
---> LAM is disabled here <---


When switch_mm_irqs_off() runs on CPU 1 to switch from the kthread to
the user thread, because the mm is not actually changing, we may not
write CR3. In this case, the user thread runs on CPU 1 with LAM
disabled, right?

The IPI would fix this problem because prctl_enable_tagged_addr() will
make sure that CPU 1 enables LAM before it returns to userspace.
Alternatively, this patch fixes the problem by making sure we write CR3
in switch_mm_irqs_off() if LAM is out-of-date.

I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
problem. Do you mind elaborating?


2024-03-09 16:34:51

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> problem. Do you mind elaborating?

Define what problem is.

Yes, in this scenario kthread gets more permissive LAM mode than it needs.
But nothing breaks.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-09 21:37:57

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Sat, Mar 9, 2024 at 8:34 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> > I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> > problem. Do you mind elaborating?
>
> Define what problem is.
>
> Yes, in this scenario kthread gets more permissive LAM mode than it needs.
> But nothing breaks.


The problem here is not how the kthread runs at all. It is the fact
that if that kthread context switches into the user process that has
enabled LAM, it may not update CR3 because the mm doesn't change.
switch_mm_irqs_off() will only update CR3 in this case if there is a
pending TLB flush. Otherwise, we just return, even if the LAM for this
mm has changed.

This can cause the process that has enabled LAM to run with LAM
disabled and fault on tagged addresses, right? Did I miss something?

2024-03-11 12:42:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Sat, Mar 09, 2024 at 01:37:06PM -0800, Yosry Ahmed wrote:
> On Sat, Mar 9, 2024 at 8:34 AM Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> > > I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> > > problem. Do you mind elaborating?
> >
> > Define what problem is.
> >
> > Yes, in this scenario kthread gets more permissive LAM mode than it needs.
> > But nothing breaks.
>
>
> The problem here is not how the kthread runs at all. It is the fact
> that if that kthread context switches into the user process that has
> enabled LAM, it may not update CR3 because the mm doesn't change.
> switch_mm_irqs_off() will only update CR3 in this case if there is a
> pending TLB flush. Otherwise, we just return, even if the LAM for this
> mm has changed.
>
> This can cause the process that has enabled LAM to run with LAM
> disabled and fault on tagged addresses, right? Did I miss something?

You are right. I think IPI is the way to go.

Will you prepare a patch?

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-03-11 18:28:50

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Mon, Mar 11, 2024 at 5:42 AM Kirill A. Shutemov
<[email protected]> wrote:
>
> On Sat, Mar 09, 2024 at 01:37:06PM -0800, Yosry Ahmed wrote:
> > On Sat, Mar 9, 2024 at 8:34 AM Kirill A. Shutemov
> > <[email protected]> wrote:
> > >
> > > On Sat, Mar 09, 2024 at 02:19:19AM +0000, Yosry Ahmed wrote:
> > > > I don't see how skipping set_tlbstate_lam_mode() for kthreads fixes this
> > > > problem. Do you mind elaborating?
> > >
> > > Define what problem is.
> > >
> > > Yes, in this scenario kthread gets more permissive LAM mode than it needs.
> > > But nothing breaks.
> >
> >
> > The problem here is not how the kthread runs at all. It is the fact
> > that if that kthread context switches into the user process that has
> > enabled LAM, it may not update CR3 because the mm doesn't change.
> > switch_mm_irqs_off() will only update CR3 in this case if there is a
> > pending TLB flush. Otherwise, we just return, even if the LAM for this
> > mm has changed.
> >
> > This can cause the process that has enabled LAM to run with LAM
> > disabled and fault on tagged addresses, right? Did I miss something?
>
> You are right. I think IPI is the way to go.
>
> Will you prepare a patch?

I am working on it.

2024-03-07 21:04:45

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [RFC PATCH 2/3] x86/mm: make sure LAM is up-to-date during context switching

On Thu, Mar 07, 2024 at 07:29:34AM -0800, Dave Hansen wrote:
> On 3/7/24 05:39, Yosry Ahmed wrote:
> > - /*
> > - * Read the tlb_gen to check whether a flush is needed.
> > - * If the TLB is up to date, just use it.
> > - * The barrier synchronizes with the tlb_gen increment in
> > - * the TLB shootdown code.
> > - */
> > - smp_mb();
> > - next_tlb_gen = atomic64_read(&next->context.tlb_gen);
> > - if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
> > - next_tlb_gen)
> > + if (!need_flush && !need_lam_update)
> > return;
>
> Instead of all this new complexity, why not just inc_mm_tlb_gen() at the
> site where LAM is enabled? That should signal to any CPU thread that
> its TLB is out of date and it needs to do a full CR3 reload.

It's not really a lot of complexity imo, most of the patch is reverting
the if conditions that return if a TLB flush is not needed to have a
single if block that sets need_flush to true. I can split this to a
different patch if it helps review, or I can do something like this to
keep the diff concise:

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2975d3f89a5de..17ab105522287 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -576,7 +576,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
* process. No TLB flush required.
*/
if (!was_lazy)
- return;
+ goto check_return;

/*
* Read the tlb_gen to check whether a flush is needed.
@@ -588,7 +588,7 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
next_tlb_gen = atomic64_read(&next->context.tlb_gen);
if (this_cpu_read(cpu_tlbstate.ctxs[prev_asid].tlb_gen) ==
next_tlb_gen)
- return;
+ goto check_return;

/*
* TLB contents went out of date while we were in lazy
@@ -596,6 +596,9 @@ void switch_mm_irqs_off(struct mm_struct *unused, struct mm_struct *next,
*/
new_asid = prev_asid;
need_flush = true;
+check_return:
+ if (!need_flush && /* LAM up-to-date */)
+ return;
} else {
/*
* Apply process to process speculation vulnerability

.but I prefer the current patch though to avoid the goto. I think the
flow is easier to follow.

I thought about doing inc_mm_tlb_gen() when LAM is updated, but it felt
hacky and more importantly doesn't make it clear in switch_mm_irqs_off()
that we correctly handle LAM updates. We can certainly add a comment,
but I think an explicit check for CPU LAM vs. mm LAM is much clearer.

WDYT?

>
> Also, have you been able to actually catch this scenario in practice?

Nope, by code inspection (as I admitted in the commit log).

> Considering how fun this code path is, a little effort at an actual
> reproduction would be really appreciated.

I tried reproducing it but gave up quickly. We need a certain sequence
of events to happen:

CPU 1 CPU 2
kthread_use_mm()
/* user thread enables LAM */
context_switch()
context_switch() /* to user thread */


IIUC we don't really need kthread_use_mm(), we need the kthread to be
scheduled on the CPU directly after the user thread, so maybe something
like:

CPU 1 CPU 2
/* user thread running */
context_switch() /* to kthread */
/* user thread enables LAM */
context_switch()
context_switch() /* to user thread */

It doesn't seem easy to reproduce. WDYT?