LinuxLists.cc - [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Mon, Sep 20 2021 at 19:23, Fenghua Yu wrote:
>
> +#ifdef CONFIG_INTEL_IOMMU_SVM
> +void pasid_put(struct task_struct *tsk, struct mm_struct *mm);
> +#else
> +static inline void pasid_put(struct task_struct *tsk, struct mm_struct *mm) { }
> +#endif

This code is again defining that PASID is entirely restricted to
INTEL. It's true, that no other vendor supports this, but PASID is
a non-vendor specific concept.

Sticking this into INTEL code means that any other PASID implementation
has to rip it out again from INTEL code and make it a run time property.

The refcounting issue should be the same for all PASID mechanisms which
attach PASID to a mm. What's INTEL specific about that?

So can we pretty please do that correct right away?

Thanks,

tglx

2021-09-23 16:42:09

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Thu, Sep 23, 2021 at 04:36:50PM +0200, Thomas Gleixner wrote:
> On Mon, Sep 20 2021 at 19:23, Fenghua Yu wrote:
> >
> > +#ifdef CONFIG_INTEL_IOMMU_SVM
> > +void pasid_put(struct task_struct *tsk, struct mm_struct *mm);
> > +#else
> > +static inline void pasid_put(struct task_struct *tsk, struct mm_struct *mm) { }
> > +#endif
>
> This code is again defining that PASID is entirely restricted to
> INTEL. It's true, that no other vendor supports this, but PASID is
> a non-vendor specific concept.
>
> Sticking this into INTEL code means that any other PASID implementation
> has to rip it out again from INTEL code and make it a run time property.
>
> The refcounting issue should be the same for all PASID mechanisms which
> attach PASID to a mm. What's INTEL specific about that?
>
> So can we pretty please do that correct right away?

It's a bit messy (surprise).

There are two reasons to hold a refcount on a PASID

1) The process has done a bind on a device that uses PASIDs

This one isn't dependent on Intel.

2) A task within a process is using ENQCMD (and thus holds
a reference on the PASID because IA32_PASID MSR for this
task has the PASID value loaded with the enable bit set).

This is (currently) Intel specific (until others
implement an ENQCMD-like feature to allow apps to
access PASID enabled devices without going through
the OS).

Perhaps some better function naming might help? E.g. have
a task_pasid_put() function that handles the process exit
case separatley from the device unbind case.

void task_pasid_put(void)
{
if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
return;

if (current->has_valid_pasid) {
mutex_lock(&pasid_mutex);
iommu_sva_free_pasid(mm);
mutex_unlock(&pasid_mutex);
}
}

-Tony

2021-09-23 17:50:52

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
> On Thu, Sep 23, 2021 at 04:36:50PM +0200, Thomas Gleixner wrote:
>> This code is again defining that PASID is entirely restricted to
>> INTEL. It's true, that no other vendor supports this, but PASID is
>> a non-vendor specific concept.
>>
>> Sticking this into INTEL code means that any other PASID implementation
>> has to rip it out again from INTEL code and make it a run time property.
>>
>> The refcounting issue should be the same for all PASID mechanisms which
>> attach PASID to a mm. What's INTEL specific about that?
>>
>> So can we pretty please do that correct right away?
>
> It's a bit messy (surprise).
>
> There are two reasons to hold a refcount on a PASID
>
> 1) The process has done a bind on a device that uses PASIDs
>
> This one isn't dependent on Intel.

Yep.

> 2) A task within a process is using ENQCMD (and thus holds
> a reference on the PASID because IA32_PASID MSR for this
> task has the PASID value loaded with the enable bit set).
>
> This is (currently) Intel specific (until others
> implement an ENQCMD-like feature to allow apps to
> access PASID enabled devices without going through
> the OS).

Right, but all it does is to grab another reference on the PASID and if
the task exits it needs to be dropped, right?

So you already added 'has_valid_pasid' to task_struct, which is a
misnomer btw. The meaning of this flag is that the task is actually
using the PASID (in the ENQCMD case written to the MSR) beyond the
purposes of the PASID which is attached to current->mm.

But the information which is important from a pasid resource management
POV is whether the task actually holds a seperate refcount on the PASID
or not. That's completely generic. It does not matter whether the task
uses it to populate the IA32_PASID_MSR or to just keeps it in memory
just because it can. The point is that is holds a refcount.

So we can have a generic interface:

int iommu_pasid_get_task_ref()
{
if (current->holds_pasid_ref)
return -EYOUGOTONEALREADY;

if (!has_pasid(current->mm)
return -EWHATAREYOUASKINGFOR;

if (!iommu->pasid_get_ref)
return -EHOWDIDYOUMAKEUPAPASID;

if (iommu->pasid_get_ref())
return -ETHEIOMMUDOESNOTLIKEYOU;

current->holds_pasid_ref = true;
return 0;
}

Actually letting this return a bool should be good enough, but you get
the idea.

void iommu_pasid_put_task_ref()
{
if (!current->holds_pasid_ref)
return;
current->holds_pasid_ref = false;
iommu->pasid_put_ref();
}

Which makes the exception handling in traps.c the real x86/intel
specific muck:

bool fixup_pasid_exception(...)
{
/* ENCQMD and future muck */
if (!per_task_pasid_usage_enabled())
return false;
if (iommu_pasid_get_ref())
return false;
fpu_write_task_pasid();
return true;
}

fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
and be done with it.

The task exit code can just call iommu_pasid_put_task_ref() from the
generic code and not from within x86.

That means you want in Kconfig:

PASID_TASK_REFS
bool

and select that when a IOMMU supporting it is enabled and provide either
the proper protypes or stub functions depending on that.

Hmm?

Thanks,

tglx

2021-09-23 23:11:21

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
> PASIDs are fundamentally hardware resources in a shared address space.
> There is a limited number of them to use ENQCMD on shared workqueue.
> They must be shared and managed. They can not, for instance, be
> statically allocated to processes.
>
> Free PASID eagerly by sending IPIs in unbind was disabled due to locking
> and other issues in commit 9bfecd058339 ("x86/cpufeatures: Force disable
> X86_FEATURE_ENQCMD and remove update_pasid()").
>
> Lazy PASID free is implemented in order to re-enable the ENQCMD feature.
> PASIDs are currently reference counted and are centered around device
> usage. To support lazy PASID free, reference counts are tracked in the
> following scenarios:
>
> 1. The PASID's reference count is initialized as 1 when the PASID is first
> allocated in bind. This is already implemented.
> 2. A reference is taken when a device is bound to the mm and dropped
> when the device is unbound from the mm. This reference tracks device
> usage of the PASID. This is already implemented.
> 3. A reference is taken when a task's IA32_PASID MSR is initialized in
> #GP fix up and dropped when the task exits. This reference tracks
> the task usage of the PASID. It is implemented here.

I think this is unnecessarily complicated because it's buying in to the existing ISA misconception that PASID has anything to do with a task. A PASID belongs to an mm, full stop. Now the ISA is nasty and we have tasks that have *noticed* that their mm has a PASID and tasks that have not noticed this fact, but that should be irrelevant to essentially everything except the fault handler.

So just refcount the thing the obvious way: take a reference when you stick the PASID in the mm_struct and drop the reference in __mmdrop(). Problem solved. You could probably drop it more aggressively in __mmput(), and the comment explaining why is left as an exercise to the reader -- if a kernel thread starts doing ENQCMD, we have worse things to worry about :)

--Andy

2021-09-23 23:24:28

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Thu, Sep 23, 2021 at 04:09:18PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:

> I think this is unnecessarily complicated because it's buying in to the
> existing ISA misconception that PASID has anything to do with a task.
> A PASID belongs to an mm, full stop. Now the ISA is nasty and we have
> tasks that have *noticed* that their mm has a PASID and tasks that have
> not noticed this fact, but that should be irrelevant to essentially
> everything except the fault handler.
>
> So just refcount the thing the obvious way: take a reference when you
> stick the PASID in the mm_struct and drop the reference in __mmdrop().
> Problem solved. You could probably drop it more aggressively in
> __mmput(), and the comment explaining why is left as an exercise to the
> reader -- if a kernel thread starts doing ENQCMD, we have worse things
> to worry about :)

That doesn't match well with the non-x86 usage of PASIDs. The code there
bumps the reference count on each device bind, and decrements on each
device unbind.

If we don't keep a reference count for each task that has IA32_PASID
set up we could have this sequence

1) Process binds to a PASID capable device
2) Task uses ENQCMD, so PASID MSR is set up.
3) Process unbinds the device, reference count on PASID
goes to zero. PASID is freed. PASID is reallocated to
another task.
4) Task from step #2 uses ENQCMD to submit a descriptor
and device now processes virtual addresses based on mappings
in the new task.

Now you might say that at step 3 we need to hunt down all the
tasks that have PASID enabled and disabled ... but that's the
same painful code that we avoided when we said that we would
not make Linux hand out a PASID to all existing tasks in a
process on the first bind operation.

-Tony

2021-09-24 05:18:48

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Thu, Sep 23, 2021, at 4:22 PM, Luck, Tony wrote:
> On Thu, Sep 23, 2021 at 04:09:18PM -0700, Andy Lutomirski wrote:
>> On Mon, Sep 20, 2021, at 12:23 PM, Fenghua Yu wrote:
>
>> I think this is unnecessarily complicated because it's buying in to the
>> existing ISA misconception that PASID has anything to do with a task.
>> A PASID belongs to an mm, full stop. Now the ISA is nasty and we have
>> tasks that have *noticed* that their mm has a PASID and tasks that have
>> not noticed this fact, but that should be irrelevant to essentially
>> everything except the fault handler.
>>
>> So just refcount the thing the obvious way: take a reference when you
>> stick the PASID in the mm_struct and drop the reference in __mmdrop().
>> Problem solved. You could probably drop it more aggressively in
>> __mmput(), and the comment explaining why is left as an exercise to the
>> reader -- if a kernel thread starts doing ENQCMD, we have worse things
>> to worry about :)
>
> That doesn't match well with the non-x86 usage of PASIDs. The code there
> bumps the reference count on each device bind, and decrements on each
> device unbind.

Can you elaborate on how that works? Is there an architecture where there is a bona fide per task PASID?

>
> If we don't keep a reference count for each task that has IA32_PASID
> set up we could have this sequence
>
> 1) Process binds to a PASID capable device

Okay, so the mm has that PASID set up and a reference is taken.

> 2) Task uses ENQCMD, so PASID MSR is set up.

Yep.

> 3) Process unbinds the device, reference count on PASID
> goes to zero. PASID is freed. PASID is reallocated to
> another task.

It had better not. We had an entire phone call in which we agreed that the entire lazy-MSR-setup approach only makes any sense if everyone pinky swears that an mm will *never* change its PASID once it has a PASID.

> 4) Task from step #2 uses ENQCMD to submit a descriptor
> and device now processes virtual addresses based on mappings
> in the new task.
>
> Now you might say that at step 3 we need to hunt down all the
> tasks that have PASID enabled and disabled ... but that's the
> same painful code that we avoided when we said that we would
> not make Linux hand out a PASID to all existing tasks in a
> process on the first bind operation.
>

Exactly. Which means that the mm ought to pin that PASID for as long as it exists. What am I missing?

Sure, one can invent a situation in which you start two threads, and one of those threads binds a device, does ENQCMD, unbinds the device, and exits. Then the other thread *in the same mm* binds another device and gets a new PASID. And it all works. But I really don't think this special case is worth optimizing for.

> -Tony

2021-09-24 16:14:15

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

Hi, Thomas,

On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
> >
> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
> > and be done with it.
> >
> > The task exit code can just call iommu_pasid_put_task_ref() from the
> > generic code and not from within x86.
>
> But OTOH why do you need a per task reference count on the PASID at all?
>
> The PASID is fundamentaly tied to the mm and the mm can't go away before
> the threads have gone away unless this magically changed after I checked
> that ~20 years ago.

There are up to 1M PASIDs because PASID is 20-bit. I think there are a few ways
to allocate and free PASID:

1. Statically allocate a PASID once a mm is created and free it in mm
exit. No PASID allocation/free during the mm's lifetime. Then
up to 1M processes can be created due to 1M PASIDs limitation.
We don't want this method because the 1M processes limitation.

2. A PASID is allocated to the mm in open(dev)->bind(dev, mm). There
are three ways to free it:
(a) Actively free it in close(fd)->unbind(dev, mm) by sending
IPIs to tell all tasks using the PASID to clear the IA32_PASID
MSR. This has locking issues similar to the actively loading
IA32_PASID MSR which was force disabled in upstream. So won't work.
(b) Passively free the PASID in destroy_context(mm) in mm exit. Once
the PASID is allocated, it stays with the process for the lifetime. It's
better than #1 because the PASID is allocated only on demand.
(c) Passively free the PASID in deactive_mm(mm) or unbind() whenever there
is no usage as implemented in this series. Tracking the PASID usage
per task provides a chance to free the PASID on task exit. The
PASID has a better chance to be freed earlier than mm exit in #(b).

This series uses #2 and #(c) to allocate and free the PASID for a better
chance to ease the 1M PASIDs limitation pressure. For example, a thread
doing open(dev)->ENQCMD->close(fd)->exit(2) will not occupy a PASID while
its sibling threads are still running.

Thanks.

-Fenghua

2021-09-24 16:14:40

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
> >
> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
> > and be done with it.
> >
> > The task exit code can just call iommu_pasid_put_task_ref() from the
> > generic code and not from within x86.
>
> But OTOH why do you need a per task reference count on the PASID at all?
>
> The PASID is fundamentaly tied to the mm and the mm can't go away before
> the threads have gone away unless this magically changed after I checked
> that ~20 years ago.

It would be possible to avoid a per-task reference to the PASID by
taking an extra reference when mm->pasid is first allocated using
the CONFIG_PASID_TASK_REFS you proposed yesterday to define a function
to take the extra reference, and another to drop it when the mm is
finally freed ... with stubs to do nothing on architectures that
don't create per-task PASID context.

This solution works, but is functionally different from Fenghua's
proposal for this case:

Process clones a task
task binds a device
task accesses device using PASID
task unbinds device
task exits (but process lives on)

Fenghua will free the PASID because the reference count goes
back to zero. The "take an extra reference and keep until the
mm is freed" option would needlessly hold onto the PASID.

This seems like an odd usage case ... even if it does exist, a process
that does this may spawn another task that does the same thing.

If you think it is sufficiently simpler to use the "extra reference"
option (invoking the "perfect is the enemy of good enough" rule) then we
can change.

-Tony

2021-09-24 20:50:10

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
> On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
>
> fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
> and be done with it.
>
> The task exit code can just call iommu_pasid_put_task_ref() from the
> generic code and not from within x86.

But OTOH why do you need a per task reference count on the PASID at all?

The PASID is fundamentaly tied to the mm and the mm can't go away before
the threads have gone away unless this magically changed after I checked
that ~20 years ago.

Thanks,

tglx

2021-09-24 23:34:24

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Fri, Sep 24, 2021, at 9:12 AM, Luck, Tony wrote:
> On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
>> On Thu, Sep 23 2021 at 19:48, Thomas Gleixner wrote:
>> > On Thu, Sep 23 2021 at 09:40, Tony Luck wrote:
>> >
>> > fpu_write_task_pasid() can just grab the pasid from current->mm->pasid
>> > and be done with it.
>> >
>> > The task exit code can just call iommu_pasid_put_task_ref() from the
>> > generic code and not from within x86.
>>
>> But OTOH why do you need a per task reference count on the PASID at all?
>>
>> The PASID is fundamentaly tied to the mm and the mm can't go away before
>> the threads have gone away unless this magically changed after I checked
>> that ~20 years ago.
>
> It would be possible to avoid a per-task reference to the PASID by
> taking an extra reference when mm->pasid is first allocated using
> the CONFIG_PASID_TASK_REFS you proposed yesterday to define a function
> to take the extra reference, and another to drop it when the mm is
> finally freed ... with stubs to do nothing on architectures that
> don't create per-task PASID context.
>
> This solution works, but is functionally different from Fenghua's
> proposal for this case:
>
> Process clones a task
> task binds a device
> task accesses device using PASID
> task unbinds device
> task exits (but process lives on)
>
> Fenghua will free the PASID because the reference count goes
> back to zero. The "take an extra reference and keep until the
> mm is freed" option would needlessly hold onto the PASID.
>
> This seems like an odd usage case ... even if it does exist, a process
> that does this may spawn another task that does the same thing.
>
> If you think it is sufficiently simpler to use the "extra reference"
> option (invoking the "perfect is the enemy of good enough" rule) then we
> can change.

I think the perfect and the good are a bit confused here. If we go for "good", then we have an mm owning a PASID for its entire lifetime. If we want "perfect", then we should actually do it right: teach the kernel to update an entire mm's PASID setting all at once. This isn't *that* hard -- it involves two things:

1. The context switch code needs to resync PASID. Unfortunately, this adds some overhead to every context switch, although a static_branch could minimize it for non-PASID users.
2. A change to an mm's PASID needs to sent an IPI, but that IPI can't touch FPU state. So instead the IPI should use task_work_add() to make sure PASID gets resynced.

And there is still no per-task refcounting.

After all, the not so perfect attempt at perfect in this patch set won't actually work if a thread pool is involved.

2021-09-25 06:38:52

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
> 1. The context switch code needs to resync PASID. Unfortunately, this adds some overhead to every context switch, although a static_branch could minimize it for non-PASID users.

Any solution that adds to context switch time isn't going to meet
the definition of "perfect" either.

-Tony

2021-09-25 23:14:49

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

Fenghua,

On Fri, Sep 24 2021 at 16:12, Fenghua Yu wrote:
> On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
>> But OTOH why do you need a per task reference count on the PASID at all?
>>
>> The PASID is fundamentaly tied to the mm and the mm can't go away before
>> the threads have gone away unless this magically changed after I checked
>> that ~20 years ago.
>
> There are up to 1M PASIDs because PASID is 20-bit. I think there are a few ways
> to allocate and free PASID:
>
> 1. Statically allocate a PASID once a mm is created and free it in mm
> exit. No PASID allocation/free during the mm's lifetime. Then
> up to 1M processes can be created due to 1M PASIDs limitation.
> We don't want this method because the 1M processes limitation.

I'm not so worried about the 1M limitation, but it obviously makes sense
to avoid that because allocating stuff which is not used is pointless in
general.

> 2. A PASID is allocated to the mm in open(dev)->bind(dev, mm). There
> are three ways to free it:
> (a) Actively free it in close(fd)->unbind(dev, mm) by sending
> IPIs to tell all tasks using the PASID to clear the IA32_PASID
> MSR. This has locking issues similar to the actively loading
> IA32_PASID MSR which was force disabled in upstream. So won't work.

Exactly.

> (b) Passively free the PASID in destroy_context(mm) in mm exit. Once
> the PASID is allocated, it stays with the process for the lifetime. It's
> better than #1 because the PASID is allocated only on demand.

Which is simple and makes a lot of sense. See below.

> (c) Passively free the PASID in deactive_mm(mm) or unbind() whenever there
> is no usage as implemented in this series. Tracking the PASID usage
> per task provides a chance to free the PASID on task exit. The
> PASID has a better chance to be freed earlier than mm exit in #(b).
>
> This series uses #2 and #(c) to allocate and free the PASID for a better
> chance to ease the 1M PASIDs limitation pressure. For example, a thread
> doing open(dev)->ENQCMD->close(fd)->exit(2) will not occupy a PASID while
> its sibling threads are still running.

I'm not seeing that as a realistic problem. Applications which use this
kind of devices are unlikely to behave exactly that way.

2^20 PASIDs are really plenty and just adding code for the theoretical
case of PASID pressure is a pointless exercise IMO. It just adds
complexity for no reason.

IMO reality will be that either you have long lived processes with tons
of threads which use such devices over and over or short lived forked
processes which open the device, do the job, close and exit. Both
scenarios are fine with allocate on first use and drop on process exit.

I think with your approach you create overhead for applications which
use thread pools where the threads get work thrown at them and do open()
-> do_stuff() -> close() and then go back to wait for the next job which
will do exactly the same thing. So you add the overhead of refcounts in
general and in the worst case if the refcount drops to zero then the
next worker has to allocate a new PASID instead of just moving on.

So unless you have a really compelling real world usecase argument, I'm
arguing that the PASID pressure problem is a purely academic exercise.

I think you are conflating two things here:

1) PASID lifetime
2) PASID MSR overhead

Which is not correct: You still can and have to optimize the per thread
behaviour vs. the PASID MSR: Track per thread whether it ever needed the
PASID and act upon that.

If the thread just does EMQCMD once in it's lifetime, then so be
it. That's not a realistic use case, really.

And if someone does this then this does not mean we have to optimize for
that. Optimizing for possible stupid implementations is the wrong
approach. There is no technial measure against stupidity. If that would
exist the world would be a much better place.

You really have to think about the problem space you are working
on. There are problems which need a 'get it right at the first shot'
solution because they create user space ABI or otheer hard to fix
dependencies.

That's absolutely not the case here.

Get the basic simple support correct and work from there. Trying to
solve all possible theoretical problems upfront is simply not possible
and a guarantee for not making progress.

"Keep it simple" and "correctness first" are still the best working
engineering principles.

They do not prevent us from revisiting this _if_ there is a real world
problem which makes enough sense to implement a finer grained solution.

Thanks,

tglx

2021-09-28 16:46:54

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

Hi, Thomas,

On Sun, Sep 26, 2021 at 01:13:50AM +0200, Thomas Gleixner wrote:
> Fenghua,
>
> On Fri, Sep 24 2021 at 16:12, Fenghua Yu wrote:
> > On Fri, Sep 24, 2021 at 03:18:12PM +0200, Thomas Gleixner wrote:
> >> But OTOH why do you need a per task reference count on the PASID at all?
> >>
> >> The PASID is fundamentaly tied to the mm and the mm can't go away before
> >> the threads have gone away unless this magically changed after I checked
> >> that ~20 years ago.
> >
> > There are up to 1M PASIDs because PASID is 20-bit. I think there are a few ways
> > to allocate and free PASID:
> >
> > 1. Statically allocate a PASID once a mm is created and free it in mm
> > exit. No PASID allocation/free during the mm's lifetime. Then
> > up to 1M processes can be created due to 1M PASIDs limitation.
> > We don't want this method because the 1M processes limitation.
>
> I'm not so worried about the 1M limitation, but it obviously makes sense
> to avoid that because allocating stuff which is not used is pointless in
> general.
>
> > 2. A PASID is allocated to the mm in open(dev)->bind(dev, mm). There
> > are three ways to free it:
> > (a) Actively free it in close(fd)->unbind(dev, mm) by sending
> > IPIs to tell all tasks using the PASID to clear the IA32_PASID
> > MSR. This has locking issues similar to the actively loading
> > IA32_PASID MSR which was force disabled in upstream. So won't work.
>
> Exactly.
>
> > (b) Passively free the PASID in destroy_context(mm) in mm exit. Once
> > the PASID is allocated, it stays with the process for the lifetime. It's
> > better than #1 because the PASID is allocated only on demand.
>
> Which is simple and makes a lot of sense. See below.
>
> > (c) Passively free the PASID in deactive_mm(mm) or unbind() whenever there
> > is no usage as implemented in this series. Tracking the PASID usage
> > per task provides a chance to free the PASID on task exit. The
> > PASID has a better chance to be freed earlier than mm exit in #(b).
> >
> > This series uses #2 and #(c) to allocate and free the PASID for a better
> > chance to ease the 1M PASIDs limitation pressure. For example, a thread
> > doing open(dev)->ENQCMD->close(fd)->exit(2) will not occupy a PASID while
> > its sibling threads are still running.
>
> I'm not seeing that as a realistic problem. Applications which use this
> kind of devices are unlikely to behave exactly that way.
>
> 2^20 PASIDs are really plenty and just adding code for the theoretical
> case of PASID pressure is a pointless exercise IMO. It just adds
> complexity for no reason.
>
> IMO reality will be that either you have long lived processes with tons
> of threads which use such devices over and over or short lived forked
> processes which open the device, do the job, close and exit. Both
> scenarios are fine with allocate on first use and drop on process exit.
>
> I think with your approach you create overhead for applications which
> use thread pools where the threads get work thrown at them and do open()
> -> do_stuff() -> close() and then go back to wait for the next job which
> will do exactly the same thing. So you add the overhead of refcounts in
> general and in the worst case if the refcount drops to zero then the
> next worker has to allocate a new PASID instead of just moving on.
>
> So unless you have a really compelling real world usecase argument, I'm
> arguing that the PASID pressure problem is a purely academic exercise.
>
> I think you are conflating two things here:
>
> 1) PASID lifetime
> 2) PASID MSR overhead
>
> Which is not correct: You still can and have to optimize the per thread
> behaviour vs. the PASID MSR: Track per thread whether it ever needed the
> PASID and act upon that.
>
> If the thread just does EMQCMD once in it's lifetime, then so be
> it. That's not a realistic use case, really.
>
> And if someone does this then this does not mean we have to optimize for
> that. Optimizing for possible stupid implementations is the wrong
> approach. There is no technial measure against stupidity. If that would
> exist the world would be a much better place.
>
> You really have to think about the problem space you are working
> on. There are problems which need a 'get it right at the first shot'
> solution because they create user space ABI or otheer hard to fix
> dependencies.
>
> That's absolutely not the case here.
>
> Get the basic simple support correct and work from there. Trying to
> solve all possible theoretical problems upfront is simply not possible
> and a guarantee for not making progress.
>
> "Keep it simple" and "correctness first" are still the best working
> engineering principles.
>
> They do not prevent us from revisiting this _if_ there is a real world
> problem which makes enough sense to implement a finer grained solution.

Sure. Will free the PASID in destroy_context() on mm exit and won't track
the PASID usage per task. The code will be simpler and clearer.

Thank you very much for your insight!

-Fenghua

2021-09-29 09:57:51

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
> I think the perfect and the good are a bit confused here. If we go for
> "good", then we have an mm owning a PASID for its entire lifetime. If
> we want "perfect", then we should actually do it right: teach the
> kernel to update an entire mm's PASID setting all at once. This isn't
> *that* hard -- it involves two things:
>
> 1. The context switch code needs to resync PASID. Unfortunately, this
> adds some overhead to every context switch, although a static_branch
> could minimize it for non-PASID users.

> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
> touch FPU state. So instead the IPI should use task_work_add() to
> make sure PASID gets resynced.

What do we need 1 for? Any PASID change can be achieved using 2 no?

Basically, call task_work_add() on all relevant tasks [1], then IPI
spray the current running of those and presto.

[1] it is nigh on impossible to find all tasks sharing an mm in any sane
way due to CLONE_MM && !CLONE_THREAD.

2021-09-29 12:31:10

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:
> On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
>> I think the perfect and the good are a bit confused here. If we go for
>> "good", then we have an mm owning a PASID for its entire lifetime. If
>> we want "perfect", then we should actually do it right: teach the
>> kernel to update an entire mm's PASID setting all at once. This isn't
>> *that* hard -- it involves two things:
>>
>> 1. The context switch code needs to resync PASID. Unfortunately, this
>> adds some overhead to every context switch, although a static_branch
>> could minimize it for non-PASID users.
>
>> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
>> touch FPU state. So instead the IPI should use task_work_add() to
>> make sure PASID gets resynced.
>
> What do we need 1 for? Any PASID change can be achieved using 2 no?
>
> Basically, call task_work_add() on all relevant tasks [1], then IPI
> spray the current running of those and presto.
>
> [1] it is nigh on impossible to find all tasks sharing an mm in any sane
> way due to CLONE_MM && !CLONE_THREAD.

Why would we want any of that at all?

Process starts, no PASID assigned.

bind to device -> PASID is allocated and assigned to the mm

some task of the process issues ENQCMD -> #GP -> write PASID MSR

After that the PASID is saved and restored as part of the XSTATE and
there is no extra overhead in context switch or return to user space.

All tasks of the process which did never use ENQCMD don't care and their
PASID xstate is in init state.

There is absolutely no point in enforcing that all tasks of the process
have the PASID activated immediately when it is assigned. If they need
it they get it via the #GP fixup and everything just works.

Looking at that patch again, none of this muck in fpu__pasid_write() is
required at all. The whole exception fixup is:

if (!user_mode(regs))
return false;

if (!current->mm->pasid)
return false;

if (current->pasid_activated)
return false;

wrmsrl(MSR_IA32_PASID, current->mm->pasid);
current->pasid_activated = true;
return true;

There is zero requirement to look at TIF_NEED_FPU_LOAD or
fpregs_state_valid() simply because the #GP comes straight from user
space which means the FPU registers contain the current tasks user space
state.

If TIF_NEED_FPU_LOAD would be set or fpregs_state_valid() would be false
after the user_mode() check then this would simply be a bug somewhere
else and has nothing to do with this PASID fixup.

So no need for magic update_one_xstate_feature() wrappers, no
concurrency concerns, nothing.

It's that simple, really. Anything more complex is just a purely
academic exercise which creates more problems than it solves.

Thanks,

tglx

2021-09-29 17:48:13

[permalink] [raw]

Subject: RE: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

> There is zero requirement to look at TIF_NEED_FPU_LOAD or
> fpregs_state_valid() simply because the #GP comes straight from user
> space which means the FPU registers contain the current tasks user space
> state.

Just to double confirm ... there is no point in the #GP handler up to this point
where pre-emption can occur?

-Tony

2021-09-29 17:53:38

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> On 9/29/21 05:28, Thomas Gleixner wrote:
>> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> required at all. The whole exception fixup is:
>>
>> if (!user_mode(regs))
>> return false;
>>
>> if (!current->mm->pasid)
>> return false;
>>
>> if (current->pasid_activated)
>> return false;
>
> <-- preemption or BH here: kaboom.

Sigh, this had obviously to run in the early portion of #GP, i.e. before
enabling interrupts.

For me that's more than obvious and I apologize that I failed to mention
it.

>>
>> wrmsrl(MSR_IA32_PASID, current->mm->pasid);
>
> This needs the actual sane fpstate writing helper -- see other email.

And with that there is no fpstate write helper required at all.

Stop this overengineering madness already.

Thanks,

tglx

2021-09-29 18:09:56

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Wed, Sep 29, 2021, at 10:41 AM, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
>> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
>> > On 9/29/21 05:28, Thomas Gleixner wrote:
>> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
>> >> required at all. The whole exception fixup is:
>> >>
>> >> if (!user_mode(regs))
>> >> return false;
>> >>
>> >> if (!current->mm->pasid)
>> >> return false;
>> >>
>> >> if (current->pasid_activated)
>> >> return false;
>> >
>> > <-- preemption or BH here: kaboom.
>>
>> Sigh, this had obviously to run in the early portion of #GP, i.e. before
>> enabling interrupts.
>
> Like this? Obviously with some comment about why this is being done.
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> unsigned long gp_addr;
> int ret;
>
> + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {
> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
> + return;
> + }
> +

This could do with a WARN_ON_ONCE(TIF_NEED_LOAD_FPU) imo.

Is instrumentation allowed to touch FPU state?

> cond_local_irq_enable(regs);
>
> if (static_cpu_has(X86_FEATURE_UMIP)) {
>
> -Tony

2021-09-29 19:16:17

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
> On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> > On 9/29/21 05:28, Thomas Gleixner wrote:
> >> Looking at that patch again, none of this muck in fpu__pasid_write() is
> >> required at all. The whole exception fixup is:
> >>
> >> if (!user_mode(regs))
> >> return false;
> >>
> >> if (!current->mm->pasid)
> >> return false;
> >>
> >> if (current->pasid_activated)
> >> return false;
> >
> > <-- preemption or BH here: kaboom.
>
> Sigh, this had obviously to run in the early portion of #GP, i.e. before
> enabling interrupts.

Like this? Obviously with some comment about why this is being done.

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..a848a59291e7 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
unsigned long gp_addr;
int ret;

+ if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {
+ current->pasid_activated = 1;
+ wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
+ return;
+ }
+
cond_local_irq_enable(regs);

if (static_cpu_has(X86_FEATURE_UMIP)) {

-Tony

2021-09-29 19:28:22

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

Hi, Thomas,

On Wed, Sep 29, 2021 at 09:51:15AM -0700, Luck, Tony wrote:
> > There is zero requirement to look at TIF_NEED_FPU_LOAD or
> > fpregs_state_valid() simply because the #GP comes straight from user
> > space which means the FPU registers contain the current tasks user space
> > state.
>
> Just to double confirm ... there is no point in the #GP handler up to this point
> where pre-emption can occur?

Same question here. The fixup function is called after cond_local_irq_enable().
If an interrupt comes before fixup_pasid_exception(), the interrupt may
use FPU and call kernel_fpu_begin_mask()->set(TIF_NEED_FPU_LOAD)->
__cpu_invalidate_fpregs_state(). Then writing to the IA32_PASID MSR. When
exiting to user, the FPU states will be restored to the FPU regs including
the IA32_PASID MSR. So the MSR could be different from the value written in
fixup_pasid_execption(). Is it possible?

Or should fixup_pasid_exception() be called before cond_local_irq_enable()?

Thanks.

-Fenghua

2021-09-29 19:28:42

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On 9/29/21 05:28, Thomas Gleixner wrote:
> On Wed, Sep 29 2021 at 11:54, Peter Zijlstra wrote:
>> On Fri, Sep 24, 2021 at 04:03:53PM -0700, Andy Lutomirski wrote:
>>> I think the perfect and the good are a bit confused here. If we go for
>>> "good", then we have an mm owning a PASID for its entire lifetime. If
>>> we want "perfect", then we should actually do it right: teach the
>>> kernel to update an entire mm's PASID setting all at once. This isn't
>>> *that* hard -- it involves two things:
>>>
>>> 1. The context switch code needs to resync PASID. Unfortunately, this
>>> adds some overhead to every context switch, although a static_branch
>>> could minimize it for non-PASID users.
>>
>>> 2. A change to an mm's PASID needs to sent an IPI, but that IPI can't
>>> touch FPU state. So instead the IPI should use task_work_add() to
>>> make sure PASID gets resynced.
>>
>> What do we need 1 for? Any PASID change can be achieved using 2 no?
>>
>> Basically, call task_work_add() on all relevant tasks [1], then IPI
>> spray the current running of those and presto.
>>
>> [1] it is nigh on impossible to find all tasks sharing an mm in any sane
>> way due to CLONE_MM && !CLONE_THREAD.
>
> Why would we want any of that at all?
>
> Process starts, no PASID assigned.
>
> bind to device -> PASID is allocated and assigned to the mm
>
> some task of the process issues ENQCMD -> #GP -> write PASID MSR
>
> After that the PASID is saved and restored as part of the XSTATE and
> there is no extra overhead in context switch or return to user space.
>
> All tasks of the process which did never use ENQCMD don't care and their
> PASID xstate is in init state.
>
> There is absolutely no point in enforcing that all tasks of the process
> have the PASID activated immediately when it is assigned. If they need
> it they get it via the #GP fixup and everything just works.
>
> Looking at that patch again, none of this muck in fpu__pasid_write() is
> required at all. The whole exception fixup is:
>
> if (!user_mode(regs))
> return false;
>
> if (!current->mm->pasid)
> return false;
>
> if (current->pasid_activated)
> return false;

<-- preemption or BH here: kaboom.

>
> wrmsrl(MSR_IA32_PASID, current->mm->pasid);

This needs the actual sane fpstate writing helper -- see other email.

2021-09-29 19:31:11

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

Hi, Tony,

On Wed, Sep 29, 2021 at 10:41:42AM -0700, Luck, Tony wrote:
> On Wed, Sep 29, 2021 at 07:15:53PM +0200, Thomas Gleixner wrote:
> > On Wed, Sep 29 2021 at 09:59, Andy Lutomirski wrote:
> > > On 9/29/21 05:28, Thomas Gleixner wrote:
> > >> Looking at that patch again, none of this muck in fpu__pasid_write() is
> > >> required at all. The whole exception fixup is:
> > >>
> > >> if (!user_mode(regs))
> > >> return false;
> > >>
> > >> if (!current->mm->pasid)
> > >> return false;
> > >>
> > >> if (current->pasid_activated)
> > >> return false;
> > >
> > > <-- preemption or BH here: kaboom.
> >
> > Sigh, this had obviously to run in the early portion of #GP, i.e. before
> > enabling interrupts.
>
> Like this? Obviously with some comment about why this is being done.
>
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..a848a59291e7 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -536,6 +536,12 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> unsigned long gp_addr;
> int ret;
>

Add
+#ifdef CONFIG_IOMMU_SUPPORT
because mm->pasid and current-pasid_activated are defined only if
CONFIG_IOMMU_SUPPORT is defined.

> + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {

Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because
the IA32_PASID MSR is only used when ENQCMD is enabled?

> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
> + return;
> + }
> +

+endif

> cond_local_irq_enable(regs);
>
> if (static_cpu_has(X86_FEATURE_UMIP)) {

Thanks.

-Fenghua

2021-09-29 20:09:43

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Wed, Sep 29 2021 at 11:31, Tony Luck wrote:
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index a58800973aed..5a3c87fd65de 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,
>
> #define GPFSTR "general protection fault"
>
> +/*
> + * When a user executes the ENQCMD instruction it will #GP
> + * fault if the IA32_PASID MSR has not been set up with a
> + * valid PASID.
> + * So if the process has been allocated a PASID (mm->pasid)
> + * AND the IA32_PASID MSR has not been initialized, try to
> + * fix this #GP by initializing the IA32_PASID MSR.
> + * If the #GP was for some other reason, it will trigger
> + * again, but this routine will return false and the #GP
> + * will be processed.
> + */
> +static void try_fixup_pasid(void)
> +{
> + if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
> + return false;
> +
> +#ifdef CONFIG_IOMMU_SUPPORT
> + if (current->mm->pasid && !current->pasid_activated) {
> + current->pasid_activated = 1;
> + wrmsrl(MSR_IA32_PASID, current->mm->pasid);
> + return true;
> + }
> +#endif
> + return false;
> +}
> +
> DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> {
> char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
> @@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
> unsigned long gp_addr;
> int ret;
>
> + if (user_mode(regs) && try_fixup_pasid())
> + return;
> +
> cond_local_irq_enable(regs);
>
> if (static_cpu_has(X86_FEATURE_UMIP)) {

Amen.

2021-09-29 21:00:50

[permalink] [raw]

Subject: Re: [PATCH 5/8] x86/mmu: Add mm-based PASID refcounting

On Wed, Sep 29, 2021 at 06:07:22PM +0000, Fenghua Yu wrote:
> Add
> +#ifdef CONFIG_IOMMU_SUPPORT
> because mm->pasid and current-pasid_activated are defined only if
> CONFIG_IOMMU_SUPPORT is defined.
>
> > + if (user_mode(regs) && current->mm->pasid && !current->pasid_activated) {
>
> Maybe need to add "&& cpu_feature_enabled(X86_FEATURE_ENQCMD)" because
> the IA32_PASID MSR is only used when ENQCMD is enabled?
>
> > + current->pasid_activated = 1;
> > + wrmsrl(MSR_IA32_PASID, current->mm->pasid | MSR_IA32_PASID_VALID);
> > + return;
> > + }
> > +
>
> +endif

New version that addresses those issues. Has ugly #ifdef in C
code :-( If that's unacceptable, then could create some stub
functions, or add a call to __try_fixup_pasid() that's in a
file in the iommu code that is only built when CONFIG_IOMMU_SUPPORT
is set. But either of those move the details far away from the
#GP handler so make extra work for anyone trying to follow along
with what is happening here.

-Tony

---

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a58800973aed..5a3c87fd65de 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -528,6 +528,32 @@ static enum kernel_gp_hint get_kernel_gp_address(struct pt_regs *regs,

#define GPFSTR "general protection fault"

+/*
+ * When a user executes the ENQCMD instruction it will #GP
+ * fault if the IA32_PASID MSR has not been set up with a
+ * valid PASID.
+ * So if the process has been allocated a PASID (mm->pasid)
+ * AND the IA32_PASID MSR has not been initialized, try to
+ * fix this #GP by initializing the IA32_PASID MSR.
+ * If the #GP was for some other reason, it will trigger
+ * again, but this routine will return false and the #GP
+ * will be processed.
+ */
+static void try_fixup_pasid(void)
+{
+ if (!cpu_feature_enabled(X86_FEATURE_ENQCMD))
+ return false;
+
+#ifdef CONFIG_IOMMU_SUPPORT
+ if (current->mm->pasid && !current->pasid_activated) {
+ current->pasid_activated = 1;
+ wrmsrl(MSR_IA32_PASID, current->mm->pasid);
+ return true;
+ }
+#endif
+ return false;
+}
+
DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
{
char desc[sizeof(GPFSTR) + 50 + 2*sizeof(unsigned long) + 1] = GPFSTR;
@@ -536,6 +562,9 @@ DEFINE_IDTENTRY_ERRORCODE(exc_general_protection)
unsigned long gp_addr;
int ret;

+ if (user_mode(regs) && try_fixup_pasid())
+ return;
+
cond_local_irq_enable(regs);

if (static_cpu_has(X86_FEATURE_UMIP)) {

2021-09-30 01:06:10