2022-05-30 11:51:28

by Miaohe Lin

[permalink] [raw]
Subject: [PATCH v4 1/4] mm: reduce the rcu lock duration

Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
extends the period of the rcu_read_lock until after the permissions checks
are done to prevent the task pointed to from changing from under us. But
the task_struct refcount is also taken at that time, the reference to task
is guaranteed to be stable. So it's unnecessary to extend the period of
the rcu_read_lock. Release the rcu lock after task refcount is successfully
grabbed to reduce the rcu holding time.

Reviewed-by: Muchun Song <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Oscar Salvador <[email protected]>
Signed-off-by: Miaohe Lin <[email protected]>
Cc: Huang Ying <[email protected]>
Cc: David Howells <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 3 +--
mm/migrate.c | 3 +--
2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b4ba3ee810e..2dad094177bf 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1609,6 +1609,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
goto out;
}
get_task_struct(task);
+ rcu_read_unlock();

err = -EINVAL;

@@ -1617,11 +1618,9 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
* Use the regular "ptrace_may_access()" checks.
*/
if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
- rcu_read_unlock();
err = -EPERM;
goto out_put;
}
- rcu_read_unlock();

task_nodes = cpuset_mems_allowed(task);
/* Is the user allowed to access the target nodes? */
diff --git a/mm/migrate.c b/mm/migrate.c
index e51588e95f57..e88ebb88fa6f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1902,17 +1902,16 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
return ERR_PTR(-ESRCH);
}
get_task_struct(task);
+ rcu_read_unlock();

/*
* Check if this process has the right to modify the specified
* process. Use the regular "ptrace_may_access()" checks.
*/
if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
- rcu_read_unlock();
mm = ERR_PTR(-EPERM);
goto out;
}
- rcu_read_unlock();

mm = ERR_PTR(security_task_movememory(task));
if (IS_ERR(mm))
--
2.23.0



2022-05-31 21:51:26

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On 30.05.22 13:30, Miaohe Lin wrote:
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done to prevent the task pointed to from changing from under us. But
> the task_struct refcount is also taken at that time, the reference to task
> is guaranteed to be stable. So it's unnecessary to extend the period of
> the rcu_read_lock. Release the rcu lock after task refcount is successfully
> grabbed to reduce the rcu holding time.
>
> Reviewed-by: Muchun Song <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Signed-off-by: Miaohe Lin <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Christoph Lameter <[email protected]>


Reviewed-by: David Hildenbrand <[email protected]>

--
Thanks,

David / dhildenb


2022-05-31 22:45:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On Tue, May 31, 2022 at 01:58:31PM +0100, Matthew Wilcox wrote:
> On Mon, May 30, 2022 at 07:30:13PM +0800, Miaohe Lin wrote:
> > Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> > extends the period of the rcu_read_lock until after the permissions checks
> > are done to prevent the task pointed to from changing from under us. But
> > the task_struct refcount is also taken at that time, the reference to task
> > is guaranteed to be stable. So it's unnecessary to extend the period of
> > the rcu_read_lock. Release the rcu lock after task refcount is successfully
> > grabbed to reduce the rcu holding time.
>
> But why bother? You know the RCU read lock isn't a "real" lock, right?

Looking over this code some more, I think this may harm performance.
ptrace_may_access() itself takes the rcu_read_lock(). So we currently
have:

rcu_read_lock()
rcu_read_lock();
rcu_read_unlock();
rcu_read_unlock();

In at least one RCU configuration, rcu_read_lock() maps to
preempt_disable(). Nested preempt_disable() just bump a counter, while
that counter reaching zero incurs some actual work. So nested
rcu_read_lock() can be better than sequential lock/unlock/lock/unlock.

This needs far better justification.

2022-06-01 14:13:47

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On 2022/5/31 14:06, Ying Huang wrote:
> On Mon, 2022-05-30 at 19:30 +0800, Miaohe Lin wrote:
>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>> extends the period of the rcu_read_lock until after the permissions checks
>> are done to prevent the task pointed to from changing from under us. But
>> the task_struct refcount is also taken at that time, the reference to task
>> is guaranteed to be stable. So it's unnecessary to extend the period of
>> the rcu_read_lock. Release the rcu lock after task refcount is successfully
>> grabbed to reduce the rcu holding time.
>
> Sorry for late reply, I am busy on something else recently.

That's all right. Many thanks for your hard work. :)

>
> I have just read the whole thread of the original patch discussion.
> During discussion, in
>
> https://lore.kernel.org/lkml/[email protected]/
>
> a patch that is same as your one is proposed. Then in the following
> message, Eric think that the rcu read lock should be released until
> permission is checked,
>
> https://lore.kernel.org/lkml/[email protected]/
>
> "
> At the moment I suspect the permissions checks are not safe unless
> performed under both rcu_read_lock and task_lock to ensure that
> the task<->mm association does not change on us while we are
> working. Even with that the cred can change under us but at least
> we know the cred will be valid until rcu_read_unlock happens.
> "
>
> So the rcu lock duration is enlarged in the following message.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> But, after some thought, I don't think extended rcu read lock adds much
> value. Because after permission checking the permission may still be
> changed. There's no much difference.
>
> So, I have no objection to the patch itself. But you should add more
> information in patch description about why the RCU proected region is
> extended and why we can reduce it.

Does below patch description makes sense for you?

"
Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
extends the period of the rcu_read_lock until after the permissions checks
are done because it suspects the permissions checks are not safe unless
performed under both rcu_read_lock and task_lock to ensure the task<->mm
association does not change on us while we are working [1]. But extended
rcu read lock does not add much value. Because after permission checking
the permission may still be changed. There's no much difference. So it's
unnecessary to extend the period of the rcu_read_lock. Release the rcu
lock after task refcount is successfully grabbed to reduce the rcu holding
time.

[1] https://lore.kernel.org/lkml/[email protected]/
"

>
> Best Regards,
> Huang, Ying

Thanks again!

>
>> Reviewed-by: Muchun Song <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Reviewed-by: Oscar Salvador <[email protected]>
>> Signed-off-by: Miaohe Lin <[email protected]>
>> Cc: Huang Ying <[email protected]>
>> Cc: David Howells <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> ---
>>  mm/mempolicy.c | 3 +--
>>  mm/migrate.c | 3 +--
>>  2 files changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 0b4ba3ee810e..2dad094177bf 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1609,6 +1609,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>>   goto out;
>>   }
>>   get_task_struct(task);
>> + rcu_read_unlock();
>>  
>>
>>
>>
>>   err = -EINVAL;
>>  
>>
>>
>>
>> @@ -1617,11 +1618,9 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>>   * Use the regular "ptrace_may_access()" checks.
>>   */
>>   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
>> - rcu_read_unlock();
>>   err = -EPERM;
>>   goto out_put;
>>   }
>> - rcu_read_unlock();
>>  
>>
>>
>>
>>   task_nodes = cpuset_mems_allowed(task);
>>   /* Is the user allowed to access the target nodes? */
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index e51588e95f57..e88ebb88fa6f 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1902,17 +1902,16 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
>>   return ERR_PTR(-ESRCH);
>>   }
>>   get_task_struct(task);
>> + rcu_read_unlock();
>>  
>>
>>
>>
>>   /*
>>   * Check if this process has the right to modify the specified
>>   * process. Use the regular "ptrace_may_access()" checks.
>>   */
>>   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
>> - rcu_read_unlock();
>>   mm = ERR_PTR(-EPERM);
>>   goto out;
>>   }
>> - rcu_read_unlock();
>>  
>>
>>
>>
>>   mm = ERR_PTR(security_task_movememory(task));
>>   if (IS_ERR(mm))
>
>
>
> .
>


2022-06-01 19:58:14

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On Mon, 2022-05-30 at 19:30 +0800, Miaohe Lin wrote:
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done to prevent the task pointed to from changing from under us. But
> the task_struct refcount is also taken at that time, the reference to task
> is guaranteed to be stable. So it's unnecessary to extend the period of
> the rcu_read_lock. Release the rcu lock after task refcount is successfully
> grabbed to reduce the rcu holding time.

Sorry for late reply, I am busy on something else recently.

I have just read the whole thread of the original patch discussion.
During discussion, in

https://lore.kernel.org/lkml/[email protected]/

a patch that is same as your one is proposed. Then in the following
message, Eric think that the rcu read lock should be released until
permission is checked,

https://lore.kernel.org/lkml/[email protected]/

"
At the moment I suspect the permissions checks are not safe unless
performed under both rcu_read_lock and task_lock to ensure that
the task<->mm association does not change on us while we are
working. Even with that the cred can change under us but at least
we know the cred will be valid until rcu_read_unlock happens.
"

So the rcu lock duration is enlarged in the following message.

https://lore.kernel.org/lkml/[email protected]/

But, after some thought, I don't think extended rcu read lock adds much
value. Because after permission checking the permission may still be
changed. There's no much difference.

So, I have no objection to the patch itself. But you should add more
information in patch description about why the RCU proected region is
extended and why we can reduce it.

Best Regards,
Huang, Ying

> Reviewed-by: Muchun Song <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Oscar Salvador <[email protected]>
> Signed-off-by: Miaohe Lin <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> ---
>  mm/mempolicy.c | 3 +--
>  mm/migrate.c | 3 +--
>  2 files changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 0b4ba3ee810e..2dad094177bf 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1609,6 +1609,7 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>   goto out;
>   }
>   get_task_struct(task);
> + rcu_read_unlock();
>  
>
>
>
>   err = -EINVAL;
>  
>
>
>
> @@ -1617,11 +1618,9 @@ static int kernel_migrate_pages(pid_t pid, unsigned long maxnode,
>   * Use the regular "ptrace_may_access()" checks.
>   */
>   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> - rcu_read_unlock();
>   err = -EPERM;
>   goto out_put;
>   }
> - rcu_read_unlock();
>  
>
>
>
>   task_nodes = cpuset_mems_allowed(task);
>   /* Is the user allowed to access the target nodes? */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index e51588e95f57..e88ebb88fa6f 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1902,17 +1902,16 @@ static struct mm_struct *find_mm_struct(pid_t pid, nodemask_t *mem_nodes)
>   return ERR_PTR(-ESRCH);
>   }
>   get_task_struct(task);
> + rcu_read_unlock();
>  
>
>
>
>   /*
>   * Check if this process has the right to modify the specified
>   * process. Use the regular "ptrace_may_access()" checks.
>   */
>   if (!ptrace_may_access(task, PTRACE_MODE_READ_REALCREDS)) {
> - rcu_read_unlock();
>   mm = ERR_PTR(-EPERM);
>   goto out;
>   }
> - rcu_read_unlock();
>  
>
>
>
>   mm = ERR_PTR(security_task_movememory(task));
>   if (IS_ERR(mm))



2022-06-01 20:12:34

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On Mon, May 30, 2022 at 07:30:13PM +0800, Miaohe Lin wrote:
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done to prevent the task pointed to from changing from under us. But
> the task_struct refcount is also taken at that time, the reference to task
> is guaranteed to be stable. So it's unnecessary to extend the period of
> the rcu_read_lock. Release the rcu lock after task refcount is successfully
> grabbed to reduce the rcu holding time.

But why bother? You know the RCU read lock isn't a "real" lock, right?

2022-06-01 21:18:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

Miaohe Lin <[email protected]> writes:

> On 2022/5/31 14:06, Ying Huang wrote:
>> On Mon, 2022-05-30 at 19:30 +0800, Miaohe Lin wrote:
>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>> extends the period of the rcu_read_lock until after the permissions checks
>>> are done to prevent the task pointed to from changing from under us. But
>>> the task_struct refcount is also taken at that time, the reference to task
>>> is guaranteed to be stable. So it's unnecessary to extend the period of
>>> the rcu_read_lock. Release the rcu lock after task refcount is successfully
>>> grabbed to reduce the rcu holding time.
>>
>> Sorry for late reply, I am busy on something else recently.
>
> That's all right. Many thanks for your hard work. :)
>
>>
>> I have just read the whole thread of the original patch discussion.
>> During discussion, in
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> a patch that is same as your one is proposed. Then in the following
>> message, Eric think that the rcu read lock should be released until
>> permission is checked,
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> "
>> At the moment I suspect the permissions checks are not safe unless
>> performed under both rcu_read_lock and task_lock to ensure that
>> the task<->mm association does not change on us while we are
>> working. Even with that the cred can change under us but at least
>> we know the cred will be valid until rcu_read_unlock happens.
>> "
>>
>> So the rcu lock duration is enlarged in the following message.
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> But, after some thought, I don't think extended rcu read lock adds much
>> value. Because after permission checking the permission may still be
>> changed. There's no much difference.
>>
>> So, I have no objection to the patch itself. But you should add more
>> information in patch description about why the RCU proected region is
>> extended and why we can reduce it.
>
> Does below patch description makes sense for you?
>
> "
> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
> extends the period of the rcu_read_lock until after the permissions checks
> are done because it suspects the permissions checks are not safe unless
> performed under both rcu_read_lock and task_lock to ensure the task<->mm
> association does not change on us while we are working [1]. But extended
> rcu read lock does not add much value. Because after permission checking
> the permission may still be changed. There's no much difference. So it's
> unnecessary to extend the period of the rcu_read_lock. Release the rcu
> lock after task refcount is successfully grabbed to reduce the rcu holding
> time.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> "

It doesn't make sense to me.

I don't see any sleeping functions called from find_mm_struct or
kernel_migrate_pages in the area kernel_migrate_pages in the area of the
code protected by get_task_struct. So at a very basic level I see a
justification for dirtying a cache line twice with get_task_struct and
put_task_struct to reduce rcu_read_lock hold times.

I would contend that a reasonable cleanup based up on the current state
of the code would be to extend the rcu_read_lock over get_task_mm so
that a reference to task_struct does not need to be taken. That has
the potential to reduce contention and reduce lock hold times.


The code is missing a big fat comment with the assertion that it is ok
if the permission checks are racy because the race is small, and the
worst case thing that happens is the page is migrated to another
numa node.


Given that the get_mm_task takes task_lock the cost of dirtying the
cache line is already being paid. Perhaps not extending task_lock hold
times a little bit is justified, but I haven't seen that case made.

This seems like code that is called little enough it would be better for
it to be correct, and not need big fat comments explaining why it
doesn't matter that they code is deliberately buggy.


In short it does not make sense to me to justify a patch for performance
reasons when it appears that extending the rcu_read_lock hold time and
not touch the task reference count would stop dirtying a cache line and
likely have more impact.

Eric

2022-06-01 21:29:42

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

Miaohe Lin <[email protected]> writes:

> On 2022/6/1 0:09, Eric W. Biederman wrote:
>> Miaohe Lin <[email protected]> writes:
> snip
>>>
>>> "
>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>> extends the period of the rcu_read_lock until after the permissions checks
>>> are done because it suspects the permissions checks are not safe unless
>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>> association does not change on us while we are working [1]. But extended
>>> rcu read lock does not add much value. Because after permission checking
>>> the permission may still be changed. There's no much difference. So it's
>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>> time.
>>>
>>> [1] https://lore.kernel.org/lkml/[email protected]/
>>> "
>>
>> It doesn't make sense to me.
>>
>> I don't see any sleeping functions called from find_mm_struct or
>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>> code protected by get_task_struct. So at a very basic level I see a
>> justification for dirtying a cache line twice with get_task_struct and
>> put_task_struct to reduce rcu_read_lock hold times.
>>
>> I would contend that a reasonable cleanup based up on the current state
>> of the code would be to extend the rcu_read_lock over get_task_mm so
>
> If so, security_task_movememory will be called inside rcu lock. It might
> call sleeping functions, e.g. smack_log(). I think it's not a good
> idea.

In general the security functions are not allowed to sleep.
The audit mechanism typically preallocates memory so it does
not need to sleep. From a quick skim through the code I don't
see smack_log sleeping.

Certainly the security hooks are not supposed to be inserted into the
code that they prevent reasonable implementation decisions.

Which is to say if there is a good (non-security hook reason) for
supporting sleeping deal with it. Otherwise the security hooks has a
bug and needs to get fixed/removed.

>> that a reference to task_struct does not need to be taken. That has
>> the potential to reduce contention and reduce lock hold times.
>>
>>
>> The code is missing a big fat comment with the assertion that it is ok
>> if the permission checks are racy because the race is small, and the
>> worst case thing that happens is the page is migrated to another
>> numa node.
>>
>>
>> Given that the get_mm_task takes task_lock the cost of dirtying the
>> cache line is already being paid. Perhaps not extending task_lock hold
>> times a little bit is justified, but I haven't seen that case made.
>>
>> This seems like code that is called little enough it would be better for
>> it to be correct, and not need big fat comments explaining why it
>> doesn't matter that they code is deliberately buggy.
>>
>
> Agree. A big fat comments will make code hard to follow.

No.

The code is impossible to follow currently.

The code either requires a comment pointing out that it is deliberately
racy, or the code needs to be fixed.

Clever and subtle code always requires a comment if for no other
reason then to alert the reader that something a typical is going on.

>> In short it does not make sense to me to justify a patch for performance
>> reasons when it appears that extending the rcu_read_lock hold time and
>> not touch the task reference count would stop dirtying a cache line and
>> likely have more impact.
>
> IMHO, incremented task refcount should make code works correctly. And extending
> the rcu_read_lock over get_task_mm will break the things because sleeping functions
> might be called while holding rcu lock.

Which sleeping functions?

I can't find any. In particular smack_task_movememory calls
smk_curacc_on_task which is the same function called by
security_task_getpgid. security_task_getpgid is called
under rcu_read_lock. So smack won't sleep.

> Does the patch itself makes sense for you? Should I rephase the commit log further?
> I'm afraid I didn't get your point correctly.

The patch makes no sense to me because I don't see it doing anything
worth doing.

get/put_task_struct both dirty a cache line and are expensive especially
when contended. Dirtying a cache line that is contended is the pretty
much the most expensive native cpu operation. In pathological scenarios
I have seen it take up to 1s. Realistically in a cache cold scenario
(which is not as bad as a contended scenario) you are looking at 100ns
or more just to execute get_task_struct/put_task_struct. That is the
kind of cost it would be nice to avoid all together (even if the
pathological case never comes up).

So I see two pieces of code where we could use the cheap operation
rcu_read_lock/rcu_read_unlock to remove the expensive operation
get_task_struct/put_task_struct. Instead I see people removing
rcu_read_lock/rcu_read_unlock.

That makes no sense. Especially as your implied reason for making this
change is to make the code have better performance. Improving
performance is the reason for making critical sections smaller isn't it?

Eric

2022-06-03 18:12:04

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On 2022/6/1 22:37, Eric W. Biederman wrote:
> Miaohe Lin <[email protected]> writes:
>
>> On 2022/6/1 0:09, Eric W. Biederman wrote:
>>> Miaohe Lin <[email protected]> writes:
>> snip
>>>>
>>>> "
>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>>> extends the period of the rcu_read_lock until after the permissions checks
>>>> are done because it suspects the permissions checks are not safe unless
>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>>> association does not change on us while we are working [1]. But extended
>>>> rcu read lock does not add much value. Because after permission checking
>>>> the permission may still be changed. There's no much difference. So it's
>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>>> time.
>>>>
>>>> [1] https://lore.kernel.org/lkml/[email protected]/
>>>> "
>>>
>>> It doesn't make sense to me.
>>>
>>> I don't see any sleeping functions called from find_mm_struct or
>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>>> code protected by get_task_struct. So at a very basic level I see a
>>> justification for dirtying a cache line twice with get_task_struct and
>>> put_task_struct to reduce rcu_read_lock hold times.
>>>
>>> I would contend that a reasonable cleanup based up on the current state
>>> of the code would be to extend the rcu_read_lock over get_task_mm so
>>
>> If so, security_task_movememory will be called inside rcu lock. It might
>> call sleeping functions, e.g. smack_log(). I think it's not a good
>> idea.
>
> In general the security functions are not allowed to sleep.
> The audit mechanism typically preallocates memory so it does
> not need to sleep. From a quick skim through the code I don't
> see smack_log sleeping.
>
> Certainly the security hooks are not supposed to be inserted into the
> code that they prevent reasonable implementation decisions.
>
> Which is to say if there is a good (non-security hook reason) for
> supporting sleeping deal with it. Otherwise the security hooks has a
> bug and needs to get fixed/removed.

I see. Many thanks for explanation.

>
>>> that a reference to task_struct does not need to be taken. That has
>>> the potential to reduce contention and reduce lock hold times.
>>>
>>>
>>> The code is missing a big fat comment with the assertion that it is ok
>>> if the permission checks are racy because the race is small, and the
>>> worst case thing that happens is the page is migrated to another
>>> numa node.
>>>
>>>
>>> Given that the get_mm_task takes task_lock the cost of dirtying the
>>> cache line is already being paid. Perhaps not extending task_lock hold
>>> times a little bit is justified, but I haven't seen that case made.
>>>
>>> This seems like code that is called little enough it would be better for
>>> it to be correct, and not need big fat comments explaining why it
>>> doesn't matter that they code is deliberately buggy.
>>>
>>
>> Agree. A big fat comments will make code hard to follow.
>
> No.
>
> The code is impossible to follow currently.
>
> The code either requires a comment pointing out that it is deliberately
> racy, or the code needs to be fixed.
>
> Clever and subtle code always requires a comment if for no other
> reason then to alert the reader that something a typical is going on.

Yes, clever and subtle code requires a comment but others might not.

>
>>> In short it does not make sense to me to justify a patch for performance
>>> reasons when it appears that extending the rcu_read_lock hold time and
>>> not touch the task reference count would stop dirtying a cache line and
>>> likely have more impact.
>>
>> IMHO, incremented task refcount should make code works correctly. And extending
>> the rcu_read_lock over get_task_mm will break the things because sleeping functions
>> might be called while holding rcu lock.
>
> Which sleeping functions?
>
> I can't find any. In particular smack_task_movememory calls
> smk_curacc_on_task which is the same function called by
> security_task_getpgid. security_task_getpgid is called
> under rcu_read_lock. So smack won't sleep.

Sorry, I didn't take a close look at smack_log code. So I thought it could sleep.

>
>> Does the patch itself makes sense for you? Should I rephase the commit log further?
>> I'm afraid I didn't get your point correctly.
>
> The patch makes no sense to me because I don't see it doing anything
> worth doing.
>
> get/put_task_struct both dirty a cache line and are expensive especially
> when contended. Dirtying a cache line that is contended is the pretty
> much the most expensive native cpu operation. In pathological scenarios
> I have seen it take up to 1s. Realistically in a cache cold scenario
> (which is not as bad as a contended scenario) you are looking at 100ns
> or more just to execute get_task_struct/put_task_struct. That is the
> kind of cost it would be nice to avoid all together (even if the
> pathological case never comes up).
>
> So I see two pieces of code where we could use the cheap operation
> rcu_read_lock/rcu_read_unlock to remove the expensive operation
> get_task_struct/put_task_struct. Instead I see people removing
> rcu_read_lock/rcu_read_unlock.
>
> That makes no sense. Especially as your implied reason for making this
> change is to make the code have better performance. Improving
> performance is the reason for making critical sections smaller isn't it?
>

I think you're right. We should extend the rcu_read_lock over get_task_mm so we can
remove the expensive operation get_task_struct/put_task_struct and thus avoid possible
cacheline penalty. But is the extended rcu lock enough to ensure the task reference
is stable when calling security check functions and performing cpuset node validation?
Or this race is just OK and can be left alone with a comment?

> Eric

Many thanks for your comment and suggestion!

> .
>

2022-06-06 05:15:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

Miaohe Lin <[email protected]> writes:

> On 2022/6/1 22:37, Eric W. Biederman wrote:
>> Miaohe Lin <[email protected]> writes:
>>
>>> On 2022/6/1 0:09, Eric W. Biederman wrote:
>>>> Miaohe Lin <[email protected]> writes:
>>> snip
>>>>>
>>>>> "
>>>>> Commit 3268c63eded4 ("mm: fix move/migrate_pages() race on task struct")
>>>>> extends the period of the rcu_read_lock until after the permissions checks
>>>>> are done because it suspects the permissions checks are not safe unless
>>>>> performed under both rcu_read_lock and task_lock to ensure the task<->mm
>>>>> association does not change on us while we are working [1]. But extended
>>>>> rcu read lock does not add much value. Because after permission checking
>>>>> the permission may still be changed. There's no much difference. So it's
>>>>> unnecessary to extend the period of the rcu_read_lock. Release the rcu
>>>>> lock after task refcount is successfully grabbed to reduce the rcu holding
>>>>> time.
>>>>>
>>>>> [1] https://lore.kernel.org/lkml/[email protected]/
>>>>> "
>>>>
>>>> It doesn't make sense to me.
>>>>
>>>> I don't see any sleeping functions called from find_mm_struct or
>>>> kernel_migrate_pages in the area kernel_migrate_pages in the area of the
>>>> code protected by get_task_struct. So at a very basic level I see a
>>>> justification for dirtying a cache line twice with get_task_struct and
>>>> put_task_struct to reduce rcu_read_lock hold times.
>>>>
>>>> I would contend that a reasonable cleanup based up on the current state
>>>> of the code would be to extend the rcu_read_lock over get_task_mm so
>>>
>>> If so, security_task_movememory will be called inside rcu lock. It might
>>> call sleeping functions, e.g. smack_log(). I think it's not a good
>>> idea.
>>
>> In general the security functions are not allowed to sleep.
>> The audit mechanism typically preallocates memory so it does
>> not need to sleep. From a quick skim through the code I don't
>> see smack_log sleeping.
>>
>> Certainly the security hooks are not supposed to be inserted into the
>> code that they prevent reasonable implementation decisions.
>>
>> Which is to say if there is a good (non-security hook reason) for
>> supporting sleeping deal with it. Otherwise the security hooks has a
>> bug and needs to get fixed/removed.
>
> I see. Many thanks for explanation.
>
>>
>>>> that a reference to task_struct does not need to be taken. That has
>>>> the potential to reduce contention and reduce lock hold times.
>>>>
>>>>
>>>> The code is missing a big fat comment with the assertion that it is ok
>>>> if the permission checks are racy because the race is small, and the
>>>> worst case thing that happens is the page is migrated to another
>>>> numa node.
>>>>
>>>>
>>>> Given that the get_mm_task takes task_lock the cost of dirtying the
>>>> cache line is already being paid. Perhaps not extending task_lock hold
>>>> times a little bit is justified, but I haven't seen that case made.
>>>>
>>>> This seems like code that is called little enough it would be better for
>>>> it to be correct, and not need big fat comments explaining why it
>>>> doesn't matter that they code is deliberately buggy.
>>>>
>>>
>>> Agree. A big fat comments will make code hard to follow.
>>
>> No.
>>
>> The code is impossible to follow currently.
>>
>> The code either requires a comment pointing out that it is deliberately
>> racy, or the code needs to be fixed.
>>
>> Clever and subtle code always requires a comment if for no other
>> reason then to alert the reader that something a typical is going on.
>
> Yes, clever and subtle code requires a comment but others might not.
>
>>
>>>> In short it does not make sense to me to justify a patch for performance
>>>> reasons when it appears that extending the rcu_read_lock hold time and
>>>> not touch the task reference count would stop dirtying a cache line and
>>>> likely have more impact.
>>>
>>> IMHO, incremented task refcount should make code works correctly. And extending
>>> the rcu_read_lock over get_task_mm will break the things because sleeping functions
>>> might be called while holding rcu lock.
>>
>> Which sleeping functions?
>>
>> I can't find any. In particular smack_task_movememory calls
>> smk_curacc_on_task which is the same function called by
>> security_task_getpgid. security_task_getpgid is called
>> under rcu_read_lock. So smack won't sleep.
>
> Sorry, I didn't take a close look at smack_log code. So I thought it could sleep.
>
>>
>>> Does the patch itself makes sense for you? Should I rephase the commit log further?
>>> I'm afraid I didn't get your point correctly.
>>
>> The patch makes no sense to me because I don't see it doing anything
>> worth doing.
>>
>> get/put_task_struct both dirty a cache line and are expensive especially
>> when contended. Dirtying a cache line that is contended is the pretty
>> much the most expensive native cpu operation. In pathological scenarios
>> I have seen it take up to 1s. Realistically in a cache cold scenario
>> (which is not as bad as a contended scenario) you are looking at 100ns
>> or more just to execute get_task_struct/put_task_struct. That is the
>> kind of cost it would be nice to avoid all together (even if the
>> pathological case never comes up).
>>
>> So I see two pieces of code where we could use the cheap operation
>> rcu_read_lock/rcu_read_unlock to remove the expensive operation
>> get_task_struct/put_task_struct. Instead I see people removing
>> rcu_read_lock/rcu_read_unlock.
>>
>> That makes no sense. Especially as your implied reason for making this
>> change is to make the code have better performance. Improving
>> performance is the reason for making critical sections smaller isn't it?
>>
>
> I think you're right. We should extend the rcu_read_lock over get_task_mm so we can
> remove the expensive operation get_task_struct/put_task_struct and thus avoid possible
> cacheline penalty. But is the extended rcu lock enough to ensure the task reference
> is stable when calling security check functions and performing cpuset node validation?
> Or this race is just OK and can be left alone with a comment?

The extending the rcu_read_lock is just about removing the expense of
get_task_struct/put_task_struct. It can be handled separately from
other issues at play in this code.


The rcu_read_lock guarantees that task does not go away. The
rcu_read_lock does not guarantee that task->mm does not change.



A separate but related issue is should the task_lock in get_task_mm
be extended to cover the security checks so that everything checks
against the same mm.

Possibly the code could be refactored or reordered so that everything
is guaranteed to check against the same mm.


If the checks are not made to guarantee they are all checking against
the same mm, the code needs a comment to say that there is a tiny race.
The comment should say we don't care about the tiny race because
the worst that can happen is that a page is migrated to a different
numa node. And so we don't care.


Eric

2022-06-18 00:44:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On Tue, 7 Jun 2022 17:19:53 +0800 Miaohe Lin <[email protected]> wrote:

> >
> >
> > If the checks are not made to guarantee they are all checking against
> > the same mm, the code needs a comment to say that there is a tiny race.
> > The comment should say we don't care about the tiny race because
> > the worst that can happen is that a page is migrated to a different
> > numa node. And so we don't care.
> >
> >
>
> I tend to do this one. To make sure, is the below code change what you suggest?

Eric went quiet.

As I understand it, any changes arising from this discussion can be
done as a followup patch. So I'm planning on moving this 4-patch
series into the non-rebasing mm-stable branch late next week.


2022-06-18 03:21:47

by Miaohe Lin

[permalink] [raw]
Subject: Re: [PATCH v4 1/4] mm: reduce the rcu lock duration

On 2022/6/18 8:23, Andrew Morton wrote:
> On Tue, 7 Jun 2022 17:19:53 +0800 Miaohe Lin <[email protected]> wrote:
>
>>>
>>>
>>> If the checks are not made to guarantee they are all checking against
>>> the same mm, the code needs a comment to say that there is a tiny race.
>>> The comment should say we don't care about the tiny race because
>>> the worst that can happen is that a page is migrated to a different
>>> numa node. And so we don't care.
>>>
>>>
>>
>> I tend to do this one. To make sure, is the below code change what you suggest?
>
> Eric went quiet.
>
> As I understand it, any changes arising from this discussion can be
> done as a followup patch. So I'm planning on moving this 4-patch
> series into the non-rebasing mm-stable branch late next week.

I will send a followup patch when discussion is done. Thanks.

>
>
> .
>