2018-03-26 18:22:16

by Yang Shi

[permalink] [raw]
Subject: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

mmap_sem is on the hot path of kernel, and it very contended, but it is
abused too. It is used to protect arg_start|end and evn_start|end when
reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
sense since those proc files just expect to read 4 values atomically and
not related to VM, they could be set to arbitrary values by C/R.

And, the mmap_sem contention may cause unexpected issue like below:

INFO: task ps:14018 blocked for more than 120 seconds.
Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
ps D 0 14018 1 0x00000004
ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
Call Trace:
[<ffffffff817154d0>] ? __schedule+0x250/0x730
[<ffffffff817159e6>] schedule+0x36/0x80
[<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
[<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
[<ffffffff81717db0>] down_read+0x20/0x40
[<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
[<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
[<ffffffff81241d87>] __vfs_read+0x37/0x150
[<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
[<ffffffff81242266>] vfs_read+0x96/0x130
[<ffffffff812437b5>] SyS_read+0x55/0xc0
[<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5

Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
for them to mitigate the abuse of mmap_sem.

So, introduce a new spinlock in mm_struct to protect the concurrent access
to arg_start|end and env_start|end.

And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
sem for writing to protect against others") changed down_read to
down_write to avoid write race condition in prctl_set_mm(). Since we
already have dedicated lock to protect them, it is safe to change back
to down_read.

Signed-off-by: Yang Shi <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Mateusz Guzik <[email protected]>
Cc: Cyrill Gorcunov <[email protected]>
---
v1 --> v2:
* Use spinlock instead of rwlock per Mattew's suggestion
* Replace down_write to down_read in prctl_set_mm (see commit log for details)

fs/proc/base.c | 8 ++++----
include/linux/mm_types.h | 2 ++
kernel/fork.c | 1 +
kernel/sys.c | 14 ++++++++++----
mm/init-mm.c | 1 +
5 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324..e0282b6 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
goto out_mmput;
}

- down_read(&mm->mmap_sem);
+ spin_lock(&mm->arg_lock);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
- up_read(&mm->mmap_sem);
+ spin_unlock(&mm->arg_lock);

BUG_ON(arg_start > arg_end);
BUG_ON(env_start > env_end);
@@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char __user *buf,
if (!mmget_not_zero(mm))
goto free;

- down_read(&mm->mmap_sem);
+ spin_lock(&mm->arg_lock);
env_start = mm->env_start;
env_end = mm->env_end;
- up_read(&mm->mmap_sem);
+ spin_unlock(&mm->arg_lock);

while (count > 0) {
size_t this_len, max_len;
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fd1af6b..3be4588 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -413,6 +413,8 @@ struct mm_struct {
unsigned long def_flags;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
+
+ spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */
unsigned long arg_start, arg_end, env_start, env_end;

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
diff --git a/kernel/fork.c b/kernel/fork.c
index e5d9d40..6540ae7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
mm->pinned_vm = 0;
memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
+ spin_lock_init(&mm->arg_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
mm_init_owner(mm, p);
diff --git a/kernel/sys.c b/kernel/sys.c
index f2289de..17bddd2 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
return error;
}

- down_write(&mm->mmap_sem);
+ down_read(&mm->mmap_sem);

/*
* We don't validate if these members are pointing to
@@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
mm->start_brk = prctl_map.start_brk;
mm->brk = prctl_map.brk;
mm->start_stack = prctl_map.start_stack;
+
+ spin_lock(&mm->arg_lock);
mm->arg_start = prctl_map.arg_start;
mm->arg_end = prctl_map.arg_end;
mm->env_start = prctl_map.env_start;
mm->env_end = prctl_map.env_end;
+ spin_unlock(&mm->arg_lock);

/*
* Note this update of @saved_auxv is lockless thus
@@ -1996,7 +1999,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
if (prctl_map.auxv_size)
memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));

- up_write(&mm->mmap_sem);
+ up_read(&mm->mmap_sem);
return 0;
}
#endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -2063,7 +2066,7 @@ static int prctl_set_mm(int opt, unsigned long addr,

error = -EINVAL;

- down_write(&mm->mmap_sem);
+ down_read(&mm->mmap_sem);
vma = find_vma(mm, addr);

prctl_map.start_code = mm->start_code;
@@ -2149,14 +2152,17 @@ static int prctl_set_mm(int opt, unsigned long addr,
mm->start_brk = prctl_map.start_brk;
mm->brk = prctl_map.brk;
mm->start_stack = prctl_map.start_stack;
+
+ spin_lock(&mm->arg_lock);
mm->arg_start = prctl_map.arg_start;
mm->arg_end = prctl_map.arg_end;
mm->env_start = prctl_map.env_start;
mm->env_end = prctl_map.env_end;
+ spin_unlock(&mm->arg_lock);

error = 0;
out:
- up_write(&mm->mmap_sem);
+ up_read(&mm->mmap_sem);
return error;
}

diff --git a/mm/init-mm.c b/mm/init-mm.c
index f94d5d1..66cce4c 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -23,6 +23,7 @@ struct mm_struct init_mm = {
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
+ .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.user_ns = &init_user_ns,
INIT_MM_CONTEXT(init_mm)
};
--
1.8.3.1



2018-03-26 18:38:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> +++ b/kernel/sys.c
> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> return error;
> }
>
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);
>
> /*
> * We don't validate if these members are pointing to
> @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> mm->start_brk = prctl_map.start_brk;
> mm->brk = prctl_map.brk;
> mm->start_stack = prctl_map.start_stack;
> +
> + spin_lock(&mm->arg_lock);
> mm->arg_start = prctl_map.arg_start;
> mm->arg_end = prctl_map.arg_end;
> mm->env_start = prctl_map.env_start;
> mm->env_end = prctl_map.env_end;
> + spin_unlock(&mm->arg_lock);
>
> /*
> * Note this update of @saved_auxv is lockless thus

I see the argument for the change to a write lock was because of a BUG
validating arg_start and arg_end, but more generally, we are updating these
values, so a write-lock is probably a good idea, and this is a very rare
operation to do, so we don't care about making this more parallel. I would
not make this change (but if other more knowledgable people in this area
disagree with me, I will withdraw my objection to this part).


2018-03-26 19:22:49

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Mon, Mar 26, 2018 at 11:37:25AM -0700, Matthew Wilcox wrote:
> On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> > +++ b/kernel/sys.c
> > @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> > return error;
> > }
> >
> > - down_write(&mm->mmap_sem);
> > + down_read(&mm->mmap_sem);
> >
> > /*
> > * We don't validate if these members are pointing to
> > @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> > mm->start_brk = prctl_map.start_brk;
> > mm->brk = prctl_map.brk;
> > mm->start_stack = prctl_map.start_stack;
> > +
> > + spin_lock(&mm->arg_lock);
> > mm->arg_start = prctl_map.arg_start;
> > mm->arg_end = prctl_map.arg_end;
> > mm->env_start = prctl_map.env_start;
> > mm->env_end = prctl_map.env_end;
> > + spin_unlock(&mm->arg_lock);
> >
> > /*
> > * Note this update of @saved_auxv is lockless thus
>
> I see the argument for the change to a write lock was because of a BUG
> validating arg_start and arg_end, but more generally, we are updating these
> values, so a write-lock is probably a good idea, and this is a very rare
> operation to do, so we don't care about making this more parallel. I would
> not make this change (but if other more knowledgable people in this area
> disagree with me, I will withdraw my objection to this part).

Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
and @brk = 20. Since now the call is guarded by _read_ the both calls
unlocked and due to OO engine it may happen then when both finish
we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
has been take to have consistent data on exit, either you have [20;10]
or [30;20] assigned not something mixed.

That said I think using read-lock here would be a bug.

Cyrill

2018-03-26 19:43:34

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
> mmap_sem is on the hot path of kernel, and it very contended, but it is
> abused too. It is used to protect arg_start|end and evn_start|end when
> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
> sense since those proc files just expect to read 4 values atomically and
> not related to VM, they could be set to arbitrary values by C/R.
>

They are not arbitrary - there is basic validation performed when
setting them.

> And, the mmap_sem contention may cause unexpected issue like below:
>
> INFO: task ps:14018 blocked for more than 120 seconds.
> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> ps D 0 14018 1 0x00000004
> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
> Call Trace:
> [<ffffffff817154d0>] ? __schedule+0x250/0x730
> [<ffffffff817159e6>] schedule+0x36/0x80
> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
> [<ffffffff81717db0>] down_read+0x20/0x40
> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
> [<ffffffff81241d87>] __vfs_read+0x37/0x150
> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
> [<ffffffff81242266>] vfs_read+0x96/0x130
> [<ffffffff812437b5>] SyS_read+0x55/0xc0
> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>
> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
> for them to mitigate the abuse of mmap_sem.
>

While switching to arg spinlock here will relieve mmap_sem to an extent,
it wont help with the problem you are seeing here.

proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you
got yet another down_read(&mm->mmap_sem);.

i.e. the issue you ran into is well known and predates my change.

The problem does not stem from contention either, but blocking for a
long time while holding the lock - the most common example is dealing
with dead nfs mount vs mmaped areas.

I don't have good ideas how to fix the problem. The least bad I came up
with was to trylock with a timeout - after a failure either return an
error or resort to returning p_comm. ps/top could be modified to
fallback to snatching the name from /status.

Since the lock owner is now being stored in the semaphore, perhaps the
above routine can happily spin until it grabs the lock or the owner is
detected to have gone into uninterruptible sleep and react accordingly.

I don't know whether it is feasible to somehow avoid the mmap lock
altogether.

If it has to be there no matter what the code can be refactored to grab
it once and relock only if copyout would fault. This would in particular
reduce the number of times it is taken to begin with and still provide
the current synchronisation against prctl. But the fundamental problem
will remain.

That said, refactoring above will have the same effect as your patch and
will avoid growing mm_struct.

That's my $0,03. MM overlords have to comment on what to do with this.

> So, introduce a new spinlock in mm_struct to protect the concurrent access
> to arg_start|end and env_start|end.
>
> And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
> sem for writing to protect against others") changed down_read to
> down_write to avoid write race condition in prctl_set_mm(). Since we
> already have dedicated lock to protect them, it is safe to change back
> to down_read.
>
> Signed-off-by: Yang Shi <[email protected]>
> Cc: Alexey Dobriyan <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Mateusz Guzik <[email protected]>
> Cc: Cyrill Gorcunov <[email protected]>
> ---
> v1 --> v2:
> * Use spinlock instead of rwlock per Mattew's suggestion
> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
>
> fs/proc/base.c | 8 ++++----
> include/linux/mm_types.h | 2 ++
> kernel/fork.c | 1 +
> kernel/sys.c | 14 ++++++++++----
> mm/init-mm.c | 1 +
> 5 files changed, 18 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index 9298324..e0282b6 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
> goto out_mmput;
> }
>
> - down_read(&mm->mmap_sem);
> + spin_lock(&mm->arg_lock);
> arg_start = mm->arg_start;
> arg_end = mm->arg_end;
> env_start = mm->env_start;
> env_end = mm->env_end;
> - up_read(&mm->mmap_sem);
> + spin_unlock(&mm->arg_lock);
>
> BUG_ON(arg_start > arg_end);
> BUG_ON(env_start > env_end);
> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char __user *buf,
> if (!mmget_not_zero(mm))
> goto free;
>
> - down_read(&mm->mmap_sem);
> + spin_lock(&mm->arg_lock);
> env_start = mm->env_start;
> env_end = mm->env_end;
> - up_read(&mm->mmap_sem);
> + spin_unlock(&mm->arg_lock);
>
> while (count > 0) {
> size_t this_len, max_len;
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index fd1af6b..3be4588 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -413,6 +413,8 @@ struct mm_struct {
> unsigned long def_flags;
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
> +
> + spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */
> unsigned long arg_start, arg_end, env_start, env_end;
>
> unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e5d9d40..6540ae7 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> mm->pinned_vm = 0;
> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
> spin_lock_init(&mm->page_table_lock);
> + spin_lock_init(&mm->arg_lock);
> mm_init_cpumask(mm);
> mm_init_aio(mm);
> mm_init_owner(mm, p);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index f2289de..17bddd2 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> return error;
> }
>
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);
>
> /*
> * We don't validate if these members are pointing to
> @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> mm->start_brk = prctl_map.start_brk;
> mm->brk = prctl_map.brk;
> mm->start_stack = prctl_map.start_stack;
> +
> + spin_lock(&mm->arg_lock);
> mm->arg_start = prctl_map.arg_start;
> mm->arg_end = prctl_map.arg_end;
> mm->env_start = prctl_map.env_start;
> mm->env_end = prctl_map.env_end;
> + spin_unlock(&mm->arg_lock);
>
> /*
> * Note this update of @saved_auxv is lockless thus
> @@ -1996,7 +1999,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> if (prctl_map.auxv_size)
> memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
>
> - up_write(&mm->mmap_sem);
> + up_read(&mm->mmap_sem);
> return 0;
> }
> #endif /* CONFIG_CHECKPOINT_RESTORE */
> @@ -2063,7 +2066,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
>
> error = -EINVAL;
>
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> vma = find_vma(mm, addr);
>
> prctl_map.start_code = mm->start_code;
> @@ -2149,14 +2152,17 @@ static int prctl_set_mm(int opt, unsigned long addr,
> mm->start_brk = prctl_map.start_brk;
> mm->brk = prctl_map.brk;
> mm->start_stack = prctl_map.start_stack;
> +
> + spin_lock(&mm->arg_lock);
> mm->arg_start = prctl_map.arg_start;
> mm->arg_end = prctl_map.arg_end;
> mm->env_start = prctl_map.env_start;
> mm->env_end = prctl_map.env_end;
> + spin_unlock(&mm->arg_lock);
>
> error = 0;
> out:
> - up_write(&mm->mmap_sem);
> + up_read(&mm->mmap_sem);
> return error;
> }
>
> diff --git a/mm/init-mm.c b/mm/init-mm.c
> index f94d5d1..66cce4c 100644
> --- a/mm/init-mm.c
> +++ b/mm/init-mm.c
> @@ -23,6 +23,7 @@ struct mm_struct init_mm = {
> .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
> .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
> .mmlist = LIST_HEAD_INIT(init_mm.mmlist),
> + .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
> .user_ns = &init_user_ns,
> INIT_MM_CONTEXT(init_mm)
> };
> --
> 1.8.3.1
>

--
Mateusz Guzik

2018-03-26 21:11:35

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On 2018/03/27 4:21, Cyrill Gorcunov wrote:
> That said I think using read-lock here would be a bug.

If I understand correctly, the caller can't set both fields atomically, for
prctl() does not receive both fields at one call.

prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);

Then, I wonder whether reading arg_start|end and env_start|end atomically makes
sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?

2018-03-26 21:13:12

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/26/18 3:42 PM, Mateusz Guzik wrote:
> On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
>> mmap_sem is on the hot path of kernel, and it very contended, but it is
>> abused too. It is used to protect arg_start|end and evn_start|end when
>> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make
>> sense since those proc files just expect to read 4 values atomically and
>> not related to VM, they could be set to arbitrary values by C/R.
>>
> They are not arbitrary - there is basic validation performed when
> setting them.
>
>> And, the mmap_sem contention may cause unexpected issue like below:
>>
>> INFO: task ps:14018 blocked for more than 120 seconds.
>> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1
>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>> message.
>> ps D 0 14018 1 0x00000004
>> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>> Call Trace:
>> [<ffffffff817154d0>] ? __schedule+0x250/0x730
>> [<ffffffff817159e6>] schedule+0x36/0x80
>> [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>> [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>> [<ffffffff81717db0>] down_read+0x20/0x40
>> [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>> [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>> [<ffffffff81241d87>] __vfs_read+0x37/0x150
>> [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>> [<ffffffff81242266>] vfs_read+0x96/0x130
>> [<ffffffff812437b5>] SyS_read+0x55/0xc0
>> [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>
>> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock
>> for them to mitigate the abuse of mmap_sem.
>>
> While switching to arg spinlock here will relieve mmap_sem to an extent,
> it wont help with the problem you are seeing here.
>
> proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you
> got yet another down_read(&mm->mmap_sem);.
>
> i.e. the issue you ran into is well known and predates my change.
>
> The problem does not stem from contention either, but blocking for a
> long time while holding the lock - the most common example is dealing
> with dead nfs mount vs mmaped areas.
>
> I don't have good ideas how to fix the problem. The least bad I came up
> with was to trylock with a timeout - after a failure either return an
> error or resort to returning p_comm. ps/top could be modified to
> fallback to snatching the name from /status.
>
> Since the lock owner is now being stored in the semaphore, perhaps the
> above routine can happily spin until it grabs the lock or the owner is
> detected to have gone into uninterruptible sleep and react accordingly.
>
> I don't know whether it is feasible to somehow avoid the mmap lock
> altogether.
>
> If it has to be there no matter what the code can be refactored to grab
> it once and relock only if copyout would fault. This would in particular
> reduce the number of times it is taken to begin with and still provide
> the current synchronisation against prctl. But the fundamental problem
> will remain.
>
> That said, refactoring above will have the same effect as your patch and
> will avoid growing mm_struct.
>
> That's my $0,03. MM overlords have to comment on what to do with this.

Thanks for the comment. Yes, the spin lock absolutely can't solve all
the mmap_sem scalability issue. Actually, I already proposed a
preliminary RFC to try to mitigate the mmap_sem issue. It is still under
review.

Other than that, we also found mmap_sem is abused somewhere, so this
patch is proposed to reduce the abuse to mmap_sem.

Yang

>
>> So, introduce a new spinlock in mm_struct to protect the concurrent access
>> to arg_start|end and env_start|end.
>>
>> And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap
>> sem for writing to protect against others") changed down_read to
>> down_write to avoid write race condition in prctl_set_mm(). Since we
>> already have dedicated lock to protect them, it is safe to change back
>> to down_read.
>>
>> Signed-off-by: Yang Shi <[email protected]>
>> Cc: Alexey Dobriyan <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Mateusz Guzik <[email protected]>
>> Cc: Cyrill Gorcunov <[email protected]>
>> ---
>> v1 --> v2:
>> * Use spinlock instead of rwlock per Mattew's suggestion
>> * Replace down_write to down_read in prctl_set_mm (see commit log for details)
>>
>> fs/proc/base.c | 8 ++++----
>> include/linux/mm_types.h | 2 ++
>> kernel/fork.c | 1 +
>> kernel/sys.c | 14 ++++++++++----
>> mm/init-mm.c | 1 +
>> 5 files changed, 18 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/proc/base.c b/fs/proc/base.c
>> index 9298324..e0282b6 100644
>> --- a/fs/proc/base.c
>> +++ b/fs/proc/base.c
>> @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf,
>> goto out_mmput;
>> }
>>
>> - down_read(&mm->mmap_sem);
>> + spin_lock(&mm->arg_lock);
>> arg_start = mm->arg_start;
>> arg_end = mm->arg_end;
>> env_start = mm->env_start;
>> env_end = mm->env_end;
>> - up_read(&mm->mmap_sem);
>> + spin_unlock(&mm->arg_lock);
>>
>> BUG_ON(arg_start > arg_end);
>> BUG_ON(env_start > env_end);
>> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char __user *buf,
>> if (!mmget_not_zero(mm))
>> goto free;
>>
>> - down_read(&mm->mmap_sem);
>> + spin_lock(&mm->arg_lock);
>> env_start = mm->env_start;
>> env_end = mm->env_end;
>> - up_read(&mm->mmap_sem);
>> + spin_unlock(&mm->arg_lock);
>>
>> while (count > 0) {
>> size_t this_len, max_len;
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index fd1af6b..3be4588 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -413,6 +413,8 @@ struct mm_struct {
>> unsigned long def_flags;
>> unsigned long start_code, end_code, start_data, end_data;
>> unsigned long start_brk, brk, start_stack;
>> +
>> + spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */
>> unsigned long arg_start, arg_end, env_start, env_end;
>>
>> unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e5d9d40..6540ae7 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
>> mm->pinned_vm = 0;
>> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
>> spin_lock_init(&mm->page_table_lock);
>> + spin_lock_init(&mm->arg_lock);
>> mm_init_cpumask(mm);
>> mm_init_aio(mm);
>> mm_init_owner(mm, p);
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index f2289de..17bddd2 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>> return error;
>> }
>>
>> - down_write(&mm->mmap_sem);
>> + down_read(&mm->mmap_sem);
>>
>> /*
>> * We don't validate if these members are pointing to
>> @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>> mm->start_brk = prctl_map.start_brk;
>> mm->brk = prctl_map.brk;
>> mm->start_stack = prctl_map.start_stack;
>> +
>> + spin_lock(&mm->arg_lock);
>> mm->arg_start = prctl_map.arg_start;
>> mm->arg_end = prctl_map.arg_end;
>> mm->env_start = prctl_map.env_start;
>> mm->env_end = prctl_map.env_end;
>> + spin_unlock(&mm->arg_lock);
>>
>> /*
>> * Note this update of @saved_auxv is lockless thus
>> @@ -1996,7 +1999,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>> if (prctl_map.auxv_size)
>> memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
>>
>> - up_write(&mm->mmap_sem);
>> + up_read(&mm->mmap_sem);
>> return 0;
>> }
>> #endif /* CONFIG_CHECKPOINT_RESTORE */
>> @@ -2063,7 +2066,7 @@ static int prctl_set_mm(int opt, unsigned long addr,
>>
>> error = -EINVAL;
>>
>> - down_write(&mm->mmap_sem);
>> + down_read(&mm->mmap_sem);
>> vma = find_vma(mm, addr);
>>
>> prctl_map.start_code = mm->start_code;
>> @@ -2149,14 +2152,17 @@ static int prctl_set_mm(int opt, unsigned long addr,
>> mm->start_brk = prctl_map.start_brk;
>> mm->brk = prctl_map.brk;
>> mm->start_stack = prctl_map.start_stack;
>> +
>> + spin_lock(&mm->arg_lock);
>> mm->arg_start = prctl_map.arg_start;
>> mm->arg_end = prctl_map.arg_end;
>> mm->env_start = prctl_map.env_start;
>> mm->env_end = prctl_map.env_end;
>> + spin_unlock(&mm->arg_lock);
>>
>> error = 0;
>> out:
>> - up_write(&mm->mmap_sem);
>> + up_read(&mm->mmap_sem);
>> return error;
>> }
>>
>> diff --git a/mm/init-mm.c b/mm/init-mm.c
>> index f94d5d1..66cce4c 100644
>> --- a/mm/init-mm.c
>> +++ b/mm/init-mm.c
>> @@ -23,6 +23,7 @@ struct mm_struct init_mm = {
>> .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
>> .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
>> .mmlist = LIST_HEAD_INIT(init_mm.mmlist),
>> + .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
>> .user_ns = &init_user_ns,
>> INIT_MM_CONTEXT(init_mm)
>> };
>> --
>> 1.8.3.1
>>


2018-03-26 21:24:23

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/26/18 5:10 PM, Tetsuo Handa wrote:
> On 2018/03/27 4:21, Cyrill Gorcunov wrote:
>> That said I think using read-lock here would be a bug.
> If I understand correctly, the caller can't set both fields atomically, for
> prctl() does not receive both fields at one call.
>
> prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);
>
> Then, I wonder whether reading arg_start|end and env_start|end atomically makes
> sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?

It might trap into dead loop if those are set to wrong values, right?



2018-03-26 21:31:02

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote:
> On 2018/03/27 4:21, Cyrill Gorcunov wrote:
> > That said I think using read-lock here would be a bug.
>
> If I understand correctly, the caller can't set both fields atomically, for
> prctl() does not receive both fields at one call.
>
> prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);
>

True, but the key moment is that two/three/four system calls can
run simultaneously. And while previously they are ordered by "write",
with read lock they are completely unordered and this is really
worries me. To be fair I would prefer to drop this old per-field
interface completely. This per-field interface was rather an ugly
solution from my side.

> Then, I wonder whether reading arg_start|end and env_start|end atomically makes
> sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?

Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.

2018-03-26 22:01:12

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/26/18 3:21 PM, Cyrill Gorcunov wrote:
> On Mon, Mar 26, 2018 at 11:37:25AM -0700, Matthew Wilcox wrote:
>> On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote:
>>> +++ b/kernel/sys.c
>>> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>>> return error;
>>> }
>>>
>>> - down_write(&mm->mmap_sem);
>>> + down_read(&mm->mmap_sem);
>>>
>>> /*
>>> * We don't validate if these members are pointing to
>>> @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>>> mm->start_brk = prctl_map.start_brk;
>>> mm->brk = prctl_map.brk;
>>> mm->start_stack = prctl_map.start_stack;
>>> +
>>> + spin_lock(&mm->arg_lock);
>>> mm->arg_start = prctl_map.arg_start;
>>> mm->arg_end = prctl_map.arg_end;
>>> mm->env_start = prctl_map.env_start;
>>> mm->env_end = prctl_map.env_end;
>>> + spin_unlock(&mm->arg_lock);
>>>
>>> /*
>>> * Note this update of @saved_auxv is lockless thus
>> I see the argument for the change to a write lock was because of a BUG
>> validating arg_start and arg_end, but more generally, we are updating these
>> values, so a write-lock is probably a good idea, and this is a very rare
>> operation to do, so we don't care about making this more parallel. I would
>> not make this change (but if other more knowledgable people in this area
>> disagree with me, I will withdraw my objection to this part).
> Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
> one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
> and @brk = 20. Since now the call is guarded by _read_ the both calls
> unlocked and due to OO engine it may happen then when both finish
> we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
> has been take to have consistent data on exit, either you have [20;10]
> or [30;20] assigned not something mixed.
>
> That said I think using read-lock here would be a bug.

Yes it sounds so. However, it was down_read before
ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for
writing to protect against others"). And, that commit is for fixing the
concurrent writing to arg_* and env_*. I just checked that commit, but
omitted the brk part. The potential issue mentioned by you should exist
before that commit, but might be just not discovered or very rare to hit.

I will change it back to down_write.

Thanks,
Yang

>
> Cyrill


2018-03-26 22:02:45

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

Cyrill Gorcunov wrote:
> On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote:
> > On 2018/03/27 4:21, Cyrill Gorcunov wrote:
> > > That said I think using read-lock here would be a bug.
> >
> > If I understand correctly, the caller can't set both fields atomically, for
> > prctl() does not receive both fields at one call.
> >
> > prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);
> >
>
> True, but the key moment is that two/three/four system calls can
> run simultaneously. And while previously they are ordered by "write",
> with read lock they are completely unordered and this is really
> worries me.

Yes, we need exclusive lock when updating these fields.

> To be fair I would prefer to drop this old per-field
> interface completely. This per-field interface was rather an ugly
> solution from my side.

But this is userspace visible API and thus we cannot change.

>
> > Then, I wonder whether reading arg_start|end and env_start|end atomically makes
> > sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?
>
> Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.
>

You are not missing my point. What I thought is

+retry:
- down_read(&mm->mmap_sem);
arg_start = mm->arg_start;
arg_end = mm->arg_end;
env_start = mm->env_start;
env_end = mm->env_end;
- up_read(&mm->mmap_sem);

- BUG_ON(arg_start > arg_end);
- BUG_ON(env_start > env_end);
+ if (unlikely(arg_start > arg_end || env_start > env_end)) {
+ cond_resched();
+ goto retry;
+ }

for reading these fields.

By the way, /proc/pid/ readers are serving as a canary who tells something
mm_mmap related problem is happening. On the other hand, it is sad that
such canary cannot be terminated by signal due to use of unkillable waits.
I wish we can use killable waits.

2018-03-26 22:14:24

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/26/18 6:00 PM, Tetsuo Handa wrote:
> Cyrill Gorcunov wrote:
>> On Tue, Mar 27, 2018 at 06:10:09AM +0900, Tetsuo Handa wrote:
>>> On 2018/03/27 4:21, Cyrill Gorcunov wrote:
>>>> That said I think using read-lock here would be a bug.
>>> If I understand correctly, the caller can't set both fields atomically, for
>>> prctl() does not receive both fields at one call.
>>>
>>> prctl(PR_SET_MM, PR_SET_MM_ARG_START xor PR_SET_MM_ARG_END xor PR_SET_MM_ENV_START xor PR_SET_MM_ENV_END, new value, 0, 0);
>>>
>> True, but the key moment is that two/three/four system calls can
>> run simultaneously. And while previously they are ordered by "write",
>> with read lock they are completely unordered and this is really
>> worries me.
> Yes, we need exclusive lock when updating these fields.
>
>> To be fair I would prefer to drop this old per-field
>> interface completely. This per-field interface was rather an ugly
>> solution from my side.
> But this is userspace visible API and thus we cannot change.
>
>>> Then, I wonder whether reading arg_start|end and env_start|end atomically makes
>>> sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?
>> Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.
>>
> You are not missing my point. What I thought is
>
> +retry:
> - down_read(&mm->mmap_sem);
> arg_start = mm->arg_start;
> arg_end = mm->arg_end;
> env_start = mm->env_start;
> env_end = mm->env_end;
> - up_read(&mm->mmap_sem);
>
> - BUG_ON(arg_start > arg_end);
> - BUG_ON(env_start > env_end);
> + if (unlikely(arg_start > arg_end || env_start > env_end)) {
> + cond_resched();
> + goto retry;

Can't it trap into dead loop if the condition is always false?

> + }
>
> for reading these fields.
>
> By the way, /proc/pid/ readers are serving as a canary who tells something
> mm_mmap related problem is happening. On the other hand, it is sad that
> such canary cannot be terminated by signal due to use of unkillable waits.
> I wish we can use killable waits.

I already proposed patches (https://lkml.org/lkml/2018/2/26/1197) to do
this a few weeks ago. In the review, akpm suggested mitigate the
mmap_sem contention instead of using killable version workaround. Then
the preliminary unmaping by section patches
(https://lkml.org/lkml/2018/3/20/786) were proposed. In the discussion,
we decided to eliminate the mmap_sem abuse, this is where the patch came
from.

Yang



2018-03-27 06:53:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue 27-03-18 02:20:39, Yang Shi wrote:
[...]
The patch looks reasonable to me. Maybe it would be better to be more
explicit about the purpose of the patch. As others noticed, this alone
wouldn't solve the mmap_sem contention issues. I _think_ that if you
were more explicit about the mmap_sem abuse it would trigger less
questions.

I have just one more question. Now that you are touching this area,
would you be willing to remove the following ugliness?

> diff --git a/kernel/sys.c b/kernel/sys.c
> index f2289de..17bddd2 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> return error;
> }
>
> - down_write(&mm->mmap_sem);
> + down_read(&mm->mmap_sem);

Why do we need to hold mmap_sem here and call find_vma, when only
PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
new lock and take the mmap_sem only for PR_SET_MM_ENV_END.

Thanks!
--
Michal Hocko
SUSE Labs

2018-03-27 07:34:38

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Mon, Mar 26, 2018 at 05:59:49PM -0400, Yang Shi wrote:
> > Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
> > one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
> > and @brk = 20. Since now the call is guarded by _read_ the both calls
> > unlocked and due to OO engine it may happen then when both finish
> > we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
> > has been take to have consistent data on exit, either you have [20;10]
> > or [30;20] assigned not something mixed.
> >
> > That said I think using read-lock here would be a bug.
>
> Yes it sounds so. However, it was down_read before
> ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing
> to protect against others"). And, that commit is for fixing the concurrent
> writing to arg_* and env_*. I just checked that commit, but omitted the brk
> part. The potential issue mentioned by you should exist before that commit,
> but might be just not discovered or very rare to hit.
>
> I will change it back to down_write.

down_read before was a bug ;) And it was not discovered earlier simply
because not that many users of this interface exist, namely only criu
as far as I know by now.

2018-03-27 07:38:53

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue, Mar 27, 2018 at 07:00:56AM +0900, Tetsuo Handa wrote:
>
> > To be fair I would prefer to drop this old per-field
> > interface completely. This per-field interface was rather an ugly
> > solution from my side.
>
> But this is userspace visible API and thus we cannot change.

Hi! We could deplrecate this API call for a couple of releases
and then if nobody complain we could rip it off completely.
There should not be many users I think, didn't heard that
someone except criu used it ever.

> > > Then, I wonder whether reading arg_start|end and env_start|end atomically makes
> > > sense. Just retry reading if arg_start > env_end or env_start > env_end is fine?
> >
> > Tetsuo, let me re-read this code tomorrow, maybe I miss something obvious.
> >
>
> You are not missing my point. What I thought is
>
> +retry:
> - down_read(&mm->mmap_sem);
> arg_start = mm->arg_start;
> arg_end = mm->arg_end;
> env_start = mm->env_start;
> env_end = mm->env_end;
> - up_read(&mm->mmap_sem);
>
> - BUG_ON(arg_start > arg_end);
> - BUG_ON(env_start > env_end);
> + if (unlikely(arg_start > arg_end || env_start > env_end)) {
> + cond_resched();
> + goto retry;
> + }
>
> for reading these fields.

I fear such contentional cycles are acceptable if only they
are guaranteed to finish eventually. Which doesn't look so
in the code above.

Cyrill

2018-03-27 07:39:56

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Mon, Mar 26, 2018 at 06:12:55PM -0400, Yang Shi wrote:
> > + if (unlikely(arg_start > arg_end || env_start > env_end)) {
> > + cond_resched();
> > + goto retry;
>
> Can't it trap into dead loop if the condition is always false?

Yes, unfortunately it can.

> > + }
> >
> > for reading these fields.
> >
> > By the way, /proc/pid/ readers are serving as a canary who tells something
> > mm_mmap related problem is happening. On the other hand, it is sad that
> > such canary cannot be terminated by signal due to use of unkillable waits.
> > I wish we can use killable waits.
>
> I already proposed patches (https://lkml.org/lkml/2018/2/26/1197) to do this
> a few weeks ago. In the review, akpm suggested mitigate the mmap_sem
> contention instead of using killable version workaround. Then the
> preliminary unmaping by section patches
> (https://lkml.org/lkml/2018/3/20/786) were proposed. In the discussion, we
> decided to eliminate the mmap_sem abuse, this is where the patch came from.

2018-03-27 14:33:00

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote:
> On Tue 27-03-18 02:20:39, Yang Shi wrote:
> [...]
> The patch looks reasonable to me. Maybe it would be better to be more
> explicit about the purpose of the patch. As others noticed, this alone
> wouldn't solve the mmap_sem contention issues. I _think_ that if you
> were more explicit about the mmap_sem abuse it would trigger less
> questions.
>

From what I gather even with other fixes the kernel will still end up
grabbing the semaphore. In this case I don't see what's the upside of
adding the spinlock for args. The downside is growth of mm_struct.

i.e. the code can be refactored to just hold the lock and relock only if
necessary (unable to copy to user without faulting)

--
Mateusz Guzik

2018-03-27 14:44:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue 27-03-18 16:31:23, Mateusz Guzik wrote:
> On Tue, Mar 27, 2018 at 08:29:39AM +0200, Michal Hocko wrote:
> > On Tue 27-03-18 02:20:39, Yang Shi wrote:
> > [...]
> > The patch looks reasonable to me. Maybe it would be better to be more
> > explicit about the purpose of the patch. As others noticed, this alone
> > wouldn't solve the mmap_sem contention issues. I _think_ that if you
> > were more explicit about the mmap_sem abuse it would trigger less
> > questions.
> >
>
> >From what I gather even with other fixes the kernel will still end up
> grabbing the semaphore. In this case I don't see what's the upside of
> adding the spinlock for args. The downside is growth of mm_struct.

Because accessing the specific address in the address space can be later
changed to use a more fine-grained locking. There are people
experimenting with range locking. These mmap_sem abusers, on the other
hand, will require the full range lock without a good reason. So it is
really worth it to remove them and replace by a more fine grained
locking.

If the mm_struct grow is a real concern (I haven't checked that) then we
can use a set of hashed locks or something else.
--
Michal Hocko
SUSE Labs

2018-03-27 18:39:36

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/27/18 2:29 AM, Michal Hocko wrote:
> On Tue 27-03-18 02:20:39, Yang Shi wrote:
> [...]
> The patch looks reasonable to me. Maybe it would be better to be more
> explicit about the purpose of the patch. As others noticed, this alone
> wouldn't solve the mmap_sem contention issues. I _think_ that if you
> were more explicit about the mmap_sem abuse it would trigger less
> questions.

Yes, sure.

>
> I have just one more question. Now that you are touching this area,
> would you be willing to remove the following ugliness?
>
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index f2289de..17bddd2 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>> return error;
>> }
>>
>> - down_write(&mm->mmap_sem);
>> + down_read(&mm->mmap_sem);
> Why do we need to hold mmap_sem here and call find_vma, when only
> PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
> new lock and take the mmap_sem only for PR_SET_MM_ENV_END.

Actually, I didn't think of why. It looks prctl_set_mm() checks if vma
does exist when it tries to set stack_start, argv_* and env_*, btw not
only env_end.

Cyrill may be able to give us some hint since C/R is the main user of
this API.

Yang

>
> Thanks!


2018-03-27 18:45:48

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/27/18 3:32 AM, Cyrill Gorcunov wrote:
> On Mon, Mar 26, 2018 at 05:59:49PM -0400, Yang Shi wrote:
>>> Say we've two syscalls running prctl_set_mm_map in parallel, and imagine
>>> one have @start_brk = 20 @brk = 10 and second caller has @start_brk = 30
>>> and @brk = 20. Since now the call is guarded by _read_ the both calls
>>> unlocked and due to OO engine it may happen then when both finish
>>> we have @start_brk = 30 and @brk = 10. In turn "write" semaphore
>>> has been take to have consistent data on exit, either you have [20;10]
>>> or [30;20] assigned not something mixed.
>>>
>>> That said I think using read-lock here would be a bug.
>> Yes it sounds so. However, it was down_read before
>> ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap sem for writing
>> to protect against others"). And, that commit is for fixing the concurrent
>> writing to arg_* and env_*. I just checked that commit, but omitted the brk
>> part. The potential issue mentioned by you should exist before that commit,
>> but might be just not discovered or very rare to hit.
>>
>> I will change it back to down_write.
> down_read before was a bug ;) And it was not discovered earlier simply
> because not that many users of this interface exist, namely only criu
> as far as I know by now.

Thanks for confirming this. I assumed so :-)



2018-03-27 18:53:31

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue, Mar 27, 2018 at 02:38:11PM -0400, Yang Shi wrote:
> > Why do we need to hold mmap_sem here and call find_vma, when only
> > PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
> > new lock and take the mmap_sem only for PR_SET_MM_ENV_END.
>
> Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does
> exist when it tries to set stack_start, argv_* and env_*, btw not only
> env_end.
>
> Cyrill may be able to give us some hint since C/R is the main user of this
> API.

First and most important it makes code smaller. This prctl call is really
rarely used. Of course we can optimize it, but as I said I would prefer
to simply deprecate this old interface (and I gonne to do so once time
permit).

2018-03-28 13:13:10

by Michal Hocko

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct

On Tue 27-03-18 21:52:17, Cyrill Gorcunov wrote:
> On Tue, Mar 27, 2018 at 02:38:11PM -0400, Yang Shi wrote:
> > > Why do we need to hold mmap_sem here and call find_vma, when only
> > > PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
> > > new lock and take the mmap_sem only for PR_SET_MM_ENV_END.
> >
> > Actually, I didn't think of why. It looks prctl_set_mm() checks if vma does
> > exist when it tries to set stack_start, argv_* and env_*, btw not only
> > env_end.
> >
> > Cyrill may be able to give us some hint since C/R is the main user of this
> > API.
>
> First and most important it makes code smaller. This prctl call is really
> rarely used. Of course we can optimize it, but as I said I would prefer
> to simply deprecate this old interface (and I gonne to do so once time
> permit).

Ohh, it would be really great if we can remove this thingy altogether. I
cannot say it has a wee bit of my sympathy.

--
Michal Hocko
SUSE Labs

2018-04-02 02:01:37

by Yang Shi

[permalink] [raw]
Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct



On 3/26/18 11:29 PM, Michal Hocko wrote:
> On Tue 27-03-18 02:20:39, Yang Shi wrote:
> [...]
> The patch looks reasonable to me. Maybe it would be better to be more
> explicit about the purpose of the patch. As others noticed, this alone
> wouldn't solve the mmap_sem contention issues. I _think_ that if you
> were more explicit about the mmap_sem abuse it would trigger less
> questions.
>
> I have just one more question. Now that you are touching this area,
> would you be willing to remove the following ugliness?
>
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index f2289de..17bddd2 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>> return error;
>> }
>>
>> - down_write(&mm->mmap_sem);
>> + down_read(&mm->mmap_sem);
> Why do we need to hold mmap_sem here and call find_vma, when only
> PR_SET_MM_ENV_END: is consuming it? I guess we can replace it wit the
> new lock and take the mmap_sem only for PR_SET_MM_ENV_END.

Sorry for taking a little bit longer to get back since I was traveling.
I think all the stuff can be protected by the new arg_lock except
mm->brk since arg_lock can't prevent from concurrent writing from sys_brk().

We may use arg_lock to protect everything else other than mm->brk. The
complexity sounds acceptable.

Of course, as Cyrill mentioned, he prefers to deprecating prctl_set_mm
since C/R is the only user of it. We may wait until he is done?

Thanks,
Yang

>
> Thanks!