Received: by 10.213.65.68 with SMTP id h4csp25299imn; Mon, 26 Mar 2018 14:13:12 -0700 (PDT) X-Google-Smtp-Source: AIpwx48jmw/vZz6P26exVKrB3z4GdzFN9cqKNM7BXSPgF/PjzruxbkMNIHSawk68XJ/75pOB6Zca X-Received: by 10.98.10.131 with SMTP id 3mr6090311pfk.112.1522098792299; Mon, 26 Mar 2018 14:13:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522098792; cv=none; d=google.com; s=arc-20160816; b=tTplUp3RpNF2/vLJF9WrsUgRHBkQHWOKIHH7OlP/uS8lREeoDIRyisRJdGzLoH+P2U nu6e89a7CYpgRgUZPKWoq4ZuE2z0+4yZ6O0Ide3QH0Q3RiCVZ88HIxOObXZk07gic+wQ /T6dIGcxkZDX4LemExTzJf9ZpgYCgMx/GwEr/SW4soVBZzQnFm4ft4VVuCtm1uJKTqp5 dSmgqlLFG376fprN9MCzT7FNV1NGUOCfaK7neK16aOe9iLj0HHptWbA6TSPgYiqEf/gl sxS0nekwH0/zfS2/3o9Tp3uQcLU6F5aRI3BCxetfzM0Dm6fOsDgj7Q6UGiHtLk+S8Rra 1HCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=y9gmesjnxL3yx+dGwan4+vNE3XUzIY4d/NMwEUckG+c=; b=nWOB/rX/iEcOqpZQFCC4vmF3oM/loCSdEuHww9CN0KJCYKLT6Drml9RsD6iogq7aME apGxr90FWJCrPUvrNwULGTxJvZkx/AZ066Sb5QjZW0+iEohYVw5XjTAWfhyGypCZXiKR D+Dkv2yekGI8nR/dVGe4y6xOrMILD9J6r9vFVK18B7ecNkXr0wyHc0hPJZeUhv6t4lGB 7K4zVOm/oYBAF1OsdBDhL+3ZzHGEz3DuIifKPaukOVOTOE/zFQXwlGB1jYOFyQ7xEoBY ayYwUG27gGb8Hk0E8qXTgdrBnR0408r1Yt5SbU69XcanwgeKg26rFDaRp2PRFd2x2lWL c+Yw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e34-v6si11404559plb.281.2018.03.26.14.12.57; Mon, 26 Mar 2018 14:13:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752189AbeCZVLY (ORCPT + 99 others); Mon, 26 Mar 2018 17:11:24 -0400 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:39629 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751732AbeCZVLW (ORCPT ); Mon, 26 Mar 2018 17:11:22 -0400 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R601e4;CH=green;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e01422;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0T-8he0e_1522098658; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:65.207.107.50) by smtp.aliyun-inc.com(127.0.0.1); Tue, 27 Mar 2018 05:11:04 +0800 Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct To: Mateusz Guzik Cc: adobriyan@gmail.com, mhocko@kernel.org, willy@infradead.org, gorcunov@openvz.org, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1522088439-105930-1-git-send-email-yang.shi@linux.alibaba.com> <20180326194210.7bf6u2wo44oh4n7z@mguzik> From: Yang Shi Message-ID: <81f12069-6ac7-a2c8-a87f-5170a9508780@linux.alibaba.com> Date: Mon, 26 Mar 2018 17:10:57 -0400 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <20180326194210.7bf6u2wo44oh4n7z@mguzik> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 3/26/18 3:42 PM, Mateusz Guzik wrote: > On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: >> mmap_sem is on the hot path of kernel, and it very contended, but it is >> abused too. It is used to protect arg_start|end and evn_start|end when >> reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make >> sense since those proc files just expect to read 4 values atomically and >> not related to VM, they could be set to arbitrary values by C/R. >> > They are not arbitrary - there is basic validation performed when > setting them. > >> And, the mmap_sem contention may cause unexpected issue like below: >> >> INFO: task ps:14018 blocked for more than 120 seconds. >> Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 >> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this >> message. >> ps D 0 14018 1 0x00000004 >> ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 >> ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 >> 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 >> Call Trace: >> [] ? __schedule+0x250/0x730 >> [] schedule+0x36/0x80 >> [] rwsem_down_read_failed+0xf0/0x150 >> [] call_rwsem_down_read_failed+0x18/0x30 >> [] down_read+0x20/0x40 >> [] proc_pid_cmdline_read+0xd9/0x4e0 >> [] ? do_filp_open+0xa5/0x100 >> [] __vfs_read+0x37/0x150 >> [] ? security_file_permission+0x9b/0xc0 >> [] vfs_read+0x96/0x130 >> [] SyS_read+0x55/0xc0 >> [] entry_SYSCALL_64_fastpath+0x1a/0xc5 >> >> Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock >> for them to mitigate the abuse of mmap_sem. >> > While switching to arg spinlock here will relieve mmap_sem to an extent, > it wont help with the problem you are seeing here. > > proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you > got yet another down_read(&mm->mmap_sem);. > > i.e. the issue you ran into is well known and predates my change. > > The problem does not stem from contention either, but blocking for a > long time while holding the lock - the most common example is dealing > with dead nfs mount vs mmaped areas. > > I don't have good ideas how to fix the problem. The least bad I came up > with was to trylock with a timeout - after a failure either return an > error or resort to returning p_comm. ps/top could be modified to > fallback to snatching the name from /status. > > Since the lock owner is now being stored in the semaphore, perhaps the > above routine can happily spin until it grabs the lock or the owner is > detected to have gone into uninterruptible sleep and react accordingly. > > I don't know whether it is feasible to somehow avoid the mmap lock > altogether. > > If it has to be there no matter what the code can be refactored to grab > it once and relock only if copyout would fault. This would in particular > reduce the number of times it is taken to begin with and still provide > the current synchronisation against prctl. But the fundamental problem > will remain. > > That said, refactoring above will have the same effect as your patch and > will avoid growing mm_struct. > > That's my $0,03. MM overlords have to comment on what to do with this. Thanks for the comment. Yes, the spin lock absolutely can't solve all the mmap_sem scalability issue. Actually, I already proposed a preliminary RFC to try to mitigate the mmap_sem issue. It is still under review. Other than that, we also found mmap_sem is abused somewhere, so this patch is proposed to reduce the abuse to mmap_sem. Yang > >> So, introduce a new spinlock in mm_struct to protect the concurrent access >> to arg_start|end and env_start|end. >> >> And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap >> sem for writing to protect against others") changed down_read to >> down_write to avoid write race condition in prctl_set_mm(). Since we >> already have dedicated lock to protect them, it is safe to change back >> to down_read. >> >> Signed-off-by: Yang Shi >> Cc: Alexey Dobriyan >> Cc: Michal Hocko >> Cc: Matthew Wilcox >> Cc: Mateusz Guzik >> Cc: Cyrill Gorcunov >> --- >> v1 --> v2: >> * Use spinlock instead of rwlock per Mattew's suggestion >> * Replace down_write to down_read in prctl_set_mm (see commit log for details) >> >> fs/proc/base.c | 8 ++++---- >> include/linux/mm_types.h | 2 ++ >> kernel/fork.c | 1 + >> kernel/sys.c | 14 ++++++++++---- >> mm/init-mm.c | 1 + >> 5 files changed, 18 insertions(+), 8 deletions(-) >> >> diff --git a/fs/proc/base.c b/fs/proc/base.c >> index 9298324..e0282b6 100644 >> --- a/fs/proc/base.c >> +++ b/fs/proc/base.c >> @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf, >> goto out_mmput; >> } >> >> - down_read(&mm->mmap_sem); >> + spin_lock(&mm->arg_lock); >> arg_start = mm->arg_start; >> arg_end = mm->arg_end; >> env_start = mm->env_start; >> env_end = mm->env_end; >> - up_read(&mm->mmap_sem); >> + spin_unlock(&mm->arg_lock); >> >> BUG_ON(arg_start > arg_end); >> BUG_ON(env_start > env_end); >> @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char __user *buf, >> if (!mmget_not_zero(mm)) >> goto free; >> >> - down_read(&mm->mmap_sem); >> + spin_lock(&mm->arg_lock); >> env_start = mm->env_start; >> env_end = mm->env_end; >> - up_read(&mm->mmap_sem); >> + spin_unlock(&mm->arg_lock); >> >> while (count > 0) { >> size_t this_len, max_len; >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index fd1af6b..3be4588 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -413,6 +413,8 @@ struct mm_struct { >> unsigned long def_flags; >> unsigned long start_code, end_code, start_data, end_data; >> unsigned long start_brk, brk, start_stack; >> + >> + spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */ >> unsigned long arg_start, arg_end, env_start, env_end; >> >> unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ >> diff --git a/kernel/fork.c b/kernel/fork.c >> index e5d9d40..6540ae7 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, >> mm->pinned_vm = 0; >> memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); >> spin_lock_init(&mm->page_table_lock); >> + spin_lock_init(&mm->arg_lock); >> mm_init_cpumask(mm); >> mm_init_aio(mm); >> mm_init_owner(mm, p); >> diff --git a/kernel/sys.c b/kernel/sys.c >> index f2289de..17bddd2 100644 >> --- a/kernel/sys.c >> +++ b/kernel/sys.c >> @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data >> return error; >> } >> >> - down_write(&mm->mmap_sem); >> + down_read(&mm->mmap_sem); >> >> /* >> * We don't validate if these members are pointing to >> @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data >> mm->start_brk = prctl_map.start_brk; >> mm->brk = prctl_map.brk; >> mm->start_stack = prctl_map.start_stack; >> + >> + spin_lock(&mm->arg_lock); >> mm->arg_start = prctl_map.arg_start; >> mm->arg_end = prctl_map.arg_end; >> mm->env_start = prctl_map.env_start; >> mm->env_end = prctl_map.env_end; >> + spin_unlock(&mm->arg_lock); >> >> /* >> * Note this update of @saved_auxv is lockless thus >> @@ -1996,7 +1999,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data >> if (prctl_map.auxv_size) >> memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); >> >> - up_write(&mm->mmap_sem); >> + up_read(&mm->mmap_sem); >> return 0; >> } >> #endif /* CONFIG_CHECKPOINT_RESTORE */ >> @@ -2063,7 +2066,7 @@ static int prctl_set_mm(int opt, unsigned long addr, >> >> error = -EINVAL; >> >> - down_write(&mm->mmap_sem); >> + down_read(&mm->mmap_sem); >> vma = find_vma(mm, addr); >> >> prctl_map.start_code = mm->start_code; >> @@ -2149,14 +2152,17 @@ static int prctl_set_mm(int opt, unsigned long addr, >> mm->start_brk = prctl_map.start_brk; >> mm->brk = prctl_map.brk; >> mm->start_stack = prctl_map.start_stack; >> + >> + spin_lock(&mm->arg_lock); >> mm->arg_start = prctl_map.arg_start; >> mm->arg_end = prctl_map.arg_end; >> mm->env_start = prctl_map.env_start; >> mm->env_end = prctl_map.env_end; >> + spin_unlock(&mm->arg_lock); >> >> error = 0; >> out: >> - up_write(&mm->mmap_sem); >> + up_read(&mm->mmap_sem); >> return error; >> } >> >> diff --git a/mm/init-mm.c b/mm/init-mm.c >> index f94d5d1..66cce4c 100644 >> --- a/mm/init-mm.c >> +++ b/mm/init-mm.c >> @@ -23,6 +23,7 @@ struct mm_struct init_mm = { >> .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), >> .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), >> .mmlist = LIST_HEAD_INIT(init_mm.mmlist), >> + .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), >> .user_ns = &init_user_ns, >> INIT_MM_CONTEXT(init_mm) >> }; >> -- >> 1.8.3.1 >>