Received: by 10.213.65.68 with SMTP id h4csp1694806imn; Mon, 26 Mar 2018 12:43:34 -0700 (PDT) X-Google-Smtp-Source: AG47ELsCexQi/7ChHe+0Q3zCQbYcOizCJS/EV9sAdyujw9kUpzbeCCrkb2xxKIpIGD7SMHFsM8FZ X-Received: by 2002:a17:902:3181:: with SMTP id x1-v6mr24934994plb.338.1522093414544; Mon, 26 Mar 2018 12:43:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522093414; cv=none; d=google.com; s=arc-20160816; b=oQpBuYDqnq4cULShxxESPmtdBmbY5WSnRuGFdNYSbyXMM189bWnlFZIQljiZ3Bm96M N4RC1hbg+2aP7OrKzt4cHl4Mtvd59jERhjjDknEnHg+RAN1QOCfkwZqrohDNkL6yGOk7 YZIkdTlSvDE9UEYwSxRolntY+ipPUeJTG7/q5YeYPm0vyXgCls9Q0X0pvd3MOIXoeQtp l0I25BAQlMBYjJ/f20tCJ7YU2nBMOTZ6onpEl4D1UcANZUjj1WmZzo6REDCGIE0wD+VE 1e4kOKQ4lQZMCk6Gwxd4H/r7eQR3N7oV6qCWu+ZJ1Lqb1pmyrjo2s2hVmXyHTDL03/CE mfqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=ctZAsqV0iPeMpaUeUYSJpECHwC5UBdVyshbb7tYIh4M=; b=z+tlkpR3+678RwaQ5Yaqej+Bp7eOj4cYptmCE9W8OGPy/CEIbZTgNnVN2tFound6kn Al9BgAgMwPnDXy5bgUCHk7k2MBLiUwMy9tsIoapAI1qAvZ0NgccsSwKE+pVYj4wEUyDA P0/JO7i1q21QNPKF13qxoBFOrN2PDkfBeWhqKI+FIcYbVbYf+mPQGO+MI10iP0ZOPk3G rJlKTZg2ZGW4v7Erh5Qh/L5tbH0g0QpBrjoOg4Ypyb05IYvp0qns3VzMfu0zlM4LfKin yp9Qr58vRoZF6mSCW3GJIiY8err2XOa+oo3oBR5Mfj8HHvyNaEIFnmKk7SbE1x+AMcfc 7FvA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e127si12004934pfc.315.2018.03.26.12.43.19; Mon, 26 Mar 2018 12:43:34 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751912AbeCZTmR (ORCPT + 99 others); Mon, 26 Mar 2018 15:42:17 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:44770 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751024AbeCZTmQ (ORCPT ); Mon, 26 Mar 2018 15:42:16 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3D3BE813F73E; Mon, 26 Mar 2018 19:42:16 +0000 (UTC) Received: from mguzik (ovpn-204-70.brq.redhat.com [10.40.204.70]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 5DDC210FFE6F; Mon, 26 Mar 2018 19:42:14 +0000 (UTC) Date: Mon, 26 Mar 2018 21:42:11 +0200 From: Mateusz Guzik To: Yang Shi Cc: adobriyan@gmail.com, mhocko@kernel.org, willy@infradead.org, gorcunov@openvz.org, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [v2 PATCH] mm: introduce arg_lock to protect arg_start|end and env_start|end in mm_struct Message-ID: <20180326194210.7bf6u2wo44oh4n7z@mguzik> References: <1522088439-105930-1-git-send-email-yang.shi@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1522088439-105930-1-git-send-email-yang.shi@linux.alibaba.com> User-Agent: Mutt/1.6.0.1 (2016-04-01) X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Mon, 26 Mar 2018 19:42:16 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Mon, 26 Mar 2018 19:42:16 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'mguzik@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 27, 2018 at 02:20:39AM +0800, Yang Shi wrote: > mmap_sem is on the hot path of kernel, and it very contended, but it is > abused too. It is used to protect arg_start|end and evn_start|end when > reading /proc/$PID/cmdline and /proc/$PID/environ, but it doesn't make > sense since those proc files just expect to read 4 values atomically and > not related to VM, they could be set to arbitrary values by C/R. > They are not arbitrary - there is basic validation performed when setting them. > And, the mmap_sem contention may cause unexpected issue like below: > > INFO: task ps:14018 blocked for more than 120 seconds. > Tainted: G E 4.9.79-009.ali3000.alios7.x86_64 #1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this > message. > ps D 0 14018 1 0x00000004 > ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0 > ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040 > 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000 > Call Trace: > [] ? __schedule+0x250/0x730 > [] schedule+0x36/0x80 > [] rwsem_down_read_failed+0xf0/0x150 > [] call_rwsem_down_read_failed+0x18/0x30 > [] down_read+0x20/0x40 > [] proc_pid_cmdline_read+0xd9/0x4e0 > [] ? do_filp_open+0xa5/0x100 > [] __vfs_read+0x37/0x150 > [] ? security_file_permission+0x9b/0xc0 > [] vfs_read+0x96/0x130 > [] SyS_read+0x55/0xc0 > [] entry_SYSCALL_64_fastpath+0x1a/0xc5 > > Both Alexey Dobriyan and Michal Hocko suggested to use dedicated lock > for them to mitigate the abuse of mmap_sem. > While switching to arg spinlock here will relieve mmap_sem to an extent, it wont help with the problem you are seeing here. proc_pid_cmdline_read -> access_process_vm -> __access_remote_vm and you got yet another down_read(&mm->mmap_sem);. i.e. the issue you ran into is well known and predates my change. The problem does not stem from contention either, but blocking for a long time while holding the lock - the most common example is dealing with dead nfs mount vs mmaped areas. I don't have good ideas how to fix the problem. The least bad I came up with was to trylock with a timeout - after a failure either return an error or resort to returning p_comm. ps/top could be modified to fallback to snatching the name from /status. Since the lock owner is now being stored in the semaphore, perhaps the above routine can happily spin until it grabs the lock or the owner is detected to have gone into uninterruptible sleep and react accordingly. I don't know whether it is feasible to somehow avoid the mmap lock altogether. If it has to be there no matter what the code can be refactored to grab it once and relock only if copyout would fault. This would in particular reduce the number of times it is taken to begin with and still provide the current synchronisation against prctl. But the fundamental problem will remain. That said, refactoring above will have the same effect as your patch and will avoid growing mm_struct. That's my $0,03. MM overlords have to comment on what to do with this. > So, introduce a new spinlock in mm_struct to protect the concurrent access > to arg_start|end and env_start|end. > > And, commit ddf1d398e517e660207e2c807f76a90df543a217 ("prctl: take mmap > sem for writing to protect against others") changed down_read to > down_write to avoid write race condition in prctl_set_mm(). Since we > already have dedicated lock to protect them, it is safe to change back > to down_read. > > Signed-off-by: Yang Shi > Cc: Alexey Dobriyan > Cc: Michal Hocko > Cc: Matthew Wilcox > Cc: Mateusz Guzik > Cc: Cyrill Gorcunov > --- > v1 --> v2: > * Use spinlock instead of rwlock per Mattew's suggestion > * Replace down_write to down_read in prctl_set_mm (see commit log for details) > > fs/proc/base.c | 8 ++++---- > include/linux/mm_types.h | 2 ++ > kernel/fork.c | 1 + > kernel/sys.c | 14 ++++++++++---- > mm/init-mm.c | 1 + > 5 files changed, 18 insertions(+), 8 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index 9298324..e0282b6 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -242,12 +242,12 @@ static ssize_t proc_pid_cmdline_read(struct file *file, char __user *buf, > goto out_mmput; > } > > - down_read(&mm->mmap_sem); > + spin_lock(&mm->arg_lock); > arg_start = mm->arg_start; > arg_end = mm->arg_end; > env_start = mm->env_start; > env_end = mm->env_end; > - up_read(&mm->mmap_sem); > + spin_unlock(&mm->arg_lock); > > BUG_ON(arg_start > arg_end); > BUG_ON(env_start > env_end); > @@ -929,10 +929,10 @@ static ssize_t environ_read(struct file *file, char __user *buf, > if (!mmget_not_zero(mm)) > goto free; > > - down_read(&mm->mmap_sem); > + spin_lock(&mm->arg_lock); > env_start = mm->env_start; > env_end = mm->env_end; > - up_read(&mm->mmap_sem); > + spin_unlock(&mm->arg_lock); > > while (count > 0) { > size_t this_len, max_len; > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index fd1af6b..3be4588 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -413,6 +413,8 @@ struct mm_struct { > unsigned long def_flags; > unsigned long start_code, end_code, start_data, end_data; > unsigned long start_brk, brk, start_stack; > + > + spinlock_t arg_lock; /* protect concurrent access to arg_* and env_* */ > unsigned long arg_start, arg_end, env_start, env_end; > > unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */ > diff --git a/kernel/fork.c b/kernel/fork.c > index e5d9d40..6540ae7 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -898,6 +898,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, > mm->pinned_vm = 0; > memset(&mm->rss_stat, 0, sizeof(mm->rss_stat)); > spin_lock_init(&mm->page_table_lock); > + spin_lock_init(&mm->arg_lock); > mm_init_cpumask(mm); > mm_init_aio(mm); > mm_init_owner(mm, p); > diff --git a/kernel/sys.c b/kernel/sys.c > index f2289de..17bddd2 100644 > --- a/kernel/sys.c > +++ b/kernel/sys.c > @@ -1959,7 +1959,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data > return error; > } > > - down_write(&mm->mmap_sem); > + down_read(&mm->mmap_sem); > > /* > * We don't validate if these members are pointing to > @@ -1980,10 +1980,13 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data > mm->start_brk = prctl_map.start_brk; > mm->brk = prctl_map.brk; > mm->start_stack = prctl_map.start_stack; > + > + spin_lock(&mm->arg_lock); > mm->arg_start = prctl_map.arg_start; > mm->arg_end = prctl_map.arg_end; > mm->env_start = prctl_map.env_start; > mm->env_end = prctl_map.env_end; > + spin_unlock(&mm->arg_lock); > > /* > * Note this update of @saved_auxv is lockless thus > @@ -1996,7 +1999,7 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data > if (prctl_map.auxv_size) > memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv)); > > - up_write(&mm->mmap_sem); > + up_read(&mm->mmap_sem); > return 0; > } > #endif /* CONFIG_CHECKPOINT_RESTORE */ > @@ -2063,7 +2066,7 @@ static int prctl_set_mm(int opt, unsigned long addr, > > error = -EINVAL; > > - down_write(&mm->mmap_sem); > + down_read(&mm->mmap_sem); > vma = find_vma(mm, addr); > > prctl_map.start_code = mm->start_code; > @@ -2149,14 +2152,17 @@ static int prctl_set_mm(int opt, unsigned long addr, > mm->start_brk = prctl_map.start_brk; > mm->brk = prctl_map.brk; > mm->start_stack = prctl_map.start_stack; > + > + spin_lock(&mm->arg_lock); > mm->arg_start = prctl_map.arg_start; > mm->arg_end = prctl_map.arg_end; > mm->env_start = prctl_map.env_start; > mm->env_end = prctl_map.env_end; > + spin_unlock(&mm->arg_lock); > > error = 0; > out: > - up_write(&mm->mmap_sem); > + up_read(&mm->mmap_sem); > return error; > } > > diff --git a/mm/init-mm.c b/mm/init-mm.c > index f94d5d1..66cce4c 100644 > --- a/mm/init-mm.c > +++ b/mm/init-mm.c > @@ -23,6 +23,7 @@ struct mm_struct init_mm = { > .mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem), > .page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock), > .mmlist = LIST_HEAD_INIT(init_mm.mmlist), > + .arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock), > .user_ns = &init_user_ns, > INIT_MM_CONTEXT(init_mm) > }; > -- > 1.8.3.1 > -- Mateusz Guzik