Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1494179ybl; Wed, 8 Jan 2020 18:53:53 -0800 (PST) X-Google-Smtp-Source: APXvYqziEgrJXUZIfxaH0jk2GM3pT4VrI9KlcS4IfPnJsPWYOpYolEqK0yN5pf/pQfyKmlaC/mac X-Received: by 2002:aca:ea43:: with SMTP id i64mr1524724oih.30.1578538433261; Wed, 08 Jan 2020 18:53:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1578538433; cv=none; d=google.com; s=arc-20160816; b=HmcbZnXdMHuWeHLuSj9v22qQXS8IWQ97cXZyc+oq/pIXh6HZBvR+rR/oP2k45kmfuW ezndYKFs6EkFZ7hxeMuur+EW2eA6u0jy3a9luIMoGAj36TWeyOhI16FThgYBp2tPtKa5 El8rpqlORqpvZMDKP342f9JgGEy3Vl1LuXUoRlVx4TP75c640anrap/FMlXdgu+HC/30 BjaszwGa9KnTd697Ew0rGwYtYR5UOeFrsm7konp8r5RaHzsUWxCWZ8klIAJ5zFSQsgNr PvFsgwrLzJnVFkQhUN8VQbL1dbgC4Z5dfW8jwqipG1T6iI1paaxaPWBdLy19EhhT73Kp zaxA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:message-id :subject:cc:to:from:date; bh=qfLCynzN2bMcZGo90vhqsV8+oK1uYYdm5qprAPL//U8=; b=OxOs56cO4ueEWD7ZwEHbCSoe/tulaaedqxfn3fuqVkXDEINgis0rvf/FHtSwoltmXA 9Ot848WBbPS/kluyrliJ5y9qlmHBKnb/3enRDqmnD+Z3Xis55qVVuNKWh+cti6zxe9V+ gG8p1mySQ1e7hK9Qvqhl8/mj94b6OLtUxrhV3K8BtxZFdsa+zpAzYRjb/PKEhKjeaKFm LKatv27EOD/PnUgz9zzEQiQl9bh3uq2z6QCTfRkzM69ebNjSSVsjQkMeeaj2H3KFNif1 zvbp2L2HiGs5oRlXKVe0kiJUbt9DSatpL47fmUOsvOaQQqewwg5Xmz1lA+TAYVKQFi82 PdBQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w11si2812468oic.62.2020.01.08.18.53.37; Wed, 08 Jan 2020 18:53:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727888AbgAICwm (ORCPT + 99 others); Wed, 8 Jan 2020 21:52:42 -0500 Received: from mga14.intel.com ([192.55.52.115]:5471 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726758AbgAICwm (ORCPT ); Wed, 8 Jan 2020 21:52:42 -0500 X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by fmsmga103.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 08 Jan 2020 18:52:41 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.69,412,1571727600"; d="scan'208";a="223739613" Received: from richard.sh.intel.com (HELO localhost) ([10.239.159.54]) by orsmga003.jf.intel.com with ESMTP; 08 Jan 2020 18:52:39 -0800 Date: Thu, 9 Jan 2020 10:52:40 +0800 From: Wei Yang To: Konstantin Khlebnikov Cc: Wei Yang , linux-mm@kvack.org, Andrew Morton , linux-kernel@vger.kernel.org, Rik van Riel , Li Xinhai , "Kirill A. Shutemov" Subject: Re: [PATCH v2 1/2] mm/rmap: fix and simplify reusing mergeable anon_vma as parent when fork Message-ID: <20200109025240.GA2000@richard> Reply-To: Wei Yang References: <157839239609.694.10268055713935919822.stgit@buzz> <20200108023211.GC13943@richard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 08, 2020 at 01:40:44PM +0300, Konstantin Khlebnikov wrote: >On 08/01/2020 05.32, Wei Yang wrote: >> On Tue, Jan 07, 2020 at 01:19:56PM +0300, Konstantin Khlebnikov wrote: >> > This fixes some misconceptions in commit 4e4a9eb92133 ("mm/rmap.c: reuse >> > mergeable anon_vma as parent when fork"). It merges anon-vma in unexpected >> > way but fortunately still produces valid anon-vma tree, so nothing crashes. >> > >> > If in parent VMAs: SRC1 SRC2 .. SRCn share anon-vma ANON0, then after fork >> > before all patches in child process related VMAs: DST1 DST2 .. DSTn will >> > fork indepndent anon-vmas: ANON1 ANON2 .. ANONn (each is child of ANON0). >> > Before this patch only DST1 will fork new ANON1 and following DST2 .. DSTn >> > will share parent's ANON0 (i.e. anon-vma tree is valid but isn't optimal). >> > With this patch DST1 will create new ANON1 and DST2 .. DSTn will share it. >> > >> > Root problem caused by initialization order in dup_mmap(): vma->vm_prev >> > is set after calling anon_vma_fork(). Thus in anon_vma_fork() it points to >> > previous VMA in parent mm. >> > >> > Second problem is hidden behind first one: assumption "Parent has vm_prev, >> > which implies we have vm_prev" is wrong if first VMA in parent mm has set >> > flag VM_DONTCOPY. Luckily prev->anon_vma doesn't dereference NULL pointer >> > because in current code 'prev' actually is same as 'pprev'. >> > >> > Third hidden problem is linking between VMA and anon-vmas whose pages it >> > could contain. Loop in anon_vma_clone() attaches only parent's anon-vmas, >> > shared anon-vma isn't attached. But every mapped page stays reachable in >> > rmap because we erroneously share anon-vma from parent's previous VMA. >> > >> > This patch moves sharing logic out of anon_vma_clone() into more specific >> > anon_vma_fork() because this supposed to work only at fork() and simply >> > reuses anon_vma from previous VMA if it is forked from the same anon-vma. >> > >> > Signed-off-by: Konstantin Khlebnikov >> > Reported-by: Li Xinhai >> > Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork") >> > Link: https://lore.kernel.org/linux-mm/CALYGNiNzz+dxHX0g5-gNypUQc3B=8_Scp53-NTOh=zWsdUuHAw@mail.gmail.com/T/#t >> > --- >> > include/linux/rmap.h | 3 ++- >> > kernel/fork.c | 2 +- >> > mm/rmap.c | 23 +++++++++-------------- >> > 3 files changed, 12 insertions(+), 16 deletions(-) >> > >> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h >> > index 988d176472df..560e4480dcd0 100644 >> > --- a/include/linux/rmap.h >> > +++ b/include/linux/rmap.h >> > @@ -143,7 +143,8 @@ void anon_vma_init(void); /* create anon_vma_cachep */ >> > int __anon_vma_prepare(struct vm_area_struct *); >> > void unlink_anon_vmas(struct vm_area_struct *); >> > int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *); >> > -int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *); >> > +int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma, >> > + struct vm_area_struct *prev); >> > >> > static inline int anon_vma_prepare(struct vm_area_struct *vma) >> > { >> > diff --git a/kernel/fork.c b/kernel/fork.c >> > index 2508a4f238a3..c33626993831 100644 >> > --- a/kernel/fork.c >> > +++ b/kernel/fork.c >> > @@ -556,7 +556,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> > tmp->anon_vma = NULL; >> > if (anon_vma_prepare(tmp)) >> > goto fail_nomem_anon_vma_fork; >> > - } else if (anon_vma_fork(tmp, mpnt)) >> > + } else if (anon_vma_fork(tmp, mpnt, prev)) >> > goto fail_nomem_anon_vma_fork; >> > tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); >> > tmp->vm_next = tmp->vm_prev = NULL; >> > diff --git a/mm/rmap.c b/mm/rmap.c >> > index b3e381919835..3c1e04389291 100644 >> > --- a/mm/rmap.c >> > +++ b/mm/rmap.c >> > @@ -269,19 +269,6 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) >> > { >> > struct anon_vma_chain *avc, *pavc; >> > struct anon_vma *root = NULL; >> > - struct vm_area_struct *prev = dst->vm_prev, *pprev = src->vm_prev; >> > - >> > - /* >> > - * If parent share anon_vma with its vm_prev, keep this sharing in in >> > - * child. >> > - * >> > - * 1. Parent has vm_prev, which implies we have vm_prev. >> > - * 2. Parent and its vm_prev have the same anon_vma. >> > - */ >> > - if (!dst->anon_vma && src->anon_vma && >> > - pprev && pprev->anon_vma == src->anon_vma) >> > - dst->anon_vma = prev->anon_vma; >> > - >> > >> > list_for_each_entry_reverse(pavc, &src->anon_vma_chain, same_vma) { >> > struct anon_vma *anon_vma; >> > @@ -332,7 +319,8 @@ int anon_vma_clone(struct vm_area_struct *dst, struct vm_area_struct *src) >> > * the corresponding VMA in the parent process is attached to. >> > * Returns 0 on success, non-zero on failure. >> > */ >> > -int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) >> > +int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma, >> > + struct vm_area_struct *prev) >> > { >> > struct anon_vma_chain *avc; >> > struct anon_vma *anon_vma; >> > @@ -342,6 +330,13 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma) >> > if (!pvma->anon_vma) >> > return 0; >> > >> > + /* Share anon_vma with previous VMA if it has the same parent. */ >> > + if (prev && prev->anon_vma && >> > + prev->anon_vma->parent == pvma->anon_vma) { >> > + vma->anon_vma = prev->anon_vma; >> > + return anon_vma_clone(vma, prev); >> > + } >> > + >> >> I am afraid this one change the intended behavior. Let's put a chart to >> describe. >> >> Commit 4e4a9eb92133 ("mm/rmap.c: reusemergeable anon_vma as parent when >> fork") tries to improve the following situation. >> >> Before the commit, the behavior is like this: >> >> Parent process: >> >> +-----+ >> | pav |<-----------------+----------------------+ >> +-----+ | | >> | | >> +-----------+ +-----------+ >> |pprev | |pvma | >> +-----------+ +-----------+ >> >> Child Process >> >> >> +-----+ +-----+ >> | av1 |<-----------------+ | av2 |<------------+ >> +-----+ | +-----+ | >> | | >> +-----------+ +-----------+ >> |prev | |vma | >> +-----------+ +-----------+ >> >> >> Parent pprev and pvma share the same anon_vma due to >> find_mergeable_anon_vma(). While the anon_vma_clone() would pick up different >> anon_vma for child process's vma. >> >> The purpose of my commit is to give child process the following shape. >> >> +-----+ >> | av |<-----------------+----------------------+ >> +-----+ | | >> | | >> +-----------+ +-----------+ >> |prev | |vma | >> +-----------+ +-----------+ >> >> After this, we reduce the extra "av2" for child process. But yes, because of >> the two reasons you found, it didn't do the exact thing. >> >> While if my understanding is correct, the anon_vma_clone() would pick up any >> anon_vma in its process tree, except parent's. If this fails to get a reusable >> one, anon_vma_fork() would allocate one, whose parent is pvma->anon_vma. >> >> Let me summarise original behavior: >> >> * if anon_vma_clone succeed, it find one anon_vma in the process tree, but >> it could not be pvma->anon_vma >> * if anon_vma_clone fail, it will allocate a new anon_vma and its parent is >> pvma->anon_vam >> >> Then take a look into your code here. >> >> "prev->anon_vma->parent == pvma->anon_vma" means prev->anon_vma parent is >> pvma's anon_vma. If my understanding is correct, this just match the second >> case. For "prev", we didn't find a reusable anon_vma and allocate a new one. >> >> But how about the first case? prev reuse an anon_vma in the process tree which >> is not parent's? > >If anon_vma_clone() pick old anon-vma for first vma in sharing chain (prev) >then second vma (vma) will fork new anon-vma (unless pick another old anon-vma), >then third vma will share it. And so on. No, I am afraid you are not correct here. Or I don't understand your sentence. This is my understanding about the behavior before my commit. Suppose av1 and av2 are both reused from old anon_vma. And if my understanding is correct, they are different from pvma->anon_vma. Then how your code match this situatioin? +-----+ +-----+ | av1 |<-----------------+ | av2 |<------------+ +-----+ | +-----+ | | | +-----------+ +-----------+ |prev | |vma | +-----------+ +-----------+ Would you explain your understanding the second and third vma in your sentence? Which case you are trying to illustrate? >Fork works left to right - we don't known about next vma to predict sharing and >choose better options. > >But reusing old vma doesn't allocates new one. It's better to not reuse them You mean reuse old anon_vma here? >second time because this makes tree less optimal (and actually not a tree anymore). >This is just a trick to prevent unlimited growth anon-vma chains in background: >while each anon-vma has at least one vma or two childs then their count is >limited with count of vmas which are visible and limited. > >> >> > /* Drop inherited anon_vma, we'll reuse existing or allocate new. */ >> > vma->anon_vma = NULL; >> > >> >> -- >> Wei Yang >> Help you, Help me >> -- Wei Yang Help you, Help me