Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp578041pxk; Wed, 23 Sep 2020 10:21:51 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyHuMSrU4iHAoA0iYW1lNB7uN7YO7rZM8wt7m4zbKvBHYO9+Zs1k+Gt4XKDn8Vxj2U9xAyJ X-Received: by 2002:a17:906:17c6:: with SMTP id u6mr686066eje.95.1600881711560; Wed, 23 Sep 2020 10:21:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1600881711; cv=none; d=google.com; s=arc-20160816; b=MDy2RCyTmf1HYhYUXEOjXGS7Uhf9ui9unifemv5daPZy7+wqs2FwXZSxisiWwnlZXW iBCUs1PGQr9dgGquHZgnQcAvYIdOR+UMWZfTLMZa1x7Tueg1EjyhgO+j9A62lPZ75cRp j0HSUsnOPyYZaNdQCDtQIfxV6mpe37mklwyMhLzfrRGvOflRF+p4CJzJt81cNaKi367c d/QyLWthGTLicazVaWKvtUWcgB6DYEcZX5JX9Uy7Bmt0S9Krw5cfSu2gNc6tI1UdpOsm LcqBM2VocdzyuPUhAGLoNlzmhg/Z4/sPM0xEMLSmxL5BTApU1sLGZy33iziuopV7hm6B o3RA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=Wl0emuuFRwoMzXPU5qqXsiLAXTaoY16tfftIB0J40yc=; b=hxI+KpBU8ry7/79+ZfVjCN0IGpLeCgBX25y8uC0LkV79SXXMPMyijRFmHscxfqmksj TPHH4Meijm4gSPiND6uqCIUQhtRC17p4FbwEzS9spokWHY2pKb0lGCyRi2+gpWZRuIgR kZ90pY+qVYBOcyyR9LPl1K5+fNkgCHrmkKUv1ZN7tH3Oq9WawRz08mXXNZWzjpYPw9iY YUmlAx7qgHI7b5woVoHH+ymvr10DrB867h2lpvuZ9P4tp35/zyZzYahRB3EaqC5BZNBw cwsQV36T4BcqkZ2khzawk9iEfAP8hGvohrrRv1vVPzlkikexsHJahTULDCsf2w7tu4V3 Cebg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=Ys6gPy1O; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d20si294665edq.603.2020.09.23.10.21.26; Wed, 23 Sep 2020 10:21:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@ziepe.ca header.s=google header.b=Ys6gPy1O; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726424AbgIWRR7 (ORCPT + 99 others); Wed, 23 Sep 2020 13:17:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59426 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726130AbgIWRR7 (ORCPT ); Wed, 23 Sep 2020 13:17:59 -0400 Received: from mail-qk1-x743.google.com (mail-qk1-x743.google.com [IPv6:2607:f8b0:4864:20::743]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 35AAAC0613CE for ; Wed, 23 Sep 2020 10:17:59 -0700 (PDT) Received: by mail-qk1-x743.google.com with SMTP id d20so443031qka.5 for ; Wed, 23 Sep 2020 10:17:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Wl0emuuFRwoMzXPU5qqXsiLAXTaoY16tfftIB0J40yc=; b=Ys6gPy1Opv7AG5/OZGbTnnJ4aTIMUptrJ823f9T2F0WoCv9W8/JCX+id1o3hVRzwpC gjc1Zz8Mb2st11gJuWOjOv3XpS3MP6aine2VKPdv56KIHPaxpgGedieMqguS6p5lphQP +7m1GFiHq7pJaN0Oio5vWW+QdOcxZ+7XJEu766Kc1lwYQRH/b2xLdQSkaXjoD6+dV6bo vCC5v9HFcbzjHQZeFxQAYugNp6d1GmouVdMuaKvFMvfX8NobD9tMbCsbUmUEZ5pTJ7Cp nlqa+aA5Ayy9SE/N1DqOc/Nzzv44UWtrfKjcE1zYdMgsLwTdjh9Jp3FhJe+X4xTPjQVJ gkdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Wl0emuuFRwoMzXPU5qqXsiLAXTaoY16tfftIB0J40yc=; b=jpONkVAtvTCtpinhH1eCuIQwtKvvAhcFJs+040ow/1RFOl5slrxIS+bCcMUCDJaGl6 xHutf0QpWbRJ0czUBhe/QwsLzUQWSV+/OS8xWcpL74Nou1sHwxwSINZ1U9PP0ZZNKW+7 rpf8O93hS8wnmUpxQyaME3KBoSKhDY6s6rwJ6mQ5OQDTU2nFe/XQBwGAfL9OVo2TF1wp KT9lZdpUfMpP1mP6D0vbOxa9PSarZcIziQ8fPbeqmMzemduQh3pk1jD/Bpgts7qQD8G7 JlfOK0HpXL8qikKPyRE8sHFPNot9B5W+kBCmS1Q5n2/Imla7swtyysog0KDZTEHhpnFu DliQ== X-Gm-Message-State: AOAM5338BtXXgXx8LI309G3O5ypEsrigPOBIBnwP2nhptFbLd1bbB4F6 0PKbP8UGT1b46UCJNB64llP/Bw== X-Received: by 2002:a37:4711:: with SMTP id u17mr885282qka.54.1600881478432; Wed, 23 Sep 2020 10:17:58 -0700 (PDT) Received: from ziepe.ca (hlfxns017vw-156-34-48-30.dhcp-dynamic.fibreop.ns.bellaliant.net. [156.34.48.30]) by smtp.gmail.com with ESMTPSA id k20sm290184qtb.34.2020.09.23.10.17.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Sep 2020 10:17:57 -0700 (PDT) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1kL8Oy-0006Rh-T9; Wed, 23 Sep 2020 14:17:56 -0300 Date: Wed, 23 Sep 2020 14:17:56 -0300 From: Jason Gunthorpe To: Peter Xu Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Linus Torvalds , Michal Hocko , Kirill Shutemov , Jann Horn , Oleg Nesterov , Kirill Tkhai , Hugh Dickins , Leon Romanovsky , Jan Kara , John Hubbard , Christoph Hellwig , Andrew Morton , Andrea Arcangeli Subject: Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork() Message-ID: <20200923171756.GC9916@ziepe.ca> References: <20200921211744.24758-1-peterx@redhat.com> <20200921212031.25233-1-peterx@redhat.com> <20200922120505.GH8409@ziepe.ca> <20200923152409.GC59978@xz-x1> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200923152409.GC59978@xz-x1> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 23, 2020 at 11:24:09AM -0400, Peter Xu wrote: > On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote: > > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote: > > > Pinned pages shouldn't be write-protected when fork() happens, because follow > > > up copy-on-write on these pages could cause the pinned pages to be replaced by > > > random newly allocated pages. > > > > > > For huge PMDs, we split the huge pmd if pinning is detected. So that future > > > handling will be done by the PTE level (with our latest changes, each of the > > > small pages will be copied). We can achieve this by let copy_huge_pmd() return > > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and > > > finally land the next copy_pte_range() call. > > > > > > Huge PUDs will be even more special - so far it does not support anonymous > > > pages. But it can actually be done the same as the huge PMDs even if the split > > > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault > > > ins will remap the same pages in either parent/child later. > > > > > > This might not be the most efficient way, but it should be easy and clean > > > enough. It should be fine, since we're tackling with a very rare case just to > > > make sure userspaces that pinned some thps will still work even without > > > MADV_DONTFORK and after they fork()ed. > > > > > > Signed-off-by: Peter Xu > > > mm/huge_memory.c | 26 ++++++++++++++++++++++++++ > > > 1 file changed, 26 insertions(+) > > > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > > index 7ff29cc3d55c..c40aac0ad87e 100644 > > > +++ b/mm/huge_memory.c > > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm, > > > > > > src_page = pmd_page(pmd); > > > VM_BUG_ON_PAGE(!PageHead(src_page), src_page); > > > + > > > + /* > > > + * If this page is a potentially pinned page, split and retry the fault > > > + * with smaller page size. Normally this should not happen because the > > > + * userspace should use MADV_DONTFORK upon pinned regions. This is a > > > + * best effort that the pinned pages won't be replaced by another > > > + * random page during the coming copy-on-write. > > > + */ > > > + if (unlikely(READ_ONCE(src_mm->has_pinned) && > > > + page_maybe_dma_pinned(src_page))) { > > > + pte_free(dst_mm, pgtable); > > > + spin_unlock(src_ptl); > > > + spin_unlock(dst_ptl); > > > + __split_huge_pmd(vma, src_pmd, addr, false, NULL); > > > + return -EAGAIN; > > > + } > > > > Not sure why, but the PMD stuff here is not calling is_cow_mapping() > > before doing the write protect. Seems like it might be an existing > > bug? > > IMHO it's not a bug, because splitting a huge pmd should always be safe. Sur splitting is safe, but testing has_pinned without checking COW is not, for what Jann explained. The 'maybe' in page_maybe_dma_pinned() means it can return true when the correct answer is false. It can never return false when the correct answer is true. It is the same when has_pinned is involved, the combined expression must never return false when true is correct. Which means it can only be applied for COW cases. Jason