Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp3224252ybc; Mon, 25 Nov 2019 10:56:42 -0800 (PST) X-Google-Smtp-Source: APXvYqy0dh9Rt3ZvDoPBp1RMXTyKmeI+Sx85e+TqBPge1s8v1akDLCaLhUE51+GbhjOpobePt54x X-Received: by 2002:a05:6402:168b:: with SMTP id a11mr20366462edv.107.1574708202722; Mon, 25 Nov 2019 10:56:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574708202; cv=none; d=google.com; s=arc-20160816; b=AV1eUxD98fz5TkPIm0lXtHzmKHFQAAG/mSehANCmYWsjN9nTkX5yNUyu6BuIFNJ8aB /473mLXWnuUzjfQziKCBcbSEUWsS1+g8sFzFdr0pjTJ5tp0UieWOAc5x5vdPB/D6JOzf RPbzTrbu8zrw+cKA4xoRAgI6crV1wzUunF8AuWdi3bKuVRyku2nllCffEsVo5RGnIw4Q zCTkmcm0AGAbCDjtpsCoG3LaO36lcg6P5kkUWO1zc2nididdO+pHLTuOujBvEUtU7bzu HMo4LY8E4wd8i5I8+dysVjUix5rb//iUK9qjQD0pyNL2ED9nAXehCHpy/E1okAGk1UHA wpRg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=kPjtxI/TrF4UnMikjlggsxMbTRpW7xiQMfu1Q1pQSto=; b=fpB8rsFophncA3DnXmgJaZ0cT1hH19ugNKe54DBGmrnbDBoWfQTOd5DlhdL9LEzObo 1HyYb3vZP2Y8QfoGHYVuNbbe1eXVWpQ2TubBJpV/yL5ob30U3wl5Vic7tSH+gfwt8SWP PPVlPGs8oK3q41mmaMG5eNqKkkCULS902WXcJWubL09nl6gzSjXiccQ3zsdEEGTEiCAO Jd3OavMsTLX+MVz7kaTQyG/utqXVLE1087O80ptDJjbWB+xIPs+k2rm75fNmIZv1NGiK EzJ4eBkCuPkOoEbhThJgDA/KkASLELjg73xdUaqLyHgdvIz1vvKO3wryjIfQWRQ0Rfzw IgGw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=15Bpu0X6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id i13si5022556ejc.76.2019.11.25.10.56.19; Mon, 25 Nov 2019 10:56:42 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=15Bpu0X6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729375AbfKYSdp (ORCPT + 99 others); Mon, 25 Nov 2019 13:33:45 -0500 Received: from mail-lj1-f193.google.com ([209.85.208.193]:41680 "EHLO mail-lj1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729360AbfKYSdo (ORCPT ); Mon, 25 Nov 2019 13:33:44 -0500 Received: by mail-lj1-f193.google.com with SMTP id m4so17071871ljj.8 for ; Mon, 25 Nov 2019 10:33:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=kPjtxI/TrF4UnMikjlggsxMbTRpW7xiQMfu1Q1pQSto=; b=15Bpu0X6m1bgIwc9pGxLJZIxvfZTaxrG3jlKJDzx9i0BU/czx4OvuotVSORu/UJMHA 5xAzIKc+sBj6fQJq7T5O+9lHX1m2VhKwfkd0yD8vRmCSzrz2q3HitQYPUdT/0O5TqWNW lWITqzgChNWXPkzX5ejQHEnESq/ytb4CupHwGgWuqWuzEz2CIZfIbGGXxW49d02VEAaJ ajqTbNbaTb3Q7BjdqUbyjB12Qx9qfQUbA0u7ndYqDInYEz75ynlU29iGTUEonXISUMRf y1zDm0i3Wt4IK0Ndyr337xWNITXajNPS9NHKNamlJbrqEiUGGwadbe4Yf5ktuPmKK6iA XFpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=kPjtxI/TrF4UnMikjlggsxMbTRpW7xiQMfu1Q1pQSto=; b=h/qSp8CfdWaVkzgPqzHDS75NgDGWQd0lzyQJxcAxckt+dmUV3mYr47hzBUTZVUNW7g 58vBLNS0z9Bk3rSL4z1gBJnTF3b+cXeBct5K2ED6aTJ2cjXr3IyKLoYr4T0Q/i411vlh cDibFmMEOWPPi4blRh6wjfuzhSC1s7Wn1qx3YsafK7XLU9aoeMCXlA+rMuK/xkWkZrwJ vg06QL5NqiQ+UTYZxOs/gROwavnDzJkKTOHXKhjZTkNlaDptCncZusWB4A3Bx6ruilfC nRXL/3TTBiaD7yEVLs8bWQJO+cj7+uvqWx4VPpgMUUvEj53dteT6vHYgR0C+/6M3WZQU D3Ng== X-Gm-Message-State: APjAAAXq3xlhVadTlX7A1FDqMns5PoVWUrBoDfl+GHHWtIlsIBgsBGQ1 q1jVHRQ9Ss85QE1Fdm4+zgw0ZQ== X-Received: by 2002:a2e:9f4d:: with SMTP id v13mr23663505ljk.78.1574706822066; Mon, 25 Nov 2019 10:33:42 -0800 (PST) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id n19sm4019290lfl.85.2019.11.25.10.33.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 25 Nov 2019 10:33:41 -0800 (PST) Received: by box.localdomain (Postfix, from userid 1000) id B23951032C4; Mon, 25 Nov 2019 21:33:50 +0300 (+03) Date: Mon, 25 Nov 2019 21:33:50 +0300 From: "Kirill A. Shutemov" To: Yang Shi Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially Message-ID: <20191125183350.5gmcln6t3ofszbsy@box> References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> User-Agent: NeoMutt/20180716 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: > > > On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: > > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: > > > Currently when truncating shmem file, if the range is partial of THP > > > (start or end is in the middle of THP), the pages actually will just get > > > cleared rather than being freed unless the range cover the whole THP. > > > Even though all the subpages are truncated (randomly or sequentially), > > > the THP may still be kept in page cache. This might be fine for some > > > usecases which prefer preserving THP. > > > > > > But, when doing balloon inflation in QEMU, QEMU actually does hole punch > > > or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. > > > So, when using shmem THP as memory backend QEMU inflation actually doesn't > > > work as expected since it doesn't free memory. But, the inflation > > > usecase really needs get the memory freed. Anonymous THP will not get > > > freed right away too but it will be freed eventually when all subpages are > > > unmapped, but shmem THP would still stay in page cache. > > > > > > To protect the usecases which may prefer preserving THP, introduce a > > > new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is > > > preferred behavior if truncating partial THP. This mode just makes > > > sense to tmpfs for the time being. > > We need to clarify interaction with khugepaged. This implementation > > doesn't do anything to prevent khugepaged from collapsing the range back > > to THP just after the split. > > Yes, it doesn't. Will clarify this in the commit log. Okay, but I'm not sure that documention alone will be enough. We need proper design. > > > @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, > > > } > > > unlock_page(page); > > > } > > > +rescan_split: > > > pagevec_remove_exceptionals(&pvec); > > > pagevec_release(&pvec); > > > + > > > + if (split && PageTransCompound(page)) { > > > + /* The THP may get freed under us */ > > > + if (!get_page_unless_zero(compound_head(page))) > > > + goto rescan_out; > > > + > > > + lock_page(page); > > > + > > > + /* > > > + * The extra pins from page cache lookup have been > > > + * released by pagevec_release(). > > > + */ > > > + if (!split_huge_page(page)) { > > > + unlock_page(page); > > > + put_page(page); > > > + /* Re-look up page cache from current index */ > > > + goto again; > > > + } > > > + unlock_page(page); > > > + put_page(page); > > > + } > > > +rescan_out: > > > index++; > > > } > > Doing get_page_unless_zero() just after you've dropped the pin for the > > page looks very suboptimal. > > If I don't drop the pins the THP can't be split. And, there might be more > than one pins from find_get_entries() if I read the code correctly. For > example, truncate 8K length in the middle of THP, the THP's refcount would > get bumpped twice since? two sub pages would be returned. Pin the page before pagevec_release() and avoid get_page_unless_zero(). Current code is buggy. You need to check that the page is still belong to the file after speculative lookup. -- Kirill A. Shutemov