Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp3223644ybc; Mon, 25 Nov 2019 10:56:03 -0800 (PST) X-Google-Smtp-Source: APXvYqxpj65put2BGZy9AizH2Nau/SZOXdZzIIucCygYIW/3JCQoUwdsIxuox+vfLBnm6Fx5t2J3 X-Received: by 2002:a17:906:f18b:: with SMTP id gs11mr39141641ejb.2.1574708163465; Mon, 25 Nov 2019 10:56:03 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574708163; cv=none; d=google.com; s=arc-20160816; b=NISteYy6/q4mp/H+7kCpG3pe/lVlr35/Qs55YTAJ1CpUAeLTy9GOB2IoNgxFBW7W/B TTotBJKgf7xdMoVJyIAeun5QavMQVIlcW2yg6sUmAnF7miQ5aAz06A40UH+Bd+AE6mZ8 371cwji0QLe3DVBF0aK+YCjsfYpOwKJAldk8lT0kw9gZmjN2wsweMW/M37MNw+EzfB9G WE1d+p9kebMwz7aZEjR4n5pAeAwOHGXr3nfEuODlnOI23VFg50YGYh7qYSJ8G0QMh1D7 QbzrAOSY3v9ITXUKGA5+2DNJUq2vyVV63R/3yDiiMbP9mopmD3oA5W+PnbDsTY/V33x3 lygg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=rGjyMZthTYipMf8A8Zx1Ec5mABbaE9dGOtD/Jd5Yepg=; b=FeEM5YsNmgAaFuHCx+Cq7kj202MU2qSuIxiUrgJJ3J6L+kvV1lG1t+xM491HWwXeHX PI23/2zf1oEVF8L4ZOb/QTW9+z101mKPy5B4yglqcfK4Fc77pwT3kPUVV0LaAaRzMV+W wdWOj1IpDynMJ5gWPbnOJfHO56ZTKuemSocNZO7FRAcQ3cFdFAM5RN3nT33P2e31BSoy 2LkGpve1u7Dp6+dFH9YiOH4mO1U0j2op3OjofXAnXA92va+TZ9I+T/USo4EJ5ayorWBW yJsjc+8JYXpSEPQ93T/W3NSRiAj40hY/X/ZBJtBznMDxeMbXB/0fObvAUqZTwWNGTL4F dUwQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d18si5072973ejr.423.2019.11.25.10.55.40; Mon, 25 Nov 2019 10:56:03 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727968AbfKYSYr (ORCPT + 99 others); Mon, 25 Nov 2019 13:24:47 -0500 Received: from out30-57.freemail.mail.aliyun.com ([115.124.30.57]:51861 "EHLO out30-57.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727269AbfKYSYr (ORCPT ); Mon, 25 Nov 2019 13:24:47 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04427;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Tj5DfgO_1574706281; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tj5DfgO_1574706281) by smtp.aliyun-inc.com(127.0.0.1); Tue, 26 Nov 2019 02:24:44 +0800 Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially To: "Kirill A. Shutemov" Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> From: Yang Shi Message-ID: <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> Date: Mon, 25 Nov 2019 10:24:38 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20191125093611.hlamtyo4hvefwibi@box> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: > On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: >> Currently when truncating shmem file, if the range is partial of THP >> (start or end is in the middle of THP), the pages actually will just get >> cleared rather than being freed unless the range cover the whole THP. >> Even though all the subpages are truncated (randomly or sequentially), >> the THP may still be kept in page cache. This might be fine for some >> usecases which prefer preserving THP. >> >> But, when doing balloon inflation in QEMU, QEMU actually does hole punch >> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. >> So, when using shmem THP as memory backend QEMU inflation actually doesn't >> work as expected since it doesn't free memory. But, the inflation >> usecase really needs get the memory freed. Anonymous THP will not get >> freed right away too but it will be freed eventually when all subpages are >> unmapped, but shmem THP would still stay in page cache. >> >> To protect the usecases which may prefer preserving THP, introduce a >> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is >> preferred behavior if truncating partial THP. This mode just makes >> sense to tmpfs for the time being. > We need to clarify interaction with khugepaged. This implementation > doesn't do anything to prevent khugepaged from collapsing the range back > to THP just after the split. Yes, it doesn't. Will clarify this in the commit log. > >> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> } >> unlock_page(page); >> } >> +rescan_split: >> pagevec_remove_exceptionals(&pvec); >> pagevec_release(&pvec); >> + >> + if (split && PageTransCompound(page)) { >> + /* The THP may get freed under us */ >> + if (!get_page_unless_zero(compound_head(page))) >> + goto rescan_out; >> + >> + lock_page(page); >> + >> + /* >> + * The extra pins from page cache lookup have been >> + * released by pagevec_release(). >> + */ >> + if (!split_huge_page(page)) { >> + unlock_page(page); >> + put_page(page); >> + /* Re-look up page cache from current index */ >> + goto again; >> + } >> + unlock_page(page); >> + put_page(page); >> + } >> +rescan_out: >> index++; >> } > Doing get_page_unless_zero() just after you've dropped the pin for the > page looks very suboptimal. If I don't drop the pins the THP can't be split. And, there might be more than one pins from find_get_entries() if I read the code correctly. For example, truncate 8K length in the middle of THP, the THP's refcount would get bumpped twice sinceĀ  two sub pages would be returned. >