Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp3267984ybc; Mon, 25 Nov 2019 11:38:54 -0800 (PST) X-Google-Smtp-Source: APXvYqwexYahaUJq6HfrS3SRj7sT1JUWzc+KartMk/jU2fOAMNSJXuCjA4ki7PRX/8CQIm8ncJJV X-Received: by 2002:a17:906:c797:: with SMTP id cw23mr38503238ejb.19.1574710734641; Mon, 25 Nov 2019 11:38:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574710734; cv=none; d=google.com; s=arc-20160816; b=AB4WOy8akxD1KXWlU2z+ZcsyjzIjrez1s4JrQAaU3Dq/4PS7p2NZV/f3BiPVGHDrUY 5Ojh3TDg4P3PZIxFKUFdxjcoruiVKSO/OJKK+kIhUOME5cFBhxUWZe23WGnrCVzm2ywc n4R9yatnmL8GbC9lMRaHiDAc5793Z8WDvH0ljm6HI+8XlKdBaBsdaDiNNcTx+8iWGRoR irn/7gQPYIOPU1HLuEXRAwnxJdDej9l6hxv/27HrbQIoZIdh2ozd4VAChLmaX4+y2gDR RwjdljCGojaFKve9pC+oekP5Y8C5atMGjlT0YpG5cxgNW/pl3FLenVa0+cMo62+5M2bU rQPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=NdFKOdaycUCyi/erczYaJcczLkc91nvvBmrlZ9V63RU=; b=u/pEa+hpA6MFlD+laHAecDEHVXdNIG70yqpjyCyAUcARVIMJEsGBtqhDlZBAWZaPZf GIgSwo3PZWS+77+nW1XC9zsZYzlxLNSwY8rr+6YWUF7UfwsroGkrYiq0Q/zLXXmC9me0 kEiTBUYaBysFlzCLKKd7I/C7NtGaTBzDo0s5Sn24D3bGqy9xjioRmCUAztgoMaS2i6sh JcFDQhss+byN7d9ykbUcrUM70bvyBHetNLp3ZL15j+beeO2ck6PMyp75vYA7dwJ0y/xY gyxLMl6Yobde0i1pBQIDQRWBO4HvZv2fccTdBOFFCt+Eb5sFUwnpbBOaOheMMUSzudLJ 5Ekw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b49si5671600eda.312.2019.11.25.11.38.28; Mon, 25 Nov 2019 11:38:54 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727040AbfKYTdw (ORCPT + 99 others); Mon, 25 Nov 2019 14:33:52 -0500 Received: from out30-57.freemail.mail.aliyun.com ([115.124.30.57]:58372 "EHLO out30-57.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725823AbfKYTdw (ORCPT ); Mon, 25 Nov 2019 14:33:52 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R601e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Tj5FL5p_1574710425; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tj5FL5p_1574710425) by smtp.aliyun-inc.com(127.0.0.1); Tue, 26 Nov 2019 03:33:49 +0800 Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially To: "Kirill A. Shutemov" Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> <20191125183350.5gmcln6t3ofszbsy@box> From: Yang Shi Message-ID: <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> Date: Mon, 25 Nov 2019 11:33:41 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20191125183350.5gmcln6t3ofszbsy@box> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/25/19 10:33 AM, Kirill A. Shutemov wrote: > On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: >> >> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: >>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: >>>> Currently when truncating shmem file, if the range is partial of THP >>>> (start or end is in the middle of THP), the pages actually will just get >>>> cleared rather than being freed unless the range cover the whole THP. >>>> Even though all the subpages are truncated (randomly or sequentially), >>>> the THP may still be kept in page cache. This might be fine for some >>>> usecases which prefer preserving THP. >>>> >>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch >>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. >>>> So, when using shmem THP as memory backend QEMU inflation actually doesn't >>>> work as expected since it doesn't free memory. But, the inflation >>>> usecase really needs get the memory freed. Anonymous THP will not get >>>> freed right away too but it will be freed eventually when all subpages are >>>> unmapped, but shmem THP would still stay in page cache. >>>> >>>> To protect the usecases which may prefer preserving THP, introduce a >>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting THP is >>>> preferred behavior if truncating partial THP. This mode just makes >>>> sense to tmpfs for the time being. >>> We need to clarify interaction with khugepaged. This implementation >>> doesn't do anything to prevent khugepaged from collapsing the range back >>> to THP just after the split. >> Yes, it doesn't. Will clarify this in the commit log. > Okay, but I'm not sure that documention alone will be enough. We need > proper design. Maybe we could try to hold inode lock with read during collapse_file(). The shmem fallocate does acquire inode lock with write, this should be able to synchronize hole punch and khugepaged. And, shmem just needs hold inode lock for llseek and fallocate, I'm supposed they are should be called not that frequently to have impact on khugepaged. The llseek might be often, but it should be quite fast. However, they might get blocked by khugepaged. It sounds safe to hold a rwsem during collapsing THP. Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch and clear it after truncate, then check the flag before doing collapse in khugepaged. khugepaged should not need hold the inode lock during collapse since it could be released after the flag is checked. > >>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >>>> } >>>> unlock_page(page); >>>> } >>>> +rescan_split: >>>> pagevec_remove_exceptionals(&pvec); >>>> pagevec_release(&pvec); >>>> + >>>> + if (split && PageTransCompound(page)) { >>>> + /* The THP may get freed under us */ >>>> + if (!get_page_unless_zero(compound_head(page))) >>>> + goto rescan_out; >>>> + >>>> + lock_page(page); >>>> + >>>> + /* >>>> + * The extra pins from page cache lookup have been >>>> + * released by pagevec_release(). >>>> + */ >>>> + if (!split_huge_page(page)) { >>>> + unlock_page(page); >>>> + put_page(page); >>>> + /* Re-look up page cache from current index */ >>>> + goto again; >>>> + } >>>> + unlock_page(page); >>>> + put_page(page); >>>> + } >>>> +rescan_out: >>>> index++; >>>> } >>> Doing get_page_unless_zero() just after you've dropped the pin for the >>> page looks very suboptimal. >> If I don't drop the pins the THP can't be split. And, there might be more >> than one pins from find_get_entries() if I read the code correctly. For >> example, truncate 8K length in the middle of THP, the THP's refcount would >> get bumpped twice sinceĀ  two sub pages would be returned. > Pin the page before pagevec_release() and avoid get_page_unless_zero(). > > Current code is buggy. You need to check that the page is still belong to > the file after speculative lookup. Yes, I missed this point. Thanks for the suggestion. >