Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp113158ybf; Wed, 26 Feb 2020 09:45:54 -0800 (PST) X-Google-Smtp-Source: APXvYqxbilR9W68j/Oj+aeHO2umzx8gEmos/v/POnu9/j+8yNU9oWBeglr83pGW+kGEk3AHGSYNK X-Received: by 2002:a9d:7a56:: with SMTP id z22mr3847788otm.201.1582739154370; Wed, 26 Feb 2020 09:45:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582739154; cv=none; d=google.com; s=arc-20160816; b=YYBhf+apv3/aFdhH4crrXy0SFdzxwQq7dqx+FFfNLCvN0T50lhVr4GRkIMZ+VKi/BW fU6J1dkmXiLTX5/xS4mSF5CryFx//lcvKm6FlImaGHZUlT2u/LlR/DWxXBGe9ezfuiYT 87I2ebcR9+Vo68VV166JL9vPy38U6cD2XNtxCocQRtUlC3hsTv0zRC/3hi6LYBysg9Ck lrOy+EFHjHqenHBQG3ZUt5PlMrYfMBsSnFy6pR4gfjQ7dn2phuGbihneTWmue2pEydSG WBbgUUPkTofzHItwuwwnZUxR4BA1jwtHX3ekJLe9a9m1n4OuOYlsxrl0QhxLoR+VeMzW rPdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=Wz/JwwzBifALa+EuhsaTl1MZLHC5QZIlMZOrq9Eci2U=; b=YXgWDVC7QDK/M+yvfL1L5KjnGR/Akrcqh7Cax5V4dn4YVcym5gP9+tV9UubdDPXjMU DG4jBH46/XW58+Q7yEqvFowwT5c5JGqKgbyRakODJAUjMrj/thHgMb+J2V0PmKam1zBk Gxx4U23EGNE1QTVWGQD0zjWLiATgIPt0ijtNXpp5/JrNnp40GReaxaG2bY0rSVz58w2e IZvcRca8jBE858/ETwUlOLmVcp6ypLFL7zcpmKB/Pcq82sWMWO5qMnDMeKlV7GC8bFE9 4KmnH+2dI2zhwuaVvTQKc5JeW46SUqd7RAwwbmEkPxQv9ORX2RsuZaJwt1Z3HIHWJXkQ GQ0g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t70si1546364oif.99.2020.02.26.09.45.40; Wed, 26 Feb 2020 09:45:54 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726889AbgBZRoD (ORCPT + 99 others); Wed, 26 Feb 2020 12:44:03 -0500 Received: from out30-44.freemail.mail.aliyun.com ([115.124.30.44]:45129 "EHLO out30-44.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726151AbgBZRoC (ORCPT ); Wed, 26 Feb 2020 12:44:02 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R931e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e07488;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Tr-OYTc_1582739036; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tr-OYTc_1582739036) by smtp.aliyun-inc.com(127.0.0.1); Thu, 27 Feb 2020 01:43:58 +0800 Subject: Re: [v2 PATCH] mm: shmem: allow split THP when truncating THP partially To: Hugh Dickins Cc: kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1575420174-19171-1-git-send-email-yang.shi@linux.alibaba.com> <00f0bb7d-3c25-a65f-ea94-3e2de8e9bcdd@linux.alibaba.com> From: Yang Shi Message-ID: Date: Wed, 26 Feb 2020 09:43:53 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/24/20 7:46 PM, Hugh Dickins wrote: > On Tue, 14 Jan 2020, Yang Shi wrote: >> On 12/4/19 4:15 PM, Hugh Dickins wrote: >>> On Wed, 4 Dec 2019, Yang Shi wrote: >>> >>>> Currently when truncating shmem file, if the range is partial of THP >>>> (start or end is in the middle of THP), the pages actually will just get >>>> cleared rather than being freed unless the range cover the whole THP. >>>> Even though all the subpages are truncated (randomly or sequentially), >>>> the THP may still be kept in page cache. This might be fine for some >>>> usecases which prefer preserving THP. >>>> >>>> But, when doing balloon inflation in QEMU, QEMU actually does hole punch >>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. >>>> So, when using shmem THP as memory backend QEMU inflation actually >>>> doesn't >>>> work as expected since it doesn't free memory. But, the inflation >>>> usecase really needs get the memory freed. Anonymous THP will not get >>>> freed right away too but it will be freed eventually when all subpages >>>> are >>>> unmapped, but shmem THP would still stay in page cache. >>>> >>>> Split THP right away when doing partial hole punch, and if split fails >>>> just clear the page so that read to the hole punched area would return >>>> zero. >>>> >>>> Cc: Hugh Dickins >>>> Cc: Kirill A. Shutemov >>>> Cc: Andrea Arcangeli >>>> Signed-off-by: Yang Shi >>>> --- >>>> v2: * Adopted the comment from Kirill. >>>> * Dropped fallocate mode flag, THP split is the default behavior. >>>> * Blended Huge's implementation with my v1 patch. TBH I'm not very >>>> keen to >>>> Hugh's find_get_entries() hack (basically neutral), but without >>>> that hack >>> Thanks for giving it a try. I'm not neutral about my find_get_entries() >>> hack: it surely had to go (without it, I'd have just pushed my own patch). >>> I've not noticed anything wrong with your patch, and it's in the right >>> direction, but I'm still not thrilled with it. I also remember that I >>> got the looping wrong in my first internal attempt (fixed in what I sent), >>> and need to be very sure of the try-again-versus-move-on-to-next conditions >>> before agreeing to anything. No rush, I'll come back to this in days or >>> month ahead: I'll try to find a less gotoey blend of yours and mine. >> Hi Hugh, >> >> Any update on this one? > I apologize for my dreadful unresponsiveness. > > I've spent the last day trying to love yours, and considering how mine > might be improved; but repeatedly arrived at the conclusion that mine is > about as nice as we can manage, just needing more comment to dignify it. Those gotoes do seems convoluted, I do agree. > > I did willingly call my find_get_entries() stopping at PageTransCompound > a hack; but now think we should just document that behavior and accept it. > The contortions of your patch come from the need to release those 14 extra > unwanted references: much better not to get them in the first place. > > Neither of us handle a failed split optimally, we treat every tail as an > opportunity to retry: which is good to recover from transient failures, > but probably excessive. And we both have to restart the pagevec after > each attempt, but at least I don't get 14 unwanted extras each time. > > What of other find_get_entries() users and its pagevec_lookup_entries() > wrapper: does an argument to select this "stop at PageTransCompound" > behavior need to be added? > > No. The pagevec_lookup_entries() calls from mm/truncate.c prefer the > new behavior - evicting the head from page cache removes all the tails > along with it, so getting the tails a waste of time there too, just as > it was in shmem_undo_range(). TBH I'm not a fun of this hack. This would bring in other confusion or complexity. Pagevec is supposed to count in the number of base page, now it would treat THP as one page, and there might be mixed base page and THP in one pagevec. But, I tend to agree avoiding getting those 14 extra pins at the first place might be a better approach. All the complexity are used to release those extra pins. > > Whereas shmem_unlock_mapping() and shmem_seek_hole_data(), as they > stand, are not removing pages from cache, but actually wanting to plod > through the tails. So those two would be slowed a little, while the > pagevec collects 1 instead of 15 pages. However: if we care about those > two at all, it's clear that we should speed them up, by noticing the > PageTransCompound case and accelerating over it, instead of plodding > through the tails. Since we haven't bothered with that optimization > yet, I'm not very worried to slow them down a little until it's done. > > I must take a look at where we stand with tmpfs 64-bit ino tomorrow, > then recomment my shmem_punch_compound() patch and post it properly, > probably day after. (Reviewing it, I see I have a "page->index + > HPAGE_PMD_NR <= end" test which needs correcting - I tend to live > in the past, before v4.13 doubled the 32-bit MAX_LFS_FILESIZE.) > > I notice that this thread has veered off into QEMU ballooning > territory: which may indeed be important, but there's nothing at all > that I can contribute on that. I certainly do not want to slow down > anything important, but remain convinced that the correct filesystem > implementation for punching a hole is to punch a hole. > > Hugh