Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1091341ybl; Wed, 4 Dec 2019 16:51:47 -0800 (PST) X-Google-Smtp-Source: APXvYqwsCGPMLhNwIDdokxiKVBEWORRZG4wBKWRIgBPRAlO+iZun9tIJfKh/QEAU3M5AjUAqrmqJ X-Received: by 2002:a05:6808:4c7:: with SMTP id a7mr1780900oie.83.1575507107192; Wed, 04 Dec 2019 16:51:47 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575507107; cv=none; d=google.com; s=arc-20160816; b=CbZ0A6IZ+GaT/HXllAp5KgD0X795KJWRWi8O91uYADJUi762cMti/SRaRo/SngdrhS wF24A3Pk5qYQHXCG9jlAccNLsgYFiisXzCbpSU80GmKTXcc2Q9kbYfD/a5Q2LfYBz1Ki nuLHXXv7CvndMfvRymTtZCS6jJwrpSYvWtMJSAvZFotYIGU6U61Zk+BRxQHa4qV3a7Cy 0Yr2vH9RhxsfqoIm+WwxRDz+eDcQUrDLzGVNNmgfz6LMlvRcC4ej2rj2BEDS+sPA3uIR oHKPo49JhkOpr9LiVMgyNo97DtdFcESH0nQAOWvhxHJd53HR0HzTXa5PSesGoTzn4MRU d7Ag== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=f91PQDDXAaZueWSMdMAT8LAJ7P+BaE93amNBu6B3Vbs=; b=IkXz1EeDEC2n2ex+XWsO6Aczg+bKaRuHWDDuqjFvS8lowM+lN4XYiYtYPBQvEGm9Rv /6HbGKCpIlDkvCC82uo7ozHEA6cMSn2BnyO6DnPebD5eERoqEIw/QW70ahGCA8C3fEi8 BVaYOKehg3yafnTXev3aD8ZEDlRXjDdpxQgWQ8h3NTQMYe6E+LXLA4n1gy8r7nBqoTLa oSH2OV11ANpD7+J/6XpsCHA7+Bfqz7d19Do8ssEAJQTvbLxWn0+ndG0OkYe6smmroH0q gTs9YCrweEEE1tpghhUtc/15c7tJNiT2VNRFuc75NgvzKHg62xY8gF1eynVbWoAx51B7 fy3w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u131si2707459oib.257.2019.12.04.16.51.33; Wed, 04 Dec 2019 16:51:47 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728490AbfLEAvC (ORCPT + 99 others); Wed, 4 Dec 2019 19:51:02 -0500 Received: from out4436.biz.mail.alibaba.com ([47.88.44.36]:28134 "EHLO out4436.biz.mail.alibaba.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728011AbfLEAvC (ORCPT ); Wed, 4 Dec 2019 19:51:02 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04446;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=6;SR=0;TI=SMTPD_---0Tk.Bofc_1575507049; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tk.Bofc_1575507049) by smtp.aliyun-inc.com(127.0.0.1); Thu, 05 Dec 2019 08:50:51 +0800 Subject: Re: [v2 PATCH] mm: shmem: allow split THP when truncating THP partially To: Hugh Dickins Cc: kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <1575420174-19171-1-git-send-email-yang.shi@linux.alibaba.com> From: Yang Shi Message-ID: Date: Wed, 4 Dec 2019 16:50:49 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/4/19 4:15 PM, Hugh Dickins wrote: > On Wed, 4 Dec 2019, Yang Shi wrote: > >> Currently when truncating shmem file, if the range is partial of THP >> (start or end is in the middle of THP), the pages actually will just get >> cleared rather than being freed unless the range cover the whole THP. >> Even though all the subpages are truncated (randomly or sequentially), >> the THP may still be kept in page cache. This might be fine for some >> usecases which prefer preserving THP. >> >> But, when doing balloon inflation in QEMU, QEMU actually does hole punch >> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. >> So, when using shmem THP as memory backend QEMU inflation actually doesn't >> work as expected since it doesn't free memory. But, the inflation >> usecase really needs get the memory freed. Anonymous THP will not get >> freed right away too but it will be freed eventually when all subpages are >> unmapped, but shmem THP would still stay in page cache. >> >> Split THP right away when doing partial hole punch, and if split fails >> just clear the page so that read to the hole punched area would return >> zero. >> >> Cc: Hugh Dickins >> Cc: Kirill A. Shutemov >> Cc: Andrea Arcangeli >> Signed-off-by: Yang Shi >> --- >> v2: * Adopted the comment from Kirill. >> * Dropped fallocate mode flag, THP split is the default behavior. >> * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to >> Hugh's find_get_entries() hack (basically neutral), but without that hack > Thanks for giving it a try. I'm not neutral about my find_get_entries() > hack: it surely had to go (without it, I'd have just pushed my own patch). We are on the same page :-) > I've not noticed anything wrong with your patch, and it's in the right > direction, but I'm still not thrilled with it. I also remember that I > got the looping wrong in my first internal attempt (fixed in what I sent), > and need to be very sure of the try-again-versus-move-on-to-next conditions > before agreeing to anything. No rush, I'll come back to this in days or > month ahead: I'll try to find a less gotoey blend of yours and mine. Yes, those goto look a little bit convoluted so I added a lot comments to improve the readability. Thanks for your time. > > Hugh > >> we have to rely on pagevec_release() to release extra pins and play with >> goto. This version does in this way. The patch is bigger than Hugh's due >> to extra comments to make the flow clear. >> >> mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++------------------- >> 1 file changed, 83 insertions(+), 37 deletions(-) >> >> diff --git a/mm/shmem.c b/mm/shmem.c >> index 220be9f..1ae0c7f 100644 >> --- a/mm/shmem.c >> +++ b/mm/shmem.c >> @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> long nr_swaps_freed = 0; >> pgoff_t index; >> int i; >> + bool split = false; >> + struct page *page = NULL; >> >> if (lend == -1) >> end = -1; /* unsigned, so actually very big */ >> >> pagevec_init(&pvec); >> index = start; >> +retry: >> while (index < end) { >> pvec.nr = find_get_entries(mapping, index, >> min(end - index, (pgoff_t)PAGEVEC_SIZE), >> @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> if (!pvec.nr) >> break; >> for (i = 0; i < pagevec_count(&pvec); i++) { >> - struct page *page = pvec.pages[i]; >> + split = false; >> + page = pvec.pages[i]; >> >> index = indices[i]; >> if (index >= end) >> @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> if (!trylock_page(page)) >> continue; >> >> - if (PageTransTail(page)) { >> - /* Middle of THP: zero out the page */ >> - clear_highpage(page); >> - unlock_page(page); >> - continue; >> - } else if (PageTransHuge(page)) { >> - if (index == round_down(end, HPAGE_PMD_NR)) { >> + if (PageTransCompound(page) && !unfalloc) { >> + if (PageHead(page) && >> + index != round_down(end, HPAGE_PMD_NR)) { >> /* >> - * Range ends in the middle of THP: >> - * zero out the page >> + * Fall through when punching whole >> + * THP. >> */ >> - clear_highpage(page); >> - unlock_page(page); >> - continue; >> + index += HPAGE_PMD_NR - 1; >> + i += HPAGE_PMD_NR - 1; >> + } else { >> + /* >> + * Split THP for any partial hole >> + * punch. >> + */ >> + get_page(page); >> + split = true; >> + goto split; >> } >> - index += HPAGE_PMD_NR - 1; >> - i += HPAGE_PMD_NR - 1; >> } >> >> if (!unfalloc || !PageUptodate(page)) { >> @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> } >> unlock_page(page); >> } >> +split: >> pagevec_remove_exceptionals(&pvec); >> pagevec_release(&pvec); >> cond_resched(); >> + >> + if (split) { >> + /* >> + * The pagevec_release() released all extra pins >> + * from pagevec lookup. And we hold an extra pin >> + * and still have the page locked under us. >> + */ >> + if (!split_huge_page(page)) { >> + unlock_page(page); >> + put_page(page); >> + /* Re-lookup page cache from current index */ >> + goto retry; >> + } >> + >> + /* Fail to split THP, move to next index */ >> + unlock_page(page); >> + put_page(page); >> + } >> + >> index++; >> } >> >> @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> return; >> >> index = start; >> +again: >> while (index < end) { >> cond_resched(); >> >> @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> continue; >> } >> for (i = 0; i < pagevec_count(&pvec); i++) { >> - struct page *page = pvec.pages[i]; >> + split = false; >> + page = pvec.pages[i]; >> >> index = indices[i]; >> if (index >= end) >> @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> >> lock_page(page); >> >> - if (PageTransTail(page)) { >> - /* Middle of THP: zero out the page */ >> - clear_highpage(page); >> - unlock_page(page); >> - /* >> - * Partial thp truncate due 'start' in middle >> - * of THP: don't need to look on these pages >> - * again on !pvec.nr restart. >> - */ >> - if (index != round_down(end, HPAGE_PMD_NR)) >> - start++; >> - continue; >> - } else if (PageTransHuge(page)) { >> - if (index == round_down(end, HPAGE_PMD_NR)) { >> + if (PageTransCompound(page) && !unfalloc) { >> + if (PageHead(page) && >> + index != round_down(end, HPAGE_PMD_NR)) { >> /* >> - * Range ends in the middle of THP: >> - * zero out the page >> + * Fall through when punching whole >> + * THP. >> */ >> - clear_highpage(page); >> - unlock_page(page); >> - continue; >> + index += HPAGE_PMD_NR - 1; >> + i += HPAGE_PMD_NR - 1; >> + } else { >> + /* >> + * Split THP for any partial hole >> + * punch. >> + */ >> + get_page(page); >> + split = true; >> + goto rescan_split; >> } >> - index += HPAGE_PMD_NR - 1; >> - i += HPAGE_PMD_NR - 1; >> } >> >> if (!unfalloc || !PageUptodate(page)) { >> @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, >> } >> unlock_page(page); >> } >> +rescan_split: >> pagevec_remove_exceptionals(&pvec); >> pagevec_release(&pvec); >> + >> + if (split) { >> + /* >> + * The pagevec_release() released all extra pins >> + * from pagevec lookup. And we hold an extra pin >> + * and still have the page locked under us. >> + */ >> + if (!split_huge_page(page)) { >> + unlock_page(page); >> + put_page(page); >> + /* Re-lookup page cache from current index */ >> + goto again; >> + } >> + >> + /* >> + * Split fail, clear the page then move to next >> + * index. >> + */ >> + clear_highpage(page); >> + >> + unlock_page(page); >> + put_page(page); >> + } >> + >> index++; >> } >> >> -- >> 1.8.3.1 >> >>