Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp2012935ybl; Tue, 3 Dec 2019 16:45:22 -0800 (PST) X-Google-Smtp-Source: APXvYqzxPoGqKCz+S93hAxWBf0ZUQLXeF572uK/IQMGOJOUA5eFPRITomvu0K7/VFcj98Z2+4OtM X-Received: by 2002:aca:889:: with SMTP id 131mr428202oii.3.1575420322239; Tue, 03 Dec 2019 16:45:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1575420322; cv=none; d=google.com; s=arc-20160816; b=P9cU2R0O0oucWgsJ0ejw3UjRxvNhRdq4RdYsd0Mhr+hw90/BthDTZYpGgCQdZNA2Nx h4vi0cE3UA6AC7J/fAY95qNBkrr8f47giqUZK1Y9nRQsJXLeLgIpi2p4qDoJJ4MjXD8U l4F1AW+4DgW1LmsWnErTuDBVlqcoAqpn2168pXwgVp3I7Xlezh4Of8GC79f+1gdr/p82 Z/bmyu+RQg4PJj8V4LRfTzmQRRqozuSzB/HVuut3uWacFKCJCCnBr+shJTGmJbyCFdYQ S0MmSci68/ovNoCuLegBubjxkphvBfYLJ4nT1FbLMangzlw4TVKJug4QaKoYjuBpgWMU Jzlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from; bh=cleAOk93BWYKfLPheKV2Oi9f+6ul6Rdr7UZLjcVJ/sM=; b=GtB30w+yycisSDWc3hNj6aWAm0URXmYln9lHcvzb9/94kprQ70FTZEd10KnpQ/pwCj ABqQ4jQoE0Nzv9fP26H3xcnOQaGVN+PJw7GKX9oiPTk4Ip/LFKOvhBr94dLVDAI5g6sp +FwUg5ITbGdTYzZkCm5RgetEbu/WY7dt8CaYHsbq1fDXxXhCTR0tiuMR9IDdsSbQgUtw tRuS5gMgYrcJRYWFV1XM2cwcWlBJCNZdQUIQySfmBqzzU6ntqd9CDzDPGd9G/C71JqpH TigMhDAzJzTK9yjrbzBLCW2xaptJ+bETQ/4aCie0nJMEfb8fzfScJKMOEfBXNILg/KoA BPpA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t2si2237210oth.32.2019.12.03.16.45.06; Tue, 03 Dec 2019 16:45:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726291AbfLDAnG (ORCPT + 99 others); Tue, 3 Dec 2019 19:43:06 -0500 Received: from out30-133.freemail.mail.aliyun.com ([115.124.30.133]:41209 "EHLO out30-133.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726057AbfLDAnG (ORCPT ); Tue, 3 Dec 2019 19:43:06 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R181e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=7;SR=0;TI=SMTPD_---0Tjr3EVU_1575420174; Received: from e19h19392.et15sqa.tbsite.net(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0Tjr3EVU_1575420174) by smtp.aliyun-inc.com(127.0.0.1); Wed, 04 Dec 2019 08:43:01 +0800 From: Yang Shi To: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org Cc: yang.shi@linux.alibaba.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [v2 PATCH] mm: shmem: allow split THP when truncating THP partially Date: Wed, 4 Dec 2019 08:42:54 +0800 Message-Id: <1575420174-19171-1-git-send-email-yang.shi@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Currently when truncating shmem file, if the range is partial of THP (start or end is in the middle of THP), the pages actually will just get cleared rather than being freed unless the range cover the whole THP. Even though all the subpages are truncated (randomly or sequentially), the THP may still be kept in page cache. This might be fine for some usecases which prefer preserving THP. But, when doing balloon inflation in QEMU, QEMU actually does hole punch or MADV_DONTNEED in base page size granulairty if hugetlbfs is not used. So, when using shmem THP as memory backend QEMU inflation actually doesn't work as expected since it doesn't free memory. But, the inflation usecase really needs get the memory freed. Anonymous THP will not get freed right away too but it will be freed eventually when all subpages are unmapped, but shmem THP would still stay in page cache. Split THP right away when doing partial hole punch, and if split fails just clear the page so that read to the hole punched area would return zero. Cc: Hugh Dickins Cc: Kirill A. Shutemov Cc: Andrea Arcangeli Signed-off-by: Yang Shi --- v2: * Adopted the comment from Kirill. * Dropped fallocate mode flag, THP split is the default behavior. * Blended Huge's implementation with my v1 patch. TBH I'm not very keen to Hugh's find_get_entries() hack (basically neutral), but without that hack we have to rely on pagevec_release() to release extra pins and play with goto. This version does in this way. The patch is bigger than Hugh's due to extra comments to make the flow clear. mm/shmem.c | 120 ++++++++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 83 insertions(+), 37 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index 220be9f..1ae0c7f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -806,12 +806,15 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, long nr_swaps_freed = 0; pgoff_t index; int i; + bool split = false; + struct page *page = NULL; if (lend == -1) end = -1; /* unsigned, so actually very big */ pagevec_init(&pvec); index = start; +retry: while (index < end) { pvec.nr = find_get_entries(mapping, index, min(end - index, (pgoff_t)PAGEVEC_SIZE), @@ -819,7 +822,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, if (!pvec.nr) break; for (i = 0; i < pagevec_count(&pvec); i++) { - struct page *page = pvec.pages[i]; + split = false; + page = pvec.pages[i]; index = indices[i]; if (index >= end) @@ -838,23 +842,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, if (!trylock_page(page)) continue; - if (PageTransTail(page)) { - /* Middle of THP: zero out the page */ - clear_highpage(page); - unlock_page(page); - continue; - } else if (PageTransHuge(page)) { - if (index == round_down(end, HPAGE_PMD_NR)) { + if (PageTransCompound(page) && !unfalloc) { + if (PageHead(page) && + index != round_down(end, HPAGE_PMD_NR)) { /* - * Range ends in the middle of THP: - * zero out the page + * Fall through when punching whole + * THP. */ - clear_highpage(page); - unlock_page(page); - continue; + index += HPAGE_PMD_NR - 1; + i += HPAGE_PMD_NR - 1; + } else { + /* + * Split THP for any partial hole + * punch. + */ + get_page(page); + split = true; + goto split; } - index += HPAGE_PMD_NR - 1; - i += HPAGE_PMD_NR - 1; } if (!unfalloc || !PageUptodate(page)) { @@ -866,9 +871,29 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } unlock_page(page); } +split: pagevec_remove_exceptionals(&pvec); pagevec_release(&pvec); cond_resched(); + + if (split) { + /* + * The pagevec_release() released all extra pins + * from pagevec lookup. And we hold an extra pin + * and still have the page locked under us. + */ + if (!split_huge_page(page)) { + unlock_page(page); + put_page(page); + /* Re-lookup page cache from current index */ + goto retry; + } + + /* Fail to split THP, move to next index */ + unlock_page(page); + put_page(page); + } + index++; } @@ -901,6 +926,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, return; index = start; +again: while (index < end) { cond_resched(); @@ -916,7 +942,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, continue; } for (i = 0; i < pagevec_count(&pvec); i++) { - struct page *page = pvec.pages[i]; + split = false; + page = pvec.pages[i]; index = indices[i]; if (index >= end) @@ -936,30 +963,24 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, lock_page(page); - if (PageTransTail(page)) { - /* Middle of THP: zero out the page */ - clear_highpage(page); - unlock_page(page); - /* - * Partial thp truncate due 'start' in middle - * of THP: don't need to look on these pages - * again on !pvec.nr restart. - */ - if (index != round_down(end, HPAGE_PMD_NR)) - start++; - continue; - } else if (PageTransHuge(page)) { - if (index == round_down(end, HPAGE_PMD_NR)) { + if (PageTransCompound(page) && !unfalloc) { + if (PageHead(page) && + index != round_down(end, HPAGE_PMD_NR)) { /* - * Range ends in the middle of THP: - * zero out the page + * Fall through when punching whole + * THP. */ - clear_highpage(page); - unlock_page(page); - continue; + index += HPAGE_PMD_NR - 1; + i += HPAGE_PMD_NR - 1; + } else { + /* + * Split THP for any partial hole + * punch. + */ + get_page(page); + split = true; + goto rescan_split; } - index += HPAGE_PMD_NR - 1; - i += HPAGE_PMD_NR - 1; } if (!unfalloc || !PageUptodate(page)) { @@ -976,8 +997,33 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend, } unlock_page(page); } +rescan_split: pagevec_remove_exceptionals(&pvec); pagevec_release(&pvec); + + if (split) { + /* + * The pagevec_release() released all extra pins + * from pagevec lookup. And we hold an extra pin + * and still have the page locked under us. + */ + if (!split_huge_page(page)) { + unlock_page(page); + put_page(page); + /* Re-lookup page cache from current index */ + goto again; + } + + /* + * Split fail, clear the page then move to next + * index. + */ + clear_highpage(page); + + unlock_page(page); + put_page(page); + } + index++; } -- 1.8.3.1