Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp4830865ybc; Tue, 26 Nov 2019 15:37:04 -0800 (PST) X-Google-Smtp-Source: APXvYqyW0waupnwPCeksklrg6AvC/6UPESLE+NCdyL4EDG5aLSB2ZN39woHBAQ2AkgZsWgtA5X6z X-Received: by 2002:a17:906:351b:: with SMTP id r27mr46850968eja.120.1574811424000; Tue, 26 Nov 2019 15:37:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574811423; cv=none; d=google.com; s=arc-20160816; b=twtnA9HbURYihMni9NmZxMnTZuj6sgIsQ9QW6GW4T/JCi7oVMSv95VYmNnUvcBYbca WOWNaRimohBHnyzxmmbjk0XkhRdOPyESH2cpm5SlDnm1a9fexcAp1/JOzZT9FRCwc0tb z7ehc73EwvuEfDIBBZf3sd4agAtguBGinbOWWi3tLMOdrdsGgjWDYBFVeQKhw7ucUhrC mChS1FrjZJNOmDOJesxuy7JSGuI/uQZrGyp1RIxFW1CIrGY9z5ITQtAQlJdM4o1hpbsC hbX6TnF4QdmvfFI+J+r+GOlgWGKQmJFAFstRM4BWHnSw6A5RcjemqX4nFVSlKnB0HnKR X6og== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:from:subject; bh=5PqQ2kZ+rygMr2Gl4loytIgltuIqQjn3ZynCZMXXGgk=; b=0zMXUhT4tii1BR+tYgIjzEcdNOr7KPBtAVu1u7f56IUjh893GCDuyYryvdBMsXKbLl kkoS7BKvb+uiN5N2CCleWXyCz4tmX9lUOrLa/6r/uDHxo4vUNmiigSn+99Ywmb71xGYC M/x/qu0j3q/VDAIJUE2k6UmK40sfzyK/vjlq5YNKHrRaYGtm3avka89w4oDcbTzHVTlP 0Iyg5LWvfe7P7E4rQ5nBztjELHenEESsiTjUsa7hcAckSFrXJXu+ZC+DjRWtZBSk7D+8 srPxZORl7BqI5XlA7EUq7aKZYbVivgm1mxTsOhMUKycWY0cMwB20KLR5g4HT3SkdVSwj cBgw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b41si9345608eda.1.2019.11.26.15.36.36; Tue, 26 Nov 2019 15:37:03 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726504AbfKZXev (ORCPT + 99 others); Tue, 26 Nov 2019 18:34:51 -0500 Received: from out30-130.freemail.mail.aliyun.com ([115.124.30.130]:39294 "EHLO out30-130.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726445AbfKZXev (ORCPT ); Tue, 26 Nov 2019 18:34:51 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R781e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04395;MF=yang.shi@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0TjAobng_1574811280; Received: from US-143344MP.local(mailfrom:yang.shi@linux.alibaba.com fp:SMTPD_---0TjAobng_1574811280) by smtp.aliyun-inc.com(127.0.0.1); Wed, 27 Nov 2019 07:34:44 +0800 Subject: Re: [RFC PATCH] mm: shmem: allow split THP when truncating THP partially From: Yang Shi To: "Kirill A. Shutemov" Cc: hughd@google.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <1574471132-55639-1-git-send-email-yang.shi@linux.alibaba.com> <20191125093611.hlamtyo4hvefwibi@box> <3a35da3a-dff0-a8ca-8269-3018fff8f21b@linux.alibaba.com> <20191125183350.5gmcln6t3ofszbsy@box> <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> Message-ID: <14b7c24b-706e-79cf-6fbc-f3c042f30f06@linux.alibaba.com> Date: Tue, 26 Nov 2019 15:34:40 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <9a68b929-2f84-083d-0ac8-2ceb3eab8785@linux.alibaba.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 11/25/19 11:33 AM, Yang Shi wrote: > > > On 11/25/19 10:33 AM, Kirill A. Shutemov wrote: >> On Mon, Nov 25, 2019 at 10:24:38AM -0800, Yang Shi wrote: >>> >>> On 11/25/19 1:36 AM, Kirill A. Shutemov wrote: >>>> On Sat, Nov 23, 2019 at 09:05:32AM +0800, Yang Shi wrote: >>>>> Currently when truncating shmem file, if the range is partial of THP >>>>> (start or end is in the middle of THP), the pages actually will >>>>> just get >>>>> cleared rather than being freed unless the range cover the whole THP. >>>>> Even though all the subpages are truncated (randomly or >>>>> sequentially), >>>>> the THP may still be kept in page cache.  This might be fine for some >>>>> usecases which prefer preserving THP. >>>>> >>>>> But, when doing balloon inflation in QEMU, QEMU actually does hole >>>>> punch >>>>> or MADV_DONTNEED in base page size granulairty if hugetlbfs is not >>>>> used. >>>>> So, when using shmem THP as memory backend QEMU inflation actually >>>>> doesn't >>>>> work as expected since it doesn't free memory.  But, the inflation >>>>> usecase really needs get the memory freed.  Anonymous THP will not >>>>> get >>>>> freed right away too but it will be freed eventually when all >>>>> subpages are >>>>> unmapped, but shmem THP would still stay in page cache. >>>>> >>>>> To protect the usecases which may prefer preserving THP, introduce a >>>>> new fallocate mode: FALLOC_FL_SPLIT_HPAGE, which means spltting >>>>> THP is >>>>> preferred behavior if truncating partial THP.  This mode just makes >>>>> sense to tmpfs for the time being. >>>> We need to clarify interaction with khugepaged. This implementation >>>> doesn't do anything to prevent khugepaged from collapsing the range >>>> back >>>> to THP just after the split. >>> Yes, it doesn't. Will clarify this in the commit log. >> Okay, but I'm not sure that documention alone will be enough. We need >> proper design. > > Maybe we could try to hold inode lock with read during > collapse_file(). The shmem fallocate does acquire inode lock with > write, this should be able to synchronize hole punch and khugepaged. > And, shmem just needs hold inode lock for llseek and fallocate, I'm > supposed they are should be called not that frequently to have impact > on khugepaged. The llseek might be often, but it should be quite fast. > However, they might get blocked by khugepaged. > > It sounds safe to hold a rwsem during collapsing THP. > > Or we could set VM_NOHUGEPAGE in shmem inode's flag with hole punch > and clear it after truncate, then check the flag before doing collapse > in khugepaged. khugepaged should not need hold the inode lock during > collapse since it could be released after the flag is checked. By relooking the code, it looks the latter one (check VM_NOHUGEPAGE) doesn't make sense, it can't prevent khugepaged from collapsing THP in parallel. > >> >>>>> @@ -976,8 +1022,31 @@ static void shmem_undo_range(struct inode >>>>> *inode, loff_t lstart, loff_t lend, >>>>>                } >>>>>                unlock_page(page); >>>>>            } >>>>> +rescan_split: >>>>>            pagevec_remove_exceptionals(&pvec); >>>>>            pagevec_release(&pvec); >>>>> + >>>>> +        if (split && PageTransCompound(page)) { >>>>> +            /* The THP may get freed under us */ >>>>> +            if (!get_page_unless_zero(compound_head(page))) >>>>> +                goto rescan_out; >>>>> + >>>>> +            lock_page(page); >>>>> + >>>>> +            /* >>>>> +             * The extra pins from page cache lookup have been >>>>> +             * released by pagevec_release(). >>>>> +             */ >>>>> +            if (!split_huge_page(page)) { >>>>> +                unlock_page(page); >>>>> +                put_page(page); >>>>> +                /* Re-look up page cache from current index */ >>>>> +                goto again; >>>>> +            } >>>>> +            unlock_page(page); >>>>> +            put_page(page); >>>>> +        } >>>>> +rescan_out: >>>>>            index++; >>>>>        } >>>> Doing get_page_unless_zero() just after you've dropped the pin for the >>>> page looks very suboptimal. >>> If I don't drop the pins the THP can't be split. And, there might be >>> more >>> than one pins from find_get_entries() if I read the code correctly. For >>> example, truncate 8K length in the middle of THP, the THP's refcount >>> would >>> get bumpped twice since  two sub pages would be returned. >> Pin the page before pagevec_release() and avoid get_page_unless_zero(). >> >> Current code is buggy. You need to check that the page is still >> belong to >> the file after speculative lookup. > > Yes, I missed this point. Thanks for the suggestion. > >> >