Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp635546ybf; Fri, 28 Feb 2020 04:55:53 -0800 (PST) X-Google-Smtp-Source: APXvYqxdXqqFIVqEebXpC/hgMYa3IOzoY6qLVatNWq/uRRKjV8EmEfyED7TH0p9+EwMHX+SPE2gh X-Received: by 2002:a05:6830:154a:: with SMTP id l10mr3232467otp.44.1582894553831; Fri, 28 Feb 2020 04:55:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582894553; cv=none; d=google.com; s=arc-20160816; b=jcbg+BUg81bLVl/+OLGUwwVTwbvBVXrVQgDW23v3Sq+7vGpH0bUvuJ03ScyjW/9/cy av649rO7Mo6JkUCm02KUsihFXrDZGs4Hy56zHric9/SfkrwbrLFr4R9MNqf0LWi0IZdt Ruz+tzU/GcCVMtbflXaR3w4cpdqKXjeNda3yzl3vTkZTqpye2fV5Z+xXur4+FC7mrwWe 0Uq5A51rwjCdaoyP+P8Wb7JnHzgm5ygkA44HOEnaboxtXqpG8ybRmNwX0P2OJ8iHOdBo 5UeaUARuZAKdaOzRg/6QaYhHqN39ADHfwRvYtEhuRm6xQN5Zy95KOg7zn62e21VVBbXH EmbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=Muv72YP2IyLq1W5BhIStFTH4ZvryXOf+fLWCoC9qpYQ=; b=SMTWYOabjZtnWIgz67y1Qgmm5m91l/DWzDWl7TnikyqnHm3kg/bylUk7Rw9OBng81z oR1AXPXyWWbONp+7ZzJrNtmcR5hhdT0zu4hrJ/mmU850vVw9eKDDZssJsUoH0Cxs/CB1 71hcAZmLFfGf3zHss8+dYNi/UpcqthKSxRFRaaj8gzC4yRGyHxAwH49qNR0T2qhYistb RPMCJOWFRBMwSX2CqBYmNZB3sEEb2TvvYWVFSUnT9xS8ZeHAdWTzkDdtqeZXLNVW24BB BbJEOljWs0S3yeGo1QqhNanCI4p87yuQ8dE1VKb85Tc+oI3EVH/I0081mDj8cVr4ADEJ fA1g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=GlaZZ6YV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t26si1615424oic.169.2020.02.28.04.55.41; Fri, 28 Feb 2020 04:55:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@shutemov-name.20150623.gappssmtp.com header.s=20150623 header.b=GlaZZ6YV; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726167AbgB1MyM (ORCPT + 99 others); Fri, 28 Feb 2020 07:54:12 -0500 Received: from mail-lf1-f68.google.com ([209.85.167.68]:46905 "EHLO mail-lf1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725900AbgB1MyM (ORCPT ); Fri, 28 Feb 2020 07:54:12 -0500 Received: by mail-lf1-f68.google.com with SMTP id v6so1991915lfo.13 for ; Fri, 28 Feb 2020 04:54:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov-name.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=Muv72YP2IyLq1W5BhIStFTH4ZvryXOf+fLWCoC9qpYQ=; b=GlaZZ6YVAEbZFHeWJ7HE/XB2QcOxTMZAalnfiEzxpYIgJO8w+ULiOB+xpiGf+8F/jY sRXUb55TJPb8m2LkyGPEl4f1evjISFaFNde7Q+8kBK3dIWf3blckzU6z94dDyYCS/729 HiP1f8JckGqHh5L5pJvik21lNCv+rDcXuM3lC+YRxU0IO2vCcPbqo029h+dbt+bBnO89 fpxxAZJDIen6JGltrxE22I3CUEJT17n50CXsEELbd+nHR9w+nh+RzyB5MvOEwd/hLn6d Zf5OgnzxbhwOPv/pzF59sIK/7UUReM1EWPWhcPhC9BoheP4kL1w85/lgaOIZaXjPTSnK QvXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Muv72YP2IyLq1W5BhIStFTH4ZvryXOf+fLWCoC9qpYQ=; b=bMHGPCf3IcJiA7Fgm+jqEHg9paWw9LyMVQVKaFg64Ysu1rqEIdowOTVsYnb5U46bop qd7DHGwhJWZ4FQJOCnRTVTeDvcA4cXYxWaMDZsp2aVMa4F25zHHBeiq3BSCpeShhTeeN MiGVHqOgECLAXjjXQqT1Koz5/0E4pymUr4SIi0FOj62GXQ79AtVYtcpu6hH0WWeFpYrA gfm3MsoMSz/s+CzzoCWCDemoVp2BnUTtR+MF4WwAgj2fq1ephTgFzUWe6nHPKp1mGx+F hhm204yujiqI2VA9hbEus7ljQ4DBKbV7b6oLIOpdPHfzYQDRfmfKpDFzISG5uU+9l3sF piBQ== X-Gm-Message-State: ANhLgQ2cL2eqw74zxwbKymEfmlNqDDcHwtxMG1EfpXAgd0TAGnH/iqX5 0Sg45CzPLSOzSDld8W+px+0Obal9BGI= X-Received: by 2002:a19:2d53:: with SMTP id t19mr2569895lft.206.1582894449304; Fri, 28 Feb 2020 04:54:09 -0800 (PST) Received: from box.localdomain ([86.57.175.117]) by smtp.gmail.com with ESMTPSA id t21sm5594429ljh.14.2020.02.28.04.54.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 28 Feb 2020 04:54:08 -0800 (PST) Received: by box.localdomain (Postfix, from userid 1000) id 5CA68101297; Fri, 28 Feb 2020 15:54:07 +0300 (+03) Date: Fri, 28 Feb 2020 15:54:07 +0300 From: "Kirill A. Shutemov" To: Matthew Wilcox Cc: Hugh Dickins , Andrew Morton , Yang Shi , Alexander Duyck , "Michael S. Tsirkin" , David Hildenbrand , "Kirill A. Shutemov" , Andrea Arcangeli , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH] huge tmpfs: try to split_huge_page() when punching hole Message-ID: <20200228125407.5lnmjtnvlcnniokt@box> References: <20200227084704.aolem5nktpricrzo@box> <20200228042646.GF29971@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200228042646.GF29971@bombadil.infradead.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 27, 2020 at 08:26:46PM -0800, Matthew Wilcox wrote: > On Thu, Feb 27, 2020 at 08:04:21PM -0800, Hugh Dickins wrote: > > It's good to consider the implications for hole-punch on a persistent > > filesystem cached with THPs (or lower order compound pages); but I > > disagree that they should behave differently from this patch. > > > > The hole-punch is fundamentally directed at freeing up the storage, yes; > > but its page cache must also be removed, otherwise you have the user > > writing into cache which is not backed by storage, and potentially losing > > the data later. So a hole must be punched in the compound page in that > > case too: in fact, it's then much more important that split_huge_page() > > succeeds - not obvious what the fallback should be if it fails (perhaps > > in that case the compound page must be kept, but all its pmds removed, > > and info on holes kept in spare fields of the compound page, to prevent > > writes and write faults without calling back into the filesystem: > > soluble, but more work than tmpfs needs today)(and perhaps when that > > extra work is done, we would choose to rely on it rather than > > immediately splitting; but it will involve discounting the holes). > > Ooh, a topic that reasonable people can disagree on! Hugh wins me over on this. Removing PMDs will not do much as we track dirty status on compound page level. I see two reasonable options for persistent filesystem to handle the punch hole: - Keep the page and PMD mappings intact, but trigger writeback if page is dirty. After the page is clean we can safely punch hole in the storage. Following write access to the area would trigger page_mkwrite() which would allocate storage accordingly. This is reasonable behaviour if we allow to allocate THPs not fully covered by space allocated on disk. - Try to split the page or drop it completely from the page cache (after write back if need) before punching the hole. Fallback to the first scenario if we cannot split or get rid of the page. I cannot say I strongly prefer one approach over another. The first one fits better with THP attitude: pay for performance with memory (and storage I guess). The second may work better if resources is limited. > The current prototype I have will allocate (huge) pages and then > ask the filesystem to fill them. The filesystem may well find that > the extent is a hole, and if it is, it will fill the page with zeroes. > Then, the application may write to those pages, and if it does, the > filesystem will be notified to create an on-disk extent for that write. > > I haven't looked at the hole-punch path in detail, but presumably it > notifies the filesystem to create a hole extent and zeroes out the > pagecache for that range (possibly by removing entire pages, and with > memset for partial pages). Then a subsequent write to the hole will > cause the filesystem to allocate a new non-hole extent, just like the > previous case. > > I think it's reasonable for the page cache to interpret a hole-punch > request as being a hint that the hole is unlikely to be accessed again, > so allocating new smaller pages for that region of the file (or just > writing back & dropping the covering page altogether) would seem like > a reasonable implementation decision. > > However, it also seems reasonable that just memset() of the affected > region and leaving the page intact would also be an acceptable > implementation. As long as writes to the newly-created hole cause the > page to become dirtied and thus writeback to be in effect. It probably > wouldn't be as good an implementation, but it shouldn't lose writes as > you suggest above. > > I'm not sure I'd choose to split a large page into smaller pages. I think > I'd prefer to allocate lower-order pages and memcpy() the data over. > Again, that's an implementation choice, and not something that should > be visible outside the implementation. Copying over the data has the same limitation as split: you need to get refcounts to the well-known state (no extra pins) before you can proceed. So it will not be never-fail operation. > > [1] http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-pagecache -- Kirill A. Shutemov