Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp1068641ybf; Fri, 28 Feb 2020 13:22:05 -0800 (PST) X-Google-Smtp-Source: APXvYqxio/XZE1GkWAi2FYHLpH8BcsNVH1b15O992xNnRvC39pQe1P1TZaOvS8ruVvwyGfwPHZgD X-Received: by 2002:aca:75c1:: with SMTP id q184mr2396206oic.35.1582924925319; Fri, 28 Feb 2020 13:22:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582924925; cv=none; d=google.com; s=arc-20160816; b=xDmjElCwzcLZQn49PmTvT91AoJX9RkG4hivtc3dJtNwNnAdLE4e7IqSqlX6Wdi4XPS jewLEPcHv0eLD018eO4W/E4eTVyDju9owJ1N55cB3e9q5yhCoQbQ7Tz0zsGTmcJQkeBn Yw33H273PlD5vdxhCnEVgD1dc3EaJQ+KKDaPiJYUBho6JFeY5UPHoZvfh5VgM+iw14Ns O8evRLnA9EwdzpIFWOrZrkKCb2izqwCiGGhPxeWGi9zcwZLwg5Oo2daFg8qofdZwRlee rOeEwz/+iLqCpODFqMhXbFZTRNGqtxeL3SdpjZONffC9V2GXkFSDiVi7hiW9lU84fPM5 dUHw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=UWsPvULuR+WDl0qEo7Gohn9lW7zULuyzEIkilp5eG8g=; b=Q3ez8SqjfbMmfpMmkd8dX0m77zBxscGw8jLnwesPt4wke3sh5cApsL+GR5OkKN30j4 MQv9riUWEpfMD7sdcmfaS8WXPdpNbFlr0xFh/qsA9fVYWaBUFO+4+PeQ/iA3GlRValIw XH6DwY1vusT+H+Y/JlssqZTPcS8MKPfK/lPLaV07/wqsARtbND7qDD8q13/rotrCuPzc EOj2eDSFPyZ+VHGownVHo/zKrOuC7bZldfaVAGiMOTuZeoxQ+Y56Xl4bpv8WAIpNcCvu GOr0sKWdLhHn5LOUTElFEzjBwrSyoRzil38o0vEOQB7SRbHwGVIQGEPXK1ZzT2QhMReX 8Mjw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g18si2140235otp.61.2020.02.28.13.21.52; Fri, 28 Feb 2020 13:22:05 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727031AbgB1VQX (ORCPT + 99 others); Fri, 28 Feb 2020 16:16:23 -0500 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:51290 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726918AbgB1VQX (ORCPT ); Fri, 28 Feb 2020 16:16:23 -0500 Received: from dread.disaster.area (pa49-195-202-68.pa.nsw.optusnet.com.au [49.195.202.68]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id EBA1A7E826E; Sat, 29 Feb 2020 08:16:13 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1j7mzS-00047t-Te; Sat, 29 Feb 2020 08:16:10 +1100 Date: Sat, 29 Feb 2020 08:16:10 +1100 From: Dave Chinner To: Andreas Dilger Cc: Kirill Tkhai , Christoph Hellwig , Theodore Ts'o , Alexander Viro , Mike Snitzer , Jan Kara , Eric Biggers , riteshh@linux.ibm.com, krisman@collabora.com, surajjs@amazon.com, dmonakhov@gmail.com, mbobrowski@mbobrowski.org, Eric Whitney , sblbir@amazon.com, Khazhismel Kumykov , linux-ext4 , Linux Kernel Mailing List , Linux FS Devel Subject: Re: [PATCH RFC 5/5] ext4: Add fallocate2() support Message-ID: <20200228211610.GQ10737@dread.disaster.area> References: <158272427715.281342.10873281294835953645.stgit@localhost.localdomain> <158272447616.281342.14858371265376818660.stgit@localhost.localdomain> <20200226155521.GA24724@infradead.org> <06f9b82c-a519-7053-ec68-a549e02c6f6c@virtuozzo.com> <4933D88C-2A2D-4ACA-823E-BDFEE0CE143F@dilger.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4933D88C-2A2D-4ACA-823E-BDFEE0CE143F@dilger.ca> User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.3 cv=X6os11be c=1 sm=1 tr=0 a=mqTaRPt+QsUAtUurwE173Q==:117 a=mqTaRPt+QsUAtUurwE173Q==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=l697ptgUJYAA:10 a=TYBLyS7eAAAA:8 a=7-415B0cAAAA:8 a=bsSdMAq58TUpE86ZS_YA:9 a=eQITBiZyMYxeg7ov:21 a=_jd0Q9eB9b6pY1G7:21 a=CjuIK1q_8ugA:10 a=zvYvwCWiE4KgVXXeO06c:22 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 28, 2020 at 08:35:19AM -0700, Andreas Dilger wrote: > On Feb 27, 2020, at 5:24 AM, Kirill Tkhai wrote: > > On 27.02.2020 00:51, Andreas Dilger wrote: > >> On Feb 26, 2020, at 1:05 PM, Kirill Tkhai wrote: > >> In that case, an interesting userspace interface would be an array of > >> inode numbers (64-bit please) that should be packed together densely in > >> the order they are provided (maybe a flag for that). That allows the > >> filesystem the freedom to find the physical blocks for the allocation, > >> while userspace can tell which files are related to each other. > > > > So, this interface is 3-in-1: > > > > 1)finds a placement for inodes extents; > > The target allocation size would be sum(size of inodes), which should > be relatively small in your case). > > > 2)assigns this space to some temporary donor inode; > > Maybe yes, or just reserves that space from being allocated by anyone. > > > 3)calls ext4_move_extents() for each of them. > > ... using the target space that was reserved earlier > > > Do I understand you right? > > Correct. That is my "5 minutes thinking about an interface for grouping > small files together without exposing kernel internals" proposal for this. You don't need any special kernel interface with XFS for this. It is simply: mkdir tmpdir create O_TMPFILEs in tmpdir Now all the tmpfiles you create and their data will be co-located around the location of the tmpdir inode. This is the natural placement policy of the filesystem. i..e the filesystem assumes that files in the same directory are all related, so will be accessed together and so should be located in relatively close proximity to each other. This is a locality optimisation technique that is older than XFS. It works remarkably well when the filesystem can spread directories effectively across it's address space. It also allows userspace to use simple techniques to group (or separate) data files as desired. Indeed, this is how xfs_fsr directs locality for it's tmpfiles when relocating/defragmenting data.... > > If so, then IMO it's good to start from two inodes, because here may code > > a very difficult algorithm of placement of many inodes, which may require > > much memory. Is this OK? > > Well, if the files are small then it won't be a lot of memory. Even so, > the kernel would only need to copy a few MB at a time in order to get > any decent performance, so I don't think that is a huge problem to have > several MB of dirty data in flight. > > > Can we introduce a flag, that some of inode is unmovable? > > There are very few flags left in the ext4_inode->i_flags for use. > You could use "IMMUTABLE" or "APPEND_ONLY" to mean that, but they > also have other semantics. The EXT4_NOTAIL_FL is for not merging the > tail of a file, but ext4 doesn't have tails (that was in Reiserfs), > so we might consider it a generic "do not merge" flag if set? We've had that in XFS for as long as I can remember. Many applications were sensitive to the exact layout of the files they created themselves, so having xfs_fsr defrag/move them about would cause performance SLAs to be broken. Indeed, thanks to XFS, ext4 already has an interface that can be used to set/clear a "no defrag" flag such as you are asking for. It's the FS_XFLAG_NODEFRAG bit in the FS_IOC_FS[GS]ETXATTR ioctl. In XFS, that manages the XFS_DIFLAG_NODEFRAG on-disk inode flag, and it has special meaning for directories. From the 'man 3 xfsctl' man page where this interface came from: Bit 13 (0x2000) - XFS_XFLAG_NODEFRAG No defragment file bit - the file should be skipped during a defragmentation operation. When applied to a directory, new files and directories created will inherit the no-defrag bit. > > Can this interface use a knowledge about underlining device discard granuality? > > As I wrote above, ext4+mballoc has a very good appreciation for alignment. > That was written for RAID storage devices, but it doesn't matter what > the reason is. It isn't clear if flash discard alignment is easily > used (it may not be a power-of-two value or similar), but wouldn't be > harmful to try. Yup, XFS has the similar (but more complex) alignment controls for directing allocation to match the underlying storage characteristics. e.g. stripe unit is also the "small file size threshold" where the allocation policy changes from packing to aligning and separating. > > In the answer to Dave, I wrote a proposition to make fallocate() care about > > i_write_hint. Could you please comment what you think about that too? > > I'm not against that. How the two interact would need to be documented > first and discussed to see if that makes sene, and then implemented. Individual filesystems can make their own choices as to what they do with write hints, including ignoring them and leaving it for the storage device to decide where to physically place the data. Which, in many cases, ignoring the hint is the right thing for the filesystem to do... Cheers, Dave. -- Dave Chinner david@fromorbit.com