Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp54422pxv; Wed, 21 Jul 2021 15:27:52 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwO6yYFSN2WLcvmeLCGbrfjFT8qpodjDfKfD6CnEHGhEBa9KAX1ugYPhmoLZ07PpTM2qKzx X-Received: by 2002:a05:6e02:1a6d:: with SMTP id w13mr2937441ilv.299.1626906472450; Wed, 21 Jul 2021 15:27:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626906472; cv=none; d=google.com; s=arc-20160816; b=sB6jul3PEVrXJmvmw9Gxx2yQr2LJ867iNRgIHFUAbEhZbTPCvt0c8ZHE5zfFBxyNCY e+d8llDlIV7BRC2lofbrqjzdZRJKSQnlKVx0c7LxLEDQGwexLnOVuXYSYvLsD6aZv1pk WAQ6kc5hKL1S+2cRuHywthw4dOk/W+OTSe9g1js/9J/YhykloqpXZI7RL+nGXZ00Hhk5 hmSy72AcxgOo+RQF62cPRhmsmuE5g/CSBHAKkv5fWvH+VwQ9Ca38mgnYZh4XStpUdk7b HxpFfo60f+PeT7PnjC2urKn/BGQs1PHNUFy4jlqBdd45uqmCpF/VIekBOlKe9gq6IPgt ajOA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=fD+hu5XV97613C14pG0ULpe4b9okopEqTc8F7kDYyrc=; b=SR3/J2olbKSfyo5SKHe/X5qFerr3iiNDuoG5JOr9PLY8Zo6T6TsJuLbmsiTyYpKDnf Ray7G2n96LJvbRcB9wbwyhee3nQN7WdFxSgg8sHottIllAyXqSE83Anoi8gsZd9oL5Py v01mZxSYfVxOQ3wgw5Wqq418t8/Xb7FCPpbhgI4u/7eNCQtgbb+TtOUmGokV7gR2N3Sk mNSV5UvENipF+CxZ8KOOqoibYUsAbHBYc4RwryMVRd5zDg1MaS1M0a+xUUVLH3gOnivj 9BT3ZqOkNGI6K+ZumINut8KSC5SC+F225zJdm0vWxX5A74BOG1dslOiWwH+iOuCVlXDU AY6w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Zlo+kSDY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n13si8120940ioj.25.2021.07.21.15.27.30; Wed, 21 Jul 2021 15:27:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=Zlo+kSDY; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230126AbhGUVn3 (ORCPT + 99 others); Wed, 21 Jul 2021 17:43:29 -0400 Received: from mail.kernel.org ([198.145.29.99]:59318 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229936AbhGUVn2 (ORCPT ); Wed, 21 Jul 2021 17:43:28 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id D13F561244; Wed, 21 Jul 2021 22:24:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1626906244; bh=LtkOAe2ux4tPqaKJayL+DllIZFvll48z5Dv27JuyO+E=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=Zlo+kSDYUwGgfshJpwUrsqHKNRzikUPmxEV4uH1cNIoK1+uSTv3k0amtbMXagYLfc ejo4WTpXWvNcqgAzHI54qnfZiXg1I9qPELqWKGik9UpLToWuyEwNeAHrz/JuTbZR78 BST6bkIsiiAtzPNNaqk71OrLE48ZjsfL4TlgRI1sHsCDF7MISgehp+FHpXUbKa91Bo iwyoFtoV7O0yYOKomwi+KmeFXJq5yTxXKA8dqQwIev6cQCKHmtbWpBRUx0cx29BUj7 SZNH0MEXLCtR+FBCKjV25/lA0RuamKl//p3hPcvtcLtzU7Jjzmmhd97IOB6VXUjCm3 LVVpObLEt/bCw== Date: Wed, 21 Jul 2021 15:24:04 -0700 From: "Darrick J. Wong" To: Gao Xiang Cc: linux-erofs@lists.ozlabs.org, linux-fsdevel@vger.kernel.org, LKML , Christoph Hellwig , Matthew Wilcox , Andreas Gruenbacher Subject: Re: [PATCH v5] iomap: support tail packing inline read Message-ID: <20210721222404.GA8639@magnolia> References: <20210721082323.41933-1-hsiangkao@linux.alibaba.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210721082323.41933-1-hsiangkao@linux.alibaba.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 21, 2021 at 04:23:23PM +0800, Gao Xiang wrote: > This tries to add tail packing inline read to iomap, which can support > several inline tail blocks. Similar to the previous approach, it cleans > post-EOF in one iteration. > > The write path remains untouched since EROFS cannot be used for testing. > It'd be better to be implemented if upcoming real users care rather than > leave untested dead code around. I had a conversation with Gao on IRC this morning, and I think I've finally gotten up to speed on where he's trying to go with this patchset. Maybe that will make review of this patch easier, or at least not muddy the waters further. Right now, inline data in iomap serves exactly two users -- gfs2 and ext4. ext4 doesn't use iomap for buffered IO and doesn't support directio for inline data files, so we can ignore them for now. gfs2 uses iomap for buffered IO, and it stores the inline data after the gfs2_dinode. iomap's inline data functions exist to serve the gfs2 use case, which is why the code has baked-in assumptions that iomap->offset is always zero, and the length is never more than a page. It used to be the case that we'd always attach an iomap_page to a page for blocksize < pagesize files, but as of 5.14-rc2 we're starting to move towards creating and dropping them on demand. IOWs, reads from inline data files always read the entire file contents into the page so we mark the whole page uptodate and do not attach an iomap_page (unlike regular reads). Writes don't attach an iomap_page to inline data files. Writeback attaches an iomap_page. Did I get that much right? Onto the erofs part, now that I've also taken the time to figure out what it's doing by reading the ondisk format in Documentation/. (Thanks for that, erofs developers!) erofs can perform tail packing to reduce internal block fragmentation. Tails of files are written immediately after the ondisk inode, which is why Gao wants to use IOMAP_INLINE for this. Note that erofs tailpacking is /not/ same as what reiserfs does, and the inlinedata model is /not/ the same as what gfs2 does. A tail-packed erofs file mapping looks like this: x = round_down(i_size, blocksize); [0..(x - 1)]: mapped to a range of external blocks [x..(i_size - 1)]: inline data immediately after the inode The previous discussions have gone a bit afield -- there's only one inline data region per file, it won't cross a page boundary because erofs requires blocksize == pagesize, and it's always at the end of the file. I don't know how we got onto the topic of multiple inline data regions or encoded regions in the middle of a file, but that's not on anybody's requirement list today, AFAICT. I suspect that adapting the inlinedata code to support regions that don't start at offset zero but are otherwise page-aligned can be done with fairly minor changes to the accounting, since I think that largely can be done by removing the asserts about offset==0. Did I get that right? The next thing the erofs developers want to do is add support for blocksize < pagesize, presumably so that they can mount a 4k blocksize erofs volume on a machine with 64k pages. For that, I think erofs needs to be able to read the tail bytes into the middle of an existing page. Hence the need to update the per-block uptodate bits in the iomap_page from the read function, and all the math changes where we increase the starting address of a copy by (iomap->offset - pos). The end result should be that we can handle inline data regions anywhere, though we won't really have a way to test that until erofs starts supporting blocksize < pagesize. Assuming that my assumptions are correct, I think this patch decomposes into three more targeted changes, one of which applies now, and the rest which will go with the later effort. 1) Update the code to handle inline data mappings where iomap->offset is not zero but the start of the mapping is always page-aligned. 2) Adapt the inline data code to create and update the iop as appropriate. This could be a little tricky since I've seen elsewhere in the v4 discussion thread that people like the idea of not paying the iop overhead for pages that are backed by a single extent even when bs < ps. I suspect we have enough to decide this from the *iomap/*srcmap length in iomap_readpage_actor or iomap_write_begin, though I've not written any code that tries that. 3) Update the code to handle inline data mappings where iomap->offset can point to the middle of a page. My apologies if everyone else already figured all of this out; for all I know I'm merely scrawling this here as notes to refer back to for future discussions. > Cc: Christoph Hellwig > Cc: Darrick J. Wong > Cc: Matthew Wilcox > Cc: Andreas Gruenbacher > Signed-off-by: Gao Xiang > --- > v4: https://lore.kernel.org/r/20210720133554.44058-1-hsiangkao@linux.alibaba.com > changes since v4: > - turn to WARN_ON_ONCE() suggested by Darrick; > - fix size to "min(iomap->length + iomap->offset - pos, > PAGE_SIZE - poff)" > > fs/iomap/buffered-io.c | 58 +++++++++++++++++++++++++++--------------- > fs/iomap/direct-io.c | 13 +++++++--- > 2 files changed, 47 insertions(+), 24 deletions(-) > > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c > index 87ccb3438bec..d8436d34a159 100644 > --- a/fs/iomap/buffered-io.c > +++ b/fs/iomap/buffered-io.c > @@ -205,25 +205,27 @@ struct iomap_readpage_ctx { > struct readahead_control *rac; > }; > > -static void > +static int > iomap_read_inline_data(struct inode *inode, struct page *page, > - struct iomap *iomap) > + struct iomap *iomap, loff_t pos) > { > - size_t size = i_size_read(inode); > + unsigned int size, poff = offset_in_page(pos); > void *addr; > > - if (PageUptodate(page)) > - return; > - > - BUG_ON(page_has_private(page)); > - BUG_ON(page->index); > - BUG_ON(size > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + /* inline source data must be inside a single page */ > + if (WARN_ON_ONCE(iomap->length > PAGE_SIZE - > + offset_in_page(iomap->inline_data))) > + return -EIO; > + /* handle tail-packing blocks cross the current page into the next */ > + size = min_t(unsigned int, iomap->length + iomap->offset - pos, > + PAGE_SIZE - poff); Part of my confusion has resulted from this comment -- now that I think I understand the problem domain better, I realize that the clamping code here is not because erofs will hand us a tail-packing iomap that crosses page boundaries; this clamp simply protects us from memory corruption. /* * iomap->inline_data is a kernel-mapped memory page, so we must * terminate the read at the end of that page. */ if (WARN_ON_ONCE(...)) return -EIO; size = min_t(...); TBH I wonder if we merely need a rule that ->iomap_begin must not hand back an inline data mapping that crosses a page, since I think the check in the previous line is sufficient. > addr = kmap_atomic(page); > - memcpy(addr, iomap->inline_data, size); > - memset(addr + size, 0, PAGE_SIZE - size); > + memcpy(addr + poff, iomap->inline_data - iomap->offset + pos, size); I keep seeing this (iomap->inline_data + pos - iomap->offset) construction in this patch, maybe it should be a helper? > + memset(addr + poff + size, 0, PAGE_SIZE - poff - size); > kunmap_atomic(addr); > - SetPageUptodate(page); > + iomap_set_range_uptodate(page, poff, PAGE_SIZE - poff); > + return PAGE_SIZE - poff; > } > > static inline bool iomap_block_needs_zeroing(struct inode *inode, > @@ -245,19 +247,23 @@ iomap_readpage_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > loff_t orig_pos = pos; > unsigned poff, plen; > sector_t sector; > + int ret; > > - if (iomap->type == IOMAP_INLINE) { > - WARN_ON_ONCE(pos); > - iomap_read_inline_data(inode, page, iomap); > - return PAGE_SIZE; > - } > - > - /* zero post-eof blocks as the page may be mapped */ > iop = iomap_page_create(inode, page); > + /* needs to skip some leading uptodate blocks */ > iomap_adjust_read_range(inode, iop, &pos, length, &poff, &plen); > if (plen == 0) > goto done; > > + if (iomap->type == IOMAP_INLINE) { > + ret = iomap_read_inline_data(inode, page, iomap, pos); > + if (ret < 0) > + return ret; > + plen = ret; > + goto done; > + } > + > + /* zero post-eof blocks as the page may be mapped */ > if (iomap_block_needs_zeroing(inode, iomap, pos)) { > zero_user(page, poff, plen); > iomap_set_range_uptodate(page, poff, plen); > @@ -589,6 +595,18 @@ __iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, int flags, > return 0; > } > > +static int iomap_write_begin_inline(struct inode *inode, loff_t pos, > + struct page *page, struct iomap *srcmap) > +{ > + /* needs more work for the tailpacking case, disable for now */ > + if (WARN_ON_ONCE(srcmap->offset != 0)) > + return -EIO; > + if (PageUptodate(page)) > + return 0; > + iomap_read_inline_data(inode, page, srcmap, 0); > + return 0; > +} > + > static int > iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > struct page **pagep, struct iomap *iomap, struct iomap *srcmap) > @@ -618,7 +636,7 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > } > > if (srcmap->type == IOMAP_INLINE) > - iomap_read_inline_data(inode, page, srcmap); > + status = iomap_write_begin_inline(inode, pos, page, srcmap); > else if (iomap->flags & IOMAP_F_BUFFER_HEAD) > status = __block_write_begin_int(page, pos, len, NULL, srcmap); > else > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 9398b8c31323..cbadb99fb88c 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -379,22 +379,27 @@ iomap_dio_inline_actor(struct inode *inode, loff_t pos, loff_t length, > { > struct iov_iter *iter = dio->submit.iter; > size_t copied; > + void *dst = iomap->inline_data + pos - iomap->offset; > > - BUG_ON(pos + length > PAGE_SIZE - offset_in_page(iomap->inline_data)); > + /* inline data must be inside a single page */ > + if (WARN_ON_ONCE(length > PAGE_SIZE - > + offset_in_page(iomap->inline_data))) > + return -EIO; /* * iomap->inline_data is a kernel-mapped memory page, so we must * terminate the write at the end of that page. */ if (WARN_ON_ONCE(...)) return -EIO; > if (dio->flags & IOMAP_DIO_WRITE) { I thought we weren't allowing writes to an inline mapping unless iomap->offset == 0? Why is it necessary to change the directio write path? Shouldn't this be: /* needs more work for the tailpacking case, disable for now */ if (WARN_ON_ONCE(pos > 0)) return -EIO; --D > loff_t size = inode->i_size; > > if (pos > size) > - memset(iomap->inline_data + size, 0, pos - size); > - copied = copy_from_iter(iomap->inline_data + pos, length, iter); > + memset(iomap->inline_data + size - iomap->offset, > + 0, pos - size); > + copied = copy_from_iter(dst, length, iter); > if (copied) { > if (pos + copied > size) > i_size_write(inode, pos + copied); > mark_inode_dirty(inode); > } > } else { > - copied = copy_to_iter(iomap->inline_data + pos, length, iter); > + copied = copy_to_iter(dst, length, iter); > } > dio->size += copied; > return copied; > -- > 2.24.4 >