Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1424007ybn; Wed, 25 Sep 2019 18:14:50 -0700 (PDT) X-Google-Smtp-Source: APXvYqyzX1Dn+3LAOn9umH/r+IYRIKnpj5TL4bnEdPp/PdeGMZDxWSW1kB8fC8LYVOFxg9XNlMgO X-Received: by 2002:a17:906:19d9:: with SMTP id h25mr941799ejd.43.1569460490465; Wed, 25 Sep 2019 18:14:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569460490; cv=none; d=google.com; s=arc-20160816; b=qGMM3hyt0HDUwdGIIrAWhpo5kO+WoSubQ7Zd1vl+QI+3kvI4FiTwX8vj/hWm2EHEGo yNJ2F9OOuW6sGQjecKZtw8Y2erBFU3TckJbDrRwNfM8vYomEF4xfQF3AllAXcTLcpSjt pv1gBtY9kyDDhEs/CRBsWOkmuWFO0ecx/JIvPoSvqPUQudEpEI2NQvfS622wCPFsPl01 g11qB8IJm6f1ywlxlQW9Fp2J0LuWIIPW0vHPTygar7kSZadpBAGnEUDqdEj5Eb5a7Zda zoxAkkQ72RPajPRnLIstUacU9Jo/IzmRLeGFdjPUxafPYug1epz24Lmn0uthjk2Zp90J Z79w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=cWHGhu/xEva0NCUnsouhp2Rqjl7y8RfVbbdYtfLeN1A=; b=wZnNNRKDGBhRC6WzUOBNx7fPmrFbG34dEJrbFKV1qB2iWRDAruFYx6ZIsMlsGSiRA8 VO2zHfuFNWC+xZNWutUNt0OaCgnahl7Ewmy1VZQv5i2dfTG8am6zzuJArLm+8m4Vi93T oCRl9tZ9/tfQtUGun9hFOO7HMNlsNfvZo9k79EHYip0jv17UmvEOiuFtuUyzmbQGFyYq aG8qoT9YNmf1mh9j4IT9dtzvxI31c5KGucPWxPOiubTQDl9x479WtghcCXLoSFBkDykb TLCi0PaEF0wjiZ9QmWp/e83Dop5ZjuLeXoucdsGAdb+Jle3jDujmxvAaB1YhKRIjRIn9 8K8g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@mbobrowski-org.20150623.gappssmtp.com header.s=20150623 header.b=NXTlTyI2; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id qh28si271564ejb.210.2019.09.25.18.14.23; Wed, 25 Sep 2019 18:14:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@mbobrowski-org.20150623.gappssmtp.com header.s=20150623 header.b=NXTlTyI2; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2408230AbfIXK3e (ORCPT + 99 others); Tue, 24 Sep 2019 06:29:34 -0400 Received: from mail-pg1-f194.google.com ([209.85.215.194]:34664 "EHLO mail-pg1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2405224AbfIXK3e (ORCPT ); Tue, 24 Sep 2019 06:29:34 -0400 Received: by mail-pg1-f194.google.com with SMTP id y35so1093824pgl.1 for ; Tue, 24 Sep 2019 03:29:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mbobrowski-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=cWHGhu/xEva0NCUnsouhp2Rqjl7y8RfVbbdYtfLeN1A=; b=NXTlTyI2vIC+atFvgwWyJ3X4f1wCJA+4VK9JPTs1mEoi1HY2sx4ltdBARifdKoKcS9 2yx9TcD/NNgkKGikxuWo09JtjlGYqkNrS4fGR5K76CoDkswsT9/pudKgmJYuMUwxkQim AVFMwL6sCWE2Jv2X/udeSLAJL3CdAT2QewEdvE4vqp+tP5my453goJ5AA05WSH2//mXI nipsqAuVLye7htJ4SftWF8jyqGju1+piB2jpn3C0SpvaebgJAlbfN7tJKZeSvd91DLMW JW5XV4ES9R6hTdYZrVjx188QWdHIiODMNHHxuV8LOEYIXwe0n6MtfBBRwhWlC2GPjCuv Fnxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=cWHGhu/xEva0NCUnsouhp2Rqjl7y8RfVbbdYtfLeN1A=; b=rUe8yjg1HZJ8HxvzsNE3JEEE23Yhn912uO0zr9rRz7N0zjd2LFgDHy9Qpfn8t9f+xO oM6ydFym3UzRqopd5oJFBe6MOpTpedZIfYB5pYPtUkdgVRKqQVQXCooWBn2gg2IjKy6z s9xYFJP6yBt7ZSF3+vs0bMrbaBgjSzyycIY4zo6kNIjNq064qlSzF39XfgIdoOCtxeRZ OCzwQBbiVmYtIIk4lp5nHR4BOpNjtE0yUphuXWetsyH180aF1W4BTp0UDZgxYLJ7KwRx EyODdQVWK7bWXTr5Q6b/Xj/KjX/Vye9Kq7NNopE4YYzuGzdWFG2vc3rMb3FQno7/vDYe bGOQ== X-Gm-Message-State: APjAAAUMXduzaBV4j2nlKgSZszXmXGeZPML5aGSIfWEsc0aYPoE+C0aA 7sbOi/jXUXCS8ZBhS2gCSN6V X-Received: by 2002:a17:90a:2301:: with SMTP id f1mr2163987pje.121.1569320973166; Tue, 24 Sep 2019 03:29:33 -0700 (PDT) Received: from bobrowski ([110.232.114.101]) by smtp.gmail.com with ESMTPSA id s1sm5023004pjs.31.2019.09.24.03.29.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 24 Sep 2019 03:29:32 -0700 (PDT) Date: Tue, 24 Sep 2019 20:29:26 +1000 From: Matthew Bobrowski To: Jan Kara Cc: tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Message-ID: <20190924102926.GC17526@bobrowski> References: <20190923211011.GH20367@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190923211011.GH20367@quack2.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote: > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote: > > +/* > > + * For a write that extends the inode size, ext4_dio_write_iter() will > > + * wait for the write to complete. Consequently, operations performed > > + * within this function are still covered by the inode_lock(). On > > + * success, this function returns 0. > > + */ > > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error, > > + unsigned int flags) > > +{ > > + int ret; > > + loff_t offset = iocb->ki_pos; > > + struct inode *inode = file_inode(iocb->ki_filp); > > + > > + if (error) { > > + ret = ext4_handle_failed_inode_extension(inode, offset + size); > > + return ret ? ret : error; > > + } > > + > > + if (flags & IOMAP_DIO_UNWRITTEN) { > > + ret = ext4_convert_unwritten_extents(NULL, inode, > > + offset, size); > > + if (ret) > > + return ret; > > + } > > + > > + if (offset + size > i_size_read(inode)) { > > + ret = ext4_handle_inode_extension(inode, offset, size, 0); > > + if (ret) > > + return ret; > > + } > > With the suggestions I made to your patch 3/6 this could be simplified to: > > if (!error && flags & IOMAP_DIO_UNWRITTEN) { > error = ext4_convert_unwritten_extents(NULL, inode, offset, > size); > } > return ext4_handle_inode_extension(inode, offset, error ? : size, size); > > > Note the change that when ext4_convert_unwritten_extents() fails (although > this should not really happen unless there's some corruption going on), we > do properly truncate possible extents beyond i_size. This sounds good to me. I like this idea. > > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > > +{ > > + ssize_t ret; > > + size_t count; > > + loff_t offset = iocb->ki_pos; > > + struct inode *inode = file_inode(iocb->ki_filp); > > + bool extend = false, overwrite = false, unaligned_aio = false; > > + > > + if (!inode_trylock(inode)) { > > + if (iocb->ki_flags & IOCB_NOWAIT) > > + return -EAGAIN; > > + inode_lock(inode); > > + } > > + > > + if (!ext4_dio_checks(inode)) { > > + inode_unlock(inode); > > + /* > > + * Fallback to buffered IO if the operation on the > > + * inode is not supported by direct IO. > > + */ > > + return ext4_buffered_write_iter(iocb, from); > > + } > > + > > + ret = ext4_write_checks(iocb, from); > > + if (ret <= 0) { > > + inode_unlock(inode); > > + return ret; > > + } > > + > > + /* > > + * Unaligned direct AIO must be serialized among each other as > > + * the zeroing of partial blocks of two competing unaligned > > + * AIOs can result in data corruption. > > + */ > > + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) && > > + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) { > > + unaligned_aio = true; > > + inode_dio_wait(inode); > > + } > > + > > + /* > > + * Determine whether the IO operation will overwrite allocated > > + * and initialized blocks. If so, check to see whether it is > > + * possible to take the dioread_nolock path. > > + */ > > + count = iov_iter_count(from); > > + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) && > > + ext4_should_dioread_nolock(inode)) { > > + overwrite = true; > > + downgrade_write(&inode->i_rwsem); > > + } > > + > > + if (offset + count > i_size_read(inode) || > > + offset + count > EXT4_I(inode)->i_disksize) { > > + ext4_update_i_disksize(inode, inode->i_size); > > + extend = true; > > + } > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else, > you need to also have transaction started and call ext4_mark_inode_dirty() > to actually journal the change of i_disksize (ext4_update_i_disksize() > updates only the in-memory copy of the entry). Also the direct IO code > needs to add the inode to the orphan list so that in case of crash, blocks > allocated beyond EOF get truncated on next mount. That is the whole point > of this excercise with i_disksize after all. > > But I'm wondering if i_disksize update is needed. Truncate cannot be in > progress (we hold i_rwsem) and dirty pages will be flushed by > iomap_dio_rw() before we start to allocate any blocks. So it should be > enough to have here: Well, I initially thought the same, however doing some research shows that we have the following edge case: - 45d8ec4d9fd54 and - 73fdad00b208b In fact you can reproduce the exact same i_size corruption issue by running the generic/475 xfstests mutitple times, as articulated within 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may be a problem that resides elsewhere that may need addressing? > if (offset + count > i_size_read(inode)) { > /* > * Add inode to orphan list so that blocks allocated beyond > * EOF get properly truncated in case of crash. > */ > start transaction handle > add inode to orphan list > stop transaction handle > } > > And just leave i_disksize at whatever it currently is. I originally had the code which added the inode to the orphan list here, but then I thought to myself that it'd make more sense to actually do this step closer to the point where we've managed to successfully allocate the required blocks for the write. This prevents the need to spray orphan list clean up code all over the place just to cover the case that a write which had intended to extend the inode beyond i_size had failed prematurely (i.e. before block allocation). So, hence the reason why I thought having it in ext4_iomap_begin() would make more sense, because at that point in the write path, there is enough/or more assurance to make the call around whether we will in fact be able to perform the write which will be extending beyond i_size, or not and consequently whether the inode should be placed onto the orphan list? Ideally I'd like to turn this statement into: if (offset + count > i_size_read(inode)) extend = true; Maybe I'm missing something here and there's actually a really good reason for doing this nice and early? What are your thoughts about what I've mentioned above? > > - * If we added blocks beyond i_size, we need to make sure they > > - * will get truncated if we crash before updating i_size in > > - * ext4_iomap_end(). For faults we don't need to do that (and > > - * even cannot because for orphan list operations inode_lock is > > - * required) - if we happen to instantiate block beyond i_size, > > - * it is because we race with truncate which has already added > > - * the inode to the orphan list. > > + * If we added blocks beyond i_size, we need to make > > + * sure they will get truncated if we crash before > > + * updating the i_size. For faults we don't need to do > > + * that (and even cannot because for orphan list > > + * operations inode_lock is required) - if we happen > > + * to instantiate block beyond i_size, it is because > > + * we race with truncate which has already added the > > + * inode to the orphan list. > > */ > > Just a nit but it would be nice to use full width of 80 columns when > formatting comments so that they don't get unnecessarily long. Sure. I'm blaming emacs for this. :P ----