Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1347134ybn; Wed, 25 Sep 2019 16:49:58 -0700 (PDT) X-Google-Smtp-Source: APXvYqwqGasN0O/bel8gtdhxmUYLgW5EauTLGNGOjpaNTiT9a+CQqAdj13HmIMmeau6Fwd2jRrpj X-Received: by 2002:a17:906:4d8d:: with SMTP id s13mr673395eju.5.1569455398663; Wed, 25 Sep 2019 16:49:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569455398; cv=none; d=google.com; s=arc-20160816; b=DIzS5FdiG+XWelLrfs2uwXJjnZqkvN+4uYDEE1ZzCio8+Dicpam79f+k/wXQI6eTzx nVaB4WHMIKkmxb1X25W3Mj7uFYC6VZqDRz0ASxPzUiSFlDyi0NsH8Z5pNhTUUG4VrAoC 7s/96lW1lxcIfqQpDhmEcVxr/tezk4V/X91BjMsvDzi31aURJugVTI9kX1jSUG5UvLC6 aCQDDVCpj18lJRBCagasO1Gren0Y5h5H64LA7XYUB5VviKxwvdXd0ZvcEqs9loGRf2sK eJyLOmwaxqH6d6ScVu/cwHnw9cRrNMRKTaW99SHcGDl67ZbdqEbRPgNjVnrLa1Ocj+lg ZvJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=SHyo4LH6+UM9jBWRweaaWTgNeIK0RagGd7+S87jaucE=; b=tLa4W8BGx5JtTXv2LoV/E6WIM0ffa3W/9bc6g3KGCLhTn4E0LG1bWVen5si1/IYz4w yiVmkUHgVV9IlSWQvSO82gl+cpmepiwSgQVa3JARMbdIjCR0wtylW+tuwk/tK+rktZHr Lg3i80oaaus8sy6714l85Fl2jGLApPRzQKWrkYt2yiwrTgLS0m1fn3gt7KMuLGbIpRYy 07+/vymguFLoRFHcGK6Mq7cgjB6gbuc6zzCEu2q+E67PeSk+7SeXUlQHgoQtVyzBvh0L u/xNna7Mu8MRKoF7Jg7eOI06o9C7BPxp5GjyyzkT5s8V4s56vOkuxq75Vu8Hz5vG0ZUk NMaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l16si194399ejx.141.2019.09.25.16.49.33; Wed, 25 Sep 2019 16:49:58 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388779AbfIWVJ7 (ORCPT + 99 others); Mon, 23 Sep 2019 17:09:59 -0400 Received: from mx2.suse.de ([195.135.220.15]:50222 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2388750AbfIWVJ7 (ORCPT ); Mon, 23 Sep 2019 17:09:59 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 7B26AAC69; Mon, 23 Sep 2019 21:09:56 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id 281871E4669; Mon, 23 Sep 2019 23:10:11 +0200 (CEST) Date: Mon, 23 Sep 2019 23:10:11 +0200 From: Jan Kara To: Matthew Bobrowski Cc: tytso@mit.edu, jack@suse.cz, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Message-ID: <20190923211011.GH20367@quack2.suse.cz> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org I'll try to comment just on top of refactoring Christoph has suggested... On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote: > @@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size) > return 0; > } > > +/* > + * For a write that extends the inode size, ext4_dio_write_iter() will > + * wait for the write to complete. Consequently, operations performed > + * within this function are still covered by the inode_lock(). On > + * success, this function returns 0. > + */ > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error, > + unsigned int flags) > +{ > + int ret; > + loff_t offset = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > + > + if (error) { > + ret = ext4_handle_failed_inode_extension(inode, offset + size); > + return ret ? ret : error; > + } > + > + if (flags & IOMAP_DIO_UNWRITTEN) { > + ret = ext4_convert_unwritten_extents(NULL, inode, > + offset, size); > + if (ret) > + return ret; > + } > + > + if (offset + size > i_size_read(inode)) { > + ret = ext4_handle_inode_extension(inode, offset, size, 0); > + if (ret) > + return ret; > + } With the suggestions I made to your patch 3/6 this could be simplified to: if (!error && flags & IOMAP_DIO_UNWRITTEN) { error = ext4_convert_unwritten_extents(NULL, inode, offset, size); } return ext4_handle_inode_extension(inode, offset, error ? : size, size); Note the change that when ext4_convert_unwritten_extents() fails (although this should not really happen unless there's some corruption going on), we do properly truncate possible extents beyond i_size. > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) > +{ > + ssize_t ret; > + size_t count; > + loff_t offset = iocb->ki_pos; > + struct inode *inode = file_inode(iocb->ki_filp); > + bool extend = false, overwrite = false, unaligned_aio = false; > + > + if (!inode_trylock(inode)) { > + if (iocb->ki_flags & IOCB_NOWAIT) > + return -EAGAIN; > + inode_lock(inode); > + } > + > + if (!ext4_dio_checks(inode)) { > + inode_unlock(inode); > + /* > + * Fallback to buffered IO if the operation on the > + * inode is not supported by direct IO. > + */ > + return ext4_buffered_write_iter(iocb, from); > + } > + > + ret = ext4_write_checks(iocb, from); > + if (ret <= 0) { > + inode_unlock(inode); > + return ret; > + } > + > + /* > + * Unaligned direct AIO must be serialized among each other as > + * the zeroing of partial blocks of two competing unaligned > + * AIOs can result in data corruption. > + */ > + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) && > + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) { > + unaligned_aio = true; > + inode_dio_wait(inode); > + } > + > + /* > + * Determine whether the IO operation will overwrite allocated > + * and initialized blocks. If so, check to see whether it is > + * possible to take the dioread_nolock path. > + */ > + count = iov_iter_count(from); > + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) && > + ext4_should_dioread_nolock(inode)) { > + overwrite = true; > + downgrade_write(&inode->i_rwsem); > + } > + > + if (offset + count > i_size_read(inode) || > + offset + count > EXT4_I(inode)->i_disksize) { > + ext4_update_i_disksize(inode, inode->i_size); > + extend = true; > + } This call to ext4_update_i_disksize() is definitely wrong. If nothing else, you need to also have transaction started and call ext4_mark_inode_dirty() to actually journal the change of i_disksize (ext4_update_i_disksize() updates only the in-memory copy of the entry). Also the direct IO code needs to add the inode to the orphan list so that in case of crash, blocks allocated beyond EOF get truncated on next mount. That is the whole point of this excercise with i_disksize after all. But I'm wondering if i_disksize update is needed. Truncate cannot be in progress (we hold i_rwsem) and dirty pages will be flushed by iomap_dio_rw() before we start to allocate any blocks. So it should be enough to have here: if (offset + count > i_size_read(inode)) { /* * Add inode to orphan list so that blocks allocated beyond * EOF get properly truncated in case of crash. */ start transaction handle add inode to orphan list stop transaction handle } And just leave i_disksize at whatever it currently is. > + > + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io); > + > + /* > + * Unaligned direct AIO must be the only IO in flight or else > + * any overlapping aligned IO after unaligned IO might result > + * in data corruption. We also need to wait here in the case > + * where the inode is being extended so that inode extension > + * routines in ext4_dio_write_end_io() are covered by the > + * inode_lock(). > + */ > + if (ret == -EIOCBQUEUED && (unaligned_aio || extend)) > + inode_dio_wait(inode); > + > + if (overwrite) > + inode_unlock_shared(inode); > + else > + inode_unlock(inode); > + > + if (ret >= 0 && iov_iter_count(from)) > + return ext4_buffered_write_iter(iocb, from); > + return ret; > +} > + ... > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index efb184928e51..f52ad3065236 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > } > } > } else if (flags & IOMAP_WRITE) { > - int dio_credits; > handle_t *handle; > - int retries = 0; > + int dio_credits, retries = 0, m_flags = 0; > > - /* Trim mapping request to maximum we can map at once for DIO */ > + /* > + * Trim mapping request to maximum we can map at once > + * for DIO. > + */ > if (map.m_len > DIO_MAX_BLOCKS) > map.m_len = DIO_MAX_BLOCKS; > dio_credits = ext4_chunk_trans_blocks(inode, map.m_len); > @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > if (IS_ERR(handle)) > return PTR_ERR(handle); > > - ret = ext4_map_blocks(handle, inode, &map, > - EXT4_GET_BLOCKS_CREATE_ZERO); > + /* > + * DAX and direct IO are the only two operations that > + * are currently supported with IOMAP_WRITE. > + */ > + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT)); > + if (IS_DAX(inode)) > + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO; > + else if (round_down(offset, i_blocksize(inode)) >= > + i_size_read(inode)) > + m_flags = EXT4_GET_BLOCKS_CREATE; > + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT; > + > + ret = ext4_map_blocks(handle, inode, &map, m_flags); > + > + /* > + * We cannot fill holes in indirect tree based inodes > + * as that could expose stale data in the case of a > + * crash. Use the magic error code to fallback to > + * buffered IO. > + */ > + if (!m_flags && !ret) > + ret = -ENOTBLK; > + > if (ret < 0) { > ext4_journal_stop(handle); > if (ret == -ENOSPC && > @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length, > } > > /* > - * If we added blocks beyond i_size, we need to make sure they > - * will get truncated if we crash before updating i_size in > - * ext4_iomap_end(). For faults we don't need to do that (and > - * even cannot because for orphan list operations inode_lock is > - * required) - if we happen to instantiate block beyond i_size, > - * it is because we race with truncate which has already added > - * the inode to the orphan list. > + * If we added blocks beyond i_size, we need to make > + * sure they will get truncated if we crash before > + * updating the i_size. For faults we don't need to do > + * that (and even cannot because for orphan list > + * operations inode_lock is required) - if we happen > + * to instantiate block beyond i_size, it is because > + * we race with truncate which has already added the > + * inode to the orphan list. > */ Just a nit but it would be nice to use full width of 80 columns when formatting comments so that they don't get unnecessarily long. Honza -- Jan Kara SUSE Labs, CR