Received: by 2002:a25:824b:0:0:0:0:0 with SMTP id d11csp1795280ybn; Thu, 26 Sep 2019 02:17:33 -0700 (PDT) X-Google-Smtp-Source: APXvYqyYqOgYieih32ZjFEYbgGyIbQWP8Reda2rG1Zp4GkJjL4mK3/hixiY1Rzckc4f+/3n1/Hbn X-Received: by 2002:a50:9734:: with SMTP id c49mr2420092edb.93.1569489452960; Thu, 26 Sep 2019 02:17:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569489452; cv=none; d=google.com; s=arc-20160816; b=fuzNGz3f8sB1JSBhmdr63MOT64UiNftGQzNFgFKv05uivrPVGOZNz9PP+2vGYN9XqL AB1py+5KywFW7TGHrIkulVhqBmaAti++ag8WV+sAuP0wIsdLo22uniMGdBGdBxYrlCMz Z2o3PuhF4yz3OVV3TY4TLmow1SBGJhxYlrBngaLkkMvTMX5JeWuNoQ0HIELNSFzLs9A8 s91Bk6gd7NS+yaCHQGZ6x0xWlaEQBcvcTdHvrgQ4Ddmo3fmXPkkldsrV4fSb1YLCg8Na dKW2lVeqvECBmLF9cGfbW5ILozbu6e6zhAELe5IkF4IEM20M7+DMq4IgjK2DmSd7J6rL rKhQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=HpeYc7akNM3A5zHh7LxzKwzKtROxQYfdMS+gGrf544c=; b=IqTNg3pc4mLPHOoGa2z3QPwW7YnBisEQK0IC0LKLPt4uEhchHHrXZKcr9kcy6WQsc7 wPZtiMJcrnLfho4ydz20t2yMsarLZn2hN6JjfOOPjp51WaAdOs+gQmKmGzcLymGfRl7K K2NwDrbSX5LNoRmztd3Us0bHcKsKWPDgOvH9w7nwRGBIbkqV4nkzy3x2Y/O01SPHIw7Y XaJsubSfu/YvDlx5E03Woc2DCAQ6cPeKz2KwZSjnp8ODY8ecoc9Nk5DDNUo7lfDgJT6Y TjERyp42x2rRk9Y77Gsl7Fu8VKZxc9/58gZ/1f/PpiHctbLM3AaEuwoejCKa0Mu/2RJy o6uA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@mbobrowski-org.20150623.gappssmtp.com header.s=20150623 header.b=F8nvNdNt; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e23si814127edq.344.2019.09.26.02.17.05; Thu, 26 Sep 2019 02:17:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@mbobrowski-org.20150623.gappssmtp.com header.s=20150623 header.b=F8nvNdNt; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729177AbfIYHOh (ORCPT + 99 others); Wed, 25 Sep 2019 03:14:37 -0400 Received: from mail-pl1-f193.google.com ([209.85.214.193]:45972 "EHLO mail-pl1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727230AbfIYHOh (ORCPT ); Wed, 25 Sep 2019 03:14:37 -0400 Received: by mail-pl1-f193.google.com with SMTP id u12so2007247pls.12 for ; Wed, 25 Sep 2019 00:14:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=mbobrowski-org.20150623.gappssmtp.com; s=20150623; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=HpeYc7akNM3A5zHh7LxzKwzKtROxQYfdMS+gGrf544c=; b=F8nvNdNtXXUJO1WsRlMJb9YPQauuYTeQUi1fnaQrHvNUregudk4s/IpuYS2tA/KC01 M8yi33ZOoNtSe5V1XtF/9b7jVKv9aM0i3keoBWTmEvUem3pRHYnNfYtj+og+RH7cvgtv IMnl0XX5ZH1ejKuqHCbZkXFlCB5WBQCJDqKBPT5ADCT3H5FY77lM6a3eGiWfOuu7UTzS y64c9t8DvQS7twmgZDa9G7Xy5tQmwx8PKt2GinArYs5CD8Y6kXEP1v7HmQL5hbCxW91g xw92bz8UMlgFNGBKnDOuVapveMz/cZks1Gu6SsRb4ufgqfO/D9tQjk1MqkP2KXA0vbs7 sS8Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=HpeYc7akNM3A5zHh7LxzKwzKtROxQYfdMS+gGrf544c=; b=cBA+Vg+POEahKN8O3uzBXXBqkxdD+FKHEepC/fOSLj3lBZMOZBtfY6SE0dFitNALjZ HLkJCoHiqWt+qkpYvosNal8xBuqsJ0rQmkx8fKgNgDZVav14y1FIklUqe/fcphKK5IlP 3V4opCMlh/S3CjUYRFZ6mIigDmnzfRCWYDpsgz9+AU3jmSH5/091rWkliU/veaT9nYWc TWdluKTqzPZgy63yFt+nCA3ETVkESidvseBRtHJUU5YC9uR4Ii11mdTg4ahhF8cisU21 mvwgCgD/6b0VA+SyzRZZ8mxb3reUUV/mFE0jxfpbmNHumyBYqM1nI4ovYtkwYgYfPhPP DRKA== X-Gm-Message-State: APjAAAVXruH/HWVFFEKqz2v8hqynx1eEBA+jQnwWtZGSfMhKNrL05PaK ilsEjqQz5dLOb7RC0dF4fSLf X-Received: by 2002:a17:902:ab89:: with SMTP id f9mr7405788plr.295.1569395675980; Wed, 25 Sep 2019 00:14:35 -0700 (PDT) Received: from bobrowski ([110.232.114.101]) by smtp.gmail.com with ESMTPSA id d1sm8598495pfc.98.2019.09.25.00.14.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2019 00:14:35 -0700 (PDT) Date: Wed, 25 Sep 2019 17:14:29 +1000 From: Matthew Bobrowski To: Jan Kara Cc: tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org, david@fromorbit.com, hch@infradead.org, darrick.wong@oracle.com Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure Message-ID: <20190925071429.GA27699@bobrowski> References: <20190923211011.GH20367@quack2.suse.cz> <20190924102926.GC17526@bobrowski> <20190924141321.GC11819@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190924141321.GC11819@quack2.suse.cz> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote: > On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote: > > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote: > > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote: > > > > + if (offset + count > i_size_read(inode) || > > > > + offset + count > EXT4_I(inode)->i_disksize) { > > > > + ext4_update_i_disksize(inode, inode->i_size); > > > > + extend = true; > > > > + } > > > > > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else, > > > you need to also have transaction started and call ext4_mark_inode_dirty() > > > to actually journal the change of i_disksize (ext4_update_i_disksize() > > > updates only the in-memory copy of the entry). Also the direct IO code > > > needs to add the inode to the orphan list so that in case of crash, blocks > > > allocated beyond EOF get truncated on next mount. That is the whole point > > > of this excercise with i_disksize after all. > > > > > > But I'm wondering if i_disksize update is needed. Truncate cannot be in > > > progress (we hold i_rwsem) and dirty pages will be flushed by > > > iomap_dio_rw() before we start to allocate any blocks. So it should be > > > enough to have here: > > > > Well, I initially thought the same, however doing some research shows that we > > have the following edge case: > > - 45d8ec4d9fd54 > > and > > - 73fdad00b208b > > > > In fact you can reproduce the exact same i_size corruption issue by running > > the generic/475 xfstests mutitple times, as articulated within > > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may > > be a problem that resides elsewhere that may need addressing? > > Right, I forgot about the special case explained in 45d8ec4d9fd54 where > there's unwritted delalloc write beyond range where DIO write happens. > > > > if (offset + count > i_size_read(inode)) { > > > /* > > > * Add inode to orphan list so that blocks allocated beyond > > > * EOF get properly truncated in case of crash. > > > */ > > > start transaction handle > > > add inode to orphan list > > > stop transaction handle > > > } > > > > > > And just leave i_disksize at whatever it currently is. > > > > I originally had the code which added the inode to the orphan list here, but > > then I thought to myself that it'd make more sense to actually do this step > > closer to the point where we've managed to successfully allocate the required > > blocks for the write. This prevents the need to spray orphan list clean up > > code all over the place just to cover the case that a write which had intended > > to extend the inode beyond i_size had failed prematurely (i.e. before block > > allocation). So, hence the reason why I thought having it in > > ext4_iomap_begin() would make more sense, because at that point in the write > > path, there is enough/or more assurance to make the call around whether we > > will in fact be able to perform the write which will be extending beyond > > i_size, or not and consequently whether the inode should be placed onto the > > orphan list? > > > > Ideally I'd like to turn this statement into: > > > > if (offset + count > i_size_read(inode)) > > extend = true; > > > > Maybe I'm missing something here and there's actually a really good reason for > > doing this nice and early? What are your thoughts about what I've mentioned > > above? > > Well, the slight trouble with adding inode to orphan list in > ext4_iomap_begin() is that then it is somewhat difficult to tell whether > you need to remove it when IO is done because there's no way how to > propagate that information from ext4_iomap_begin() and checking against > i_disksize is unreliable because it can change (due to writeback of > delalloc pages) while direct IO is running. But I think we can overcome > that by splitting our end_io functions to two - ext4_dio_write_end_io() and > ext4_dio_extend_write_end_io(). So: > > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); > /* > * Need to check against i_disksize as there may be dellalloc writes > * pending. > */ > if (offset + count > EXT4_I(inode)->i_disksize) > extend = true; Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we should be using to determine whether an extension will be performed or not? Because, my understanding is that i_size is what holds the actual value of what the file size is expected to be and hence the reason why we previously updated the i_disksize to i_size using ext4_update_i_disksize(). Also, there are cases where offset + count > EXT4_I(inode)->i_disksize, however offset + count < i_size_read(inode). So in that case we may take an incorrect path somewhere i.e. below where extend clause is true. Also, I feel as though we should try stick to using one value as the reference to determine whether we're performing an extension and not use i_disksize here and then i_size over there kind of thing as that leads to unnecessary confusion? > ... > iomap_dio_rw(..., > extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io); > > and ext4_dio_write_end_io() will just take care of conversion of unwritten > extents on successful IO completion, while ext4_dio_extend_write_end_io() > will take care of all the complex stuff with orphan handling, extension > of inode size, and truncation of blocks beyond EOF - and it can do that > because it is guaranteed to run under the protection of i_rwsem held in > ext4_dio_write_iter(). > > Alternatively, we could also just pass NULL instead of > ext4_dio_extend_write_end_io() and just do all the work explicitely in > ext4_dio_write_iter() in the 'extend' case. That might be actually the most > transparent option... Well, with the changes to ext4_handle_inode_extension() conditions that you recommended in patch 2/6, then I can't see why we'd need two separate ->end_io() handlers as we'd just abort early if we're not extending? > But at this point there are so many suggestions in flight that I need to > see current state of the code again to be able to tell anything useful :). Heh, true. I will post through an updated patch series taking into account most of the recommendations put forward for this series version and then we can have a discussion based on that. :) ----