Received: by 2002:a05:7412:37c9:b0:e2:908c:2ebd with SMTP id jz9csp1154746rdb; Wed, 20 Sep 2023 00:45:02 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEcOa6KDQt64bk6jQdtBL7ouoiVrtYXg/tFNygnWkWJyaYKfaBQb8KTmoUT/qB3DJU26tFw X-Received: by 2002:a17:902:a407:b0:1c0:aa07:1792 with SMTP id p7-20020a170902a40700b001c0aa071792mr1372996plq.36.1695195901948; Wed, 20 Sep 2023 00:45:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695195901; cv=none; d=google.com; s=arc-20160816; b=uG6xKM5C11eGDlaE4VUnIKf01lrwkTKOKDZ8YFS6GcmY8X3WUV2B0TKaQHcT6bKa8h sRK6RhS7nHoTdajP51UTto4ZOIoqBdaElDJ91hEHrHaILu5hFi1H/eeDuNEuJWpM1NwQ nX7v385YIZ1QvOfvQ59sKKGuYwKw0+aiw0lPYpurgyNSNi3p/RTw+eb6gm/23RW41rve Y7RKtiieOjHjIpoNQISTZo1rgvnCWi2TrsV4A4Btit5h5wEvXZg/uOOR5u8hFR5XVKUe B8s1mShq9sCj7q1iwqFQiQP2QQc1LPRz6+5Yh4bx1KRd4RgMmfociq/7nvpxtH4aKpDk 77Ug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=1QG4YRsJypQ1gQc9GfUvx3iDlbWGcWRd4l+bkZRPEtM=; fh=XfQ8ETrt73FuBOqLhZoeHGWqHzH4mtpA3Rb2JgGI28o=; b=DEnCEzdDijSFGKqoqmKyjwvQn/1lDxIkG2/YlZcsNh98i5QuiFNzGFH1MeWznnaVXj 5AbP1etli5jB2zMnGYxPnEBHKd46mn0rXO5KM2gRP/MJEpnJ4ieP0EF6DomPNlkx27NS x4fxavjfmDUFb2AlevtsAp5NnW/Lnq8eIlIf4N1WzjjdfdK1dP3CYxOk0bPV9Ty1oL7I NXsZlTGZWAH8zuZbsB1kBd7Tw5rfFRpIN7elHuEFYO9d9qRoU25ZvkRVIh+eLkCwG3iW FjmnYSj1iOubKz12jH3SuyshGroc7uJrVeRDOrVfB9OHndo02Va6xHVhlSZK0fo5ID12 Emqg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Return-Path: Received: from groat.vger.email (groat.vger.email. [2620:137:e000::3:5]) by mx.google.com with ESMTPS id d18-20020a170903231200b001b9e9b21249si11727184plh.649.2023.09.20.00.45.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Sep 2023 00:45:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) client-ip=2620:137:e000::3:5; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::3:5 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=alibaba.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by groat.vger.email (Postfix) with ESMTP id 87956806C345; Wed, 20 Sep 2023 00:37:36 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at groat.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233830AbjITHhk (ORCPT + 99 others); Wed, 20 Sep 2023 03:37:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233889AbjITHhR (ORCPT ); Wed, 20 Sep 2023 03:37:17 -0400 Received: from out30-133.freemail.mail.aliyun.com (out30-133.freemail.mail.aliyun.com [115.124.30.133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DAEA0114 for ; Wed, 20 Sep 2023 00:36:59 -0700 (PDT) X-Alimail-AntiSpam: AC=PASS;BC=-1|-1;BR=01201311R821e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046059;MF=hsiangkao@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0VsUDVKf_1695195414; Received: from 30.97.48.198(mailfrom:hsiangkao@linux.alibaba.com fp:SMTPD_---0VsUDVKf_1695195414) by smtp.aliyun-inc.com; Wed, 20 Sep 2023 15:36:55 +0800 Message-ID: <88ce2dea-1ddf-7cb1-7f8f-6964b7d17ce5@linux.alibaba.com> Date: Wed, 20 Sep 2023 15:36:53 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.15.0 Subject: Re: [bug report] ext4 misses final i_size meta sync under O_DIRECT | O_SYNC semantics after iomap DIO conversion To: Dave Chinner Cc: Jan Kara , linux-ext4@vger.kernel.org, Theodore Ts'o , Matthew Bobrowski , Christoph Hellwig , Joseph Qi , "Darrick J. Wong" References: <02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com> <20230919120532.5dg7mgdnwd5lezgz@quack3> <9fccc0e4-8f51-d3e7-21de-f85f8837be7f@linux.alibaba.com> From: Gao Xiang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.2 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, UNPARSEABLE_RELAY autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on groat.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (groat.vger.email [0.0.0.0]); Wed, 20 Sep 2023 00:37:36 -0700 (PDT) On 2023/9/20 15:29, Dave Chinner wrote: > On Tue, Sep 19, 2023 at 09:47:34PM +0800, Gao Xiang wrote: >> >> (sorry... add Darrick here...) >> >> Hi Jan, >> >> On 2023/9/19 20:05, Jan Kara wrote: >>> Hello! >>> >>> On Tue 19-09-23 14:00:04, Gao Xiang wrote: >>>> Our consumer reports a behavior change between pre-iomap and iomap >>>> direct io conversion: >>>> >>>> If the system crashes after an appending write to a file open with >>>> O_DIRECT | O_SYNC flag set, file i_size won't be updated even if >>>> O_SYNC was marked before. >>>> >>>> It can be reproduced by a test program in the attachment with >>>> gcc -o repro repro.c && ./repro testfile && echo c > /proc/sysrq-trigger >>>> >>>> After some analysis, we found that before iomap direct I/O conversion, >>>> the timing was roughly (taking Linux 3.10 codebase as an example): >>>> >>>> .. >>>> - ext4_file_dio_write >>>> - __generic_file_aio_write >>>> .. >>>> - ext4_direct_IO # generic_file_direct_write >>>> - ext4_ext_direct_IO >>>> - ext4_ind_direct_IO # final_size > inode->i_size >>>> - .. >>>> - ret = blockdev_direct_IO() >>>> - i_size_write(inode, end) # orphan && ret > 0 && >>>> # end > inode->i_size >>>> - ext4_mark_inode_dirty() >>>> - ... >>>> - generic_write_sync # handling O_SYNC >>>> >>>> So the dirty inode meta will be committed into journal immediately >>>> if O_SYNC is set. However, After commit 569342dc2485 ("ext4: move >>>> inode extension/truncate code out from ->iomap_end() callback"), >>>> the new behavior seems as below: >>>> >>>> .. >>>> - ext4_dio_write_iter >>>> - ext4_dio_write_checks # extend = 1 >>>> - iomap_dio_rw >>>> - __iomap_dio_rw >>>> - iomap_dio_complete >>>> - generic_write_sync >>>> - ext4_handle_inode_extension # extend = 1 >>>> >>>> So that i_size will be recorded only after generic_write_sync() is >>>> called. So O_SYNC won't flush the update i_size to the disk. >>> >>> Indeed, that looks like a bug. Thanks for report! >> >> Thanks for the confirmation! >> >>> >>>> On the other side, after a quick look of XFS side, it will record >>>> i_size changes in xfs_dio_write_end_io() so it seems that it doesn't >>>> have this problem. >>> >>> Yes, I'm a bit hazy on the details but I think we've decided to call >>> ext4_handle_inode_extension() directly from ext4_dio_write_iter() because >>> from ext4_dio_write_end_io() it was difficult to test in a race-free way >>> whether extending i_size (and i_disksize) is needed or not (we don't >>> necessarily hold i_rwsem there). I'll think how we could fix the problem >>> you've reported. > > Given that ext4 can run extent conversion in IO completion, it can > run file extension in IO completion. Yes, that might require > additional synchronisation of file size updates to co-ordinate > submission and completion size checks. XFS just uses a spinlock for > this.... > >> Yes, another concern is O_DSYNC, I'm quite not sure if the behavior >> is changed too. > > For O_DSYNC, the file size change needs to be covered by the > call to generic_write_sync() as well. O_DSYNC should be thought of > as being essentially the same as O_SYNC except for minor details. > >> I had a rough feeling that currently iomap DIO behaviors on these are >> too strict and might not fit in each specific fs detailed >> implementation, tho. > > > In what way? iomap implements exactly the data integrity semantics > that are required for O_DSYNC and O_SYNC writes, and it requires > filesystem end_io method to finalize all metadata modifications > needed for data integrity purposes. > > Keep in mind that iomap is designed around the requirements async IO > (AIO and io_uring) place on individual IOs: there is no waiting > context to "finish" the IO before userspace is signalled that it is > complete. Hence everything related to data integrity needs to be > done by the filesystem in ->end_io before the iomap completion runs > generic_write_sync() and signals IO completion.... Yes, I understand iomap is well-suited in this model. Anyway, it's somewhat out of scope on my side. Thanks, Gao Xiang > > Cheers, > > Dave.