Received-SPF: pass (google.com: domain of linux-ext4-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Date:   Tue, 13 Jun 2023 09:27:38 +0530
Message-Id: <87zg54580d.fsf@doe.com>
From:   Ritesh Harjani (IBM) <ritesh.list@gmail.com>
To:     Matthew Wilcox <willy@infradead.org>
Cc:     Theodore Ts'o <tytso@mit.edu>, linux-ext4@vger.kernel.org,
        linux-fsdevel@vger.kernel.org,
        Ojaswin Mujoo <ojaswin@linux.ibm.com>,
        Disha Goel <disgoel@linux.ibm.com>, Jan Kara <jack@suse.cz>
Subject: Re: [RFCv2 2/5] ext4: Remove PAGE_SIZE assumption of folio from mpage_submit_folio
In-Reply-To: <ZIdvJLE945Qbzy+H@casper.infradead.org>
Precedence: bulk


I am hoping Jan and Ted could correct me if any of my understanding
is incorrect. But here is my view...

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Jun 12, 2023 at 11:55:55PM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <willy@infradead.org> writes:
>> I couldn't respond to your change because I still had some confusion
>> around this suggestion -
>>
>> > So do we care if we write a random fragment of a page after a truncate?
>> > If so, we should add:
>> >
>> >         if (folio_pos(folio) >= size)
>> >                 return 0; /* Do we need to account nr_to_write? */
>>
>> I was not sure whether if go with above case then whether it will
>> work with collapse_range. I initially thought that collapse_range will
>> truncate the pages between start and end of the file and then
>> it can also reduce the inode->i_size. That means writeback can find an
>> inode->i_size smaller than folio_pos(folio) which it is writing to.
>> But in this case we can't skip the write in writeback case like above
>> because that write is still required (a spurious write) even though
>> i_size is reduced as it's corresponding FS blocks are not truncated.
>>
>> But just now looking at ext4_collapse_range() code it doesn't look like
>> it is the problem because it waits for any dirty data to be written
>> before truncate. So no matter which folio_pos(folio) the writeback is
>> writing, there should not be an issue if we simply return 0 like how
>> you suggested above.
>>
>>     static int ext4_collapse_range(struct file *file, loff_t offset, loff_t len)
>>
>>     <...>
>>         ioffset = round_down(offset, PAGE_SIZE);
>>         /*
>>         * Write tail of the last page before removed range since it will get
>>         * removed from the page cache below.
>>         */
>>
>>         ret = filemap_write_and_wait_range(mapping, ioffset, offset);
>>         if (ret)
>>             goto out_mmap;
>>         /*
>>         * Write data that will be shifted to preserve them when discarding
>>         * page cache below. We are also protected from pages becoming dirty
>>         * by i_rwsem and invalidate_lock.
>>         */
>>         ret = filemap_write_and_wait_range(mapping, offset + len,
>>                         LLONG_MAX);
>>         truncate_pagecache(inode, ioffset);
>>
>>         <... within i_data_sem>
>>         i_size_write(inode, new_size);
>>
>>     <...>
>>
>>
>> However to avoid problems like this I felt, I will do some more code
>> reading. And then I was mostly considering your second suggestion which
>> is this. This will ensure we keep the current behavior as is and not
>> change that.
>>
>> > If we simply don't care that we're doing a spurious write, then we can
>> > do something like:
>> >
>> > -               len = size & ~PAGE_MASK;
>> > +               len = size & (len - 1);
>
> For all I know, I've found a bug here.  I don't know enough about ext4; if
> we have truncated a file, and then writeback a page that is past i_size,
> will the block its writing to have been freed?

I don't think so. If we look at truncate code, it first reduces i_size,
then call truncate_pagecache(inode, newsize) and then we will call
ext4_truncate() which will free the corresponding blocks.
Since writeback happens with folio lock held until completion, hence I
think truncate_pagecache() should block on that folio until it's lock
has been released.

- IIUC, if truncate would have completed then the folio won't be in the
foliocache for writeback to happen. Foliocache is kept consistent
via
    - first truncate the folio in the foliocache and then remove/free
    the blocks on device.

- Also the reason we update i_size "before" calling truncate_pagecache()
  is to synchronize with mmap/pagefault.

> Is this potentially a silent data corruptor?

- Let's consider a case when folio_pos > i_size but both still belongs
to the last block. i.e. it's a straddle write case.
In such case we require writeback to write the data of this last folio
straddling i_size. Because truncate will not remove/free this last folio
straddling i_size & neither the last block will be freed. And I think
writeback is supposed to write this last folio to the disk to keep the
cache and disk data consistent. Because truncate will only zero out
the rest of the folio in the foliocache. But I don't think it will go and
write that folio out (It's not required because i_size means that the
rest of the folio beyond i_size should remain zero).

So, IMO writeback is supposed to write this last folio to the disk. And,
if we skip this writeout, then I think it may cause silent data corruption.

But I am not sure about the rest of the write beyond the last block of
i_size. I think those could just be spurious writes which won't cause
any harm because truncate will eventually first remove this folio from
file mapping and then will release the corresponding disk blocks.
So writing those out should does no harm


-ritesh