LinuxLists.cc - Re: [PATCH 19/26] netfs: New writeback implementation

2024-03-30 01:06:43

Subject: Re: [PATCH 19/26] netfs: New writeback implementation

On 29/03/2024 10:34, Naveen Mamindlapalli wrote:
>> -----Original Message-----
>> From: David Howells <[email protected]>
>> Sent: Thursday, March 28, 2024 10:04 PM
>> To: Christian Brauner <[email protected]>; Jeff Layton <[email protected]>;
>> Gao Xiang <[email protected]>; Dominique Martinet
>> <[email protected]>
>> Cc: David Howells <[email protected]>; Matthew Wilcox
>> <[email protected]>; Steve French <[email protected]>; Marc Dionne
>> <[email protected]>; Paulo Alcantara <[email protected]>; Shyam
>> Prasad N <[email protected]>; Tom Talpey <[email protected]>; Eric Van
>> Hensbergen <[email protected]>; Ilya Dryomov <[email protected]>;
>> [email protected]; [email protected]; [email protected];
>> [email protected]; [email protected]; ceph-
>> [email protected]; [email protected]; [email protected]; linux-
>> [email protected]; [email protected]; [email protected]; linux-
>> [email protected]; Latchesar Ionkov <[email protected]>; Christian
>> Schoenebeck <[email protected]>
>> Subject: [PATCH 19/26] netfs: New writeback implementation
>>
>> The current netfslib writeback implementation creates writeback requests of
>> contiguous folio data and then separately tiles subrequests over the space
>> twice, once for the server and once for the cache. This creates a few
>> issues:
>>
>> (1) Every time there's a discontiguity or a change between writing to only
>> one destination or writing to both, it must create a new request.
>> This makes it harder to do vectored writes.
>>
>> (2) The folios don't have the writeback mark removed until the end of the
>> request - and a request could be hundreds of megabytes.
>>
>> (3) In future, I want to support a larger cache granularity, which will
>> require aggregation of some folios that contain unmodified data (which
>> only need to go to the cache) and some which contain modifications
>> (which need to be uploaded and stored to the cache) - but, currently,
>> these are treated as discontiguous.
>>
>> There's also a move to get everyone to use writeback_iter() to extract
>> writable folios from the pagecache. That said, currently writeback_iter()
>> has some issues that make it less than ideal:
>>
>> (1) there's no way to cancel the iteration, even if you find a "temporary"
>> error that means the current folio and all subsequent folios are going
>> to fail;
>>
>> (2) there's no way to filter the folios being written back - something
>> that will impact Ceph with it's ordered snap system;
>>
>> (3) and if you get a folio you can't immediately deal with (say you need
>> to flush the preceding writes), you are left with a folio hanging in
>> the locked state for the duration, when really we should unlock it and
>> relock it later.
>>
>> In this new implementation, I use writeback_iter() to pump folios,
>> progressively creating two parallel, but separate streams and cleaning up
>> the finished folios as the subrequests complete. Either or both streams
>> can contain gaps, and the subrequests in each stream can be of variable
>> size, don't need to align with each other and don't need to align with the
>> folios.
>>
>> Indeed, subrequests can cross folio boundaries, may cover several folios or
>> a folio may be spanned by multiple folios, e.g.:
>>
>> +---+---+-----+-----+---+----------+
>> Folios: | | | | | | |
>> +---+---+-----+-----+---+----------+
>>
>> +------+------+ +----+----+
>> Upload: | | |.....| | |
>> +------+------+ +----+----+
>>
>> +------+------+------+------+------+
>> Cache: | | | | | |
>> +------+------+------+------+------+
>>
>> The progressive subrequest construction permits the algorithm to be
>> preparing both the next upload to the server and the next write to the
>> cache whilst the previous ones are already in progress. Throttling can be
>> applied to control the rate of production of subrequests - and, in any
>> case, we probably want to write them to the server in ascending order,
>> particularly if the file will be extended.
>>
>> Content crypto can also be prepared at the same time as the subrequests and
>> run asynchronously, with the prepped requests being stalled until the
>> crypto catches up with them. This might also be useful for transport
>> crypto, but that happens at a lower layer, so probably would be harder to
>> pull off.
>>
>> The algorithm is split into three parts:
>>
>> (1) The issuer. This walks through the data, packaging it up, encrypting
>> it and creating subrequests. The part of this that generates
>> subrequests only deals with file positions and spans and so is usable
>> for DIO/unbuffered writes as well as buffered writes.
>>
>> (2) The collector. This asynchronously collects completed subrequests,
>> unlocks folios, frees crypto buffers and performs any retries. This
>> runs in a work queue so that the issuer can return to the caller for
>> writeback (so that the VM can have its kswapd thread back) or async
>> writes.
>>
>> (3) The retryer. This pauses the issuer, waits for all outstanding
>> subrequests to complete and then goes through the failed subrequests
>> to reissue them. This may involve reprepping them (with cifs, the
>> credits must be renegotiated, and a subrequest may need splitting),
>> and doing RMW for content crypto if there's a conflicting change on
>> the server.
>>
>> [!] Note that some of the functions are prefixed with "new_" to avoid
>> clashes with existing functions. These will be renamed in a later patch
>> that cuts over to the new algorithm.
>>
>> Signed-off-by: David Howells <[email protected]>
>> cc: Jeff Layton <[email protected]>
>> cc: Eric Van Hensbergen <[email protected]>
>> cc: Latchesar Ionkov <[email protected]>
>> cc: Dominique Martinet <[email protected]>
>> cc: Christian Schoenebeck <[email protected]>
>> cc: Marc Dionne <[email protected]>
>> cc: [email protected]
>> cc: [email protected]
>> cc: [email protected]
>> cc: [email protected]

[..snip..]

>> +/*
>> + * Begin a write operation for writing through the pagecache.
>> + */
>> +struct netfs_io_request *new_netfs_begin_writethrough(struct kiocb *iocb, size_t
>> len)
>> +{
>> + struct netfs_io_request *wreq = NULL;
>> + struct netfs_inode *ictx = netfs_inode(file_inode(iocb->ki_filp));
>> +
>> + mutex_lock(&ictx->wb_lock);
>> +
>> + wreq = netfs_create_write_req(iocb->ki_filp->f_mapping, iocb->ki_filp,
>> + iocb->ki_pos, NETFS_WRITETHROUGH);
>> + if (IS_ERR(wreq))
>> + mutex_unlock(&ictx->wb_lock);
>> +
>> + wreq->io_streams[0].avail = true;
>> + trace_netfs_write(wreq, netfs_write_trace_writethrough);
>
> Missing mutex_unlock() before return.
>

mutex_unlock() happens in new_netfs_end_writethrough()

> Thanks,
> Naveen
>