With Willy's upcoming folio changes, from a filesystem point of view, we're
going to be looking at folios instead of pages, where:
- a folio is a contiguous collection of pages;
- each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
a huge pages (say 2M each);
- a folio has one dirty flag and one writeback flag that applies to all
constituent pages;
- a complete folio currently is limited to PMD_SIZE or order 8, but could
theoretically go up to about 2GiB before various integer fields have to be
modified (not to mention the memory allocator).
Willy is arguing that network filesystems should, except in certain very
special situations (eg. O_SYNC), only write whole folios (limited to EOF).
Some network filesystems, however, currently keep track of which byte ranges
are modified within a dirty page (AFS does; NFS seems to also) and only write
out the modified data.
Also, there are limits to the maximum RPC payload sizes, so writing back large
pages may necessitate multiple writes, possibly to multiple servers.
What I'm trying to do is collate each network filesystem's properties (I'm
including FUSE in that).
So we have the following filesystems:
Plan9
- Doesn't track bytes
- Only writes single pages
AFS
- Max RPC payload theoretically ~5.5 TiB (OpenAFS), ~16EiB (Auristor/kAFS)
- kAFS (Linux kernel)
- Tracks bytes, only writes back what changed
- Writes from up to 65535 contiguous pages.
- OpenAFS/Auristor (UNIX/Linux)
- Deal with cache-sized blocks (configurable, but something from 8K to 2M),
reads and writes in these blocks
- OpenAFS/Auristor (Windows)
- Track bytes, write back only what changed
Ceph
- File divided into objects (typically 2MiB in size), which may be scattered
over multiple servers.
- Max RPC size is therefore object size.
- Doesn't track bytes.
CIFS/SMB
- Writes back just changed bytes immediately under some circumstances
- Doesn't track bytes and writes back whole pages otherwise.
- SMB3 has a max RPC size of 16MiB, with a default of 4MiB
FUSE
- Doesn't track bytes.
- Max 'RPC' size of 256 pages (I think).
NFS
- Tracks modified bytes within a page.
- Max RPC size of 1MiB.
- Files may be constructed of objects scattered over different servers.
OrangeFS
- Doesn't track bytes.
- Multipage writes possible.
If you could help me fill in the gaps, that would be great.
Thanks,
David
On Thu, Aug 5, 2021 at 9:36 AM David Howells <[email protected]> wrote:
>
> Some network filesystems, however, currently keep track of which byte ranges
> are modified within a dirty page (AFS does; NFS seems to also) and only write
> out the modified data.
NFS definitely does. I haven't used NFS in two decades, but I worked
on some of the code (read: I made nfs use the page cache both for
reading and writing) back in my Transmeta days, because NFSv2 was the
default filesystem setup back then.
See fs/nfs/write.c, although I have to admit that I don't recognize
that code any more.
It's fairly important to be able to do streaming writes without having
to read the old contents for some loads. And read-modify-write cycles
are death for performance, so you really want to coalesce writes until
you have the whole page.
That said, I suspect it's also *very* filesystem-specific, to the
point where it might not be worth trying to do in some generic manner.
In particular, NFS had things like interesting credential issues, so
if you have multiple concurrent writers that used different 'struct
file *' to write to the file, you can't just mix the writes. You have
to sync the writes from one writer before you start the writes for the
next one, because one might succeed and the other not.
So you can't just treat it as some random "page cache with dirty byte
extents". You really have to be careful about credentials, timeouts,
etc, and the pending writes have to keep a fair amount of state
around.
At least that was the case two decades ago.
[ goes off and looks. See "nfs_write_begin()" and friends in
fs/nfs/file.c for some of the examples of these things, althjough it
looks like the code is less aggressive about avoding the
read-modify-write case than I thought I remembered, and only does it
for write-only opens ]
Linus
Linus
On Thu, 2021-08-05 at 10:27 -0700, Linus Torvalds wrote:
> On Thu, Aug 5, 2021 at 9:36 AM David Howells <[email protected]>
> wrote:
> >
> > Some network filesystems, however, currently keep track of which
> > byte ranges
> > are modified within a dirty page (AFS does; NFS seems to also) and
> > only write
> > out the modified data.
>
> NFS definitely does. I haven't used NFS in two decades, but I worked
> on some of the code (read: I made nfs use the page cache both for
> reading and writing) back in my Transmeta days, because NFSv2 was the
> default filesystem setup back then.
>
> See fs/nfs/write.c, although I have to admit that I don't recognize
> that code any more.
>
> It's fairly important to be able to do streaming writes without
> having
> to read the old contents for some loads. And read-modify-write cycles
> are death for performance, so you really want to coalesce writes
> until
> you have the whole page.
>
> That said, I suspect it's also *very* filesystem-specific, to the
> point where it might not be worth trying to do in some generic
> manner.
>
> In particular, NFS had things like interesting credential issues, so
> if you have multiple concurrent writers that used different 'struct
> file *' to write to the file, you can't just mix the writes. You have
> to sync the writes from one writer before you start the writes for
> the
> next one, because one might succeed and the other not.
>
> So you can't just treat it as some random "page cache with dirty byte
> extents". You really have to be careful about credentials, timeouts,
> etc, and the pending writes have to keep a fair amount of state
> around.
>
> At least that was the case two decades ago.
>
> [ goes off and looks. See "nfs_write_begin()" and friends in
> fs/nfs/file.c for some of the examples of these things, althjough it
> looks like the code is less aggressive about avoding the
> read-modify-write case than I thought I remembered, and only does it
> for write-only opens ]
>
All correct, however there is also the issue that even if we have done
a read-modify-write, we can't always extend the write to cover the
entire page.
If you look at nfs_can_extend_write(), you'll note that we don't extend
the page data if the file is range locked, if the attributes have not
been revalidated, or if the page cache contents are suspected to be
invalid for some other reason.
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Thu, 2021-08-05 at 17:35 +0100, David Howells wrote:
> With Willy's upcoming folio changes, from a filesystem point of view, we're
> going to be looking at folios instead of pages, where:
>
> - a folio is a contiguous collection of pages;
>
> - each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
> a huge pages (say 2M each);
>
> - a folio has one dirty flag and one writeback flag that applies to all
> constituent pages;
>
> - a complete folio currently is limited to PMD_SIZE or order 8, but could
> theoretically go up to about 2GiB before various integer fields have to be
> modified (not to mention the memory allocator).
>
> Willy is arguing that network filesystems should, except in certain very
> special situations (eg. O_SYNC), only write whole folios (limited to EOF).
>
> Some network filesystems, however, currently keep track of which byte ranges
> are modified within a dirty page (AFS does; NFS seems to also) and only write
> out the modified data.
>
> Also, there are limits to the maximum RPC payload sizes, so writing back large
> pages may necessitate multiple writes, possibly to multiple servers.
>
> What I'm trying to do is collate each network filesystem's properties (I'm
> including FUSE in that).
>
> So we have the following filesystems:
>
> Plan9
> - Doesn't track bytes
> - Only writes single pages
>
> AFS
> - Max RPC payload theoretically ~5.5 TiB (OpenAFS), ~16EiB (Auristor/kAFS)
> - kAFS (Linux kernel)
> - Tracks bytes, only writes back what changed
> - Writes from up to 65535 contiguous pages.
> - OpenAFS/Auristor (UNIX/Linux)
> - Deal with cache-sized blocks (configurable, but something from 8K to 2M),
> reads and writes in these blocks
> - OpenAFS/Auristor (Windows)
> - Track bytes, write back only what changed
>
> Ceph
> - File divided into objects (typically 2MiB in size), which may be scattered
> over multiple servers.
The default is 4M in modern cephfs clusters, but the rest is correct.
> - Max RPC size is therefore object size.
> - Doesn't track bytes.
>
> CIFS/SMB
> - Writes back just changed bytes immediately under some circumstances
cifs.ko can also just do writes to specific byte ranges synchronously
when it doesn't have the ability to use the cache (i.e. no oplock or
lease). CephFS also does this when it doesn't have the necessary
capabilities (aka caps) to use the pagecache.
If we want to add infrastructure for netfs writeback, then it would be
nice to consider similar infrastructure to handle those cases as well.
> - Doesn't track bytes and writes back whole pages otherwise.
> - SMB3 has a max RPC size of 16MiB, with a default of 4MiB
>
> FUSE
> - Doesn't track bytes.
> - Max 'RPC' size of 256 pages (I think).
>
> NFS
> - Tracks modified bytes within a page.
> - Max RPC size of 1MiB.
> - Files may be constructed of objects scattered over different servers.
>
> OrangeFS
> - Doesn't track bytes.
> - Multipage writes possible.
>
> If you could help me fill in the gaps, that would be great.
--
Jeff Layton <[email protected]>
On Thu, Aug 05, 2021 at 10:27:05AM -0700, Linus Torvalds wrote:
> On Thu, Aug 5, 2021 at 9:36 AM David Howells <[email protected]> wrote:
> > Some network filesystems, however, currently keep track of which byte ranges
> > are modified within a dirty page (AFS does; NFS seems to also) and only write
> > out the modified data.
>
> NFS definitely does. I haven't used NFS in two decades, but I worked
> on some of the code (read: I made nfs use the page cache both for
> reading and writing) back in my Transmeta days, because NFSv2 was the
> default filesystem setup back then.
>
> See fs/nfs/write.c, although I have to admit that I don't recognize
> that code any more.
>
> It's fairly important to be able to do streaming writes without having
> to read the old contents for some loads. And read-modify-write cycles
> are death for performance, so you really want to coalesce writes until
> you have the whole page.
I completely agree with you. The context you're missing is that Dave
wants to do RMW twice. He doesn't do the delaying SetPageUptodate dance.
If the write is less than the whole page, AFS, Ceph and anybody else
using netfs_write_begin() will first read the entire page in and mark
it Uptodate.
Then he wants to track which parts of the page are dirty (at byte
granularity) and send only those bytes to the server in a write request.
So it's worst of both worlds; first the client does an RMW, then the
server does an RMW (assuming the client's data is no longer in the
server's cache.
The NFS code moves the RMW from the client to the server, and that makes
a load of sense.
> That said, I suspect it's also *very* filesystem-specific, to the
> point where it might not be worth trying to do in some generic manner.
It certainly doesn't make sense for block filesystems. Since they
can only do I/O on block boundaries, a sub-block write has to read in
the surrounding block, and once you're doing that, you might as well
read in the whole page.
Tracking sub-page dirty bits still makes sense. It's on my to-do
list for iomap.
> [ goes off and looks. See "nfs_write_begin()" and friends in
> fs/nfs/file.c for some of the examples of these things, althjough it
> looks like the code is less aggressive about avoding the
> read-modify-write case than I thought I remembered, and only does it
> for write-only opens ]
NFS is missing one trick; it could implement aops->is_partially_uptodate
and then it would be able to read back bytes that have already been
written by this client without writing back the dirty ranges and fetching
the page from the server.
Maybe this isn't an important optimisation.
On Thu, Aug 05, 2021 at 05:35:33PM +0100, David Howells wrote:
> With Willy's upcoming folio changes, from a filesystem point of view, we're
> going to be looking at folios instead of pages, where:
>
> - a folio is a contiguous collection of pages;
>
> - each page in the folio might be standard PAGE_SIZE page (4K or 64K, say) or
> a huge pages (say 2M each);
This is not a great way to explain folios.
If you're familiar with compound pages, a folio is a new type for
either a base page or the head page of a compound page; nothing more
and nothing less.
If you're not familiar with compound pages, a folio contains 2^n
contiguous pages. They are treated as a single unit.
> - a folio has one dirty flag and one writeback flag that applies to all
> constituent pages;
>
> - a complete folio currently is limited to PMD_SIZE or order 8, but could
> theoretically go up to about 2GiB before various integer fields have to be
> modified (not to mention the memory allocator).
Filesystems should not make an assumption about this ... I suspect
the optimum page size scales with I/O bandwidth; taking PCI bandwidth
as a reasonable proxy, it's doubled five times in twenty years.
> Willy is arguing that network filesystems should, except in certain very
> special situations (eg. O_SYNC), only write whole folios (limited to EOF).
I did also say that the write could be limited by, eg, a byte-range
lease on the file. If the client doesn't have permission to write
a byte range, then it doesn't need to write it back.
Matthew Wilcox <[email protected]> wrote:
> > It's fairly important to be able to do streaming writes without having
> > to read the old contents for some loads. And read-modify-write cycles
> > are death for performance, so you really want to coalesce writes until
> > you have the whole page.
>
> I completely agree with you. The context you're missing is that Dave
> wants to do RMW twice. He doesn't do the delaying SetPageUptodate dance.
Actually, I do the delaying of SetPageUptodate in the new write helpers that
I'm working on - at least to some extent. For a write of any particular size
(which may be more than a page), I only read the first and last pages affected
if they're not completely changed by the write. Note that I have my own
version of generic_perform_write() that allows me to eliminate write_begin and
write_end for any filesystem using it.
Keeping track of which regions are dirty allows merging of contiguous dirty
regions.
It has occurred to me that I don't actually need the pages to be uptodate and
completely filled out. I'm tracking which bits are dirty - I could defer
reading the missing bits till someone wants to read or mmap.
But that kind of screws with local caching. The local cache might need to
track the missing bits, and we are likely to be using blocks larger than a
page.
Basically, there are a lot of scenarios where not having fully populated pages
sucks. And for streaming writes, wouldn't it be better if you used DIO
writes?
> If the write is less than the whole page, AFS, Ceph and anybody else
> using netfs_write_begin() will first read the entire page in and mark
> it Uptodate.
Indeed - but that function is set to be replaced. What you're missing is that
if someone then tries to read the partially modified page, you may have to do
two reads from the server.
> Then he wants to track which parts of the page are dirty (at byte
> granularity) and send only those bytes to the server in a write request.
Yes. Because other constraints may apply, for example the handling of
conflicting third-party writes. The question here is how much we care about
that - and that's why I'm trying to write back only what's changed where
possible.
That said, if content encryption is thrown into the mix, the minimum we can
write back is whatever the size of the blocks on which encryption is
performed, so maybe we shouldn't care.
Add disconnected operation reconnection resolution, where it might be handy to
have a list of what changed on a file.
> So it's worst of both worlds; first the client does an RMW, then the
> server does an RMW (assuming the client's data is no longer in the
> server's cache.
Actually, it's not necessarily what you make out. You have to compare the
server-side RMW with cost of setting up a read or a write operation.
And then there's this scenario: Imagine I'm going to modify the middle of a
page which doesn't yet exist. I read the bit at the beginning and the bit at
the end and then try to fill the middle, but now get an EFAULT error. I'm
going to have to do *three* reads if someone wants to read the page.
> The NFS code moves the RMW from the client to the server, and that makes
> a load of sense.
No, it very much depends. It might suck if you have the folio partly cached
locally in fscache, and it doesn't work if you have content encryption and
would suck if you're doing disconnected operation.
I presume you're advocating that the change is immediately written to the
server, and then you read it back from the server?
> > That said, I suspect it's also *very* filesystem-specific, to the
> > point where it might not be worth trying to do in some generic manner.
>
> It certainly doesn't make sense for block filesystems. Since they
> can only do I/O on block boundaries, a sub-block write has to read in
> the surrounding block, and once you're doing that, you might as well
> read in the whole page.
I'm not trying to do this for block filesystems! However, a block filesystem
- or even a blockdev - might be involved in terms of the local cache.
> Tracking sub-page dirty bits still makes sense. It's on my to-do
> list for iomap.
/me blinks
"bits" as in parts of a page or "bits" as in the PG_dirty bits on the pages
contributing to a folio?
> > [ goes off and looks. See "nfs_write_begin()" and friends in
> > fs/nfs/file.c for some of the examples of these things, althjough it
> > looks like the code is less aggressive about avoding the
> > read-modify-write case than I thought I remembered, and only does it
> > for write-only opens ]
>
> NFS is missing one trick; it could implement aops->is_partially_uptodate
> and then it would be able to read back bytes that have already been
> written by this client without writing back the dirty ranges and fetching
> the page from the server.
As mentioned above, I have been considering the possibility of keeping track
of partially dirty non-uptodate pages. Jeff and I have been discussing that
we might want support for explicit RMW anyway for various reasons (e.g. doing
DIO that's not crypto-block aligned,
remote-invalidation/reconnection-resolution handling).
David
Matthew Wilcox <[email protected]> wrote:
> Filesystems should not make an assumption about this ... I suspect
> the optimum page size scales with I/O bandwidth; taking PCI bandwidth
> as a reasonable proxy, it's doubled five times in twenty years.
There are a lot more factors than you make out. Local caching, content
crypto, transport crypto, cost of setting up RPC calls, compounding calls to
multiple servers.
David
On Fri, Aug 06, 2021 at 02:42:37PM +0100, David Howells wrote:
> Matthew Wilcox <[email protected]> wrote:
>
> > > It's fairly important to be able to do streaming writes without having
> > > to read the old contents for some loads. And read-modify-write cycles
> > > are death for performance, so you really want to coalesce writes until
> > > you have the whole page.
> >
> > I completely agree with you. The context you're missing is that Dave
> > wants to do RMW twice. He doesn't do the delaying SetPageUptodate dance.
>
> Actually, I do the delaying of SetPageUptodate in the new write helpers that
> I'm working on - at least to some extent. For a write of any particular size
> (which may be more than a page), I only read the first and last pages affected
> if they're not completely changed by the write. Note that I have my own
> version of generic_perform_write() that allows me to eliminate write_begin and
> write_end for any filesystem using it.
No, that is very much not the same thing. Look at what NFS does, like
Linus said. Consider this test program:
fd = open();
lseek(fd, 5, SEEK_SET);
write(fd, buf, 3);
write(fd, buf2, 10);
write(fd, buf3, 2);
close(fd);
You're going to do an RMW. NFS keeps track of which bytes are dirty,
and writes only those bytes to the server (when that page is eventually
written-back). So yes, it's using the page cache, but it's not doing
an unnecessary read from the server.
> It has occurred to me that I don't actually need the pages to be uptodate and
> completely filled out. I'm tracking which bits are dirty - I could defer
> reading the missing bits till someone wants to read or mmap.
>
> But that kind of screws with local caching. The local cache might need to
> track the missing bits, and we are likely to be using blocks larger than a
> page.
There's nothing to cache. Pages which are !Uptodate aren't going to get
locally cached.
> Basically, there are a lot of scenarios where not having fully populated pages
> sucks. And for streaming writes, wouldn't it be better if you used DIO
> writes?
DIO can't do sub-512-byte writes.
> > If the write is less than the whole page, AFS, Ceph and anybody else
> > using netfs_write_begin() will first read the entire page in and mark
> > it Uptodate.
>
> Indeed - but that function is set to be replaced. What you're missing is that
> if someone then tries to read the partially modified page, you may have to do
> two reads from the server.
NFS doesn't. It writes back the dirty data from the page and then
does a single read of the entire page. And as I said later on, using
->is_partially_uptodate can avoid that for some cases.
> > Then he wants to track which parts of the page are dirty (at byte
> > granularity) and send only those bytes to the server in a write request.
>
> Yes. Because other constraints may apply, for example the handling of
> conflicting third-party writes. The question here is how much we care about
> that - and that's why I'm trying to write back only what's changed where
> possible.
If you care about conflicting writes from different clients, you really
need to establish a cache ownership model. Or treat the page-cache as
write-through.
> > > That said, I suspect it's also *very* filesystem-specific, to the
> > > point where it might not be worth trying to do in some generic manner.
> >
> > It certainly doesn't make sense for block filesystems. Since they
> > can only do I/O on block boundaries, a sub-block write has to read in
> > the surrounding block, and once you're doing that, you might as well
> > read in the whole page.
>
> I'm not trying to do this for block filesystems! However, a block filesystem
> - or even a blockdev - might be involved in terms of the local cache.
You might not be trying to do anything for block filesystems, but we
should think about what makes sense for block filesystems as well as
network filesystems.
> > Tracking sub-page dirty bits still makes sense. It's on my to-do
> > list for iomap.
>
> /me blinks
>
> "bits" as in parts of a page or "bits" as in the PG_dirty bits on the pages
> contributing to a folio?
Perhaps I should have said "Tracking dirtiness on a sub-page basis".
Right now, that looks like a block bitmap, but maybe it should be a
range-based data structure.
Matthew Wilcox <[email protected]> wrote:
> No, that is very much not the same thing. Look at what NFS does, like
> Linus said. Consider this test program:
>
> fd = open();
> lseek(fd, 5, SEEK_SET);
> write(fd, buf, 3);
> write(fd, buf2, 10);
> write(fd, buf3, 2);
> close(fd);
Yes, I get that. I can do that when there isn't a local cache or content
encryption.
Note that, currently, if the pages (or cache blocks) being read/modified are
beyond the EOF at the point when the file is opened, truncated down or last
subject to 3rd-party invalidation, I don't go to the server at all.
> > But that kind of screws with local caching. The local cache might need to
> > track the missing bits, and we are likely to be using blocks larger than a
> > page.
>
> There's nothing to cache. Pages which are !Uptodate aren't going to get
> locally cached.
Eh? Of course there is. You've just written some data. That need to get
copied to the cache as well as the server if that file is supposed to be being
cached (for filesystems that support local caching of files open for writing,
which AFS does).
> > Basically, there are a lot of scenarios where not having fully populated
> > pages sucks. And for streaming writes, wouldn't it be better if you used
> > DIO writes?
>
> DIO can't do sub-512-byte writes.
Yes it can - and it works for my AFS client at least with the patches in my
fscache-iter-2 branch. This is mainly a restriction for block storage devices
we're doing DMA to - but we're not doing direct DMA to block storage devices
typically when talking to a network filesystem.
For AFS, at least, I can just make one big FetchData/StoreData RPC that
reads/writes the entire DIO request in a single op; for other filesystems
(NFS, ceph for example), it needs breaking up into a sequence of RPCs, but
there's no particular reason that I know of that requires it to be 512-byte
aligned on any of these.
Things get more interesting if you're doing DIO to a content-encrypted file
because the block size may be 4096 or even a lot larger - in which case we
would have to do local RMW to handle misaligned writes, but it presents no
particular difficulty.
> You might not be trying to do anything for block filesystems, but we
> should think about what makes sense for block filesystems as well as
> network filesystems.
Whilst that's a good principle, they have very different characteristics that
might make that difficult.
David