2021-08-05 12:34:23

by David Howells

[permalink] [raw]
Subject: Could it be made possible to offer "supplementary" data to a DIO write ?

Hi,

I'm working on network filesystem write helpers to go with the read helpers,
and I see situations where I want to write a few bytes to the cache, but have
more available that could be written also if it would allow the
filesystem/blockdev to optimise its layout.

Say, for example, I need to write a 3-byte change from a page, where that page
is part of a 256K sequence in the pagecache. Currently, I have to round the
3-bytes out to DIO size/alignment, but I could say to the API, for example,
"here's a 256K iterator - I need bytes 225-227 written, but you can write more
if you want to"?

Would it be useful/feasible to have some sort of interface that allows the
offer to be made?

David


2021-08-05 14:00:57

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

On Thu, Aug 05, 2021 at 11:19:17AM +0100, David Howells wrote:
> I'm working on network filesystem write helpers to go with the read helpers,
> and I see situations where I want to write a few bytes to the cache, but have
> more available that could be written also if it would allow the
> filesystem/blockdev to optimise its layout.
>
> Say, for example, I need to write a 3-byte change from a page, where that page
> is part of a 256K sequence in the pagecache. Currently, I have to round the
> 3-bytes out to DIO size/alignment, but I could say to the API, for example,
> "here's a 256K iterator - I need bytes 225-227 written, but you can write more
> if you want to"?

I think you're optimising the wrong thing. No actual storage lets you
write three bytes. You're just pushing the read/modify/write cycle to
the remote end. So you shouldn't even be tracking that three bytes have
been dirtied; you should be working in multiples of i_blocksize().

I don't know of any storage which lets you ask "can I optimise this
further for you by using a larger size". Maybe we have some (software)
compressed storage which could do a better job if given a whole 256kB
block to recompress.

So it feels like you're both tracking dirty data at too fine a
granularity, and getting ahead of actual hardware capabilities by trying
to introduce a too-flexible API.

2021-08-05 14:16:05

by David Howells

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

Matthew Wilcox <[email protected]> wrote:

> > Say, for example, I need to write a 3-byte change from a page, where that
> > page is part of a 256K sequence in the pagecache. Currently, I have to
> > round the 3-bytes out to DIO size/alignment, but I could say to the API,
> > for example, "here's a 256K iterator - I need bytes 225-227 written, but
> > you can write more if you want to"?
>
> I think you're optimising the wrong thing. No actual storage lets you
> write three bytes. You're just pushing the read/modify/write cycle to
> the remote end. So you shouldn't even be tracking that three bytes have
> been dirtied; you should be working in multiples of i_blocksize().

I'm dealing with network filesystems that don't necessarily let you know what
i_blocksize is. Assume it to be 1.

Further, only sending, say, 3 bytes and pushing RMW to the remote end is not
necessarily wrong for a network filesystem for at least two reasons: it
reduces the network loading and it reduces the effects of third-party write
collisions.

> I don't know of any storage which lets you ask "can I optimise this
> further for you by using a larger size". Maybe we have some (software)
> compressed storage which could do a better job if given a whole 256kB
> block to recompress.

It would offer an extent-based filesystem the possibility of adjusting its
extent list. And if you were mad enough to put your cache on a shingled
drive... (though you'd probably need a much bigger block than 256K to make
that useful). Also, jffs2 (if someone used that as a cache) can compress its
blocks.

> So it feels like you're both tracking dirty data at too fine a granularity,
> and getting ahead of actual hardware capabilities by trying to introduce a
> too-flexible API.

We might not know what the h/w caps are and there may be multiple destination
servers with different h/w caps involved. Note that NFS and AFS in the kernel
both currently track at byte granularity and only send the bytes that changed.
The expense of setting up the write op on the server might actually outweigh
the RMW cycle. With something like ceph, the server might actually have a
whole-object RMW/COW, say 4M.

Yet further, if your network fs has byte-range locks/leases and you have a
write lock/lease that ends part way into a page, when you drop that lock/lease
you shouldn't flush any data outside of that range lest you overwrite a range
that someone else has a lock/lease on.

David

2021-08-05 15:09:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

On Thu, Aug 05, 2021 at 03:38:01PM +0100, David Howells wrote:
> > If you want to take leases at byte granularity, and then not writeback
> > parts of a page that are outside that lease, feel free. It shouldn't
> > affect how you track dirtiness or how you writethrough the page cache
> > to the disk cache.
>
> Indeed. Handling writes to the local disk cache is different from handling
> writes to the server(s). The cache has a larger block size but I don't have
> to worry about third-party conflicts on it, whereas the server can be taken as
> having no minimum block size, but my write can clash with someone else's.
>
> Generally, I prefer to write back the minimum I can get away with (as does the
> Linux NFS client AFAICT).
>
> However, if everyone agrees that we should only ever write back a multiple of
> a certain block size, even to network filesystems, what block size should that
> be?

If your network protocol doesn't give you a way to ask the server what
size it is, assume 512 bytes and allow it to be overridden by a mount
option.

> Note that PAGE_SIZE varies across arches and folios are going to
> exacerbate this. What I don't want to happen is that you read from a file, it
> creates, say, a 4M (or larger) folio; you change three bytes and then you're
> forced to write back the entire 4M folio.

Actually, you do. Two situations:

1. Application uses MADVISE_HUGEPAGE. In response, we create a 2MB
page and mmap it aligned. We use a PMD sized TLB entry and then the
CPU dirties a few bytes with a store. There's no sub-TLB-entry tracking
of dirtiness. It's just the whole 2MB.

2. The bigger the folio, the more writes it will absorb before being
written back. So when you're writing back that 4MB folio, you're not
just servicing this 3 byte write, you're servicing every other write
which hit this 4MB chunk of the file.

There is one exception I've found, and that's O_SYNC writes. These are
pretty rare, and I think I have a solution to it which essentially treats
the page cache as writethrough (for sync writes). We skip marking
the page (folio) as dirty and go straight to marking it as writeback.
We have all the information we need about which bytes to write and we're
actually using the existing page cache infrastructure to do it.

I'm working on implementing that in iomap; there's some SMOP type
problems to solve, but it looks doable.

2021-08-05 15:42:21

by David Howells

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

Matthew Wilcox <[email protected]> wrote:

> > Note that PAGE_SIZE varies across arches and folios are going to
> > exacerbate this. What I don't want to happen is that you read from a
> > file, it creates, say, a 4M (or larger) folio; you change three bytes and
> > then you're forced to write back the entire 4M folio.
>
> Actually, you do. Two situations:
>
> 1. Application uses MADVISE_HUGEPAGE. In response, we create a 2MB
> page and mmap it aligned. We use a PMD sized TLB entry and then the
> CPU dirties a few bytes with a store. There's no sub-TLB-entry tracking
> of dirtiness. It's just the whole 2MB.

That's a special case. The app specifically asked for it. I'll grant with
mmap you have to mark a whole page as being dirty - but if you mmapped it, you
need to understand that's what will happen.

> 2. The bigger the folio, the more writes it will absorb before being
> written back. So when you're writing back that 4MB folio, you're not
> just servicing this 3 byte write, you're servicing every other write
> which hit this 4MB chunk of the file.

You can argue it that way - but we already do it bytewise in some filesystems,
so what you want would necessitate a change of behaviour.

Note also that if the page size > max RPC payload size (1MB in NFS, I think),
you have to make multiple write operations to fulfil that writeback; further,
if you have an object-based system you might be making writes to multiple
servers, some of which will not actually make a change, to make that
writeback.

I wonder if this needs pushing onto the various network filesystem mailing
lists to find out what they want and why.

David

2021-08-05 18:35:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

On Thu, Aug 05, 2021 at 02:07:03PM +0100, David Howells wrote:
> Matthew Wilcox <[email protected]> wrote:
> > > Say, for example, I need to write a 3-byte change from a page, where that
> > > page is part of a 256K sequence in the pagecache. Currently, I have to
> > > round the 3-bytes out to DIO size/alignment, but I could say to the API,
> > > for example, "here's a 256K iterator - I need bytes 225-227 written, but
> > > you can write more if you want to"?
> >
> > I think you're optimising the wrong thing. No actual storage lets you
> > write three bytes. You're just pushing the read/modify/write cycle to
> > the remote end. So you shouldn't even be tracking that three bytes have
> > been dirtied; you should be working in multiples of i_blocksize().
>
> I'm dealing with network filesystems that don't necessarily let you know what
> i_blocksize is. Assume it to be 1.

That's a really bad idea. The overhead of tracking at byte level
granularity is just not worth it.

> Further, only sending, say, 3 bytes and pushing RMW to the remote end is not
> necessarily wrong for a network filesystem for at least two reasons: it
> reduces the network loading and it reduces the effects of third-party write
> collisions.

You can already get 400Gbit ethernet. Saving 500 bytes by sending
just the 12 bytes that changed is optimising the wrong thing. If you
have two clients accessing the same file at byte granularity, you've
already lost.

> > I don't know of any storage which lets you ask "can I optimise this
> > further for you by using a larger size". Maybe we have some (software)
> > compressed storage which could do a better job if given a whole 256kB
> > block to recompress.
>
> It would offer an extent-based filesystem the possibility of adjusting its
> extent list. And if you were mad enough to put your cache on a shingled
> drive... (though you'd probably need a much bigger block than 256K to make
> that useful). Also, jffs2 (if someone used that as a cache) can compress its
> blocks.

Extent based filesystems create huge extents anyway:

$ /usr/sbin/xfs_bmap *.deb
linux-headers-5.14.0-rc1+_5.14.0-rc1+-1_amd64.deb:
0: [0..16095]: 150008440..150024535
linux-image-5.14.0-rc1+_5.14.0-rc1+-1_amd64.deb:
0: [0..383]: 149991824..149992207
1: [384..103495]: 166567016..166670127
linux-image-5.14.0-rc1+-dbg_5.14.0-rc1+-1_amd64.deb:
0: [0..183]: 149993016..149993199
1: [184..1503623]: 763050936..764554375
linux-libc-dev_5.14.0-rc1+-1_amd64.deb:
0: [0..2311]: 149979624..149981935

This has already happened when you initially wrote to the file backing
the cache. Updates are just going to write to the already-allocated
blocks, unless you've done something utterly inappropriate to the
situation like reflinked the files.

> > So it feels like you're both tracking dirty data at too fine a granularity,
> > and getting ahead of actual hardware capabilities by trying to introduce a
> > too-flexible API.
>
> We might not know what the h/w caps are and there may be multiple destination
> servers with different h/w caps involved. Note that NFS and AFS in the kernel
> both currently track at byte granularity and only send the bytes that changed.
> The expense of setting up the write op on the server might actually outweigh
> the RMW cycle. With something like ceph, the server might actually have a
> whole-object RMW/COW, say 4M.
>
> Yet further, if your network fs has byte-range locks/leases and you have a
> write lock/lease that ends part way into a page, when you drop that lock/lease
> you shouldn't flush any data outside of that range lest you overwrite a range
> that someone else has a lock/lease on.

If you want to take leases at byte granularity, and then not writeback
parts of a page that are outside that lease, feel free. It shouldn't
affect how you track dirtiness or how you writethrough the page cache
to the disk cache.

2021-08-05 18:38:20

by David Howells

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

Matthew Wilcox <[email protected]> wrote:

> You can already get 400Gbit ethernet.

Sorry, but that's not likely to become relevant any time soon. Besides, my
laptop's wifi doesn't really do that yet.

> Saving 500 bytes by sending just the 12 bytes that changed is optimising the
> wrong thing.

In one sense, at least, you're correct. The cost of setting up an RPC to do
the write and setting up crypto is high compared to transmitting 3 bytes vs 4k
bytes.

> If you have two clients accessing the same file at byte granularity, you've
> already lost.

Doesn't stop people doing it, though. People have sqlite, dbm, mail stores,
whatever in the homedirs from the desktop environments. Granted, most of the
time people don't log in twice with the same homedir from two different
machines (and it doesn't - or didn't - used to work with Gnome or KDE).

> Extent based filesystems create huge extents anyway:

Okay, so it's not feasible. That's fine.

> This has already happened when you initially wrote to the file backing
> the cache. Updates are just going to write to the already-allocated
> blocks, unless you've done something utterly inappropriate to the
> situation like reflinked the files.

Or the file is being read random-access and we now have a block we didn't have
before that is contiguous to another block we already have.

> If you want to take leases at byte granularity, and then not writeback
> parts of a page that are outside that lease, feel free. It shouldn't
> affect how you track dirtiness or how you writethrough the page cache
> to the disk cache.

Indeed. Handling writes to the local disk cache is different from handling
writes to the server(s). The cache has a larger block size but I don't have
to worry about third-party conflicts on it, whereas the server can be taken as
having no minimum block size, but my write can clash with someone else's.

Generally, I prefer to write back the minimum I can get away with (as does the
Linux NFS client AFAICT).

However, if everyone agrees that we should only ever write back a multiple of
a certain block size, even to network filesystems, what block size should that
be? Note that PAGE_SIZE varies across arches and folios are going to
exacerbate this. What I don't want to happen is that you read from a file, it
creates, say, a 4M (or larger) folio; you change three bytes and then you're
forced to write back the entire 4M folio.

Note that when content crypto or compression is employed, some multiple of the
size of the encrypted/compressed blocks would be a requirement.

David

2021-08-06 00:18:51

by Adam Borowski

[permalink] [raw]
Subject: Re: Could it be made possible to offer "supplementary" data to a DIO write ?

On Thu, Aug 05, 2021 at 03:38:01PM +0100, David Howells wrote:
> Generally, I prefer to write back the minimum I can get away with (as does the
> Linux NFS client AFAICT).
>
> However, if everyone agrees that we should only ever write back a multiple of
> a certain block size, even to network filesystems, what block size should that
> be? Note that PAGE_SIZE varies across arches and folios are going to
> exacerbate this. What I don't want to happen is that you read from a file, it
> creates, say, a 4M (or larger) folio; you change three bytes and then you're
> forced to write back the entire 4M folio.

grep . /sys/class/block/*/queue/minimum_io_size
and also hw_sector_size, logical_block_size, physical_block_size.

The data seems suspect to me, though. I get 4096 for a spinner (looks
sane), 512 for nvme (less than page size), and 4096 for pmem (I'd expect
cacheline or ECC block).


Meow!
--
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄⠀⠀⠀⠀