LinuxLists.cc - Re: [LSF/MM/BPF TOPIC] untorn buffered writes

2024-05-15 19:55:17

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On 27/02/2024 23:12, Theodore Ts'o wrote:
> Last year, I talked about an interest to provide database such as
> MySQL with the ability to issue writes that would not be torn as they
> write 16k database pages[1].
>
> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
>

After discussing this topic earlier this week, I would like to know if
there are still objections or concerns with the untorn-writes userspace
API proposed in
https://lore.kernel.org/linux-block/[email protected]/

I feel that the series for supporting direct-IO only, above, is stuck
because of this topic of buffered IO.

So I sent an RFC for buffered untorn-writes last month in
https://lore.kernel.org/linux-fsdevel/[email protected]/,
which did leverage the bs > ps effort. Maybe it did not get noticed due
to being an RFC. It works on the following principles:

- A buffered atomic write requires RWF_ATOMIC flag be set, same as
direct IO. The same other atomic writes rules apply.
- For an inode, only a single size of buffered write is allowed. So for
statx, atomic_write_unit_min = atomic_write_unit_max always for
buffered atomic writes.
- A single folio maps to an atomic write in the pagecache. So inode
address_space folio min order = max order = atomic_write_unit_min/max
- A folio is tagged as "atomic" when atomically written and written back
to storage "atomically", same as direct-IO method would do for an
atomic write.
- If userspace wants to guarantee a buffered atomic write is written to
storage atomically after the write syscall returns, it must use
RWF_SYNC or similar (along with RWF_ATOMIC).

This is all along the lines of what I described on Monday.

There are no concrete semantics for buffered untorn-writes ATM - like
mixing RWF_ATOMIC write with non-RWF_ATOMIC writes in the pagecache -
but I don't think that this needs to be formalized yet. Or, if it really
does, let me know.

There was also talk in the "limits of buffered IO.. " session - as I
understand - that RWF_ATOMIC for buffered IO should be writethough. If
anyone wants to discuss that further or describe that issue, then please do.

Anyway, I plan to push the direct IO series for merging in the next
cycle, so let me know of what else to discuss and get conclusion on.

> There is a patch set being worked on by John Garry which provides
> stronger guarantees than what is actually required for this use case,
> called "atomic writes". The proposed interface for this facility
> involves passing a new flag to pwritev2(2), RWF_ATOMIC, which requests
> that the specific write be written to the storage device in an
> all-or-nothing fashion, and if it can not be guaranteed, that the
> write should fail. In this interface, if the userspace sends an 128k
> write with the RWF_ATOMIC flag, if the storage device will support
> that an all-or-nothing write with the given size and alignment the
> kernel will guarantee that it will be sent as a single 128k request
> --- although from the database perspective, if it is using 16k
> database pages, it only needs to guarantee that if the write is torn,
> it only happen on a 16k boundary. That is, if the write is split into
> 32k and 96k request, that would be totally fine as far as the database
> is concerned --- and so the RWF_ATOMIC interface is a stronger
> guarantee than what might be needed.
>
> So far, the "atomic write" patchset has only focused on Direct I/O,
> where this stronger guarantee is mostly harmless, even if it is
> unneeded for the original motivating use case. Which might be OK,
> since perhaps there might be other future use cases where they might
> want some 32k writes to be "atomic", while other 128k writes might
> want to be "atomic" (that is to say, persisted with all-or-nothing
> semantics), and the proposed RWF_ATOMIC interface might permit that
> --- even though no one can seem top come up with a credible use case
> that would require this.
>
>
> However, this proposed interface is highly problematic when it comes
> to buffered writes, and Postgress database uses buffered, not direct
> I/O writes. Suppose the database performs a 16k write, followed by a
> 64k write, followed by a 128k write --- and these writes are done
> using a file descriptor that does not have O_DIRECT enable, and let's
> suppose they are written using the proposed RWF_ATOMIC flag. In
> order to provide the (stronger than we need) RWF_ATOMIC guarantee, the
> kernel would need to store the fact that certain pages in the page
> cache were dirtied as part of a 16k RWF_ATOMIC write, and other pages
> were dirtied as part of a 32k RWF_ATOMIC write, etc, so that the
> writeback code knows what the "atomic" guarantee that was made at
> write time. This very quickly becomes a mess.
>
> Another interface that one be much simpler to implement for buffered
> writes would be one the untorn write granularity is set on a per-file
> descriptor basis, using fcntl(2). We validate whether the untorn
> write granularity is one that can be supported when fcntl(2) is
> called, and we also store in the inode the largest untorn write
> granularity that has been requested by a file descriptor for that
> inode. (When the last file descriptor opened for writing has been
> closed, the largest untorn write granularity for that inode can be set
> back down to zero.)
>
> The write(2) system call will check whether the size and alignment of
> the write are valid given the requested untorn write granularity. And
> in the writeback path, the writeback will detect if there are
> contiguous (aligned) dirty pages, and make sure they are sent to the
> storage device in multiples of the largest requested untorn write
> granularity. This provides only the guarantees required by databases,
> and obviates the need to track which pages were dirtied by an
> RWF_ATOMIC flag, and the size of the RWF_ATOMIC write.
>
> I'd like to discuss at LSF/MM what the best interface would be for
> buffered, untorn writes (I am deliberately avoiding the use of the
> word "atomic" since that presumes stronger guarantees than what we
> need, and because it has led to confusion in previous discussions),
> and what might be needed to support it.
>
> - Ted
>

2024-05-23 01:37:19

by Luis Chamberlain

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On Wed, May 15, 2024 at 01:54:39PM -0600, John Garry wrote:
> On 27/02/2024 23:12, Theodore Ts'o wrote:
> > Last year, I talked about an interest to provide database such as
> > MySQL with the ability to issue writes that would not be torn as they
> > write 16k database pages[1].
> >
> > [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
> >
>
> After discussing this topic earlier this week, I would like to know if there
> are still objections or concerns with the untorn-writes userspace API
> proposed in https://lore.kernel.org/linux-block/[email protected]/
>
> I feel that the series for supporting direct-IO only, above, is stuck
> because of this topic of buffered IO.

I think it was good we had the discussions at LSFMM over it, however
I personally don't percieve it as stuck, however without any consensus
being obviated or written down anywhere it would not be clear to anyone
that we did reach any consensus at all. Hope is that lwn captures any
consensus if any was indeed reached as you're not making it clear any
was.

In case it helps, as we did with the LBS effort it may also be useful to
put together bi-monthly cabals to follow up progress, and divide and
conquer any pending work items.

> So I sent an RFC for buffered untorn-writes last month in https://lore.kernel.org/linux-fsdevel/[email protected]/,
> which did leverage the bs > ps effort. Maybe it did not get noticed due to
> being an RFC. It works on the following principles:
>
> - A buffered atomic write requires RWF_ATOMIC flag be set, same as
> direct IO. The same other atomic writes rules apply.
> - For an inode, only a single size of buffered write is allowed. So for
> statx, atomic_write_unit_min = atomic_write_unit_max always for
> buffered atomic writes.
> - A single folio maps to an atomic write in the pagecache. So inode
> address_space folio min order = max order = atomic_write_unit_min/max
> - A folio is tagged as "atomic" when atomically written and written back
> to storage "atomically", same as direct-IO method would do for an
> atomic write.
> - If userspace wants to guarantee a buffered atomic write is written to
> storage atomically after the write syscall returns, it must use
> RWF_SYNC or similar (along with RWF_ATOMIC).

From my perspective the above just needs the IOCB atomic support, and
the pending long term work item there is the near-write-through buffered
IO support. We could just wait for buffered-IO support until we have
support for that. I can't think of anying blocking DIO support though,
now that we at least have a mental model of how buffered IO *should*
work.

What about testing? Are you extending fstests, blktests?

Luis

2024-05-23 12:00:49

by John Garry

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On 22/05/2024 22:56, Luis Chamberlain wrote:
> On Wed, May 15, 2024 at 01:54:39PM -0600, John Garry wrote:
>> On 27/02/2024 23:12, Theodore Ts'o wrote:
>>> Last year, I talked about an interest to provide database such as
>>> MySQL with the ability to issue writes that would not be torn as they
>>> write 16k database pages[1].
>>>
>>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
>>>
>>
>> After discussing this topic earlier this week, I would like to know if there
>> are still objections or concerns with the untorn-writes userspace API
>> proposed in https://lore.kernel.org/linux-block/[email protected]/
>>
>> I feel that the series for supporting direct-IO only, above, is stuck
>> because of this topic of buffered IO.
>
> I think it was good we had the discussions at LSFMM over it, however
> I personally don't percieve it as stuck, however without any consensus
> being obviated or written down anywhere it would not be clear to anyone
> that we did reach any consensus at all.

> Hope is that lwn captures any
> consensus if any was indeed reached as you're not making it clear any
> was.

That's my point really. There were some positive discussion. I put
across the idea of implementing buffered atomic writes, and now I want
to ensure that everyone is satisfied with that going forward. I think
that a LWN report is now being written.

>
> In case it helps, as we did with the LBS effort it may also be useful to
> put together bi-monthly cabals to follow up progress, and divide and
> conquer any pending work items.

ok, we can consider that.

>
>> So I sent an RFC for buffered untorn-writes last month in https://lore.kernel.org/linux-fsdevel/[email protected]/,
>> which did leverage the bs > ps effort. Maybe it did not get noticed due to
>> being an RFC. It works on the following principles:
>>
>> - A buffered atomic write requires RWF_ATOMIC flag be set, same as
>> direct IO. The same other atomic writes rules apply.
>> - For an inode, only a single size of buffered write is allowed. So for
>> statx, atomic_write_unit_min = atomic_write_unit_max always for
>> buffered atomic writes.
>> - A single folio maps to an atomic write in the pagecache. So inode
>> address_space folio min order = max order = atomic_write_unit_min/max
>> - A folio is tagged as "atomic" when atomically written and written back
>> to storage "atomically", same as direct-IO method would do for an
>> atomic write.
>> - If userspace wants to guarantee a buffered atomic write is written to
>> storage atomically after the write syscall returns, it must use
>> RWF_SYNC or similar (along with RWF_ATOMIC).
>
> From my perspective the above just needs the IOCB atomic support, and
> the pending long term work item there is the near-write-through buffered
> IO support. We could just wait for buffered-IO support until we have
> support for that. I can't think of anying blocking DIO support though,
> now that we at least have a mental model of how buffered IO *should*
> work.

Yes, these are my thoughts as well.

>
> What about testing? Are you extending fstests, blktests?

Yes, so 3 things to mention here:

- We have been looking at adding full test coverage in xfstests.
Catherine Hoang recently starting working on this. Most tests will
actually cover the forcealign feature. Indeed, just atomic writes
support testing would be quite limited when compared to forcealign
testing. Furthermore we are also looking at forcealign and atomic writes
testing in fsx.c, as finding forcealign corner cases would be quite
limited on the formalized tests

- for blktests, we were going to add some basic atomic writes test
there, like ensuring that misaligned or mis-sized writes are rejected.
This would be the same really for xfstests, above. I don't think that
there are so many tests which we can cover. scsi_debug will support
atomic writes, which can be used for blktests.

- I have done some limited power-fail testing for my NVMe card.

I have 2x challenges here:
- My host does not allow the card port to be manually powered down, so I
need to physically plug out the power cable to test :(
- My NVMe card only supports 4KB power-fail atomic writes, which is
quite small.

The actual power-fail testing involves using fio in verify mode. In
that, each data block has a CRC written per test loop. I just verify
that the CRCs are valid after the power cycle (which they are when block
size is 4KB and lower :)).

Thanks,
John

2024-05-23 19:06:45

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

2024-05-28 09:22:01

by John Garry

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On 23/05/2024 13:59, Christoph Hellwig wrote:
> On Wed, May 15, 2024 at 01:54:39PM -0600, John Garry wrote:
>> On 27/02/2024 23:12, Theodore Ts'o wrote:
>>> Last year, I talked about an interest to provide database such as
>>> MySQL with the ability to issue writes that would not be torn as they
>>> write 16k database pages[1].
>>>
>>> [1] https://urldefense.com/v3/__https://lwn.net/Articles/932900/__;!!ACWV5N9M2RV99hQ!Ij_ZeSZrJ4uPL94Im73udLMjqpkcZwHmuNnznogL68ehu6TDTXqbMsC4xLUqh18hq2Ib77p1D8_4mV5Q$
>>>
>>
>> After discussing this topic earlier this week, I would like to know if there
>> are still objections or concerns with the untorn-writes userspace API
>> proposed in https://lore.kernel.org/linux-block/[email protected]/
>>
>> I feel that the series for supporting direct-IO only, above, is stuck
>> because of this topic of buffered IO.
>
> Just my 2 cents, but I think supporting untorn I/O for buffered I/O
> is an amazingly bad idea that opens up a whole can of worms in terms
> of potential failure paths while not actually having a convincing use
> case.
>
> For buffered I/O something like the atomic msync proposal makes a lot
> more sense, because it actually provides a useful API for non-trivial
> transactions.

Is this what you are talking about:

https://web.eecs.umich.edu/~tpkelly/papers/Failure_atomic_msync_EuroSys_2013.pdf

If so, I am not sure if a mmap interface would work for DB usecase, like
PostgreSQL. I can ask.

2024-05-28 13:57:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On Tue, May 28, 2024 at 10:21:15AM +0100, John Garry wrote:
> If so, I am not sure if a mmap interface would work for DB usecase, like
> PostgreSQL. I can ask.

Databases really should be using direct I/O for various reasons. And if
Postgres still isn't doing that we shouldn't work around that in the
kernel.

2024-06-01 09:34:27

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On Thu, May 23, 2024 at 12:59:57PM +0100, John Garry wrote:
>
> That's my point really. There were some positive discussion. I put across
> the idea of implementing buffered atomic writes, and now I want to ensure
> that everyone is satisfied with that going forward. I think that a LWN
> report is now being written.

I checked in with some PostgreSQL developers after LSF/MM, and
unfortunately, the idea of immediately sending atomic buffered I/O
directly to the storage device is going to be problematic for them.
The problem is that they depend on the database to coalesce writes for
them. So if they are doing a large database commit that involves
touching hundreds or thousands of 16k database pages, they today issue
a separate buffered write request for each database page. So if we
turn each one into an immediate SCSI/NVMe write request, that would be
disastrous for performance. Yes, when they migrate to using Direct
I/O, the database is going to have to figure out how to coalesce write
requests; but this is why it's going to take at least 3 years to make
this migration (and some will call this hopelessly optimistic), and
then users will probably wait another 3 to 5 years before they trust
that the database rewrite to use Direct I/O will get it right and
trust their enterprise workloads to it....

So I think this goes back to either (a) trying to track which writes
we've promised atomic write semantics, or (b) using a completely
different API that only promises "untorn writes with a specified
granulatity" approach for the untorn buffered writes I/O interface,
instead in addition to, or instead of, the current "atomic write"
interface which we are currently trying to promulate for Direct I/O.

Personally, I'd advocate for two separate interfaces; one for "atomic"
I/O's, and a different one for "untorn writes with a specified
guaranteed granularity". And if XFS folks want to turn the atomic I/O
interface into something where you can do a multi-megabyte atomic
write into something that requires allocating new blocks and
atomically mutating the file system metadata to do this kind of
atomicity --- even though the Database folks Don't Care --- God bless.

But let's have something which *just* promises the guarantee requested
by the primary requesteres of this interface, at least for the
buffered I/O case.

Cheers,

- Ted

2024-06-11 15:27:45

by John Garry

[permalink] [raw]

Subject: Re: [LSF/MM/BPF TOPIC] untorn buffered writes

On 01/06/2024 10:33, Theodore Ts'o wrote:
> On Thu, May 23, 2024 at 12:59:57PM +0100, John Garry wrote:
>>
>> That's my point really. There were some positive discussion. I put across
>> the idea of implementing buffered atomic writes, and now I want to ensure
>> that everyone is satisfied with that going forward. I think that a LWN
>> report is now being written.
>
> I checked in with some PostgreSQL developers after LSF/MM, and
> unfortunately, the idea of immediately sending atomic buffered I/O
> directly to the storage device is going to be problematic for them.

This was not my idea (for supporting buffered atomic writes).

As I remember, that was a candidate solution for dealing with the
problem that is how to tag a buffered write as atomic. Or deal with
overlapping atomic writes. And that solution is to just write through,
so we don't need to remember if it was atomic.

For performance reasons, I was not keen on that, and prefer the solution
I already mentioned earlier.

> The problem is that they depend on the database to coalesce writes for
> them. So if they are doing a large database commit that involves
> touching hundreds or thousands of 16k database pages, they today issue
> a separate buffered write request for each database page. So if we
> turn each one into an immediate SCSI/NVMe write request, that would be
> disastrous for performance.

FWIW, atomic writes support merging in the block layer.

But, that aside, IMHO, talking about performance like this is close to
speculation.

> Yes, when they migrate to using Direct
> I/O, the database is going to have to figure out how to coalesce write
> requests; but this is why it's going to take at least 3 years to make
> this migration (and some will call this hopelessly optimistic), and
> then users will probably wait another 3 to 5 years before they trust
> that the database rewrite to use Direct I/O will get it right and
> trust their enterprise workloads to it....
>
> So I think this goes back to either (a) trying to track which writes
> we've promised atomic write semantics, or (b) using a completely
> different API that only promises "untorn writes with a specified
> granulatity" approach for the untorn buffered writes I/O interface,
> instead in addition to, or instead of, the current "atomic write"
> interface which we are currently trying to promulate for Direct I/O.
>
> Personally, I'd advocate for two separate interfaces; one for "atomic"
> I/O's, and a different one for "untorn writes with a specified
> guaranteed granularity". And if XFS folks want to turn the atomic I/O
> interface into something where you can do a multi-megabyte atomic
> write into something that requires allocating new blocks and
> atomically mutating the file system metadata to do this kind of
> atomicity --- even though the Database folks Don't Care --- God bless.

At this stage, if people want buffered atomic writes support for
PostgreSQL - and not prepared to wait for or help with direct io support
for that DB - then they need to design/extend a kernel API, implement
that, and then port PostgreSQL. Then the performance figures can be
seen. And then try to upstream kernel support.

We have already done such a thing for MySQL for direct IO. We know that
the performance is good, and we want to support it in the kernel today.

>
> But let's have something which *just* promises the guarantee requested
> by the primary requesteres of this interface, at least for the
> buffered I/O case.
>

I think that you need decide whether you want to endorse our direct IO
support today (and give acked-by or similar), or .. live with probably
no support for any sort of atomic writes in the kernel...

Thanks,
John