At last year's LSFMM we learned through Ted Ts'o about the interest by
cloud providers in large atomics [0]. It is a good example where cloud
providers innovated in an area perhaps before storage vendors were
providing hardware support for such features. An example use case was
databases. In short, with large atomics databases can disable their own version
of journaling so to increase TPS. Large atomics lets you disabling things like
MySQL innodb_doublewrite. The feature to allow you to disable this and use
large atomcis is known as torn write prevention [1]. At least for MySQL the
default page size for the database (used for columns) is 16k, and so enabling
for example a 16k atomic can allow you to take advantage of this. It was also
mentioned how PostgreSQL only supports buffered-IO and so it would be desirable
for a solution to support buffered-IO with large atomics as well. The way
cloud providers enable torn write protection, is by using direct IO.
John Garry has been working on adding an API for atomic writes, it would
seem some folks refer to this as the no-tears atomic API. It consists of
two parts, one for the block layer [2] and another set of changes for
XFS [3]. It enables Direct IO support with large atomics. It includes
a userspace API which lets you peg a FS_XFLAG_ATOMICWRITES flag onto a
file, and you then create an XFS filesystem using the XFS realtime
subvolume with with an extent alignment. The current users of this API
seems to be SCSI, but obviously this can grow to support others. A neat
feature of this effort is you can have two separate directories with
separate aligment requirements. There is no generic filesystem solution
yet.
Meanwhile we're now at a v2 RFC for LBS support [4]. Although the LBS
effort originally was a completely orthogonal effort to large atomics, it
would seem there is a direct relationship here now worth discussing.
In short LBS enables buffered-IO large atomic support if the hardware
support its. We get both alignment constraints gauranteed and now ensure
we use contigous memory for the IOs for DMA too it is built on using large
folios. We expect NVMe drives which support support large atomics can
easily profit from this without any userspace modification other than
when you create the filesystem.
We reviewed the possible intersection of both efforts at our last LBS cabal
with LBS interested folks and Martin Peterson and John Garry. It is somewhat
unclear exactly how to follow up on some aspects of the no-tear API [5]
but there was agreement about the possible intersection of both efforts,
and that we should discuss this at LSFMM. The goal would be to try to reach
consensus on how no-tear API and how LBS could help with those
interested in leveraging large atomics.
Some things to evaluate or for us to discuss:
* no-tear API:
- allows directories to have separate alignment requirements
- this might be useful for folks who want to use large IOs with
large atomics for some workloads but smaller IOs for another
directory on the same drive. It this a viable option to some
users for large atomics with concerns of being forced to use
only large writes with LBS?
- statx is modified so to display new alignment considerations
- atomics are power of 2
- there seems to be some interest in supporting no-hardware-accel atomic
solution, so a software implemented atomic solution, could someone
clarify if that's accurate? How is the double write avoided? What are
the use cases? Do databases use that today?
- How do we generalize a solution per file? Would extending a min
order per file be desirable? Is that even tenable?
* LBS:
- stat will return the block size set, so userspace applications
using stat / statx will use the larger block size to ensure
alignment
- a drive with support for a large atomic but supporting smaller
logical block sizes will still allow writes to the logical block
size. If a block driver has a "preference" (in NVMe this would
be the NPWG for the IU) to write above the logical block size,
do we want the option to lift the logical block size? In
retrospect I don't think this is needed given Jan Kara's patches
to prevent allowing writes to to mounted devices [4], that should
ensure that if a filesystem takes advantage of a larger physical
block size and creates a filesystem with it as the sector size,
userspace won't be mucking around with lower IOs to the drive
while it is mounted. But, are there any applications which would
get the block device logical block size instead for DIO?
- LBS is transparent to to userspace applications
- We've verified *most* IOs are aligned if you use a 16k block size
but a smaller sector size, the lower IOs were verified to come
from the XFS buffer cache. If your drive supports a large atomic
you can avoid these as you can lift the sector size set as the
physical block size will be larger than the logical block size.
For NVMe today this is possible for drives with a large
NPWG (the IU) and NAWUFP (the large atomic), for example.
Tooling:
- Both efforts stand to gain from a shared verification set of tools
for alignment and atomic use
- We have a block layer eBPF alignent tool written by Daniel Gomez [6]
however there is lack of interested parties to help review a simpler
version of this tool this tool so we merge it [7], we can benefit from more
eyeablls from experienced eBPF / block layer folks.
- More advanced tools are typically not encouraged, and this leaves us
wondering what a better home would be other than side forks
- Other than preventing torn writes, do users of the no-tear API care
about WAF? While we have one for NVMe for WAF [8] would
collaborating on a generic tool be of interest ?
Any other things folks want to get out of this as a session, provided
there is interest?
[0] https://lwn.net/Articles/932900/
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
[2] https://lore.kernel.org/linux-nvme/[email protected]/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
[3] https://lore.kernel.org/all/[email protected]/
[4] https://lore.kernel.org/all/[email protected]/T/#u
[5] https://lkml.kernel.org/r/[email protected]
[6] https://github.com/dagmcr/bcc/tree/blkalgn-dump
[7] https://github.com/iovisor/bcc/pull/4813
[8] https://github.com/dagmcr/bcc/tree/nvmeiuwaf
Luis
Luis Chamberlain <[email protected]> writes:
> At last year's LSFMM we learned through Ted Ts'o about the interest by
> cloud providers in large atomics [0]. It is a good example where cloud
> providers innovated in an area perhaps before storage vendors were
> providing hardware support for such features. An example use case was
> databases. In short, with large atomics databases can disable their own version
> of journaling so to increase TPS. Large atomics lets you disabling things like
> MySQL innodb_doublewrite. The feature to allow you to disable this and use
> large atomcis is known as torn write prevention [1]. At least for MySQL the
> default page size for the database (used for columns) is 16k, and so enabling
> for example a 16k atomic can allow you to take advantage of this. It was also
> mentioned how PostgreSQL only supports buffered-IO and so it would be desirable
> for a solution to support buffered-IO with large atomics as well. The way
> cloud providers enable torn write protection, is by using direct IO.
>
> John Garry has been working on adding an API for atomic writes, it would
> seem some folks refer to this as the no-tears atomic API. It consists of
> two parts, one for the block layer [2] and another set of changes for
> XFS [3]. It enables Direct IO support with large atomics. It includes
> a userspace API which lets you peg a FS_XFLAG_ATOMICWRITES flag onto a
> file, and you then create an XFS filesystem using the XFS realtime
> subvolume with with an extent alignment. The current users of this API
> seems to be SCSI, but obviously this can grow to support others. A neat
> feature of this effort is you can have two separate directories with
> separate aligment requirements. There is no generic filesystem solution
> yet.
>
> Meanwhile we're now at a v2 RFC for LBS support [4]. Although the LBS
> effort originally was a completely orthogonal effort to large atomics, it
> would seem there is a direct relationship here now worth discussing.
> In short LBS enables buffered-IO large atomic support if the hardware
> support its.
> We get both alignment constraints gauranteed and now ensure
> we use contigous memory for the IOs for DMA too it is built on using large
> folios. We expect NVMe drives which support support large atomics can
> easily profit from this without any userspace modification other than
> when you create the filesystem.
>
> We reviewed the possible intersection of both efforts at our last LBS cabal
> with LBS interested folks and Martin Peterson and John Garry. It is somewhat
> unclear exactly how to follow up on some aspects of the no-tear API [5]
> but there was agreement about the possible intersection of both efforts,
> and that we should discuss this at LSFMM. The goal would be to try to reach
> consensus on how no-tear API and how LBS could help with those
> interested in leveraging large atomics.
>
> Some things to evaluate or for us to discuss:
>
> * no-tear API:
> - allows directories to have separate alignment requirements
> - this might be useful for folks who want to use large IOs with
> large atomics for some workloads but smaller IOs for another
> directory on the same drive. It this a viable option to some
> users for large atomics with concerns of being forced to use
> only large writes with LBS?
> - statx is modified so to display new alignment considerations
> - atomics are power of 2
> - there seems to be some interest in supporting no-hardware-accel atomic
> solution, so a software implemented atomic solution, could someone
> clarify if that's accurate? How is the double write avoided? What are
> the use cases? Do databases use that today?
> - How do we generalize a solution per file? Would extending a min
> order per file be desirable? Is that even tenable?
I would also be interested in this discussion. For e.g. let's also try
and bring below points in the agenda -
1. Like LBS, for systems with large page size 64k (PowerPC and ARM), we should
already be able to utilize the untorn/atomic writes if they format the
filesystem with a given blocksize (for DIO atleast).
I think we need not even use bigalloc in such case for ext4.
So what does it takes from Linux Filesystems to expose an interface
to user such that they can start utilizing it?
(Now this has a catch that the FS still needs to be formatted with a
given blocksize to utilize untorn writes.)
2. What others think on adding O_ATOMIC interface similar to O_DIRECT
such that applications don't need much changes? We should still have
RWF_ATOMIC for pwrites, but for open/read/write calls an O_ATOMIC will
be useful too.
3. Buffered-io is important for Postgres and I have been looking into it
from the perspective of supporting untorn writes for buffered-io as
well. It will be again easier maybe to start off with 64k pagesize
systems or by utilizing large folio support. This way we have less work
in managing multiple pages which needs to be written atomically.
4. We already have RFC for ext4 multiblock code to support aligned
allocations which can be used to plug in untorn Direct-io write support
to ext4 [1]
[1]: https://lore.kernel.org/linux-ext4/[email protected]/
>
> * LBS:
> - stat will return the block size set, so userspace applications
> using stat / statx will use the larger block size to ensure
> alignment
> - a drive with support for a large atomic but supporting smaller
> logical block sizes will still allow writes to the logical block
> size. If a block driver has a "preference" (in NVMe this would
> be the NPWG for the IU) to write above the logical block size,
> do we want the option to lift the logical block size? In
> retrospect I don't think this is needed given Jan Kara's patches
> to prevent allowing writes to to mounted devices [4], that should
> ensure that if a filesystem takes advantage of a larger physical
> block size and creates a filesystem with it as the sector size,
> userspace won't be mucking around with lower IOs to the drive
> while it is mounted. But, are there any applications which would
> get the block device logical block size instead for DIO?
> - LBS is transparent to to userspace applications
> - We've verified *most* IOs are aligned if you use a 16k block size
> but a smaller sector size, the lower IOs were verified to come
> from the XFS buffer cache. If your drive supports a large atomic
> you can avoid these as you can lift the sector size set as the
> physical block size will be larger than the logical block size.
> For NVMe today this is possible for drives with a large
> NPWG (the IU) and NAWUFP (the large atomic), for example.
>
> Tooling:
>
> - Both efforts stand to gain from a shared verification set of tools
> for alignment and atomic use
> - We have a block layer eBPF alignent tool written by Daniel Gomez [6]
> however there is lack of interested parties to help review a simpler
> version of this tool this tool so we merge it [7], we can benefit from more
> eyeablls from experienced eBPF / block layer folks.
> - More advanced tools are typically not encouraged, and this leaves us
> wondering what a better home would be other than side forks
> - Other than preventing torn writes, do users of the no-tear API care
> about WAF? While we have one for NVMe for WAF [8] would
> collaborating on a generic tool be of interest ?
>
> Any other things folks want to get out of this as a session, provided
> there is interest?
>
> [0] https://lwn.net/Articles/932900/
> [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/storage-twp.html
> [2] https://lore.kernel.org/linux-nvme/[email protected]/T/#m4ad28b480a8e12eb51467e17208d98ca50041ff2
> [3] https://lore.kernel.org/all/[email protected]/
> [4] https://lore.kernel.org/all/[email protected]/T/#u
> [5] https://lkml.kernel.org/r/[email protected]
> [6] https://github.com/dagmcr/bcc/tree/blkalgn-dump
> [7] https://github.com/iovisor/bcc/pull/4813
> [8] https://github.com/dagmcr/bcc/tree/nvmeiuwaf
>
> Luis
On Thu, Feb 22, 2024 at 01:59:32PM -0800, Luis Chamberlain wrote:
> At last year's LSFMM we learned through Ted Ts'o about the interest by
> cloud providers in large atomics [0]. It is a good example where cloud
> providers innovated in an area perhaps before storage vendors were
> providing hardware support for such features. An example use case was
> databases. In short, with large atomics databases can disable their own version
> of journaling so to increase TPS. Large atomics lets you disabling things like
> MySQL innodb_doublewrite. The feature to allow you to disable this and use
> large atomcis is known as torn write prevention [1]. At least for MySQL the
> default page size for the database (used for columns) is 16k, and so enabling
> for example a 16k atomic can allow you to take advantage of this. It was also
> mentioned how PostgreSQL only supports buffered-IO and so it would be desirable
> for a solution to support buffered-IO with large atomics as well. The way
> cloud providers enable torn write protection, is by using direct IO.
>
> John Garry has been working on adding an API for atomic writes, it would
> seem some folks refer to this as the no-tears atomic API. It consists of
> two parts, one for the block layer [2] and another set of changes for
> XFS [3]. It enables Direct IO support with large atomics. It includes
> a userspace API which lets you peg a FS_XFLAG_ATOMICWRITES flag onto a
> file, and you then create an XFS filesystem using the XFS realtime
> subvolume with with an extent alignment. The current users of this API
> seems to be SCSI, but obviously this can grow to support others. A neat
> feature of this effort is you can have two separate directories with
> separate aligment requirements. There is no generic filesystem solution
> yet.
>
> Meanwhile we're now at a v2 RFC for LBS support [4]. Although the LBS
> effort originally was a completely orthogonal effort to large atomics, it
> would seem there is a direct relationship here now worth discussing.
> In short LBS enables buffered-IO large atomic support if the hardware
> support its.
>
> We get both alignment constraints gauranteed and now ensure
> we use contigous memory for the IOs for DMA too it is built on using large
> folios. We expect NVMe drives which support support large atomics can
> easily profit from this without any userspace modification other than
> when you create the filesystem.
If we combine atomic writes with buffered writeback then we create a
major IO constraint: *all* writes must be atomic in this sort of
setup because we cannot allow multi-sector writes to be torn
randomly in the middle of *any* sector. i.e. the driver needs to
telling the block device that it's maximum IO size is limited by the
max atomic write size the device supports.
With that constraint in place, I don't see how the page cache or
filesystem needs to care about how the underlying storage device
provides it's atomic sector sized IO. If the underlying device uses
atomic writes, then it needs to set up all it's published IO
constraints that are used by filesystems to build bios around the
limitations of atomic writes. And that bleeds into userspace as
well - it needs to know the sector sizes so it can set up the
filesystem correctly in the first place.
Hence I think there is -zero- overlap between LBS and atomic writes.
Yes, a device can provide a larger sector size via atomic write
support, but that's orthogonal to LBS infrastructure. All the device
needs to do is to set all of the device limits to be based on atomic
write constraints. Nothing else in the kernel or userspace needs to
care, and then the driver can simply add the REQ_ATOMIC flag to all
the write IOs itself....
Note that I'm not talking about IOCB_ATOMIC here: the page cache
doesn't give any guarantees about atomic write semantics. e.g. reads
are allowed to race with writes to the same folio, "atomic" user
writes that span folios can be written back independently (even
whilst the write() is in progress!) breaking the atomicity that
userspace specified.
Hence if we want IOCB_ATOMIC for buffered writes, the first problem
that needs to be solved is providing guaranteed stable atomic write
semantics through the page cache right down to the async writeback
code.....
> We reviewed the possible intersection of both efforts at our last LBS cabal
> with LBS interested folks and Martin Peterson and John Garry. It is somewhat
> unclear exactly how to follow up on some aspects of the no-tear API [5]
> but there was agreement about the possible intersection of both efforts,
> and that we should discuss this at LSFMM. The goal would be to try to reach
> consensus on how no-tear API and how LBS could help with those
> interested in leveraging large atomics.
>
> Some things to evaluate or for us to discuss:
>
> * no-tear API:
But I like to cry.
> - allows directories to have separate alignment requirements
> - this might be useful for folks who want to use large IOs with
> large atomics for some workloads but smaller IOs for another
> directory on the same drive. It this a viable option to some
> users for large atomics with concerns of being forced to use
> only large writes with LBS?
We can already do that with extent size hints in XFS.
> - statx is modified so to display new alignment considerations
> - atomics are power of 2
> - there seems to be some interest in supporting no-hardware-accel atomic
> solution, so a software implemented atomic solution, could someone
> clarify if that's accurate? How is the double write avoided? What are
> the use cases? Do databases use that today?
Christoph's proposal for XFS involves using existing internal
copy-on-write infrastructure for IOCB_ATOMIC writes. i.e. it uses
the filesystem journal to do the atomic swap of the new data extent
in place of the old one.
> - How do we generalize a solution per file? Would extending a min
> order per file be desirable? Is that even tenable?
AFAIA, this is already the plan with XFS via a FORCE_ALIGN inode
flag in conjunction with extent size hints.
> * LBS:
> - stat will return the block size set, so userspace applications
> using stat / statx will use the larger block size to ensure
> alignment
> - a drive with support for a large atomic but supporting smaller
> logical block sizes will still allow writes to the logical block
> size. If a block driver has a "preference" (in NVMe this would
> be the NPWG for the IU) to write above the logical block size,
> do we want the option to lift the logical block size? In
> retrospect I don't think this is needed given Jan Kara's patches
> to prevent allowing writes to to mounted devices [4], that should
> ensure that if a filesystem takes advantage of a larger physical
> block size and creates a filesystem with it as the sector size,
> userspace won't be mucking around with lower IOs to the drive
> while it is mounted. But, are there any applications which would
> get the block device logical block size instead for DIO?
> - LBS is transparent to to userspace applications
> - We've verified *most* IOs are aligned if you use a 16k block size
> but a smaller sector size, the lower IOs were verified to come
> from the XFS buffer cache. If your drive supports a large atomic
> you can avoid these as you can lift the sector size set as the
> physical block size will be larger than the logical block size.
> For NVMe today this is possible for drives with a large
> NPWG (the IU) and NAWUFP (the large atomic), for example.
This is just how the page cache and filesystems behave according to
sector and block size constraints defined by the block device and
mkfs. I'm not sure what you're asking that we comment on or discuss
here...
> Tooling:
>
> - Both efforts stand to gain from a shared verification set of tools
> for alignment and atomic use
> - We have a block layer eBPF alignent tool written by Daniel Gomez [6]
> however there is lack of interested parties to help review a simpler
> version of this tool this tool so we merge it [7], we can benefit from more
> eyeablls from experienced eBPF / block layer folks.
Running and maintaining eBPF tools on development systems running
custom kernels is a PITA in my experience.
Wouldn't it be better just to add block tracepoint analysis filters
to things like trace-cmd? We already have tracepoints that expose
all the IO operations like queuing, merging, dispatch, etc that
users are familiar with and have scripts and tooling written for.
Adding a filter that calculates IO alignment for traces during
report generation would by much more useful for IO analysis in
general as understanding these behaviours is not specific to atomic
writes.
-Dave.
--
Dave Chinner
[email protected]