LinuxLists.cc - [LSF/MM TOPIC] Direct block mapping through fs for device

2019-04-26 01:58:56

Subject: [LSF/MM TOPIC] Direct block mapping through fs for device

I see that they are still empty spot in LSF/MM schedule so i would like to
have a discussion on allowing direct block mapping of file for devices (nic,
gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
is pretty light ie only adding 2 callback to vm_operations_struct:

int (*device_map)(struct vm_area_struct *vma,
struct device *importer,
struct dma_buf **bufp,
unsigned long start,
unsigned long end,
unsigned flags,
dma_addr_t *pa);

// Some flags i can think of:
DEVICE_MAP_FLAG_PIN // ie return a dma_buf object
DEVICE_MAP_FLAG_WRITE // importer want to be able to write
DEVICE_MAP_FLAG_SUPPORT_ATOMIC_OP // importer want to do atomic operation
// on the mapping

void (*device_unmap)(struct vm_area_struct *vma,
struct device *importer,
unsigned long start,
unsigned long end,
dma_addr_t *pa);

Each filesystem could add this callback and decide wether or not to allow
the importer to directly map block. Filesystem can use what ever logic they
want to make that decision. For instance if they are page in the page cache
for the range then it can say no and the device would fallback to main
memory. Filesystem can also update its internal data structure to keep
track of direct block mapping.

If filesystem decide to allow the direct block mapping then it forward the
request to the block device which itself can decide to forbid the direct
mapping again for any reasons. For instance running out of BAR space or
peer to peer between block device and importer device is not supported or
block device does not want to allow writeable peer mapping ...

So event flow is:
1 program mmap a file (end never intend to access it with CPU)
2 program try to access the mmap from a device A
3 device A driver see device_map callback on the vma and call it
4a on success device A driver program the device to mapped dma address
4b on failure device A driver fallback to faulting so that it can use
page from page cache

This API assume that the importer does support mmu notifier and thus that
the fs can invalidate device mapping at _any_ time by sending mmu notifier
to all mapping of the file (for a given range in the file or for the whole
file). Obviously you want to minimize disruption and thus only invalidate
when necessary.

The dma_buf parameter can be use to add pinning support for filesystem who
wish to support that case too. Here the mapping lifetime get disconnected
from the vma and is transfer to the dma_buf allocated by filesystem. Again
filesystem can decide to say no as pinning blocks has drastic consequence
for filesystem and block device.

This has some similarities to the hmmap and caching topic (which is mapping
block directly to CPU AFAIU) but device mapping can cut some corner for
instance some device can forgo atomic operation on such mapping and thus
can work over PCIE while CPU can not do atomic to PCIE BAR.

Also this API here can be use to allow peer to peer access between devices
when the vma is a mmap of a device file and thus vm_operations_struct come
from some exporter device driver. So same 2 vm_operations_struct call back
can be use in more cases than what i just described here.

So i would like to gather people feedback on general approach and few things
like:
- Do block device need to be able to invalidate such mapping too ?

It is easy for fs the to invalidate as it can walk file mappings
but block device do not know about file.

- Do we want to provide some generic implementation to share accross
fs ?

- Maybe some share helpers for block devices that could track file
corresponding to peer mapping ?

Cheers,
J?r?me

2019-04-26 06:47:01

by Dave Chinner

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Thu, Apr 25, 2019 at 09:38:14PM -0400, Jerome Glisse wrote:
> I see that they are still empty spot in LSF/MM schedule so i would like to
> have a discussion on allowing direct block mapping of file for devices (nic,
> gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
> is pretty light ie only adding 2 callback to vm_operations_struct:

The filesystem already has infrastructure for the bits it needs to
provide. They are called file layout leases (how many times do I
have to keep telling people this!), and what you do with the lease
for the LBA range the filesystem maps for you is then something you
can negotiate with the underlying block device.

i.e. go look at how xfs_pnfs.c works to hand out block mappings to
remote pNFS clients so they can directly access the underlying
storage. Basically, anyone wanting to map blocks needs a file layout
lease and then to manage the filesystem state over that range via
these methods in the struct export_operations:

int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
int (*map_blocks)(struct inode *inode, loff_t offset,
u64 len, struct iomap *iomap,
bool write, u32 *device_generation);
int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
int nr_iomaps, struct iattr *iattr);

Basically, before you read/write data, you map the blocks. if you've
written data, then you need to commit the blocks (i.e. tell the fs
they've been written to).

The iomap will give you a contiguous LBA range and the block device
they belong to, and you can then use that to whatever smart DMA stuff
you need to do through the block device directly.

If the filesystem wants the space back (e.g. because truncate) then
the lease will be revoked. The client then must finish off it's
outstanding operations, commit them and release the lease. To access
the file range again, it must renew the lease and remap the file
through ->map_blocks....

> So i would like to gather people feedback on general approach and few things
> like:
> - Do block device need to be able to invalidate such mapping too ?
>
> It is easy for fs the to invalidate as it can walk file mappings
> but block device do not know about file.

If you are needing the block device to invalidate filesystem level
information, then your model is all wrong.

> - Do we want to provide some generic implementation to share accross
> fs ?

We already have a generic interface, filesystems other than XFS will
need to implement them.

> - Maybe some share helpers for block devices that could track file
> corresponding to peer mapping ?

If the application hasn't supplied the peer with the file it needs
to access, get a lease from and then map an LBA range out of, then
you are doing it all wrong.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-26 12:47:07

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Fri, Apr 26, 2019 at 04:28:16PM +1000, Dave Chinner wrote:
> i.e. go look at how xfs_pnfs.c works to hand out block mappings to
> remote pNFS clients so they can directly access the underlying
> storage. Basically, anyone wanting to map blocks needs a file layout
> lease and then to manage the filesystem state over that range via
> these methods in the struct export_operations:
>
> int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> int (*map_blocks)(struct inode *inode, loff_t offset,
> u64 len, struct iomap *iomap,
> bool write, u32 *device_generation);
> int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> int nr_iomaps, struct iattr *iattr);

Nipick: get_uuid isn't needed for the least itself, it just works
around the fact that the original pNFS/block protocol is braindead.
The pNFS/SCSI prototocol already switches to a device UUID, and other
users that work locally shouldn't need it either.

2019-04-26 14:47:27

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Fri, Apr 26, 2019 at 05:45:53AM -0700, Christoph Hellwig wrote:
> On Fri, Apr 26, 2019 at 04:28:16PM +1000, Dave Chinner wrote:
> > i.e. go look at how xfs_pnfs.c works to hand out block mappings to
> > remote pNFS clients so they can directly access the underlying
> > storage. Basically, anyone wanting to map blocks needs a file layout
> > lease and then to manage the filesystem state over that range via
> > these methods in the struct export_operations:
> >
> > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> > int (*map_blocks)(struct inode *inode, loff_t offset,
> > u64 len, struct iomap *iomap,
> > bool write, u32 *device_generation);
> > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> > int nr_iomaps, struct iattr *iattr);
>
> Nipick: get_uuid isn't needed for the least itself, it just works
> around the fact that the original pNFS/block protocol is braindead.
> The pNFS/SCSI prototocol already switches to a device UUID, and other
> users that work locally shouldn't need it either.

Hmmm, this lease interface still doesn't support COW, right?

(Right, xfs_pnfs.c bails out with -ENXIO for reflink files)

It occurs to me that maybe we don't want Goldwyn's IOMAP_DAX_COW
approach (hide the read address in the iomap->inline_data pointer); we
just want two physical source addresses. Then the dax code can turn
that into a memory pointer and file lessees can do sector accesses or
whatever they need to do to write the range before calling
->commit_blocks.

Oh, right, both of you commented about a dual iomap approach on the v2
btrfs dax support series.

/me goes back to drinking coffee.

--D

2019-04-26 14:48:29

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Fri, Apr 26, 2019 at 07:45:07AM -0700, Darrick J. Wong wrote:
> Hmmm, this lease interface still doesn't support COW, right?
>
> (Right, xfs_pnfs.c bails out with -ENXIO for reflink files)

Yes, but that is because no one bothered to do the work.
pNFS/block and pNFS/scsi explicitly support a compatbile COW
scheme, it is just that no implemented it yet (at all I think,
not just for Linux..).

2019-04-26 15:23:29

by Jerome Glisse

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Fri, Apr 26, 2019 at 04:28:16PM +1000, Dave Chinner wrote:
> On Thu, Apr 25, 2019 at 09:38:14PM -0400, Jerome Glisse wrote:
> > I see that they are still empty spot in LSF/MM schedule so i would like to
> > have a discussion on allowing direct block mapping of file for devices (nic,
> > gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
> > is pretty light ie only adding 2 callback to vm_operations_struct:
>
> The filesystem already has infrastructure for the bits it needs to
> provide. They are called file layout leases (how many times do I
> have to keep telling people this!), and what you do with the lease
> for the LBA range the filesystem maps for you is then something you
> can negotiate with the underlying block device.
>
> i.e. go look at how xfs_pnfs.c works to hand out block mappings to
> remote pNFS clients so they can directly access the underlying
> storage. Basically, anyone wanting to map blocks needs a file layout
> lease and then to manage the filesystem state over that range via
> these methods in the struct export_operations:
>
> int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> int (*map_blocks)(struct inode *inode, loff_t offset,
> u64 len, struct iomap *iomap,
> bool write, u32 *device_generation);
> int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> int nr_iomaps, struct iattr *iattr);
>
> Basically, before you read/write data, you map the blocks. if you've
> written data, then you need to commit the blocks (i.e. tell the fs
> they've been written to).
>
> The iomap will give you a contiguous LBA range and the block device
> they belong to, and you can then use that to whatever smart DMA stuff
> you need to do through the block device directly.
>
> If the filesystem wants the space back (e.g. because truncate) then
> the lease will be revoked. The client then must finish off it's
> outstanding operations, commit them and release the lease. To access
> the file range again, it must renew the lease and remap the file
> through ->map_blocks....

Sorry i should have explain why lease do not work. Here are list of
lease shortcoming AFAIK:
- only one process
- program ie userspace is responsible for doing the right thing
so heavy burden on userspace program
- lease break time induce latency
- lease may require privileges for the applications
- work on file descriptor not virtual addresses

While what i am trying to achieve is:
- support any number of process
- work on virtual addresses
- is an optimization ie falling back to page cache is _always_
acceptable
- no changes to userspace program ie existing program can
benefit from this by just running on a kernel with the
feature on the system with hardware that support this.
- allow multiple different devices to map the block (can be
read only if the fabric between devices is not cache coherent)
- it is an optimization ie avoiding to waste main memory if file
is only accessed by device
- there is _no pin_ and it can be revoke at _any_ time from within
the kernel ie there is no need to rely on application to do the
right thing
- not only support filesystem but also vma that comes from device
file

I do not think i can achieve those objectives with file lease.

The motivation is coming from new storage technology (NVMe with CMB for
instance) where block device can offer byte addressable access to block.
It can be read only or read and write. When you couple this with gpu,
fgpa, tpu that can crunch massive data set (in the tera bytes ranges)
then avoiding going through main memory becomes an appealing prospect.

If we can achieve that with no disruption to the application programming
model the better it is. By allowing to mediate direct block access through
vma we can achieve that. With no update to the application we can provide
speed-up (right now storage device are still a bit slower than main memory
but PCIE can be the bottleneck) or at the very least save main memory for
other thing.

This is why i am believe something at the vma level is better suited to
make such thing as easy and transparent as possible. Note that unlike
GUP there is _no pinning_ so filesystem is always in total control and
can revoke at _any_ time. Also because it is all kernel side we should
achieve much better latency (flushing device page table is usualy faster
then switching to userspace and having userspace calling back into the
driver).

>
> > So i would like to gather people feedback on general approach and few things
> > like:
> > - Do block device need to be able to invalidate such mapping too ?
> >
> > It is easy for fs the to invalidate as it can walk file mappings
> > but block device do not know about file.
>
> If you are needing the block device to invalidate filesystem level
> information, then your model is all wrong.

It is _not_ a requirement. It is a feature and it does not need to be
implemented right away the motivation comes from block device that can
manage their PCIE BAR address space dynamicly and they might want to
unmap some block to make room for other block. For this they would need
to make sure that they can revoke access from device or CPU they might
have mapped the block they want to evict.

> > - Do we want to provide some generic implementation to share accross
> > fs ?
>
> We already have a generic interface, filesystems other than XFS will
> need to implement them.
>
> > - Maybe some share helpers for block devices that could track file
> > corresponding to peer mapping ?
>
> If the application hasn't supplied the peer with the file it needs
> to access, get a lease from and then map an LBA range out of, then
> you are doing it all wrong.

I do not have the same programming model than one you have in mind, i
want to allow existing application which mmap files and access that
mapping through a device or CPU to directly access those blocks through
the virtual address.

Cheers,
J?r?me

2019-04-26 20:31:40

by Adam Manzanares

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Thu, 2019-04-25 at 21:38 -0400, Jerome Glisse wrote:
> I see that they are still empty spot in LSF/MM schedule so i would
> like to
> have a discussion on allowing direct block mapping of file for
> devices (nic,
> gpu, fpga, ...). This is mm, fs and block discussion, thought the mm
> side
> is pretty light ie only adding 2 callback to vm_operations_struct:
>
> int (*device_map)(struct vm_area_struct *vma,
> struct device *importer,
> struct dma_buf **bufp,
> unsigned long start,
> unsigned long end,
> unsigned flags,
> dma_addr_t *pa);
>
> // Some flags i can think of:
> DEVICE_MAP_FLAG_PIN // ie return a dma_buf object
> DEVICE_MAP_FLAG_WRITE // importer want to be able to write
> DEVICE_MAP_FLAG_SUPPORT_ATOMIC_OP // importer want to do atomic
> operation
> // on the mapping
>
> void (*device_unmap)(struct vm_area_struct *vma,
> struct device *importer,
> unsigned long start,
> unsigned long end,
> dma_addr_t *pa);
>
> Each filesystem could add this callback and decide wether or not to
> allow
> the importer to directly map block. Filesystem can use what ever
> logic they
> want to make that decision. For instance if they are page in the page
> cache
> for the range then it can say no and the device would fallback to
> main
> memory. Filesystem can also update its internal data structure to
> keep
> track of direct block mapping.
>
> If filesystem decide to allow the direct block mapping then it
> forward the
> request to the block device which itself can decide to forbid the
> direct
> mapping again for any reasons. For instance running out of BAR space
> or
> peer to peer between block device and importer device is not
> supported or
> block device does not want to allow writeable peer mapping ...
>
>
> So event flow is:
> 1 program mmap a file (end never intend to access it with CPU)
> 2 program try to access the mmap from a device A
> 3 device A driver see device_map callback on the vma and call it
> 4a on success device A driver program the device to mapped dma
> address
> 4b on failure device A driver fallback to faulting so that it can
> use
> page from page cache
>
> This API assume that the importer does support mmu notifier and thus
> that
> the fs can invalidate device mapping at _any_ time by sending mmu
> notifier
> to all mapping of the file (for a given range in the file or for the
> whole
> file). Obviously you want to minimize disruption and thus only
> invalidate
> when necessary.
>
> The dma_buf parameter can be use to add pinning support for
> filesystem who
> wish to support that case too. Here the mapping lifetime get
> disconnected
> from the vma and is transfer to the dma_buf allocated by filesystem.
> Again
> filesystem can decide to say no as pinning blocks has drastic
> consequence
> for filesystem and block device.
>
>
> This has some similarities to the hmmap and caching topic (which is
> mapping
> block directly to CPU AFAIU) but device mapping can cut some corner
> for
> instance some device can forgo atomic operation on such mapping and
> thus
> can work over PCIE while CPU can not do atomic to PCIE BAR.
>
> Also this API here can be use to allow peer to peer access between
> devices
> when the vma is a mmap of a device file and thus vm_operations_struct
> come
> from some exporter device driver. So same 2 vm_operations_struct call
> back
> can be use in more cases than what i just described here.
>
>
> So i would like to gather people feedback on general approach and few
> things
> like:
> - Do block device need to be able to invalidate such mapping too
> ?
>
> It is easy for fs the to invalidate as it can walk file
> mappings
> but block device do not know about file.
>
> - Do we want to provide some generic implementation to share
> accross
> fs ?
>
> - Maybe some share helpers for block devices that could track
> file
> corresponding to peer mapping ?

I'm interested in being a part of this discussion.

>
>
> Cheers,
> Jérôme

2019-04-27 01:26:40

by Dave Chinner

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Fri, Apr 26, 2019 at 11:20:45AM -0400, Jerome Glisse wrote:
> On Fri, Apr 26, 2019 at 04:28:16PM +1000, Dave Chinner wrote:
> > On Thu, Apr 25, 2019 at 09:38:14PM -0400, Jerome Glisse wrote:
> > > I see that they are still empty spot in LSF/MM schedule so i would like to
> > > have a discussion on allowing direct block mapping of file for devices (nic,
> > > gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
> > > is pretty light ie only adding 2 callback to vm_operations_struct:
> >
> > The filesystem already has infrastructure for the bits it needs to
> > provide. They are called file layout leases (how many times do I
> > have to keep telling people this!), and what you do with the lease
> > for the LBA range the filesystem maps for you is then something you
> > can negotiate with the underlying block device.
> >
> > i.e. go look at how xfs_pnfs.c works to hand out block mappings to
> > remote pNFS clients so they can directly access the underlying
> > storage. Basically, anyone wanting to map blocks needs a file layout
> > lease and then to manage the filesystem state over that range via
> > these methods in the struct export_operations:
> >
> > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> > int (*map_blocks)(struct inode *inode, loff_t offset,
> > u64 len, struct iomap *iomap,
> > bool write, u32 *device_generation);
> > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> > int nr_iomaps, struct iattr *iattr);
> >
> > Basically, before you read/write data, you map the blocks. if you've
> > written data, then you need to commit the blocks (i.e. tell the fs
> > they've been written to).
> >
> > The iomap will give you a contiguous LBA range and the block device
> > they belong to, and you can then use that to whatever smart DMA stuff
> > you need to do through the block device directly.
> >
> > If the filesystem wants the space back (e.g. because truncate) then
> > the lease will be revoked. The client then must finish off it's
> > outstanding operations, commit them and release the lease. To access
> > the file range again, it must renew the lease and remap the file
> > through ->map_blocks....
>
> Sorry i should have explain why lease do not work. Here are list of
> lease shortcoming AFAIK:
> - only one process

Sorry, what? The lease is taken by a application process that then
hands out the mapping to whatever parts of it - processes, threads,
remote clients, etc - need access. If your application doesn't
have an access co-ordination method, then you're already completely
screwed.

> - program ie userspace is responsible for doing the right thing
> so heavy burden on userspace program

You're asking for direct access to storage owned by the filesystem.
The application *must* play by the filesystem rules. Stop trying to
hack around the fact that the filesystem controls access to the
block mapping.

> - lease break time induce latency

Lease breaks should never happen in normal workloads, so this isn't
an issue. IF you have an application that requires exclusive access,
then ensure that the file can only be accessed by the application
and the lease should never be broken.

But if you are going to ask for filesystems to hand out block
mapping for thrid party access, the 3rd parties need to play by the
filesystem's access rules, and that means they /must/ break access
if the filesystem asks them to.

> - lease may require privileges for the applications

If you can directly access the underlying block device (which
requires root/CAP_SYS_ADMIN) then the application has sufficient
privilege to get a file layout lease.

> - work on file descriptor not virtual addresses

Sorry, what? You want direct access to the underlying storage device
for direct DMA, not access to the page cache. i.e. you need a
mapping for a range of a file (from offset X to Y) and you most
definitely do not need the file to be virtually mapped for that.

If you want to DMA from a userspace or peer device memory to storage
directly, then you definitely do not want the file to mapped into
the page cache, and so mmap() is most definitely the wrong interface
to be using to set up direct storage access to a file.

> While what i am trying to achieve is:
> - support any number of process

file leases don't prevent that.

> - work on virtual addresses

like direct io, get_user_pages() works just fine for this.

> - is an optimization ie falling back to page cache is _always_
> acceptable

No, it isn't. Falling back to the page cache will break the layout
lease because the lock filesystem does IO that breaks existing
leases. You can't have both a layout lease and page cache access to
the same file.

> - no changes to userspace program ie existing program can
> benefit from this by just running on a kernel with the
> feature on the system with hardware that support this.

That's a pipe dream. Existing direct access applications /don't work/ with
file-backed mmap() ranges. They will not work with DAX, either, so
please stop with the "work with unmodified existing applications"
already.

If you want peer to peer DMA to filesystem managed storage, then you
*must* use the filesystem to manage access to that storage.

> - allow multiple different devices to map the block (can be
> read only if the fabric between devices is not cache coherent)

Nothing about a layout lease prevents that. What the application
does with the layout lease is it's own business.

> - it is an optimization ie avoiding to waste main memory if file
> is only accessed by device

Layout leases don't prevent this - they are explicitly for allowing
this sort of access to be made safely.

> - there is _no pin_ and it can be revoke at _any_ time from within
> the kernel ie there is no need to rely on application to do the
> right thing

Revoke how, exactly? Are you really proposing sending SEGV to user
processes as the revoke mechanism?

> - not only support filesystem but also vma that comes from device
> file

What's a "device file" and how is that any difference from a normal
kernel file?

> The motivation is coming from new storage technology (NVMe with CMB for
> instance) where block device can offer byte addressable access to block.
> It can be read only or read and write. When you couple this with gpu,
> fgpa, tpu that can crunch massive data set (in the tera bytes ranges)
> then avoiding going through main memory becomes an appealing prospect.
>
> If we can achieve that with no disruption to the application programming
> model the better it is. By allowing to mediate direct block access through
> vma we can achieve that.

I have a hammer that I can use to mediate direct block access, too.

That doesn't mean it's the right tool for the job. At it's most
fundamental level, the block mapping is between an inode, the file
offset and the LBA range in the block device that the storage device
presents to users. This is entirely /filesystem information/ and we
already have interfaces to manage and arbitrate safe direct storage
access for third parties.

Stop trying to re-invent the wheel and use the one we already have.

> This is why i am believe something at the vma level is better suited to
> make such thing as easy and transparent as possible. Note that unlike
> GUP there is _no pinning_ so filesystem is always in total control and
> can revoke at _any_ time.

Revoke how, exactly? And how do applications pause and restart when
this undefined revoke mechanism is invoked? What happens to access
latency when this revoke occurs and why is this any different to
having a file layout lease revoked?

> Also because it is all kernel side we should
> achieve much better latency (flushing device page table is usualy faster
> then switching to userspace and having userspace calling back into the
> driver).
>
>
> >
> > > So i would like to gather people feedback on general approach and few things
> > > like:
> > > - Do block device need to be able to invalidate such mapping too ?
> > >
> > > It is easy for fs the to invalidate as it can walk file mappings
> > > but block device do not know about file.
> >
> > If you are needing the block device to invalidate filesystem level
> > information, then your model is all wrong.
>
> It is _not_ a requirement. It is a feature and it does not need to be
> implemented right away the motivation comes from block device that can
> manage their PCIE BAR address space dynamicly and they might want to
> unmap some block to make room for other block. For this they would need
> to make sure that they can revoke access from device or CPU they might
> have mapped the block they want to evict.

This has nothing to do with the /layout lease/. Layout leases are
for managing direct device access, not how the application interacts
with the hardware that it has been given a block mapping for.

Jerome, it seems to me like you're conflating hardware management
issues with block device access and LBA management. These are
completely separate things that the application has to manage - the
filesystem and the layout lease doesn't give a shit about whether
the application has exhausted the hardware PCIE BAR space. i.e.
hardware kicking out a user address mapping does not invalidate the
layout lease in any way - it just requires the application to set up
that direct access map in the hardware again. The file offset to
LBA mapping that the layout lease manages is entirely unaffected by
this sort of problem.

> > > - Maybe some share helpers for block devices that could track file
> > > corresponding to peer mapping ?
> >
> > If the application hasn't supplied the peer with the file it needs
> > to access, get a lease from and then map an LBA range out of, then
> > you are doing it all wrong.
>
> I do not have the same programming model than one you have in mind, i
> want to allow existing application which mmap files and access that
> mapping through a device or CPU to directly access those blocks through
> the virtual address.

Which is the *wrong model*.

mmap() of a file-backed mapping does not provide a sane, workable
direct storage access management API. It's fundamentally flawed
because it does not provide any guarantee about the underlying
filesystem information (e.g. the block mapping) and as such, results
in a largely unworkable model that we need all sorts of complexity
to sorta make work.

Layout leases and the export ops provide the application with the
exact information they need to directly access the storage
underneath the filesystem in a safe manner. They do not, in any way,
control how the application then uses that information. If you
really want to use mmap() to access the storage, then you can mmap()
the ranges of the block device the ->map_blocks() method tells you
belong to that file.

You can do whatever you want with those vmas and the filesystem
doesn't care - it's not involved in /any way whatsoever/ with the
data transfer into and out of the storage because ->map_blocks has
guaranteed that the storage is allocated. All the application needs
to do is call ->commit_blocks on each range of the mapping it writes
data into to tell the filesystem it now contains valid data. It's
simple, straight forward, and hard to get wrong from both userspace
and the kernel filesystem side.

Please stop trying to invent new and excitingly complex ways to do
direct block access because we ialready have infrastructure we know
works, we already support and is flexible enough to provide exactly
the sort of direct block device access mechainsms that you are
asking for.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-04-29 13:29:35

by Jerome Glisse

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Sat, Apr 27, 2019 at 11:25:16AM +1000, Dave Chinner wrote:
> On Fri, Apr 26, 2019 at 11:20:45AM -0400, Jerome Glisse wrote:
> > On Fri, Apr 26, 2019 at 04:28:16PM +1000, Dave Chinner wrote:
> > > On Thu, Apr 25, 2019 at 09:38:14PM -0400, Jerome Glisse wrote:
> > > > I see that they are still empty spot in LSF/MM schedule so i would like to
> > > > have a discussion on allowing direct block mapping of file for devices (nic,
> > > > gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
> > > > is pretty light ie only adding 2 callback to vm_operations_struct:
> > >
> > > The filesystem already has infrastructure for the bits it needs to
> > > provide. They are called file layout leases (how many times do I
> > > have to keep telling people this!), and what you do with the lease
> > > for the LBA range the filesystem maps for you is then something you
> > > can negotiate with the underlying block device.
> > >
> > > i.e. go look at how xfs_pnfs.c works to hand out block mappings to
> > > remote pNFS clients so they can directly access the underlying
> > > storage. Basically, anyone wanting to map blocks needs a file layout
> > > lease and then to manage the filesystem state over that range via
> > > these methods in the struct export_operations:
> > >
> > > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> > > int (*map_blocks)(struct inode *inode, loff_t offset,
> > > u64 len, struct iomap *iomap,
> > > bool write, u32 *device_generation);
> > > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> > > int nr_iomaps, struct iattr *iattr);
> > >
> > > Basically, before you read/write data, you map the blocks. if you've
> > > written data, then you need to commit the blocks (i.e. tell the fs
> > > they've been written to).
> > >
> > > The iomap will give you a contiguous LBA range and the block device
> > > they belong to, and you can then use that to whatever smart DMA stuff
> > > you need to do through the block device directly.
> > >
> > > If the filesystem wants the space back (e.g. because truncate) then
> > > the lease will be revoked. The client then must finish off it's
> > > outstanding operations, commit them and release the lease. To access
> > > the file range again, it must renew the lease and remap the file
> > > through ->map_blocks....
> >
> > Sorry i should have explain why lease do not work. Here are list of
> > lease shortcoming AFAIK:
> > - only one process
>
> Sorry, what? The lease is taken by a application process that then
> hands out the mapping to whatever parts of it - processes, threads,
> remote clients, etc - need access. If your application doesn't
> have an access co-ordination method, then you're already completely
> screwed.

Then i am completely screw :) The thing is that today mmap of a file
does not mandate any kind of synchronization between process than
mmap files and thus we have a lot of existing applications that are
just happy with that programming model and i do not see any reasons
to force something new on them.

Here i am only trying an optimization by possibly skipping the page
cache intermediary if filesystem and blocks device allows it and can
do it (depending on any number of runtime conditions).

So lease is inherently not compatible with mmap of file by multiple
processes. I understand you want to push your access model but the
reality is that we have existing applications that just do not fit
that model and i do not see any reasons to ask them to change, it
is never a successful approach.

>
> > - program ie userspace is responsible for doing the right thing
> > so heavy burden on userspace program
>
> You're asking for direct access to storage owned by the filesystem.
> The application *must* play by the filesystem rules. Stop trying to
> hack around the fact that the filesystem controls access to the
> block mapping.

This is a filesystem opt-in feature if a given filesystem do not want
to implement it then just do not implement it and it will use page
cache. It is not mandatory i am not forcing anyone. The first reasons
for those are not filesystem but mmap of device file. But as LSF/MM
is up i thought it would be a good time to maybe propose that for file-
system too. If you do not want that for your filesystem then just NAK
any patch that add that to filesystem you care about.

>
> > - lease break time induce latency
>
> Lease breaks should never happen in normal workloads, so this isn't
> an issue. IF you have an application that requires exclusive access,
> then ensure that the file can only be accessed by the application
> and the lease should never be broken.
>
> But if you are going to ask for filesystems to hand out block
> mapping for thrid party access, the 3rd parties need to play by the
> filesystem's access rules, and that means they /must/ break access
> if the filesystem asks them to.

The mmu notifier give you revocation at any _time_ without round
trip to user space.

>
> > - lease may require privileges for the applications
>
> If you can directly access the underlying block device (which
> requires root/CAP_SYS_ADMIN) then the application has sufficient
> privilege to get a file layout lease.

Again here i am tageting mmap of file and thus do not necessarily
need the same privileges as lease (AFAIK).

>
> > - work on file descriptor not virtual addresses
>
> Sorry, what? You want direct access to the underlying storage device
> for direct DMA, not access to the page cache. i.e. you need a
> mapping for a range of a file (from offset X to Y) and you most
> definitely do not need the file to be virtually mapped for that.
>
> If you want to DMA from a userspace or peer device memory to storage
> directly, then you definitely do not want the file to mapped into
> the page cache, and so mmap() is most definitely the wrong interface
> to be using to set up direct storage access to a file.

I am starting from _existing_ application that do mmap and thus mmap
is the base assumption it is my starting point. I am not trying to
do some kind of new workload, i am trying to allow existing application
to leverage new storage technology more efficiently without changing
a single line in those application.

I believe kernel should always try to improve existing application
workload with no modification to the application whenever possible.

>
> > While what i am trying to achieve is:
> > - support any number of process
>
> file leases don't prevent that.

I was under the impression that there could be only one lease at a
time per file. Sorry if i was wrong.

>
> > - work on virtual addresses
>
> like direct io, get_user_pages() works just fine for this.

Here i am getting away of GUP to avoid the pinning issues related
to GUP hence why there is no need in what i propose to pin anything.
What i am trying to do is skip out the page cache copy if at all
possible (depending on many factors) so that application can get
better performance without modification.

> > - is an optimization ie falling back to page cache is _always_
> > acceptable
>
> No, it isn't. Falling back to the page cache will break the layout
> lease because the lock filesystem does IO that breaks existing
> leases. You can't have both a layout lease and page cache access to
> the same file.

Yes and this is what i want to do, mmap access is from where i start
so as a starting point it is what i would like to allow. If you do
not want that for your filesystem fine.

>
> > - no changes to userspace program ie existing program can
> > benefit from this by just running on a kernel with the
> > feature on the system with hardware that support this.
>
> That's a pipe dream. Existing direct access applications /don't work/ with
> file-backed mmap() ranges. They will not work with DAX, either, so
> please stop with the "work with unmodified existing applications"
> already.

I am talking about existing application that do mmap, i do not care
about application that do direct access this is not what i am trying
to address. I want to improve application that use mmap.

>
> If you want peer to peer DMA to filesystem managed storage, then you
> *must* use the filesystem to manage access to that storage.

And the callback just do that, they give control to the filesystem, if
the callback is not there then page cache is use, if it is there then
the filesystem do not always have to succeed and it can fails and just
fallbacks to page cache.

>
> > - allow multiple different devices to map the block (can be
> > read only if the fabric between devices is not cache coherent)
>
> Nothing about a layout lease prevents that. What the application
> does with the layout lease is it's own business.
>
> > - it is an optimization ie avoiding to waste main memory if file
> > is only accessed by device
>
> Layout leases don't prevent this - they are explicitly for allowing
> this sort of access to be made safely.
>
> > - there is _no pin_ and it can be revoke at _any_ time from within
> > the kernel ie there is no need to rely on application to do the
> > right thing
>
> Revoke how, exactly? Are you really proposing sending SEGV to user
> processes as the revoke mechanism?

No, just mmu notifier, exactly what happens on write-back, truncate, ...
you can walk all rmap of page and trigger mmu notifier for it or reverse
walk file offset range and trigger mmu notifier for those.

So there is no disruption to the application. After revocation it page
fault again and the cycle start again either it end up in page cache
or the same callback is call again.

>
> > - not only support filesystem but also vma that comes from device
> > file
>
> What's a "device file" and how is that any difference from a normal
> kernel file?

Many device driver expose object (for instance all the GPU driver) through
their device file and allow userspace to mmap the device file to access
those object. At a given offset in the device file you will find a given
object and this can be per application or global to all application.

>
> > The motivation is coming from new storage technology (NVMe with CMB for
> > instance) where block device can offer byte addressable access to block.
> > It can be read only or read and write. When you couple this with gpu,
> > fgpa, tpu that can crunch massive data set (in the tera bytes ranges)
> > then avoiding going through main memory becomes an appealing prospect.
> >
> > If we can achieve that with no disruption to the application programming
> > model the better it is. By allowing to mediate direct block access through
> > vma we can achieve that.
>
> I have a hammer that I can use to mediate direct block access, too.
>
> That doesn't mean it's the right tool for the job. At it's most
> fundamental level, the block mapping is between an inode, the file
> offset and the LBA range in the block device that the storage device
> presents to users. This is entirely /filesystem information/ and we
> already have interfaces to manage and arbitrate safe direct storage
> access for third parties.
>
> Stop trying to re-invent the wheel and use the one we already have

I have a starting point mmap and this is what i try to improve.
.
>
> > This is why i am believe something at the vma level is better suited to
> > make such thing as easy and transparent as possible. Note that unlike
> > GUP there is _no pinning_ so filesystem is always in total control and
> > can revoke at _any_ time.
>
> Revoke how, exactly? And how do applications pause and restart when
> this undefined revoke mechanism is invoked? What happens to access
> latency when this revoke occurs and why is this any different to
> having a file layout lease revoked?

Application do not pause in anyway, after invalidation it will page
fault exactly as with truncate or write back. Then the callback is
call again and the filesystem can say no this time and it will fall
back to the page cache. There is no change to existing application
it just works as it does now with no changes in behavior.

>
> > Also because it is all kernel side we should
> > achieve much better latency (flushing device page table is usualy faster
> > then switching to userspace and having userspace calling back into the
> > driver).
> >
> >
> > >
> > > > So i would like to gather people feedback on general approach and few things
> > > > like:
> > > > - Do block device need to be able to invalidate such mapping too ?
> > > >
> > > > It is easy for fs the to invalidate as it can walk file mappings
> > > > but block device do not know about file.
> > >
> > > If you are needing the block device to invalidate filesystem level
> > > information, then your model is all wrong.
> >
> > It is _not_ a requirement. It is a feature and it does not need to be
> > implemented right away the motivation comes from block device that can
> > manage their PCIE BAR address space dynamicly and they might want to
> > unmap some block to make room for other block. For this they would need
> > to make sure that they can revoke access from device or CPU they might
> > have mapped the block they want to evict.
>
> This has nothing to do with the /layout lease/. Layout leases are
> for managing direct device access, not how the application interacts
> with the hardware that it has been given a block mapping for.
>
> Jerome, it seems to me like you're conflating hardware management
> issues with block device access and LBA management. These are
> completely separate things that the application has to manage - the
> filesystem and the layout lease doesn't give a shit about whether
> the application has exhausted the hardware PCIE BAR space. i.e.
> hardware kicking out a user address mapping does not invalidate the
> layout lease in any way - it just requires the application to set up
> that direct access map in the hardware again. The file offset to
> LBA mapping that the layout lease manages is entirely unaffected by
> this sort of problem.

Ignore this point if it confuse you, it just something that can be
ignore unless it becomes a problem.

>
> > > > - Maybe some share helpers for block devices that could track file
> > > > corresponding to peer mapping ?
> > >
> > > If the application hasn't supplied the peer with the file it needs
> > > to access, get a lease from and then map an LBA range out of, then
> > > you are doing it all wrong.
> >
> > I do not have the same programming model than one you have in mind, i
> > want to allow existing application which mmap files and access that
> > mapping through a device or CPU to directly access those blocks through
> > the virtual address.
>
> Which is the *wrong model*.
>
> mmap() of a file-backed mapping does not provide a sane, workable
> direct storage access management API. It's fundamentally flawed
> because it does not provide any guarantee about the underlying
> filesystem information (e.g. the block mapping) and as such, results
> in a largely unworkable model that we need all sorts of complexity
> to sorta make work.
>
> Layout leases and the export ops provide the application with the
> exact information they need to directly access the storage
> underneath the filesystem in a safe manner. They do not, in any way,
> control how the application then uses that information. If you
> really want to use mmap() to access the storage, then you can mmap()
> the ranges of the block device the ->map_blocks() method tells you
> belong to that file.
>
> You can do whatever you want with those vmas and the filesystem
> doesn't care - it's not involved in /any way whatsoever/ with the
> data transfer into and out of the storage because ->map_blocks has
> guaranteed that the storage is allocated. All the application needs
> to do is call ->commit_blocks on each range of the mapping it writes
> data into to tell the filesystem it now contains valid data. It's
> simple, straight forward, and hard to get wrong from both userspace
> and the kernel filesystem side.\
>
> Please stop trying to invent new and excitingly complex ways to do
> direct block access because we ialready have infrastructure we know
> works, we already support and is flexible enough to provide exactly
> the sort of direct block device access mechainsms that you are
> asking for.

I understand you do not like mmap but it has been around long enough
that it is extensively use and we have to accept that. There is no
changing all the applications that exist out there and that rely on
mmap and this is for those applications that such optimization would
help.

Cheers,
J?r?me

2019-05-01 23:48:55

by Dave Chinner

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Mon, Apr 29, 2019 at 09:26:45AM -0400, Jerome Glisse wrote:
> On Sat, Apr 27, 2019 at 11:25:16AM +1000, Dave Chinner wrote:
> > On Fri, Apr 26, 2019 at 11:20:45AM -0400, Jerome Glisse wrote:
> > > On Fri, Apr 26, 2019 at 04:28:16PM +1000, Dave Chinner wrote:
> > > > On Thu, Apr 25, 2019 at 09:38:14PM -0400, Jerome Glisse wrote:
> > > > > I see that they are still empty spot in LSF/MM schedule so i would like to
> > > > > have a discussion on allowing direct block mapping of file for devices (nic,
> > > > > gpu, fpga, ...). This is mm, fs and block discussion, thought the mm side
> > > > > is pretty light ie only adding 2 callback to vm_operations_struct:
> > > >
> > > > The filesystem already has infrastructure for the bits it needs to
> > > > provide. They are called file layout leases (how many times do I
> > > > have to keep telling people this!), and what you do with the lease
> > > > for the LBA range the filesystem maps for you is then something you
> > > > can negotiate with the underlying block device.
> > > >
> > > > i.e. go look at how xfs_pnfs.c works to hand out block mappings to
> > > > remote pNFS clients so they can directly access the underlying
> > > > storage. Basically, anyone wanting to map blocks needs a file layout
> > > > lease and then to manage the filesystem state over that range via
> > > > these methods in the struct export_operations:
> > > >
> > > > int (*get_uuid)(struct super_block *sb, u8 *buf, u32 *len, u64 *offset);
> > > > int (*map_blocks)(struct inode *inode, loff_t offset,
> > > > u64 len, struct iomap *iomap,
> > > > bool write, u32 *device_generation);
> > > > int (*commit_blocks)(struct inode *inode, struct iomap *iomaps,
> > > > int nr_iomaps, struct iattr *iattr);
> > > >
> > > > Basically, before you read/write data, you map the blocks. if you've
> > > > written data, then you need to commit the blocks (i.e. tell the fs
> > > > they've been written to).
> > > >
> > > > The iomap will give you a contiguous LBA range and the block device
> > > > they belong to, and you can then use that to whatever smart DMA stuff
> > > > you need to do through the block device directly.
> > > >
> > > > If the filesystem wants the space back (e.g. because truncate) then
> > > > the lease will be revoked. The client then must finish off it's
> > > > outstanding operations, commit them and release the lease. To access
> > > > the file range again, it must renew the lease and remap the file
> > > > through ->map_blocks....
> > >
> > > Sorry i should have explain why lease do not work. Here are list of
> > > lease shortcoming AFAIK:
> > > - only one process
> >
> > Sorry, what? The lease is taken by a application process that then
> > hands out the mapping to whatever parts of it - processes, threads,
> > remote clients, etc - need access. If your application doesn't
> > have an access co-ordination method, then you're already completely
> > screwed.
>
> Then i am completely screw :) The thing is that today mmap of a file
> does not mandate any kind of synchronization between process than
> mmap files and thus we have a lot of existing applications that are
> just happy with that programming model and i do not see any reasons
> to force something new on them.

The whole model is fundamentally broken, and instead of making
things really fucking complex to try and work around the brokenness
we should simply /fix the underlying problem/.

> Here i am only trying an optimization by possibly skipping the page
> cache intermediary if filesystem and blocks device allows it and can
> do it (depending on any number of runtime conditions).

And that's where all the problems and demons lie.

> So lease is inherently not compatible with mmap of file by multiple
> processes. I understand you want to push your access model but the
> reality is that we have existing applications that just do not fit
> that model and i do not see any reasons to ask them to change, it
> is never a successful approach.

That's bullshit.

I'm tired of you saying "we must use mmap" and then ignoring the
people saying "mmap on files for direct block device access is
broken - use a layout lease and then /mmap the block device
directly/".

You don't need to change how the applications work - they just need
to add a layer of /access arbitration/ to get direct access to the
block device.

That's the whole point of filesystems - they are an access
arbitration layer for a block device. That's why people use them
instead of manging raw block device space themselves. We have
multiple different applications - some which have nothing to do with
RDMA, peer-to-peer hardware, etc that want direct access to the
block device that is managed by a local filesystem. mmap()ing files
simply doesn't work for them because remote access arbitrartion is a
set of network protocols, not hardware devices.

We want all these application to be able to work together in a sane
way. Your insistence that your applications /must/ use mmap to
directly access block devices, but then /must/ abstract taht direct
access to the block device by mmap()ing files without any guarantee
that the file mapping is stable is a horrible, unstable, and largely
unworkable architecture.

Just because your applications do it now doesn't mean it's the right
way to do it. The stupid thing about all this is that the change to
use layout leases and layout mapping requests in the applications is
actually very minimal (yeah, if you have a layout lease then FIEMAP
output will actually be stable for the lease range!), and if the
applications and systems are set up properly then the probability of
lease revocation is almost non-existent.

You still set up your direct access mapping from the application via
mmap(), except now it's via mmap() on the block device rather than
through the file.

This is exactly the same model as the pNFS remote access - the
remote client is given a block device mapping, and it goes and
accesses it directly via iscsi, iSER, or whatever RDMA transport the
block device provides the pNFS client. The local filesystem that
handed out the layout lease /just doesn't care/ how the application
interacts with the block device or what it does with the mapping.

And guess what? This works with DAX filesystems and block devices,
too, without the application even having to be aware it's on DAX
capable filesystems and hardware.

I'm arguing for a sane, generic method of offloading direct access
to the block device underneath a filesystem. I'm not trying to
change the way your applications or you peer-to-peer application
mappings work. All I want is that these *filesystem bypass*
storage access methods all use the same access arbitration interface
with the same semantics, and so applications can be completely
agnostic as to what filesystem and/or hardware lies underneath them.

> > What's a "device file" and how is that any difference from a normal
> > kernel file?
>
> Many device driver expose object (for instance all the GPU driver) through
> their device file and allow userspace to mmap the device file to access
> those object. At a given offset in the device file you will find a given
> object and this can be per application or global to all application.

This is irrelevant for how you access the block device underneath
a local filesystem. The application has to do something to map these
objects in the "device file" correctly, just like it needs to do
something to correctly map the storage underneath the filesystem.

> > Please stop trying to invent new and excitingly complex ways to do
> > direct block access because we ialready have infrastructure we know
> > works, we already support and is flexible enough to provide exactly
> > the sort of direct block device access mechainsms that you are
> > asking for.
>
> I understand you do not like mmap but it has been around long enough
> that it is extensively use and we have to accept that. There is no
> changing all the applications that exist out there and that rely on
> mmap and this is for those applications that such optimization would
> help.

You're arguing that "mmap() works for me, so it's good enough for
everyone" but filesystem engineers have known this is not true
for years. I also really don't buy this "file leases are going to
fundamentally change how applications work" argument. All it changes
is what the application mmap()s to get direct access to the
underlying storage. It works with any storage hardware and it works
with both local and remote direct access because it completely
avoids the need for the filesystem to manage data transfers and/or
storage hardware.

From the filesystem architecture perspective, we need a clean,
generic, reliable filesystem bypass mechanism people can build any
application on top of. File-backed mappings simply do not provide
the necessary semantics, APIs, guarantees or revocation model
(SIGBUS != revocation model) that we know are required by various
existing userspace applications. Your application(s) are not the
only direct storage access applications people are trying to
implement, nor are they the only ones we'll have to support. They
all need to use the same interface and methods - anything else is
simply going to be an unsupportable, shitty mess.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-05-02 01:53:32

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Direct block mapping through fs for device

On Mon, Apr 29, 2019 at 09:26:45AM -0400, Jerome Glisse wrote:
> This is a filesystem opt-in feature if a given filesystem do not want
> to implement it then just do not implement it and it will use page
> cache. It is not mandatory i am not forcing anyone. The first reasons
> for those are not filesystem but mmap of device file. But as LSF/MM
> is up i thought it would be a good time to maybe propose that for file-
> system too. If you do not want that for your filesystem then just NAK
> any patch that add that to filesystem you care about.

No. This is stupid, broken, and wrong. I know we already have
application-visible differences between filesystems, and every single one
of those is a bug. They may be hard bugs to fix, they may be bugs that we
feel like we can't fix, they may never be fixed. But they are all bugs.

Applications should be able to work on any Linux filesystem without
having to care what it is. Code has a tendency to far outlive its
authors expectations (and indeed sometimes its authors). If 'tar' had
an #ifdef XFS / #elsif EXT4 / #elsif BTRFS / ... #endif, that would be
awful.

We need the same semantics across all major filesystems. Anything else
is us making application developers lives harder than necessary, and
that's unacceptable.