2018-10-02 21:13:42

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

[Added ext4, xfs, and linux-api folks to CC for the interface discussion]

On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
> > Hello,
> >
> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> > flags, the application just fails to start complaining that DAX support is
> > missing in the kernel. The question now is how do we go about this?
>
> OK naive question from me, how do we want an application to be able to
> check if it is running on a DAX mapping?

The question from me is: Should application really care? After all DAX is
just a caching decision. Sure it affects performance characteristics and
memory usage of the kernel but it is not a correctness issue (in particular
we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be
supported for current mapping). And in the future the details of what we do
with DAX mapping can change - e.g. I could imagine we might decide to cache
writes in DRAM but do direct PMEM access on reads. And all this could be
auto-tuned based on media properties. And we don't want to tie our hands by
specifying too narrowly how the kernel is going to behave.

OTOH I understand that e.g. for a large database application the difference
between DAX and non-DAX mapping can be a difference between performs fine
and performs terribly / kills the machine so such application might want to
determine / force caching policy to save sysadmin from debugging why the
application is misbehaving.

> AFAIU DAX is always associated with a file descriptor of some kind (be
> it a real file with filesystem dax or the /dev/dax device file for
> device dax). So could a new fcntl() be of any help here? IS_DAX() only
> checks for the S_DAX flag in inode::i_flags, so this should be doable
> for both fsdax and devdax.

So fcntl() to query DAX usage is one option. Another option is the GETFLAGS
ioctl with which you can query the state of S_DAX flag (works only for XFS
currently). But that inode flag was meant more as a hint "use DAX if
available" AFAIK so that's probably not really suitable for querying
whether DAX is really in use or not. Since DAX is really about caching
policy, I was also thinking that we could use madvise / fadvise for this.
I.e., something like MADV_DIRECT_ACCESS which would return with success if
DAX is in use, with error if not. Later, kernel could use it as a hint to
really force DAX on a mapping and not try clever caching policies...
Thoughts?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR


2018-10-03 03:04:11

by Dan Williams

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <[email protected]> wrote:
>
> On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > No, it should not. DAX is an implementation detail thay may change
> > > > or go away at any time.
> > >
> > > Well we had an issue with an application checking for dax, this is how
> > > we landed here in the first place.
> >
> > So what exacty is that "DAX" they are querying about (and no, I'm not
> > joking, nor being philosophical).
>
> I believe the application we are speaking about is mostly concerned about
> the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> DRAM, the database running on it is about that size as well and they want
> database state stored somewhere persistently - which they may want to do by
> modifying mmaped database files if they do small updates... So they really
> want to be able to use close to all DRAM for the DB and not leave slack
> space for the kernel page cache to cache 1TB of database files.

VM_MIXEDMAP was never a reliable indication of DAX because it could be
set for random other device-drivers that use vm_insert_mixed(). The
MAP_SYNC flag positively indicates that page cache is disabled for a
given mapping, although whether that property is due to "dax" or some
other kernel mechanics is purely an internal detail.

I'm not opposed to faking out VM_MIXEDMAP if this broken check has
made it into production, but again, it's unreliable.

2018-10-31 15:06:02

by Yasunori Gotou (Fujitsu)

[permalink] [raw]
Subject: RE: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

Hello,

> On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <[email protected]> wrote:
> > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > > MAP_SYNC
> > > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > > data is consistent on media before a write fault is completed. A
> > > > > > side-effect is that the page cache will not be used for
> > > > > > writably-mapped pages.
> > > > >
> > > > > I think you are conflating current implementation with API
> > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > > use. The man page definition simply says "supported only for files
> > > > > supporting DAX" and that it provides certain data integrity
> > > > > guarantees. It does not define the implementation.
> ....
> > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > > filesystems) so usually people just won't notice. If fallback for
> > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> > >
> > > Which is just like the situation where O_DIRECT on ext3 was not very
> > > useful, but on other filesystems like XFS it was fully functional.
> > >
> > > IMO, the fact that a specific filesytem has a suboptimal fallback
> > > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > > until it's always direct access, then most filesystems wouldn't be
> > > able to provide any faster paths at all. It's much better to have
> > > partial functionality now than it is to never have the functionality
> > > at all, and so we need to design in the flexibility we need to
> > > iteratively improve implementations without needing API changes that
> > > will break applications.
> >
> > The hard guarantee requirement still remains though because an
> > application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> > be surprised if the MAP_DIRECT property silently disappears.
>
> Why would they be surprised? They won't even notice it if the
> filesystem can provide MAP_SYNC without MAP_DIRECT.
>
> And that's the whole point.
>
> MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
> visible to the filesystem and the filesystem does nothing to enforce
> them. If someone does something that requires the page cache (e.g.
> calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
> heap of new work to do. And, in some cases, the filesystem may not
> be able to provide MAP_DIRECT as a result..
>
> IOWs, the filesystem cannot guarantee MAP_DIRECT and the
> circumstances under which MAP_DIRECT will and will not work are
> dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
> applications randomly segfaulting in production as things like
> backups, indexers, etc run over the filesystem and do their work.
>
> This is why MAP_DIRECT needs to be an optimisation, not a
> requirement - things will still work if MAP_DIRECT is not used. What
> matters to these applications is MAP_SYNC - if we break MAP_SYNC,
> then the application data integrity model is violated. That's not an
> acceptible outcome.
>
> The problem, it seems to me, is that people are unable to separate
> MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
> MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
> provide MAP_SYNC functionality on DAX mappings. However, that's a
> /filesystem implementation issue/, not an API guarantee we need to
> provide to userspace.
>
> If we implement a persistent page cache (e.g. allocate page cache
> pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
> could provide applications with the MAP_SYNC data integrity model
> without MAP_DIRECT. Indeed, those filesystems would not even be able
> to provide MAP_DIRECT semantics because they aren't backed by pmem.
>
> Hence if applications that want MAP_SYNC are hard coded
> MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
> those applications are going to fail on a filesystem that provides
> only MAP_SYNC. This is despite the fact the applications would
> function correctly and the data integrity model would be maintained.
> i.e. the failure is because applications have assumed MAP_SYNC can
> only be provided by a DAX implementation, not because MAP_SYNC is
> not supported.
>
> MAP_SYNC really isn't about DAX at all. It's about enabling a data
> integrity model that requires the filesystem to provide userspace
> access to CPU addressable persistent memory. DAX+MAP_DIRECT is just
> one method of providing this functionality, but it's not the only
> method. Our API needs to be future proof rather than an encoding of
> the existing implementation limitations, otherwise apps will have to
> be re-written as every new MAP_SYNC capable technology comes along.
>
> In summary:
>
> MAP_DIRECT is an access hint.
>
> MAP_SYNC provides a data integrity model guarantee.
>
> MAP_SYNC may imply MAP_DIRECT for specific implementations,
> but it does not require or guarantee MAP_DIRECT.
>
> Let's compare that with O_DIRECT:
>
> O_DIRECT in an access hint.
>
> O_DSYNC provides a data integrity model guarantee.
>
> O_DSYNC may imply O_DIRECT for specific implementations, but
> it does not require or guarantee O_DIRECT.
>
> Consistency in access and data integrity models is a good thing. DAX
> and pmem is not an exception. We need to use a model we know works
> and has proven itself over a long period of time.

Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
(I'm not opposed to your opinion, but I need to know it.)

In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is
"wrong alignment is specified by application", right?

When filesystem can not use O_DIRECT and it uses page cache instead,
then system uses more memory resource than user's expectation.
So, there is a side effect, and it may cause other trouble.
(memory pressure, expected performance can not be gained, and so on ..)

In such case its administrator (or technical support engineer) needs to struggle to
investigate what is the reason.
If the reason of the breakage is clear, then it is helpful to find the root cause,
and they can require the developer of wrong application to fix the problem.
"Please fix the alignment!".

So, I would like to know in MAP_DIRECT case, what is the reasons?
I think it will be helpful for users.
Only splice?

(Maybe such document will be necessary....)

Thanks,

>
> > I think
> > it still makes some sense as a hint for apps that want to minimize
> > page cache, but for the applications with a flush from userspace model
> > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> > still gives the filesystem the option to inject page-cache at will,
> > but with an application coordination point.
>
> Why make it more complex for applications than it needs to be?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2018-10-18 08:23:32

by Dave Chinner

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> > [Added ext4, xfs, and linux-api folks to CC for the interface discussion]
> >
> > On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote:
> >> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
> >> > Hello,
> >> >
> >> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
> >> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
> >> > mean time certain customer of ours started poking into /proc/<pid>/smaps
> >> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
> >> > flags, the application just fails to start complaining that DAX support is
> >> > missing in the kernel. The question now is how do we go about this?
> >>
> >> OK naive question from me, how do we want an application to be able to
> >> check if it is running on a DAX mapping?
> >
> > The question from me is: Should application really care? After all DAX is
> > just a caching decision. Sure it affects performance characteristics and
> > memory usage of the kernel but it is not a correctness issue (in particular
> > we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be
> > supported for current mapping). And in the future the details of what we do
> > with DAX mapping can change - e.g. I could imagine we might decide to cache
> > writes in DRAM but do direct PMEM access on reads. And all this could be
> > auto-tuned based on media properties. And we don't want to tie our hands by
> > specifying too narrowly how the kernel is going to behave.
>
> For read and write, I would expect the O_DIRECT open flag to still work,
> even for dax-capable persistent memory. Is that a contentious opinion?

Not contentious at all, because that's the way it currently works.
FYI, XFS decides what to do with read (and similarly writes) like
this:

if (IS_DAX(inode))
ret = xfs_file_dax_read(iocb, to);
else if (iocb->ki_flags & IOCB_DIRECT)
ret = xfs_file_dio_aio_read(iocb, to);
else
ret = xfs_file_buffered_aio_read(iocb, to);

Neither DAX or O_DIRECT on pmem use the page cache - the only difference
between the DAX read/write path and the O_DIRECT read/write path
is where the memcpy() into the user buffer is done. For DAX
it's done in the fsdax layer, for O_DIRECT it's done in the pmem
block driver.

> So, what we're really discussing is the behavior for mmap.

Yes.

> MAP_SYNC
> will certainly ensure that the page cache is not used for writes. It
> would also be odd for us to decide to cache reads. The only issue I can
> see is that perhaps the application doesn't want to take a performance
> hit on write faults. I haven't heard that concern expressed in this
> thread, though.
>
> Just to be clear, this is my understanding of the world:
>
> MAP_SYNC
> - file system guarantees that metadata required to reach faulted-in file
> data is consistent on media before a write fault is completed. A
> side-effect is that the page cache will not be used for
> writably-mapped pages.

I think you are conflating current implementation with API
requirements - MAP_SYNC doesn't guarantee anything about page cache
use. The man page definition simply says "supported only for files
supporting DAX" and that it provides certain data integrity
guarantees. It does not define the implementation.

We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
because that's the only way we can currently provide the required
behaviour to userspace. However, if a filesystem can use the page
cache to provide the required functionality, then it's free to do
so.

i.e. if someone implements a pmem-based page cache, MAP_SYNC data
integrity could be provided /without DAX/ by any filesystem using
that persistent page cache. i.e. MAP_SYNC really only requires
mmap() of CPU addressable persistent memory - it does not require
DAX. Right now, however, the only way to get this functionality is
through a DAX capable filesystem on dax capable storage.

And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
COWs new pages in pmem and attaches them a special per-inode cache
on clean->dirty transition. Then on data sync, background writeback
or crash recovery, it migrates them from the cache into the file map
proper via atomic metadata pointer swaps.

IOWs, NOVA provides the correct MAP_SYNC semantics by using a
separate persistent per-inode write cache to provide the correct
crash recovery semantics for MAP_SYNC.

> and what I think Dan had proposed:
>
> mmap flag, MAP_DIRECT
> - file system guarantees that page cache will not be used to front storage.
> storage MUST be directly addressable. This *almost* implies MAP_SYNC.
> The subtle difference is that a write fault /may/ not result in metadata
> being written back to media.

SIimilar to O_DIRECT, these semantics do not allow userspace apps to
replace msync/fsync with CPU cache flush operations. So any
application that uses this mode still needs to use either MAP_SYNC
or issue msync/fsync for data integrity.

If the app is using MAP_DIRECT, the what do we do if the filesystem
can't provide the required semantics for that specific operation? In
the case of O_DIRECT, we fall back to buffered IO because it has the
same data integrity semantics as O_DIRECT and will always work. It's
just slower and consumes more memory, but the app continues to work
just fine.

Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
without using the pagecache seems extremely problematic to me. e.g.
an app already has an open MAP_DIRECT file, and a third party
reflinks it or dedupes it and the fs has to fall back to buffered IO
to do COW operations. This isn't the app's fault - the kernel should
just fall back transparently to using the page cache for the
MAP_DIRECT app and just keep working, just like it would if it was
using O_DIRECT read/write.

The point I'm trying to make here is that O_DIRECT is a /hint/, not
a guarantee, and it's done that way to prevent applications from
being presented with transient, potentially fatal error cases
because a filesystem implementation can't do a specific operation
through the direct IO path.

IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
guarantee. Over time we'll end up with filesystems that can
guarantee that MAP_DIRECT is always going to use DAX, in the same
way we have filesystems that guarantee O_DIRECT will always be
O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
no page cache will ever be used, then we are basically saying
"filesystems won't provide MAP_DIRECT even in common, useful cases
because they can't provide MAP_DIRECT in all cases." And that
doesn't seem like a very good solution to me.

> and this is what I think you were proposing, Jan:
>
> madvise flag, MADV_DIRECT_ACCESS
> - same semantics as MAP_DIRECT, but specified via the madvise system call

Seems to be the equivalent of fcntl(F_SETFL, O_DIRECT). Makes sense
to have both MAP_DIRECT and MADV_DIRECT_ACCESS to me - one is an
init time flag, the other is a run time flag.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-10-03 21:55:48

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed 03-10-18 07:38:50, Dan Williams wrote:
> On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <[email protected]> wrote:
> >
> > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <[email protected]> wrote:
> > > >
> > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > No, it should not. DAX is an implementation detail thay may change
> > > > > > > or go away at any time.
> > > > > >
> > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > we landed here in the first place.
> > > > >
> > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > joking, nor being philosophical).
> > > >
> > > > I believe the application we are speaking about is mostly concerned about
> > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > DRAM, the database running on it is about that size as well and they want
> > > > database state stored somewhere persistently - which they may want to do by
> > > > modifying mmaped database files if they do small updates... So they really
> > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > space for the kernel page cache to cache 1TB of database files.
> > >
> > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > set for random other device-drivers that use vm_insert_mixed(). The
> > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > given mapping, although whether that property is due to "dax" or some
> > > other kernel mechanics is purely an internal detail.
> > >
> > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > made it into production, but again, it's unreliable.
> >
> > So luckily this particular application wasn't widely deployed yet so we
> > will likely get away with the vendor asking customers to update to a
> > version not looking into smaps and parsing /proc/mounts instead.
> >
> > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > if we had a better interface for applications to query whether they can
> > avoid page cache for mmaps or not.
>
> Yeah, the mount flag is not a good indicator either. I think we need
> to follow through on the per-inode property of DAX. Darrick and I
> discussed just allowing the property to be inherited from the parent
> directory at file creation time. That avoids the dynamic set-up /
> teardown races that seem intractable at this point.
>
> What's wrong with MAP_SYNC as a page-cache detector in the meantime?

So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
mount option. It works now but nobody promises it will reliably detect DAX in
future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
using pagecache if we find a sensible usecase for that.

WRT per-inode DAX property, AFAIU that inode flag is just going to be
advisory thing - i.e., use DAX if possible. If you mount a filesystem with
these inode flags set in a configuration which does not allow DAX to be
used, you will still be able to access such inodes but the access will use
page cache instead. And querying these flags should better show real
on-disk status and not just whether DAX is used as that would result in an
even bigger mess. So this feature seems to be somewhat orthogonal to the
API I'm looking for.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-02 21:51:01

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue 02-10-18 07:37:13, Christoph Hellwig wrote:
> On Tue, Oct 02, 2018 at 04:29:59PM +0200, Jan Kara wrote:
> > > OK naive question from me, how do we want an application to be able to
> > > check if it is running on a DAX mapping?
> >
> > The question from me is: Should application really care?
>
> No, it should not. DAX is an implementation detail thay may change
> or go away at any time.

I agree that whether / how pagecache is used for filesystem access is an
implementation detail of the kernel. OTOH for some workloads it is about
whether kernel needs gigabytes of RAM to cache files or not, which is not a
detail anymore if you want to fully utilize the machine. So people will be
asking for this and will be finding odd ways to determine whether DAX is
used or not (such as poking in smaps). And once there is some widely enough
used application doing this, it is not "stupid application" problem anymore
but the kernel's problem of not maintaining backward compatibility.

So I think we would be better off providing *some* API which applications
can use to determine whether pagecache is used or not and make sure this
API will convey the right information even if we change DAX
implementation or remove it altogether.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-03 22:02:38

by Dan Williams

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <[email protected]> wrote:
>
> On Wed 03-10-18 07:38:50, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <[email protected]> wrote:
> > >
> > > On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > > > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <[email protected]> wrote:
> > > > >
> > > > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > > > No, it should not. DAX is an implementation detail thay may change
> > > > > > > > or go away at any time.
> > > > > > >
> > > > > > > Well we had an issue with an application checking for dax, this is how
> > > > > > > we landed here in the first place.
> > > > > >
> > > > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > > > joking, nor being philosophical).
> > > > >
> > > > > I believe the application we are speaking about is mostly concerned about
> > > > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > > > DRAM, the database running on it is about that size as well and they want
> > > > > database state stored somewhere persistently - which they may want to do by
> > > > > modifying mmaped database files if they do small updates... So they really
> > > > > want to be able to use close to all DRAM for the DB and not leave slack
> > > > > space for the kernel page cache to cache 1TB of database files.
> > > >
> > > > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > > > set for random other device-drivers that use vm_insert_mixed(). The
> > > > MAP_SYNC flag positively indicates that page cache is disabled for a
> > > > given mapping, although whether that property is due to "dax" or some
> > > > other kernel mechanics is purely an internal detail.
> > > >
> > > > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > > > made it into production, but again, it's unreliable.
> > >
> > > So luckily this particular application wasn't widely deployed yet so we
> > > will likely get away with the vendor asking customers to update to a
> > > version not looking into smaps and parsing /proc/mounts instead.
> > >
> > > But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> > > if we had a better interface for applications to query whether they can
> > > avoid page cache for mmaps or not.
> >
> > Yeah, the mount flag is not a good indicator either. I think we need
> > to follow through on the per-inode property of DAX. Darrick and I
> > discussed just allowing the property to be inherited from the parent
> > directory at file creation time. That avoids the dynamic set-up /
> > teardown races that seem intractable at this point.
> >
> > What's wrong with MAP_SYNC as a page-cache detector in the meantime?
>
> So IMHO checking for MAP_SYNC is about as reliable as checking for 'dax'
> mount option. It works now but nobody promises it will reliably detect DAX in
> future - e.g. there's nothing that prevents MAP_SYNC to work for mappings
> using pagecache if we find a sensible usecase for that.

Fair enough.

> WRT per-inode DAX property, AFAIU that inode flag is just going to be
> advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> these inode flags set in a configuration which does not allow DAX to be
> used, you will still be able to access such inodes but the access will use
> page cache instead. And querying these flags should better show real
> on-disk status and not just whether DAX is used as that would result in an
> even bigger mess. So this feature seems to be somewhat orthogonal to the
> API I'm looking for.

True, I imagine once we have that flag we will be able to distinguish
the "saved" property and the "effective / live" property of DAX...
Also it's really not DAX that applications care about as much as "is
there page-cache indirection / overhead for this mapping?". That seems
to be a narrower guarantee that we can make than what "DAX" might
imply.

2018-10-03 23:52:56

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed 03-10-18 08:13:37, Dan Williams wrote:
> On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <[email protected]> wrote:
> > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > these inode flags set in a configuration which does not allow DAX to be
> > used, you will still be able to access such inodes but the access will use
> > page cache instead. And querying these flags should better show real
> > on-disk status and not just whether DAX is used as that would result in an
> > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > API I'm looking for.
>
> True, I imagine once we have that flag we will be able to distinguish
> the "saved" property and the "effective / live" property of DAX...
> Also it's really not DAX that applications care about as much as "is
> there page-cache indirection / overhead for this mapping?". That seems
> to be a narrower guarantee that we can make than what "DAX" might
> imply.

Right. So what do people think about my suggestion earlier in the thread to
use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
when DAX is in use, failure otherwise. Later we could extend it to be also
used as a hint for caching policy for the inode...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-04 04:04:15

by Dan Williams

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed, Oct 3, 2018 at 9:46 AM Jan Kara <[email protected]> wrote:
>
> On Wed 03-10-18 08:13:37, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <[email protected]> wrote:
> > > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > > these inode flags set in a configuration which does not allow DAX to be
> > > used, you will still be able to access such inodes but the access will use
> > > page cache instead. And querying these flags should better show real
> > > on-disk status and not just whether DAX is used as that would result in an
> > > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > > API I'm looking for.
> >
> > True, I imagine once we have that flag we will be able to distinguish
> > the "saved" property and the "effective / live" property of DAX...
> > Also it's really not DAX that applications care about as much as "is
> > there page-cache indirection / overhead for this mapping?". That seems
> > to be a narrower guarantee that we can make than what "DAX" might
> > imply.
>
> Right. So what do people think about my suggestion earlier in the thread to
> use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
> when DAX is in use, failure otherwise. Later we could extend it to be also
> used as a hint for caching policy for the inode...

The only problem is that you can't use it purely as a query. If we
ever did plumb it to be a hint you could not read the state without
writing the state.

mincore(2) seems to be close the intent of discovering whether RAM is
being consumed for a given address range, but it currently is
implemented to only indicate if *any* mapping is established, not
whether RAM is consumed. I can see an argument that a dax mapped file
should always report an empty mincore vector.

2018-10-30 15:23:04

by Dan Williams

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <[email protected]> wrote:
>
> On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > MAP_SYNC
> > > > - file system guarantees that metadata required to reach faulted-in file
> > > > data is consistent on media before a write fault is completed. A
> > > > side-effect is that the page cache will not be used for
> > > > writably-mapped pages.
> > >
> > > I think you are conflating current implementation with API
> > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > use. The man page definition simply says "supported only for files
> > > supporting DAX" and that it provides certain data integrity
> > > guarantees. It does not define the implementation.
> > >
> > > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > > because that's the only way we can currently provide the required
> > > behaviour to userspace. However, if a filesystem can use the page
> > > cache to provide the required functionality, then it's free to do
> > > so.
> > >
> > > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > > integrity could be provided /without DAX/ by any filesystem using
> > > that persistent page cache. i.e. MAP_SYNC really only requires
> > > mmap() of CPU addressable persistent memory - it does not require
> > > DAX. Right now, however, the only way to get this functionality is
> > > through a DAX capable filesystem on dax capable storage.
> > >
> > > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > > COWs new pages in pmem and attaches them a special per-inode cache
> > > on clean->dirty transition. Then on data sync, background writeback
> > > or crash recovery, it migrates them from the cache into the file map
> > > proper via atomic metadata pointer swaps.
> > >
> > > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > > separate persistent per-inode write cache to provide the correct
> > > crash recovery semantics for MAP_SYNC.
> >
> > Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> > effectively it will be also able to provide MAP_DIRECT semantics, right?
>
> Yes, I think so. It still needs to do COW on first write fault,
> but then the app has direct access to the data buffer until it is
> cleaned and put back in place. The "put back in place" is just an
> atomic swap of metadata pointers, so it doesn't need the page cache
> at all...
>
> > Because there won't be DRAM between app and persistent storage and I don't
> > think COW tricks or other data integrity methods are that interesting for
> > the application.
>
> Not for the application, but the filesystem still wants to support
> snapshots and other such functionality that requires COW. And NOVA
> doesn't have write-in-place functionality at all - it always COWs
> on the clean->dirty transition.
>
> > Most users of O_DIRECT are concerned about getting close
> > to media speed performance and low DRAM usage...
>
> *nod*
>
> > > > and what I think Dan had proposed:
> > > >
> > > > mmap flag, MAP_DIRECT
> > > > - file system guarantees that page cache will not be used to front storage.
> > > > storage MUST be directly addressable. This *almost* implies MAP_SYNC.
> > > > The subtle difference is that a write fault /may/ not result in metadata
> > > > being written back to media.
> > >
> > > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > > replace msync/fsync with CPU cache flush operations. So any
> > > application that uses this mode still needs to use either MAP_SYNC
> > > or issue msync/fsync for data integrity.
> > >
> > > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > > can't provide the required semantics for that specific operation? In
> > > the case of O_DIRECT, we fall back to buffered IO because it has the
> > > same data integrity semantics as O_DIRECT and will always work. It's
> > > just slower and consumes more memory, but the app continues to work
> > > just fine.
> > >
> > > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > > without using the pagecache seems extremely problematic to me. e.g.
> > > an app already has an open MAP_DIRECT file, and a third party
> > > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > > to do COW operations. This isn't the app's fault - the kernel should
> > > just fall back transparently to using the page cache for the
> > > MAP_DIRECT app and just keep working, just like it would if it was
> > > using O_DIRECT read/write.
> >
> > There's another option of failing reflink / dedupe with EBUSY if the file
> > is mapped with MAP_DIRECT and the filesystem cannot support relink &
> > MAP_DIRECT together. But there are downsides to that as well.
>
> Yup, not the least that setting MAP_DIRECT can race with a
> reflink....
>
> > > The point I'm trying to make here is that O_DIRECT is a /hint/, not
> > > a guarantee, and it's done that way to prevent applications from
> > > being presented with transient, potentially fatal error cases
> > > because a filesystem implementation can't do a specific operation
> > > through the direct IO path.
> > >
> > > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> > > guarantee. Over time we'll end up with filesystems that can
> > > guarantee that MAP_DIRECT is always going to use DAX, in the same
> > > way we have filesystems that guarantee O_DIRECT will always be
> > > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> > > no page cache will ever be used, then we are basically saying
> > > "filesystems won't provide MAP_DIRECT even in common, useful cases
> > > because they can't provide MAP_DIRECT in all cases." And that
> > > doesn't seem like a very good solution to me.
> >
> > These are good points. I'm just somewhat wary of the situation where users
> > will map files with MAP_DIRECT and then the machine starts thrashing
> > because the file got reflinked and thus pagecache gets used suddently.
>
> It's still better than apps randomly getting SIGBUS.
>
> FWIW, this suggests that we probably need to be able to host both
> DAX pages and page cache pages on the same file at the same time,
> and be able to handle page faults based on the type of page being
> mapped (different sets of fault ops for different page types?)
> and have fallback paths when the page type needs to be changed
> between direct and cached during the fault....
>
> > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > filesystems) so usually people just won't notice. If fallback for
> > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
>
> Which is just like the situation where O_DIRECT on ext3 was not very
> useful, but on other filesystems like XFS it was fully functional.
>
> IMO, the fact that a specific filesytem has a suboptimal fallback
> path for an uncommon behaviour isn't an argument against MAP_DIRECT
> as a hint - it's actually a feature. If MAP_DIRECT can't be used
> until it's always direct access, then most filesystems wouldn't be
> able to provide any faster paths at all. It's much better to have
> partial functionality now than it is to never have the functionality
> at all, and so we need to design in the flexibility we need to
> iteratively improve implementations without needing API changes that
> will break applications.

The hard guarantee requirement still remains though because an
application that expects combined MAP_SYNC|MAP_DIRECT semantics will
be surprised if the MAP_DIRECT property silently disappears. I think
it still makes some sense as a hint for apps that want to minimize
page cache, but for the applications with a flush from userspace model
I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
still gives the filesystem the option to inject page-cache at will,
but with an application coordination point.

2018-10-31 07:54:47

by Dan Williams

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue, Oct 30, 2018 at 3:49 PM Dave Chinner <[email protected]> wrote:
>
> On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <[email protected]> wrote:
> > > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > > MAP_SYNC
> > > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > > data is consistent on media before a write fault is completed. A
> > > > > > side-effect is that the page cache will not be used for
> > > > > > writably-mapped pages.
> > > > >
> > > > > I think you are conflating current implementation with API
> > > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > > use. The man page definition simply says "supported only for files
> > > > > supporting DAX" and that it provides certain data integrity
> > > > > guarantees. It does not define the implementation.
> ....
> > > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > > filesystems) so usually people just won't notice. If fallback for
> > > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> > >
> > > Which is just like the situation where O_DIRECT on ext3 was not very
> > > useful, but on other filesystems like XFS it was fully functional.
> > >
> > > IMO, the fact that a specific filesytem has a suboptimal fallback
> > > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > > until it's always direct access, then most filesystems wouldn't be
> > > able to provide any faster paths at all. It's much better to have
> > > partial functionality now than it is to never have the functionality
> > > at all, and so we need to design in the flexibility we need to
> > > iteratively improve implementations without needing API changes that
> > > will break applications.
> >
> > The hard guarantee requirement still remains though because an
> > application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> > be surprised if the MAP_DIRECT property silently disappears.
>
> Why would they be surprised? They won't even notice it if the
> filesystem can provide MAP_SYNC without MAP_DIRECT.
>
> And that's the whole point.
>
> MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
> visible to the filesystem and the filesystem does nothing to enforce
> them. If someone does something that requires the page cache (e.g.
> calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
> heap of new work to do. And, in some cases, the filesystem may not
> be able to provide MAP_DIRECT as a result..
>
> IOWs, the filesystem cannot guarantee MAP_DIRECT and the
> circumstances under which MAP_DIRECT will and will not work are
> dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
> applications randomly segfaulting in production as things like
> backups, indexers, etc run over the filesystem and do their work.
>
> This is why MAP_DIRECT needs to be an optimisation, not a
> requirement - things will still work if MAP_DIRECT is not used. What
> matters to these applications is MAP_SYNC - if we break MAP_SYNC,
> then the application data integrity model is violated. That's not an
> acceptible outcome.
>
> The problem, it seems to me, is that people are unable to separate
> MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
> MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
> provide MAP_SYNC functionality on DAX mappings. However, that's a
> /filesystem implementation issue/, not an API guarantee we need to
> provide to userspace.
>
> If we implement a persistent page cache (e.g. allocate page cache
> pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
> could provide applications with the MAP_SYNC data integrity model
> without MAP_DIRECT. Indeed, those filesystems would not even be able
> to provide MAP_DIRECT semantics because they aren't backed by pmem.
>
> Hence if applications that want MAP_SYNC are hard coded
> MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
> those applications are going to fail on a filesystem that provides
> only MAP_SYNC. This is despite the fact the applications would
> function correctly and the data integrity model would be maintained.
> i.e. the failure is because applications have assumed MAP_SYNC can
> only be provided by a DAX implementation, not because MAP_SYNC is
> not supported.
>
> MAP_SYNC really isn't about DAX at all. It's about enabling a data
> integrity model that requires the filesystem to provide userspace
> access to CPU addressable persistent memory. DAX+MAP_DIRECT is just
> one method of providing this functionality, but it's not the only
> method. Our API needs to be future proof rather than an encoding of
> the existing implementation limitations, otherwise apps will have to
> be re-written as every new MAP_SYNC capable technology comes along.
>
> In summary:
>
> MAP_DIRECT is an access hint.
>
> MAP_SYNC provides a data integrity model guarantee.
>
> MAP_SYNC may imply MAP_DIRECT for specific implementations,
> but it does not require or guarantee MAP_DIRECT.
>
> Let's compare that with O_DIRECT:
>
> O_DIRECT in an access hint.
>
> O_DSYNC provides a data integrity model guarantee.
>
> O_DSYNC may imply O_DIRECT for specific implementations, but
> it does not require or guarantee O_DIRECT.
>
> Consistency in access and data integrity models is a good thing. DAX
> and pmem is not an exception. We need to use a model we know works
> and has proven itself over a long period of time.
>
> > I think
> > it still makes some sense as a hint for apps that want to minimize
> > page cache, but for the applications with a flush from userspace model
> > I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> > still gives the filesystem the option to inject page-cache at will,
> > but with an application coordination point.
>
> Why make it more complex for applications than it needs to be?

With the clarification that MAP_SYNC implies "cpu cache flush to
persistent memory page-cache *or* dax to persistent memory" I think
all of the concerns are addressed. I was conflating MAP_DIRECT as "no
page cache indirection", but the indirection does not matter if the
page cache itself is persisted.

2018-10-19 09:01:47

by Dave Chinner

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > MAP_SYNC
> > > - file system guarantees that metadata required to reach faulted-in file
> > > data is consistent on media before a write fault is completed. A
> > > side-effect is that the page cache will not be used for
> > > writably-mapped pages.
> >
> > I think you are conflating current implementation with API
> > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > use. The man page definition simply says "supported only for files
> > supporting DAX" and that it provides certain data integrity
> > guarantees. It does not define the implementation.
> >
> > We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> > because that's the only way we can currently provide the required
> > behaviour to userspace. However, if a filesystem can use the page
> > cache to provide the required functionality, then it's free to do
> > so.
> >
> > i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> > integrity could be provided /without DAX/ by any filesystem using
> > that persistent page cache. i.e. MAP_SYNC really only requires
> > mmap() of CPU addressable persistent memory - it does not require
> > DAX. Right now, however, the only way to get this functionality is
> > through a DAX capable filesystem on dax capable storage.
> >
> > And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> > COWs new pages in pmem and attaches them a special per-inode cache
> > on clean->dirty transition. Then on data sync, background writeback
> > or crash recovery, it migrates them from the cache into the file map
> > proper via atomic metadata pointer swaps.
> >
> > IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> > separate persistent per-inode write cache to provide the correct
> > crash recovery semantics for MAP_SYNC.
>
> Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
> effectively it will be also able to provide MAP_DIRECT semantics, right?

Yes, I think so. It still needs to do COW on first write fault,
but then the app has direct access to the data buffer until it is
cleaned and put back in place. The "put back in place" is just an
atomic swap of metadata pointers, so it doesn't need the page cache
at all...

> Because there won't be DRAM between app and persistent storage and I don't
> think COW tricks or other data integrity methods are that interesting for
> the application.

Not for the application, but the filesystem still wants to support
snapshots and other such functionality that requires COW. And NOVA
doesn't have write-in-place functionality at all - it always COWs
on the clean->dirty transition.

> Most users of O_DIRECT are concerned about getting close
> to media speed performance and low DRAM usage...

*nod*

> > > and what I think Dan had proposed:
> > >
> > > mmap flag, MAP_DIRECT
> > > - file system guarantees that page cache will not be used to front storage.
> > > storage MUST be directly addressable. This *almost* implies MAP_SYNC.
> > > The subtle difference is that a write fault /may/ not result in metadata
> > > being written back to media.
> >
> > SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> > replace msync/fsync with CPU cache flush operations. So any
> > application that uses this mode still needs to use either MAP_SYNC
> > or issue msync/fsync for data integrity.
> >
> > If the app is using MAP_DIRECT, the what do we do if the filesystem
> > can't provide the required semantics for that specific operation? In
> > the case of O_DIRECT, we fall back to buffered IO because it has the
> > same data integrity semantics as O_DIRECT and will always work. It's
> > just slower and consumes more memory, but the app continues to work
> > just fine.
> >
> > Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> > without using the pagecache seems extremely problematic to me. e.g.
> > an app already has an open MAP_DIRECT file, and a third party
> > reflinks it or dedupes it and the fs has to fall back to buffered IO
> > to do COW operations. This isn't the app's fault - the kernel should
> > just fall back transparently to using the page cache for the
> > MAP_DIRECT app and just keep working, just like it would if it was
> > using O_DIRECT read/write.
>
> There's another option of failing reflink / dedupe with EBUSY if the file
> is mapped with MAP_DIRECT and the filesystem cannot support relink &
> MAP_DIRECT together. But there are downsides to that as well.

Yup, not the least that setting MAP_DIRECT can race with a
reflink....

> > The point I'm trying to make here is that O_DIRECT is a /hint/, not
> > a guarantee, and it's done that way to prevent applications from
> > being presented with transient, potentially fatal error cases
> > because a filesystem implementation can't do a specific operation
> > through the direct IO path.
> >
> > IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> > guarantee. Over time we'll end up with filesystems that can
> > guarantee that MAP_DIRECT is always going to use DAX, in the same
> > way we have filesystems that guarantee O_DIRECT will always be
> > O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> > no page cache will ever be used, then we are basically saying
> > "filesystems won't provide MAP_DIRECT even in common, useful cases
> > because they can't provide MAP_DIRECT in all cases." And that
> > doesn't seem like a very good solution to me.
>
> These are good points. I'm just somewhat wary of the situation where users
> will map files with MAP_DIRECT and then the machine starts thrashing
> because the file got reflinked and thus pagecache gets used suddently.

It's still better than apps randomly getting SIGBUS.

FWIW, this suggests that we probably need to be able to host both
DAX pages and page cache pages on the same file at the same time,
and be able to handle page faults based on the type of page being
mapped (different sets of fault ops for different page types?)
and have fallback paths when the page type needs to be changed
between direct and cached during the fault....

> With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> filesystems) so usually people just won't notice. If fallback for
> MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

Which is just like the situation where O_DIRECT on ext3 was not very
useful, but on other filesystems like XFS it was fully functional.

IMO, the fact that a specific filesytem has a suboptimal fallback
path for an uncommon behaviour isn't an argument against MAP_DIRECT
as a hint - it's actually a feature. If MAP_DIRECT can't be used
until it's always direct access, then most filesystems wouldn't be
able to provide any faster paths at all. It's much better to have
partial functionality now than it is to never have the functionality
at all, and so we need to design in the flexibility we need to
iteratively improve implementations without needing API changes that
will break applications.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-10-04 16:56:41

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed, Oct 03, 2018 at 06:44:07PM +0200, Jan Kara wrote:
> On Wed 03-10-18 08:13:37, Dan Williams wrote:
> > On Wed, Oct 3, 2018 at 8:07 AM Jan Kara <[email protected]> wrote:
> > > WRT per-inode DAX property, AFAIU that inode flag is just going to be
> > > advisory thing - i.e., use DAX if possible. If you mount a filesystem with
> > > these inode flags set in a configuration which does not allow DAX to be
> > > used, you will still be able to access such inodes but the access will use
> > > page cache instead. And querying these flags should better show real
> > > on-disk status and not just whether DAX is used as that would result in an
> > > even bigger mess. So this feature seems to be somewhat orthogonal to the
> > > API I'm looking for.
> >
> > True, I imagine once we have that flag we will be able to distinguish
> > the "saved" property and the "effective / live" property of DAX...
> > Also it's really not DAX that applications care about as much as "is
> > there page-cache indirection / overhead for this mapping?". That seems
> > to be a narrower guarantee that we can make than what "DAX" might
> > imply.
>
> Right. So what do people think about my suggestion earlier in the thread to
> use madvise(MADV_DIRECT_ACCESS) for this? Currently it would return success
> when DAX is in use, failure otherwise. Later we could extend it to be also
> used as a hint for caching policy for the inode...

Hmm apart from Dan's objection that it can't really be used for a
query, isn't madvise(2) for mmap(2)?

But AFAIU (from looking at the xfs code, so please correct me if I',
wrong), DAX can be used for the traditional read(2)/write(2) interface
as well.

There is at least:

xfs_file_read_iter()
`-> if (IS_DAX(inode))
`-> xfs_file_dax_read()
`->dax_iomap_rw()

So IMHO something on an inode granularity would make more sens to me.

Byte,
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-10-19 05:08:23

by Jeff Moyer

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

Dave,

Thanks for the detailed response! I hadn't considered the NOVA use case
at all.

Cheers,
Jeff

2018-10-03 19:39:14

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue 02-10-18 13:18:54, Dan Williams wrote:
> On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <[email protected]> wrote:
> >
> > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > No, it should not. DAX is an implementation detail thay may change
> > > > > or go away at any time.
> > > >
> > > > Well we had an issue with an application checking for dax, this is how
> > > > we landed here in the first place.
> > >
> > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > joking, nor being philosophical).
> >
> > I believe the application we are speaking about is mostly concerned about
> > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > DRAM, the database running on it is about that size as well and they want
> > database state stored somewhere persistently - which they may want to do by
> > modifying mmaped database files if they do small updates... So they really
> > want to be able to use close to all DRAM for the DB and not leave slack
> > space for the kernel page cache to cache 1TB of database files.
>
> VM_MIXEDMAP was never a reliable indication of DAX because it could be
> set for random other device-drivers that use vm_insert_mixed(). The
> MAP_SYNC flag positively indicates that page cache is disabled for a
> given mapping, although whether that property is due to "dax" or some
> other kernel mechanics is purely an internal detail.
>
> I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> made it into production, but again, it's unreliable.

So luckily this particular application wasn't widely deployed yet so we
will likely get away with the vendor asking customers to update to a
version not looking into smaps and parsing /proc/mounts instead.

But I don't find parsing /proc/mounts that beautiful either and I'd prefer
if we had a better interface for applications to query whether they can
avoid page cache for mmaps or not.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-31 07:44:32

by Dave Chinner

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <[email protected]> wrote:
> > On Thu, Oct 18, 2018 at 04:55:55PM +0200, Jan Kara wrote:
> > > On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> > > > On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > > > > MAP_SYNC
> > > > > - file system guarantees that metadata required to reach faulted-in file
> > > > > data is consistent on media before a write fault is completed. A
> > > > > side-effect is that the page cache will not be used for
> > > > > writably-mapped pages.
> > > >
> > > > I think you are conflating current implementation with API
> > > > requirements - MAP_SYNC doesn't guarantee anything about page cache
> > > > use. The man page definition simply says "supported only for files
> > > > supporting DAX" and that it provides certain data integrity
> > > > guarantees. It does not define the implementation.
....
> > > With O_DIRECT the fallback to buffered IO is quite rare (at least for major
> > > filesystems) so usually people just won't notice. If fallback for
> > > MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.
> >
> > Which is just like the situation where O_DIRECT on ext3 was not very
> > useful, but on other filesystems like XFS it was fully functional.
> >
> > IMO, the fact that a specific filesytem has a suboptimal fallback
> > path for an uncommon behaviour isn't an argument against MAP_DIRECT
> > as a hint - it's actually a feature. If MAP_DIRECT can't be used
> > until it's always direct access, then most filesystems wouldn't be
> > able to provide any faster paths at all. It's much better to have
> > partial functionality now than it is to never have the functionality
> > at all, and so we need to design in the flexibility we need to
> > iteratively improve implementations without needing API changes that
> > will break applications.
>
> The hard guarantee requirement still remains though because an
> application that expects combined MAP_SYNC|MAP_DIRECT semantics will
> be surprised if the MAP_DIRECT property silently disappears.

Why would they be surprised? They won't even notice it if the
filesystem can provide MAP_SYNC without MAP_DIRECT.

And that's the whole point.

MAP_DIRECT is a private mapping state. So is MAP_SYNC. They are not
visible to the filesystem and the filesystem does nothing to enforce
them. If someone does something that requires the page cache (e.g.
calls do_splice_direct()) then that MAP_DIRECT mapping has a whole
heap of new work to do. And, in some cases, the filesystem may not
be able to provide MAP_DIRECT as a result..

IOWs, the filesystem cannot guarantee MAP_DIRECT and the
circumstances under which MAP_DIRECT will and will not work are
dynamic. If MAP_DIRECT is supposed to be a guarantee then we'll have
applications randomly segfaulting in production as things like
backups, indexers, etc run over the filesystem and do their work.

This is why MAP_DIRECT needs to be an optimisation, not a
requirement - things will still work if MAP_DIRECT is not used. What
matters to these applications is MAP_SYNC - if we break MAP_SYNC,
then the application data integrity model is violated. That's not an
acceptible outcome.

The problem, it seems to me, is that people are unable to separate
MAP_DIRECT and MAP_SYNC. I suspect that is because, at present,
MAP_SYNC on XFS and ext4 requires MAP_DIRECT. i.e. we can only
provide MAP_SYNC functionality on DAX mappings. However, that's a
/filesystem implementation issue/, not an API guarantee we need to
provide to userspace.

If we implement a persistent page cache (e.g. allocate page cache
pages out of ZONE_DEVICE pmem), then filesystems like XFS and ext4
could provide applications with the MAP_SYNC data integrity model
without MAP_DIRECT. Indeed, those filesystems would not even be able
to provide MAP_DIRECT semantics because they aren't backed by pmem.

Hence if applications that want MAP_SYNC are hard coded
MAP_SYNC|MAP_DIRECT and we make MAP_DIRECT a hard guarantee, then
those applications are going to fail on a filesystem that provides
only MAP_SYNC. This is despite the fact the applications would
function correctly and the data integrity model would be maintained.
i.e. the failure is because applications have assumed MAP_SYNC can
only be provided by a DAX implementation, not because MAP_SYNC is
not supported.

MAP_SYNC really isn't about DAX at all. It's about enabling a data
integrity model that requires the filesystem to provide userspace
access to CPU addressable persistent memory. DAX+MAP_DIRECT is just
one method of providing this functionality, but it's not the only
method. Our API needs to be future proof rather than an encoding of
the existing implementation limitations, otherwise apps will have to
be re-written as every new MAP_SYNC capable technology comes along.

In summary:

MAP_DIRECT is an access hint.

MAP_SYNC provides a data integrity model guarantee.

MAP_SYNC may imply MAP_DIRECT for specific implementations,
but it does not require or guarantee MAP_DIRECT.

Let's compare that with O_DIRECT:

O_DIRECT in an access hint.

O_DSYNC provides a data integrity model guarantee.

O_DSYNC may imply O_DIRECT for specific implementations, but
it does not require or guarantee O_DIRECT.

Consistency in access and data integrity models is a good thing. DAX
and pmem is not an exception. We need to use a model we know works
and has proven itself over a long period of time.

> I think
> it still makes some sense as a hint for apps that want to minimize
> page cache, but for the applications with a flush from userspace model
> I think that wants to be an F_SETLEASE / F_DIRECTLCK operation. This
> still gives the filesystem the option to inject page-cache at will,
> but with an application coordination point.

Why make it more complex for applications than it needs to be?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-10-02 22:14:59

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > No, it should not. DAX is an implementation detail thay may change
> > > or go away at any time.
> >
> > Well we had an issue with an application checking for dax, this is how
> > we landed here in the first place.
>
> So what exacty is that "DAX" they are querying about (and no, I'm not
> joking, nor being philosophical).

I believe the application we are speaking about is mostly concerned about
the memory overhead of the page cache. Think of a machine that has ~ 1TB of
DRAM, the database running on it is about that size as well and they want
database state stored somewhere persistently - which they may want to do by
modifying mmaped database files if they do small updates... So they really
want to be able to use close to all DRAM for the DB and not leave slack
space for the kernel page cache to cache 1TB of database files.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-18 22:57:20

by Jan Kara

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Thu 18-10-18 11:25:10, Dave Chinner wrote:
> On Wed, Oct 17, 2018 at 04:23:50PM -0400, Jeff Moyer wrote:
> > MAP_SYNC
> > - file system guarantees that metadata required to reach faulted-in file
> > data is consistent on media before a write fault is completed. A
> > side-effect is that the page cache will not be used for
> > writably-mapped pages.
>
> I think you are conflating current implementation with API
> requirements - MAP_SYNC doesn't guarantee anything about page cache
> use. The man page definition simply says "supported only for files
> supporting DAX" and that it provides certain data integrity
> guarantees. It does not define the implementation.
>
> We've /implemented MAP_SYNC/ as O_DSYNC page fault behaviour,
> because that's the only way we can currently provide the required
> behaviour to userspace. However, if a filesystem can use the page
> cache to provide the required functionality, then it's free to do
> so.
>
> i.e. if someone implements a pmem-based page cache, MAP_SYNC data
> integrity could be provided /without DAX/ by any filesystem using
> that persistent page cache. i.e. MAP_SYNC really only requires
> mmap() of CPU addressable persistent memory - it does not require
> DAX. Right now, however, the only way to get this functionality is
> through a DAX capable filesystem on dax capable storage.
>
> And, FWIW, this is pretty much how NOVA maintains DAX w/ COW - it
> COWs new pages in pmem and attaches them a special per-inode cache
> on clean->dirty transition. Then on data sync, background writeback
> or crash recovery, it migrates them from the cache into the file map
> proper via atomic metadata pointer swaps.
>
> IOWs, NOVA provides the correct MAP_SYNC semantics by using a
> separate persistent per-inode write cache to provide the correct
> crash recovery semantics for MAP_SYNC.

Corect. NOVA would be able to provide MAP_SYNC semantics without DAX. But
effectively it will be also able to provide MAP_DIRECT semantics, right?
Because there won't be DRAM between app and persistent storage and I don't
think COW tricks or other data integrity methods are that interesting for
the application. Most users of O_DIRECT are concerned about getting close
to media speed performance and low DRAM usage...

> > and what I think Dan had proposed:
> >
> > mmap flag, MAP_DIRECT
> > - file system guarantees that page cache will not be used to front storage.
> > storage MUST be directly addressable. This *almost* implies MAP_SYNC.
> > The subtle difference is that a write fault /may/ not result in metadata
> > being written back to media.
>
> SIimilar to O_DIRECT, these semantics do not allow userspace apps to
> replace msync/fsync with CPU cache flush operations. So any
> application that uses this mode still needs to use either MAP_SYNC
> or issue msync/fsync for data integrity.
>
> If the app is using MAP_DIRECT, the what do we do if the filesystem
> can't provide the required semantics for that specific operation? In
> the case of O_DIRECT, we fall back to buffered IO because it has the
> same data integrity semantics as O_DIRECT and will always work. It's
> just slower and consumes more memory, but the app continues to work
> just fine.
>
> Sending SIGBUS to apps when we can't perform MAP_DIRECT operations
> without using the pagecache seems extremely problematic to me. e.g.
> an app already has an open MAP_DIRECT file, and a third party
> reflinks it or dedupes it and the fs has to fall back to buffered IO
> to do COW operations. This isn't the app's fault - the kernel should
> just fall back transparently to using the page cache for the
> MAP_DIRECT app and just keep working, just like it would if it was
> using O_DIRECT read/write.

There's another option of failing reflink / dedupe with EBUSY if the file
is mapped with MAP_DIRECT and the filesystem cannot support relink &
MAP_DIRECT together. But there are downsides to that as well.

> The point I'm trying to make here is that O_DIRECT is a /hint/, not
> a guarantee, and it's done that way to prevent applications from
> being presented with transient, potentially fatal error cases
> because a filesystem implementation can't do a specific operation
> through the direct IO path.
>
> IMO, MAP_DIRECT needs to be a hint like O_DIRECT and not a
> guarantee. Over time we'll end up with filesystems that can
> guarantee that MAP_DIRECT is always going to use DAX, in the same
> way we have filesystems that guarantee O_DIRECT will always be
> O_DIRECT (e.g. XFS). But if we decide that MAP_DIRECT must guarantee
> no page cache will ever be used, then we are basically saying
> "filesystems won't provide MAP_DIRECT even in common, useful cases
> because they can't provide MAP_DIRECT in all cases." And that
> doesn't seem like a very good solution to me.

These are good points. I'm just somewhat wary of the situation where users
will map files with MAP_DIRECT and then the machine starts thrashing
because the file got reflinked and thus pagecache gets used suddently.
With O_DIRECT the fallback to buffered IO is quite rare (at least for major
filesystems) so usually people just won't notice. If fallback for
MAP_DIRECT will be easy to hit, I'm not sure it would be very useful.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-10-02 21:35:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > No, it should not. DAX is an implementation detail thay may change
> > or go away at any time.
>
> Well we had an issue with an application checking for dax, this is how
> we landed here in the first place.

So what exacty is that "DAX" they are querying about (and no, I'm not
joking, nor being philosophical).

2018-10-18 04:21:14

by Jeff Moyer

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

Jan Kara <[email protected]> writes:

> [Added ext4, xfs, and linux-api folks to CC for the interface discussion]
>
> On Tue 02-10-18 14:10:39, Johannes Thumshirn wrote:
>> On Tue, Oct 02, 2018 at 12:05:31PM +0200, Jan Kara wrote:
>> > Hello,
>> >
>> > commit e1fb4a086495 "dax: remove VM_MIXEDMAP for fsdax and device dax" has
>> > removed VM_MIXEDMAP flag from DAX VMAs. Now our testing shows that in the
>> > mean time certain customer of ours started poking into /proc/<pid>/smaps
>> > and looks at VMA flags there and if VM_MIXEDMAP is missing among the VMA
>> > flags, the application just fails to start complaining that DAX support is
>> > missing in the kernel. The question now is how do we go about this?
>>
>> OK naive question from me, how do we want an application to be able to
>> check if it is running on a DAX mapping?
>
> The question from me is: Should application really care? After all DAX is
> just a caching decision. Sure it affects performance characteristics and
> memory usage of the kernel but it is not a correctness issue (in particular
> we took care for MAP_SYNC to return EOPNOTSUPP if the feature cannot be
> supported for current mapping). And in the future the details of what we do
> with DAX mapping can change - e.g. I could imagine we might decide to cache
> writes in DRAM but do direct PMEM access on reads. And all this could be
> auto-tuned based on media properties. And we don't want to tie our hands by
> specifying too narrowly how the kernel is going to behave.

For read and write, I would expect the O_DIRECT open flag to still work,
even for dax-capable persistent memory. Is that a contentious opinion?

So, what we're really discussing is the behavior for mmap. MAP_SYNC
will certainly ensure that the page cache is not used for writes. It
would also be odd for us to decide to cache reads. The only issue I can
see is that perhaps the application doesn't want to take a performance
hit on write faults. I haven't heard that concern expressed in this
thread, though.

Just to be clear, this is my understanding of the world:

MAP_SYNC
- file system guarantees that metadata required to reach faulted-in file
data is consistent on media before a write fault is completed. A
side-effect is that the page cache will not be used for
writably-mapped pages.

and what I think Dan had proposed:

mmap flag, MAP_DIRECT
- file system guarantees that page cache will not be used to front storage.
storage MUST be directly addressable. This *almost* implies MAP_SYNC.
The subtle difference is that a write fault /may/ not result in metadata
being written back to media.

and this is what I think you were proposing, Jan:

madvise flag, MADV_DIRECT_ACCESS
- same semantics as MAP_DIRECT, but specified via the madvise system call

Cheers,
Jeff

2018-10-02 21:28:02

by Johannes Thumshirn

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> No, it should not. DAX is an implementation detail thay may change
> or go away at any time.

Well we had an issue with an application checking for dax, this is how
we landed here in the first place.

It's not that I want them to do it, it's more that they're actually
doing it in all kinds of interesting ways and then complaining when it
doesn't work anymore.

So it's less of an "API beauty price problem" but more of a "provide a
documented way which we won't break" way.

Byte,
Johannes
--
Johannes Thumshirn Storage
[email protected] +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

2018-10-02 21:21:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Tue, Oct 02, 2018 at 04:29:59PM +0200, Jan Kara wrote:
> > OK naive question from me, how do we want an application to be able to
> > check if it is running on a DAX mapping?
>
> The question from me is: Should application really care?

No, it should not. DAX is an implementation detail thay may change
or go away at any time.

2018-10-03 21:27:55

by Dan Williams

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed, Oct 3, 2018 at 5:51 AM Jan Kara <[email protected]> wrote:
>
> On Tue 02-10-18 13:18:54, Dan Williams wrote:
> > On Tue, Oct 2, 2018 at 8:32 AM Jan Kara <[email protected]> wrote:
> > >
> > > On Tue 02-10-18 07:52:06, Christoph Hellwig wrote:
> > > > On Tue, Oct 02, 2018 at 04:44:13PM +0200, Johannes Thumshirn wrote:
> > > > > On Tue, Oct 02, 2018 at 07:37:13AM -0700, Christoph Hellwig wrote:
> > > > > > No, it should not. DAX is an implementation detail thay may change
> > > > > > or go away at any time.
> > > > >
> > > > > Well we had an issue with an application checking for dax, this is how
> > > > > we landed here in the first place.
> > > >
> > > > So what exacty is that "DAX" they are querying about (and no, I'm not
> > > > joking, nor being philosophical).
> > >
> > > I believe the application we are speaking about is mostly concerned about
> > > the memory overhead of the page cache. Think of a machine that has ~ 1TB of
> > > DRAM, the database running on it is about that size as well and they want
> > > database state stored somewhere persistently - which they may want to do by
> > > modifying mmaped database files if they do small updates... So they really
> > > want to be able to use close to all DRAM for the DB and not leave slack
> > > space for the kernel page cache to cache 1TB of database files.
> >
> > VM_MIXEDMAP was never a reliable indication of DAX because it could be
> > set for random other device-drivers that use vm_insert_mixed(). The
> > MAP_SYNC flag positively indicates that page cache is disabled for a
> > given mapping, although whether that property is due to "dax" or some
> > other kernel mechanics is purely an internal detail.
> >
> > I'm not opposed to faking out VM_MIXEDMAP if this broken check has
> > made it into production, but again, it's unreliable.
>
> So luckily this particular application wasn't widely deployed yet so we
> will likely get away with the vendor asking customers to update to a
> version not looking into smaps and parsing /proc/mounts instead.
>
> But I don't find parsing /proc/mounts that beautiful either and I'd prefer
> if we had a better interface for applications to query whether they can
> avoid page cache for mmaps or not.

Yeah, the mount flag is not a good indicator either. I think we need
to follow through on the per-inode property of DAX. Darrick and I
discussed just allowing the property to be inherited from the parent
directory at file creation time. That avoids the dynamic set-up /
teardown races that seem intractable at this point.

What's wrong with MAP_SYNC as a page-cache detector in the meantime?

2018-11-02 08:05:19

by Dave Chinner

[permalink] [raw]
Subject: Re: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps

On Wed, Oct 31, 2018 at 05:59:17AM +0000, [email protected] wrote:
> > On Mon, Oct 29, 2018 at 11:30:41PM -0700, Dan Williams wrote:
> > > On Thu, Oct 18, 2018 at 5:58 PM Dave Chinner <[email protected]> wrote:
> > In summary:
> >
> > MAP_DIRECT is an access hint.
> >
> > MAP_SYNC provides a data integrity model guarantee.
> >
> > MAP_SYNC may imply MAP_DIRECT for specific implementations,
> > but it does not require or guarantee MAP_DIRECT.
> >
> > Let's compare that with O_DIRECT:
> >
> > O_DIRECT in an access hint.
> >
> > O_DSYNC provides a data integrity model guarantee.
> >
> > O_DSYNC may imply O_DIRECT for specific implementations, but
> > it does not require or guarantee O_DIRECT.
> >
> > Consistency in access and data integrity models is a good thing. DAX
> > and pmem is not an exception. We need to use a model we know works
> > and has proven itself over a long period of time.
>
> Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
> (I'm not opposed to your opinion, but I need to know it.)
>
> In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is
> "wrong alignment is specified by application", right?

O_DIRECT has defined memory and offset alignment restrictions, and
will return an error to userspace when they are violated. It does
not fall back to buffered IO in this case. MAP_DIRECT has no
equivalent restriction, so IO alignment of O_DIRECT is largely
irrelevant here.

What we are talking about here is that some filesystems can only do
certain operations through buffered IO, such as block allocation or
file extension, and so silently fall back to doing them via buffered
IO even when O_DIRECT is specified. The old direct IO code used to
be full of conditionals to allow this - I think DIO_SKIP_HOLES is
only one remaining:

/*
* For writes that could fill holes inside i_size on a
* DIO_SKIP_HOLES filesystem we forbid block creations: only
* overwrites are permitted. We will return early to the caller
* once we see an unmapped buffer head returned, and the caller
* will fall back to buffered I/O.
*
* Otherwise the decision is left to the get_blocks method,
* which may decide to handle it or also return an unmapped
* buffer head.
*/
create = dio->op == REQ_OP_WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
i_blkbits))
create = 0;
}

Other cases like file extension cases are caught by the filesystems
before calling into the DIO code itself, so there's multiple avenues
for O_DIRECT transparently falling back to buffered IO.

This means the applications don't fail just because the filesystem
can't do a specific operation via O_DIRECT. The data writes still
succeed because they fall back to buffered IO, and the application
is blissfully unaware that the filesystem behaved that way.

> When filesystem can not use O_DIRECT and it uses page cache instead,
> then system uses more memory resource than user's expectation.

That's far better than failing unexpectedly because the app
unexpectedly came across a hole in the file (e.g. someone ran
sparsify across the filesystem).

> So, there is a side effect, and it may cause other trouble.
> (memory pressure, expected performance can not be gained, and so on ..)

Which is why people are supposed to test their systems before they
put them into production.

I've lost count of the number of times I've heard "but O_DIRECT is
supposed to make things faster!" because people don't understand
exactly what it does or means. Bypassing the page cache does not
magically make applications go faster - it puts the responsibility
for doing optimal IO on the application, not the kernel.

MAP_DIRECT will be no different. It's no guarantee that it will make
things faster, or that everything will just work as users expect
them to. It specifically places the responsibility for performing IO
in an optimal fashion on the application and the user for making
sure that it is fit for their purposes. Like O_DIRECT, using
MAP_DIRECT means "I, the application, know exactly what I'm doing,
so get out of the way as much as possible because I'm taking
responsibility for issuing IO in the most optimal manner now".

> In such case its administrator (or technical support engineer) needs to struggle to
> investigate what is the reason.

That's no different to performance problems that arise from
inappropriate use of O_DIRECT. It requires a certain level of
expertise to be able to understand and diagnose such issues.

> So, I would like to know in MAP_DIRECT case, what is the reasons?
> I think it will be helpful for users.
> Only splice?

The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm
certain that filesystem developers will try to maintain MAP_DIRECT
semantics as much as possible, but it's not going to be possible in
/all situations/ on XFS and ext4 because they simply haven't been
designed with DAX in mind. Filesystems designed specifically for
pmem and DAX might be able to provide MAP_DIRECT in all situations,
but those filesystems don't really exist yet.

This is no different to the early days of O_DIRECT. e.g. ext3
couldn't do O_DIRECT for all operations when it was first
introduced, but over time the functionality improved as the
underlying issues were solved. If O_DIRECT was a guarantee, then
ext3 would have never supported O_DIRECT at all...

> (Maybe such document will be necessary....)

The semantics will need to be documented in the relevant man pages.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-11-02 10:49:01

by Yasunori Gotou (Fujitsu)

[permalink] [raw]
Subject: RE: Problems with VM_MIXEDMAP removal from /proc/<pid>/smaps


> > > MAP_DIRECT is an access hint.
> > >
> > > MAP_SYNC provides a data integrity model guarantee.
> > >
> > > MAP_SYNC may imply MAP_DIRECT for specific implementations,
> > > but it does not require or guarantee MAP_DIRECT.
> > >
> > > Let's compare that with O_DIRECT:
> > >
> > > O_DIRECT in an access hint.
> > >
> > > O_DSYNC provides a data integrity model guarantee.
> > >
> > > O_DSYNC may imply O_DIRECT for specific implementations, but
> > > it does not require or guarantee O_DIRECT.
> > >
> > > Consistency in access and data integrity models is a good thing. DAX
> > > and pmem is not an exception. We need to use a model we know works
> > > and has proven itself over a long period of time.
> >
> > Hmmm, then, I would like to know all of the reasons of breakage of MAP_DIRECT.
> > (I'm not opposed to your opinion, but I need to know it.)
> >
> > In O_DIRECT case, in my understanding, the reason of breakage of O_DIRECT is
> > "wrong alignment is specified by application", right?
>
> O_DIRECT has defined memory and offset alignment restrictions, and
> will return an error to userspace when they are violated. It does
> not fall back to buffered IO in this case. MAP_DIRECT has no
> equivalent restriction, so IO alignment of O_DIRECT is largely
> irrelevant here.
>
> What we are talking about here is that some filesystems can only do
> certain operations through buffered IO, such as block allocation or
> file extension, and so silently fall back to doing them via buffered
> IO even when O_DIRECT is specified. The old direct IO code used to
> be full of conditionals to allow this - I think DIO_SKIP_HOLES is
> only one remaining:
>
> /*
> * For writes that could fill holes inside i_size on a
> * DIO_SKIP_HOLES filesystem we forbid block creations: only
> * overwrites are permitted. We will return early to the caller
> * once we see an unmapped buffer head returned, and the caller
> * will fall back to buffered I/O.
> *
> * Otherwise the decision is left to the get_blocks method,
> * which may decide to handle it or also return an unmapped
> * buffer head.
> */
> create = dio->op == REQ_OP_WRITE;
> if (dio->flags & DIO_SKIP_HOLES) {
> if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
> i_blkbits))
> create = 0;
> }
>
> Other cases like file extension cases are caught by the filesystems
> before calling into the DIO code itself, so there's multiple avenues
> for O_DIRECT transparently falling back to buffered IO.
>
> This means the applications don't fail just because the filesystem
> can't do a specific operation via O_DIRECT. The data writes still
> succeed because they fall back to buffered IO, and the application
> is blissfully unaware that the filesystem behaved that way.
>
> > When filesystem can not use O_DIRECT and it uses page cache instead,
> > then system uses more memory resource than user's expectation.
>
> That's far better than failing unexpectedly because the app
> unexpectedly came across a hole in the file (e.g. someone ran
> sparsify across the filesystem).
>
> > So, there is a side effect, and it may cause other trouble.
> > (memory pressure, expected performance can not be gained, and so on ..)
>
> Which is why people are supposed to test their systems before they
> put them into production.
>
> I've lost count of the number of times I've heard "but O_DIRECT is
> supposed to make things faster!" because people don't understand
> exactly what it does or means. Bypassing the page cache does not
> magically make applications go faster - it puts the responsibility
> for doing optimal IO on the application, not the kernel.
>
> MAP_DIRECT will be no different. It's no guarantee that it will make
> things faster, or that everything will just work as users expect
> them to. It specifically places the responsibility for performing IO
> in an optimal fashion on the application and the user for making
> sure that it is fit for their purposes. Like O_DIRECT, using
> MAP_DIRECT means "I, the application, know exactly what I'm doing,
> so get out of the way as much as possible because I'm taking
> responsibility for issuing IO in the most optimal manner now".
>
> > In such case its administrator (or technical support engineer) needs to struggle to
> > investigate what is the reason.
>
> That's no different to performance problems that arise from
> inappropriate use of O_DIRECT. It requires a certain level of
> expertise to be able to understand and diagnose such issues.
>
> > So, I would like to know in MAP_DIRECT case, what is the reasons?
> > I think it will be helpful for users.
> > Only splice?
>
> The filesystem can ignore MAP_DIRECT for any reason it needs to. I'm
> certain that filesystem developers will try to maintain MAP_DIRECT
> semantics as much as possible, but it's not going to be possible in
> /all situations/ on XFS and ext4 because they simply haven't been
> designed with DAX in mind. Filesystems designed specifically for
> pmem and DAX might be able to provide MAP_DIRECT in all situations,
> but those filesystems don't really exist yet.
>
> This is no different to the early days of O_DIRECT. e.g. ext3
> couldn't do O_DIRECT for all operations when it was first
> introduced, but over time the functionality improved as the
> underlying issues were solved. If O_DIRECT was a guarantee, then
> ext3 would have never supported O_DIRECT at all...

Hmm, Ok. I see.
Thank you very much for your detail explanation.

>
> > (Maybe such document will be necessary....)
>
> The semantics will need to be documented in the relevant man pages.

I agree.

Thanks, again.
----
Yasunori Goto