2019-10-22 13:20:41

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On 20/10/2019 18:59, [email protected] wrote:
> From: Ira Weiny <[email protected]>
>
> At LSF/MM'19 [1] [2] we discussed applications that overestimate memory
> consumption due to their inability to detect whether the kernel will
> instantiate page cache for a file, and cases where a global dax enable via a
> mount option is too coarse.
>
> The following patch series enables selecting the use of DAX on individual files
> and/or directories on xfs, and lays some groundwork to do so in ext4. In this
> scheme the dax mount option can be omitted to allow the per-file property to
> take effect.
>
> The insight at LSF/MM was to separate the per-mount or per-file "physical"
> capability switch from an "effective" attribute for the file.
>
> At LSF/MM we discussed the difficulties of switching the mode of a file with
> active mappings / page cache. Rather than solve those races the decision was to
> just limit mode flips to 0-length files.
>

What I understand above is that only "writers" before writing any bytes may
turn the flag on, which then persists. But as a very long time user of DAX, usually
it is the writers that are least interesting. With lots of DAX technologies and
emulations the write is slower and needs slow "flushing".

The more interesting and performance gains comes from DAX READs actually.
specially cross the VM guest. (IE. All VMs share host memory or pmem)

This fixture as I understand it, that I need to know before I write if I will
want DAX or not and then the write is DAX as well as reads after that, looks
not very interesting for me as a user.

Just my $0.17
Boaz

> Finally, the physical DAX flag inheritance is maintained from previous work on
> XFS but should be added for other file systems for consistence.
>
>
> [1] https://lwn.net/Articles/787973/
> [2] https://lwn.net/Articles/787233/
>
> To: [email protected]
> Cc: Alexander Viro <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: "Theodore Y. Ts'o" <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
>
> Ira Weiny (5):
> fs/stat: Define DAX statx attribute
> fs/xfs: Isolate the physical DAX flag from effective
> fs/xfs: Separate functionality of xfs_inode_supports_dax()
> fs/xfs: Clean up DAX support check
> fs/xfs: Allow toggle of physical DAX flag
>
> fs/stat.c | 3 +++
> fs/xfs/xfs_ioctl.c | 32 ++++++++++++++------------------
> fs/xfs/xfs_iops.c | 36 ++++++++++++++++++++++++++++++------
> fs/xfs/xfs_iops.h | 2 ++
> include/uapi/linux/stat.h | 1 +
> 5 files changed, 50 insertions(+), 24 deletions(-)
>


2019-10-23 20:04:22

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On 22/10/2019 14:21, Boaz Harrosh wrote:
> On 20/10/2019 18:59, [email protected] wrote:
<>
>> At LSF/MM we discussed the difficulties of switching the mode of a file with
>> active mappings / page cache. Rather than solve those races the decision was to
>> just limit mode flips to 0-length files.
>>
>
> What I understand above is that only "writers" before writing any bytes may
> turn the flag on, which then persists. But as a very long time user of DAX, usually
> it is the writers that are least interesting. With lots of DAX technologies and
> emulations the write is slower and needs slow "flushing".
>
> The more interesting and performance gains comes from DAX READs actually.
> specially cross the VM guest. (IE. All VMs share host memory or pmem)
>
> This fixture as I understand it, that I need to know before I write if I will
> want DAX or not and then the write is DAX as well as reads after that, looks
> not very interesting for me as a user.
>
> Just my $0.17
> Boaz
>

I want to say one more thing about this patchset please. I was sleeping on it
and I think it is completely wrong approach with a completely wrong API.

The DAX or not DAX is a matter of transport. and is nothing to do with data
content and persistency. It's like connecting a disk behind, say, a scsi bus and then
take the same DB and putting it behind a multy-path or an NFS server. It is always
the same data.
(Same mistake we did with ndctl which is cry for generations)

For me the DAX or NO-DAX is an application thing and not a data thing.

The right API is perhaps an open O_FLAG where if you are not the first opener
the open returns EBUSY and then the app can decide if to open without the
flag or complain and fail. (Or apply open locks)

If you are a second opener when the file is already opened O_DAX you are
silently running DAX. If you are 2nd open(O_DAX) when already oppened
O_DAX then off course you succeed.
(Also the directory inheritance thing is all mute too)

Please explain the use case behind your model?

Thanks
Boaz

2019-10-24 10:28:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
> On 22/10/2019 14:21, Boaz Harrosh wrote:
> > On 20/10/2019 18:59, [email protected] wrote:
> Please explain the use case behind your model?

No application changes needed to control whether they use DAX or
not. It allows the admin to control the application behaviour
completely, so they can turn off DAX if necessary. Applications are
unaware of constraints that may prevent DAX from being used, and so
admins need a mechanism to prevent DAX aware application from
actually using DAX if the capability is present.

e.g. given how slow some PMEM devices are when it comes to writing
data, especially under extremely high concurrency, DAX is not
necessarily a performance win for every application. Admins need a
guaranteed method of turning off DAX in these situations - apps may
not provide such a knob, or even be aware of a thing called DAX...

e.g. the data set being accessed by the application is mapped and
modified by RDMA applications, so those files must not be accessed
using DAX by any application because DAX+RDMA are currently
incompatible. Hence you can have RDMA on pmem devices co-exist
within the same filesystem as other applications using DAX to access
the pmem...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-10-24 11:15:30

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On 24/10/2019 01:13, Dave Chinner wrote:
> On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
>> On 22/10/2019 14:21, Boaz Harrosh wrote:
>>> On 20/10/2019 18:59, [email protected] wrote:
>> Please explain the use case behind your model?
>
> No application changes needed to control whether they use DAX or
> not. It allows the admin to control the application behaviour
> completely, so they can turn off DAX if necessary. Applications are
> unaware of constraints that may prevent DAX from being used, and so
> admins need a mechanism to prevent DAX aware application from
> actually using DAX if the capability is present.
>
> e.g. given how slow some PMEM devices are when it comes to writing
> data, especially under extremely high concurrency, DAX is not
> necessarily a performance win for every application. Admins need a
> guaranteed method of turning off DAX in these situations - apps may
> not provide such a knob, or even be aware of a thing called DAX...
>

Thank you Dave for explaining. Forgive my slowness. I now understand
your intention.

But if so please address my first concern. That in the submitted implementation
you must set the flag-bit after the create of the file but before the write.
So exactly the above slow writes must always be DAX if I ever want the file
to be DAX accessed in the future.

In fact I do not see how you do this without changing the application because
most applications create thier own files, so you do not have a chance to set
the DAX-flag before the write happens. So the only effective fixture is the
inheritance from the parent directory.
But then again how do you separate from the slow writes that we would like
none-DAX to the DAX reads that are fast and save so much resources and latency.

What if, say in XFS when setting the DAX-bit we take all the three write-locks
same as a truncate. Then we check that there are no active page-cache mappings
ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)

> e.g. the data set being accessed by the application is mapped and
> modified by RDMA applications, so those files must not be accessed
> using DAX by any application because DAX+RDMA are currently
> incompatible. Hence you can have RDMA on pmem devices co-exist
> within the same filesystem as other applications using DAX to access
> the pmem...
>

I actually like the lastest patchset that takes a lease on the file.
But yes an outside admin tool to set different needs.

> Cheers,
> Dave.
>

Yes, thanks
Boaz

2019-10-24 21:01:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
> On 24/10/2019 01:13, Dave Chinner wrote:
> > On Wed, Oct 23, 2019 at 04:09:50PM +0300, Boaz Harrosh wrote:
> >> On 22/10/2019 14:21, Boaz Harrosh wrote:
> >>> On 20/10/2019 18:59, [email protected] wrote:
> >> Please explain the use case behind your model?
> >
> > No application changes needed to control whether they use DAX or
> > not. It allows the admin to control the application behaviour
> > completely, so they can turn off DAX if necessary. Applications are
> > unaware of constraints that may prevent DAX from being used, and so
> > admins need a mechanism to prevent DAX aware application from
> > actually using DAX if the capability is present.
> >
> > e.g. given how slow some PMEM devices are when it comes to writing
> > data, especially under extremely high concurrency, DAX is not
> > necessarily a performance win for every application. Admins need a
> > guaranteed method of turning off DAX in these situations - apps may
> > not provide such a knob, or even be aware of a thing called DAX...
> >
>
> Thank you Dave for explaining. Forgive my slowness. I now understand
> your intention.
>
> But if so please address my first concern. That in the submitted implementation
> you must set the flag-bit after the create of the file but before the write.
> So exactly the above slow writes must always be DAX if I ever want the file
> to be DAX accessed in the future.

The on disk DAX flag is inherited from the parent directory at
create time. Hence an admin only need to set it on the data
directory of the application when first configuring it, and
everything the app creates will be configured for DAX access
automatically.

Or, alternatively, mkfs sets the flag on the root dir so that
everything in the filesystem uses DAX by default (through
inheritance) unless the admin turns off the flag on a directory
before it starts to be used or on a set of files after they have
been created (because DAX causes problems)...

So, yeah, there's another problem with the basic assertion that we
only need to allow the on disk flag to be changed on zero length
files: we actually want to be able to -clear- the DAX flag when the
file has data attached to it, not just when it is an empty file...

> What if, say in XFS when setting the DAX-bit we take all the three write-locks
> same as a truncate. Then we check that there are no active page-cache mappings
> ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)

DAX doesn't have page cache mappings, so anything that relies on
checking page cache state isn't going to work reliably. I also seem
to recall that there was a need to take some vm level lock to really
prevent page fault races, and that we can't safely take that in a
safe combination with all the filesystem locks we need.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-10-25 13:58:50

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On 24/10/2019 10:34, Dave Chinner wrote:
> On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
<>
>
> The on disk DAX flag is inherited from the parent directory at
> create time. Hence an admin only need to set it on the data
> directory of the application when first configuring it, and
> everything the app creates will be configured for DAX access
> automatically.
>

Yes I said that as well. But again I am concerned that this is the
opposite of our Intention. As you said the WRITEs are slow and
do not scale so what we like, and why we have the all problem, is
to WRITE *none*-DAX. And if so then how do we turn the bit ON later
for the fast READs.

> Or, alternatively, mkfs sets the flag on the root dir so that
> everything in the filesystem uses DAX by default (through
> inheritance) unless the admin turns off the flag on a directory
> before it starts to be used

> or on a set of files after they have
> been created (because DAX causes problems)...
>

Yes exactly this can not be done currently.

> So, yeah, there's another problem with the basic assertion that we
> only need to allow the on disk flag to be changed on zero length
> files: we actually want to be able to -clear- the DAX flag when the
> file has data attached to it, not just when it is an empty file...
>

Exactly, This is my concern. And the case that I see most useful is the
opposite where I want to turn it ON, for DAX fast READs.

>> What if, say in XFS when setting the DAX-bit we take all the three write-locks
>> same as a truncate. Then we check that there are no active page-cache mappings
>> ie. a single opener. Then allow to set the bit. Else return EBUISY. (file is in use)
>
> DAX doesn't have page cache mappings, so anything that relies on
> checking page cache state isn't going to work reliably.

I meant on the opposite case, Where the flag was OFF and I want it ON for
fast READs. In that case if I have any users there are pages on the
xarray.
BTW the opposite is also true if we have active DAX users we will have
DAX entries in the xarray. What we want is that there are *no* active
users while we change the file-DAX-mode. Else we fail the change.

> I also seem
> to recall that there was a need to take some vm level lock to really
> prevent page fault races, and that we can't safely take that in a
> safe combination with all the filesystem locks we need.
>

We do not really care with page fault races in the Kernel as long
as I protect the xarray access and these are protected well if we
take truncate locking. But we have a bigger problem that you pointed
out with the change of the operations vector pointer.

I was thinking about this last night. One way to do this is with
file-exclusive-lock. Correct me if I'm wrong:
file-exclusive-readwrite-lock means any other openers will fail and
if there are openers already the lock will fail. Which is what we want
no? to make sure we are the exclusive user of the file while we change
the op vector.
Now the question is if we force the application to take the lock and
Kernel only check that we are locked. Or Kernel take the lock within
the IOCTL.

Lets touch base. As I understand the protocol we want to establish with
the administration tool is:

- File is created, written. read...

- ALL file handles are closed, there are no active users
- File open by single opener for the purpose of changing the DAX-mode
- lock from all other opens
- change the DAX-mode, op vectors
- unlock-exlusivness

- File activity can resume...

That's easy to say, But how can we enforce this protocol?

> Cheers,
> Dave.
>

Thanks Dave
Boaz

2019-10-25 19:08:15

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations

On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote:
> On 24/10/2019 10:34, Dave Chinner wrote:
> > On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote:
> <>
> >
> > The on disk DAX flag is inherited from the parent directory at
> > create time. Hence an admin only need to set it on the data
> > directory of the application when first configuring it, and
> > everything the app creates will be configured for DAX access
> > automatically.
> >
>
> Yes I said that as well.

You said "it must be set between creation and first write",
stating the requirement for an on-disk flag to work. I'm
decribing how that requirement is actually implemented. i.e. what
you are stating is something we actually implemented years ago...

> > I also seem
> > to recall that there was a need to take some vm level lock to really
> > prevent page fault races, and that we can't safely take that in a
> > safe combination with all the filesystem locks we need.
> >
>
> We do not really care with page fault races in the Kernel as long

Oh yes we do. A write fault is a 2-part operation - a read fault to
populate the pte and mapping, then a write fault (->page_mkwrite) to
do all the filesystem work needed to dirty the page and pte.

The read fault sets up the state for the write fault, and if we
change the aops between these two operations, then the
->page_mkwrite implementation goes kaboom.

This isn't a theoretical problem - this is exactly the race
condition that lead us to disabling the flag in the first place.
There is no serialisation between the read and write parts of the
page fault iand the filesystem changing the DAX flag and ops vector,
and so fixing this problem requires hold yet more locks in the
filesystem path to completely lock out page fault processing on the
inode's mapping.

> as I protect the xarray access and these are protected well if we
> take truncate locking. But we have a bigger problem that you pointed
> out with the change of the operations vector pointer.
>
> I was thinking about this last night. One way to do this is with
> file-exclusive-lock. Correct me if I'm wrong:
> file-exclusive-readwrite-lock means any other openers will fail and
> if there are openers already the lock will fail. Which is what we want
> no?

The filesystem ioctls and page faults have no visibility of file
locks. They don't know and can't find out in a sane manner that an
inode has a single -user- reference.

And it introduces a new problem for any application using the
fssetxattr() ioctl - accidentally not setting the S_DAX flag to be
unmodified will now fail, and that means such a change breaks
existing applications. Sure, you can say they are "buggy
applications", but the fact is this user API change breaks them.

Hence I don't think we can change the user API for setting/clearing
this flag like this.

Cheers,

Dave.
--
Dave Chinner
[email protected]