LinuxLists.cc - [PATCH v2 0/6] Miscellaneous DAX patches, take 2

2015-07-03 14:42:20

Subject: [PATCH v2 0/6] Miscellaneous DAX patches, take 2

A miscellaneous set of patches to improve DAX. All are independent of
each other.

I dropped the DAX mmap support for block devices; I started trying
to put in a sysfs toggle and ran into some difficulties, so just drop
that part of the patch for now. There's no problem (that I know of)
with the read()/write() portion of the patch.

I fixed up the conflict between Dave Chinner's changes to ext4 and mine.

I forgot the 'bdev_direct_access() may sleep' patch in the previous
submission.

Matthew Wilcox (6):
dax: Add block size note to documentation
dax: Use copy_from_iter_nocache
ext4: Use ext4_get_block_write() for DAX
vfs: Allow truncate, chomd and chown to be interrupted by fatal
signals
block: Add support for DAX reads/writes to block devices
dax: bdev_direct_access() may sleep

Documentation/filesystems/dax.txt | 6 ++++--
fs/block_dev.c | 10 ++++++++++
fs/dax.c | 8 +++++---
fs/ext4/file.c | 8 ++++----
fs/open.c | 9 ++++++---
5 files changed, 29 insertions(+), 12 deletions(-)

--
2.1.4

2015-07-03 14:40:57

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH v2 1/6] dax: Add block size note to documentation

From: Matthew Wilcox <[email protected]>

For block devices which are small enough, mkfs will default to creating
a filesystem with block sizes smaller than page size.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/filesystems/dax.txt | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index baf4111..7af2851 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -18,8 +18,10 @@ Usage
-----

If you have a block device which supports DAX, you can make a filesystem
-on it as usual. When mounting it, use the -o dax option manually
-or add 'dax' to the options in /etc/fstab.
+on it as usual. The DAX code currently only supports files with a block
+size equal to your kernel's PAGE_SIZE, so you may need to specify a block
+size when creating the filesystem. When mounting it, use the "-o dax"
+option on the command line or add 'dax' to the options in /etc/fstab.

Implementation Tips for Block Driver Writers
--
2.1.4

2015-07-03 14:42:25

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH v2 2/6] dax: Use copy_from_iter_nocache

From: Matthew Wilcox <[email protected]>

When userspace does a write, there's no need for the written data to
pollute the CPU cache. This matches the original XIP code.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 99b5fbc..eaa9e06 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -155,7 +155,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
}

if (iov_iter_rw(iter) == WRITE)
- len = copy_from_iter(addr, max - pos, iter);
+ len = copy_from_iter_nocache(addr, max - pos, iter);
else if (!hole)
len = copy_to_iter(addr, max - pos, iter);
else
--
2.1.4

2015-07-03 14:41:11

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX

From: Matthew Wilcox <[email protected]>

DAX relies on the get_block function either zeroing newly allocated blocks
before they're findable by subsequent calls to get_block, or marking newly
allocated blocks as unwritten. ext4_get_block() cannot create unwritten
extents, but ext4_get_block_write() can.

Reported-by: Andy Rudoff <[email protected]>
Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/ext4/file.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bc313ac..e5bdcb7 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -195,7 +195,7 @@ out:
static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
{
struct inode *inode = bh->b_assoc_map->host;
- /* XXX: breaks on 32-bit > 16GB. Is that even supported? */
+ /* XXX: breaks on 32-bit > 16TB. Is that even supported? */
loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
int err;
if (!uptodate)
@@ -206,13 +206,13 @@ static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)

static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_fault(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
- /* Is this the right get_block? */
+ return dax_fault(vma, vmf, ext4_get_block_write, ext4_end_io_unwritten);
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block, ext4_end_io_unwritten);
+ return dax_mkwrite(vma, vmf, ext4_get_block_write,
+ ext4_end_io_unwritten);
}

static const struct vm_operations_struct ext4_dax_vm_ops = {
--
2.1.4

2015-07-03 14:42:23

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH v2 4/6] vfs: Allow truncate, chomd and chown to be interrupted by fatal signals

If another task dies while holding the i_mutex, tasks trying to truncate,
chmod or chown the file will hang. Allowing a fatal interrupt to kill
the task is beneficial for the system administrator trying to get the
machine to shut down more cleanly.

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/open.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index e0250bd..4b0061e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -56,7 +56,8 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
if (ret)
newattrs.ia_valid |= ret | ATTR_FORCE;

- mutex_lock(&dentry->d_inode->i_mutex);
+ if (mutex_lock_killable(&dentry->d_inode->i_mutex))
+ return -EINTR;
/* Note any delegations or leases have already been broken: */
ret = notify_change(dentry, &newattrs, NULL);
mutex_unlock(&dentry->d_inode->i_mutex);
@@ -508,7 +509,8 @@ static int chmod_common(struct path *path, umode_t mode)
if (error)
return error;
retry_deleg:
- mutex_lock(&inode->i_mutex);
+ if (mutex_lock_killable(&inode->i_mutex))
+ return -EINTR;
error = security_path_chmod(path, mode);
if (error)
goto out_unlock;
@@ -591,7 +593,8 @@ retry_deleg:
if (!S_ISDIR(inode->i_mode))
newattrs.ia_valid |=
ATTR_KILL_SUID | ATTR_KILL_SGID | ATTR_KILL_PRIV;
- mutex_lock(&inode->i_mutex);
+ if (mutex_lock_killable(&inode->i_mutex))
+ return -EINTR;
error = security_path_chown(path, uid, gid);
if (!error)
error = notify_change(path->dentry, &newattrs, &delegated_inode);
--
2.1.4

2015-07-03 14:41:23

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices

If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/block_dev.c | 4 ++++
fs/dax.c | 6 ++++--
2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4fe10f9..0bb2993 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -152,6 +152,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;

+ if (IS_DAX(inode))
+ return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
+ NULL, DIO_SKIP_DIO_COUNT);
return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
blkdev_get_block, NULL, NULL,
DIO_SKIP_DIO_COUNT);
@@ -1170,6 +1173,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
bdev->bd_disk = disk;
bdev->bd_queue = disk->queue;
bdev->bd_contains = bdev;
+ bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
if (!partno) {
ret = -ENXIO;
bdev->bd_part = disk_get_part(disk, partno);
diff --git a/fs/dax.c b/fs/dax.c
index eaa9e06..c3e21cc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -209,7 +209,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
}

/* Protects against truncate */
- inode_dio_begin(inode);
+ if (!(flags & DIO_SKIP_DIO_COUNT))
+ inode_dio_begin(inode);

retval = dax_io(inode, iter, pos, end, get_block, &bh);

@@ -219,7 +220,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
if ((retval > 0) && end_io)
end_io(iocb, pos, retval, bh.b_private);

- inode_dio_end(inode);
+ if (!(flags & DIO_SKIP_DIO_COUNT))
+ inode_dio_end(inode);
out:
return retval;
}
--
2.1.4

2015-07-03 14:41:49

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH v2 6/6] dax: bdev_direct_access() may sleep

The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method. To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().

Signed-off-by: Matthew Wilcox <[email protected]>
---
fs/block_dev.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 0bb2993..1982437 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -446,6 +446,12 @@ long bdev_direct_access(struct block_device *bdev, sector_t sector,
long avail;
const struct block_device_operations *ops = bdev->bd_disk->fops;

+ /*
+ * The device driver is allowed to sleep, in order to make the
+ * memory directly accessible.
+ */
+ might_sleep();
+
if (size < 0)
return size;
if (!ops->direct_access)
--
2.1.4

2015-07-03 18:30:40

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX

On Fri, Jul 03, 2015 at 10:40:40AM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> DAX relies on the get_block function either zeroing newly allocated blocks
> before they're findable by subsequent calls to get_block, or marking newly
> allocated blocks as unwritten. ext4_get_block() cannot create unwritten
> extents, but ext4_get_block_write() can.

To be clear, this patch has no prerequistes or dependencies, right?
That is, it would be fine if I take this through the ext4 git tree?
Or is there a reason or a preference for carrying this patch
somewhere else?

Also, is there a way I can test the DAX functionality in ext4 using
KVM? If so, can you give me a cheat sheet about how I can do that?

Thanks,

- Ted

2015-07-03 18:48:31

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX

On Fri, Jul 03, 2015 at 02:30:27PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 03, 2015 at 10:40:40AM -0400, Matthew Wilcox wrote:
> > From: Matthew Wilcox <[email protected]>
> >
> > DAX relies on the get_block function either zeroing newly allocated blocks
> > before they're findable by subsequent calls to get_block, or marking newly
> > allocated blocks as unwritten. ext4_get_block() cannot create unwritten
> > extents, but ext4_get_block_write() can.
>
> To be clear, this patch has no prerequistes or dependencies, right?
> That is, it would be fine if I take this through the ext4 git tree?
> Or is there a reason or a preference for carrying this patch
> somewhere else?

Right, no dependencies or prerequisites, completely independent of all
the other patches.

> Also, is there a way I can test the DAX functionality in ext4 using
> KVM? If so, can you give me a cheat sheet about how I can do that?

I don't use KVM, but I can tell you what I do ... (additional explanation
added, not for Ted's benefit, but because people less familiar with
Linux than Ted is may happen upon this email for their own purposes).

In /etc/default/grub, I have this line:

GRUB_CMDLINE_LINUX="memmap=4G!4G"

memmap=nn[KMG]!ss[KMG]
[KNL,X86] Mark specific memory as protected.
Region of memory to be used, from ss to ss+nn.
The memory region may be marked as e820 type 12 (0xc)
and is NVDIMM or ADR memory.

In my kernel config, I have:

CONFIG_X86_PMEM_LEGACY=y
CONFIG_LIBNVDIMM=y
CONFIG_BLK_DEV_PMEM=m

At boot, I "modprobe pmem". On the desktop-class system I'm using as
my development machine, the BIOS doesn't clear RAM between boots (only
power cycles), so the partition table and ext4 filesystem stays good,
and all I have to do is:

mount -odax /dev/pmem0p1 /mnt/ram0/

Also my xfstests local.config:

TEST_DEV=/dev/pmem0p1
TEST_DIR=/mnt/ram0
SCRATCH_DEV=/dev/pmem0p2
SCRATCH_MNT=/mnt/ram1
TEST_FS_MOUNT_OPTS="-o dax"
EXT_MOUNT_OPTIONS="-o dax"
MKFS_OPTIONS="-b4096"

Hope I didn't forget anything.

2015-07-03 19:07:11

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX

On Fri, Jul 03, 2015 at 02:48:24PM -0400, Matthew Wilcox wrote:
>
> At boot, I "modprobe pmem".

Is there a reason why it's important to build and load pmem as a
module? If I use CONFIG_BLK_DEV_PMEM=y (which is more convenient
given how I launch my KVM test appliance), should I expect any
problems?

I assume that this won't detect any bugs caused by missing CLFLUSH
instructions, but I assume that when using NVM as a block device, this
isn't much of an issue, as long as we don't care about torn writes?
(How using NVM with metdata checksums, or any checksums for that
matter, seems to be an interesting question --- how do we recover from
a checksum failure after a power failure?)

- Ted

2015-07-04 05:03:37

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v2 1/6] dax: Add block size note to documentation

On Fri, Jul 03, 2015 at 10:40:38AM -0400, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> For block devices which are small enough, mkfs will default to creating
> a filesystem with block sizes smaller than page size.

This seems like an odd note. At least for XFS and btrfs filesystem size
don't change block sizes. ff this is an extN oddity you're probably
better off documenting it there.

2015-07-05 08:44:08

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH v2 1/6] dax: Add block size note to documentation

On 07/04/2015 08:03 AM, Christoph Hellwig wrote:
> On Fri, Jul 03, 2015 at 10:40:38AM -0400, Matthew Wilcox wrote:
>> From: Matthew Wilcox <[email protected]>
>>
>> For block devices which are small enough, mkfs will default to creating
>> a filesystem with block sizes smaller than page size.
>
> This seems like an odd note. At least for XFS and btrfs filesystem size
> don't change block sizes. ff this is an extN oddity you're probably
> better off documenting it there.
>

Is why I added to brd:
/* This is so fdisk will align partitions on 4k, because of
* direct_access API needing 4k alignment, returning a PFN
* (This is only a problem on very small devices <= 4M,
* otherwise fdisk will align on 1M. Regardless this call
* is harmless)
*/
blk_queue_physical_block_size(brd->brd_queue, PAGE_SIZE);

Perhaps add the same to pmem, and/or perhaps make mkfs.extX
also inspect this and not create blocks smaller then physical_block_size?

Thanks
Boaz

2015-07-05 13:11:37

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH v2 2/6] dax: Use copy_from_iter_nocache

On 07/03/2015 05:40 PM, Matthew Wilcox wrote:
> From: Matthew Wilcox <[email protected]>
>
> When userspace does a write, there's no need for the written data to
> pollute the CPU cache. This matches the original XIP code.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> fs/dax.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 99b5fbc..eaa9e06 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -155,7 +155,7 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
> }
>
> if (iov_iter_rw(iter) == WRITE)
> - len = copy_from_iter(addr, max - pos, iter);
> + len = copy_from_iter_nocache(addr, max - pos, iter);
> else if (!hole)
> len = copy_to_iter(addr, max - pos, iter);
> else
>

With the current ioremap_nocache at pmem none of this matters for pmem.

For brd yes, so We've been conducting some measurements and regular ext4
(no DAX) benchmark gives 6-16% increase in performance with this above.
And DAX is almost x2 then no DAX.
Is why the network guys been using this for a long time. So I'd say this
is a good default for any page-cache writes. (Think about it it makes sense,
we will 95% of the time flush these to real memory before DMA)

For pmem with any sane cached mapping (We use page-stuct-pmem actually)
Then DAX, for it to actually work (persist) with pmem, needs this:

static size_t copy_from_iter_nt(void *addr, size_t bytes, struct iov_iter *ii)
{
size_t ret = copy_from_iter_nocache(addr, bytes, ii);

if (unlikely((ii->type & ITER_BVEC) || (ii->type & ITER_KVEC))) {
/* FIXME: copy_from_iter_nocache did regular copy for Kernel
* buffers (BVEC or KVEC). Before we fix it do cl_flush
* for now.
*/
cl_flush(addr, bytes, false);
} else {
/* copy_from_iter_nocache only persists in 8-byte aligned words.
* Lets persist remaining unaligned edges.
*/
if (unlikely((ulong)addr & 0x7))
cl_flush(addr, 1, false);
if (unlikely((ulong)(addr + bytes) & 0x7))
cl_flush((addr + bytes), 1, false);
}

return ret;
}

This is based on an not-in-kernel cl_flush().

The first part FIXME above could be fixed with Dan's memcpy_persistent() patches

Cheers
Boaz

2015-07-05 13:29:41

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH v2 3/6] ext4: Use ext4_get_block_write() for DAX

On 07/03/2015 10:07 PM, Theodore Ts'o wrote:
> On Fri, Jul 03, 2015 at 02:48:24PM -0400, Matthew Wilcox wrote:
>>
>> At boot, I "modprobe pmem".
>
> Is there a reason why it's important to build and load pmem as a
> module? If I use CONFIG_BLK_DEV_PMEM=y (which is more convenient
> given how I launch my KVM test appliance), should I expect any
> problems?
>

This (=y) should work fine. We use it a lot. (with KVM even boot an
image with -dax, with a trick)

Note that DAX need not be tested with pmem only, you can always use brd
at any given point without any reboot.

One more trick for xfstest I use:
memmap=2G!4G,2G!6G

And have two pmem0/1 and don't need to bother with any fdisk. Do need
to mkfs every boot though.

BTW: with kvm a reboot with above memmap will give you back the exact
same memory. halt and "virsh start" is a different story.

> I assume that this won't detect any bugs caused by missing CLFLUSH
> instructions, but I assume that when using NVM as a block device, this
> isn't much of an issue, as long as we don't care about torn writes?
> (How using NVM with metdata checksums, or any checksums for that
> matter, seems to be an interesting question --- how do we recover from
> a checksum failure after a power failure?)
>

Currently pmem maps a very-(very) slow ioremap_nocache. So any Kernel
memory access should be pmem persistent. For a real world faster ioremap,
there are few major pieces still missing in the stack to make it
persistent. Note the even today with ioremap_nocache, any application
mmap (like git) is not persistent.

> - Ted

Cheers
Boaz

2015-07-05 13:47:32

by Boaz Harrosh

[permalink] [raw]

Subject: Re: [PATCH v2 5/6] block: Add support for DAX reads/writes to block devices

On 07/03/2015 05:40 PM, Matthew Wilcox wrote:
> If a block device supports the ->direct_access methods, bypass the normal
> DIO path and use DAX to go straight to memcpy() instead of allocating
> a DIO and a BIO.
>

I can't remember the details but I'm not sure it is safe for mmap to go through
page-cache while DAX is bypassing page-cache. (Because of the pg_mkwrite thing,
while IO), do you find this code safe?

I think you need to force all DAX IO through dax.c. I did not understand why we
need to give the user an option? why would he ever want a cached access to a memory
device (ie two copies of the same thing). Why not DAX hard coded?

> Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
> do_blockdev_direct_IO().
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> fs/block_dev.c | 4 ++++
> fs/dax.c | 6 ++++--
> 2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 4fe10f9..0bb2993 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -152,6 +152,9 @@ blkdev_direct_IO(struct kiocb *iocb, struct iov_iter *iter, loff_t offset)
> struct file *file = iocb->ki_filp;
> struct inode *inode = file->f_mapping->host;
>
> + if (IS_DAX(inode))
> + return dax_do_io(iocb, inode, iter, offset, blkdev_get_block,
> + NULL, DIO_SKIP_DIO_COUNT);
> return __blockdev_direct_IO(iocb, inode, I_BDEV(inode), iter, offset,
> blkdev_get_block, NULL, NULL,
> DIO_SKIP_DIO_COUNT);
> @@ -1170,6 +1173,7 @@ static int __blkdev_get(struct block_device *bdev, fmode_t mode, int for_part)
> bdev->bd_disk = disk;
> bdev->bd_queue = disk->queue;
> bdev->bd_contains = bdev;
> + bdev->bd_inode->i_flags = disk->fops->direct_access ? S_DAX : 0;
> if (!partno) {
> ret = -ENXIO;
> bdev->bd_part = disk_get_part(disk, partno);
> diff --git a/fs/dax.c b/fs/dax.c
> index eaa9e06..c3e21cc 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -209,7 +209,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
> }
>
> /* Protects against truncate */
> - inode_dio_begin(inode);
> + if (!(flags & DIO_SKIP_DIO_COUNT))
> + inode_dio_begin(inode);

Is really a separate issue, is it not?

>
> retval = dax_io(inode, iter, pos, end, get_block, &bh);
>
> @@ -219,7 +220,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
> if ((retval > 0) && end_io)
> end_io(iocb, pos, retval, bh.b_private);
>
> - inode_dio_end(inode);
> + if (!(flags & DIO_SKIP_DIO_COUNT))
> + inode_dio_end(inode);
> out:
> return retval;
> }
>

This scares me, the mix of DAX while cached data. I'd vote for hard-coded DAX
all over.

Boaz