2015-08-05 20:19:13

by Jeff Moyer

[permalink] [raw]
Subject: regression introduced by "block: Add support for DAX reads/writes to block devices"

Hi, Matthew,

Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

# mkfs -t xfs -f /dev/pmem0
meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=2097152, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

I sat down with Linda to look into it, and the problem is that mkfs.xfs
sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
from the last sector of the device. This results in dax_io trying to do
a page-sized I/O at 512 bytes from the end of the device.
bdev_direct_access, receiving this bogus pos/size combo, returns
-ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) >
part_nr_sects_read(bdev->bd_part))
return -ERANGE;

Given that file systems supporting dax refuse to mount with a blocksize
!= page size, I'm guessing this is sort of expected behavior. However,
we really shouldn't be breaking direct I/O on pmem devices.

So, what do you want to do? We could make the pmem device's logical
block size fixed at the sytem page size. Or, we could modify the dax
code to work with blocksize < pagesize. Or, we could continue using the
direct I/O codepath for direct block device access. What do you think?

Thaks,
Jeff and Linda


2015-08-05 22:01:30

by Dave Chinner

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> Hi, Matthew,
>
> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>
> # mkfs -t xfs -f /dev/pmem0
> meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0
> data = bsize=4096 blocks=2097152, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal log bsize=4096 blocks=2560, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> mkfs.xfs: read failed: Numerical result out of range
>
> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> from the last sector of the device. This results in dax_io trying to do
> a page-sized I/O at 512 bytes from the end of the device.

Right - we have to be able to do IO to that last sector, so this is
a sanity check to tell if the block dev is large enough. The XFS
kernel code does the same end-of-device sector read when the
filesystem is mounted, too.

> bdev_direct_access, receiving this bogus pos/size combo, returns
> -ERANGE:
>
> if ((sector + DIV_ROUND_UP(size, 512)) >
> part_nr_sects_read(bdev->bd_part))
> return -ERANGE;
>
> Given that file systems supporting dax refuse to mount with a blocksize
> != page size, I'm guessing this is sort of expected behavior. However,
> we really shouldn't be breaking direct I/O on pmem devices.

If the device is advertising 512 byte sector size support, then this
needs to work, especially as DAX is completely transparent on the
block device. Remember that DAX through a filesystem works on
filesystem data block size boundaries, so a 512 byte sector/4k block
size filesystem will be able to use DAX for mmapped files just fine.

> So, what do you want to do? We could make the pmem device's logical
> block size fixed at the sytem page size. Or, we could modify the dax
> code to work with blocksize < pagesize. Or, we could continue using the
> direct I/O codepath for direct block device access. What do you think?

I don't know how the pmem device sets up it's limits. Can you post
the output of:

/sys/block/pmem0/queue/logical_block_size
/sys/block/pmem0/queue/physical_block_size
/sys/block/pmem0/queue/hw_sector_size
/sys/block/pmem0/queue/minimum_io_size
/sys/block/pmem0/queue/optimal_io_size

As these all affect how mkfs.xfs configures the filesystem being
made and so influences the size and alignment of the IO is does....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-06 01:43:09

by Linda Knippers

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 08/05/2015 06:01 PM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>> Hi, Matthew,
>>
>> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>>
>> # mkfs -t xfs -f /dev/pmem0
>> meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
>> = sectsz=512 attr=2, projid32bit=1
>> = crc=0 finobt=0
>> data = bsize=4096 blocks=2097152, imaxpct=25
>> = sunit=0 swidth=0 blks
>> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
>> log =internal log bsize=4096 blocks=2560, version=2
>> = sectsz=512 sunit=0 blks, lazy-count=1
>> realtime =none extsz=4096 blocks=0, rtextents=0
>> mkfs.xfs: read failed: Numerical result out of range
>>
>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>> from the last sector of the device. This results in dax_io trying to do
>> a page-sized I/O at 512 bytes from the end of the device.
>
> Right - we have to be able to do IO to that last sector, so this is
> a sanity check to tell if the block dev is large enough. The XFS
> kernel code does the same end-of-device sector read when the
> filesystem is mounted, too.
>
>> bdev_direct_access, receiving this bogus pos/size combo, returns
>> -ERANGE:
>>
>> if ((sector + DIV_ROUND_UP(size, 512)) >
>> part_nr_sects_read(bdev->bd_part))
>> return -ERANGE;
>>
>> Given that file systems supporting dax refuse to mount with a blocksize
>> != page size, I'm guessing this is sort of expected behavior. However,
>> we really shouldn't be breaking direct I/O on pmem devices.
>
> If the device is advertising 512 byte sector size support, then this
> needs to work, especially as DAX is completely transparent on the
> block device. Remember that DAX through a filesystem works on
> filesystem data block size boundaries, so a 512 byte sector/4k block
> size filesystem will be able to use DAX for mmapped files just fine.
>
>> So, what do you want to do? We could make the pmem device's logical
>> block size fixed at the sytem page size. Or, we could modify the dax
>> code to work with blocksize < pagesize. Or, we could continue using the
>> direct I/O codepath for direct block device access. What do you think?
>
> I don't know how the pmem device sets up it's limits. Can you post
> the output of:
>
> /sys/block/pmem0/queue/logical_block_size
512

> /sys/block/pmem0/queue/physical_block_size
512

> /sys/block/pmem0/queue/hw_sector_size
512

> /sys/block/pmem0/queue/minimum_io_size
512

> /sys/block/pmem0/queue/optimal_io_size
0

Let me know if you need anything else.

-- ljk


> As these all affect how mkfs.xfs configures the filesystem being
> made and so influences the size and alignment of the IO is does....
>
> Cheers,
>
> Dave.
>

2015-08-06 03:24:37

by Dave Chinner

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> On 08/05/2015 06:01 PM, Dave Chinner wrote:
> > On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> >> Hi, Matthew,
> >>
> >> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
> >>
> >> # mkfs -t xfs -f /dev/pmem0
> >> meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
> >> = sectsz=512 attr=2, projid32bit=1
> >> = crc=0 finobt=0
> >> data = bsize=4096 blocks=2097152, imaxpct=25
> >> = sunit=0 swidth=0 blks
> >> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> >> log =internal log bsize=4096 blocks=2560, version=2
> >> = sectsz=512 sunit=0 blks, lazy-count=1
> >> realtime =none extsz=4096 blocks=0, rtextents=0
> >> mkfs.xfs: read failed: Numerical result out of range
> >>
> >> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> >> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> >> from the last sector of the device. This results in dax_io trying to do
> >> a page-sized I/O at 512 bytes from the end of the device.
> >
> > Right - we have to be able to do IO to that last sector, so this is
> > a sanity check to tell if the block dev is large enough. The XFS
> > kernel code does the same end-of-device sector read when the
> > filesystem is mounted, too.
> >
> >> bdev_direct_access, receiving this bogus pos/size combo, returns
> >> -ERANGE:
> >>
> >> if ((sector + DIV_ROUND_UP(size, 512)) >
> >> part_nr_sects_read(bdev->bd_part))
> >> return -ERANGE;
> >>
> >> Given that file systems supporting dax refuse to mount with a blocksize
> >> != page size, I'm guessing this is sort of expected behavior. However,
> >> we really shouldn't be breaking direct I/O on pmem devices.
> >
> > If the device is advertising 512 byte sector size support, then this
> > needs to work, especially as DAX is completely transparent on the
> > block device. Remember that DAX through a filesystem works on
> > filesystem data block size boundaries, so a 512 byte sector/4k block
> > size filesystem will be able to use DAX for mmapped files just fine.
> >
> >> So, what do you want to do? We could make the pmem device's logical
> >> block size fixed at the sytem page size. Or, we could modify the dax
> >> code to work with blocksize < pagesize. Or, we could continue using the
> >> direct I/O codepath for direct block device access. What do you think?
> >
> > I don't know how the pmem device sets up it's limits. Can you post
> > the output of:
> >
> > /sys/block/pmem0/queue/logical_block_size
> 512
>
> > /sys/block/pmem0/queue/physical_block_size
> 512
>
> > /sys/block/pmem0/queue/hw_sector_size
> 512
>
> > /sys/block/pmem0/queue/minimum_io_size
> 512
>
> > /sys/block/pmem0/queue/optimal_io_size
> 0

Ok, so the pmem device is advertising 512 bytes for both
physical and logical sector sizes. That means mkfs.xfs is not doing
anything wrong. i.e. ERANGE on w read of the last sector of the
block device is a bug in the block device code.

It is not at all obvious from these sector sizes that the block
device is DAX enabled. I'd suggest that you probably want to make
the physical sector size 4k on x86-64 to indicate to filesystem
utilities that 4k alignment of the filesystem is preferred, even if
512 byte IO can be supported in a less efficient manner (i.e.
equivalent of a 512e hard drive)....

You can't really make the logical sector size = PAGE_SIZE, because
on 64k page size machines that will make the sector size larger than
many filesystems support. e.g. XFS only supports sector sizes up to
32k at the moment...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-06 07:52:52

by Boaz Harrosh

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 08/06/2015 06:24 AM, Dave Chinner wrote:
> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
<>
>>>>
>>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>>>> from the last sector of the device. This results in dax_io trying to do
>>>> a page-sized I/O at 512 bytes from the end of the device.
>>>

This part I do not understand. how is mkfs.xfs reading the sector?
Is it through open(/dev/pmem0,...) ? O_DIRECT?

If so then yes the inode of /dev/pmem0 is IS_DAX() and will try
to use the dax.c stuff. (I think, which Kernel?)

Which means this is a bug.

>>> Right - we have to be able to do IO to that last sector, so this is
>>> a sanity check to tell if the block dev is large enough. The XFS
>>> kernel code does the same end-of-device sector read when the
>>> filesystem is mounted, too.
>>>
>>>> bdev_direct_access, receiving this bogus pos/size combo, returns
>>>> -ERANGE:
>>>>
>>>> if ((sector + DIV_ROUND_UP(size, 512)) >
>>>> part_nr_sects_read(bdev->bd_part))
>>>> return -ERANGE;
>>>>
>>>> Given that file systems supporting dax refuse to mount with a blocksize
>>>> != page size, I'm guessing this is sort of expected behavior. However,
>>>> we really shouldn't be breaking direct I/O on pmem devices.
>>>

No this is a BUG. read/write buffered/direct to an IS_DAX() inode should
be able to be of any alignment size. Since with DAX buffered/direct is
exact same code path and buffered IO expects any size IO.

This is probably a bug in the DAX handling of the bdev-inode. Let me
test this. I will send a fix ASAP.

<>
>>> the output of:
>>>
>>> /sys/block/pmem0/queue/logical_block_size
>> 512
>>
>>> /sys/block/pmem0/queue/physical_block_size
>> 512
>>

There is a pending fix for this.
Do you need it sent to stable ?

>>> /sys/block/pmem0/queue/hw_sector_size
>> 512
>>
>>> /sys/block/pmem0/queue/minimum_io_size
>> 512
>>
>>> /sys/block/pmem0/queue/optimal_io_size
>> 0

Thanks
Boaz

2015-08-06 14:21:30

by Matthew Wilcox

[permalink] [raw]
Subject: RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

I think I see the problem. I'm kind of wrapped up in other things right now; can you try replacing the line in dax_io():

- bh->b_size = PAGE_ALIGN(end - pos);
+ bh->b_size = ALIGN(end - pos, 1 << blkbits);

-----Original Message-----
From: Jeff Moyer [mailto:[email protected]]
Sent: Wednesday, August 05, 2015 1:19 PM
To: Wilcox, Matthew R; [email protected]
Cc: [email protected]; [email protected]
Subject: regression introduced by "block: Add support for DAX reads/writes to block devices"

Hi, Matthew,

Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:

# mkfs -t xfs -f /dev/pmem0
meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=2097152, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

I sat down with Linda to look into it, and the problem is that mkfs.xfs
sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
from the last sector of the device. This results in dax_io trying to do
a page-sized I/O at 512 bytes from the end of the device.
bdev_direct_access, receiving this bogus pos/size combo, returns
-ERANGE:

if ((sector + DIV_ROUND_UP(size, 512)) >
part_nr_sects_read(bdev->bd_part))
return -ERANGE;

Given that file systems supporting dax refuse to mount with a blocksize
!= page size, I'm guessing this is sort of expected behavior. However,
we really shouldn't be breaking direct I/O on pmem devices.

So, what do you want to do? We could make the pmem device's logical
block size fixed at the sytem page size. Or, we could modify the dax
code to work with blocksize < pagesize. Or, we could continue using the
direct I/O codepath for direct block device access. What do you think?

Thaks,
Jeff and Linda

2015-08-06 15:33:55

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

"Wilcox, Matthew R" <[email protected]> writes:

> I think I see the problem. I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

That's not gonna work either. :) You'll end up with -EINVAL since
bdev_direct_access wants the sector to be aligned to a page:

if (sector % (PAGE_SIZE / 512))
return -EINVAL;

I think you really want to call direct_access with the full page, and
then tease out the part you want up in dax_io, right? I'll take a crack
at it if you're busy.

Cheers,
Jeff

> -----Original Message-----
> From: Jeff Moyer [mailto:[email protected]]
> Sent: Wednesday, August 05, 2015 1:19 PM
> To: Wilcox, Matthew R; [email protected]
> Cc: [email protected]; [email protected]
> Subject: regression introduced by "block: Add support for DAX reads/writes to block devices"
>
> Hi, Matthew,
>
> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>
> # mkfs -t xfs -f /dev/pmem0
> meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0
> data = bsize=4096 blocks=2097152, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal log bsize=4096 blocks=2560, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> mkfs.xfs: read failed: Numerical result out of range
>
> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> from the last sector of the device. This results in dax_io trying to do
> a page-sized I/O at 512 bytes from the end of the device.
> bdev_direct_access, receiving this bogus pos/size combo, returns
> -ERANGE:
>
> if ((sector + DIV_ROUND_UP(size, 512)) >
> part_nr_sects_read(bdev->bd_part))
> return -ERANGE;
>
> Given that file systems supporting dax refuse to mount with a blocksize
> != page size, I'm guessing this is sort of expected behavior. However,
> we really shouldn't be breaking direct I/O on pmem devices.
>
> So, what do you want to do? We could make the pmem device's logical
> block size fixed at the sytem page size. Or, we could modify the dax
> code to work with blocksize < pagesize. Or, we could continue using the
> direct I/O codepath for direct block device access. What do you think?
>
> Thaks,
> Jeff and Linda

2015-08-06 15:52:01

by Matthew Wilcox

[permalink] [raw]
Subject: RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

Yes, that's the result I want. Fundamentally, I think DAX should be able to support devices that are not multiples of PAGE_SIZE in size.

-----Original Message-----
From: Jeff Moyer [mailto:[email protected]]
Sent: Thursday, August 06, 2015 8:34 AM
To: Wilcox, Matthew R
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

"Wilcox, Matthew R" <[email protected]> writes:

> I think I see the problem. I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

That's not gonna work either. :) You'll end up with -EINVAL since
bdev_direct_access wants the sector to be aligned to a page:

if (sector % (PAGE_SIZE / 512))
return -EINVAL;

I think you really want to call direct_access with the full page, and
then tease out the part you want up in dax_io, right? I'll take a crack
at it if you're busy.

Cheers,
Jeff

> -----Original Message-----
> From: Jeff Moyer [mailto:[email protected]]
> Sent: Wednesday, August 05, 2015 1:19 PM
> To: Wilcox, Matthew R; [email protected]
> Cc: [email protected]; [email protected]
> Subject: regression introduced by "block: Add support for DAX reads/writes to block devices"
>
> Hi, Matthew,
>
> Linda Knippers noticed that commit (bbab37ddc20b) breaks mkfs.xfs:
>
> # mkfs -t xfs -f /dev/pmem0
> meta-data=/dev/pmem0 isize=256 agcount=4, agsize=524288 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=0 finobt=0
> data = bsize=4096 blocks=2097152, imaxpct=25
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0 ftype=0
> log =internal log bsize=4096 blocks=2560, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
> mkfs.xfs: read failed: Numerical result out of range
>
> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> from the last sector of the device. This results in dax_io trying to do
> a page-sized I/O at 512 bytes from the end of the device.
> bdev_direct_access, receiving this bogus pos/size combo, returns
> -ERANGE:
>
> if ((sector + DIV_ROUND_UP(size, 512)) >
> part_nr_sects_read(bdev->bd_part))
> return -ERANGE;
>
> Given that file systems supporting dax refuse to mount with a blocksize
> != page size, I'm guessing this is sort of expected behavior. However,
> we really shouldn't be breaking direct I/O on pmem devices.
>
> So, what do you want to do? We could make the pmem device's logical
> block size fixed at the sytem page size. Or, we could modify the dax
> code to work with blocksize < pagesize. Or, we could continue using the
> direct I/O codepath for direct block device access. What do you think?
>
> Thaks,
> Jeff and Linda

2015-08-06 20:35:39

by Dave Chinner

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
> On 08/06/2015 06:24 AM, Dave Chinner wrote:
> > On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> >> On 08/05/2015 06:01 PM, Dave Chinner wrote:
> >>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> <>
> >>>>
> >>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> >>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> >>>> from the last sector of the device. This results in dax_io trying to do
> >>>> a page-sized I/O at 512 bytes from the end of the device.
> >>>
>
> This part I do not understand. how is mkfs.xfs reading the sector?
> Is it through open(/dev/pmem0,...) ? O_DIRECT?

mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
told that it is working on an image file does it fall back to
buffered IO. All of the XFS userspace tools work this way to prevent
page cache pollution issues with read-once or write-once data during
operation.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-06 22:25:31

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

"Wilcox, Matthew R" <[email protected]> writes:

> I think I see the problem. I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

This works for me. If it looks okay to others, I'll submit a properly
signed-off patch.

Cheers,
Jeff

diff --git a/fs/dax.c b/fs/dax.c
index a7f77e1..b6c4f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -98,6 +98,10 @@ static bool buffer_size_valid(struct buffer_head *bh)
return bh->b_state != 0;
}

+/*
+ * This function assumes file system block size (represented by
+ * inode->i_blkbits) is less than or equal to the system page size.
+ */
static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
loff_t start, loff_t end, get_block_t get_block,
struct buffer_head *bh)
@@ -117,9 +121,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
if (pos == max) {
unsigned blkbits = inode->i_blkbits;
sector_t block = pos >> blkbits;
- unsigned first = pos - (block << blkbits);
+ long page = pos >> PAGE_SHIFT;
+ unsigned first; /* byte offset into block */
long size;

+ /* we can only map entire pages */
+ if (pos & (PAGE_SIZE-1))
+ block = page << (PAGE_SHIFT - blkbits);
+ first = pos - (block << blkbits);
+
if (pos == bh_max) {
bh->b_size = PAGE_ALIGN(end - pos);
bh->b_state = 0;

2015-08-07 18:11:32

by Matthew Wilcox

[permalink] [raw]
Subject: RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

So ... what you're doing here is if somebody requests the last 512 bytes, you're asking for the last page. That's going to work as long as the partition is a multiple of PAGE_SIZE, but not if the partition happens to be misaligned. It's an improvement, although I'd be tempted to do this as:

if (pos == max) {
unsigned blkbits = inode->i_blkbits;
- sector_t block = pos >> blkbits;
+ long page = pos >> PAGE_SHIFT;
+ sector_t block = page << (PAGE_SHIFT - blkbits);
unsigned first = pos - (block << blkbits);
long size;

We need to cope with the case where the end of a partition isn't on a page boundary though.

-----Original Message-----
From: Jeff Moyer [mailto:[email protected]]
Sent: Thursday, August 06, 2015 2:31 PM
To: Wilcox, Matthew R
Cc: [email protected]; [email protected]; [email protected]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

"Wilcox, Matthew R" <[email protected]> writes:

> I think I see the problem. I'm kind of wrapped up in other things
> right now; can you try replacing the line in dax_io():
>
> - bh->b_size = PAGE_ALIGN(end - pos);
> + bh->b_size = ALIGN(end - pos, 1 << blkbits);

This works for me. If it looks okay to others, I'll submit a properly
signed-off patch.

Cheers,
Jeff

diff --git a/fs/dax.c b/fs/dax.c
index a7f77e1..b6c4f93 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -98,6 +98,10 @@ static bool buffer_size_valid(struct buffer_head *bh)
return bh->b_state != 0;
}

+/*
+ * This function assumes file system block size (represented by
+ * inode->i_blkbits) is less than or equal to the system page size.
+ */
static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
loff_t start, loff_t end, get_block_t get_block,
struct buffer_head *bh)
@@ -117,9 +121,15 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
if (pos == max) {
unsigned blkbits = inode->i_blkbits;
sector_t block = pos >> blkbits;
- unsigned first = pos - (block << blkbits);
+ long page = pos >> PAGE_SHIFT;
+ unsigned first; /* byte offset into block */
long size;

+ /* we can only map entire pages */
+ if (pos & (PAGE_SIZE-1))
+ block = page << (PAGE_SHIFT - blkbits);
+ first = pos - (block << blkbits);
+
if (pos == bh_max) {
bh->b_size = PAGE_ALIGN(end - pos);
bh->b_state = 0;

2015-08-07 20:41:13

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

"Wilcox, Matthew R" <[email protected]> writes:

> So ... what you're doing here is if somebody requests the last 512
> bytes, you're asking for the last page. That's going to work as long
> as the partition is a multiple of PAGE_SIZE, but not if the partition
> happens to be misaligned. It's an improvement, although I'd be
> tempted to do this as:
>
> if (pos == max) {
> unsigned blkbits = inode->i_blkbits;
> - sector_t block = pos >> blkbits;
> + long page = pos >> PAGE_SHIFT;
> + sector_t block = page << (PAGE_SHIFT - blkbits);
> unsigned first = pos - (block << blkbits);
> long size;

Yeah, that's easier to read.

> We need to cope with the case where the end of a partition isn't on a
> page boundary though.

Well, that's usually done by falling back to buffered I/O. I gave that
a try and panicked the box. :) I'll keep looking into it, but probably
won't have another patch until next week.

Cheers,
Jeff

2015-08-09 08:52:40

by Boaz Harrosh

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 08/06/2015 11:34 PM, Dave Chinner wrote:
> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>> <>
>>>>>>
>>>>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>>>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>>>>>> from the last sector of the device. This results in dax_io trying to do
>>>>>> a page-sized I/O at 512 bytes from the end of the device.
>>>>>
>>
>> This part I do not understand. how is mkfs.xfs reading the sector?
>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
>
> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
> told that it is working on an image file does it fall back to
> buffered IO. All of the XFS userspace tools work this way to prevent
> page cache pollution issues with read-once or write-once data during
> operation.
>

Thanks, yes makes sense. This is a bug at the DAX implementation of
bdev. Since as you know with DAX there is no difference between
O_DIRECT and buffered, we must support any aligned IO. I bet it
should be something with bdev not giving 4K buffer-heads to dax.c.

Or ... It might just be the infamous bug where the actual partition
they used was not 4k aligned on its start sector. So the last sector IO
after partition translation came out wrong. This bug then should be
fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
by:Vishal Verma

Vishal I think we should add CC: [email protected] to your patch
because of these fdisk bugs.

> Cheers,
> Dave.

Thanks
Boaz

2015-08-10 07:42:19

by Boaz Harrosh

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 08/07/2015 11:41 PM, Jeff Moyer wrote:
<>
>
>> We need to cope with the case where the end of a partition isn't on a
>> page boundary though.
>
> Well, that's usually done by falling back to buffered I/O. I gave that
> a try and panicked the box. :) I'll keep looking into it, but probably
> won't have another patch until next week.
>

lets slow down for a sec, please.

We have all established that an unaligned partition start is BAD and not supported?
(If we want to also support this, which is possible then all the below is mute)

Well we do know that any real pmem device will actually be 128M aligned because of
how the DIMM thing work. So any start-aligned partition means length aligned as well.
(Emulated pmem is 4k aligned as well)

That said, you might want to protect against unaligned start / length. Even though
we have the 4k physical sector patch, a forced fdisk could produce such a partition.

I would suggest that for such un-aligned partitions the code in bdev that sets
IS_DAX on the bdev-inode should only do so if the start / length is aligned.
Else it becomes a !IS_DAX inode and all is fine.
Users will learn soon enough that for dax you need to keep the recommended 4k
alignments.

In the DAX filesystem case the start-un-aligned case is fatal and the FS would
not mount (dax enabled) . The length-un-aligned case is not a problem because
the 4k block size mandatory for dax will trim the device length to an FS block
boundary.

So it is only a problem with RAW bdev inodes and I would just not let it be
IS_DAX in the un-aligned case. (or trim its size)

Thanks
Boaz

> Cheers,
> Jeff

2015-08-10 16:32:47

by Linda Knippers

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
> On 08/06/2015 11:34 PM, Dave Chinner wrote:
>> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>>>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>>>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>>> <>
>>>>>>>
>>>>>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>>>>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>>>>>>> from the last sector of the device. This results in dax_io trying to do
>>>>>>> a page-sized I/O at 512 bytes from the end of the device.
>>>>>>
>>>
>>> This part I do not understand. how is mkfs.xfs reading the sector?
>>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
>>
>> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
>> told that it is working on an image file does it fall back to
>> buffered IO. All of the XFS userspace tools work this way to prevent
>> page cache pollution issues with read-once or write-once data during
>> operation.
>>
>
> Thanks, yes makes sense. This is a bug at the DAX implementation of
> bdev. Since as you know with DAX there is no difference between
> O_DIRECT and buffered, we must support any aligned IO. I bet it
> should be something with bdev not giving 4K buffer-heads to dax.c.
>
> Or ... It might just be the infamous bug where the actual partition
> they used was not 4k aligned on its start sector. So the last sector IO
> after partition translation came out wrong. This bug then should be
> fixed by: https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
> by:Vishal Verma
>
> Vishal I think we should add CC: [email protected] to your patch
> because of these fdisk bugs.

That patch does cause 'mkfs -t xfs' to work.

Before:
$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256 agcount=4, agsize=524288 blks
= sectsz=512 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=2097152, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
mkfs.xfs: read failed: Numerical result out of range

After:

$ sudo mkfs -t xfs -f /dev/pmem3
meta-data=/dev/pmem3 isize=256 agcount=4, agsize=524288 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=2097152, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

$ cat /sys/block/pmem3/queue/logical_block_size
512
$ cat /sys/block/pmem3/queue/physical_block_size
4096
$ cat /sys/block/pmem3/queue/hw_sector_size
512
$ cat /sys/block/pmem3/queue/minimum_io_size
4096

Previously physical_block_size was 512 and minimum_io_size was 0.
What about logical_block_size and hw_sector_size still being 512?

So do we want to change pmem rather than changing DAX?

-- ljk
>
>> Cheers,
>> Dave.
>
> Thanks
> Boaz
>

2015-08-10 21:27:33

by Dave Chinner

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
> On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
> > On 08/06/2015 11:34 PM, Dave Chinner wrote:
> >> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
> >>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
> >>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
> >>>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
> >>>>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
> >>> <>
> >>>>>>>
> >>>>>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
> >>>>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
> >>>>>>> from the last sector of the device. This results in dax_io trying to do
> >>>>>>> a page-sized I/O at 512 bytes from the end of the device.
> >>>>>>
> >>>
> >>> This part I do not understand. how is mkfs.xfs reading the sector?
> >>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
> >>
> >> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
> >> told that it is working on an image file does it fall back to
> >> buffered IO. All of the XFS userspace tools work this way to prevent
> >> page cache pollution issues with read-once or write-once data during
> >> operation.
....
> That patch does cause 'mkfs -t xfs' to work.
>
> Before:
> $ sudo mkfs -t xfs -f /dev/pmem3
> meta-data=/dev/pmem3 isize=256 agcount=4, agsize=524288 blks
> = sectsz=512 attr=2, projid32bit=1
^^^^^^^^^^
....

> $ sudo mkfs -t xfs -f /dev/pmem3
> meta-data=/dev/pmem3 isize=256 agcount=4, agsize=524288 blks
> = sectsz=4096 attr=2, projid32bit=1
^^^^^^^^^^^

So in the after case, mkfs.xfs is behaving differently and not
exercising the bug. It's seen the:

> $ cat /sys/block/pmem3/queue/logical_block_size
> 512
> $ cat /sys/block/pmem3/queue/physical_block_size
> 4096
^^^^

4k physical block size, and hence configured the filesystem with a
4k sector size so all IO it issues is physicallly aligned. IOWs,
mkfs.xfs's last sector read is 4k aligned and sized, and therefore
the test has not confirmed that the patch fixes the 512 byte last
sector read is fixed at all.

Isn't there a regression test suite that covers basic block device
functionality that you can use to test these simple corner cases?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-08-10 23:04:10

by Linda Knippers

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 8/10/2015 5:27 PM, Dave Chinner wrote:
> On Mon, Aug 10, 2015 at 12:32:08PM -0400, Linda Knippers wrote:
>> On 8/9/2015 4:52 AM, Boaz Harrosh wrote:
>>> On 08/06/2015 11:34 PM, Dave Chinner wrote:
>>>> On Thu, Aug 06, 2015 at 10:52:47AM +0300, Boaz Harrosh wrote:
>>>>> On 08/06/2015 06:24 AM, Dave Chinner wrote:
>>>>>> On Wed, Aug 05, 2015 at 09:42:54PM -0400, Linda Knippers wrote:
>>>>>>> On 08/05/2015 06:01 PM, Dave Chinner wrote:
>>>>>>>> On Wed, Aug 05, 2015 at 04:19:08PM -0400, Jeff Moyer wrote:
>>>>> <>
>>>>>>>>>
>>>>>>>>> I sat down with Linda to look into it, and the problem is that mkfs.xfs
>>>>>>>>> sets the blocksize of the device to 512 (via BLKBSZSET), and then reads
>>>>>>>>> from the last sector of the device. This results in dax_io trying to do
>>>>>>>>> a page-sized I/O at 512 bytes from the end of the device.
>>>>>>>>
>>>>>
>>>>> This part I do not understand. how is mkfs.xfs reading the sector?
>>>>> Is it through open(/dev/pmem0,...) ? O_DIRECT?
>>>>
>>>> mkfs.xfs uses O_DIRECT. Only if open(O_DIRECT) fails or mkfs.xfs is
>>>> told that it is working on an image file does it fall back to
>>>> buffered IO. All of the XFS userspace tools work this way to prevent
>>>> page cache pollution issues with read-once or write-once data during
>>>> operation.
> ....
>> That patch does cause 'mkfs -t xfs' to work.
>>
>> Before:
>> $ sudo mkfs -t xfs -f /dev/pmem3
>> meta-data=/dev/pmem3 isize=256 agcount=4, agsize=524288 blks
>> = sectsz=512 attr=2, projid32bit=1
> ^^^^^^^^^^
> ....
>
>> $ sudo mkfs -t xfs -f /dev/pmem3
>> meta-data=/dev/pmem3 isize=256 agcount=4, agsize=524288 blks
>> = sectsz=4096 attr=2, projid32bit=1
> ^^^^^^^^^^^
>
> So in the after case, mkfs.xfs is behaving differently and not
> exercising the bug. It's seen the:
>
>> $ cat /sys/block/pmem3/queue/logical_block_size
>> 512
>> $ cat /sys/block/pmem3/queue/physical_block_size
>> 4096
> ^^^^
>
> 4k physical block size, and hence configured the filesystem with a
> 4k sector size so all IO it issues is physicallly aligned. IOWs,
> mkfs.xfs's last sector read is 4k aligned and sized, and therefore
> the test has not confirmed that the patch fixes the 512 byte last
> sector read is fixed at all.

That is true. All I reported is that mkfs.xfs now works. I
had some questions about whether that was right.

The underlying problem is still there, as I can demonstrate with
a simple reproducer that just does what mkfs.xfs would have done
before.

> Isn't there a regression test suite that covers basic block device
> functionality that you can use to test these simple corner cases?

If there is, seems like DAX adds a few twists.

-- ljk
>
> Cheers,
>
> Dave.
>

2015-08-12 21:11:39

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

Boaz Harrosh <[email protected]> writes:

> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
> <>
>>
>>> We need to cope with the case where the end of a partition isn't on a
>>> page boundary though.
>>
>> Well, that's usually done by falling back to buffered I/O. I gave that
>> a try and panicked the box. :) I'll keep looking into it, but probably
>> won't have another patch until next week.
>>
>
> lets slow down for a sec, please.
>
> We have all established that an unaligned partition start is BAD and not supported?

No. Unaligned partitions on RAID arrays or 512e devices are bad because
they result in suboptimal performance. They are most certainly still
supported, though.

-Jeff

2015-08-13 05:32:14

by Boaz Harrosh

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 08/13/2015 12:11 AM, Jeff Moyer wrote:
> Boaz Harrosh <[email protected]> writes:
>
>> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
>> <>
>>>
>>>> We need to cope with the case where the end of a partition isn't on a
>>>> page boundary though.
>>>
>>> Well, that's usually done by falling back to buffered I/O. I gave that
>>> a try and panicked the box. :) I'll keep looking into it, but probably
>>> won't have another patch until next week.
>>>
>>
>> lets slow down for a sec, please.
>>
>> We have all established that an unaligned partition start is BAD and not supported?
>
> No. Unaligned partitions on RAID arrays or 512e devices are bad because
> they result in suboptimal performance. They are most certainly still
> supported, though.
>

What ?

I meant for dax on pmem or brd. I meant that we *do not* support dax access
on an unaligned partition start. (None dax is perfectly supported)

We did it this way because of the direct_access API that returns a pfn
with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
not worth it, and that dax should be page aligned for code simplicity

Cheers
Boaz

> -Jeff
>

2015-08-13 14:00:10

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

Boaz Harrosh <[email protected]> writes:

> On 08/13/2015 12:11 AM, Jeff Moyer wrote:
>> Boaz Harrosh <[email protected]> writes:
>>
>>> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
>>> <>
>>>>
>>>>> We need to cope with the case where the end of a partition isn't on a
>>>>> page boundary though.
>>>>
>>>> Well, that's usually done by falling back to buffered I/O. I gave that
>>>> a try and panicked the box. :) I'll keep looking into it, but probably
>>>> won't have another patch until next week.
>>>>
>>>
>>> lets slow down for a sec, please.
>>>
>>> We have all established that an unaligned partition start is BAD and not supported?
>>
>> No. Unaligned partitions on RAID arrays or 512e devices are bad because
>> they result in suboptimal performance. They are most certainly still
>> supported, though.
>>
>
> What ?
>
> I meant for dax on pmem or brd. I meant that we *do not* support dax access
> on an unaligned partition start. (None dax is perfectly supported)

Sorry, I thought your statement was broader than that.

> We did it this way because of the direct_access API that returns a pfn
> with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
> not worth it, and that dax should be page aligned for code simplicity

I'd be fine with changing the persistent memory block device to only
support 4k logical, 4k physical block size. That probably makes the
most sense.

Cheers,
Jeff

2015-08-13 16:42:36

by Linda Knippers

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 8/13/2015 10:00 AM, Jeff Moyer wrote:
> Boaz Harrosh <[email protected]> writes:
>
>> On 08/13/2015 12:11 AM, Jeff Moyer wrote:
>>> Boaz Harrosh <[email protected]> writes:
>>>
>>>> On 08/07/2015 11:41 PM, Jeff Moyer wrote:
>>>> <>
>>>>>
>>>>>> We need to cope with the case where the end of a partition isn't on a
>>>>>> page boundary though.
>>>>>
>>>>> Well, that's usually done by falling back to buffered I/O. I gave that
>>>>> a try and panicked the box. :) I'll keep looking into it, but probably
>>>>> won't have another patch until next week.
>>>>>
>>>>
>>>> lets slow down for a sec, please.
>>>>
>>>> We have all established that an unaligned partition start is BAD and not supported?
>>>
>>> No. Unaligned partitions on RAID arrays or 512e devices are bad because
>>> they result in suboptimal performance. They are most certainly still
>>> supported, though.
>>>
>>
>> What ?
>>
>> I meant for dax on pmem or brd. I meant that we *do not* support dax access
>> on an unaligned partition start. (None dax is perfectly supported)
>
> Sorry, I thought your statement was broader than that.
>
>> We did it this way because of the direct_access API that returns a pfn
>> with is PAGE_SIZE. We could introduce a pfn+offset but we thought it is
>> not worth it, and that dax should be page aligned for code simplicity
>
> I'd be fine with changing the persistent memory block device to only
> support 4k logical, 4k physical block size. That probably makes the
> most sense.

If that's what we want, the current patch doesn't do that.
https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html

It causes the physical block size to be PAGE_SIZE but the
logical block size is still 512. However, the minimum_io_size
is now 4096 (same as physical block size, I assume). The
optimal_io_size is still 0. What does that mean?

Whatever we go with, we should do something because 4.2rc6 is still
broken, unable to create a xfs file system on a pmem device, ever
since the change to use DAX on block devices with O_DIRECT.

-- ljk
>
> Cheers,
> Jeff
>

2015-08-13 17:14:53

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

Linda Knippers <[email protected]> writes:

>> I'd be fine with changing the persistent memory block device to only
>> support 4k logical, 4k physical block size. That probably makes the
>> most sense.
>
> If that's what we want, the current patch doesn't do that.
> https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
>
> It causes the physical block size to be PAGE_SIZE but the
> logical block size is still 512. However, the minimum_io_size
> is now 4096 (same as physical block size, I assume). The
> optimal_io_size is still 0. What does that mean?

physical block size - device's internal block size
logical block size - addressable unit
optimal io size - device's preferred unit for streaming
minimum io size - device’s preferred minimum unit for random I/O

See Martin Petersen's "Linux & Advanced Storage Interfaces" document for
more information.

> Whatever we go with, we should do something because 4.2rc6 is still
> broken, unable to create a xfs file system on a pmem device, ever
> since the change to use DAX on block devices with O_DIRECT.

We can change the block device to export logical/physical block sizes of
PAGE_SIZE. However, when persistent memory support comes to platforms
that support page sizes > 32k, xfs will again run into problems (Dave
Chinner mentioned that xfs can't deal with logical block sizes >32k.)
Arguably, you can use pmem and dax on such platforms using RAM today for
testing. Do we care about breaking that?

Cheers,
Jeff

2015-08-13 17:52:17

by Linda Knippers

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On 8/13/2015 1:14 PM, Jeff Moyer wrote:
> Linda Knippers <[email protected]> writes:
>
>>> I'd be fine with changing the persistent memory block device to only
>>> support 4k logical, 4k physical block size. That probably makes the
>>> most sense.
>>
>> If that's what we want, the current patch doesn't do that.
>> https://lists.01.org/pipermail/linux-nvdimm/2015-July/001555.html
>>
>> It causes the physical block size to be PAGE_SIZE but the
>> logical block size is still 512. However, the minimum_io_size
>> is now 4096 (same as physical block size, I assume). The
>> optimal_io_size is still 0. What does that mean?
>
> physical block size - device's internal block size
> logical block size - addressable unit

Right, but it's still reported as 512 and that doesn't work.

> optimal io size - device's preferred unit for streaming

So 0 is ok.

> minimum io size - device’s preferred minimum unit for random I/O
>
> See Martin Petersen's "Linux & Advanced Storage Interfaces" document for
> more information.
>
>> Whatever we go with, we should do something because 4.2rc6 is still
>> broken, unable to create a xfs file system on a pmem device, ever
>> since the change to use DAX on block devices with O_DIRECT.
>
> We can change the block device to export logical/physical block sizes of
> PAGE_SIZE. However, when persistent memory support comes to platforms
> that support page sizes > 32k, xfs will again run into problems (Dave
> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
> Arguably, you can use pmem and dax on such platforms using RAM today for
> testing. Do we care about breaking that?

I would think so. AARCH64 uses 64k pages today.

I think Documentation/filesystems/dax.txt could use a little update
too. It has a section "Implementation Tips for Block Driver Writers"
that makes it sound easy but now I wonder if it even works with the
example ram drivers. Should we be able to read any 512 byte
"sector"?

-- ljk
>
> Cheers,
> Jeff
>

2015-08-13 18:19:09

by Jeff Moyer

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

Linda Knippers <[email protected]> writes:

>>> It causes the physical block size to be PAGE_SIZE but the
>>> logical block size is still 512. However, the minimum_io_size
>>> is now 4096 (same as physical block size, I assume). The
>>> optimal_io_size is still 0. What does that mean?
>>
>> physical block size - device's internal block size
>> logical block size - addressable unit
>
> Right, but it's still reported as 512 and that doesn't work.

Understood. :)

>> optimal io size - device's preferred unit for streaming
>
> So 0 is ok.

Correct.

>> We can change the block device to export logical/physical block sizes of
>> PAGE_SIZE. However, when persistent memory support comes to platforms
>> that support page sizes > 32k, xfs will again run into problems (Dave
>> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
>> Arguably, you can use pmem and dax on such platforms using RAM today for
>> testing. Do we care about breaking that?
>
> I would think so. AARCH64 uses 64k pages today.

So does powerpc, but I guess nobody cares about that anymore. ;-) If
the logical block size is smaller than the page size, we're going to
have to deal with sub-page I/O. For now, we can do as Boaz suggested,
and just turn off dax for those configurations. We could also just
revert the patch that introduced this problem. I really don't know who
is going to care about O_DIRECT I/O performance to a persistent memory
block device.

Willy? What was the real motivation there?

> I think Documentation/filesystems/dax.txt could use a little update
> too. It has a section "Implementation Tips for Block Driver Writers"
> that makes it sound easy but now I wonder if it even works with the
> example ram drivers. Should we be able to read any 512 byte
> "sector"?

If the logical block size is 512 bytes, then you have to be able to do
(direct) I/O to any 512 byte sector. Simple as that.

Cheers,
Jeff

2015-08-13 19:32:41

by Matthew Wilcox

[permalink] [raw]
Subject: RE: regression introduced by "block: Add support for DAX reads/writes to block devices"

I liked the patch you were pushing to request the *page* containing the requested bytes instead of the *block* containing the requested bytes.

For the misaligned partition problem, I was thinking we should change the direct_access API to return a phys_addr_t instead of a pfn. That way we can return something that isn't actually page aligned, and DAX can take care of making sure it doesn't overshoot the end.

-----Original Message-----
From: Jeff Moyer [mailto:[email protected]]
Sent: Thursday, August 13, 2015 11:19 AM
To: Linda Knippers
Cc: Boaz Harrosh; Wilcox, Matthew R; [email protected]; [email protected]; Verma, Vishal L
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

Linda Knippers <[email protected]> writes:

>>> It causes the physical block size to be PAGE_SIZE but the
>>> logical block size is still 512. However, the minimum_io_size
>>> is now 4096 (same as physical block size, I assume). The
>>> optimal_io_size is still 0. What does that mean?
>>
>> physical block size - device's internal block size
>> logical block size - addressable unit
>
> Right, but it's still reported as 512 and that doesn't work.

Understood. :)

>> optimal io size - device's preferred unit for streaming
>
> So 0 is ok.

Correct.

>> We can change the block device to export logical/physical block sizes of
>> PAGE_SIZE. However, when persistent memory support comes to platforms
>> that support page sizes > 32k, xfs will again run into problems (Dave
>> Chinner mentioned that xfs can't deal with logical block sizes >32k.)
>> Arguably, you can use pmem and dax on such platforms using RAM today for
>> testing. Do we care about breaking that?
>
> I would think so. AARCH64 uses 64k pages today.

So does powerpc, but I guess nobody cares about that anymore. ;-) If
the logical block size is smaller than the page size, we're going to
have to deal with sub-page I/O. For now, we can do as Boaz suggested,
and just turn off dax for those configurations. We could also just
revert the patch that introduced this problem. I really don't know who
is going to care about O_DIRECT I/O performance to a persistent memory
block device.

Willy? What was the real motivation there?

> I think Documentation/filesystems/dax.txt could use a little update
> too. It has a section "Implementation Tips for Block Driver Writers"
> that makes it sound easy but now I wonder if it even works with the
> example ram drivers. Should we be able to read any 512 byte
> "sector"?

If the logical block size is 512 bytes, then you have to be able to do
(direct) I/O to any 512 byte sector. Simple as that.

Cheers,
Jeff

2015-08-14 16:28:28

by Dan Williams

[permalink] [raw]
Subject: Re: regression introduced by "block: Add support for DAX reads/writes to block devices"

On Thu, Aug 13, 2015 at 12:32 PM, Wilcox, Matthew R
<[email protected]> wrote:
> I liked the patch you were pushing to request the *page* containing the requested bytes instead of the *block* containing the requested bytes.
>
> For the misaligned partition problem, I was thinking we should change the direct_access API to return a phys_addr_t instead of a pfn. That way we can return something that isn't actually page aligned, and DAX can take care of making sure it doesn't overshoot the end.


If you go that route please make it __pfn_t + offset rather than
phys_addr_t so we can communicate the PFN_DEV flag among others.