2017-02-06 23:14:09

by Ross Zwisler

[permalink] [raw]
Subject: question about ext4 block allocation

I recently hit an issue in my DAX testing where I was unable to get ext4 to
give me 2 MiB sized and aligned block allocations in a situation where I
thought I should be able to. I'm using a PMEM ramdisk of size 16 GiB, created
using the memmap kernel command line parameter.

# fdisk -l /dev/pmem0
Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

The very simple test program I used to reproduce this can be found at the
bottom of this mail. Here is the quick function that I used to recreate my
filesystem each run:

# type go_ext4
go_ext4 is a function
go_ext4 ()
{
umount /dev/pmem0 2> /dev/null;
mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0;
mount -o dax /dev/pmem0 ~/dax;
cd ~/fsync
}

To be able to easily see whether DAX is able to use PMDs instead of PTEs, you
can run with the mmots tree (git://git.cmpxchg.org/linux-mmots.git), tag
v4.10-rc4-mmots-2017-01-17-16-32.

Okay, so here's the interesting part. If I create a filesystem and run the
test so it creates a file of size 32 MiB or 128 MiB, I get a PMD fault.
Here's the corresponding tracepoint output:

test-1429 [008] .... 10573.026699: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x7fff

test-1429 [008] .... 10573.026912: dax_pmd_insert_mapping: dev 259:0 ino 0xc
shared write address 0x40280000 length 0x200000 pfn 0x108a00 DEV|MAP
radix_entry 0x114000e

test-1429 [008] .... 10573.026917: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x7fff NOPAGE

Great. That's what I want. But, if I create the filesystem and use the test
to create a file that is 64 MiB in size, the PMD fault fails because the PFN I
get from the filesystem isn't 2MiB aligned:

test-1475 [006] .... 11809.982188: dax_pmd_fault: dev 259:0 ino 0xc shared
WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
0x40400000 pgoff 0x280 max_pgoff 0x3fff

test-1475 [006] .... 11809.982398: dax_pmd_insert_mapping_fallback: dev 259:0
ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108601 DEV|MAP
radix_entry 0x0

test-1475 [006] .... 11809.982399: dax_pmd_fault_done: dev 259:0 ino 0xc
shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff FALLBACK

The PFN for the block allocation I get from ext4 is 0x108601, which isn't
aligned, so we fail the PG_PMD_COLOUR alignment check in
dax_iomap_pmd_fault(), and use PTEs instead.

I initially saw this in a test from Xiong:

https://www.mail-archive.com/[email protected]/msg02615.html

and created the attached test to have a simpler reproducer. With Xiong's
test, a test on a 128 MiB sized file will have all PMDs, an on a 64 MiB file
we'll use all PTEs.

This question is important because eventually we'd like to say to customers
"do X and you should get PMDs when you use DAX", but right now I'm not sure
what X is. :)

Thanks,
- Ross

--- >8 ---
#define _GNU_SOURCE
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <linux/falloc.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <unistd.h>

#define GiB(a) ((a)*1024ULL*1024*1024)
#define MiB(a) ((a)*1024ULL*1024)
#define PAGE(a) ((a)*0x1000)

void usage(char *prog)
{
fprintf(stderr, "usage: %s <size in MiB>\n", prog);
exit(1);
}

void err_exit(char *op, unsigned long len)
{
fprintf(stderr, "%s(%s) len %lu\n", op, strerror(errno), len);
exit(1);
}

int main(int argc, char *argv[])
{
char *data_array = (char*) GiB(1); /* request a 2MiB aligned address with mmap() */
unsigned long len;
int fd;

if (argc < 2)
usage(basename(argv[0]));

len = strtoul(argv[1], NULL, 10);
if (errno == ERANGE)
err_exit("strtoul", 0);

fd = open("/root/dax/data", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0) {
perror("fd");
return 1;
}

ftruncate(fd, 0);
fallocate(fd, 0, 0, MiB(len));

data_array = mmap(data_array, PAGE(0x400), PROT_READ|PROT_WRITE,
MAP_SHARED, fd, PAGE(0));
data_array[PAGE(0x280)] = 142;

fsync(fd);
close(fd);
return 0;
}


2017-02-09 17:52:28

by Ross Zwisler

[permalink] [raw]
Subject: Re: question about ext4 block allocation

On Thu, Feb 09, 2017 at 04:30:09PM +0100, Jan Kara wrote:
> Hi Ross,
>
> On Mon 06-02-17 16:14:09, Ross Zwisler wrote:
> > I recently hit an issue in my DAX testing where I was unable to get ext4 to
> > give me 2 MiB sized and aligned block allocations in a situation where I
> > thought I should be able to. I'm using a PMEM ramdisk of size 16 GiB, created
> > using the memmap kernel command line parameter.
> >
> > # fdisk -l /dev/pmem0
> > Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors
> > Units: sectors of 1 * 512 = 512 bytes
> > Sector size (logical/physical): 512 bytes / 4096 bytes
> > I/O size (minimum/optimal): 4096 bytes / 4096 bytes
> >
> > The very simple test program I used to reproduce this can be found at the
> > bottom of this mail. Here is the quick function that I used to recreate my
> > filesystem each run:
> >
> > # type go_ext4
> > go_ext4 is a function
> > go_ext4 ()
> > {
> > umount /dev/pmem0 2> /dev/null;
> > mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0;
> > mount -o dax /dev/pmem0 ~/dax;
> > cd ~/fsync
> > }
>
> ...
>
> > Great. That's what I want. But, if I create the filesystem and use the test
> > to create a file that is 64 MiB in size, the PMD fault fails because the PFN I
> > get from the filesystem isn't 2MiB aligned:
> >
> > test-1475 [006] .... 11809.982188: dax_pmd_fault: dev 259:0 ino 0xc shared
> > WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end
> > 0x40400000 pgoff 0x280 max_pgoff 0x3fff
> >
> > test-1475 [006] .... 11809.982398: dax_pmd_insert_mapping_fallback: dev 259:0
> > ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108601 DEV|MAP
> > radix_entry 0x0
> >
> > test-1475 [006] .... 11809.982399: dax_pmd_fault_done: dev 259:0 ino 0xc
> > shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000
> > vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff FALLBACK
> >
> > The PFN for the block allocation I get from ext4 is 0x108601, which isn't
> > aligned, so we fail the PG_PMD_COLOUR alignment check in
> > dax_iomap_pmd_fault(), and use PTEs instead.
>
> Yeah, it's a bug in ext4 allocator. Requests for 128MB are exactly a group
> size so we find completely empty group and satisfy the request. Even larger
> requests will get split into 128MB chunks. 32MB requests are small enough
> that they go via a special path for power-of-two sized requests. However
> 64MB allocation request can be satisfied from somewhat filled group (there
> are sb backup blocks in group 1 in your case) and we screw up when deciding
> whether to treat such request as power-of-two or not and don't align it at
> all in the end.
>
> Another problem is that the stride size ends up unused due to another bug
> in ext4. The second attached patch fixes that issue.
>
> With these two patches applied I get file blocks aligned. That being said
> the stripe-aligned allocator does a poor job of creating large extents
> (larger than stripe-width) however that is more difficult to fix.

Thanks for the fixes! Your patches do fix my simple reproducer so that it
gives me 2MiB aligned and size allocations for 64 MiB files, but when I run
Xiong's xfstest I'm still getting misaligned allocations only for 64 MiB
files. 32 MiB and 128 MiB sized files still work and give me a PMD.

I've pared down his xfstest to be a pretty minimal reproducer, and you can
find it here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/xfstests-dev.git/log/?h=ext4_PMD_allocation

You can just revert the top patch in this tree (which basically just comments
out 90% of the test and adds some debug) to get back to Xiong's full test.

The kernel tree I tested with today using your fixes can be found here:

https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=ext4_PMD_allocation

2017-02-09 19:29:48

by Theodore Y. Ts'o

[permalink] [raw]
Subject: Re: question about ext4 block allocation

On Thu, Feb 09, 2017 at 10:52:28AM -0700, Ross Zwisler wrote:
> I've pared down his xfstest to be a pretty minimal reproducer, and you can
> find it here:
>
> https://git.kernel.org/cgit/linux/kernel/git/zwisler/xfstests-dev.git/log/?h=ext4_PMD_allocation

I'm getting "No repositories found". And looking at the top level
git.kernel.org, it doesn't look like there is an
zwisler/xfstests-dev.git tree there at all. Was it uploaded correctly?

- Ted

2017-02-09 20:21:54

by Ross Zwisler

[permalink] [raw]
Subject: Re: question about ext4 block allocation

On Thu, Feb 09, 2017 at 02:29:48PM -0500, Theodore Ts'o wrote:
> On Thu, Feb 09, 2017 at 10:52:28AM -0700, Ross Zwisler wrote:
> > I've pared down his xfstest to be a pretty minimal reproducer, and you can
> > find it here:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/zwisler/xfstests-dev.git/log/?h=ext4_PMD_allocation
>
> I'm getting "No repositories found". And looking at the top level
> git.kernel.org, it doesn't look like there is an
> zwisler/xfstests-dev.git tree there at all. Was it uploaded correctly?
>
> - Ted

The above link works for me, and I can see it in the top level view:

https://git.kernel.org/cgit/

I just created the repo today, though. Maybe it's just taking a mirror a
second to catch up or something?

Please let me know if you continue to be unable to see it, and I'll figure it
out.

- Ross

2017-02-09 22:54:40

by Theodore Y. Ts'o

[permalink] [raw]
Subject: Re: question about ext4 block allocation

On Thu, Feb 09, 2017 at 01:21:54PM -0700, Ross Zwisler wrote:
> On Thu, Feb 09, 2017 at 02:29:48PM -0500, Theodore Ts'o wrote:
> > On Thu, Feb 09, 2017 at 10:52:28AM -0700, Ross Zwisler wrote:
> > > I've pared down his xfstest to be a pretty minimal reproducer, and you can
> > > find it here:
> > >
> > > https://git.kernel.org/cgit/linux/kernel/git/zwisler/xfstests-dev.git/log/?h=ext4_PMD_allocation
> >
> > I'm getting "No repositories found". And looking at the top level
> > git.kernel.org, it doesn't look like there is an
> > zwisler/xfstests-dev.git tree there at all. Was it uploaded correctly?
> >
> > - Ted
>
> The above link works for me, and I can see it in the top level view:
>
> https://git.kernel.org/cgit/
>
> I just created the repo today, though. Maybe it's just taking a mirror a
> second to catch up or something?

Yup, that must have been what it was. I must have caught a DNS mirror
that was slow in updating.

- Ted