From: Ross Zwisler Subject: question about ext4 block allocation Date: Mon, 6 Feb 2017 16:14:09 -0700 Message-ID: <20170206231409.GA16676@linux.intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org To: Jan Kara , Theodore Ts'o , linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Xiong Zhou Return-path: Content-Disposition: inline List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org Sender: "Linux-nvdimm" List-Id: linux-ext4.vger.kernel.org I recently hit an issue in my DAX testing where I was unable to get ext4 to give me 2 MiB sized and aligned block allocations in a situation where I thought I should be able to. I'm using a PMEM ramdisk of size 16 GiB, created using the memmap kernel command line parameter. # fdisk -l /dev/pmem0 Disk /dev/pmem0: 16 GiB, 17179869184 bytes, 33554432 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes The very simple test program I used to reproduce this can be found at the bottom of this mail. Here is the quick function that I used to recreate my filesystem each run: # type go_ext4 go_ext4 is a function go_ext4 () { umount /dev/pmem0 2> /dev/null; mkfs.ext4 -b 4096 -E stride=512 -F /dev/pmem0; mount -o dax /dev/pmem0 ~/dax; cd ~/fsync } To be able to easily see whether DAX is able to use PMDs instead of PTEs, you can run with the mmots tree (git://git.cmpxchg.org/linux-mmots.git), tag v4.10-rc4-mmots-2017-01-17-16-32. Okay, so here's the interesting part. If I create a filesystem and run the test so it creates a file of size 32 MiB or 128 MiB, I get a PMD fault. Here's the corresponding tracepoint output: test-1429 [008] .... 10573.026699: dax_pmd_fault: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end 0x40400000 pgoff 0x280 max_pgoff 0x7fff test-1429 [008] .... 10573.026912: dax_pmd_insert_mapping: dev 259:0 ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108a00 DEV|MAP radix_entry 0x114000e test-1429 [008] .... 10573.026917: dax_pmd_fault_done: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end 0x40400000 pgoff 0x280 max_pgoff 0x7fff NOPAGE Great. That's what I want. But, if I create the filesystem and use the test to create a file that is 64 MiB in size, the PMD fault fails because the PFN I get from the filesystem isn't 2MiB aligned: test-1475 [006] .... 11809.982188: dax_pmd_fault: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff test-1475 [006] .... 11809.982398: dax_pmd_insert_mapping_fallback: dev 259:0 ino 0xc shared write address 0x40280000 length 0x200000 pfn 0x108601 DEV|MAP radix_entry 0x0 test-1475 [006] .... 11809.982399: dax_pmd_fault_done: dev 259:0 ino 0xc shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x40280000 vm_start 0x40000000 vm_end 0x40400000 pgoff 0x280 max_pgoff 0x3fff FALLBACK The PFN for the block allocation I get from ext4 is 0x108601, which isn't aligned, so we fail the PG_PMD_COLOUR alignment check in dax_iomap_pmd_fault(), and use PTEs instead. I initially saw this in a test from Xiong: https://www.mail-archive.com/linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org/msg02615.html and created the attached test to have a simpler reproducer. With Xiong's test, a test on a 128 MiB sized file will have all PMDs, an on a 64 MiB file we'll use all PTEs. This question is important because eventually we'd like to say to customers "do X and you should get PMDs when you use DAX", but right now I'm not sure what X is. :) Thanks, - Ross --- >8 --- #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #define GiB(a) ((a)*1024ULL*1024*1024) #define MiB(a) ((a)*1024ULL*1024) #define PAGE(a) ((a)*0x1000) void usage(char *prog) { fprintf(stderr, "usage: %s \n", prog); exit(1); } void err_exit(char *op, unsigned long len) { fprintf(stderr, "%s(%s) len %lu\n", op, strerror(errno), len); exit(1); } int main(int argc, char *argv[]) { char *data_array = (char*) GiB(1); /* request a 2MiB aligned address with mmap() */ unsigned long len; int fd; if (argc < 2) usage(basename(argv[0])); len = strtoul(argv[1], NULL, 10); if (errno == ERANGE) err_exit("strtoul", 0); fd = open("/root/dax/data", O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); if (fd < 0) { perror("fd"); return 1; } ftruncate(fd, 0); fallocate(fd, 0, 0, MiB(len)); data_array = mmap(data_array, PAGE(0x400), PROT_READ|PROT_WRITE, MAP_SHARED, fd, PAGE(0)); data_array[PAGE(0x280)] = 142; fsync(fd); close(fd); return 0; }