2014-03-13 18:01:26

by Cédric Le Goater

[permalink] [raw]
Subject: ext4 extent issue when page size > block size

Hi,

While running openldap unit tests on a ppc64 system, we have had
issues with the cp command. cp uses the FS_IOC_FIEMAP ioctl to
optimize the copy and it appeared that the ext4 extent list of
the file did not match all the data which was 'written' on disk.

The system we use has a 64kB page size but the page size being
greater than the filesystem block seems to be the top level reason
of the problem. One can use a 1kB block size filesystem to reproduce
the issue on a 4kB page size system.

Attached is a simple test case from Anton, which creates extents
as follow :

lseek(48K -1) -> creates [11/1)
p = mmap(128K)
*(p) = 1 -> creates [0/1) with a fault
lseek(128K) -> creates [31/1)
*(p + 49K) = 1 -> creates [12/1) and then merges in [11/2)
munmap(128K)


On a 4kB page size system, the extent list returned by FS_IOC_FIEMAP
looks correct :

Extent 0: logical: 0 physical: 0 length: 4096 flags 0x006
Extent 1: logical: 45056 physical: 0 length: 8192 flags 0x006
Extent 2: logical: 126976 physical: 0 length: 4096 flags 0x007


But, with a 64kB page size, we miss the in-the-middle extent (no page
fault but the data is on disk) :

Extent 0: logical: 0 physical: 0 length: 49152 flags 0x006
Extent 1: logical: 126976 physical: 0 length: 4096 flags 0x007


This looks wrong. Right ? Or are we doing something wrong ? I have been
digging in the ext4 page writeback code. There are some caveats when
blocksize < pagesize but I am not sure my understanding is correct.


Many thanks,

C.


Attachments:
mmap_lseek_issue0.c (2.50 kB)

2014-03-13 21:24:30

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 extent issue when page size > block size

Hello,

On Thu 13-03-14 19:00:06, Cedric Le Goater wrote:
> While running openldap unit tests on a ppc64 system, we have had
> issues with the cp command. cp uses the FS_IOC_FIEMAP ioctl to
> optimize the copy and it appeared that the ext4 extent list of
> the file did not match all the data which was 'written' on disk.
>
> The system we use has a 64kB page size but the page size being
> greater than the filesystem block seems to be the top level reason
> of the problem. One can use a 1kB block size filesystem to reproduce
> the issue on a 4kB page size system.
>
> Attached is a simple test case from Anton, which creates extents
> as follow :
>
> lseek(48K -1) -> creates [11/1)
> p = mmap(128K)
> *(p) = 1 -> creates [0/1) with a fault
> lseek(128K) -> creates [31/1)
> *(p + 49K) = 1 -> creates [12/1) and then merges in [11/2)
> munmap(128K)
>
>
> On a 4kB page size system, the extent list returned by FS_IOC_FIEMAP
> looks correct :
>
> Extent 0: logical: 0 physical: 0 length: 4096 flags 0x006
> Extent 1: logical: 45056 physical: 0 length: 8192 flags 0x006
> Extent 2: logical: 126976 physical: 0 length: 4096 flags 0x007
>
>
> But, with a 64kB page size, we miss the in-the-middle extent (no page
> fault but the data is on disk) :
>
> Extent 0: logical: 0 physical: 0 length: 49152 flags 0x006
> Extent 1: logical: 126976 physical: 0 length: 4096 flags 0x007
>
>
> This looks wrong. Right ? Or are we doing something wrong ? I have been
> digging in the ext4 page writeback code. There are some caveats when
> blocksize < pagesize but I am not sure my understanding is correct.
So you are completely right with the observation that in case like you
describe we don't create delayed allocation extent for the block just
beyond EOF. This is a problem which exists since day one when delayed
allocation was introduced for ext4 (but also xfs and I dare to say any
other fs doing delayed allocation). The culprit really is that we create
delayed allocation extents on page fault - at that time file is only 48KB
so we naturally don't allocate blocks for blocks beyond those 48KB. However
after extending the file, the part of the page at offsets beyond 48KB
suddently becomes part of the file and if you write some data there (no
page fault happens because the page is already marked writeable in the page
tables), we won't have any delayed allocation extent backing that data.

One thing to note here is that posix specifically says that extending file
while it is mmaped has undefined consequences for the mmap so technically
speaking if we would just throw away the data, we would still adhere to it.
I don't think we should be so harsh but I mention this to explain why some
weirdness may be acceptable.

Anyway, fixing this isn't completely easy. I was looking into that some
years ago and the best solution I've found back then was to writeprotect
the last partial page whenever blocksize < pagesize, we are extending the
file and creating hole in the last partial page beyond original EOF. This
actually requires tweaking not only truncate path but also write path and
the locking was somewhat hairy there because we need to writeprotect the
tail page before updating i_size and make sure noone can fault it in again
before the i_size is updated.

Honza
> /*
> * mmap vs extent issue
> *
> * Copyright (C) 2014 Anton Blanchard <[email protected]>, IBM
> *
> * This program is free software; you can redistribute it and/or
> * modify it under the terms of the GNU General Public License
> * as published by the Free Software Foundation; either version
> * 2 of the License, or (at your option) any later version.
> */
>
> #include <stdio.h>
> #include <string.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <linux/fs.h>
> #include <linux/fiemap.h>
>
> static void check_fiemap(int fd)
> {
> struct fiemap *fiemap;
> unsigned long i, ex_size;
>
> fiemap = malloc(sizeof(struct fiemap));
> if (!fiemap) {
> perror("malloc");
> exit(1);
> }
>
> memset(fiemap, 0, sizeof(struct fiemap));
> fiemap->fm_length = ~0;
>
> if (ioctl(fd, FS_IOC_FIEMAP, fiemap) == -1) {
> perror("ioctl(FIEMAP)");
> exit(1);
> }
>
> ex_size = sizeof(struct fiemap_extent) * fiemap->fm_mapped_extents;
>
> fiemap = realloc(fiemap, sizeof(struct fiemap) + ex_size);
> if (!fiemap) {
> perror("realloc");
> exit(1);
> }
>
> memset(fiemap->fm_extents, 0, ex_size);
> fiemap->fm_extent_count = fiemap->fm_mapped_extents;
> fiemap->fm_mapped_extents = 0;
>
> if (ioctl(fd, FS_IOC_FIEMAP, fiemap) < 0) {
> perror("ioctl(FIEMAP)");
> exit(1);
> }
>
> for (i = 0; i < fiemap->fm_mapped_extents; i++) {
> unsigned long start = fiemap->fm_extents[i].fe_logical;
> unsigned long end = fiemap->fm_extents[i].fe_logical +
> fiemap->fm_extents[i].fe_length;
>
> if (start <= 48*1024 && end > 48*1024) {
> printf("GOOD\n");
> exit(0);
> }
> }
>
> printf("BAD:\n");
> for (i = 0; i < fiemap->fm_mapped_extents; i++) {
> printf("%ld:\t%016llx %016llx\n", i,
> fiemap->fm_extents[i].fe_logical,
> fiemap->fm_extents[i].fe_length);
> }
>
> exit(1);
> }
>
> int main(int argc, char *argv[])
> {
> char name[] = "mmap-lseek-XXXXXX";
> int fd;
> char *p;
>
> fd = mkstemp(name);
> if (fd == -1) {
> perror("mkstemp");
> exit(1);
> }
>
> /* Create a 48 kB file */
> lseek(fd, 48 * 1024 - 1, SEEK_SET);
> if (write(fd, "\0", 1) != 1) {
> perror("write");
> exit(1);
> }
>
> /* Map it, allowing space for it to grow */
> p = mmap(NULL, 128 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
> if (p == MAP_FAILED) {
> perror("mmap");
> exit(1);
> }
>
> /* Write to the start of the file */
> *(p) = 1;
>
> /* Extend the file */
> lseek(fd, 128 * 1024 - 1, SEEK_SET);
> if (write(fd, "\0", 1) != 1) {
> perror("write");
> exit(1);
> }
>
> /* write to the new space in the first page */
> *(p + 49 * 1024) = 1;
>
> munmap(p, 128 * 1024);
>
> check_fiemap(fd);
>
> close(fd);
>
> return 0;
> }
>
>

--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-03-14 00:45:42

by Dave Chinner

[permalink] [raw]
Subject: Re: ext4 extent issue when page size > block size

On Thu, Mar 13, 2014 at 10:24:28PM +0100, Jan Kara wrote:
> Hello,
>
> On Thu 13-03-14 19:00:06, Cedric Le Goater wrote:
> > While running openldap unit tests on a ppc64 system, we have had
> > issues with the cp command. cp uses the FS_IOC_FIEMAP ioctl to
> > optimize the copy and it appeared that the ext4 extent list of
> > the file did not match all the data which was 'written' on disk.
> >
> > The system we use has a 64kB page size but the page size being
> > greater than the filesystem block seems to be the top level reason
> > of the problem. One can use a 1kB block size filesystem to reproduce
> > the issue on a 4kB page size system.
> >
> > Attached is a simple test case from Anton, which creates extents
> > as follow :
> >
> > lseek(48K -1) -> creates [11/1)
> > p = mmap(128K)
> > *(p) = 1 -> creates [0/1) with a fault
> > lseek(128K) -> creates [31/1)
> > *(p + 49K) = 1 -> creates [12/1) and then merges in [11/2)
> > munmap(128K)

This shoul dbe easy to reproduce on a 4k page size machine using 512
byte block size. Yup, it is.

> > On a 4kB page size system, the extent list returned by FS_IOC_FIEMAP
> > looks correct :
> >
> > Extent 0: logical: 0 physical: 0 length: 4096 flags 0x006
> > Extent 1: logical: 45056 physical: 0 length: 8192 flags 0x006
> > Extent 2: logical: 126976 physical: 0 length: 4096 flags 0x007
> >
> >
> > But, with a 64kB page size, we miss the in-the-middle extent (no page
> > fault but the data is on disk) :
> >
> > Extent 0: logical: 0 physical: 0 length: 49152 flags 0x006
> > Extent 1: logical: 126976 physical: 0 length: 4096 flags 0x007

Pretty much the same:

EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..3]: 11..14 0 (11..14) 4 00000
1: [4..14]: hole 11
2: [15..15]: 15..15 0 (15..15) 1 00000

> > This looks wrong. Right ? Or are we doing something wrong ? I have been
> > digging in the ext4 page writeback code. There are some caveats when
> > blocksize < pagesize but I am not sure my understanding is correct.
> So you are completely right with the observation that in case like you
> describe we don't create delayed allocation extent for the block just
> beyond EOF. This is a problem which exists since day one when delayed
> allocation was introduced for ext4 (but also xfs and I dare to say any
> other fs doing delayed allocation).

The above was done on XFS, because ext4 doesn't support 512 byte
block sizes.

> delayed allocation extents on page fault - at that time file is only 48KB
> so we naturally don't allocate blocks for blocks beyond those 48KB. However
> after extending the file, the part of the page at offsets beyond 48KB
> suddently becomes part of the file and if you write some data there (no
> page fault happens because the page is already marked writeable in the page
> tables), we won't have any delayed allocation extent backing that data.
>
> One thing to note here is that posix specifically says that extending file
> while it is mmaped has undefined consequences for the mmap so technically
> speaking if we would just throw away the data, we would still adhere to it.
> I don't think we should be so harsh but I mention this to explain why some
> weirdness may be acceptable.

Right - if you touch the mmap()d range beyond EOF before the file
is extended, we segv the application. So, really, it's just a bad
idea to do this.

> Anyway, fixing this isn't completely easy. I was looking into that some
> years ago and the best solution I've found back then was to writeprotect
> the last partial page whenever blocksize < pagesize, we are extending the
> file and creating hole in the last partial page beyond original EOF. This
> actually requires tweaking not only truncate path but also write path and
> the locking was somewhat hairy there because we need to writeprotect the
> tail page before updating i_size and make sure noone can fault it in again
> before the i_size is updated.

It's just another "we can't synchronise page faults against IO"
problem, just like we have with hole punching. The way we've
optimised truncate by using page locks and isize and mapping
checks is fine for simple cases where EOF isn't changing, but
the moment we need isize update vs page fault serialisation, or IO
vs page-fault-inside-EOF serialisation, we're screwed.

The whole concept that page faults are immune to/exempted from
filesystem IO serialisation requirements is fundamentally broken.
Until we fix that we're not going to be able to solve these
nasty corner cases.

Cheers,

Dave.
--
Dave Chinner
[email protected]