2007-08-02 12:41:38

by Petr Tesařík

[permalink] [raw]
Subject: mmap behavior on out-of-space conditions

Hello,

while solving a different issue, my colleague Libor Pechacek found a
problem with handling mmapped sparse files. If you mmap the hole insidea
sparse file and write to it, the data gets silently lost if there is not
enough space left on the underlying device.

I found a thread which touched this topic in December 2001 (sic!). I'd
like to quote an email by Andrea Arcangeli:

> On Sun, Dec 30, 2001 at 01:33:24AM -0500, Alexander Viro wrote:
> >
> >
> > On Sat, 29 Dec 2001, Andrew Morton wrote:
> >
> > > Would it be necessary to preallocate the holes at mmap() time? Mad
> > > hand-waving: Could we not perform the instantiation at pagefault time,
> > > and give the caller SIGBUS if we cannot allocate the blocks? Or if
> > > there's an IO error, or quota exceeded.
> >
> > Allocation at mmap() Is Not Going To Happen. Consider it vetoed.
> > There are applications that use mmap() on large and very sparse
> > files.
>
> it's exactly this kind of apps that will be screwed up by silent data
> corruption. the point of the holes is to optimize performance and save
> space, but they shouldn't introduce the possibilty of data corruption.
>
> Note: I'm fine to introduce another way to notify the app about -ENOSPC,
> -ENOSPC on mmap is the most obvious one, but we could still allow the
> current "overcommit" behaviour with a kind of sigbus mentioned by
> Andrew (possibly not sigbus though, since it has just well defined
> semantics for MAP_SHARED, maybe they could be extended, anyways as said
> this is only a matter of API). My point is only that some API should be
> added because your mmap on sparse files are unreliable at the moment.

(see http://marc.info/?l=linux-kernel&m=100975730421590&w=2)

However, this is still not fixed - I am attaching a simple test case,
run it like:

$ make
$ # become root
# make check

IMO we should go with the SIGBUS solution, but I want to discuss it here
before making a patch.

Kind regards,
Petr Tesarik
SUSE LINUX, L3 Prague


Attachments:
mmap-nospc.tar.gz (1.21 kB)
signature.asc (189.00 B)
This is a digitally signed message part
Download all attachments

2007-08-02 13:06:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: mmap behavior on out-of-space conditions

On Thu, 2007-08-02 at 14:41 +0200, Petr Tesarik wrote:
> Hello,
>
> while solving a different issue, my colleague Libor Pechacek found a
> problem with handling mmapped sparse files. If you mmap the hole insidea
> sparse file and write to it, the data gets silently lost if there is not
> enough space left on the underlying device.

I think Dave's block_page_mkwrite() stuff addresses this as well, no?

http://lkml.org/lkml/2007/3/18/198

Peter

2007-08-02 13:18:44

by Guillaume Chazarain

[permalink] [raw]
Subject: Re: mmap behavior on out-of-space conditions

2007/8/2, Peter Zijlstra <[email protected]>:

> I think Dave's block_page_mkwrite() stuff addresses this as well, no?
>
> http://lkml.org/lkml/2007/3/18/198

I saw a similar problem some time ago with msync:
http://lkml.org/lkml/2006/12/29/136 but Andrew didn't
like my patch.

--
Guillaume

2007-08-02 18:53:29

by Andrew Morton

[permalink] [raw]
Subject: Re: mmap behavior on out-of-space conditions

On Thu, 2 Aug 2007 15:18:33 +0200
"Guillaume Chazarain" <[email protected]> wrote:

> 2007/8/2, Peter Zijlstra <[email protected]>:
>
> > I think Dave's block_page_mkwrite() stuff addresses this as well, no?
> >
> > http://lkml.org/lkml/2007/3/18/198
>
> I saw a similar problem some time ago with msync:
> http://lkml.org/lkml/2006/12/29/136 but Andrew didn't
> like my patch.
>

Yeah, we need to get that finished off.

The current _design_ of the VM/VFS is such that if the application runs
fsync() or msync() then it will be able to reliably detect any data loss
which has occurred, even if that data loss occurred during a random pageout
attempt by some other process half an hour ago.

However Guillaume has identified some holes in the implementation (I don't
recall the details, but that link is a start).

2007-08-02 22:20:38

by David Chinner

[permalink] [raw]
Subject: Re: mmap behavior on out-of-space conditions

On Thu, Aug 02, 2007 at 03:06:15PM +0200, Peter Zijlstra wrote:
> On Thu, 2007-08-02 at 14:41 +0200, Petr Tesarik wrote:
> > Hello,
> >
> > while solving a different issue, my colleague Libor Pechacek found a
> > problem with handling mmapped sparse files. If you mmap the hole insidea
> > sparse file and write to it, the data gets silently lost if there is not
> > enough space left on the underlying device.
>
> I think Dave's block_page_mkwrite() stuff addresses this as well, no?
>
> http://lkml.org/lkml/2007/3/18/198

Yes it does. But only XFS hooks that right now. It works, too. ;)

Create and mount 128MB filesystem:

budgie:~ # mkfs.xfs -f -d size=128m /dev/sdb9
....

Fill it up:

budgie:~ # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=127
dd: writing `/mnt/scratch/fred': No space left on device
119+0 records in
118+0 records out
124764160 bytes (125 MB) copied, 4.69505 seconds, 26.6 MB/s
budgie:~ # sync
budgie:~ # df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/sdb9 124M 120M 48K 100% /mnt/scratch

Free up one block:

budgie:~ # xfs_io -f -c "truncate 124760000" /mnt/scratch/fred
budgie:~ # df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/sdb9 124M 120M 52K 100% /mnt/scratch

Create sparse file:

budgie:~ # dd if=/dev/zero of=/mnt/scratch/sparse_mmap_file bs=4k count=1 seek=1000000000
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.00022675 seconds, 18.1 MB/s
budgie:~ # ls -l /mnt/scratch/sparse_mmap_file
-rw-r--r-- 1 root root 4096000004096 Aug 3 08:13 /mnt/scratch/sparse_mmap_file

Mmap sparse file and try to write to it:

budgie:~ # xfs_io -f -c "mmap 0 1000000000" -c " mwrite 4000000 50000" /mnt/scratch/sparse_mmap_file
Bus error

There's your bus error. The blocks that were allocated before ENOSPC:

budgie:~ # xfs_bmap -vp /mnt/scratch/sparse_mmap_file
/mnt/scratch/sparse_mmap_file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7807]: hole 7808
1: [7808..7839]: 155520..155551 4 (24448..24479) 32
2: [7840..7999999999]: hole 7999992160
3: [8000000000..8000000007]: 155512..155519 4 (24440..24447) 8
budgie:~ #

We got 4x4k data blocks allocated and:

budgie:~ # df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/sdb9 124M 120M 32K 100% /mnt/scratch

That shows that 5 blocks were allocated to hold the 4 data blocks
that lead to ENOSPC. i.e. a metadata block of some kind was also
allocated.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group