Hello,
while solving a different issue, my colleague Libor Pechacek found a
problem with handling mmapped sparse files. If you mmap the hole insidea
sparse file and write to it, the data gets silently lost if there is not
enough space left on the underlying device.
I found a thread which touched this topic in December 2001 (sic!). I'd
like to quote an email by Andrea Arcangeli:
> On Sun, Dec 30, 2001 at 01:33:24AM -0500, Alexander Viro wrote:
> >
> >
> > On Sat, 29 Dec 2001, Andrew Morton wrote:
> >
> > > Would it be necessary to preallocate the holes at mmap() time? Mad
> > > hand-waving: Could we not perform the instantiation at pagefault time,
> > > and give the caller SIGBUS if we cannot allocate the blocks? Or if
> > > there's an IO error, or quota exceeded.
> >
> > Allocation at mmap() Is Not Going To Happen. Consider it vetoed.
> > There are applications that use mmap() on large and very sparse
> > files.
>
> it's exactly this kind of apps that will be screwed up by silent data
> corruption. the point of the holes is to optimize performance and save
> space, but they shouldn't introduce the possibilty of data corruption.
>
> Note: I'm fine to introduce another way to notify the app about -ENOSPC,
> -ENOSPC on mmap is the most obvious one, but we could still allow the
> current "overcommit" behaviour with a kind of sigbus mentioned by
> Andrew (possibly not sigbus though, since it has just well defined
> semantics for MAP_SHARED, maybe they could be extended, anyways as said
> this is only a matter of API). My point is only that some API should be
> added because your mmap on sparse files are unreliable at the moment.
(see http://marc.info/?l=linux-kernel&m=100975730421590&w=2)
However, this is still not fixed - I am attaching a simple test case,
run it like:
$ make
$ # become root
# make check
IMO we should go with the SIGBUS solution, but I want to discuss it here
before making a patch.
Kind regards,
Petr Tesarik
SUSE LINUX, L3 Prague
On Thu, 2007-08-02 at 14:41 +0200, Petr Tesarik wrote:
> Hello,
>
> while solving a different issue, my colleague Libor Pechacek found a
> problem with handling mmapped sparse files. If you mmap the hole insidea
> sparse file and write to it, the data gets silently lost if there is not
> enough space left on the underlying device.
I think Dave's block_page_mkwrite() stuff addresses this as well, no?
http://lkml.org/lkml/2007/3/18/198
Peter
2007/8/2, Peter Zijlstra <[email protected]>:
> I think Dave's block_page_mkwrite() stuff addresses this as well, no?
>
> http://lkml.org/lkml/2007/3/18/198
I saw a similar problem some time ago with msync:
http://lkml.org/lkml/2006/12/29/136 but Andrew didn't
like my patch.
--
Guillaume
On Thu, 2 Aug 2007 15:18:33 +0200
"Guillaume Chazarain" <[email protected]> wrote:
> 2007/8/2, Peter Zijlstra <[email protected]>:
>
> > I think Dave's block_page_mkwrite() stuff addresses this as well, no?
> >
> > http://lkml.org/lkml/2007/3/18/198
>
> I saw a similar problem some time ago with msync:
> http://lkml.org/lkml/2006/12/29/136 but Andrew didn't
> like my patch.
>
Yeah, we need to get that finished off.
The current _design_ of the VM/VFS is such that if the application runs
fsync() or msync() then it will be able to reliably detect any data loss
which has occurred, even if that data loss occurred during a random pageout
attempt by some other process half an hour ago.
However Guillaume has identified some holes in the implementation (I don't
recall the details, but that link is a start).
On Thu, Aug 02, 2007 at 03:06:15PM +0200, Peter Zijlstra wrote:
> On Thu, 2007-08-02 at 14:41 +0200, Petr Tesarik wrote:
> > Hello,
> >
> > while solving a different issue, my colleague Libor Pechacek found a
> > problem with handling mmapped sparse files. If you mmap the hole insidea
> > sparse file and write to it, the data gets silently lost if there is not
> > enough space left on the underlying device.
>
> I think Dave's block_page_mkwrite() stuff addresses this as well, no?
>
> http://lkml.org/lkml/2007/3/18/198
Yes it does. But only XFS hooks that right now. It works, too. ;)
Create and mount 128MB filesystem:
budgie:~ # mkfs.xfs -f -d size=128m /dev/sdb9
....
Fill it up:
budgie:~ # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=127
dd: writing `/mnt/scratch/fred': No space left on device
119+0 records in
118+0 records out
124764160 bytes (125 MB) copied, 4.69505 seconds, 26.6 MB/s
budgie:~ # sync
budgie:~ # df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/sdb9 124M 120M 48K 100% /mnt/scratch
Free up one block:
budgie:~ # xfs_io -f -c "truncate 124760000" /mnt/scratch/fred
budgie:~ # df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/sdb9 124M 120M 52K 100% /mnt/scratch
Create sparse file:
budgie:~ # dd if=/dev/zero of=/mnt/scratch/sparse_mmap_file bs=4k count=1 seek=1000000000
1+0 records in
1+0 records out
4096 bytes (4.1 kB) copied, 0.00022675 seconds, 18.1 MB/s
budgie:~ # ls -l /mnt/scratch/sparse_mmap_file
-rw-r--r-- 1 root root 4096000004096 Aug 3 08:13 /mnt/scratch/sparse_mmap_file
Mmap sparse file and try to write to it:
budgie:~ # xfs_io -f -c "mmap 0 1000000000" -c " mwrite 4000000 50000" /mnt/scratch/sparse_mmap_file
Bus error
There's your bus error. The blocks that were allocated before ENOSPC:
budgie:~ # xfs_bmap -vp /mnt/scratch/sparse_mmap_file
/mnt/scratch/sparse_mmap_file:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL
0: [0..7807]: hole 7808
1: [7808..7839]: 155520..155551 4 (24448..24479) 32
2: [7840..7999999999]: hole 7999992160
3: [8000000000..8000000007]: 155512..155519 4 (24440..24447) 8
budgie:~ #
We got 4x4k data blocks allocated and:
budgie:~ # df -h /mnt/scratch
Filesystem Size Used Avail Use% Mounted on
/dev/sdb9 124M 120M 32K 100% /mnt/scratch
That shows that 5 blocks were allocated to hold the 4 data blocks
that lead to ENOSPC. i.e. a metadata block of some kind was also
allocated.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group