2010-12-02 13:56:42

by Spelic

[permalink] [raw]
Subject: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

Hello all

I noticed what seem to be 4 bugs.
(kernel v2.6.37-rc4 but probably also before)

First two are one in mkfs.xfs and one in device mapper (lvm mailing list
I suppose, otherwise pls forward it):

Steps to reproduce:

Boot with a large ramdisk, like ramdisk_size=2097152
(actually I had 14GB ramdisk when I tried this but I don't think it will
make a difference)

Now partition it with a 1GB partition:
fdisk /dev/ram0
n
p
1
1
+1G
w
(only one 1GB physical partition)

Make a devmapper mapping for the partition
kpartx -av /dev/ram0

mkfs.xfs -f /dev/mapper/ram0p1
meta-data=/dev/mapper/ram0p1 isize=256 agcount=4,
agsize=66266 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=265064, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

Now, lo and behold, partition is gone!
fdisk /dev/ram0
p
will show no partitions!

you can also check with
dd if=/dev/ram bs=1M count=1 | hexdump -C
All first MB of /dev/ram is zeroed!!

also
mount /dev/ram0p1 /mnt
will fail. Unknown filesystem

I think this shows 2 bugs: firstly mkfs.xfs dares to do stuff before the
beginning of the device on which it should work.
Secondly, device mapper does not constrain access within the boundaries
of the device, which I think it should do.

Then I have 2 more bugs for you. Please see my thread in linux-rdma called:
"NFS-RDMA hangs: connection closed (-103)"
in particular this post
http://www.mail-archive.com/[email protected]/msg06632.html
with NFS over <RDMA or IPoIB> over Infiniband over XFS over ramdisk it
is possible to write a file (2.3GB) which is larger than the size of the
device (1.5GB): one bug I think is for XFS people (because I think XFS
should check if the space on the filesystem is finished), and another
one I think is for /dev/ram people (what mailing list? I am adding
lkml), because I think the device should check if someone is writing
beyond the end of it.

Thank you
PS: I am not subscribed to lkml so please do not reply ONLY to lkml.


2010-12-02 14:11:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This
option must never be enabled, as it causes block devices to be
randomly renumered. Together with the ramdisk driver overloading
the BLKFLSBUF ioctl to discard all data it guarantees you to get
data loss like yours.

2010-12-02 14:14:56

by Spelic

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This
> option must never be enabled, as it causes block devices to be
> randomly renumered. Together with the ramdisk driver overloading
> the BLKFLSBUF ioctl to discard all data it guarantees you to get
> data loss like yours.
>

Nope...

# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set

2010-12-02 14:15:52

by Spelic

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

Sorry for replying to my own email already
one more thing on the 3rd bug:

On 12/02/2010 02:55 PM, Spelic wrote:
> Hello all
> [CUT]
> .......
> with NFS over <RDMA or IPoIB> over Infiniband over XFS over ramdisk it
> is possible to write a file (2.3GB) which is larger than

This is also reproducible with:
NFS over TCP over Ethernet over XFS over ramdisk.
You don't need infiniband for this.
With ethernet it doesn't hang (that's another bug, for RDMA people, in
the othter thread) but the file is still 1.9GB, i.e. larger than the device.


Look, after running the test over ethernet,
at server side:

# ll -h /mnt/ram
total 1.5G
drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./
drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
-rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

# mount
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
/dev/sda1 on / type ext4 (rw,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
devtmpfs on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
nfsd on /proc/fs/nfsd type nfsd (rw)
binfmt_misc on /proc/sys/fs/binfmt_misc type binfmt_misc
(rw,noexec,nosuid,nodev)
/dev/ram0 on /mnt/ram type xfs (rw)

# blockdev --getsize64 /dev/ram0
1610612736

# dd if=/mnt/ram/zerofile | wc -c
1985937408
3878784+0 records in
3878784+0 records out
1985937408 bytes (2.0 GB) copied, 6.57081 s, 302 MB/s

Feel free to forward to NFS mailing list also if you think it's appropriate.
Thank you

2010-12-02 14:17:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This
> >option must never be enabled, as it causes block devices to be
> >randomly renumered. Together with the ramdisk driver overloading
> >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> >data loss like yours.
>
> Nope...
>
> # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set

Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
effect.

2010-12-02 21:22:49

by Mike Snitzer

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Thu, Dec 02 2010 at 9:17am -0500,
Christoph Hellwig <[email protected]> wrote:

> On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> > On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This
> > >option must never be enabled, as it causes block devices to be
> > >randomly renumered. Together with the ramdisk driver overloading
> > >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> > >data loss like yours.
> >
> > Nope...
> >
> > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
>
> Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
> effect.

For the benefit of others:
- mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is
ramdisk's major, this dates back to 2004:
http://oss.sgi.com/archives/xfs/2004-08/msg00463.html
- but because a kpartx partition overlay (linear DM mapping) is used for
the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major
- so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that
the backing device (behind the DM linear target) is a brd device
- DM will forward the BLKFLSBUF ioctl to brd, which triggers
drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the
process)

So coming full circle this is what hch was referring to when he
mentioned:
1) "ramdisk driver overloading the BLKFLSBUF ioctl ..."
2) "dm-linear's dumb forwarding of ioctls ..."

I really can't see DM adding a specific check for ramdisk's major when
forwarding the BLKFLSBUF ioctl.

brd has direct partition support (see commit d7853d1f8932c) so maybe
kpartx should just blacklist /dev/ram devices?

Alternatively, what about switching brd away from overloading BLKFLSBUF
to a real implementation of (overloaded) BLKDISCARD support in brd.c?
One that doesn't blindly nuke the entire device but that properly
processes the discard request.

Mike

2010-12-02 22:08:21

by Mike Snitzer

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Thu, Dec 02 2010 at 4:22pm -0500,
Mike Snitzer <[email protected]> wrote:

> On Thu, Dec 02 2010 at 9:17am -0500,
> Christoph Hellwig <[email protected]> wrote:
>
> > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This
> > > >option must never be enabled, as it causes block devices to be
> > > >randomly renumered. Together with the ramdisk driver overloading
> > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> > > >data loss like yours.
> > >
> > > Nope...
> > >
> > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
> >
> > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
> > effect.
>
> For the benefit of others:
> - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is
> ramdisk's major, this dates back to 2004:
> http://oss.sgi.com/archives/xfs/2004-08/msg00463.html
> - but because a kpartx partition overlay (linear DM mapping) is used for
> the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major
> - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that
> the backing device (behind the DM linear target) is a brd device
> - DM will forward the BLKFLSBUF ioctl to brd, which triggers
> drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the
> process)
>
> So coming full circle this is what hch was referring to when he
> mentioned:
> 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..."
> 2) "dm-linear's dumb forwarding of ioctls ..."
>
> I really can't see DM adding a specific check for ramdisk's major when
> forwarding the BLKFLSBUF ioctl.
>
> brd has direct partition support (see commit d7853d1f8932c) so maybe
> kpartx should just blacklist /dev/ram devices?
>
> Alternatively, what about switching brd away from overloading BLKFLSBUF
> to a real implementation of (overloaded) BLKDISCARD support in brd.c?
> One that doesn't blindly nuke the entire device but that properly
> processes the discard request.

Hmm, any chance we could revisit this approach?

http://lkml.indiana.edu/hypermail/linux/kernel/0405.3/0998.html

2010-12-02 23:08:07

by Dave Chinner

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Thu, Dec 02, 2010 at 03:14:39PM +0100, Spelic wrote:
> Sorry for replying to my own email already
> one more thing on the 3rd bug:
>
> On 12/02/2010 02:55 PM, Spelic wrote:
> >Hello all
> >[CUT]
> >.......
> >with NFS over <RDMA or IPoIB> over Infiniband over XFS over
> >ramdisk it is possible to write a file (2.3GB) which is larger
> >than
>
> This is also reproducible with:
> NFS over TCP over Ethernet over XFS over ramdisk.
> You don't need infiniband for this.
> With ethernet it doesn't hang (that's another bug, for RDMA people,
> in the othter thread) but the file is still 1.9GB, i.e. larger than
> the device.
>
>
> Look, after running the test over ethernet,
> at server side:
>
> # ll -h /mnt/ram
> total 1.5G
> drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./
> drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
> -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

This is a classic ENOSPC vs NFS client writeback overcommit caching
issue. Have a look at the block map output - I bet theres holes in
the file and it's only consuming 1.5GB of disk space. use xfs_bmap
to check this. du should tell you the same thing.

Basically, the NFS client overcommits the server filesystem space by
doing local writeback caching. Hence it caches 1.9GB of data before
it gets the first ENOSPC error back from the server at around 1.5GB
of written data. At that point, the data that gets ENOSPC errors is
tossed by the NFS client, and a ENOSPC error is placed on the
address space to be reported to the next write/sync call. That gets
to the dd process when it's 1.9GB into the write.

However, there is still (in this case) 400MB of dirty data in the
NFS client cache that it will try to write to the server. Because
XFS uses speculative preallocation and reserves some space for
metadata allocation during delayed allocation, it's handling of the
initial ENOSPC condition can result in some space being freed up
again as unused reserved metadata space is returned to the free pool
as delalloc occurs during server writeback. This usually takes a
second or two to complete.

As a result, shortly after the first ENOSPC has been reported and
subsequent writes have also ENOSPC, we can have space freed up and
another write will succeed. At that point, the write that succeeds
will be a different offset to the last one that succeeded, leaving a
hole in the file and moving the EOF well past 1.5GB. That will go on
until there really is no space left at all or the NFS client has no
more dirty data to send.

Basically, what you see it not a bug in XFS, it is a result of NFS
clients being able to overcommit server filesystem space and the
interaction that has with the way the filesystem on the NFS server
handles ENOSPC.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-12-03 14:08:32

by Spelic

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On 12/03/2010 12:07 AM, Dave Chinner wrote:
> This is a classic ENOSPC vs NFS client writeback overcommit caching
> issue. Have a look at the block map output - I bet theres holes in
> the file and it's only consuming 1.5GB of disk space. use xfs_bmap
> to check this. du should tell you the same thing.
>
>

Yes you are right!

root@server:/mnt/ram# ll -h
total 1.5G
drwxr-xr-x 2 root root 21 2010-12-02 12:54 ./
drwxr-xr-x 3 root root 4.0K 2010-11-29 23:51 ../
-rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile

root@server:/mnt/ram# ls -lsh
total 1.5G
1.5G -rw-r--r-- 1 root root 1.9G 2010-12-02 15:04 zerofile
(it's a sparse file)

root@server:/mnt/ram# xfs_bmap zerofile
zerofile:
0: [0..786367]: 786496..1572863
1: [786368..1572735]: 2359360..3145727
2: [1572736..2232319]: 1593408..2252991
3: [2232320..2529279]: 285184..582143
4: [2529280..2531327]: hole
5: [2531328..2816407]: 96..285175
6: [2816408..2971511]: 582144..737247
7: [2971512..2971647]: hole
8: [2971648..2975183]: 761904..765439
9: [2975184..2975743]: hole
10: [2975744..2975751]: 765440..765447
11: [2975752..2977791]: hole
12: [2977792..2977799]: 765480..765487
13: [2977800..2979839]: hole
14: [2979840..2979847]: 765448..765455
15: [2979848..2981887]: hole
16: [2981888..2981895]: 765472..765479
17: [2981896..2983935]: hole
18: [2983936..2983943]: 765456..765463
19: [2983944..2985983]: hole
20: [2985984..2985991]: 765464..765471
21: [2985992..3202903]: hole
22: [3202904..3215231]: 737248..749575
23: [3215232..3239767]: hole
24: [3239768..3252095]: 774104..786431
25: [3252096..3293015]: hole
26: [3293016..3305343]: 749576..761903
27: [3305344..3370839]: hole
28: [3370840..3383167]: 2252992..2265319
29: [3383168..3473239]: hole
30: [3473240..3485567]: 2265328..2277655
31: [3485568..3632983]: hole
32: [3632984..3645311]: 2277656..2289983
33: [3645312..3866455]: hole
34: [3866456..3878783]: 2289984..2302311

(many delayed allocation extents cannot be filled because space on
device is finished)

However ...


> Basically, the NFS client overcommits the server filesystem space by
> doing local writeback caching. Hence it caches 1.9GB of data before
> it gets the first ENOSPC error back from the server at around 1.5GB
> of written data. At that point, the data that gets ENOSPC errors is
> tossed by the NFS client, and a ENOSPC error is placed on the
> address space to be reported to the next write/sync call. That gets
> to the dd process when it's 1.9GB into the write.
>

I'm no great expert but isn't this a design flaw in NFS?

Ok in this case we were lucky it was all zeroes so XFS made a sparse
file and could fit a 1.9GB into 1.5GB device size.

In general with nonzero data it seems to me you will get data corruption
because the NFS client thinks it has written the data while the NFS
server really can't write more data than the device size.

It's nice that the NFS server does local writeback caching but it should
also cache the filesystem's free space (and check it periodically, since
nfs-server is presumably not the only process writing in that
filesystem) so that it doesn't accept more data than it can really
write. Alternatively, when free space drops below 1GB (or a reasonable
size based on network speed), nfs-server should turn off filesystem
writeback caching.

I can't repeat the test with urandom because it's too slow (8MB/sec !?).
How come Linux hasn't got an "uurandom" device capable of e.g. 400MB/sec
with only very weak randomness?

But I have repeated the test over ethernet with a bunch of symlinks to a
100MB file created from urandom:

At client side:

# time cat randfile{001..020} | pv -b > /mnt/nfsram/randfile
1.95GB

real 0m22.978s
user 0m0.310s
sys 0m5.360s


At server side:

# ls -lsh ram
total 1.5G
1.5G -rw-r--r-- 1 root root 1.7G 2010-12-03 14:43 randfile
# xfs_bmap ram/randfile
ram/randfile:
0: [0..786367]: 786496..1572863
1: [786368..790527]: 96..4255
2: [790528..1130495]: hole
3: [1130496..1916863]: 2359360..3145727
4: [1916864..2682751]: 1593408..2359295
5: [2682752..3183999]: 285184..786431
6: [3184000..3387207]: 4256..207463
7: [3387208..3387391]: hole
8: [3387392..3391567]: 207648..211823
9: [3391568..3393535]: hole
10: [3393536..3393543]: 211824..211831
11: [3393544..3395583]: hole
12: [3395584..3395591]: 211832..211839
13: [3395592..3397631]: hole
14: [3397632..3397639]: 211856..211863
15: [3397640..3399679]: hole
16: [3399680..3399687]: 211848..211855
17: [3399688..3401727]: hole
18: [3401728..3409623]: 221984..229879
# dd if=/mnt/ram/randfile | wc -c
3409624+0 records in
3409624+0 records out
1745727488
1745727488 bytes (1.7 GB) copied, 5.72443 s, 305 MB/s

The file is still sparse, and this time it certainly has data corruption
(holes will be read as zeroes).
I understand that the client receives Input/output error when this
condition is hit, but the file written at server side has apparent size
1.8GB but the valid data in it is not 1.8GB. Is it good semantics?
Wouldn't it be better for nfs-server to turn off writeback caching when
it approaches a disk-full situation?


And then I see another problem:
As you see, xfs_fsr shows lots of holes, even with randomfile (this is
taken from urandom so you can be sure it hasn't got many zeroes) already
from offset 790528 sectors which is far from the disk full situation...

First I checked that this does not happen by pushing less than 1.5GB of
data. Ok it does not.
Then I tried with exactly 15*100MB (files are 100MB, are symliks to a
file which was created with dd if=/dev/urandom of=randfile.rnd bs=1M
count=100)
and this happened:

client side:

# time cat randfile{001..015} | pv -b > /mnt/nfsram/randfile
1.46GB

real 0m18.265s
user 0m0.260s
sys 0m4.460s

(please note: no I/O error at client side! blockdev --getsize64
/dev/ram0 == 1610612736)


server side:

# ls -ls ram
total 1529676
1529676 -rw-r--r-- 1 root root 1571819520 2010-12-03 14:51 randfile

# dd if=/mnt/ram/randfile | wc -c
3069960+0 records in
3069960+0 records out
1571819520
1571819520 bytes (1.6 GB) copied, 5.30442 s, 296 MB/s

# xfs_bmap ram/randfile
ram/randfile:
0: [0..112639]: 96..112735
1: [112640..208895]: 114784..211039
2: [208896..399359]: 285184..475647
3: [399360..401407]: 112736..114783
4: [401408..573439]: 475648..647679
5: [573440..937983]: 786496..1151039
6: [937984..1724351]: 2359360..3145727
7: [1724352..2383871]: 1593408..2252927
8: [2383872..2805695]: 1151040..1572863
9: [2805696..2944447]: 647680..786431
10: [2944448..2949119]: 211040..215711
11: [2949120..3055487]: 2252928..2359295
12: [3055488..3058871]: 215712..219095
13: [3058872..3059711]: hole
14: [3059712..3060143]: 219936..220367
15: [3060144..3061759]: hole
16: [3061760..3061767]: 220368..220375
17: [3061768..3063807]: hole
18: [3063808..3063815]: 220376..220383
19: [3063816..3065855]: hole
20: [3065856..3065863]: 220384..220391
21: [3065864..3067903]: hole
22: [3067904..3067911]: 220392..220399
23: [3067912..3069951]: hole
24: [3069952..3069959]: 220400..220407

Holes in a random file!
This is data corruption, and nobody is notified of this data corruption:
no error at client side or server side!
Is it good semantics? How could client get notified of this? Some kind
of fsync maybe?

Thank you

2010-12-03 17:11:47

by Nick Piggin

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Thu, Dec 02, 2010 at 04:22:27PM -0500, Mike Snitzer wrote:
> On Thu, Dec 02 2010 at 9:17am -0500,
> Christoph Hellwig <[email protected]> wrote:
>
> > On Thu, Dec 02, 2010 at 03:14:28PM +0100, Spelic wrote:
> > > On 12/02/2010 03:11 PM, Christoph Hellwig wrote:
> > > >I'm pretty sure you have CONFIG_DEBUG_BLOCK_EXT_DEVT enabled. This
> > > >option must never be enabled, as it causes block devices to be
> > > >randomly renumered. Together with the ramdisk driver overloading
> > > >the BLKFLSBUF ioctl to discard all data it guarantees you to get
> > > >data loss like yours.
> > >
> > > Nope...
> > >
> > > # CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
> >
> > Hmm, I suspect dm-linear's dumb forwarding of ioctls has the same
> > effect.
>
> For the benefit of others:
> - mkfs.xfs will avoid sending BLKFLSBUF to any device whose major is
> ramdisk's major, this dates back to 2004:
> http://oss.sgi.com/archives/xfs/2004-08/msg00463.html
> - but because a kpartx partition overlay (linear DM mapping) is used for
> the /dev/ram0p1 device, mkfs.xfs only sees a device with DM's major
> - so mkfs.xfs sends BLKFLSBUF to the DM device blissfully unaware that
> the backing device (behind the DM linear target) is a brd device
> - DM will forward the BLKFLSBUF ioctl to brd, which triggers
> drivers/block/brd.c:brd_ioctl (nuking the entire ramdisk in the
> process)
>
> So coming full circle this is what hch was referring to when he
> mentioned:
> 1) "ramdisk driver overloading the BLKFLSBUF ioctl ..."
> 2) "dm-linear's dumb forwarding of ioctls ..."
>
> I really can't see DM adding a specific check for ramdisk's major when
> forwarding the BLKFLSBUF ioctl.
>
> brd has direct partition support (see commit d7853d1f8932c) so maybe
> kpartx should just blacklist /dev/ram devices?
>
> Alternatively, what about switching brd away from overloading BLKFLSBUF
> to a real implementation of (overloaded) BLKDISCARD support in brd.c?
> One that doesn't blindly nuke the entire device but that properly
> processes the discard request.

Yeah the situation really sucks (mkfs.jfs doesn't work on ramdisk
for the same reason).

I want to unfortunately keep ioctl for compatibility, but adding new
saner ones would be welcome. Also, having a non-default config or
load time parameter for brd, to skip the special case, if that would
help testing on older userspace.

DISCARD is actually a problem for rd. To actually get proper
correctness, you need to preload brd with pages, otherwise when
doing stress tests, IO can require memory allocations and deadlock.
If we add a discard that frees pages, that introduces the same problem.
If you find any option useful for testing, however, patches are fine --
brd pretty much is only useful for testing nowadays.

2010-12-03 18:15:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Sat, Dec 04, 2010 at 04:11:40AM +1100, Nick Piggin wrote:
> > Alternatively, what about switching brd away from overloading BLKFLSBUF
> > to a real implementation of (overloaded) BLKDISCARD support in brd.c?
> > One that doesn't blindly nuke the entire device but that properly
> > processes the discard request.
>
> Yeah the situation really sucks (mkfs.jfs doesn't work on ramdisk
> for the same reason).
>
> I want to unfortunately keep ioctl for compatibility, but adding new
> saner ones would be welcome. Also, having a non-default config or
> load time parameter for brd, to skip the special case, if that would
> help testing on older userspace.

How many programs actually depend on BLKFLSBUF dropping the pages used
in /dev/ram? The fact that it did this at all was a historical
accident of how the original /dev/ram was implemented (in the buffer
cache directly), and not anything that was intended. I think that's
something that we should be able to fix, since the number of programs
that knowly operate on the ramdisk is quite small. Just a few system
programs used by distributions in their early boot scripts....

So I would argue for dropping the "special" behavior of BLKFLSBUF for
/dev/ram.

- Ted

2010-12-06 04:10:08

by Dave Chinner

[permalink] [raw]
Subject: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram

On Fri, Dec 03, 2010 at 03:07:58PM +0100, Spelic wrote:
> On 12/03/2010 12:07 AM, Dave Chinner wrote:
> >This is a classic ENOSPC vs NFS client writeback overcommit caching
> >issue. Have a look at the block map output - I bet theres holes in
> >the file and it's only consuming 1.5GB of disk space. use xfs_bmap
> >to check this. du should tell you the same thing.
> >
>
> Yes you are right!
....
> root@server:/mnt/ram# xfs_bmap zerofile
> zerofile:
....
> 30: [3473240..3485567]: 2265328..2277655
> 31: [3485568..3632983]: hole
> 32: [3632984..3645311]: 2277656..2289983
> 33: [3645312..3866455]: hole
> 34: [3866456..3878783]: 2289984..2302311
>
> (many delayed allocation extents cannot be filled because space on
> device is finished)
>
> However ...
>
>
> >Basically, the NFS client overcommits the server filesystem space by
> >doing local writeback caching. Hence it caches 1.9GB of data before
> >it gets the first ENOSPC error back from the server at around 1.5GB
> >of written data. At that point, the data that gets ENOSPC errors is
> >tossed by the NFS client, and a ENOSPC error is placed on the
> >address space to be reported to the next write/sync call. That gets
> >to the dd process when it's 1.9GB into the write.
>
> I'm no great expert but isn't this a design flaw in NFS?

Yes, sure is.

[ Well, to be precise the original NFSv2 specification
didn't have this flaw because all writes were synchronous. NFSv3
introduced asynchronous writes (writeback caching) and with it this
problem. NFSv4 does not fix this flaw. ]

> Ok in this case we were lucky it was all zeroes so XFS made a sparse
> file and could fit a 1.9GB into 1.5GB device size.
>
> In general with nonzero data it seems to me you will get data
> corruption because the NFS client thinks it has written the data
> while the NFS server really can't write more data than the device
> size.

Yup, well known issue. Simple rule: don't run your NFS server out of
space.

> It's nice that the NFS server does local writeback caching but it
> should also cache the filesystem's free space (and check it
> periodically, since nfs-server is presumably not the only process
> writing in that filesystem) so that it doesn't accept more data than
> it can really write. Alternatively, when free space drops below 1GB
> (or a reasonable size based on network speed), nfs-server should
> turn off filesystem writeback caching.

This isn't a NFS server problem, or one that canbe worked around at
the server. it's a NFS _client_ problem in that it does not get
synchronous ENOSPC errors when using writeback caching. There is no
way for the NFS client to know the server is near ENOSPC conditions
prior to writing the data to the server as clients operate
independently.

If you really want your NFS clients to behave correctly when the
server goes ENOSPC, turn off writeback caching at the client side,
not the server (i.e. use sync mounts on the client side).
Write performance will suck, but if you want sane ENOSPC behaviour...

.....

> Holes in a random file!
> This is data corruption, and nobody is notified of this data
> corruption: no error at client side or server side!
> Is it good semantics? How could client get notified of this? Some
> kind of fsync maybe?

Use wireshark to determine if the server sends an ENOSPC to the
client when the first background write fails. I bet it does and that
your dd write failed with ENOSPC, too. Something stopped it writing
at 1.9GB....

What happens to the remaining cached writeback data in the NFS
client once the server runs out of space is NFS client specific
behaviour. If you end up with only bits of the file on the server,
ending up on the server, then that's a result of NFS client
behaviour, not a NFS server problem.

Cheers,

Dave.


--
Dave Chinner
[email protected]

2010-12-06 12:21:13

by Spelic

[permalink] [raw]
Subject: NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram)

On 12/06/2010 05:09 AM, Dave Chinner wrote:
>> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching]
>>
>>
>> It's nice that the NFS server does local writeback caching but it
>> should also cache the filesystem's free space (and check it
>> periodically, since nfs-server is presumably not the only process
>> writing in that filesystem) so that it doesn't accept more data than
>> it can really write. Alternatively, when free space drops below 1GB
>> (or a reasonable size based on network speed), nfs-server should
>> turn off filesystem writeback caching.
>>
> This isn't a NFS server problem, or one that canbe worked around at
> the server. it's a NFS _client_ problem in that it does not get
> synchronous ENOSPC errors when using writeback caching. There is no
> way for the NFS client to know the server is near ENOSPC conditions
> prior to writing the data to the server as clients operate
> independently.
>
> If you really want your NFS clients to behave correctly when the
> server goes ENOSPC, turn off writeback caching at the client side,
> not the server (i.e. use sync mounts on the client side).
> Write performance will suck, but if you want sane ENOSPC behaviour...
>
>

[adding NFS ML in cc]

Thank you for your very clear explanation.

Going without writeback cache is a problem (write performance sucks as
you say), but guaranteeing to never reach ENOSPC also is hardly
feasible, especially if humans are logged at client side and they are
doing "whatever they want".

I would suggest that either be the NFS client to do polling to see if
it's near an ENOSPC and if yes disable writeback caching, or be the
server to do the polling and if it finds out it's near-ENOSPC condition
it sends a specific message to clients to warn them so that they can
disable caching.

Performed at client side wouldn't change the NFS protocol and can be
good enough if one can specify how often freespace should be polled and
what is the freespace threshold. Or with just one value: specify what is
the max speed at which server disk can fill (next polling period can be
inferred from current free space), and maybe also specify a minimum
polling period (just in case).

Regarding the last part of the email, perhaps I was not clear:


> .....
>
>> Holes in a random file!
>> This is data corruption, and nobody is notified of this data
>> corruption: no error at client side or server side!
>> Is it good semantics? How could client get notified of this? Some
>> kind of fsync maybe?
>>
> Use wireshark to determine if the server sends an ENOSPC to the
> client when the first background write fails. I bet it does and that
> your dd write failed with ENOSPC, too. Something stopped it writing
> at 1.9GB....
>

No, in that case I had written 15x100MB which was more than the
available space but less than available+writeback_cache.
So "cat" ended by itself and never got an ENOSPC error but data never
reached the disk at the other side.

However today I found that by using fsync, the problem is fortunately
detected:

# time cat randfile{001..015} | pv -b | dd conv=fsync
of=/mnt/nfsram/randfile
1.46GB
dd: fsync failed for `/mnt/nfsram/randfile': Input/output error
3072000+0 records in
3072000+0 records out
1572864000 bytes (1.6 GB) copied, 20.9101 s, 75.2 MB/s

real 0m21.364s
user 0m0.470s
sys 0m11.440s


so ok I understand that processes needing guarantees on written data
should use fsync/fdatasync (which is good practice also for a local
filesystem actually...)

Thank you

2010-12-06 13:34:07

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS corruption on ENOSPC (was: Re: Bugs in mkfs.xfs, device mapper, xfs, and /dev/ram)

On Mon, 2010-12-06 at 13:20 +0100, Spelic wrote:
> On 12/06/2010 05:09 AM, Dave Chinner wrote:
> >> [Files become sparse at nfs-server-side upon hitting ENOSPC if NFS client uses local writeback caching]
> >>
> >>
> >> It's nice that the NFS server does local writeback caching but it
> >> should also cache the filesystem's free space (and check it
> >> periodically, since nfs-server is presumably not the only process
> >> writing in that filesystem) so that it doesn't accept more data than
> >> it can really write. Alternatively, when free space drops below 1GB
> >> (or a reasonable size based on network speed), nfs-server should
> >> turn off filesystem writeback caching.
> >>
> > This isn't a NFS server problem, or one that canbe worked around at
> > the server. it's a NFS _client_ problem in that it does not get
> > synchronous ENOSPC errors when using writeback caching. There is no
> > way for the NFS client to know the server is near ENOSPC conditions
> > prior to writing the data to the server as clients operate
> > independently.
> >
> > If you really want your NFS clients to behave correctly when the
> > server goes ENOSPC, turn off writeback caching at the client side,
> > not the server (i.e. use sync mounts on the client side).
> > Write performance will suck, but if you want sane ENOSPC behaviour...
> >
> >
>
> [adding NFS ML in cc]
>
> Thank you for your very clear explanation.
>
> Going without writeback cache is a problem (write performance sucks as
> you say), but guaranteeing to never reach ENOSPC also is hardly
> feasible, especially if humans are logged at client side and they are
> doing "whatever they want".
>
> I would suggest that either be the NFS client to do polling to see if
> it's near an ENOSPC and if yes disable writeback caching, or be the
> server to do the polling and if it finds out it's near-ENOSPC condition
> it sends a specific message to clients to warn them so that they can
> disable caching.



> Performed at client side wouldn't change the NFS protocol and can be
> good enough if one can specify how often freespace should be polled and
> what is the freespace threshold. Or with just one value: specify what is
> the max speed at which server disk can fill (next polling period can be
> inferred from current free space), and maybe also specify a minimum
> polling period (just in case).

You can just as easily do this at the application level. The kernel
can't do it any more reliably than the application can, so there really
is no point in doing it there.

We already ensure that when the server does send us an error, we switch
to synchronous operation until the error clears.

Trond