2001-12-20 19:08:19

by Torrey Hoffman

[permalink] [raw]
Subject: RE: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

Yes, this does fix the problem. Thank you very much!

Hopefully something like this will make it into 2.4.18?

(I have to admit I applied and tested your patch to
2.4.17-rc2, and did not test -rc2 without your patch.
However, -rc1 has the bug and I don't think any other changes
between -rc1 and -rc2 would have fixed this. If someone out
there wants me to, I can recompile and test -rc2 vanilla.)

Torrey Hoffman


Tachino Nobuhiro ([email protected]) wrote:

> Hello,
>
> Following patch may fix your problem.
>
> diff -r -u linux-2.4.17-rc2.org/drivers/block/rd.c
> linux-2.4.17-rc2/drivers/block/rd.c
> --- linux-2.4.17-rc2.org/drivers/block/rd.c Thu Dec 20 20:30:57 2001
> +++ linux-2.4.17-rc2/drivers/block/rd.c Thu Dec 20 20:46:53 2001
> @@ -194,9 +194,11 @@
> static int ramdisk_readpage(struct file *file, struct page * page)
> {
> if (!Page_Uptodate(page)) {
> - memset(kmap(page), 0, PAGE_CACHE_SIZE);
> - kunmap(page);
> - flush_dcache_page(page);
> + if (!page->buffers) {
> + memset(kmap(page), 0, PAGE_CACHE_SIZE);
> + kunmap(page);
> + flush_dcache_page(page);
> + }
> SetPageUptodate(page);
> }
> UnlockPage(page);
>
>
> grow_dev_page() creates not Uptodate page which has valid
> buffers, so
> it is wrong that ramdisk_readpage() clears whole page unconditionally.
>
>
> At Tue, 18 Dec 2001 12:14:49 -0800,
> Torrey Hoffman wrote:
> >
> > More information! And a workaround!
> >
> > I conjecture that the ramdisk driver (post 2.4.9) only grabs
> > VM pages properly if it is accessed directly, as a dd to
> > /dev/ram0 does. I further conjecture that accessing the
> > ramdisk through a mounted filesystem does not grab pages
> > properly.
> >
> > The reason I believe this is that removing the call to
> > "freeramdisk" from my original script avoids corruption.
> >
> > Another way to avoid ramdisk corruption is to
> > "dd if=/dev/zero of=/dev/ram0 bs=1k count=4000"
> > immediately after the call to freeramdisk.
> >
> > If my conjecture is right, then the corruption is caused
> > because mke2fs on a "freed" /dev/ram0 doesn't touch every
> > block of the fs, leaving "holes" where pages are not properly
> > grabbed from the VM. The resulting filesystem appears to work,
> > but dd'ing from /dev/ram0 gets a broken filesystem image.
> >
> > Note that "freeramdisk /dev/ram0" is pretty much just:
> > #define FLKFLSBUF _IO(0x12,97) /* flush buffer cache */
> > f = open("/dev/ram0", O_RDWR);
> > ioctl(f, BLKFLSBUF);
> >
> > To experiment for yourself, stick the following script in
> > a subdirectory which also contains a "testdir" directory
> > with about 3 MB of data.
> >
> > - - - - - - - -
> > #!/bin/bash
> >
> > # to tickle the bug, do the freeramdisk but not the
> > # dd from /dev/zero to /dev/ram0.
> >
> > freeramdisk /dev/ram0
> > #dd if=/dev/zero of=/dev/ram0 bs=1k count=4000
> >
> > mke2fs -m0 /dev/ram0 4000
> > mount -t ext2 /dev/ram0 /mnt/ramdisk
> > rm -rf /mnt/ramdisk/*
> >
> > cp -a ./testdir /mnt/ramdisk
> > umount /dev/ram0
> >
> > dd if=/dev/ram0 of=ram0.img bs=1k count=4000
> > dd if=ram0.img of=/dev/ram0 bs=1k count=4000
> >
> > mount -t ext2 /dev/ram0 /mnt/ramdisk
> > diff -q -r ./testdir /mnt/ramdisk/testdir
> >
> > # If diff reports mismatches, you saw the bug.
> >
> > umount /dev/ram0
> > - - - - - - - -
> >
> > If the gods of the VM and VFS don't bother to look at
> > it, I might take a peek at the relevant kernel code myself.
> > Might take two months of study before I know enough though.
> >
> > Torrey


2001-12-20 19:47:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

In article <[email protected]> you write:
>Yes, this does fix the problem. Thank you very much!
>
>Hopefully something like this will make it into 2.4.18?

This does not seem quite right.

>Tachino Nobuhiro ([email protected]) wrote:
>> Hello,
>>
>> Following patch may fix your problem.
>>
>> diff -r -u linux-2.4.17-rc2.org/drivers/block/rd.c
>> linux-2.4.17-rc2/drivers/block/rd.c
>> --- linux-2.4.17-rc2.org/drivers/block/rd.c Thu Dec 20 20:30:57 2001
>> +++ linux-2.4.17-rc2/drivers/block/rd.c Thu Dec 20 20:46:53 2001
>> @@ -194,9 +194,11 @@
>> static int ramdisk_readpage(struct file *file, struct page * page)
>> {
>> if (!Page_Uptodate(page)) {
>> - memset(kmap(page), 0, PAGE_CACHE_SIZE);
>> - kunmap(page);
>> - flush_dcache_page(page);
>> + if (!page->buffers) {
>> + memset(kmap(page), 0, PAGE_CACHE_SIZE);
>> + kunmap(page);
>> + flush_dcache_page(page);
>> + }
>> SetPageUptodate(page);
>> }
>> UnlockPage(page);
>>
>>
>> grow_dev_page() creates not Uptodate page which has valid
>> buffers, so
>> it is wrong that ramdisk_readpage() clears whole page unconditionally.

The problem is that having buffers doesn't necessarily always mean that
they are valid, nor that _all_ of them are valid.

Also, if the ramdisk "readpage" code is wrong, then so is the
"prepare_write" code. They share the same logic, which basically says
that "if the page isn't up-to-date, then it is zero". Which is always
true for normal read/write accesses, but as you found out it's not true
when parts of the page have been accessed by filesystems through the
buffers.

So the code _should_ use a common helper, something like

static void ramdisk_updatepage(struct page * page)
{
if (!Page_Uptodate(page)) {
struct buffer_head *bh = page->buffers;
void * address = page_address(page);
if (bh) {
struct buffer_head *tmp = bh;
do {
if (!buffer_uptodate(tmp)) {
memset(address, 0, tmp->b_size);
set_buffer_uptodate(tmp);
}
address += tmp->b_size;
tmp = tmp->b_this_page;
} while (tmp != bh);
} else
memset(address, 0, PAGE_SIZE);
flush_dcache_page(page);
SetPageUptodate(page);
}
}

and then ramdisk_readpage() would just be

kmap(page);
ramdisk_updatepage(page);
UnlockPage(page);
kunmap(page);
return 0;

while ramdisk_prepare_write() would be

ramdisk_updatepage(page);
SetPageDirty(page);
return 0;

NOTE NOTE NOTE! This is untested. Please somebody test it, and in
particular verify that there aren't any stupid highmem bugs, and
resubmit the patch to me and Marcelo, ok?

Linus

2001-12-20 22:57:30

by Alexander Viro

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes



On Thu, 20 Dec 2001, Linus Torvalds wrote:

> The problem is that having buffers doesn't necessarily always mean that
> they are valid, nor that _all_ of them are valid.
>
> Also, if the ramdisk "readpage" code is wrong, then so is the
> "prepare_write" code. They share the same logic, which basically says
> that "if the page isn't up-to-date, then it is zero". Which is always
> true for normal read/write accesses, but as you found out it's not true
> when parts of the page have been accessed by filesystems through the
> buffers.

AFAICS, it's nastier than that. What's to stop buffer_heads to be
freed under memory pressure?

2001-12-20 23:43:12

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Thu, Dec 20, 2001 at 11:46:23AM -0800, Linus Torvalds wrote:
> In article <[email protected]> you write:
> >Yes, this does fix the problem. Thank you very much!
> >
> >Hopefully something like this will make it into 2.4.18?
>
> This does not seem quite right.
>
> >Tachino Nobuhiro ([email protected]) wrote:
> >> Hello,
> >>
> >> Following patch may fix your problem.
> >>
> >> diff -r -u linux-2.4.17-rc2.org/drivers/block/rd.c
> >> linux-2.4.17-rc2/drivers/block/rd.c
> >> --- linux-2.4.17-rc2.org/drivers/block/rd.c Thu Dec 20 20:30:57 2001
> >> +++ linux-2.4.17-rc2/drivers/block/rd.c Thu Dec 20 20:46:53 2001
> >> @@ -194,9 +194,11 @@
> >> static int ramdisk_readpage(struct file *file, struct page * page)
> >> {
> >> if (!Page_Uptodate(page)) {
> >> - memset(kmap(page), 0, PAGE_CACHE_SIZE);
> >> - kunmap(page);
> >> - flush_dcache_page(page);
> >> + if (!page->buffers) {
> >> + memset(kmap(page), 0, PAGE_CACHE_SIZE);
> >> + kunmap(page);
> >> + flush_dcache_page(page);
> >> + }
> >> SetPageUptodate(page);
> >> }
> >> UnlockPage(page);
> >>
> >>
> >> grow_dev_page() creates not Uptodate page which has valid
> >> buffers, so
> >> it is wrong that ramdisk_readpage() clears whole page unconditionally.
>
> The problem is that having buffers doesn't necessarily always mean that
> they are valid, nor that _all_ of them are valid.
>
> Also, if the ramdisk "readpage" code is wrong, then so is the
> "prepare_write" code. They share the same logic, which basically says
> that "if the page isn't up-to-date, then it is zero". Which is always
> true for normal read/write accesses, but as you found out it's not true
> when parts of the page have been accessed by filesystems through the
> buffers.

subtle, while writing and testing that code the buffercache was not
sharing the physical address space yet, so it was stable, then it kept to
work somehow till today because only metadata writes into holes would
trigger it.

>
> So the code _should_ use a common helper, something like
>
> static void ramdisk_updatepage(struct page * page)
> {
> if (!Page_Uptodate(page)) {
> struct buffer_head *bh = page->buffers;
> void * address = page_address(page);
> if (bh) {
> struct buffer_head *tmp = bh;
> do {
> if (!buffer_uptodate(tmp)) {
> memset(address, 0, tmp->b_size);
> set_buffer_uptodate(tmp);
> }
> address += tmp->b_size;
> tmp = tmp->b_this_page;
> } while (tmp != bh);
> } else
> memset(address, 0, PAGE_SIZE);
> flush_dcache_page(page);
> SetPageUptodate(page);
> }
> }
>
> and then ramdisk_readpage() would just be
>
> kmap(page);
> ramdisk_updatepage(page);
> UnlockPage(page);
> kunmap(page);
> return 0;
>
> while ramdisk_prepare_write() would be
>
> ramdisk_updatepage(page);
> SetPageDirty(page);
> return 0;

agreed. rd_blkdev_pagecache_IO will as well do:

err = 0;

kmap(page);
ramdisk_updatepage(page);
kunmap(page);

unlock = 1;

Andrea

2001-12-21 01:39:28

by Tachino Nobuhiro

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes


At Thu, 20 Dec 2001 11:46:23 -0800,
Linus Torvalds wrote:
>
> The problem is that having buffers doesn't necessarily always mean that
> they are valid, nor that _all_ of them are valid.

If following sequence occurs, ramdisk_readpage() may clear the valid buffer
data. I'm not sure whether this really occurs, but if it does,
I think lock_buffer()/unlock_buffer() may be required.


CPU1 CPU2
---------------------------------------------------------------------------
call ext2_alloc_branch() call ramdisk_readpage()

bh = getblk();
lock_buffer(bh);

write data to bh.
ramdisk_readpage()
if (!buffer_uptodate(bh))
do memset()

mark_buffer_uptodate()
unlock_buffer(bh);

2001-12-21 01:52:18

by Andre Hedrick

[permalink] [raw]
Subject: Everyone else but TWO ...


Well now that I have proven the isolation technique to the folks IRC'ng,
just maybe the two Lead Penguins will get catch a clue. I suspect one of
the two is more on top of the issue and more willing to listen.

Regards,


Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project

2001-12-21 01:49:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

I diffed a more advanced version of it. It's not very well tested.

--- 2.4.17rc2aa1/drivers/block/rd.c.~1~ Wed Dec 19 03:01:04 2001
+++ 2.4.17rc2aa1/drivers/block/rd.c Fri Dec 21 01:39:15 2001
@@ -191,26 +191,44 @@
* 2000 Transmeta Corp.
* aops copied from ramfs.
*/
-static int ramdisk_readpage(struct file *file, struct page * page)
+static void ramdisk_updatepage(struct page * page, int need_kmap)
{
if (!Page_Uptodate(page)) {
- memset(kmap(page), 0, PAGE_CACHE_SIZE);
- kunmap(page);
+ struct buffer_head *bh = page->buffers;
+ void * address;
+
+ if (need_kmap)
+ kmap(page);
+ address = page_address(page);
+ if (bh) {
+ struct buffer_head *tmp = bh;
+ do {
+ if (!buffer_uptodate(tmp)) {
+ memset(address, 0, tmp->b_size);
+ mark_buffer_uptodate(tmp, 1);
+ }
+ address += tmp->b_size;
+ tmp = tmp->b_this_page;
+ } while (tmp != bh);
+ } else
+ memset(address, 0, PAGE_SIZE);
+ if (need_kmap)
+ kunmap(page);
flush_dcache_page(page);
SetPageUptodate(page);
}
+}
+
+static int ramdisk_readpage(struct file *file, struct page * page)
+{
+ ramdisk_updatepage(page, 1);
UnlockPage(page);
return 0;
}

static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
{
- if (!Page_Uptodate(page)) {
- void *addr = page_address(page);
- memset(addr, 0, PAGE_CACHE_SIZE);
- flush_dcache_page(page);
- SetPageUptodate(page);
- }
+ ramdisk_updatepage(page, 0);
SetPageDirty(page);
return 0;
}
@@ -233,10 +251,18 @@
unsigned long index;
int offset, size, err;

- err = -EIO;
err = 0;
mapping = rd_bdev[minor]->bd_inode->i_mapping;

+ /* writing a buffer cache not uptodate must not clear it */
+ if (sbh->b_page->mapping == mapping) {
+ if (rw == WRITE) {
+ mark_buffer_uptodate(sbh, 1);
+ SetPageDirty(sbh->b_page);
+ }
+ goto out;
+ }
+
index = sbh->b_rsector >> (PAGE_CACHE_SHIFT - 9);
offset = (sbh->b_rsector << 9) & ~PAGE_CACHE_MASK;
size = sbh->b_size;
@@ -262,11 +288,7 @@
goto out;
err = 0;

- if (!Page_Uptodate(page)) {
- memset(kmap(page), 0, PAGE_CACHE_SIZE);
- kunmap(page);
- SetPageUptodate(page);
- }
+ ramdisk_updatepage(page, 1);

unlock = 1;
}

actually while testing it I unfortunately found also an fs corruption
bug in the ->prepare_write/commit_write/writepage/direct_IO callbacks.
It's a longstanding one, since get_block born. In short, if get_block
fails while mapping on a certain page during
prepare_write/writepage/direct_IO (like it can happen with a full
filesystem, incidentally ext2 on /dev/ram0 during my testing that is
only 4M and so it overflows fast), the blocks before the ENOSPC failure
will be allocated, but the i_size won't be update accordingly (no commit
write, because prepare_write failed in the middle). for the
generic_file_write users it's easily fixable with an hack (truncating
backwards because we don't know how far we got allocating blocks when we
return from prepare_write), similar hack for the direct_IO one (that
commits only once at the end in function of the direct_IO generated).
But for writepage is quite a pain, infact I believe the writepage blocks
should be reserved in-core, to guarantee we will never fail a truncate
with ENOSPC. With the shared mappings we're effectively doing allocate
on flush... but with the lack of space reservation that makes it
unreliable :)

So for now I did an hack to cure the other two (writepage can still
corrupt the fs as said). I think the right fix (ala 2.5) is to change
the API so we can use the last blocks too, but the below will cure 2.4
and for writepage the right fix IMHO is to do the reservation of the
space.

both holds the i_sem so calling truncate from there cannot race with
any other inode-writer.

--- 2.4.17rc2aa2/mm/filemap.c.~1~ Wed Dec 19 03:43:23 2001
+++ 2.4.17rc2aa2/mm/filemap.c Fri Dec 21 02:21:07 2001
@@ -3026,8 +3025,14 @@

kaddr = kmap(page);
status = mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
- if (status)
+ if (status) {
+ if (status == -ENOSPC && inode->i_op && inode->i_op->truncate) {
+ lock_kernel();
+ inode->i_op->truncate(inode);
+ unlock_kernel();
+ }
goto unlock;
+ }
page_fault = __copy_from_user(kaddr+offset, buf, bytes);
flush_dcache_page(page);
status = mapping->a_ops->commit_write(file, page, offset, offset+bytes);
@@ -3090,6 +3095,14 @@
if (end > inode->i_size && !S_ISBLK(inode->i_mode)) {
inode->i_size = end;
mark_inode_dirty(inode);
+ }
+ /* there's the possibility that some get_block left some leftover */
+ if (written != count) {
+ if (inode->i_op && inode->i_op->truncate) {
+ lock_kernel();
+ inode->i_op->truncate(inode);
+ unlock_kernel();
+ }
}
*ppos = end;
invalidate_inode_pages2(mapping);


ah, and of course, other workaround is to use fs-blocksize == PAGE_SIZE 8)

btw, I'll be away the next few days, so I will not be very responsive
until after Christmas.

Andrea

2001-12-29 15:41:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Thu, Dec 20, 2001 at 09:56:40PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > ...
> > actually while testing it I unfortunately found also an fs corruption
> > bug in the ->prepare_write/commit_write/writepage/direct_IO callbacks.
> > It's a longstanding one, since get_block born. In short, if get_block
> > fails while mapping on a certain page during
> > prepare_write/writepage/direct_IO (like it can happen with a full
> > filesystem, incidentally ext2 on /dev/ram0 during my testing that is
> > only 4M and so it overflows fast), the blocks before the ENOSPC failure
> > will be allocated, but the i_size won't be update accordingly (no commit
> > write, because prepare_write failed in the middle). for the
> > generic_file_write users it's easily fixable with an hack (truncating
> > backwards because we don't know how far we got allocating blocks when we
> > return from prepare_write), similar hack for the direct_IO one (that
> > commits only once at the end in function of the direct_IO generated).
> > But for writepage is quite a pain, infact I believe the writepage blocks
> > should be reserved in-core, to guarantee we will never fail a truncate
> > with ENOSPC. With the shared mappings we're effectively doing allocate
> > on flush... but with the lack of space reservation that makes it
> > unreliable :)
>
> The -ac kernels handled this by zeroing out the accidentally-allocated
> blocks at __block_prepare_write.

actually my fix seems cleaner because it puts the "clearing" in one single
place.

> > So for now I did an hack to cure the other two (writepage can still
> > corrupt the fs as said). I think the right fix (ala 2.5) is to change
> > the API so we can use the last blocks too, but the below will cure 2.4
> > and for writepage the right fix IMHO is to do the reservation of the
> > space.
>
> This is better in a way because it reclaims the eztra few blocks. But
> the -ac approach also works for writepage().

yes, it can solve the metadata corruption (assuming the locking is
right, I can as well call ->truncate within writepage but it's not
obvious at all that it won't race because we don't hold the i_sem within
writepage), but the data corruption still holds. I mean, there's no
failure path to notify userspace that a certain page fault is writing
into a page over an hole, that we don't have space to later allocate on
disk. so to me it sounds like MAP_SHARED should preallocate the space of
the holes so you will know that the writes into the MAP_SHARED segments
won't be lost (current state of things will lead to silent corruption
and pinned dirty pages in ram, aka broken allocate on flush like
previously said).

> Why was that code not brought across?

Who developed that code? Can the author of the code forward port it to
2.4.18pre and post a patch to the list so we can review? thanks,
Avoiding the matadata corruption would be a good start at least for 2.4,
then we should just focus on the writepage locking that could race with the
other "create=1" get_blocks. If it doesn't race I will certainly agree
on that approch for 2.4.

Andrea

2001-12-30 06:23:41

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

[ this thread started at http://www.lib.uaa.alaska.edu/linux-kernel/archive/2001-Week-51/1001.html ]


Andrea Arcangeli wrote:
>
> On Thu, Dec 20, 2001 at 09:56:40PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli wrote:
> > >
> > > ...
> > > actually while testing it I unfortunately found also an fs corruption
> > > bug in the ->prepare_write/commit_write/writepage/direct_IO callbacks.
> > > It's a longstanding one, since get_block born. In short, if get_block
> > > fails while mapping on a certain page during
> > > prepare_write/writepage/direct_IO (like it can happen with a full
> > > filesystem, incidentally ext2 on /dev/ram0 during my testing that is
> > > only 4M and so it overflows fast), the blocks before the ENOSPC failure
> > > will be allocated, but the i_size won't be update accordingly (no commit
> > > write, because prepare_write failed in the middle). for the
> > > generic_file_write users it's easily fixable with an hack (truncating
> > > backwards because we don't know how far we got allocating blocks when we
> > > return from prepare_write), similar hack for the direct_IO one (that
> > > commits only once at the end in function of the direct_IO generated).
> > > But for writepage is quite a pain, infact I believe the writepage blocks
> > > should be reserved in-core, to guarantee we will never fail a truncate
> > > with ENOSPC. With the shared mappings we're effectively doing allocate
> > > on flush... but with the lack of space reservation that makes it
> > > unreliable :)
> >
> > The -ac kernels handled this by zeroing out the accidentally-allocated
> > blocks at __block_prepare_write.
>
> actually my fix seems cleaner because it puts the "clearing" in one single
> place.

I think so too. Even for ext3, which has a very complex truncate,
it appears to be OK.

> > > So for now I did an hack to cure the other two (writepage can still
> > > corrupt the fs as said). I think the right fix (ala 2.5) is to change
> > > the API so we can use the last blocks too, but the below will cure 2.4
> > > and for writepage the right fix IMHO is to do the reservation of the
> > > space.
> >
> > This is better in a way because it reclaims the eztra few blocks. But
> > the -ac approach also works for writepage().
>
> yes, it can solve the metadata corruption (assuming the locking is

Where can metadata corruption occur? A few extra blocks outside
i_size for ext2 directories isn't going to cause corruption, is it?

Or are you referring to i_blocks accounting being incorrect?

> right, I can as well call ->truncate within writepage but it's not
> obvious at all that it won't race because we don't hold the i_sem within
> writepage), but the data corruption still holds.

For sure - holding i_sem on truncate is abolutely required.

> I mean, there's no
> failure path to notify userspace that a certain page fault is writing
> into a page over an hole, that we don't have space to later allocate on
> disk. so to me it sounds like MAP_SHARED should preallocate the space of
> the holes so you will know that the writes into the MAP_SHARED segments
> won't be lost (current state of things will lead to silent corruption
> and pinned dirty pages in ram, aka broken allocate on flush like
> previously said).

Um. How does this differ from an I/O error on flush?

Would it be necessary to preallocate the holes at mmap() time? Mad
hand-waving: Could we not perform the instantiation at pagefault time,
and give the caller SIGBUS if we cannot allocate the blocks? Or if
there's an IO error, or quota exceeded.

> > Why was that code not brought across?
>
> Who developed that code? Can the author of the code forward port it to
> 2.4.18pre and post a patch to the list so we can review? thanks,
> Avoiding the matadata corruption would be a good start at least for 2.4,
> then we should just focus on the writepage locking that could race with the
> other "create=1" get_blocks. If it doesn't race I will certainly agree
> on that approch for 2.4.
>

It appeared in 2.4.2-ac25, and it looks like sct was the author:

o Fix higmem block_prepare_write crash (Stephen Tweedie)

Which is interesting - from the changelog it looks like he was
fixing a different problem! I always though that code was there
to prevent leakage of stale blocks. Stephen?

A 2.4.18-pre1 version is below. It compiles, but I haven't actually
exercised that code path.

It's not a pretty fix, IMO. It leaves dangling blocks outside i_size
which will make fsck unhappy. It also breaks ext3 a little bit -
those blocks should be journalled and in theory we'd need to
clone off private copies of __block_prepare_write() and
unmap_underlying_metadata() to do this. Which would be irritating,
but not the end of the world. (Should have done this in the -ac
version of ext3, but I never noticed it).


--- linux-2.4.18-pre1/fs/buffer.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/buffer.c Sat Dec 29 21:53:46 2001
@@ -1639,6 +1639,17 @@ static int __block_prepare_write(struct
}
return 0;
out:
+ bh = head;
+ block_start = 0;
+ do {
+ if (buffer_new(bh) && !buffer_uptodate(bh)) {
+ memset(kaddr+block_start, 0, bh->b_size);
+ set_bit(BH_Uptodate, &bh->b_state);
+ mark_buffer_dirty(bh);
+ }
+ block_start += bh->b_size;
+ bh = bh->b_this_page;
+ } while (bh != head);
return err;
}

Question: can someone please define BH_New? Its lifecycle seems
very vague. We never actually seem to *clear* it anywhere for
ext2, and it appears that the kernel will keep on treating a
clearly non-new buffer as "new" all the time. ext3 explicitly
clears BH_New in get_block(), if it finds the block was already
present in the file. I did this because we need the newness
info for internal purposes.

-

2001-12-30 06:33:41

by Alexander Viro

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes



On Sat, 29 Dec 2001, Andrew Morton wrote:

> Would it be necessary to preallocate the holes at mmap() time? Mad
> hand-waving: Could we not perform the instantiation at pagefault time,
> and give the caller SIGBUS if we cannot allocate the blocks? Or if
> there's an IO error, or quota exceeded.

Allocation at mmap() Is Not Going To Happen. Consider it vetoed.
There are applications that use mmap() on large and very sparse
files.

> Question: can someone please define BH_New? Its lifecycle seems
> very vague. We never actually seem to *clear* it anywhere for
> ext2, and it appears that the kernel will keep on treating a
> clearly non-new buffer as "new" all the time. ext3 explicitly
> clears BH_New in get_block(), if it finds the block was already
> present in the file. I did this because we need the newness
> info for internal purposes.

It should be reset when we submit IO. Breakage related to failing
allocation is indeed not new, but that's a long story. And no,
"allocate on mmap()" is not a fix.

2001-12-30 06:42:03

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

Alexander Viro wrote:
>
> On Sat, 29 Dec 2001, Andrew Morton wrote:
>
> > Would it be necessary to preallocate the holes at mmap() time? Mad
> > hand-waving: Could we not perform the instantiation at pagefault time,
> > and give the caller SIGBUS if we cannot allocate the blocks? Or if
> > there's an IO error, or quota exceeded.
>
> Allocation at mmap() Is Not Going To Happen. Consider it vetoed.
> There are applications that use mmap() on large and very sparse
> files.

I think Andrea was referring to simply reserving the necessary
amount of disk space, rather than actually instantiating the
blocks. But even that would be a big problem for the applications
which you describe.

> > Question: can someone please define BH_New? Its lifecycle seems
> > very vague. We never actually seem to *clear* it anywhere for
> > ext2, and it appears that the kernel will keep on treating a
> > clearly non-new buffer as "new" all the time. ext3 explicitly
> > clears BH_New in get_block(), if it finds the block was already
> > present in the file. I did this because we need the newness
> > info for internal purposes.
>
> It should be reset when we submit IO.

well... It isn't. And I'd like a chance to review/test any
proposed changes in this area which are outside specific filesystems...

> Breakage related to failing allocation is indeed not new, but
> that's a long story. And no, "allocate on mmap()" is not a fix.

Yup. But what *is* the fix? (filemap_nopage?)

-

2001-12-30 07:18:33

by Alexander Viro

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes



On Sat, 29 Dec 2001, Andrew Morton wrote:

> > Breakage related to failing allocation is indeed not new, but
> > that's a long story. And no, "allocate on mmap()" is not a fix.
>
> Yup. But what *is* the fix? (filemap_nopage?)

For ->writepage() - nothing. It _is_ asynchronous and it _can_ fail.
Due to failing allocations, IO errors, whatever.

Now, the fs consistency stuff is a different story. Fixes had been
in -ac since before 2.4.0 and I distinctly remember at least one of
3 area getting synced with -linus. My fault - I assumed that the
whole patch went there at that point. I'll try to dig the rest out.

2.4.9-ac* is probably a good starting point - they are in generic_file_write()
and in __block_write_full_page()

2001-12-30 07:12:13

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

As you pointed out offline, the -ac kernels fixed the problem using
*both* approaches. Here's a 2.4.18-pre1 version. generic_file_write()
is getting rather icky.

Please let me know if this looks like the way to proceed, and I'll
find a way to actually test the thing.


--- linux-2.4.18-pre1/fs/buffer.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/buffer.c Sat Dec 29 22:46:21 2001
@@ -1639,6 +1639,17 @@ static int __block_prepare_write(struct
}
return 0;
out:
+ bh = head;
+ block_start = 0;
+ do {
+ if (buffer_new(bh) && !buffer_uptodate(bh)) {
+ memset(kaddr+block_start, 0, bh->b_size);
+ set_bit(BH_Uptodate, &bh->b_state);
+ mark_buffer_dirty(bh);
+ }
+ block_start += bh->b_size;
+ bh = bh->b_this_page;
+ } while (bh != head);
return err;
}

--- linux-2.4.18-pre1/mm/filemap.c Wed Dec 26 11:47:41 2001
+++ linux-akpm/mm/filemap.c Sat Dec 29 23:04:29 2001
@@ -2856,6 +2856,7 @@ generic_file_write(struct file *file,con
ssize_t written;
long status = 0;
int err;
+ int prepare_write_failed = 0;
unsigned bytes;

if ((ssize_t) count < 0)
@@ -3003,8 +3004,10 @@ generic_file_write(struct file *file,con

kaddr = kmap(page);
status = mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
- if (status)
+ if (status) {
+ prepare_write_failed = 1;
goto unlock;
+ }
page_fault = __copy_from_user(kaddr+offset, buf, bytes);
flush_dcache_page(page);
status = mapping->a_ops->commit_write(file, page, offset, offset+bytes);
@@ -3026,8 +3029,18 @@ unlock:
UnlockPage(page);
page_cache_release(page);

- if (status < 0)
+ if (status < 0) {
+ if (prepare_write_failed) {
+ /*
+ * If blocksize < pagesize, prepare_write() may have
+ * instantiated a few blocks outside i_size. Trim
+ * these off again.
+ */
+ if (pos + bytes > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ }
break;
+ }
} while (count);
*ppos = pos;

2001-12-30 07:29:27

by Alexander Viro

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes



On Sat, 29 Dec 2001, Andrew Morton wrote:

> As you pointed out offline, the -ac kernels fixed the problem using
> *both* approaches. Here's a 2.4.18-pre1 version. generic_file_write()
> is getting rather icky.
>
> Please let me know if this looks like the way to proceed, and I'll
> find a way to actually test the thing.

Erm... I don't think so. Come on - vmtruncate() is obvious candidate
for out-of-line. Please, look how it was done in 2.4.9-ac*. BTW,
fs/buffer.c part of patch looks mangled.

2001-12-30 08:03:12

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrdkern el panic woes

Alexander Viro wrote:
>
> Erm... I don't think so. Come on - vmtruncate() is obvious candidate
> for out-of-line. Please, look how it was done in 2.4.9-ac*.

Yeah, I thought that was just too damn ugly to make it out of my
ethernet port. Oh well, let's try it anyway.

> BTW, fs/buffer.c part of patch looks mangled.

Looks OK to me.

Here's the latest. It includes the block_write_full_page()
chunk (minus the access-bh-after-submit_bh bug which was in
the original).

So we have three chunks:

generic_file_write(): truncate the blocks when prepare_write()
fails. This is for ENOSPC on write-to-end-of-file.

block_write_full_page(): submit the blocks which we managed
to map. To prevent exposure of old data.

__block_prepare_write(): zero out and dirty the blocks which
we managed to instantiate. This is to prevent exposure of
stale data on ENOSPC during write() to mid-file.


Question: is the block_write_full_page() change correct? We
mark the page not-up-to-date, but end_buffer_io_async() is
going to mark it uptodate anyway.


And what should we do with BH_New?


--- linux-2.4.18-pre1/fs/buffer.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/buffer.c Sat Dec 29 23:39:11 2001
@@ -1512,6 +1512,7 @@ static int __block_write_full_page(struc
int err, i;
unsigned long block;
struct buffer_head *bh, *head;
+ int need_unlock;

if (!PageLocked(page))
BUG();
@@ -1567,8 +1568,34 @@ static int __block_write_full_page(struc
return 0;

out:
+ /*
+ * ENOSPC, or some other error. We may already have added some
+ * blocks to the file, so we need to write these out to avoid
+ * exposing stale data.
+ */
ClearPageUptodate(page);
- UnlockPage(page);
+ bh = head;
+ need_unlock = 1;
+ /* Recovery: lock and submit the mapped buffers */
+ do {
+ if (buffer_mapped(bh)) {
+ lock_buffer(bh);
+ need_unlock = 0;
+ }
+ bh = bh->b_this_page;
+ } while (bh != head);
+ do {
+ struct buffer_head *next = bh->b_this_page;
+ if (buffer_mapped(bh)) {
+ set_buffer_async_io(bh);
+ set_bit(BH_Uptodate, &bh->b_state);
+ clear_bit(BH_Dirty, &bh->b_state);
+ submit_bh(WRITE, bh);
+ }
+ bh = next;
+ } while (bh != head);
+ if (need_unlock)
+ UnlockPage(page);
return err;
}

@@ -1639,6 +1666,17 @@ static int __block_prepare_write(struct
}
return 0;
out:
+ bh = head;
+ block_start = 0;
+ do {
+ if (buffer_new(bh) && !buffer_uptodate(bh)) {
+ memset(kaddr+block_start, 0, bh->b_size);
+ set_bit(BH_Uptodate, &bh->b_state);
+ mark_buffer_dirty(bh);
+ }
+ block_start += bh->b_size;
+ bh = bh->b_this_page;
+ } while (bh != head);
return err;
}

--- linux-2.4.18-pre1/mm/filemap.c Wed Dec 26 11:47:41 2001
+++ linux-akpm/mm/filemap.c Sat Dec 29 23:44:03 2001
@@ -3004,7 +3004,7 @@ generic_file_write(struct file *file,con
kaddr = kmap(page);
status = mapping->a_ops->prepare_write(file, page, offset, offset+bytes);
if (status)
- goto unlock;
+ goto sync_failure;
page_fault = __copy_from_user(kaddr+offset, buf, bytes);
flush_dcache_page(page);
status = mapping->a_ops->commit_write(file, page, offset, offset+bytes);
@@ -3029,6 +3029,7 @@ unlock:
if (status < 0)
break;
} while (count);
+done:
*ppos = pos;

if (cached_page)
@@ -3050,6 +3051,18 @@ out:
fail_write:
status = -EFAULT;
goto unlock;
+
+sync_failure:
+ /*
+ * If blocksize < pagesize, prepare_write() may have instantiated a
+ * few blocks outside i_size. Trim these off again.
+ */
+ kunmap(page);
+ UnlockPage(page);
+ page_cache_release(page);
+ if (pos + bytes > inode->i_size)
+ vmtruncate(inode, inode->i_size);
+ goto done;

o_direct:
written = generic_file_direct_IO(WRITE, file, (char *) buf, count, pos);

2001-12-30 10:06:35

by Alan

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd

> Now, the fs consistency stuff is a different story. Fixes had been
> in -ac since before 2.4.0 and I distinctly remember at least one of
> 3 area getting synced with -linus. My fault - I assumed that the
> whole patch went there at that point. I'll try to dig the rest out.

I didnt feed Marcelo the final truncate stuff nor the file write fixes, it
was getting too close to 2.4.17 and they are not completely risk free
sort of changes.

2001-12-30 17:40:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrdkern el panic woes


On Sat, 29 Dec 2001, Andrew Morton wrote:
>
> And what should we do with BH_New?

The only point of BH_New was to not need this horror in three different
places, and have the BH_New bit as a way of saying "this buffer has no
contents yet", and fill it with zeroes in just _one_ place (ie the
readpage path).

However, I don't think it was ever implemented, so if you prefer the
straightforward (brute-force but ugly) approach, just get rid of it.

Linus

2001-12-30 23:57:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Sat, Dec 29, 2001 at 10:19:52PM -0800, Andrew Morton wrote:
> [ this thread started at http://www.lib.uaa.alaska.edu/linux-kernel/archive/2001-Week-51/1001.html ]
>
>
> Andrea Arcangeli wrote:
> >
> > On Thu, Dec 20, 2001 at 09:56:40PM -0800, Andrew Morton wrote:
> > > Andrea Arcangeli wrote:
> > > >
> > > > ...
> > > > actually while testing it I unfortunately found also an fs corruption
> > > > bug in the ->prepare_write/commit_write/writepage/direct_IO callbacks.
> > > > It's a longstanding one, since get_block born. In short, if get_block
> > > > fails while mapping on a certain page during
> > > > prepare_write/writepage/direct_IO (like it can happen with a full
> > > > filesystem, incidentally ext2 on /dev/ram0 during my testing that is
> > > > only 4M and so it overflows fast), the blocks before the ENOSPC failure
> > > > will be allocated, but the i_size won't be update accordingly (no commit
> > > > write, because prepare_write failed in the middle). for the
> > > > generic_file_write users it's easily fixable with an hack (truncating
> > > > backwards because we don't know how far we got allocating blocks when we
> > > > return from prepare_write), similar hack for the direct_IO one (that
> > > > commits only once at the end in function of the direct_IO generated).
> > > > But for writepage is quite a pain, infact I believe the writepage blocks
> > > > should be reserved in-core, to guarantee we will never fail a truncate
> > > > with ENOSPC. With the shared mappings we're effectively doing allocate
> > > > on flush... but with the lack of space reservation that makes it
> > > > unreliable :)
> > >
> > > The -ac kernels handled this by zeroing out the accidentally-allocated
> > > blocks at __block_prepare_write.
> >
> > actually my fix seems cleaner because it puts the "clearing" in one single
> > place.
>
> I think so too. Even for ext3, which has a very complex truncate,
> it appears to be OK.
>
> > > > So for now I did an hack to cure the other two (writepage can still
> > > > corrupt the fs as said). I think the right fix (ala 2.5) is to change
> > > > the API so we can use the last blocks too, but the below will cure 2.4
> > > > and for writepage the right fix IMHO is to do the reservation of the
> > > > space.
> > >
> > > This is better in a way because it reclaims the eztra few blocks. But
> > > the -ac approach also works for writepage().
> >
> > yes, it can solve the metadata corruption (assuming the locking is
>
> Where can metadata corruption occur? A few extra blocks outside
> i_size for ext2 directories isn't going to cause corruption, is it?
>
> Or are you referring to i_blocks accounting being incorrect?

yes, I'm referring to the further blocks wrongly left allocated into the
inode. that's plain metadata corruption (of course, it's not completly
random corruption at least) but just to mention one of the side effects
it can exploited in order to sniff contents of old data on disk (ok,
with ext2 that can happen anytime after an unclean unmount anyways, but
you know ext3 ordered or data journaling would prevent that).

> > right, I can as well call ->truncate within writepage but it's not
> > obvious at all that it won't race because we don't hold the i_sem within
> > writepage), but the data corruption still holds.
>
> For sure - holding i_sem on truncate is abolutely required.

yes, this is also why it's not obvious the "undo" get_block on -ENOSPC
failure within writepage (what I understood from you to be in -ac).

> > I mean, there's no
> > failure path to notify userspace that a certain page fault is writing
> > into a page over an hole, that we don't have space to later allocate on
> > disk. so to me it sounds like MAP_SHARED should preallocate the space of
> > the holes so you will know that the writes into the MAP_SHARED segments
> > won't be lost (current state of things will lead to silent corruption
> > and pinned dirty pages in ram, aka broken allocate on flush like
> > previously said).
>
> Um. How does this differ from an I/O error on flush?

It differs because I/O error is an hardware error and we have no hope if
the platter is corrupted. I mean, if the DB gets corrupted after a
reboot because last night the harddisk created some badblocks is not our
business. We cannot fix it, currently we cannot notify about it, but
after all it wasn't our mistake :). we'll just left some dirty pages
around if those write fails because the hardware couldn't relocate the
badblock transparently during the writes.

If instead the DB gets corrupted because somebody was writing to disk
and the disk filled up, so the writepage allocate on flush couldn't find
any space where to write the dirty data, that sounds like our mistake.
There's no way the app can know about this unless we sigsomething or we
return -ENOSPC on mmap. Checking SuSv2 there's no -ENOSPC contemplated
as retval from mmap, but I remeber there are always the zillon of
exceptions and "always available" error codes, so it's not obvious it's
not contemplated.

I mean, if the mmap API prevents us to return -ENOSPC there's no way we
can build a reliable MAP_SHARED with filesystems that allows holes on
files. You will never know if this MAP_SHARED updates hit the disk,
unless you previously made sure to generate a file without holes (to get
the -ENOSPC failure path with the write(2) syscall).

> Would it be necessary to preallocate the holes at mmap() time? Mad

Not preallocate, but just reserve space, just like we should do with
write(2), the reason write(2) is synchronous with the lowlevel block
allocation is just because we don't have an API with the lowlevel fs to
reserve space. As soon as we'll add this API it will be possible to turn
all the fs out there into allocate on flush for writes (not only for
MAP_SHARED). so if we write a few bytes and delete the file, we won't
hit the lowlevel fs ialloc path for example (but only the superblock
space reservation callback that can also be threaded or maybe lockless
with atomic ops).

> hand-waving: Could we not perform the instantiation at pagefault time,
> and give the caller SIGBUS if we cannot allocate the blocks? Or if
> there's an IO error, or quota exceeded.

sigbus is viable solution indeed, it's a matter of userspace API
specifications actually, not something we can choose liberally.

The fact is, currently we sigbus on MAP_SHARED if the updates are beyond
the end of the i_size, so maybe using sigbus to notify about the other
cases too would be confusing? (app thinks it's writing beyond the end of
i_size, unless it explicitly checks for that)

>
> > > Why was that code not brought across?
> >
> > Who developed that code? Can the author of the code forward port it to
> > 2.4.18pre and post a patch to the list so we can review? thanks,
> > Avoiding the matadata corruption would be a good start at least for 2.4,
> > then we should just focus on the writepage locking that could race with the
> > other "create=1" get_blocks. If it doesn't race I will certainly agree
> > on that approch for 2.4.
> >
>
> It appeared in 2.4.2-ac25, and it looks like sct was the author:
>
> o Fix higmem block_prepare_write crash (Stephen Tweedie)
>
> Which is interesting - from the changelog it looks like he was
> fixing a different problem! I always though that code was there
> to prevent leakage of stale blocks. Stephen?

Makes sense to me, I also didn't recall any specific fix for the i_block
corruption before my report of last week, maybe it fixed both things at
once?

> A 2.4.18-pre1 version is below. It compiles, but I haven't actually
> exercised that code path.
>
> It's not a pretty fix, IMO. It leaves dangling blocks outside i_size
> which will make fsck unhappy. It also breaks ext3 a little bit -
> those blocks should be journalled and in theory we'd need to
> clone off private copies of __block_prepare_write() and
> unmap_underlying_metadata() to do this. Which would be irritating,
> but not the end of the world. (Should have done this in the -ac
> version of ext3, but I never noticed it).
>
>
> --- linux-2.4.18-pre1/fs/buffer.c Fri Dec 21 11:19:14 2001
> +++ linux-akpm/fs/buffer.c Sat Dec 29 21:53:46 2001
> @@ -1639,6 +1639,17 @@ static int __block_prepare_write(struct
> }
> return 0;
> out:
> + bh = head;
> + block_start = 0;
> + do {
> + if (buffer_new(bh) && !buffer_uptodate(bh)) {
> + memset(kaddr+block_start, 0, bh->b_size);
> + set_bit(BH_Uptodate, &bh->b_state);
> + mark_buffer_dirty(bh);
> + }
> + block_start += bh->b_size;
> + bh = bh->b_this_page;
> + } while (bh != head);
> return err;
> }

yes, I doesn't deallocate the blocks that was the whole point of the
corruption I mentioned, so it looks it's not covering our case.

>
> Question: can someone please define BH_New? Its lifecycle seems

BH_New means the bh is just born while being mapped. It matters only if
you called get_block with create=1 in order to map an unmapped bh. If
the bh was mapped it doesn't matter, because if it was mapped you didn't
need to call get_block in first place (so we left it set and that's
fine). Calling get_block on mapped bh is a bug. a bh_new is always
bh_mapped.

> very vague. We never actually seem to *clear* it anywhere for
> ext2, and it appears that the kernel will keep on treating a
> clearly non-new buffer as "new" all the time. ext3 explicitly
> clears BH_New in get_block(), if it finds the block was already
> present in the file. I did this because we need the newness
> info for internal purposes.
>
> -

anyways about the sigbus/mmap returning -ENOSPC things, they're low
prio, I think first we should avoid metadata corruption, that's our
internal stuff and we have no constraints about userspace API in order
to fix it. Convering writepage with the "get_block undo" as said
doesn't look obvious due the lack of i_sem and the other writers
depending on stuff not going away under them unless i_sem was acquired.
(OTOH those floating unwriteable dirty pages could run the machine oom)

Andrea

2001-12-31 00:05:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Sun, Dec 30, 2001 at 01:33:24AM -0500, Alexander Viro wrote:
>
>
> On Sat, 29 Dec 2001, Andrew Morton wrote:
>
> > Would it be necessary to preallocate the holes at mmap() time? Mad
> > hand-waving: Could we not perform the instantiation at pagefault time,
> > and give the caller SIGBUS if we cannot allocate the blocks? Or if
> > there's an IO error, or quota exceeded.
>
> Allocation at mmap() Is Not Going To Happen. Consider it vetoed.
> There are applications that use mmap() on large and very sparse
> files.

it's exactly this kind of apps that will be screwed up by silent data
corruption. the point of the holes is to optimize performance and save
space, but they shouldn't introduce the possibilty of data corruption.

Note: I'm fine to introduce another way to notify the app about -ENOSPC,
-ENOSPC on mmap is the most obvious one, but we could still allow the
current "overcommit" behaviour with a kind of sigbus mentioned by
Andrew (possibly not sigbus though, since it has just well defined
semantics for MAP_SHARED, maybe they could be extended, anyways as said
this is only a matter of API). My point is only that some API should be
added because your mmap on sparse files are unreliable at the moment.

> > Question: can someone please define BH_New? Its lifecycle seems
> > very vague. We never actually seem to *clear* it anywhere for
> > ext2, and it appears that the kernel will keep on treating a
> > clearly non-new buffer as "new" all the time. ext3 explicitly
> > clears BH_New in get_block(), if it finds the block was already
> > present in the file. I did this because we need the newness
> > info for internal purposes.
>
> It should be reset when we submit IO. Breakage related to failing
> allocation is indeed not new, but that's a long story. And no,
> "allocate on mmap()" is not a fix.

it is a fix (assuming people will understand/expect this retval :), but
yes, the current overcommit behaviour has some value so I agree some
other async API ala sigbus may be more desiderable.

Andrea

2001-12-31 00:08:31

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Sat, Dec 29, 2001 at 10:38:14PM -0800, Andrew Morton wrote:
> Alexander Viro wrote:
> >
> > On Sat, 29 Dec 2001, Andrew Morton wrote:
> >
> > > Would it be necessary to preallocate the holes at mmap() time? Mad
> > > hand-waving: Could we not perform the instantiation at pagefault time,
> > > and give the caller SIGBUS if we cannot allocate the blocks? Or if
> > > there's an IO error, or quota exceeded.
> >
> > Allocation at mmap() Is Not Going To Happen. Consider it vetoed.
> > There are applications that use mmap() on large and very sparse
> > files.
>
> I think Andrea was referring to simply reserving the necessary
> amount of disk space, rather than actually instantiating the
> blocks. But even that would be a big problem for the applications

correct.

> which you describe.

Nod.

>
> > > Question: can someone please define BH_New? Its lifecycle seems
> > > very vague. We never actually seem to *clear* it anywhere for
> > > ext2, and it appears that the kernel will keep on treating a
> > > clearly non-new buffer as "new" all the time. ext3 explicitly
> > > clears BH_New in get_block(), if it finds the block was already
> > > present in the file. I did this because we need the newness
> > > info for internal purposes.
> >
> > It should be reset when we submit IO.
>
> well... It isn't. And I'd like a chance to review/test any

see the other email, it's all right as far I can tell, it doesn't need
to be resetted ever. only mapped bh are bh_new so you will never care
about bh_new any longer because you will never need any further
get_block on the same bh.


> proposed changes in this area which are outside specific filesystems...
>
> > Breakage related to failing allocation is indeed not new, but
> > that's a long story. And no, "allocate on mmap()" is not a fix.
>
> Yup. But what *is* the fix? (filemap_nopage?)
>
> -


Andrea

2001-12-31 00:28:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrdkern el panic woes

On Sun, Dec 30, 2001 at 09:40:11AM -0800, Linus Torvalds wrote:
>
> On Sat, 29 Dec 2001, Andrew Morton wrote:
> >
> > And what should we do with BH_New?
>
> The only point of BH_New was to not need this horror in three different
> places, and have the BH_New bit as a way of saying "this buffer has no
> contents yet", and fill it with zeroes in just _one_ place (ie the
> readpage path).

actually bh_new is needed also to serialize with the buffercache, a new
bh mapped in pagecache must be dropped from the buffercache before we
can start using it (unmap_underlying_metadata).

so at least for 2.4 I wouldn't drop it :)

>
> However, I don't think it was ever implemented, so if you prefer the
> straightforward (brute-force but ugly) approach, just get rid of it.
>
> Linus


Andrea

2001-12-31 00:36:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrdkern el panic woes


On Mon, 31 Dec 2001, Andrea Arcangeli wrote:
>
> actually bh_new is needed also to serialize with the buffercache, a new
> bh mapped in pagecache must be dropped from the buffercache before we
> can start using it (unmap_underlying_metadata).

You're right, although it's something of an optimization (ie we could as
well just depend on the "mapped" bit and watch it change).

Linus

2001-12-31 01:00:24

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrdkern el panic woes

On Sun, Dec 30, 2001 at 04:35:46PM -0800, Linus Torvalds wrote:
>
> On Mon, 31 Dec 2001, Andrea Arcangeli wrote:
> >
> > actually bh_new is needed also to serialize with the buffercache, a new
> > bh mapped in pagecache must be dropped from the buffercache before we
> > can start using it (unmap_underlying_metadata).
>
> You're right, although it's something of an optimization (ie we could as
> well just depend on the "mapped" bit and watch it change).

we could even hold the optimization (cache coherency only on new blocks)
by pushing the cache coherency into the lowlevel just like the bh
clearing, but the current buffer_new branch in the library code seems
clean (and potentially a little faster with big softblocksize due the
partial clearing).

Andrea

2001-12-31 10:03:48

by Daniel Phillips

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On December 30, 2001 07:19 am, Andrew Morton wrote:
> Question: can someone please define BH_New? Its lifecycle seems
> very vague.

That's because it is. (vague) In its current incarnation, BH_New is
valid right after calling xxx_get_block, otherwise it's like a floating
signal. This is just sloppy.

> We never actually seem to *clear* it anywhere for
> ext2, and it appears that the kernel will keep on treating a
> clearly non-new buffer as "new" all the time. ext3 explicitly
> clears BH_New in get_block(), if it finds the block was already
> present in the file. I did this because we need the newness
> info for internal purposes.

As I understand it, BH_New is set because xxx_get_block created a
block and didn't initialize it - the initialization is expected to
be done by someone else. So somebody better pick it up before the
block starts transitioning to other states. The correct model
would use scalars for block states, not bits, and we would
enumerate all the correct transitions. Since that isn't going to
happen any time soon, we could clear BH_New every time we set some
other state bit. Or maybe BUG if we ever change another state bit
and BH_NEW is still set, indicating somebody forgot to initialize
the buffer.

Hmm, I'm just taking a look at ext3/inode.c, line 863 - you've just
called getblk, and you will act on BH_New. Does that code ever get
executed?

--
Daniel

2002-01-04 16:39:23

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

Hi,

[catching up from holidays]

On Sat, Dec 29, 2001 at 10:19:52PM -0800, Andrew Morton wrote:
>
> It appeared in 2.4.2-ac25, and it looks like sct was the author:
>
> o Fix higmem block_prepare_write crash (Stephen Tweedie)
>
> Which is interesting - from the changelog it looks like he was
> fixing a different problem! I always though that code was there
> to prevent leakage of stale blocks. Stephen?

The code I fixed was the

- memset(bh->b_data, 0, bh->b_size);
+ memset(kaddr+block_start, 0, bh->b_size);

chunk to prevent the block_prepare_write out: error path from oopsing
on highmem pages -- that's unrelated to the problem at hand here.

Cheers,
Stephen

2002-01-05 07:59:18

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

Andrea Arcangeli wrote:
>
> I diffed a more advanced version of it. It's not very well tested.
>
> [ your rd.c patch ]
>

Your patch is working OK for me. I made two changes:

- s/PAGE_SIZE/PAGE_CACHE_SIZE/ in ramdisk_updatepage()

- I think there's an SMP race in rd_blkdev_pagecache_IO() - it looks up
the underlying page in the pagecache and if it is present, it simply
proceeds, assuming that the page is uptodate. But another CPU could have
just added the page and may be in the middle of initialising it.
So I changed rd_blkdev_pagecache_IO() to always lock the page. It
got simpler.

BTW, here's some fun: set up /dev/ram0 and /dev/ram1, both 128 megabytes. Fill
each of them with a big file and then try to umount /dev/ram1. The machine
locks up for sixty seconds running the quadratic search in write_some_buffers().
Sigh. In the lowlatency patch I simply brute-forced this by always setting
dev = NODEV in sync_buffers().

Please review - I'm trying to use the rd driver to test the truncate+ENOSPC
patch and there's just rampant filesystem corruption all over the place
without this patch. It's unusable.


--- linux-2.4.18-pre1/drivers/block/rd.c Fri Dec 21 11:19:13 2001
+++ linux-akpm/drivers/block/rd.c Fri Jan 4 23:50:09 2002
@@ -191,26 +191,44 @@ __setup("ramdisk_blocksize=", ramdisk_bl
* 2000 Transmeta Corp.
* aops copied from ramfs.
*/
-static int ramdisk_readpage(struct file *file, struct page * page)
+static void ramdisk_updatepage(struct page * page, int need_kmap)
{
if (!Page_Uptodate(page)) {
- memset(kmap(page), 0, PAGE_CACHE_SIZE);
- kunmap(page);
+ struct buffer_head *bh = page->buffers;
+ void * address;
+
+ if (need_kmap)
+ kmap(page);
+ address = page_address(page);
+ if (bh) {
+ struct buffer_head *tmp = bh;
+ do {
+ if (!buffer_uptodate(tmp)) {
+ memset(address, 0, tmp->b_size);
+ mark_buffer_uptodate(tmp, 1);
+ }
+ address += tmp->b_size;
+ tmp = tmp->b_this_page;
+ } while (tmp != bh);
+ } else
+ memset(address, 0, PAGE_CACHE_SIZE);
+ if (need_kmap)
+ kunmap(page);
flush_dcache_page(page);
SetPageUptodate(page);
}
+}
+
+static int ramdisk_readpage(struct file *file, struct page * page)
+{
+ ramdisk_updatepage(page, 1);
UnlockPage(page);
return 0;
}

static int ramdisk_prepare_write(struct file *file, struct page *page, unsigned offset, unsigned to)
{
- if (!Page_Uptodate(page)) {
- void *addr = page_address(page);
- memset(addr, 0, PAGE_CACHE_SIZE);
- flush_dcache_page(page);
- SetPageUptodate(page);
- }
+ ramdisk_updatepage(page, 0);
SetPageDirty(page);
return 0;
}
@@ -233,44 +251,40 @@ static int rd_blkdev_pagecache_IO(int rw
unsigned long index;
int offset, size, err;

- err = -EIO;
err = 0;
mapping = rd_bdev[minor]->bd_inode->i_mapping;

+ /* writing a buffer cache not uptodate must not clear it */
+ if (sbh->b_page->mapping == mapping) {
+ if (rw == WRITE) {
+ mark_buffer_uptodate(sbh, 1);
+ SetPageDirty(sbh->b_page);
+ }
+ goto out;
+ }
+
index = sbh->b_rsector >> (PAGE_CACHE_SHIFT - 9);
offset = (sbh->b_rsector << 9) & ~PAGE_CACHE_MASK;
size = sbh->b_size;

do {
int count;
- struct page ** hash;
struct page * page;
char * src, * dst;
- int unlock = 0;

count = PAGE_CACHE_SIZE - offset;
if (count > size)
count = size;
size -= count;

- hash = page_hash(mapping, index);
- page = __find_get_page(mapping, index, hash);
+ page = grab_cache_page(mapping, index);
if (!page) {
- page = grab_cache_page(mapping, index);
err = -ENOMEM;
- if (!page)
- goto out;
- err = 0;
-
- if (!Page_Uptodate(page)) {
- memset(kmap(page), 0, PAGE_CACHE_SIZE);
- kunmap(page);
- SetPageUptodate(page);
- }
-
- unlock = 1;
+ goto out;
}

+ ramdisk_updatepage(page, 1);
+
index++;

if (rw == READ) {
@@ -294,8 +308,7 @@ static int rd_blkdev_pagecache_IO(int rw
} else {
SetPageDirty(page);
}
- if (unlock)
- UnlockPage(page);
+ UnlockPage(page);
__free_page(page);
} while (size);

2002-01-05 11:48:54

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

Andrea Arcangeli wrote:
>
> Note: I'm fine to introduce another way to notify the app about -ENOSPC,
> -ENOSPC on mmap is the most obvious one, but we could still allow the
> current "overcommit" behaviour with a kind of sigbus mentioned by
> Andrew (possibly not sigbus though, since it has just well defined
> semantics for MAP_SHARED, maybe they could be extended, anyways as said
> this is only a matter of API). My point is only that some API should be
> added because your mmap on sparse files are unreliable at the moment.
>

The very least we can do is to return a sensible error code from msync().
At present, if you create a 200 meg mapping on a 100 meg disk, dirty it
all and run msync(MS_SYNC), the damn thing returns zero and you don't
know that you lost half your data.

The patch makes filemap_fdatasync() and filemap_fdatawait() return
an error code, and propagates that up through all callers, including
fsync() and fdatasync(). Please review, especially the nfs and
generic_file_direct_IO() changes.

There's also a half-assed attempt to implement MS_ASYNC in here.
If anyone thinks that's not worth the effort, I won't be offended.



--- linux-2.4.18-pre1/include/linux/fs.h Fri Dec 21 11:19:23 2001
+++ linux-akpm/include/linux/fs.h Sat Jan 5 03:21:07 2002
@@ -1212,8 +1212,8 @@ extern int osync_inode_data_buffers(stru
extern int fsync_inode_buffers(struct inode *);
extern int fsync_inode_data_buffers(struct inode *);
extern int inode_has_buffers(struct inode *);
-extern void filemap_fdatasync(struct address_space *);
-extern void filemap_fdatawait(struct address_space *);
+extern int filemap_fdatasync(struct address_space *);
+extern int filemap_fdatawait(struct address_space *);
extern void sync_supers(kdev_t);
extern int bmap(struct inode *, int);
extern int notify_change(struct dentry *, struct iattr *);
--- linux-2.4.18-pre1/mm/filemap.c Wed Dec 26 11:47:41 2001
+++ linux-akpm/mm/filemap.c Sat Jan 5 03:23:09 2002
@@ -582,8 +582,9 @@ EXPORT_SYMBOL(fail_writepage);
* @mapping: address space structure to write
*
*/
-void filemap_fdatasync(struct address_space * mapping)
+int filemap_fdatasync(struct address_space * mapping)
{
+ int ret = 0;
int (*writepage)(struct page *) = mapping->a_ops->writepage;

spin_lock(&pagecache_lock);
@@ -603,8 +604,11 @@ void filemap_fdatasync(struct address_sp
lock_page(page);

if (PageDirty(page)) {
+ int err;
ClearPageDirty(page);
- writepage(page);
+ err = writepage(page);
+ if (err && !ret)
+ ret = err;
} else
UnlockPage(page);

@@ -612,6 +616,7 @@ void filemap_fdatasync(struct address_sp
spin_lock(&pagecache_lock);
}
spin_unlock(&pagecache_lock);
+ return ret;
}

/**
@@ -621,8 +626,10 @@ void filemap_fdatasync(struct address_sp
* @mapping: address space structure to wait for
*
*/
-void filemap_fdatawait(struct address_space * mapping)
+int filemap_fdatawait(struct address_space * mapping)
{
+ int ret = 0;
+
spin_lock(&pagecache_lock);

while (!list_empty(&mapping->locked_pages)) {
@@ -638,11 +645,14 @@ void filemap_fdatawait(struct address_sp
spin_unlock(&pagecache_lock);

___wait_on_page(page);
+ if (PageError(page))
+ ret = -EIO;

page_cache_release(page);
spin_lock(&pagecache_lock);
}
spin_unlock(&pagecache_lock);
+ return ret;
}

/*
@@ -1519,12 +1529,14 @@ static ssize_t generic_file_direct_IO(in
goto out_free;

/*
- * Flush to disk exlusively the _data_, metadata must remains
+ * Flush to disk exclusively the _data_, metadata must remain
* completly asynchronous or performance will go to /dev/null.
*/
- filemap_fdatasync(mapping);
- retval = fsync_inode_data_buffers(inode);
- filemap_fdatawait(mapping);
+ retval = filemap_fdatasync(mapping);
+ if (retval == 0)
+ retval = fsync_inode_data_buffers(inode);
+ if (retval == 0)
+ retval = filemap_fdatawait(mapping);
if (retval < 0)
goto out_free;

@@ -2141,26 +2153,41 @@ int generic_file_mmap(struct file * file
* The msync() system call.
*/

+/*
+ * We attempt to implement MS_ASYNC, but it's lame. There needs to be a way
+ * of starting async writeout of the metadata and inode.
+ */
static int msync_interval(struct vm_area_struct * vma,
unsigned long start, unsigned long end, int flags)
{
+ int ret = 0;
struct file * file = vma->vm_file;
+
if (file && (vma->vm_flags & VM_SHARED)) {
- int error;
- error = filemap_sync(vma, start, end-start, flags);
+ /* filemap_sync() cannot fail */
+ ret = filemap_sync(vma, start, end-start, flags);

- if (!error && (flags & MS_SYNC)) {
+ if (flags & (MS_SYNC|MS_ASYNC)) {
struct inode * inode = file->f_dentry->d_inode;
+
down(&inode->i_sem);
- filemap_fdatasync(inode->i_mapping);
- if (file->f_op && file->f_op->fsync)
- error = file->f_op->fsync(file, file->f_dentry, 1);
- filemap_fdatawait(inode->i_mapping);
+ ret = filemap_fdatasync(inode->i_mapping);
+ if (flags & MS_SYNC) {
+ int err;
+
+ if (file->f_op && file->f_op->fsync) {
+ err = file->f_op->fsync(file, file->f_dentry, 1);
+ if (err && !ret)
+ ret = err;
+ }
+ err = filemap_fdatawait(inode->i_mapping);
+ if (err && !ret)
+ ret = err;
+ }
up(&inode->i_sem);
}
- return error;
}
- return 0;
+ return ret;
}

asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
--- linux-2.4.18-pre1/fs/block_dev.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/block_dev.c Sat Jan 5 03:21:07 2002
@@ -171,11 +171,15 @@ static loff_t block_llseek(struct file *

static int __block_fsync(struct inode * inode)
{
- int ret;
+ int ret, err;

- filemap_fdatasync(inode->i_mapping);
- ret = sync_buffers(inode->i_rdev, 1);
- filemap_fdatawait(inode->i_mapping);
+ ret = filemap_fdatasync(inode->i_mapping);
+ err = sync_buffers(inode->i_rdev, 1);
+ if (err && !ret)
+ ret = err;
+ err = filemap_fdatawait(inode->i_mapping);
+ if (err && !ret)
+ ret = err;

return ret;
}
--- linux-2.4.18-pre1/fs/buffer.c Fri Dec 21 11:19:14 2001
+++ linux-akpm/fs/buffer.c Sat Jan 5 03:21:07 2002
@@ -401,9 +401,9 @@ asmlinkage long sys_fsync(unsigned int f
struct file * file;
struct dentry * dentry;
struct inode * inode;
- int err;
+ int ret, err;

- err = -EBADF;
+ ret = -EBADF;
file = fget(fd);
if (!file)
goto out;
@@ -411,21 +411,27 @@ asmlinkage long sys_fsync(unsigned int f
dentry = file->f_dentry;
inode = dentry->d_inode;

- err = -EINVAL;
- if (!file->f_op || !file->f_op->fsync)
+ ret = -EINVAL;
+ if (!file->f_op || !file->f_op->fsync) {
+ /* Why? We can still call filemap_fdatasync */
goto out_putf;
+ }

/* We need to protect against concurrent writers.. */
down(&inode->i_sem);
- filemap_fdatasync(inode->i_mapping);
+ ret = filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 0);
- filemap_fdatawait(inode->i_mapping);
+ if (err && !ret)
+ ret = err;
+ err = filemap_fdatawait(inode->i_mapping);
+ if (err && !ret)
+ ret = err;
up(&inode->i_sem);

out_putf:
fput(file);
out:
- return err;
+ return ret;
}

asmlinkage long sys_fdatasync(unsigned int fd)
@@ -433,9 +439,9 @@ asmlinkage long sys_fdatasync(unsigned i
struct file * file;
struct dentry * dentry;
struct inode * inode;
- int err;
+ int ret, err;

- err = -EBADF;
+ ret = -EBADF;
file = fget(fd);
if (!file)
goto out;
@@ -443,14 +449,18 @@ asmlinkage long sys_fdatasync(unsigned i
dentry = file->f_dentry;
inode = dentry->d_inode;

- err = -EINVAL;
+ ret = -EINVAL;
if (!file->f_op || !file->f_op->fsync)
goto out_putf;

down(&inode->i_sem);
- filemap_fdatasync(inode->i_mapping);
+ ret = filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 1);
- filemap_fdatawait(inode->i_mapping);
+ if (err && !ret)
+ ret = err;
+ err = filemap_fdatawait(inode->i_mapping);
+ if (err && !ret)
+ ret = err;
up(&inode->i_sem);

out_putf:
--- linux-2.4.18-pre1/fs/nfs/file.c Wed Dec 26 11:47:41 2001
+++ linux-akpm/fs/nfs/file.c Sat Jan 5 03:21:07 2002
@@ -244,6 +244,7 @@ nfs_lock(struct file *filp, int cmd, str
{
struct inode * inode = filp->f_dentry->d_inode;
int status = 0;
+ int status2;

dprintk("NFS: nfs_lock(f=%4x/%ld, t=%x, fl=%x, r=%Ld:%Ld)\n",
inode->i_dev, inode->i_ino,
@@ -278,11 +279,18 @@ nfs_lock(struct file *filp, int cmd, str
* Flush all pending writes before doing anything
* with locks..
*/
- filemap_fdatasync(inode->i_mapping);
+ /*
+ * Shouldn't filemap_fdatasync/wait be inside i_sem?
+ */
+ status = filemap_fdatasync(inode->i_mapping);
down(&inode->i_sem);
- status = nfs_wb_all(inode);
+ status2 = nfs_wb_all(inode);
+ if (status2 && !status)
+ status = status2;
up(&inode->i_sem);
- filemap_fdatawait(inode->i_mapping);
+ status2 = filemap_fdatawait(inode->i_mapping);
+ if (status2 && !status)
+ status = status2;
if (status < 0)
return status;

@@ -300,11 +308,17 @@ nfs_lock(struct file *filp, int cmd, str
*/
out_ok:
if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
- filemap_fdatasync(inode->i_mapping);
+ status2 = filemap_fdatasync(inode->i_mapping);
+ if (status2 && !status)
+ status = status2;
down(&inode->i_sem);
- nfs_wb_all(inode); /* we may have slept */
+ status2 = nfs_wb_all(inode); /* we may have slept */
+ if (status2 && !status)
+ status2 = status;
up(&inode->i_sem);
- filemap_fdatawait(inode->i_mapping);
+ status2 = filemap_fdatawait(inode->i_mapping);
+ if (status2 && !status)
+ status = status2;
nfs_zap_caches(inode);
}
return status;

2002-01-05 14:05:12

by Trond Myklebust

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

>>>>> " " == Andrew Morton <[email protected]> writes:

> out_ok:
> if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type
> != F_UNLCK) {
> - filemap_fdatasync(inode->i_mapping);
> + status2 = filemap_fdatasync(inode->i_mapping);
> + if (status2 && !status)
> + status = status2;
> down(&inode->i_sem);
> - nfs_wb_all(inode); /* we may have slept */
> + status2 = nfs_wb_all(inode); /* we may have slept */
> + if (status2 && !status)
> + status2 = status;
> up(&inode->i_sem);
> - filemap_fdatawait(inode->i_mapping);
> + status2 = filemap_fdatawait(inode->i_mapping);
> + if (status2 && !status)
> + status = status2;
> nfs_zap_caches(inode);
> } return status;

Hmm. I'm not sure about this hunk...

At this point in the code, we already know that we've been granted a
lock by the server. All we are doing is to try to sync any data that
may have been committed while we were waiting on the lock, in order to
ensure that the act of locking provides a cache coherency point.

IMHO it would be wrong to signal that the lock itself has failed just
because some other process has lost data in the filemap_fdata* calls.
It's a different matter with the nfs_wb_all() call: that indicates
that the process has been signalled, so it may indeed make sense to
return that particular error.

Cheers,
Trond

2002-01-07 01:09:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Fri, Jan 04, 2002 at 11:53:59PM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > I diffed a more advanced version of it. It's not very well tested.
> >
> > [ your rd.c patch ]
> >
>
> Your patch is working OK for me. I made two changes:
>
> - s/PAGE_SIZE/PAGE_CACHE_SIZE/ in ramdisk_updatepage()

ok

>
> - I think there's an SMP race in rd_blkdev_pagecache_IO() - it looks up
> the underlying page in the pagecache and if it is present, it simply
> proceeds, assuming that the page is uptodate. But another CPU could have
> just added the page and may be in the middle of initialising it.
> So I changed rd_blkdev_pagecache_IO() to always lock the page. It
> got simpler.

certainly it makes smp simpler agreed, so also ramdisk_updatepage always
runs on locked pages.

the reason it was not locking down the page is that before the
ramdisk_aops was introduced, a read via /dev/ram? would lock down the
pagecache page, and then start the ll_rw_block on the pagecache page,
but with the page locked, so it would deadlock with an unconditional
grab_cache_page. Now with the ramdisk_aops it should be ok because the physical
address space I/O never uses the ->make_request_fn so it shouldn't
recurse on the page lock any longer.

> Please review - I'm trying to use the rd driver to test the truncate+ENOSPC

your patch seems all right to me now (of course with ramdisk_updatepage
run unconditionally). (I'd only set -ENOMEM in the fast path, so the
oom branch becomes an out of line goto, didn't checked the asm generated
though)

Andrea

2002-01-07 03:09:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Sat, Jan 05, 2002 at 03:43:36AM -0800, Andrew Morton wrote:
> static int msync_interval(struct vm_area_struct * vma,
> unsigned long start, unsigned long end, int flags)
> {
> + int ret = 0;
> struct file * file = vma->vm_file;
> +
> if (file && (vma->vm_flags & VM_SHARED)) {
> - int error;
> - error = filemap_sync(vma, start, end-start, flags);
> + /* filemap_sync() cannot fail */

:) yes, I guess it cames from some old debugging code trying to catch
pte corruption, now dead in current kernels.

> + ret = filemap_sync(vma, start, end-start, flags);
>
> - if (!error && (flags & MS_SYNC)) {
> + if (flags & (MS_SYNC|MS_ASYNC)) {

ok, it cannot fail but I prefer you either avoid the 'ret =
filemap_sync' and you make filemap_sync return void, or you also left
'(!err' over here.

> struct inode * inode = file->f_dentry->d_inode;
> +
> down(&inode->i_sem);
> - filemap_fdatasync(inode->i_mapping);
> - if (file->f_op && file->f_op->fsync)
> - error = file->f_op->fsync(file, file->f_dentry, 1);
> - filemap_fdatawait(inode->i_mapping);
> + ret = filemap_fdatasync(inode->i_mapping);
> + if (flags & MS_SYNC) {
> + int err;
> +
> + if (file->f_op && file->f_op->fsync) {
> + err = file->f_op->fsync(file, file->f_dentry, 1);
> + if (err && !ret)
> + ret = err;
> + }
> + err = filemap_fdatawait(inode->i_mapping);
> + if (err && !ret)
> + ret = err;
> + }

sounds right (not something I'd love to do in 2.4 but it's
strightforward enough so I'll take the risk).

> @@ -433,9 +439,9 @@ asmlinkage long sys_fdatasync(unsigned i
> struct file * file;
> struct dentry * dentry;
> struct inode * inode;
> - int err;
> + int ret, err;
>
> - err = -EBADF;
> + ret = -EBADF;
> file = fget(fd);
> if (!file)
> goto out;
> @@ -443,14 +449,18 @@ asmlinkage long sys_fdatasync(unsigned i
> dentry = file->f_dentry;
> inode = dentry->d_inode;
>
> - err = -EINVAL;
> + ret = -EINVAL;
> if (!file->f_op || !file->f_op->fsync)
> goto out_putf;
>
> down(&inode->i_sem);
> - filemap_fdatasync(inode->i_mapping);
> + ret = filemap_fdatasync(inode->i_mapping);
> err = file->f_op->fsync(file, dentry, 1);
> - filemap_fdatawait(inode->i_mapping);
> + if (err && !ret)
> + ret = err;
> + err = filemap_fdatawait(inode->i_mapping);
> + if (err && !ret)
> + ret = err;
> up(&inode->i_sem);
>
> out_putf:

you forgot return ret here.

> --- linux-2.4.18-pre1/fs/nfs/file.c Wed Dec 26 11:47:41 2001
> +++ linux-akpm/fs/nfs/file.c Sat Jan 5 03:21:07 2002
> @@ -244,6 +244,7 @@ nfs_lock(struct file *filp, int cmd, str
> {
> struct inode * inode = filp->f_dentry->d_inode;
> int status = 0;
> + int status2;
>
> dprintk("NFS: nfs_lock(f=%4x/%ld, t=%x, fl=%x, r=%Ld:%Ld)\n",
> inode->i_dev, inode->i_ino,
> @@ -278,11 +279,18 @@ nfs_lock(struct file *filp, int cmd, str
> * Flush all pending writes before doing anything
> * with locks..
> */
> - filemap_fdatasync(inode->i_mapping);
> + /*
> + * Shouldn't filemap_fdatasync/wait be inside i_sem?
> + */

I think it's not necessary, because the list browse it's serialized by
the pagecache_lock and writepage can run outside the i_sem.

> + status = filemap_fdatasync(inode->i_mapping);
> down(&inode->i_sem);
> - status = nfs_wb_all(inode);
> + status2 = nfs_wb_all(inode);
> + if (status2 && !status)
> + status = status2;
> up(&inode->i_sem);
> - filemap_fdatawait(inode->i_mapping);
> + status2 = filemap_fdatawait(inode->i_mapping);
> + if (status2 && !status)
> + status = status2;
> if (status < 0)
> return status;
>
> @@ -300,11 +308,17 @@ nfs_lock(struct file *filp, int cmd, str
> */
> out_ok:
> if ((IS_SETLK(cmd) || IS_SETLKW(cmd)) && fl->fl_type != F_UNLCK) {
> - filemap_fdatasync(inode->i_mapping);
> + status2 = filemap_fdatasync(inode->i_mapping);
> + if (status2 && !status)
> + status = status2;
> down(&inode->i_sem);
> - nfs_wb_all(inode); /* we may have slept */
> + status2 = nfs_wb_all(inode); /* we may have slept */
> + if (status2 && !status)
> + status2 = status;
> up(&inode->i_sem);
> - filemap_fdatawait(inode->i_mapping);
> + status2 = filemap_fdatawait(inode->i_mapping);
> + if (status2 && !status)
> + status = status2;
> nfs_zap_caches(inode);
> }
> return status;

all right, thanks.

Andrea

2002-01-07 03:55:05

by Andrew Morton

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

Andrea Arcangeli wrote:
>
> > + ret = filemap_sync(vma, start, end-start, flags);
> >
> > - if (!error && (flags & MS_SYNC)) {
> > + if (flags & (MS_SYNC|MS_ASYNC)) {
>
> ok, it cannot fail but I prefer you either avoid the 'ret =
> filemap_sync' and you make filemap_sync return void, or you also left
> '(!err' over here.

OK, let's leave the test in place - filemap_sync() has global scope.

> > struct inode * inode = file->f_dentry->d_inode;
> > +
> > down(&inode->i_sem);
> > - filemap_fdatasync(inode->i_mapping);
> > - if (file->f_op && file->f_op->fsync)
> > - error = file->f_op->fsync(file, file->f_dentry, 1);
> > - filemap_fdatawait(inode->i_mapping);
> > + ret = filemap_fdatasync(inode->i_mapping);
> > + if (flags & MS_SYNC) {
> > + int err;
> > +
> > + if (file->f_op && file->f_op->fsync) {
> > + err = file->f_op->fsync(file, file->f_dentry, 1);
> > + if (err && !ret)
> > + ret = err;
> > + }
> > + err = filemap_fdatawait(inode->i_mapping);
> > + if (err && !ret)
> > + ret = err;
> > + }
>
> sounds right (not something I'd love to do in 2.4 but it's
> strightforward enough so I'll take the risk).

It works OK, but in practice makes little difference compared to MS_ASYNC.
We run out of request slots very easily with this sort of thing.

Another option would be to make MS_ASYNC behave the same as MS_SYNC,
which is probably better than just ignoring it as we do at present.
I'll leave it as shown above unless there be objections.

>
> you forgot return ret here.

Whoop, thanks. It was returning `err'.

> > --- linux-2.4.18-pre1/fs/nfs/file.c Wed Dec 26 11:47:41 2001
> > +++ linux-akpm/fs/nfs/file.c Sat Jan 5 03:21:07 2002
> > @@ -244,6 +244,7 @@ nfs_lock(struct file *filp, int cmd, str
> > {
> > struct inode * inode = filp->f_dentry->d_inode;
> > int status = 0;
> > + int status2;
> >
> > dprintk("NFS: nfs_lock(f=%4x/%ld, t=%x, fl=%x, r=%Ld:%Ld)\n",
> > inode->i_dev, inode->i_ino,
> > @@ -278,11 +279,18 @@ nfs_lock(struct file *filp, int cmd, str
> > * Flush all pending writes before doing anything
> > * with locks..
> > */
> > - filemap_fdatasync(inode->i_mapping);
> > + /*
> > + * Shouldn't filemap_fdatasync/wait be inside i_sem?
> > + */
>
> I think it's not necessary, because the list browse it's serialized by
> the pagecache_lock and writepage can run outside the i_sem.

Yup. I've changed this part of the patch based on some discussion
with Trond.


I'll wait until Marcelo looks like he has his head above water and
then send out the final version of the ramdisk, truncate and
fsync/msync patches, cc'ed to yourself and lkml.

Thanks.

2002-01-07 04:32:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: ramdisk corruption problems - was: RE: pivot_root and initrd kern el panic woes

On Sun, Jan 06, 2002 at 07:49:05PM -0800, Andrew Morton wrote:
> which is probably better than just ignoring it as we do at present.
> I'll leave it as shown above unless there be objections.

by design defintely better as shown than mainline. a MS_ASYNC currently
doesn't work because the VM will never care flushing dirty mapped pages,
first it will have to unmap them that will never happen unless there's
low-on mem condition, while the filemap_fdatasync will ensure those
pages will hit the disk asynchronously in a sane amount of time
(depending on kupdate). MS_ASYNC manpage says an update is scheduled,
currently it will instead never happen for example if the machine is
idle and there's no vm swap etc... your change will fix it.

> I'll wait until Marcelo looks like he has his head above water and
> then send out the final version of the ramdisk, truncate and
> fsync/msync patches, cc'ed to yourself and lkml.

I'd like to release a new -aa within tomorrow afternoon with every known
bug discussed in this thread fixed, so then I can concentrate back on
another thing, so I'd like to get those new versions ASAP :)

for the buffer_new thing I guess it's simpler to just change the
semantics of it rather than allocating ram on the stack.

Andrea