LinuxLists.cc - [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 08:29:07

by Christian Borntraeger

[permalink] [raw]

Subject: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit unmaintained,
so decided to sent the patch to you :-).
I have CCed Ted, who did work on the code in the 90s. I found no current
email address of Chad Page.

We have seen ramdisk based install systems, where some pages of mapped
libraries and programs were suddendly zeroed under memory pressure. This
should not happen, as the ramdisk avoids freeing its pages by keeping them
dirty all the time.

It turns out that there is a case, where the VM makes a ramdisk page clean,
without telling the ramdisk driver.
On memory pressure shrink_zone runs and it starts to run shrink_active_list.
There is a check for buffer_heads_over_limit, and if true, pagevec_strip is
called. pagevec_strip calls try_to_release_page. If the mapping has no
releasepage callback, try_to_free_buffers is called. try_to_free_buffers has
now a special logic for some file systems to make a dirty page clean, if all
buffers are clean. Thats what happened in our test case.

The solution is to provide a noop-releasepage callback for the ramdisk driver.
This avoids try_to_free_buffers for ramdisk pages.

To trigger the problem, you have to make buffer_heads_over_limit true, which
means:
- lower max_buffer_heads
or
- have a system with lots of high memory

Andrew, if there are no objections - please apply. The patch applies against
todays git.

Signed-off-by: Christian Borntraeger <[email protected]>

---
drivers/block/rd.c | 13 +++++++++++++
1 files changed, 13 insertions(+)

Index: linux-2.6/drivers/block/rd.c
===================================================================
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
return 0;
}

+/*
+ * releasepage is called by pagevec_strip/try_to_release_page if
+ * buffers_heads_over_limit is true. Without a releasepage function
+ * try_to_free_buffers is called instead. That can unset the dirty
+ * bit of our ram disk pages, which will be eventually freed, even
+ * if the page is still in use.
+ */
+static int ramdisk_releasepage(struct page *page, gfp_t dummy)
+{
+ return 0;
+}
+
static const struct address_space_operations ramdisk_aops = {
.readpage = ramdisk_readpage,
.prepare_write = ramdisk_prepare_write,
@@ -196,6 +208,7 @@ static const struct address_space_operat
.writepage = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
+ .releasepage = ramdisk_releasepage,
};

static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,

2007-10-15 08:56:20

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> unmaintained, so decided to sent the patch to you :-).
> I have CCed Ted, who did work on the code in the 90s. I found no current
> email address of Chad Page.

This really needs to be fixed...

I can't make up my mind between the approaches to fixing it.

On one hand, I would actually prefer to really mark the buffers
dirty (as in: Eric's fix for this problem[*]) than this patch,
and this seems a bit like a bandaid...

On the other hand, the wound being covered by the bandaid is
actually the code in the buffer layer that does this latent
"cleaning" of the page because it sadly doesn't really keep
track of the pagecache state. But it *still* feels like we
should be marking the rd page's buffers dirty which should
avoid this problem anyway.

[*] However, hmm, with Eric's patch I guess we'd still have a hole
where filesystems that write their buffers by hand think they are
"cleaning" these things and we're back to square one. That could
be fixed by marking the buffers dirty again?

Why were Eric's patches dropped, BTW? I don't remember.

2007-10-15 09:06:17

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

Am Montag, 15. Oktober 2007 schrieb Nick Piggin:
> On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > unmaintained, so decided to sent the patch to you :-).
> > I have CCed Ted, who did work on the code in the 90s. I found no current
> > email address of Chad Page.
>
> This really needs to be fixed...

I obviously agree ;-)
We have seen this problem happen several times.

> I can't make up my mind between the approaches to fixing it.
>
> On one hand, I would actually prefer to really mark the buffers
> dirty (as in: Eric's fix for this problem[*]) than this patch,
> and this seems a bit like a bandaid...

I have never seen these patches, so I cannot comment on them.
>
> On the other hand, the wound being covered by the bandaid is
> actually the code in the buffer layer that does this latent
> "cleaning" of the page because it sadly doesn't really keep
> track of the pagecache state. But it *still* feels like we
> should be marking the rd page's buffers dirty which should
> avoid this problem anyway.

Yes, that would solve the problem as well. As long as we fix
the problem, I am happy. On the other hand, do you see any
obvious problem with this "bandaid"?

Christian

2007-10-15 09:17:19

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[email protected]> wrote:

> On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > unmaintained, so decided to sent the patch to you :-).
> > I have CCed Ted, who did work on the code in the 90s. I found no current
> > email address of Chad Page.
>
> This really needs to be fixed...

rd.c is fairly mind-boggling vfs abuse.

> I can't make up my mind between the approaches to fixing it.
>
> On one hand, I would actually prefer to really mark the buffers
> dirty (as in: Eric's fix for this problem[*]) than this patch,
> and this seems a bit like a bandaid...
>
> On the other hand, the wound being covered by the bandaid is
> actually the code in the buffer layer that does this latent
> "cleaning" of the page because it sadly doesn't really keep
> track of the pagecache state. But it *still* feels like we
> should be marking the rd page's buffers dirty which should
> avoid this problem anyway.
>
> [*] However, hmm, with Eric's patch I guess we'd still have a hole
> where filesystems that write their buffers by hand think they are
> "cleaning" these things and we're back to square one. That could
> be fixed by marking the buffers dirty again?
>
> Why were Eric's patches dropped, BTW? I don't remember.

runtime problems, iirc.

2007-10-15 09:28:04

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

On Monday 15 October 2007 19:05, Christian Borntraeger wrote:
> Am Montag, 15. Oktober 2007 schrieb Nick Piggin:
> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > > unmaintained, so decided to sent the patch to you :-).
> > > I have CCed Ted, who did work on the code in the 90s. I found no
> > > current email address of Chad Page.
> >
> > This really needs to be fixed...
>
> I obviously agree ;-)
> We have seen this problem happen several times.
>
> > I can't make up my mind between the approaches to fixing it.
> >
> > On one hand, I would actually prefer to really mark the buffers
> > dirty (as in: Eric's fix for this problem[*]) than this patch,
> > and this seems a bit like a bandaid...
>
> I have never seen these patches, so I cannot comment on them.

> > On the other hand, the wound being covered by the bandaid is
> > actually the code in the buffer layer that does this latent
> > "cleaning" of the page because it sadly doesn't really keep
> > track of the pagecache state. But it *still* feels like we
> > should be marking the rd page's buffers dirty which should
> > avoid this problem anyway.
>
> Yes, that would solve the problem as well. As long as we fix
> the problem, I am happy. On the other hand, do you see any
> obvious problem with this "bandaid"?

I don't think so -- in fact, it could be the best candidate for
a minimal fix for stable kernels (anyone disagree? if not, maybe
you could also send this to the stable maintainers?).

But I do want to have this fixed in a "nice" way. eg. I'd like
it to mark the buffers dirty because that actually results in
more reuse of generic kernel code, and also should make rd
behave more naturally (I like using it to test filesystems
because it can expose a lot more concurrency than something like
loop on tmpfs). It should also be possible to actually have
rd's buffer heads get reclaimed as well, preferably while
exercising the common buffer paths and without writing much new
code.

All of that is secondary to fixing the data corruption problem
of course! But the fact that those alternate patches do exist now
means I want to just bring them into the discussion again before
merging one or the other.

2007-10-15 10:13:32

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

On Monday 15 October 2007 19:16, Andrew Morton wrote:
> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[email protected]>
wrote:
> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > > unmaintained, so decided to sent the patch to you :-).
> > > I have CCed Ted, who did work on the code in the 90s. I found no
> > > current email address of Chad Page.
> >
> > This really needs to be fixed...
>
> rd.c is fairly mind-boggling vfs abuse.

Why do you say that? I guess it is _different_, by necessity(?)
Is there anything that is really bad? I guess it's not nice
for operating on the pagecache from its request_fn, but the
alternative is to duplicate pages for backing store and buffer
cache (actually that might not be a bad alternative really).

2007-10-15 18:48:31

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

Nick Piggin <[email protected]> writes:

> On Monday 15 October 2007 19:05, Christian Borntraeger wrote:
>> Am Montag, 15. Oktober 2007 schrieb Nick Piggin:
>> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
>> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
>> > > unmaintained, so decided to sent the patch to you :-).
>> > > I have CCed Ted, who did work on the code in the 90s. I found no
>> > > current email address of Chad Page.
>> >
>> > This really needs to be fixed...
>>
>> I obviously agree ;-)
>> We have seen this problem happen several times.
>>
>> > I can't make up my mind between the approaches to fixing it.
>> >
>> > On one hand, I would actually prefer to really mark the buffers
>> > dirty (as in: Eric's fix for this problem[*]) than this patch,
>> > and this seems a bit like a bandaid...
>>
>> I have never seen these patches, so I cannot comment on them.
>
>> > On the other hand, the wound being covered by the bandaid is
>> > actually the code in the buffer layer that does this latent
>> > "cleaning" of the page because it sadly doesn't really keep
>> > track of the pagecache state. But it *still* feels like we
>> > should be marking the rd page's buffers dirty which should
>> > avoid this problem anyway.
>>
>> Yes, that would solve the problem as well. As long as we fix
>> the problem, I am happy. On the other hand, do you see any
>> obvious problem with this "bandaid"?
>
> I don't think so -- in fact, it could be the best candidate for
> a minimal fix for stable kernels (anyone disagree? if not, maybe
> you could also send this to the stable maintainers?).

A minor one. It still leaves us with buffer heads out of sync with
struct page.

> But I do want to have this fixed in a "nice" way. eg. I'd like
> it to mark the buffers dirty because that actually results in
> more reuse of generic kernel code, and also should make rd
> behave more naturally (I like using it to test filesystems
> because it can expose a lot more concurrency than something like
> loop on tmpfs). It should also be possible to actually have
> rd's buffer heads get reclaimed as well, preferably while
> exercising the common buffer paths and without writing much new
> code.

We actually allow that currently for clean pages which is part
of what makes this tricky.

> All of that is secondary to fixing the data corruption problem
> of course! But the fact that those alternate patches do exist now
> means I want to just bring them into the discussion again before
> merging one or the other.

The core of my original fix was to modify init_page_buffers so that
when we added buffers to a dirty page the buffers became dirty.

Modifying the generic code is a bit spooky because it requires us
to audit the kernel to make certain nothing else depends on the
current behavior in odd ways. Although since init_page_buffers
is only called when we are adding buffer heads to an existing
page I still think that was the proper change.

The historical reason for my patches not getting merged the first
time is there was some weird issue with reiserfs ramdisks and so
Andrew disabled the code, and then dropped it when he had discovered
he had the patch disabled for several releases. I don't think
any causal relationship was ever established. But I didn't
hear enough about the reiserfs ramdisk issue, to make a guess
what was going on.

So it looks to me like the important invariant we need to maintain
is that when a ramdisk page is dirty it always has buffers and those
buffers are dirty as well. With a little care we can ensure this
happens with just modifications to rd.c

Eric

2007-10-15 22:38:40

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

[email protected] (Eric W. Biederman) writes:

> Nick Piggin <[email protected]> writes:
>
>> On Monday 15 October 2007 19:05, Christian Borntraeger wrote:
>>> Am Montag, 15. Oktober 2007 schrieb Nick Piggin:
>>> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
>>> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
>>> > > unmaintained, so decided to sent the patch to you :-).
>>> > > I have CCed Ted, who did work on the code in the 90s. I found no
>>> > > current email address of Chad Page.
>>> >
>>> > This really needs to be fixed...
>>>
>>> I obviously agree ;-)
>>> We have seen this problem happen several times.
>>>
>>> > I can't make up my mind between the approaches to fixing it.
>>> >
>>> > On one hand, I would actually prefer to really mark the buffers
>>> > dirty (as in: Eric's fix for this problem[*]) than this patch,
>>> > and this seems a bit like a bandaid...
>>>
>>> I have never seen these patches, so I cannot comment on them.
>>
>>> > On the other hand, the wound being covered by the bandaid is
>>> > actually the code in the buffer layer that does this latent
>>> > "cleaning" of the page because it sadly doesn't really keep
>>> > track of the pagecache state. But it *still* feels like we
>>> > should be marking the rd page's buffers dirty which should
>>> > avoid this problem anyway.
>>>
>>> Yes, that would solve the problem as well. As long as we fix
>>> the problem, I am happy. On the other hand, do you see any
>>> obvious problem with this "bandaid"?
>>
>> I don't think so -- in fact, it could be the best candidate for
>> a minimal fix for stable kernels (anyone disagree? if not, maybe
>> you could also send this to the stable maintainers?).
>
> A minor one. It still leaves us with buffer heads out of sync with
> struct page.
>
>> But I do want to have this fixed in a "nice" way. eg. I'd like
>> it to mark the buffers dirty because that actually results in
>> more reuse of generic kernel code, and also should make rd
>> behave more naturally (I like using it to test filesystems
>> because it can expose a lot more concurrency than something like
>> loop on tmpfs). It should also be possible to actually have
>> rd's buffer heads get reclaimed as well, preferably while
>> exercising the common buffer paths and without writing much new
>> code.
>
> We actually allow that currently for clean pages which is part
> of what makes this tricky.
>
>> All of that is secondary to fixing the data corruption problem
>> of course! But the fact that those alternate patches do exist now
>> means I want to just bring them into the discussion again before
>> merging one or the other.
>
> The core of my original fix was to modify init_page_buffers so that
> when we added buffers to a dirty page the buffers became dirty.
>
> Modifying the generic code is a bit spooky because it requires us
> to audit the kernel to make certain nothing else depends on the
> current behavior in odd ways. Although since init_page_buffers
> is only called when we are adding buffer heads to an existing
> page I still think that was the proper change.
>
> The historical reason for my patches not getting merged the first
> time is there was some weird issue with reiserfs ramdisks and so
> Andrew disabled the code, and then dropped it when he had discovered
> he had the patch disabled for several releases. I don't think
> any causal relationship was ever established. But I didn't
> hear enough about the reiserfs ramdisk issue, to make a guess
> what was going on.
>
> So it looks to me like the important invariant we need to maintain
> is that when a ramdisk page is dirty it always has buffers and those
> buffers are dirty as well. With a little care we can ensure this
> happens with just modifications to rd.c

Hah. I looked over my last round of patches again and I have been able
to verify by review the parts I was a little iffy about and I have
found where in my cleanups I had missed a layering violation in the
ramdisk code, and removed some needed code. Which probably accounts
for the reiserfs ramdisk problems. Updated patches in a minute.

Eric

2007-10-15 22:41:24

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH] rd: Preserve the dirty bit in init_page_buffers()

The problem: When we are trying to free buffers try_to_free_buffers()
will look at ramdisk pages with clean buffer heads and remove the
dirty bit from the page. Resulting in ramdisk pages with data that get
removed from the page cache. Ouch!

Buffer heads appear on ramdisk pages when a filesystem calls
__getblk() which through a series of function calls eventually calls
init_page_buffers().

So to fix the mismatch between buffer head and page state this patch
modifies init_page_buffers() to transfer the dirty bit from the page to
the buffer heads like we currently do for the uptodate bit.

This patch is safe as only __getblk calls init_page_buffers, and
there are only two implementations of block devices cached in the
page cache. The generic implementation in block_dev.c and the
implementation in rd.c.

The generic implementation of block devices always does everything
in terms of buffer heads so it always has buffer heads allocated
before a page is marked dirty so this change does not affect it.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/buffer.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 75b51df..8b87beb 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct block_device *bdev,
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh = head;
int uptodate = PageUptodate(page);
+ int dirty = PageDirty(page);

do {
if (!buffer_mapped(bh)) {
@@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct block_device *bdev,
bh->b_blocknr = block;
if (uptodate)
set_buffer_uptodate(bh);
+ if (dirty)
+ set_buffer_dirty(bh);
set_buffer_mapped(bh);
}
block++;
--
1.5.3.rc6.17.g1911

2007-10-15 22:42:40

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH] rd: Mark ramdisk buffers heads dirty

I have not observed this case but it is possible to get a dirty page
cache with clean buffer heads if we get a clean ramdisk page with
buffer heads generated by a filesystem calling __getblk and then write
to that page from user space through the block device. Then we just
need to hit the proper window and try_to_free_buffers() will mark that
page clean and eventually drop it. Ouch!

To fix this use the generic __set_page_dirty_buffers in the ramdisk
code so that when we mark a page dirty we also mark it's buffer heads
dirty.

Signed-off-by: Eric W. Biederman <[email protected]>
---
drivers/block/rd.c | 13 +------------
1 files changed, 1 insertions(+), 12 deletions(-)

diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 701ea77..84163da 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space *mapping,
return 0;
}

-/*
- * ramdisk blockdev pages have their own ->set_page_dirty() because we don't
- * want them to contribute to dirty memory accounting.
- */
-static int ramdisk_set_page_dirty(struct page *page)
-{
- if (!TestSetPageDirty(page))
- return 1;
- return 0;
-}
-
static const struct address_space_operations ramdisk_aops = {
.readpage = ramdisk_readpage,
.prepare_write = ramdisk_prepare_write,
.commit_write = ramdisk_commit_write,
.writepage = ramdisk_writepage,
- .set_page_dirty = ramdisk_set_page_dirty,
+ .set_page_dirty = __set_page_dirty_buffers,
.writepages = ramdisk_writepages,
};

--
1.5.3.rc6.17.g1911

2007-10-16 03:15:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

Nick Piggin <[email protected]> writes:

> On Monday 15 October 2007 19:16, Andrew Morton wrote:
>> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[email protected]>
> wrote:
>> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
>> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
>> > > unmaintained, so decided to sent the patch to you :-).
>> > > I have CCed Ted, who did work on the code in the 90s. I found no
>> > > current email address of Chad Page.
>> >
>> > This really needs to be fixed...
>>
>> rd.c is fairly mind-boggling vfs abuse.
>
> Why do you say that? I guess it is _different_, by necessity(?)
> Is there anything that is really bad?

make_page_uptodate() is most hideous part I have run into.
It has to know details about other layers to now what not
to stomp. I think my incorrect simplification of this is what messed
things up, last round.

> I guess it's not nice
> for operating on the pagecache from its request_fn, but the
> alternative is to duplicate pages for backing store and buffer
> cache (actually that might not be a bad alternative really).

Cool. Triple buffering :) Although I guess that would only
apply to metadata these days. Having a separate store would
solve some of the problems, and probably remove the need
for carefully specifying the ramdisk block size. We would
still need the magic restictions on page allocations though
and it we would use them more often as the initial write to the
ramdisk would not populate the pages we need.

A very ugly bit seems to be the fact that we assume we can
dereference bh->b_data without any special magic which
means the ramdisk must live in low memory on 32bit machines.

Eric

2007-10-16 04:04:37

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

On Tuesday 16 October 2007 13:14, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Monday 15 October 2007 19:16, Andrew Morton wrote:
> >> On Tue, 16 Oct 2007 00:06:19 +1000 Nick Piggin <[email protected]>
> >
> > wrote:
> >> > On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> >> > > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> >> > > unmaintained, so decided to sent the patch to you :-).
> >> > > I have CCed Ted, who did work on the code in the 90s. I found no
> >> > > current email address of Chad Page.
> >> >
> >> > This really needs to be fixed...
> >>
> >> rd.c is fairly mind-boggling vfs abuse.
> >
> > Why do you say that? I guess it is _different_, by necessity(?)
> > Is there anything that is really bad?
>
> make_page_uptodate() is most hideous part I have run into.
> It has to know details about other layers to now what not
> to stomp. I think my incorrect simplification of this is what messed
> things up, last round.

Not really, it's just named funny. That's just a minor utility
function that more or less does what it says it should do.

The main problem is really that it's implementing a block device
who's data comes from its own buffercache :P. I think.

> > I guess it's not nice
> > for operating on the pagecache from its request_fn, but the
> > alternative is to duplicate pages for backing store and buffer
> > cache (actually that might not be a bad alternative really).
>
> Cool. Triple buffering :) Although I guess that would only
> apply to metadata these days.

Double buffering. You no longer serve data out of your buffer
cache. All filesystem data was already double buffered anyway,
so we'd be just losing out on one layer of savings for metadata.
I think it's worthwhile, given that we'd have a "real" looking
block device and minus these bugs.

> Having a separate store would
> solve some of the problems, and probably remove the need
> for carefully specifying the ramdisk block size. We would
> still need the magic restictions on page allocations though
> and it we would use them more often as the initial write to the
> ramdisk would not populate the pages we need.

What magic restrictions on page allocations? Actually we have
fewer restrictions on page allocations because we can use
highmem! And the lowmem buffercache pages that we currently pin
(unsuccessfully, in the case of this bug) are now completely
reclaimable. And all your buffer heads are now reclaimable.

If you mean GFP_NOIO... I don't see any problem. Block device
drivers have to allocate memory with GFP_NOIO; this may have
been considered magic or deep badness back when the code was
written, but it's pretty simple and accepted now.

> A very ugly bit seems to be the fact that we assume we can
> dereference bh->b_data without any special magic which
> means the ramdisk must live in low memory on 32bit machines.

Yeah but that's not rd.c. You need to rewrite the buffer layer
to fix that (see fsblock ;)).

2007-10-16 04:58:30

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

Nick Piggin <[email protected]> writes:

>>
>> make_page_uptodate() is most hideous part I have run into.
>> It has to know details about other layers to now what not
>> to stomp. I think my incorrect simplification of this is what messed
>> things up, last round.
>
> Not really, it's just named funny. That's just a minor utility
> function that more or less does what it says it should do.
>
> The main problem is really that it's implementing a block device
> who's data comes from its own buffercache :P. I think.

Well to put it another way, mark_page_uptodate() is the only
place where we really need to know about the upper layers.
Given that you can kill ramdisks by coding it as:

static void make_page_uptodate(struct page *page)
{
clear_highpage(page);
flush_dcache_page(page);
SetPageUptodate(page);
}

Something is seriously non-intuitive about that function if
you understand the usual rules for how to use the page cache.

The problem is that we support a case in the buffer cache
where pages are partially uptodate and only the buffer_heads
remember which parts are valid. Assuming we are using them
correctly.

Having to walk through all of the buffer heads in make_page_uptodate
seems to me to be a nasty layering violation in rd.c

>> > I guess it's not nice
>> > for operating on the pagecache from its request_fn, but the
>> > alternative is to duplicate pages for backing store and buffer
>> > cache (actually that might not be a bad alternative really).
>>
>> Cool. Triple buffering :) Although I guess that would only
>> apply to metadata these days.
>
> Double buffering. You no longer serve data out of your buffer
> cache. All filesystem data was already double buffered anyway,
> so we'd be just losing out on one layer of savings for metadata.

Yep we are in agreement there.

> I think it's worthwhile, given that we'd have a "real" looking
> block device and minus these bugs.

For testing purposes I think I can agree with that.

>> Having a separate store would
>> solve some of the problems, and probably remove the need
>> for carefully specifying the ramdisk block size. We would
>> still need the magic restictions on page allocations though
>> and it we would use them more often as the initial write to the
>> ramdisk would not populate the pages we need.
>
> What magic restrictions on page allocations? Actually we have
> fewer restrictions on page allocations because we can use
> highmem!

With the proposed rewrite yes.

> And the lowmem buffercache pages that we currently pin
> (unsuccessfully, in the case of this bug) are now completely
> reclaimable. And all your buffer heads are now reclaimable.

Hmm. Good point. So in net it should save memory even if
it consumes a little more in the worst case.

> If you mean GFP_NOIO... I don't see any problem. Block device
> drivers have to allocate memory with GFP_NOIO; this may have
> been considered magic or deep badness back when the code was
> written, but it's pretty simple and accepted now.

Well I always figured it was a bit rude allocating large amounts
of memory GFP_NOIO but whatever.

>> A very ugly bit seems to be the fact that we assume we can
>> dereference bh->b_data without any special magic which
>> means the ramdisk must live in low memory on 32bit machines.
>
> Yeah but that's not rd.c. You need to rewrite the buffer layer
> to fix that (see fsblock ;)).

I'm not certain which way we should go. Take fsblock and run it
in parallel until everything is converted or use fsblock as a
prototype and once we have figured out which way we should go
convert struct buffer_head into struct fsblock one patch at a time.

I'm inclined to think we should evolve the buffer_head.

Eric

2007-10-16 05:26:45

[permalink] [raw]

Subject: Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> >> make_page_uptodate() is most hideous part I have run into.
> >> It has to know details about other layers to now what not
> >> to stomp. I think my incorrect simplification of this is what messed
> >> things up, last round.
> >
> > Not really, it's just named funny. That's just a minor utility
> > function that more or less does what it says it should do.
> >
> > The main problem is really that it's implementing a block device
> > who's data comes from its own buffercache :P. I think.
>
> Well to put it another way, mark_page_uptodate() is the only
> place where we really need to know about the upper layers.
> Given that you can kill ramdisks by coding it as:
>
> static void make_page_uptodate(struct page *page)
> {
> clear_highpage(page);
> flush_dcache_page(page);
> SetPageUptodate(page);
> }
>
> Something is seriously non-intuitive about that function if
> you understand the usual rules for how to use the page cache.

You're overwriting some buffers that were uptodate and dirty.
That would be expected to cause problems.

> The problem is that we support a case in the buffer cache
> where pages are partially uptodate and only the buffer_heads
> remember which parts are valid. Assuming we are using them
> correctly.
>
> Having to walk through all of the buffer heads in make_page_uptodate
> seems to me to be a nasty layering violation in rd.c

Sure, but it's not just about the buffers. It's the pagecache
in general. It is supposed to be invisible to the device driver
and sitting above it, and yet it is taking the buffercache and
using it to pull its data out of.

> > I think it's worthwhile, given that we'd have a "real" looking
> > block device and minus these bugs.
>
> For testing purposes I think I can agree with that.

What non-testing uses does it have?

> >> Having a separate store would
> >> solve some of the problems, and probably remove the need
> >> for carefully specifying the ramdisk block size. We would
> >> still need the magic restictions on page allocations though
> >> and it we would use them more often as the initial write to the
> >> ramdisk would not populate the pages we need.
> >
> > What magic restrictions on page allocations? Actually we have
> > fewer restrictions on page allocations because we can use
> > highmem!
>
> With the proposed rewrite yes.
>
> > And the lowmem buffercache pages that we currently pin
> > (unsuccessfully, in the case of this bug) are now completely
> > reclaimable. And all your buffer heads are now reclaimable.
>
> Hmm. Good point. So in net it should save memory even if
> it consumes a little more in the worst case.

Highmem systems would definitely like it. For others, yes, all
the duplicated pages should be able to get reclaimed if memory
gets tight, along with the buffer heads, so yeah footprint may
be a tad smaller.

> > If you mean GFP_NOIO... I don't see any problem. Block device
> > drivers have to allocate memory with GFP_NOIO; this may have
> > been considered magic or deep badness back when the code was
> > written, but it's pretty simple and accepted now.
>
> Well I always figured it was a bit rude allocating large amounts
> of memory GFP_NOIO but whatever.

You'd rather not, of course, but with dirty data limits now,
it doesn't matter much. (and I doubt anybody outside testing
is going to be hammering like crazy on rd).

Note that the buffercache based ramdisk driver is going to
also be allocating with GFP_NOFS if you're talking about a
filesystem writing to its metadata. In most systems, GFP_NOFS
isn't much different to GFP_NOIO.

We could introduce a mode which allocates pages up front
quite easily if it were a problem (which I doubt it ever would
be).

2007-10-16 07:42:18

[permalink] [raw]

Subject: [patch][rfc] rewrite ramdisk

On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:

> > > What magic restrictions on page allocations? Actually we have
> > > fewer restrictions on page allocations because we can use
> > > highmem!
> >
> > With the proposed rewrite yes.

Here's a quick first hack...

Comments?

Attachments:

(No filename) (342.00 B)
rd2.patch (13.61 kB)
Download all attachments

2007-10-16 07:53:07

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Oct 16 2007 17:47, Nick Piggin wrote:
>
>Here's a quick first hack...

Inline patches preferred ;-)

>+config BLK_DEV_BRD
>+ tristate "RAM block device support"
>+ ---help---
>+ This is a new based block driver that replaces BLK_DEV_RAM.

based on what? -^

>+ To compile this driver as a module, choose M here: the
>+ module will be called rd.

called brd.ko.

>+/*
>+ * And now the modules code and kernel interface.
>+ */
>+static int rd_nr;
>+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;

Perhaps unsigned?
Perhaps even long for rd_size?

>+module_param(rd_nr, int, 0);
>+MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
>+module_param(rd_size, int, 0);
>+MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
>+MODULE_LICENSE("GPL");
>+MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
>+
>+/* options - nonmodular */
>+#ifndef MODULE
>+static int __init ramdisk_size(char *str)
>+{
>+ rd_size = simple_strtol(str,NULL,0);
>+ return 1;
>+}

Is this, besides for compatibility, really needed?

>+static int __init ramdisk_size2(char *str)
>+{
>+ return ramdisk_size(str);
>+}
>+static int __init rd_nr(char *str)

Err! Overlapping symbols! The rd_nr() function collides with the rd_nr
variable. It also does not seem needed, since it did not exist before.
It should go, you can set the variable with brd.rd_nr=XXX (same
goes for ramdisk_size). What's the point of ramdisk_size2()?

>+{
>+ rd_nr = simple_strtol(str, NULL, 0);
>+ return 1;
>+}
>+__setup("ramdisk=", ramdisk_size);
>+__setup("ramdisk_size=", ramdisk_size2);

__setup("ramdisk_size=", ramdisk_size); maybe, or does not that work?

>+__setup("rd_nr=", rd_nr);
>+#endif
>+
>+
>+static struct brd_device *brd_alloc(int i)
>+{
>+ struct brd_device *brd;
>+ struct gendisk *disk;
>+
>+ brd = kzalloc(sizeof(*brd), GFP_KERNEL);
>+ if (!brd)
>+ goto out;
>+ brd->brd_number = i;

2007-10-16 07:57:21

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:

>?fs/buffer.c | ? ?3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
> drivers/block/rd.c | 13 +------------
> 1 files changed, 1 insertions(+), 12 deletions(-)

Your patches look sane so far. I have applied both patches, and the problem
seems gone. I will try to get these patches to our testers.

As long as they dont find new problems:

Acked-by: Christian Borntraeger <[email protected]>

Nick, Eric. What is the best patch for the stable series? Both patches from
Eric or mine? I CCed stable, so they know that something is coming.

Christian

2007-10-16 08:02:47

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Tuesday 16 October 2007 17:52, Jan Engelhardt wrote:
> On Oct 16 2007 17:47, Nick Piggin wrote:
> >Here's a quick first hack...
>
> Inline patches preferred ;-)

Thanks for reviewing it anyway ;)

> >+config BLK_DEV_BRD
> >+ tristate "RAM block device support"
> >+ ---help---
> >+ This is a new based block driver that replaces BLK_DEV_RAM.
>
> based on what? -^

RAM based.

> >+ To compile this driver as a module, choose M here: the
> >+ module will be called rd.
>
> called brd.ko.

Changed. But it will hopefully just completely replace rd.c,
so I will probably just rename it to rd.c at some point (and
change .config options to stay compatible). Unless someone
sees a problem with that?

> >+/*
> >+ * And now the modules code and kernel interface.
> >+ */
> >+static int rd_nr;
> >+static int rd_size = CONFIG_BLK_DEV_RAM_SIZE;
>
> Perhaps unsigned?
> Perhaps even long for rd_size?

I've taken most of that stuff out of rd.c in an effort to stay
back compatible. I don't know if it really matters to use long?

> >+module_param(rd_nr, int, 0);
> >+MODULE_PARM_DESC(rd_nr, "Maximum number of brd devices");
> >+module_param(rd_size, int, 0);
> >+MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
> >+MODULE_LICENSE("GPL");
> >+MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
> >+
> >+/* options - nonmodular */
> >+#ifndef MODULE
> >+static int __init ramdisk_size(char *str)
> >+{
> >+ rd_size = simple_strtol(str,NULL,0);
> >+ return 1;
> >+}
>
> Is this, besides for compatibility, really needed?
>
> >+static int __init ramdisk_size2(char *str)
> >+{
> >+ return ramdisk_size(str);
> >+}
> >+static int __init rd_nr(char *str)
>
> Err! Overlapping symbols! The rd_nr() function collides with the rd_nr
> variable.

Thanks... %s gone awry. Fixed to the expected names.

> It also does not seem needed, since it did not exist before.
> It should go, you can set the variable with brd.rd_nr=XXX (same
> goes for ramdisk_size).

But only if it's a module?

> What's the point of ramdisk_size2()?

Back compat. When rewriting the internals, I want to try avoid
changing any externals if possible. Whether this is the Right
Way to do it or not, I don't know :P

> >+{
> >+ rd_nr = simple_strtol(str, NULL, 0);
> >+ return 1;
> >+}
> >+__setup("ramdisk=", ramdisk_size);
> >+__setup("ramdisk_size=", ramdisk_size2);
>
> __setup("ramdisk_size=", ramdisk_size); maybe, or does not that work?

Didn't try it, but the rd.c code does the same thing so I guess it
doesn't (or didn't, when it was written).

2007-10-16 08:07:12

[permalink] [raw]

Subject: Re: [PATCH] rd: Preserve the dirty bit in init_page_buffers()

On Tuesday 16 October 2007 08:40, Eric W. Biederman wrote:
> The problem: When we are trying to free buffers try_to_free_buffers()
> will look at ramdisk pages with clean buffer heads and remove the
> dirty bit from the page. Resulting in ramdisk pages with data that get
> removed from the page cache. Ouch!
>
> Buffer heads appear on ramdisk pages when a filesystem calls
> __getblk() which through a series of function calls eventually calls
> init_page_buffers().
>
> So to fix the mismatch between buffer head and page state this patch
> modifies init_page_buffers() to transfer the dirty bit from the page to
> the buffer heads like we currently do for the uptodate bit.
>
> This patch is safe as only __getblk calls init_page_buffers, and
> there are only two implementations of block devices cached in the
> page cache. The generic implementation in block_dev.c and the
> implementation in rd.c.
>
> The generic implementation of block devices always does everything
> in terms of buffer heads so it always has buffer heads allocated
> before a page is marked dirty so this change does not affect it.

This is probably a good idea. Was this causing the reiserfs problems?
If so, I think we should be concentrating on what the real problem
is with reiserfs... (or at least why this so obviously correct
looking patch is wrong).

Acked-by: Nick Piggin <[email protected]>

>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> fs/buffer.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 75b51df..8b87beb 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -972,6 +972,7 @@ init_page_buffers(struct page *page, struct
> block_device *bdev, struct buffer_head *head = page_buffers(page);
> struct buffer_head *bh = head;
> int uptodate = PageUptodate(page);
> + int dirty = PageDirty(page);
>
> do {
> if (!buffer_mapped(bh)) {
> @@ -980,6 +981,8 @@ init_page_buffers(struct page *page, struct
> block_device *bdev, bh->b_blocknr = block;
> if (uptodate)
> set_buffer_uptodate(bh);
> + if (dirty)
> + set_buffer_dirty(bh);
> set_buffer_mapped(bh);
> }
> block++;

2007-10-16 08:13:21

[permalink] [raw]

Subject: Re: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

hi! i need sex video

--
This message was sent on behalf of [email protected] at openSubscriber.com
http://www.opensubscriber.com/message/[email protected]/7788114.html

2007-10-16 08:14:14

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote:
> I have not observed this case but it is possible to get a dirty page
> cache with clean buffer heads if we get a clean ramdisk page with
> buffer heads generated by a filesystem calling __getblk and then write
> to that page from user space through the block device. Then we just
> need to hit the proper window and try_to_free_buffers() will mark that
> page clean and eventually drop it. Ouch!
>
> To fix this use the generic __set_page_dirty_buffers in the ramdisk
> code so that when we mark a page dirty we also mark it's buffer heads
> dirty.

Hmm, so we can also have some filesystems writing their own buffers
out by hand (clear_buffer_dirty, submit buffer for IO). Other places
will do similarly dodgy things with filesystem metadata
(fsync_buffers_list, for example).

So your buffers get cleaned again, then your pages get cleaned.

While I said it was a good fix when I saw the patch earlier, I think
it's not closing the entire hole, and as such, Christian's patch is
probably the way to go for stable.

For mainline, *if* we want to keep the old rd.c around at all, I
don't see any harm in this patch so long as Christian's is merged
as well. Sharing common code is always good.

>
> Signed-off-by: Eric W. Biederman <[email protected]>
> ---
> drivers/block/rd.c | 13 +------------
> 1 files changed, 1 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/block/rd.c b/drivers/block/rd.c
> index 701ea77..84163da 100644
> --- a/drivers/block/rd.c
> +++ b/drivers/block/rd.c
> @@ -178,23 +178,12 @@ static int ramdisk_writepages(struct address_space
> *mapping, return 0;
> }
>
> -/*
> - * ramdisk blockdev pages have their own ->set_page_dirty() because we
> don't - * want them to contribute to dirty memory accounting.
> - */
> -static int ramdisk_set_page_dirty(struct page *page)
> -{
> - if (!TestSetPageDirty(page))
> - return 1;
> - return 0;
> -}
> -
> static const struct address_space_operations ramdisk_aops = {
> .readpage = ramdisk_readpage,
> .prepare_write = ramdisk_prepare_write,
> .commit_write = ramdisk_commit_write,
> .writepage = ramdisk_writepage,
> - .set_page_dirty = ramdisk_set_page_dirty,
> + .set_page_dirty = __set_page_dirty_buffers,
> .writepages = ramdisk_writepages,
> };

2007-10-16 08:17:32

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Oct 16 2007 18:07, Nick Piggin wrote:
>Changed. But it will hopefully just completely replace rd.c,
>so I will probably just rename it to rd.c at some point (and
>change .config options to stay compatible). Unless someone
>sees a problem with that?

I do not see a problem with keeping brd either.

>> It also does not seem needed, since it did not exist before.
>> It should go, you can set the variable with brd.rd_nr=XXX (same
>> goes for ramdisk_size).
>
>But only if it's a module?

Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
and you will see.

2007-10-16 08:17:57

[permalink] [raw]

Subject: Re: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

hi!to everybody i need sex video

--
This message was sent on behalf of [email protected] at openSubscriber.com
http://www.opensubscriber.com/message/[email protected]/7788114.html

2007-10-16 08:22:00

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Tuesday 16 October 2007 18:17, Jan Engelhardt wrote:
> On Oct 16 2007 18:07, Nick Piggin wrote:
> >Changed. But it will hopefully just completely replace rd.c,
> >so I will probably just rename it to rd.c at some point (and
> >change .config options to stay compatible). Unless someone
> >sees a problem with that?
>
> I do not see a problem with keeping brd either.

Just doesn't seem to be any point in making it a new and different
module, assuming it can support exactly the same semantics. I'm
only doing so in these first diffs so that they are easier to read
and also easier to do a side by side comparison / builds with the
old rd.c

> >> It also does not seem needed, since it did not exist before.
> >> It should go, you can set the variable with brd.rd_nr=XXX (same
> >> goes for ramdisk_size).
> >
> >But only if it's a module?
>
> Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
> and you will see.

Ah, nice. (I don't use them much!). Still, backward compat I
think is needed if we are to replace rd.c.

2007-10-16 08:49:18

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Am Dienstag, 16. Oktober 2007 schrieb Nick Piggin:
> While I said it was a good fix when I saw the patch earlier, I think
> it's not closing the entire hole, and as such, Christian's patch is
> probably the way to go for stable.

That would be my preferred method. Merge Erics and my fixup for 2.6.24-rc.
The only open questions is, what was the reiserfs problem? Is it still
causes by Erics patches?

> For mainline, *if* we want to keep the old rd.c around at all, I
> don't see any harm in this patch so long as Christian's is merged
> as well. Sharing common code is always good.

While the merge window is still open, to me the new ramdisk code seems
more like a 2.6.25-rc thing. We actually should test the behaviour in
low memory scenarios. What do you think?

Christian

2007-10-16 08:53:24

by Jan Engelhardt

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Oct 16 2007 18:26, Nick Piggin wrote:
>> >> It also does not seem needed, since it did not exist before.
>> >> It should go, you can set the variable with brd.rd_nr=XXX (same
>> >> goes for ramdisk_size).
>> >
>> >But only if it's a module?
>>
>> Attributes always work. Try vt.default_red=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
>> and you will see.
>
>Ah, nice. (I don't use them much!). Still, backward compat I
>think is needed if we are to replace rd.c.
>
Like I said, I did not see rd_nr in Documentation/kernel-parameters.txt
so I thought there was no compat.

2007-10-16 09:09:57

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

Well on that same tune. But with a slightly different implementation.
It compiles but I need to head to bed so I haven't had a chance
to test it yet.

Nick it is very similar to yours with the big difference being that
I embedded a struct address_space instead of rolled rerolled it by
hand, which saves a lot of lines of code.

Eric

---
drivers/block/rd.c
/*
* ramdisk.c - Multiple RAM disk driver - gzip-loading version - v. 0.8 beta.
*
* (C) Chad Page, Theodore Ts'o, et. al, 1995.
*
* This RAM disk is designed to have filesystems created on it and mounted
* just like a regular floppy disk.
*
* It also does something suggested by Linus: use the buffer cache as the
* RAM disk data. This makes it possible to dynamically allocate the RAM disk
* buffer - with some consequences I have to deal with as I write this.
*
* This code is based on the original ramdisk.c, written mostly by
* Theodore Ts'o (TYT) in 1991. The code was largely rewritten by
* Chad Page to use the buffer cache to store the RAM disk data in
* 1995; Theodore then took over the driver again, and cleaned it up
* for inclusion in the mainline kernel.
*
* The original CRAMDISK code was written by Richard Lyons, and
* adapted by Chad Page to use the new RAM disk interface. Theodore
* Ts'o rewrote it so that both the compressed RAM disk loader and the
* kernel decompressor uses the same inflate.c codebase. The RAM disk
* loader now also loads into a dynamic (buffer cache based) RAM disk,
* not the old static RAM disk. Support for the old static RAM disk has
* been completely removed.
*
* Loadable module support added by Tom Dyas.
*
* Further cleanups by Chad Page ([email protected]):
* Cosmetic changes in #ifdef MODULE, code movement, etc.
* When the RAM disk module is removed, free the protected buffers
* Default RAM disk size changed to 2.88 MB
*
* Added initrd: Werner Almesberger & Hans Lermen, Feb '96
*
* 4/25/96 : Made RAM disk size a parameter (default is now 4 MB)
* - Chad Page
*
* Add support for fs images split across >1 disk, Paul Gortmaker, Mar '98
*
* Make block size and block size shift for RAM disks a global macro
* and set blk_size for -ENOSPC, Werner Fink <[email protected]>, Apr '99
*/

#include <linux/string.h>
#include <linux/slab.h>
#include <asm/atomic.h>
#include <linux/bio.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/blkdev.h>
#include <linux/genhd.h>
#include <linux/buffer_head.h> /* for invalidate_bdev() */
#include <linux/backing-dev.h>
#include <linux/blkpg.h>
#include <linux/writeback.h>

#include <asm/uaccess.h>

#define RAMDISK_MINORS 250

struct ramdisk {
struct address_space rd_mapping;
struct gendisk *rd_disk;
struct request_queue *rd_queue;
struct list_head rd_list;
};

/* Various static variables go here. Most are used only in the RAM disk code.
*/

static LIST_HEAD(ramdisks);
static DEFINE_MUTEX(ramdisks_mutex);

/*
* Parameters for the boot-loading of the RAM disk. These are set by
* init/main.c (from arguments to the kernel command line) or from the
* architecture-specific setup routine (from the stored boot sector
* information).
*/
int rd_size = CONFIG_BLK_DEV_RAM_SIZE; /* Size of the RAM disks */
/*
* It would be very desirable to have a soft-blocksize (that in the case
* of the ramdisk driver is also the hardblocksize ;) of PAGE_SIZE because
* doing that we'll achieve a far better MM footprint. Using a rd_blocksize of
* BLOCK_SIZE in the worst case we'll make PAGE_SIZE/BLOCK_SIZE buffer-pages
* unfreeable. With a rd_blocksize of PAGE_SIZE instead we are sure that only
* 1 page will be protected. Depending on the size of the ramdisk you
* may want to change the ramdisk blocksize to achieve a better or worse MM
* behaviour. The default is still BLOCK_SIZE (needed by rd_load_image that
* supposes the filesystem in the image uses a BLOCK_SIZE blocksize).
*/
static int rd_blocksize = CONFIG_BLK_DEV_RAM_BLOCKSIZE;

/*
* Copyright (C) 2000 Linus Torvalds.
* 2000 Transmeta Corp.
* aops copied from ramfs.
*/

static const struct address_space_operations ramdisk_aops = {
.readpage = simple_readpage,
.set_page_dirty = __set_page_dirty_no_writeback,
};

static struct ramdisk *bdev_ramdisk(struct block_device *bdev)
{
return bdev->bd_disk->private_data;
}

static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
struct address_space *mapping)
{
pgoff_t index = sector >> (PAGE_CACHE_SHIFT - 9);
unsigned int vec_offset = vec->bv_offset;
int offset = (sector << 9) & ~PAGE_CACHE_MASK;
int size = vec->bv_len;
int err = 0;

do {
int count;
struct page *page;
char *src;
char *dst;

count = PAGE_CACHE_SIZE - offset;
if (count > size)
count = size;
size -= count;

page = grab_cache_page(mapping, index);
if (!page) {
err = -ENOMEM;
goto out;
}

if (!PageUptodate(page)) {
clear_highpage(page);
SetPageUptodate(page);
}

index++;

if (rw == READ) {
src = kmap_atomic(page, KM_USER0) + offset;
dst = kmap_atomic(vec->bv_page, KM_USER1) + vec_offset;
} else {
src = kmap_atomic(vec->bv_page, KM_USER0) + vec_offset;
dst = kmap_atomic(page, KM_USER1) + offset;
}
offset = 0;
vec_offset += count;

memcpy(dst, src, count);

kunmap_atomic(src, KM_USER0);
kunmap_atomic(dst, KM_USER1);

if (rw == WRITE)
set_page_dirty(page);
unlock_page(page);
put_page(page);
} while (size);

out:
return err;
}

/*
* Basically, my strategy here is to set up a buffer-head which can't be
* deleted, and make that my Ramdisk. If the request is outside of the
* allocated size, we must get rid of it...
*
* 19-JAN-1998 Richard Gooch <[email protected]> Added devfs support
*
*/
static int rd_make_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct ramdisk *rd = bdev_ramdisk(bdev);
struct address_space * mapping = &rd->rd_mapping;
sector_t sector = bio->bi_sector;
unsigned long len = bio->bi_size >> 9;
int rw = bio_data_dir(bio);
struct bio_vec *bvec;
int ret = 0, i;

if (sector + len > get_capacity(bdev->bd_disk))
goto fail;

if (rw==READA)
rw=READ;

bio_for_each_segment(bvec, bio, i) {
ret |= rd_blkdev_pagecache_IO(rw, bvec, sector, mapping);
sector += bvec->bv_len >> 9;
}
if (ret)
goto fail;

bio_endio(bio, 0);
return 0;
fail:
bio_io_error(bio);
return 0;
}

static int rd_ioctl(struct inode *inode, struct file *file,
unsigned int cmd, unsigned long arg)
{
int error;
struct block_device *bdev = inode->i_bdev;
struct ramdisk *rd = bdev_ramdisk(bdev);

if (cmd != BLKFLSBUF)
return -ENOTTY;

/*
* special: we want to release the ramdisk memory, it's not like with
* the other blockdevices where this ioctl only flushes away the buffer
* cache
*/
error = -EBUSY;
mutex_lock(&bdev->bd_mutex);
if (bdev->bd_openers <= 1) {
truncate_inode_pages(&rd->rd_mapping, 0);
error = 0;
}
mutex_unlock(&bdev->bd_mutex);
return error;
}

/*
* This is the backing_dev_info for the blockdev inode itself. It doesn't need
* writeback and it does not contribute to dirty memory accounting.
*/
static struct backing_dev_info rd_backing_dev_info = {
.ra_pages = 0, /* No readahead */
.capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK,
.unplug_io_fn = default_unplug_io_fn,
};

static struct block_device_operations rd_bd_op = {
.owner = THIS_MODULE,
.ioctl = rd_ioctl,
};

static struct ramdisk *rd_alloc(int i)
{
struct ramdisk *rd;
struct request_queue *queue;
struct gendisk *disk;
struct address_space *mapping;

rd = kzalloc(sizeof(*rd), GFP_KERNEL);
if (!rd)
goto out;

rd->rd_queue = queue = blk_alloc_queue(GFP_KERNEL);
if (!rd->rd_queue)
goto out_free_dev;

disk = rd->rd_disk = alloc_disk(1);
if (!disk)
goto out_free_queue;

mapping = &rd->rd_mapping;
mapping->a_ops = &ramdisk_aops;
mapping->backing_dev_info = &rd_backing_dev_info;
mapping_set_gfp_mask(mapping, __GFP_WAIT | __GFP_HIGH | __GFP_HIGHMEM);

blk_queue_make_request(queue, &rd_make_request);
blk_queue_hardsect_size(queue, rd_blocksize);

/* rd_size is given in kB */
disk->major = RAMDISK_MAJOR;
disk->first_minor = i;
disk->fops = &rd_bd_op;
disk->queue = queue;
disk->flags |= GENHD_FL_SUPPRESS_PARTITION_INFO;
sprintf(disk->disk_name, "ram%d", i);
set_capacity(disk, rd_size * 2);
disk->private_data = rd;
add_disk(rd->rd_disk);
list_add_tail(&rd->rd_list, &ramdisks);
return rd;

out_free_queue:
blk_cleanup_queue(queue);
out_free_dev:
kfree(rd);
out:
return NULL;
}

static void rd_free(struct ramdisk *rd)
{
del_gendisk(rd->rd_disk);
put_disk(rd->rd_disk);
blk_cleanup_queue(rd->rd_queue);
truncate_inode_pages(&rd->rd_mapping, 0);
list_del(&rd->rd_list);
kfree(rd);
}

static struct kobject *rd_probe(dev_t dev, int *part, void *data)
{
struct ramdisk *rd;
struct kobject *kobj;
unsigned int unit;

kobj = ERR_PTR(-ENOMEM);
unit = MINOR(dev);

mutex_lock(&ramdisks_mutex);
list_for_each_entry(rd, &ramdisks, rd_list) {
if (rd->rd_disk->first_minor == unit)
goto found;
}
rd = rd_alloc(MINOR(dev));
found:
if (rd)
kobj = get_disk(rd->rd_disk);
mutex_unlock(&ramdisks_mutex);

*part = 0;
return kobj;
}

/*
* Before freeing the module, invalidate all of the protected buffers!
*/
static void __exit rd_cleanup(void)
{
struct ramdisk *rd, *next;

list_for_each_entry_safe(rd, next, &ramdisks, rd_list)
rd_free(rd);

blk_unregister_region(MKDEV(RAMDISK_MAJOR, 0), RAMDISK_MINORS);
unregister_blkdev(RAMDISK_MAJOR, "ramdisk");
}

/*
* This is the registration and initialization section of the RAM disk driver
*/
static int __init rd_init(void)
{
if (rd_blocksize > PAGE_SIZE || rd_blocksize < 512 ||
(rd_blocksize & (rd_blocksize-1))) {
printk("RAMDISK: wrong blocksize %d, reverting to defaults\n",
rd_blocksize);
rd_blocksize = BLOCK_SIZE;
}

if (register_blkdev(RAMDISK_MAJOR, "ramdisk"))
return -EIO;

blk_register_region(MKDEV(RAMDISK_MAJOR, 0), RAMDISK_MINORS,
THIS_MODULE, rd_probe, NULL, NULL);

/* rd_size is given in kB */
printk("RAMDISK driver initialized: "
"%d RAM disks of %dK size %d blocksize\n",
CONFIG_BLK_DEV_RAM_COUNT, rd_size, rd_blocksize);

return 0;
}

module_init(rd_init);
module_exit(rd_cleanup);

/* options - nonmodular */
#ifndef MODULE
static int __init ramdisk_size(char *str)
{
rd_size = simple_strtol(str,NULL,0);
return 1;
}
static int __init ramdisk_size2(char *str) /* kludge */
{
return ramdisk_size(str);
}
static int __init ramdisk_blocksize(char *str)
{
rd_blocksize = simple_strtol(str,NULL,0);
return 1;
}
__setup("ramdisk=", ramdisk_size);
__setup("ramdisk_size=", ramdisk_size2);
__setup("ramdisk_blocksize=", ramdisk_blocksize);
#endif

/* options - modular */
module_param(rd_size, int, 0);
MODULE_PARM_DESC(rd_size, "Size of each RAM disk in kbytes.");
module_param(rd_blocksize, int, 0);
MODULE_PARM_DESC(rd_blocksize, "Blocksize of each RAM disk in bytes.");
MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);

MODULE_LICENSE("GPL");

2007-10-16 09:23:21

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Christian Borntraeger <[email protected]> writes:

> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
>
>>?fs/buffer.c | ? ?3 +++
>> 1 files changed, 3 insertions(+), 0 deletions(-)
>> drivers/block/rd.c | 13 +------------
>> 1 files changed, 1 insertions(+), 12 deletions(-)
>
> Your patches look sane so far. I have applied both patches, and the problem
> seems gone. I will try to get these patches to our testers.
>
> As long as they dont find new problems:

Sounds good. Please make certain to test reiserfs as well as ext2+ as
it seems to strain the ramdisk code more aggressively.

> Acked-by: Christian Borntraeger <[email protected]>
>
> Nick, Eric. What is the best patch for the stable series? Both patches from
> Eric or mine? I CCed stable, so they know that something is coming.

My gut feel says my patches assuming everything tests out ok, as
they actually fix the problem instead of papering over it, and there
isn't really a size difference.

In addition the change to init_page_buffers is interesting all by
itself. With that patch we now have the option of running block
devices in mode where we don't bother to cache the buffer heads
which should reduce memory pressure a bit. Not that an enhancement
like that will show up in stable, but...

Eric

2007-10-16 09:35:59

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Preserve the dirty bit in init_page_buffers()

Nick Piggin <[email protected]> writes:

> On Tuesday 16 October 2007 08:40, Eric W. Biederman wrote:
>> The problem: When we are trying to free buffers try_to_free_buffers()
>> will look at ramdisk pages with clean buffer heads and remove the
>> dirty bit from the page. Resulting in ramdisk pages with data that get
>> removed from the page cache. Ouch!
>>
>> Buffer heads appear on ramdisk pages when a filesystem calls
>> __getblk() which through a series of function calls eventually calls
>> init_page_buffers().
>>
>> So to fix the mismatch between buffer head and page state this patch
>> modifies init_page_buffers() to transfer the dirty bit from the page to
>> the buffer heads like we currently do for the uptodate bit.
>>
>> This patch is safe as only __getblk calls init_page_buffers, and
>> there are only two implementations of block devices cached in the
>> page cache. The generic implementation in block_dev.c and the
>> implementation in rd.c.
>>
>> The generic implementation of block devices always does everything
>> in terms of buffer heads so it always has buffer heads allocated
>> before a page is marked dirty so this change does not affect it.
>
> This is probably a good idea. Was this causing the reiserfs problems?
> If so, I think we should be concentrating on what the real problem
> is with reiserfs... (or at least why this so obviously correct
> looking patch is wrong).

I think it was my cleanup patch that was sitting on top of this,
That caused the problems.

Eric

2007-10-16 19:07:50

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Nick Piggin <[email protected]> writes:

> On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote:
>> I have not observed this case but it is possible to get a dirty page
>> cache with clean buffer heads if we get a clean ramdisk page with
>> buffer heads generated by a filesystem calling __getblk and then write
>> to that page from user space through the block device. Then we just
>> need to hit the proper window and try_to_free_buffers() will mark that
>> page clean and eventually drop it. Ouch!
>>
>> To fix this use the generic __set_page_dirty_buffers in the ramdisk
>> code so that when we mark a page dirty we also mark it's buffer heads
>> dirty.
>
> Hmm, so we can also have some filesystems writing their own buffers
> out by hand (clear_buffer_dirty, submit buffer for IO). Other places
> will do similarly dodgy things with filesystem metadata
> (fsync_buffers_list, for example).
>
> So your buffers get cleaned again, then your pages get cleaned.

So I just took a little bit of time to look at and think about
the path you are referring to, and I don't see a problem.

The rule with the buffer dirty bit is that you first clear it
and then you submit the write. When the write request finally
makes it's way to rd.c we copy the data if necessary and call
set_page_dirty. Which will then mark the page and the buffer
dirty again.

In essence the ramdisk code just attempts to lock buffers in
ram by setting their dirty bit, just like we do for pages
in ramfs.

The only case where I see the dirty bit getting cleared without
submitting I/O is when dirty state doesn't matter, in which
case if we get a page full buffers all of whose data we don't care
about it is legitimate to drop the page.

As for ramdisk_writepage it probably makes sense to remove it,
as the generic code paths seem to work as well or better the
writepage method is NULL. However if we do keep it we should call
set_page_dirty there on general principles.

> While I said it was a good fix when I saw the patch earlier, I think
> it's not closing the entire hole, and as such, Christian's patch is
> probably the way to go for stable.

I thought through the logic in try_to_free_buffers and it actually
makes sense to me now. mark_buffer_dirty sets the page dirty bit
so dirty buffers reside on dirty pages. When we submit I/O we
aren't guaranteed to submit all of the dirty buffers for a page
at once, so we don't clear the page dirty bit. With the result
that we can end up with pages with the dirty bit set but all of
their buffers are clean.

Since we rather deliberately allow truly clean pages to be dropped
from the ramdisk overriding try_to_free_buffer_pages looks wrong
because then except for invalidate we can not remove buffers
from truly clean pages.

> For mainline, *if* we want to keep the old rd.c around at all, I
> don't see any harm in this patch so long as Christian's is merged
> as well. Sharing common code is always good.

I do agree that with the amount of code duplication in the buffer
cache that locking pages into the buffer cache seems very error prone,
and difficult to maintain. So rewriting rd.c to store it's pages
elsewhere sounds very promising.

While I can see Christian's patch as fixing the symptom. I have
a very hard time see it as correct. If we did something more
elaborate to replace try_to_free_buffer_pages such that
we could drop buffers from clean buffer cache pages when they
became such and so were only suppressing the drop the dirty
bit logic for pages that contain data we want to keep I would be ok
with it. Just totally skipping buffer head freeing just feels
wrong to me.

Eric

2007-10-16 21:29:25

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
> + /*
> + * ram device BLKFLSBUF has special semantics, we want to actually
> + * release and destroy the ramdisk data.
> + */

We won't be able to fix completely this for a while time, but the fact
that BLKFLSBUF has special semantics has always been a major wart.
Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt
there are many tools that actually take advantage of this wierd aspect
of ramdisks, so hopefully it's something we could remove in a 18
months or so...

- Ted

2007-10-16 22:02:13

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

On Wednesday 17 October 2007 05:06, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Tuesday 16 October 2007 08:42, Eric W. Biederman wrote:
> >> I have not observed this case but it is possible to get a dirty page
> >> cache with clean buffer heads if we get a clean ramdisk page with
> >> buffer heads generated by a filesystem calling __getblk and then write
> >> to that page from user space through the block device. Then we just
> >> need to hit the proper window and try_to_free_buffers() will mark that
> >> page clean and eventually drop it. Ouch!
> >>
> >> To fix this use the generic __set_page_dirty_buffers in the ramdisk
> >> code so that when we mark a page dirty we also mark it's buffer heads
> >> dirty.
> >
> > Hmm, so we can also have some filesystems writing their own buffers
> > out by hand (clear_buffer_dirty, submit buffer for IO). Other places
> > will do similarly dodgy things with filesystem metadata
> > (fsync_buffers_list, for example).
> >
> > So your buffers get cleaned again, then your pages get cleaned.
>
> So I just took a little bit of time to look at and think about
> the path you are referring to, and I don't see a problem.
>
> The rule with the buffer dirty bit is that you first clear it
> and then you submit the write. When the write request finally
> makes it's way to rd.c we copy the data if necessary and call
> set_page_dirty. Which will then mark the page and the buffer
> dirty again.

Oh, maybe you're right. I didn't see it redirty the page there.

> In essence the ramdisk code just attempts to lock buffers in
> ram by setting their dirty bit, just like we do for pages
> in ramfs.

Yeah, which is half the reason why its so complicated. Logically
it should just hold another reference on the pages rather than
interfere with pagecache state, but it can't do that because it
doesn't always know when a new page is inserted.

> > While I said it was a good fix when I saw the patch earlier, I think
> > it's not closing the entire hole, and as such, Christian's patch is
> > probably the way to go for stable.
>
> I thought through the logic in try_to_free_buffers and it actually
> makes sense to me now. mark_buffer_dirty sets the page dirty bit
> so dirty buffers reside on dirty pages. When we submit I/O we
> aren't guaranteed to submit all of the dirty buffers for a page
> at once, so we don't clear the page dirty bit. With the result
> that we can end up with pages with the dirty bit set but all of
> their buffers are clean.

Yep.

> Since we rather deliberately allow truly clean pages to be dropped
> from the ramdisk overriding try_to_free_buffer_pages looks wrong
> because then except for invalidate we can not remove buffers
> from truly clean pages.

Yeah, if your fix works I guess it is better to use it and converge
code rather than diverge it even more.

2007-10-16 22:03:32

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
> > + /*
> > + * ram device BLKFLSBUF has special semantics, we want to actually
> > + * release and destroy the ramdisk data.
> > + */
>
> We won't be able to fix completely this for a while time, but the fact
> that BLKFLSBUF has special semantics has always been a major wart.
> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt
> there are many tools that actually take advantage of this wierd aspect
> of ramdisks, so hopefully it's something we could remove in a 18
> months or so...

It would be nice to be able to do that, I agree. The new ramdisk
code will be able to flush the buffer cache and destroy its data
separately, so it can actually be implemented.

2007-10-16 23:48:57

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

Nick Piggin <[email protected]> writes:

> On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
>> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
>> > + /*
>> > + * ram device BLKFLSBUF has special semantics, we want to actually
>> > + * release and destroy the ramdisk data.
>> > + */
>>
>> We won't be able to fix completely this for a while time, but the fact
>> that BLKFLSBUF has special semantics has always been a major wart.
>> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
>> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt
>> there are many tools that actually take advantage of this wierd aspect
>> of ramdisks, so hopefully it's something we could remove in a 18
>> months or so...
>
> It would be nice to be able to do that, I agree. The new ramdisk
> code will be able to flush the buffer cache and destroy its data
> separately, so it can actually be implemented.

So the practical problem are peoples legacy boot setups but those
are quickly going away.

The sane thing is probably something that can be taken as a low
level format command for the block device.

Say: dd if=/dev/zero of=/dev/ramX

I know rewriting the drive with all zeroes can cause a modern
disk to redo it's low level format. And that is something
we can definitely implement without any backwards compatibility
problems.

Hmm. Do we have anything special for punching holes in files?
That would be another sane route to take to remove the special
case for clearing the memory.

Eric

2007-10-17 00:22:58

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Wednesday 17 October 2007 09:48, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
> >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
> >> > + /*
> >> > + * ram device BLKFLSBUF has special semantics, we want to actually
> >> > + * release and destroy the ramdisk data.
> >> > + */
> >>
> >> We won't be able to fix completely this for a while time, but the fact
> >> that BLKFLSBUF has special semantics has always been a major wart.
> >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
> >> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt
> >> there are many tools that actually take advantage of this wierd aspect
> >> of ramdisks, so hopefully it's something we could remove in a 18
> >> months or so...
> >
> > It would be nice to be able to do that, I agree. The new ramdisk
> > code will be able to flush the buffer cache and destroy its data
> > separately, so it can actually be implemented.
>
> So the practical problem are peoples legacy boot setups but those
> are quickly going away.

After that, is the ramdisk useful for anything aside from testing?

> The sane thing is probably something that can be taken as a low
> level format command for the block device.
>
> Say: dd if=/dev/zero of=/dev/ramX

We have 2 problems. First is that, for testing/consistency, we
don't want BLKFLSBUF to throw out the data. Maybe hardly anything
uses BLKFLSBUF now, so it could be just a minor problem, but still
one to fix.

Second is actually throwing out the ramdisk data. dd from /dev/null
isn't trivial because it isn't a "command" from the kernel's POV.
rd could examine the writes to see if they are zero and page aligned,
I suppose... but if you're transitioning everyone over to a new
method anyway, might as well make it a nice one ;)

> I know rewriting the drive with all zeroes can cause a modern
> disk to redo it's low level format. And that is something
> we can definitely implement without any backwards compatibility
> problems.
>
> Hmm. Do we have anything special for punching holes in files?
> That would be another sane route to take to remove the special
> case for clearing the memory.

truncate_range, I suppose. A file descriptor syscall based
alternative for madvise would be nice though (like fallocate).

We could always put something in /sys/block/ram*/xxx

2007-10-17 01:14:38

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

Nick Piggin <[email protected]> writes:

> On Wednesday 17 October 2007 09:48, Eric W. Biederman wrote:
>> Nick Piggin <[email protected]> writes:
>> > On Wednesday 17 October 2007 07:28, Theodore Tso wrote:
>> >> On Tue, Oct 16, 2007 at 05:47:12PM +1000, Nick Piggin wrote:
>> >> > + /*
>> >> > + * ram device BLKFLSBUF has special semantics, we want to actually
>> >> > + * release and destroy the ramdisk data.
>> >> > + */
>> >>
>> >> We won't be able to fix completely this for a while time, but the fact
>> >> that BLKFLSBUF has special semantics has always been a major wart.
>> >> Could we perhaps create a new ioctl, say RAMDISKDESTORY, and add a
>> >> deperecation printk for BLKFLSBUF when passed to the ramdisk? I doubt
>> >> there are many tools that actually take advantage of this wierd aspect
>> >> of ramdisks, so hopefully it's something we could remove in a 18
>> >> months or so...
>> >
>> > It would be nice to be able to do that, I agree. The new ramdisk
>> > code will be able to flush the buffer cache and destroy its data
>> > separately, so it can actually be implemented.
>>
>> So the practical problem are peoples legacy boot setups but those
>> are quickly going away.
>
> After that, is the ramdisk useful for anything aside from testing?
>
>
>> The sane thing is probably something that can be taken as a low
>> level format command for the block device.
>>
>> Say: dd if=/dev/zero of=/dev/ramX
>
> We have 2 problems. First is that, for testing/consistency, we
> don't want BLKFLSBUF to throw out the data. Maybe hardly anything
> uses BLKFLSBUF now, so it could be just a minor problem, but still
> one to fix.

Hmm. This is interesting because we won't be doing anything that
effects correctness if we don't special case BLKFLSBUF just something
that effects efficiency. So I think we can get away with just
changing blkflsbuf as long as there is a way to get rid of
the data.

> Second is actually throwing out the ramdisk data. dd from /dev/null
> isn't trivial because it isn't a "command" from the kernel's POV.
> rd could examine the writes to see if they are zero and page aligned,
> I suppose... but if you're transitioning everyone over to a new
> method anyway, might as well make it a nice one ;)

Well I was thinking you can examine the page you just wrote to
and if it is all zero's you don't need to cache that page anymore.
Call it intelligent compression.

Further it does make forwards and backwards compatibility simple
because all you would have to do to reliably free a ramdisk is:

dd if=/dev/zero of=/dev/ramX
blockdev --flushbufs /dev/ramX

>> I know rewriting the drive with all zeroes can cause a modern
>> disk to redo it's low level format. And that is something
>> we can definitely implement without any backwards compatibility
>> problems.
>>
>> Hmm. Do we have anything special for punching holes in files?
>> That would be another sane route to take to remove the special
>> case for clearing the memory.
>
> truncate_range, I suppose. A file descriptor syscall based
> alternative for madvise would be nice though (like fallocate).
>
> We could always put something in /sys/block/ram*/xxx

I guess when I look at this it looks like an operation that
is unique to a ramdisk. Real hard drives have a low level
format operations and the like. If we can find something
standard there that is guaranteed to trash your data we
can use that, and have gone from less consistency to more.

Eric

2007-10-17 01:48:24

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Wednesday 17 October 2007 11:13, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:

> > We have 2 problems. First is that, for testing/consistency, we
> > don't want BLKFLSBUF to throw out the data. Maybe hardly anything
> > uses BLKFLSBUF now, so it could be just a minor problem, but still
> > one to fix.
>
> Hmm. This is interesting because we won't be doing anything that
> effects correctness if we don't special case BLKFLSBUF just something
> that effects efficiency. So I think we can get away with just
> changing blkflsbuf as long as there is a way to get rid of
> the data.

Technically, it does change correctness: after BLKFLSBUF, the
ramdisk should contain zeroes.

I'm assuming it would also cause problems in tight embedded
environments if ramdisk ram is supposed to be thrown away but
isn't. So maybe not technically a correctness problem, but could
be the difference between working and not working.

> > Second is actually throwing out the ramdisk data. dd from /dev/null
> > isn't trivial because it isn't a "command" from the kernel's POV.
> > rd could examine the writes to see if they are zero and page aligned,
> > I suppose... but if you're transitioning everyone over to a new
> > method anyway, might as well make it a nice one ;)
>
> Well I was thinking you can examine the page you just wrote to
> and if it is all zero's you don't need to cache that page anymore.
> Call it intelligent compression.
>
> Further it does make forwards and backwards compatibility simple
> because all you would have to do to reliably free a ramdisk is:
>
> dd if=/dev/zero of=/dev/ramX
> blockdev --flushbufs /dev/ramX

Sure, you could do that, but you still presumably need to support
the old behaviour.

As a test vehicle for filesystems, I'd much rather it didn't do this
of course, because subsequent writes would need to reallocate the
page again.

2007-10-17 10:31:28

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

Nick Piggin <[email protected]> writes:

> On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
>> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
>
>> > > What magic restrictions on page allocations? Actually we have
>> > > fewer restrictions on page allocations because we can use
>> > > highmem!
>> >
>> > With the proposed rewrite yes.
>
> Here's a quick first hack...
>
> Comments?

I have beaten my version of this into working shape, and things
seem ok.

However I'm beginning to think that the real solution is to remove
the dependence on buffer heads for caching the disk mapping for
data pages, and move the metadata buffer heads off of the block
device page cache pages. Although I am just a touch concerned
there may be an issue with using filesystem tools while the
filesystem is mounted if I move the metadata buffer heads.

If we were to move the metadata buffer heads (assuming I haven't
missed some weird dependency) then I think there a bunch of
weird corner cases that would be simplified.

I guess that is where I look next.

Oh for what it is worth I took a quick look at fsblock and I don't think
struct fsblock makes much sense as a block mapping translation layer for
the data path where the current page caches works well. For less
then the cost of 1 fsblock I can cache all of the translations for a
1K filesystem on a 4K page.

I haven't looked to see if fsblock makes sense to use as a buffer head
replacement yet.

Anyway off to bed with me.

Eric

2007-10-17 12:50:06

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
> >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
> >> > > What magic restrictions on page allocations? Actually we have
> >> > > fewer restrictions on page allocations because we can use
> >> > > highmem!
> >> >
> >> > With the proposed rewrite yes.
> >
> > Here's a quick first hack...
> >
> > Comments?
>
> I have beaten my version of this into working shape, and things
> seem ok.
>
> However I'm beginning to think that the real solution is to remove
> the dependence on buffer heads for caching the disk mapping for
> data pages, and move the metadata buffer heads off of the block
> device page cache pages. Although I am just a touch concerned
> there may be an issue with using filesystem tools while the
> filesystem is mounted if I move the metadata buffer heads.
>
> If we were to move the metadata buffer heads (assuming I haven't
> missed some weird dependency) then I think there a bunch of
> weird corner cases that would be simplified.

I'd prefer just to go along the lines of what I posted. It's
a pure block device driver which knows absolutely nothing about
vm or vfs.

What are you guys using rd for, and is it really important to
have this supposed buffercache optimisation?

> I guess that is where I look next.
>
> Oh for what it is worth I took a quick look at fsblock and I don't think
> struct fsblock makes much sense as a block mapping translation layer for
> the data path where the current page caches works well.

Except the page cache doesn't store any such translation. fsblock
works very nicely as a buffer_head and nobh-mode replacement there,
but we could probably go one better (for many filesystems) by using
a tree.

> For less
> then the cost of 1 fsblock I can cache all of the translations for a
> 1K filesystem on a 4K page.

I'm not exactly sure what you mean, unless you're talking about an
extent based data structure. fsblock is fairly slim as far as a
generic solution goes. You could probably save 4 or 8 bytes in it,
but I don't know if it's worth the trouble.

> I haven't looked to see if fsblock makes sense to use as a buffer head
> replacement yet.

Well I don't want to get too far off topic on the subject of fsblock,
and block mappings (because I think the ramdisk driver simply should
have nothing to do with any of that)... but fsblock is exactly a
buffer head replacement (so if it doesn't make sense, then I've
screwed something up badly! ;))

2007-10-17 16:14:51

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Eric,

Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger:
> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
>
> >?fs/buffer.c | ? ?3 +++
> > 1 files changed, 3 insertions(+), 0 deletions(-)
> > drivers/block/rd.c | 13 +------------
> > 1 files changed, 1 insertions(+), 12 deletions(-)
>
> Your patches look sane so far. I have applied both patches, and the problem
> seems gone. I will try to get these patches to our testers.
>
> As long as they dont find new problems:

Our testers did only a short test, and then they were stopped by problems with
reiserfs. At the moment I cannot say for sure if your patch caused this, but
we got the following BUG

ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr storage.
------------[ cut here ]------------
kernel BUG at /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
illegal operation: 0001 [#1]
Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
CPU: 3 Not tainted
Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
Krnl GPRS: 00000002 7411b5c8 0000002b 00000000
7b04d000 00000001 00000000 76d1de00
7513eec0 00000003 00000012 77f77680
7411b608 fb343b7e fb34404a 7513ee50
Krnl Code: fb344374: a7210002 tmll %r2,2
fb344378: a7840004 brc 8,fb344380
fb34437c: a7f40001 brc 15,fb34437e
>fb344380: 5810d8c2 l %r1,2242(%r13)
fb344384: 5820b03c l %r2,60(%r11)
fb344388: 0de1 basr %r14,%r1
fb34438a: 5810d90e l %r1,2318(%r13)
fb34438e: 5820b03c l %r2,60(%r11)

Looking at the code, this really seems related to dirty buffers, so your patch
is the main suspect at the moment.

if (!barrier) {
/* If there was a write error in the journal - we can't commit
* this transaction - it will be invalid and, if successful,
* will just end up propagating the write error out to
* the file system. */
if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
if (buffer_dirty(jl->j_commit_bh))
1117----> BUG();
mark_buffer_dirty(jl->j_commit_bh) ;
sync_dirty_buffer(jl->j_commit_bh) ;
}
}

Christian

2007-10-17 17:58:46

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Christian Borntraeger <[email protected]> writes:

> Eric,
>
> Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger:
>> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
>>
>> >?fs/buffer.c | ? ?3 +++
>> > 1 files changed, 3 insertions(+), 0 deletions(-)
>> > drivers/block/rd.c | 13 +------------
>> > 1 files changed, 1 insertions(+), 12 deletions(-)
>>
>> Your patches look sane so far. I have applied both patches, and the problem
>> seems gone. I will try to get these patches to our testers.
>>
>> As long as they dont find new problems:
>
> Our testers did only a short test, and then they were stopped by problems with
> reiserfs. At the moment I cannot say for sure if your patch caused this, but
> we got the following BUG

Thanks.

> ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr
> storage.
> ------------[ cut here ]------------
> kernel BUG at
> /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
> illegal operation: 0001 [#1]
> Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
> CPU: 3 Not tainted
> Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
> Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
> R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
> Krnl GPRS: 00000002 7411b5c8 0000002b 00000000
> 7b04d000 00000001 00000000 76d1de00
> 7513eec0 00000003 00000012 77f77680
> 7411b608 fb343b7e fb34404a 7513ee50
> Krnl Code: fb344374: a7210002 tmll %r2,2
> fb344378: a7840004 brc 8,fb344380
> fb34437c: a7f40001 brc 15,fb34437e
> >fb344380: 5810d8c2 l %r1,2242(%r13)
> fb344384: 5820b03c l %r2,60(%r11)
> fb344388: 0de1 basr %r14,%r1
> fb34438a: 5810d90e l %r1,2318(%r13)
> fb34438e: 5820b03c l %r2,60(%r11)
>
>
> Looking at the code, this really seems related to dirty buffers, so your patch
> is the main suspect at the moment.

Sounds reasonable.

> if (!barrier) {
> /* If there was a write error in the journal - we can't commit
> * this transaction - it will be invalid and, if successful,
> * will just end up propagating the write error out to
> * the file system. */
> if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
> if (buffer_dirty(jl->j_commit_bh))
> 1117----> BUG();
> mark_buffer_dirty(jl->j_commit_bh) ;
> sync_dirty_buffer(jl->j_commit_bh) ;
> }
> }

Grr. I'm not certain how to call that.

Given that I should also be able to trigger this case by writing to
the block device through the buffer cache (to the write spot at the
write moment) this feels like a reiserfs bug. Although I feel
screaming about filesystems that go BUG instead of remount read-only....

At the same time I increasingly don't think we should allow user space
to dirty or update our filesystem metadata buffer heads. That seems
like asking for trouble.

Did you have both of my changes applied?
To init_page_buffer() and to the ramdisk_set_dirty_page?

I'm guessing it was the change to ramdisk_set_dirty_page....

Eric

2007-10-17 18:46:29

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

Nick Piggin <[email protected]> writes:

> On Wednesday 17 October 2007 20:30, Eric W. Biederman wrote:
>> Nick Piggin <[email protected]> writes:
>> > On Tuesday 16 October 2007 18:08, Nick Piggin wrote:
>> >> On Tuesday 16 October 2007 14:57, Eric W. Biederman wrote:
>> >> > > What magic restrictions on page allocations? Actually we have
>> >> > > fewer restrictions on page allocations because we can use
>> >> > > highmem!
>> >> >
>> >> > With the proposed rewrite yes.
>> >
>> > Here's a quick first hack...
>> >
>> > Comments?
>>
>> I have beaten my version of this into working shape, and things
>> seem ok.
>>
>> However I'm beginning to think that the real solution is to remove
>> the dependence on buffer heads for caching the disk mapping for
>> data pages, and move the metadata buffer heads off of the block
>> device page cache pages. Although I am just a touch concerned
>> there may be an issue with using filesystem tools while the
>> filesystem is mounted if I move the metadata buffer heads.
>>
>> If we were to move the metadata buffer heads (assuming I haven't
>> missed some weird dependency) then I think there a bunch of
>> weird corner cases that would be simplified.
>
> I'd prefer just to go along the lines of what I posted. It's
> a pure block device driver which knows absolutely nothing about
> vm or vfs.
>
> What are you guys using rd for, and is it really important to
> have this supposed buffercache optimisation?

Well what brought this up for me was old user space code using
an initial ramdisk. The actual failure that I saw occurred on
the read path. And fixing init_page_buffers was the real world
fix.

At the moment I'm messing with it because it has become the
itch I've decided to scratch. So at the moment I'm having fun,
learning the block layer, refreshing my VM knowledge and getting my
head around this wreck that we call buffer_heads. The high level
concept of buffer_heads may be sane but the implementation seems to
export a lot of nasty state.

At this point my concern is what makes a clean code change in the
kernel. Because user space can currently play with buffer_heads
by way of the block device and cause lots of havoc (see the recent
resierfs bug in this thread) that is why I increasingly think
metadata buffer_heads should not share storage with the block
device page cache.

If that change is made then it happens that the current ramdisk
would not need to worry about buffer heads and all of that
nastiness and could just lock pages in the page cache. It would not
be quite as good for testing filesystems but retaining the existing
characteristics would be simple.

After having looked a bit deeper the buffer_heads and the block
devices don't look as intricately tied up as I had first thought.
We still have the nasty case of:
if (buffer_new(bh))
unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
That I don't know how it got merged. But otherwise the caches
are fully separate.

So currently it looks to me like there are two big things that will
clean up that part of the code a lot:
- moving the metadata buffer_heads to a magic filesystem inode.
- Using a simpler non-buffer_head returning version of get_block
so we can make simple generic code for generating BIOs.

>> I guess that is where I look next.
>>
>> Oh for what it is worth I took a quick look at fsblock and I don't think
>> struct fsblock makes much sense as a block mapping translation layer for
>> the data path where the current page caches works well.
>
> Except the page cache doesn't store any such translation. fsblock
> works very nicely as a buffer_head and nobh-mode replacement there,
> but we could probably go one better (for many filesystems) by using
> a tree.
>
>> For less
>> then the cost of 1 fsblock I can cache all of the translations for a
>> 1K filesystem on a 4K page.
>
> I'm not exactly sure what you mean, unless you're talking about an
> extent based data structure. fsblock is fairly slim as far as a
> generic solution goes. You could probably save 4 or 8 bytes in it,
> but I don't know if it's worth the trouble.

As a meta_data cache manager perhaps, for a translation cache we need
8 bytes per page max.

However all we need for a generic translation cache (assuming we still
want one) is an array of sector_t per page.

So what we would want is:
int blkbits_per_page = PAGE_CACHE_SHIFT - inode->i_blkbits;
if (blkbits_per_page <= 0)
blkbits_per_page = 0;
sector_t *blocks = kmalloc(sizeof(sector_t) << blkbits_per_page);

And to remember if we have stored the translation:
#define UNMAPPED_SECTOR (-1(sector_t))

...

The core of all of this being something like:
#define MAX_BLOCKS_PER_PAGE (1 << (PAGE_CACHE_SHIFT - 9))
typedef int (page_blocks_t)(struct page *page,
sector_t blocks[MAX_BLOCKS_PER_PAGE],
int create);

>> I haven't looked to see if fsblock makes sense to use as a buffer head
>> replacement yet.
>
> Well I don't want to get too far off topic on the subject of fsblock,
> and block mappings (because I think the ramdisk driver simply should
> have nothing to do with any of that)...
Which I can agree with.

> but fsblock is exactly a buffer head replacement (so if it doesn't
> make sense, then I've screwed something up badly! ;))

By definition!

Eric

2007-10-17 19:16:25

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

On Wed, 2007-10-17 at 11:57 -0600, Eric W. Biederman wrote:
> Christian Borntraeger <[email protected]> writes:
>
> > Eric,
> >
> > Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger:
> >> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
> >>
> >> > fs/buffer.c | 3 +++
> >> > 1 files changed, 3 insertions(+), 0 deletions(-)
> >> > drivers/block/rd.c | 13 +------------
> >> > 1 files changed, 1 insertions(+), 12 deletions(-)
> >>
> >> Your patches look sane so far. I have applied both patches, and the problem
> >> seems gone. I will try to get these patches to our testers.
> >>
> >> As long as they dont find new problems:
> >
> > Our testers did only a short test, and then they were stopped by problems with
> > reiserfs. At the moment I cannot say for sure if your patch caused this, but
> > we got the following BUG
>
> Thanks.
>
> > ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr
> > storage.
> > ------------[ cut here ]------------
> > kernel BUG at
> > /home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
> > illegal operation: 0001 [#1]
> > Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
> > CPU: 3 Not tainted
> > Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
> > Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
> > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
> > Krnl GPRS: 00000002 7411b5c8 0000002b 00000000
> > 7b04d000 00000001 00000000 76d1de00
> > 7513eec0 00000003 00000012 77f77680
> > 7411b608 fb343b7e fb34404a 7513ee50
> > Krnl Code: fb344374: a7210002 tmll %r2,2
> > fb344378: a7840004 brc 8,fb344380
> > fb34437c: a7f40001 brc 15,fb34437e
> > >fb344380: 5810d8c2 l %r1,2242(%r13)
> > fb344384: 5820b03c l %r2,60(%r11)
> > fb344388: 0de1 basr %r14,%r1
> > fb34438a: 5810d90e l %r1,2318(%r13)
> > fb34438e: 5820b03c l %r2,60(%r11)
> >
> >
> > Looking at the code, this really seems related to dirty buffers, so your patch
> > is the main suspect at the moment.
>
> Sounds reasonable.
>
> > if (!barrier) {
> > /* If there was a write error in the journal - we can't commit
> > * this transaction - it will be invalid and, if successful,
> > * will just end up propagating the write error out to
> > * the file system. */
> > if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
> > if (buffer_dirty(jl->j_commit_bh))
> > 1117----> BUG();
> > mark_buffer_dirty(jl->j_commit_bh) ;
> > sync_dirty_buffer(jl->j_commit_bh) ;
> > }
> > }
>
> Grr. I'm not certain how to call that.
>
> Given that I should also be able to trigger this case by writing to
> the block device through the buffer cache (to the write spot at the
> write moment) this feels like a reiserfs bug.
> Although I feel
> screaming about filesystems that go BUG instead of remount read-only....
>

In this case, the commit block isn't allowed to be dirty before reiserfs
decides it is safe to write it. The journal code expects it is the only
spot in the kernel setting buffer heads dirty, and it only does so after
the rest of the log blocks are safely on disk.

Given this is a ramdisk, the check can be ignored, but I'd rather not
sprinkle if (ram_disk) into the FS code....

> At the same time I increasingly don't think we should allow user space
> to dirty or update our filesystem metadata buffer heads. That seems
> like asking for trouble.
>

Demanding trouble ;)

-chris

2007-10-17 20:31:37

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Chris Mason <[email protected]> writes:

> In this case, the commit block isn't allowed to be dirty before reiserfs
> decides it is safe to write it. The journal code expects it is the only
> spot in the kernel setting buffer heads dirty, and it only does so after
> the rest of the log blocks are safely on disk.

Ok. So the journal code here fundamentally depends on being able to
control the order of the writes, and something else being able to set
the buffer head dirty messes up that control.

> Given this is a ramdisk, the check can be ignored, but I'd rather not
> sprinkle if (ram_disk) into the FS code....

It makes no sense to implement a ramdisk in a way that requires this.

>> At the same time I increasingly don't think we should allow user space
>> to dirty or update our filesystem metadata buffer heads. That seems
>> like asking for trouble.
>>
>
> Demanding trouble ;)

Looks like it. There are even comments in jbd about the same class
of problems. Apparently dump and tune2fs on mounted filesystems have
triggered some of these issues. The practical question is any of this
trouble worth handling.

Thinking about it. I don't believe anyone has ever intentionally built
a filesystem tool that depends on being able to modify a file systems
metadata buffer heads while the filesystem is running, and doing that
would seem to be fragile as it would require a lot of cooperation
between the tool and the filesystem about how the filesystem uses and
implement things.

Now I guess I need to see how difficult a patch would be to give
filesystems magic inodes to keep their metadata buffer heads in.

Eric

2007-10-17 20:57:19

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

On Wed, 2007-10-17 at 14:29 -0600, Eric W. Biederman wrote:
> Chris Mason <[email protected]> writes:
>
> > In this case, the commit block isn't allowed to be dirty before reiserfs
> > decides it is safe to write it. The journal code expects it is the only
> > spot in the kernel setting buffer heads dirty, and it only does so after
> > the rest of the log blocks are safely on disk.
>
> Ok. So the journal code here fundamentally depends on being able to
> control the order of the writes, and something else being able to set
> the buffer head dirty messes up that control.
>

Right.

> >> At the same time I increasingly don't think we should allow user space
> >> to dirty or update our filesystem metadata buffer heads. That seems
> >> like asking for trouble.
> >>
> >
> > Demanding trouble ;)
>
> Looks like it. There are even comments in jbd about the same class
> of problems. Apparently dump and tune2fs on mounted filesystems have
> triggered some of these issues. The practical question is any of this
> trouble worth handling.
>
> Thinking about it. I don't believe anyone has ever intentionally built
> a filesystem tool that depends on being able to modify a file systems
> metadata buffer heads while the filesystem is running, and doing that
> would seem to be fragile as it would require a lot of cooperation
> between the tool and the filesystem about how the filesystem uses and
> implement things.
>

That's right. For example, ext2 is doing directories in the page cache
of the directory inode, so there's a cache alias between the block
device page cache and the directory inode page cache.

> Now I guess I need to see how difficult a patch would be to give
> filesystems magic inodes to keep their metadata buffer heads in.

Not hard, the block device inode is already a magic inode for metadata
buffer heads. You could just make another one attached to the bdev.

But, I don't think I fully understand the problem you're trying to
solve?

-chris

2007-10-17 21:33:14

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Chris Mason <[email protected]> writes:

>> Thinking about it. I don't believe anyone has ever intentionally built
>> a filesystem tool that depends on being able to modify a file systems
>> metadata buffer heads while the filesystem is running, and doing that
>> would seem to be fragile as it would require a lot of cooperation
>> between the tool and the filesystem about how the filesystem uses and
>> implement things.
>>
>
> That's right. For example, ext2 is doing directories in the page cache
> of the directory inode, so there's a cache alias between the block
> device page cache and the directory inode page cache.
>
>> Now I guess I need to see how difficult a patch would be to give
>> filesystems magic inodes to keep their metadata buffer heads in.
>
> Not hard, the block device inode is already a magic inode for metadata
> buffer heads. You could just make another one attached to the bdev.
>
> But, I don't think I fully understand the problem you're trying to
> solve?

So the start:
When we write buffers from the buffer cache we clear buffer_dirty
but not PageDirty

So try_to_free_buffers() will mark any page with clean buffer_heads
that is not clean itself clean.

The ramdisk set pages dirty to keep them from being removed from the
page cache, just like ramfs.

Unfortunately when those dirty ramdisk pages get buffers on them and
those buffers all go clean and we are trying to reclaim buffer_heads
we drop those pages from the page cache. Ouch!

We can fix the ramdisk by setting making certain that buffer_heads
on ramdisk pages stay dirty as well. The problem is this breaks
filesystems like reiserfs and ext3 that expect to be able to make
buffer_heads clean sometimes.

There are other ways to solve this for ramdisks, such as changing
where ramdisks are stored. However fixing the ramdisks this way
still leaves the general problem that there are other paths to the
filesystem metadata buffers, and those other paths cause the code
to be complicated and buggy.

So I'm trying to see if we can untangle this Gordian knot, so the
code because more easily maintainable.

To make the buffer cache a helper library instead of require
infrastructure, it looks like two things need to happen.
- Move metadata buffer heads off block devices page cache entries,
- Communicate the mappings of data pages to block device sectors
in a generic way without buffer heads.

How we ultimately fix the ramdisk tends to depend on how we untangle
the buffer head problem. Right now the only simple solution is to
suppress try_to_free_buffers, which is a bit ugly. We can also come
up with a completely separate store for the pages in the buffer cache
but if we wind up moving the metadata buffer heads anyway then that
should not be necessary.

Eric

2007-10-17 21:49:40

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Am Mittwoch, 17. Oktober 2007 schrieb Eric W. Biederman:
> Did you have both of my changes applied?
> To init_page_buffer() and to the ramdisk_set_dirty_page?

Yes, I removed my patch and applied both patches from you.

Christian

2007-10-17 22:23:13

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Christian Borntraeger <[email protected]> writes:

> Am Mittwoch, 17. Oktober 2007 schrieb Eric W. Biederman:
>> Did you have both of my changes applied?
>> To init_page_buffer() and to the ramdisk_set_dirty_page?
>
> Yes, I removed my patch and applied both patches from you.

Thanks.

Grr. Inconsistent rules on a core piece of infrastructure.
It looks like that if there is any trivial/minimal fix it
is based on your patch suppressing try_to_free_buffers. Ugh.

Eric

2007-10-17 22:59:00

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

On Wed, 2007-10-17 at 15:30 -0600, Eric W. Biederman wrote:
> Chris Mason <[email protected]> writes:
>
> >> Thinking about it. I don't believe anyone has ever intentionally built
> >> a filesystem tool that depends on being able to modify a file systems
> >> metadata buffer heads while the filesystem is running, and doing that
> >> would seem to be fragile as it would require a lot of cooperation
> >> between the tool and the filesystem about how the filesystem uses and
> >> implement things.
> >>
> >
> > That's right. For example, ext2 is doing directories in the page cache
> > of the directory inode, so there's a cache alias between the block
> > device page cache and the directory inode page cache.
> >
> >> Now I guess I need to see how difficult a patch would be to give
> >> filesystems magic inodes to keep their metadata buffer heads in.
> >
> > Not hard, the block device inode is already a magic inode for metadata
> > buffer heads. You could just make another one attached to the bdev.
> >
> > But, I don't think I fully understand the problem you're trying to
> > solve?
>
>
> So the start:
> When we write buffers from the buffer cache we clear buffer_dirty
> but not PageDirty
>
> So try_to_free_buffers() will mark any page with clean buffer_heads
> that is not clean itself clean.
>
> The ramdisk set pages dirty to keep them from being removed from the
> page cache, just like ramfs.
>
So, the problem is using the Dirty bit to indicate pinned. You're
completely right that our current setup of buffer heads and pages and
filesystem metadata is complex and difficult.

But, moving the buffer heads off of the page cache pages isn't going to
make it any easier to use dirty as pinned, especially in the face of
buffer_head users for file data pages.

You've already seen Nick fsblock code, but you can see my general
approach to replacing buffer heads here:

http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h

(alpha quality implementation in extent_map.c and users in inode.c) The
basic idea is to do extent based record keeping for mapping and state of
things in the filesystem, and to avoid attaching these things to the
page.

> Unfortunately when those dirty ramdisk pages get buffers on them and
> those buffers all go clean and we are trying to reclaim buffer_heads
> we drop those pages from the page cache. Ouch!
>
> We can fix the ramdisk by setting making certain that buffer_heads
> on ramdisk pages stay dirty as well. The problem is this breaks
> filesystems like reiserfs and ext3 that expect to be able to make
> buffer_heads clean sometimes.
>
> There are other ways to solve this for ramdisks, such as changing
> where ramdisks are stored. However fixing the ramdisks this way
> still leaves the general problem that there are other paths to the
> filesystem metadata buffers, and those other paths cause the code
> to be complicated and buggy.
>
> So I'm trying to see if we can untangle this Gordian knot, so the
> code because more easily maintainable.
>

Don't get me wrong, I'd love to see a simple and coherent fix for what
reiserfs and ext3 do with buffer head state, but I think for the short
term you're best off pinning the ramdisk pages via some other means.

-chris

2007-10-17 23:30:04

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Chris Mason <[email protected]> writes:

> So, the problem is using the Dirty bit to indicate pinned. You're
> completely right that our current setup of buffer heads and pages and
> filesystpem metadata is complex and difficult.
>
> But, moving the buffer heads off of the page cache pages isn't going to
> make it any easier to use dirty as pinned, especially in the face of
> buffer_head users for file data pages.

Let me specific. Not moving buffer_heads off of page cache pages,
moving buffer_heads off of the block devices page cache pages.

My problem is the coupling of how block devices are cached and the
implementation of buffer heads, and by removing that coupling
we can generally make things better. Currently that coupling
means silly things like all block devices are cached in low memory.
Which probably isn't what you want if you actually have a use
for block devices.

For the ramdisk case in particular what this means is that there
are no more users that create buffer_head mappings on the block
device cache so using the dirty bit will be safe.

Further it removes the nasty possibility of user space messing with
metadata buffer head state. So the only way those cases can happen is
a code bug, or a hardware bug.

So I think by removing these unnecessary code paths things will
become easier to work with.

> You've already seen Nick fsblock code, but you can see my general
> approach to replacing buffer heads here:
>
> http://oss.oracle.com/mercurial/mason/btrfs-unstable/file/f89e7971692f/extent_map.h
>
> (alpha quality implementation in extent_map.c and users in inode.c) The
> basic idea is to do extent based record keeping for mapping and state of
> things in the filesystem, and to avoid attaching these things to the
> page.

Interesting. Something to dig into.

> Don't get me wrong, I'd love to see a simple and coherent fix for what
> reiserfs and ext3 do with buffer head state, but I think for the short
> term you're best off pinning the ramdisk pages via some other means.

Yes. And the problem is hard enough to trigger that a short term fix
is actually of debatable value. The reason this hasn't shown up more
frequently is that it only ever triggers if you are in the buffer head
reclaim state, which on a 64bit box means you have to use < 4K buffers
and have your ram cache another block device. That plus most people
use initramfs these days.

For the short term we have Christian's other patch which simply
disables calling try_to_free_buffers. Although that really feels
like a hack to me.

For 2.6.25 I think I have a shot at fixing these things cleanly.

Eric

2007-10-18 00:04:53

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote:
> Chris Mason <[email protected]> writes:
>
> > So, the problem is using the Dirty bit to indicate pinned. You're
> > completely right that our current setup of buffer heads and pages and
> > filesystpem metadata is complex and difficult.
> >
> > But, moving the buffer heads off of the page cache pages isn't going to
> > make it any easier to use dirty as pinned, especially in the face of
> > buffer_head users for file data pages.
>
> Let me specific. Not moving buffer_heads off of page cache pages,
> moving buffer_heads off of the block devices page cache pages.
>
> My problem is the coupling of how block devices are cached and the
> implementation of buffer heads, and by removing that coupling
> we can generally make things better. Currently that coupling
> means silly things like all block devices are cached in low memory.
> Which probably isn't what you want if you actually have a use
> for block devices.
>
> For the ramdisk case in particular what this means is that there
> are no more users that create buffer_head mappings on the block
> device cache so using the dirty bit will be safe.

Ok, we move the buffer heads off to a different inode, and that indoe
has pages. The pages on the inode still need to get pinned, how does
that pinning happen?

The problem you described where someone cleans a page because the buffer
heads are clean happens already without help from userland. So, keeping
the pages away from userland won't save them from cleaning.

Sorry if I'm reading your suggestion wrong...

-chris

2007-10-18 01:08:20

[permalink] [raw]

Subject: Re: [patch][rfc] rewrite ramdisk

On Thursday 18 October 2007 04:45, Eric W. Biederman wrote:
> At this point my concern is what makes a clean code change in the
> kernel. Because user space can currently play with buffer_heads
> by way of the block device and cause lots of havoc (see the recent

Well if userspace is writing to the filesystem metadata via the
blockdevice while it is running... that's the definition of havoc,
isn't it? ;) Whether or not the writes are going via a unified
metadata/blockdev cache or separate ones.

You really just have to not do that.

The actual reiserfs problem being seen is not because of userspace
going silly, but because ramdisk is hijacking the dirty bits.

> If that change is made then it happens that the current ramdisk
> would not need to worry about buffer heads and all of that
> nastiness and could just lock pages in the page cache. It would not
> be quite as good for testing filesystems but retaining the existing
> characteristics would be simple.

No, it wouldn't. Because if you're proposing to split up the buffer
cache and the metadata cache, then you're back to a 2 cache
solution which is basically has the memory characteristics of my
proposal while still being horribly incestuous with the pagecache.

> After having looked a bit deeper the buffer_heads and the block
> devices don't look as intricately tied up as I had first thought.
> We still have the nasty case of:
> if (buffer_new(bh))
> unmap_underlying_metadata(bh->b_bdev, bh->b_blocknr);
> That I don't know how it got merged. But otherwise the caches
> are fully separate.

Well its needed because some filesystems forget about their old
metadata. It's not really there to solve aliasing with the blockdev
pagecache.

> So currently it looks to me like there are two big things that will
> clean up that part of the code a lot:
> - moving the metadata buffer_heads to a magic filesystem inode.
> - Using a simpler non-buffer_head returning version of get_block
> so we can make simple generic code for generating BIOs.

Although this is going off the track of the ramdisk problem. For
that we should just do the rewrite.

2007-10-18 03:28:59

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Chris Mason <[email protected]> writes:

> On Wed, 2007-10-17 at 17:28 -0600, Eric W. Biederman wrote:
>> Chris Mason <[email protected]> writes:
>>
>> > So, the problem is using the Dirty bit to indicate pinned. You're
>> > completely right that our current setup of buffer heads and pages and
>> > filesystpem metadata is complex and difficult.
>> >
>> > But, moving the buffer heads off of the page cache pages isn't going to
>> > make it any easier to use dirty as pinned, especially in the face of
>> > buffer_head users for file data pages.
>>
>> Let me specific. Not moving buffer_heads off of page cache pages,
>> moving buffer_heads off of the block devices page cache pages.
>>
>> My problem is the coupling of how block devices are cached and the
>> implementation of buffer heads, and by removing that coupling
>> we can generally make things better. Currently that coupling
>> means silly things like all block devices are cached in low memory.
>> Which probably isn't what you want if you actually have a use
>> for block devices.
>>
>> For the ramdisk case in particular what this means is that there
>> are no more users that create buffer_head mappings on the block
>> device cache so using the dirty bit will be safe.
>
> Ok, we move the buffer heads off to a different inode, and that indoe
> has pages. The pages on the inode still need to get pinned, how does
> that pinning happen?

> The problem you described where someone cleans a page because the buffer
> heads are clean happens already without help from userland. So, keeping
> the pages away from userland won't save them from cleaning.
>
> Sorry if I'm reading your suggestion wrong...

Yes. I was suggesting to continue to pin the pages for the page
cache pages block device inode, and having the buffer cache live
on some other inode. Thus not causing me problems by adding clean
buffer_heads to my dirty pages.

Eric

2007-10-18 04:00:07

by Eric W. Biederman

[permalink] [raw]

Subject: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

If filesystems care at all they want absolute control over the buffer
cache. Controlling which buffers are dirty and when. Because we
keep the buffer cache in the page cache for the block device we have
not quite been giving filesystems that control leading to really weird
bugs.

In addition this tieing of the implemetation of block device caching
and the buffer cache has resulted in a much more complicated and
limited implementation then necessary. Block devices for example
don't need buffer_heads, and it is perfectly reasonable to cache
block devices in high memory.

To start untangling the worst of this mess this patch introduces a
second block device inode for the buffer cache. All buffer cache
operations are diverted to that use the new bd_metadata_inode, which
keeps the weirdness of the metadata requirements isolated in their
own little world.

This should enable future cleanups to diverge and simplify the
address_space_operations of the buffer cache and block device
page cache.

As a side effect of this cleanup the current ramdisk code should
be safe from dropping pages because we never place any buffer heads
on ramdisk pages.

Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/block_dev.c | 45 ++++++++++++++++++++++++++++++++-------------
fs/buffer.c | 17 ++++++++++++-----
fs/ext3/dir.c | 2 +-
fs/ext4/dir.c | 2 +-
fs/fat/inode.c | 2 +-
include/linux/fs.h | 1 +
6 files changed, 48 insertions(+), 21 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 379a446..87a5760 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -59,10 +59,12 @@ static sector_t max_block(struct block_device *bdev)
/* Kill _all_ buffers and pagecache , dirty or not.. */
static void kill_bdev(struct block_device *bdev)
{
- if (bdev->bd_inode->i_mapping->nrpages == 0)
- return;
- invalidate_bh_lrus();
- truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+ if (bdev->bd_inode->i_mapping->nrpages)
+ truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+ if (bdev->bd_metadata_inode->i_mapping->nrpages) {
+ truncate_inode_pages(bdev->bd_metadata_inode->i_mapping, 0);
+ invalidate_bh_lrus();
+ }
}

int set_blocksize(struct block_device *bdev, int size)
@@ -80,6 +82,7 @@ int set_blocksize(struct block_device *bdev, int size)
sync_blockdev(bdev);
bdev->bd_block_size = size;
bdev->bd_inode->i_blkbits = blksize_bits(size);
+ bdev->bd_metadata_inode->i_blkbits = blksize_bits(size);
kill_bdev(bdev);
}
return 0;
@@ -114,7 +117,7 @@ static int
blkdev_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
- if (iblock >= max_block(I_BDEV(inode))) {
+ if (iblock >= max_block(inode->i_bdev)) {
if (create)
return -EIO;

@@ -126,7 +129,7 @@ blkdev_get_block(struct inode *inode, sector_t iblock,
*/
return 0;
}
- bh->b_bdev = I_BDEV(inode);
+ bh->b_bdev = inode->i_bdev;
bh->b_blocknr = iblock;
set_buffer_mapped(bh);
return 0;
@@ -136,7 +139,7 @@ static int
blkdev_get_blocks(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
- sector_t end_block = max_block(I_BDEV(inode));
+ sector_t end_block = max_block(inode->i_bdev);
unsigned long max_blocks = bh->b_size >> inode->i_blkbits;

if ((iblock + max_blocks) > end_block) {
@@ -152,7 +155,7 @@ blkdev_get_blocks(struct inode *inode, sector_t iblock,
}
}

- bh->b_bdev = I_BDEV(inode);
+ bh->b_bdev = inode->i_bdev;
bh->b_blocknr = iblock;
bh->b_size = max_blocks << inode->i_blkbits;
if (max_blocks)
@@ -167,7 +170,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;

- return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode),
+ return blockdev_direct_IO_no_locking(rw, iocb, inode, inode->i_bdev,
iov, offset, nr_segs, blkdev_get_blocks, NULL);
}

@@ -244,7 +247,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t pos, unsigned long nr_segs)
{
struct inode *inode = iocb->ki_filp->f_mapping->host;
- unsigned blkbits = blksize_bits(bdev_hardsect_size(I_BDEV(inode)));
+ unsigned blkbits = blksize_bits(bdev_hardsect_size(inode->i_bdev);
unsigned blocksize_mask = (1 << blkbits) - 1;
unsigned long seg = 0; /* iov segment iterator */
unsigned long nvec; /* number of bio vec needed */
@@ -292,7 +295,7 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,

/* bio_alloc should not fail with GFP_KERNEL flag */
bio = bio_alloc(GFP_KERNEL, nvec);
- bio->bi_bdev = I_BDEV(inode);
+ bio->bi_bdev = inode->i_bdev;
bio->bi_end_io = blk_end_aio;
bio->bi_private = iocb;
bio->bi_sector = pos >> blkbits;
@@ -498,6 +501,8 @@ static void bdev_clear_inode(struct inode *inode)
}
list_del_init(&bdev->bd_list);
spin_unlock(&bdev_lock);
+ iput(bdev->bd_metadata_inode);
+ bdev->bd_metadata_inode = NULL;
}

static const struct super_operations bdev_sops = {
@@ -566,7 +571,7 @@ static LIST_HEAD(all_bdevs);
struct block_device *bdget(dev_t dev)
{
struct block_device *bdev;
- struct inode *inode;
+ struct inode *inode, *md_inode;

inode = iget5_locked(bd_mnt->mnt_sb, hash(dev),
bdev_test, bdev_set, &dev);
@@ -574,6 +579,11 @@ struct block_device *bdget(dev_t dev)
if (!inode)
return NULL;

+ /* Get an anonymous inode for the filesystem metadata cache */
+ md_inode = new_inode(bd_mnt->mnt_sb);
+ if (!md_inode)
+ return NULL;
+
bdev = &BDEV_I(inode)->bdev;

if (inode->i_state & I_NEW) {
@@ -582,12 +592,19 @@ struct block_device *bdget(dev_t dev)
bdev->bd_block_size = (1 << inode->i_blkbits);
bdev->bd_part_count = 0;
bdev->bd_invalidated = 0;
+ bdev->bd_metadata_inode = md_inode;
inode->i_mode = S_IFBLK;
inode->i_rdev = dev;
inode->i_bdev = bdev;
inode->i_data.a_ops = &def_blk_aops;
mapping_set_gfp_mask(&inode->i_data, GFP_USER);
inode->i_data.backing_dev_info = &default_backing_dev_info;
+ md_inode->i_mode = S_IFBLK;
+ md_inode->i_rdev = dev;
+ md_inode->i_bdev = bdev;
+ md_inode->i_data.a_ops = &def_blk_aops;
+ mapping_set_gfp_mask(&md_inode->i_data, GFP_USER);
+ md_inode->i_data.backing_dev_info = &default_backing_dev_info;
spin_lock(&bdev_lock);
list_add(&bdev->bd_list, &all_bdevs);
spin_unlock(&bdev_lock);
@@ -604,7 +621,7 @@ long nr_blockdev_pages(void)
long ret = 0;
spin_lock(&bdev_lock);
list_for_each_entry(bdev, &all_bdevs, bd_list) {
- ret += bdev->bd_inode->i_mapping->nrpages;
+ ret += bdev->bd_metadata_inode->i_mapping->nrpages;
}
spin_unlock(&bdev_lock);
return ret;
@@ -1099,6 +1116,7 @@ void bd_set_size(struct block_device *bdev, loff_t size)
unsigned bsize = bdev_hardsect_size(bdev);

bdev->bd_inode->i_size = size;
+ bdev->bd_metadata_inode->i_size = size;
while (bsize < PAGE_CACHE_SIZE) {
if (size & bsize)
break;
@@ -1106,6 +1124,7 @@ void bd_set_size(struct block_device *bdev, loff_t size)
}
bdev->bd_block_size = bsize;
bdev->bd_inode->i_blkbits = blksize_bits(bsize);
+ bdev->bd_metadata_inode->i_blkbits = blksize_bits(bsize);
}
EXPORT_SYMBOL(bd_set_size);

diff --git a/fs/buffer.c b/fs/buffer.c
index faceb5e..2c044b6 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -166,8 +166,11 @@ int sync_blockdev(struct block_device *bdev)
{
int ret = 0;

- if (bdev)
+ if (bdev) {
ret = filemap_write_and_wait(bdev->bd_inode->i_mapping);
+ if (!ret)
+ ret = filemap_write_and_wait(bdev->bd_metadata_inode->i_mapping);
+ }
return ret;
}
EXPORT_SYMBOL(sync_blockdev);
@@ -261,7 +264,7 @@ EXPORT_SYMBOL(thaw_bdev);
static struct buffer_head *
__find_get_block_slow(struct block_device *bdev, sector_t block)
{
- struct inode *bd_inode = bdev->bd_inode;
+ struct inode *bd_inode = bdev->bd_metadata_inode;
struct address_space *bd_mapping = bd_inode->i_mapping;
struct buffer_head *ret = NULL;
pgoff_t index;
@@ -347,12 +350,16 @@ out:
void invalidate_bdev(struct block_device *bdev)
{
struct address_space *mapping = bdev->bd_inode->i_mapping;
+ struct address_space *meta_mapping = bdev->bd_metadata_inode->i_mapping;

- if (mapping->nrpages == 0)
+ if (mapping->nrpages)
+ invalidate_mapping_pages(mapping, 0, -1);
+
+ if (meta_mapping->nrpages == 0)
return;

invalidate_bh_lrus();
- invalidate_mapping_pages(mapping, 0, -1);
+ invalidate_mapping_pages(meta_mapping, 0, -1);
}

/*
@@ -1009,7 +1016,7 @@ static struct page *
grow_dev_page(struct block_device *bdev, sector_t block,
pgoff_t index, int size)
{
- struct inode *inode = bdev->bd_inode;
+ struct inode *inode = bdev->bd_metadata_inode;
struct page *page;
struct buffer_head *bh;

diff --git a/fs/ext3/dir.c b/fs/ext3/dir.c
index c2c3491..a46305e 100644
--- a/fs/ext3/dir.c
+++ b/fs/ext3/dir.c
@@ -140,7 +140,7 @@ static int ext3_readdir(struct file * filp,
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
page_cache_sync_readahead(
- sb->s_bdev->bd_inode->i_mapping,
+ sb->s_bdev->bd_metadata_inode->i_mapping,
&filp->f_ra, filp,
index, 1);
filp->f_ra.prev_pos = (loff_t)index << PAGE_CACHE_SHIFT;
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index e11890a..eaab1db 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -139,7 +139,7 @@ static int ext4_readdir(struct file * filp,
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
page_cache_sync_readahead(
- sb->s_bdev->bd_inode->i_mapping,
+ sb->s_bdev->bd_metadata_inode->i_mapping,
&filp->f_ra, filp,
index, 1);
filp->f_ra.prev_pos = (loff_t)index << PAGE_CACHE_SHIFT;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 46b8a67..a8485b6 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -1484,7 +1484,7 @@ int fat_flush_inodes(struct super_block *sb, struct inode *i1, struct inode *i2)
if (!ret && i2)
ret = writeback_inode(i2);
if (!ret) {
- struct address_space *mapping = sb->s_bdev->bd_inode->i_mapping;
+ struct address_space *mapping = sb->s_bdev->bd_metadata_inode->i_mapping;
ret = filemap_flush(mapping);
}
return ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f70d52c..aeac9d3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -519,6 +519,7 @@ struct address_space {
struct block_device {
dev_t bd_dev; /* not a kdev_t - it's a search key */
struct inode * bd_inode; /* will die */
+ struct inode * bd_metadata_inode;
int bd_openers;
struct mutex bd_mutex; /* open/close mutex */
struct semaphore bd_mount_sem;
--
1.5.3.rc6.17.g1911

2007-10-18 04:33:26

by Andrew Morton

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

On Wed, 17 Oct 2007 21:59:02 -0600 [email protected] (Eric W. Biederman) wrote:

> If filesystems care at all they want absolute control over the buffer
> cache. Controlling which buffers are dirty and when. Because we
> keep the buffer cache in the page cache for the block device we have
> not quite been giving filesystems that control leading to really weird
> bugs.
>
> In addition this tieing of the implemetation of block device caching
> and the buffer cache has resulted in a much more complicated and
> limited implementation then necessary. Block devices for example
> don't need buffer_heads, and it is perfectly reasonable to cache
> block devices in high memory.
>
> To start untangling the worst of this mess this patch introduces a
> second block device inode for the buffer cache. All buffer cache
> operations are diverted to that use the new bd_metadata_inode, which
> keeps the weirdness of the metadata requirements isolated in their
> own little world.

I don't think we little angels want to tread here. There are so many
weirdo things out there which will break if we bust the coherence between
the fs and /dev/hda1. Online resize, online fs checkers, various local
tools which people have hacked up to look at metadata in a live fs,
direct-io access to the underlying fs, heaven knows how many boot loader
installers, etc. Cerainly I couldn't enumerate tham all.

The mere thought of all this scares the crap out of me.

I don't actually see what the conceptual problem is with the existing
implementation. The buffer_head is a finer-grained view onto the
blockdev's pagecache: it provides additional states and additional locking
against a finer-grained section of the page. It works well.

Yeah, the highmem thing is a bit of a problem (but waning in importance).
But we can fix that by teaching individual filesystems about kmap and then
tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount
time. If anyone cares, which they don't.

2007-10-18 05:11:34

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

On Thursday 18 October 2007 13:59, Eric W. Biederman wrote:
> If filesystems care at all they want absolute control over the buffer
> cache. Controlling which buffers are dirty and when. Because we
> keep the buffer cache in the page cache for the block device we have
> not quite been giving filesystems that control leading to really weird
> bugs.

Mmm. Like I said, when a live filesystem is mounted on a bdev,
it isn't like you want userspace to go dancing around on it without
knowing exactly what it is doing.

The kernel more or less does the right thing here with respect to
the *state* of the data[*] (that is, buffer heads and pagecache).
It's when you actually start changing the data itself around is when
you'll blow up the filesystem.

[*] The ramdisk code is simply buggy, right? (and not the buffer
cache)

The idea of your patch in theory is OK, but Andrew raises valid
points about potential coherency problems, I think.

2007-10-18 09:26:24

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Am Donnerstag, 18. Oktober 2007 schrieb Eric W. Biederman:
> Grr. Inconsistent rules on a core piece of infrastructure.
> It looks like that if there is any trivial/minimal fix it
> is based on your patch suppressing try_to_free_buffers. Ugh.
>
> Eric

Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24
and doing a nicer fix (rd rewrite for example for post 2.6.24)?

Christian

2007-10-19 21:28:57

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

Andrew Morton <[email protected]> writes:

>
> I don't think we little angels want to tread here. There are so many
> weirdo things out there which will break if we bust the coherence between
> the fs and /dev/hda1.

We broke coherence between the fs and /dev/hda1 when we introduced
the page cache years ago, and weird hacky cases like
unmap_underlying_metadata don't change that. Currently only
metadata is more or less in sync with the contents of /dev/hda1.

> Online resize, online fs checkers, various local
> tools which people have hacked up to look at metadata in a live fs,
> direct-io access to the underlying fs, heaven knows how many boot loader
> installers, etc. Cerainly I couldn't enumerate tham all.

Well I took a look at ext3. For online resize all of the writes are
done by the fs not by the user space tool. For e2fsck of a read-only
filesystem currently we do cache the buffers for the super block and
reexamine those blocks when we mount read-only.

Which makes my patch by itself unsafe. If however ext3 and anyone
else who does things like that were to reread the data and not
to merely reexamine the data we should be fine.

Fundamentally doing anything like this requires some form of
synchronization, and if that synchronization does not exist
today there will be bugs. Further decoupling things only makes that
requirement clearer.

Unfortunately because of things like the ext3 handling of remounting
from ro to rw this doesn't fall into the quick trivial fix category :(

> I don't actually see what the conceptual problem is with the existing
> implementation. The buffer_head is a finer-grained view onto the
> blockdev's pagecache: it provides additional states and additional locking
> against a finer-grained section of the page. It works well.

The buffer_head itself seems to be a reasonable entity.

The buffer cache is a monster. It does not follow the ordinary rules
of the page cache, making it extremely hard to reason about.

Currently in the buffer cache there are buffer_heads we are not
allowed to make dirty which hold dirty data. Some filesystems
panic the kernel when they notice this. Others like ext3 use a
different bit to remember that the buffer is dirty.

Because of ordering considerations the buffer cache does not hold a
consistent view of what has been scheduled for being written to disk.
It instead holds partially complete pages.

The only place we should ever clear the dirty bit is just before
calling write_page but try_to_free_buffers clears the dirty bit!

We have buffers on pages without a mapping!

In general the buffer cache violates a primary rule for comprehensible
programming having. The buffer cache does not have a clear enough
definition that it is clear what things are bugs and what things
are features.

99% of the weird strange behavior in rd.c is because of the buffer
cache not following the normal rules.

> Yeah, the highmem thing is a bit of a problem (but waning in importance).
> But we can fix that by teaching individual filesystems about kmap and then
> tweak the blockdev's caching policy with mapping_set_gfp_mask() at mount
> time. If anyone cares, which they don't.

This presumes I want to use a filesystem on my block device. Where I
would care most is when I am doing things like fsck or mkfs on an
unmounted filesystem. Where having buffer_heads is just extra memory
pressure slowing things down, and similarly for highmem. We have
to sync the filesystem before mounting but we have to do that anyway
for all of the non metadata so that isn't new.

Anyway my main objective was to get a good grasp on the buffer cache
and the mm layer again. Which I now more or less have. While I think
the buffer cache needs a bunch of tender loving care before it becomes
sane I have other projects that I intend to complete before I try
anything in this area.

Eric

2007-10-19 21:37:26

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

Nick Piggin <[email protected]> writes:

>
> [*] The ramdisk code is simply buggy, right? (and not the buffer
> cache)

>From the perspective of the ramdisk it expects the buffer cache to
simply be a user of the page cache, and thus the buffer cache
is horribly buggy.

>From the perspective of the buffer cache it still the block device
cache in the kernel and it the way it works are the rules for how
caching should be done in the kernel, and it doesn't give any
mind to this new fangled page cache thingy.

> The idea of your patch in theory is OK, but Andrew raises valid
> points about potential coherency problems, I think.

There are certainly implementation issues in various filesystems
to overcome before remounting read-write after doing a fsck
on a read-only filesystem will work as it does today. So my patch
is incomplete.

Eric

2007-10-19 22:47:11

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Mark ramdisk buffers heads dirty

Christian Borntraeger <[email protected]> writes:

> Am Donnerstag, 18. Oktober 2007 schrieb Eric W. Biederman:
>> Grr. Inconsistent rules on a core piece of infrastructure.
>> It looks like that if there is any trivial/minimal fix it
>> is based on your patch suppressing try_to_free_buffers. Ugh.
>>
>> Eric
>
> Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24
> and doing a nicer fix (rd rewrite for example for post 2.6.24)?

Looking at it. If we don't get carried away using our own private
inode is barely more difficult then stomping on release_page and
in a number of ways a whole lot more subtle. At least for 2.6.24 I
think it makes a sane fix, and quite possibly as a back port as well.

Eric

2007-10-19 22:52:33

by Eric W. Biederman

[permalink] [raw]

Subject: [PATCH] rd: Use a private inode for backing storage

Currently the ramdisk tries to keep the block device page cache pages
from being marked clean and dropped from memory. That fails for
filesystems that use the buffer cache because the buffer cache is not
an ordinary buffer cache user and depends on the generic block device
address space operations being used.

To fix all of those associated problems this patch allocates a private
inode to store the ramdisk pages in.

The result is slightly more memory used for metadata, an extra copying
when reading or writing directly to the block device, and changing the
software block size does not loose the contents of the ramdisk. Most
of all this ensures we don't loose data during normal use of the
ramdisk.

I deliberately avoid the cleanup that is now possible because this
patch is intended to be a bug fix.

Signed-off-by: Eric W. Biederman <[email protected]>
---
drivers/block/rd.c | 19 ++++++++++++++++---
1 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/block/rd.c b/drivers/block/rd.c
index 08176d2..a52f153 100644
--- a/drivers/block/rd.c
+++ b/drivers/block/rd.c
@@ -62,6 +62,7 @@
/* Various static variables go here. Most are used only in the RAM disk code.
*/

+static struct inode *rd_inode[CONFIG_BLK_DEV_RAM_COUNT];
static struct gendisk *rd_disks[CONFIG_BLK_DEV_RAM_COUNT];
static struct block_device *rd_bdev[CONFIG_BLK_DEV_RAM_COUNT];/* Protected device data */
static struct request_queue *rd_queue[CONFIG_BLK_DEV_RAM_COUNT];
@@ -267,7 +268,7 @@ static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
static int rd_make_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
- struct address_space * mapping = bdev->bd_inode->i_mapping;
+ struct address_space * mapping = rd_inode[MINOR(bdev->bd_dev)]->i_mapping;
sector_t sector = bio->bi_sector;
unsigned long len = bio->bi_size >> 9;
int rw = bio_data_dir(bio);
@@ -312,6 +313,7 @@ static int rd_ioctl(struct inode *inode, struct file *file,
mutex_lock(&bdev->bd_mutex);
if (bdev->bd_openers <= 2) {
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
+ truncate_inode_pages(rd_inode[iminor(inode)]->i_mapping, 0);
error = 0;
}
mutex_unlock(&bdev->bd_mutex);
@@ -344,20 +346,30 @@ static int rd_open(struct inode *inode, struct file *filp)
unsigned unit = iminor(inode);

if (rd_bdev[unit] == NULL) {
+ struct inode *ramdisk_inode;
struct block_device *bdev = inode->i_bdev;
struct address_space *mapping;
unsigned bsize;
gfp_t gfp_mask;

+ ramdisk_inode = new_inode(bdev->bd_inode->i_sb);
+ if (!ramdisk_inode)
+ return -ENOMEM;
+
inode = igrab(bdev->bd_inode);
rd_bdev[unit] = bdev;
+ rd_inode[unit] = ramdisk_inode;
bdev->bd_openers++;
bsize = bdev_hardsect_size(bdev);
bdev->bd_block_size = bsize;
inode->i_blkbits = blksize_bits(bsize);
inode->i_size = get_capacity(bdev->bd_disk)<<9;

- mapping = inode->i_mapping;
+ ramdisk_inode->i_mode = S_IFBLK;
+ ramdisk_inode->i_bdev = bdev;
+ ramdisk_inode->i_rdev = bdev->bd_dev;
+
+ mapping = ramdisk_inode->i_mapping;
mapping->a_ops = &ramdisk_aops;
mapping->backing_dev_info = &rd_backing_dev_info;
bdev->bd_inode_backing_dev_info = &rd_file_backing_dev_info;
@@ -377,7 +389,7 @@ static int rd_open(struct inode *inode, struct file *filp)
* for the page allocator emergency pools to keep the ramdisk
* driver happy.
*/
- gfp_mask = mapping_gfp_mask(mapping);
+ gfp_mask = GFP_USER;
gfp_mask &= ~(__GFP_FS|__GFP_IO);
gfp_mask |= __GFP_HIGH;
mapping_set_gfp_mask(mapping, gfp_mask);
@@ -409,6 +421,7 @@ static void __exit rd_cleanup(void)
del_gendisk(rd_disks[i]);
put_disk(rd_disks[i]);
blk_cleanup_queue(rd_queue[i]);
+ iput(rd_inode[i]);
}
unregister_blkdev(RAMDISK_MAJOR, "ramdisk");

--
1.5.3.rc6.17.g1911

2007-10-21 04:30:05

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> Andrew Morton <[email protected]> writes:
> > I don't think we little angels want to tread here. There are so many
> > weirdo things out there which will break if we bust the coherence between
> > the fs and /dev/hda1.
>
> We broke coherence between the fs and /dev/hda1 when we introduced
> the page cache years ago,

Not for metadata. And I wouldn't expect many filesystem analysis
tools to care about data.

> and weird hacky cases like
> unmap_underlying_metadata don't change that.

unmap_underlying_metadata isn't about raw block device access at
all, though (if you write to the filesystem via the blockdevice
when it isn't expecting it, it's going to blow up regardless).

> Currently only
> metadata is more or less in sync with the contents of /dev/hda1.

It either is or it isn't, right? And it is, isn't it? (at least
for the common filesystems).

2007-10-21 04:34:17

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
> Currently the ramdisk tries to keep the block device page cache pages
> from being marked clean and dropped from memory. That fails for
> filesystems that use the buffer cache because the buffer cache is not
> an ordinary buffer cache user and depends on the generic block device
> address space operations being used.
>
> To fix all of those associated problems this patch allocates a private
> inode to store the ramdisk pages in.
>
> The result is slightly more memory used for metadata, an extra copying
> when reading or writing directly to the block device, and changing the
> software block size does not loose the contents of the ramdisk. Most
> of all this ensures we don't loose data during normal use of the
> ramdisk.
>
> I deliberately avoid the cleanup that is now possible because this
> patch is intended to be a bug fix.

This just breaks coherency again like the last patch. That's a
really bad idea especially for stable (even if nothing actually
was to break, we'd likely never know about it anyway).

Christian's patch should go upstream and into stable. For 2.6.25-6,
my rewrite should just replace what's there. Using address spaces
to hold the ramdisk pages just confuses the issue even if they
*aren't* actually wired up to the vfs at all. Saving 20 lines is
not a good reason to use them.

2007-10-21 04:54:35

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

Nick Piggin <[email protected]> writes:

> On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
>> Andrew Morton <[email protected]> writes:
>> > I don't think we little angels want to tread here. There are so many
>> > weirdo things out there which will break if we bust the coherence between
>> > the fs and /dev/hda1.
>>
>> We broke coherence between the fs and /dev/hda1 when we introduced
>> the page cache years ago,
>
> Not for metadata. And I wouldn't expect many filesystem analysis
> tools to care about data.

Well tools like dump certainly weren't happy when we made the change.

>> and weird hacky cases like
>> unmap_underlying_metadata don't change that.
>
> unmap_underlying_metadata isn't about raw block device access at
> all, though (if you write to the filesystem via the blockdevice
> when it isn't expecting it, it's going to blow up regardless).

Well my goal with separating things is so that we could decouple two
pieces of code that have different usage scenarios, and where
supporting both scenarios simultaneously appears to me to needlessly
complicate the code.

Added to that we could then tune the two pieces of code for their
different users.

>> Currently only
>> metadata is more or less in sync with the contents of /dev/hda1.
>
> It either is or it isn't, right? And it is, isn't it? (at least
> for the common filesystems).

ext2 doesn't store directories in the buffer cache.

Journaling filesystems and filesystems that do ordered writes
game the buffer cache. Putting in data that should not yet
be written to disk. That gaming is where reiserfs goes BUG
and where JBD moves the dirty bit to a different dirty bit.

So as far as I can tell what is in the buffer cache is not really
in sync with what should be on disk at any given movement except
when everything is clean.

My suspicion is that actually reading from disk is likely to
give a more coherent view of things. Because there at least
we have the writes as they are expected to be seen by fsck
to recover the data, and a snapshot there should at least
be recoverable. Whereas a snapshot provides not such guarantees.

Eric

2007-10-21 05:11:16

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

Nick Piggin <[email protected]> writes:

> On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
>> Currently the ramdisk tries to keep the block device page cache pages
>> from being marked clean and dropped from memory. That fails for
>> filesystems that use the buffer cache because the buffer cache is not
>> an ordinary buffer cache user and depends on the generic block device
>> address space operations being used.
>>
>> To fix all of those associated problems this patch allocates a private
>> inode to store the ramdisk pages in.
>>
>> The result is slightly more memory used for metadata, an extra copying
>> when reading or writing directly to the block device, and changing the
>> software block size does not loose the contents of the ramdisk. Most
>> of all this ensures we don't loose data during normal use of the
>> ramdisk.
>>
>> I deliberately avoid the cleanup that is now possible because this
>> patch is intended to be a bug fix.
>
> This just breaks coherency again like the last patch. That's a
> really bad idea especially for stable (even if nothing actually
> was to break, we'd likely never know about it anyway).

Not a chance. The only way we make it to that inode is through block
device I/O so it lives at exactly the same level in the hierarchy as
a real block device. My patch is the considered rewrite boiled down
to it's essentials and made a trivial patch.

It fundamentally fixes the problem, and doesn't attempt to reconcile
the incompatible expectations of the ramdisk code and the buffer cache.

> Christian's patch should go upstream and into stable. For 2.6.25-6,
> my rewrite should just replace what's there. Using address spaces
> to hold the ramdisk pages just confuses the issue even if they
> *aren't* actually wired up to the vfs at all. Saving 20 lines is
> not a good reason to use them.

Well is more like saving 100 lines. Not having to reexamine complicated
infrastructure code and doing things the same way ramfs is. I think
that combination is a good reason. Especially since I can do with a
16 line patch as I just demonstrated. It is a solid and simple
incremental change.

Eric

2007-10-21 05:30:19

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Sunday 21 October 2007 15:10, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Saturday 20 October 2007 08:51, Eric W. Biederman wrote:
> >> Currently the ramdisk tries to keep the block device page cache pages
> >> from being marked clean and dropped from memory. That fails for
> >> filesystems that use the buffer cache because the buffer cache is not
> >> an ordinary buffer cache user and depends on the generic block device
> >> address space operations being used.
> >>
> >> To fix all of those associated problems this patch allocates a private
> >> inode to store the ramdisk pages in.
> >>
> >> The result is slightly more memory used for metadata, an extra copying
> >> when reading or writing directly to the block device, and changing the
> >> software block size does not loose the contents of the ramdisk. Most
> >> of all this ensures we don't loose data during normal use of the
> >> ramdisk.
> >>
> >> I deliberately avoid the cleanup that is now possible because this
> >> patch is intended to be a bug fix.
> >
> > This just breaks coherency again like the last patch. That's a
> > really bad idea especially for stable (even if nothing actually
> > was to break, we'd likely never know about it anyway).
>
> Not a chance.

Yes it does. It is exactly breaking the coherency between block
device and filesystem metadata coherency that Andrew cared about.
Whether or not that matters, that is a much bigger conceptual
change than simply using slightly more (reclaimable) memory in
some situations that my patch does.

If you want to try convincing people to break that coherency,
fine, but it has to be done consistently and everywhere rather than
for a special case in rd.c.

> The only way we make it to that inode is through block
> device I/O so it lives at exactly the same level in the hierarchy as
> a real block device.

No, it doesn't. A real block device driver does have its own
buffer cache as it's backing store. It doesn't know about
readpage or writepage or set_page_dirty or buffers or pagecache.

> My patch is the considered rewrite boiled down
> to it's essentials and made a trivial patch.

What's the considered rewrite here? The rewrite I posted is the
only one so far that's come up that I would consider [worthy],
while these patches are just more of the same wrongness.

> It fundamentally fixes the problem, and doesn't attempt to reconcile
> the incompatible expectations of the ramdisk code and the buffer cache.

It fixes the bug in question, but not because it makes any
fundamental improvement to the conceptual ickyness of rd.c. I
don't know what you mean about attempting to reconcile the
incompatible [stuff]?

> > Christian's patch should go upstream and into stable. For 2.6.25-6,
> > my rewrite should just replace what's there. Using address spaces
> > to hold the ramdisk pages just confuses the issue even if they
> > *aren't* actually wired up to the vfs at all. Saving 20 lines is
> > not a good reason to use them.
>
> Well is more like saving 100 lines. Not having to reexamine complicated
> infrastructure code and doing things the same way ramfs is.

Using radix_tree_insert instead of add_to_page_cache is hardly
complicated. If anything it is simpler because it isn't actually
confusing the issues and it is much better fit with real block
device drivers, which is actually the thing that is familiar to
block device driver maintainers.

> I think
> that combination is a good reason. Especially since I can do with a
> 16 line patch as I just demonstrated. It is a solid and simple
> incremental change.

My opinion is that it is a much bigger behavioural change because
it results in incoherent buffer cache / blockdev cache, and it
results in code which is still ugly and conceptually wrong.

2007-10-21 05:41:40

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

On Sunday 21 October 2007 14:53, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> >> Andrew Morton <[email protected]> writes:
> >> > I don't think we little angels want to tread here. There are so many
> >> > weirdo things out there which will break if we bust the coherence
> >> > between the fs and /dev/hda1.
> >>
> >> We broke coherence between the fs and /dev/hda1 when we introduced
> >> the page cache years ago,
> >
> > Not for metadata. And I wouldn't expect many filesystem analysis
> > tools to care about data.
>
> Well tools like dump certainly weren't happy when we made the change.

Doesn't that give you any suspicion that other tools mightn't
be happy if we make this change, then?

> >> and weird hacky cases like
> >> unmap_underlying_metadata don't change that.
> >
> > unmap_underlying_metadata isn't about raw block device access at
> > all, though (if you write to the filesystem via the blockdevice
> > when it isn't expecting it, it's going to blow up regardless).
>
> Well my goal with separating things is so that we could decouple two
> pieces of code that have different usage scenarios, and where
> supporting both scenarios simultaneously appears to me to needlessly
> complicate the code.
>
> Added to that we could then tune the two pieces of code for their
> different users.

I don't see too much complication from it. If we can actually
simplify things or make useful tuning, maybe it will be worth
doing.

> >> Currently only
> >> metadata is more or less in sync with the contents of /dev/hda1.
> >
> > It either is or it isn't, right? And it is, isn't it? (at least
> > for the common filesystems).
>
> ext2 doesn't store directories in the buffer cache.

Oh that's what you mean. OK, agreed there. But for the filesystems
and types of metadata that can now expect to have coherency, doing
this will break that expectation.

Again, I have no opinions either way on whether we should do that
in the long run. But doing it as a kneejerk response to braindead
rd.c code is wrong because of what *might* go wrong and we don't
know about.

> Journaling filesystems and filesystems that do ordered writes
> game the buffer cache. Putting in data that should not yet
> be written to disk. That gaming is where reiserfs goes BUG
> and where JBD moves the dirty bit to a different dirty bit.

Filesystems really want better control of writeback, I think.
This isn't really a consequence of the unified blockdev pagecache
/ metadata buffer cache, it is just that most of the important
things they do are with metadata.

If they have their own metadata inode, then they'll need to game
the cache for it, or the writeback code for that inode somehow
too.

> So as far as I can tell what is in the buffer cache is not really
> in sync with what should be on disk at any given movement except
> when everything is clean.

Naturally. It is a writeback cache.

> My suspicion is that actually reading from disk is likely to
> give a more coherent view of things. Because there at least
> we have the writes as they are expected to be seen by fsck
> to recover the data, and a snapshot there should at least
> be recoverable. Whereas a snapshot provides not such guarantees.

ext3 fsck I don't think is supposed to be run under a read/write
filesystem, so it's going to explode if you do that regardless.

2007-10-21 06:49:52

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

Nick Piggin <[email protected]> writes:

> Yes it does. It is exactly breaking the coherency between block
> device and filesystem metadata coherency that Andrew cared about.
> Whether or not that matters, that is a much bigger conceptual
> change than simply using slightly more (reclaimable) memory in
> some situations that my patch does.
>
> If you want to try convincing people to break that coherency,
> fine, but it has to be done consistently and everywhere rather than
> for a special case in rd.c.

Nick. Reread the patch. The only thing your arguments have
established for me is that this patch is not obviously correct. Which
makes it ineligible for a back port. Frankly I suspect the whole
issue is to subtle and rare to make any backport make any sense. My
apologies Christian.

>> The only way we make it to that inode is through block
>> device I/O so it lives at exactly the same level in the hierarchy as
>> a real block device.
>
> No, it doesn't. A real block device driver does have its own
> buffer cache as it's backing store. It doesn't know about
> readpage or writepage or set_page_dirty or buffers or pagecache.

Well those pages are only accessed through rd_blkdev_pagecache_IO
and rd_ioctl.

The address space operations can (after my patch) be deleted or
be replaced by their generic versions. I just didn't take that
step because it was an unnecessary change and I wanted the minimal
change for a backport.

>> My patch is the considered rewrite boiled down
>> to it's essentials and made a trivial patch.
>
> What's the considered rewrite here? The rewrite I posted is the
> only one so far that's come up that I would consider [worthy],
> while these patches are just more of the same wrongness.

Well it looks like you were blind when you read the patch.

Because the semantics between the two are almost identical,
except I managed to implement BLKFLSBUF in a backwards compatible
way by flushing both the buffer cache and my private cache. You
failed to flush the buffer cache in your implementation.

Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
radix_tree. But having truncate_inode_pages and grab_cache_page
continue to work sure is convenient. I certainly think it makes it a
lot simpler to audit the code to change just one thing at a time (the
backing store) then to rip out and replace everything and then try and
prove that the two patches are equivalent.

Eric

2007-10-21 07:10:37

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

Nick Piggin <[email protected]> writes:

> On Sunday 21 October 2007 14:53, Eric W. Biederman wrote:
>> Nick Piggin <[email protected]> writes:
>> > On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
>> >> Andrew Morton <[email protected]> writes:
>> >> > I don't think we little angels want to tread here. There are so many
>> >> > weirdo things out there which will break if we bust the coherence
>> >> > between the fs and /dev/hda1.
>> >>
>> >> We broke coherence between the fs and /dev/hda1 when we introduced
>> >> the page cache years ago,
>> >
>> > Not for metadata. And I wouldn't expect many filesystem analysis
>> > tools to care about data.
>>
>> Well tools like dump certainly weren't happy when we made the change.
>
> Doesn't that give you any suspicion that other tools mightn't
> be happy if we make this change, then?

I read a representative sample of the relevant tools before replying
to Andrew.

>> >> and weird hacky cases like
>> >> unmap_underlying_metadata don't change that.
>> >
>> > unmap_underlying_metadata isn't about raw block device access at
>> > all, though (if you write to the filesystem via the blockdevice
>> > when it isn't expecting it, it's going to blow up regardless).
>>
>> Well my goal with separating things is so that we could decouple two
>> pieces of code that have different usage scenarios, and where
>> supporting both scenarios simultaneously appears to me to needlessly
>> complicate the code.
>>
>> Added to that we could then tune the two pieces of code for their
>> different users.
>
> I don't see too much complication from it. If we can actually
> simplify things or make useful tuning, maybe it will be worth
> doing.

That was my feeling that we could simplify things. The block layer
page cache operations certainly.

I know in the filesystems that use the buffer cache like reiser and
JBD they could stop worrying about the buffers becoming mysteriously
dirty. Beyond that I think there is a lot of opportunity I just
haven't looked much yet.

>> >> Currently only
>> >> metadata is more or less in sync with the contents of /dev/hda1.
>> >
>> > It either is or it isn't, right? And it is, isn't it? (at least
>> > for the common filesystems).
>>
>> ext2 doesn't store directories in the buffer cache.
>
> Oh that's what you mean. OK, agreed there. But for the filesystems
> and types of metadata that can now expect to have coherency, doing
> this will break that expectation.
>
> Again, I have no opinions either way on whether we should do that
> in the long run. But doing it as a kneejerk response to braindead
> rd.c code is wrong because of what *might* go wrong and we don't
> know about.

The rd.c code is perfectly valid if someone wasn't forcing buffer
heads on it's pages. It is a conflict of expectations.

Regardless I didn't do it as a kneejerk and I don't think that
patch should be merged at this time. I proposed it because as I
see it that starts untangling the mess that is the buffer cache.
rd.c was just my entry point into understanding how all of those
pieces work. I was doing my best to completely explore my options
and what the code was doing before settling on the fix for rd.c

>> Journaling filesystems and filesystems that do ordered writes
>> game the buffer cache. Putting in data that should not yet
>> be written to disk. That gaming is where reiserfs goes BUG
>> and where JBD moves the dirty bit to a different dirty bit.
>
> Filesystems really want better control of writeback, I think.
> This isn't really a consequence of the unified blockdev pagecache
> / metadata buffer cache, it is just that most of the important
> things they do are with metadata.

Yes.

> If they have their own metadata inode, then they'll need to game
> the cache for it, or the writeback code for that inode somehow
> too.

Yes. Although they will at least get the guarantee that no one
else is dirtying their pages at strange times.

>> So as far as I can tell what is in the buffer cache is not really
>> in sync with what should be on disk at any given movement except
>> when everything is clean.
>
> Naturally. It is a writeback cache.

Not that so much as the order in which things go into the cache
does not match the order the blocks go to disk.

>> My suspicion is that actually reading from disk is likely to
>> give a more coherent view of things. Because there at least
>> we have the writes as they are expected to be seen by fsck
>> to recover the data, and a snapshot there should at least
>> be recoverable. Whereas a snapshot provides not such guarantees.
>
> ext3 fsck I don't think is supposed to be run under a read/write
> filesystem, so it's going to explode if you do that regardless.

Yes. I was thinking of dump or something like that here. Where
we simply read out the data and try to make some coherent sense
of it. If we see a version of the metadata that points to things
that have not been finished yet or is in the process of being
written to that could be a problem.

When going through the buffer cache as far as I can tell people
don't use little things like page lock when writing data so
the page cache reads can potentially race with what should
be atomic writes.

Eric

2007-10-21 07:29:32

by Christian Borntraeger

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

Am Sonntag, 21. Oktober 2007 schrieb Eric W. Biederman:
> Nick. Reread the patch. The only thing your arguments have
> established for me is that this patch is not obviously correct. Which
> makes it ineligible for a back port. Frankly I suspect the whole
> issue is to subtle and rare to make any backport make any sense. My
> apologies Christian.

About being rare, when I force the VM to be more aggressive reclaiming buffer
by using the following patch:
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -3225,7 +3225,7 @@ void __init buffer_init(void)
* Limit the bh occupancy to 10% of ZONE_NORMAL
*/
nrpages = (nr_free_buffer_pages() * 10) / 100;
- max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
+ max_buffer_heads = 0;
hotcpu_notifier(buffer_cpu_notify, 0);
}

I can actually cause data corruption within some seconds. So I think the
problem is real enough to be worth fixing.

I still dont fully understand what issues you have with my patch.
- it obviously fixes the problem
- I am not aware of any regression it introduces
- its small

One concern you had, was the fact that buffer heads are out of sync with
struct pages. Testing your first patch revealed that this is actually needed
by reiserfs - and maybe others.
I can also see, that my patch looks a bit like a bandaid that cobbles the rd
pieces together.
Is there anything else, that makes my patch unmergeable in your opinion?

Christian

2007-10-21 08:24:50

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

Christian Borntraeger <[email protected]> writes:

> Am Sonntag, 21. Oktober 2007 schrieb Eric W. Biederman:
>> Nick. Reread the patch. The only thing your arguments have
>> established for me is that this patch is not obviously correct. Which
>> makes it ineligible for a back port. Frankly I suspect the whole
>> issue is to subtle and rare to make any backport make any sense. My
>> apologies Christian.
>
> About being rare, when I force the VM to be more aggressive reclaiming buffer
> by using the following patch:
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -3225,7 +3225,7 @@ void __init buffer_init(void)
> * Limit the bh occupancy to 10% of ZONE_NORMAL
> */
> nrpages = (nr_free_buffer_pages() * 10) / 100;
> - max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
> + max_buffer_heads = 0;
> hotcpu_notifier(buffer_cpu_notify, 0);
> }
>
> I can actually cause data corruption within some seconds. So I think the
> problem is real enough to be worth fixing.

Let me put it another way. Looking at /proc/slabinfo I can get
37 buffer_heads per page. I can allocate 10% of memory in
buffer_heads before we start to reclaim them. So it requires just
over 3.7 buffer_heads on very page of low memory to even trigger
this case. That is a large 1k filesystem or a weird sized partition,
that we have written to directly.

That makes this condition very rare in practice without your patch.

Especially since even after we reach the above condition we have
to have enough vm pressure to find a page with clean buffer heads
that is dirty in the ramdisk.

While it can be done deterministically usually it is pretty hard
to trigger and pretty easy to work around by simply using partition
sizes that are a multiple of 4k and 4k block sized filesystems.

> I still dont fully understand what issues you have with my patch.
> - it obviously fixes the problem
> - I am not aware of any regression it introduces
> - its small

My primary issue with your patch is that it continues the saga the
trying to use buffer cache to store the data which is a serious
review problem, and clearly not what we want to do long term.

> One concern you had, was the fact that buffer heads are out of sync with
> struct pages. Testing your first patch revealed that this is actually needed
> by reiserfs - and maybe others.
> I can also see, that my patch looks a bit like a bandaid that cobbles the rd
> pieces together.

> Is there anything else, that makes my patch unmergeable in your
> opinion?

For linus's tree the consensus is that to fix rd.c that we
need to have a backing store that is stored somewhere besides
in the page cache/buffer cache for /dev/ram0. Doing that prevents
all of the weird issues.

Now we have the question of which patch gets us there. I contend
I have implemented it with my last little patch that this thread
is a reply to. Nick hasn't seen that just yet.

So if we have a small patch that can implement the proper long
term fix I contend we are in better shape.

As for backports we can worry about that after we get something
sane merged upstream.

Eric

2007-10-21 09:44:26

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Sunday 21 October 2007 16:48, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > Yes it does. It is exactly breaking the coherency between block
> > device and filesystem metadata coherency that Andrew cared about.
> > Whether or not that matters, that is a much bigger conceptual
> > change than simply using slightly more (reclaimable) memory in
> > some situations that my patch does.
> >
> > If you want to try convincing people to break that coherency,
> > fine, but it has to be done consistently and everywhere rather than
> > for a special case in rd.c.
>
> Nick. Reread the patch. The only thing your arguments have
> established for me is that this patch is not obviously correct. Which
> makes it ineligible for a back port.

OK, I missed that you set the new inode's aops to the ramdisk_aops
rather than the bd_inode. Which doesn't make a lot of sense because
you just have a lot of useless aops there now.

> Frankly I suspect the whole
> issue is to subtle and rare to make any backport make any sense. My
> apologies Christian.

It's a data corruption issue. I think it should be fixed.

> >> The only way we make it to that inode is through block
> >> device I/O so it lives at exactly the same level in the hierarchy as
> >> a real block device.
> >
> > No, it doesn't. A real block device driver does have its own
> > buffer cache as it's backing store. It doesn't know about
> > readpage or writepage or set_page_dirty or buffers or pagecache.
>
> Well those pages are only accessed through rd_blkdev_pagecache_IO
> and rd_ioctl.

Wrong. It will be via the LRU, will get ->writepage() called,
block_invalidate_page, etc. And I guess also via sb->s_inodes, where
drop_pagecache_sb might do stuff to it (although it probably escapes
harm). But you're right that it isn't the obviously correct fix for
the problem.

> >> My patch is the considered rewrite boiled down
> >> to it's essentials and made a trivial patch.
> >
> > What's the considered rewrite here? The rewrite I posted is the
> > only one so far that's come up that I would consider [worthy],
> > while these patches are just more of the same wrongness.
>
> Well it looks like you were blind when you read the patch.

If you think it is a nice way to go, then I think you are
blind ;)

> Because the semantics between the two are almost identical,
> except I managed to implement BLKFLSBUF in a backwards compatible
> way by flushing both the buffer cache and my private cache. You
> failed to flush the buffer cache in your implementation.

Obviously a simple typo that can be fixed by adding one line
of code.

> Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
> radix_tree. But having truncate_inode_pages and grab_cache_page
> continue to work sure is convenient.

It's horrible. And using truncate_inode_pages / grab_cache_page and
new_inode is an incredible argument to save a few lines of code. You
obviously didn't realise your so called private pages would get
accessed via the LRU, for example. This is making a relatively
larger logical change than my patch, because now as well as having
a separate buffer cache and backing store, you are also making the
backing store pages visible to the VM.

> I certainly think it makes it a
> lot simpler to audit the code to change just one thing at a time (the
> backing store) then to rip out and replace everything and then try and
> prove that the two patches are equivalent.

I think it's a bad idea just to stir the shit. We should take the
simple fix for the problem, and then fix it properly.

2007-10-21 10:02:14

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> Christian Borntraeger <[email protected]> writes:

> Let me put it another way. Looking at /proc/slabinfo I can get
> 37 buffer_heads per page. I can allocate 10% of memory in
> buffer_heads before we start to reclaim them. So it requires just
> over 3.7 buffer_heads on very page of low memory to even trigger
> this case. That is a large 1k filesystem or a weird sized partition,
> that we have written to directly.

On a highmem machine it it could be relatively common.

> > I still dont fully understand what issues you have with my patch.
> > - it obviously fixes the problem
> > - I am not aware of any regression it introduces
> > - its small
>
> My primary issue with your patch is that it continues the saga the
> trying to use buffer cache to store the data which is a serious
> review problem, and clearly not what we want to do long term.

You don't want to change that for a stable patch, however.
It fixes the bug.

> > One concern you had, was the fact that buffer heads are out of sync with
> > struct pages. Testing your first patch revealed that this is actually
> > needed by reiserfs - and maybe others.
> > I can also see, that my patch looks a bit like a bandaid that cobbles the
> > rd pieces together.
> >
> > Is there anything else, that makes my patch unmergeable in your
> > opinion?
>
> For linus's tree the consensus is that to fix rd.c that we
> need to have a backing store that is stored somewhere besides
> in the page cache/buffer cache for /dev/ram0. Doing that prevents
> all of the weird issues.
>
> Now we have the question of which patch gets us there. I contend
> I have implemented it with my last little patch that this thread
> is a reply to. Nick hasn't seen that just yet.

Or ever will. It wasn't that my whole argument against it is
based on that I mistakenly thought your patch served the bdev
inode directly from its backing store.

> So if we have a small patch that can implement the proper long
> term fix I contend we are in better shape.

I just don't think what you have is the proper fix. Calling
into the core vfs and vm because right now it does something
that works for you but is completely unrelated to what you
are conceptually doing is not the right fix.

Also, the patch I posted is big because it did other stuff
with dynamically allocated ramdisks from loop (ie. a modern
rewrite). As it is applied to rd.c and split into chunks, the
actual patch to switch to the new backing store isn't actually
that big. I'll submit it to -mm after things stabilise after
the merge window too.

2007-10-21 17:57:10

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

Nick Piggin <[email protected]> writes:

> OK, I missed that you set the new inode's aops to the ramdisk_aops
> rather than the bd_inode. Which doesn't make a lot of sense because
> you just have a lot of useless aops there now.

Not totally useless as you have mentioned they are accessed by
the lru so we still need something there just not much.

>> Frankly I suspect the whole
>> issue is to subtle and rare to make any backport make any sense. My
>> apologies Christian.
>
> It's a data corruption issue. I think it should be fixed.

Of course it should be fixed. I just don't know if a backport
makes sense. The problem once understood is hard to trigger
and easy to avoid.

>> >> The only way we make it to that inode is through block
>> >> device I/O so it lives at exactly the same level in the hierarchy as
>> >> a real block device.
>> >
>> > No, it doesn't. A real block device driver does have its own
>> > buffer cache as it's backing store. It doesn't know about
>> > readpage or writepage or set_page_dirty or buffers or pagecache.
>>
>> Well those pages are only accessed through rd_blkdev_pagecache_IO
>> and rd_ioctl.
>
> Wrong. It will be via the LRU, will get ->writepage() called,
> block_invalidate_page, etc. And I guess also via sb->s_inodes, where
> drop_pagecache_sb might do stuff to it (although it probably escapes
> harm). But you're right that it isn't the obviously correct fix for
> the problem.

Yes. We will be accessed via the LRU. Yes I knew that. The truth is
whatever we do we will be visible to the LRU. My preferences run
towards having something that is less of a special case then a new
potentially huge cache that is completely unaccounted for, that we
have to build and maintain all of the infrastructure for
independently. The ramdisk code doesn't seem interesting enough
long term to get people to do independent maintenance.

With my patch we are on the road to being just like the ramdisk
for maintenance purposes code except having a different GFP mask.

I think I might have to send in a patch that renames
block_invalidatepage to block_invalidate_page which is how everyone
seems to be spelling it.

Anyway nothing in the ramdisk code sets PagePrivate o
block_invalidagepage should never be called, nor try_to_free_buffers
for that matter.

>> >> My patch is the considered rewrite boiled down
>> >> to it's essentials and made a trivial patch.
>> >
>> > What's the considered rewrite here? The rewrite I posted is the
>> > only one so far that's come up that I would consider [worthy],
>> > while these patches are just more of the same wrongness.
>>
>> Well it looks like you were blind when you read the patch.
>
> If you think it is a nice way to go, then I think you are
> blind ;)

Well we each have different tastes. I think my patch
is a sane sensible small incremental step that does just what
is needed to fix the problem. It doesn't drag a lot of other
stuff into the problem like a rewrite would so we can easily verify
it.

>> Because the semantics between the two are almost identical,
>> except I managed to implement BLKFLSBUF in a backwards compatible
>> way by flushing both the buffer cache and my private cache. You
>> failed to flush the buffer cache in your implementation.
>
> Obviously a simple typo that can be fixed by adding one line
> of code.

Quite true. I noticed the breakage in mine because the patch
was so simple, and I only had to worry about one aspect. With
a rewrite I missed it because there was so much other noise I
couldn't see any issues.

>> Yes. I use an inode 99% for it's mapping and the mapping 99% for it's
>> radix_tree. But having truncate_inode_pages and grab_cache_page
>> continue to work sure is convenient.
>
> It's horrible. And using truncate_inode_pages / grab_cache_page and
> new_inode is an incredible argument to save a few lines of code. You
> obviously didn't realise your so called private pages would get
> accessed via the LRU, for example.

I did but but that is relatively minor. Using the pagecache this
way for this purpose is a well established idiom in the kernel
so I didn't figure I was introducing anything to hard to maintain.

> This is making a relatively
> larger logical change than my patch, because now as well as having
> a separate buffer cache and backing store, you are also making the
> backing store pages visible to the VM.

I am continuing to have the backing store pages visible to the VM,
and from that perspective it is a smaller change then where we are
today. Not that we can truly hide pages from the VM.

>> I certainly think it makes it a
>> lot simpler to audit the code to change just one thing at a time (the
>> backing store) then to rip out and replace everything and then try and
>> prove that the two patches are equivalent.
>
> I think it's a bad idea just to stir the shit. We should take the
> simple fix for the problem, and then fix it properly.

The code in rd.c isn't terrible, and it isn't shit. There is only one
fundamental problem with it. rd.c is fundamentally incompatible with
the buffer cache. Currently rd.c is a legitimate if a bit ugly
user of the page cache. The ugliness comes from working around
the buffer cache placing buffer heads on it's pages.

With my patch I stop using the same pages as the buffer cache which
removes that one fundamental problem.

Once we get the problem actually fixed I have no problem with
cleaning up the code. I even have patches queued I just believe
in only changing one thing at a time if I can.

Eric

2007-10-21 18:40:26

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

Nick Piggin <[email protected]> writes:

> On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
>> Christian Borntraeger <[email protected]> writes:
>
>> Let me put it another way. Looking at /proc/slabinfo I can get
>> 37 buffer_heads per page. I can allocate 10% of memory in
>> buffer_heads before we start to reclaim them. So it requires just
>> over 3.7 buffer_heads on very page of low memory to even trigger
>> this case. That is a large 1k filesystem or a weird sized partition,
>> that we have written to directly.
>
> On a highmem machine it it could be relatively common.

Possibly. But the same proportions still hold. 1k filesystems
are not the default these days and ramdisks are relatively uncommon.
The memory quantities involved are all low mem.

This is an old problem. If it was common I suspect we would have
noticed and fixed it long ago. As far as I can tell this problem dates
back to 2.5.13 in the commit below (which starts clearing the dirty
bit in try_to_free_buffers). The rd.c had earlier made the transition
from using BH_protected to using the dirty bit to lock it into the
page cache sometime earlier.

>commit acc98edfe3abf531df6cc0ba87f857abdd3552ad
>Author: akpm <akpm>
>Date: Tue Apr 30 20:52:10 2002 +0000
>
> [PATCH] writeback from address spaces
>
> [ I reversed the order in which writeback walks the superblock's
> dirty inodes. It sped up dbench's unlink phase greatly. I'm
> such a sleaze ]
>
> The core writeback patch. Switches file writeback from the dirty
> buffer LRU over to address_space.dirty_pages.
>
> - The buffer LRU is removed
>
> - The buffer hash is removed (uses blockdev pagecache lookups)
>
> - The bdflush and kupdate functions are implemented against
> address_spaces, via pdflush.
>
> - The relationship between pages and buffers is changed.
>
> - If a page has dirty buffers, it is marked dirty
> - If a page is marked dirty, it *may* have dirty buffers.
> - A dirty page may be "partially dirty". block_write_full_page
> discovers this.
>
> - A bunch of consistency checks of the form
>
> if (!something_which_should_be_true())
> buffer_error();
>
> have been introduced. These fog the code up but are important for
> ensuring that the new buffer/page code is working correctly.
>
> - New locking (inode.i_bufferlist_lock) is introduced for exclusion
> from try_to_free_buffers(). This is needed because set_page_dirty
> is called under spinlock, so it cannot lock the page. But it
> needs access to page->buffers to set them all dirty.
>
> i_bufferlist_lock is also used to protect inode.i_dirty_buffers.
>
> - fs/inode.c has been split: all the code related to file data writeback
> has been moved into fs/fs-writeback.c
>
> - Code related to file data writeback at the address_space level is in
> the new mm/page-writeback.c
>
> - try_to_free_buffers() is now non-blocking
>
> - Switches vmscan.c over to understand that all pages with dirty data
> are now marked dirty.
>
> - Introduces a new a_op for VM writeback:
>
> ->vm_writeback(struct page *page, int *nr_to_write)
>
> this is a bit half-baked at present. The intent is that the address_space
> is given the opportunity to perform clustered writeback. To allow it to
> opportunistically write out disk-contiguous dirty data which may be in other zones.
> To allow delayed-allocate filesystems to get good disk layout.
>
> - Added address_space.io_pages. Pages which are being prepared for
> writeback. This is here for two reasons:
>
> 1: It will be needed later, when BIOs are assembled direct
> against pagecache, bypassing the buffer layer. It avoids a
> deadlock which would occur if someone moved the page back onto the
> dirty_pages list after it was added to the BIO, but before it was
> submitted. (hmm. This may not be a problem with PG_writeback logic).
>
> 2: Avoids a livelock which would occur if some other thread is continually
> redirtying pages.
>
> - There are two known performance problems in this code:
>
> 1: Pages which are locked for writeback cause undesirable
> blocking when they are being overwritten. A patch which leaves
> pages unlocked during writeback comes later in the series.
>
>
> 2: While inodes are under writeback, they are locked. This
> causes namespace lookups against the file to get unnecessarily
> blocked in wait_on_inode(). This is a fairly minor problem.
>
> I don't have a fix for this at present - I'll fix this when I
> attach dirty address_spaces direct to super_blocks.
>
> - The patch vastly increases the amount of dirty data which the
> kernel permits highmem machines to maintain. This is because the
> balancing decisions are made against the amount of memory in the
> 2: While inodes are under writeback, they are locked. This
> causes namespace lookups against the file to get unnecessarily
> blocked in wait_on_inode(). This is a fairly minor problem.
>
> I don't have a fix for this at present - I'll fix this when I
> attach dirty address_spaces direct to super_blocks.
>
> - The patch vastly increases the amount of dirty data which the
> kernel permits highmem machines to maintain. This is because the
> balancing decisions are made against the amount of memory in the
> machine, not against the amount of buffercache-allocatable memory.
>
> This may be very wrong, although it works fine for me (2.5 gigs).
>
> We can trivially go back to the old-style throttling with
> s/nr_free_pagecache_pages/nr_free_buffer_pages/ in
> balance_dirty_pages(). But better would be to allow blockdev
> mappings to use highmem (I'm thinking about this one, slowly). And
> to move writer-throttling and writeback decisions into the VM (modulo
> the file-overwriting problem).
>
> - Drops 24 bytes from struct buffer_head. More to come.
>
> - There's some gunk like super_block.flags:MS_FLUSHING which needs to
> be killed. Need a better way of providing collision avoidance
> between pdflush threads, to prevent more than one pdflush thread
> working a disk at the same time.
>
> The correct way to do that is to put a flag in the request queue to
> say "there's a pdlfush thread working this disk". This is easy to
> do: just generalise the "ra_pages" pointer to point at a struct which
> includes ra_pages and the new collision-avoidance flag.
>
> BKrev: 3ccf03faM0no6SEm3ltNUkHf4BH1ag

> You don't want to change that for a stable patch, however.
> It fixes the bug.

No it avoids the bug which is something slightly different.
Further I contend that it is not obviously correct that there
are no other side effects (because it doesn't actually fix the
bug), and that makes it of dubious value for a backport.

If I had to slap a patch on there at this point just implementing
an empty try_to_release_page (which disables try_to_free_buffers)
would be my choice. I just think something that has existed
at least potentially for the entire 2.6 series, and is easy
to avoid is a bit dubious to fix now.

> I just don't think what you have is the proper fix. Calling
> into the core vfs and vm because right now it does something
> that works for you but is completely unrelated to what you
> are conceptually doing is not the right fix.

I think there is a strong conceptual relation and other code
doing largely the same thing is already in the kernel (ramfs). Plus
my gut feel says shared code will make maintenance easier.

You do have a point that the reuse may not be perfect and if that
is the case we need to watch out for the potential to mess things
up.

So far I don't see any problems with the reuse.

> Also, the patch I posted is big because it did other stuff
> with dynamically allocated ramdisks from loop (ie. a modern
> rewrite). As it is applied to rd.c and split into chunks, the
> actual patch to switch to the new backing store isn't actually
> that big. I'll submit it to -mm after things stabilise after
> the merge window too.

Sounds like a plan.

Eric

2007-10-22 00:15:57

by David Chinner

[permalink] [raw]

Subject: Re: [RFC][PATCH] block: Isolate the buffer cache in it's own mappings.

On Sun, Oct 21, 2007 at 02:24:46PM +1000, Nick Piggin wrote:
> On Saturday 20 October 2007 07:27, Eric W. Biederman wrote:
> > Currently only
> > metadata is more or less in sync with the contents of /dev/hda1.
>
> It either is or it isn't, right? And it is, isn't it? (at least
> for the common filesystems).

It is not true for XFS - it's metadata is not in sync with /dev/<block>
at all as all the cached metadata is kept in a different address space
to the raw block device.

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2007-10-22 00:36:48

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Monday 22 October 2007 03:56, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > OK, I missed that you set the new inode's aops to the ramdisk_aops
> > rather than the bd_inode. Which doesn't make a lot of sense because
> > you just have a lot of useless aops there now.
>
> Not totally useless as you have mentioned they are accessed by
> the lru so we still need something there just not much.

Well, the ones that are unused are totally useless ;) I would
have expected them to be removed.

> >> Frankly I suspect the whole
> >> issue is to subtle and rare to make any backport make any sense. My
> >> apologies Christian.
> >
> > It's a data corruption issue. I think it should be fixed.
>
> Of course it should be fixed. I just don't know if a backport
> makes sense. The problem once understood is hard to trigger
> and easy to avoid.

I mean backported. That's just me though, I don't know the nuances
of -stable releases. It could be that they rather not risk introducing
something worse which would be fair enough.

> >> Well those pages are only accessed through rd_blkdev_pagecache_IO
> >> and rd_ioctl.
> >
> > Wrong. It will be via the LRU, will get ->writepage() called,
> > block_invalidate_page, etc. And I guess also via sb->s_inodes, where
> > drop_pagecache_sb might do stuff to it (although it probably escapes
> > harm). But you're right that it isn't the obviously correct fix for
> > the problem.
>
> Yes. We will be accessed via the LRU. Yes I knew that.

OK it just didn't sound like it, seeing as you said that's the only
way they are accessed.

> The truth is
> whatever we do we will be visible to the LRU.

No. With my patch, nothing in the ramdisk code is visible to the LRU.
Which is how it should be.

> My preferences run
> towards having something that is less of a special case then a new
> potentially huge cache that is completely unaccounted for, that we
> have to build and maintain all of the infrastructure for
> independently.

It's not a cache, and it's not unaccounted for. It's specified exactly
with the rd sizing parameters. I don't know why you would say your
patch is better in this regard. Your ramdisk backing store will be
accounted as pagecache, which is completely wrong.

> The ramdisk code doesn't seem interesting enough
> long term to get people to do independent maintenance.
>
> With my patch we are on the road to being just like the ramdisk
> for maintenance purposes code except having a different GFP mask.

You can be using highmem, BTW. And technically it probably isn't
correct to use GFP_USER.

> > If you think it is a nice way to go, then I think you are
> > blind ;)
>
> Well we each have different tastes. I think my patch
> is a sane sensible small incremental step that does just what
> is needed to fix the problem. It doesn't drag a lot of other
> stuff into the problem like a rewrite would so we can easily verify
> it.

The small incremental step that fixes the problem is Christian's
patch.

> > It's horrible. And using truncate_inode_pages / grab_cache_page and
> > new_inode is an incredible argument to save a few lines of code. You
> > obviously didn't realise your so called private pages would get
> > accessed via the LRU, for example.
>
> I did but but that is relatively minor. Using the pagecache this
> way for this purpose is a well established idiom in the kernel
> so I didn't figure I was introducing anything to hard to maintain.

Where else is this an established idiom?

> > This is making a relatively
> > larger logical change than my patch, because now as well as having
> > a separate buffer cache and backing store, you are also making the
> > backing store pages visible to the VM.
>
> I am continuing to have the backing store pages visible to the VM,
> and from that perspective it is a smaller change then where we are
> today.

It is smaller lines of code. It is a larger change. Because what you
are doing is 2 things. You are firstly discontinuing the use of the
buffer cache for the backing store, and secondly you are introducing
a new backing store which registers an inode with the vfs and pages
with the pagecache.

My patch does the same thing without those two last questionable
steps.

You now have to treat those backing store pages as pagecache pages,
and hope you have set the right flags and registered the right aops.

> Not that we can truly hide pages from the VM.

Any page you allocate is your private page. The problem is you're
just sending them back to the VM again.

2007-10-22 02:02:47

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Monday 22 October 2007 04:39, Eric W. Biederman wrote:
> Nick Piggin <[email protected]> writes:
> > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> >> Christian Borntraeger <[email protected]> writes:
> >>
> >> Let me put it another way. Looking at /proc/slabinfo I can get
> >> 37 buffer_heads per page. I can allocate 10% of memory in
> >> buffer_heads before we start to reclaim them. So it requires just
> >> over 3.7 buffer_heads on very page of low memory to even trigger
> >> this case. That is a large 1k filesystem or a weird sized partition,
> >> that we have written to directly.
> >
> > On a highmem machine it it could be relatively common.
>
> Possibly. But the same proportions still hold. 1k filesystems
> are not the default these days and ramdisks are relatively uncommon.
> The memory quantities involved are all low mem.

You don't need 1K filesystems to have buffers attached though,
of course. You can hit the limit with a 4K filesystem with less
than 8GB in pagecache I'd say.

> > You don't want to change that for a stable patch, however.
> > It fixes the bug.
>
> No it avoids the bug which is something slightly different.
> Further I contend that it is not obviously correct that there
> are no other side effects (because it doesn't actually fix the
> bug), and that makes it of dubious value for a backport.

The bug in question isn't exactly that it uses buffercache for its
backing store (although that's obviously rather hairy as well). It's
this specific problem sequence. And it does fix the problem.

> If I had to slap a patch on there at this point just implementing
> an empty try_to_release_page (which disables try_to_free_buffers)
> would be my choice.

How is that better? Now you're making the exact same change for
all filesystems that you didn't think was obviously correct for
rd.c.

> > I just don't think what you have is the proper fix. Calling
> > into the core vfs and vm because right now it does something
> > that works for you but is completely unrelated to what you
> > are conceptually doing is not the right fix.
>
> I think there is a strong conceptual relation and other code
> doing largely the same thing is already in the kernel (ramfs). Plus
> my gut feel says shared code will make maintenance easier.

ramfs is rather a different case. Filesystems intimately know
about the pagecache.

> You do have a point that the reuse may not be perfect and if that
> is the case we need to watch out for the potential to mess things
> up.
>
> So far I don't see any problems with the reuse.

It's just wrong. I guess if you don't see that by now, then we
have to just disagree.

2007-10-22 22:52:41

[permalink] [raw]

Subject: Re: [PATCH] rd: Use a private inode for backing storage

On Sun, 21 Oct 2007 12:39:30 -0600
[email protected] (Eric W. Biederman) wrote:

> Nick Piggin <[email protected]> writes:
>
> > On Sunday 21 October 2007 18:23, Eric W. Biederman wrote:
> >> Christian Borntraeger <[email protected]> writes:
> >
> >> Let me put it another way. Looking at /proc/slabinfo I can get
> >> 37 buffer_heads per page. I can allocate 10% of memory in
> >> buffer_heads before we start to reclaim them. So it requires just
> >> over 3.7 buffer_heads on very page of low memory to even trigger
> >> this case. That is a large 1k filesystem or a weird sized
> >> partition, that we have written to directly.
> >
> > On a highmem machine it it could be relatively common.
>
> Possibly. But the same proportions still hold. 1k filesystems
> are not the default these days and ramdisks are relatively uncommon.
> The memory quantities involved are all low mem.

It is definitely common during run time. It was seen in practice enough
to be reproducible and get fixed for the non-ramdisk case.

The big underlying question is how which ramdisk usage case are we
shooting for. Keeping the ram disk pages off the LRU can certainly help
the VM if larger ramdisks used at runtime are very common.

Otherwise, I'd say to keep it as simple as possible and use Eric's
patch. By simple I'm not counting lines of code, I'm counting overall
readability between something everyone knows (page cache usage) and
something specific to ramdisks (Nick's patch).

-chris