2008-11-20 00:38:45

by Toshiyuki Okajima

[permalink] [raw]
Subject: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

Hi.

I found it possible that even if a lot of pages can be logically released,
they cannot be released by try_to_release_page, and then they keep remaining.

This case enables an oom-killer to happen easily.

Details of the root cause and my patch which fixes it are shown below.
---
The direct data blocks can be released by the member function, releasepage()
of their mapping of the filesystem i-node.
(If an ext3 has the i-node, ext3_releasepage() is used as releasepage().)

On the other hand, the indirect data blocks (ext3) are attempted to be released
by try_to_free_buffers(). (And other metadata are also done by it.)
Because a block device has its mapping, and doesn't have own member function
to release a page.

But try_to_free_buffers() is a generic function which releases buffer_heads
(and a page), and no buffer_head can be released if a buffer_head has private
data (like journal_head) because the buffer_head's reference counter is bigger
than 0. Therefore, try_to_free_buffers() cannot release a buffer_head even if
it is possible to release its private data.

As a result, oom-killer may happen when a system memory is exhausted even if
it is possible to release a lot of private data and their pages, because
try_to_free_buffers() doesn't release such pages.

In order to solve this situation, we add a member function into a block device
to release private data and then the page.
This member function is:
- registered at a filesystem initialization time (get_sb_bdev())
- unregistered at a filesystem unmount time (kill_block_super())

This member function's pointer is located in a bdev_inode structure.
Besides, a client which registers it is also added into this structure.
A client for a filesystem is its superblock.

If we use an ext3, this additional member function can do equal processing to
ext3_releasepage() by using the superblock. And a block device's releasepage()
is necessary to call this additional member function. Therefore we need a
member function, 'releasepage' of the mapping of a block device.

Changing like them becomes possible to release private data and then the page
via try_to_release_page().
Therefore it becomes difficult for oom-killer to happen than before.
Because this patch enables journal_heads to be released more efficiently
in case of ext3.

I will post patches to solve it (ext3/ext4 version):
(1) [patch 1/3] vfs: release block-device-mapping buffer_heads which have the
filesystem private data for avoiding oom-killer
(2) [patch 2/3] ext3: release block-device-mapping buffer_heads which have the
filesystem private data for avoiding oom-killer
(3) [patch 3/3] ext4: release block-device-mapping buffer_heads which have the
filesystem private data for avoiding oom-killer

[Additional information]
I have confirmed that JBD on 2.6.28-rc4 to which my patch was applied could keep
running for long time without oom-killer under the heavy loads.
(Of course, JBD without the patch cannot keep running for long time
under the same situation.)
* This patch needs Ted's fix which was posted at "Wed, 5 Nov 2008 09:05:07 -0500"
* as "[PATCH] jbd: don't give up looking for space so easily in
* __log_wait_for_space".
* Because "no transactions" error happens easily by releasing journal_heads
* efficiently with my patch.
* But linux-2.6.28-rc4 includes his patch. Therefore I don't care about this.

Any comments are welcome.

Best Regards,
Toshiyuki Okajima


2008-11-24 21:15:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

On Thu, 20 Nov 2008 09:27:11 +0900
Toshiyuki Okajima <[email protected]> wrote:

> Hi.
>
> I found it possible that even if a lot of pages can be logically released,
> they cannot be released by try_to_release_page, and then they keep remaining.
>
> This case enables an oom-killer to happen easily.
>
> Details of the root cause and my patch which fixes it are shown below.
> ---
> The direct data blocks can be released by the member function, releasepage()
> of their mapping of the filesystem i-node.
> (If an ext3 has the i-node, ext3_releasepage() is used as releasepage().)
>
> On the other hand, the indirect data blocks (ext3) are attempted to be released
> by try_to_free_buffers(). (And other metadata are also done by it.)
> Because a block device has its mapping, and doesn't have own member function
> to release a page.
>
> But try_to_free_buffers() is a generic function which releases buffer_heads
> (and a page), and no buffer_head can be released if a buffer_head has private
> data (like journal_head) because the buffer_head's reference counter is bigger
> than 0. Therefore, try_to_free_buffers() cannot release a buffer_head even if
> it is possible to release its private data.
>
> As a result, oom-killer may happen when a system memory is exhausted even if
> it is possible to release a lot of private data and their pages, because
> try_to_free_buffers() doesn't release such pages.
>
> In order to solve this situation, we add a member function into a block device
> to release private data and then the page.
> This member function is:
> - registered at a filesystem initialization time (get_sb_bdev())
> - unregistered at a filesystem unmount time (kill_block_super())
>
> This member function's pointer is located in a bdev_inode structure.
> Besides, a client which registers it is also added into this structure.
> A client for a filesystem is its superblock.
>
> If we use an ext3, this additional member function can do equal processing to
> ext3_releasepage() by using the superblock. And a block device's releasepage()
> is necessary to call this additional member function. Therefore we need a
> member function, 'releasepage' of the mapping of a block device.
>
> Changing like them becomes possible to release private data and then the page
> via try_to_release_page().
> Therefore it becomes difficult for oom-killer to happen than before.
> Because this patch enables journal_heads to be released more efficiently
> in case of ext3.
>
> I will post patches to solve it (ext3/ext4 version):
> (1) [patch 1/3] vfs: release block-device-mapping buffer_heads which have the
> filesystem private data for avoiding oom-killer
> (2) [patch 2/3] ext3: release block-device-mapping buffer_heads which have the
> filesystem private data for avoiding oom-killer
> (3) [patch 3/3] ext4: release block-device-mapping buffer_heads which have the
> filesystem private data for avoiding oom-killer
>
> [Additional information]
> I have confirmed that JBD on 2.6.28-rc4 to which my patch was applied could keep
> running for long time without oom-killer under the heavy loads.
> (Of course, JBD without the patch cannot keep running for long time
> under the same situation.)
> * This patch needs Ted's fix which was posted at "Wed, 5 Nov 2008 09:05:07 -0500"
> * as "[PATCH] jbd: don't give up looking for space so easily in
> * __log_wait_for_space".
> * Because "no transactions" error happens easily by releasing journal_heads
> * efficiently with my patch.
> * But linux-2.6.28-rc4 includes his patch. Therefore I don't care about this.
>

I'm scratching my head trying to work out why we never encountered and
fixed this before.

Is it possible that you have a very large number of filesystems
mounted, and/or that they have large journals?



Would it not be more logical if the ->client_releasepage function
pointer were a member of the blockdev address_space_operations, rather
than some random field in the blockdev inode? That arrangement might
well be reused in the future, when some other address_space needs to
talk to a different address_space to make a page reclaimable.

2008-11-25 06:13:50

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

Hi Andrew,
Thanks for your comments.

> On Thu, 20 Nov 2008 09:27:11 +0900
> Toshiyuki Okajima <[email protected]> wrote:
<SNIP>
>
> I'm scratching my head trying to work out why we never encountered and
> fixed this before.

> Is it possible that you have a very large number of filesystems
> mounted, and/or that they have large journals?

Yes, I think it happen more easily under those conditions.

Actually, I encountered this situation if conditions were:
- on the x86 architecture (The size of Normal zone is only 800MB
even if the huge memory (more than 4GB) install.)
- reserving the big memory (more than 100MB) for the kdump kernel.
(The memory obtains from Normal Zone.)
- mounting the large number of ext3 filesystems (more than 50).

And the following operations were done:
- many I/Os were issued to many filesystems sequentially and continuously.
(They made many journal_heads (and buffer_heads).
=> they were metadata.)
- issuing the I/Os to many filesystems were stopped.
(This caused many metadata to remain.)

By their operations, the number of remaining the journal_heads was
more than 100000 (They occupied 400MB (The same number of buffer_heads remained
and the block size was 4096B)). We cannot release those journal_heads because
checkpointing the transactions are not executed till some I/Os are issued to
the filesystems or the filesystems were unmounting.
And many other slab caches which couldn't be released occupied about 300MB.
Therefore about 800MB memory couldn't be released.
As a result, there was no room in the Normal zone.

I think you could not encounter it because you haven't done such the following:
- You reserve the big memory for the kdump kernel.
- You issue many I/Os to each ext3 filesystem sequentially and continuously,
and then you never issue some I/Os to the filesystems at all afterwards.
(Especially, you do the operations which causes many metadata to remain.
Example: Delete many files which are huge.)

> Would it not be more logical if the ->client_releasepage function
> pointer were a member of the blockdev address_space_operations, rather
> than some random field in the blockdev inode? That arrangement might
> well be reused in the future, when some other address_space needs to
> talk to a different address_space to make a page reclaimable.

I think it logical to replace a default ->releasepage with a function pointer
which a client (FS) passed, but I don't think it logical to add a new member
function in address space in order to release a client page. Because new
function is called from ->releasepage, so I think this function pointer should
not be put in the same level as the releasepage of address space.

Though, it is difficult to replace ->releasepage member with a client function
because there is no exclusive operation while this function is calling.

So, I made this patch (without replacing ->releasepage).

How about my thought?

Best Regards,
Toshiyuki Okajima


2008-11-25 06:22:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

On Mon, Nov 24, 2008 at 01:13:52PM -0800, Andrew Morton wrote:
>
> I'm scratching my head trying to work out why we never encountered and
> fixed this before.
>
> Is it possible that you have a very large number of filesystems
> mounted, and/or that they have large journals?

In Okajima-san's original test case, he was running on a 128 meg
system, and created an ext3 filesystem which with a 512 meg journal.
This could very well be a "Doctor, doctor, it hurts when I do that"
sort of thing, although I can see this being a believable situation
for small NAS boxes. Even with his patches, though, this sort of
configuration can run into memory pressure problems, since if a
transaction grows too big, too many memory pages will get pinned, and
we currently won't close a transaction early if we seem to be pinning
too much memory and we are running into memory pressure issues
(although this would be a good idea). So his patches will postpone
the OOM killer from running, but if the journal is sufficiently large
(thus allowing sufficiently large transactions), and the amount of
memory sufficiently small, we still could run into OOM situations.

> Would it not be more logical if the ->client_releasepage function
> pointer were a member of the blockdev address_space_operations, rather
> than some random field in the blockdev inode? That arrangement might
> well be reused in the future, when some other address_space needs to
> talk to a different address_space to make a page reclaimable.

I'm not sure how your suggestion would be implemented practically
speaking. Right now as I undersand things the blockdev's a_ops field
is an attribute of the block device and is normally a pointer to
default_blkdev_aops. In Okajima-san's patch, we define the blkdev's
releasepages to be a function which looks like which
client_releasepage to call in the blockdev inode. If we put that
field in the blockdev a_ops structure, then when when we mount the
filesystem, we would have to make a copy of the blockdev's a_ops
structure so we can set the client_releasepage pointer. That seems to
be rather inefficient and kludgy, so I don't imagine that's what you
meant. What am I missing?

- Ted

2008-11-25 06:30:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

On Tue, 25 Nov 2008 15:13:37 +0900 Toshiyuki Okajima <[email protected]> wrote:

> Hi Andrew,
> Thanks for your comments.
>
> > On Thu, 20 Nov 2008 09:27:11 +0900
> > Toshiyuki Okajima <[email protected]> wrote:
> <SNIP>
> >
> > I'm scratching my head trying to work out why we never encountered and
> > fixed this before.
>
> > Is it possible that you have a very large number of filesystems
> > mounted, and/or that they have large journals?
>
> Yes, I think it happen more easily under those conditions.
>
> Actually, I encountered this situation if conditions were:
> - on the x86 architecture (The size of Normal zone is only 800MB
> even if the huge memory (more than 4GB) install.)
> - reserving the big memory (more than 100MB) for the kdump kernel.
> (The memory obtains from Normal Zone.)
> - mounting the large number of ext3 filesystems (more than 50).
>
> And the following operations were done:
> - many I/Os were issued to many filesystems sequentially and continuously.
> (They made many journal_heads (and buffer_heads).
> => they were metadata.)
> - issuing the I/Os to many filesystems were stopped.
> (This caused many metadata to remain.)
>
> By their operations, the number of remaining the journal_heads was
> more than 100000 (They occupied 400MB (The same number of buffer_heads remained
> and the block size was 4096B)). We cannot release those journal_heads because
> checkpointing the transactions are not executed till some I/Os are issued to
> the filesystems or the filesystems were unmounting.
> And many other slab caches which couldn't be released occupied about 300MB.
> Therefore about 800MB memory couldn't be released.
> As a result, there was no room in the Normal zone.
>
> I think you could not encounter it because you haven't done such the following:
> - You reserve the big memory for the kdump kernel.
> - You issue many I/Os to each ext3 filesystem sequentially and continuously,
> and then you never issue some I/Os to the filesystems at all afterwards.
> (Especially, you do the operations which causes many metadata to remain.
> Example: Delete many files which are huge.)

yup.

> > Would it not be more logical if the ->client_releasepage function
> > pointer were a member of the blockdev address_space_operations, rather
> > than some random field in the blockdev inode? That arrangement might
> > well be reused in the future, when some other address_space needs to
> > talk to a different address_space to make a page reclaimable.
>
> I think it logical to replace a default ->releasepage with a function pointer
> which a client (FS) passed, but I don't think it logical to add a new member
> function in address space in order to release a client page. Because new
> function is called from ->releasepage, so I think this function pointer should
> not be put in the same level as the releasepage of address space.
>
> Though, it is difficult to replace ->releasepage member with a client function
> because there is no exclusive operation while this function is calling.
>
> So, I made this patch (without replacing ->releasepage).
>
> How about my thought?

yeah, I don't have particularly strong opinions either way. If it
needs changing later, we can change it.

2008-11-25 07:32:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

Okajima-san,

I've added your patches to the ext4 patch queue for testing. I did
make some changes to the commit descriptions for clarity's sake

-------------------------
vfs: add releasepages hooks to block devices which can be used by file systems

From: Toshiyuki Okajima <[email protected]>

Implement blkdev_releasepage() to release the buffer_heads and page
after we release private data which belongs to a client of the block
device, such as a filesystem.

blkdev_releasepage() call the client's releasepage() which is
registered by blkdev_register_client_releasepage() to release its
private data.

Signed-off-by: Toshiyuki Okajima <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
-------------------------
ext3: provide function to release journal heads under memory pressure

From: Toshiyuki Okajima <[email protected]>

Pages in the page cache belonging to ext3 data files are released via
the ext3_releasepage() function specified in the ext3 inode's
address_space_ops. However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which is called
by the block device's blkdev_releasepage() function, which calls
journal_try_to_free_buffers() function to free the metadata.

Signed-off-by: Toshiyuki Okajima <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
-------------------------
ext4: implement release_metadata to release private data under memory pressure

From: Toshiyuki Okajima <[email protected]>

Pages in the page cache belonging to ext4 data files are released via
the ext4_releasepage() function specified in the ext4 inode's
address_space_ops. However, metadata blocks (such as indirect blocks,
directory blocks, etc) are managed via the block device
address_space_ops, and they can not be released by
try_to_free_buffers() if they have a journal head attached to them.

To address this, we supply a release_metadata function which is called
by the block device's blkdev_releasepage() function, which calls
journal_try_to_free_buffers() function to free the metadata.

Signed-off-by: Toshiyuki Okajima <[email protected]>
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
----------------------

Also, in the 2nd and 3rd patches, I made change to the comment
describing ext[34]_release_metadata. It now reads:

/*
* Try to release metadata pages (indirect blocks, directories) which are
* mapped via the block device. Since these pages could have journal heads
* which would prevent try_to_free_buffers() from freeing them, we must use
* jbd[2] layer's try_to_free_buffers() function to release them.
*/

I hope you agree these changes are more descriptive of what is going
on in the patch.

Best regards,

- Ted

2008-11-25 08:07:04

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [RESEND][PATCH 0/3 BUG,RFC] release block-device-mapping buffer_heads which have the filesystem private data for avoiding oom-killer

Ted-san,
Thanks for your applying.

Theodore Tso wrote:
> Okajima-san,
>
> I've added your patches to the ext4 patch queue for testing. I did
> make some changes to the commit descriptions for clarity's sake
>
> -------------------------
> vfs: add releasepages hooks to block devices which can be used by file systems
<SNIP>
> -------------------------
> ext3: provide function to release journal heads under memory pressure
<SNIP>
> -------------------------
> ext4: implement release_metadata to release private data under memory pressure
<SNIP>
> ----------------------
>
> Also, in the 2nd and 3rd patches, I made change to the comment
> describing ext[34]_release_metadata. It now reads:
>
> /*
> * Try to release metadata pages (indirect blocks, directories) which are
> * mapped via the block device. Since these pages could have journal heads
> * which would prevent try_to_free_buffers() from freeing them, we must use
> * jbd[2] layer's try_to_free_buffers() function to release them.
> */
>
> I hope you agree these changes are more descriptive of what is going
> on in the patch.
OK!

I agree to your all appropriate changes.

(I am looking forward to merging them to mainstream kernel early. :-))

Best regards,
Toshiyuki Okajima