2007-05-03 17:38:30

by Alex Tomas

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> We can make great improvements here, and I've (twice) previously decribed
> how: hoist the entire ordered-mode data handling out of ext3, and out of
> the buffer_head layer and move it up into the VFS pagecache layer.
> Basically, do ordered-data with a commit-time inode walk, calling
> do_sync_mapping_range().
>
> Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.

I'm not sure it's that easy.

if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?

generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}

another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?

thanks, Alex





2007-05-03 23:55:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > We can make great improvements here, and I've (twice) previously decribed
> > how: hoist the entire ordered-mode data handling out of ext3, and out of
> > the buffer_head layer and move it up into the VFS pagecache layer.
> > Basically, do ordered-data with a commit-time inode walk, calling
> > do_sync_mapping_range().
> >
> > Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
>
> I'm not sure it's that easy.
>
> if we move to pages, then we have to mark pages to be flushed holding
> transaction open. now take delayed allocation into account: we need
> to allocate number of blocks at once and then mark all pages mapped,
> again within context of the same transaction.

Yes, there can be issues with needing to allocate journal space within the
context of a commit. But

a) If the page has newly allocated space on disk then the metadata which
refers to that page is already in the journal: no new journal space
needed.

b) If the page doesn't have space allocated on disk then we don't need
to write it out at ordered-mode commit time, because the post-recovery
filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?

However b) might lead to the hey-my-file-is-full-of-zeroes problem.

> so, an implementation
> would look like the following?
>
> generic_writepages() {
> /* collect set of contig. dirty pages */
> foo_get_blocks() {
> foo_journal_start();
> foo_new_blocks();
> foo_attach_blocks_to_inode();
> generic_mark_pages_mapped();
> foo_journal_stop();
> }
> }
>
> another question is will it scale well given number of dirty inodes
> can be much larger than number of inodes with dirty mapped blocks
> (in delayed allocation case, for example) ?

Possibly - zillions of dirty-for-atime inodes might get in the way. A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug). A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree. Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.

2007-05-04 06:18:36

by Alex Tomas

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> Yes, there can be issues with needing to allocate journal space within the
> context of a commit. But

no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.

> a) If the page has newly allocated space on disk then the metadata which
> refers to that page is already in the journal: no new journal space
> needed.
>
> b) If the page doesn't have space allocated on disk then we don't need
> to write it out at ordered-mode commit time, because the post-recovery
> filesystem will not have any references to that page.
>
> c) If the page is dirty due to overwrite then no metadata update was required.
>
> IOW, under what circumstances would an ordered-mode commit need to allocate
> space for a delayed-allocate page?

no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.

>
> However b) might lead to the hey-my-file-is-full-of-zeroes problem.
>

thanks, Alex

2007-05-04 06:39:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > Yes, there can be issues with needing to allocate journal space within the
> > context of a commit. But
>
> no-no, this isn't required. we only need to mark pages/blocks within
> transaction, otherwise race is possible when we allocate blocks in transaction,
> then transacton starts to commit, then we mark pages/blocks to be flushed
> before commit.

I don't understand. Can you please describe the race in more detail?

2007-05-04 06:57:41

by Alex Tomas

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>> Yes, there can be issues with needing to allocate journal space within the
>>> context of a commit. But
>> no-no, this isn't required. we only need to mark pages/blocks within
>> transaction, otherwise race is possible when we allocate blocks in transaction,
>> then transacton starts to commit, then we mark pages/blocks to be flushed
>> before commit.
>
> I don't understand. Can you please describe the race in more detail?

if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.

say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread

t1 t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()

going to commit
find inode I dirty
do NOT find these blocks because they're
allocated only, but pages/bhs aren't mapped
to them
start commit


map pages/bhs to just allocate blocks


so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.

thanks, Alex



2007-05-04 07:18:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
> >
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit. But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> >
> > I don't understand. Can you please describe the race in more detail?
>
> if I understood your idea right, then in data=ordered mode, commit thread writes
> all dirty mapped blocks before real commit.
>
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread
>
> t1 t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()

I'm still not understanding. The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean? Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().


> going to commit
> find inode I dirty
> do NOT find these blocks because they're
> allocated only, but pages/bhs aren't mapped
> to them
> start commit

I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page(). Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.



It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search. But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data. Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
->writepages()? Doing the allocation one-page-at-a-time sounds painful...

>
> map pages/bhs to just allocate blocks
>
>
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.
>

I don't think I'm understanding you fully yet.

2007-05-04 07:39:45

by Alex Tomas

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> I'm still not understanding. The terms you're using are a bit ambiguous.
>
> What does "find some dirty unallocated blocks" mean? Find a page which is
> dirty and which does not have a disk mapping?
>
> Normally the above operation would be implemented via
> ext4_writeback_writepage(), and it runs under lock_page().

I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).

>
>
>> going to commit
>> find inode I dirty
>> do NOT find these blocks because they're
>> allocated only, but pages/bhs aren't mapped
>> to them
>> start commit
>
> I think you're assuming here that commit would be using ->t_sync_datalist
> to locate dirty buffer_heads.

nope, I mean sb->inode->page walk.

> But under this proposal, t_sync_datalist just gets removed: the new
> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> understanding you, the way in which we'd handle any such race is to make
> kjournald's writeback of the dirty pages block in lock_page(). Once it
> gets the page lock it can look to see if some other thread has mapped the
> page to disk.

if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?

> It may turn out that kjournald needs a private way of getting at the
> I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> anyway, with a tagged search. But I expect that a single pass through the
> superblock's dirty inodes would suffice for ordered-data. Files which
> have chattr +j would screw things up, as usual.

not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.

> I assume (hope) that your delayed allocation code implements
> ->writepages()? Doing the allocation one-page-at-a-time sounds painful...

indeed. this is a root cause of all this complexity.

thanks, Alex


2007-05-04 08:03:46

by Andrew Morton

[permalink] [raw]
Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > I'm still not understanding. The terms you're using are a bit ambiguous.
> >
> > What does "find some dirty unallocated blocks" mean? Find a page which is
> > dirty and which does not have a disk mapping?
> >
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
>
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome. If someone comes in and does
an fsync() we've lost our synchronisation point. Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals. We're OK wrt page reclaim and we're OK wrt
truncate and invalidate. As long as the page is uptodate we _should_ be OK
wrt readpage(). But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.


I'd be 100% OK with locking multiple pages in ascending pgoff_t order.
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow. But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.


> >
> >
> >> going to commit
> >> find inode I dirty
> >> do NOT find these blocks because they're
> >> allocated only, but pages/bhs aren't mapped
> >> to them
> >> start commit
> >
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
>
> nope, I mean sb->inode->page walk.
>
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page(). Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
>
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that. Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view. If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search. But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data. Files which
> > have chattr +j would screw things up, as usual.
>
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon.
It's what we do now.