LinuxLists.cc - Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 27 Apr 2007 13:31:30 -0600
Andreas Dilger <[email protected]> wrote:

> On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote:
> > On a good filesystem, when you do "fsync()" on a file, nothing at all
> > happens to any other files. On ext3, it seems to sync the global journal,
> > which means that just about *everything* that writes even a single byte
> > (well, at least anything journalled, which would be all the normal
> > directory ops etc) to disk will just *stop* dead cold!
> >
> > It's horrid. And it really is ext3, not "fsync()".
> >
> > I used to run reiserfs, and it had its problems, but this was the
> > "feature" of ext3 that I've disliked most. If you run a MUA with local
> > mail, it will do fsync's for most things, and things really hickup if you
> > are doing some other writes at the same time. In contrast, with reiser, if
> > you did a big untar or some other big write, if somebody fsync'ed a small
> > file, it wasn't even a blip on the radar - the fsync would sync just that
> > small thing.
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too. The reason is
> that if a journal commit doesn't flush the data as well then a crash will
> result in garbage (from old deleted files) being visible in the newly
> allocated file. People used to complain about this with reiserfs all the
> time having corrupt data in new files after a crash, which is why I believe
> it was fixed.

People still complain about hey-my-files-are-all-full-of-zeroes on XFS.

> There definitely are some problems with the ext3 journal commit though.
> If the journal is full it will cause the whole journal to checkpoint out
> to the filesystem synchronously even if just space for a small transaction
> is needed. That is doubly bad if you have a very large journal. I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
>

We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer.
Basically, do ordered-data with a commit-time inode walk, calling
do_sync_mapping_range().

Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.

And guess what? We can then partly fix _this_ problem too. If we're
running a commit on behalf of fsync(inode1) and we come across an inode2
which doesn't have any block allocation metadata in this commit, we don't
need to sync inode2's pages.

Weep. It's times like this when I want to escape all this patch-wrangling
nonsense and go do some real stuff.

2007-04-28 05:45:54

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Sat, 28 Apr 2007, Mikulas Patocka wrote:

> On Fri, 27 Apr 2007, Bill Huey wrote:
> Hi
>
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
> phase tree filesystems (TUX2);

--- BTW, I don't think that writing to unallocated parts of disk is good
idea. These filesystems have cool write benchmarks, but one subtle (and
unbenchmarkable) problem:
They group files according to time when they were created and not
according to directory hierarchy.
When the user has directory with project files and he edited different
files at different times, normal filesystems will place the files near
each other (so that "grep blabla *" is fast) and log-structured
filesystems will scatter the files over the whole disk.

Mikulas

2007-04-28 06:10:25

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 27 Apr 2007, Bill Huey wrote:

> On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:
>> Oh, well.. Journalling sucks.
>>
>> I was actually _really_ hoping that somebody would come along and tell
>> everybody that this whole journal-logging is stupid, and that it's just
>> better to not ever re-write blocks on disk, but instead write to new
>> blocks with version numbers (and not re-use old blocks until new versions
>> are stable on disk).
>>
>> There was even somebody who did something like that for a PhD thesis, I
>> forget the details (and it apparently died when the thesis was presumably
>> accepted ;).
>
> That sounds a whole lot like NetApp's WAFL file system and is heavily
> patented.
>
> bill

Hi

SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
phase tree filesystems (TUX2); it writes inside normal used structures,
but it marks each structure with generation tags --- when it updates
global table of tags, it atomically makes several structures valid. I
don't know about this idea being used elsewhere.

It's fsync is slow too (needs to write all (meta)data too), but it at
least doesn't livelock --- fsync is basically:
* write all buffers and wait for completion
* take lock preventing metadata updates
* write all buffers again (those that were updated while previous write
was in progress) and wait for completion
* update global generation count table
* release the lock

Maybe Suse will be paying me from this autumn to make more features to it
--- so far it works, doesn't eat data, but isn't much known :)

Mikulas

2007-04-28 08:44:35

by Matthias Andree

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 27 Apr 2007, Linus Torvalds wrote:

>
>
> On Fri, 27 Apr 2007, Marat Buharov wrote:
> >
> > On 4/27/07, Andrew Morton <[email protected]> wrote:
> > > Aside: why the heck do applications think that their data is so important
> > > that they need to fsync it all the time. I used to run a kernel on my
> > > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > > pleasurable.
> >
> > So, if having fake fsync() and fdatasync() is pleasurable for laptop
> > and desktop, may be it's time to add option into Kconfig which
> > disables normal fsync behaviour in favor of robust desktop?
>
> This really is an ext3 issue, not "fsync()".
>
> On a good filesystem, when you do "fsync()" on a file, nothing at all
> happens to any other files. On ext3, it seems to sync the global journal,

This behavior has been in Linux and sort of official since the early
2.4.X days - remember the discussion on fsync()ing directory changes for
MTAs that led to the mount option "dirsync" for ext?fs so that rename(),
link() and stuff like that became synchronous even without fsync()ing
the parent directory? I can look up archive references if need be.

Surely four years ago, if not five (this is from the top of my head, not
a quotable fact I verified from the LKML archives though).

> I used to run reiserfs, and it had its problems, but this was the
> "feature" of ext3 that I've disliked most. If you run a MUA with local
> mail, it will do fsync's for most things, and things really hickup if you
> are doing some other writes at the same time. In contrast, with reiser, if
> you did a big untar or some other big write, if somebody fsync'ed a small
> file, it wasn't even a blip on the radar - the fsync would sync just that
> small thing.

It's not as though I'd recommend reiserfs. I have seen one major
corruption recently in openSUSE 10.2 with ext3, but I've had constant
headaches with reiserfs since the day it went into S.u.S.E. kernels at
the time until I switched away from reiserfs some years ago.

--
Matthias Andree

2007-04-28 08:45:49

by Matthias Andree

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 27 Apr 2007, Linus Torvalds wrote:

> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).

Only that you need direct-overwrite support to be able to safely trash
data you no longer need...

--
Matthias Andree

2007-04-28 20:46:29

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

> hm, fsync.
>
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
>
> But wedging for 20 minutes is probably excessive punishment.

I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w :-/

Mikulas

2007-04-28 21:12:41

by Lee Revell

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On 4/28/07, Mikulas Patocka <[email protected]> wrote:
> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
> during that) and doesn't fsync the resulting file on :w :-/

Never seen this. Why would fsync block typing unless vim was doing
disk IO for every keystroke?

Lee

2007-04-28 21:58:23

by Bill Huey

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
> phase tree filesystems (TUX2); it writes inside normal used structures,
> but it marks each structure with generation tags --- when it updates
> global table of tags, it atomically makes several structures valid. I
> don't know about this idea being used elsewhere.

So how is this generation structure organized ? paper ?

bill

2007-04-28 22:38:06

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

> On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
>> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
>> phase tree filesystems (TUX2); it writes inside normal used structures,
>> but it marks each structure with generation tags --- when it updates
>> global table of tags, it atomically makes several structures valid. I
>> don't know about this idea being used elsewhere.
>
> So how is this generation structure organized ? paper ?

Paper is in CITSA 2006 proceedings (but you likely don't have them and I
signed some statement that I can't post it elsewhere :-( )

Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table ---
that array is on disk and in memory (see struct __spadfs->cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory
too (see struct __spadfs->cc)

* On mount, you load crash count table and crash count from disk to
memory. You increment carsh count on disk (but leave old in memory). You
increment one entry in crash count table - cct[cc] in memory, but leave
old on disk.
* On sync you write all metadata buffers, do write barrier, write one
sector of crash count table from memory to disk and do write
barrier again.
* On unmount, you sync and decrement crash count on disk.

--- so crash count counts crashes --- it is increased each time you mount
and don't unmount.

Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc)
and 16-bit crash count(cc).
* You create directory entry with entry->txc = fs->txc[fs->cc] and
entry->cc = fs->cc
* Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes,
it will load crash count table from disk and there's one-less value than
entry->txc, so the entry will be invalid. It will also run with increased
cc, so it will never touch txc at an old index, so the entry will be valid
forever.
--- if you sync, you write crash count table to disk and directory entry
will be atomically made valid forever (because values in crash count table
never decrease)

In my implementation, the top bit of entry->txc is used to mark whether
the entry is scheduled for adding or delete, so that you can atomically
add one directory entry and delete other.

Space allocation bitmaps or lists are managed in such a way that there are
two copies and cc/txc pair determining which one is valid.

Files are extended in such a way that each file has two "size" entries and
cc/txc pair denoting which one is valid, so that you can atomically
extend/truncate file and mark its space allocated/freed in bitmaps or
lists (BTW. this cc/txc pair is the same one that denotes if the directory
entry is valid and another bit determines one of these two functions ---
to save space).

Mikulas

2007-04-29 20:49:29

by Mark Lord

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Lee Revell wrote:
> On 4/28/07, Mikulas Patocka <[email protected]> wrote:
>> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
>> during that) and doesn't fsync the resulting file on :w :-/
>
> Never seen this. Why would fsync block typing unless vim was doing
> disk IO for every keystroke?

It does do that, for the crash-recovery files it maintains.

2007-05-03 17:38:28

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> We can make great improvements here, and I've (twice) previously decribed
> how: hoist the entire ordered-mode data handling out of ext3, and out of
> the buffer_head layer and move it up into the VFS pagecache layer.
> Basically, do ordered-data with a commit-time inode walk, calling
> do_sync_mapping_range().
>
> Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.

I'm not sure it's that easy.

if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?

generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}

another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?

thanks, Alex

2007-05-03 23:55:18

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > We can make great improvements here, and I've (twice) previously decribed
> > how: hoist the entire ordered-mode data handling out of ext3, and out of
> > the buffer_head layer and move it up into the VFS pagecache layer.
> > Basically, do ordered-data with a commit-time inode walk, calling
> > do_sync_mapping_range().
> >
> > Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
>
> I'm not sure it's that easy.
>
> if we move to pages, then we have to mark pages to be flushed holding
> transaction open. now take delayed allocation into account: we need
> to allocate number of blocks at once and then mark all pages mapped,
> again within context of the same transaction.

Yes, there can be issues with needing to allocate journal space within the
context of a commit. But

a) If the page has newly allocated space on disk then the metadata which
refers to that page is already in the journal: no new journal space
needed.

b) If the page doesn't have space allocated on disk then we don't need
to write it out at ordered-mode commit time, because the post-recovery
filesystem will not have any references to that page.

c) If the page is dirty due to overwrite then no metadata update was required.

IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?

However b) might lead to the hey-my-file-is-full-of-zeroes problem.

> so, an implementation
> would look like the following?
>
> generic_writepages() {
> /* collect set of contig. dirty pages */
> foo_get_blocks() {
> foo_journal_start();
> foo_new_blocks();
> foo_attach_blocks_to_inode();
> generic_mark_pages_mapped();
> foo_journal_stop();
> }
> }
>
> another question is will it scale well given number of dirty inodes
> can be much larger than number of inodes with dirty mapped blocks
> (in delayed allocation case, for example) ?

Possibly - zillions of dirty-for-atime inodes might get in the way. A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug). A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree. Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.

2007-05-04 06:18:32

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> Yes, there can be issues with needing to allocate journal space within the
> context of a commit. But

no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.

> a) If the page has newly allocated space on disk then the metadata which
> refers to that page is already in the journal: no new journal space
> needed.
>
> b) If the page doesn't have space allocated on disk then we don't need
> to write it out at ordered-mode commit time, because the post-recovery
> filesystem will not have any references to that page.
>
> c) If the page is dirty due to overwrite then no metadata update was required.
>
> IOW, under what circumstances would an ordered-mode commit need to allocate
> space for a delayed-allocate page?

no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.

>
> However b) might lead to the hey-my-file-is-full-of-zeroes problem.
>

thanks, Alex

2007-05-04 06:39:07

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > Yes, there can be issues with needing to allocate journal space within the
> > context of a commit. But
>
> no-no, this isn't required. we only need to mark pages/blocks within
> transaction, otherwise race is possible when we allocate blocks in transaction,
> then transacton starts to commit, then we mark pages/blocks to be flushed
> before commit.

I don't understand. Can you please describe the race in more detail?

2007-05-04 06:57:29

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>> Yes, there can be issues with needing to allocate journal space within the
>>> context of a commit. But
>> no-no, this isn't required. we only need to mark pages/blocks within
>> transaction, otherwise race is possible when we allocate blocks in transaction,
>> then transacton starts to commit, then we mark pages/blocks to be flushed
>> before commit.
>
> I don't understand. Can you please describe the race in more detail?

if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.

say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread

t1 t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()

going to commit
find inode I dirty
do NOT find these blocks because they're
allocated only, but pages/bhs aren't mapped
to them
start commit

map pages/bhs to just allocate blocks

so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.

thanks, Alex

2007-05-04 07:18:42

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
> >
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit. But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> >
> > I don't understand. Can you please describe the race in more detail?
>
> if I understood your idea right, then in data=ordered mode, commit thread writes
> all dirty mapped blocks before real commit.
>
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread
>
> t1 t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()

I'm still not understanding. The terms you're using are a bit ambiguous.

What does "find some dirty unallocated blocks" mean? Find a page which is
dirty and which does not have a disk mapping?

Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().

> going to commit
> find inode I dirty
> do NOT find these blocks because they're
> allocated only, but pages/bhs aren't mapped
> to them
> start commit

I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.

But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page(). Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.

It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search. But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data. Files which
have chattr +j would screw things up, as usual.

I assume (hope) that your delayed allocation code implements
->writepages()? Doing the allocation one-page-at-a-time sounds painful...

>
> map pages/bhs to just allocate blocks
>
>
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.
>

I don't think I'm understanding you fully yet.

2007-05-04 07:39:43

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> I'm still not understanding. The terms you're using are a bit ambiguous.
>
> What does "find some dirty unallocated blocks" mean? Find a page which is
> dirty and which does not have a disk mapping?
>
> Normally the above operation would be implemented via
> ext4_writeback_writepage(), and it runs under lock_page().

I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).

>
>
>> going to commit
>> find inode I dirty
>> do NOT find these blocks because they're
>> allocated only, but pages/bhs aren't mapped
>> to them
>> start commit
>
> I think you're assuming here that commit would be using ->t_sync_datalist
> to locate dirty buffer_heads.

nope, I mean sb->inode->page walk.

> But under this proposal, t_sync_datalist just gets removed: the new
> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> understanding you, the way in which we'd handle any such race is to make
> kjournald's writeback of the dirty pages block in lock_page(). Once it
> gets the page lock it can look to see if some other thread has mapped the
> page to disk.

if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?

> It may turn out that kjournald needs a private way of getting at the
> I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> anyway, with a tagged search. But I expect that a single pass through the
> superblock's dirty inodes would suffice for ordered-data. Files which
> have chattr +j would screw things up, as usual.

not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.

> I assume (hope) that your delayed allocation code implements
> ->writepages()? Doing the allocation one-page-at-a-time sounds painful...

indeed. this is a root cause of all this complexity.

thanks, Alex

2007-05-04 08:03:41

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > I'm still not understanding. The terms you're using are a bit ambiguous.
> >
> > What does "find some dirty unallocated blocks" mean? Find a page which is
> > dirty and which does not have a disk mapping?
> >
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
>
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).

ooh, that sounds hacky and quite worrisome. If someone comes in and does
an fsync() we've lost our synchronisation point. Yes, all callers happen
to do

lock_page();
wait_on_page_writeback();

(I think) but we've never considered a bare PageWriteback() as something
which protects page internals. We're OK wrt page reclaim and we're OK wrt
truncate and invalidate. As long as the page is uptodate we _should_ be OK
wrt readpage(). But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.

I'd be 100% OK with locking multiple pages in ascending pgoff_t order.
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow. But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.

> >
> >
> >> going to commit
> >> find inode I dirty
> >> do NOT find these blocks because they're
> >> allocated only, but pages/bhs aren't mapped
> >> to them
> >> start commit
> >
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
>
> nope, I mean sb->inode->page walk.
>
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page(). Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
>
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?

I don't think we'd want kjournald to do that. Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view. If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.

> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search. But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data. Files which
> > have chattr +j would screw things up, as usual.
>
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.

Newly allocated blocks, you mean?

Just write out the overwritten blocks as well as the new ones, I reckon.
It's what we do now.

2007-08-16 18:33:31

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
>>> But under this proposal, t_sync_datalist just gets removed: the new
>>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
>>> understanding you, the way in which we'd handle any such race is to make
>>> kjournald's writeback of the dirty pages block in lock_page(). Once it
>>> gets the page lock it can look to see if some other thread has mapped the
>>> page to disk.
>> if I'm right holding number of pages locked, then they won't be locked, but
>> writeback. of course kjournald can block on writeback as well, but how does
>> it find pages with *newly allocated* blocks only?
>
> I don't think we'd want kjournald to do that. Even if a page was dirtied
> by an overwrite, we'd want to write it back during commit, just from a
> quality-of-implementation point of view. If we were to leave these pages
> unwritten during commit then a post-recovery file could have a mix of
> up-to-five-second-old data and up-to-30-seconds-old data.

trying to implement this I've got to think that there is one significant
difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
is per-transaction. IOW, it doesn't change once transaction is closed. in
contrast, nothing (currently) would prevent others to modify pages while
commit is in progress. I think this is serious disadvantage of the solution.

what I'd propose is sort of in-core tracker for all data-related IOs in flight
(assigned to specific transaction) and wait for their completion in commit
thread.

thanks, Alex

2007-08-16 18:46:14

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Thu, 16 Aug 2007 22:20:06 +0400
Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> >>> But under this proposal, t_sync_datalist just gets removed: the new
> >>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> >>> understanding you, the way in which we'd handle any such race is to make
> >>> kjournald's writeback of the dirty pages block in lock_page(). Once it
> >>> gets the page lock it can look to see if some other thread has mapped the
> >>> page to disk.
> >> if I'm right holding number of pages locked, then they won't be locked, but
> >> writeback. of course kjournald can block on writeback as well, but how does
> >> it find pages with *newly allocated* blocks only?
> >
> > I don't think we'd want kjournald to do that. Even if a page was dirtied
> > by an overwrite, we'd want to write it back during commit, just from a
> > quality-of-implementation point of view. If we were to leave these pages
> > unwritten during commit then a post-recovery file could have a mix of
> > up-to-five-second-old data and up-to-30-seconds-old data.
>
> trying to implement this I've got to think that there is one significant
> difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
> is per-transaction. IOW, it doesn't change once transaction is closed. in
> contrast, nothing (currently) would prevent others to modify pages while
> commit is in progress.

That can happen at present - there's nothing to stop a process from modifying
a page which is undergoing ordered-data commit-time writeout.

2007-08-17 02:25:15

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> On Thu, 16 Aug 2007 22:20:06 +0400
> Alex Tomas <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>>>> But under this proposal, t_sync_datalist just gets removed: the new
>>>>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
>>>>> understanding you, the way in which we'd handle any such race is to make
>>>>> kjournald's writeback of the dirty pages block in lock_page(). Once it
>>>>> gets the page lock it can look to see if some other thread has mapped the
>>>>> page to disk.
>>>> if I'm right holding number of pages locked, then they won't be locked, but
>>>> writeback. of course kjournald can block on writeback as well, but how does
>>>> it find pages with *newly allocated* blocks only?
>>> I don't think we'd want kjournald to do that. Even if a page was dirtied
>>> by an overwrite, we'd want to write it back during commit, just from a
>>> quality-of-implementation point of view. If we were to leave these pages
>>> unwritten during commit then a post-recovery file could have a mix of
>>> up-to-five-second-old data and up-to-30-seconds-old data.
>> trying to implement this I've got to think that there is one significant
>> difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
>> is per-transaction. IOW, it doesn't change once transaction is closed. in
>> contrast, nothing (currently) would prevent others to modify pages while
>> commit is in progress.
>
> That can happen at present - there's nothing to stop a process from modifying
> a page which is undergoing ordered-data commit-time writeout.

I tend to think it's still a bit different: set of pages doesn't change with
t_sync_datalist. with sb->inode->page approach even silly dd will be able to
*add* a bunch of new pages while we're syncing first ones. why shouldn't we
fix this?

thanks, Alex

2007-08-17 06:53:03

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 17 Aug 2007 06:24:47 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > On Thu, 16 Aug 2007 22:20:06 +0400
> > Alex Tomas <[email protected]> wrote:
> >
> >> Andrew Morton wrote:
> >>>>> But under this proposal, t_sync_datalist just gets removed: the new
> >>>>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> >>>>> understanding you, the way in which we'd handle any such race is to make
> >>>>> kjournald's writeback of the dirty pages block in lock_page(). Once it
> >>>>> gets the page lock it can look to see if some other thread has mapped the
> >>>>> page to disk.
> >>>> if I'm right holding number of pages locked, then they won't be locked, but
> >>>> writeback. of course kjournald can block on writeback as well, but how does
> >>>> it find pages with *newly allocated* blocks only?
> >>> I don't think we'd want kjournald to do that. Even if a page was dirtied
> >>> by an overwrite, we'd want to write it back during commit, just from a
> >>> quality-of-implementation point of view. If we were to leave these pages
> >>> unwritten during commit then a post-recovery file could have a mix of
> >>> up-to-five-second-old data and up-to-30-seconds-old data.
> >> trying to implement this I've got to think that there is one significant
> >> difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
> >> is per-transaction. IOW, it doesn't change once transaction is closed. in
> >> contrast, nothing (currently) would prevent others to modify pages while
> >> commit is in progress.
> >
> > That can happen at present - there's nothing to stop a process from modifying
> > a page which is undergoing ordered-data commit-time writeout.
>
> I tend to think it's still a bit different: set of pages doesn't change with
> t_sync_datalist. with sb->inode->page approach even silly dd will be able to
> *add* a bunch of new pages while we're syncing first ones. why shouldn't we
> fix this?
>

Sort-of. But the per-superpblock, per-inode writeback code is pretty
careful to avoid livelocks. The per-inode writeback is a strict single
linear sweep across the file. It'll basically write out anything which was
dirty when it was called. The per-superblock inode walk isn't as accurate
as that, becuase of the difficulties of juggling list_heads. But we're
slowly working on that, and I suspect it'll be ggod enough for ext3
purposes already.

2007-08-17 08:37:00

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

Andrew Morton wrote:
> Sort-of. But the per-superpblock, per-inode writeback code is pretty
> careful to avoid livelocks. The per-inode writeback is a strict single
> linear sweep across the file. It'll basically write out anything which was
> dirty when it was called. The per-superblock inode walk isn't as accurate
> as that, becuase of the difficulties of juggling list_heads. But we're
> slowly working on that, and I suspect it'll be ggod enough for ext3
> purposes already.

I'd say that these are two different mechanism solving different problems:
1) VFS/MM does periodic updates and uses regular writeback
2) data=ordered is to avoid metadata pointing to not-written-yet data

we can't use regular writeback in commit thread as long as it can fall into
allocation. so, we'd have to add one more WB mode (btw, i have a patch which
skips non-allocated blocks in writeback if special WB mode is requested).

OTOH, the faster we go through data sync part of commit, the better. given
that lots of inodes can be dirty with no data to sync, it's going to take
long in some cases. it's especially bad because commit doesn't scale to many
CPUs.

also, why would we need to flush *everything* every 5s? just because ext3 does
this? sounds strange. if somebody really need this we could add this possibility
to regular writeback path (making it tunable). but I'd rather prefer to have
a separate (fast, lightweight, scalable) mechanism to support data=ordered.

thanks, Alex

2007-08-17 09:03:40

[permalink] [raw]

Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose when FS is under heavy write load (massive starvation)

On Fri, 17 Aug 2007 12:36:32 +0400 Alex Tomas <[email protected]> wrote:

> Andrew Morton wrote:
> > Sort-of. But the per-superpblock, per-inode writeback code is pretty
> > careful to avoid livelocks. The per-inode writeback is a strict single
> > linear sweep across the file. It'll basically write out anything which was
> > dirty when it was called. The per-superblock inode walk isn't as accurate
> > as that, becuase of the difficulties of juggling list_heads. But we're
> > slowly working on that, and I suspect it'll be ggod enough for ext3
> > purposes already.
>
> I'd say that these are two different mechanism solving different problems:
> 1) VFS/MM does periodic updates and uses regular writeback
> 2) data=ordered is to avoid metadata pointing to not-written-yet data

VFS/MM can do _much_ more than that! Look at struct writeback_control.

That code path has many different modes of operation: it is used for
regular pdflush writeback, sync, fsync, throttling, etc. Probably one of
its modes will be sufficient. If we want to change ext3's existing
semantics and add an "only writeback uninitialised blocks" mode then
that'll be pretty straightforward: add more control information to
writeback_control and go for it.

> we can't use regular writeback in commit thread as long as it can fall into
> allocation. so, we'd have to add one more WB mode (btw, i have a patch which
> skips non-allocated blocks in writeback if special WB mode is requested).

yup

> OTOH, the faster we go through data sync part of commit, the better. given
> that lots of inodes can be dirty with no data to sync, it's going to take
> long in some cases. it's especially bad because commit doesn't scale to many
> CPUs.

eh?

> also, why would we need to flush *everything* every 5s? just because ext3 does
> this? sounds strange. if somebody really need this we could add this possibility
> to regular writeback path (making it tunable). but I'd rather prefer to have
> a separate (fast, lightweight, scalable) mechanism to support data=ordered.
>

Yeah, that would make sense, perhaps.

Or just speed the existing stuff up. iirc the main problem in there is unrelated
to data writeback. There are situations where the running transaction has to block
behind metadata writeout which the committing transaction is performing. I
reluctantly put that in years ago to get us out of a tight spot and it
never got optimised.

2007-08-17 18:43:16