On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[email protected]> wrote:
> Greetings,
>
> As subject states, my GUI is going away for extended periods of time
> when my very full and likely highly fragmented (how to find out)
> filesystem is under heavy write load. While write is under way, if
> amarok (mp3 player) is running, no song change will occur until write is
> finished, and the GUI can go _entirely_ comatose for very long periods.
> Usually, it will come back to life after write is finished, but
> occasionally, a complete GUI restart is necessary.
I'd be suspecting a GUI bug if a restart is necessary. Perhaps it went to
lunch for so long in the kernel that some time-based thing went bad.
> The longest comatose period to date was ~20 minutes with 2.6.20.7 a few
> days ago. I was letting SuSE's software update programs update my SuSE
> 10.2 system, and started a bonnie while it was running (because I had
> been seeing this on recent kernels, and wanted to see if it was in
> stable as well), WHAM, instant dead GUI. When this happens, kbd and
> mouse events work fine, so I hot-keyed to a VT via CTRL+ALT+F1, and
> killed the bonnie. No joy, GUI stayed utterly comatose until the
> updater finished roughly 20 minutes later, at which time the shells I'd
> tried to start popped up, and all worked as if nothing bad had ever
> happened. During the time in between, no window could be brought into
> focus, nada.
>
> While a bonnie is writing, if I poke KDE's menu button, that will
> instantly trigger nastiness, and a trace (this one was with a cfs
> kernel, but I just did same with virgin 2.6.21) shows that "kicker",
> KDE's launcher proggy does an fdatasync for some reason, and that's the
> end of it's world for ages. When clicking on amarok's icon, it does an
> fsync, and that's the last thing that will happen in it's world until
> write is done as well. I've repeated this with CFQ and AS IO
> schedulers.
Well that all sucks.
> I have a couple of old kernels lying around that I can test with, but I
> think it's going to be the same. Seems to be ext3's journal that is
> causing my woes. Below this trace of kicker is one of amarok during
> it's dead to the world time.
>
> Box is 3GHz P4 intel ICH5, SMP/UP doesn't matter. .config is latest
> kernel tested attached. Mount options are
> noatime,nodiratime,acl,user_xattr.
>
>
> [ 308.046646] kicker D 00000044 0 5897 1 (NOTLB)
> [ 308.052611] f32abe4c 00200082 83398b5a 00000044 c01c251e f32ab000 f32ab000 c01169b6
> [ 308.060926] f772fbcc cdc7e694 00000039 8339857a 00000044 83398b5a 00000044 00000000
> [ 308.069422] c1b5ab00 c1b5aac0 00353928 f32abe80 c01c7ab8 00000000 f32ab000 c1b5ab10
> [ 308.077927] Call Trace:
> [ 308.080568] [<c01c7ab8>] log_wait_commit+0x9d/0x11f
> [ 308.085549] [<c01c250f>] journal_stop+0x1a1/0x22a
> [ 308.090364] [<c01c2fce>] journal_force_commit+0x1d/0x20
> [ 308.095699] [<c01bac1e>] ext3_force_commit+0x24/0x26
> [ 308.100774] [<c01b50ea>] ext3_write_inode+0x2d/0x3b
> [ 308.105771] [<c0186cf0>] __writeback_single_inode+0x2df/0x3a9
> [ 308.111633] [<c0187641>] sync_inode+0x15/0x38
> [ 308.116093] [<c01b1695>] ext3_sync_file+0xbd/0xc8
> [ 308.120900] [<c0189a09>] do_fsync+0x58/0x8b
> [ 308.125188] [<c0189a5c>] __do_fsync+0x20/0x2f
> [ 308.129656] [<c0189a7b>] sys_fdatasync+0x10/0x12
> [ 308.134384] [<c0103eec>] sysenter_past_esp+0x5d/0x81
> [ 308.139441] =======================
Right. One possibility here is that bonnie is stuffing new dirty blocks
onto the committing transaction's ordered-data list and JBD commit is
livelocking. Only we're not supposed to be putting those blocks on that
list.
Another livelock possibility is that bonnie is redirtying pages faster than
commit can write them out, so commit got livelocked:
When I was doing the original port-from-2.2 I found that an application
which does
for ( ; ; )
pwrite(fd, "", 1, 0);
would permanently livelock the fs. I fixed that, but it was six years ago,
and perhaps we later unfixed it.
It would be most interesting to try data=writeback.
> [ 311.755953] bonnie D 00000046 0 6146 5929 (NOTLB)
> [ 311.761929] e7622a60 00200082 04d7e5fe 00000046 03332bd5 00000000 e7622000 c02c0c54
> [ 311.770244] d8eaabcc e7622a64 f7d0c3ec 04d7e521 00000046 04d7e5fe 00000046 00000000
> [ 311.778758] e7622aa4 f4f94400 e7622aac e7622a68 c04a2f06 e7622a70 c018b105 e7622a8c
> [ 311.787261] Call Trace:
> [ 311.789904] [<c04a2f06>] io_schedule+0xe/0x16
> [ 311.794373] [<c018b105>] sync_buffer+0x2e/0x32
> [ 311.798927] [<c04a3756>] __wait_on_bit_lock+0x3f/0x62
> [ 311.804089] [<c04a37d8>] out_of_line_wait_on_bit_lock+0x5f/0x67
> [ 311.810115] [<c018b248>] __lock_buffer+0x2b/0x31
> [ 311.814846] [<c018bb56>] sync_dirty_buffer+0x88/0xc3
> [ 311.819921] [<c01c3ce4>] journal_dirty_data+0x1dd/0x205
> [ 311.825256] [<c01b3300>] ext3_journal_dirty_data+0x12/0x37
> [ 311.830858] [<c01b333a>] journal_dirty_data_fn+0x15/0x1c
> [ 311.836280] [<c01b277d>] walk_page_buffers+0x36/0x68
> [ 311.841347] [<c01b552f>] ext3_ordered_writepage+0x11a/0x191
> [ 311.847027] [<c0152133>] generic_writepages+0x1f3/0x305
> [ 311.852344] [<c015227c>] do_writepages+0x37/0x39
> [ 311.857064] [<c0186aa7>] __writeback_single_inode+0x96/0x3a9
> [ 311.862842] [<c0187037>] sync_sb_inodes+0x1bc/0x27f
> [ 311.867830] [<c01875e3>] writeback_inodes+0x98/0xe1
> [ 311.872819] [<c015240a>] balance_dirty_pages_ratelimited_nr+0xc4/0x1bf
> [ 311.879461] [<c014df55>] generic_file_buffered_write+0x32e/0x677
> [ 311.885576] [<c014e580>] __generic_file_aio_write_nolock+0x2e2/0x57f
> [ 311.892044] [<c014e87d>] generic_file_aio_write+0x60/0xd4
> [ 311.897553] [<c01b14f7>] ext3_file_write+0x27/0xa5
> [ 311.902455] [<c016ab7b>] do_sync_write+0xcd/0x103
> [ 311.907270] [<c016b37a>] vfs_write+0xa8/0x128
> [ 311.911738] [<c016b873>] sys_write+0x3d/0x64
> [ 311.916111] [<c0103eec>] sysenter_past_esp+0x5d/0x81
That's normal. But bonnie _is_ blocking here, so it obviously cannot be
dirtying buffers while it's doing that.
> [ 311.921185] =======================
> [ 311.924763] pdflush D 00000046 0 6147 5 (L-TLB)
> [ 311.930739] ec7e2ef0 00000046 03f2b0ea 00000046 ec7e2f0c c0186b45 ec7e2000 c01169b6
> [ 311.939052] ea14069c ec7e2f00 ec7e2f00 03f2afc9 00000046 03f2b0ea 00000046 00000282
> [ 311.947557] ec7e2f00 ffffab4c ec7e2f30 ec7e2f20 c04a3689 00000400 00000840 c0681648
> [ 311.956062] Call Trace:
> [ 311.958703] [<c04a3689>] schedule_timeout+0x44/0xa4
> [ 311.963683] [<c04a2eda>] io_schedule_timeout+0xe/0x16
> [ 311.968827] [<c01568c7>] congestion_wait+0x4c/0x61
> [ 311.973721] [<c0152603>] background_writeout+0x2f/0x8f
> [ 311.978969] [<c0152b68>] pdflush+0xe7/0x1d6
> [ 311.983255] [<c012e8fb>] kthread+0xc5/0xc9
> [ 311.987465] [<c0104aab>] kernel_thread_helper+0x7/0x1c
OK, that's normal.
> [ 1421.790647] amarokapp D 00000148 0 6428 1 (NOTLB)
> [ 1421.796620] e303ce4c 00000082 823c9fd2 00000148 00000148 ee1a3030 e303c000 00000001
> [ 1421.804944] ee1a316c dfef91d0 00000000 823c9f15 00000148 823c9fd2 00000148 00000246
> [ 1421.813447] dfef91c0 dfef9180 0035465c e303ce80 c01c7ab8 00000000 e303c000 dfef91d0
> [ 1421.821962] Call Trace:
> [ 1421.824603] [<c01c7ab8>] log_wait_commit+0x9d/0x11f
> [ 1421.829582] [<c01c250f>] journal_stop+0x1a1/0x22a
> [ 1421.834397] [<c01c2fce>] journal_force_commit+0x1d/0x20
> [ 1421.839732] [<c01bac1e>] ext3_force_commit+0x24/0x26
> [ 1421.844790] [<c01b50ea>] ext3_write_inode+0x2d/0x3b
> [ 1421.849771] [<c0186cf0>] __writeback_single_inode+0x2df/0x3a9
> [ 1421.855633] [<c0187641>] sync_inode+0x15/0x38
> [ 1421.860103] [<c01b1695>] ext3_sync_file+0xbd/0xc8
> [ 1421.864918] [<c0189a09>] do_fsync+0x58/0x8b
> [ 1421.869213] [<c0189a5c>] __do_fsync+0x20/0x2f
> [ 1421.873673] [<c0189a8a>] sys_fsync+0xd/0xf
> [ 1421.877874] [<c0103eec>] sysenter_past_esp+0x5d/0x81
>
hm, fsync.
Aside: why the heck do applications think that their data is so important
that they need to fsync it all the time. I used to run a kernel on my
laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
pleasurable.
But wedging for 20 minutes is probably excessive punishment.
Bottom line: no idea. Please see if data=writeback changes things, maybe
try the pwrite() loop.
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote:
> On Fri, 27 Apr 2007 09:59:27 +0200 Mike Galbraith <[email protected]> wrote:
>
> > Greetings,
> >
> > As subject states, my GUI is going away for extended periods of time
> > when my very full and likely highly fragmented (how to find out)
> > filesystem is under heavy write load. While write is under way, if
> > amarok (mp3 player) is running, no song change will occur until write is
> > finished, and the GUI can go _entirely_ comatose for very long periods.
> > Usually, it will come back to life after write is finished, but
> > occasionally, a complete GUI restart is necessary.
>
> I'd be suspecting a GUI bug if a restart is necessary. Perhaps it went to
> lunch for so long in the kernel that some time-based thing went bad.
Yeah, there have been some KDE updates, maybe something went south. I
know for sure that nothing this horrible used to happen during IO. But
then when I used to regularly test IO, my disk heads didn't have to
traverse nearly as much either.
> Right. One possibility here is that bonnie is stuffing new dirty blocks
> onto the committing transaction's ordered-data list and JBD commit is
> livelocking. Only we're not supposed to be putting those blocks on that
> list.
>
> Another livelock possibility is that bonnie is redirtying pages faster than
> commit can write them out, so commit got livelocked:
>
> When I was doing the original port-from-2.2 I found that an application
> which does
>
> for ( ; ; )
> pwrite(fd, "", 1, 0);
>
> would permanently livelock the fs. I fixed that, but it was six years ago,
> and perhaps we later unfixed it.
I'll try that.
> It would be most interesting to try data=writeback.
Seems somewhat better, but nothing close to tolerable. I still had to
hot-key to a VT and kill the bonnie.
> hm, fsync.
>
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
I thought unkind thoughts when I saw those traces :)
Thanks,
-Mike
On Fri, 2007-04-27 at 01:33 -0700, Andrew Morton wrote:
> Another livelock possibility is that bonnie is redirtying pages faster than
> commit can write them out, so commit got livelocked:
>
> When I was doing the original port-from-2.2 I found that an application
> which does
>
> for ( ; ; )
> pwrite(fd, "", 1, 0);
>
> would permanently livelock the fs. I fixed that, but it was six years ago,
> and perhaps we later unfixed it.
Well, box doesn't seem the least bit upset after quite a while now, so I
guess it didn't get unfixed.
-Mike
On 4/27/07, Andrew Morton <[email protected]> wrote:
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
So, if having fake fsync() and fdatasync() is pleasurable for laptop
and desktop, may be it's time to add option into Kconfig which
disables normal fsync behaviour in favor of robust desktop?
On Fri, 2007-04-27 at 15:59 +0400, Marat Buharov wrote:
> On 4/27/07, Andrew Morton <[email protected]> wrote:
> > Aside: why the heck do applications think that their data is so important
> > that they need to fsync it all the time. I used to run a kernel on my
> > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > pleasurable.
>
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?
Nah, just teaching user-space to behave themselves should be sufficient;
there is just no way kicker can justify doing a fdatasync(), I mean,
come on its just showing a friggin menu. I have always wondered why that
thing was so damn slow, like it needs to fetch stuff like that from all
four corners of disk, feh!
Just sliding over a sub-menu can take more than a second; I mean, it
_really_ is just faster to just start things from your favourite shell.
No way is globally disabling fsync() a good thing. I guess Andrew just
is a sucker for punishment :-)
Marat Buharov wrote:
> On 4/27/07, Andrew Morton <[email protected]> wrote:
>> Aside: why the heck do applications think that their data is so important
>> that they need to fsync it all the time. I used to run a kernel on my
>> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
>> pleasurable.
>
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?
Sure, a noop fsync/fdatasync would speed up some things. And I am sure
Andrew Morton knew what he was doing and the consequences.
But unless you care nothing about your data, you should not do it. It is
as simple as that. No, it does not give you a robust desktop!!
-Manoj
--
Manoj Joseph
http://kerneljunkie.blogspot.com/
Peter Zijlstra wrote:
>
> No way is globally disabling fsync() a good thing. I guess Andrew just
> is a sucker for punishment :-)
Mmm... perhaps another nice thing to include in laptop-mode operation?
On Fri, 27 Apr 2007, Marat Buharov wrote:
>
> On 4/27/07, Andrew Morton <[email protected]> wrote:
> > Aside: why the heck do applications think that their data is so important
> > that they need to fsync it all the time. I used to run a kernel on my
> > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > pleasurable.
>
> So, if having fake fsync() and fdatasync() is pleasurable for laptop
> and desktop, may be it's time to add option into Kconfig which
> disables normal fsync behaviour in favor of robust desktop?
This really is an ext3 issue, not "fsync()".
On a good filesystem, when you do "fsync()" on a file, nothing at all
happens to any other files. On ext3, it seems to sync the global journal,
which means that just about *everything* that writes even a single byte
(well, at least anything journalled, which would be all the normal
directory ops etc) to disk will just *stop* dead cold!
It's horrid. And it really is ext3, not "fsync()".
I used to run reiserfs, and it had its problems, but this was the
"feature" of ext3 that I've disliked most. If you run a MUA with local
mail, it will do fsync's for most things, and things really hickup if you
are doing some other writes at the same time. In contrast, with reiser, if
you did a big untar or some other big write, if somebody fsync'ed a small
file, it wasn't even a blip on the radar - the fsync would sync just that
small thing.
Maybe I'm wrong on the exact details (I'm not really up on the ext3
journal handling ;^), but you don't even have to know about any internals
at all: you can just test it. Gaak.
Linus
On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote:
> On a good filesystem, when you do "fsync()" on a file, nothing at all
> happens to any other files. On ext3, it seems to sync the global journal,
> which means that just about *everything* that writes even a single byte
> (well, at least anything journalled, which would be all the normal
> directory ops etc) to disk will just *stop* dead cold!
>
> It's horrid. And it really is ext3, not "fsync()".
>
> I used to run reiserfs, and it had its problems, but this was the
> "feature" of ext3 that I've disliked most. If you run a MUA with local
> mail, it will do fsync's for most things, and things really hickup if you
> are doing some other writes at the same time. In contrast, with reiser, if
> you did a big untar or some other big write, if somebody fsync'ed a small
> file, it wasn't even a blip on the radar - the fsync would sync just that
> small thing.
It's true that this is a "feature" of ext3 with data=ordered (the default),
but I suspect the same thing is now true in reiserfs too. The reason is
that if a journal commit doesn't flush the data as well then a crash will
result in garbage (from old deleted files) being visible in the newly
allocated file. People used to complain about this with reiserfs all the
time having corrupt data in new files after a crash, which is why I believe
it was fixed.
There definitely are some problems with the ext3 journal commit though.
If the journal is full it will cause the whole journal to checkpoint out
to the filesystem synchronously even if just space for a small transaction
is needed. That is doubly bad if you have a very large journal. I believe
Alex has a patch to have it checkpoint much smaller chunks to the fs.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
On Fri, 2007-04-27 at 13:31 -0600, Andreas Dilger wrote:
> I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
I wouldn't be averse to test driving such a patch (understatement). You
have a pointer?
-Mike
On Fri, 27 Apr 2007, Andreas Dilger wrote:
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too.
Oh, well.. Journalling sucks.
I was actually _really_ hoping that somebody would come along and tell
everybody that this whole journal-logging is stupid, and that it's just
better to not ever re-write blocks on disk, but instead write to new
blocks with version numbers (and not re-use old blocks until new versions
are stable on disk).
There was even somebody who did something like that for a PhD thesis, I
forget the details (and it apparently died when the thesis was presumably
accepted ;).
Linus
The idea has not died and some NAS/file server vendors have already been
doing this for some time. (I am not sure but is WAFS the same thing?)
> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Linus Torvalds
> Sent: Friday, April 27, 2007 12:51 PM
> To: Andreas Dilger
> Cc: Marat Buharov; Andrew Morton; Mike Galbraith; LKML; Jens Axboe;
> [email protected]; Alex Tomas
> Subject: Re: [ext3][kernels >= 2.6.20.7 at least] KDE going comatose
> when FS is under heavy write load (massive starvation)
>
>
>
> On Fri, 27 Apr 2007, Andreas Dilger wrote:
> >
> > It's true that this is a "feature" of ext3 with data=ordered (the
> default),
> > but I suspect the same thing is now true in reiserfs too.
>
> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new
> versions
> are stable on disk).
>
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was
> presumably
> accepted ;).
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:
> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).
>
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was presumably
> accepted ;).
That sounds a whole lot like NetApp's WAFL file system and is heavily patented.
bill
Linus Torvalds wrote:
> There was even somebody who did something like that for a PhD thesis, I
> forget the details (and it apparently died when the thesis was presumably
> accepted ;).
>
>
You mean SpadFS[1] right ?
> Linus
>
Gabriel
[1] http://artax.karlin.mff.cuni.cz/~mikulas/spadfs/
Linus Torvalds wrote:
>
> On Fri, 27 Apr 2007, Andreas Dilger wrote:
>> It's true that this is a "feature" of ext3 with data=ordered (the default),
>> but I suspect the same thing is now true in reiserfs too.
>
> Oh, well.. Journalling sucks.
Go back to ext2? ;)
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).
Ah, "copy on write"! ZFS (Sun) and WAFL (NetApp) does this. Don't know
about WAFL, but ZFS does logging too.
-Manoj
--
Manoj Joseph
http://kerneljunkie.blogspot.com/
On Fri, 27 Apr 2007 13:31:30 -0600
Andreas Dilger <[email protected]> wrote:
> On Apr 27, 2007 08:30 -0700, Linus Torvalds wrote:
> > On a good filesystem, when you do "fsync()" on a file, nothing at all
> > happens to any other files. On ext3, it seems to sync the global journal,
> > which means that just about *everything* that writes even a single byte
> > (well, at least anything journalled, which would be all the normal
> > directory ops etc) to disk will just *stop* dead cold!
> >
> > It's horrid. And it really is ext3, not "fsync()".
> >
> > I used to run reiserfs, and it had its problems, but this was the
> > "feature" of ext3 that I've disliked most. If you run a MUA with local
> > mail, it will do fsync's for most things, and things really hickup if you
> > are doing some other writes at the same time. In contrast, with reiser, if
> > you did a big untar or some other big write, if somebody fsync'ed a small
> > file, it wasn't even a blip on the radar - the fsync would sync just that
> > small thing.
>
> It's true that this is a "feature" of ext3 with data=ordered (the default),
> but I suspect the same thing is now true in reiserfs too. The reason is
> that if a journal commit doesn't flush the data as well then a crash will
> result in garbage (from old deleted files) being visible in the newly
> allocated file. People used to complain about this with reiserfs all the
> time having corrupt data in new files after a crash, which is why I believe
> it was fixed.
People still complain about hey-my-files-are-all-full-of-zeroes on XFS.
> There definitely are some problems with the ext3 journal commit though.
> If the journal is full it will cause the whole journal to checkpoint out
> to the filesystem synchronously even if just space for a small transaction
> is needed. That is doubly bad if you have a very large journal. I believe
> Alex has a patch to have it checkpoint much smaller chunks to the fs.
>
We can make great improvements here, and I've (twice) previously decribed
how: hoist the entire ordered-mode data handling out of ext3, and out of
the buffer_head layer and move it up into the VFS pagecache layer.
Basically, do ordered-data with a commit-time inode walk, calling
do_sync_mapping_range().
Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
And guess what? We can then partly fix _this_ problem too. If we're
running a commit on behalf of fsync(inode1) and we come across an inode2
which doesn't have any block allocation metadata in this commit, we don't
need to sync inode2's pages.
Weep. It's times like this when I want to escape all this patch-wrangling
nonsense and go do some real stuff.
On Sat, 28 Apr 2007, Mikulas Patocka wrote:
> On Fri, 27 Apr 2007, Bill Huey wrote:
> Hi
>
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
> phase tree filesystems (TUX2);
--- BTW, I don't think that writing to unallocated parts of disk is good
idea. These filesystems have cool write benchmarks, but one subtle (and
unbenchmarkable) problem:
They group files according to time when they were created and not
according to directory hierarchy.
When the user has directory with project files and he edited different
files at different times, normal filesystems will place the files near
each other (so that "grep blabla *" is fast) and log-structured
filesystems will scatter the files over the whole disk.
Mikulas
On Fri, 27 Apr 2007, Bill Huey wrote:
> On Fri, Apr 27, 2007 at 12:50:34PM -0700, Linus Torvalds wrote:
>> Oh, well.. Journalling sucks.
>>
>> I was actually _really_ hoping that somebody would come along and tell
>> everybody that this whole journal-logging is stupid, and that it's just
>> better to not ever re-write blocks on disk, but instead write to new
>> blocks with version numbers (and not re-use old blocks until new versions
>> are stable on disk).
>>
>> There was even somebody who did something like that for a PhD thesis, I
>> forget the details (and it apparently died when the thesis was presumably
>> accepted ;).
>
> That sounds a whole lot like NetApp's WAFL file system and is heavily
> patented.
>
> bill
Hi
SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
phase tree filesystems (TUX2); it writes inside normal used structures,
but it marks each structure with generation tags --- when it updates
global table of tags, it atomically makes several structures valid. I
don't know about this idea being used elsewhere.
It's fsync is slow too (needs to write all (meta)data too), but it at
least doesn't livelock --- fsync is basically:
* write all buffers and wait for completion
* take lock preventing metadata updates
* write all buffers again (those that were updated while previous write
was in progress) and wait for completion
* update global generation count table
* release the lock
Maybe Suse will be paying me from this autumn to make more features to it
--- so far it works, doesn't eat data, but isn't much known :)
Mikulas
On Fri, 27 Apr 2007, Linus Torvalds wrote:
>
>
> On Fri, 27 Apr 2007, Marat Buharov wrote:
> >
> > On 4/27/07, Andrew Morton <[email protected]> wrote:
> > > Aside: why the heck do applications think that their data is so important
> > > that they need to fsync it all the time. I used to run a kernel on my
> > > laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> > > pleasurable.
> >
> > So, if having fake fsync() and fdatasync() is pleasurable for laptop
> > and desktop, may be it's time to add option into Kconfig which
> > disables normal fsync behaviour in favor of robust desktop?
>
> This really is an ext3 issue, not "fsync()".
>
> On a good filesystem, when you do "fsync()" on a file, nothing at all
> happens to any other files. On ext3, it seems to sync the global journal,
This behavior has been in Linux and sort of official since the early
2.4.X days - remember the discussion on fsync()ing directory changes for
MTAs that led to the mount option "dirsync" for ext?fs so that rename(),
link() and stuff like that became synchronous even without fsync()ing
the parent directory? I can look up archive references if need be.
Surely four years ago, if not five (this is from the top of my head, not
a quotable fact I verified from the LKML archives though).
> I used to run reiserfs, and it had its problems, but this was the
> "feature" of ext3 that I've disliked most. If you run a MUA with local
> mail, it will do fsync's for most things, and things really hickup if you
> are doing some other writes at the same time. In contrast, with reiser, if
> you did a big untar or some other big write, if somebody fsync'ed a small
> file, it wasn't even a blip on the radar - the fsync would sync just that
> small thing.
It's not as though I'd recommend reiserfs. I have seen one major
corruption recently in openSUSE 10.2 with ext3, but I've had constant
headaches with reiserfs since the day it went into S.u.S.E. kernels at
the time until I switched away from reiserfs some years ago.
--
Matthias Andree
On Fri, 27 Apr 2007, Linus Torvalds wrote:
> Oh, well.. Journalling sucks.
>
> I was actually _really_ hoping that somebody would come along and tell
> everybody that this whole journal-logging is stupid, and that it's just
> better to not ever re-write blocks on disk, but instead write to new
> blocks with version numbers (and not re-use old blocks until new versions
> are stable on disk).
Only that you need direct-overwrite support to be able to safely trash
data you no longer need...
--
Matthias Andree
> hm, fsync.
>
> Aside: why the heck do applications think that their data is so important
> that they need to fsync it all the time. I used to run a kernel on my
> laptop which had "return 0;" at the top of fsync() and fdatasync(). Most
> pleasurable.
>
> But wedging for 20 minutes is probably excessive punishment.
I most wonder, why vim fsyncs its swapfile regularly (blocking typing
during that) and doesn't fsync the resulting file on :w :-/
Mikulas
On 4/28/07, Mikulas Patocka <[email protected]> wrote:
> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
> during that) and doesn't fsync the resulting file on :w :-/
Never seen this. Why would fsync block typing unless vim was doing
disk IO for every keystroke?
Lee
On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
> phase tree filesystems (TUX2); it writes inside normal used structures,
> but it marks each structure with generation tags --- when it updates
> global table of tags, it atomically makes several structures valid. I
> don't know about this idea being used elsewhere.
So how is this generation structure organized ? paper ?
bill
> On Sat, Apr 28, 2007 at 07:37:17AM +0200, Mikulas Patocka wrote:
>> SpadFS doesn't write to unallocated parts like log filesystems (LFS) or
>> phase tree filesystems (TUX2); it writes inside normal used structures,
>> but it marks each structure with generation tags --- when it updates
>> global table of tags, it atomically makes several structures valid. I
>> don't know about this idea being used elsewhere.
>
> So how is this generation structure organized ? paper ?
Paper is in CITSA 2006 proceedings (but you likely don't have them and I
signed some statement that I can't post it elsewhere :-( )
Basicly the idea is this:
* you have array containing 65536 32-bit numbers --- crash count table ---
that array is on disk and in memory (see struct __spadfs->cct in my sources)
* you have 16-bit value --- crash count, that value is on disk and in memory
too (see struct __spadfs->cc)
* On mount, you load crash count table and crash count from disk to
memory. You increment carsh count on disk (but leave old in memory). You
increment one entry in crash count table - cct[cc] in memory, but leave
old on disk.
* On sync you write all metadata buffers, do write barrier, write one
sector of crash count table from memory to disk and do write
barrier again.
* On unmount, you sync and decrement crash count on disk.
--- so crash count counts crashes --- it is increased each time you mount
and don't unmount.
Consistency of structures:
* Each directory entry has two tags --- 32-bit transaction count (txc)
and 16-bit crash count(cc).
* You create directory entry with entry->txc = fs->txc[fs->cc] and
entry->cc = fs->cc
* Directory entry is considered valid if fs->txc[entry->cc] >= entry->txc
(see macro CC_VALID)
* If the directory entry is not valid, it is skipped during directory
scan, as if it wasn't there
--- so you create a directory entry and its valid. If the system crashes,
it will load crash count table from disk and there's one-less value than
entry->txc, so the entry will be invalid. It will also run with increased
cc, so it will never touch txc at an old index, so the entry will be valid
forever.
--- if you sync, you write crash count table to disk and directory entry
will be atomically made valid forever (because values in crash count table
never decrease)
In my implementation, the top bit of entry->txc is used to mark whether
the entry is scheduled for adding or delete, so that you can atomically
add one directory entry and delete other.
Space allocation bitmaps or lists are managed in such a way that there are
two copies and cc/txc pair determining which one is valid.
Files are extended in such a way that each file has two "size" entries and
cc/txc pair denoting which one is valid, so that you can atomically
extend/truncate file and mark its space allocated/freed in bitmaps or
lists (BTW. this cc/txc pair is the same one that denotes if the directory
entry is valid and another bit determines one of these two functions ---
to save space).
Mikulas
Lee Revell wrote:
> On 4/28/07, Mikulas Patocka <[email protected]> wrote:
>> I most wonder, why vim fsyncs its swapfile regularly (blocking typing
>> during that) and doesn't fsync the resulting file on :w :-/
>
> Never seen this. Why would fsync block typing unless vim was doing
> disk IO for every keystroke?
It does do that, for the crash-recovery files it maintains.
Andrew Morton wrote:
> We can make great improvements here, and I've (twice) previously decribed
> how: hoist the entire ordered-mode data handling out of ext3, and out of
> the buffer_head layer and move it up into the VFS pagecache layer.
> Basically, do ordered-data with a commit-time inode walk, calling
> do_sync_mapping_range().
>
> Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
I'm not sure it's that easy.
if we move to pages, then we have to mark pages to be flushed holding
transaction open. now take delayed allocation into account: we need
to allocate number of blocks at once and then mark all pages mapped,
again within context of the same transaction. so, an implementation
would look like the following?
generic_writepages() {
/* collect set of contig. dirty pages */
foo_get_blocks() {
foo_journal_start();
foo_new_blocks();
foo_attach_blocks_to_inode();
generic_mark_pages_mapped();
foo_journal_stop();
}
}
another question is will it scale well given number of dirty inodes
can be much larger than number of inodes with dirty mapped blocks
(in delayed allocation case, for example) ?
thanks, Alex
On Thu, 03 May 2007 21:38:10 +0400
Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> > We can make great improvements here, and I've (twice) previously decribed
> > how: hoist the entire ordered-mode data handling out of ext3, and out of
> > the buffer_head layer and move it up into the VFS pagecache layer.
> > Basically, do ordered-data with a commit-time inode walk, calling
> > do_sync_mapping_range().
> >
> > Do it in the VFS. Make reiserfs use it, remove reiserfs ordered-mode too.
> > Make XFS use it, fix the hey-my-files-are-all-full-of-zeroes problem there.
>
> I'm not sure it's that easy.
>
> if we move to pages, then we have to mark pages to be flushed holding
> transaction open. now take delayed allocation into account: we need
> to allocate number of blocks at once and then mark all pages mapped,
> again within context of the same transaction.
Yes, there can be issues with needing to allocate journal space within the
context of a commit. But
a) If the page has newly allocated space on disk then the metadata which
refers to that page is already in the journal: no new journal space
needed.
b) If the page doesn't have space allocated on disk then we don't need
to write it out at ordered-mode commit time, because the post-recovery
filesystem will not have any references to that page.
c) If the page is dirty due to overwrite then no metadata update was required.
IOW, under what circumstances would an ordered-mode commit need to allocate
space for a delayed-allocate page?
However b) might lead to the hey-my-file-is-full-of-zeroes problem.
> so, an implementation
> would look like the following?
>
> generic_writepages() {
> /* collect set of contig. dirty pages */
> foo_get_blocks() {
> foo_journal_start();
> foo_new_blocks();
> foo_attach_blocks_to_inode();
> generic_mark_pages_mapped();
> foo_journal_stop();
> }
> }
>
> another question is will it scale well given number of dirty inodes
> can be much larger than number of inodes with dirty mapped blocks
> (in delayed allocation case, for example) ?
Possibly - zillions of dirty-for-atime inodes might get in the way. A
short-term fix would be to create a separate dirty-inode list on the
superblock (ug). A long-term fix is to rip all the per-superblock
dirty-inode lists and use a radix-tree. Not for lookup purposes, but for
the tree's ability to do tagged and restartable searches.
Andrew Morton wrote:
> Yes, there can be issues with needing to allocate journal space within the
> context of a commit. But
no-no, this isn't required. we only need to mark pages/blocks within
transaction, otherwise race is possible when we allocate blocks in transaction,
then transacton starts to commit, then we mark pages/blocks to be flushed
before commit.
> a) If the page has newly allocated space on disk then the metadata which
> refers to that page is already in the journal: no new journal space
> needed.
>
> b) If the page doesn't have space allocated on disk then we don't need
> to write it out at ordered-mode commit time, because the post-recovery
> filesystem will not have any references to that page.
>
> c) If the page is dirty due to overwrite then no metadata update was required.
>
> IOW, under what circumstances would an ordered-mode commit need to allocate
> space for a delayed-allocate page?
no need to allocate space within commit thread, I think. only to take care
of the race I described above. in hackish version of data=ordered for delayed
allocation I used counter of submitted bio's with newly-allocated blocks and
commit thread waits for the counter to reach 0.
>
> However b) might lead to the hey-my-file-is-full-of-zeroes problem.
>
thanks, Alex
On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> > Yes, there can be issues with needing to allocate journal space within the
> > context of a commit. But
>
> no-no, this isn't required. we only need to mark pages/blocks within
> transaction, otherwise race is possible when we allocate blocks in transaction,
> then transacton starts to commit, then we mark pages/blocks to be flushed
> before commit.
I don't understand. Can you please describe the race in more detail?
Andrew Morton wrote:
> On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>> Yes, there can be issues with needing to allocate journal space within the
>>> context of a commit. But
>> no-no, this isn't required. we only need to mark pages/blocks within
>> transaction, otherwise race is possible when we allocate blocks in transaction,
>> then transacton starts to commit, then we mark pages/blocks to be flushed
>> before commit.
>
> I don't understand. Can you please describe the race in more detail?
if I understood your idea right, then in data=ordered mode, commit thread writes
all dirty mapped blocks before real commit.
say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread
t1 t2
find dirty inode I
find some dirty unallocated blocks
journal_start()
allocate blocks
attach them to I
journal_stop()
going to commit
find inode I dirty
do NOT find these blocks because they're
allocated only, but pages/bhs aren't mapped
to them
start commit
map pages/bhs to just allocate blocks
so, either we mark pages/bhs someway within journal_start()--journal_stop() or
commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.
thanks, Alex
On Fri, 04 May 2007 10:57:12 +0400 Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> > On Fri, 04 May 2007 10:18:12 +0400 Alex Tomas <[email protected]> wrote:
> >
> >> Andrew Morton wrote:
> >>> Yes, there can be issues with needing to allocate journal space within the
> >>> context of a commit. But
> >> no-no, this isn't required. we only need to mark pages/blocks within
> >> transaction, otherwise race is possible when we allocate blocks in transaction,
> >> then transacton starts to commit, then we mark pages/blocks to be flushed
> >> before commit.
> >
> > I don't understand. Can you please describe the race in more detail?
>
> if I understood your idea right, then in data=ordered mode, commit thread writes
> all dirty mapped blocks before real commit.
>
> say, we have two threads: t1 is a thread doing flushing and t2 is a commit thread
>
> t1 t2
> find dirty inode I
> find some dirty unallocated blocks
> journal_start()
> allocate blocks
> attach them to I
> journal_stop()
I'm still not understanding. The terms you're using are a bit ambiguous.
What does "find some dirty unallocated blocks" mean? Find a page which is
dirty and which does not have a disk mapping?
Normally the above operation would be implemented via
ext4_writeback_writepage(), and it runs under lock_page().
> going to commit
> find inode I dirty
> do NOT find these blocks because they're
> allocated only, but pages/bhs aren't mapped
> to them
> start commit
I think you're assuming here that commit would be using ->t_sync_datalist
to locate dirty buffer_heads.
But under this proposal, t_sync_datalist just gets removed: the new
ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
understanding you, the way in which we'd handle any such race is to make
kjournald's writeback of the dirty pages block in lock_page(). Once it
gets the page lock it can look to see if some other thread has mapped the
page to disk.
It may turn out that kjournald needs a private way of getting at the
I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
had the radix-tree-of-dirty-inodes thing then that's easy enough to do
anyway, with a tagged search. But I expect that a single pass through the
superblock's dirty inodes would suffice for ordered-data. Files which
have chattr +j would screw things up, as usual.
I assume (hope) that your delayed allocation code implements
->writepages()? Doing the allocation one-page-at-a-time sounds painful...
>
> map pages/bhs to just allocate blocks
>
>
> so, either we mark pages/bhs someway within journal_start()--journal_stop() or
> commit thread should do lookup for all dirty pages. the latter doesn't sound nice, IMHO.
>
I don't think I'm understanding you fully yet.
Andrew Morton wrote:
> I'm still not understanding. The terms you're using are a bit ambiguous.
>
> What does "find some dirty unallocated blocks" mean? Find a page which is
> dirty and which does not have a disk mapping?
>
> Normally the above operation would be implemented via
> ext4_writeback_writepage(), and it runs under lock_page().
I'm mostly worried about delayed allocation case. My impression was that
holding number of pages locked isn't a good idea, even if they're locked
in index order. so, I was going to turn number of pages writeback, then
allocate blocks for all of them at once, then put proper blocknr's into
bh's (or PG_mappedtodisk?).
>
>
>> going to commit
>> find inode I dirty
>> do NOT find these blocks because they're
>> allocated only, but pages/bhs aren't mapped
>> to them
>> start commit
>
> I think you're assuming here that commit would be using ->t_sync_datalist
> to locate dirty buffer_heads.
nope, I mean sb->inode->page walk.
> But under this proposal, t_sync_datalist just gets removed: the new
> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> understanding you, the way in which we'd handle any such race is to make
> kjournald's writeback of the dirty pages block in lock_page(). Once it
> gets the page lock it can look to see if some other thread has mapped the
> page to disk.
if I'm right holding number of pages locked, then they won't be locked, but
writeback. of course kjournald can block on writeback as well, but how does
it find pages with *newly allocated* blocks only?
> It may turn out that kjournald needs a private way of getting at the
> I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> anyway, with a tagged search. But I expect that a single pass through the
> superblock's dirty inodes would suffice for ordered-data. Files which
> have chattr +j would screw things up, as usual.
not dirty inodes only, but rather some fast way to find pages with newly
allocated pages.
> I assume (hope) that your delayed allocation code implements
> ->writepages()? Doing the allocation one-page-at-a-time sounds painful...
indeed. this is a root cause of all this complexity.
thanks, Alex
On Fri, 04 May 2007 11:39:22 +0400 Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> > I'm still not understanding. The terms you're using are a bit ambiguous.
> >
> > What does "find some dirty unallocated blocks" mean? Find a page which is
> > dirty and which does not have a disk mapping?
> >
> > Normally the above operation would be implemented via
> > ext4_writeback_writepage(), and it runs under lock_page().
>
> I'm mostly worried about delayed allocation case. My impression was that
> holding number of pages locked isn't a good idea, even if they're locked
> in index order. so, I was going to turn number of pages writeback, then
> allocate blocks for all of them at once, then put proper blocknr's into
> bh's (or PG_mappedtodisk?).
ooh, that sounds hacky and quite worrisome. If someone comes in and does
an fsync() we've lost our synchronisation point. Yes, all callers happen
to do
lock_page();
wait_on_page_writeback();
(I think) but we've never considered a bare PageWriteback() as something
which protects page internals. We're OK wrt page reclaim and we're OK wrt
truncate and invalidate. As long as the page is uptodate we _should_ be OK
wrt readpage(). But still, it'd be better to use the standard locking
rather than inventing new rules, if poss.
I'd be 100% OK with locking multiple pages in ascending pgoff_t order.
Locking the page is the standard way of doing this synchronisation and the
only problem I can think of is that having a tremendous number of pages
locked could cause the wake_up_page() waitqueue hashes to get overloaded
and go slow. But it's also possible to lock many, many pages with
readahead and nobody has reported problems in there.
> >
> >
> >> going to commit
> >> find inode I dirty
> >> do NOT find these blocks because they're
> >> allocated only, but pages/bhs aren't mapped
> >> to them
> >> start commit
> >
> > I think you're assuming here that commit would be using ->t_sync_datalist
> > to locate dirty buffer_heads.
>
> nope, I mean sb->inode->page walk.
>
> > But under this proposal, t_sync_datalist just gets removed: the new
> > ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> > understanding you, the way in which we'd handle any such race is to make
> > kjournald's writeback of the dirty pages block in lock_page(). Once it
> > gets the page lock it can look to see if some other thread has mapped the
> > page to disk.
>
> if I'm right holding number of pages locked, then they won't be locked, but
> writeback. of course kjournald can block on writeback as well, but how does
> it find pages with *newly allocated* blocks only?
I don't think we'd want kjournald to do that. Even if a page was dirtied
by an overwrite, we'd want to write it back during commit, just from a
quality-of-implementation point of view. If we were to leave these pages
unwritten during commit then a post-recovery file could have a mix of
up-to-five-second-old data and up-to-30-seconds-old data.
> > It may turn out that kjournald needs a private way of getting at the
> > I_DIRTY_PAGES inodes to do this properly, but I don't _think_ so. If we
> > had the radix-tree-of-dirty-inodes thing then that's easy enough to do
> > anyway, with a tagged search. But I expect that a single pass through the
> > superblock's dirty inodes would suffice for ordered-data. Files which
> > have chattr +j would screw things up, as usual.
>
> not dirty inodes only, but rather some fast way to find pages with newly
> allocated pages.
Newly allocated blocks, you mean?
Just write out the overwritten blocks as well as the new ones, I reckon.
It's what we do now.
Andrew Morton wrote:
>>> But under this proposal, t_sync_datalist just gets removed: the new
>>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
>>> understanding you, the way in which we'd handle any such race is to make
>>> kjournald's writeback of the dirty pages block in lock_page(). Once it
>>> gets the page lock it can look to see if some other thread has mapped the
>>> page to disk.
>> if I'm right holding number of pages locked, then they won't be locked, but
>> writeback. of course kjournald can block on writeback as well, but how does
>> it find pages with *newly allocated* blocks only?
>
> I don't think we'd want kjournald to do that. Even if a page was dirtied
> by an overwrite, we'd want to write it back during commit, just from a
> quality-of-implementation point of view. If we were to leave these pages
> unwritten during commit then a post-recovery file could have a mix of
> up-to-five-second-old data and up-to-30-seconds-old data.
trying to implement this I've got to think that there is one significant
difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
is per-transaction. IOW, it doesn't change once transaction is closed. in
contrast, nothing (currently) would prevent others to modify pages while
commit is in progress. I think this is serious disadvantage of the solution.
what I'd propose is sort of in-core tracker for all data-related IOs in flight
(assigned to specific transaction) and wait for their completion in commit
thread.
thanks, Alex
On Thu, 16 Aug 2007 22:20:06 +0400
Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> >>> But under this proposal, t_sync_datalist just gets removed: the new
> >>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> >>> understanding you, the way in which we'd handle any such race is to make
> >>> kjournald's writeback of the dirty pages block in lock_page(). Once it
> >>> gets the page lock it can look to see if some other thread has mapped the
> >>> page to disk.
> >> if I'm right holding number of pages locked, then they won't be locked, but
> >> writeback. of course kjournald can block on writeback as well, but how does
> >> it find pages with *newly allocated* blocks only?
> >
> > I don't think we'd want kjournald to do that. Even if a page was dirtied
> > by an overwrite, we'd want to write it back during commit, just from a
> > quality-of-implementation point of view. If we were to leave these pages
> > unwritten during commit then a post-recovery file could have a mix of
> > up-to-five-second-old data and up-to-30-seconds-old data.
>
> trying to implement this I've got to think that there is one significant
> difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
> is per-transaction. IOW, it doesn't change once transaction is closed. in
> contrast, nothing (currently) would prevent others to modify pages while
> commit is in progress.
That can happen at present - there's nothing to stop a process from modifying
a page which is undergoing ordered-data commit-time writeout.
Andrew Morton wrote:
> On Thu, 16 Aug 2007 22:20:06 +0400
> Alex Tomas <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>>>> But under this proposal, t_sync_datalist just gets removed: the new
>>>>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
>>>>> understanding you, the way in which we'd handle any such race is to make
>>>>> kjournald's writeback of the dirty pages block in lock_page(). Once it
>>>>> gets the page lock it can look to see if some other thread has mapped the
>>>>> page to disk.
>>>> if I'm right holding number of pages locked, then they won't be locked, but
>>>> writeback. of course kjournald can block on writeback as well, but how does
>>>> it find pages with *newly allocated* blocks only?
>>> I don't think we'd want kjournald to do that. Even if a page was dirtied
>>> by an overwrite, we'd want to write it back during commit, just from a
>>> quality-of-implementation point of view. If we were to leave these pages
>>> unwritten during commit then a post-recovery file could have a mix of
>>> up-to-five-second-old data and up-to-30-seconds-old data.
>> trying to implement this I've got to think that there is one significant
>> difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
>> is per-transaction. IOW, it doesn't change once transaction is closed. in
>> contrast, nothing (currently) would prevent others to modify pages while
>> commit is in progress.
>
> That can happen at present - there's nothing to stop a process from modifying
> a page which is undergoing ordered-data commit-time writeout.
I tend to think it's still a bit different: set of pages doesn't change with
t_sync_datalist. with sb->inode->page approach even silly dd will be able to
*add* a bunch of new pages while we're syncing first ones. why shouldn't we
fix this?
thanks, Alex
On Fri, 17 Aug 2007 06:24:47 +0400 Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> > On Thu, 16 Aug 2007 22:20:06 +0400
> > Alex Tomas <[email protected]> wrote:
> >
> >> Andrew Morton wrote:
> >>>>> But under this proposal, t_sync_datalist just gets removed: the new
> >>>>> ordered-data mode _only_ need to do the sb->inode->page walk. So if I'm
> >>>>> understanding you, the way in which we'd handle any such race is to make
> >>>>> kjournald's writeback of the dirty pages block in lock_page(). Once it
> >>>>> gets the page lock it can look to see if some other thread has mapped the
> >>>>> page to disk.
> >>>> if I'm right holding number of pages locked, then they won't be locked, but
> >>>> writeback. of course kjournald can block on writeback as well, but how does
> >>>> it find pages with *newly allocated* blocks only?
> >>> I don't think we'd want kjournald to do that. Even if a page was dirtied
> >>> by an overwrite, we'd want to write it back during commit, just from a
> >>> quality-of-implementation point of view. If we were to leave these pages
> >>> unwritten during commit then a post-recovery file could have a mix of
> >>> up-to-five-second-old data and up-to-30-seconds-old data.
> >> trying to implement this I've got to think that there is one significant
> >> difference between t_sync_datalist and sb->inode->page walk: t_sync_datalist
> >> is per-transaction. IOW, it doesn't change once transaction is closed. in
> >> contrast, nothing (currently) would prevent others to modify pages while
> >> commit is in progress.
> >
> > That can happen at present - there's nothing to stop a process from modifying
> > a page which is undergoing ordered-data commit-time writeout.
>
> I tend to think it's still a bit different: set of pages doesn't change with
> t_sync_datalist. with sb->inode->page approach even silly dd will be able to
> *add* a bunch of new pages while we're syncing first ones. why shouldn't we
> fix this?
>
Sort-of. But the per-superpblock, per-inode writeback code is pretty
careful to avoid livelocks. The per-inode writeback is a strict single
linear sweep across the file. It'll basically write out anything which was
dirty when it was called. The per-superblock inode walk isn't as accurate
as that, becuase of the difficulties of juggling list_heads. But we're
slowly working on that, and I suspect it'll be ggod enough for ext3
purposes already.
Andrew Morton wrote:
> Sort-of. But the per-superpblock, per-inode writeback code is pretty
> careful to avoid livelocks. The per-inode writeback is a strict single
> linear sweep across the file. It'll basically write out anything which was
> dirty when it was called. The per-superblock inode walk isn't as accurate
> as that, becuase of the difficulties of juggling list_heads. But we're
> slowly working on that, and I suspect it'll be ggod enough for ext3
> purposes already.
I'd say that these are two different mechanism solving different problems:
1) VFS/MM does periodic updates and uses regular writeback
2) data=ordered is to avoid metadata pointing to not-written-yet data
we can't use regular writeback in commit thread as long as it can fall into
allocation. so, we'd have to add one more WB mode (btw, i have a patch which
skips non-allocated blocks in writeback if special WB mode is requested).
OTOH, the faster we go through data sync part of commit, the better. given
that lots of inodes can be dirty with no data to sync, it's going to take
long in some cases. it's especially bad because commit doesn't scale to many
CPUs.
also, why would we need to flush *everything* every 5s? just because ext3 does
this? sounds strange. if somebody really need this we could add this possibility
to regular writeback path (making it tunable). but I'd rather prefer to have
a separate (fast, lightweight, scalable) mechanism to support data=ordered.
thanks, Alex
On Fri, 17 Aug 2007 12:36:32 +0400 Alex Tomas <[email protected]> wrote:
> Andrew Morton wrote:
> > Sort-of. But the per-superpblock, per-inode writeback code is pretty
> > careful to avoid livelocks. The per-inode writeback is a strict single
> > linear sweep across the file. It'll basically write out anything which was
> > dirty when it was called. The per-superblock inode walk isn't as accurate
> > as that, becuase of the difficulties of juggling list_heads. But we're
> > slowly working on that, and I suspect it'll be ggod enough for ext3
> > purposes already.
>
> I'd say that these are two different mechanism solving different problems:
> 1) VFS/MM does periodic updates and uses regular writeback
> 2) data=ordered is to avoid metadata pointing to not-written-yet data
VFS/MM can do _much_ more than that! Look at struct writeback_control.
That code path has many different modes of operation: it is used for
regular pdflush writeback, sync, fsync, throttling, etc. Probably one of
its modes will be sufficient. If we want to change ext3's existing
semantics and add an "only writeback uninitialised blocks" mode then
that'll be pretty straightforward: add more control information to
writeback_control and go for it.
> we can't use regular writeback in commit thread as long as it can fall into
> allocation. so, we'd have to add one more WB mode (btw, i have a patch which
> skips non-allocated blocks in writeback if special WB mode is requested).
yup
> OTOH, the faster we go through data sync part of commit, the better. given
> that lots of inodes can be dirty with no data to sync, it's going to take
> long in some cases. it's especially bad because commit doesn't scale to many
> CPUs.
eh?
> also, why would we need to flush *everything* every 5s? just because ext3 does
> this? sounds strange. if somebody really need this we could add this possibility
> to regular writeback path (making it tunable). but I'd rather prefer to have
> a separate (fast, lightweight, scalable) mechanism to support data=ordered.
>
Yeah, that would make sense, perhaps.
Or just speed the existing stuff up. iirc the main problem in there is unrelated
to data writeback. There are situations where the running transaction has to block
behind metadata writeout which the committing transaction is performing. I
reluctantly put that in years ago to get us out of a tight spot and it
never got optimised.
Andrew Morton wrote:
>> OTOH, the faster we go through data sync part of commit, the better. given
>> that lots of inodes can be dirty with no data to sync, it's going to take
>> long in some cases. it's especially bad because commit doesn't scale to many
>> CPUs.
>
> eh?
I mean that number inodes to scan can be order of magnitude larger than number
of inodes needing sync for this given transaction. commit thread has to scan them
all (quite amount of CPU, i guess) and we can't use >1 CPU to speed the scan up.
>> also, why would we need to flush *everything* every 5s? just because ext3 does
>> this? sounds strange. if somebody really need this we could add this possibility
>> to regular writeback path (making it tunable). but I'd rather prefer to have
>> a separate (fast, lightweight, scalable) mechanism to support data=ordered.
>>
>
> Yeah, that would make sense, perhaps.
>
> Or just speed the existing stuff up. iirc the main problem in there is unrelated
> to data writeback. There are situations where the running transaction has to block
> behind metadata writeout which the committing transaction is performing. I
> reluctantly put that in years ago to get us out of a tight spot and it
> never got optimised.
AFAIU, existing writeback is built around dirty inodes list and dirty bit in
per-inode radix tree. in order to avoid scanning too much (see before) we'd
need a separate list and probably one more bit in radix tree. plus some code
to allow writeback to use new list/tag.
as for the main problem ... I'd very appreciate any details. probably it was
about several blocks in page when one block is allocated in transaction 1 and
next block is being allocated in transaction 2?
thanks, Alex