2006-03-09 16:22:58

by OGAWA Hirofumi

[permalink] [raw]
Subject: [PATCH] Write the inode itself in block_fsync()

Hi,

For block device's inode, we don't write a inode's meta data
itself. But, I think we should write inode's meta data for fsync().

Signed-off-by: OGAWA Hirofumi <[email protected]>
---

fs/block_dev.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff -puN fs/block_dev.c~block_fsync-write-inode fs/block_dev.c
--- linux-2.6/fs/block_dev.c~block_fsync-write-inode 2006-03-05 21:51:15.000000000 +0900
+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-05 22:28:28.000000000 +0900
@@ -19,6 +19,7 @@
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/buffer_head.h>
+#include <linux/writeback.h>
#include <linux/mpage.h>
#include <linux/mount.h>
#include <linux/uio.h>
@@ -232,7 +233,20 @@ static loff_t block_llseek(struct file *

static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
- return sync_blockdev(I_BDEV(filp->f_mapping->host));
+ struct inode *inode = dentry->d_inode;
+ int ret = 0, err;
+
+ if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0, /* sys_fsync did this */
+ };
+ ret = sync_inode(inode, &wbc);
+ }
+ err = sync_blockdev(I_BDEV(filp->f_mapping->host));
+ if (!ret)
+ ret = err;
+ return err;
}

/*
_


2006-03-10 01:05:47

by Sam Vilain

[permalink] [raw]
Subject: Re: [PATCH] Write the inode itself in block_fsync()

OGAWA Hirofumi wrote:

>Hi,
>
>For block device's inode, we don't write a inode's meta data
>itself. But, I think we should write inode's meta data for fsync().
>
>

Ouch... won't that halve performance of database transaction logs?

>Signed-off-by: OGAWA Hirofumi <[email protected]>
>---
>
> fs/block_dev.c | 16 +++++++++++++++-
> 1 file changed, 15 insertions(+), 1 deletion(-)
>
>diff -puN fs/block_dev.c~block_fsync-write-inode fs/block_dev.c
>--- linux-2.6/fs/block_dev.c~block_fsync-write-inode 2006-03-05 21:51:15.000000000 +0900
>+++ linux-2.6-hirofumi/fs/block_dev.c 2006-03-05 22:28:28.000000000 +0900
>@@ -19,6 +19,7 @@
> #include <linux/module.h>
> #include <linux/blkpg.h>
> #include <linux/buffer_head.h>
>+#include <linux/writeback.h>
> #include <linux/mpage.h>
> #include <linux/mount.h>
> #include <linux/uio.h>
>@@ -232,7 +233,20 @@ static loff_t block_llseek(struct file *
>
> static int block_fsync(struct file *filp, struct dentry *dentry, int datasync)
> {
>- return sync_blockdev(I_BDEV(filp->f_mapping->host));
>+ struct inode *inode = dentry->d_inode;
>+ int ret = 0, err;
>+
>+ if (!datasync || (inode->i_state & I_DIRTY_DATASYNC)) {
>+ struct writeback_control wbc = {
>+ .sync_mode = WB_SYNC_ALL,
>+ .nr_to_write = 0, /* sys_fsync did this */
>+ };
>+ ret = sync_inode(inode, &wbc);
>+ }
>+ err = sync_blockdev(I_BDEV(filp->f_mapping->host));
>+ if (!ret)
>+ ret = err;
>+ return err;
> }
>
> /*
>_
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>

2006-03-10 04:12:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Write the inode itself in block_fsync()

Sam Vilain <[email protected]> wrote:
>
> OGAWA Hirofumi wrote:
>
> >Hi,
> >
> >For block device's inode, we don't write a inode's meta data
> >itself. But, I think we should write inode's meta data for fsync().
> >
> >
>
> Ouch... won't that halve performance of database transaction logs?

Yes, it could well cause a lot more seeking to do atime and/or mtime
writes. Which aren't terribly important, really.

Unless I'm missing something, I suspect we'd be better off without this,
even though it's a correctness fix :(

2006-03-10 14:12:27

by Bart Samwel

[permalink] [raw]
Subject: Re: [PATCH] Write the inode itself in block_fsync()

Andrew Morton wrote:
> Sam Vilain <[email protected]> wrote:
>> OGAWA Hirofumi wrote:
>>
>> >For block device's inode, we don't write a inode's meta data
>> >itself. But, I think we should write inode's meta data for fsync().
>>
>> Ouch... won't that halve performance of database transaction logs?
>
> Yes, it could well cause a lot more seeking to do atime and/or mtime
> writes. Which aren't terribly important, really.
>
> Unless I'm missing something, I suspect we'd be better off without this,
> even though it's a correctness fix :(

Maybe atime/mtime aren't important, but I would be unhappy if a file
size change wasn't written to disk on fsync.

Anyway, shouldn't databases be using a combination of fixed-size files
and fdatasync? fsync doesn't perform well by definition, and I guess the
only reason databases still use it is because the kernel failed to
implement the sucky part of the behaviour.

A complex but perhaps viable suggestion: as atime/mtime are stored on
disk in second granularity (on ext3 at least, don't know about other
fss), wouldn't it somehow be possible to only regard atime/mtime changes
as real changes when i_(a|c)time.tv_sec changes? This would enable fsync
to write the inode once every second instead of on every fsync. The
performance drop would be much less dramatic than writing the inode on
every fsync, and it would at least yield correct behaviour.

Cheers,
Bart

2006-03-10 15:18:34

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: [PATCH] Write the inode itself in block_fsync()

Bart Samwel <[email protected]> writes:

> Andrew Morton wrote:
>> Sam Vilain <[email protected]> wrote:
>>> OGAWA Hirofumi wrote:
> >>
>>> >For block device's inode, we don't write a inode's meta data
>>> >itself. But, I think we should write inode's meta data for fsync().
>>>
>>> Ouch... won't that halve performance of database transaction logs?
>>
>> Yes, it could well cause a lot more seeking to do atime and/or mtime
>> writes. Which aren't terribly important, really.
>>
>> Unless I'm missing something, I suspect we'd be better off without this,
>> even though it's a correctness fix :(
>
> Maybe atime/mtime aren't important, but I would be unhappy if a file
> size change wasn't written to disk on fsync.

Please don't worry, we should be doing a right thing for normal files
already. This patch is just for block device file.

> Anyway, shouldn't databases be using a combination of fixed-size files
> and fdatasync? fsync doesn't perform well by definition, and I guess the
> only reason databases still use it is because the kernel failed to
> implement the sucky part of the behaviour.

Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so,
usually this patch doesn't change fdatasync() at all.

Umm... however, I also can understand what akpm says.... check some databases.

berkeley db 4.4: use fdatasync() if available
mysql 5.0: use fdatasync() if available (innobase)
use fsync() (bdb)
postgresql: use fdatasync() if available
sqlite: use fsync

After all, I don't know.

> A complex but perhaps viable suggestion: as atime/mtime are stored
> on disk in second granularity (on ext3 at least, don't know about
> other fss), wouldn't it somehow be possible to only regard
> atime/mtime changes as real changes when i_(a|c)time.tv_sec changes?
> This would enable fsync to write the inode once every second instead
> of on every fsync. The performance drop would be much less dramatic
> than writing the inode on every fsync, and it would at least yield
> correct behaviour.

Yes, and we are already doing it.

Thanks.
--
OGAWA Hirofumi <[email protected]>

2006-03-10 17:33:13

by Bart Samwel

[permalink] [raw]
Subject: Re: [PATCH] Write the inode itself in block_fsync()

OGAWA Hirofumi wrote:
> Bart Samwel <[email protected]> writes:
>
>> Andrew Morton wrote:
>>> Sam Vilain <[email protected]> wrote:
>>>> OGAWA Hirofumi wrote:
>> >>
>>>> Ouch... won't that halve performance of database transaction logs?
>>> Yes, it could well cause a lot more seeking to do atime and/or mtime
>>> writes. Which aren't terribly important, really.
>>>
>>> Unless I'm missing something, I suspect we'd be better off without this,
>>> even though it's a correctness fix :(
>> Maybe atime/mtime aren't important, but I would be unhappy if a file
>> size change wasn't written to disk on fsync.
>
> Please don't worry, we should be doing a right thing for normal files
> already. This patch is just for block device file.

Ahhh, I missed that. I interpreted:

>For block device's inode, we don't write a inode's meta data
>itself. But, I think we should write inode's meta data for fsync().

as "for block devices we don't, for normal files, yes", but apparently
that's not what you meant. :-)

>> Anyway, shouldn't databases be using a combination of fixed-size files
>> and fdatasync? fsync doesn't perform well by definition, and I guess the
>> only reason databases still use it is because the kernel failed to
>> implement the sucky part of the behaviour.
>
> Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so,
> usually this patch doesn't change fdatasync() at all.
>
> Umm... however, I also can understand what akpm says.... check some databases.
>
> berkeley db 4.4: use fdatasync() if available
> mysql 5.0: use fdatasync() if available (innobase)
> use fsync() (bdb)
> postgresql: use fdatasync() if available
> sqlite: use fsync

Nice piece of info. Apparently all of the "large" database engines can
use fdatasync, only the smaller ones (bdb, sqlite) don't. I've done some
extra research:

* From a quick look at the docs it seems to me that bdb can't be
configured to put its transaction log directly on a block device, so bdb
won't be affected.

* SQLite definitely can't write logs to a block device, the docs
explicitly say that the transaction log is a regular file with a
specific name, so we can write off sqlite as well. (It does seem to use
fdatasync btw, since version 3.2.6, see http://www.sqlite.org/changes.html.)

If we've missed none, that leaves only proprietary databases at risk.
But I would be genuinely surprised if a database like Oracle would use
fsync. If we assume that Oracle et al. are not a problem, the risks of
this patch are very low.

Cheers,
Bart