2010-04-15 04:43:46

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices


We are seeing a large regression in database performance on recent kernels.
The database opens a block device with O_DIRECT|O_SYNC and a number of threads
write to different regions of the file at the same time.

A simple test case is below. I haven't defined DEVICE to anything since getting
it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
see about 17MB/sec and only a few threads in IO wait:

procs -----io---- -system-- -----cpu------
r b bi bo in cs us sy id wa st
0 3 0 16170 656 2259 0 0 86 14 0
0 2 0 16704 695 2408 0 0 92 8 0
0 2 0 17308 744 2653 0 0 86 14 0
0 2 0 17933 759 2777 0 0 89 10 0

Most threads are blocking in vfs_fsync_range, which has:

mutex_lock(&mapping->host->i_mutex);
err = fop->fsync(file, dentry, datasync);
if (!ret)
ret = err;
mutex_unlock(&mapping->host->i_mutex);

Commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
of what is going on:

Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.

Thanks Jan for such a good commit message! The patch below drops the i_mutex
in blkdev_fsync as suggested. With it the testcase improves from 17MB/s to
68M/sec:

procs -----io---- -system-- -----cpu------
r b bi bo in cs us sy id wa st
0 7 0 65536 1000 3878 0 0 70 30 0
0 34 0 69632 1016 3921 0 1 46 53 0
0 57 0 69632 1000 3921 0 0 55 45 0
0 53 0 69640 754 4111 0 0 81 19 0

I'd appreciate any comments from the I/O guys on if this is the right approach.


Testcase:

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS 64
#define BUFSIZE (64 * 1024)

#define DEVICE "/dev/mapper/XXXXXX"

#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

static int fd;

static void *doit(void *arg)
{
unsigned long offset = (long)arg;
char *b, *buf;

b = malloc(BUFSIZE + 1024);
buf = (char *)ALIGN((unsigned long)b, 1024);
memset(buf, 0, BUFSIZE);

while (1)
pwrite(fd, buf, BUFSIZE, offset);
}

int main(int argc, char *argv[])
{
int flags = O_RDWR|O_DIRECT;
int i;
unsigned long offset = 0;

if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
flags |= O_SYNC;

fd = open(DEVICE, flags);
if (fd == -1) {
perror("open");
exit(1);
}

for (i = 0; i < NR_THREADS-1; i++) {
pthread_t tid;
pthread_create(&tid, NULL, doit, (void *)offset);
offset += BUFSIZE;
}
doit((void *)offset);

return 0;
}


Signed-off-by: Anton Blanchard <[email protected]>
---

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2010-04-14 12:55:50.000000000 +1000
+++ linux-2.6/fs/block_dev.c 2010-04-14 13:17:45.000000000 +1000
@@ -406,16 +406,24 @@ static loff_t block_llseek(struct file *

int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
- struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+ struct inode *bd_inode = filp->f_mapping->host;
+ struct block_device *bdev = I_BDEV(bd_inode);
int error;

+ mutex_unlock(&bd_inode->i_mutex);
+
error = sync_blockdev(bdev);
- if (error)
+ if (error) {
+ mutex_lock(&bd_inode->i_mutex);
return error;
+ }

error = blkdev_issue_flush(bdev, NULL);
if (error == -EOPNOTSUPP)
error = 0;
+
+ mutex_lock(&bd_inode->i_mutex);
+
return error;
}
EXPORT_SYMBOL(blkdev_fsync);


2010-04-15 08:47:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices

On Thu 15-04-10 14:40:39, Anton Blanchard wrote:
>
> We are seeing a large regression in database performance on recent kernels.
> The database opens a block device with O_DIRECT|O_SYNC and a number of threads
> write to different regions of the file at the same time.
>
> A simple test case is below. I haven't defined DEVICE to anything since getting
> it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
> see about 17MB/sec and only a few threads in IO wait:
>
> procs -----io---- -system-- -----cpu------
> r b bi bo in cs us sy id wa st
> 0 3 0 16170 656 2259 0 0 86 14 0
> 0 2 0 16704 695 2408 0 0 92 8 0
> 0 2 0 17308 744 2653 0 0 86 14 0
> 0 2 0 17933 759 2777 0 0 89 10 0
>
> Most threads are blocking in vfs_fsync_range, which has:
>
> mutex_lock(&mapping->host->i_mutex);
> err = fop->fsync(file, dentry, datasync);
> if (!ret)
> ret = err;
> mutex_unlock(&mapping->host->i_mutex);
...
Just a few style nitpicks:

> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c 2010-04-14 12:55:50.000000000 +1000
> +++ linux-2.6/fs/block_dev.c 2010-04-14 13:17:45.000000000 +1000
> @@ -406,16 +406,24 @@ static loff_t block_llseek(struct file *
>
> int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
> {
> - struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> + struct inode *bd_inode = filp->f_mapping->host;
> + struct block_device *bdev = I_BDEV(bd_inode);
> int error;
>
Could you please add a comment here? Like "There is no need to
protect syncing of the block device by i_mutex and it unnecessarily
serializes workloads with several O_SYNC writers to the block device"

> + mutex_unlock(&bd_inode->i_mutex);
> +
> error = sync_blockdev(bdev);
> - if (error)
> + if (error) {
> + mutex_lock(&bd_inode->i_mutex);
> return error;
Usually, "goto out" is preferred instead of the above.

> + }
>
> error = blkdev_issue_flush(bdev, NULL);
> if (error == -EOPNOTSUPP)
> error = 0;
> +
And define out: here.

> + mutex_lock(&bd_inode->i_mutex);
> +
> return error;
> }
> EXPORT_SYMBOL(blkdev_fsync);

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-15 10:04:18

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices

On Thu, Apr 15 2010, Anton Blanchard wrote:
>
> We are seeing a large regression in database performance on recent kernels.
> The database opens a block device with O_DIRECT|O_SYNC and a number of threads
> write to different regions of the file at the same time.
>
> A simple test case is below. I haven't defined DEVICE to anything since getting
> it wrong will destroy your data :) On an 3 disk LVM with a 64k chunk size we
> see about 17MB/sec and only a few threads in IO wait:
>
> procs -----io---- -system-- -----cpu------
> r b bi bo in cs us sy id wa st
> 0 3 0 16170 656 2259 0 0 86 14 0
> 0 2 0 16704 695 2408 0 0 92 8 0
> 0 2 0 17308 744 2653 0 0 86 14 0
> 0 2 0 17933 759 2777 0 0 89 10 0
>
> Most threads are blocking in vfs_fsync_range, which has:
>
> mutex_lock(&mapping->host->i_mutex);
> err = fop->fsync(file, dentry, datasync);
> if (!ret)
> ret = err;
> mutex_unlock(&mapping->host->i_mutex);
>
> Commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
> syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
> of what is going on:
>
> Use these new helpers for syncing from generic VFS functions. This makes
> O_SYNC writes to block devices acquire i_mutex for syncing. If we really
> care about this, we can make block_fsync() drop the i_mutex and reacquire
> it before it returns.
>
> Thanks Jan for such a good commit message! The patch below drops the i_mutex
> in blkdev_fsync as suggested. With it the testcase improves from 17MB/s to
> 68M/sec:
>
> procs -----io---- -system-- -----cpu------
> r b bi bo in cs us sy id wa st
> 0 7 0 65536 1000 3878 0 0 70 30 0
> 0 34 0 69632 1016 3921 0 1 46 53 0
> 0 57 0 69632 1000 3921 0 0 55 45 0
> 0 53 0 69640 754 4111 0 0 81 19 0
>
> I'd appreciate any comments from the I/O guys on if this is the right approach.

Looks good to me, I see Jan already made a few style suggestions.

--
Jens Axboe

2010-04-15 10:43:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices

On Thu, Apr 15, 2010 at 02:40:39PM +1000, Anton Blanchard wrote:
> int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
> {
> - struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> + struct inode *bd_inode = filp->f_mapping->host;
> + struct block_device *bdev = I_BDEV(bd_inode);
> int error;
>
> + mutex_unlock(&bd_inode->i_mutex);
> +
> error = sync_blockdev(bdev);

Actually you can just drop this call entirely. sync_blockdev is an
overcomplicated alias for filemap_write_and_wait on the block device
inode, which is exactl what we did just before calling into ->fsync

It might be worth to still drop i_mutex for the cache flush, though.

2010-04-15 13:34:31

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices

On Thu 15-04-10 12:42:24, Christoph Hellwig wrote:
> On Thu, Apr 15, 2010 at 02:40:39PM +1000, Anton Blanchard wrote:
> > int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
> > {
> > - struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> > + struct inode *bd_inode = filp->f_mapping->host;
> > + struct block_device *bdev = I_BDEV(bd_inode);
> > int error;
> >
> > + mutex_unlock(&bd_inode->i_mutex);
> > +
> > error = sync_blockdev(bdev);
>
> Actually you can just drop this call entirely. sync_blockdev is an
> overcomplicated alias for filemap_write_and_wait on the block device
> inode, which is exactl what we did just before calling into ->fsync
>
> It might be worth to still drop i_mutex for the cache flush, though.
Yeah, that's a good point... Anton, just remove sync_blockdev() from
blkdev_fsync completely please.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-04-20 02:29:56

by Anton Blanchard

[permalink] [raw]
Subject: Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices


Hi,

> Actually you can just drop this call entirely. sync_blockdev is an
> overcomplicated alias for filemap_write_and_wait on the block device
> inode, which is exactl what we did just before calling into ->fsync
>
> It might be worth to still drop i_mutex for the cache flush, though.

Thanks for the feedback Jan + Christoph. New patch on the way.

Anton

2010-04-20 02:35:03

by Anton Blanchard

[permalink] [raw]
Subject: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices


We are seeing a large regression in database performance on recent kernels.
The database opens a block device with O_DIRECT|O_SYNC and a number of threads
write to different regions of the file at the same time.

A simple test case is below. I haven't defined DEVICE since getting it wrong
will destroy your data :) On an 3 disk LVM with a 64k chunk size we see about
17MB/sec and only a few threads in IO wait:

procs -----io---- -system-- -----cpu------
r b bi bo in cs us sy id wa st
0 3 0 16170 656 2259 0 0 86 14 0
0 2 0 16704 695 2408 0 0 92 8 0
0 2 0 17308 744 2653 0 0 86 14 0
0 2 0 17933 759 2777 0 0 89 10 0

Most threads are blocking in vfs_fsync_range, which has:

mutex_lock(&mapping->host->i_mutex);
err = fop->fsync(file, dentry, datasync);
if (!ret)
ret = err;
mutex_unlock(&mapping->host->i_mutex);

commit 148f948ba877f4d3cdef036b1ff6d9f68986706a (vfs: Introduce new helpers for
syncing after writing to O_SYNC file or IS_SYNC inode) offers some explanation
of what is going on:

Use these new helpers for syncing from generic VFS functions. This makes
O_SYNC writes to block devices acquire i_mutex for syncing. If we really
care about this, we can make block_fsync() drop the i_mutex and reacquire
it before it returns.

Thanks Jan for such a good commit message! As well as dropping i_mutex,
Christoph suggests we should remove the call to sync_blockdev():

> sync_blockdev is an overcomplicated alias for filemap_write_and_wait on
> the block device inode, which is exactly what we did just before calling
> into ->fsync

The patch below incorporates both suggestions. With it the testcase improves
from 17MB/s to 68M/sec:

procs -----io---- -system-- -----cpu------
r b bi bo in cs us sy id wa st
0 7 0 65536 1000 3878 0 0 70 30 0
0 34 0 69632 1016 3921 0 1 46 53 0
0 57 0 69632 1000 3921 0 0 55 45 0
0 53 0 69640 754 4111 0 0 81 19 0


Testcase:

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

#define NR_THREADS 64
#define BUFSIZE (64 * 1024)

#define DEVICE "/dev/mapper/XXXXXX"

#define ALIGN(VAL, SIZE) (((VAL)+(SIZE)-1) & ~((SIZE)-1))

static int fd;

static void *doit(void *arg)
{
unsigned long offset = (long)arg;
char *b, *buf;

b = malloc(BUFSIZE + 1024);
buf = (char *)ALIGN((unsigned long)b, 1024);
memset(buf, 0, BUFSIZE);

while (1)
pwrite(fd, buf, BUFSIZE, offset);
}

int main(int argc, char *argv[])
{
int flags = O_RDWR|O_DIRECT;
int i;
unsigned long offset = 0;

if (argc > 1 && !strcmp(argv[1], "O_SYNC"))
flags |= O_SYNC;

fd = open(DEVICE, flags);
if (fd == -1) {
perror("open");
exit(1);
}

for (i = 0; i < NR_THREADS-1; i++) {
pthread_t tid;
pthread_create(&tid, NULL, doit, (void *)offset);
offset += BUFSIZE;
}
doit((void *)offset);

return 0;
}


Signed-off-by: Anton Blanchard <[email protected]>
---

Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c 2010-04-20 11:28:32.000000000 +1000
+++ linux-2.6/fs/block_dev.c 2010-04-20 11:36:46.000000000 +1000
@@ -406,16 +406,23 @@ static loff_t block_llseek(struct file *

int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
{
- struct block_device *bdev = I_BDEV(filp->f_mapping->host);
+ struct inode *bd_inode = filp->f_mapping->host;
+ struct block_device *bdev = I_BDEV(bd_inode);
int error;

- error = sync_blockdev(bdev);
- if (error)
- return error;
-
+ /*
+ * There is no need to serialise calls to blkdev_issue_flush with
+ * i_mutex and doing so causes performance issues with concurrent
+ * O_SYNC writers to a block device.
+ */
+ mutex_unlock(&bd_inode->i_mutex);
+
error = blkdev_issue_flush(bdev, NULL);
if (error == -EOPNOTSUPP)
error = 0;
+
+ mutex_lock(&bd_inode->i_mutex);
+
return error;
}
EXPORT_SYMBOL(blkdev_fsync);

2010-04-22 19:25:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Fix regression in O_DIRECT|O_SYNC writes to block devices

On Tue 20-04-10 12:30:47, Anton Blanchard wrote:
<snip>
> Signed-off-by: Anton Blanchard <[email protected]>
The patch looks good to me now.

Acked-by: Jan Kara <[email protected]>

> Index: linux-2.6/fs/block_dev.c
> ===================================================================
> --- linux-2.6.orig/fs/block_dev.c 2010-04-20 11:28:32.000000000 +1000
> +++ linux-2.6/fs/block_dev.c 2010-04-20 11:36:46.000000000 +1000
> @@ -406,16 +406,23 @@ static loff_t block_llseek(struct file *
>
> int blkdev_fsync(struct file *filp, struct dentry *dentry, int datasync)
> {
> - struct block_device *bdev = I_BDEV(filp->f_mapping->host);
> + struct inode *bd_inode = filp->f_mapping->host;
> + struct block_device *bdev = I_BDEV(bd_inode);
> int error;
>
> - error = sync_blockdev(bdev);
> - if (error)
> - return error;
> -
> + /*
> + * There is no need to serialise calls to blkdev_issue_flush with
> + * i_mutex and doing so causes performance issues with concurrent
> + * O_SYNC writers to a block device.
> + */
> + mutex_unlock(&bd_inode->i_mutex);
> +
> error = blkdev_issue_flush(bdev, NULL);
> if (error == -EOPNOTSUPP)
> error = 0;
> +
> + mutex_lock(&bd_inode->i_mutex);
> +
> return error;
> }
> EXPORT_SYMBOL(blkdev_fsync);
--
Jan Kara <[email protected]>
SUSE Labs, CR