2008-08-06 19:09:00

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 1/2] add hrtimer_sleep_ns helper function

Hello,

This patch adds a function so that things within the kernel can sleep for
nanoseconds. This is needed for the next patch in the series. Thank you,

Signed-off-by: Josef Bacik <[email protected]>

Index: linux-2.6/include/linux/hrtimer.h
===================================================================
--- linux-2.6.orig/include/linux/hrtimer.h
+++ linux-2.6/include/linux/hrtimer.h
@@ -345,6 +345,7 @@ extern long hrtimer_nanosleep_restart(st

extern void hrtimer_init_sleeper(struct hrtimer_sleeper *sl,
struct task_struct *tsk);
+extern int hrtimer_sleep_ns(u64 ns, int interruptible);

/* Soft interrupt function to run the hrtimer queues: */
extern void hrtimer_run_queues(void);
Index: linux-2.6/kernel/hrtimer.c
===================================================================
--- linux-2.6.orig/kernel/hrtimer.c
+++ linux-2.6/kernel/hrtimer.c
@@ -1459,12 +1459,14 @@ void hrtimer_init_sleeper(struct hrtimer
#endif
}

-static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
+static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode,
+ int interruptible)
{
hrtimer_init_sleeper(t, current);

do {
- set_current_state(TASK_INTERRUPTIBLE);
+ set_current_state(interruptible ? TASK_INTERRUPTIBLE :
+ TASK_UNINTERRUPTIBLE);
hrtimer_start(&t->timer, t->timer.expires, mode);
if (!hrtimer_active(&t->timer))
t->task = NULL;
@@ -1482,6 +1484,22 @@ static int __sched do_nanosleep(struct h
return t->task == NULL;
}

+int hrtimer_sleep_ns(u64 ns, int interruptible)
+{
+ struct hrtimer_sleeper t;
+ int ret = 0;
+
+ hrtimer_init_on_stack(&t.timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+ t.timer.expires = ktime_add_ns(ktime_get(), ns);
+
+ ret = do_nanosleep(&t, HRTIMER_MODE_ABS, interruptible);
+
+ destroy_hrtimer_on_stack(&t.timer);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(hrtimer_sleep_ns);
+
static int update_rmtp(struct hrtimer *timer, struct timespec __user *rmtp)
{
struct timespec rmt;
@@ -1508,7 +1526,7 @@ long __sched hrtimer_nanosleep_restart(s
HRTIMER_MODE_ABS);
t.timer.expires.tv64 = restart->nanosleep.expires;

- if (do_nanosleep(&t, HRTIMER_MODE_ABS))
+ if (do_nanosleep(&t, HRTIMER_MODE_ABS, 1))
goto out;

rmtp = restart->nanosleep.rmtp;
@@ -1534,7 +1552,7 @@ long hrtimer_nanosleep(struct timespec *

hrtimer_init_on_stack(&t.timer, clockid, mode);
t.timer.expires = timespec_to_ktime(*rqtp);
- if (do_nanosleep(&t, mode))
+ if (do_nanosleep(&t, mode, 1))
goto out;

/* Absolute timers do not update the rmtp value and restart: */


2008-08-06 19:17:27

by Josef Bacik

[permalink] [raw]
Subject: [PATCH 2/2] improve ext3 fsync batching

Hello,

Fsync batching in ext3 is somewhat flawed when it comes to disks that are very
fast. Now we do an unconditional sleep for 1 second, which is great on slow
disks like SATA and such, but on fast disks this means just sitting around and
waiting for nothing. This patch measures the time it takes to commit a
transaction to the disk, and sleeps based on the speed of the underlying disk.
Using the following fs_mark command to test the speeds

./fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t 2

I got the following results (with write cacheing turned off)

type threads with patch without patch
sata 2 26.4 27.8
sata 4 44.6 44.4
sata 8 70.4 72.8
sata 16 75.2 89.6
sata 32 92.7 96.0
ram 1 2399.1 2398.8
ram 2 257.3 3603.0
ram 4 395.6 4827.9
ram 8 659.0 4721.1
ram 16 1326.4 4373.3
ram 32 1964.2 3816.3

I used a ramdisk to emulate a "fast" disk since I don't happen to have a
clariion sitting around. I didn't test single thread in the sata case as it
should be relatively the same between the two. Thanks,

Signed-off-by: Josef Bacik <[email protected]>

Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -288,6 +288,8 @@ void journal_commit_transaction(journal_
int flags;
int err;
unsigned long blocknr;
+ ktime_t start_time;
+ u64 commit_time;
char *tagp = NULL;
journal_header_t *header;
journal_block_tag_t *tag = NULL;
@@ -400,6 +402,7 @@ void journal_commit_transaction(journal_
commit_transaction->t_state = T_FLUSH;
journal->j_committing_transaction = commit_transaction;
journal->j_running_transaction = NULL;
+ start_time = ktime_get();
commit_transaction->t_log_start = journal->j_head;
wake_up(&journal->j_wait_transaction_locked);
spin_unlock(&journal->j_state_lock);
@@ -873,6 +876,18 @@ restart_loop:
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
journal->j_committing_transaction = NULL;
+ commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+
+ /*
+ * weight the commit time higher than the average time so we don't
+ * react too strongly to vast changes in commit time
+ */
+ if (likely(journal->j_average_commit_time))
+ journal->j_average_commit_time = (commit_time*3 +
+ journal->j_average_commit_time) / 4;
+ else
+ journal->j_average_commit_time = commit_time;
+
spin_unlock(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL &&
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -25,6 +25,7 @@
#include <linux/timer.h>
#include <linux/mm.h>
#include <linux/highmem.h>
+#include <linux/hrtimer.h>

static void __journal_temp_unlink_buffer(struct journal_head *jh);

@@ -49,6 +50,7 @@ get_transaction(journal_t *journal, tran
{
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
+ transaction->t_start_time = ktime_get();
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
@@ -1361,7 +1363,7 @@ int journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int old_handle_count, err;
+ int err;
pid_t pid;

J_ASSERT(journal_current_handle() == handle);
@@ -1397,11 +1399,18 @@ int journal_stop(handle_t *handle)
*/
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
+ u64 commit_time, trans_time;
+
journal->j_last_sync_writer = pid;
- do {
- old_handle_count = transaction->t_handle_count;
- schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
+
+ spin_lock(&journal->j_state_lock);
+ commit_time = journal->j_average_commit_time;
+ spin_unlock(&journal->j_state_lock);
+
+ trans_time = ktime_to_ns(ktime_sub(ktime_get(),
+ transaction->t_start_time));
+ if (trans_time < commit_time)
+ hrtimer_sleep_ns(commit_time, 1);
}

current->journal_info = NULL;
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
unsigned long t_expires;

/*
+ * When this transaction started, in nanoseconds [no locking]
+ */
+ ktime_t t_start_time;
+
+ /*
* How many handles used this transaction? [t_handle_lock]
*/
int t_handle_count;
@@ -800,6 +805,8 @@ struct journal_s

pid_t j_last_sync_writer;

+ u64 j_average_commit_time;
+
/*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here

2008-08-06 19:23:54

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Wed, Aug 06, 2008 at 03:15:36PM -0400, Josef Bacik wrote:
> Hello,
>
> Fsync batching in ext3 is somewhat flawed when it comes to disks that are very
> fast. Now we do an unconditional sleep for 1 second, which is great on slow
> disks like SATA and such, but on fast disks this means just sitting around and
> waiting for nothing. This patch measures the time it takes to commit a
> transaction to the disk, and sleeps based on the speed of the underlying disk.
> Using the following fs_mark command to test the speeds
>
> ./fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t 2
>
> I got the following results (with write cacheing turned off)
>
> type threads with patch without patch
> sata 2 26.4 27.8
> sata 4 44.6 44.4
> sata 8 70.4 72.8
> sata 16 75.2 89.6
> sata 32 92.7 96.0
> ram 1 2399.1 2398.8
> ram 2 257.3 3603.0
> ram 4 395.6 4827.9
> ram 8 659.0 4721.1
> ram 16 1326.4 4373.3
> ram 32 1964.2 3816.3
>

Err sorry about that, the with patch and without patch colums should be
reversed. Thanks,

Josef

> I used a ramdisk to emulate a "fast" disk since I don't happen to have a
> clariion sitting around. I didn't test single thread in the sata case as it
> should be relatively the same between the two. Thanks,
>
> Signed-off-by: Josef Bacik <[email protected]>
>
> Index: linux-2.6/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/commit.c
> +++ linux-2.6/fs/jbd/commit.c
> @@ -288,6 +288,8 @@ void journal_commit_transaction(journal_
> int flags;
> int err;
> unsigned long blocknr;
> + ktime_t start_time;
> + u64 commit_time;
> char *tagp = NULL;
> journal_header_t *header;
> journal_block_tag_t *tag = NULL;
> @@ -400,6 +402,7 @@ void journal_commit_transaction(journal_
> commit_transaction->t_state = T_FLUSH;
> journal->j_committing_transaction = commit_transaction;
> journal->j_running_transaction = NULL;
> + start_time = ktime_get();
> commit_transaction->t_log_start = journal->j_head;
> wake_up(&journal->j_wait_transaction_locked);
> spin_unlock(&journal->j_state_lock);
> @@ -873,6 +876,18 @@ restart_loop:
> J_ASSERT(commit_transaction == journal->j_committing_transaction);
> journal->j_commit_sequence = commit_transaction->t_tid;
> journal->j_committing_transaction = NULL;
> + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> +
> + /*
> + * weight the commit time higher than the average time so we don't
> + * react too strongly to vast changes in commit time
> + */
> + if (likely(journal->j_average_commit_time))
> + journal->j_average_commit_time = (commit_time*3 +
> + journal->j_average_commit_time) / 4;
> + else
> + journal->j_average_commit_time = commit_time;
> +
> spin_unlock(&journal->j_state_lock);
>
> if (commit_transaction->t_checkpoint_list == NULL &&
> Index: linux-2.6/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/transaction.c
> +++ linux-2.6/fs/jbd/transaction.c
> @@ -25,6 +25,7 @@
> #include <linux/timer.h>
> #include <linux/mm.h>
> #include <linux/highmem.h>
> +#include <linux/hrtimer.h>
>
> static void __journal_temp_unlink_buffer(struct journal_head *jh);
>
> @@ -49,6 +50,7 @@ get_transaction(journal_t *journal, tran
> {
> transaction->t_journal = journal;
> transaction->t_state = T_RUNNING;
> + transaction->t_start_time = ktime_get();
> transaction->t_tid = journal->j_transaction_sequence++;
> transaction->t_expires = jiffies + journal->j_commit_interval;
> spin_lock_init(&transaction->t_handle_lock);
> @@ -1361,7 +1363,7 @@ int journal_stop(handle_t *handle)
> {
> transaction_t *transaction = handle->h_transaction;
> journal_t *journal = transaction->t_journal;
> - int old_handle_count, err;
> + int err;
> pid_t pid;
>
> J_ASSERT(journal_current_handle() == handle);
> @@ -1397,11 +1399,18 @@ int journal_stop(handle_t *handle)
> */
> pid = current->pid;
> if (handle->h_sync && journal->j_last_sync_writer != pid) {
> + u64 commit_time, trans_time;
> +
> journal->j_last_sync_writer = pid;
> - do {
> - old_handle_count = transaction->t_handle_count;
> - schedule_timeout_uninterruptible(1);
> - } while (old_handle_count != transaction->t_handle_count);
> +
> + spin_lock(&journal->j_state_lock);
> + commit_time = journal->j_average_commit_time;
> + spin_unlock(&journal->j_state_lock);
> +
> + trans_time = ktime_to_ns(ktime_sub(ktime_get(),
> + transaction->t_start_time));
> + if (trans_time < commit_time)
> + hrtimer_sleep_ns(commit_time, 1);
> }
>
> current->journal_info = NULL;
> Index: linux-2.6/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.orig/include/linux/jbd.h
> +++ linux-2.6/include/linux/jbd.h
> @@ -543,6 +543,11 @@ struct transaction_s
> unsigned long t_expires;
>
> /*
> + * When this transaction started, in nanoseconds [no locking]
> + */
> + ktime_t t_start_time;
> +
> + /*
> * How many handles used this transaction? [t_handle_lock]
> */
> int t_handle_count;
> @@ -800,6 +805,8 @@ struct journal_s
>
> pid_t j_last_sync_writer;
>
> + u64 j_average_commit_time;
> +
> /*
> * An opaque pointer to fs-private information. ext3 puts its
> * superblock pointer here

2008-08-19 04:32:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Wed, 6 Aug 2008 15:15:36 -0400 Josef Bacik <[email protected]> wrote:

> Hello,
>
> Fsync batching in ext3 is somewhat flawed when it comes to disks that are very
> fast. Now we do an unconditional sleep for 1 second,

It sleeps for one jiffy, not one second.

> which is great on slow
> disks like SATA and such, but on fast disks this means just sitting around and
> waiting for nothing. This patch measures the time it takes to commit a
> transaction to the disk, and sleeps based on the speed of the underlying disk.
> Using the following fs_mark command to test the speeds
>
> ./fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t 2
>
> I got the following results (with write cacheing turned off)
>
> type threads with patch without patch
> sata 2 26.4 27.8
> sata 4 44.6 44.4
> sata 8 70.4 72.8
> sata 16 75.2 89.6
> sata 32 92.7 96.0
> ram 1 2399.1 2398.8
> ram 2 257.3 3603.0
> ram 4 395.6 4827.9
> ram 8 659.0 4721.1
> ram 16 1326.4 4373.3
> ram 32 1964.2 3816.3
>
> I used a ramdisk to emulate a "fast" disk since I don't happen to have a
> clariion sitting around. I didn't test single thread in the sata case as it
> should be relatively the same between the two. Thanks,

This is all a bit mysterious. That delay doesn't have much at all to
do with commit times. The code is looping around giving other
userspace processes an opportunity to get scheduled and to run an fsync
and to join the current transaction rather than having to start a new
one.

(that code was quite effective when I first added it, but in more
recent testing, which was some time ago, it doesn't appear to provide
any improvement. This needs to be understood)

Also, I'd expect that the average commit time is much longer that one
jiffy on most disks, and perhaps even on fast disks and maybe even on
ramdisk. So perhaps what's happened here is that you've increased the
sleep period and more tasks are joining particular transactions.

Or you've shortened the sleep time (which wasn't really doing anything
useful) and this causes tasks to spend less time asleep.

Or something else.


2008-08-19 05:44:27

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Aug 18, 2008 21:31 -0700, Andrew Morton wrote:
> On Wed, 6 Aug 2008 15:15:36 -0400 Josef Bacik <[email protected]> wrote:
> > Using the following fs_mark command to test the speeds
> >
> > ./fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t 2
> >
> > I got the following results (with write cacheing turned off)
> >
> > type threads with patch without patch
> > sata 2 26.4 27.8
> > sata 4 44.6 44.4
> > sata 8 70.4 72.8
> > sata 16 75.2 89.6
> > sata 32 92.7 96.0
> > ram 1 2399.1 2398.8
> > ram 2 257.3 3603.0
> > ram 4 395.6 4827.9
> > ram 8 659.0 4721.1
> > ram 16 1326.4 4373.3
> > ram 32 1964.2 3816.3
> >
> > I used a ramdisk to emulate a "fast" disk since I don't happen to have a
> > clariion sitting around. I didn't test single thread in the sata case as it
> > should be relatively the same between the two. Thanks,
>
> This is all a bit mysterious. That delay doesn't have much at all to
> do with commit times. The code is looping around giving other
> userspace processes an opportunity to get scheduled and to run an fsync
> and to join the current transaction rather than having to start a new
> one.
>
> (that code was quite effective when I first added it, but in more
> recent testing, which was some time ago, it doesn't appear to provide
> any improvement. This needs to be understood)

I don't think it is mysterious at all. With a HZ=100 system 1 jiffie
is 10ms, which was comparable to the seek time of a disk, so sleeping
for 1 jiffie to avoid doing 2 transactions was a win. With a flash
device (simulated by RAM here) seek time is 1ms so waiting 10ms
isn't going to be useful if there are only 2 threads and both have
already joined the transaction.

> Also, I'd expect that the average commit time is much longer that one
> jiffy on most disks, and perhaps even on fast disks and maybe even on
> ramdisk. So perhaps what's happened here is that you've increased the
> sleep period and more tasks are joining particular transactions.
>
> Or you've shortened the sleep time (which wasn't really doing anything
> useful) and this causes tasks to spend less time asleep.

I think both are true. By making the sleep time dynamic it removes
the "useless" sleep time, but can also increase the sleep time if
there are many threads and the commit cost is better amortized over
more operations.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-08-19 11:01:11

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

Andreas Dilger wrote:
> On Aug 18, 2008 21:31 -0700, Andrew Morton wrote:
>
>> On Wed, 6 Aug 2008 15:15:36 -0400 Josef Bacik <[email protected]> wrote:
>>
>>> Using the following fs_mark command to test the speeds
>>>
>>> ./fs_mark -d /mnt/ext3-test -s 4096 -n 2000 -D 64 -t 2
>>>
>>> I got the following results (with write cacheing turned off)
>>>
>>> type threads with patch without patch
>>> sata 2 26.4 27.8
>>> sata 4 44.6 44.4
>>> sata 8 70.4 72.8
>>> sata 16 75.2 89.6
>>> sata 32 92.7 96.0
>>> ram 1 2399.1 2398.8
>>> ram 2 257.3 3603.0
>>> ram 4 395.6 4827.9
>>> ram 8 659.0 4721.1
>>> ram 16 1326.4 4373.3
>>> ram 32 1964.2 3816.3
>>>
>>> I used a ramdisk to emulate a "fast" disk since I don't happen to have a
>>> clariion sitting around. I didn't test single thread in the sata case as it
>>> should be relatively the same between the two. Thanks,
>>>
>> This is all a bit mysterious. That delay doesn't have much at all to
>> do with commit times. The code is looping around giving other
>> userspace processes an opportunity to get scheduled and to run an fsync
>> and to join the current transaction rather than having to start a new
>> one.
>>
>> (that code was quite effective when I first added it, but in more
>> recent testing, which was some time ago, it doesn't appear to provide
>> any improvement. This needs to be understood)
>>
>
> I don't think it is mysterious at all. With a HZ=100 system 1 jiffie
> is 10ms, which was comparable to the seek time of a disk, so sleeping
> for 1 jiffie to avoid doing 2 transactions was a win. With a flash
> device (simulated by RAM here) seek time is 1ms so waiting 10ms
> isn't going to be useful if there are only 2 threads and both have
> already joined the transaction.
>

The code was originally tuned to S-ATA & ATA disk response times which
are closer to 12-15ms. Sleeping for 10ms (100HZ kernel) or 4ms (250HZ)
did not overly penalize the low thread count case and worked well for
higher thread counts (and ext3 special cases the single threaded writer
so no sleep happens).

This is still a really, really good thing to do, but we need to sleep
less when the device characteristics are radically different. For
example, a fibre channel attached disk array drops that 12-15 ms down to
1.5 ms (not to mention RAM disks!).
>
>> Also, I'd expect that the average commit time is much longer that one
>> jiffy on most disks, and perhaps even on fast disks and maybe even on
>> ramdisk. So perhaps what's happened here is that you've increased the
>> sleep period and more tasks are joining particular transactions.
>>
>> Or you've shortened the sleep time (which wasn't really doing anything
>> useful) and this causes tasks to spend less time asleep.
>>
>
> I think both are true. By making the sleep time dynamic it removes
> the "useless" sleep time, but can also increase the sleep time if
> there are many threads and the commit cost is better amortized over
> more operations.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>

It would be great to be able to use this batching technique for faster
devices, but we currently sleep 3-4 times longer waiting to batch for an
array than it takes to complete the transaction.

Thanks!

Ric



2008-08-19 17:56:38

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:

> It would be great to be able to use this batching technique for faster
> devices, but we currently sleep 3-4 times longer waiting to batch for an
> array than it takes to complete the transaction.

Obviously, tuning that delay down to the minimum necessary is a good
thing. But doing it based on commit-time seems indirect at best. What
happens on a slower disk when commit times are in the tens of
milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
when commit times go up to seconds?

Perhaps a better scheme would be to tune it based on how many other
processes are joining that transaction. If it's "zero" then decrease
the timeout. But one would need to work out how to increase it, which
perhaps could be done by detecting the case where process A runs an
fsync when a commit is currently in progress, and that commit was
caused by process B's fsync.

But before doing all that I would recommend/ask that the following be
investigated:

- How effective is the present code?

- What happens when it is simply removed?

- Add instrumentation (a counter and a printk) to work out how
many other tasks are joining this task's transaction.

- If the answer is "zero" or "small", work out why.

- See if we can increase its effectiveness.

Because it could be that the code broke. There might be issues with
higher-level locks which are preventing the batching. For example, if
all the files which the test app is syncing are in the same directory,
perhaps all the tasks are piling up on that directory's i_mutex?

2008-08-19 18:08:46

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

Andrew Morton wrote:
> On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:
>
>
>> It would be great to be able to use this batching technique for faster
>> devices, but we currently sleep 3-4 times longer waiting to batch for an
>> array than it takes to complete the transaction.
>>
>
> Obviously, tuning that delay down to the minimum necessary is a good
> thing. But doing it based on commit-time seems indirect at best. What
> happens on a slower disk when commit times are in the tens of
> milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
> when commit times go up to seconds?
>

Transactions on that busier drive would take longer, we would sleep
longer which would allow us to batch up more into one transaction. That
should be a good result and it should reset when the drive gets less
busy (and transactions shorter) to a shorter sleep time.

> Perhaps a better scheme would be to tune it based on how many other
> processes are joining that transaction. If it's "zero" then decrease
> the timeout. But one would need to work out how to increase it, which
> perhaps could be done by detecting the case where process A runs an
> fsync when a commit is currently in progress, and that commit was
> caused by process B's fsync.
>
This is really, really a property of the device's latency at any given
point in time. If there are no other processes running, we could do an
optimization and not wait.

> But before doing all that I would recommend/ask that the following be
> investigated:
>
> - How effective is the present code?
>

It causes the most expensive storage (arrays) to run 3-4 times slower
than they should on a synchronous write workload (NFS server, mail
server?) with more than 1 thread. For example, against a small EMC
array, I saw single threaded write rates of 720 files/sec against ext3
with 1 thread, 225 (if I remember correctly) with 2 ;-)

> - What happens when it is simply removed?
>
If you remove the code, you will not see the throughput rise when you go
multithreaded on existing slow devices (S-ATA/ATA for example). Faster
devices will not see that 2 threaded drop.

> - Add instrumentation (a counter and a printk) to work out how
> many other tasks are joining this task's transaction.
>
> - If the answer is "zero" or "small", work out why.
>
> - See if we can increase its effectiveness.
>
> Because it could be that the code broke. There might be issues with
> higher-level locks which are preventing the batching. For example, if
> all the files which the test app is syncing are in the same directory,
> perhaps all the tasks are piling up on that directory's i_mutex?
>

I have to admit that I don't see the down side here - we have shown a
huge increase for arrays (embarrassingly huge increase for RAM disks)
and see no degradation for the S-ATA/ATA case.

The code is not broken (having been there and done the performance
tuning on the original code), it just did not account for the widely
varying average response times for different classes of storage ;-)

ric



2008-08-19 18:43:27

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

Andrew Morton wrote:
> On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:
>
>
>> It would be great to be able to use this batching technique for faster
>> devices, but we currently sleep 3-4 times longer waiting to batch for an
>> array than it takes to complete the transaction.
>>
>
> Obviously, tuning that delay down to the minimum necessary is a good
> thing. But doing it based on commit-time seems indirect at best. What
> happens on a slower disk when commit times are in the tens of
> milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
> when commit times go up to seconds?
>
> Perhaps a better scheme would be to tune it based on how many other
> processes are joining that transaction. If it's "zero" then decrease
> the timeout. But one would need to work out how to increase it, which
> perhaps could be done by detecting the case where process A runs an
> fsync when a commit is currently in progress, and that commit was
> caused by process B's fsync.
>
> But before doing all that I would recommend/ask that the following be
> investigated:
>
> - How effective is the present code?
>
> - What happens when it is simply removed?
>
> - Add instrumentation (a counter and a printk) to work out how
> many other tasks are joining this task's transaction.
>
> - If the answer is "zero" or "small", work out why.
>
> - See if we can increase its effectiveness.
>
> Because it could be that the code broke. There might be issues with
> higher-level locks which are preventing the batching. For example, if
> all the files which the test app is syncing are in the same directory,
> perhaps all the tasks are piling up on that directory's i_mutex?
>

One other way to think about this is as a fairly normal queuing problem:

(1) arrival rate is the rate at which we see new tasks coming into
the code
(2) service time is basically the time spent committing the
transaction to storage

and we have the assumption that some number of tasks can join a
transaction more or less for "free."

What the existing code assumes is that all devices have an equal service
time. That worked well as long as we only looked at devices that were
roughly equal (10-20 ms latencies) or used a higher HZ for the kernel
(1000HZ and you don't see this as much as with 100HZ).

The two key issues that Josef's code tried to address are that first
assumption that all devices have a similar service time and the tie
between how long we wait and the HZ. It would seem to be generically
useful to be able to sleep for less than 1 jiffie, not just for file
systems, but maybe also in some other contexts?

ric


2008-08-19 19:15:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] add hrtimer_sleep_ns helper function

On Wed, Aug 06, 2008 at 03:08:19PM -0400, Josef Bacik wrote:
> -static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
> +static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode,
> + int interruptible)
> {
> hrtimer_init_sleeper(t, current);
>
> do {
> - set_current_state(TASK_INTERRUPTIBLE);
> + set_current_state(interruptible ? TASK_INTERRUPTIBLE :
> + TASK_UNINTERRUPTIBLE);

I don't see any users (in this patch or the next) of people wanting
uninterruptible nanosleeps. We shouldn't be introducing new
TASK_UNINTERRUPTIBLE users, but instead using TASK_KILLABLE if the user
really can't cope with signals in a sensible manner.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-08-19 19:18:01

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Tue, Aug 19, 2008 at 10:56:38AM -0700, Andrew Morton wrote:
> On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:
>
> > It would be great to be able to use this batching technique for faster
> > devices, but we currently sleep 3-4 times longer waiting to batch for an
> > array than it takes to complete the transaction.
>
> Obviously, tuning that delay down to the minimum necessary is a good
> thing. But doing it based on commit-time seems indirect at best. What
> happens on a slower disk when commit times are in the tens of
> milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
> when commit times go up to seconds?
>
> Perhaps a better scheme would be to tune it based on how many other
> processes are joining that transaction. If it's "zero" then decrease
> the timeout. But one would need to work out how to increase it, which
> perhaps could be done by detecting the case where process A runs an
> fsync when a commit is currently in progress, and that commit was
> caused by process B's fsync.
>
> But before doing all that I would recommend/ask that the following be
> investigated:
>
> - How effective is the present code?
>
> - What happens when it is simply removed?
>
> - Add instrumentation (a counter and a printk) to work out how
> many other tasks are joining this task's transaction.
>
> - If the answer is "zero" or "small", work out why.
>
> - See if we can increase its effectiveness.
>
> Because it could be that the code broke. There might be issues with
> higher-level locks which are preventing the batching. For example, if
> all the files which the test app is syncing are in the same directory,
> perhaps all the tasks are piling up on that directory's i_mutex?

There is no problem with the current code on normal desktop boxes with sata
drives. This optimization is fantastic and greatly increases throughput. The
problem is that in the case of low latency drives where sleeping for 1 jiffie
(depending on HZ) is entirely too long based on the latency of the disk. I had
thought about doing the number of syncing threads per transaction, but I'm
worried about the normal case of a running box, ie where the only thing running
fsync is syslog. In that case the "average fsyncing threads" count would be 1,
so when syslog ran fsync we'd bypass the sleep and just commit, which would
result in utter crap performance from the current code. Measuring the time it
takes to commit a transaction was a nice uniform way to figure out how long we
may need to wait in order for a usefull amount of stuff to be added to the
transaction, and it would be self tuning to the underlying disk. The goal was
to maintain the awesome performance we currently get with high latency devices
while at the same time fixing the crappy performance we see with low latency
disks. I hope that helps. Thanks,

Josef

2008-08-19 19:22:16

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 1/2] add hrtimer_sleep_ns helper function

On Tue, Aug 19, 2008 at 01:15:08PM -0600, Matthew Wilcox wrote:
> On Wed, Aug 06, 2008 at 03:08:19PM -0400, Josef Bacik wrote:
> > -static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
> > +static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode,
> > + int interruptible)
> > {
> > hrtimer_init_sleeper(t, current);
> >
> > do {
> > - set_current_state(TASK_INTERRUPTIBLE);
> > + set_current_state(interruptible ? TASK_INTERRUPTIBLE :
> > + TASK_UNINTERRUPTIBLE);
>
> I don't see any users (in this patch or the next) of people wanting
> uninterruptible nanosleeps. We shouldn't be introducing new
> TASK_UNINTERRUPTIBLE users, but instead using TASK_KILLABLE if the user
> really can't cope with signals in a sensible manner.
>

Hmm doh, sorry about that the 2/2 patch of this series should be passing 0 not 1
since we need to be uninterruptible. I figured this sort of thing would be used
by fs's/device drivers where TASK_UNINTERRUPTIBLE is desired. If that is not
appropriate let me know and I can use TASK_KILLABLE or whatever else the
preference is. Thanks,

Josef

2008-08-19 19:37:13

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 1/2] add hrtimer_sleep_ns helper function

On Tue, Aug 19, 2008 at 03:22:16PM -0400, Josef Bacik wrote:
> On Tue, Aug 19, 2008 at 01:15:08PM -0600, Matthew Wilcox wrote:
> > I don't see any users (in this patch or the next) of people wanting
> > uninterruptible nanosleeps. We shouldn't be introducing new
> > TASK_UNINTERRUPTIBLE users, but instead using TASK_KILLABLE if the user
> > really can't cope with signals in a sensible manner.
>
> Hmm doh, sorry about that the 2/2 patch of this series should be passing 0 not 1
> since we need to be uninterruptible.

I think that's taught us that '0' and '1' are insufficiently
descriptive, and we should either be passing in a state (ie
do_nanosleep(x, y, TASK_FOO), or have separate do_nanosleep_killable()
and do_nanosleep_interruptible()).

> I figured this sort of thing would be used
> by fs's/device drivers where TASK_UNINTERRUPTIBLE is desired. If that is not
> appropriate let me know and I can use TASK_KILLABLE or whatever else the
> preference is. Thanks,

We should be trying to accommodate the user's wishes wherever possible.
If they say kill -9, they really mean it, and we should stop waiting.
Now, I don't think we should abort the journal_stop(). That's probably
going too far in this instance. But we should stop waiting for other
tasks to join in, and finish up as quickly as possible.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-08-19 19:39:47

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH 1/2] add hrtimer_sleep_ns helper function

On Tue, Aug 19, 2008 at 01:36:57PM -0600, Matthew Wilcox wrote:
> On Tue, Aug 19, 2008 at 03:22:16PM -0400, Josef Bacik wrote:
> > On Tue, Aug 19, 2008 at 01:15:08PM -0600, Matthew Wilcox wrote:
> > > I don't see any users (in this patch or the next) of people wanting
> > > uninterruptible nanosleeps. We shouldn't be introducing new
> > > TASK_UNINTERRUPTIBLE users, but instead using TASK_KILLABLE if the user
> > > really can't cope with signals in a sensible manner.
> >
> > Hmm doh, sorry about that the 2/2 patch of this series should be passing 0 not 1
> > since we need to be uninterruptible.
>
> I think that's taught us that '0' and '1' are insufficiently
> descriptive, and we should either be passing in a state (ie
> do_nanosleep(x, y, TASK_FOO), or have separate do_nanosleep_killable()
> and do_nanosleep_interruptible()).
>
> > I figured this sort of thing would be used
> > by fs's/device drivers where TASK_UNINTERRUPTIBLE is desired. If that is not
> > appropriate let me know and I can use TASK_KILLABLE or whatever else the
> > preference is. Thanks,
>
> We should be trying to accommodate the user's wishes wherever possible.
> If they say kill -9, they really mean it, and we should stop waiting.
> Now, I don't think we should abort the journal_stop(). That's probably
> going too far in this instance. But we should stop waiting for other
> tasks to join in, and finish up as quickly as possible.

Fair enough, I will make the changes in the next round, thanks much.

Josef

2008-08-19 20:29:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching


Could I just point out that this is a very very painful way of writing
a changelog? All these new revelations are important and relevant and
should have been there!

On Tue, 19 Aug 2008 14:08:31 -0400
Ric Wheeler <[email protected]> wrote:

> Andrew Morton wrote:
> > On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:
> >
> >
> >> It would be great to be able to use this batching technique for faster
> >> devices, but we currently sleep 3-4 times longer waiting to batch for an
> >> array than it takes to complete the transaction.
> >>
> >
> > Obviously, tuning that delay down to the minimum necessary is a good
> > thing. But doing it based on commit-time seems indirect at best. What
> > happens on a slower disk when commit times are in the tens of
> > milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
> > when commit times go up to seconds?
> >
>
> Transactions on that busier drive would take longer, we would sleep
> longer which would allow us to batch up more into one transaction. That
> should be a good result and it should reset when the drive gets less
> busy (and transactions shorter) to a shorter sleep time.

Has this been empirically confirmed?

Commits can takes tens of seconds and causing an fsync() to block
itself for such periods could have quite bad effects. How do we (ie:
I) know that there are no such scenarios with this change?

> > Perhaps a better scheme would be to tune it based on how many other
> > processes are joining that transaction. If it's "zero" then decrease
> > the timeout. But one would need to work out how to increase it, which
> > perhaps could be done by detecting the case where process A runs an
> > fsync when a commit is currently in progress, and that commit was
> > caused by process B's fsync.
> >
> This is really, really a property of the device's latency at any given
> point in time. If there are no other processes running, we could do an
> optimization and not wait.

well yes. This represents yet another attempt to predict future
application behaviour. The way in which we _usually_ do that is by
monitoring past application behaviour.

Only this patch didn't do that (directly) and I'm wondering why not.

> > But before doing all that I would recommend/ask that the following be
> > investigated:
> >
> > - How effective is the present code?
> >
>
> It causes the most expensive storage (arrays) to run 3-4 times slower
> than they should on a synchronous write workload (NFS server, mail
> server?) with more than 1 thread. For example, against a small EMC
> array, I saw single threaded write rates of 720 files/sec against ext3
> with 1 thread, 225 (if I remember correctly) with 2 ;-)

Current code has:

/*
* Implement synchronous transaction batching. If the handle
* was synchronous, don't force a commit immediately. Let's
* yield and let another thread piggyback onto this transaction.
* Keep doing that while new threads continue to arrive.
* It doesn't cost much - we're about to run a commit and sleep
* on IO anyway. Speeds up many-threaded, many-dir operations
* by 30x or more...
*
* But don't do this if this process was the most recent one to
* perform a synchronous write. We do this to detect the case where a
* single process is doing a stream of sync writes. No point in waiting
* for joiners in that case.
*/

has the 30x been reproduced? If not, what broke? If so, what effect
did the proposed change have upon it?

> > - What happens when it is simply removed?
> >
> If you remove the code, you will not see the throughput rise when you go
> multithreaded on existing slow devices (S-ATA/ATA for example). Faster
> devices will not see that 2 threaded drop.

See above - has this been tested and confirmed?

> > - Add instrumentation (a counter and a printk) to work out how
> > many other tasks are joining this task's transaction.
> >
> > - If the answer is "zero" or "small", work out why.
> >
> > - See if we can increase its effectiveness.
> >
> > Because it could be that the code broke. There might be issues with
> > higher-level locks which are preventing the batching. For example, if
> > all the files which the test app is syncing are in the same directory,
> > perhaps all the tasks are piling up on that directory's i_mutex?
> >
>
> I have to admit that I don't see the down side here - we have shown a
> huge increase for arrays (embarrassingly huge increase for RAM disks)
> and see no degradation for the S-ATA/ATA case.
>
> The code is not broken (having been there and done the performance
> tuning on the original code), it just did not account for the widely
> varying average response times for different classes of storage ;-)

Well, as I said - last time I checked, it did seem to be broken. By
what means did you confirm that it is still effective, and what were
the results?



2008-08-19 20:35:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Tue, 19 Aug 2008 14:43:27 -0400
Ric Wheeler <[email protected]> wrote:

> Andrew Morton wrote:
> > On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:
> >
> >
> >> It would be great to be able to use this batching technique for faster
> >> devices, but we currently sleep 3-4 times longer waiting to batch for an
> >> array than it takes to complete the transaction.
> >>
> >
> > Obviously, tuning that delay down to the minimum necessary is a good
> > thing. But doing it based on commit-time seems indirect at best. What
> > happens on a slower disk when commit times are in the tens of
> > milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
> > when commit times go up to seconds?
> >
> > Perhaps a better scheme would be to tune it based on how many other
> > processes are joining that transaction. If it's "zero" then decrease
> > the timeout. But one would need to work out how to increase it, which
> > perhaps could be done by detecting the case where process A runs an
> > fsync when a commit is currently in progress, and that commit was
> > caused by process B's fsync.
> >
> > But before doing all that I would recommend/ask that the following be
> > investigated:
> >
> > - How effective is the present code?
> >
> > - What happens when it is simply removed?
> >
> > - Add instrumentation (a counter and a printk) to work out how
> > many other tasks are joining this task's transaction.
> >
> > - If the answer is "zero" or "small", work out why.
> >
> > - See if we can increase its effectiveness.
> >
> > Because it could be that the code broke. There might be issues with
> > higher-level locks which are preventing the batching. For example, if
> > all the files which the test app is syncing are in the same directory,
> > perhaps all the tasks are piling up on that directory's i_mutex?
> >
>
> One other way to think about this is as a fairly normal queuing problem:
>
> (1) arrival rate is the rate at which we see new tasks coming into
> the code
> (2) service time is basically the time spent committing the
> transaction to storage
>
> and we have the assumption that some number of tasks can join a
> transaction more or less for "free."
>
> What the existing code assumes is that all devices have an equal service
> time. That worked well as long as we only looked at devices that were
> roughly equal (10-20 ms latencies) or used a higher HZ for the kernel
> (1000HZ and you don't see this as much as with 100HZ).
>
> The two key issues that Josef's code tried to address are that first
> assumption that all devices have a similar service time and the tie
> between how long we wait and the HZ.

yes, I see the (indirect) logic. But I wonder about whether there's a
direct way of measuring the thing we really want to measure.

Also, I'd be heaps less scared of the change if it did

commit_time = min(commit_time, one jiffy)


> It would seem to be generically
> useful to be able to sleep for less than 1 jiffie, not just for file
> systems, but maybe also in some other contexts?

Sure.

2008-08-19 20:55:47

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

Andrew Morton wrote:
> Could I just point out that this is a very very painful way of writing
> a changelog? All these new revelations are important and relevant and
> should have been there!
>

Agreed, some of this was bounced around only in the email thread.
> On Tue, 19 Aug 2008 14:08:31 -0400
> Ric Wheeler <[email protected]> wrote:
>
>
>> Andrew Morton wrote:
>>
>>> On Tue, 19 Aug 2008 07:01:11 -0400 Ric Wheeler <[email protected]> wrote:
>>>
>>>
>>>
>>>> It would be great to be able to use this batching technique for faster
>>>> devices, but we currently sleep 3-4 times longer waiting to batch for an
>>>> array than it takes to complete the transaction.
>>>>
>>>>
>>> Obviously, tuning that delay down to the minimum necessary is a good
>>> thing. But doing it based on commit-time seems indirect at best. What
>>> happens on a slower disk when commit times are in the tens of
>>> milliseconds? When someone runs a concurrent `dd if=/dev/zero of=foo'
>>> when commit times go up to seconds?
>>>
>>>
>> Transactions on that busier drive would take longer, we would sleep
>> longer which would allow us to batch up more into one transaction. That
>> should be a good result and it should reset when the drive gets less
>> busy (and transactions shorter) to a shorter sleep time.
>>
>
> Has this been empirically confirmed?
>

No, this was just me thinking about how a shared disk can become
sluggish. It is a good experiment to try (run the test on one host,
inject a heavy load from a second, watch the behaviour).

> Commits can takes tens of seconds and causing an fsync() to block
> itself for such periods could have quite bad effects. How do we (ie:
> I) know that there are no such scenarios with this change?
>

A very good point - I like your suggestion to do the minimum of avg
sleep time vs 1 jiffie in the follow up message.

>
>>> Perhaps a better scheme would be to tune it based on how many other
>>> processes are joining that transaction. If it's "zero" then decrease
>>> the timeout. But one would need to work out how to increase it, which
>>> perhaps could be done by detecting the case where process A runs an
>>> fsync when a commit is currently in progress, and that commit was
>>> caused by process B's fsync.
>>>
>>>
>> This is really, really a property of the device's latency at any given
>> point in time. If there are no other processes running, we could do an
>> optimization and not wait.
>>
>
> well yes. This represents yet another attempt to predict future
> application behaviour. The way in which we _usually_ do that is by
> monitoring past application behaviour.
>
This whole area is very similar to the IO elevator area where some of
the same device characteristics are measured.


> Only this patch didn't do that (directly) and I'm wondering why not.
>

The average transaction commit time is a direct measurement of the past
behaviour, right?

>
>>> But before doing all that I would recommend/ask that the following be
>>> investigated:
>>>
>>> - How effective is the present code?
>>>
>>>
>> It causes the most expensive storage (arrays) to run 3-4 times slower
>> than they should on a synchronous write workload (NFS server, mail
>> server?) with more than 1 thread. For example, against a small EMC
>> array, I saw single threaded write rates of 720 files/sec against ext3
>> with 1 thread, 225 (if I remember correctly) with 2 ;-)
>>
>
> Current code has:
>
> /*
> * Implement synchronous transaction batching. If the handle
> * was synchronous, don't force a commit immediately. Let's
> * yield and let another thread piggyback onto this transaction.
> * Keep doing that while new threads continue to arrive.
> * It doesn't cost much - we're about to run a commit and sleep
> * on IO anyway. Speeds up many-threaded, many-dir operations
> * by 30x or more...
> *
> * But don't do this if this process was the most recent one to
> * perform a synchronous write. We do this to detect the case where a
> * single process is doing a stream of sync writes. No point in waiting
> * for joiners in that case.
> */
>
> has the 30x been reproduced? If not, what broke? If so, what effect
> did the proposed change have upon it?
>

The huge gain was only in the case of a RAM disk test which I assume was
not tested against the original patch early on. Against an array (with a
250HZ kernel), we saw a 2.5x speedup with the new code.

>
>>> - What happens when it is simply removed?
>>>
>>>
>> If you remove the code, you will not see the throughput rise when you go
>> multithreaded on existing slow devices (S-ATA/ATA for example). Faster
>> devices will not see that 2 threaded drop.
>>
>
> See above - has this been tested and confirmed?
>

Yes - we (back at EMC) did remove the logic and the fast devices will
write at least at their starting rate (700+ files/sec).

>
>>> - Add instrumentation (a counter and a printk) to work out how
>>> many other tasks are joining this task's transaction.
>>>
>>> - If the answer is "zero" or "small", work out why.
>>>
>>> - See if we can increase its effectiveness.
>>>
>>> Because it could be that the code broke. There might be issues with
>>> higher-level locks which are preventing the batching. For example, if
>>> all the files which the test app is syncing are in the same directory,
>>> perhaps all the tasks are piling up on that directory's i_mutex?
>>>
>>>
>> I have to admit that I don't see the down side here - we have shown a
>> huge increase for arrays (embarrassingly huge increase for RAM disks)
>> and see no degradation for the S-ATA/ATA case.
>>
>> The code is not broken (having been there and done the performance
>> tuning on the original code), it just did not account for the widely
>> varying average response times for different classes of storage ;-)
>>
>
> Well, as I said - last time I checked, it did seem to be broken. By
> what means did you confirm that it is still effective, and what were
> the results?
>
>

I think Josef posted those results for S-ATA earlier in the thread and
they were still working. We can repost/rerun to give more detail...

ric




2008-08-19 21:19:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

On Tue, 19 Aug 2008 16:55:47 -0400
Ric Wheeler <[email protected]> wrote:

> >
> >>> Perhaps a better scheme would be to tune it based on how many other
> >>> processes are joining that transaction. If it's "zero" then decrease
> >>> the timeout. But one would need to work out how to increase it, which
> >>> perhaps could be done by detecting the case where process A runs an
> >>> fsync when a commit is currently in progress, and that commit was
> >>> caused by process B's fsync.
> >>>
> >>>
> >> This is really, really a property of the device's latency at any given
> >> point in time. If there are no other processes running, we could do an
> >> optimization and not wait.
> >>
> >
> > well yes. This represents yet another attempt to predict future
> > application behaviour. The way in which we _usually_ do that is by
> > monitoring past application behaviour.
> >
> This whole area is very similar to the IO elevator area where some of
> the same device characteristics are measured.
>
>
> > Only this patch didn't do that (directly) and I'm wondering why not.
> >
>
> The average transaction commit time is a direct measurement of the past
> behaviour, right?

Of commit time, yes. Somewhat - it has obvious failures during
transients.

But there's an assumption here that commit time is usefully correlated
with "mean time between fsync()s", which might introduce further
inaccuracies.

> >
> >>> But before doing all that I would recommend/ask that the following be
> >>> investigated:
> >>>
> >>> - How effective is the present code?
> >>>
> >>>
> >> It causes the most expensive storage (arrays) to run 3-4 times slower
> >> than they should on a synchronous write workload (NFS server, mail
> >> server?) with more than 1 thread. For example, against a small EMC
> >> array, I saw single threaded write rates of 720 files/sec against ext3
> >> with 1 thread, 225 (if I remember correctly) with 2 ;-)
> >>
> >
> > Current code has:
> >
> > /*
> > * Implement synchronous transaction batching. If the handle
> > * was synchronous, don't force a commit immediately. Let's
> > * yield and let another thread piggyback onto this transaction.
> > * Keep doing that while new threads continue to arrive.
> > * It doesn't cost much - we're about to run a commit and sleep
> > * on IO anyway. Speeds up many-threaded, many-dir operations
> > * by 30x or more...
> > *
> > * But don't do this if this process was the most recent one to
> > * perform a synchronous write. We do this to detect the case where a
> > * single process is doing a stream of sync writes. No point in waiting
> > * for joiners in that case.
> > */
> >
> > has the 30x been reproduced? If not, what broke? If so, what effect
> > did the proposed change have upon it?
> >
>
> The huge gain was only in the case of a RAM disk test which I assume was
> not tested against the original patch early on. Against an array (with a
> 250HZ kernel), we saw a 2.5x speedup with the new code.

You only answered question 3/3 ;)

I am concerned that the current code is no longer working correctly.
If so then apart from the obvious we-should-fix-that issue, there's
also the possibility that we're adding more stuff on top of the broken
stuff rather than fixing the broken stuff.

> >
> >>> - What happens when it is simply removed?
> >>>
> >>>
> >> If you remove the code, you will not see the throughput rise when you go
> >> multithreaded on existing slow devices (S-ATA/ATA for example). Faster
> >> devices will not see that 2 threaded drop.
> >>
> >
> > See above - has this been tested and confirmed?
> >
>
> Yes - we (back at EMC) did remove the logic and the fast devices will
> write at least at their starting rate (700+ files/sec).

Did you observe the effect upon slower devices?

> >
> >>> - Add instrumentation (a counter and a printk) to work out how
> >>> many other tasks are joining this task's transaction.
> >>>
> >>> - If the answer is "zero" or "small", work out why.
> >>>
> >>> - See if we can increase its effectiveness.
> >>>
> >>> Because it could be that the code broke. There might be issues with
> >>> higher-level locks which are preventing the batching. For example, if
> >>> all the files which the test app is syncing are in the same directory,
> >>> perhaps all the tasks are piling up on that directory's i_mutex?
> >>>
> >>>
> >> I have to admit that I don't see the down side here - we have shown a
> >> huge increase for arrays (embarrassingly huge increase for RAM disks)
> >> and see no degradation for the S-ATA/ATA case.
> >>
> >> The code is not broken (having been there and done the performance
> >> tuning on the original code), it just did not account for the widely
> >> varying average response times for different classes of storage ;-)
> >>
> >
> > Well, as I said - last time I checked, it did seem to be broken. By
> > what means did you confirm that it is still effective, and what were
> > the results?
> >
> >
>
> I think Josef posted those results for S-ATA earlier in the thread and
> they were still working. We can repost/rerun to give more detail...

Yes please.

2008-08-19 21:29:16

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH 2/2] improve ext3 fsync batching

Andrew Morton wrote:
> On Tue, 19 Aug 2008 16:55:47 -0400
> Ric Wheeler <[email protected]> wrote:
>
>
>>>
>>>
>>>>> Perhaps a better scheme would be to tune it based on how many other
>>>>> processes are joining that transaction. If it's "zero" then decrease
>>>>> the timeout. But one would need to work out how to increase it, which
>>>>> perhaps could be done by detecting the case where process A runs an
>>>>> fsync when a commit is currently in progress, and that commit was
>>>>> caused by process B's fsync.
>>>>>
>>>>>
>>>>>
>>>> This is really, really a property of the device's latency at any given
>>>> point in time. If there are no other processes running, we could do an
>>>> optimization and not wait.
>>>>
>>>>
>>> well yes. This represents yet another attempt to predict future
>>> application behaviour. The way in which we _usually_ do that is by
>>> monitoring past application behaviour.
>>>
>>>
>> This whole area is very similar to the IO elevator area where some of
>> the same device characteristics are measured.
>>
>>
>>
>>> Only this patch didn't do that (directly) and I'm wondering why not.
>>>
>>>
>> The average transaction commit time is a direct measurement of the past
>> behaviour, right?
>>
>
> Of commit time, yes. Somewhat - it has obvious failures during
> transients.
>
> But there's an assumption here that commit time is usefully correlated
> with "mean time between fsync()s", which might introduce further
> inaccuracies.
>
>
>>>
>>>
>>>>> But before doing all that I would recommend/ask that the following be
>>>>> investigated:
>>>>>
>>>>> - How effective is the present code?
>>>>>
>>>>>
>>>>>
>>>> It causes the most expensive storage (arrays) to run 3-4 times slower
>>>> than they should on a synchronous write workload (NFS server, mail
>>>> server?) with more than 1 thread. For example, against a small EMC
>>>> array, I saw single threaded write rates of 720 files/sec against ext3
>>>> with 1 thread, 225 (if I remember correctly) with 2 ;-)
>>>>
>>>>
>>> Current code has:
>>>
>>> /*
>>> * Implement synchronous transaction batching. If the handle
>>> * was synchronous, don't force a commit immediately. Let's
>>> * yield and let another thread piggyback onto this transaction.
>>> * Keep doing that while new threads continue to arrive.
>>> * It doesn't cost much - we're about to run a commit and sleep
>>> * on IO anyway. Speeds up many-threaded, many-dir operations
>>> * by 30x or more...
>>> *
>>> * But don't do this if this process was the most recent one to
>>> * perform a synchronous write. We do this to detect the case where a
>>> * single process is doing a stream of sync writes. No point in waiting
>>> * for joiners in that case.
>>> */
>>>
>>> has the 30x been reproduced? If not, what broke? If so, what effect
>>> did the proposed change have upon it?
>>>
>>>
>> The huge gain was only in the case of a RAM disk test which I assume was
>> not tested against the original patch early on. Against an array (with a
>> 250HZ kernel), we saw a 2.5x speedup with the new code.
>>
>
> You only answered question 3/3 ;)
>
> I am concerned that the current code is no longer working correctly.
> If so then apart from the obvious we-should-fix-that issue, there's
> also the possibility that we're adding more stuff on top of the broken
> stuff rather than fixing the broken stuff.
>

I think it is still working for slow devices, our testing will try and
verify that.


>
>>>
>>>
>>>>> - What happens when it is simply removed?
>>>>>
>>>>>
>>>>>
>>>> If you remove the code, you will not see the throughput rise when you go
>>>> multithreaded on existing slow devices (S-ATA/ATA for example). Faster
>>>> devices will not see that 2 threaded drop.
>>>>
>>>>
>>> See above - has this been tested and confirmed?
>>>
>>>
>> Yes - we (back at EMC) did remove the logic and the fast devices will
>> write at least at their starting rate (700+ files/sec).
>>
>
> Did you observe the effect upon slower devices?
>

Slower devices were still great with this code, but that can vary with
high HZ kernels. Could you have tested and seen poor results with a
1000HZ kernel (where the slow devices don't sleep long enough?)

>
>>>
>>>
>>>>> - Add instrumentation (a counter and a printk) to work out how
>>>>> many other tasks are joining this task's transaction.
>>>>>
>>>>> - If the answer is "zero" or "small", work out why.
>>>>>
>>>>> - See if we can increase its effectiveness.
>>>>>
>>>>> Because it could be that the code broke. There might be issues with
>>>>> higher-level locks which are preventing the batching. For example, if
>>>>> all the files which the test app is syncing are in the same directory,
>>>>> perhaps all the tasks are piling up on that directory's i_mutex?
>>>>>
>>>>>
>>>>>
>>>> I have to admit that I don't see the down side here - we have shown a
>>>> huge increase for arrays (embarrassingly huge increase for RAM disks)
>>>> and see no degradation for the S-ATA/ATA case.
>>>>
>>>> The code is not broken (having been there and done the performance
>>>> tuning on the original code), it just did not account for the widely
>>>> varying average response times for different classes of storage ;-)
>>>>
>>>>
>>> Well, as I said - last time I checked, it did seem to be broken. By
>>> what means did you confirm that it is still effective, and what were
>>> the results?
>>>
>>>
>>>
>> I think Josef posted those results for S-ATA earlier in the thread and
>> they were still working. We can repost/rerun to give more detail...
>>
>
> Yes please.
>
We should do a clean retest and repost with the types of storage we have
at hand. Roughly, that would include:

* RAM disk
* mid sized array (FC first, iSCSI second?)
* S-ATA
* SAS

It would also be interesting to test with various HZ settings (100, 250,
1000).

Would that fill in most of the interesting spots?

Not sure how much time we have to cover all of these, but we will take a
stab at it ;-)

ric