2008-11-04 16:22:47

by Josef Bacik

[permalink] [raw]
Subject: [PATCH] imporve jbd2 fsync batching

Hello,

This is the jbd2 version of the jbd fsync batching patch.

This patch removes the static sleep time in favor of a more self optimizing
approach where we measure the average amount of time it takes to commit a
transaction to disk and the ammount of time a transaction has been running. If
somebody does a sync write or an fsync() traditionally we would sleep for 1
jiffies, which depending on the value of HZ could be a significant amount of
time compared to how long it takes to commit a transaction to the underlying
storage. With this patch instead of sleeping for a jiffie, we check to see if
the amount of time this transaction has been running is less than the average
commit time, and if it is we sleep for the delta using schedule_hrtimeout to
give us a higher precision sleep time. This greatly benefits high end storage
where you could end up sleeping for longer than it takes to commit the
transaction and therefore sitting idle instead of allowing the transaction to be
committed by keeping the sleep time to a minimum so you are sure to always be
doing something. Thank you,

Signed-off-by: Josef Bacik <[email protected]>

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 8b119e1..3169db9 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -332,6 +332,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
int flags;
int err;
unsigned long long blocknr;
+ ktime_t start_time;
+ u64 commit_time;
char *tagp = NULL;
journal_header_t *header;
journal_block_tag_t *tag = NULL;
@@ -458,6 +460,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
commit_transaction->t_state = T_FLUSH;
journal->j_committing_transaction = commit_transaction;
journal->j_running_transaction = NULL;
+ start_time = ktime_get();
commit_transaction->t_log_start = journal->j_head;
wake_up(&journal->j_wait_transaction_locked);
spin_unlock(&journal->j_state_lock);
@@ -972,6 +975,17 @@ restart_loop:
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
journal->j_committing_transaction = NULL;
+ commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+
+ /*
+ * weight the commit time higher than the average time so we don't
+ * react too strongly to vast changes in the commit time
+ */
+ if (likely(journal->j_average_commit_time))
+ journal->j_average_commit_time = (commit_time*3 +
+ journal->j_average_commit_time) / 4;
+ else
+ journal->j_average_commit_time = commit_time;
spin_unlock(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL &&
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 39b7805..ef705f9 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -25,6 +25,7 @@
#include <linux/timer.h>
#include <linux/mm.h>
#include <linux/highmem.h>
+#include <linux/hrtimer.h>

static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);

@@ -48,6 +49,7 @@ jbd2_get_transaction(journal_t *journal, transaction_t *transaction)
{
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
+ transaction->t_start_time = ktime_get();
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
@@ -1193,7 +1195,7 @@ int jbd2_journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int old_handle_count, err;
+ int err;
pid_t pid;

J_ASSERT(journal_current_handle() == handle);
@@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle)
* on IO anyway. Speeds up many-threaded, many-dir operations
* by 30x or more...
*
+ * We try and optimize the sleep time against what the underlying disk
+ * can do, instead of having a static sleep time. This is usefull for
+ * the case where our storage is so fast that it is more optimal to go
+ * ahead and force a flush and wait for the transaction to be committed
+ * than it is to wait for an arbitrary amount of time for new writers to
+ * join the transaction. We acheive this by measuring how long it takes
+ * to commit a transaction, and compare it with how long this
+ * transaction has been running, and if run time < commit time then we
+ * sleep for the delta and commit. This greatly helps super fast disks
+ * that would see slowdowns as more threads started doing fsyncs.
+ *
* But don't do this if this process was the most recent one to
* perform a synchronous write. We do this to detect the case where a
* single process is doing a stream of sync writes. No point in waiting
@@ -1229,11 +1242,26 @@ int jbd2_journal_stop(handle_t *handle)
*/
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
+ u64 commit_time, trans_time;
+
journal->j_last_sync_writer = pid;
- do {
- old_handle_count = transaction->t_handle_count;
- schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
+
+ spin_lock(&journal->j_state_lock);
+ commit_time = journal->j_average_commit_time;
+ spin_unlock(&journal->j_state_lock);
+
+ trans_time = ktime_to_ns(ktime_sub(ktime_get(),
+ transaction->t_start_time));
+
+ commit_time = min_t(u64, commit_time,
+ 1000*jiffies_to_usecs(1));
+
+ if (trans_time < commit_time) {
+ ktime_t expires = ktime_add_ns(ktime_get(),
+ commit_time);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ schedule_hrtimeout(&expires, HRTIMER_MODE_ABS);
+ }
}

current->journal_info = NULL;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index c7d106e..b8b8744 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -637,6 +637,11 @@ struct transaction_s
unsigned long t_expires;

/*
+ * When this transaction started, in nanoseconds [no locking]
+ */
+ ktime_t t_start_time;
+
+ /*
* How many handles used this transaction? [t_handle_lock]
*/
int t_handle_count;
@@ -938,8 +943,18 @@ struct journal_s
struct buffer_head **j_wbuf;
int j_wbufsize;

+ /*
+ * this is the pid of hte last person to run a synchronous operation
+ * through the journal
+ */
pid_t j_last_sync_writer;

+ /*
+ * the average amount of time in nanoseconds it takes to commit a
+ * transaction to disk. [j_state_lock]
+ */
+ u64 j_average_commit_time;
+
/* This function is called when a transaction is closed */
void (*j_commit_callback)(journal_t *,
transaction_t *);


2008-11-04 20:52:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

Thanks!!

Once tiny spelling nit, which I'll fix up before I put it in the ext4
patch queue.

On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> @@ -1222,6 +1224,17 @@ int jbd2_journal_stop(handle_t *handle)
> * on IO anyway. Speeds up many-threaded, many-dir operations
> * by 30x or more...
> *
> + * We try and optimize the sleep time against what the underlying disk
> + * can do, instead of having a static sleep time. This is usefull for

s/usefull/useful/

- Ted

2008-11-04 22:20:07

by Leroy van Logchem

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching


> > + * can do, instead of having a static sleep time. This is usefull for
>
> s/usefull/useful/

and:

+ * join the transaction. We acheive this by measuring how long it takes

s/acheive/achieve/

--
Leroy



2008-11-05 23:10:26

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Nov 04, 2008 11:10 -0500, Josef Bacik wrote:
> somebody does a sync write or an fsync() traditionally we would sleep for 1
> jiffies, which depending on the value of HZ could be a significant amount of
> time compared to how long it takes to commit a transaction to the underlying
> storage. With this patch instead of sleeping for a jiffie, we check to see if
> the amount of time this transaction has been running is less than the average
> commit time, and if it is we sleep for the delta using schedule_hrtimeout to
> give us a higher precision sleep time. This greatly benefits high end storage
> where you could end up sleeping for longer than it takes to commit the
> transaction and therefore sitting idle instead of allowing the transaction to
> be committed by keeping the sleep time to a minimum so you are sure to always
> be doing something.

There was no reply to my previous comments about making the maximum sleep
time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
on the jiffies value, which may change between 1ms and 10ms.

I don't object to this being included in ext4, but I suspect it could do
even better, or at least be more consistent than depending on the HZ value.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-11-06 00:27:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
> There was no reply to my previous comments about making the maximum sleep
> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
> on the jiffies value, which may change between 1ms and 10ms.

I know the whole point is to make it be self-tuning, but I wonder if
we should have some knobs to control the min and max sleep times.

- Ted

2008-11-06 12:46:02

by Ric Wheeler

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

Theodore Tso wrote:
> On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
>
>> There was no reply to my previous comments about making the maximum sleep
>> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
>> on the jiffies value, which may change between 1ms and 10ms.
>>
>
> I know the whole point is to make it be self-tuning, but I wonder if
> we should have some knobs to control the min and max sleep times.
>
> - Ted
>

I think that it could be useful to have a tunable knob for the max sleep
time. It would certainly be capped at a much lower level for a RAM disk
than it would be for a very large RAID volume...

Ric


2008-11-06 14:35:38

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Wed, Nov 05, 2008 at 04:10:12PM -0700, Andreas Dilger wrote:
> On Nov 04, 2008 11:10 -0500, Josef Bacik wrote:
> > somebody does a sync write or an fsync() traditionally we would sleep for 1
> > jiffies, which depending on the value of HZ could be a significant amount of
> > time compared to how long it takes to commit a transaction to the underlying
> > storage. With this patch instead of sleeping for a jiffie, we check to see if
> > the amount of time this transaction has been running is less than the average
> > commit time, and if it is we sleep for the delta using schedule_hrtimeout to
> > give us a higher precision sleep time. This greatly benefits high end storage
> > where you could end up sleeping for longer than it takes to commit the
> > transaction and therefore sitting idle instead of allowing the transaction to
> > be committed by keeping the sleep time to a minimum so you are sure to always
> > be doing something.
>
> There was no reply to my previous comments about making the maximum sleep
> time be a fixed value (e.g. 15ms) instead of having it arbitrarily based
> on the jiffies value, which may change between 1ms and 10ms.
>
> I don't object to this being included in ext4, but I suspect it could do
> even better, or at least be more consistent than depending on the HZ value.
>

Sorry I missed that comment. The reason I set the top to be 1 jiffies was
because thats what it used to be, so in an effort to try and keep regressions
from happening I just left it as the max. I'm fine with setting the max to some
other more constant value. Thanks,

Josef

2008-11-25 10:22:37

by Theodore Ts'o

[permalink] [raw]
Subject: [PATCH] ext4: add fsync batch tuning knobs

On Thu, Nov 06, 2008 at 07:45:38AM -0500, Ric Wheeler wrote:
> I think that it could be useful to have a tunable knob for the max sleep
> time. It would certainly be capped at a much lower level for a RAM disk
> than it would be for a very large RAID volume...

Here is a patch which implements tuning knobs for Josef's fsync
batching patch, at least for ext4. I've implemented both a minimum
and maximum batch time. I've also changed the default max batch time
to be 15ms, independent of HZ setting.

It'd be great if someone could run some benchmarks to see what would
be ideal ways to twiddle these knobs for different types of storage
devices (and workloads, presumably).

- Ted

-------------
ext4: add fsync batch tuning knobs

From: "Theodore Ts'o" <[email protected]>

Add new mount options, min_batch_time and max_batch_time, which
controls how long the jbd2 layer should wait for additional filesystem
operations to get batched into a synchronous handle.

Signed-off-by: "Theodore Ts'o" <[email protected]>
---
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8370ffd..6244c5d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -328,6 +328,7 @@ struct ext4_mount_options {
uid_t s_resuid;
gid_t s_resgid;
unsigned long s_commit_interval;
+ u32 s_min_batch_time, s_max_batch_time;
#ifdef CONFIG_QUOTA
int s_jquota_fmt;
char *s_qf_names[MAXQUOTAS];
@@ -806,6 +807,12 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
#define EXT4_DEFM_JMODE_WBACK 0x0060

/*
+ * Default journal batch times
+ */
+#define EXT4_DEF_MIN_BATCH_TIME 0
+#define EXT4_DEF_MAX_BATCH_TIME 15000 /* 15ms */
+
+/*
* Structure of a directory entry
*/
#define EXT4_NAME_LEN 255
diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index 2f3b8b1..f5b5d56 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -74,6 +74,8 @@ struct ext4_sb_info {
struct journal_s *s_journal;
struct list_head s_orphan;
unsigned long s_commit_interval;
+ u32 s_max_batch_time;
+ u32 s_min_batch_time;
struct block_device *journal_bdev;
#ifdef CONFIG_JBD2_DEBUG
struct timer_list turn_ro_timer; /* For turning read-only (crash simulation) */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4aa2040..35500ed 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -681,10 +681,19 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
#endif
if (!test_opt(sb, RESERVATION))
seq_puts(seq, ",noreservation");
- if (sbi->s_commit_interval) {
+ if (sbi->s_commit_interval != JBD2_DEFAULT_MAX_COMMIT_AGE*HZ) {
seq_printf(seq, ",commit=%u",
(unsigned) (sbi->s_commit_interval / HZ));
}
+ if (sbi->s_min_batch_time != EXT4_DEF_MIN_BATCH_TIME) {
+ seq_printf(seq, ",min_batch_time=%u",
+ (unsigned) sbi->s_min_batch_time);
+ }
+ if (sbi->s_max_batch_time != EXT4_DEF_MAX_BATCH_TIME) {
+ seq_printf(seq, ",max_batch_time=%u",
+ (unsigned) sbi->s_min_batch_time);
+ }
+
/*
* We're changing the default of barrier mount option, so
* let's always display its mount state so it's clear what its
@@ -850,7 +859,8 @@ enum {
Opt_nouid32, Opt_debug, Opt_oldalloc, Opt_orlov,
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
- Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+ Opt_commit, Opt_min_batch_time, Opt_max_batch_time,
+ Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
Opt_journal_checksum, Opt_journal_async_commit,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
Opt_data_err_abort, Opt_data_err_ignore,
@@ -889,6 +899,8 @@ static const match_table_t tokens = {
{Opt_nobh, "nobh"},
{Opt_bh, "bh"},
{Opt_commit, "commit=%u"},
+ {Opt_min_batch_time, "min_batch_time=%u"},
+ {Opt_max_batch_time, "max_batch_time=%u"},
{Opt_journal_update, "journal=update"},
{Opt_journal_inum, "journal=%u"},
{Opt_journal_dev, "journal_dev=%u"},
@@ -1107,6 +1119,20 @@ static int parse_options(char *options, struct super_block *sb,
option = JBD2_DEFAULT_MAX_COMMIT_AGE;
sbi->s_commit_interval = HZ * option;
break;
+ case Opt_max_batch_time:
+ if (match_int(&args[0], &option))
+ return 0;
+ if (option < 0)
+ return 0;
+ if (option == 0)
+ option = EXT4_DEF_MAX_BATCH_TIME;
+ sbi->s_max_batch_time = option;
+ case Opt_min_batch_time:
+ if (match_int(&args[0], &option))
+ return 0;
+ if (option < 0)
+ return 0;
+ sbi->s_min_batch_time = option;
case Opt_data_journal:
data_opt = EXT4_MOUNT_JOURNAL_DATA;
goto datacheck;
@@ -1955,6 +1981,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)

sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
+ sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
+ sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
+ sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;

set_opt(sbi->s_mount_opt, RESERVATION);
set_opt(sbi->s_mount_opt, BARRIER);
@@ -2484,11 +2513,9 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);

- if (sbi->s_commit_interval)
- journal->j_commit_interval = sbi->s_commit_interval;
- /* We could also set up an ext4-specific default for the commit
- * interval here, but for now we'll just fall back to the jbd
- * default. */
+ journal->j_commit_interval = sbi->s_commit_interval;
+ journal->j_min_batch_time = sbi->s_min_batch_time;
+ journal->j_max_batch_time = sbi->s_max_batch_time;

spin_lock(&journal->j_state_lock);
if (test_opt(sb, BARRIER))
@@ -2977,6 +3004,8 @@ static int ext4_remount(struct super_block *sb, int *flags, char *data)
old_opts.s_resuid = sbi->s_resuid;
old_opts.s_resgid = sbi->s_resgid;
old_opts.s_commit_interval = sbi->s_commit_interval;
+ old_opts.s_min_batch_time = sbi->s_min_batch_time;
+ old_opts.s_max_batch_time = sbi->s_max_batch_time;
#ifdef CONFIG_QUOTA
old_opts.s_jquota_fmt = sbi->s_jquota_fmt;
for (i = 0; i < MAXQUOTAS; i++)
@@ -3106,6 +3135,8 @@ restore_opts:
sbi->s_resuid = old_opts.s_resuid;
sbi->s_resgid = old_opts.s_resgid;
sbi->s_commit_interval = old_opts.s_commit_interval;
+ sbi->s_min_batch_time = old_opts.s_min_batch_time;
+ sbi->s_max_batch_time = old_opts.s_max_batch_time;
#ifdef CONFIG_QUOTA
sbi->s_jquota_fmt = old_opts.s_jquota_fmt;
for (i = 0; i < MAXQUOTAS; i++) {
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index e70d657..dfacd9e 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -961,6 +961,8 @@ static journal_t * journal_init_common (void)
spin_lock_init(&journal->j_state_lock);

journal->j_commit_interval = (HZ * JBD2_DEFAULT_MAX_COMMIT_AGE);
+ journal->j_min_batch_time = 0;
+ journal->j_max_batch_time = 15000; /* 15ms */

/* The journal is marked for error until we succeed with recovery! */
journal->j_flags = JBD2_ABORT;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 0e1fc34..e7de7ac 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1255,9 +1255,11 @@ int jbd2_journal_stop(handle_t *handle)
trans_time = ktime_to_ns(ktime_sub(ktime_get(),
transaction->t_start_time));

+ commit_time = max_t(u64, commit_time,
+ 1000*journal->j_min_batch_time);
commit_time = min_t(u64, commit_time,
- 1000*jiffies_to_usecs(1));

2008-11-25 10:23:38

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> +
> + /*
> + * weight the commit time higher than the average time so we don't
> + * react too strongly to vast changes in the commit time
> + */
> + if (likely(journal->j_average_commit_time))
> + journal->j_average_commit_time = (commit_time*3 +
> + journal->j_average_commit_time) / 4;

Stupid question... if the goal is to not have the average commit time
not react too strongly to sudden and vast changes to the commit time,
would it be better to do this instead:

> + journal->j_average_commit_time = (commit_time +
> + journal->j_average_commit_time*3) / 4;


- Ted

2008-11-25 22:41:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Nov 25, 2008 05:23 -0500, Theodore Ts'o wrote:
> On Tue, Nov 04, 2008 at 11:10:25AM -0500, Josef Bacik wrote:
> > + commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> > +
> > + /*
> > + * weight the commit time higher than the average time so we don't
> > + * react too strongly to vast changes in the commit time
> > + */
> > + if (likely(journal->j_average_commit_time))
> > + journal->j_average_commit_time = (commit_time*3 +
> > + journal->j_average_commit_time) / 4;
>
> Stupid question... if the goal is to not have the average commit time
> not react too strongly to sudden and vast changes to the commit time,
> would it be better to do this instead:
>
> > + journal->j_average_commit_time = (commit_time +
> > + journal->j_average_commit_time*3) / 4;

Actually, yes. That is my fault, I'd suggested the original change to
Josef.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-11-26 05:11:01

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote:
> > Stupid question... if the goal is to not have the average commit time
> > not react too strongly to sudden and vast changes to the commit time,
> > would it be better to do this instead:
> >
> > > + journal->j_average_commit_time = (commit_time +
> > > + journal->j_average_commit_time*3) / 4;
>
> Actually, yes. That is my fault, I'd suggested the original change to
> Josef.

BTW, I've added a patch to display the average commit time it does
vary wildly, especially on a laptop hard drive. While the system is
idle, and the occasional commits need to wait for the hard drive to
wake up, leads to a average commit time of around 80-140ms. If the
disk is just getting lightly tickled, such that it never has a chance
to go to sleep, the average commit time can drop down to around
20-25ms. If the hard drive is really busy, then the average commit
time can go up to 40-50ms.

Increasing the weight as described below will slow down the move to
these long-term averages, but at least for laptop or Green Star drives
with power savings enabled, the average commit time does seem to vary
in some non-inintuitive ways. Of course, if we are capping the max
transaction time at 15ms, most of this will never be visible, but it
would probably be interesting to test out this patch on a fast SSD or
an expensive enterprise array to see whether are similar surprising
variations in the average commit time.

- Ted

2008-11-26 13:22:59

by Josef Bacik

[permalink] [raw]
Subject: Re: [PATCH] imporve jbd2 fsync batching

On Wed, Nov 26, 2008 at 12:10:57AM -0500, Theodore Tso wrote:
> On Tue, Nov 25, 2008 at 03:41:15PM -0700, Andreas Dilger wrote:
> > > Stupid question... if the goal is to not have the average commit time
> > > not react too strongly to sudden and vast changes to the commit time,
> > > would it be better to do this instead:
> > >
> > > > + journal->j_average_commit_time = (commit_time +
> > > > + journal->j_average_commit_time*3) / 4;
> >
> > Actually, yes. That is my fault, I'd suggested the original change to
> > Josef.
>
> BTW, I've added a patch to display the average commit time it does
> vary wildly, especially on a laptop hard drive. While the system is
> idle, and the occasional commits need to wait for the hard drive to
> wake up, leads to a average commit time of around 80-140ms. If the
> disk is just getting lightly tickled, such that it never has a chance
> to go to sleep, the average commit time can drop down to around
> 20-25ms. If the hard drive is really busy, then the average commit
> time can go up to 40-50ms.
>
> Increasing the weight as described below will slow down the move to
> these long-term averages, but at least for laptop or Green Star drives
> with power savings enabled, the average commit time does seem to vary
> in some non-inintuitive ways. Of course, if we are capping the max
> transaction time at 15ms, most of this will never be visible, but it
> would probably be interesting to test out this patch on a fast SSD or
> an expensive enterprise array to see whether are similar surprising
> variations in the average commit time.
>

I was printing out the commit time when I was testing this and with high end
storage the commit time didn't vary all that much. With my SATA drive on a
normal running box it didn't vary all that much either, it would drop down to
like 1ms because of syslog, but other than that it would ramp up when there was
load and it would slowly come back down as the load went away. Also do you
want a new patch with that fix, or will you just fix it up? Thanks,

Josef

2008-12-02 14:45:48

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH] ext4: add fsync batch tuning knobs

On Tue, Nov 25, 2008 at 05:22:32AM -0500, Theodore Tso wrote:
....
...

> {Opt_bh, "bh"},
> {Opt_commit, "commit=%u"},
> + {Opt_min_batch_time, "min_batch_time=%u"},
> + {Opt_max_batch_time, "max_batch_time=%u"},
> {Opt_journal_update, "journal=update"},
> {Opt_journal_inum, "journal=%u"},
> {Opt_journal_dev, "journal_dev=%u"},
> @@ -1107,6 +1119,20 @@ static int parse_options(char *options, struct super_block *sb,
> option = JBD2_DEFAULT_MAX_COMMIT_AGE;
> sbi->s_commit_interval = HZ * option;
> break;
> + case Opt_max_batch_time:
> + if (match_int(&args[0], &option))
> + return 0;
> + if (option < 0)
> + return 0;
> + if (option == 0)
> + option = EXT4_DEF_MAX_BATCH_TIME;
> + sbi->s_max_batch_time = option;


We need a break here .


> + case Opt_min_batch_time:
> + if (match_int(&args[0], &option))
> + return 0;
> + if (option < 0)
> + return 0;
> + sbi->s_min_batch_time = option;

same here

> case Opt_data_journal:
> data_opt = EXT4_MOUNT_JOURNAL_DATA;
> goto datacheck;
> @@ -1955,6 +1981,9 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>
> sbi->s_resuid = le16_to_cpu(es->s_def_resuid);
> sbi->s_resgid = le16_to_cpu(es->s_def_resgid);
> + sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
> + sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
> + sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;
>
> set_opt(sbi->s_mount_opt, RESERVATION);
> set_opt(sbi->s_mount_opt, BARRIER);
....
..
The patch also needs documentation of new mount options

-aneesh