2008-07-14 16:15:26

by Ric Wheeler

[permalink] [raw]
Subject: transaction batching performance & multi-threaded synchronous writers


Here is a pointer to the older patch & some results:

http://www.spinics.net/lists/linux-fsdevel/msg13121.html

I will retry this on some updated kernels, but would not expect to see a
difference since the code has not been changed ;-)

ric



2008-07-14 17:17:39

by Josef Bacik

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>
> Here is a pointer to the older patch & some results:
>
> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>
> I will retry this on some updated kernels, but would not expect to see a
> difference since the code has not been changed ;-)
>

I've been thinking, the problem with this for slower disks is that with the
patch I provided we're not really allowing multiple things to be batched, since
one thread will come up, do the sync and wait for the sync to finish. In the
meantime the next thread will come up and do the log_wait_commit() in order to
let more threads join the transaction, but in the case of fs_mark with only 2
threads there won't be another one, since the original is waiting for the log to
commit. So when the log finishes committing, thread 1 gets woken up to do its
thing, and thread 2 gets woken up as well, it does its commit and waits for it
to finish, and thread 2 comes in and gets stuck in log_wait_commit(). So this
essentially kills the optimization, which is why on faster disks this makes
everything go better, as the faster disks don't need the original optimization.

So this is what I was thinking about. Perhaps we track the average time a
commit takes to occur, and then if the current transaction start time is < than
the avg commit time we sleep and wait for more things to join the transaction,
and then we commit. How does that idea sound? Thanks,

Josef

2008-07-14 17:26:42

by Ric Wheeler

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

Josef Bacik wrote:
> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>
>> Here is a pointer to the older patch & some results:
>>
>> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>>
>> I will retry this on some updated kernels, but would not expect to see a
>> difference since the code has not been changed ;-)
>>
>>
>
> I've been thinking, the problem with this for slower disks is that with the
> patch I provided we're not really allowing multiple things to be batched, since
> one thread will come up, do the sync and wait for the sync to finish. In the
> meantime the next thread will come up and do the log_wait_commit() in order to
> let more threads join the transaction, but in the case of fs_mark with only 2
> threads there won't be another one, since the original is waiting for the log to
> commit. So when the log finishes committing, thread 1 gets woken up to do its
> thing, and thread 2 gets woken up as well, it does its commit and waits for it
> to finish, and thread 2 comes in and gets stuck in log_wait_commit(). So this
> essentially kills the optimization, which is why on faster disks this makes
> everything go better, as the faster disks don't need the original optimization.
>
> So this is what I was thinking about. Perhaps we track the average time a
> commit takes to occur, and then if the current transaction start time is < than
> the avg commit time we sleep and wait for more things to join the transaction,
> and then we commit. How does that idea sound? Thanks,
>
> Josef
>
I think that this is moving in the right direction. If you think about
this, we are basically trying to do the same kind of thing that the IO
scheduler does - anticipate future requests and plug the file system
level queue for a reasonable bit of time. The problem space is very
similar - various speed devices and a need to self tune the batching
dynamically.

It would be great to be able to share the approach (if not the actual
code) ;-)

ric


2008-07-15 07:58:36

by Andreas Dilger

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Jul 14, 2008 12:58 -0400, Josef Bacik wrote:
> Perhaps we track the average time a commit takes to occur, and then if
> the current transaction start time is < than the avg commit time we sleep
> and wait for more things to join the transaction, and then we commit.
> How does that idea sound? Thanks,

The drawback of this approach is that if the thread waits an extra "average
transaction time" for the transaction to commit then this will increase the
average transaction time each time, and it still won't tell you if there
needs to be a wait at all.

What might be more interesting is tracking how many processes had sync
handles on the previous transaction(s), and once that number of processes
have done that work, or the timeout reached, the transaction is committed.

While this might seem like a hack for the particular benchmark, this
will also optimize real-world workloads like mailserver, NFS/fileserver,
http where the number of threads running at one time is generally fixed.

The best way to do that would be to keep a field in the task struct to
track whether a given thread has participated in transaction "T" when
it starts a new handle, and if not then increment the "number of sync
threads on this transaction" counter.

In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
the transaction can be committed earlier, and if not then it does a
wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).

While the number of sync threads is growing or constant the commits will
be rapid, and any "slow" threads will block on the next transaction and
increment its num_sync_thr until the thread count stabilizes (i.e. a small
number of transactions at startup). After that the wait will be exactly
as long as needed for each thread to participate. If some threads are
too slow, or stop processing then there will be a single sleep and the
next transaction will wait for fewer threads the next time.


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


2008-07-15 11:29:35

by Ric Wheeler

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

Andreas Dilger wrote:
> On Jul 14, 2008 12:58 -0400, Josef Bacik wrote:
>
>> Perhaps we track the average time a commit takes to occur, and then if
>> the current transaction start time is < than the avg commit time we sleep
>> and wait for more things to join the transaction, and then we commit.
>> How does that idea sound? Thanks,
>>
>
> The drawback of this approach is that if the thread waits an extra "average
> transaction time" for the transaction to commit then this will increase the
> average transaction time each time, and it still won't tell you if there
> needs to be a wait at all.
>
> What might be more interesting is tracking how many processes had sync
> handles on the previous transaction(s), and once that number of processes
> have done that work, or the timeout reached, the transaction is committed.
>
> While this might seem like a hack for the particular benchmark, this
> will also optimize real-world workloads like mailserver, NFS/fileserver,
> http where the number of threads running at one time is generally fixed.
>
> The best way to do that would be to keep a field in the task struct to
> track whether a given thread has participated in transaction "T" when
> it starts a new handle, and if not then increment the "number of sync
> threads on this transaction" counter.
>
> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
> the transaction can be committed earlier, and if not then it does a
> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).
>
> While the number of sync threads is growing or constant the commits will
> be rapid, and any "slow" threads will block on the next transaction and
> increment its num_sync_thr until the thread count stabilizes (i.e. a small
> number of transactions at startup). After that the wait will be exactly
> as long as needed for each thread to participate. If some threads are
> too slow, or stop processing then there will be a single sleep and the
> next transaction will wait for fewer threads the next time.
>
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>
This really sounds like one of those math problems (queuing theory?)
that I never was able to completely wrap my head around back at
university, but the basic things that we we have are:

(1) the average time it takes to complete an independent
transaction. This will be different for each target device and will
possibly change over time (specific odd case is a shared disk, like an
array).
(2) the average cost it takes to add "one more" thread to a
transaction. I think that the assumption is that this cost is close to zero.
(3) the rate of arrival of threads trying to join a transaction.
(4) come knowledge of the history of which threads did the past
transactions. It is quite reasonable to never wait if a single thread is
the author of the last (most of the last?) sequence which is the good
thing in there now.
(5) the minimum time we can effectively wait with a given mechanism
(4ms or 1ms for example depending on the HZ in the code today)

I think the trick here is to try and get a heuristic that works without
going nuts in complexity.

The obvious thing we need to keep is the heuristic to not wait if we
detect a single thread workload.

It would seem reasonable not to wait if the latency of the device (1
above) is lower than the time the chosen mechanism can wait (5). For
example, if transactions are done in microseconds like for a ramdisk,
just blast away ;-)

What would be left would be the need to figure out if (3) arrival rate
would predict a new thread will come along before we would be able to
finish the current transaction without waiting.

Does this make any sense? This sounds close to the idea that Josef
proposed above, we would just tweak his proposal to avoid sleeping in
the single threaded case.

Ric





2008-07-15 13:10:20

by Josef Bacik

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote:
> On Jul 14, 2008 12:58 -0400, Josef Bacik wrote:
> > Perhaps we track the average time a commit takes to occur, and then if
> > the current transaction start time is < than the avg commit time we sleep
> > and wait for more things to join the transaction, and then we commit.
> > How does that idea sound? Thanks,
>
> The drawback of this approach is that if the thread waits an extra "average
> transaction time" for the transaction to commit then this will increase the
> average transaction time each time, and it still won't tell you if there
> needs to be a wait at all.
>

I'm not talking about the average transaction life, as you say it would be
highly dependant on random things that have nothing to do with the transaction
time (waiting for locks and such). I'm measuring the time it takes for the
actual commit to take place, so I record the start time in
journal_commit_transaction when we set running_transaction = NULL, and then the
end time right before the wakeup() at the end of journal_commit_transaction,
that way there is an idea of how long the committing of a transaction to disk
happens. If we only have two threads doing work and fsyncing, its going to be a
constant time, because we'll only be writing a certain number of buffers each
time.

> What might be more interesting is tracking how many processes had sync
> handles on the previous transaction(s), and once that number of processes
> have done that work, or the timeout reached, the transaction is committed.
>
> While this might seem like a hack for the particular benchmark, this
> will also optimize real-world workloads like mailserver, NFS/fileserver,
> http where the number of threads running at one time is generally fixed.
>
> The best way to do that would be to keep a field in the task struct to
> track whether a given thread has participated in transaction "T" when
> it starts a new handle, and if not then increment the "number of sync
> threads on this transaction" counter.
>
> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
> the transaction can be committed earlier, and if not then it does a
> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).
>
> While the number of sync threads is growing or constant the commits will
> be rapid, and any "slow" threads will block on the next transaction and
> increment its num_sync_thr until the thread count stabilizes (i.e. a small
> number of transactions at startup). After that the wait will be exactly
> as long as needed for each thread to participate. If some threads are
> too slow, or stop processing then there will be a single sleep and the
> next transaction will wait for fewer threads the next time.
>

This idea is good, but I'm wondering about the normal user use case, ie where
syslog is the only thing that does fsync. If we get into a position where
prev_num_sync_thr is always 1, we'll just bypass sleeping and waiting for other
stuff to join the transaction and sync whenever syslog pleases, which will
likely affect most normal users, especially if they are like me and have crappy
wireless cards that like to spit stuff into syslog constantly ;). Thanks much,

Josef

2008-07-15 14:22:37

by Ric Wheeler

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote:
>
>> On Jul 14, 2008 12:58 -0400, Josef Bacik wrote:
>>
>>> Perhaps we track the average time a commit takes to occur, and then if
>>> the current transaction start time is < than the avg commit time we sleep
>>> and wait for more things to join the transaction, and then we commit.
>>> How does that idea sound? Thanks,
>>>
>> The drawback of this approach is that if the thread waits an extra "average
>> transaction time" for the transaction to commit then this will increase the
>> average transaction time each time, and it still won't tell you if there
>> needs to be a wait at all.
>>
>>
>
> I'm not talking about the average transaction life, as you say it would be
> highly dependant on random things that have nothing to do with the transaction
> time (waiting for locks and such). I'm measuring the time it takes for the
> actual commit to take place, so I record the start time in
> journal_commit_transaction when we set running_transaction = NULL, and then the
> end time right before the wakeup() at the end of journal_commit_transaction,
> that way there is an idea of how long the committing of a transaction to disk
> happens. If we only have two threads doing work and fsyncing, its going to be a
> constant time, because we'll only be writing a certain number of buffers each
> time.
>

I think that this is exactly the interesting measurement to capture. It
will change over time (depending on how loaded the target device is, etc).

The single thread case that is special cased is also an important one to
handle since it should be a fairly common one.
>
>
>> What might be more interesting is tracking how many processes had sync
>> handles on the previous transaction(s), and once that number of processes
>> have done that work, or the timeout reached, the transaction is committed.
>>
>> While this might seem like a hack for the particular benchmark, this
>> will also optimize real-world workloads like mailserver, NFS/fileserver,
>> http where the number of threads running at one time is generally fixed.
>>
>> The best way to do that would be to keep a field in the task struct to
>> track whether a given thread has participated in transaction "T" when
>> it starts a new handle, and if not then increment the "number of sync
>> threads on this transaction" counter.
>>
>> In journal_stop() if t_num_sync_thr >= prev num_sync_thr then
>> the transaction can be committed earlier, and if not then it does a
>> wait_event_interruptible_timeout(cur_num_sync_thr >= prev_num_sync_thr, 1).
>>
>> While the number of sync threads is growing or constant the commits will
>> be rapid, and any "slow" threads will block on the next transaction and
>> increment its num_sync_thr until the thread count stabilizes (i.e. a small
>> number of transactions at startup). After that the wait will be exactly
>> as long as needed for each thread to participate. If some threads are
>> too slow, or stop processing then there will be a single sleep and the
>> next transaction will wait for fewer threads the next time.
>>
>>
>
> This idea is good, but I'm wondering about the normal user use case, ie where
> syslog is the only thing that does fsync. If we get into a position where
> prev_num_sync_thr is always 1, we'll just bypass sleeping and waiting for other
> stuff to join the transaction and sync whenever syslog pleases, which will
> likely affect most normal users, especially if they are like me and have crappy
> wireless cards that like to spit stuff into syslog constantly ;). Thanks much,
>
> Josef
>

I think this syslog case is just a common example of the same thread
doing a sequence of IO's.

ric


2008-07-15 14:24:31

by Josef Bacik

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Tue, Jul 15, 2008 at 08:51:27AM -0400, Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 01:58:32AM -0600, Andreas Dilger wrote:
> > On Jul 14, 2008 12:58 -0400, Josef Bacik wrote:
> > > Perhaps we track the average time a commit takes to occur, and then if
> > > the current transaction start time is < than the avg commit time we sleep
> > > and wait for more things to join the transaction, and then we commit.
> > > How does that idea sound? Thanks,
> >
> > The drawback of this approach is that if the thread waits an extra "average
> > transaction time" for the transaction to commit then this will increase the
> > average transaction time each time, and it still won't tell you if there
> > needs to be a wait at all.
> >
>
> I'm not talking about the average transaction life, as you say it would be
> highly dependant on random things that have nothing to do with the transaction
> time (waiting for locks and such). I'm measuring the time it takes for the
> actual commit to take place, so I record the start time in
> journal_commit_transaction when we set running_transaction = NULL, and then the
> end time right before the wakeup() at the end of journal_commit_transaction,
> that way there is an idea of how long the committing of a transaction to disk
> happens. If we only have two threads doing work and fsyncing, its going to be a
> constant time, because we'll only be writing a certain number of buffers each
> time.
>

Hmm now that I think about it you are right, if the thread waits then the time
the transaction was running could be greater than the commit time and this
wouldn't help. Though one would hope that dirtying pagecache would still be
several times faster than writing to disk so any amount of waiting wouldn't be
too much of a big deal, and waiting too long would hurt in our current case
anyway. Thanks,

Josef

2008-07-15 18:58:03

by Josef Bacik

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>
> Here is a pointer to the older patch & some results:
>
> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>
> I will retry this on some updated kernels, but would not expect to see a
> difference since the code has not been changed ;-)
>

Ok here are the numbers with the original idea I had proposed.

type threads base patch speedup
sata 1 17.9 17.3 0.97
sata 2 33.2 34.2 1.03
sata 4 58.4 63.6 1.09
sata 8 78.8 80.8 1.03
sata 16 94.4 97.6 1.16

ram 1 2394.4 1878.0 0.78
ram 2 989.6 2041.1 2.06
ram 4 1466.1 3201.8 2.18
ram 8 1858.1 3362.8 1.81
ram 16 3008.0 3227.7 1.07

I've got to find a fast disk array to test this with, but the ramdisk results
make me happy, though they were kind of irratic, so I think the fast disk array
will be a more stable measure of how well this patch does, but it definitely
doesn't hurt the slow case, and brings stability to the fast case. Thanks much,

Josef


Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -273,6 +273,15 @@ write_out_data:
journal_do_submit_data(wbuf, bufs);
}

+static inline unsigned long elapsed_jiffies(unsigned long start,
+ unsigned long end)
+{
+ if (end >= start)
+ return end - start;
+
+ return end + (MAX_JIFFY_OFFSET - start) + 1;
+}
+
/*
* journal_commit_transaction
*
@@ -288,6 +297,7 @@ void journal_commit_transaction(journal_
int flags;
int err;
unsigned long blocknr;
+ unsigned long long commit_time, start_time;
char *tagp = NULL;
journal_header_t *header;
journal_block_tag_t *tag = NULL;
@@ -400,6 +410,7 @@ void journal_commit_transaction(journal_
commit_transaction->t_state = T_FLUSH;
journal->j_committing_transaction = commit_transaction;
journal->j_running_transaction = NULL;
+ start_time = jiffies;
commit_transaction->t_log_start = journal->j_head;
wake_up(&journal->j_wait_transaction_locked);
spin_unlock(&journal->j_state_lock);
@@ -873,6 +884,12 @@ restart_loop:
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
journal->j_committing_transaction = NULL;
+ commit_time = elapsed_jiffies(start_time, jiffies);
+ if (unlikely(!journal->j_average_commit_time))
+ journal->j_average_commit_time = commit_time;
+ else
+ journal->j_average_commit_time = (commit_time +
+ journal->j_average_commit_time) / 2;
spin_unlock(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL &&
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -49,6 +49,7 @@ get_transaction(journal_t *journal, tran
{
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
+ transaction->t_start_time = jiffies;
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
@@ -1361,8 +1362,7 @@ int journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int old_handle_count, err;
- pid_t pid;
+ int err;

J_ASSERT(journal_current_handle() == handle);

@@ -1395,13 +1395,17 @@ int journal_stop(handle_t *handle)
* single process is doing a stream of sync writes. No point in waiting
* for joiners in that case.
*/
- pid = current->pid;
- if (handle->h_sync && journal->j_last_sync_writer != pid) {
- journal->j_last_sync_writer = pid;
- do {
- old_handle_count = transaction->t_handle_count;
+ if (handle->h_sync) {
+ unsigned long commit_time;
+
+ spin_lock(&journal->j_state_lock);
+ commit_time = journal->j_average_commit_time;
+ spin_unlock(&journal->j_state_lock);
+
+ while (time_before(jiffies, commit_time +
+ transaction->t_start_time)) {
schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
+ }
}

current->journal_info = NULL;
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
unsigned long t_expires;

/*
+ * When this transaction started, in jiffies [no locking]
+ */
+ unsigned long t_start_time;
+
+ /*
* How many handles used this transaction? [t_handle_lock]
*/
int t_handle_count;
@@ -800,6 +805,8 @@ struct journal_s

pid_t j_last_sync_writer;

+ unsigned long long j_average_commit_time;
+
/*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here

2008-07-15 20:29:07

by Josef Bacik

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote:
> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
> >
> > Here is a pointer to the older patch & some results:
> >
> > http://www.spinics.net/lists/linux-fsdevel/msg13121.html
> >
> > I will retry this on some updated kernels, but would not expect to see a
> > difference since the code has not been changed ;-)
> >
>
> Ok here are the numbers with the original idea I had proposed.
>
> type threads base patch speedup
> sata 1 17.9 17.3 0.97
> sata 2 33.2 34.2 1.03
> sata 4 58.4 63.6 1.09
> sata 8 78.8 80.8 1.03
> sata 16 94.4 97.6 1.16
>
> ram 1 2394.4 1878.0 0.78
> ram 2 989.6 2041.1 2.06
> ram 4 1466.1 3201.8 2.18
> ram 8 1858.1 3362.8 1.81
> ram 16 3008.0 3227.7 1.07
>
> I've got to find a fast disk array to test this with, but the ramdisk results
> make me happy, though they were kind of irratic, so I think the fast disk array
> will be a more stable measure of how well this patch does, but it definitely
> doesn't hurt the slow case, and brings stability to the fast case. Thanks much,
>

Hmm talking with ric I should just leave the single thread stuff alone. That
removes the slight speed regression seen above. Thanks,

Josef


Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -273,6 +273,15 @@ write_out_data:
journal_do_submit_data(wbuf, bufs);
}

+static inline unsigned long elapsed_jiffies(unsigned long start,
+ unsigned long end)
+{
+ if (end >= start)
+ return end - start;
+
+ return end + (MAX_JIFFY_OFFSET - start) + 1;
+}
+
/*
* journal_commit_transaction
*
@@ -288,6 +297,7 @@ void journal_commit_transaction(journal_
int flags;
int err;
unsigned long blocknr;
+ unsigned long long commit_time, start_time;
char *tagp = NULL;
journal_header_t *header;
journal_block_tag_t *tag = NULL;
@@ -400,6 +410,7 @@ void journal_commit_transaction(journal_
commit_transaction->t_state = T_FLUSH;
journal->j_committing_transaction = commit_transaction;
journal->j_running_transaction = NULL;
+ start_time = jiffies;
commit_transaction->t_log_start = journal->j_head;
wake_up(&journal->j_wait_transaction_locked);
spin_unlock(&journal->j_state_lock);
@@ -873,6 +884,12 @@ restart_loop:
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
journal->j_committing_transaction = NULL;
+ commit_time = elapsed_jiffies(start_time, jiffies);
+ if (unlikely(!journal->j_average_commit_time))
+ journal->j_average_commit_time = commit_time;
+ else
+ journal->j_average_commit_time = (commit_time +
+ journal->j_average_commit_time) / 2;
spin_unlock(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL &&
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -49,6 +49,7 @@ get_transaction(journal_t *journal, tran
{
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
+ transaction->t_start_time = jiffies;
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
@@ -1361,7 +1362,7 @@ int journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int old_handle_count, err;
+ int err;
pid_t pid;

J_ASSERT(journal_current_handle() == handle);
@@ -1397,11 +1398,17 @@ int journal_stop(handle_t *handle)
*/
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
+ unsigned long commit_time;
+
journal->j_last_sync_writer = pid;
- do {
- old_handle_count = transaction->t_handle_count;
+
+ spin_lock(&journal->j_state_lock);
+ commit_time = journal->j_average_commit_time;
+ spin_unlock(&journal->j_state_lock);
+
+ while (time_before(jiffies, commit_time +
+ transaction->t_start_time))
schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
}

current->journal_info = NULL;
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -543,6 +543,11 @@ struct transaction_s
unsigned long t_expires;

/*
+ * When this transaction started, in jiffies [no locking]
+ */
+ unsigned long t_start_time;
+
+ /*
* How many handles used this transaction? [t_handle_lock]
*/
int t_handle_count;
@@ -800,6 +805,8 @@ struct journal_s

pid_t j_last_sync_writer;

+ unsigned long long j_average_commit_time;
+
/*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here

2008-07-15 21:02:52

by Josef Bacik

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

On Tue, Jul 15, 2008 at 04:10:10PM -0400, Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote:
> > On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
> > >
> > > Here is a pointer to the older patch & some results:
> > >
> > > http://www.spinics.net/lists/linux-fsdevel/msg13121.html
> > >
> > > I will retry this on some updated kernels, but would not expect to see a
> > > difference since the code has not been changed ;-)
> > >
> >
> > Ok here are the numbers with the original idea I had proposed.
> >
> > type threads base patch speedup
> > sata 1 17.9 17.3 0.97
> > sata 2 33.2 34.2 1.03
> > sata 4 58.4 63.6 1.09
> > sata 8 78.8 80.8 1.03
> > sata 16 94.4 97.6 1.16
> >
> > ram 1 2394.4 1878.0 0.78
> > ram 2 989.6 2041.1 2.06
> > ram 4 1466.1 3201.8 2.18
> > ram 8 1858.1 3362.8 1.81
> > ram 16 3008.0 3227.7 1.07
> >
> > I've got to find a fast disk array to test this with, but the ramdisk results
> > make me happy, though they were kind of irratic, so I think the fast disk array
> > will be a more stable measure of how well this patch does, but it definitely
> > doesn't hurt the slow case, and brings stability to the fast case. Thanks much,
> >
>
> Hmm talking with ric I should just leave the single thread stuff alone. That
> removes the slight speed regression seen above. Thanks,
>

Here are the results with the single thread stuff put back in and with 250HZ
instead of 1000HZ from before

type threads base patch
sata 1 21.8 21.6
sata 2 26.2 34.6
sata 4 48.0 58.0
sata 8 70.4 75.2
sata 16 89.6 101.1

ram 1 2505.4 2422.0
ram 2 463.8 3462.3
ram 4 330.4 3653.9
ram 8 995.1 3592.4
ram 16 1335.2 3806.5

Thanks,

Josef

2008-07-15 22:33:38

by Ric Wheeler

[permalink] [raw]
Subject: Re: transaction batching performance & multi-threaded synchronous writers

Josef Bacik wrote:
> On Tue, Jul 15, 2008 at 04:10:10PM -0400, Josef Bacik wrote:
>
>> On Tue, Jul 15, 2008 at 02:39:04PM -0400, Josef Bacik wrote:
>>
>>> On Mon, Jul 14, 2008 at 12:15:23PM -0400, Ric Wheeler wrote:
>>>
>>>> Here is a pointer to the older patch & some results:
>>>>
>>>> http://www.spinics.net/lists/linux-fsdevel/msg13121.html
>>>>
>>>> I will retry this on some updated kernels, but would not expect to see a
>>>> difference since the code has not been changed ;-)
>>>>
>>>>
>>> Ok here are the numbers with the original idea I had proposed.
>>>
>>> type threads base patch speedup
>>> sata 1 17.9 17.3 0.97
>>> sata 2 33.2 34.2 1.03
>>> sata 4 58.4 63.6 1.09
>>> sata 8 78.8 80.8 1.03
>>> sata 16 94.4 97.6 1.16
>>>
>>> ram 1 2394.4 1878.0 0.78
>>> ram 2 989.6 2041.1 2.06
>>> ram 4 1466.1 3201.8 2.18
>>> ram 8 1858.1 3362.8 1.81
>>> ram 16 3008.0 3227.7 1.07
>>>
>>> I've got to find a fast disk array to test this with, but the ramdisk results
>>> make me happy, though they were kind of irratic, so I think the fast disk array
>>> will be a more stable measure of how well this patch does, but it definitely
>>> doesn't hurt the slow case, and brings stability to the fast case. Thanks much,
>>>
>>>
>> Hmm talking with ric I should just leave the single thread stuff alone. That
>> removes the slight speed regression seen above. Thanks,
>>
>>
>
> Here are the results with the single thread stuff put back in and with 250HZ
> instead of 1000HZ from before
>
> type threads base patch
> sata 1 21.8 21.6
> sata 2 26.2 34.6
> sata 4 48.0 58.0
> sata 8 70.4 75.2
> sata 16 89.6 101.1
>
> ram 1 2505.4 2422.0
> ram 2 463.8 3462.3
> ram 4 330.4 3653.9
> ram 8 995.1 3592.4
> ram 16 1335.2 3806.5
>
> Thanks,
>
> Josef
>
These numbers are pretty impressive - we need to get a run on an array
backed file system as well to round out the picture and possibly an SSD
(anyone have one out there to play with)?

ric