2008-08-01 12:06:48

by Ric Wheeler

[permalink] [raw]
Subject: high resolution timers, scheduling & sleep granularity


Hi Thomas & Ingo,

Josef has been working on some patches to try and get ext3/4 to
dynamically detect the latency of a storage device and use that base
latency to tune the amount of time we sleep waiting for others to join
in a transaction. The logic in question lives in jbd/transaction.c
(transaction_stop).

The code was originally developed to try and allow multiple threads to
join in a big, slow transaction. For example, transacations that write
to a slow ATA or S-ATA drive take in the neighborhood of 10 to 20 ms.

Faster devices, for example a disk array, can complete the transaction
in 1.3 ms. Even higher speed SSD devices boast of a latency of 0.1ms,
not to mention RAM disks ;-)

The current logic makes us wait way too long, especially with a 250HZ
kernel since we sleep many times longer than it takes to complete the IO ;-)

Do either of you have any thoughts on how to get a better, fine grained
sleep capability that we could use that would allow us to sleep in
sub-jiffie chunks?

Regards,

Ric


2008-08-01 13:49:47

by Josef Bacik

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

On Fri, Aug 01, 2008 at 08:05:37AM -0400, Ric Wheeler wrote:
>
> Hi Thomas & Ingo,
>
> Josef has been working on some patches to try and get ext3/4 to dynamically
> detect the latency of a storage device and use that base latency to tune
> the amount of time we sleep waiting for others to join in a transaction.
> The logic in question lives in jbd/transaction.c (transaction_stop).
>
> The code was originally developed to try and allow multiple threads to join
> in a big, slow transaction. For example, transacations that write to a slow
> ATA or S-ATA drive take in the neighborhood of 10 to 20 ms.
>
> Faster devices, for example a disk array, can complete the transaction in
> 1.3 ms. Even higher speed SSD devices boast of a latency of 0.1ms, not to
> mention RAM disks ;-)
>
> The current logic makes us wait way too long, especially with a 250HZ
> kernel since we sleep many times longer than it takes to complete the IO
> ;-)
>
> Do either of you have any thoughts on how to get a better, fine grained
> sleep capability that we could use that would allow us to sleep in
> sub-jiffie chunks?
>

Hello,

This is the most recent iteration of my patch using hrtimers. It works really
well for ramdisks, so anything with low latency writes is going to be really
fast, but I'm still trying to come up with a smart way to sleep long enough to
not hurt SATA performance. As it stands now I'm getting a 5% decrease in speed
on SATA. So I think I've got the sleep as little as possible part down right,
just can't quite get it to sleep long enough if the disk is slow. Thanks,

Josef


Index: linux-2.6/fs/jbd/commit.c
===================================================================
--- linux-2.6.orig/fs/jbd/commit.c
+++ linux-2.6/fs/jbd/commit.c
@@ -288,6 +288,7 @@ void journal_commit_transaction(journal_
int flags;
int err;
unsigned long blocknr;
+ unsigned long long commit_time, start_time;
char *tagp = NULL;
journal_header_t *header;
journal_block_tag_t *tag = NULL;
@@ -400,6 +401,7 @@ void journal_commit_transaction(journal_
commit_transaction->t_state = T_FLUSH;
journal->j_committing_transaction = commit_transaction;
journal->j_running_transaction = NULL;
+ start_time = jiffies;
commit_transaction->t_log_start = journal->j_head;
wake_up(&journal->j_wait_transaction_locked);
spin_unlock(&journal->j_state_lock);
@@ -873,6 +875,19 @@ restart_loop:
J_ASSERT(commit_transaction == journal->j_committing_transaction);
journal->j_commit_sequence = commit_transaction->t_tid;
journal->j_committing_transaction = NULL;
+ commit_time = elapsed_jiffies(start_time, jiffies);
+ if (unlikely(!journal->j_average_commit_time))
+ journal->j_average_commit_time = commit_time;
+ else
+ journal->j_average_commit_time = (commit_time +
+ journal->j_average_commit_time) / 2;
+
+ if (journal->print_count < 100) {
+ journal->print_count++;
+ printk(KERN_ERR "avg commit time = %lu\n",
+ journal->j_average_commit_time);
+ }
+
spin_unlock(&journal->j_state_lock);

if (commit_transaction->t_checkpoint_list == NULL &&
Index: linux-2.6/fs/jbd/transaction.c
===================================================================
--- linux-2.6.orig/fs/jbd/transaction.c
+++ linux-2.6/fs/jbd/transaction.c
@@ -25,6 +25,7 @@
#include <linux/timer.h>
#include <linux/mm.h>
#include <linux/highmem.h>
+#include <linux/hrtimer.h>

static void __journal_temp_unlink_buffer(struct journal_head *jh);

@@ -49,6 +50,7 @@ get_transaction(journal_t *journal, tran
{
transaction->t_journal = journal;
transaction->t_state = T_RUNNING;
+ transaction->t_start_time = jiffies;
transaction->t_tid = journal->j_transaction_sequence++;
transaction->t_expires = jiffies + journal->j_commit_interval;
spin_lock_init(&transaction->t_handle_lock);
@@ -63,6 +65,32 @@ get_transaction(journal_t *journal, tran
return transaction;
}

+static void precision_sleep(unsigned long time)
+{
+ struct hrtimer_sleeper t;
+
+ hrtimer_init_on_stack(&t.timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
+ hrtimer_init_sleeper(&t, current);
+ t.timer.expires = ktime_add_ns(ktime_get_real(), time);
+
+ do {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
+ if (!hrtimer_active(&t.timer))
+ t.task = NULL;
+
+ if (likely(t.task))
+ schedule();
+
+ hrtimer_cancel(&t.timer);
+ } while (t.task);
+
+ set_current_state(TASK_RUNNING);
+
+ destroy_hrtimer_on_stack(&t.timer);
+}
+
/*
* Handle management.
*
@@ -1361,7 +1389,7 @@ int journal_stop(handle_t *handle)
{
transaction_t *transaction = handle->h_transaction;
journal_t *journal = transaction->t_journal;
- int old_handle_count, err;
+ int err;
pid_t pid;

J_ASSERT(journal_current_handle() == handle);
@@ -1397,11 +1425,31 @@ int journal_stop(handle_t *handle)
*/
pid = current->pid;
if (handle->h_sync && journal->j_last_sync_writer != pid) {
+ unsigned long commit_time;
+ int print_count;
+ unsigned long sleep_time;
+
journal->j_last_sync_writer = pid;
- do {
- old_handle_count = transaction->t_handle_count;
- schedule_timeout_uninterruptible(1);
- } while (old_handle_count != transaction->t_handle_count);
+
+ spin_lock(&journal->j_state_lock);
+ commit_time = journal->j_average_commit_time;
+ spin_unlock(&journal->j_state_lock);
+
+ sleep_time = elapsed_jiffies(transaction->t_start_time, jiffies);
+ if (!sleep_time)
+ sleep_time = 1;
+ sleep_time = (commit_time / sleep_time) * commit_time;
+
+ /*
+ if (print_count >= 100 && print_count < 200) {
+ journal->print_count++;
+ printk(KERN_ERR "sleep_time=%lu\n", (unsigned long)sleep_time);
+ }
+
+ sleep_time = (long)(commit_time - sleep_time);
+ */
+
+ precision_sleep((unsigned long)sleep_time);
}

current->journal_info = NULL;
Index: linux-2.6/include/linux/jbd.h
===================================================================
--- linux-2.6.orig/include/linux/jbd.h
+++ linux-2.6/include/linux/jbd.h
@@ -31,6 +31,7 @@
#include <linux/mutex.h>
#include <linux/timer.h>
#include <linux/lockdep.h>
+#include <linux/jiffies.h>

#define journal_oom_retry 1

@@ -79,6 +80,15 @@ static inline void jbd_free(void *ptr, s
free_pages((unsigned long)ptr, get_order(size));
};

+static inline unsigned long elapsed_jiffies(unsigned long start,
+ unsigned long end)
+{
+ if (end >= start)
+ return end - start;
+
+ return end + (MAX_JIFFY_OFFSET - start) + 1;
+}
+
#define JFS_MIN_JOURNAL_BLOCKS 1024


@@ -543,6 +553,11 @@ struct transaction_s
unsigned long t_expires;

/*
+ * When this transaction started, in jiffies [no locking]
+ */
+ unsigned long t_start_time;
+
+ /*
* How many handles used this transaction? [t_handle_lock]
*/
int t_handle_count;
@@ -800,6 +815,10 @@ struct journal_s

pid_t j_last_sync_writer;

+ unsigned long j_average_commit_time;
+
+ int print_count;
+
/*
* An opaque pointer to fs-private information. ext3 puts its
* superblock pointer here
Index: linux-2.6/kernel/hrtimer.c
===================================================================
--- linux-2.6.orig/kernel/hrtimer.c
+++ linux-2.6/kernel/hrtimer.c
@@ -1458,6 +1458,7 @@ void hrtimer_init_sleeper(struct hrtimer
sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
#endif
}
+EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);

static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
{

2008-08-01 13:58:24

by Ric Wheeler

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

Josef Bacik wrote:
> On Fri, Aug 01, 2008 at 08:05:37AM -0400, Ric Wheeler wrote:
>
>> Hi Thomas & Ingo,
>>
>> Josef has been working on some patches to try and get ext3/4 to dynamically
>> detect the latency of a storage device and use that base latency to tune
>> the amount of time we sleep waiting for others to join in a transaction.
>> The logic in question lives in jbd/transaction.c (transaction_stop).
>>
>> The code was originally developed to try and allow multiple threads to join
>> in a big, slow transaction. For example, transacations that write to a slow
>> ATA or S-ATA drive take in the neighborhood of 10 to 20 ms.
>>
>> Faster devices, for example a disk array, can complete the transaction in
>> 1.3 ms. Even higher speed SSD devices boast of a latency of 0.1ms, not to
>> mention RAM disks ;-)
>>
>> The current logic makes us wait way too long, especially with a 250HZ
>> kernel since we sleep many times longer than it takes to complete the IO
>> ;-)
>>
>> Do either of you have any thoughts on how to get a better, fine grained
>> sleep capability that we could use that would allow us to sleep in
>> sub-jiffie chunks?
>>
>>
>
> Hello,
>
> This is the most recent iteration of my patch using hrtimers. It works really
> well for ramdisks, so anything with low latency writes is going to be really
> fast, but I'm still trying to come up with a smart way to sleep long enough to
> not hurt SATA performance. As it stands now I'm getting a 5% decrease in speed
> on SATA. So I think I've got the sleep as little as possible part down right,
> just can't quite get it to sleep long enough if the disk is slow. Thanks,
>
> Josef
>

I think that this (or similar) kind of precision_sleep() should be
generically useful.

One question on the code, would it be better to measure the average
transaction time in the same units as your precision sleep uses - aren't
jiffies are still too coarse?

It also might be interesting to export the measured average commit time
via some non-printk method (/sys/fs ??)

ric




>
> Index: linux-2.6/fs/jbd/commit.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/commit.c
> +++ linux-2.6/fs/jbd/commit.c
> @@ -288,6 +288,7 @@ void journal_commit_transaction(journal_
> int flags;
> int err;
> unsigned long blocknr;
> + unsigned long long commit_time, start_time;
> char *tagp = NULL;
> journal_header_t *header;
> journal_block_tag_t *tag = NULL;
> @@ -400,6 +401,7 @@ void journal_commit_transaction(journal_
> commit_transaction->t_state = T_FLUSH;
> journal->j_committing_transaction = commit_transaction;
> journal->j_running_transaction = NULL;
> + start_time = jiffies;
> commit_transaction->t_log_start = journal->j_head;
> wake_up(&journal->j_wait_transaction_locked);
> spin_unlock(&journal->j_state_lock);
> @@ -873,6 +875,19 @@ restart_loop:
> J_ASSERT(commit_transaction == journal->j_committing_transaction);
> journal->j_commit_sequence = commit_transaction->t_tid;
> journal->j_committing_transaction = NULL;
> + commit_time = elapsed_jiffies(start_time, jiffies);
> + if (unlikely(!journal->j_average_commit_time))
> + journal->j_average_commit_time = commit_time;
> + else
> + journal->j_average_commit_time = (commit_time +
> + journal->j_average_commit_time) / 2;
> +
> + if (journal->print_count < 100) {
> + journal->print_count++;
> + printk(KERN_ERR "avg commit time = %lu\n",
> + journal->j_average_commit_time);
> + }
> +
> spin_unlock(&journal->j_state_lock);
>
> if (commit_transaction->t_checkpoint_list == NULL &&
> Index: linux-2.6/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/transaction.c
> +++ linux-2.6/fs/jbd/transaction.c
> @@ -25,6 +25,7 @@
> #include <linux/timer.h>
> #include <linux/mm.h>
> #include <linux/highmem.h>
> +#include <linux/hrtimer.h>
>
> static void __journal_temp_unlink_buffer(struct journal_head *jh);
>
> @@ -49,6 +50,7 @@ get_transaction(journal_t *journal, tran
> {
> transaction->t_journal = journal;
> transaction->t_state = T_RUNNING;
> + transaction->t_start_time = jiffies;
> transaction->t_tid = journal->j_transaction_sequence++;
> transaction->t_expires = jiffies + journal->j_commit_interval;
> spin_lock_init(&transaction->t_handle_lock);
> @@ -63,6 +65,32 @@ get_transaction(journal_t *journal, tran
> return transaction;
> }
>
> +static void precision_sleep(unsigned long time)
> +{
> + struct hrtimer_sleeper t;
> +
> + hrtimer_init_on_stack(&t.timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
> + hrtimer_init_sleeper(&t, current);
> + t.timer.expires = ktime_add_ns(ktime_get_real(), time);
> +
> + do {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
> + if (!hrtimer_active(&t.timer))
> + t.task = NULL;
> +
> + if (likely(t.task))
> + schedule();
> +
> + hrtimer_cancel(&t.timer);
> + } while (t.task);
> +
> + set_current_state(TASK_RUNNING);
> +
> + destroy_hrtimer_on_stack(&t.timer);
> +}
> +
> /*
> * Handle management.
> *
> @@ -1361,7 +1389,7 @@ int journal_stop(handle_t *handle)
> {
> transaction_t *transaction = handle->h_transaction;
> journal_t *journal = transaction->t_journal;
> - int old_handle_count, err;
> + int err;
> pid_t pid;
>
> J_ASSERT(journal_current_handle() == handle);
> @@ -1397,11 +1425,31 @@ int journal_stop(handle_t *handle)
> */
> pid = current->pid;
> if (handle->h_sync && journal->j_last_sync_writer != pid) {
> + unsigned long commit_time;
> + int print_count;
> + unsigned long sleep_time;
> +
> journal->j_last_sync_writer = pid;
> - do {
> - old_handle_count = transaction->t_handle_count;
> - schedule_timeout_uninterruptible(1);
> - } while (old_handle_count != transaction->t_handle_count);
> +
> + spin_lock(&journal->j_state_lock);
> + commit_time = journal->j_average_commit_time;
> + spin_unlock(&journal->j_state_lock);
> +
> + sleep_time = elapsed_jiffies(transaction->t_start_time, jiffies);
> + if (!sleep_time)
> + sleep_time = 1;
> + sleep_time = (commit_time / sleep_time) * commit_time;
> +
> + /*
> + if (print_count >= 100 && print_count < 200) {
> + journal->print_count++;
> + printk(KERN_ERR "sleep_time=%lu\n", (unsigned long)sleep_time);
> + }
> +
> + sleep_time = (long)(commit_time - sleep_time);
> + */
> +
> + precision_sleep((unsigned long)sleep_time);
> }
>
> current->journal_info = NULL;
> Index: linux-2.6/include/linux/jbd.h
> ===================================================================
> --- linux-2.6.orig/include/linux/jbd.h
> +++ linux-2.6/include/linux/jbd.h
> @@ -31,6 +31,7 @@
> #include <linux/mutex.h>
> #include <linux/timer.h>
> #include <linux/lockdep.h>
> +#include <linux/jiffies.h>
>
> #define journal_oom_retry 1
>
> @@ -79,6 +80,15 @@ static inline void jbd_free(void *ptr, s
> free_pages((unsigned long)ptr, get_order(size));
> };
>
> +static inline unsigned long elapsed_jiffies(unsigned long start,
> + unsigned long end)
> +{
> + if (end >= start)
> + return end - start;
> +
> + return end + (MAX_JIFFY_OFFSET - start) + 1;
> +}
> +
> #define JFS_MIN_JOURNAL_BLOCKS 1024
>
>
> @@ -543,6 +553,11 @@ struct transaction_s
> unsigned long t_expires;
>
> /*
> + * When this transaction started, in jiffies [no locking]
> + */
> + unsigned long t_start_time;
> +
> + /*
> * How many handles used this transaction? [t_handle_lock]
> */
> int t_handle_count;
> @@ -800,6 +815,10 @@ struct journal_s
>
> pid_t j_last_sync_writer;
>
> + unsigned long j_average_commit_time;
> +
> + int print_count;
> +
> /*
> * An opaque pointer to fs-private information. ext3 puts its
> * superblock pointer here
> Index: linux-2.6/kernel/hrtimer.c
> ===================================================================
> --- linux-2.6.orig/kernel/hrtimer.c
> +++ linux-2.6/kernel/hrtimer.c
> @@ -1458,6 +1458,7 @@ void hrtimer_init_sleeper(struct hrtimer
> sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
> #endif
> }
> +EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);
>
> static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
> {
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2008-08-01 14:19:16

by Josef Bacik

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

On Fri, Aug 01, 2008 at 09:57:33AM -0400, Ric Wheeler wrote:
> Josef Bacik wrote:
>> On Fri, Aug 01, 2008 at 08:05:37AM -0400, Ric Wheeler wrote:
>>
>>> Hi Thomas & Ingo,
>>>
>>> Josef has been working on some patches to try and get ext3/4 to
>>> dynamically detect the latency of a storage device and use that base
>>> latency to tune the amount of time we sleep waiting for others to join in
>>> a transaction. The logic in question lives in jbd/transaction.c
>>> (transaction_stop).
>>>
>>> The code was originally developed to try and allow multiple threads to
>>> join in a big, slow transaction. For example, transacations that write to
>>> a slow ATA or S-ATA drive take in the neighborhood of 10 to 20 ms.
>>>
>>> Faster devices, for example a disk array, can complete the transaction
>>> in 1.3 ms. Even higher speed SSD devices boast of a latency of 0.1ms, not
>>> to mention RAM disks ;-)
>>>
>>> The current logic makes us wait way too long, especially with a 250HZ
>>> kernel since we sleep many times longer than it takes to complete the IO
>>> ;-)
>>>
>>> Do either of you have any thoughts on how to get a better, fine grained
>>> sleep capability that we could use that would allow us to sleep in
>>> sub-jiffie chunks?
>>>
>>>
>>
>> Hello,
>>
>> This is the most recent iteration of my patch using hrtimers. It works really
>> well for ramdisks, so anything with low latency writes is going to be really
>> fast, but I'm still trying to come up with a smart way to sleep long enough to
>> not hurt SATA performance. As it stands now I'm getting a 5% decrease in speed
>> on SATA. So I think I've got the sleep as little as possible part down right,
>> just can't quite get it to sleep long enough if the disk is slow. Thanks,
>>
>> Josef
>>
>
> I think that this (or similar) kind of precision_sleep() should be
> generically useful.
>
> One question on the code, would it be better to measure the average
> transaction time in the same units as your precision sleep uses - aren't
> jiffies are still too coarse?
>

Oh well crap I guess thats why I'm not sleeping long enough, i'm only sleeping
jiffies number of nanoseconds, and jiffies is much higher than nanoseconds...

/me writes jiffies * HZ = seconds backwards on his forehead

Josef

2008-08-01 14:50:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

On Fri, 2008-08-01 at 09:25 -0400, Josef Bacik wrote:

> Index: linux-2.6/fs/jbd/transaction.c
> ===================================================================
> --- linux-2.6.orig/fs/jbd/transaction.c
> +++ linux-2.6/fs/jbd/transaction.c

> @@ -63,6 +65,32 @@ get_transaction(journal_t *journal, tran
> return transaction;
> }
>
> +static void precision_sleep(unsigned long time)
> +{
> + struct hrtimer_sleeper t;
> +
> + hrtimer_init_on_stack(&t.timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
> + hrtimer_init_sleeper(&t, current);
> + t.timer.expires = ktime_add_ns(ktime_get_real(), time);
> +
> + do {
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
> + if (!hrtimer_active(&t.timer))
> + t.task = NULL;
> +
> + if (likely(t.task))
> + schedule();
> +
> + hrtimer_cancel(&t.timer);
> + } while (t.task);
> +
> + set_current_state(TASK_RUNNING);
> +
> + destroy_hrtimer_on_stack(&t.timer);
> +}
> +
> /*
> * Handle management.
> *

I was convinced we already had such a creature,. but I guess I was wrong
as I can't find it ;-)

Anyway, I'm thinking this function ought to live in kernel/hrtimer.c and
possibly get renamed to something like hrtimer_sleep_ns() or some such
(means you can also reuse the do_nanosleep helper in there).

Also, have you considered the impact on platforms that do not support
hrtimers, or don't have high resolution clock events available?

> Index: linux-2.6/kernel/hrtimer.c
> ===================================================================
> --- linux-2.6.orig/kernel/hrtimer.c
> +++ linux-2.6/kernel/hrtimer.c
> @@ -1458,6 +1458,7 @@ void hrtimer_init_sleeper(struct hrtimer
> sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
> #endif
> }
> +EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);
>
> static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
> {

That also gets rid of this export..

2008-08-01 14:58:11

by Josef Bacik

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

On Fri, Aug 01, 2008 at 04:50:11PM +0200, Peter Zijlstra wrote:
> On Fri, 2008-08-01 at 09:25 -0400, Josef Bacik wrote:
>
> > Index: linux-2.6/fs/jbd/transaction.c
> > ===================================================================
> > --- linux-2.6.orig/fs/jbd/transaction.c
> > +++ linux-2.6/fs/jbd/transaction.c
>
> > @@ -63,6 +65,32 @@ get_transaction(journal_t *journal, tran
> > return transaction;
> > }
> >
> > +static void precision_sleep(unsigned long time)
> > +{
> > + struct hrtimer_sleeper t;
> > +
> > + hrtimer_init_on_stack(&t.timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
> > + hrtimer_init_sleeper(&t, current);
> > + t.timer.expires = ktime_add_ns(ktime_get_real(), time);
> > +
> > + do {
> > + set_current_state(TASK_UNINTERRUPTIBLE);
> > +
> > + hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
> > + if (!hrtimer_active(&t.timer))
> > + t.task = NULL;
> > +
> > + if (likely(t.task))
> > + schedule();
> > +
> > + hrtimer_cancel(&t.timer);
> > + } while (t.task);
> > +
> > + set_current_state(TASK_RUNNING);
> > +
> > + destroy_hrtimer_on_stack(&t.timer);
> > +}
> > +
> > /*
> > * Handle management.
> > *
>
> I was convinced we already had such a creature,. but I guess I was wrong
> as I can't find it ;-)
>
> Anyway, I'm thinking this function ought to live in kernel/hrtimer.c and
> possibly get renamed to something like hrtimer_sleep_ns() or some such
> (means you can also reuse the do_nanosleep helper in there).
>
> Also, have you considered the impact on platforms that do not support
> hrtimers, or don't have high resolution clock events available?
>

Not yet :), this is just a proof of concept atm, eventually if I do use hrtimers
I would just fallback to using jiffies if the arch doesn't support hrtimers as
it still gets significantly better performance than what we currently have.
Thanks,

Josef


> > Index: linux-2.6/kernel/hrtimer.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/hrtimer.c
> > +++ linux-2.6/kernel/hrtimer.c
> > @@ -1458,6 +1458,7 @@ void hrtimer_init_sleeper(struct hrtimer
> > sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
> > #endif
> > }
> > +EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);
> >
> > static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
> > {
>
> That also gets rid of this export..
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2008-08-01 15:04:03

by Ric Wheeler

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

Peter Zijlstra wrote:
> On Fri, 2008-08-01 at 09:25 -0400, Josef Bacik wrote:
>
>
>> Index: linux-2.6/fs/jbd/transaction.c
>> ===================================================================
>> --- linux-2.6.orig/fs/jbd/transaction.c
>> +++ linux-2.6/fs/jbd/transaction.c
>>
>
>
>> @@ -63,6 +65,32 @@ get_transaction(journal_t *journal, tran
>> return transaction;
>> }
>>
>> +static void precision_sleep(unsigned long time)
>> +{
>> + struct hrtimer_sleeper t;
>> +
>> + hrtimer_init_on_stack(&t.timer, CLOCK_REALTIME, HRTIMER_MODE_ABS);
>> + hrtimer_init_sleeper(&t, current);
>> + t.timer.expires = ktime_add_ns(ktime_get_real(), time);
>> +
>> + do {
>> + set_current_state(TASK_UNINTERRUPTIBLE);
>> +
>> + hrtimer_start(&t.timer, t.timer.expires, HRTIMER_MODE_ABS);
>> + if (!hrtimer_active(&t.timer))
>> + t.task = NULL;
>> +
>> + if (likely(t.task))
>> + schedule();
>> +
>> + hrtimer_cancel(&t.timer);
>> + } while (t.task);
>> +
>> + set_current_state(TASK_RUNNING);
>> +
>> + destroy_hrtimer_on_stack(&t.timer);
>> +}
>> +
>> /*
>> * Handle management.
>> *
>>
>
> I was convinced we already had such a creature,. but I guess I was wrong
> as I can't find it ;-)
>
> Anyway, I'm thinking this function ought to live in kernel/hrtimer.c and
> possibly get renamed to something like hrtimer_sleep_ns() or some such
> (means you can also reuse the do_nanosleep helper in there).
>
> Also, have you considered the impact on platforms that do not support
> hrtimers, or don't have high resolution clock events available?
>

I agree that this is definitely a much more generically interesting
function.

With high res timers, falling back to the jiffie's granularity (or not
sleeping at all in the case of the file system transaction batching)
seems to be a reasonable fall back,

ric


>
>> Index: linux-2.6/kernel/hrtimer.c
>> ===================================================================
>> --- linux-2.6.orig/kernel/hrtimer.c
>> +++ linux-2.6/kernel/hrtimer.c
>> @@ -1458,6 +1458,7 @@ void hrtimer_init_sleeper(struct hrtimer
>> sl->timer.cb_mode = HRTIMER_CB_IRQSAFE_NO_SOFTIRQ;
>> #endif
>> }
>> +EXPORT_SYMBOL_GPL(hrtimer_init_sleeper);
>>
>> static int __sched do_nanosleep(struct hrtimer_sleeper *t, enum hrtimer_mode mode)
>> {
>>
>
> That also gets rid of this export..
>
>

2008-08-01 18:17:01

by Andreas Dilger

[permalink] [raw]
Subject: Re: high resolution timers, scheduling & sleep granularity

On Aug 01, 2008 09:25 -0400, Josef Bacik wrote:
> + if (unlikely(!journal->j_average_commit_time))
> + journal->j_average_commit_time = commit_time;
> + else
> + journal->j_average_commit_time = (commit_time +
> + journal->j_average_commit_time) / 2;

You may also consider making this a decaying average, so that minor changes
in the workload are smoothed out. Also, it is probably easier to read
likely(foo) instead of unlikely(!foo)...

if (likely(journal->j_average_commit_time != 0))
journal->j_average_commit_time =
(commit_time * 3 + journal->j_average_commit_time) / 4;
else
journal->j_average_commit_time = commit_time;

> + if (journal->print_count < 100) {
> + journal->print_count++;
> + printk(KERN_ERR "avg commit time = %lu\n",
> + journal->j_average_commit_time);
> + }

There is already the jbd stats patch in jbd2 that is reporting this
information for the previous transactions.

>
> + spin_lock(&journal->j_state_lock);
> + commit_time = journal->j_average_commit_time;
> + spin_unlock(&journal->j_state_lock);
> +
> + sleep_time = elapsed_jiffies(transaction->t_start_time, jiffies);
> + if (!sleep_time)
> + sleep_time = 1;
> + sleep_time = (commit_time / sleep_time) * commit_time;

I was also going to comment on the use of jiffies here, but Ric beat me
to it.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.