From: Arjan van de Ven <[email protected]>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
With latencytop, I noticed that the (in memory) atime updates during a
kernel build had latencies of 6 seconds or longer; this is obviously not so
nice behavior. Other EXT3 journal related operations had similar or even
longer latencies.
Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work on behalf of other processes and effectively this leads to a twisted
kind of "mass priority inversion" type of behavior.
The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.
The patch below makes kjournald of the IOPRIO_CLASS_RT priority to break
this priority inversion behavior. With this patch, the latencies for atime
updates (and similar operation) go down by a factor of 3x to 4x !
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/ioprio.c | 3 ++-
fs/jbd/journal.c | 12 ++++++++++++
include/linux/ioprio.h | 2 ++
3 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
#include <linux/security.h>
#include <linux/pid_namespace.h>
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
{
int err;
struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
task_unlock(task);
return err;
}
+EXPORT_SYMBOL_GPL(set_task_ioprio);
asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
{
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index aa7143a..2ed3d8f 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
#include <linux/poison.h>
#include <linux/proc_fs.h>
#include <linux/debugfs.h>
+#include <linux/ioprio.h>
#include <asm/uaccess.h>
#include <asm/page.h>
@@ -131,6 +132,17 @@ static int kjournald(void *arg)
journal->j_commit_interval / HZ);
/*
+ * kjournald is the process on which most other processes depend on
+ * for doing the filesystem portion of their IO. As such, there exists
+ * the equivalent of a priority inversion situation, where kjournald
+ * would get less priority as it should.
+ *
+ * For this reason we set to "medium real time priority", which is higher
+ * than regular tasks, but not infinitely powerful.
+ */
+ set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4));
+
+ /*
* And now, wait forever for commit wakeup events.
*/
spin_lock(&journal->j_state_lock);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index f98a656..76dad48 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
*/
extern int ioprio_best(unsigned short aprio, unsigned short bprio);
+extern int set_task_ioprio(struct task_struct *task, int ioprio);
+
#endif
--
1.5.5.1
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
You proposed this a while back and it didn't happen and I forget
why and the changelog doesn't mention any of that?
On Wed, Oct 01 2008, Andrew Morton wrote:
> On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
>
> > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
>
> You proposed this a while back and it didn't happen and I forget
> why and the changelog doesn't mention any of that?
I think you called for benchmark results, which I don't think happened.
The patch definitely makes sense, so we should just make sure that we
don't regress elsewhere. Honestly, I'd be surprised if we did...
How about I just toss it into the 2.6.28 testing mix, plenty of time for
testing and such?
--
Jens Axboe
On Thu, 2 Oct 2008 08:27:37 +0200 Jens Axboe <[email protected]> wrote:
> On Wed, Oct 01 2008, Andrew Morton wrote:
> > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
> >
> > > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> >
> > You proposed this a while back and it didn't happen and I forget
> > why and the changelog doesn't mention any of that?
>
> I think you called for benchmark results, which I don't think happened.
> The patch definitely makes sense, so we should just make sure that we
> don't regress elsewhere. Honestly, I'd be surprised if we did...
Now I think about it, didn't the earlier patch tweak CPU priority and
not IO priority? I forget. <kicks the changelog again>
> How about I just toss it into the 2.6.28 testing mix, plenty of time for
> testing and such?
Many performance regressions don't get noticed for six or twelve
months, by which time everyone is suffering from them (see
kernel/sched.c).
kjournald does huge amounts of not-terribly-important writeback. One
obvious risk is that by making all that bulk writeback high-priority,
read-latency-sensitive applications might suffer latency spikes.
Now, kjournald is _supposed_ to be mostly asynchronous wrt foreground
operations. And once upon a time (seven years ago) it mostly was. But
there was some horrid race which I ended up fixing by introducing one
point where synchronous userspace actions had to block behind kjournald
activity. I spent quite some time on it but couldn't come up with
anything better. It had fairly bad effects on some workloads.
I've forgotten where that code is now, but I don't think it was ever
revisited. It should be.
So. Where are these atime updaters getting blocked?
Andrew Morton <[email protected]> writes:
> On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
>
>> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
>
> You proposed this a while back and it didn't happen and I forget
> why and the changelog doesn't mention any of that?
XFS tried this some time ago too.
I think the issue was that real user supplied RT applications don't want to
compete with a "pseudo RT" kjournald.
So it would really need a new priority class between RT and normal priority.
-Andi
--
[email protected]
On Wed, Oct 01 2008, Andrew Morton wrote:
> On Thu, 2 Oct 2008 08:27:37 +0200 Jens Axboe <[email protected]> wrote:
>
> > On Wed, Oct 01 2008, Andrew Morton wrote:
> > > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
> > >
> > > > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > >
> > > You proposed this a while back and it didn't happen and I forget
> > > why and the changelog doesn't mention any of that?
> >
> > I think you called for benchmark results, which I don't think happened.
> > The patch definitely makes sense, so we should just make sure that we
> > don't regress elsewhere. Honestly, I'd be surprised if we did...
>
> Now I think about it, didn't the earlier patch tweak CPU priority and
> not IO priority? I forget. <kicks the changelog again>
>
> > How about I just toss it into the 2.6.28 testing mix, plenty of time for
> > testing and such?
>
> Many performance regressions don't get noticed for six or twelve
> months, by which time everyone is suffering from them (see
> kernel/sched.c).
>
> kjournald does huge amounts of not-terribly-important writeback. One
> obvious risk is that by making all that bulk writeback high-priority,
> read-latency-sensitive applications might suffer latency spikes.
>
>
>
> Now, kjournald is _supposed_ to be mostly asynchronous wrt foreground
> operations. And once upon a time (seven years ago) it mostly was. But
> there was some horrid race which I ended up fixing by introducing one
> point where synchronous userspace actions had to block behind kjournald
> activity. I spent quite some time on it but couldn't come up with
> anything better. It had fairly bad effects on some workloads.
>
> I've forgotten where that code is now, but I don't think it was ever
> revisited. It should be.
>
> So. Where are these atime updaters getting blocked?
Behind other IO activity I suppose, since it's marked async. A more
appropriate fix may be to mark atime updates as sync IO.
Once upon a time I actually did start xxxing meta data IO from the file
system, in ext3. That was mainly for blktrace so we could inspect what
data was what at the trace end, but if we had full coverage (or better,
at least), we could use that to bump priority as well. Probably the
priority should be left at the default, but the IO should be upgraded to
SYNC instead. That'll ensure that it goes out fairly quickly (like other
IO at the same class), but not get preferential treatment apart from
that. Seems like the right thing to do.
--
Jens Axboe
On Thu, Oct 02 2008, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
>
> > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
> >
> >> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> >
> > You proposed this a while back and it didn't happen and I forget
> > why and the changelog doesn't mention any of that?
>
> XFS tried this some time ago too.
>
> I think the issue was that real user supplied RT applications don't want to
> compete with a "pseudo RT" kjournald.
>
> So it would really need a new priority class between RT and normal priority.
Good point. I think we should mark the IO as sync, and maintain the same
priority level. Any IO that ends up being waited on is sync by
definition, we just need to expand the coverage a bit.
--
Jens Axboe
On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <[email protected]> wrote:
> > So. Where are these atime updaters getting blocked?
>
> Behind other IO activity I suppose, since it's marked async. A more
> appropriate fix may be to mark atime updates as sync IO.
No, they might be getting blocked at a higher level.
An async atime update gets recorded into the current transaction.
kjournald is working on the committing transaction. We try to keep
those separated, to prevent user processes from getting blocked behind
kjournald activity.
But sometimes that doesn't work (including the place where I knowingly
broke it). If we can find and fix the offending piece of jbd logic (a
big if) then all is peachy.
If the above theory turns out to be true then diddling IO priorities
is but a workaround.
On Thu, Oct 02 2008, Andrew Morton wrote:
> On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <[email protected]> wrote:
>
> > > So. Where are these atime updaters getting blocked?
> >
> > Behind other IO activity I suppose, since it's marked async. A more
> > appropriate fix may be to mark atime updates as sync IO.
>
> No, they might be getting blocked at a higher level.
>
> An async atime update gets recorded into the current transaction.
> kjournald is working on the committing transaction. We try to keep
> those separated, to prevent user processes from getting blocked behind
> kjournald activity.
>
> But sometimes that doesn't work (including the place where I knowingly
> broke it). If we can find and fix the offending piece of jbd logic (a
> big if) then all is peachy.
>
> If the above theory turns out to be true then diddling IO priorities
> is but a workaround.
If diddling with io priorities makes a big difference (and Arjan said it
does), it's clearly stuck inside the io scheduler waiting to be
dispatched. If it was marked sync, it would be going through much
faster. As async, it'll get mixed in with other async writeout and that
doesn't get a lot of attention in case of sync activity.
--
Jens Axboe
On Thu, 2 Oct 2008 10:22:13 +0200 Jens Axboe <[email protected]> wrote:
> On Thu, Oct 02 2008, Andrew Morton wrote:
> > On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <[email protected]> wrote:
> >
> > > > So. Where are these atime updaters getting blocked?
> > >
> > > Behind other IO activity I suppose, since it's marked async. A more
> > > appropriate fix may be to mark atime updates as sync IO.
> >
> > No, they might be getting blocked at a higher level.
> >
> > An async atime update gets recorded into the current transaction.
> > kjournald is working on the committing transaction. We try to keep
> > those separated, to prevent user processes from getting blocked behind
> > kjournald activity.
> >
> > But sometimes that doesn't work (including the place where I knowingly
> > broke it). If we can find and fix the offending piece of jbd logic (a
> > big if) then all is peachy.
> >
> > If the above theory turns out to be true then diddling IO priorities
> > is but a workaround.
>
> If diddling with io priorities makes a big difference (and Arjan said it
> does), it's clearly stuck inside the io scheduler waiting to be
> dispatched. If it was marked sync, it would be going through much
> faster. As async, it'll get mixed in with other async writeout and that
> doesn't get a lot of attention in case of sync activity.
>
Maybe. And maybe marking all journal writeout and all commit activity
as sync would have the same effect with less downside.
But it'd be better to not wait on this IO at all, if poss..
On Thu, Oct 02 2008, Andrew Morton wrote:
> On Thu, 2 Oct 2008 10:22:13 +0200 Jens Axboe <[email protected]> wrote:
>
> > On Thu, Oct 02 2008, Andrew Morton wrote:
> > > On Thu, 2 Oct 2008 09:45:24 +0200 Jens Axboe <[email protected]> wrote:
> > >
> > > > > So. Where are these atime updaters getting blocked?
> > > >
> > > > Behind other IO activity I suppose, since it's marked async. A more
> > > > appropriate fix may be to mark atime updates as sync IO.
> > >
> > > No, they might be getting blocked at a higher level.
> > >
> > > An async atime update gets recorded into the current transaction.
> > > kjournald is working on the committing transaction. We try to keep
> > > those separated, to prevent user processes from getting blocked behind
> > > kjournald activity.
> > >
> > > But sometimes that doesn't work (including the place where I knowingly
> > > broke it). If we can find and fix the offending piece of jbd logic (a
> > > big if) then all is peachy.
> > >
> > > If the above theory turns out to be true then diddling IO priorities
> > > is but a workaround.
> >
> > If diddling with io priorities makes a big difference (and Arjan said it
> > does), it's clearly stuck inside the io scheduler waiting to be
> > dispatched. If it was marked sync, it would be going through much
> > faster. As async, it'll get mixed in with other async writeout and that
> > doesn't get a lot of attention in case of sync activity.
> >
>
> Maybe. And maybe marking all journal writeout and all commit activity
> as sync would have the same effect with less downside.
>
> But it'd be better to not wait on this IO at all, if poss..
Agree, would be much better...
--
Jens Axboe
On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
> On Thu, Oct 02 2008, Andi Kleen wrote:
> > Andrew Morton <[email protected]> writes:
> >
> > > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
> > >
> > >> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > >
> > > You proposed this a while back and it didn't happen and I forget
> > > why and the changelog doesn't mention any of that?
> >
> > XFS tried this some time ago too.
> >
> > I think the issue was that real user supplied RT applications don't want to
> > compete with a "pseudo RT" kjournald.
> >
> > So it would really need a new priority class between RT and normal priority.
>
> Good point. I think we should mark the IO as sync, and maintain the same
> priority level. Any IO that ends up being waited on is sync by
> definition, we just need to expand the coverage a bit.
That's what XFS has always done - mark the journal I/O as sync.
Still, once you load up the elevator, the sync I/O can still get
delayed for hundreds of milliseconds before dispatch, which was
why I started looking at boosting the priority of the log I/O.
It proved to be much more effective at getting the log I/O
dispatched than the existing "mark it sync" technique....
The RT folk were happy with the idea of journal I/O using the
highest non-RT priority for the journal, but I never got around
to testing that out as I had a bunnch of other stuff to fix at
the time.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Oct 02 2008, Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
> > On Thu, Oct 02 2008, Andi Kleen wrote:
> > > Andrew Morton <[email protected]> writes:
> > >
> > > > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven <[email protected]> wrote:
> > > >
> > > >> Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> > > >
> > > > You proposed this a while back and it didn't happen and I forget
> > > > why and the changelog doesn't mention any of that?
> > >
> > > XFS tried this some time ago too.
> > >
> > > I think the issue was that real user supplied RT applications don't want to
> > > compete with a "pseudo RT" kjournald.
> > >
> > > So it would really need a new priority class between RT and normal priority.
> >
> > Good point. I think we should mark the IO as sync, and maintain the same
> > priority level. Any IO that ends up being waited on is sync by
> > definition, we just need to expand the coverage a bit.
>
> That's what XFS has always done - mark the journal I/O as sync.
> Still, once you load up the elevator, the sync I/O can still get
> delayed for hundreds of milliseconds before dispatch, which was
> why I started looking at boosting the priority of the log I/O.
> It proved to be much more effective at getting the log I/O
> dispatched than the existing "mark it sync" technique....
Sure, just marking it as sync is not a magic bullet. It'll be in the
first priority for that class, but it'll share bandwidth with other
processes. So if you have lots of IO going on, it can take hundreds of
miliseconds before being dispatched.
RT will always be faster, but mainly by virtue of there being no RT IO
in the first place. And of course preferential treatment is given to
this higher priority scheduling class.
> The RT folk were happy with the idea of journal I/O using the
> highest non-RT priority for the journal, but I never got around
> to testing that out as I had a bunnch of other stuff to fix at
> the time.
That's a good idea, just bump the priority a little bit. Arjan, did you
test that out? I'd suggest just trying prio level 0 and still using
best-effort scheduling. Probably still need the sync marking, would be
interesting to experiment with though.
--
Jens Axboe
On Thu, Oct 02, 2008 at 01:03:15AM -0700, Andrew Morton wrote:
>
> An async atime update gets recorded into the current transaction.
> kjournald is working on the committing transaction. We try to keep
> those separated, to prevent user processes from getting blocked behind
> kjournald activity.
>
This is true unless the journal gets too full, and we need to do a
checkpoint operation --- at which point, everything stops. If this
was metadata-intensive a benchmark, and the journal wasn't large
enough, this could be the problem. (And if you make the journal
bigger, then when you *do* finally get forced to do a checkpoint
operation, things get stalled for even longer.)
Arjan, is this *really* about atime updates? I thought most poeple
these days run with noatime or relatime. If people *really* want true
atime semantics, the best way to solve this problem would be to have
two dirty flags in the inode --- an "atime dirty" and a "dirty" flag.
The atime dirty bit would not actually cause the inode to get written
to disk, unless either (a) we are unmounting the filesystem, or (b) we
are trying to shrink the inode cache due to memory pressure. If when
we write the inode out to disk, only the atime dirty bit is set, we
can also skip journalling the inode table block. So if there are
people who really care about true atime semantics, without getting
killed by the I/O writes, there are some solutions we can pursue.
But if this is really about the "entangled fsync problem", where we
have a large number of processes writing a large amount of async data,
and then we have a single process writing a small amount of data and
then calling fsync(), then that's a different (and very long-standing)
problem in ext3/4. Raising the I/O priority is probably the only
thing we can do in this circumstance. We could try to do some kind of
complex priority inheritance scheme, but it would certainly be much
simpler to raise the I/O priority. We could choose a level just below
realtime priority, but the reality is that if a real-time priority is
trying to write to the filesystem, and we are doing a checkpointing
opration, we're going to be blocking the real-time process anyway, and
it will be a priority inversion. So perhaps the simplest and best
algorithm would be to use a priority level just below real-time when
doing a normal commit, but if we start to do a checkpoint, we go to
IOPRIO_CLASS_RT.
> But sometimes that doesn't work (including the place where I knowingly
> broke it). If we can find and fix the offending piece of jbd logic (a
> big if) then all is peachy.
Do we have workloads that can easily demonstrate this problem? If so,
we can add some tracing code which will allow us to see which theory
is correct, and what is actually happening.
- Ted
On Wed, 1 Oct 2008 21:56:38 -0700
Andrew Morton <[email protected]> wrote:
> On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven
> <[email protected]> wrote:
>
> > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
>
> You proposed this a while back
one year ago to be precise
>and it didn't happen and I forget
> why and the changelog doesn't mention any of that?
Jens didn't like the APi I used back then.
Since then the kernel grew (by Jens) a clean API to do the same...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Wed, 1 Oct 2008 23:55:01 -0700
>
> I've forgotten where that code is now, but I don't think it was ever
> revisited. It should be.
>
> So. Where are these atime updaters getting blocked?
my reproducer is sadly very simple (claws-mail is my mail client that uses maildir)
Process claws-mail (4896) Total: 2829.7 msec
EXT3: Waiting for journal access 2491.0 msec 88.4 %
Writing back inodes 160.9 msec 5.7 %
synchronous write 78.8 msec 3.0 %
is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger)
Waiting for journal access is "journal_get_write_access"
Writing back inodes is "writeback_inodes"
synchronous write is "do_sync_write"
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 11:45:37 +0200
Jens Axboe <[email protected]> wrote:
>
> That's a good idea, just bump the priority a little bit. Arjan, did
> you test that out? I'd suggest just trying prio level 0 and still
> using best-effort scheduling. Probably still need the sync marking,
> would be interesting to experiment with though.
I looked at 0 but it appears the 0 is the default for everyone...
if everyone just defaulted to > 0 then yes I would have picked 0.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 08:04:44 -0400
Theodore Tso <[email protected]> wrote:
>
> > But sometimes that doesn't work (including the place where I
> > knowingly broke it). If we can find and fix the offending piece of
> > jbd logic (a big if) then all is peachy.
>
> Do we have workloads that can easily demonstrate this problem? If so,
> we can add some tracing code which will allow us to see which theory
> is correct, and what is actually happening.
I can very easily reproduce it; my mail client (claws) stalls due to
this several seconds at least once every 5 to 10 minutes...
(usually when I'm typing an email.. grumble)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 11:45:37 +0200
> Jens Axboe <[email protected]> wrote:
>
> >
> > That's a good idea, just bump the priority a little bit. Arjan, did
> > you test that out? I'd suggest just trying prio level 0 and still
> > using best-effort scheduling. Probably still need the sync marking,
> > would be interesting to experiment with though.
>
> I looked at 0 but it appears the 0 is the default for everyone...
> if everyone just defaulted to > 0 then yes I would have picked 0.
That's not correct, class BE and value 4 is the default (and the code
defaults to that, if you haven't set a value yourself):
#define IOPRIO_NORM (4)
static inline int task_ioprio(struct io_context *ioc)
{
if (ioprio_valid(ioc->ioprio))
return IOPRIO_PRIO_DATA(ioc->ioprio);
return IOPRIO_NORM;
}
static inline int task_ioprio_class(struct io_context *ioc)
{
if (ioprio_valid(ioc->ioprio))
return IOPRIO_PRIO_CLASS(ioc->ioprio);
return IOPRIO_CLASS_BE;
}
So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
highest priority of the default scheduling class.
--
Jens Axboe
On Thu, 2 Oct 2008 15:27:47 +0200
Jens Axboe <[email protected]> wrote:
> On Thu, Oct 02 2008, Arjan van de Ven wrote:
> > On Thu, 2 Oct 2008 11:45:37 +0200
> > Jens Axboe <[email protected]> wrote:
> >
> > >
> > > That's a good idea, just bump the priority a little bit. Arjan,
> > > did you test that out? I'd suggest just trying prio level 0 and
> > > still using best-effort scheduling. Probably still need the sync
> > > marking, would be interesting to experiment with though.
> >
> > I looked at 0 but it appears the 0 is the default for everyone...
> > if everyone just defaulted to > 0 then yes I would have picked 0.
>
> That's not correct, class BE and value 4 is the default (and the code
> defaults to that, if you haven't set a value yourself):
>
> #define IOPRIO_NORM (4)
> static inline int task_ioprio(struct io_context *ioc)
> {
> if (ioprio_valid(ioc->ioprio))
> return IOPRIO_PRIO_DATA(ioc->ioprio);
>
> return IOPRIO_NORM;
> }
>
> static inline int task_ioprio_class(struct io_context *ioc)
> {
> if (ioprio_valid(ioc->ioprio))
> return IOPRIO_PRIO_CLASS(ioc->ioprio);
>
> return IOPRIO_CLASS_BE;
> }
>
> So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
> highest priority of the default scheduling class.
ok
I checked not by looking at the code, but running ionice -p <pid> on a
bunch of things and they came back as 0
>
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, Oct 02, 2008 at 06:16:29AM -0700, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 08:04:44 -0400
> Theodore Tso <[email protected]> wrote:
>
> >
> > > But sometimes that doesn't work (including the place where I
> > > knowingly broke it). If we can find and fix the offending piece of
> > > jbd logic (a big if) then all is peachy.
> >
> > Do we have workloads that can easily demonstrate this problem? If so,
> > we can add some tracing code which will allow us to see which theory
> > is correct, and what is actually happening.
>
> I can very easily reproduce it; my mail client (claws) stalls due to
> this several seconds at least once every 5 to 10 minutes...
> (usually when I'm typing an email.. grumble)
Are you running with noatime or relatime?
One quick thing to try that might show what is going on is to run
debugfs on the device where your mail directory is located, while
claws is running, and then use the debugfs command "logdump -a" to
dump out the contents of the journal. You can then use ncheck and
icheck to take the FS block numbers from logdump and translate them to
inode numbers, and then from inode numbers to pathnames. That might
give us some insight as to what is going on.
I can whip up a patch which adds some markers which we could use to
find out more information about what is happening.
- Ted
On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 15:27:47 +0200
> Jens Axboe <[email protected]> wrote:
>
> > On Thu, Oct 02 2008, Arjan van de Ven wrote:
> > > On Thu, 2 Oct 2008 11:45:37 +0200
> > > Jens Axboe <[email protected]> wrote:
> > >
> > > >
> > > > That's a good idea, just bump the priority a little bit. Arjan,
> > > > did you test that out? I'd suggest just trying prio level 0 and
> > > > still using best-effort scheduling. Probably still need the sync
> > > > marking, would be interesting to experiment with though.
> > >
> > > I looked at 0 but it appears the 0 is the default for everyone...
> > > if everyone just defaulted to > 0 then yes I would have picked 0.
> >
> > That's not correct, class BE and value 4 is the default (and the code
> > defaults to that, if you haven't set a value yourself):
> >
> > #define IOPRIO_NORM (4)
> > static inline int task_ioprio(struct io_context *ioc)
> > {
> > if (ioprio_valid(ioc->ioprio))
> > return IOPRIO_PRIO_DATA(ioc->ioprio);
> >
> > return IOPRIO_NORM;
> > }
> >
> > static inline int task_ioprio_class(struct io_context *ioc)
> > {
> > if (ioprio_valid(ioc->ioprio))
> > return IOPRIO_PRIO_CLASS(ioc->ioprio);
> >
> > return IOPRIO_CLASS_BE;
> > }
> >
> > So if you use IOPRIO_CLASS_BE and 0 for the ioprio, you will have the
> > highest priority of the default scheduling class.
>
> ok
>
> I checked not by looking at the code, but running ionice -p <pid> on a
> bunch of things and they came back as 0
Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
set' as the default values, BE/4. There's a big diffence, since '0'
means that we track CPU nice values where as if it returned be/4 then
that is a strict/fixed setting.
--
Jens Axboe
On Thu, 2 Oct 2008 15:47:37 +0200
Jens Axboe <[email protected]> wrote:
> Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
> set' as the default values, BE/4. There's a big diffence, since '0'
> means that we track CPU nice values where as if it returned be/4 then
> that is a strict/fixed setting.
argh. "0" means both "not set" and "highest priority".
how about we fix this?
right now "query + set" isn't idempotent.....
it should be able to be that.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 09:46:59 -0400
Theodore Tso <[email protected]> wrote:
> On Thu, Oct 02, 2008 at 06:16:29AM -0700, Arjan van de Ven wrote:
> > On Thu, 2 Oct 2008 08:04:44 -0400
> > Theodore Tso <[email protected]> wrote:
> >
> > >
> > > > But sometimes that doesn't work (including the place where I
> > > > knowingly broke it). If we can find and fix the offending
> > > > piece of jbd logic (a big if) then all is peachy.
> > >
> > > Do we have workloads that can easily demonstrate this problem?
> > > If so, we can add some tracing code which will allow us to see
> > > which theory is correct, and what is actually happening.
> >
> > I can very easily reproduce it; my mail client (claws) stalls due to
> > this several seconds at least once every 5 to 10 minutes...
> > (usually when I'm typing an email.. grumble)
>
> Are you running with noatime or relatime?
yes; I have
mount -o remount,relatime / &
for i in `pidof kjournald` ; do ionice -c1 -p $i ; done
in my /etc/rc.local
> One quick thing to try that might show what is going on is to run
> debugfs on the device where your mail directory is located, while
> claws is running, and then use the debugfs command "logdump -a" to
> dump out the contents of the journal. You can then use ncheck and
> icheck to take the FS block numbers from logdump and translate them to
> inode numbers, and then from inode numbers to pathnames. That might
> give us some insight as to what is going on.
I'll try (sadly I'm in a training session all day today so I might not
get to it quickly)
> I can whip up a patch which adds some markers which we could use to
> find out more information about what is happening.
interesting testcase of the markers concept.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Jens Axboe <[email protected]> wrote:
> On Thu, Oct 02 2008, Dave Chinner wrote:
>> On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
>> > Good point. I think we should mark the IO as sync, and maintain the same
>> > priority level. Any IO that ends up being waited on is sync by
>> > definition, we just need to expand the coverage a bit.
>>
>> That's what XFS has always done - mark the journal I/O as sync.
>> Still, once you load up the elevator, the sync I/O can still get
>> delayed for hundreds of milliseconds before dispatch, which was
>> why I started looking at boosting the priority of the log I/O.
>> It proved to be much more effective at getting the log I/O
>> dispatched than the existing "mark it sync" technique....
>
> Sure, just marking it as sync is not a magic bullet. It'll be in the
> first priority for that class, but it'll share bandwidth with other
> processes. So if you have lots of IO going on, it can take hundreds of
> miliseconds before being dispatched.
Sounds like you need a priority class besides sync and async.
On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 15:47:37 +0200
> Jens Axboe <[email protected]> wrote:
>
> > Yes, it'll report '0' which means 'not set'. The kernel inteprets 'not
> > set' as the default values, BE/4. There's a big diffence, since '0'
> > means that we track CPU nice values where as if it returned be/4 then
> > that is a strict/fixed setting.
>
> argh. "0" means both "not set" and "highest priority".
Arjan, please read the interface, it does not...
0 means not set, period. What matters is the class setting, if that is
non-zero then it is a valid setting. See
#define IOPRIO_PRIO_VALUE(class, data) (((class) << IOPRIO_CLASS_SHIFT) | data)
So highest priority BE is IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0), which
is of course valid.
--
Jens Axboe
On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Wed, 1 Oct 2008 21:56:38 -0700
> Andrew Morton <[email protected]> wrote:
>
> > On Wed, 1 Oct 2008 20:00:34 -0700 Arjan van de Ven
> > <[email protected]> wrote:
> >
> > > Subject: [PATCH] Give kjournald a IOPRIO_CLASS_RT io priority
> >
> > You proposed this a while back
>
> one year ago to be precise
>
> >and it didn't happen and I forget
> > why and the changelog doesn't mention any of that?
>
> Jens didn't like the APi I used back then.
> Since then the kernel grew (by Jens) a clean API to do the same...
Actually I don't think I changed anything, did I? IIRC, I just pointed
you at the API :-)
--
Jens Axboe
On Thu, 2 Oct 2008 11:45:37 +0200
Jens Axboe <[email protected]> wrote:
> > The RT folk were happy with the idea of journal I/O using the
> > highest non-RT priority for the journal, but I never got around
> > to testing that out as I had a bunnch of other stuff to fix at
> > the time.
>
> That's a good idea, just bump the priority a little bit. Arjan, did
> you test that out? I'd suggest just trying prio level 0 and still
> using best-effort scheduling. Probably still need the sync marking,
> would be interesting to experiment with though.
>
ok 0 works ok enough in quick testing as well...... updated patch below
>From df64cc4e2ab0c102bbac609dd948958a6f804fd3 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <[email protected]>
Date: Wed, 1 Oct 2008 19:58:18 -0700
Subject: [PATCH] Give kjournald a higher io priority
With latencytop, I noticed that the (in memory) file updates during my
workload (reading mail) had latencies of 6 seconds or longer; this is
obviously not so nice behavior. Other EXT3 journal related operations had
similar or even longer latencies.
Digging into this a bit more, it appears to be an interaction between EXT3
and CFQ in that CFQ tries to be fair to everyone, including kjournald.
However, in reality, kjournald is "special" in that it does a lot of journal
work and effectively this leads to a twisted kind of "mass priority
inversion" type of behavior.
The good news is that CFQ already has the infrastructure to make certain
processes special... JBD just wasn't using that quite yet.
The patch below makes kjournald of a slighlty higher priority than normal
applications, reducing these latencies significantly.
Signed-off-by: Arjan van de Ven <[email protected]>
---
fs/ioprio.c | 3 ++-
fs/jbd/journal.c | 12 ++++++++++++
include/linux/ioprio.h | 2 ++
3 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/fs/ioprio.c b/fs/ioprio.c
index da3cc46..3bd95dc 100644
--- a/fs/ioprio.c
+++ b/fs/ioprio.c
@@ -27,7 +27,7 @@
#include <linux/security.h>
#include <linux/pid_namespace.h>
-static int set_task_ioprio(struct task_struct *task, int ioprio)
+int set_task_ioprio(struct task_struct *task, int ioprio)
{
int err;
struct io_context *ioc;
@@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
task_unlock(task);
return err;
}
+EXPORT_SYMBOL_GPL(set_task_ioprio);
asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
{
diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
index aa7143a..a859a46 100644
--- a/fs/jbd/journal.c
+++ b/fs/jbd/journal.c
@@ -36,6 +36,7 @@
#include <linux/poison.h>
#include <linux/proc_fs.h>
#include <linux/debugfs.h>
+#include <linux/ioprio.h>
#include <asm/uaccess.h>
#include <asm/page.h>
@@ -131,6 +132,17 @@ static int kjournald(void *arg)
journal->j_commit_interval / HZ);
/*
+ * kjournald is the process on which most other processes depend on
+ * for doing the filesystem portion of their IO. As such, there exists
+ * the equivalent of a priority inversion situation, where kjournald
+ * would get less priority as it should.
+ *
+ * For this reason we set to "medium real time priority", which is higher
+ * than regular tasks, but not infinitely powerful.
+ */
+ set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0));
+
+ /*
* And now, wait forever for commit wakeup events.
*/
spin_lock(&journal->j_state_lock);
diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
index f98a656..76dad48 100644
--- a/include/linux/ioprio.h
+++ b/include/linux/ioprio.h
@@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
*/
extern int ioprio_best(unsigned short aprio, unsigned short bprio);
+extern int set_task_ioprio(struct task_struct *task, int ioprio);
+
#endif
--
1.5.5.1
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, Oct 02 2008, Arjan van de Ven wrote:
> On Thu, 2 Oct 2008 11:45:37 +0200
> Jens Axboe <[email protected]> wrote:
>
> > > The RT folk were happy with the idea of journal I/O using the
> > > highest non-RT priority for the journal, but I never got around
> > > to testing that out as I had a bunnch of other stuff to fix at
> > > the time.
> >
> > That's a good idea, just bump the priority a little bit. Arjan, did
> > you test that out? I'd suggest just trying prio level 0 and still
> > using best-effort scheduling. Probably still need the sync marking,
> > would be interesting to experiment with though.
> >
>
> ok 0 works ok enough in quick testing as well...... updated patch below
>
> From df64cc4e2ab0c102bbac609dd948958a6f804fd3 Mon Sep 17 00:00:00 2001
> From: Arjan van de Ven <[email protected]>
> Date: Wed, 1 Oct 2008 19:58:18 -0700
> Subject: [PATCH] Give kjournald a higher io priority
>
> With latencytop, I noticed that the (in memory) file updates during my
> workload (reading mail) had latencies of 6 seconds or longer; this is
> obviously not so nice behavior. Other EXT3 journal related operations had
> similar or even longer latencies.
>
> Digging into this a bit more, it appears to be an interaction between EXT3
> and CFQ in that CFQ tries to be fair to everyone, including kjournald.
> However, in reality, kjournald is "special" in that it does a lot of journal
> work and effectively this leads to a twisted kind of "mass priority
> inversion" type of behavior.
>
> The good news is that CFQ already has the infrastructure to make certain
> processes special... JBD just wasn't using that quite yet.
>
> The patch below makes kjournald of a slighlty higher priority than normal
> applications, reducing these latencies significantly.
>
> Signed-off-by: Arjan van de Ven <[email protected]>
> ---
> fs/ioprio.c | 3 ++-
> fs/jbd/journal.c | 12 ++++++++++++
> include/linux/ioprio.h | 2 ++
> 3 files changed, 16 insertions(+), 1 deletions(-)
>
> diff --git a/fs/ioprio.c b/fs/ioprio.c
> index da3cc46..3bd95dc 100644
> --- a/fs/ioprio.c
> +++ b/fs/ioprio.c
> @@ -27,7 +27,7 @@
> #include <linux/security.h>
> #include <linux/pid_namespace.h>
>
> -static int set_task_ioprio(struct task_struct *task, int ioprio)
> +int set_task_ioprio(struct task_struct *task, int ioprio)
> {
> int err;
> struct io_context *ioc;
> @@ -64,6 +64,7 @@ static int set_task_ioprio(struct task_struct *task, int ioprio)
> task_unlock(task);
> return err;
> }
> +EXPORT_SYMBOL_GPL(set_task_ioprio);
>
> asmlinkage long sys_ioprio_set(int which, int who, int ioprio)
> {
> diff --git a/fs/jbd/journal.c b/fs/jbd/journal.c
> index aa7143a..a859a46 100644
> --- a/fs/jbd/journal.c
> +++ b/fs/jbd/journal.c
> @@ -36,6 +36,7 @@
> #include <linux/poison.h>
> #include <linux/proc_fs.h>
> #include <linux/debugfs.h>
> +#include <linux/ioprio.h>
>
> #include <asm/uaccess.h>
> #include <asm/page.h>
> @@ -131,6 +132,17 @@ static int kjournald(void *arg)
> journal->j_commit_interval / HZ);
>
> /*
> + * kjournald is the process on which most other processes depend on
> + * for doing the filesystem portion of their IO. As such, there exists
> + * the equivalent of a priority inversion situation, where kjournald
> + * would get less priority as it should.
> + *
> + * For this reason we set to "medium real time priority", which is higher
> + * than regular tasks, but not infinitely powerful.
> + */
> + set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0));
> +
> + /*
> * And now, wait forever for commit wakeup events.
> */
> spin_lock(&journal->j_state_lock);
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index f98a656..76dad48 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
> */
> extern int ioprio_best(unsigned short aprio, unsigned short bprio);
>
> +extern int set_task_ioprio(struct task_struct *task, int ioprio);
> +
> #endif
> --
> 1.5.5.1
Can we agree on this patch?
--
Jens Axboe
On Thu, 2 Oct 2008 06:12:36 -0700 Arjan van de Ven <[email protected]> wrote:
> On Wed, 1 Oct 2008 23:55:01 -0700
> >
> > I've forgotten where that code is now, but I don't think it was ever
> > revisited. It should be.
> >
> > So. Where are these atime updaters getting blocked?
>
> my reproducer is sadly very simple (claws-mail is my mail client that uses maildir)
>
> Process claws-mail (4896) Total: 2829.7 msec
> EXT3: Waiting for journal access 2491.0 msec 88.4 %
> Writing back inodes 160.9 msec 5.7 %
> synchronous write 78.8 msec 3.0 %
>
> is an example of such a trace (this is with patch, without patch the numbers are about 3x bigger)
>
> Waiting for journal access is "journal_get_write_access"
> Writing back inodes is "writeback_inodes"
> synchronous write is "do_sync_write"
>
Right. Probably the lock_buffer() in do_get_write_access(). kjournald
is checkpointing the committing transaction (writing metadata buffers
back into the fs) and a user process operating on the current
transaction is trying to get access to one of those buffers but has to
wait for the writeout to complete first.
It wasn't always thus. Back in, umm, 2.5.0 we did
/*
* The buffer_locked() || buffer_dirty() tests here are simply an
* optimisation tweak. If anyone else in the system decides to
* lock this buffer later on, we'll blow up. There doesn't seem
* to be a good reason why they should do this.
*/
if (jh->b_cp_transaction &&
(buffer_locked(jh2bh(jh)) || buffer_dirty(jh2bh(jh)))) {
unlock_journal(journal);
lock_buffer(jh2bh(jh));
and I _think_ it was the loss of that which hurt us a lot.
773fc4c63442fbd8237b4805627f6906143204a8 or thereabouts in the old git
tree.
It would be very good if we could again decouple the committing and
current transactions, but I fear that none of us remember sufficiently
well how it all works (or, more importantly, how it all doesn't work
when you make a change).
Of course, that could all be wrong and we could be stuck somewhere
else. A good way to diagnose this stuff would be
--- a/kernel/sched.c~a
+++ a/kernel/sched.c
@@ -5567,10 +5567,14 @@ EXPORT_SYMBOL(yield);
void __sched io_schedule(void)
{
struct rq *rq = &__raw_get_cpu_var(runqueues);
+ unsigned long in, out;
delayacct_blkio_start();
atomic_inc(&rq->nr_iowait);
+ in = jiffies;
schedule();
+ out = jiffies;
+ WARN_ON(time_after(out, in + 1 * HZ));
atomic_dec(&rq->nr_iowait);
delayacct_blkio_end();
}
_
perhaps for varying values of "1".
On Thu, 2 Oct 2008 21:22:23 +0200
Jens Axboe <[email protected]> wrote:
> > @@ -131,6 +132,17 @@ static int kjournald(void *arg)
> > journal->j_commit_interval / HZ);
> >
> > /*
> > + * kjournald is the process on which most other processes depend on
> > + * for doing the filesystem portion of their IO. As such, there exists
> > + * the equivalent of a priority inversion situation, where kjournald
> > + * would get less priority as it should.
> > + *
> > + * For this reason we set to "medium real time priority", which is higher
> > + * than regular tasks, but not infinitely powerful.
> > + */
> > + set_task_ioprio(current, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_BE, 0));
> > +
> > + /*
> > * And now, wait forever for commit wakeup events.
> > */
> > spin_lock(&journal->j_state_lock);
> > diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> > index f98a656..76dad48 100644
> > --- a/include/linux/ioprio.h
> > +++ b/include/linux/ioprio.h
> > @@ -86,4 +86,6 @@ static inline int task_nice_ioclass(struct task_struct *task)
> > */
> > extern int ioprio_best(unsigned short aprio, unsigned short bprio);
> >
> > +extern int set_task_ioprio(struct task_struct *task, int ioprio);
> > +
> > #endif
> > --
> > 1.5.5.1
>
> Can we agree on this patch?
This change will cause _all_ kjournald writeout to have elevated
priority. The majority of that writeout (in data=ordered mode) is file
data, which we didn't intend to change.
The risk here is that this will *worsen* latency for plain old read(),
because now kjournald writeout will be favoured.
There is in fact a good argument for _reducing_ kjournald's IO
priority, not increasing it!
A better approach might be to mark the relevant buffers/bios as needing
higher priority at submit_bh() time (if that's possible). At least
that way we don't accidentally elevate the priority of the bulk data.
It's a bit of a hack, sorry :(
On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> Jens Axboe <[email protected]> wrote:
> > On Thu, Oct 02 2008, Dave Chinner wrote:
> >> On Thu, Oct 02, 2008 at 09:55:11AM +0200, Jens Axboe wrote:
>
> >> > Good point. I think we should mark the IO as sync, and maintain the same
> >> > priority level. Any IO that ends up being waited on is sync by
> >> > definition, we just need to expand the coverage a bit.
> >>
> >> That's what XFS has always done - mark the journal I/O as sync.
> >> Still, once you load up the elevator, the sync I/O can still get
> >> delayed for hundreds of milliseconds before dispatch, which was
> >> why I started looking at boosting the priority of the log I/O.
> >> It proved to be much more effective at getting the log I/O
> >> dispatched than the existing "mark it sync" technique....
> >
> > Sure, just marking it as sync is not a magic bullet. It'll be in the
> > first priority for that class, but it'll share bandwidth with other
> > processes. So if you have lots of IO going on, it can take hundreds of
> > miliseconds before being dispatched.
>
> Sounds like you need a priority class besides sync and async.
There's BIO_META now as well, which I was testing at the same time
as RT priority. Marking all the metadata I/O as BIO_META did help,
but once again I never got to determining if that was a result of
the different tagging or the priority increase.
However, given that only CFQ understand BIO_META, I suspect that
changing the way XFS uses BIO_SYNC to be a combination of BIO_META
and BIO_SYNC would cause significant regressions on other
schedulers.....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Oct 02, 2008 at 02:37:13PM -0700, Andrew Morton wrote:
> On Thu, 2 Oct 2008 21:22:23 +0200 Jens Axboe <[email protected]> wrote:
> > Can we agree on this patch?
>
> This change will cause _all_ kjournald writeout to have elevated
> priority. The majority of that writeout (in data=ordered mode) is file
> data, which we didn't intend to change.
>
> The risk here is that this will *worsen* latency for plain old read(),
> because now kjournald writeout will be favoured.
>
> There is in fact a good argument for _reducing_ kjournald's IO
> priority, not increasing it!
>
> A better approach might be to mark the relevant buffers/bios as needing
> higher priority at submit_bh() time (if that's possible). At least
> that way we don't accidentally elevate the priority of the bulk data.
You can do that for submit_bio() by calling bio_set_prio() before
submision - I did that for elevating only the XFS journal I/O.
submit_bh() doesn't have any way of passing a priority through to it
right now...
I should resurrect the XFS patches I had an retest them....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, 3 Oct 2008 09:58:49 +1000
Dave Chinner <[email protected]> wrote:
> On Thu, Oct 02, 2008 at 02:37:13PM -0700, Andrew Morton wrote:
> > On Thu, 2 Oct 2008 21:22:23 +0200 Jens Axboe <[email protected]> wrote:
> > > Can we agree on this patch?
> >
> > This change will cause _all_ kjournald writeout to have elevated
> > priority. The majority of that writeout (in data=ordered mode) is file
> > data, which we didn't intend to change.
> >
> > The risk here is that this will *worsen* latency for plain old read(),
> > because now kjournald writeout will be favoured.
> >
> > There is in fact a good argument for _reducing_ kjournald's IO
> > priority, not increasing it!
> >
> > A better approach might be to mark the relevant buffers/bios as needing
> > higher priority at submit_bh() time (if that's possible). At least
> > that way we don't accidentally elevate the priority of the bulk data.
>
> You can do that for submit_bio() by calling bio_set_prio() before
> submision - I did that for elevating only the XFS journal I/O.
> submit_bh() doesn't have any way of passing a priority through to it
> right now...
Yup. There are plenty of spare bits in buffer_head.b_state.
set_buffer_kludge()?
On Thu, 2 Oct 2008 17:06:46 -0700
Andrew Morton <[email protected]> wrote:
> On Fri, 3 Oct 2008 09:58:49 +1000
> Dave Chinner <[email protected]> wrote:
>
> > On Thu, Oct 02, 2008 at 02:37:13PM -0700, Andrew Morton wrote:
> > > On Thu, 2 Oct 2008 21:22:23 +0200 Jens Axboe <[email protected]> wrote:
> > > > Can we agree on this patch?
> > >
> > > This change will cause _all_ kjournald writeout to have elevated
> > > priority. The majority of that writeout (in data=ordered mode) is file
> > > data, which we didn't intend to change.
> > >
> > > The risk here is that this will *worsen* latency for plain old read(),
> > > because now kjournald writeout will be favoured.
> > >
> > > There is in fact a good argument for _reducing_ kjournald's IO
> > > priority, not increasing it!
> > >
> > > A better approach might be to mark the relevant buffers/bios as needing
> > > higher priority at submit_bh() time (if that's possible). At least
> > > that way we don't accidentally elevate the priority of the bulk data.
> >
> > You can do that for submit_bio() by calling bio_set_prio() before
> > submision - I did that for elevating only the XFS journal I/O.
> > submit_bh() doesn't have any way of passing a priority through to it
> > right now...
>
> Yup. There are plenty of spare bits in buffer_head.b_state.
> set_buffer_kludge()?
>
And if we do decide to run set_buffer_kludge() against these blocks
then it should be done at the time when they are marked dirty, when JBD
refiles them for checkpoint writeback.
This is because kjournald isn't the only process which writes these
buffers. Pretty often (usually?) it is pdflush. The increased IO
prioritisation is a property of the buffer, not of the task which
submits it for IO. I suspect.
On Thu, 2 Oct 2008 13:24:57 -0700
Andrew Morton <[email protected]> wrote:
> Of course, that could all be wrong and we could be stuck somewhere
> else. A good way to diagnose this stuff would be
>
> --- a/kernel/sched.c~a
> +++ a/kernel/sched.c
> @@ -5567,10 +5567,14 @@ EXPORT_SYMBOL(yield);
> void __sched io_schedule(void)
> {
> struct rq *rq = &__raw_get_cpu_var(runqueues);
> + unsigned long in, out;
>
> delayacct_blkio_start();
> atomic_inc(&rq->nr_iowait);
> + in = jiffies;
> schedule();
> + out = jiffies;
> + WARN_ON(time_after(out, in + 1 * HZ));
> atomic_dec(&rq->nr_iowait);
> delayacct_blkio_end();
> }
> _
>
> perhaps for varying values of "1".
I'll run with this. (well with a value of "1000" for 1 ;-)
Also I'll tell latencytop to give me the full backtrace as well ;-)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 21:01:17 -0700
Arjan van de Ven <[email protected]> wrote:
> > _
> >
> > perhaps for varying values of "1".
>
caught a few, a few were over 3 1/2 seconds in stall time
(specifically I know this for the last one)
(I've stripped out the ? entries to keep them reasonable)
[ 410.168277] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[ 410.168347] Pid: 699, comm: kjournald Not tainted 2.6.27-rc8-tip #50
[ 410.168366] [<c042ee64>] warn_on_slowpath+0x41/0x65
[ 410.168414] [<c070ec83>] io_schedule+0x77/0xb0
[ 410.168421] [<c04abc72>] sync_buffer+0x33/0x37
[ 410.168429] [<c070f123>] __wait_on_bit+0x36/0x5d
[ 410.168445] [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[ 410.168471] [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[ 410.168478] [<c04de796>] journal_commit_transaction+0x484/0xcb2
[ 410.168519] [<c04e14ae>] kjournald+0xc7/0x1ea
[ 410.168544] [<c043fb87>] kthread+0x3b/0x61
[ 410.168559] [<c040499f>] kernel_thread_helper+0x7/0x10
[ 410.168567] =======================
[ 410.168572] ---[ end trace de523043f88bd9a7 ]---
[ 451.605034] ------------[ cut here ]------------
[ 451.605041] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[ 451.605114] Pid: 699, comm: kjournald Tainted: G W 2.6.27-rc8-tip #50
[ 451.605133] [<c042ee64>] warn_on_slowpath+0x41/0x65
[ 451.605182] [<c070ec83>] io_schedule+0x77/0xb0
[ 451.605189] [<c04abc72>] sync_buffer+0x33/0x37
[ 451.605198] [<c070f123>] __wait_on_bit+0x36/0x5d
[ 451.605213] [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[ 451.605240] [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[ 451.605248] [<c04de796>] journal_commit_transaction+0x484/0xcb2
[ 451.605300] [<c04e14ae>] kjournald+0xc7/0x1ea
[ 451.605324] [<c043fb87>] kthread+0x3b/0x61
[ 451.605339] [<c040499f>] kernel_thread_helper+0x7/0x10
[ 451.605348] =======================
[ 451.605353] ---[ end trace de523043f88bd9a7 ]---
[ 514.780993] ------------[ cut here ]------------
[ 514.781001] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[ 514.781074] Pid: 699, comm: kjournald Tainted: G W 2.6.27-rc8-tip #50
[ 514.781092] [<c042ee64>] warn_on_slowpath+0x41/0x65
[ 514.781152] [<c070ec83>] io_schedule+0x77/0xb0
[ 514.781159] [<c04abc72>] sync_buffer+0x33/0x37
[ 514.781167] [<c070f010>] __wait_on_bit_lock+0x34/0x5e
[ 514.781183] [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
[ 514.781210] [<c04abfa1>] __lock_buffer+0x24/0x2a
[ 514.781218] [<c04de597>] journal_commit_transaction+0x285/0xcb2
[ 514.781288] [<c04e14ae>] kjournald+0xc7/0x1ea
[ 514.781313] [<c043fb87>] kthread+0x3b/0x61
[ 514.781334] [<c040499f>] kernel_thread_helper+0x7/0x10
[ 514.781345] =======================
[ 514.781352] ---[ end trace de523043f88bd9a7 ]---
[ 517.064300] ------------[ cut here ]------------
[ 517.064379] Pid: 699, comm: kjournald Tainted: G W 2.6.27-rc8-tip #50
[ 517.064398] [<c042ee64>] warn_on_slowpath+0x41/0x65
[ 517.064474] [<c070ec83>] io_schedule+0x77/0xb0
[ 517.064481] [<c04abc72>] sync_buffer+0x33/0x37
[ 517.064490] [<c070f123>] __wait_on_bit+0x36/0x5d
[ 517.064505] [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
[ 517.064530] [<c04abbd1>] __wait_on_buffer+0x19/0x1c
[ 517.064538] [<c04de796>] journal_commit_transaction+0x484/0xcb2
[ 517.064579] [<c04e14ae>] kjournald+0xc7/0x1ea
[ 517.064603] [<c043fb87>] kthread+0x3b/0x61
[ 517.064618] [<c040499f>] kernel_thread_helper+0x7/0x10
[ 517.064626] =======================
[ 517.064631] ---[ end trace de523043f88bd9a7 ]---
[ 517.067487] ------------[ cut here ]------------
[ 517.067493] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
[ 517.067556] Pid: 4320, comm: claws-mail Tainted: G W 2.6.27-rc8-tip #50
[ 517.067572] [<c042ee64>] warn_on_slowpath+0x41/0x65
[ 517.067652] [<c070ec83>] io_schedule+0x77/0xb0
[ 517.067659] [<c04abc72>] sync_buffer+0x33/0x37
[ 517.067666] [<c070f010>] __wait_on_bit_lock+0x34/0x5e
[ 517.067682] [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
[ 517.067707] [<c04abfa1>] __lock_buffer+0x24/0x2a
[ 517.067715] [<c04dd7fc>] do_get_write_access+0x64/0x3b1
[ 517.067743] [<c04ddb64>] journal_get_write_access+0x1b/0x2a
[ 517.067752] [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
[ 517.067761] [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
[ 517.067769] [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
[ 517.067777] [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
[ 517.067784] [<c04a7bed>] __mark_inode_dirty+0x29/0x144
[ 517.067794] [<c049e60f>] file_update_time+0x80/0xa9
[ 517.067803] [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
[ 517.067842] [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
[ 517.067850] [<c04cdc65>] ext3_file_write+0x1a/0x89
[ 517.067858] [<c048da41>] do_sync_write+0xab/0xe9
[ 517.067896] [<c048e302>] vfs_write+0x8a/0x12e
[ 517.067903] [<c048e43f>] sys_write+0x3b/0x60
[ 517.067910] [<c0403b0b>] sysenter_do_call+0x12/0x2f
[ 517.067919] =======================
[ 517.067923] ---[ end trace de523043f88bd9a7 ]---
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 21:23:55 -0700 Arjan van de Ven <[email protected]> wrote:
> On Thu, 2 Oct 2008 21:01:17 -0700
> Arjan van de Ven <[email protected]> wrote:
>
> > > _
> > >
> > > perhaps for varying values of "1".
> >
>
> caught a few, a few were over 3 1/2 seconds in stall time
> (specifically I know this for the last one)
>
> (I've stripped out the ? entries to keep them reasonable)
>
> [ 410.168277] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
> [ 410.168347] Pid: 699, comm: kjournald Not tainted 2.6.27-rc8-tip #50
> [ 410.168366] [<c042ee64>] warn_on_slowpath+0x41/0x65
> [ 410.168414] [<c070ec83>] io_schedule+0x77/0xb0
> [ 410.168421] [<c04abc72>] sync_buffer+0x33/0x37
> [ 410.168429] [<c070f123>] __wait_on_bit+0x36/0x5d
> [ 410.168445] [<c070f1f5>] out_of_line_wait_on_bit+0xab/0xb3
> [ 410.168471] [<c04abbd1>] __wait_on_buffer+0x19/0x1c
> [ 410.168478] [<c04de796>] journal_commit_transaction+0x484/0xcb2
> [ 410.168519] [<c04e14ae>] kjournald+0xc7/0x1ea
> [ 410.168544] [<c043fb87>] kthread+0x3b/0x61
> [ 410.168559] [<c040499f>] kernel_thread_helper+0x7/0x10
> [ 410.168567] =======================
> [ 410.168572] ---[ end trace de523043f88bd9a7 ]---
> [ 451.605034] ------------[ cut here ]------------
> [ 451.605041] WARNING: at kernel/sched.c:5652 io_schedule+0x77/0xb0()
> [ 451.605114] Pid: 699, comm: kjournald Tainted: G W 2.6.27-rc8-tip #50
>
> ...
>
hm, they're all kjournald getting stuck on lock_buffer(). That _may_
be related to the one we care about, but it doesn't seem likely.
> [ 517.067556] Pid: 4320, comm: claws-mail Tainted: G W 2.6.27-rc8-tip #50
> [ 517.067572] [<c042ee64>] warn_on_slowpath+0x41/0x65
> [ 517.067652] [<c070ec83>] io_schedule+0x77/0xb0
> [ 517.067659] [<c04abc72>] sync_buffer+0x33/0x37
> [ 517.067666] [<c070f010>] __wait_on_bit_lock+0x34/0x5e
> [ 517.067682] [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
> [ 517.067707] [<c04abfa1>] __lock_buffer+0x24/0x2a
> [ 517.067715] [<c04dd7fc>] do_get_write_access+0x64/0x3b1
> [ 517.067743] [<c04ddb64>] journal_get_write_access+0x1b/0x2a
> [ 517.067752] [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
> [ 517.067761] [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
> [ 517.067769] [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
> [ 517.067777] [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
> [ 517.067784] [<c04a7bed>] __mark_inode_dirty+0x29/0x144
> [ 517.067794] [<c049e60f>] file_update_time+0x80/0xa9
> [ 517.067803] [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
> [ 517.067842] [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
> [ 517.067850] [<c04cdc65>] ext3_file_write+0x1a/0x89
> [ 517.067858] [<c048da41>] do_sync_write+0xab/0xe9
> [ 517.067896] [<c048e302>] vfs_write+0x8a/0x12e
> [ 517.067903] [<c048e43f>] sys_write+0x3b/0x60
> [ 517.067910] [<c0403b0b>] sysenter_do_call+0x12/0x2f
> [ 517.067919] =======================
> [ 517.067923] ---[ end trace de523043f88bd9a7 ]---
That's the one - the lock_buffer() in do_get_write_access(). It's a
major contention site and it'd be a major win if we could fix it. Even
if we resorted to some nasty thing like taking a temp copy of the
buffer's contents.
> [ 517.067572] [<c042ee64>] warn_on_slowpath+0x41/0x65
> [ 517.067652] [<c070ec83>] io_schedule+0x77/0xb0
> [ 517.067659] [<c04abc72>] sync_buffer+0x33/0x37
> [ 517.067666] [<c070f010>] __wait_on_bit_lock+0x34/0x5e
> [ 517.067682] [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
> [ 517.067707] [<c04abfa1>] __lock_buffer+0x24/0x2a
> [ 517.067715] [<c04dd7fc>] do_get_write_access+0x64/0x3b1
> [ 517.067743] [<c04ddb64>] journal_get_write_access+0x1b/0x2a
> [ 517.067752] [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
> [ 517.067761] [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
> [ 517.067769] [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
> [ 517.067777] [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
> [ 517.067784] [<c04a7bed>] __mark_inode_dirty+0x29/0x144
> [ 517.067794] [<c049e60f>] file_update_time+0x80/0xa9
> [ 517.067803] [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
> [ 517.067842] [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
> [ 517.067850] [<c04cdc65>] ext3_file_write+0x1a/0x89
> [ 517.067858] [<c048da41>] do_sync_write+0xab/0xe9
> [ 517.067896] [<c048e302>] vfs_write+0x8a/0x12e
> [ 517.067903] [<c048e43f>] sys_write+0x3b/0x60
> [ 517.067910] [<c0403b0b>] sysenter_do_call+0x12/0x2f
> [ 517.067919] =======================
> [ 517.067923] ---[ end trace de523043f88bd9a7 ]---
> That's the one - the lock_buffer() in do_get_write_access(). It's a
> major contention site and it'd be a major win if we could fix it.
> Even if we resorted to some nasty thing like taking a temp copy of the
> buffer's contents.
I also notice it's part of "file_update_time". Do we really need to go all the way
down to this level of synchronicity for that?
(I also randomly wonder if we, in the write path, dirty the inode twice, once for size once for item, and
if we then also reserve two slots in the journal for that..... but I'm showing
my total ignorance of JBD internals here)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 21:40:00 -0700
Andrew Morton <[email protected]> wrote:
> On Thu, 2 Oct 2008 21:23:55 -0700 Arjan van de Ven
> <[email protected]> wrote:
>
> > On Thu, 2 Oct 2008 21:01:17 -0700
> > Arjan van de Ven <[email protected]> wrote:
> >
> > > > _
> > > >
> > > > perhaps for varying values of "1".
> > >
> >
> > caught a few, a few were over 3 1/2 seconds in stall time
> > (specifically I know this for the last one)
> >
> > (I've stripped out the ? entries to keep them reasonable)
> >
> > [ 410.168277] WARNING: at kernel/sched.c:5652
> > io_schedule+0x77/0xb0() [ 410.168347] Pid: 699, comm: kjournald
> > Not tainted 2.6.27-rc8-tip #50 [ 410.168366] [<c042ee64>]
> > warn_on_slowpath+0x41/0x65 [ 410.168414] [<c070ec83>]
> > io_schedule+0x77/0xb0 [ 410.168421] [<c04abc72>]
> > sync_buffer+0x33/0x37 [ 410.168429] [<c070f123>]
> > __wait_on_bit+0x36/0x5d [ 410.168445] [<c070f1f5>]
> > out_of_line_wait_on_bit+0xab/0xb3 [ 410.168471] [<c04abbd1>]
> > __wait_on_buffer+0x19/0x1c [ 410.168478] [<c04de796>]
> > journal_commit_transaction+0x484/0xcb2 [ 410.168519] [<c04e14ae>]
> > kjournald+0xc7/0x1ea [ 410.168544] [<c043fb87>] kthread+0x3b/0x61
> > [ 410.168559] [<c040499f>] kernel_thread_helper+0x7/0x10
> > [ 410.168567] =======================
> > [ 410.168572] ---[ end trace de523043f88bd9a7 ]---
> > [ 451.605034] ------------[ cut here ]------------
> > [ 451.605041] WARNING: at kernel/sched.c:5652
> > io_schedule+0x77/0xb0() [ 451.605114] Pid: 699, comm: kjournald
> > Tainted: G W 2.6.27-rc8-tip #50
> >
> > ...
> >
>
> hm, they're all kjournald getting stuck on lock_buffer(). That _may_
> be related to the one we care about, but it doesn't seem likely.
one of them is (based on timestamps of the printk) is less than 3
seconds before the "real" one, while the real one's delay was 4 seconds.
To me that implies they're at least somewhat correlated...
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 21:43:53 -0700 Arjan van de Ven <[email protected]> wrote:
> > [ 517.067572] [<c042ee64>] warn_on_slowpath+0x41/0x65
> > [ 517.067652] [<c070ec83>] io_schedule+0x77/0xb0
> > [ 517.067659] [<c04abc72>] sync_buffer+0x33/0x37
> > [ 517.067666] [<c070f010>] __wait_on_bit_lock+0x34/0x5e
> > [ 517.067682] [<c070f0e5>] out_of_line_wait_on_bit_lock+0xab/0xb3
> > [ 517.067707] [<c04abfa1>] __lock_buffer+0x24/0x2a
> > [ 517.067715] [<c04dd7fc>] do_get_write_access+0x64/0x3b1
> > [ 517.067743] [<c04ddb64>] journal_get_write_access+0x1b/0x2a
> > [ 517.067752] [<c04da374>] __ext3_journal_get_write_access+0x19/0x3c
> > [ 517.067761] [<c04cf672>] ext3_reserve_inode_write+0x34/0x68
> > [ 517.067769] [<c04cf6d5>] ext3_mark_inode_dirty+0x2f/0x46
> > [ 517.067777] [<c04cf7f7>] ext3_dirty_inode+0x53/0x67
> > [ 517.067784] [<c04a7bed>] __mark_inode_dirty+0x29/0x144
> > [ 517.067794] [<c049e60f>] file_update_time+0x80/0xa9
> > [ 517.067803] [<c046b66c>] __generic_file_aio_write_nolock+0x2f0/0x41b
> > [ 517.067842] [<c046bf0d>] generic_file_aio_write+0x5a/0xb7
> > [ 517.067850] [<c04cdc65>] ext3_file_write+0x1a/0x89
> > [ 517.067858] [<c048da41>] do_sync_write+0xab/0xe9
> > [ 517.067896] [<c048e302>] vfs_write+0x8a/0x12e
> > [ 517.067903] [<c048e43f>] sys_write+0x3b/0x60
> > [ 517.067910] [<c0403b0b>] sysenter_do_call+0x12/0x2f
> > [ 517.067919] =======================
> > [ 517.067923] ---[ end trace de523043f88bd9a7 ]---
>
> > That's the one - the lock_buffer() in do_get_write_access(). It's a
> > major contention site and it'd be a major win if we could fix it.
> > Even if we resorted to some nasty thing like taking a temp copy of the
> > buffer's contents.
>
> I also notice it's part of "file_update_time". Do we really need to go all the way
> down to this level of synchronicity for that?
Well, we've tossed that around many times but never implemented it.
Once you get into the details it gets a bit nasty. Need to keep the
dirtiness state in the VFS (or fs) inode, and going backwards from a
plain old buffer_head at commit time isn't possible. We usually
tempfixed the problem by adding increasingly fancy ways of not doing the
atime update at all.
Of course, fixing this running-vs-committing contention point would fix
a lot more things than just atime updates.
> (I also randomly wonder if we, in the write path, dirty the inode twice, once for size once for item, and
> if we then also reserve two slots in the journal for that.....
That shouldn't be the case - once we have write access to the buffer it
remains freely modifiable for the rest of the transaction period. I
think.
> but I'm showing
> my total ignorance of JBD internals here)
I'm going on senile memories of JDB five years ago, but the concepts
didn't change much.
On Thu, 2 Oct 2008 21:50:26 -0700
Andrew Morton <[email protected]> wrote:
> >
> > I also notice it's part of "file_update_time". Do we really need to
> > go all the way down to this level of synchronicity for that?
>
> Well, we've tossed that around many times but never implemented it.
> Once you get into the details it gets a bit nasty. Need to keep the
> dirtiness state in the VFS (or fs) inode, and going backwards from a
> plain old buffer_head at commit time isn't possible. We usually
> tempfixed the problem by adding increasingly fancy ways of not doing
> the atime update at all.
given that this is the write path, I was assuming this was mtime rather
than atime; doesn't change your answer though.
>
> Of course, fixing this running-vs-committing contention point would
> fix a lot more things than just atime updates.
yes clearly. It's waaay above my paygrade to hack on though; JBD is one
of those places in the kernel that scare me for doing fundamental
changes ;-(
>
> > (I also randomly wonder if we, in the write path, dirty the inode
> > twice, once for size once for item, and if we then also reserve two
> > slots in the journal for that.....
>
> That shouldn't be the case - once we have write access to the buffer
> it remains freely modifiable for the rest of the transaction period.
> I think.
I hope you're right otherwise we'd always hit this; once for the size
change, then block for the mtime. That would thoroughly suck; so much
so that you just must be right.
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Thu, 2 Oct 2008 22:00:40 -0700 Arjan van de Ven <[email protected]> wrote:
> > Of course, fixing this running-vs-committing contention point would
> > fix a lot more things than just atime updates.
>
> yes clearly. It's waaay above my paygrade to hack on though; JBD is one
> of those places in the kernel that scare me for doing fundamental
> changes ;-(
If someone has an hour or so to kill, we can pretend to fix it:
fs/ext3/super.c | 12 +++++++++++-
fs/jbd/transaction.c | 13 +++++++++++--
include/linux/ext3_fs.h | 1 +
include/linux/jbd.h | 1 +
4 files changed, 24 insertions(+), 3 deletions(-)
diff -puN fs/ext3/super.c~a fs/ext3/super.c
--- a/fs/ext3/super.c~a
+++ a/fs/ext3/super.c
@@ -617,6 +617,8 @@ static int ext3_show_options(struct seq_
seq_puts(seq, ",barrier=1");
if (test_opt(sb, NOBH))
seq_puts(seq, ",nobh");
+ if (test_opt(sb, AKPM))
+ seq_puts(seq, ",akpm");
if (test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_JOURNAL_DATA)
seq_puts(seq, ",data=journal");
@@ -757,7 +759,7 @@ enum {
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
- Opt_grpquota
+ Opt_grpquota, Opt_akpm,
};
static match_table_t tokens = {
@@ -808,6 +810,7 @@ static match_table_t tokens = {
{Opt_usrquota, "usrquota"},
{Opt_barrier, "barrier=%u"},
{Opt_resize, "resize"},
+ {Opt_akpm, "akpm"},
{Opt_err, NULL},
};
@@ -1097,6 +1100,9 @@ set_qf_format:
set_opt(sbi->s_mount_opt, QUOTA);
set_opt(sbi->s_mount_opt, GRPQUOTA);
break;
+ case Opt_akpm:
+ set_opt(sbi->s_mount_opt, AKPM);
+ break;
case Opt_noquota:
if (sb_any_quota_enabled(sb) ||
sb_any_quota_suspended(sb)) {
@@ -1986,6 +1992,10 @@ static void ext3_init_journal_params(str
journal->j_flags |= JFS_BARRIER;
else
journal->j_flags &= ~JFS_BARRIER;
+ if (test_opt(sb, AKPM))
+ journal->j_flags |= JFS_AKPM;
+ else
+ journal->j_flags &= ~JFS_AKPM;
spin_unlock(&journal->j_state_lock);
}
diff -puN include/linux/ext3_fs.h~a include/linux/ext3_fs.h
--- a/include/linux/ext3_fs.h~a
+++ a/include/linux/ext3_fs.h
@@ -380,6 +380,7 @@ struct ext3_inode {
#define EXT3_MOUNT_QUOTA 0x80000 /* Some quota option set */
#define EXT3_MOUNT_USRQUOTA 0x100000 /* "old" user quota */
#define EXT3_MOUNT_GRPQUOTA 0x200000 /* "old" group quota */
+#define EXT3_MOUNT_AKPM 0x400000
/* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
diff -puN include/linux/jbd.h~a include/linux/jbd.h
--- a/include/linux/jbd.h~a
+++ a/include/linux/jbd.h
@@ -816,6 +816,7 @@ struct journal_s
#define JFS_FLUSHED 0x008 /* The journal superblock has been flushed */
#define JFS_LOADED 0x010 /* The journal superblock has been loaded */
#define JFS_BARRIER 0x020 /* Use IDE barriers */
+#define JFS_AKPM 0x040
/*
* Function declarations for the journaling transaction and buffer
diff -puN fs/jbd/transaction.c~a fs/jbd/transaction.c
--- a/fs/jbd/transaction.c~a
+++ a/fs/jbd/transaction.c
@@ -537,6 +537,7 @@ do_get_write_access(handle_t *handle, st
int error;
char *frozen_buffer = NULL;
int need_copy = 0;
+ int locked;
if (is_handle_aborted(handle))
return -EROFS;
@@ -552,7 +553,14 @@ repeat:
/* @@@ Need to check for errors here at some point. */
- lock_buffer(bh);
+ locked = 0;
+ if (journal->j_flags & JFS_AKPM) {
+ if (trylock_buffer(bh))
+ locked = 1; /* lolz */
+ } else {
+ lock_buffer(bh);
+ locked = 1;
+ }
jbd_lock_bh_state(bh);
/* We now hold the buffer lock so it is safe to query the buffer
@@ -591,7 +599,8 @@ repeat:
jbd_unexpected_dirty_buffer(jh);
}
- unlock_buffer(bh);
+ if (locked)
+ unlock_buffer(bh);
error = -EROFS;
if (is_handle_aborted(handle)) {
_
(might emit nasty warnings or assertion failures which will need to be
disabled)
Mount a junk partition with `-oakpm' and run some benchmarks. If the
results are "wow" then it's worth spending time on. If the results are
"meh" then we can not bother..
On Thu, 2 Oct 2008 22:24:38 -0700
Andrew Morton <[email protected]> wrote:
>
> Mount a junk partition with `-oakpm' and run some benchmarks. If the
> results are "wow" then it's worth spending time on. If the results
> are "meh" then we can not bother..
>
I've been running with this for a few hours now
while I can't see IO is perfectly smooth now, it does feel a lot better
and latencytop isn't showing me the huge spikes in latency either in
the mail client.
(there's a lot of little stuff, but not the 4+ seconds)
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Dave Chinner wrote:
> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
>> Sounds like you need a priority class besides sync and async.
>
> There's BIO_META now as well, which I was testing at the same time
> as RT priority. Marking all the metadata I/O as BIO_META did help,
> but once again I never got to determining if that was a result of
> the different tagging or the priority increase.
What exactly do you want META to mean? Strict prioritisation over
all other non-META requests, or just more frequent and/or larger
dispatches? Should META requests be sorted?
> However, given that only CFQ understand BIO_META, I suspect that
> changing the way XFS uses BIO_SYNC to be a combination of BIO_META
> and BIO_SYNC would cause significant regressions on other
> schedulers.....
That shouldn't be a problem. noop doesn't care about any of that stuff,
and deadline doesn't care about BIO_SYNC (more on that below). If the
bios that use META are a subset of those that currently use SYNC, then
we can temporarily change AS to treat META and SYNC equally. Only CFQ
would change in behaviour.
So deadline should probably support BIO_SYNC... below is a patch to do
that. It doesn't have much of an effect on postmark or compilebench on
a single spindle, but I'm guessing that's not the workload or hardware
that is expected to benefit.
---
Subject: [PATCH] deadline-iosched: support SYNC bio/request flag
Support sync/async requests in deadline rather than read/write, as is
done in AS and CFQ.
Signed-off-by: Aaron Carroll <[email protected]>
---
block/deadline-iosched.c | 63 ++++++++++++++++++++++++---------------------
1 files changed, 34 insertions(+), 29 deletions(-)
diff --git a/block/deadline-iosched.c b/block/deadline-iosched.c
index 342448c..b2cfd47 100644
--- a/block/deadline-iosched.c
+++ b/block/deadline-iosched.c
@@ -23,6 +23,11 @@ static const int writes_starved = 2; /* max times reads can starve a write */
static const int fifo_batch = 16; /* # of sequential requests treated as one
by the above parameters. For throughput. */
+enum {
+ REQ_ASYNC,
+ REQ_SYNC,
+};
+
struct deadline_data {
/*
* run time data
@@ -53,7 +58,7 @@ struct deadline_data {
static void deadline_move_request(struct deadline_data *, struct request *);
-#define RQ_RB_ROOT(dd, rq) (&(dd)->sort_list[rq_data_dir((rq))])
+#define RQ_RB_ROOT(dd, rq) (&(dd)->sort_list[rq_is_sync((rq))])
/*
* get the request after `rq' in sector-sorted order
@@ -86,7 +91,7 @@ retry:
static inline void
deadline_del_rq_rb(struct deadline_data *dd, struct request *rq)
{
- const int data_dir = rq_data_dir(rq);
+ const int data_dir = rq_is_sync(rq);
if (dd->next_rq[data_dir] == rq)
dd->next_rq[data_dir] = deadline_latter_request(rq);
@@ -101,7 +106,7 @@ static void
deadline_add_request(struct request_queue *q, struct request *rq)
{
struct deadline_data *dd = q->elevator->elevator_data;
- const int data_dir = rq_data_dir(rq);
+ const int data_dir = rq_is_sync(rq);
deadline_add_rq_rb(dd, rq);
@@ -206,10 +211,10 @@ deadline_move_to_dispatch(struct deadline_data *dd, struct request *rq)
static void
deadline_move_request(struct deadline_data *dd, struct request *rq)
{
- const int data_dir = rq_data_dir(rq);
+ const int data_dir = rq_is_sync(rq);
- dd->next_rq[READ] = NULL;
- dd->next_rq[WRITE] = NULL;
+ dd->next_rq[REQ_SYNC] = NULL;
+ dd->next_rq[REQ_ASYNC] = NULL;
dd->next_rq[data_dir] = deadline_latter_request(rq);
dd->last_sector = rq->sector + rq->nr_sectors;
@@ -245,18 +250,18 @@ static inline int deadline_check_fifo(struct deadline_data *dd, int ddir)
static int deadline_dispatch_requests(struct request_queue *q, int force)
{
struct deadline_data *dd = q->elevator->elevator_data;
- const int reads = !list_empty(&dd->fifo_list[READ]);
- const int writes = !list_empty(&dd->fifo_list[WRITE]);
+ const int reads = !list_empty(&dd->fifo_list[REQ_SYNC]);
+ const int writes = !list_empty(&dd->fifo_list[REQ_ASYNC]);
struct request *rq;
int data_dir;
/*
* batches are currently reads XOR writes
*/
- if (dd->next_rq[WRITE])
- rq = dd->next_rq[WRITE];
+ if (dd->next_rq[REQ_ASYNC])
+ rq = dd->next_rq[REQ_ASYNC];
else
- rq = dd->next_rq[READ];
+ rq = dd->next_rq[REQ_SYNC];
if (rq) {
/* we have a "next request" */
@@ -276,12 +281,12 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
*/
if (reads) {
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[READ]));
+ BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[REQ_SYNC]));
if (writes && (dd->starved++ >= dd->writes_starved))
goto dispatch_writes;
- data_dir = READ;
+ data_dir = REQ_SYNC;
goto dispatch_find_request;
}
@@ -292,11 +297,11 @@ static int deadline_dispatch_requests(struct request_queue *q, int force)
if (writes) {
dispatch_writes:
- BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[WRITE]));
+ BUG_ON(RB_EMPTY_ROOT(&dd->sort_list[REQ_ASYNC]));
dd->starved = 0;
- data_dir = WRITE;
+ data_dir = REQ_ASYNC;
goto dispatch_find_request;
}
@@ -338,16 +343,16 @@ static int deadline_queue_empty(struct request_queue *q)
{
struct deadline_data *dd = q->elevator->elevator_data;
- return list_empty(&dd->fifo_list[WRITE])
- && list_empty(&dd->fifo_list[READ]);
+ return list_empty(&dd->fifo_list[REQ_ASYNC])
+ && list_empty(&dd->fifo_list[REQ_SYNC]);
}
static void deadline_exit_queue(elevator_t *e)
{
struct deadline_data *dd = e->elevator_data;
- BUG_ON(!list_empty(&dd->fifo_list[READ]));
- BUG_ON(!list_empty(&dd->fifo_list[WRITE]));
+ BUG_ON(!list_empty(&dd->fifo_list[REQ_SYNC]));
+ BUG_ON(!list_empty(&dd->fifo_list[REQ_ASYNC]));
kfree(dd);
}
@@ -363,12 +368,12 @@ static void *deadline_init_queue(struct request_queue *q)
if (!dd)
return NULL;
- INIT_LIST_HEAD(&dd->fifo_list[READ]);
- INIT_LIST_HEAD(&dd->fifo_list[WRITE]);
- dd->sort_list[READ] = RB_ROOT;
- dd->sort_list[WRITE] = RB_ROOT;
- dd->fifo_expire[READ] = read_expire;
- dd->fifo_expire[WRITE] = write_expire;
+ INIT_LIST_HEAD(&dd->fifo_list[REQ_SYNC]);
+ INIT_LIST_HEAD(&dd->fifo_list[REQ_ASYNC]);
+ dd->sort_list[REQ_SYNC] = RB_ROOT;
+ dd->sort_list[REQ_ASYNC] = RB_ROOT;
+ dd->fifo_expire[REQ_SYNC] = read_expire;
+ dd->fifo_expire[REQ_ASYNC] = write_expire;
dd->writes_starved = writes_starved;
dd->front_merges = 1;
dd->fifo_batch = fifo_batch;
@@ -403,8 +408,8 @@ static ssize_t __FUNC(elevator_t *e, char *page) \
__data = jiffies_to_msecs(__data); \
return deadline_var_show(__data, (page)); \
}
-SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[READ], 1);
-SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[WRITE], 1);
+SHOW_FUNCTION(deadline_read_expire_show, dd->fifo_expire[REQ_SYNC], 1);
+SHOW_FUNCTION(deadline_write_expire_show, dd->fifo_expire[REQ_ASYNC], 1);
SHOW_FUNCTION(deadline_writes_starved_show, dd->writes_starved, 0);
SHOW_FUNCTION(deadline_front_merges_show, dd->front_merges, 0);
SHOW_FUNCTION(deadline_fifo_batch_show, dd->fifo_batch, 0);
@@ -426,8 +431,8 @@ static ssize_t __FUNC(elevator_t *e, const char *page, size_t count) \
*(__PTR) = __data; \
return ret; \
}
-STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[READ], 0, INT_MAX, 1);
-STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[WRITE], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_read_expire_store, &dd->fifo_expire[REQ_SYNC], 0, INT_MAX, 1);
+STORE_FUNCTION(deadline_write_expire_store, &dd->fifo_expire[REQ_ASYNC], 0, INT_MAX, 1);
STORE_FUNCTION(deadline_writes_starved_store, &dd->writes_starved, INT_MIN, INT_MAX, 0);
STORE_FUNCTION(deadline_front_merges_store, &dd->front_merges, 0, 1, 0);
STORE_FUNCTION(deadline_fifo_batch_store, &dd->fifo_batch, 0, INT_MAX, 0);
--
1.6.0.2
On Thu, Oct 02, 2008 at 07:33:04AM -0700, Arjan van de Ven wrote:
> > I can whip up a patch which adds some markers which we could use to
> > find out more information about what is happening.
>
> interesting testcase of the markers concept.
Sorry for the delay, I ran into a minor bug in the Modules.marker
generation support that prevented Systemtap from being able to use
markers. (It was busted since 2.6.27-rc1, so I guess that gives us
some sense how often developers use Systemtap. :-)
It looks like Andrew's workaround seems to help you out, but if you're
willing to run this while your mail reader is running, and correlate
it with with the large latency spikes, we might get some interesting
results.
Anyway, here's the patch (against ext4, although could pretty easily
move this to ext3 --- but you can mount an ext3 filesystem as ext4,
although for the moment you do have to run the command "tune2fs -E
test_fs /dev/hdXX" first), and a sample systemtap script. You'll also
need the markers patch, and of course the latest systemtap from their
git repository.
- Ted
P.S. Make sure you install systemtap in /usr/local, instead of trying
to run it out of the build tree. See for an interesting report from
Roland McGrath about what happens if you make this mistake:
http://sources.redhat.com/ml/systemtap/2008-q3/msg00809.html
I really think Systemtap has a lot of potential if only they could get
past some "minor usability concerns". So one thing that I think would
really help the Systemtap folks is if more people gave them more, ah,
"constructive feedback" to their mailing list.
On Sat, Oct 04, 2008 at 10:12:52AM -0400, Theodore Tso wrote:
> Anyway, here's the patch (against ext4, although could pretty easily
> move this to ext3 --- but you can mount an ext3 filesystem as ext4,
> although for the moment you do have to run the command "tune2fs -E
> test_fs /dev/hdXX" first) [...]
Shouldn't he also be sure to mount the FS with the "noextents" option,
or has ext4 learned not to to enable extents if they're not already
enabled?
--
Joseph Fannin
[email protected]
On Sat, Oct 04, 2008 at 01:14:45PM -0400, Joseph Fannin wrote:
> On Sat, Oct 04, 2008 at 10:12:52AM -0400, Theodore Tso wrote:
>
> > Anyway, here's the patch (against ext4, although could pretty easily
> > move this to ext3 --- but you can mount an ext3 filesystem as ext4,
> > although for the moment you do have to run the command "tune2fs -E
> > test_fs /dev/hdXX" first) [...]
>
> Shouldn't he also be sure to mount the FS with the "noextents" option,
> or has ext4 learned not to to enable extents if they're not already
> enabled?
The latter; ext4 will no longer enable extents if not already enabled
as of commit e4079a11, dated July 11th. That means 2.6.26 with any of
the ext4 patchsets (highly recommended if you want to use ext4) or any
mainline release post 2.6.26-git4/2.6.27-rc1.
A number of the ext4 tutorials/howto's are available on the web page
are out of date, including the one on IBM developer works. The best
place to get up-to-date information on using ext4 is
http://ext4.wiki.kernel.org.
Regards,
- Ted
P.S. Note that there will be advantages to using even an unconverted
filesystem with the ext3 filesystem format with the ext4 code base,
even if you don't enable extents or any other ext4-specific feature;
the delayed allocation and soon-to-be-added inode table readahead
features (in the 2.6.27-rc8-ext4-2 patchset, and which will be pushed
during the next merge window) in the ext4 code base will improve
performance even on an "ext3 filesystem".
On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> Dave Chinner wrote:
>> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
>>> Sounds like you need a priority class besides sync and async.
>>
>> There's BIO_META now as well, which I was testing at the same time
>> as RT priority. Marking all the metadata I/O as BIO_META did help,
>> but once again I never got to determining if that was a result of
>> the different tagging or the priority increase.
>
> What exactly do you want META to mean? Strict prioritisation over
> all other non-META requests, or just more frequent and/or larger
> dispatches? Should META requests be sorted?
The real question is "what was it supposed to mean"? AFAICT, it was
added to a couple of filesystems to be used to tag superblock read
I/O. Why - I don't know - there's a distinct lack of documentation
surrounding these bio flags. :/
Realistically, I'm not sure that having a separate queue for
BIO_META will buy us anything, given that noop is quite often the
fastest scheduler for XFS because it enables interleaved metadata
I/O to be merged with data I/O. Like I said, I was not able to spend
the time to determine exactly how BIO_META affected I/O patterns, so
I can't really comment on whether it is really necessary or not.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Oct 06 2008, Dave Chinner wrote:
> On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> > Dave Chinner wrote:
> >> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> >>> Sounds like you need a priority class besides sync and async.
> >>
> >> There's BIO_META now as well, which I was testing at the same time
> >> as RT priority. Marking all the metadata I/O as BIO_META did help,
> >> but once again I never got to determining if that was a result of
> >> the different tagging or the priority increase.
> >
> > What exactly do you want META to mean? Strict prioritisation over
> > all other non-META requests, or just more frequent and/or larger
> > dispatches? Should META requests be sorted?
>
> The real question is "what was it supposed to mean"? AFAICT, it was
> added to a couple of filesystems to be used to tag superblock read
> I/O. Why - I don't know - there's a distinct lack of documentation
> surrounding these bio flags. :/
It was added to be able to differentiate between data and meta data IO
when using blktrace, that is all.
> Realistically, I'm not sure that having a separate queue for
> BIO_META will buy us anything, given that noop is quite often the
> fastest scheduler for XFS because it enables interleaved metadata
> I/O to be merged with data I/O. Like I said, I was not able to spend
> the time to determine exactly how BIO_META affected I/O patterns, so
> I can't really comment on whether it is really necessary or not.
There's no seperate queue for meta data IO anywhere. CFQ will give
_slight_ preference to meta data IO as a side effect, preferring the
meta IO for otherwise same IO in what to serve next in the same queue.
And it will not allow preemption of a meta data IO for a data IO.
So using meta should not yield any important boosts by itself.
--
Jens Axboe
On Tue, Oct 07, 2008 at 08:06:53PM +0200, Jens Axboe wrote:
> On Mon, Oct 06 2008, Dave Chinner wrote:
> > On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> > > Dave Chinner wrote:
> > >> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> > >>> Sounds like you need a priority class besides sync and async.
> > >>
> > >> There's BIO_META now as well, which I was testing at the same time
> > >> as RT priority. Marking all the metadata I/O as BIO_META did help,
> > >> but once again I never got to determining if that was a result of
> > >> the different tagging or the priority increase.
> > >
> > > What exactly do you want META to mean? Strict prioritisation over
> > > all other non-META requests, or just more frequent and/or larger
> > > dispatches? Should META requests be sorted?
> >
> > The real question is "what was it supposed to mean"? AFAICT, it was
> > added to a couple of filesystems to be used to tag superblock read
> > I/O. Why - I don't know - there's a distinct lack of documentation
> > surrounding these bio flags. :/
>
> It was added to be able to differentiate between data and meta data IO
> when using blktrace, that is all.
Ok.
> > Realistically, I'm not sure that having a separate queue for
> > BIO_META will buy us anything, given that noop is quite often the
> > fastest scheduler for XFS because it enables interleaved metadata
> > I/O to be merged with data I/O. Like I said, I was not able to spend
> > the time to determine exactly how BIO_META affected I/O patterns, so
> > I can't really comment on whether it is really necessary or not.
>
> There's no seperate queue for meta data IO anywhere. CFQ will give
> _slight_ preference to meta data IO as a side effect, preferring the
> meta IO for otherwise same IO in what to serve next in the same queue.
> And it will not allow preemption of a meta data IO for a data IO.
>
> So using meta should not yield any important boosts by itself.
Which means that performance increase I saw on CFQ was a result of
removing the BIO_SYNC tagging "optimisation" XFS uses for metadata,
not from adding BIO_META.....
Could you please document what these tags actually mean and do
so that other people don't get as confused as me about this
stuff....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Oct 02, 2008 at 10:24:38PM -0700, Andrew Morton wrote:
>
> Mount a junk partition with `-oakpm' and run some benchmarks. If the
> results are "wow" then it's worth spending time on. If the results are
> "meh" then we can not bother..
>
I've ported the patch to the ext4 filesystem, and dropped it into the
unstable portion of the ext4 patch queue. If we can get someone (hi,
Ric!) to run fs_mark with and without -o akpm_lock_hack, I suspect we
will find that it makes quite a large difference on that particular
benchmark, since it is fsync-heavy to force a large number of
transaction, and the creation of the inodes should cause multiple
blocks that will be entangled between the current and committing
transactions.
- Ted
ext4: akpm's locking hack to fix locking delays
This is a port of the following patch from Andrew Morton to ext4:
http://lkml.org/lkml/2008/10/3/22
This fixes a major contention problem in do_get_write_access() when a
buffer is modified in both the current and committing transaction.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Cc: [email protected]
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index f46a513..23822fb 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -540,6 +540,7 @@ do { \
#define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT 0x1000000 /* Journal Async Commit */
#define EXT4_MOUNT_I_VERSION 0x2000000 /* i_version support */
#define EXT4_MOUNT_DELALLOC 0x8000000 /* Delalloc support */
+#define EXT4_MOUNT_AKPM_LOCK_HACK 0x10000000 /* akpm lock hack */
/* Compatibility, for having both ext2_fs.h and ext4_fs.h included at once */
#ifndef _LINUX_EXT2_FS_H
#define clear_opt(o, opt) o &= ~EXT4_MOUNT_##opt
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 67ebefb..f4e7157 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -752,6 +752,8 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
seq_puts(seq, ",journal_async_commit");
if (test_opt(sb, NOBH))
seq_puts(seq, ",nobh");
+ if (test_opt(sb, AKPM_LOCK_HACK))
+ seq_puts(seq, ",akpm_lock_hack");
if (!test_opt(sb, EXTENTS))
seq_puts(seq, ",noextents");
if (test_opt(sb, I_VERSION))
@@ -911,7 +913,7 @@ enum {
Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
Opt_grpquota, Opt_extents, Opt_noextents, Opt_i_version,
Opt_mballoc, Opt_nomballoc, Opt_stripe, Opt_delalloc, Opt_nodelalloc,
- Opt_inode_readahead_blks
+ Opt_inode_readahead_blks, Opt_akpm_lock_hack,
};
static match_table_t tokens = {
@@ -973,6 +975,7 @@ static match_table_t tokens = {
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
+ {Opt_akpm_lock_hack, "akpm_lock_hack"},
{Opt_err, NULL},
};
@@ -1382,6 +1385,9 @@ set_qf_format:
return 0;
sbi->s_inode_readahead_blks = option;
break;
+ case Opt_akpm_lock_hack:
+ set_opt(sbi->s_mount_opt, AKPM_LOCK_HACK);
+ break;
default:
printk(KERN_ERR
"EXT4-fs: Unrecognized mount option \"%s\" "
@@ -2534,6 +2540,10 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
journal->j_flags |= JBD2_BARRIER;
else
journal->j_flags &= ~JBD2_BARRIER;
+ if (test_opt(sb, AKPM_LOCK_HACK))
+ journal->j_flags |= JBD2_LOCK_HACK;
+ else
+ journal->j_flags &= ~JBD2_LOCK_HACK;
spin_unlock(&journal->j_state_lock);
}
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e5d5405..32c288a 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -546,6 +546,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
int error;
char *frozen_buffer = NULL;
int need_copy = 0;
+ int locked = 0;
if (is_handle_aborted(handle))
return -EROFS;
@@ -561,7 +562,13 @@ repeat:
/* @@@ Need to check for errors here at some point. */
- lock_buffer(bh);
+ if (journal->j_flags & JBD2_LOCK_HACK) {
+ if (trylock_buffer(bh))
+ locked = 1; /* lolz */
+ } else {
+ lock_buffer(bh);
+ locked = 1;
+ }
jbd_lock_bh_state(bh);
/* We now hold the buffer lock so it is safe to query the buffer
@@ -600,7 +607,8 @@ repeat:
jbd_unexpected_dirty_buffer(jh);
}
- unlock_buffer(bh);
+ if (locked)
+ unlock_buffer(bh);
error = -EROFS;
if (is_handle_aborted(handle)) {
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 66c3499..c614dae 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -967,6 +967,7 @@ struct journal_s
#define JBD2_FLUSHED 0x008 /* The journal superblock has been flushed */
#define JBD2_LOADED 0x010 /* The journal superblock has been loaded */
#define JBD2_BARRIER 0x020 /* Use IDE barriers */
+#define JBD2_LOCK_HACK 0x040 /* akpm's locking hack */
/*
* Function declarations for the journaling transaction and buffer
On Wed, 8 Oct 2008 23:00:54 -0400 Theodore Tso <[email protected]> wrote:
> On Thu, Oct 02, 2008 at 10:24:38PM -0700, Andrew Morton wrote:
> >
> > Mount a junk partition with `-oakpm' and run some benchmarks. If the
> > results are "wow" then it's worth spending time on. If the results are
> > "meh" then we can not bother..
> >
>
> I've ported the patch to the ext4 filesystem, and dropped it into the
> unstable portion of the ext4 patch queue.
Useful, thanks.
> If we can get someone (hi,
> Ric!) to run fs_mark with and without -o akpm_lock_hack, I suspect we
> will find that it makes quite a large difference on that particular
> benchmark, since it is fsync-heavy to force a large number of
> transaction, and the creation of the inodes should cause multiple
> blocks that will be entangled between the current and committing
> transactions.
>
fsync? Yes, I suppose so. Repeated modifications to the same
inodes/directories/bitmaps blocks/etc will hurt.
A quick test on other quantified workloads would be useful too. If the
results look promising then someone(tm) will need to work out how to
fix this for real.
>
> ext4: akpm's locking hack to fix locking delays
>
> This is a port of the following patch from Andrew Morton to ext4:
>
> http://lkml.org/lkml/2008/10/3/22
>
> This fixes a major contention problem in do_get_write_access() when a
> buffer is modified in both the current and committing transaction.
More specifically: "under checkpoint writeback in the committing
transaction when the committing transaction requests write access".
On Wed, Oct 08 2008, Dave Chinner wrote:
> On Tue, Oct 07, 2008 at 08:06:53PM +0200, Jens Axboe wrote:
> > On Mon, Oct 06 2008, Dave Chinner wrote:
> > > On Sat, Oct 04, 2008 at 05:45:00PM +1000, Aaron Carroll wrote:
> > > > Dave Chinner wrote:
> > > >> On Thu, Oct 02, 2008 at 05:32:04PM +0200, Bodo Eggert wrote:
> > > >>> Sounds like you need a priority class besides sync and async.
> > > >>
> > > >> There's BIO_META now as well, which I was testing at the same time
> > > >> as RT priority. Marking all the metadata I/O as BIO_META did help,
> > > >> but once again I never got to determining if that was a result of
> > > >> the different tagging or the priority increase.
> > > >
> > > > What exactly do you want META to mean? Strict prioritisation over
> > > > all other non-META requests, or just more frequent and/or larger
> > > > dispatches? Should META requests be sorted?
> > >
> > > The real question is "what was it supposed to mean"? AFAICT, it was
> > > added to a couple of filesystems to be used to tag superblock read
> > > I/O. Why - I don't know - there's a distinct lack of documentation
> > > surrounding these bio flags. :/
> >
> > It was added to be able to differentiate between data and meta data IO
> > when using blktrace, that is all.
>
> Ok.
>
> > > Realistically, I'm not sure that having a separate queue for
> > > BIO_META will buy us anything, given that noop is quite often the
> > > fastest scheduler for XFS because it enables interleaved metadata
> > > I/O to be merged with data I/O. Like I said, I was not able to spend
> > > the time to determine exactly how BIO_META affected I/O patterns, so
> > > I can't really comment on whether it is really necessary or not.
> >
> > There's no seperate queue for meta data IO anywhere. CFQ will give
> > _slight_ preference to meta data IO as a side effect, preferring the
> > meta IO for otherwise same IO in what to serve next in the same queue.
> > And it will not allow preemption of a meta data IO for a data IO.
> >
> > So using meta should not yield any important boosts by itself.
>
> Which means that performance increase I saw on CFQ was a result of
> removing the BIO_SYNC tagging "optimisation" XFS uses for metadata,
> not from adding BIO_META.....
>
> Could you please document what these tags actually mean and do
> so that other people don't get as confused as me about this
> stuff....
Sure, I've added such a patch for 2.6.28.
--
Jens Axboe