Hi,
we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the
default .config setting was CONFIG_PREEMPT=y. To get straight to the
point: according to our measurements, this results in severe performance
degradation with our typical and some artificial workload. By "severe" I
mean this:
---
1. "production" load(*), kernel-compiles (make -j5, in both cases
kernels with CONFIG_PREEMPT=n were compiled), dual Xeon 3,06GHz/4GB RAM,
SMP+HT.
2.6.4 with CONFIG_PREEMPT=y:
real 4m41.741s
user 6m13.631s
sys 0m43.729s
2.6.4 with CONFIG_PREEMPT=n:
real 2m20.424s
user 5m54.498s
sys 0m37.297s
The slowness during the compilation was very noticeable on the console.
2. artificial load(**), kernel-compiles (make -j5), single 2,8GHz P4
with SMP+HT:
2.6.4 with CONFIG_PREEMPT=y:
real 7m40.933s
user 5m42.454s
sys 0m28.114s
2.6.4 with CONFIG_PREEMPT=n:
real 7m13.735s
user 4m56.266s
sys 0m35.495s
3. no noticeable slowdown with no load at all during the benchmarking
was observed.
(*) busy webserver doing apache/mod_perl stuff, ~40 static/~10 dynamic
hits/sec, ~20% CPU usage; typical load has no significant
jumps/anomalies in this time frame (it's also one of 3 boxes in a
cluster with a hardware load balancer)
(**) "ab -n 100000 -c2 <some simple CGI script>" from another box
running at the same time
---
Now, we can go into details about our methodology and what else may have
been borked about our boxes, but what I'm wondering about is (assuming
that we didn't mess things up somewhere and other people can confirm
this!): why in the world would anyone want CONFIG_PREEMPT=y as a default
setting when it has such an impact on performance in actual production
environments? Is 2.6 intended to become the "Desktop Linux" code path?
If not, shouldn't there be a big warning sticker somewhere that says
"DON'T EVEN THINK ABOUT KEEPING THIS DEFAULT SETTING UNLESS ALL YOU WANT
TO DO IS LISTEN TO MP3 AUDIO WHILE USING XFREE86!"? I've been skimming
over older lkml postings about performance problems with 2.6.x and many
of them, while obviously being about systems with CONFIG_PREEMPT=y,
don't even mention the fact that the degradation might be because of
that setting. Noone seems to know/care. Why?
Regards,
Marinos
--
Dipl.-Ing. Marinos Yannikos, CEO
Preisvergleich Internet Services AG
Franzensbr?ckenstra?e 8/2/16, A-1020 Wien
Tel./Fax: (+431) 5811609-52/-55
"Marinos J. Yannikos" <[email protected]> wrote:
>
> we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the
> default .config setting was CONFIG_PREEMPT=y. To get straight to the
> point: according to our measurements, this results in severe performance
> degradation with our typical and some artificial workload. By "severe" I
> mean this:
You're the first to report this. On a little 256MB P4-HT box here, running
2.6.5-rc1:
preempt:
make -j3 vmlinux 272.49s user 16.32s system 196% cpu 2:26.77 total
non-preempt:
make -j3 vmlinux 271.64s user 15.65s system 195% cpu 2:27.25 total
> If not, shouldn't there be a big warning sticker somewhere that says
> "DON'T EVEN THINK ABOUT KEEPING THIS DEFAULT SETTING UNLESS ALL YOU WANT
> TO DO IS LISTEN TO MP3 AUDIO WHILE USING XFREE86!"?
Dude, chill ;) Something seems to be pretty busted there. Is the machine
swapping at all? Under any sort of memory stress? How does it compare
with 2.4 running the same workloads?
Suggest you run some simple IO benchmarks (straight dd, hdparm -t,
bonnie++, tiobench, etc) and compute-intensive tasks, etc. Try to narrow
down exactly what part of the kernel's operation is being impacted. Is it
context switches, I/O rates, disk fragmentation, etc, etc.
If you can distill to regression down to the most simple test then we can
identify its cause more easily.
On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote:
> Hi,
>
> we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the
> default .config setting was CONFIG_PREEMPT=y. To get straight to the
> point: according to our measurements, this results in severe performance
> degradation with our typical and some artificial workload. By "severe" I
> mean this:
this is expected (see the below email, I predicted it on Mar 2000), keep
preempt turned off always, it's useless. Worst of all we're now taking
spinlocks earlier than needed, and the preempt_count stuff isn't
optmized away by PREEMPT=n, once those bits will be fixed too it'll go
even faster.
preempt just wastes cpu with tons of branches in fast paths that should
take one cycle instead.
Takashi Iwai did lots of research on the preempt vs lowlatency and
he found that preempt buys nothing and he confirmed my old theories (I
always advocated against preempt, infact I still advocate for "enabling"
preempt on demand during the copy-user only, so to enable preempt on
demand, not to disable it on demand like it happens now with the bloats
it generates), infact 2.4-aa has a lower max latency than 2.6 stock with
preempt enabled. About my old idea of enabling preempt on demand (i.e.
the opposite of preempt) in the copy-user we've to check for reschedule
anyways, so we can as well enable preempt, and that would still keep the
kernel simple and efficient. This way we would dominate the latency
during the bulk work (especially important with bigger page size or with
page clustering).
These fixes from Takashi Iwai brings 2.6 back in line with 2.4, I
suggested to use EIP dumps from interrupts to get the hotspots, he
promptly used the RTC for that and he could fixup all the spots, great
job he did since now we've a very low worst case sched latency in 2.6
too:
--- linux/fs/mpage.c-dist 2004-03-10 16:26:54.293647478 +0100
+++ linux/fs/mpage.c 2004-03-10 16:27:07.405673634 +0100
@@ -695,6 +695,7 @@ mpage_writepages(struct address_space *m
unlock_page(page);
}
page_cache_release(page);
+ cond_resched();
spin_lock(&mapping->page_lock);
}
/*
--- linux/fs/super.c-dist 2004-03-09 19:28:58.482270871 +0100
+++ linux/fs/super.c 2004-03-09 19:29:05.000792950 +0100
@@ -356,6 +356,7 @@ void sync_supers(void)
{
struct super_block * sb;
restart:
+ cond_resched();
spin_lock(&sb_lock);
sb = sb_entry(super_blocks.next);
while (sb != sb_entry(&super_blocks))
--- linux/fs/fs-writeback.c-dist 2004-03-09 19:15:25.237752504 +0100
+++ linux/fs/fs-writeback.c 2004-03-09 19:16:37.630330614 +0100
@@ -360,6 +360,7 @@ writeback_inodes(struct writeback_contro
}
spin_unlock(&sb_lock);
spin_unlock(&inode_lock);
+ cond_resched();
}
/*
I'm actually for dropping preempt from the kernel and to try to
implement my old idea of enabling preempt on demand in a few latency
critical spots. So instead of worrying about taking spinlocks too early
and calling preempt_disable, the only thing the kernel will do w.r.t.
preempt is:
enter_kernel:
preempt_enable()
copy_user()
preempt_disable()
exit_kernel:
I think it's perfectly acceable to do the above (i.e. the opposite of
preempt). While I think preempt is overkill.
More details on this in the old posts (I recall I even did a quick hack
to try if it worked, I'm surprised how old this email is but it's still
very actual apparently):
http://www.ussg.iu.edu/hypermail/linux/kernel/0003.1/0998.html
"With the fact we'll have to bloat the fast path (a fast lock like the
above one and all the spinlocks will need an additional
forbid_preempt(smp_processor_id()) the preemtable kernel it's not likely
to be a win.
The latency will decrease without drpping throughtput only in code that
runs for long time with none lock held like the copy_user stuff. That
stuff will run at the same speed as now but with zero scheduler latency.
The _lose_ instead will happen in _all_ the code that grabs any kind of
spinlock because spin_lock/spin_unlock will be slower and the latency
won't decrease for that stuff.
But now by thinking at that stuff I have an idea! Why instead of making
the kernel preemtable we take the other way around? So why instead of
having to forbid scheduling in locked regions, we don't simply allow
rescheduling in some piece of code that we know that will benefit by the
preemtable thing?
The kernel won't be preemtable this way (so we'll keep throughtput in
the locking fast path) but we could mark special section of kernel like
the copy user as preemtable.
It will be quite easy:
static atomic_t cpu_preemtable[NR_CPUS] = { [0..NR_CPUS] =
ATOMIC_INIT(0), };
#define preemtable_copy_user(...) \
do { \
atomic_inc(&cpu_preemtable[smp_processor_id()]); \
copy_user(...); \
atomic_dec(&cpu_preemtable[smp_processor_id()]); \
} while (0)
[..]
"
I still think after 4 years that such idea is more appealing then
preempt, and numbers start to prove me right.
Andrea Arcangeli <[email protected]> wrote:
>
> On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote:
> > Hi,
> >
> > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the
> > default .config setting was CONFIG_PREEMPT=y. To get straight to the
> > point: according to our measurements, this results in severe performance
> > degradation with our typical and some artificial workload. By "severe" I
> > mean this:
>
> this is expected (see the below email, I predicted it on Mar 2000),
Incorrectly.
> keep preempt turned off always, it's useless.
Preempt is overrated. The infrastructure which it has introduced has been
useful for detecting locking bugs.
It has been demonstrated that preempt improves the average latency. But
not worst-case, because those paths tend to be under spinlock.
> Worst of all we're now taking spinlocks earlier than needed,
Where? CPU scheduler?
> and the preempt_count stuff isn't optmized away by PREEMPT=n,
It should be. If you see somewhere where it isn't, please tell us.
We unconditionally bump the preempt_count in kmap_atomic() so that we can
use atomic kmaps in read() and write(). This is why four concurrent
write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels.
> preempt just wastes cpu with tons of branches in fast paths that should
> take one cycle instead.
I don't recall anyone demonstrating even a 1% impact from preemption. If
preemption was really causing slowdowns of this magnitude it would of
course have been noticed. Something strange has happened here and more
investigation is needed.
> ...
> I still think after 4 years that such idea is more appealing then
> preempt, and numbers start to prove me right.
The overhead of CONFIG_PREEMPT is quite modest. Measuring that is simple.
On Thu, Mar 18, 2004 at 01:50:04AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote:
> > > Hi,
> > >
> > > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the
> > > default .config setting was CONFIG_PREEMPT=y. To get straight to the
> > > point: according to our measurements, this results in severe performance
> > > degradation with our typical and some artificial workload. By "severe" I
> > > mean this:
> >
> > this is expected (see the below email, I predicted it on Mar 2000),
>
> Incorrectly.
>
> > keep preempt turned off always, it's useless.
>
> Preempt is overrated. The infrastructure which it has introduced has been
> useful for detecting locking bugs.
yes, I agree preempt is useful to debug SMP systems on UP systems, but
it's debugging feature so it should be disabled on production systems.
> It has been demonstrated that preempt improves the average latency. But
> not worst-case, because those paths tend to be under spinlock.
Sure, I agree it improves average latency, the problem is that there are
nearly no application that cares about average, what matters is the
worst case latency only.
> > Worst of all we're now taking spinlocks earlier than needed,
>
> Where? CPU scheduler?
Everywhere, see the kmaps, we spinlock before instead of spinlock after,
the scheduler, lots of places. I mean, people don't call
preempt_disable()
kmap_atomic
spin_lock
they do:
spin_lock
kmap_atomic
so they're effectively optimizing for PREEMPT=y and I don't think this
is optimal for the long term. One can aruge the microscalability
slowdown isn't something to worry about, I certainly don't worry about
it too much either, it's more a bad coding habit to spinlock earlier
than needed to avoid preempt_disable.
> > and the preempt_count stuff isn't optmized away by PREEMPT=n,
>
> It should be. If you see somewhere where it isn't, please tell us.
the counter is definitely not optimized away, see:
#define inc_preempt_count() \
do { \
preempt_count()++; \
} while (0)
#define dec_preempt_count() \
do { \
preempt_count()--; \
} while (0)
#define preempt_count() (current_thread_info()->preempt_count)
those are running regardless of PREEMPT=n.
This is debugging code to catch some basic preempt issue with
in_interrupt() and friends even with PREEMPT=n, but it wastes 1
cacheline per-cpu during irq handling.
> We unconditionally bump the preempt_count in kmap_atomic() so that we can
> use atomic kmaps in read() and write(). This is why four concurrent
> write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels.
sorry, why should the atomic kmaps read the preempt_count? Are those ++
-- useful for anything more than debugging PREEMPT=y on a kernel
compiled with PREEMPT=n? I thought it was just debugging code with
PREEMPT=n.
I know why the atomic kmaps speedup write but I don't see how can
preempt_count help there when PREEMPT=n, the atomic kmaps are purerly
per-cpu and one can't schedule anyways while taking those kmaps (no
matter if inc_preempt_count or not).
> > preempt just wastes cpu with tons of branches in fast paths that should
> > take one cycle instead.
>
> I don't recall anyone demonstrating even a 1% impact from preemption. If
> preemption was really causing slowdowns of this magnitude it would of
> course have been noticed. Something strange has happened here and more
> investigation is needed.
I'm also surprised the slowdown is so huge, maybe he tweaked the
CONFIG_SLAB at the same time of PREEMPT? ;) Anyways there is a slowdown,
and the whole point is that preempt doesn't improve the worst case
latency at all.
> > ...
> > I still think after 4 years that such idea is more appealing then
> > preempt, and numbers start to prove me right.
>
> The overhead of CONFIG_PREEMPT is quite modest. Measuring that is simple.
It is quite modest I agree, but there is an overhead and it doesn't
payoff.
BTW, with preempt enabled there is no guarantee that RCU can ever reach
a quiescient point and secondly there is no guarantee that you will ever
be allowed to unplug a CPU hotline since again there's no guarantee to
reach a quiescient point. Think a kernel thread doing for (;;) (i.e.
math computations in background, to avoid starving RCU the kernel thread
will have to add schedule() explicitly no matter if PREEMPT=y or
PREEMPT=n, again invalidating the point of preempt, the rcu tracking for
PREEMT=y is also more expensive).
Note, the work you and the other preempt developers did with preempt was
great, it wouldn't be possible to be certain that it wasn't worthwhile
until we had the thing working and finegrined (i.e. in all in_interrupt
etc..), and now we know it doesn't payoff and in turn I'm going to try
the explicit-preempt that is to explicitly enable preempt in a few
cpu-intensive kernel spots where we don't take locks (i.e. copy-user),
the original suggestion I did 4 years ago, I believe in such places an
explicit-preempt will work best since we've already to check every few
bytes the current->need_resched, so adding a branch there should be very
worthwhile. Doing real preempt like now is overkill instead and should
be avoided IMHO.
On Thu, 2004-03-18 at 04:50, Andrew Morton wrote:
> I don't recall anyone demonstrating even a 1% impact from preemption. If
> preemption was really causing slowdowns of this magnitude it would of
> course have been noticed. Something strange has happened here and more
> investigation is needed.
>
> > ...
> > I still think after 4 years that such idea is more appealing then
> > preempt, and numbers start to prove me right.
>
> The overhead of CONFIG_PREEMPT is quite modest. Measuring that is simple.
Well, I reported an issue on my laptop several weeks ago where network
activity via my aironet wireless adapter would use 60-70% of the CPU but
only when PREEMPT was enabled. Looking back over the list I see other
similar issues with PREEMPT and various network card drivers (8139too
and ne2k show up), although most of those seem to be against preempt in
2.4.x not 2.6.
I think the user that started this thread was seeing significant
regressions during kernel compiles with PREEMPT enabled while the system
also has some additional load from Apache (perhaps with significant
network activity). I think there are cases like this where PREEMPT
seems to trip up.
I'll try to reproduce my issue with current generation kernels (last I
tested with PREEMPT was 2.6.1) and see if my problem is still there.
When I reported the issue last time no one seemed interested so I just
learned to disable preempt.
Later,
Tom
At Thu, 18 Mar 2004 07:03:58 +0100,
Andrea Arcangeli wrote:
>
> On Thu, Mar 18, 2004 at 05:00:01AM +0100, Marinos J. Yannikos wrote:
> > Hi,
> >
> > we upgraded a few production boxes from 2.4.x to 2.6.4 recently and the
> > default .config setting was CONFIG_PREEMPT=y. To get straight to the
> > point: according to our measurements, this results in severe performance
> > degradation with our typical and some artificial workload. By "severe" I
> > mean this:
>
> this is expected (see the below email, I predicted it on Mar 2000), keep
> preempt turned off always, it's useless. Worst of all we're now taking
> spinlocks earlier than needed, and the preempt_count stuff isn't
> optmized away by PREEMPT=n, once those bits will be fixed too it'll go
> even faster.
>
> preempt just wastes cpu with tons of branches in fast paths that should
> take one cycle instead.
>
> Takashi Iwai did lots of research on the preempt vs lowlatency and
> he found that preempt buys nothing and he confirmed my old theories
well, i personally am not against the current preempt mechanism from
the viewpoint of the audio-processing purpose :) the implementation
is relatively clean and easy.
but i agree with Andrea, that surely we can achieve the alsmo same
RT-performance even without preemption, i.e. with less perempt
overhead. it's not necessary to be default.
(snip)
> These fixes from Takashi Iwai brings 2.6 back in line with 2.4, I
> suggested to use EIP dumps from interrupts to get the hotspots, he
> promptly used the RTC for that and he could fixup all the spots, great
> job he did since now we've a very low worst case sched latency in 2.6
> too:
>
> --- linux/fs/mpage.c-dist 2004-03-10 16:26:54.293647478 +0100
> +++ linux/fs/mpage.c 2004-03-10 16:27:07.405673634 +0100
> @@ -695,6 +695,7 @@ mpage_writepages(struct address_space *m
> unlock_page(page);
> }
> page_cache_release(page);
> + cond_resched();
> spin_lock(&mapping->page_lock);
> }
> /*
the above one is the major source of RT-latency.
only this oneliner will reduce more than 90% of RT-latencies.
in my case with reiserfs, i got 0.4ms RT-latency with my test suite
(with athlon 2200+).
there is another point to be fixed in the reiserfs journal
transaction. then you'll get 0.1ms RT-latency without preemption.
for ext3, these two spots are relevant.
--- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
+++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
@@ -290,6 +290,9 @@ write_out_data_locked:
commit_transaction->t_sync_datalist = jh;
break;
}
+
+ if (need_resched())
+ break;
} while (jh != last_jh);
if (bufs || need_resched()) {
--- linux-2.6.4-8/fs/ext3/inode.c-dist 2004-03-18 02:33:38.000000000 +0100
+++ linux-2.6.4-8/fs/ext3/inode.c 2004-03-18 02:33:40.000000000 +0100
@@ -1987,6 +1987,7 @@ static void ext3_free_branches(handle_t
if (is_handle_aborted(handle))
return;
+ cond_resched();
if (depth--) {
struct buffer_head *bh;
int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb);
i think the first one is needed for preemptive kernel, too.
with these patches, also 0.1-0.2ms RT-latency is achieved.
BTW, my measurement tool is found at
http://www.alsa-project.org/~iwai/latencytest-0.5.2.tar.gz
--
Takashi Iwai <[email protected]> ALSA Developer - http://www.alsa-project.org
On Thu, 2004-03-18 at 09:51, Andrea Arcangeli wrote:
> Note, the work you and the other preempt developers did with preempt was
> great, it wouldn't be possible to be certain that it wasn't worthwhile
> until we had the thing working and finegrined (i.e. in all in_interrupt
> etc..), and now we know it doesn't payoff and in turn I'm going to try
> the explicit-preempt that is to explicitly enable preempt in a few
> cpu-intensive kernel spots where we don't take locks (i.e. copy-user),
> the original suggestion I did 4 years ago, I believe in such places an
> explicit-preempt will work best since we've already to check every few
> bytes the current->need_resched, so adding a branch there should be very
> worthwhile. Doing real preempt like now is overkill instead and should
> be avoided IMHO.
I think you are really blowing the overhead of kernel preemption out of
proportion. The numbers Marinos J. Yannikos are reported are definitely
a bug, an issue, something that will be fixed. The numbers everyone
else has historically shown are in line with Andrew: slight changes and
generally a small improvement to kernel compiles, dbench runs, et
cetera.
I also feel you underestimate the improvements kernel preemption gives.
Yes, the absolute worst case latency probably remains because it tends
to occur under lock (although, it is now easier to pinpoint that latency
and work some magic on the locks). But the variance of the latency goes
way down, too. We smooth out the curve. And these are differences that
matter.
And it can be turned off, so if you don't care about that and are not
debugging atomicity (which preempt is a big help with, right?) then turn
it off.
Oh, and if the PREEMPT=n overhead is really an issue, then I agree that
needs to be fixed :)
Robert Love
On Thu, Mar 18, 2004 at 10:20:23AM -0500, Tom Sightler wrote:
> Well, I reported an issue on my laptop several weeks ago where network
> activity via my aironet wireless adapter would use 60-70% of the CPU but
> only when PREEMPT was enabled. Looking back over the list I see other
sounds like a preempt bug triggering in a softirq, or maybe an SMP bug
triggering in UP.
I certainly agree with Andrew that preempt cannot cause a 60/70%
slowdown with a network card at least (if you do nothing but spinlocking
in a loop then it's possible a 60/70% slowdown instead but the system
time that a wifi card should take is nothing compared to the idle/user
time).
> I'll try to reproduce my issue with current generation kernels (last I
> tested with PREEMPT was 2.6.1) and see if my problem is still there.
> When I reported the issue last time no one seemed interested so I just
> learned to disable preempt.
This is fine IMHO (even if they are preempt bugs). Except if your
machine is UP, if it's UP it would be better track down the bug, since
it may still happen on a SMP box (I wouldn't be too surprised if wifi
drivers aren't smp safe). I certainly agree with Andrew when he says
preempt is a good debugging knob to debug smp bugs in UP (though I would
leave it off in production UP, like in production SMP).
On Thu, 2004-03-18 at 10:20, Tom Sightler wrote:
> Well, I reported an issue on my laptop several weeks ago where network
> activity via my aironet wireless adapter would use 60-70% of the CPU but
> only when PREEMPT was enabled. Looking back over the list I see other
> similar issues with PREEMPT and various network card drivers (8139too
> and ne2k show up), although most of those seem to be against preempt in
> 2.4.x not 2.6.
I've only seen these against 2.4, right?
> I think the user that started this thread was seeing significant
> regressions during kernel compiles with PREEMPT enabled while the system
> also has some additional load from Apache (perhaps with significant
> network activity). I think there are cases like this where PREEMPT
> seems to trip up.
Well, Apache could definitely be the "issue" if it is using CPU time:
more time to Apache and less time to the compiles, assumingly by
design. To really test kernel preemption's impact on throughput, you
need an idle system, otherwise you also test the improved preemptibility
and fairness.
Robert Love
On Thu, Mar 18, 2004 at 04:28:16PM +0100, Takashi Iwai wrote:
> the above one is the major source of RT-latency.
> only this oneliner will reduce more than 90% of RT-latencies.
> in my case with reiserfs, i got 0.4ms RT-latency with my test suite
> (with athlon 2200+).
cool ;)
> there is another point to be fixed in the reiserfs journal
> transaction. then you'll get 0.1ms RT-latency without preemption.
[..]
> i think the first one is needed for preemptive kernel, too.
> with these patches, also 0.1-0.2ms RT-latency is achieved.
amazing.
And without those improvements you did, the worst case for 2.6 mainline
is 8msec, right?
thanks for the great work on lowlatency. I'm sure Andrew will pick those
improvements immediatly :).
On Thu, 2004-03-18 at 10:28, Takashi Iwai wrote:
Hi, Takashi Iwai.
> well, i personally am not against the current preempt mechanism from
> the viewpoint of the audio-processing purpose :) the implementation
> is relatively clean and easy.
Agreed.
> i think the first one is needed for preemptive kernel, too.
> with these patches, also 0.1-0.2ms RT-latency is achieved.
Ohh, interesting. I'll give these a spin with PREEMPT=y and see. Thank
you!
Robert Love
Hi Robert,
On Thu, Mar 18, 2004 at 10:34:22AM -0500, Robert Love wrote:
> I also feel you underestimate the improvements kernel preemption gives.
Takashi benchmarked the worst case latency in very good detail. 2.6
stock with PREEMPT=y has a worst case latency of 2.4-aa. This is a fact.
With Takashi's lowlatency fixes the latency goes below 2.4-aa, w/ or w/o
PREEMPT.
PREEMPT=y doesn't and cannot improve the worst case latency. This is
true today like it was true 4 years ago.
> Yes, the absolute worst case latency probably remains because it tends
> to occur under lock (although, it is now easier to pinpoint that latency
> and work some magic on the locks). But the variance of the latency goes
> way down, too. We smooth out the curve. And these are differences that
> matter.
I don't think they can matter when the worst case is below 0.2msec.
> And it can be turned off, so if you don't care about that and are not
> debugging atomicity (which preempt is a big help with, right?) then turn
> it off.
I want to implement my aged idea that is to do the opposite of preempt.
I believe that is a much more efficient way to smooth the curve at lower
overhead and no kernel complexity.
Preempt is always enabled as soon as the cpu enters kernel. And it can
be disabled on demand.
I want preempt to be disabled as soon as teh cpu enters kernel, and I
want to enable it on demand _only_ during the copy user, or similar cpu
intensive operations, also guaranteeing that those operations comes to
an end to avoid RCU starvation.
Then I would like to ompare the average latency (the curve) I doubt
they'll be any different, and the overhead will be zero (we've to check
need_resched anyways after a copy-user, so we can as well do
preempt_enable preembt_disable around it).
> Oh, and if the PREEMPT=n overhead is really an issue, then I agree that
> needs to be fixed :)
It's not a big issue of course (very low prio thing ;).
Andrea Arcangeli <[email protected]> wrote:
>
> > > Worst of all we're now taking spinlocks earlier than needed,
> >
> > Where? CPU scheduler?
>
> Everywhere, see the kmaps, we spinlock before instead of spinlock after,
> the scheduler, lots of places. I mean, people don't call
>
> preempt_disable()
> kmap_atomic
> spin_lock
>
> they do:
>
> spin_lock
> kmap_atomic
They do? kmap_atomic() disables preemption anyway.
> so they're effectively optimizing for PREEMPT=y and I don't think this
> is optimal for the long term. One can aruge the microscalability
> slowdown isn't something to worry about, I certainly don't worry about
> it too much either, it's more a bad coding habit to spinlock earlier
> than needed to avoid preempt_disable.
>
> > > and the preempt_count stuff isn't optmized away by PREEMPT=n,
> >
> > It should be. If you see somewhere where it isn't, please tell us.
>
> the counter is definitely not optimized away, see:
>
> #define inc_preempt_count() \
> do { \
> preempt_count()++; \
> } while (0)
>
> #define dec_preempt_count() \
> do { \
> preempt_count()--; \
> } while (0)
>
> #define preempt_count() (current_thread_info()->preempt_count)
>
> those are running regardless of PREEMPT=n.
The macros are needed for kmap_atomic().
The task->preempt field is also used for tracking the hardirq and softirq
depths (in_interrupt(), in_irq(), in_softirq()) so it cannot be removed
with CONFIG_PREEMPT=n.
> > We unconditionally bump the preempt_count in kmap_atomic() so that we can
> > use atomic kmaps in read() and write(). This is why four concurrent
> > write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels.
>
> sorry, why should the atomic kmaps read the preempt_count? Are those ++
> -- useful for anything more than debugging PREEMPT=y on a kernel
> compiled with PREEMPT=n? I thought it was just debugging code with
> PREEMPT=n.
>
> I know why the atomic kmaps speedup write but I don't see how can
> preempt_count help there when PREEMPT=n, the atomic kmaps are purerly
> per-cpu and one can't schedule anyways while taking those kmaps (no
> matter if inc_preempt_count or not).
We run inc_prempt_count() in kmap_atomic() so that we can perform
copy_*_user() while holding an atomic kmap in read() and write().
do_page_fault() will detect the elevated preempt count and will return a
short copy.
This could be implemented with a separate task_struct field but given that
preempt_count is always there and the infrastructure exists anyway, using
preempt_count() is tidier.
On Thu, 2004-03-18 at 09:51, Andrea Arcangeli wrote:
> the counter is definitely not optimized away, see:
This is because of work Dave Miller and Ingo did - irq count, softirq
count, and lock count (when PREEMPT=y) are unified into preempt_count.
So it is intended.
The unification makes things cleaner and simpler, using one value in
place of three and one interface and concept in place of many others.
It also gives us a single simple thing to check for an overall notion of
"atomicity", which is what makes debugging so nice.
Robert Love
On Thu, Mar 18, 2004 at 09:39:02AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > > Worst of all we're now taking spinlocks earlier than needed,
> > >
> > > Where? CPU scheduler?
> >
> > Everywhere, see the kmaps, we spinlock before instead of spinlock after,
> > the scheduler, lots of places. I mean, people don't call
> >
> > preempt_disable()
> > kmap_atomic
> > spin_lock
> >
> > they do:
> >
> > spin_lock
> > kmap_atomic
>
> They do? kmap_atomic() disables preemption anyway.
dunno why but see:
spin_lock(&mm->page_table_lock);
page_table = pte_offset_map(pmd, address);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
etc...
maybe at some point in the past kmap_atomic didn't disable preempt? Just
guessing.
> > so they're effectively optimizing for PREEMPT=y and I don't think this
> > is optimal for the long term. One can aruge the microscalability
> > slowdown isn't something to worry about, I certainly don't worry about
> > it too much either, it's more a bad coding habit to spinlock earlier
> > than needed to avoid preempt_disable.
> >
> > > > and the preempt_count stuff isn't optmized away by PREEMPT=n,
> > >
> > > It should be. If you see somewhere where it isn't, please tell us.
> >
> > the counter is definitely not optimized away, see:
> >
> > #define inc_preempt_count() \
> > do { \
> > preempt_count()++; \
> > } while (0)
> >
> > #define dec_preempt_count() \
> > do { \
> > preempt_count()--; \
> > } while (0)
> >
> > #define preempt_count() (current_thread_info()->preempt_count)
> >
> > those are running regardless of PREEMPT=n.
>
> The macros are needed for kmap_atomic().
>
> The task->preempt field is also used for tracking the hardirq and softirq
> depths (in_interrupt(), in_irq(), in_softirq()) so it cannot be removed
> with CONFIG_PREEMPT=n.
>
> > > We unconditionally bump the preempt_count in kmap_atomic() so that we can
> > > use atomic kmaps in read() and write(). This is why four concurrent
> > > write(fd, 1, buf) processes on 4-way is 8x faster than on 2.4 kernels.
> >
> > sorry, why should the atomic kmaps read the preempt_count? Are those ++
> > -- useful for anything more than debugging PREEMPT=y on a kernel
> > compiled with PREEMPT=n? I thought it was just debugging code with
> > PREEMPT=n.
> >
> > I know why the atomic kmaps speedup write but I don't see how can
> > preempt_count help there when PREEMPT=n, the atomic kmaps are purerly
> > per-cpu and one can't schedule anyways while taking those kmaps (no
> > matter if inc_preempt_count or not).
>
> We run inc_prempt_count() in kmap_atomic() so that we can perform
> copy_*_user() while holding an atomic kmap in read() and write().
> do_page_fault() will detect the elevated preempt count and will return a
> short copy.
>
> This could be implemented with a separate task_struct field but given that
> preempt_count is always there and the infrastructure exists anyway, using
> preempt_count() is tidier.
>
thanks for the detailed explanation, so I was definitely wrong about
claiming there was more stuff to optimize away with preempt=n, clearly
it's a great feature your atomic copy-user interrupted by the page
fault. Infact I wonder if we should try once more time and go atomic
again. what you're doing right now is:
kmap_atomic()
left = copy_user
kunmap_atomic
if (left) {
kmap() <- persistent unscalable
copy-user
kunmap
}
I would suggest we should be even more aggressive like this:
kmap_atomic()
left = copy_user
kunmap_atomic
if (left) {
get_user()
kmap_atomic()
left = copy_user
kunmap_atomic
if (left) {
kmap() <- persistent unscalable
copy-user
kunmap
}
}
It's not going to trigger often anyways, but it's only a few bytecodes
more and it sounds more scalable.
On Thu, Mar 18, 2004 at 12:48:50PM -0500, Robert Love wrote:
> On Thu, 2004-03-18 at 09:51, Andrea Arcangeli wrote:
>
> > the counter is definitely not optimized away, see:
>
> This is because of work Dave Miller and Ingo did - irq count, softirq
> count, and lock count (when PREEMPT=y) are unified into preempt_count.
>
> So it is intended.
>
> The unification makes things cleaner and simpler, using one value in
> place of three and one interface and concept in place of many others.
> It also gives us a single simple thing to check for an overall notion of
> "atomicity", which is what makes debugging so nice.
You're right, I didn't notice the other counters disappeared. Those
counter existed anyways w/o preempt too, so it would been superflous
with preempt=y to do the accounting in two places. So this is zerocost
with preempt=n and I was wrong claiming superflous preempt leftovers.
Andrea Arcangeli <[email protected]> wrote:
>
> > They do? kmap_atomic() disables preemption anyway.
>
> dunno why but see:
>
> spin_lock(&mm->page_table_lock);
> page_table = pte_offset_map(pmd, address);
>
> pte_unmap(page_table);
> spin_unlock(&mm->page_table_lock);
>
do_wp_page()? The lock is there to pin the pte down isn't it? Maybe it
can be optimised - certainly the spin_unlock() in there can be moved up a
few statements.
> Infact I wonder if we should try once more time and go atomic
> again. what you're doing right now is:
>
> kmap_atomic()
> left = copy_user
> kunmap_atomic
> if (left) {
> kmap() <- persistent unscalable
> copy-user
> kunmap
> }
>
> I would suggest we should be even more aggressive like this:
>
> kmap_atomic()
> left = copy_user
> kunmap_atomic
> if (left) {
> get_user()
> kmap_atomic()
> left = copy_user
> kunmap_atomic
> if (left) {
> kmap() <- persistent unscalable
> copy-user
> kunmap
> }
> }
>
>
> It's not going to trigger often anyways, but it's only a few bytecodes
> more and it sounds more scalable.
Could be. When I did that code I had some printks in the slow path and
although it did trigger, it was rare. We've already faulted the page in by
hand so we should only fall into the kmap() if the page was suddenly stolen
again.
But it was a long time ago - adding some instrumentation here to work out
how often we enter the slow path would tell us whether this is needed.
On Thu, Mar 18, 2004 at 10:26:23AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > They do? kmap_atomic() disables preemption anyway.
> >
> > dunno why but see:
> >
> > spin_lock(&mm->page_table_lock);
> > page_table = pte_offset_map(pmd, address);
> >
> > pte_unmap(page_table);
> > spin_unlock(&mm->page_table_lock);
> >
>
> do_wp_page()? The lock is there to pin the pte down isn't it? Maybe it
> can be optimised - certainly the spin_unlock() in there can be moved up a
> few statements.
It's not needed, we need the lock only when we _read_ the pte, not
during the kmap_atomic. The kmap_atomic is just a window on some highmem
rmap, it has no clue what's there and it can't affect the locking in any
way. The only thing it matters is that we don't schedule.
so taking locks regularly earlier than needed sounds a bit confusing,
since somebody could think they're really needed there.
> Could be. When I did that code I had some printks in the slow path and
> although it did trigger, it was rare. We've already faulted the page in by
I agree it must be _very_ rare ;). Writing zeros isn't very useful
anyways. And while swapping the scalability don't matter much anywyas.
> hand so we should only fall into the kmap() if the page was suddenly stolen
> again.
Oh so you mean the page fault insn't only interrupting the copy-user
atomically, but the page fault is also going to sleep and pagein the
page? I though you didn't want to allow other tasks to steal the kmap
before you effectively run the kunmap_atomic. I see it can be safe if
kunmap_atomic is a noop though, but you're effectively allowing
scheduling inside a kmap this way.
Andrea Arcangeli <[email protected]> wrote:
>
> > hand so we should only fall into the kmap() if the page was suddenly stolen
> > again.
>
> Oh so you mean the page fault insn't only interrupting the copy-user
> atomically, but the page fault is also going to sleep and pagein the
> page? I though you didn't want to allow other tasks to steal the kmap
> before you effectively run the kunmap_atomic. I see it can be safe if
> kunmap_atomic is a noop though, but you're effectively allowing
> scheduling inside a kmap this way.
No, we don't schedule with the atomic kmap held. What I meant was:
get_user(page); /* fault it in */
kmap_atomic(page);
if (copy_from_user(page))
goto slow_path;
kunmap_atomic(page);
...
slow_path:
kunmap_atomic(page);
kmap(page);
copy_from_user(page);
kunmap(page);
slow_path is only taken if the vm (or truncate in the case of read())
unmapped the page in that little window after the get_user().
On Thu, Mar 18, 2004 at 10:47:29AM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > hand so we should only fall into the kmap() if the page was suddenly stolen
> > > again.
> >
> > Oh so you mean the page fault insn't only interrupting the copy-user
> > atomically, but the page fault is also going to sleep and pagein the
> > page? I though you didn't want to allow other tasks to steal the kmap
> > before you effectively run the kunmap_atomic. I see it can be safe if
> > kunmap_atomic is a noop though, but you're effectively allowing
> > scheduling inside a kmap this way.
>
> No, we don't schedule with the atomic kmap held. What I meant was:
good to hear this (even scheduling inside an atomic kmap could be made
to work but I would find it less obviously safe since kumap_atomic would
need to be a noop for example).
>
> get_user(page); /* fault it in */
> kmap_atomic(page);
> if (copy_from_user(page))
> goto slow_path;
> kunmap_atomic(page);
>
> ...
>
> slow_path:
> kunmap_atomic(page);
> kmap(page);
> copy_from_user(page);
> kunmap(page);
>
I see what you mean now, so you suggest to run an unconditional
get_user(). It's hard to tell here, I agree the cost would be nearly
zero since the loaded byte from memory it'll go into l1 cache and it
would avoid a page fault in some unlikely case compared to what I
proposed, but I don't like too much it because I really like to optimize
as much as possible for the fast path always, so I still like to keep
the get_user out of the fast path. But you're right to argue since it's
hard to tell what's better in practice (certainly one could write a
malicious load where the additional pagefault could be measurable, while
the get_user would always be hardly measurable).
Takashi Iwai <[email protected]> wrote:
>
> > These fixes from Takashi Iwai brings 2.6 back in line with 2.4, I
> > suggested to use EIP dumps from interrupts to get the hotspots, he
> > promptly used the RTC for that and he could fixup all the spots, great
> > job he did since now we've a very low worst case sched latency in 2.6
> > too:
> >
> > --- linux/fs/mpage.c-dist 2004-03-10 16:26:54.293647478 +0100
> > +++ linux/fs/mpage.c 2004-03-10 16:27:07.405673634 +0100
> > @@ -695,6 +695,7 @@ mpage_writepages(struct address_space *m
> > unlock_page(page);
> > }
> > page_cache_release(page);
> > + cond_resched();
> > spin_lock(&mapping->page_lock);
> > }
> > /*
>
> the above one is the major source of RT-latency.
This one's fine.
> for ext3, these two spots are relevant.
>
> --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
> +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
> @@ -290,6 +290,9 @@ write_out_data_locked:
> commit_transaction->t_sync_datalist = jh;
> break;
> }
> +
> + if (need_resched())
> + break;
> } while (jh != last_jh);
>
> if (bufs || need_resched()) {
This one I need to think about. Perhaps we can remove the yield point a
few lines above.
One needs to be really careful with the lock-dropping trick - there are
weird situations in which the kernel fails to make any forward progress.
I've been meaning to do another round of latency tuneups for ages, so I'll
check this one out, thanks.
There's also the SMP problem: this CPU could be spinning on a lock with
need_resched() true, but the other CPU is hanging on the lock for ages
because its need_resched() is false. In the 2.4 ll patch I solved that via
the scary hack of broadcasting a reschedule instruction to all CPUs if an
rt-prio task just became runnable. In 2.6-preempt we use
preempt_spin_lock().
But in 2.6 non-preempt we have no solution to this, so worst-case
scheduling latencies on 2.6 SMP CONFIG_PREEMPT=n are high.
Last time I looked the worst-case latency is in fact over in the ext3
checkpoint code. It's under spinlock and tricky to fix.
> --- linux-2.6.4-8/fs/ext3/inode.c-dist 2004-03-18 02:33:38.000000000 +0100
> +++ linux-2.6.4-8/fs/ext3/inode.c 2004-03-18 02:33:40.000000000 +0100
> @@ -1987,6 +1987,7 @@ static void ext3_free_branches(handle_t
> if (is_handle_aborted(handle))
> return;
>
> + cond_resched();
> if (depth--) {
> struct buffer_head *bh;
> int addr_per_block = EXT3_ADDR_PER_BLOCK(inode->i_sb);
>
This one's OK.
btw, several months ago we discussed the idea of adding a sysctl to the
ALSA drivers which would cause a dump_stack() to be triggered if the audio
ISR detected a sound underrun.
This would be a very useful feature, because it increases the number of
low-latency developers from O(2) to O(lots). If some user is complaining
of underruns we can just ask them to turn on the sysctl and we get a trace
pointing at the culprit code.
And believe me, we need the coverage. There are all sorts of weird code
paths which were found during the development of the 2.4 low-latency patch.
i2c drivers, fbdev drivers, all sorts of things which you and I don't
test.
I know it's a matter of
if (sysctl_is_set)
dump_stack();
in snd_pcm_update_hw_ptr_post() somewhere, but my brain burst when working
out the ALSA sysctl architecture.
Is this something you could add please?
At Thu, 18 Mar 2004 11:01:59 -0800,
Andrew Morton wrote:
>
> btw, several months ago we discussed the idea of adding a sysctl to the
> ALSA drivers which would cause a dump_stack() to be triggered if the audio
> ISR detected a sound underrun.
>
> This would be a very useful feature, because it increases the number of
> low-latency developers from O(2) to O(lots). If some user is complaining
> of underruns we can just ask them to turn on the sysctl and we get a trace
> pointing at the culprit code.
>
> And believe me, we need the coverage. There are all sorts of weird code
> paths which were found during the development of the 2.4 low-latency patch.
> i2c drivers, fbdev drivers, all sorts of things which you and I don't
> test.
>
> I know it's a matter of
>
> if (sysctl_is_set)
> dump_stack();
>
> in snd_pcm_update_hw_ptr_post() somewhere, but my brain burst when working
> out the ALSA sysctl architecture.
>
> Is this something you could add please?
oh, sorry, maybe i forgot to tell you that it has been already there
:)
# echo 1 > /proc/asound/card0/pcm0p/xrun_debug
this will show the stacktrace when a buffer overrun/underrun is
detected in the irq handler. it's not perfect, though.
we can add stacktracing in other nasty places, e.g. when the
unexpected h/w pointer is returned (this is usually because of sloppy
irq handling).
Takashi
At Thu, 18 Mar 2004 11:01:59 -0800,
Andrew Morton wrote:
>
> Takashi Iwai <[email protected]> wrote:
> >
> > for ext3, these two spots are relevant.
> >
> > --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
> > +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
> > @@ -290,6 +290,9 @@ write_out_data_locked:
> > commit_transaction->t_sync_datalist = jh;
> > break;
> > }
> > +
> > + if (need_resched())
> > + break;
> > } while (jh != last_jh);
> >
> > if (bufs || need_resched()) {
>
> This one I need to think about. Perhaps we can remove the yield point a
> few lines above.
yes, i'm afraid that it's also overkill to check this at every time.
perhaps we can optimize it a bit better. the fact that it imporives
the latency means that there are so many locked buffers or non-dirty
buffers in the list?
> One needs to be really careful with the lock-dropping trick - there are
> weird situations in which the kernel fails to make any forward progress.
> I've been meaning to do another round of latency tuneups for ages, so I'll
> check this one out, thanks.
>
> There's also the SMP problem: this CPU could be spinning on a lock with
> need_resched() true, but the other CPU is hanging on the lock for ages
> because its need_resched() is false.
yep, i see a similar problem also in reiserfs's do_journal_end().
it's in lock_kernel().
> In the 2.4 ll patch I solved that via
> the scary hack of broadcasting a reschedule instruction to all CPUs if an
> rt-prio task just became runnable. In 2.6-preempt we use
> preempt_spin_lock().
>
> But in 2.6 non-preempt we have no solution to this, so worst-case
> scheduling latencies on 2.6 SMP CONFIG_PREEMPT=n are high.
hmm, it's bad...
> Last time I looked the worst-case latency is in fact over in the ext3
> checkpoint code. It's under spinlock and tricky to fix.
BTW, i had the worst latency in sis900's timer handler.
it takes 3ms, and hard to fix, too :-<
Takashi
Takashi Iwai <[email protected]> wrote:
>
> oh, sorry, maybe i forgot to tell you that it has been already there
> :)
>
> # echo 1 > /proc/asound/card0/pcm0p/xrun_debug
>
> this will show the stacktrace when a buffer overrun/underrun is
> detected in the irq handler. it's not perfect, though.
heh, you just shrunk my todo list by 0.01%.
Have you had any useful reports from this feature?
At Thu, 18 Mar 2004 11:18:07 -0800,
Andrew Morton wrote:
>
> Takashi Iwai <[email protected]> wrote:
> >
> > oh, sorry, maybe i forgot to tell you that it has been already there
> > :)
> >
> > # echo 1 > /proc/asound/card0/pcm0p/xrun_debug
> >
> > this will show the stacktrace when a buffer overrun/underrun is
> > detected in the irq handler. it's not perfect, though.
>
> heh, you just shrunk my todo list by 0.01%.
>
> Have you had any useful reports from this feature?
not much yet. i should write this feature in documentation now :)
also, the total buffer underrun/overrun problem doesn't happen *so*
often (except for the heavy i/o load, etc).
Takashi
On Thu, 2004-03-18 at 14:08, Takashi Iwai wrote:
> oh, sorry, maybe i forgot to tell you that it has been already there
> :)
>
> # echo 1 > /proc/asound/card0/pcm0p/xrun_debug
>
> this will show the stacktrace when a buffer overrun/underrun is
> detected in the irq handler. it's not perfect, though.
Excellent!
Has this resulted in anything useful?
With KALLSYMS=y, a lot of users now become intelligent bug testers.
Eh, except that I do not have that procfs entry on my system..
Robert Love
Takashi Iwai <[email protected]> wrote:
>
> > > --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
> > > +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
> > > @@ -290,6 +290,9 @@ write_out_data_locked:
> > > commit_transaction->t_sync_datalist = jh;
> > > break;
> > > }
> > > +
> > > + if (need_resched())
> > > + break;
> > > } while (jh != last_jh);
> > >
> > > if (bufs || need_resched()) {
> >
> > This one I need to think about. Perhaps we can remove the yield point a
> > few lines above.
>
> yes, i'm afraid that it's also overkill to check this at every time.
> perhaps we can optimize it a bit better. the fact that it imporives
> the latency means that there are so many locked buffers or non-dirty
> buffers in the list?
yes, lots of clean buffers.
> > One needs to be really careful with the lock-dropping trick - there are
> > weird situations in which the kernel fails to make any forward progress.
> > I've been meaning to do another round of latency tuneups for ages, so I'll
> > check this one out, thanks.
> >
> > There's also the SMP problem: this CPU could be spinning on a lock with
> > need_resched() true, but the other CPU is hanging on the lock for ages
> > because its need_resched() is false.
>
> yep, i see a similar problem also in reiserfs's do_journal_end().
> it's in lock_kernel().
I have a scheduling point in journal_end() in 2.4. But I added bugs to
reiserfs a couple of times doing this - it's pretty delicate. Beat up on
Chris ;)
> > Last time I looked the worst-case latency is in fact over in the ext3
> > checkpoint code. It's under spinlock and tricky to fix.
>
> BTW, i had the worst latency in sis900's timer handler.
> it takes 3ms, and hard to fix, too :-<
networking in general can cause problems, as can the random driver, which I
hacked rather flakily in 2.4.
davem fixed the tcp_minisock reaping in 2.6, which helps.
On Thu, Mar 18, 2004 at 11:01:59AM -0800, Andrew Morton wrote:
> weird situations in which the kernel fails to make any forward progress.
definitely agreed.
the old lowlatency patches had the same issue too and that's why I
rejected it for some time, but eventually I changed them to go stright
the second time.
breaking the loop _once_ (and having to restart from scratch) is ok,
because we may be starting the loop with only 1msec of timeslice left
for example. Second time in the loop (or at some point) we can ignore
need_resched, to guarantee forwards progress.
> I've been meaning to do another round of latency tuneups for ages, so I'll
> check this one out, thanks.
thanks Andrew!
> There's also the SMP problem: this CPU could be spinning on a lock with
> need_resched() true, but the other CPU is hanging on the lock for ages
> because its need_resched() is false. In the 2.4 ll patch I solved that via
> the scary hack of broadcasting a reschedule instruction to all CPUs if an
> rt-prio task just became runnable. In 2.6-preempt we use
> preempt_spin_lock().
>
> But in 2.6 non-preempt we have no solution to this, so worst-case
> scheduling latencies on 2.6 SMP CONFIG_PREEMPT=n are high.
>
>
> Last time I looked the worst-case latency is in fact over in the ext3
> checkpoint code. It's under spinlock and tricky to fix.
preempt_spin_lock is the wrong way to go, the right way to go is to
fix the broken code that is "hanging on the lock for ages" ;). More
seriously: if you need a lock for ages, you must use a semaphore not a
spinlock.
spinlocked critical sections must be small, oh and you can always
trivially add a spin_lock_schedule() API that is doing the same thing of
preempt_spin_lock if you really want because you've an obscure lock
taken inside other locks, but some "quick" reader that could sleep
happens to have to wait the writer to do the bulk of the work.
The only difference is that spin_lock_schedule must be called in a place
where you're sure you can sleep.
So a spin_lock_schedule is doable just fine, but if you need it, it's
possible the code could be written better with a semaphore or by
avoiding big loops holding locks.
> low-latency developers from O(2) to O(lots). If some user is complaining
;)
On Thu, Mar 18, 2004 at 08:20:17PM +0100, Takashi Iwai wrote:
> also, the total buffer underrun/overrun problem doesn't happen *so*
> often (except for the heavy i/o load, etc).
you mean because the disk has not enough bandwidth to read the file,
right? (not scheduler related)
On Thu, 2004-03-18 at 14:29, Andrew Morton wrote:
> > yep, i see a similar problem also in reiserfs's do_journal_end().
> > it's in lock_kernel().
>
> I have a scheduling point in journal_end() in 2.4. But I added bugs to
> reiserfs a couple of times doing this - it's pretty delicate. Beat up on
> Chris ;)
;-) Not sure if Takashi is talking about -suse or -mm, the data=ordered
patches change things around. He sent me suggestions for the
data=ordered latencies already, but it shouldn't be against the BKL
there, since I drop it before calling write_ordered_buffers().
-chris
At Thu, 18 Mar 2004 20:43:45 +0100,
Andrea Arcangeli wrote:
>
> On Thu, Mar 18, 2004 at 08:20:17PM +0100, Takashi Iwai wrote:
> > also, the total buffer underrun/overrun problem doesn't happen *so*
> > often (except for the heavy i/o load, etc).
>
> you mean because the disk has not enough bandwidth to read the file,
> right? (not scheduler related)
it's just my general opinion judging from the bug reports from ALSA
users. the i/o load has more influence than scheduler-senstive
loads. i think i can test the "pseudo" interactivity with the latency
test program running as a normal user. i'll try it tonight.
Takashi
Takashi Iwai <[email protected]> wrote:
>
> --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
> +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
> @@ -290,6 +290,9 @@ write_out_data_locked:
> commit_transaction->t_sync_datalist = jh;
> break;
> }
> +
> + if (need_resched())
> + break;
> } while (jh != last_jh);
>
> if (bufs || need_resched()) {
Yup, this has problems.
What we're doing here is walking (under spinlock) a ring of buffers
searching for unlocked dirty ones to write out.
Suppose the ring has thousands of locked buffers and there is a RT task
scheduling at 1kHz. Each time need_resched() comes true we break out of
the search, schedule away and then restart the search from the beginning.
So if the time to walk that ring of buffers exceeds one millisecond the
kernel will make no progress. Eventually things will fix themselves up as
the I/O completes against those buffers but in the meanwhile we chew 100%
of CPU.
A few lines higher up we advance the searching start point so the next
search will start at the current position. But that's when it is known
that we made some forward progress. In the above patch we are vulnerable
to the situation where _all_ of the buffers on that ring are locked.
The below might suit, although I haven't tested it. It needs careful
testing too - it will impact I/O scheduling efficiency.
--- 25/fs/jbd/commit.c~a Thu Mar 18 14:23:40 2004
+++ 25-akpm/fs/jbd/commit.c Thu Mar 18 14:28:46 2004
@@ -280,6 +280,22 @@ write_out_data_locked:
goto write_out_data;
}
}
+ } else if (need_resched()) {
+ /*
+ * A locked buffer, and we're being a CPU pig. We
+ * cannot just schedule away because it may be the case
+ * that *all* the buffers are locked. So let's just
+ * wait until this buffer has been written.
+ *
+ * If we already have some buffers in wbuf[] then fine,
+ * we're making forward progress.
+ */
+ if (bufs)
+ break;
+ commit_transaction->t_sync_datalist = jh;
+ spin_unlock(&journal->j_list_lock);
+ wait_on_buffer(bh);
+ goto write_out_data;
}
if (bufs == ARRAY_SIZE(wbuf)) {
/*
_
On Thu, 2004-03-18 at 17:32, Andrew Morton wrote:
> Takashi Iwai <[email protected]> wrote:
> >
> > --- linux-2.6.4-8/fs/jbd/commit.c-dist 2004-03-16 23:00:40.000000000 +0100
> > +++ linux-2.6.4-8/fs/jbd/commit.c 2004-03-18 02:42:41.043448624 +0100
> > @@ -290,6 +290,9 @@ write_out_data_locked:
> > commit_transaction->t_sync_datalist = jh;
> > break;
> > }
> > +
> > + if (need_resched())
> > + break;
> > } while (jh != last_jh);
> >
> > if (bufs || need_resched()) {
>
> Yup, this has problems.
>
> What we're doing here is walking (under spinlock) a ring of buffers
> searching for unlocked dirty ones to write out.
>
> Suppose the ring has thousands of locked buffers and there is a RT task
> scheduling at 1kHz. Each time need_resched() comes true we break out of
> the search, schedule away and then restart the search from the beginning.
>
Why not just put all the locked buffers onto a different list as you
find them or lock them. That way you can always make progress...
-chris
On Thu, 2004-03-18 at 10:37, Andrea Arcangeli wrote:
> On Thu, Mar 18, 2004 at 10:20:23AM -0500, Tom Sightler wrote:
> > Well, I reported an issue on my laptop several weeks ago where network
> > activity via my aironet wireless adapter would use 60-70% of the CPU but
> > only when PREEMPT was enabled. Looking back over the list I see other
>
> sounds like a preempt bug triggering in a softirq, or maybe an SMP bug
> triggering in UP.
>
> I certainly agree with Andrew that preempt cannot cause a 60/70%
> slowdown with a network card at least (if you do nothing but spinlocking
> in a loop then it's possible a 60/70% slowdown instead but the system
> time that a wifi card should take is nothing compared to the idle/user
> time).
Actually, I managed to get two bugs mixed together, my system had a
problem with PREEMPT and ACPI but this certainly seems to be fixed in
the current kernels, at least in 2.6.3-mm2 (my current everyday kernel)
and 2.6.5-rc1-mm2 (just a quick test tonight). Somehow the combination
of PREEMPT and any little applet that used ACPI (battery monitor) would
use a lot of CPU, then using the aironet adapter caused the CPU to
spike. Turning off either PREEMPT or ACPI would get rid of the
problem. I forgot about the ACPI part being in the mix.
I still think the aironet driver acts strange, but I'm not really sure
how to describe it. In 2.6.3-mm2 which I transfer a file top will show
60% of the cpu time in 'irq' whatever that means. It seems vmstat still
calls this idle. Perhaps I'm interpreting those numbers wrong.
When I enable preempt on the same system my network performance with the
aironet driver drops tremendously, probably by 75%, and the rest of the
systems seems to crawl. Somehow without preempt this isn't a problem,
even though it shows 60% of the time in 'irq' the rest of the system
feels responsive. What would be the proper way to track down this type
of problem?
I'm having more problems with 2.6.5-rc1-mm2 and the aironet card. It
seems to preform poorly no matter if preempt is on or off. I'm still
looking into that. I don't think the drivers has changed significantly
between the two version so I'm hoping I just made a mistake.
Later,
Tom
Chris Mason <[email protected]> wrote:
>
> > What we're doing here is walking (under spinlock) a ring of buffers
> > searching for unlocked dirty ones to write out.
> >
> > Suppose the ring has thousands of locked buffers and there is a RT task
> > scheduling at 1kHz. Each time need_resched() comes true we break out of
> > the search, schedule away and then restart the search from the beginning.
> >
> Why not just put all the locked buffers onto a different list as you
> find them or lock them. That way you can always make progress...
Coz that made me look at the code and hence rewrite the whole thing, damn
you. It's definitely better, but is of scary scope.
diff -puN fs/jbd/commit.c~jbd-move-locked-buffers fs/jbd/commit.c
--- 25/fs/jbd/commit.c~jbd-move-locked-buffers Thu Mar 18 15:01:18 2004
+++ 25-akpm/fs/jbd/commit.c Thu Mar 18 15:53:05 2004
@@ -88,7 +88,6 @@ void journal_commit_transaction(journal_
{
transaction_t *commit_transaction;
struct journal_head *jh, *new_jh, *descriptor;
- struct journal_head *next_jh, *last_jh;
struct buffer_head *wbuf[64];
int bufs;
int flags;
@@ -222,38 +221,55 @@ void journal_commit_transaction(journal_
err = 0;
/*
* Whenever we unlock the journal and sleep, things can get added
- * onto ->t_datalist, so we have to keep looping back to write_out_data
- * until we *know* that the list is empty.
+ * onto ->t_sync_datalist, so we have to keep looping back to
+ * write_out_data until we *know* that the list is empty.
*/
-write_out_data:
-
+ bufs = 0;
/*
* Cleanup any flushed data buffers from the data list. Even in
* abort mode, we want to flush this out as soon as possible.
- *
- * We take j_list_lock to protect the lists from
- * journal_try_to_free_buffers().
*/
+write_out_data:
spin_lock(&journal->j_list_lock);
-write_out_data_locked:
- bufs = 0;
- next_jh = commit_transaction->t_sync_datalist;
- if (next_jh == NULL)
- goto sync_datalist_empty;
- last_jh = next_jh->b_tprev;
-
- do {
+ while (commit_transaction->t_sync_datalist) {
struct buffer_head *bh;
- jh = next_jh;
- next_jh = jh->b_tnext;
+ jh = commit_transaction->t_sync_datalist;
+ commit_transaction->t_sync_datalist = jh->b_tnext;
bh = jh2bh(jh);
- if (!buffer_locked(bh)) {
+ if (buffer_locked(bh)) {
+ /*
+ * We have a lock ranking problem..
+ */
+ if (!jbd_trylock_bh_state(bh)) {
+ spin_unlock(&journal->j_list_lock);
+ schedule();
+ goto write_out_data;
+ }
+ __journal_unfile_buffer(jh);
+ __journal_file_buffer(jh, jh->b_transaction, BJ_Locked);
+ jbd_unlock_bh_state(bh);
+ if (need_resched()) {
+ spin_unlock(&journal->j_list_lock);
+ cond_resched();
+ goto write_out_data;
+ }
+ } else {
if (buffer_dirty(bh)) {
BUFFER_TRACE(bh, "start journal writeout");
atomic_inc(&bh->b_count);
wbuf[bufs++] = bh;
+ if (bufs == ARRAY_SIZE(wbuf)) {
+ jbd_debug(2, "submit %d writes\n",
+ bufs);
+ spin_unlock(&journal->j_list_lock);
+ ll_rw_block(WRITE, bufs, wbuf);
+ journal_brelse_array(wbuf, bufs);
+ cond_resched();
+ bufs = 0;
+ goto write_out_data;
+ }
} else {
BUFFER_TRACE(bh, "writeout complete: unfile");
/*
@@ -269,66 +285,53 @@ write_out_data_locked:
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
__brelse(bh);
- if (need_resched() && commit_transaction->
- t_sync_datalist) {
- commit_transaction->t_sync_datalist =
- next_jh;
- if (bufs)
- break;
+ if (need_resched()) {
spin_unlock(&journal->j_list_lock);
cond_resched();
goto write_out_data;
}
}
}
- if (bufs == ARRAY_SIZE(wbuf)) {
- /*
- * Major speedup: start here on the next scan
- */
- J_ASSERT(commit_transaction->t_sync_datalist != 0);
- commit_transaction->t_sync_datalist = jh;
- break;
- }
- } while (jh != last_jh);
-
- if (bufs || need_resched()) {
- jbd_debug(2, "submit %d writes\n", bufs);
- spin_unlock(&journal->j_list_lock);
- if (bufs)
- ll_rw_block(WRITE, bufs, wbuf);
- cond_resched();
- journal_brelse_array(wbuf, bufs);
- spin_lock(&journal->j_list_lock);
- goto write_out_data_locked;
}
/*
- * Wait for all previously submitted IO on the data list to complete.
+ * Wait for all previously submitted IO to complete.
*/
- jh = commit_transaction->t_sync_datalist;
- if (jh == NULL)
- goto sync_datalist_empty;
-
- do {
+ while (commit_transaction->t_locked_list) {
struct buffer_head *bh;
- jh = jh->b_tprev; /* Wait on the last written */
+
+ jh = commit_transaction->t_locked_list->b_tprev;
bh = jh2bh(jh);
+ get_bh(bh);
if (buffer_locked(bh)) {
- get_bh(bh);
spin_unlock(&journal->j_list_lock);
wait_on_buffer(bh);
if (unlikely(!buffer_uptodate(bh)))
err = -EIO;
+ spin_lock(&journal->j_list_lock);
+ }
+ if (!jbd_trylock_bh_state(bh)) {
put_bh(bh);
- /* the journal_head may have been removed now */
- goto write_out_data;
- } else if (buffer_dirty(bh)) {
- goto write_out_data_locked;
+ spin_unlock(&journal->j_list_lock);
+ schedule();
+ spin_lock(&journal->j_list_lock);
+ continue;
}
- } while (jh != commit_transaction->t_sync_datalist);
- goto write_out_data_locked;
-
-sync_datalist_empty:
+ if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+ __journal_unfile_buffer(jh);
+ jh->b_transaction = NULL;
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ put_bh(bh);
+ if (need_resched()) {
+ spin_unlock(&journal->j_list_lock);
+ cond_resched();
+ spin_lock(&journal->j_list_lock);
+ }
+ }
spin_unlock(&journal->j_list_lock);
journal_write_revoke_records(journal, commit_transaction);
diff -puN include/linux/jbd.h~jbd-move-locked-buffers include/linux/jbd.h
--- 25/include/linux/jbd.h~jbd-move-locked-buffers Thu Mar 18 15:07:29 2004
+++ 25-akpm/include/linux/jbd.h Thu Mar 18 15:09:46 2004
@@ -487,6 +487,12 @@ struct transaction_s
struct journal_head *t_reserved_list;
/*
+ * Doubly-linked circular list of all buffers under writeout during
+ * commit [j_list_lock]
+ */
+ struct journal_head *t_locked_list;
+
+ /*
* Doubly-linked circular list of all metadata buffers owned by this
* transaction [j_list_lock]
*/
@@ -1080,6 +1086,7 @@ static inline int jbd_space_needed(journ
#define BJ_LogCtl 6 /* Buffer contains log descriptors */
#define BJ_Reserved 7 /* Buffer is reserved for access by journal */
#define BJ_Types 8
+#define BJ_Locked 9 /* Locked for I/O during commit */
extern int jbd_blocks_per_page(struct inode *inode);
diff -puN fs/jbd/transaction.c~jbd-move-locked-buffers fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~jbd-move-locked-buffers Thu Mar 18 15:10:21 2004
+++ 25-akpm/fs/jbd/transaction.c Thu Mar 18 15:56:56 2004
@@ -1010,7 +1010,8 @@ int journal_dirty_data(handle_t *handle,
* the write() data.
*/
if (jh->b_jlist != BJ_None &&
- jh->b_jlist != BJ_SyncData) {
+ jh->b_jlist != BJ_SyncData &&
+ jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "Not stealing");
goto no_journal;
}
@@ -1048,7 +1049,7 @@ int journal_dirty_data(handle_t *handle,
* committing transaction, so might still be left on that
* transaction's metadata lists.
*/
- if (jh->b_jlist != BJ_SyncData) {
+ if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "not on correct data list: unfile");
J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow);
__journal_unfile_buffer(jh);
@@ -1539,6 +1540,9 @@ void __journal_unfile_buffer(struct jour
case BJ_Reserved:
list = &transaction->t_reserved_list;
break;
+ case BJ_Locked:
+ list = &transaction->t_locked_list;
+ break;
}
__blist_del_buffer(list, jh);
@@ -1576,7 +1580,7 @@ __journal_try_to_free_buffer(journal_t *
spin_lock(&journal->j_list_lock);
if (jh->b_transaction != 0 && jh->b_cp_transaction == 0) {
- if (jh->b_jlist == BJ_SyncData) {
+ if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) {
/* A written-back ordered data buffer */
JBUFFER_TRACE(jh, "release data");
__journal_unfile_buffer(jh);
@@ -1985,6 +1989,9 @@ void __journal_file_buffer(struct journa
case BJ_Reserved:
list = &transaction->t_reserved_list;
break;
+ case BJ_Locked:
+ list = &transaction->t_locked_list;
+ break;
}
__blist_add_buffer(list, jh);
_
Andrea Arcangeli wrote:
>
>BTW, with preempt enabled there is no guarantee that RCU can ever reach
>a quiescient point and secondly there is no guarantee that you will ever
>be allowed to unplug a CPU hotline since again there's no guarantee to
>reach a quiescient point. Think a kernel thread doing for (;;) (i.e.
>math computations in background, to avoid starving RCU the kernel thread
>will have to add schedule() explicitly no matter if PREEMPT=y or
>PREEMPT=n, again invalidating the point of preempt, the rcu tracking for
>PREEMT=y is also more expensive).
>
>
Why is this so? schedule() runs RCU_qsctr(task_cpu(prev))++;
whether it is called directly or via preempt_schedule.
>Note, the work you and the other preempt developers did with preempt was
>great, it wouldn't be possible to be certain that it wasn't worthwhile
>until we had the thing working and finegrined (i.e. in all in_interrupt
>etc..), and now we know it doesn't payoff and in turn I'm going to try
>the explicit-preempt that is to explicitly enable preempt in a few
>cpu-intensive kernel spots where we don't take locks (i.e. copy-user),
>the original suggestion I did 4 years ago, I believe in such places an
>explicit-preempt will work best since we've already to check every few
>bytes the current->need_resched, so adding a branch there should be very
>worthwhile. Doing real preempt like now is overkill instead and should
>be avoided IMHO.
>
>
Technically, I think preempt can still reduce maximum latency
depending on whether the time between explicit resched checks
is greater than the (small) scheduling overhead that preempt
adds.
But yeah, that would be in the noise compared with long
preempt-off regions.
I'm not sure about applications, but maybe some soft-realtime
things like audio software benefit from a lower average latency
(I don't know).
On Thu, 2004-03-18 at 10:28, Takashi Iwai wrote:
> in my case with reiserfs, i got 0.4ms RT-latency with my test suite
> (with athlon 2200+).
>
> there is another point to be fixed in the reiserfs journal
> transaction. then you'll get 0.1ms RT-latency without preemption.
Are you talking about the following patch recently merged in Linus tree?
[PATCH] resierfs: scheduling latency improvements
http://linus.bkbits.net:8080/linux-2.5/cset@40571b49jtE7PzOtsXjBx65-GoDijg
I'm interested to try any patch you might have to help reduce latency
with reiserfs.
At Thu, 18 Mar 2004 22:10:06 -0800,
Andrew Morton wrote:
>
> Takashi Iwai <[email protected]> wrote:
> >
> > BTW, i had the worst latency in sis900's timer handler.
> > it takes 3ms, and hard to fix, too :-<
>
> It's all coming back to me now.
>
> The worst-case latency is during umount, fs/inode.c:invalidate_list() when
> the filesystem has a zillion inodes in icache. Measured 250 milliseconds
> on a 256MB 2.7GHz P4 here. OK, so don't do that.
also that's not included in my test case :)
> The unavoidable worst case is in the RCU callbacks for dcache shrinkage -
> I've seen 25 millisecond holdoffs on the above machine during filesystem
> stresstests when RCU is freeing a huge number of dentries in softirq
> context.
hmm, this wasn't also evaluated in my tests.
it's worthy to try. thanks for the info.
> This if Hard To Fix. Dipankar spent quite some time looking into it and
> had patches, but I lost track of where they're at.
couldn't this tasklet be replaced with workqueue or such?
Takashi
At Thu, 18 Mar 2004 22:07:41 -0500,
Eric St-Laurent wrote:
>
> On Thu, 2004-03-18 at 10:28, Takashi Iwai wrote:
>
> > in my case with reiserfs, i got 0.4ms RT-latency with my test suite
> > (with athlon 2200+).
> >
> > there is another point to be fixed in the reiserfs journal
> > transaction. then you'll get 0.1ms RT-latency without preemption.
>
> Are you talking about the following patch recently merged in Linus tree?
>
> [PATCH] resierfs: scheduling latency improvements
> http://linus.bkbits.net:8080/linux-2.5/cset@40571b49jtE7PzOtsXjBx65-GoDijg
i've tested the suse kernel, so the patch above was already in it.
i'm not sure whether mm tree already includes the all relevant
fixes... Chris?
what i wrote above about is loops in do_journal_end(). but i can't
tell you surely unless i test the mm kernel again.
> I'm interested to try any patch you might have to help reduce latency
> with reiserfs.
AFAIK, even only the fix of mpage_writepages() will reduce the latency
very much. please give a try.
Takashi
At Thu, 18 Mar 2004 14:48:07 -0500,
Chris Mason wrote:
>
> On Thu, 2004-03-18 at 14:29, Andrew Morton wrote:
>
> > > yep, i see a similar problem also in reiserfs's do_journal_end().
> > > it's in lock_kernel().
> >
> > I have a scheduling point in journal_end() in 2.4. But I added bugs to
> > reiserfs a couple of times doing this - it's pretty delicate. Beat up on
> > Chris ;)
>
> ;-) Not sure if Takashi is talking about -suse or -mm, the data=ordered
> patches change things around. He sent me suggestions for the
> data=ordered latencies already, but it shouldn't be against the BKL
> there, since I drop it before calling write_ordered_buffers().
i tested only suse kernels recently. will try mm kernel later, again.
ok, let me explain some nasty points i found through the disk i/o load
tests:
- in the loop in do_journal_end(). this happens periodically in
pdflush.
/* first data block is j_start + 1, so add one to cur_write_start wherever you use it */
cur_write_start = SB_JOURNAL(p_s_sb)->j_start ;
cn = SB_JOURNAL(p_s_sb)->j_first ;
jindex = 1 ; /* start at one so we don't get the desc again */
while(cn) {
clear_bit(BH_JNew, &(cn->bh->b_state)) ;
....
next = cn->next ;
free_cnode(p_s_sb, cn) ;
cn = next ;
}
- in write_ordered_buffers().
i still don't figure out where. we have already cond_resched()
check in the loops. this one is triggered when i write bulk data
in parallel (1GB write with 20 threads at the same time), resulting
in up to 2ms.
a typical stacktracing looks like this:
T=36.569 diff=3.64275
comm=reiserfs/0
rtc_interrupt (+cd/e0)
handle_IRQ_event (+2f/60)
do_IRQ (+76/170)
common_interrupt (+18/20)
kfree (+36/50)
reiserfs_free_jh (+34/60)
write_ordered_buffers (+11f/1d0)
flush_commit_list (+3e6/480)
flush_async_commits (+5d/70)
worker_thread (+164/1d0)
flush_async_commits (+0/70)
default_wake_function (+0/10)
default_wake_function (+0/10)
worker_thread (+0/1d0)
kthread (+77/9f)
kthread (+0/9f)
kernel_thread_helper (+5/10)
Takashi
On Fri, 2004-03-19 at 06:23, Takashi Iwai wrote:
> At Thu, 18 Mar 2004 22:07:41 -0500,
> Eric St-Laurent wrote:
> >
> > On Thu, 2004-03-18 at 10:28, Takashi Iwai wrote:
> >
> > > in my case with reiserfs, i got 0.4ms RT-latency with my test suite
> > > (with athlon 2200+).
> > >
> > > there is another point to be fixed in the reiserfs journal
> > > transaction. then you'll get 0.1ms RT-latency without preemption.
> >
> > Are you talking about the following patch recently merged in Linus tree?
> >
> > [PATCH] resierfs: scheduling latency improvements
> > http://linus.bkbits.net:8080/linux-2.5/cset@40571b49jtE7PzOtsXjBx65-GoDijg
>
> i've tested the suse kernel, so the patch above was already in it.
> i'm not sure whether mm tree already includes the all relevant
> fixes... Chris?
>
The whole reiserfs patcheset from -suse is now at:
ftp.suse.com/pub/people/mason/patches/data-logging/experimental/2.6.4
The mm tree does already have the most important reiserfs latency fix,
which is named reiserfs-lock-lat. Andrew sent along to linus as well,
so it should be in mainline.
> what i wrote above about is loops in do_journal_end(). but i can't
> tell you surely unless i test the mm kernel again.
>
The spot you found in do_journal_end still needs the cond_resched. The
patches above don't include that one yet (I'll upload it today).
-chris
On Fri, 2004-03-19 at 06:37, Takashi Iwai wrote:
> At Thu, 18 Mar 2004 14:48:07 -0500,
> Chris Mason wrote:
> >
> > On Thu, 2004-03-18 at 14:29, Andrew Morton wrote:
> >
> > > > yep, i see a similar problem also in reiserfs's do_journal_end().
> > > > it's in lock_kernel().
> > >
> > > I have a scheduling point in journal_end() in 2.4. But I added bugs to
> > > reiserfs a couple of times doing this - it's pretty delicate. Beat up on
> > > Chris ;)
> >
> > ;-) Not sure if Takashi is talking about -suse or -mm, the data=ordered
> > patches change things around. He sent me suggestions for the
> > data=ordered latencies already, but it shouldn't be against the BKL
> > there, since I drop it before calling write_ordered_buffers().
>
> i tested only suse kernels recently. will try mm kernel later, again.
>
> ok, let me explain some nasty points i found through the disk i/o load
> tests:
>
> - in the loop in do_journal_end(). this happens periodically in
> pdflush.
>
This one we need.
> - in write_ordered_buffers().
>
> i still don't figure out where. we have already cond_resched()
> check in the loops. this one is triggered when i write bulk data
> in parallel (1GB write with 20 threads at the same time), resulting
> in up to 2ms.
>
Hmmm, is this a lock latency or just noticing that write() takes a long
time?
It's probably this:
if (chunk.nr == 0 && need_resched) {
I'll bet the latencies go away when you drop the chunk.nr test. If so
I'll need to benchmark if we want to submit before the schedule
(probably yes, as long as we've already sent down some buffers
previously).
(BTW, this code is only in the -suse patchset, data=ordered adds many
new places for possible latencies)
-chris
At Fri, 19 Mar 2004 08:46:47 -0500,
Chris Mason wrote:
>
> > - in write_ordered_buffers().
> >
> > i still don't figure out where. we have already cond_resched()
> > check in the loops. this one is triggered when i write bulk data
> > in parallel (1GB write with 20 threads at the same time), resulting
> > in up to 2ms.
> >
> Hmmm, is this a lock latency or just noticing that write() takes a long
> time?
>
> It's probably this:
> if (chunk.nr == 0 && need_resched) {
>
> I'll bet the latencies go away when you drop the chunk.nr test.
i'll give a try. thanks for hints.
Takashi
On Thu, Mar 18, 2004 at 10:10:06PM -0800, Andrew Morton wrote:
> The worst-case latency is during umount, fs/inode.c:invalidate_list() when
> the filesystem has a zillion inodes in icache. Measured 250 milliseconds
> on a 256MB 2.7GHz P4 here. OK, so don't do that.
>
> The unavoidable worst case is in the RCU callbacks for dcache shrinkage -
> I've seen 25 millisecond holdoffs on the above machine during filesystem
> stresstests when RCU is freeing a huge number of dentries in softirq
> context.
What filesystem stresstest was that ?
>
> This if Hard To Fix. Dipankar spent quite some time looking into it and
> had patches, but I lost track of where they're at.
And I am still working on this on a larger scope/scale. Yes, I have
a patch that hands over the rcu callbacks to a per-cpu kernel thread
reducing the softirq time. However this is not really a solution to
the overall problem, IMO. I am collecting some instrumentation
data to understand softirq/rcu behavior during heavy loads and
ways to counter long running softirqs.
Latency isn't the only issue. DoS on route cache is another
issue that needs to be addressed. I have been experimenting
with Robert Olsson's router test and should have some more results
out soon.
Thanks
Dipankar
Dipankar Sarma <[email protected]> wrote:
>
> On Thu, Mar 18, 2004 at 10:10:06PM -0800, Andrew Morton wrote:
> > The worst-case latency is during umount, fs/inode.c:invalidate_list() when
> > the filesystem has a zillion inodes in icache. Measured 250 milliseconds
> > on a 256MB 2.7GHz P4 here. OK, so don't do that.
> >
> > The unavoidable worst case is in the RCU callbacks for dcache shrinkage -
> > I've seen 25 millisecond holdoffs on the above machine during filesystem
> > stresstests when RCU is freeing a huge number of dentries in softirq
> > context.
>
> What filesystem stresstest was that ?
Something which creates a lot of slab, and a bit of memory pressure
basically. Such as make-teeny-files from
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz
At Thu, 18 Mar 2004 15:57:45 -0800,
Andrew Morton wrote:
>
> Chris Mason <[email protected]> wrote:
> >
> > > What we're doing here is walking (under spinlock) a ring of buffers
> > > searching for unlocked dirty ones to write out.
> > >
> > > Suppose the ring has thousands of locked buffers and there is a RT task
> > > scheduling at 1kHz. Each time need_resched() comes true we break out of
> > > the search, schedule away and then restart the search from the beginning.
> > >
> > Why not just put all the locked buffers onto a different list as you
> > find them or lock them. That way you can always make progress...
>
> Coz that made me look at the code and hence rewrite the whole thing, damn
> you. It's definitely better, but is of scary scope.
wow, interesting, i'll give a shot with your patch.
thanks!
Takashi
Takashi Iwai <[email protected]> wrote:
>
> At Thu, 18 Mar 2004 15:57:45 -0800,
> Andrew Morton wrote:
> >
> > Chris Mason <[email protected]> wrote:
> > >
> > > > What we're doing here is walking (under spinlock) a ring of buffers
> > > > searching for unlocked dirty ones to write out.
> > > >
> > > > Suppose the ring has thousands of locked buffers and there is a RT task
> > > > scheduling at 1kHz. Each time need_resched() comes true we break out of
> > > > the search, schedule away and then restart the search from the beginning.
> > > >
> > > Why not just put all the locked buffers onto a different list as you
> > > find them or lock them. That way you can always make progress...
> >
> > Coz that made me look at the code and hence rewrite the whole thing, damn
> > you. It's definitely better, but is of scary scope.
>
> wow, interesting, i'll give a shot with your patch.
It had a bug. This version has been well tested.
There's some nasty code in commit which deals with a lock ranking problem.
Currently if it fails to get the lock when and local variable `bufs' is zero
we forget to write out some ordered-data buffers. So a subsequent
crash+recovery could yield stale data in existing files.
Fix it by correctly restarting the t_sync_datalist search.
---
25-akpm/fs/jbd/commit.c | 3 +--
1 files changed, 1 insertion(+), 2 deletions(-)
diff -puN fs/jbd/commit.c~jbd-commit-ordered-fix fs/jbd/commit.c
--- 25/fs/jbd/commit.c~jbd-commit-ordered-fix Thu Mar 18 14:08:13 2004
+++ 25-akpm/fs/jbd/commit.c Thu Mar 18 14:28:52 2004
@@ -262,8 +262,7 @@ write_out_data_locked:
if (!jbd_trylock_bh_state(bh)) {
spin_unlock(&journal->j_list_lock);
schedule();
- spin_lock(&journal->j_list_lock);
- break;
+ goto write_out_data;
}
__journal_unfile_buffer(jh);
jh->b_transaction = NULL;
_
For data=ordered, kjournald at commit time has to write out and wait upon a
long list of buffers. It does this in a rather awkward way with a single
list. it causes complexity and long lock hold times, and makes the addition
of rescheduling points quite hard
So what we do instead (based on Chris Mason's suggestion) is to add a new
buffer list (t_locked_list) to the journal. It contains buffers which have
been placed under I/O.
So as we walk the t_sync_datalist list we move buffers over to t_locked_list
as they are written out.
When t_sync_datalist is empty we may then walk t_locked_list waiting for the
I/O to complete.
As a side-effect this means that we can remove the nasty synchronous wait in
journal_dirty_data which is there to avoid the kjournald livelock which would
otherwise occur when someone is continuously dirtying a buffer.
---
25-akpm/fs/jbd/commit.c | 143 +++++++++++++++++++++----------------------
25-akpm/fs/jbd/transaction.c | 13 +++
25-akpm/include/linux/jbd.h | 9 ++
3 files changed, 91 insertions(+), 74 deletions(-)
diff -puN fs/jbd/commit.c~jbd-move-locked-buffers fs/jbd/commit.c
--- 25/fs/jbd/commit.c~jbd-move-locked-buffers 2004-03-18 22:28:53.207331840 -0800
+++ 25-akpm/fs/jbd/commit.c 2004-03-18 22:33:57.569061824 -0800
@@ -79,6 +79,21 @@ nope:
}
/*
+ * Try to acquire jbd_lock_bh_state() against the buffer, when j_list_lock is
+ * held. For ranking reasons we must trylock. If we lose, schedule away and
+ * return 0. j_list_lock is dropped in this case.
+ */
+static int inverted_lock(journal_t *journal, struct buffer_head *bh)
+{
+ if (!jbd_trylock_bh_state(bh)) {
+ spin_unlock(&journal->j_list_lock);
+ schedule();
+ return 0;
+ }
+ return 1;
+}
+
+/*
* journal_commit_transaction
*
* The primary function for committing a transaction to the log. This
@@ -88,7 +103,6 @@ void journal_commit_transaction(journal_
{
transaction_t *commit_transaction;
struct journal_head *jh, *new_jh, *descriptor;
- struct journal_head *next_jh, *last_jh;
struct buffer_head *wbuf[64];
int bufs;
int flags;
@@ -222,113 +236,102 @@ void journal_commit_transaction(journal_
err = 0;
/*
* Whenever we unlock the journal and sleep, things can get added
- * onto ->t_datalist, so we have to keep looping back to write_out_data
- * until we *know* that the list is empty.
+ * onto ->t_sync_datalist, so we have to keep looping back to
+ * write_out_data until we *know* that the list is empty.
*/
-write_out_data:
-
+ bufs = 0;
/*
* Cleanup any flushed data buffers from the data list. Even in
* abort mode, we want to flush this out as soon as possible.
- *
- * We take j_list_lock to protect the lists from
- * journal_try_to_free_buffers().
*/
+write_out_data:
+ cond_resched();
spin_lock(&journal->j_list_lock);
-write_out_data_locked:
- bufs = 0;
- next_jh = commit_transaction->t_sync_datalist;
- if (next_jh == NULL)
- goto sync_datalist_empty;
- last_jh = next_jh->b_tprev;
-
- do {
+ while (commit_transaction->t_sync_datalist) {
struct buffer_head *bh;
- jh = next_jh;
- next_jh = jh->b_tnext;
+ jh = commit_transaction->t_sync_datalist;
+ commit_transaction->t_sync_datalist = jh->b_tnext;
bh = jh2bh(jh);
- if (!buffer_locked(bh)) {
+ if (buffer_locked(bh)) {
+ BUFFER_TRACE(bh, "locked");
+ if (!inverted_lock(journal, bh))
+ goto write_out_data;
+ __journal_unfile_buffer(jh);
+ __journal_file_buffer(jh, jh->b_transaction, BJ_Locked);
+ jbd_unlock_bh_state(bh);
+ if (need_resched()) {
+ spin_unlock(&journal->j_list_lock);
+ goto write_out_data;
+ }
+ } else {
if (buffer_dirty(bh)) {
BUFFER_TRACE(bh, "start journal writeout");
- atomic_inc(&bh->b_count);
+ get_bh(bh);
wbuf[bufs++] = bh;
- } else {
- BUFFER_TRACE(bh, "writeout complete: unfile");
- /*
- * We have a lock ranking problem..
- */
- if (!jbd_trylock_bh_state(bh)) {
+ if (bufs == ARRAY_SIZE(wbuf)) {
+ jbd_debug(2, "submit %d writes\n",
+ bufs);
spin_unlock(&journal->j_list_lock);
- schedule();
+ ll_rw_block(WRITE, bufs, wbuf);
+ journal_brelse_array(wbuf, bufs);
+ bufs = 0;
goto write_out_data;
}
+ } else {
+ BUFFER_TRACE(bh, "writeout complete: unfile");
+ if (!inverted_lock(journal, bh))
+ goto write_out_data;
__journal_unfile_buffer(jh);
jh->b_transaction = NULL;
jbd_unlock_bh_state(bh);
journal_remove_journal_head(bh);
- __brelse(bh);
- if (need_resched() && commit_transaction->
- t_sync_datalist) {
- commit_transaction->t_sync_datalist =
- next_jh;
- if (bufs)
- break;
+ put_bh(bh);
+ if (need_resched()) {
spin_unlock(&journal->j_list_lock);
- cond_resched();
goto write_out_data;
}
}
}
- if (bufs == ARRAY_SIZE(wbuf)) {
- /*
- * Major speedup: start here on the next scan
- */
- J_ASSERT(commit_transaction->t_sync_datalist != 0);
- commit_transaction->t_sync_datalist = jh;
- break;
- }
- } while (jh != last_jh);
-
- if (bufs || need_resched()) {
- jbd_debug(2, "submit %d writes\n", bufs);
- spin_unlock(&journal->j_list_lock);
- if (bufs)
- ll_rw_block(WRITE, bufs, wbuf);
- cond_resched();
- journal_brelse_array(wbuf, bufs);
- spin_lock(&journal->j_list_lock);
- goto write_out_data_locked;
}
/*
- * Wait for all previously submitted IO on the data list to complete.
+ * Wait for all previously submitted IO to complete.
*/
- jh = commit_transaction->t_sync_datalist;
- if (jh == NULL)
- goto sync_datalist_empty;
-
- do {
+ while (commit_transaction->t_locked_list) {
struct buffer_head *bh;
- jh = jh->b_tprev; /* Wait on the last written */
+
+ jh = commit_transaction->t_locked_list->b_tprev;
bh = jh2bh(jh);
+ get_bh(bh);
if (buffer_locked(bh)) {
- get_bh(bh);
spin_unlock(&journal->j_list_lock);
wait_on_buffer(bh);
if (unlikely(!buffer_uptodate(bh)))
err = -EIO;
+ spin_lock(&journal->j_list_lock);
+ }
+ if (!inverted_lock(journal, bh)) {
put_bh(bh);
- /* the journal_head may have been removed now */
- goto write_out_data;
- } else if (buffer_dirty(bh)) {
- goto write_out_data_locked;
+ spin_lock(&journal->j_list_lock);
+ continue;
}
- } while (jh != commit_transaction->t_sync_datalist);
- goto write_out_data_locked;
-
-sync_datalist_empty:
+ if (buffer_jbd(bh) && jh->b_jlist == BJ_Locked) {
+ __journal_unfile_buffer(jh);
+ jh->b_transaction = NULL;
+ jbd_unlock_bh_state(bh);
+ journal_remove_journal_head(bh);
+ } else {
+ jbd_unlock_bh_state(bh);
+ }
+ put_bh(bh);
+ if (need_resched()) {
+ spin_unlock(&journal->j_list_lock);
+ cond_resched();
+ spin_lock(&journal->j_list_lock);
+ }
+ }
spin_unlock(&journal->j_list_lock);
journal_write_revoke_records(journal, commit_transaction);
diff -puN include/linux/jbd.h~jbd-move-locked-buffers include/linux/jbd.h
--- 25/include/linux/jbd.h~jbd-move-locked-buffers 2004-03-18 22:28:53.208331688 -0800
+++ 25-akpm/include/linux/jbd.h 2004-03-18 22:28:53.214330776 -0800
@@ -487,6 +487,12 @@ struct transaction_s
struct journal_head *t_reserved_list;
/*
+ * Doubly-linked circular list of all buffers under writeout during
+ * commit [j_list_lock]
+ */
+ struct journal_head *t_locked_list;
+
+ /*
* Doubly-linked circular list of all metadata buffers owned by this
* transaction [j_list_lock]
*/
@@ -1079,7 +1085,8 @@ static inline int jbd_space_needed(journ
#define BJ_Shadow 5 /* Buffer contents being shadowed to the log */
#define BJ_LogCtl 6 /* Buffer contains log descriptors */
#define BJ_Reserved 7 /* Buffer is reserved for access by journal */
-#define BJ_Types 8
+#define BJ_Locked 8 /* Locked for I/O during commit */
+#define BJ_Types 9
extern int jbd_blocks_per_page(struct inode *inode);
diff -puN fs/jbd/transaction.c~jbd-move-locked-buffers fs/jbd/transaction.c
--- 25/fs/jbd/transaction.c~jbd-move-locked-buffers 2004-03-18 22:28:53.210331384 -0800
+++ 25-akpm/fs/jbd/transaction.c 2004-03-18 22:33:38.948892520 -0800
@@ -1010,7 +1010,8 @@ int journal_dirty_data(handle_t *handle,
* the write() data.
*/
if (jh->b_jlist != BJ_None &&
- jh->b_jlist != BJ_SyncData) {
+ jh->b_jlist != BJ_SyncData &&
+ jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "Not stealing");
goto no_journal;
}
@@ -1048,7 +1049,7 @@ int journal_dirty_data(handle_t *handle,
* committing transaction, so might still be left on that
* transaction's metadata lists.
*/
- if (jh->b_jlist != BJ_SyncData) {
+ if (jh->b_jlist != BJ_SyncData && jh->b_jlist != BJ_Locked) {
JBUFFER_TRACE(jh, "not on correct data list: unfile");
J_ASSERT_JH(jh, jh->b_jlist != BJ_Shadow);
__journal_unfile_buffer(jh);
@@ -1539,6 +1540,9 @@ void __journal_unfile_buffer(struct jour
case BJ_Reserved:
list = &transaction->t_reserved_list;
break;
+ case BJ_Locked:
+ list = &transaction->t_locked_list;
+ break;
}
__blist_del_buffer(list, jh);
@@ -1576,7 +1580,7 @@ __journal_try_to_free_buffer(journal_t *
spin_lock(&journal->j_list_lock);
if (jh->b_transaction != 0 && jh->b_cp_transaction == 0) {
- if (jh->b_jlist == BJ_SyncData) {
+ if (jh->b_jlist == BJ_SyncData || jh->b_jlist == BJ_Locked) {
/* A written-back ordered data buffer */
JBUFFER_TRACE(jh, "release data");
__journal_unfile_buffer(jh);
@@ -1985,6 +1989,9 @@ void __journal_file_buffer(struct journa
case BJ_Reserved:
list = &transaction->t_reserved_list;
break;
+ case BJ_Locked:
+ list = &transaction->t_locked_list;
+ break;
}
__blist_add_buffer(list, jh);
_
On Thu, 18 Mar 2004 14:24:59 EST, Robert Love said:
> On Thu, 2004-03-18 at 14:08, Takashi Iwai wrote:
>
> > oh, sorry, maybe i forgot to tell you that it has been already there
> > :)
> >
> > # echo 1 > /proc/asound/card0/pcm0p/xrun_debug
> >
> > this will show the stacktrace when a buffer overrun/underrun is
> > detected in the irq handler. it's not perfect, though.
>
> Excellent!
>
> Has this resulted in anything useful?
>
> With KALLSYMS=y, a lot of users now become intelligent bug testers.
>
> Eh, except that I do not have that procfs entry on my system..
Is one of these yours? My machine has 3 (but only one soundcard) ;)
% find /proc/asound/ -name xrun_debug -ls
4480 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm1c/xrun_debug
4470 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm0c/xrun_debug
4462 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm0p/xrun_debug
(Are these 3 different controls, or 3 places to set the same variable?)
> Is one of these yours? My machine has 3 (but only one soundcard) ;)
>
> % find /proc/asound/ -name xrun_debug -ls
> 4480 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm1c/xrun_debug
> 4470 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm0c/xrun_debug
> 4462 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm0p/xrun_debug
>
> (Are these 3 different controls, or 3 places to set the same variable?)
Heh :)
I suspect I need debugging enabled.
Robert Love
Robert Love wrote:
> This is because of work Dave Miller and Ingo did - irq count, softirq
> count, and lock count (when PREEMPT=y) are unified into preempt_count.
(x86 only) Have you considerd unifying the "interrupts disabled" bit
into it as well, for spin_lock_irqsave/spin_unlock_irqrestore?
The principle is to use a memory bit or counter instead of "cli" and
"sti", because it is cheaper or even free when it's unified in
preempt_count.
The very first instructions of the interrupt handlers check
preempt_count, and if interrupts are logically disabled they modify
preempt_count to indicate that an irq is pending, save the interrupt
vector number, and return keeping interrupts disabled at the CPU level
(i.e. not using "iret"). Only one vector can be pending in this way,
because we keep CPU interrupts disabled after the first one.
In spin_unlock_irqrestore, it checks preempt_count (as it already
does), and in the slow path if there's an irq pending, calls the
handler for that irq as if it were invoked by the CPU.
That should effectively eliminate the "cli" and "sti" isnructions
from spin_lock_irqsave and spin_unlock_irqrestore, saving a few cycles.
-- Jamie
On Thu, Mar 18, 2004 at 09:09:48PM -0800, William Lee Irwin III wrote:
> On Fri, Mar 19, 2004 at 01:17:47PM +1100, Nick Piggin wrote:
> > Technically, I think preempt can still reduce maximum latency
> > depending on whether the time between explicit resched checks
> > is greater than the (small) scheduling overhead that preempt
> > adds.
> > But yeah, that would be in the noise compared with long
> > preempt-off regions.
> > I'm not sure about applications, but maybe some soft-realtime
> > things like audio software benefit from a lower average latency
> > (I don't know).
>
> Someone really needs to get numbers on this stuff.
Takashi already did it and we know it doesn't reduce the maximum latency.
On Fri, Mar 19, 2004 at 10:52:03PM +0530, Dipankar Sarma wrote:
> On Thu, Mar 18, 2004 at 10:10:06PM -0800, Andrew Morton wrote:
> > The worst-case latency is during umount, fs/inode.c:invalidate_list() when
> > the filesystem has a zillion inodes in icache. Measured 250 milliseconds
> > on a 256MB 2.7GHz P4 here. OK, so don't do that.
> >
> > The unavoidable worst case is in the RCU callbacks for dcache shrinkage -
> > I've seen 25 millisecond holdoffs on the above machine during filesystem
> > stresstests when RCU is freeing a huge number of dentries in softirq
> > context.
>
> What filesystem stresstest was that ?
>
> >
> > This if Hard To Fix. Dipankar spent quite some time looking into it and
> > had patches, but I lost track of where they're at.
>
> And I am still working on this on a larger scope/scale. Yes, I have
> a patch that hands over the rcu callbacks to a per-cpu kernel thread
> reducing the softirq time. However this is not really a solution to
why a per-cpu kernel thread when you can use a softirq?
> the overall problem, IMO. I am collecting some instrumentation
> data to understand softirq/rcu behavior during heavy loads and
> ways to counter long running softirqs.
>
> Latency isn't the only issue. DoS on route cache is another
> issue that needs to be addressed. I have been experimenting
> with Robert Olsson's router test and should have some more results
> out soon.
why don't you simply interrupt rcu_do_batch after a dozen of callbacks?
if it gets interrupted you then go ahead and you splice the remaining
entries into another list for a tasklet, then the tasklet will be a
reentrant one, so the ksoftirqd will take care of the latency.
the only valid reason to use the timer irq instead of the tasklet in the
first place is to delay the rcu invocation and coalesce the work
together, but if there's too much work to do you must go back to the
tasklet way that has always been scheduler-friendy.
On Sat, Mar 20, 2004 at 01:24:11PM +0100, Andrea Arcangeli wrote:
> On Fri, Mar 19, 2004 at 10:52:03PM +0530, Dipankar Sarma wrote:
> > the overall problem, IMO. I am collecting some instrumentation
> > data to understand softirq/rcu behavior during heavy loads and
> > ways to counter long running softirqs.
> >
> > Latency isn't the only issue. DoS on route cache is another
> > issue that needs to be addressed. I have been experimenting
> > with Robert Olsson's router test and should have some more results
> > out soon.
>
> why don't you simply interrupt rcu_do_batch after a dozen of callbacks?
> if it gets interrupted you then go ahead and you splice the remaining
> entries into another list for a tasklet, then the tasklet will be a
> reentrant one, so the ksoftirqd will take care of the latency.
>
> the only valid reason to use the timer irq instead of the tasklet in the
> first place is to delay the rcu invocation and coalesce the work
> together, but if there's too much work to do you must go back to the
> tasklet way that has always been scheduler-friendy.
Andrea, I *am* working on a throttling mechanism for rcu/softirqs.
I just didn't see the point in publishing it until I had the
measurement results in hand :)
I will publish the results under both router DoS and filesystem workload.
Thanks
Dipankar
On Thu, Mar 18, 2004 at 09:09:48PM -0800, William Lee Irwin III wrote:
>> Someone really needs to get numbers on this stuff.
On Sat, Mar 20, 2004 at 01:14:23PM +0100, Andrea Arcangeli wrote:
> Takashi already did it and we know it doesn't reduce the maximum latency.
I may have missed one of his posts where he gave the results from the
RT test suite. I found a list of functions with some kind of numbers,
though I didn't see a description of what those numbers were and was
looking for something more detailed (e.g. the output of the RT
instrumentation things he had with and without preempt). This is all
mostly curiosity and sort of hoping this gets carried out vaguely
scientifically anyway, so I'm not really arguing one way or the other.
-- wli
On Sat, Mar 20, 2004 at 06:51:11AM -0800, William Lee Irwin III wrote:
> I may have missed one of his posts where he gave the results from the
> RT test suite. I found a list of functions with some kind of numbers,
> though I didn't see a description of what those numbers were and was
> looking for something more detailed (e.g. the output of the RT
> instrumentation things he had with and without preempt). This is all
> mostly curiosity and sort of hoping this gets carried out vaguely
> scientifically anyway, so I'm not really arguing one way or the other.
agreed. what I've seen so far is a great number of graphs, they were
scientific enough for my needs and covering real life different
workloads, but I'm not sure what Takashi published exactly, you may want
to discuss it with him.
On Sat, Mar 20, 2004 at 06:51:11AM -0800, William Lee Irwin III wrote:
>> I may have missed one of his posts where he gave the results from the
>> RT test suite. I found a list of functions with some kind of numbers,
>> though I didn't see a description of what those numbers were and was
>> looking for something more detailed (e.g. the output of the RT
>> instrumentation things he had with and without preempt). This is all
>> mostly curiosity and sort of hoping this gets carried out vaguely
>> scientifically anyway, so I'm not really arguing one way or the other.
On Sat, Mar 20, 2004 at 04:03:11PM +0100, Andrea Arcangeli wrote:
> agreed. what I've seen so far is a great number of graphs, they were
> scientific enough for my needs and covering real life different
> workloads, but I'm not sure what Takashi published exactly, you may want
> to discuss it with him.
I didn't see the graphs; they may be helpful for my purposes.
-- wli
On Fri, Mar 19, 2004 at 11:30:16AM +0100, Takashi Iwai wrote:
> > The unavoidable worst case is in the RCU callbacks for dcache shrinkage -
> > I've seen 25 millisecond holdoffs on the above machine during filesystem
> > stresstests when RCU is freeing a huge number of dentries in softirq
> > context.
>
> hmm, this wasn't also evaluated in my tests.
> it's worthy to try. thanks for the info.
Please do. This would be a very interesting data point.
>
> > This if Hard To Fix. Dipankar spent quite some time looking into it and
> > had patches, but I lost track of where they're at.
>
> couldn't this tasklet be replaced with workqueue or such?
Close. I am using per-cpu kernel threads to hand over the rcu callbacks
if there are too many of them. It depends on CONFIG_PREEMPT on better
latency from there onwards.
I can use a workqueue, but I needed the flexibility experiment with
the kernel thread for now.
Thanks
Dipankar
At Sat, 20 Mar 2004 16:03:11 +0100,
Andrea Arcangeli wrote:
>
> On Sat, Mar 20, 2004 at 06:51:11AM -0800, William Lee Irwin III wrote:
> > I may have missed one of his posts where he gave the results from the
> > RT test suite. I found a list of functions with some kind of numbers,
> > though I didn't see a description of what those numbers were and was
> > looking for something more detailed (e.g. the output of the RT
> > instrumentation things he had with and without preempt). This is all
> > mostly curiosity and sort of hoping this gets carried out vaguely
> > scientifically anyway, so I'm not really arguing one way or the other.
>
> agreed. what I've seen so far is a great number of graphs, they were
> scientific enough for my needs and covering real life different
> workloads, but I'm not sure what Takashi published exactly, you may want
> to discuss it with him.
sorry, there is no exact descrption (yet) in public, except for the
pdf of slides i presented ago:
http://www.alsa-project.org/~iwai/audio-latency.pdf
the data there are already old (based on 2.6.0-test9), and were
measured by the old version of latency-test program (modifed version
from Benno Senor's latencytest suite).
basically, the latency-test program measures the latency between the
time of (periodical) irq and the time when user RT-thread is woken
up under different workloads. the user thread can perform an
artificial CPU load (busy loop) to simulate the situation that any RT
process does a heavy job.
the resultant plot shows the critical deadline, the total latency and
the CPU latency (busy loop) responses, as you can see in the above
pdf.
the latest version of latency-test suite is found at
http://www.alsa-project.org/~iwai/latencytest-0.5.3.tar.gz
it uses its own kernel module to generate irqs from RTC and to trace
stacks.
i'll show the results of the recent kernels tomorrow...
--
Takashi Iwai <[email protected]> ALSA Developer - http://www.alsa-project.org
On Wed, Mar 24, 2004 at 02:57:20PM +0100, Takashi Iwai wrote:
> At Sat, 20 Mar 2004 16:03:11 +0100,
> Andrea Arcangeli wrote:
> >
> > On Sat, Mar 20, 2004 at 06:51:11AM -0800, William Lee Irwin III wrote:
> > > I may have missed one of his posts where he gave the results from the
> > > RT test suite. I found a list of functions with some kind of numbers,
> > > though I didn't see a description of what those numbers were and was
> > > looking for something more detailed (e.g. the output of the RT
> > > instrumentation things he had with and without preempt). This is all
> > > mostly curiosity and sort of hoping this gets carried out vaguely
> > > scientifically anyway, so I'm not really arguing one way or the other.
> >
> > agreed. what I've seen so far is a great number of graphs, they were
> > scientific enough for my needs and covering real life different
> > workloads, but I'm not sure what Takashi published exactly, you may want
> > to discuss it with him.
>
> sorry, there is no exact descrption (yet) in public, except for the
> pdf of slides i presented ago:
>
> http://www.alsa-project.org/~iwai/audio-latency.pdf
this is what I meant. To be fair I've also seen lots of raw input data,
but you represented all the interesting points in the presentation.
> it uses its own kernel module to generate irqs from RTC and to trace
> stacks.
>
> i'll show the results of the recent kernels tomorrow...
cool ;)
At Fri, 19 Mar 2004 17:12:55 -0500,
Robert Love wrote:
>
>
> > Is one of these yours? My machine has 3 (but only one soundcard) ;)
> >
> > % find /proc/asound/ -name xrun_debug -ls
> > 4480 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm1c/xrun_debug
> > 4470 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm0c/xrun_debug
> > 4462 0 -r--r--r-- 1 root root 0 Mar 19 16:00 /proc/asound/card0/pcm0p/xrun_debug
> >
> > (Are these 3 different controls, or 3 places to set the same variable?)
>
> Heh :)
>
> I suspect I need debugging enabled.
yes, please turn on CONFIG_SND_DEBUG.
these files appear in each PCM stream, so that only one of them can be
activated. for example, pcm0p is the first PCM for playback, pcm1c
the second PCM for capture.
in the latest version of ALSA (the patchset will be submitted by
Jaroslav hopefully soon later), the debug messages are suppressed
until xrun_debug > 0 is set, because in many cases they are triggered
by the sloppy irq handler and harmless.
when xrun_debug > 1, the driver will show stacktraces together with
the xrun (overrun/underrun) notifications.
there is a new document, Documentation/sound/alsa/Procfile.txt, for
describing these and other proc files of ALSA.
Takashi
On Wed, Mar 24, 2004 at 02:57:20PM +0100, Takashi Iwai wrote:
> http://www.alsa-project.org/~iwai/audio-latency.pdf
> the data there are already old (based on 2.6.0-test9), and were
> measured by the old version of latency-test program (modifed version
> from Benno Senor's latencytest suite).
> basically, the latency-test program measures the latency between the
> time of (periodical) irq and the time when user RT-thread is woken
> up under different workloads. the user thread can perform an
> artificial CPU load (busy loop) to simulate the situation that any RT
> process does a heavy job.
> the resultant plot shows the critical deadline, the total latency and
> the CPU latency (busy loop) responses, as you can see in the above
> pdf.
> the latest version of latency-test suite is found at
> http://www.alsa-project.org/~iwai/latencytest-0.5.3.tar.gz
> it uses its own kernel module to generate irqs from RTC and to trace
> stacks.
> i'll show the results of the recent kernels tomorrow...
Very nice. This document is the kind of thing I wanted to see.
-- wli