I've been testign for weeks with 2.6.16-rc1 + the latency trace patch
and the longest latencies measured were 10-15ms due to the well known
rt_run_flush issue. Today I got one twice as long, when a Firefox
process with a bunch of acroreads in tabs, from a new code path.
It seems to trigger when a process with a large amount of memory swapped
out exits.
Can this be solved with a cond_resched?
preemption latency trace v1.1.5 on 2.6.16-rc1
--------------------------------------------------------------------
latency: 28130 us, #38020/38020, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
-----------------
| task: acroread-29870 (uid:1000 nice:0 policy:0 rt_prio:0)
-----------------
_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
Xorg-18254 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
Xorg-18254 0d.s2 1us : __trace_start_sched_wakeup <<...>-29870> (73 0)
Xorg-18254 0d.s. 2us : wake_up_process (process_timeout)
Xorg-18254 0dn.2 3us < (2110048)
Xorg-18254 0dn.2 4us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 5us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 6us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 6us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 7us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 8us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 8us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 9us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 10us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 11us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 12us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 12us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 13us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 14us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 14us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 15us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 16us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 16us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 17us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 18us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 19us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 19us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 20us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 21us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 21us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 22us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 23us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 23us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 24us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 25us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 25us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 26us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 27us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 27us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 28us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 29us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 30us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 30us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 31us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 32us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 32us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 33us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 34us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 35us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 35us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 36us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 37us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 37us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 38us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 39us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.2 40us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 40us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 41us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 42us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 42us : radix_tree_lookup (find_trylock_page)
[ etc ]
Xorg-18254 0dn.2 28062us : free_swap_and_cache (unmap_vmas)
Xorg-18254 0dn.2 28062us : swap_info_get (free_swap_and_cache)
Xorg-18254 0dn.3 28063us : swap_entry_free (free_swap_and_cache)
Xorg-18254 0dn.3 28064us : find_trylock_page (free_swap_and_cache)
Xorg-18254 0dn.4 28064us : radix_tree_lookup (find_trylock_page)
Xorg-18254 0dn.3 28065us : preempt_schedule (find_trylock_page)
Xorg-18254 0dn.2 28066us : preempt_schedule (free_swap_and_cache)
Xorg-18254 0dn.1 28067us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28073us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28077us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28083us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28089us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28095us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28099us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28101us+: preempt_schedule (unmap_vmas)
Xorg-18254 0dn.1 28106us : preempt_schedule (unmap_vmas)
Xorg-18254 0dn.2 28107us : vm_normal_page (unmap_vmas)
Xorg-18254 0dn.2 28109us : vm_normal_page (unmap_vmas)
Xorg-18254 0dn.2 28109us : vm_normal_page (unmap_vmas)
Xorg-18254 0dn.2 28110us : vm_normal_page (unmap_vmas)
Xorg-18254 0dn.1 28111us : preempt_schedule (unmap_vmas)
Xorg-18254 0dn.. 28112us : preempt_schedule (unmap_vmas)
Xorg-18254 0dn.. 28113us : schedule (preempt_schedule)
Xorg-18254 0dn.. 28113us : stop_trace (schedule)
Xorg-18254 0dn.. 28114us : profile_hit (schedule)
Xorg-18254 0dn.1 28115us+: sched_clock (schedule)
Xorg-18254 0dn.2 28117us : recalc_task_prio (schedule)
Xorg-18254 0dn.2 28119us+: effective_prio (recalc_task_prio)
<...>-29870 0d..2 28123us+: __switch_to (schedule)
<...>-29870 0d..2 28126us : schedule <Xorg-18254> (76 73)
<...>-29870 0d..1 28127us : trace_stop_sched_switched (schedule)
<...>-29870 0d..2 28128us : trace_stop_sched_switched <<...>-29870> (73 0)
<...>-29870 0d..2 28130us : schedule (schedule)
vim:ft=help
On Tue, 14 Mar 2006, Lee Revell wrote:
> I've been testign for weeks with 2.6.16-rc1 + the latency trace patch
> and the longest latencies measured were 10-15ms due to the well known
> rt_run_flush issue. Today I got one twice as long, when a Firefox
> process with a bunch of acroreads in tabs, from a new code path.
>
> It seems to trigger when a process with a large amount of memory swapped
> out exits.
>
> Can this be solved with a cond_resched?
Not that easily, I think.
Are you testing with CONFIG_PREEMPT=y, as I'd expect? I thought
cond_resched() adds nothing to that case (and we keep on intending
but forgetting to make it compile away to nothing in that case).
Or am I confused?
This is an area (the most notable area) where CONFIG_PREEMPT on and
off make very different decisions on how best to proceed, PREEMPT
advancing in tiny steps and non-PREEMPT doing as much as it can
in one go with preemption disabled.
I suppose, if the swapper_space radix tree is deep, you might get
to see preemption held off for longer than expected, even when
CONFIG_PREEMPT=y; but it'd probably be painful to make those tiny
(8 at a time) steps even tinier.
Whereas if you're checking out CONFIG_PREEMPT undefined, yes,
it's entirely unsurprising that you see long latencies here.
Whichever you're trying, although this trace identifies a different
aspect from the regression you identified in free_pgtables, it is
(probably) another example of how the per-cpu mmu_gather limits us.
I'm still working intermittently on the patch to rework that,
which should improve the situation here as in free_pgtables.
Intermittently because I'm not thoroughly convinced by what I've
done so far, it's unsatisfactory in two different ways (one, leaving
sparc64 and powerpc out in the cold, since they deal with the TLB in
a different way, more deeply reliant on the per-cpu buffers; two,
failing to improve the rare but real truncation-of-mapped-file case).
Not ready to show to light of day yet.
I have put a cond_resched into release_pages in that patch, to
improve the CONFIG_PREEMPT off case; but that does depends on the
other changes, it cannot simply be extracted from the whole.
Hugh
> preemption latency trace v1.1.5 on 2.6.16-rc1
> --------------------------------------------------------------------
> latency: 28130 us, #38020/38020, CPU#0 | (M:rt VP:0, KP:0, SP:0 HP:0)
> -----------------
> | task: acroread-29870 (uid:1000 nice:0 policy:0 rt_prio:0)
> -----------------
>
> _------=> CPU#
> / _-----=> irqs-off
> | / _----=> need-resched
> || / _---=> hardirq/softirq
> ||| / _--=> preempt-depth
> |||| /
> ||||| delay
> cmd pid ||||| time | caller
> \ / ||||| \ | /
> Xorg-18254 0d.s2 1us : __trace_start_sched_wakeup (try_to_wake_up)
> Xorg-18254 0d.s2 1us : __trace_start_sched_wakeup <<...>-29870> (73 0)
> Xorg-18254 0d.s. 2us : wake_up_process (process_timeout)
> Xorg-18254 0dn.2 3us < (2110048)
> Xorg-18254 0dn.2 4us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 5us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 6us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 6us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 7us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 8us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 8us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 9us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 10us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 11us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 12us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 12us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 13us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 14us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 14us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 15us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 16us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 16us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 17us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 18us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 19us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 19us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 20us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 21us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 21us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 22us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 23us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 23us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 24us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 25us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 25us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 26us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 27us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 27us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 28us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 29us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 30us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 30us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 31us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 32us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 32us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 33us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 34us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 35us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 35us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 36us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 37us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 37us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 38us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 39us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.2 40us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 40us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 41us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 42us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 42us : radix_tree_lookup (find_trylock_page)
>
> [ etc ]
>
> Xorg-18254 0dn.2 28062us : free_swap_and_cache (unmap_vmas)
> Xorg-18254 0dn.2 28062us : swap_info_get (free_swap_and_cache)
> Xorg-18254 0dn.3 28063us : swap_entry_free (free_swap_and_cache)
> Xorg-18254 0dn.3 28064us : find_trylock_page (free_swap_and_cache)
> Xorg-18254 0dn.4 28064us : radix_tree_lookup (find_trylock_page)
> Xorg-18254 0dn.3 28065us : preempt_schedule (find_trylock_page)
> Xorg-18254 0dn.2 28066us : preempt_schedule (free_swap_and_cache)
> Xorg-18254 0dn.1 28067us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28073us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28077us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28083us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28089us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28095us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28099us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28101us+: preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.1 28106us : preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.2 28107us : vm_normal_page (unmap_vmas)
> Xorg-18254 0dn.2 28109us : vm_normal_page (unmap_vmas)
> Xorg-18254 0dn.2 28109us : vm_normal_page (unmap_vmas)
> Xorg-18254 0dn.2 28110us : vm_normal_page (unmap_vmas)
> Xorg-18254 0dn.1 28111us : preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.. 28112us : preempt_schedule (unmap_vmas)
> Xorg-18254 0dn.. 28113us : schedule (preempt_schedule)
> Xorg-18254 0dn.. 28113us : stop_trace (schedule)
> Xorg-18254 0dn.. 28114us : profile_hit (schedule)
> Xorg-18254 0dn.1 28115us+: sched_clock (schedule)
> Xorg-18254 0dn.2 28117us : recalc_task_prio (schedule)
> Xorg-18254 0dn.2 28119us+: effective_prio (recalc_task_prio)
> <...>-29870 0d..2 28123us+: __switch_to (schedule)
> <...>-29870 0d..2 28126us : schedule <Xorg-18254> (76 73)
> <...>-29870 0d..1 28127us : trace_stop_sched_switched (schedule)
> <...>-29870 0d..2 28128us : trace_stop_sched_switched <<...>-29870> (73 0)
> <...>-29870 0d..2 28130us : schedule (schedule)
On Tue, 2006-03-14 at 18:40 +0000, Hugh Dickins wrote:
> On Tue, 14 Mar 2006, Lee Revell wrote:
>
> > I've been testign for weeks with 2.6.16-rc1 + the latency trace patch
> > and the longest latencies measured were 10-15ms due to the well known
> > rt_run_flush issue. Today I got one twice as long, when a Firefox
> > process with a bunch of acroreads in tabs, from a new code path.
> >
> > It seems to trigger when a process with a large amount of memory swapped
> > out exits.
> >
> > Can this be solved with a cond_resched?
>
> Not that easily, I think.
>
> Are you testing with CONFIG_PREEMPT=y, as I'd expect? I thought
> cond_resched() adds nothing to that case (and we keep on intending
> but forgetting to make it compile away to nothing in that case).
> Or am I confused?
>
Thanks for the explanation - I am the one confused (PREEMPT is on so
cond_resched would be pointless).
Lee
* Hugh Dickins <[email protected]> wrote:
> > It seems to trigger when a process with a large amount of memory swapped
> > out exits.
> >
> > Can this be solved with a cond_resched?
>
> Not that easily, I think.
>
> Are you testing with CONFIG_PREEMPT=y, as I'd expect? I thought
> cond_resched() adds nothing to that case (and we keep on intending but
> forgetting to make it compile away to nothing in that case). Or am I
> confused?
it still has an effect: if PREEMPT_BKL is disabled it drops the BKL. But
if PREEMPT_BKL is enabled too then cond_resched() has no effect.
(cond_resched_locked() does though)
back to the problem reported by Lee:
> > Xorg-18254 0dn.2 3us < (2110048)
> > Xorg-18254 0dn.2 4us : preempt_schedule (free_swap_and_cache)
> > Xorg-18254 0dn.2 5us : free_swap_and_cache (unmap_vmas)
> > Xorg-18254 0dn.2 6us : swap_info_get (free_swap_and_cache)
> > Xorg-18254 0dn.3 6us : swap_entry_free (free_swap_and_cache)
> > Xorg-18254 0dn.3 7us : find_trylock_page (free_swap_and_cache)
hm, where does the latency come from? We do have a lockbreaker in
unmap_vmas():
if (need_resched() ||
(i_mmap_lock && need_lockbreak(i_mmap_lock))) {
if (i_mmap_lock) {
*tlbp = NULL;
goto out;
}
cond_resched();
}
why doesnt this break up the 28ms latency?
Ingo
* Lee Revell <[email protected]> wrote:
> I've been testign for weeks with 2.6.16-rc1 + the latency trace patch
> and the longest latencies measured were 10-15ms due to the well known
> rt_run_flush issue. Today I got one twice as long, when a Firefox
> process with a bunch of acroreads in tabs, from a new code path.
>
> It seems to trigger when a process with a large amount of memory
> swapped out exits.
btw., one good way to get such things fixed is to code up a testcase: a
.c file that just has to be run to reproduce the latency. It might be
less trivial to code that up in some cases (like this one - e.g. you
might have to first get a large chunk of memory swapped out which isnt
easy), but i think it's still worth the effort, as that way you can
gently pressure us lazy upstream maintainers to act quicker, and we can
also easily verify whether the fix does the trick :-)
Ingo
On Tue, 2006-03-14 at 22:22 +0100, Ingo Molnar wrote:
> * Lee Revell <[email protected]> wrote:
>
> > I've been testign for weeks with 2.6.16-rc1 + the latency trace patch
> > and the longest latencies measured were 10-15ms due to the well known
> > rt_run_flush issue. Today I got one twice as long, when a Firefox
> > process with a bunch of acroreads in tabs, from a new code path.
> >
> > It seems to trigger when a process with a large amount of memory
> > swapped out exits.
>
> btw., one good way to get such things fixed is to code up a testcase: a
> .c file that just has to be run to reproduce the latency. It might be
> less trivial to code that up in some cases (like this one - e.g. you
> might have to first get a large chunk of memory swapped out which isnt
> easy), but i think it's still worth the effort, as that way you can
> gently pressure us lazy upstream maintainers to act quicker, and we can
> also easily verify whether the fix does the trick :-)
Sure, I'll try. But I've been pounding on this kernel for weeks and
only hit this once. And I'm pretty sure I've killed processes with lots
of swapped memory. The only new variable was acroread - open 5 PDFs in
Firefox tabs, leave it alone for a few days until the kernel swaps it
all out, then kill it...
Lee
On Tue, 2006-03-14 at 22:01 +0100, Ingo Molnar wrote:
> hm, where does the latency come from? We do have a lockbreaker in
> unmap_vmas():
>
> if (need_resched() ||
> (i_mmap_lock &&
> need_lockbreak(i_mmap_lock))) {
> if (i_mmap_lock) {
> *tlbp = NULL;
> goto out;
> }
> cond_resched();
> }
>
>
> why doesnt this break up the 28ms latency?
>
But the preempt count is >= 2, doesn't that mean some other lock must be
held also, or someone called preempt_disable?
Lee
On Tue, 14 Mar 2006, Lee Revell wrote:
> On Tue, 2006-03-14 at 22:01 +0100, Ingo Molnar wrote:
> > hm, where does the latency come from? We do have a lockbreaker in
> > unmap_vmas():
> >
> > if (need_resched() ||
> > (i_mmap_lock &&
> > need_lockbreak(i_mmap_lock))) {
> > if (i_mmap_lock) {
> > *tlbp = NULL;
> > goto out;
> > }
> > cond_resched();
> > }
> >
> >
> > why doesnt this break up the 28ms latency?
That block is actually for PREEMPT n, and for truncating a mapped
file (i_mmap_lock additionally held): all Lee's PREEMPT y exit case
should need is the tlb_finish_mmu and tlb_gather_mmu around it,
letting preemption in - and the ZAP_BLOCK_SIZE 8*PAGE_SIZE.
> But the preempt count is >= 2, doesn't that mean some other lock must be
> held also, or someone called preempt_disable?
Yes, as I read the trace (and let me admit, I'm not at all skilled at
reading those traces), and as your swap observation implies, this is
not a problem with ptes present, but with swap entries: and with the
radix tree lookup involved in finding whether they have an associated
struct page in core - all handled while holding page table lock, and
while holding the per-cpu mmu_gather structure.
Oh, thank you for forcing me to take another look, 2.6.15 did make a
regression there, and this one is very simply remedied: Lee, please
try the patch below (I've done it against 2.6.16-rc6 because that's
what I have to hand; and would be a better tree for you to test),
and let us know if it fixes your case as I expect - thanks.
(Robin Holt observed how inefficient the small ZAP_BLOCK_SIZE was on
very sparse mmaps, as originally implemented; so he and Nick reworked
it to count only real work done; but the swap entries got put on the
side of "no real work", whereas you've found they may involve very
significant work. My patch below reverses that: yes, I've got some
other cases now going the slow way when they needn't, but they're
too rare to clutter the code for.)
Hugh
--- 2.6.16-rc6/mm/memory.c 2006-03-12 15:25:45.000000000 +0000
+++ linux/mm/memory.c 2006-03-15 07:32:36.000000000 +0000
@@ -623,11 +623,12 @@ static unsigned long zap_pte_range(struc
(*zap_work)--;
continue;
}
+
+ (*zap_work) -= PAGE_SIZE;
+
if (pte_present(ptent)) {
struct page *page;
- (*zap_work) -= PAGE_SIZE;
-
page = vm_normal_page(vma, addr, ptent);
if (unlikely(details) && page) {
/*
Hugh Dickins wrote:
> Oh, thank you for forcing me to take another look, 2.6.15 did make a
> regression there, and this one is very simply remedied: Lee, please
> try the patch below (I've done it against 2.6.16-rc6 because that's
> what I have to hand; and would be a better tree for you to test),
> and let us know if it fixes your case as I expect - thanks.
>
> (Robin Holt observed how inefficient the small ZAP_BLOCK_SIZE was on
> very sparse mmaps, as originally implemented; so he and Nick reworked
> it to count only real work done; but the swap entries got put on the
> side of "no real work", whereas you've found they may involve very
> significant work. My patch below reverses that: yes, I've got some
> other cases now going the slow way when they needn't, but they're
> too rare to clutter the code for.)
>
I think this patch looks good, thanks Hugh.
> Hugh
>
> --- 2.6.16-rc6/mm/memory.c 2006-03-12 15:25:45.000000000 +0000
> +++ linux/mm/memory.c 2006-03-15 07:32:36.000000000 +0000
> @@ -623,11 +623,12 @@ static unsigned long zap_pte_range(struc
> (*zap_work)--;
> continue;
> }
> +
> + (*zap_work) -= PAGE_SIZE;
> +
> if (pte_present(ptent)) {
> struct page *page;
>
> - (*zap_work) -= PAGE_SIZE;
> -
> page = vm_normal_page(vma, addr, ptent);
> if (unlikely(details) && page) {
> /*
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com