In unmaping region, if current task doesn't need reschedule, don't do a
tlb_finish_mmu. This can reduce some tlb flushes.
In the lmbench tests, this patch gives 2.1% improvement on exec proc
item and 4.2% on sh proc item.
Signed-off-by: Shaohua Li <[email protected]>
---
linux-2.6.16-rc5-root/mm/memory.c | 7 +++----
1 files changed, 3 insertions(+), 4 deletions(-)
diff -puN mm/memory.c~less_flush mm/memory.c
--- linux-2.6.16-rc5/mm/memory.c~less_flush 2006-03-21 07:22:47.000000000 +0800
+++ linux-2.6.16-rc5-root/mm/memory.c 2006-03-21 07:26:51.000000000 +0800
@@ -837,19 +837,18 @@ unsigned long unmap_vmas(struct mmu_gath
break;
}
- tlb_finish_mmu(*tlbp, tlb_start, start);
-
if (need_resched() ||
(i_mmap_lock && need_lockbreak(i_mmap_lock))) {
+ tlb_finish_mmu(*tlbp, tlb_start, start);
if (i_mmap_lock) {
*tlbp = NULL;
goto out;
}
cond_resched();
+ tlb_start_valid = 0;
+ *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
}
- *tlbp = tlb_gather_mmu(vma->vm_mm, fullmm);
- tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
_
Shaohua Li wrote:
>In unmaping region, if current task doesn't need reschedule, don't do a
>tlb_finish_mmu. This can reduce some tlb flushes.
>
>In the lmbench tests, this patch gives 2.1% improvement on exec proc
>item and 4.2% on sh proc item.
>
>
The problem with this is that by the time we _do_ determine that a
reschedule is needed, we might have built up a huge amount of work
to do (which can probably be as much if not more exensive per-page
as the unmapping), so scheduling latency can still be unacceptable
so I'm afraid I don't think we can include this patch.
One option I've been looking into is my "mmu gather in-place" that
never needs extra tlb flushes and is always preemptible... so that
may be a way forward.
---
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote on Tuesday, March 21, 2006 8:53 PM
> Shaohua Li wrote:
> >In unmaping region, if current task doesn't need reschedule, don't do a
> >tlb_finish_mmu. This can reduce some tlb flushes.
> >
> >In the lmbench tests, this patch gives 2.1% improvement on exec proc
> >item and 4.2% on sh proc item.
>
> The problem with this is that by the time we _do_ determine that a
> reschedule is needed, we might have built up a huge amount of work
> to do (which can probably be as much if not more exensive per-page
> as the unmapping), so scheduling latency can still be unacceptable
> so I'm afraid I don't think we can include this patch.
Interesting. In the old day, since mm->page_table_lock is held for the
entire unmap_vmas function, it was beneficial to introduce periodic
reschedule point and to drop the spin lock under pressure. Now that the
page table lock is fine-grained and is pushed into zap_pte_range(), I
would think scheduling latency would improve from lock contention
avoidance point of view. It is not the case?
- Ken
Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, March 21, 2006 8:53 PM
>
>>Shaohua Li wrote:
>>
>>>In unmaping region, if current task doesn't need reschedule, don't do a
>>>tlb_finish_mmu. This can reduce some tlb flushes.
>>>
>>>In the lmbench tests, this patch gives 2.1% improvement on exec proc
>>>item and 4.2% on sh proc item.
>>
>>The problem with this is that by the time we _do_ determine that a
>>reschedule is needed, we might have built up a huge amount of work
>>to do (which can probably be as much if not more exensive per-page
>>as the unmapping), so scheduling latency can still be unacceptable
>>so I'm afraid I don't think we can include this patch.
>
>
> Interesting. In the old day, since mm->page_table_lock is held for the
> entire unmap_vmas function, it was beneficial to introduce periodic
> reschedule point and to drop the spin lock under pressure. Now that the
> page table lock is fine-grained and is pushed into zap_pte_range(), I
> would think scheduling latency would improve from lock contention
> avoidance point of view. It is not the case?
>
Well mmu_gather uses a per-cpu data structure and is non preemptible,
which I guess is one of the main reasons why we have this preemption
here.
You're right that another good reason would be ptl lock contention,
however I don't think that alleviating that problem alone would allow
longer mmu_gather scheduling latencies, because the longest latency
is still the mmu_gather <--> mmu_finish span.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin wrote on Tuesday, March 21, 2006 11:30 PM
> > Chen, Kenneth W wrote:
> > Interesting. In the old day, since mm->page_table_lock is held for the
> > entire unmap_vmas function, it was beneficial to introduce periodic
> > reschedule point and to drop the spin lock under pressure. Now that the
> > page table lock is fine-grained and is pushed into zap_pte_range(), I
> > would think scheduling latency would improve from lock contention
> > avoidance point of view. It is not the case?
> >
>
> Well mmu_gather uses a per-cpu data structure and is non preemptible,
> which I guess is one of the main reasons why we have this preemption
> here.
>
> You're right that another good reason would be ptl lock contention,
> however I don't think that alleviating that problem alone would allow
> longer mmu_gather scheduling latencies, because the longest latency
> is still the mmu_gather <--> mmu_finish span.
OK, I think it would be beneficial to take a latency measurement again,
just to see how it perform now a day. The dynamics might changed.
- Ken
Chen, Kenneth W wrote:
> Nick Piggin wrote on Tuesday, March 21, 2006 11:30 PM
>
>>Well mmu_gather uses a per-cpu data structure and is non preemptible,
>>which I guess is one of the main reasons why we have this preemption
>>here.
>>
>>You're right that another good reason would be ptl lock contention,
>>however I don't think that alleviating that problem alone would allow
>>longer mmu_gather scheduling latencies, because the longest latency
>>is still the mmu_gather <--> mmu_finish span.
>
>
> OK, I think it would be beneficial to take a latency measurement again,
> just to see how it perform now a day. The dynamics might changed.
>
Well I wouldn't argue against further investigation or fine tuning
the present code, however also remember that the way of unconditionally
finishing the mmu_gather that the patch is aimed to prevent never
actually lowered ptl hold times itself.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Tue, 2006-03-21 at 23:44 -0800, Chen, Kenneth W wrote:
>
> OK, I think it would be beneficial to take a latency measurement
> again,
> just to see how it perform now a day. The dynamics might changed.
I will test this with Ingo's latency tracer as soon as I get a chance.
I had previously posted results showing this to be a problem spot.
Lee