Andrew,
Attached patch implements a low latency version of "zap_page_range()".
Calls with even moderately large page ranges result in very long lock
held times and consequently very long periods of non-preemptibility.
This function is in my list of the top 3 worst offenders. It is gross.
This new version reimplements zap_page_range() as a loop over
ZAP_BLOCK_SIZE chunks. After each iteration, if a reschedule is
pending, we drop page_table_lock and automagically preempt. Note we can
not blindly drop the locks and reschedule (e.g. for the non-preempt
case) since there is a possibility to enter this codepath holding other
locks.
... I am sure you are familar with all this, its the same deal as your
low-latency work. This patch implements the "cond_resched_lock()" as we
discussed sometime back. I think this solution should be acceptable to
you and Linus.
There are other misc. cleanups, too.
This new zap_page_range() yields latency too-low-to-benchmark: <<1ms.
Please, Andrew, add this to your ever-growing list.
Robert Love
diff -urN linux-2.5.32/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.32/include/linux/sched.h Tue Aug 27 15:26:34 2002
+++ linux/include/linux/sched.h Wed Aug 28 18:04:41 2002
@@ -898,6 +898,34 @@
__cond_resched();
}
+#ifdef CONFIG_PREEMPT
+
+/*
+ * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * call schedule, and on return reacquire the lock.
+ *
+ * Note: this does not assume the given lock is the _only_ lock held.
+ * The kernel preemption counter gives us "free" checking that we are
+ * atomic -- let's use it.
+ */
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+ if (need_resched() && preempt_count() == 1) {
+ _raw_spin_unlock(lock);
+ preempt_enable_no_resched();
+ __cond_resched();
+ spin_lock(lock);
+ }
+}
+
+#else
+
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+}
+
+#endif
+
/* Reevaluate whether the task has signals pending delivery.
This is required every time the blocked sigset_t changes.
Athread cathreaders should have t->sigmask_lock. */
diff -urN linux-2.5.32/mm/memory.c linux/mm/memory.c
--- linux-2.5.32/mm/memory.c Tue Aug 27 15:26:42 2002
+++ linux/mm/memory.c Wed Aug 28 18:03:11 2002
@@ -389,8 +389,8 @@
{
pgd_t * dir;
- if (address >= end)
- BUG();
+ BUG_ON(address >= end);
+
dir = pgd_offset(vma->vm_mm, address);
tlb_start_vma(tlb, vma);
do {
@@ -401,30 +401,43 @@
tlb_end_vma(tlb, vma);
}
-/*
- * remove user pages in a given range.
+#define ZAP_BLOCK_SIZE (256 * PAGE_SIZE) /* how big a chunk we loop over */
+
+/**
+ * zap_page_range - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
*/
void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
{
struct mm_struct *mm = vma->vm_mm;
mmu_gather_t *tlb;
- unsigned long start = address, end = address + size;
+ unsigned long end, block;
- /*
- * This is a long-lived spinlock. That's fine.
- * There's no contention, because the page table
- * lock only protects against kswapd anyway, and
- * even if kswapd happened to be looking at this
- * process we _want_ it to get stuck.
- */
- if (address >= end)
- BUG();
spin_lock(&mm->page_table_lock);
- flush_cache_range(vma, address, end);
- tlb = tlb_gather_mmu(mm, 0);
- unmap_page_range(tlb, vma, address, end);
- tlb_finish_mmu(tlb, start, end);
+ /*
+ * This was once a long-held spinlock. Now we break the
+ * work up into ZAP_BLOCK_SIZE units and relinquish the
+ * lock after each interation. This drastically lowers
+ * lock contention and allows for a preemption point.
+ */
+ while (size) {
+ block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
+ end = address + block;
+
+ flush_cache_range(vma, address, end);
+ tlb = tlb_gather_mmu(mm, 0);
+ unmap_page_range(tlb, vma, address, end);
+ tlb_finish_mmu(tlb, address, end);
+
+ cond_resched_lock(&mm->page_table_lock);
+
+ address += block;
+ size -= block;
+ }
+
spin_unlock(&mm->page_table_lock);
}
Robert Love wrote:
>
> Andrew,
>
> Attached patch implements a low latency version of "zap_page_range()".
>
This doesn't quite do the right thing on SMP.
Note that pages which are to be torn down are buffered in the
mmu_gather_t array. The kernel throws away 507 pages at a
time - this is to reduce the frequency of global TLB invalidations.
(The 507 is, I assume, designed to make the mmu_gather_t be
2048 bytes in size. I recently broke that math, and need to fix
it up).
However with your change, we'll only ever put 256 pages into the
mmu_gather_t. Half of that thing's buffer is unused and the
invalidation rate will be doubled during teardown of large
address ranges.
I suggest that you make ZAP_BLOCK_SIZE be equal to FREE_PTE_NR on
SMP, and 256 on UP.
(We could get fancier and do something like:
tlb = tlb_gather_mmu(mm, 0):
while (size) {
...
unmap_page_range(ZAP_BLOCK_SIZE pages);
tlb_flush_mmu(...);
cond_resched_lock();
}
tlb_finish_mmu(..);
spin_unlock(page_table_lock);
but I don't think that passes the benefit-versus-complexity test.)
Also, if the kernel is not compiled for preemption then we're
doing a little bit of extra work to no advantage, yes? We can
avoid doing that by setting ZAP_BLOCK_SIZE to infinity.
How does this altered version look? All I changed was the ZAP_BLOCK_SIZE
initialisation.
--- 2.5.32/include/linux/sched.h~llzpr Thu Aug 29 13:01:01 2002
+++ 2.5.32-akpm/include/linux/sched.h Thu Aug 29 13:01:01 2002
@@ -907,6 +907,34 @@ static inline void cond_resched(void)
__cond_resched();
}
+#ifdef CONFIG_PREEMPT
+
+/*
+ * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * call schedule, and on return reacquire the lock.
+ *
+ * Note: this does not assume the given lock is the _only_ lock held.
+ * The kernel preemption counter gives us "free" checking that we are
+ * atomic -- let's use it.
+ */
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+ if (need_resched() && preempt_count() == 1) {
+ _raw_spin_unlock(lock);
+ preempt_enable_no_resched();
+ __cond_resched();
+ spin_lock(lock);
+ }
+}
+
+#else
+
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+}
+
+#endif
+
/* Reevaluate whether the task has signals pending delivery.
This is required every time the blocked sigset_t changes.
Athread cathreaders should have t->sigmask_lock. */
--- 2.5.32/mm/memory.c~llzpr Thu Aug 29 13:01:01 2002
+++ 2.5.32-akpm/mm/memory.c Thu Aug 29 13:26:21 2002
@@ -389,8 +389,8 @@ void unmap_page_range(mmu_gather_t *tlb,
{
pgd_t * dir;
- if (address >= end)
- BUG();
+ BUG_ON(address >= end);
+
dir = pgd_offset(vma->vm_mm, address);
tlb_start_vma(tlb, vma);
do {
@@ -401,30 +401,53 @@ void unmap_page_range(mmu_gather_t *tlb,
tlb_end_vma(tlb, vma);
}
-/*
- * remove user pages in a given range.
+#if defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)
+#define ZAP_BLOCK_SIZE (FREE_PTE_NR * PAGE_SIZE)
+#endif
+
+#if !defined(CONFIG_SMP) && defined(CONFIG_PREEMPT)
+#define ZAP_BLOCK_SIZE (256 * PAGE_SIZE)
+#endif
+
+#if !defined(CONFIG_PREEMPT)
+#define ZAP_BLOCK_SIZE (~(0UL))
+#endif
+
+/**
+ * zap_page_range - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
*/
void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
{
struct mm_struct *mm = vma->vm_mm;
mmu_gather_t *tlb;
- unsigned long start = address, end = address + size;
+ unsigned long end, block;
- /*
- * This is a long-lived spinlock. That's fine.
- * There's no contention, because the page table
- * lock only protects against kswapd anyway, and
- * even if kswapd happened to be looking at this
- * process we _want_ it to get stuck.
- */
- if (address >= end)
- BUG();
spin_lock(&mm->page_table_lock);
- flush_cache_range(vma, address, end);
- tlb = tlb_gather_mmu(mm, 0);
- unmap_page_range(tlb, vma, address, end);
- tlb_finish_mmu(tlb, start, end);
+ /*
+ * This was once a long-held spinlock. Now we break the
+ * work up into ZAP_BLOCK_SIZE units and relinquish the
+ * lock after each interation. This drastically lowers
+ * lock contention and allows for a preemption point.
+ */
+ while (size) {
+ block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
+ end = address + block;
+
+ flush_cache_range(vma, address, end);
+ tlb = tlb_gather_mmu(mm, 0);
+ unmap_page_range(tlb, vma, address, end);
+ tlb_finish_mmu(tlb, address, end);
+
+ cond_resched_lock(&mm->page_table_lock);
+
+ address += block;
+ size -= block;
+ }
+
spin_unlock(&mm->page_table_lock);
}
.
On Thu, 2002-08-29 at 16:30, Andrew Morton wrote:
> However with your change, we'll only ever put 256 pages into the
> mmu_gather_t. Half of that thing's buffer is unused and the
> invalidation rate will be doubled during teardown of large
> address ranges.
Agreed. Go for it.
Hm, unless, since 507 vs 256 is not the end of the world and latency is
already low, we want to just make it always (FREE_PTE_NR*PAGE_SIZE)...
As long as the "cond_resched_lock()" is a preempt only thing, I also
agree with making ZAP_BLOCK_SIZE ~0 on !CONFIG_PREEMPT - unless we
wanted to unconditionally drop the locks and let preempt just do the
right thing and also reduce SMP lock contention in the SMP case.
Robert Love
On Thu, 2002-08-29 at 16:40, Robert Love wrote:
> On Thu, 2002-08-29 at 16:30, Andrew Morton wrote:
>
> > However with your change, we'll only ever put 256 pages into the
> > mmu_gather_t. Half of that thing's buffer is unused and the
> > invalidation rate will be doubled during teardown of large
> > address ranges.
>
> Agreed. Go for it.
Oh and put a comment in there explaining what you just said to me :)
Robert Love
Robert Love wrote:
>
> ...
> unless we
> wanted to unconditionally drop the locks and let preempt just do the
> right thing and also reduce SMP lock contention in the SMP case.
That's an interesting point. page_table_lock is one of those locks
which is occasionally held for ages, and frequently held for a short
time.
I suspect that yes, voluntarily popping the lock during the long holdtimes
will allow other CPUs to get on with stuff, and will provide efficiency
increases. (It's a pretty lame way of doing that though).
But I don't recall seeing nasty page_table_lock spintimes on
anyone's lockmeter reports, so we can leave it as-is for now.
Robert Love wrote:
>
> ...
> unless we
> wanted to unconditionally drop the locks and let preempt just do the
> right thing and also reduce SMP lock contention in the SMP case.
That's an interesting point. page_table_lock is one of those locks
which is occasionally held for ages, and frequently held for a short
time.
I suspect that yes, voluntarily popping the lock during the long holdtimes
will allow other CPUs to get on with stuff, and will provide efficiency
increases. (It's a pretty lame way of doing that though).
But I don't recall seeing nasty page_table_lock spintimes on
anyone's lockmeter reports, so...
On Thu, 2002-08-29 at 17:00, Andrew Morton wrote:
> That's an interesting point. page_table_lock is one of those locks
> which is occasionally held for ages, and frequently held for a short
> time.
Since latency is a direct function of lock held times in the preemptible
kernel, and I am seeing disgusting zap_page_range() latencies, the lock
is held a long time.
So we know it is held forever and a day... but is there contention?
> But I don't recall seeing nasty page_table_lock spintimes on
> anyone's lockmeter reports, so we can leave it as-is for now.
I do not recall seeing this either and I have not done my own tests.
Personally, I would love to rip out the "cond_resched_lock()" and just
do
spin_unlock();
spin_lock();
and be done with it. This gives automatic preemption support and the
SMP benefit. Preemption being an "automatic" consequence of improved
locking was always my selling point (albeit, this is a gross example of
improving the locking, but it gets the job done).
But, the current implementation was more palatable to you and Linus when
I first posted this, and that counts for something.
Robert Love
Robert Love wrote:
>
> On Thu, 2002-08-29 at 17:00, Andrew Morton wrote:
>
> > That's an interesting point. page_table_lock is one of those locks
> > which is occasionally held for ages, and frequently held for a short
> > time.
>
> Since latency is a direct function of lock held times in the preemptible
> kernel, and I am seeing disgusting zap_page_range() latencies, the lock
> is held a long time.
>
> So we know it is held forever and a day... but is there contention?
I'm sure there is, but nobody has measured the right workload.
Two CLONE_MM threads, one running mmap()/munmap(), the other trying
to fault in some pages. I'm sure someone has some vital application
which does exactly this. They always do :(
Robert Love wrote:
>> unless we
>> wanted to unconditionally drop the locks and let preempt just do the
>> right thing and also reduce SMP lock contention in the SMP case.
On Thu, Aug 29, 2002 at 01:59:17PM -0700, Andrew Morton wrote:
> That's an interesting point. page_table_lock is one of those locks
> which is occasionally held for ages, and frequently held for a short
> time.
> I suspect that yes, voluntarily popping the lock during the long holdtimes
> will allow other CPUs to get on with stuff, and will provide efficiency
> increases. (It's a pretty lame way of doing that though).
> But I don't recall seeing nasty page_table_lock spintimes on
> anyone's lockmeter reports, so...
You will. There are just bigger fish to fry at the moment.
Cheers,
Bill
On Thu, 29 Aug 2002, Andrew Morton wrote:
> > So we know it is held forever and a day... but is there contention?
>
> I'm sure there is, but nobody has measured the right workload.
>
> Two CLONE_MM threads, one running mmap()/munmap(), the other trying
> to fault in some pages. I'm sure someone has some vital application
> which does exactly this. They always do :(
Can't fix this one. The mmap()/munmap() needs to have the
mmap_sem for writing as long as its setting up or tearing
down a VMA while the pagefault path takes the mmap_sem for
reading.
It might be fixable in some dirty way, but I doubt that'll
ever be worth it.
regards,
Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://distro.conectiva.com/
In article <[email protected]>,
William Lee Irwin III <[email protected]> wrote:
>Robert Love wrote:
>>> unless we
>>> wanted to unconditionally drop the locks and let preempt just do the
>>> right thing and also reduce SMP lock contention in the SMP case.
>
>On Thu, Aug 29, 2002 at 01:59:17PM -0700, Andrew Morton wrote:
>> That's an interesting point. page_table_lock is one of those locks
>> which is occasionally held for ages, and frequently held for a short
>> time.
>> I suspect that yes, voluntarily popping the lock during the long holdtimes
>> will allow other CPUs to get on with stuff, and will provide efficiency
>> increases. (It's a pretty lame way of doing that though).
>> But I don't recall seeing nasty page_table_lock spintimes on
>> anyone's lockmeter reports, so...
>
>You will. There are just bigger fish to fry at the moment.
You will NOT.
The page_table_lock protects against page stealing of the VM and
concurrent page-faults, nothing else. There is no way you can get
contention on it under any reasonable load that doesn't involve heavy
out-of-memory behaviour, simply because
- the lock is per-mm
- all "regular" paths that care about this also get the mmap semaphore
In short, that spinlock has _zero_ scalability impact. You can
theoretically get contention on it without memory pressure only by
having hundreds of threads page-faulting at the same time (getting a
read-lock on the mmap semaphore), but by then your performance has
nothing to do with the spinlock, and everything to do with the page
faults themselves.
(In fact, I can almost guarantee that most of the long hold-times are
for exit(), not for munmap(). And in that case the spinlock cannot get
any non-pagestealer contention at all, since nobody else is using the
MM)
Linus
On Thu, Aug 29, 2002 at 10:37:02PM +0000, Linus Torvalds wrote:
> You will NOT.
> The page_table_lock protects against page stealing of the VM and
> concurrent page-faults, nothing else. There is no way you can get
> contention on it under any reasonable load that doesn't involve heavy
> out-of-memory behaviour, simply because
> - the lock is per-mm
> - all "regular" paths that care about this also get the mmap semaphore
> In short, that spinlock has _zero_ scalability impact. You can
> theoretically get contention on it without memory pressure only by
> having hundreds of threads page-faulting at the same time (getting a
> read-lock on the mmap semaphore), but by then your performance has
> nothing to do with the spinlock, and everything to do with the page
> faults themselves.
> (In fact, I can almost guarantee that most of the long hold-times are
> for exit(), not for munmap(). And in that case the spinlock cannot get
> any non-pagestealer contention at all, since nobody else is using the
> MM)
All I have to go on is a report this has happened and a low-priority
task to investigate it at some point in the future. I'll send you data
either demonstrating it or exonerating it when I eventually get to it.
Cheers,
Bill