LinuxLists.cc - [PATCH] low-latency zap_page

2002-07-20 20:17:28

Subject: [PATCH] low-latency zap_page_range

The lock hold time in zap_page_range is horrid. This patch breaks the
work up into chunks and relinquishes the lock after each iteration.
This drastically lowers latency by creating a preemption point, as well
as lowering lock contention.

The chunk size is ZAP_BLOCK_SIZE and currently 256*PAGE_SIZE.

This lowers the maximum latency in zap_page_range from 10~20ms (on a
dual Athlon - one of the worst latencies recorded) to unmeasurable.

I made a couple other cleanups and optimizations:

- remove unneeded dir variable and call to pgd_offset - nothing
uses this anymore as the code was pushed to unmap_page_range

- removed duplicated start variable - it is the same as address

- BUG -> BUG_ON in unmap_page_range

- remove redundant BUG from zap_page_range - the same check is
done in unmap_page_range

- better comments

Patch is against 2.5.27, please apply.

Robert Love

diff -urN linux-2.5.27/mm/memory.c linux/mm/memory.c
--- linux-2.5.27/mm/memory.c Sat Jul 20 12:11:17 2002
+++ linux/mm/memory.c Sat Jul 20 12:56:18 2002
@@ -390,8 +390,8 @@
{
pgd_t * dir;

- if (address >= end)
- BUG();
+ BUG_ON(address >= end);
+
dir = pgd_offset(vma->vm_mm, address);
tlb_start_vma(tlb, vma);
do {
@@ -402,34 +402,42 @@
tlb_end_vma(tlb, vma);
}

-/*
- * remove user pages in a given range.
+#define ZAP_BLOCK_SIZE (256 * PAGE_SIZE) /* how big a chunk we loop over */
+
+/**
+ * zap_page_range - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
*/
void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
{
struct mm_struct *mm = vma->vm_mm;
mmu_gather_t *tlb;
- pgd_t * dir;
- unsigned long start = address, end = address + size;
-
- dir = pgd_offset(mm, address);
+ unsigned long end, block;

/*
- * This is a long-lived spinlock. That's fine.
- * There's no contention, because the page table
- * lock only protects against kswapd anyway, and
- * even if kswapd happened to be looking at this
- * process we _want_ it to get stuck.
+ * This was once a long-held spinlock. Now we break the
+ * work up into ZAP_BLOCK_SIZE units and relinquish the
+ * lock after each interation. This drastically lowers
+ * lock contention and allows for a preemption point.
*/
- if (address >= end)
- BUG();
- spin_lock(&mm->page_table_lock);
- flush_cache_range(vma, address, end);
-
- tlb = tlb_gather_mmu(mm, 0);
- unmap_page_range(tlb, vma, address, end);
- tlb_finish_mmu(tlb, start, end);
- spin_unlock(&mm->page_table_lock);
+ while (size) {
+ block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
+ end = address + block;
+
+ spin_lock(&mm->page_table_lock);
+
+ flush_cache_range(vma, address, end);
+ tlb = tlb_gather_mmu(mm, 0);
+ unmap_page_range(tlb, vma, address, end);
+ tlb_finish_mmu(tlb, address, end);
+
+ spin_unlock(&mm->page_table_lock);
+
+ address += block;
+ size -= block;
+ }
}

/*

2002-07-21 02:44:32

by Pete Zaitcev

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

> The lock hold time in zap_page_range is horrid. This patch breaks the
> work up into chunks and relinquishes the lock after each iteration.
> This drastically lowers latency by creating a preemption point, as well
> as lowering lock contention.

> void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)

Arjan sent me something similar, done by AKPM, only he did this a
little differently. He added an argument to zap_page_range
which allowed to work it in the old way, if set. Then, he set it so
all places would use low latency EXCEPT a reading from /dev/zero.
I assume it was some locking somewhere in devices/char/mem.c,
though I was unable to figure which in particular.

Andrew, care to unconfuse me?

-- Pete

2002-07-22 05:09:09

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

Pete Zaitcev wrote:
>
> > The lock hold time in zap_page_range is horrid. This patch breaks the
> > work up into chunks and relinquishes the lock after each iteration.
> > This drastically lowers latency by creating a preemption point, as well
> > as lowering lock contention.
>
> > void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
>
> Arjan sent me something similar, done by AKPM, only he did this a
> little differently. He added an argument to zap_page_range
> which allowed to work it in the old way, if set. Then, he set it so
> all places would use low latency EXCEPT a reading from /dev/zero.
> I assume it was some locking somewhere in devices/char/mem.c,
> though I was unable to figure which in particular.
>

There are actually quite a few places in the ll patch which don't
pass ZPR_COND_RESCHED into zap_page_range. Places which are
themselves called under locks, places where not enough pages
are being zapped to make it necessary, etc. vmtruncate_list,
some mremap code, others.

Plus some historical notes: there used to be a couple more
flags which could be passed into zap_page_range() to tell
it whether to run flush_cache_range and flush_tlb_range. That
was an irrelevant cleanup. But those calls were later unconditionally
sucked into zap_page_range() anyway.

-

2002-07-22 05:03:26

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

Robert Love wrote:
>
> The lock hold time in zap_page_range is horrid.
>

Yes, it is. And although our mandate is to fix things
like this without grafted-on low latency hacks, zap_page_range()
may be one case where simply popping the lock is the best solution.
Not sure.

> ...
> + while (size) {
> + block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
> + end = address + block;
> +
> + spin_lock(&mm->page_table_lock);
> +
> + flush_cache_range(vma, address, end);
> + tlb = tlb_gather_mmu(mm, 0);
> + unmap_page_range(tlb, vma, address, end);
> + tlb_finish_mmu(tlb, address, end);
> +
> + spin_unlock(&mm->page_table_lock);
> +
> + address += block;
> + size -= block;
> + }

This adds probably-unneeded extra work - we shouldn't go
dropping the lock unless that is actually required. ie:
poll ->need_resched first. Possible?

-

2002-07-22 17:16:40

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

On Sun, 2002-07-21 at 22:20, Andrew Morton wrote:

> There are actually quite a few places in the ll patch which don't
> pass ZPR_COND_RESCHED into zap_page_range. Places which are
> themselves called under locks, places where not enough pages
> are being zapped to make it necessary, etc. vmtruncate_list,
> some mremap code, others.

The bonus of the approach in my patch is that we won't reschedule unless
the lock count is zero (due to the preemptive kernel), so we do not need
to worry about the above. The only concern would be if there is ever a
case where the critical region is the entire chunk we want to zap (e.g.
we must zap the entire inputted range atomically) but I do not see that
ever being the case.

Robert Love

2002-07-22 17:55:13

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

On Sun, 2002-07-21 at 22:14, Andrew Morton wrote:

> This adds probably-unneeded extra work - we shouldn't go
> dropping the lock unless that is actually required. ie:
> poll ->need_resched first. Possible?

Sure. What do you think of this?

spin_lock(&mm->page_table_lock);

while (size) {
block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
end = address + block;

flush_cache_range(vma, address, end);
tlb = tlb_gather_mmu(mm, 0);
unmap_page_range(tlb, vma, address, end);
tlb_finish_mmu(tlb, address, end);

if (need_resched()) {
/*
* If we need to reschedule we will do so
* here if we do not hold any other locks.
*/
spin_unlock(&mm->page_table_lock);
spin_lock(&mm->page_table_lock);
}

address += block;
size -= block;
}

spin_unlock(&mm->page_table_lock);

My only issue with the above is it is _ugly_ compared to the more
natural loop. I.e., this looks much more like explicit lock breaking /
conditional rescheduling whereas the original loop just happens to
acquire and release the lock on each iteration. Sure, same effect, but
I think its says something toward the maintainability and cleanliness of
the function.

One thing about the "overhead" here - the main overhead would be the
lock bouncing in between cachelines on SMP afaict. However, either (a)
there is no SMP contention or (b) there is and dropping the lock
regardless may be a good idea. Thoughts?

Hm, the above also ends up checking need_resched twice (the explicit
need_resched() and again on the final unlock)... we can fix that by
manually calling _raw_spin_unlock and then preempt_schedule, but that
could also result in a (much longer) needless call to preempt_schedule
if an intervening interrupt serviced the request first.

But maybe that is just me... like this better? I can redo the patch as
the above.

Robert Love

2002-07-22 18:01:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

On 22 Jul 2002, Robert Love wrote:
>
> Sure. What do you think of this?

How about adding an "cond_resched_lock()" primitive?

You can do it better as a primitive than as the written-out thing (the
spin_unlock() doesn't need to conditionally test the scheduling point
again, it can just unconditionally call schedule())

And there might be other places that want to drop a lock before scheduling
anyway.

Linus

2002-07-22 18:19:24

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

On Mon, 2002-07-22 at 11:05, Linus Torvalds wrote:

> How about adding an "cond_resched_lock()" primitive?
>
> You can do it better as a primitive than as the written-out thing (the
> spin_unlock() doesn't need to conditionally test the scheduling point
> again, it can just unconditionally call schedule())
>
> And there might be other places that want to drop a lock before scheduling
> anyway.

Great idea. I have similar functions in my lock-break patch...

This introduces "cond_resched_lock()" and "break_spin_lock()". Both
take a lock has a parameter. The former only drops the locks and
reschedules if need_resched is set. It is optimized to only check once,
etc. The later simply unlocks then relocks.

Patch is against your BK tree.

Robert Love

diff -urN linux-2.5.27/include/linux/sched.h linux/include/linux/sched.h
--- linux-2.5.27/include/linux/sched.h Sat Jul 20 12:11:07 2002
+++ linux/include/linux/sched.h Mon Jul 22 11:16:55 2002
@@ -865,6 +867,30 @@
__cond_resched();
}

+/*
+ * cond_resched_lock() - if a reschedule is pending, drop the given lock,
+ * call schedule, and on return reacquire the lock. Note this assumes
+ * the given lock is the _only_ held lock and otherwise you are not atomic.
+ */
+static inline void cond_resched_lock(spinlock_t * lock)
+{
+ if (need_resched()) {
+ spin_unlock_no_resched(lock);
+ __cond_resched();
+ spin_lock(lock);
+ }
+}
+
+/*
+ * break_spin_lock - drop and immeditately reacquire the given lock. This
+ * creates a preemption point if it is the only held lock.
+ */
+static inline void break_spin_lock(spinlock_t * lock)
+{
+ spin_unlock(lock);
+ spin_lock(lock);
+}
+
/* Reevaluate whether the task has signals pending delivery.
This is required every time the blocked sigset_t changes.
Athread cathreaders should have t->sigmask_lock. */

2002-07-22 18:25:25

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

On Mon, 2002-07-22 at 11:05, Linus Torvalds wrote:

> On 22 Jul 2002, Robert Love wrote:
> >
> > Sure. What do you think of this?
>
> How about adding an "cond_resched_lock()" primitive?

And this patch is an updated zap_page_range() now using the new approach
I posted and Linus's suggested cond_resched_lock method (previous
patch).

Personally I still prefer the simpler loop method... note that the
cond_resched_lock() assumes that the lock depth is ALWAYS 1 - e.g., we
explicitly call schedule. A safer alternative would be break_spin_lock
which will preemptively reschedule automatically, but only if
preempt_count==0 (and only with the preemptible kernel enabled).

This patch also has the other cleanups/optimizations from the original
zap_page_range patch - same patch as before but with the new method.
Patch is against 2.5 BK.

Robert Love

diff -urN linux-2.5.27/mm/memory.c linux/mm/memory.c
--- linux-2.5.27/mm/memory.c Sat Jul 20 12:11:17 2002
+++ linux/mm/memory.c Mon Jul 22 11:18:10 2002
@@ -390,8 +390,8 @@
{
pgd_t * dir;

- if (address >= end)
- BUG();
+ BUG_ON(address >= end);
+
dir = pgd_offset(vma->vm_mm, address);
tlb_start_vma(tlb, vma);
do {
@@ -402,33 +402,43 @@
tlb_end_vma(tlb, vma);
}

-/*
- * remove user pages in a given range.
+#define ZAP_BLOCK_SIZE (256 * PAGE_SIZE) /* how big a chunk we loop over */
+
+/**
+ * zap_page_range - remove user pages in a given range
+ * @vma: vm_area_struct holding the applicable pages
+ * @address: starting address of pages to zap
+ * @size: number of bytes to zap
*/
void zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size)
{
struct mm_struct *mm = vma->vm_mm;
mmu_gather_t *tlb;
- pgd_t * dir;
- unsigned long start = address, end = address + size;
+ unsigned long end, block;

- dir = pgd_offset(mm, address);
+ spin_lock(&mm->page_table_lock);

/*
- * This is a long-lived spinlock. That's fine.
- * There's no contention, because the page table
- * lock only protects against kswapd anyway, and
- * even if kswapd happened to be looking at this
- * process we _want_ it to get stuck.
+ * This was once a long-held spinlock. Now we break the
+ * work up into ZAP_BLOCK_SIZE units and relinquish the
+ * lock after each interation. This drastically lowers
+ * lock contention and allows for a preemption point.
*/
- if (address >= end)
- BUG();
- spin_lock(&mm->page_table_lock);
- flush_cache_range(vma, address, end);
+ while (size) {
+ block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
+ end = address + block;
+
+ flush_cache_range(vma, address, end);
+ tlb = tlb_gather_mmu(mm, 0);
+ unmap_page_range(tlb, vma, address, end);
+ tlb_finish_mmu(tlb, address, end);
+
+ cond_resched_lock(&mm->page_table_lock);
+
+ address += block;
+ size -= block;
+ }

- tlb = tlb_gather_mmu(mm, 0);
- unmap_page_range(tlb, vma, address, end);
- tlb_finish_mmu(tlb, start, end);
spin_unlock(&mm->page_table_lock);
}

2002-07-22 18:39:11

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

Robert Love wrote:
>
> On Sun, 2002-07-21 at 22:14, Andrew Morton wrote:
>
> > This adds probably-unneeded extra work - we shouldn't go
> > dropping the lock unless that is actually required. ie:
> > poll ->need_resched first. Possible?
>
> Sure. What do you think of this?
>
> spin_lock(&mm->page_table_lock);
>
> while (size) {
> block = (size > ZAP_BLOCK_SIZE) ? ZAP_BLOCK_SIZE : size;
> end = address + block;
>
> flush_cache_range(vma, address, end);
> tlb = tlb_gather_mmu(mm, 0);
> unmap_page_range(tlb, vma, address, end);
> tlb_finish_mmu(tlb, address, end);
>
> if (need_resched()) {
> /*
> * If we need to reschedule we will do so
> * here if we do not hold any other locks.
> */
> spin_unlock(&mm->page_table_lock);
> spin_lock(&mm->page_table_lock);
> }
>
> address += block;
> size -= block;
> }
>
> spin_unlock(&mm->page_table_lock);
>
> My only issue with the above is it is _ugly_ compared to the more
> natural loop. I.e., this looks much more like explicit lock breaking /
> conditional rescheduling whereas the original loop just happens to
> acquire and release the lock on each iteration. Sure, same effect, but
> I think its says something toward the maintainability and cleanliness of
> the function.

Disagree, really. It's not a thing of beauty, but it is completely
obvious what the code is doing and why it is doing it. There are
no subtle side-effects and the whole lot can be understood from a
single screenful. Unmaintainable code is code which requires you
to spend days crawling all over the tree working out what it's doing
any why it's doing it. It's awkward, but it's simple, and I wouldn't
get very worked up over it personally.

> One thing about the "overhead" here - the main overhead would be the
> lock bouncing in between cachelines on SMP afaict. However, either (a)
> there is no SMP contention or (b) there is and dropping the lock
> regardless may be a good idea. Thoughts?

Hard call. In general I suspect it's best to hold onto a lock
for as long as possible, get a lot of work done rather than
reacquiring it all the time. But there are some locks which are
occasionally held for a very long time and are often held for very
short periods. Such as this one (mm->page_table_lock) and pagemap_lru_lock.
In those cases, dropping the lock to let someone else get in, out and
away may help. But not as much as a little bit of locking redesign...

> Hm, the above also ends up checking need_resched twice (the explicit
> need_resched() and again on the final unlock)... we can fix that by
> manually calling _raw_spin_unlock and then preempt_schedule, but that
> could also result in a (much longer) needless call to preempt_schedule
> if an intervening interrupt serviced the request first.

zap_page_range is sometimes called under another lock, (eg, vmtruncate_list).
So there's nothing useful to be done there. Perhaps you should test
->preempt_count as well - if it's greater than one then don't bother with
the lock dropping.

This, btw, probably means that your code won't be very effective yet: large
truncates will still exhibit poor latency. However, truncate _is_ something
which we can improve algorithmically. One possibility is to implement a
radix tree split function: split the radix tree into two trees along the
truncation point, clean up the end and then drop the bulk of the pages
outside locks.

However that's probably going a bit far - my preferred approach would
be to implement a radix tree gang-lookup function. "Find me the next
N pages above index I". No more list-walking in truncate. We can use
gang-lookup nicely in truncate, in readahead and in writeback. But no
immediate plans on that one, alas.

-

2002-07-22 18:47:08

by Robert Love

[permalink] [raw]

Subject: Re: [PATCH] low-latency zap_page_range

On Mon, 2002-07-22 at 11:40, Andrew Morton wrote:

> Disagree, really. It's not a thing of beauty, but it is completely
> obvious what the code is doing and why it is doing it. There are
> no subtle side-effects and the whole lot can be understood from a
> single screenful. Unmaintainable code is code which requires you
> to spend days crawling all over the tree working out what it's doing
> any why it's doing it. It's awkward, but it's simple, and I wouldn't
> get very worked up over it personally.

I agree with your points although I do not find the previous version any
less of this.

> Hard call. In general I suspect it's best to hold onto a lock
> for as long as possible, get a lot of work done rather than
> reacquiring it all the time. But there are some locks which are
> occasionally held for a very long time and are often held for very
> short periods. Such as this one (mm->page_table_lock) and pagemap_lru_lock.
> In those cases, dropping the lock to let someone else get in, out and
> away may help. But not as much as a little bit of locking redesign...

Agreed.

> zap_page_range is sometimes called under another lock, (eg, vmtruncate_list).
> So there's nothing useful to be done there. Perhaps you should test
> ->preempt_count as well - if it's greater than one then don't bother with
> the lock dropping.

Hrm, this means cond_resched_lock() is out of the question here, then.

We could use break_spin_locks() but that would mean we drop the lock w/o
checking for need_resched (or wrap it in need_resched() and then we
check twice).

Finally, we could take your approach, change cond_resched_lock() to be:

if (need_resched() && preempt_count() == 1) {
spin_unlock_no_resched(lock);
__cond_resched();
spin_lock(lock);
}

but then we need to break the function up into a preempt and a
non-preempt version as preempt_count() unconditionally returns 0 with
!CONFIG_PREEMPT. Right now the functions I posted do the right thing on
any combination of UP, SMP, and preempt.

Thoughts?

> This, btw, probably means that your code won't be very effective yet: large
> truncates will still exhibit poor latency. However, truncate _is_ something
> which we can improve algorithmically. One possibility is to implement a
> radix tree split function: split the radix tree into two trees along the
> truncation point, clean up the end and then drop the bulk of the pages
> outside locks.

I would _much_ prefer to tackle these issues via better algorithms...
your suggestion for truncate is good.

Note that I make an exception here (part of my argument for a preemptive
kernel was no more need to do "hackish" conditional scheduling and lock
breaking) because there really is not much you can do to this
algorithm. It does a lot of work on potentially a lot of data and the
cleanest solution we have is to just break it up into chunks.

Robert Love

2002-07-23 00:02:56

by jak

[permalink] [raw]

Subject: [PATCH] ext3 lockup fix to lock-break-rml-2.4.18-1.patch

Hi Robert,
The following patch fixes an apparent mismerge of Andrew Morton's low
latency work into the lock breaking patch. Without this fix,
journal_commit_transaction() will loop forever, without making
forward progress, if need_resched was set on entry.

Base to which the patch was derived:

linux.2.4.18.tar.bz2
+ sched-O1-2.4.18-pre8-K3.patch
+ preempt-kernel-rml-2.4.18-rc1-ingo-K3-1.patch
+ lock-break-rml-2.4.18-1.patch

Regards,
Joe

--- fs/jbd/commit.c.orig Mon Jul 22 19:51:43 2002
+++ fs/jbd/commit.c Mon Jul 22 19:52:00 2002
@@ -278,6 +278,7 @@
debug_lock_break(551);
spin_unlock(&journal_datalist_lock);
unlock_journal(journal);
+ unconditional_schedule();
lock_journal(journal);
spin_lock(&journal_datalist_lock);
continue;