2009-04-13 19:45:21

by Ying Han

[permalink] [raw]
Subject: [V4][PATCH 0/4]page fault retry with NOPAGE_RETRY

page fault retry with NOPAGE_RETRY
Allow major faults to drop the mmap_sem read lock while waiting for
synchronous disk read. This allows another thread which wishes to grab
down_write(mmap_sem) to proceed while the current is waiting the disk IO.

The patch replace the 'write' with flag of handle_mm_fault() to
FAULT_FLAG_RETRY as identify that the caller can tolerate the retry in the
filemap_fault call patch.

Compiled and booted on x86_64. (Andrew: can you help on other arches )

changelog[v4]:
- split the patch into four parts. the first two patches is a cleanup of
Linus "untested" patch, which replace the boolen "writeaccess" to be a
flag in handle_mm_fault. and the last two patches add the
FAULT_FLAG_RETRY support.
- include all arches cleanups.
- add more comments as Andrew suggested.
- cleanups as Fengguang Wu suggested.

changelog[v3]:
- applied fixes and cleanups from Wu Fengguang.
filemap VM_FAULT_RETRY fixes
[PATCH 01/14] mm: fix find_lock_page_retry() return value parsing
[PATCH 02/14] mm: fix major/minor fault accounting on retried fault
[PATCH 04/14] mm: reduce duplicate page fault code
[PATCH 05/14] readahead: account mmap_miss for VM_FAULT_RETRY

- split the patch into two parts. first part includes FAULT_FLAG_RETRY
support with no current user change. second part includes individual
per-architecture cleanups that enable FAULT_FLAG_RETRY.
currently there are mainly two users for handle_mm_fault, we enable
FAULT_FLAG_RETRY for actual fault handler and leave get_user_pages
unchanged.

changelog[v2]:
- reduce the runtime overhead by extending the 'write' flag of
handle_mm_fault() to indicate the retry hint.
- add another two branches in filemap_fault with retry logic.
- replace find_lock_page with find_lock_page_retry to make the code
cleaner.

Benchmarks:
case 1. one application has a high count of threads each faulting in
different pages of a hugefile. Benchmark indicate that this double data
structure walking in case of major fault results in << 1% performance hit.

case 2. add another thread in the above application which in a tight loop
of mmap()/munmap(). Here we measure loop count in the new thread while other
threads doing the same amount of work as case one. we got << 3% performance
hit on the Complete Time(benchmark value for case one) and 10% performance
improvement on the mmap()/munmap() counter.

This patch helps a lot in cases we have writer which is waitting behind all
readers, so it could execute much faster.

benchmarks from Wufengguang:
Just tested the sparse-random-read-on-sparse-file case, and found the
performance impact to be 0.4% (8.706s vs 8.744s) in the worst case.
Kind of acceptable.

without FAULT_FLAG_RETRY:
iotrace.rb --load stride-100 --mplay /mnt/btrfs-ram/sparse
3.28s user 5.39s system 99% cpu 8.692 total
iotrace.rb --load stride-100 --mplay /mnt/btrfs-ram/sparse
3.17s user 5.54s system 99% cpu 8.742 total
iotrace.rb --load stride-100 --mplay /mnt/btrfs-ram/sparse
3.18s user 5.48s system 99% cpu 8.684 total

FAULT_FLAG_RETRY:
iotrace.rb --load stride-100 --mplay /mnt/btrfs-ram/sparse
3.18s user 5.63s system 99% cpu 8.825 total
iotrace.rb --load stride-100 --mplay /mnt/btrfs-ram/sparse
3.22s user 5.47s system 99% cpu 8.718 total
iotrace.rb --load stride-100 --mplay /mnt/btrfs-ram/sparse
3.13s user 5.55s system 99% cpu 8.690 total

In the above faked workload, the mmap read page offsets are loaded from
stride-100 and performed on /mnt/btrfs-ram/sparse, which are created by:

seq 0 100 1000000 > stride-100
dd if=/dev/zero of=/mnt/btrfs-ram/sparse bs=1M count=1 seek=1024000

Signed-off-by: Mike Waychison <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Ying Han <[email protected]>


2009-04-13 20:06:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [V4][PATCH 0/4]page fault retry with NOPAGE_RETRY



On Mon, 13 Apr 2009, Ying Han wrote:
>
> Benchmarks:
> case 1. one application has a high count of threads each faulting in
> different pages of a hugefile. Benchmark indicate that this double data
> structure walking in case of major fault results in << 1% performance hit.
>
> case 2. add another thread in the above application which in a tight loop
> of mmap()/munmap(). Here we measure loop count in the new thread while other
> threads doing the same amount of work as case one. we got << 3% performance
> hit on the Complete Time(benchmark value for case one) and 10% performance
> improvement on the mmap()/munmap() counter.
>
> This patch helps a lot in cases we have writer which is waitting behind all
> readers, so it could execute much faster.

Hmm. I normally think of "<<" as "much smaller than", but the way you use
it makes me wonder. In particular, "<< 3%" sounds very odd. If it's much
smaller than 3%, I'd have expected "<< 1%" again. So it probably isn't.

> benchmarks from Wufengguang:
> Just tested the sparse-random-read-on-sparse-file case, and found the
> performance impact to be 0.4% (8.706s vs 8.744s) in the worst case.
> Kind of acceptable.

Well, have you tried the obvious optimization of _not_ doing the RETRY
path when atomic_read(&mm->counter) == 1?

After all, if it's not a threaded app, and it doesn't have a possibility
of concurrent mmap/fault, then why release the lock?

Linus

2009-04-13 20:28:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [V4][PATCH 0/4]page fault retry with NOPAGE_RETRY



On Mon, 13 Apr 2009, Linus Torvalds wrote:
>
> Well, have you tried the obvious optimization of _not_ doing the RETRY
> path when atomic_read(&mm->counter) == 1?
>
> After all, if it's not a threaded app, and it doesn't have a possibility
> of concurrent mmap/fault, then why release the lock?

Ok, so the counter is called 'mm_users', not 'counter'.

Anyway, I would try that in the arch patch, and just see what happens when
you change the

unsigned int fault_flags = FAULT_FLAG_RETRY;

into

unsigned int fault_flags;

..

fault_flags = atomic_read(&mm->mm_users) > 1 ? FAULT_FLAG_RETRY : 0;

where you should probably do that mm dereference only after checking that
you're not in atomic context or something like that (so move the
assignment down).

The reason I'd suggest doing it in the caller of handle_mm_fault() rather
than anywhere deeper in the call chain is that some callers _might_ really
want to get the retry semantics even if they are the only ones. Imagine,
for example, some kind of non-blocking "get_user_pages()" thing.

Linus

2009-04-13 21:04:40

by Ying Han

[permalink] [raw]
Subject: Re: [V4][PATCH 0/4]page fault retry with NOPAGE_RETRY

On Mon, Apr 13, 2009 at 12:57 PM, Linus Torvalds
<[email protected]> wrote:
>
>
> On Mon, 13 Apr 2009, Ying Han wrote:
>>
>> Benchmarks:
>> case 1. one application has a high count of threads each faulting in
>> different pages of a hugefile. Benchmark indicate that this double data
>> structure walking in case of major fault results in << 1% performance hit.
>>
>> case 2. add another thread in the above application which in a tight loop
>> of mmap()/munmap(). Here we measure loop count in the new thread while other
>> threads doing the same amount of work as case one. we got << 3% performance
>> hit on the Complete Time(benchmark value for case one) and 10% performance
>> improvement on the mmap()/munmap() counter.
>>
>> This patch helps a lot in cases we have writer which is waitting behind all
>> readers, so it could execute much faster.
>
> Hmm. I normally think of "<<" as "much smaller than", but the way you use
> it makes me wonder. In particular, "<< 3%" sounds very odd. If it's much
> smaller than 3%, I'd have expected "<< 1%" again. So it probably isn't.

Yes, it should be "< 3%", i will make the change.

>> benchmarks from Wufengguang:
>> Just tested the sparse-random-read-on-sparse-file case, and found the
>> performance impact to be 0.4% (8.706s vs 8.744s) in the worst case.
>> Kind of acceptable.
>
> Well, have you tried the obvious optimization of _not_ doing the RETRY
> path when atomic_read(&mm->counter) == 1?
>
> After all, if it's not a threaded app, and it doesn't have a possibility
> of concurrent mmap/fault, then why release the lock?
>
> Linus
>