2008-11-22 06:47:56

by Ying Han

[permalink] [raw]
Subject: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

page fault retry with NOPAGE_RETRY
Allow major faults to drop the mmap_sem read lock while waitting for
synchronous disk read. This allows another thread which wishes to grab
down_read(mmap_sem) to proceed while the current is waitting the disk IO.

The patch flags current->flags to PF_FAULT_MAYRETRY as identify that the
caller can tolerate the retry in the filemap_fault call patch.

Benchmark is done by mmap in huge file and spaw 64 thread each faulting in
pages in reverse order, the the result shows 8% porformance hit with the
patch.

Future Improvement:
1. It could be more efficient to check if the mm_struct has been changed or
not. So we don't need to back all the way out of pagefault handler for the
cases mm_struct not changed.
2. It is a bit hacky and using a flag in current->flags to determine
whether we have done the retry or now. More generic way of doing it is pass
back the page_fault handler something like page_fault_args which introduce
the reason for the retry, so the higher level could be able to better
handle.

Signed-off-by: Mike Waychison <[email protected]>
Signed-off-by: Ying Han <[email protected]>


arch/x86/mm/fault.c | 25 +++++++++++++++++++---
include/linux/mm.h | 1 +
include/linux/sched.h | 1 +
mm/filemap.c | 53 +++++++++++++++++++++++++++++++++++++++++++++++-
mm/memory.c | 6 +++++
5 files changed, 80 insertions(+), 6 deletions(-)


diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 31e8730..883e9c5 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -577,10 +577,8 @@ int show_unhandled_signals = 1;
* and the problem, and then passes it off to one of the appropriate
* routines.
*/
-#ifdef CONFIG_X86_64
-asmlinkage
-#endif
-void __kprobes do_page_fault(struct pt_regs *regs, unsigned long error_code)
+static void __kprobes __do_page_fault(struct pt_regs *regs,
+ unsigned long error_code)
{
struct task_struct *tsk;
struct mm_struct *mm;
@@ -689,6 +687,7 @@ again:
down_read(&mm->mmap_sem);
}

+retry:
vma = find_vma(mm, address);
if (!vma)
goto bad_area;
@@ -743,6 +742,15 @@ good_area:
goto do_sigbus;
BUG();
}
+
+ if (fault & VM_FAULT_RETRY) {
+ if (current->flags & PF_FAULT_MAYRETRY) {
+ current->flags &= ~PF_FAULT_MAYRETRY;
+ goto retry;
+ }
+ BUG();
+ }
+
if (fault & VM_FAULT_MAJOR)
tsk->maj_flt++;
else
@@ -893,6 +901,16 @@ do_sigbus:
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}

+#ifdef CONFIG_X86_64
+asmlinkage
+#endif
+void do_page_fault(struct pt_regs *regs, unsigned long error_code)
+{
+ current->flags |= PF_FAULT_MAYRETRY;
+ __do_page_fault(regs, error_code);
+ current->flags &= ~PF_FAULT_MAYRETRY;
+}
+
DEFINE_SPINLOCK(pgd_lock);
LIST_HEAD(pgd_list);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffee2f7..d325ae8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -694,6 +694,7 @@ static inline int page_mapped(struct page *page)
#define VM_FAULT_SIGBUS 0x0002
#define VM_FAULT_MAJOR 0x0004
#define VM_FAULT_WRITE 0x0008 /* Special case for get_user_pages */
+#define VM_FAULT_RETRY 0x0010

#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b483f39..8c41746 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1559,7 +1559,7 @@ extern cputime_t task_gtime(struct task_struct *p);
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester *
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeab
#define PF_FREEZER_NOSIG 0x80000000 /* Freezer won't send signals to it */
-
+#define PF_FAULT_MAYRETRY 0x08000000 /* may drop mmap_sem during fault */
/*
* Only the _current_ task can read/write to tsk->flags, but other
* tasks can access tsk->flags in readonly mode for example
diff --git a/mm/filemap.c b/mm/filemap.c
index f3e5f89..2baa519 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1458,6 +1458,8 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_
*/
retry_find:
page = find_lock_page(mapping, vmf->pgoff);
+
+retry_find_nopage:
/*
* For sequential accesses, we use the generic readahead logic.
*/
@@ -1512,6 +1514,7 @@ retry_find:
if (!did_readaround)
ra->mmap_miss--;

+retry_page_update:
/*
* We have a locked page in the page cache, now we need to check
* that it's up-to-date. If not, it is going to be due to an error.
@@ -1547,8 +1550,54 @@ no_cached_page:
* In the unlikely event that someone removed it in the
* meantime, we'll just come back here and read it again.
*/
- if (error >= 0)
- goto retry_find;
+ if (error >= 0) {
+ /*
+ * If caller cannot tolerate a retry in the ->fault path
+ * go back to check the page again.
+ */
+ if (!(current->flags & PF_FAULT_MAYRETRY))
+ goto retry_find;
+
+ /*
+ * Caller is flagged with retry. If page is deleted
+ * already, go back to get a new page, otherwise
+ * check the page is locked or not. If page is
+ * locked, do nopage_retry.
+ */
+ page = find_get_page(mapping, vmf->pgoff);
+ if (!page)
+ goto retry_find_nopage;
+ if (!trylock_page(page)) {
+ struct mm_struct *mm = vma->vm_mm;
+ /*
+ * Page is already locked by someone else.
+ *
+ * We don't want to be holding down_read(mmap_sem)
+ * inside lock_page(). We use wait_on_page_lock here
+ * to just wait until the page is unlocked, but we
+ * don't really need
+ * to lock it.
+ */
+ up_read(&mm->mmap_sem);
+ wait_on_page_locked(page);
+ down_read(&mm->mmap_sem);
+ /*
+ * The VMA tree may have changed at this point.
+ */
+ page_cache_release(page);
+ return VM_FAULT_RETRY;
+ }
+
+ /* Has the page been truncated */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ page_cache_release(page);
+ goto retry_find;
+ }
+
+ goto retry_page_update;
+
+ }

/*
* An error return from page_cache_read can result if the
diff --git a/mm/memory.c b/mm/memory.c
index 164951c..38bd63b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2467,6 +2467,12 @@ static int __do_fault(struct mm_struct *mm, struct vm_a
vmf.page = NULL;

ret = vma->vm_ops->fault(vma, &vmf);
+
+ /* page may be available, but we have to restart the process
+ * because mmap_sem was dropped during the ->fault */
+ if (ret == VM_FAULT_RETRY)
+ return ret;
+
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
return ret;


2008-11-22 07:15:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Fri, 21 Nov 2008 22:47:44 -0800 Ying Han <[email protected]> wrote:

> page fault retry with NOPAGE_RETRY
> Allow major faults to drop the mmap_sem read lock while waitting for
> synchronous disk read. This allows another thread which wishes to grab
> down_read(mmap_sem) to proceed while the current is waitting the disk IO.

Confused. down_read() on an rwsem will already permit multiple threads
to run that section of ccode concurrently.

The benefit here will be to permit down_write() callers (eg:
sys_mmap()) to get in there and do work.

> The patch flags current->flags to PF_FAULT_MAYRETRY as identify that the
> caller can tolerate the retry in the filemap_fault call patch.
>
> Benchmark is done by mmap in huge file and spaw 64 thread each faulting in
> pages in reverse order, the the result shows 8% porformance hit with the
> patch.

You mean it slowed down 8%? I'm a bit surprised - I'd have expected a
smaller slowdown for an IO-intensive thing like this.

Does it speed anything up?

2008-11-23 09:19:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY


* Ying Han <[email protected]> wrote:

> page fault retry with NOPAGE_RETRY

Interesting patch.

> Allow major faults to drop the mmap_sem read lock while waitting for
> synchronous disk read. This allows another thread which wishes to grab
> down_read(mmap_sem) to proceed while the current is waitting the disk IO.

Do you mean down_write()? down_read() can already be nested
arbitrarily.

> The patch flags current->flags to PF_FAULT_MAYRETRY as identify that
> the caller can tolerate the retry in the filemap_fault call patch.
>
> Benchmark is done by mmap in huge file and spaw 64 thread each
> faulting in pages in reverse order, the the result shows 8%
> porformance hit with the patch.

I suspect we also want to see the cases where this change helps?

Also, constructs like this are pretty ugly:

> +#ifdef CONFIG_X86_64
> +asmlinkage
> +#endif
> +void do_page_fault(struct pt_regs *regs, unsigned long error_code)
> +{
> + current->flags |= PF_FAULT_MAYRETRY;
> + __do_page_fault(regs, error_code);
> + current->flags &= ~PF_FAULT_MAYRETRY;
> +}

This seems to be unnecessary runtime overhead to pass in a flag to
handle_mm_fault(). Why not extend the 'write' flag of
handle_mm_fault() to also signal "arch is able to retry"?

Also, _if_ we decide that from-scratch pagefault retries are good, i
see no reason why this should not be extended to all architectures:

The retry should happen purely in the MM layer - all information is
available already, and much of do_page_fault() could generally be
moved into mm/memory.c, with one or two arch-provided standard
callbacks to express certain page fault quirks. (such as vm86 mode on
x86)

(Such a design would allow more nice cleanups - handle_mm_fault()
could inline inside the pagefault handler, etc.)

Also, a few small details. Please use this proper multi-line comment
style:

> + /*
> + * Page is already locked by someone else.
> + *
> + * We don't want to be holding down_read(mmap_sem)
> + * inside lock_page(). We use wait_on_page_lock here
> + * to just wait until the page is unlocked, but we
> + * don't really need
> + * to lock it.
> + */

Not this one:

> + /* page may be available, but we have to restart the process
> + * because mmap_sem was dropped during the ->fault */

Ingo

2008-11-23 18:25:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Sun, 23 Nov 2008 10:18:44 +0100 Ingo Molnar <[email protected]> wrote:

> * Ying Han <[email protected]> wrote:
>
> > page fault retry with NOPAGE_RETRY
>
> Interesting patch.

<a grey call stirs>

ahhh... I thought this all sounded familiar. It surfaced a couple of
years ago, and this was my summary of the intent at the time:

http://lkml.indiana.edu/hypermail/linux/kernel/0609.1/2106.html

It all went round and round for a while, but I don't think anything got
merged. In fact I can't even find Ben's original little spufs/cell
NOPAGE_RETRY code in the tree, an I thought we merged that. Confused.

The questions are, of course: does this new code address the issues
which were raised at that time?

2008-11-25 18:43:07

by Ying Han

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Thanks Ingo for your comments and now i am working on V2 which should
be posted later today.

On Sun, Nov 23, 2008 at 1:18 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ying Han <[email protected]> wrote:
>
>> page fault retry with NOPAGE_RETRY
>
> Interesting patch.
Thank you, glad to know that.
>
>> Allow major faults to drop the mmap_sem read lock while waitting for
>> synchronous disk read. This allows another thread which wishes to grab
>> down_read(mmap_sem) to proceed while the current is waitting the disk IO.
>
> Do you mean down_write()? down_read() can already be nested
> arbitrarily.
fixed. it should be down_write()

>> The patch flags current->flags to PF_FAULT_MAYRETRY as identify that
>> the caller can tolerate the retry in the filemap_fault call patch.
>>
>> Benchmark is done by mmap in huge file and spaw 64 thread each
>> faulting in pages in reverse order, the the result shows 8%
>> porformance hit with the patch.
>
> I suspect we also want to see the cases where this change helps?
i am working on more benchmark to show performance improvement.
>
> Also, constructs like this are pretty ugly:
>
>> +#ifdef CONFIG_X86_64
>> +asmlinkage
>> +#endif
>> +void do_page_fault(struct pt_regs *regs, unsigned long error_code)
>> +{
>> + current->flags |= PF_FAULT_MAYRETRY;
>> + __do_page_fault(regs, error_code);
>> + current->flags &= ~PF_FAULT_MAYRETRY;
>> +}
>
> This seems to be unnecessary runtime overhead to pass in a flag to
> handle_mm_fault(). Why not extend the 'write' flag of
> handle_mm_fault() to also signal "arch is able to retry"?
thanks and fixed in V2

>
> Also, _if_ we decide that from-scratch pagefault retries are good, i
> see no reason why this should not be extended to all architectures:
>
> The retry should happen purely in the MM layer - all information is
> available already, and much of do_page_fault() could generally be
> moved into mm/memory.c, with one or two arch-provided standard
> callbacks to express certain page fault quirks. (such as vm86 mode on
> x86)
>
> (Such a design would allow more nice cleanups - handle_mm_fault()
> could inline inside the pagefault handler, etc.)
I will make the megapatch in V2 for each architecture support and send
to Andrew,
linux-kernel and linux-arch. thanks.

>
> Also, a few small details. Please use this proper multi-line comment
> style:
>
>> + /*
>> + * Page is already locked by someone else.
>> + *
>> + * We don't want to be holding down_read(mmap_sem)
>> + * inside lock_page(). We use wait_on_page_lock here
>> + * to just wait until the page is unlocked, but we
>> + * don't really need
>> + * to lock it.
>> + */
thanks and fixed.
> Not this one:
>
>> + /* page may be available, but we have to restart the process
>> + * because mmap_sem was dropped during the ->fault */
>
> Ingo
>

2008-11-26 12:32:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Tue, Nov 25, 2008 at 10:42:47AM -0800, Ying Han wrote:
> >> The patch flags current->flags to PF_FAULT_MAYRETRY as identify that
> >> the caller can tolerate the retry in the filemap_fault call patch.
> >>
> >> Benchmark is done by mmap in huge file and spaw 64 thread each
> >> faulting in pages in reverse order, the the result shows 8%
> >> porformance hit with the patch.
> >
> > I suspect we also want to see the cases where this change helps?
> i am working on more benchmark to show performance improvement.

Can't you share the actual improvement you see inside Google?

Google must be doing something funky with threads, because both
this patch and their new malloc allocator apparently were due to
mmap_sem contention problems, right?

That was before the kernel and glibc got together to fix the stupid
mmap_sem problem in malloc (shown up in that FreeBSD MySQL thread);
and before private futexes. I would be interested to know if Google
still has problems that require this patch...

2008-11-26 19:58:37

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Nick Piggin wrote:
> On Tue, Nov 25, 2008 at 10:42:47AM -0800, Ying Han wrote:
>>>> The patch flags current->flags to PF_FAULT_MAYRETRY as identify that
>>>> the caller can tolerate the retry in the filemap_fault call patch.
>>>>
>>>> Benchmark is done by mmap in huge file and spaw 64 thread each
>>>> faulting in pages in reverse order, the the result shows 8%
>>>> porformance hit with the patch.
>>> I suspect we also want to see the cases where this change helps?
>> i am working on more benchmark to show performance improvement.
>
> Can't you share the actual improvement you see inside Google?
>
> Google must be doing something funky with threads, because both
> this patch and their new malloc allocator apparently were due to
> mmap_sem contention problems, right?

One of the big improvements we see with this patch is the ability to
read out files in /proc/pid much faster. Consider the following events:

- an application has a high count of threads sleeping with
read_lock(mmap_sem) held in the fault path (on the order of hundreds).
- one of the threads in the application then blocks in
write_lock(mmap_sem) in the mmap()/munmap() paths
- now our monitoring software tries to read some of the /proc/pid files
and blocks behind the waiting writer due to the fairness of the rwsems.
This basically has to wait for all faults ahead of the reader to
terminate (and let go of the reader lock) and then the writer to have a
go at mmap_sem. This can take an extremely long time.

This patch helps a lot in this case as it keeps the writer from waiting
behind all the waiting readers, so it executes much faster.

>
> That was before the kernel and glibc got together to fix the stupid
> mmap_sem problem in malloc (shown up in that FreeBSD MySQL thread);
> and before private futexes. I would be interested to know if Google
> still has problems that require this patch...
>

I'm not very familiar with the 'malloc' problem in glibc. Was this just
overhead in heap growth/shrinkage causing problems?

2008-11-27 08:56:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Wed, Nov 26, 2008 at 11:57:24AM -0800, Mike Waychison wrote:
> Nick Piggin wrote:
> >On Tue, Nov 25, 2008 at 10:42:47AM -0800, Ying Han wrote:
> >>>>The patch flags current->flags to PF_FAULT_MAYRETRY as identify that
> >>>>the caller can tolerate the retry in the filemap_fault call patch.
> >>>>
> >>>>Benchmark is done by mmap in huge file and spaw 64 thread each
> >>>>faulting in pages in reverse order, the the result shows 8%
> >>>>porformance hit with the patch.
> >>>I suspect we also want to see the cases where this change helps?
> >>i am working on more benchmark to show performance improvement.
> >
> >Can't you share the actual improvement you see inside Google?
> >
> >Google must be doing something funky with threads, because both
> >this patch and their new malloc allocator apparently were due to
> >mmap_sem contention problems, right?
>
> One of the big improvements we see with this patch is the ability to
> read out files in /proc/pid much faster. Consider the following events:
>
> - an application has a high count of threads sleeping with
> read_lock(mmap_sem) held in the fault path (on the order of hundreds).
> - one of the threads in the application then blocks in
> write_lock(mmap_sem) in the mmap()/munmap() paths
> - now our monitoring software tries to read some of the /proc/pid files
> and blocks behind the waiting writer due to the fairness of the rwsems.
> This basically has to wait for all faults ahead of the reader to
> terminate (and let go of the reader lock) and then the writer to have a
> go at mmap_sem. This can take an extremely long time.
>
> This patch helps a lot in this case as it keeps the writer from waiting
> behind all the waiting readers, so it executes much faster.

Hmm. How quantifiable is the benefit? Does it actually matter that you
can read the proc file much faster? (this is for some automated workload
management daemon or something, right?)

Would it be possible to reduce mmap()/munmap() activity? eg. if it is
due to a heap memory allocator, then perhaps do more batching or set
some hysteresis.


> >That was before the kernel and glibc got together to fix the stupid
> >mmap_sem problem in malloc (shown up in that FreeBSD MySQL thread);
> >and before private futexes. I would be interested to know if Google
> >still has problems that require this patch...
> >
>
> I'm not very familiar with the 'malloc' problem in glibc. Was this just
> overhead in heap growth/shrinkage causing problems?

As far as I understand, glibc does actually seperate notions of allocated
virtual memory and allocated pages. By that I mean that it is very careful to
return pages back to the system when they are unused, but it does try to
batch up changes to the virtual address space. Unfortunately, it used to
return pages by doing a mmap call with PROT_NONE, then to start using
that virtual memory again, it would mmap with PROT_READ|PROT_WRITE.

This meant that a malloc/touch/free would look like this:
mmap <- down_write(mmap_sem)
page fault <- down_read(mmap_sem)
mmap <- down_write(mmap_sem)

What we did was to make it use madvise(MADV_DONTNEED) to throw the pages
away. Then the kernel was changed to implement MADV_DONTNEED using only
down_read. Then the same sequence is just this:

page fault <- down_read(mmap_sem)
madvise <- down_read(mmap_sem)

(because changes to virtual memory allocation are batched).

I thought google's malloc was primarily written to fix this bad behaviour,
because with the new behaviour, glibc's malloc seems to beat google's
malloc on the performance and scalability tests I was running at the time,
as well as being more memory footprint friendly.

OTOH, it would be possible that hysteresis watermarks in glibc are not big
enough for a given application, which could introduce virtual address
activity back into the workload. These can be tuned, however.

2008-11-27 09:30:23

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Nick Piggin wrote:
> On Wed, Nov 26, 2008 at 11:57:24AM -0800, Mike Waychison wrote:
>> Nick Piggin wrote:
>>> On Tue, Nov 25, 2008 at 10:42:47AM -0800, Ying Han wrote:
>>>>>> The patch flags current->flags to PF_FAULT_MAYRETRY as identify that
>>>>>> the caller can tolerate the retry in the filemap_fault call patch.
>>>>>>
>>>>>> Benchmark is done by mmap in huge file and spaw 64 thread each
>>>>>> faulting in pages in reverse order, the the result shows 8%
>>>>>> porformance hit with the patch.
>>>>> I suspect we also want to see the cases where this change helps?
>>>> i am working on more benchmark to show performance improvement.
>>> Can't you share the actual improvement you see inside Google?
>>>
>>> Google must be doing something funky with threads, because both
>>> this patch and their new malloc allocator apparently were due to
>>> mmap_sem contention problems, right?
>> One of the big improvements we see with this patch is the ability to
>> read out files in /proc/pid much faster. Consider the following events:
>>
>> - an application has a high count of threads sleeping with
>> read_lock(mmap_sem) held in the fault path (on the order of hundreds).
>> - one of the threads in the application then blocks in
>> write_lock(mmap_sem) in the mmap()/munmap() paths
>> - now our monitoring software tries to read some of the /proc/pid files
>> and blocks behind the waiting writer due to the fairness of the rwsems.
>> This basically has to wait for all faults ahead of the reader to
>> terminate (and let go of the reader lock) and then the writer to have a
>> go at mmap_sem. This can take an extremely long time.
>>
>> This patch helps a lot in this case as it keeps the writer from waiting
>> behind all the waiting readers, so it executes much faster.
>
> Hmm. How quantifiable is the benefit? Does it actually matter that you
> can read the proc file much faster? (this is for some automated workload
> management daemon or something, right?)

Correct. I don't recall the numbers from the pathelogical cases we were
seeing, but iirc, it was on the order of 10s of seconds, likely
exascerbated by slower than usual disks. I've been digging through my
inbox to find numbers without much success -- we've been using a variant
of this patch since 2.6.11.

T?r?k however identified mmap taking on the order of several
milliseconds due to this exact problem:

http://lkml.org/lkml/2008/9/12/185

>
> Would it be possible to reduce mmap()/munmap() activity? eg. if it is
> due to a heap memory allocator, then perhaps do more batching or set
> some hysteresis.

I know our tcmalloc team had made great strides to reduce mmap_sem
contention for the heap, but there are various other bits of the stack
that really want to mmap files..

We generally try to avoid such things, but sometimes it a) can't be
easily avoided (third party libraries for instance) and b) when it hits
us, it affects the overall health of the machine/cluster (the monitoring
daemons get blocked, which isn't very healthy).

>
>
>>> That was before the kernel and glibc got together to fix the stupid
>>> mmap_sem problem in malloc (shown up in that FreeBSD MySQL thread);
>>> and before private futexes. I would be interested to know if Google
>>> still has problems that require this patch...
>>>
>> I'm not very familiar with the 'malloc' problem in glibc. Was this just
>> overhead in heap growth/shrinkage causing problems?
>
> As far as I understand, glibc does actually seperate notions of allocated
> virtual memory and allocated pages. By that I mean that it is very careful to
> return pages back to the system when they are unused, but it does try to
> batch up changes to the virtual address space. Unfortunately, it used to
> return pages by doing a mmap call with PROT_NONE, then to start using
> that virtual memory again, it would mmap with PROT_READ|PROT_WRITE.
>
> This meant that a malloc/touch/free would look like this:
> mmap <- down_write(mmap_sem)
> page fault <- down_read(mmap_sem)
> mmap <- down_write(mmap_sem)
>
> What we did was to make it use madvise(MADV_DONTNEED) to throw the pages
> away. Then the kernel was changed to implement MADV_DONTNEED using only
> down_read. Then the same sequence is just this:
>
> page fault <- down_read(mmap_sem)
> madvise <- down_read(mmap_sem)
>
> (because changes to virtual memory allocation are batched).
>
> I thought google's malloc was primarily written to fix this bad behaviour,
> because with the new behaviour, glibc's malloc seems to beat google's
> malloc on the performance and scalability tests I was running at the time,
> as well as being more memory footprint friendly.
>
> OTOH, it would be possible that hysteresis watermarks in glibc are not big
> enough for a given application, which could introduce virtual address
> activity back into the workload. These can be tuned, however.
>

2008-11-27 10:00:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, 2008-11-27 at 01:28 -0800, Mike Waychison wrote:

> Correct. I don't recall the numbers from the pathelogical cases we were
> seeing, but iirc, it was on the order of 10s of seconds, likely
> exascerbated by slower than usual disks. I've been digging through my
> inbox to find numbers without much success -- we've been using a variant
> of this patch since 2.6.11.

> We generally try to avoid such things, but sometimes it a) can't be
> easily avoided (third party libraries for instance) and b) when it hits
> us, it affects the overall health of the machine/cluster (the monitoring
> daemons get blocked, which isn't very healthy).

If its only monitoring, there might be another solution. If you can keep
the required data in a separate (approximate) copy so that you don't
need mmap_sem at all to show them.

If your mmap_sem is so contended your latencies are unacceptable, adding
more users to it - even statistics gathering, just isn't going to cure
the situation.

Furthermore, /proc code usually isn't written with performance in mind,
so its usually simple and robust code. Adding it to a 'hot'-path like
you're doing doesn't seem advisable.

Also, releasing and re-acquiring mmap_sem can significantly add to the
cacheline bouncing that thing already has.

2008-11-27 10:14:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 11:00:07AM +0100, Peter Zijlstra wrote:
> On Thu, 2008-11-27 at 01:28 -0800, Mike Waychison wrote:
>
> > Correct. I don't recall the numbers from the pathelogical cases we were
> > seeing, but iirc, it was on the order of 10s of seconds, likely
> > exascerbated by slower than usual disks. I've been digging through my
> > inbox to find numbers without much success -- we've been using a variant
> > of this patch since 2.6.11.
>
> > We generally try to avoid such things, but sometimes it a) can't be
> > easily avoided (third party libraries for instance) and b) when it hits
> > us, it affects the overall health of the machine/cluster (the monitoring
> > daemons get blocked, which isn't very healthy).
>
> If its only monitoring, there might be another solution. If you can keep
> the required data in a separate (approximate) copy so that you don't
> need mmap_sem at all to show them.
>
> If your mmap_sem is so contended your latencies are unacceptable, adding
> more users to it - even statistics gathering, just isn't going to cure
> the situation.
>
> Furthermore, /proc code usually isn't written with performance in mind,
> so its usually simple and robust code. Adding it to a 'hot'-path like
> you're doing doesn't seem advisable.
>
> Also, releasing and re-acquiring mmap_sem can significantly add to the
> cacheline bouncing that thing already has.

Yes, it would be nice to reduce mmap_sem load regardless of any other
fixes or problems. I guess they're not very worried about cacheline
bouncing but more about hold time (how many sockets in these systems?
4 at most?)

I guess it is the pagemap stuff that they use most heavily?

pagemap_read looks like it can use get_user_pages_fast. The smaps and
clear_refs stuff might have been nicer if they could work on ranges
like pagemap. Then they could avoid mmap_sem as well (although maps
would need to be sampled and take mmap_sem I guess).

One problem with dropping mmap_sem is that it hurts priority/fairness.
And it opens a bit of a (maybe theoretical but not something to completely
ignore) forward progress hole AFAIKS. If mmap_sem is very heavily
contended, then the refault is going to take a while to get through,
and then the page might get reclaimed etc).


2008-11-27 11:08:45

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

> Furthermore, /proc code usually isn't written with performance in mind,
> so its usually simple and robust code. Adding it to a 'hot'-path like
> you're doing doesn't seem advisable.
>
> Also, releasing and re-acquiring mmap_sem can significantly add to the
> cacheline bouncing that thing already has.

Interesting.

I tryed to demonstration /proc slowness.

1. make many process
$ nice ./hackbench 120 process 10000

2. read /proc
$ time ps -ef

0.16s user 0.57s system 1% cpu 46.859 total


HAHAHA!
That is really slow over my expected.


2008-11-27 11:40:23

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-27 11:28, Mike Waychison wrote:
> Correct. I don't recall the numbers from the pathelogical cases we
> were seeing, but iirc, it was on the order of 10s of seconds, likely
> exascerbated by slower than usual disks. I've been digging through my
> inbox to find numbers without much success -- we've been using a
> variant of this patch since 2.6.11.
>
> T?r?k however identified mmap taking on the order of several
> milliseconds due to this exact problem:
>
> http://lkml.org/lkml/2008/9/12/185


Hi,

Thanks for the patch. I just tested it on top of 2.6.28-rc6-tip, see
/proc/lock_stat output at the end.

Running my testcase shows no significant performance difference. What am
I doing wrong?

Number of CPUs, 4
Number of threads ->, 1,, 2,, 4,, 8,, 16
Kernel version, read, mmap, read, mmap, read, mmap, read, mmap, read, mmap
2.6.27.6, 27.18, 25.45, 16.03, 22.20, 11.20, 19.90, 9.05, 19.69, 9.15, 20.38
2.6.28-rc6-tip + faultretry, 28.76, 25.24, 15.66, 20.99, 10.74, 19.41,
9.16, 20.21, 9.12, 20.47

Running clamd:
Without patch: Time: 163.563 sec (2 m 43 s)
With patch: Time: 159.535 sec (2 m 39 s)

Diff of mmap contention (stat_googlepatch is 2.6.28-rc6-tip + faultretry
patch, and stat_thrtest is 2.6.28-rc6-tip w/o patches)
--- /tmp/stat_thrtest 2008-11-27 13:33:51.302846222 +0200
+++ /tmp/stat_googlepatch 2008-11-27 13:33:38.782846981 +0200
@@ -3,1849 +3,32 @@
class name con-bounces
contentions waittime-min waittime-max waittime-total
acq-bounces acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

- &mm->mmap_sem-W: 4642
7231 57.00 109595.18 33782959.93
7484 16093 2.41 2276.98 94821.51
- &mm->mmap_sem-R: 31172
58590 35.89 106151.31 234170024.06
39097 202237 2.37 205627.05 40584673.55
+ &mm->mmap_sem-W: 4631
7196 82.92 146246.85 33118519.70
7373 17160 2.52 2653.19 100180.33
+ &mm->mmap_sem-R: 32568
59632 25.40 180539.37 238784236.95
40753 214536 2.59 174742.01 42486255.59
---------------
- &mm->mmap_sem 3548
[<ffffffff80211171>] sys_mmap+0xf1/0x140
- &mm->mmap_sem 3585
[<ffffffff802bf4d7>] sys_munmap+0x47/0x80
- &mm->mmap_sem 58585
[<ffffffff805d10d5>] do_page_fault+0x175/0xaa0
- &mm->mmap_sem 47
[<ffffffff802110e6>] sys_mmap+0x66/0x140
+ &mm->mmap_sem 3546
[<ffffffff80211171>] sys_mmap+0xf1/0x140
+ &mm->mmap_sem 59631
[<ffffffff80229af4>] do_page_fault+0x194/0xaa0
+ &mm->mmap_sem 3586
[<ffffffff802bfff7>] sys_munmap+0x47/0x80
+ &mm->mmap_sem 50
[<ffffffff802110e6>] sys_mmap+0x66/0x140
---------------
- &mm->mmap_sem 2667
[<ffffffff80211171>] sys_mmap+0xf1/0x140
- &mm->mmap_sem 1953
[<ffffffff802bf4d7>] sys_munmap+0x47/0x80
- &mm->mmap_sem 61144
[<ffffffff805d10d5>] do_page_fault+0x175/0xaa0
- &mm->mmap_sem 26
[<ffffffff802c1011>] sys_mprotect+0xf1/0x280
+ &mm->mmap_sem 3048
[<ffffffff80211171>] sys_mmap+0xf1/0x140
+ &mm->mmap_sem 61747
[<ffffffff80229af4>] do_page_fault+0x194/0xaa0
+ &mm->mmap_sem 1997
[<ffffffff802bfff7>] sys_munmap+0x47/0x80
+ &mm->mmap_sem 34
[<ffffffff802110e6>] sys_mmap+0x66/0x140

=====================================
With the patch applied to 2.6.28-rc6-tip, and lockstat enabled:
=====================================

My testcase:
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&mm->mmap_sem-W: 4631
7196 82.92 146246.85 33118519.70
7373 17160 2.52 2653.19 100180.33
&mm->mmap_sem-R: 32568
59632 25.40 180539.37 238784236.95
40753 214536 2.59 174742.01 42486255.59
---------------
&mm->mmap_sem 3546
[<ffffffff80211171>] sys_mmap+0xf1/0x140
&mm->mmap_sem 59631
[<ffffffff80229af4>] do_page_fault+0x194/0xaa0
&mm->mmap_sem 3586
[<ffffffff802bfff7>] sys_munmap+0x47/0x80
&mm->mmap_sem 50
[<ffffffff802110e6>] sys_mmap+0x66/0x140
---------------
&mm->mmap_sem 3048
[<ffffffff80211171>] sys_mmap+0xf1/0x140
&mm->mmap_sem 61747
[<ffffffff80229af4>] do_page_fault+0x194/0xaa0
&mm->mmap_sem 1997
[<ffffffff802bfff7>] sys_munmap+0x47/0x80
&mm->mmap_sem 34
[<ffffffff802110e6>] sys_mmap+0x66/0x140

...............................................................................................................................................................................................

&sem->wait_lock: 26753
26905 0.56 89.52 42034.48 306909
1285225 0.37 97.94 288043.52
---------------
&sem->wait_lock 9948
[<ffffffff8043a8d3>] __up_read+0x23/0xc0
&sem->wait_lock 11976
[<ffffffff8043a600>] __down_read_trylock+0x20/0x60
&sem->wait_lock 285
[<ffffffff8043a768>] __up_write+0x28/0x170
&sem->wait_lock 438
[<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
---------------
&sem->wait_lock 9960
[<ffffffff8043a600>] __down_read_trylock+0x20/0x60
&sem->wait_lock 7511
[<ffffffff8043a8d3>] __up_read+0x23/0xc0
&sem->wait_lock 5342
[<ffffffff8043a768>] __up_write+0x28/0x170
&sem->wait_lock 403
[<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0

...............................................................................................................................................................................................

clamd:
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&mm->mmap_sem-W: 354386
529027 1.62 125013.00 413785798.68
429370 623695 1.35 16317.50 2514673.72
&mm->mmap_sem-R: 267963
585119 1.56 108510.12 374811350.59
459907 983775 1.74 41604.05 2087538.75
---------------
&mm->mmap_sem 249719
[<ffffffff802110e6>] sys_mmap+0x66/0x140
&mm->mmap_sem 524336
[<ffffffff80229af4>] do_page_fault+0x194/0xaa0
&mm->mmap_sem 263474
[<ffffffff802bfff7>] sys_munmap+0x47/0x80
&mm->mmap_sem 1367
[<ffffffff802c1b31>] sys_mprotect+0xf1/0x280
---------------
&mm->mmap_sem 209337
[<ffffffff802110e6>] sys_mmap+0x66/0x140
&mm->mmap_sem 601714
[<ffffffff80229af4>] do_page_fault+0x194/0xaa0
&mm->mmap_sem 210647
[<ffffffff802bfff7>] sys_munmap+0x47/0x80
&mm->mmap_sem 1567
[<ffffffff802c1b31>] sys_mprotect+0xf1/0x280

...............................................................................................................................................................................................

&sem->wait_lock: 122700
126641 0.42 77.94 125372.37
1779026 7368894 0.27 1099.42 3085559.16
---------------
&sem->wait_lock 5943
[<ffffffff8043a768>] __up_write+0x28/0x170
&sem->wait_lock 8615
[<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
&sem->wait_lock 13568
[<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
&sem->wait_lock 49377
[<ffffffff8043a600>] __down_read_trylock+0x20/0x60
---------------
&sem->wait_lock 8097
[<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
&sem->wait_lock 31540
[<ffffffff8043a768>] __up_write+0x28/0x170
&sem->wait_lock 5501
[<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
&sem->wait_lock 33342
[<ffffffff8043a600>] __down_read_trylock+0x20/0x60

...............................................................................................................................................................................................

dcache_lock: 29384
30333 0.39 310.56 15303.87 586666
4963875 0.29 4306.42 1337134.02
-----------
dcache_lock 529
[<ffffffff802e52eb>] dentry_unhash+0x3b/0xa0
dcache_lock 2490
[<ffffffff802f02c0>] d_delete+0x30/0x100
dcache_lock 6437
[<ffffffff80435240>] _atomic_dec_and_lock+0x60/0x80
dcache_lock 4932
[<ffffffff802ef15e>] d_instantiate+0x2e/0x60
-----------
dcache_lock 2446
[<ffffffff802f02c0>] d_delete+0x30/0x100
dcache_lock 598
[<ffffffff802fb8dd>] simple_empty+0x1d/0x90
dcache_lock 3258
[<ffffffff802ed91a>] d_rehash+0x2a/0x60
dcache_lock 4287
[<ffffffff802f0622>] d_alloc+0x162/0x210

=====================================
With the original 2.6.28-rc6-tip, and lockstat enabled:
=====================================
My testcase:
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&mm->mmap_sem-W: 4642
7231 57.00 109595.18 33782959.93
7484 16093 2.41 2276.98 94821.51
&mm->mmap_sem-R: 31172
58590 35.89 106151.31 234170024.06
39097 202237 2.37 205627.05 40584673.55
---------------
&mm->mmap_sem 3548
[<ffffffff80211171>] sys_mmap+0xf1/0x140
&mm->mmap_sem 3585
[<ffffffff802bf4d7>] sys_munmap+0x47/0x80
&mm->mmap_sem 58585
[<ffffffff805d10d5>] do_page_fault+0x175/0xaa0
&mm->mmap_sem 47
[<ffffffff802110e6>] sys_mmap+0x66/0x140
---------------
&mm->mmap_sem 2667
[<ffffffff80211171>] sys_mmap+0xf1/0x140
&mm->mmap_sem 1953
[<ffffffff802bf4d7>] sys_munmap+0x47/0x80
&mm->mmap_sem 61144
[<ffffffff805d10d5>] do_page_fault+0x175/0xaa0
&mm->mmap_sem 26
[<ffffffff802c1011>] sys_mprotect+0xf1/0x280

...............................................................................................................................................................................................

&sem->wait_lock: 28621
28702 0.56 93.20 43699.77 314546
1251660 0.37 101.52 288756.82
---------------
&sem->wait_lock 12897
[<ffffffff80439ae0>] __down_read_trylock+0x20/0x60
&sem->wait_lock 10819
[<ffffffff80439db3>] __up_read+0x23/0xc0
&sem->wait_lock 3698
[<ffffffff805cd95c>] __down_read+0x1c/0xbc
&sem->wait_lock 356
[<ffffffff80439c48>] __up_write+0x28/0x170
---------------
&sem->wait_lock 8522
[<ffffffff80439db3>] __up_read+0x23/0xc0
&sem->wait_lock 10661
[<ffffffff80439ae0>] __down_read_trylock+0x20/0x60
&sem->wait_lock 5407
[<ffffffff80439c48>] __up_write+0x28/0x170
&sem->wait_lock 3253
[<ffffffff805cd95c>] __down_read+0x1c/0xbc

clamd:
lock_stat version 0.3
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
class name con-bounces contentions
waittime-min waittime-max waittime-total acq-bounces
acquisitions holdtime-min holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

&mm->mmap_sem-W: 356390
538631 1.60 132281.47 451534654.86
428029 620900 1.49 15407.85 2496161.71
&mm->mmap_sem-R: 250235
563157 1.64 131147.55 375041110.73
429837 896476 1.87 54909.72 2644784.48
---------------
&mm->mmap_sem 254006
[<ffffffff802110e6>] sys_mmap+0x66/0x140
&mm->mmap_sem 512446
[<ffffffff805d10d5>] do_page_fault+0x175/0xaa0
&mm->mmap_sem 268819
[<ffffffff802bf4d7>] sys_munmap+0x47/0x80
&mm->mmap_sem 553
[<ffffffff802c1011>] sys_mprotect+0xf1/0x280
---------------
&mm->mmap_sem 734
[<ffffffff802c1011>] sys_mprotect+0xf1/0x280
&mm->mmap_sem 215625
[<ffffffff802110e6>] sys_mmap+0x66/0x140
&mm->mmap_sem 586901
[<ffffffff805d10d5>] do_page_fault+0x175/0xaa0
&mm->mmap_sem 219682
[<ffffffff802bf4d7>] sys_munmap+0x47/0x80

...............................................................................................................................................................................................

&sem->wait_lock: 120220
123433 0.39 68.17 124965.88
1723668 7164023 0.28 2373.48 3112477.10
---------------
&sem->wait_lock 48491
[<ffffffff80439ae0>] __down_read_trylock+0x20/0x60
&sem->wait_lock 5904
[<ffffffff80439c48>] __up_write+0x28/0x170
&sem->wait_lock 18958
[<ffffffff805cd95c>] __down_read+0x1c/0xbc
&sem->wait_lock 28307
[<ffffffff80439db3>] __up_read+0x23/0xc0
---------------
&sem->wait_lock 34432
[<ffffffff80439db3>] __up_read+0x23/0xc0
&sem->wait_lock 32236
[<ffffffff80439ae0>] __down_read_trylock+0x20/0x60
&sem->wait_lock 32154
[<ffffffff80439c48>] __up_write+0x28/0x170
&sem->wait_lock 5436
[<ffffffff805cd88c>] __down_write_nested+0x1c/0xc0

...............................................................................................................................................................................................

&rq->lock: 29950
30035 0.45 31.72 25917.28 1851965
5875683 0.20 4648.46 3721908.72
---------
&rq->lock 17727
[<ffffffff802332b4>] task_rq_lock+0x54/0xa0
&rq->lock 11937
[<ffffffff805ca689>] schedule+0x159/0x4da
&rq->lock 257
[<ffffffff8023fd60>] scheduler_tick+0x50/0x230
&rq->lock 19
[<ffffffff80237cd5>] double_rq_lock+0x75/0xa0
---------
&rq->lock 15957
[<ffffffff805ca689>] schedule+0x159/0x4da
&rq->lock 13660
[<ffffffff802332b4>] task_rq_lock+0x54/0xa0
&rq->lock 95
[<ffffffff8023fd60>] scheduler_tick+0x50/0x230
&rq->lock 178
[<ffffffff80237c8d>] double_rq_lock+0x2d/0xa0


======================
My .config
======================
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.28-rc6
# Thu Nov 27 12:48:05 2008
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_FAST_CMPXCHG_LOCAL=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
CONFIG_ARCH_HAS_CPU_IDLE_WAIT=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_DEFAULT_IDLE=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_HAVE_CPUMASK_OF_CPU_MAP=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ZONE_DMA32=y
CONFIG_ARCH_POPULATES_NODE_MAP=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_X86_SMP=y
CONFIG_USE_GENERIC_SMP_HELPERS=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_X86_BIOS_REBOOT=y
CONFIG_X86_TRAMPOLINE=y
# CONFIG_KTIME_SCALAR is not set
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_TREE=y
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
# CONFIG_CGROUP_FREEZER is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_RELAY=y
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
# CONFIG_COMPAT_BRK is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_PCI_QUIRKS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_MARKERS=y
CONFIG_OPROFILE=m
# CONFIG_OPROFILE_IBS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_BLK_DEV_BSG is not set
# CONFIG_BLK_DEV_INTEGRITY is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_CLASSIC_RCU=y
# CONFIG_FREEZER is not set

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_FIND_SMP_CONFIG=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_VSMP is not set
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_PARAVIRT_GUEST is not set
# CONFIG_MEMTEST is not set
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
# CONFIG_MPENTIUMM is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_MPSC is not set
CONFIG_MCORE2=y
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_CPU=y
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_INTERNODE_CACHE_BYTES=64
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_P6_NOP=y
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR_64=y
CONFIG_X86_DS=y
CONFIG_X86_PTRACE_BTS=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
# CONFIG_AMD_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
CONFIG_NR_CPUS=32
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
# CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS is not set
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=m
CONFIG_MICROCODE_INTEL=y
# CONFIG_MICROCODE_AMD is not set
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
# CONFIG_NUMA is not set
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_UNEVICTABLE_LRU=y
# CONFIG_X86_CHECK_BIOS_CORRUPTION is not set
CONFIG_X86_RESERVE_LOW_64K=y
CONFIG_MTRR=y
# CONFIG_MTRR_SANITIZER is not set
CONFIG_X86_PAT=y
# CONFIG_EFI is not set
CONFIG_SECCOMP=y
# CONFIG_CC_STACKPROTECTOR is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
# CONFIG_CRASH_DUMP is not set
CONFIG_PHYSICAL_START=0x200000
# CONFIG_RELOCATABLE is not set
CONFIG_PHYSICAL_ALIGN=0x200000
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT_VDSO=y
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y

#
# Power management and ACPI options
#
CONFIG_PM=y
# CONFIG_PM_DEBUG is not set
# CONFIG_SUSPEND is not set
# CONFIG_HIBERNATION is not set
CONFIG_ACPI=y
CONFIG_ACPI_PROCFS=y
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_SYSFS_POWER=y
CONFIG_ACPI_PROC_EVENT=y
CONFIG_ACPI_AC=m
CONFIG_ACPI_BATTERY=m
CONFIG_ACPI_BUTTON=m
CONFIG_ACPI_FAN=m
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=m
CONFIG_ACPI_HOTPLUG_CPU=y
CONFIG_ACPI_THERMAL=m
CONFIG_ACPI_WMI=m
CONFIG_ACPI_ASUS=m
CONFIG_ACPI_TOSHIBA=m
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ACPI_BLACKLIST_YEAR=0
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_ACPI_SYSTEM=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=m
CONFIG_ACPI_SBS=m

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_TABLE=m
# CONFIG_CPU_FREQ_DEBUG is not set
CONFIG_CPU_FREQ_STAT=m
# CONFIG_CPU_FREQ_STAT_DETAILS is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=m
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m

#
# CPUFreq processor drivers
#
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_POWERNOW_K8=m
CONFIG_X86_POWERNOW_K8_ACPI=y
CONFIG_X86_SPEEDSTEP_CENTRINO=m
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_ACPI_CPUFREQ_PROC_INTF is not set
# CONFIG_X86_SPEEDSTEP_LIB is not set
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_DMAR is not set
# CONFIG_INTR_REMAP is not set
CONFIG_PCIEPORTBUS=y
CONFIG_PCIEAER=y
# CONFIG_PCIEASPM is not set
CONFIG_ARCH_SUPPORTS_MSI=y
CONFIG_PCI_MSI=y
CONFIG_PCI_LEGACY=y
# CONFIG_PCI_DEBUG is not set
CONFIG_HT_IRQ=y
CONFIG_ISA_DMA_API=y
CONFIG_K8_NB=y
# CONFIG_PCCARD is not set
# CONFIG_HOTPLUG_PCI is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_IA32_EMULATION=y
CONFIG_IA32_AOUT=m
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
CONFIG_XFRM_IPCOMP=m
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_ASK_IP_FIB_HASH=y
# CONFIG_IP_FIB_TRIE is not set
CONFIG_IP_FIB_HASH=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
# CONFIG_ARPD is not set
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=m
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_LRO is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
CONFIG_TCP_CONG_BIC=m
CONFIG_TCP_CONG_CUBIC=y
CONFIG_TCP_CONG_WESTWOOD=m
CONFIG_TCP_CONG_HTCP=m
CONFIG_TCP_CONG_HSTCP=m
CONFIG_TCP_CONG_HYBLA=m
CONFIG_TCP_CONG_VEGAS=m
CONFIG_TCP_CONG_SCALABLE=m
CONFIG_TCP_CONG_LP=m
CONFIG_TCP_CONG_VENO=m
CONFIG_TCP_CONG_YEAH=m
CONFIG_TCP_CONG_ILLINOIS=m
# CONFIG_DEFAULT_BIC is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_HTCP is not set
# CONFIG_DEFAULT_VEGAS is not set
# CONFIG_DEFAULT_WESTWOOD is not set
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=m
CONFIG_IPV6_PRIVACY=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
CONFIG_INET6_AH=m
CONFIG_INET6_ESP=m
CONFIG_INET6_IPCOMP=m
CONFIG_IPV6_MIP6=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION=m
CONFIG_IPV6_SIT=m
CONFIG_IPV6_NDISC_NODETYPE=y
CONFIG_IPV6_TUNNEL=m
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_PIMSM_V2=y
# CONFIG_NETLABEL is not set
CONFIG_NETWORK_SECMARK=y
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=m
CONFIG_NF_CT_ACCT=y
CONFIG_NF_CONNTRACK_MARK=y
CONFIG_NF_CONNTRACK_SECMARK=y
CONFIG_NF_CONNTRACK_EVENTS=y
CONFIG_NF_CT_PROTO_DCCP=m
CONFIG_NF_CT_PROTO_GRE=m
CONFIG_NF_CT_PROTO_SCTP=m
CONFIG_NF_CT_PROTO_UDPLITE=m
CONFIG_NF_CONNTRACK_AMANDA=m
CONFIG_NF_CONNTRACK_FTP=m
CONFIG_NF_CONNTRACK_H323=m
CONFIG_NF_CONNTRACK_IRC=m
CONFIG_NF_CONNTRACK_NETBIOS_NS=m
CONFIG_NF_CONNTRACK_PPTP=m
CONFIG_NF_CONNTRACK_SANE=m
CONFIG_NF_CONNTRACK_SIP=m
CONFIG_NF_CONNTRACK_TFTP=m
CONFIG_NF_CT_NETLINK=m
# CONFIG_NETFILTER_TPROXY is not set
CONFIG_NETFILTER_XTABLES=m
CONFIG_NETFILTER_XT_TARGET_CLASSIFY=m
CONFIG_NETFILTER_XT_TARGET_CONNMARK=m
CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=m
CONFIG_NETFILTER_XT_TARGET_DSCP=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_NFLOG=m
CONFIG_NETFILTER_XT_TARGET_NFQUEUE=m
CONFIG_NETFILTER_XT_TARGET_NOTRACK=m
CONFIG_NETFILTER_XT_TARGET_RATEEST=m
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_NETFILTER_XT_TARGET_SECMARK=m
CONFIG_NETFILTER_XT_TARGET_TCPMSS=m
CONFIG_NETFILTER_XT_TARGET_TCPOPTSTRIP=m
CONFIG_NETFILTER_XT_MATCH_COMMENT=m
CONFIG_NETFILTER_XT_MATCH_CONNBYTES=m
CONFIG_NETFILTER_XT_MATCH_CONNLIMIT=m
CONFIG_NETFILTER_XT_MATCH_CONNMARK=m
CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m
CONFIG_NETFILTER_XT_MATCH_DCCP=m
CONFIG_NETFILTER_XT_MATCH_DSCP=m
CONFIG_NETFILTER_XT_MATCH_ESP=m
CONFIG_NETFILTER_XT_MATCH_HASHLIMIT=m
CONFIG_NETFILTER_XT_MATCH_HELPER=m
CONFIG_NETFILTER_XT_MATCH_IPRANGE=m
CONFIG_NETFILTER_XT_MATCH_LENGTH=m
CONFIG_NETFILTER_XT_MATCH_LIMIT=m
CONFIG_NETFILTER_XT_MATCH_MAC=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m
CONFIG_NETFILTER_XT_MATCH_OWNER=m
CONFIG_NETFILTER_XT_MATCH_POLICY=m
CONFIG_NETFILTER_XT_MATCH_PKTTYPE=m
CONFIG_NETFILTER_XT_MATCH_QUOTA=m
CONFIG_NETFILTER_XT_MATCH_RATEEST=m
CONFIG_NETFILTER_XT_MATCH_REALM=m
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
CONFIG_NETFILTER_XT_MATCH_SCTP=m
CONFIG_NETFILTER_XT_MATCH_STATE=m
CONFIG_NETFILTER_XT_MATCH_STATISTIC=m
CONFIG_NETFILTER_XT_MATCH_STRING=m
CONFIG_NETFILTER_XT_MATCH_TCPMSS=m
CONFIG_NETFILTER_XT_MATCH_TIME=m
CONFIG_NETFILTER_XT_MATCH_U32=m
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
CONFIG_NF_DEFRAG_IPV4=m
CONFIG_NF_CONNTRACK_IPV4=m
CONFIG_NF_CONNTRACK_PROC_COMPAT=y
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_ADDRTYPE=m
CONFIG_IP_NF_MATCH_AH=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_NF_NAT=m
CONFIG_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_NETMAP=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_NF_NAT_SNMP_BASIC=m
CONFIG_NF_NAT_PROTO_DCCP=m
CONFIG_NF_NAT_PROTO_GRE=m
CONFIG_NF_NAT_PROTO_UDPLITE=m
CONFIG_NF_NAT_PROTO_SCTP=m
CONFIG_NF_NAT_FTP=m
CONFIG_NF_NAT_IRC=m
CONFIG_NF_NAT_TFTP=m
CONFIG_NF_NAT_AMANDA=m
CONFIG_NF_NAT_PPTP=m
CONFIG_NF_NAT_H323=m
CONFIG_NF_NAT_SIP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_CLUSTERIP=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_TTL=m
CONFIG_IP_NF_RAW=m
# CONFIG_IP_NF_SECURITY is not set
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m

#
# IPv6: Netfilter Configuration
#
CONFIG_NF_CONNTRACK_IPV6=m
CONFIG_IP6_NF_QUEUE=m
CONFIG_IP6_NF_IPTABLES=m
CONFIG_IP6_NF_MATCH_AH=m
CONFIG_IP6_NF_MATCH_EUI64=m
CONFIG_IP6_NF_MATCH_FRAG=m
CONFIG_IP6_NF_MATCH_OPTS=m
CONFIG_IP6_NF_MATCH_HL=m
CONFIG_IP6_NF_MATCH_IPV6HEADER=m
CONFIG_IP6_NF_MATCH_MH=m
CONFIG_IP6_NF_MATCH_RT=m
CONFIG_IP6_NF_TARGET_LOG=m
CONFIG_IP6_NF_FILTER=m
CONFIG_IP6_NF_TARGET_REJECT=m
CONFIG_IP6_NF_MANGLE=m
CONFIG_IP6_NF_TARGET_HL=m
CONFIG_IP6_NF_RAW=m
# CONFIG_IP6_NF_SECURITY is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
# CONFIG_NET_SCHED is not set
CONFIG_NET_CLS_ROUTE=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_HAMRADIO is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
# CONFIG_BT is not set
CONFIG_AF_RXRPC=m
# CONFIG_AF_RXRPC_DEBUG is not set
# CONFIG_RXKAD is not set
# CONFIG_PHONET is not set
CONFIG_FIB_RULES=y
# CONFIG_WIRELESS is not set
CONFIG_RFKILL=m
CONFIG_RFKILL_INPUT=m
CONFIG_RFKILL_LEDS=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_CONNECTOR is not set
# CONFIG_MTD is not set
# CONFIG_PARPORT is not set
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
CONFIG_BLK_DEV_FD=m
CONFIG_BLK_CPQ_DA=m
# CONFIG_BLK_CPQ_CISS_DA is not set
CONFIG_BLK_DEV_DAC960=m
CONFIG_BLK_DEV_UMEM=m
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_CRYPTOLOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_SX8=m
# CONFIG_BLK_DEV_UB is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=65536
# CONFIG_BLK_DEV_XIP is not set
CONFIG_CDROM_PKTCDVD=m
CONFIG_CDROM_PKTCDVD_BUFFERS=8
# CONFIG_CDROM_PKTCDVD_WCACHE is not set
CONFIG_ATA_OVER_ETH=m
# CONFIG_BLK_DEV_HD is not set
CONFIG_MISC_DEVICES=y
CONFIG_IBM_ASM=m
CONFIG_PHANTOM=m
CONFIG_EEPROM_93CX6=m
CONFIG_SGI_IOC4=m
CONFIG_TIFM_CORE=m
CONFIG_TIFM_7XX1=m
CONFIG_ACER_WMI=m
CONFIG_ASUS_LAPTOP=m
CONFIG_FUJITSU_LAPTOP=m
# CONFIG_FUJITSU_LAPTOP_DEBUG is not set
# CONFIG_HP_WMI is not set
# CONFIG_ICS932S401 is not set
CONFIG_MSI_LAPTOP=m
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
CONFIG_SONY_LAPTOP=m
CONFIG_SONYPI_COMPAT=y
CONFIG_THINKPAD_ACPI=m
# CONFIG_THINKPAD_ACPI_DEBUG is not set
CONFIG_THINKPAD_ACPI_BAY=y
CONFIG_THINKPAD_ACPI_VIDEO=y
CONFIG_THINKPAD_ACPI_HOTKEY_POLL=y
# CONFIG_INTEL_MENLOW is not set
CONFIG_EEEPC_LAPTOP=m
CONFIG_ENCLOSURE_SERVICES=m
# CONFIG_SGI_XP is not set
CONFIG_HP_ILO=m
# CONFIG_SGI_GRU is not set
# CONFIG_C2PORT is not set
CONFIG_HAVE_IDE=y
CONFIG_IDE=y

#
# Please see Documentation/ide/ide.txt for help/info on IDE drives
#
# CONFIG_BLK_DEV_IDE_SATA is not set
CONFIG_IDE_GD=y
CONFIG_IDE_GD_ATA=y
# CONFIG_IDE_GD_ATAPI is not set
# CONFIG_BLK_DEV_IDECD is not set
# CONFIG_BLK_DEV_IDETAPE is not set
# CONFIG_BLK_DEV_IDESCSI is not set
# CONFIG_BLK_DEV_IDEACPI is not set
# CONFIG_IDE_TASK_IOCTL is not set
CONFIG_IDE_PROC_FS=y

#
# IDE chipset support/bugfixes
#
# CONFIG_IDE_GENERIC is not set
# CONFIG_BLK_DEV_PLATFORM is not set
# CONFIG_BLK_DEV_CMD640 is not set
# CONFIG_BLK_DEV_IDEPNP is not set

#
# PCI IDE chipsets support
#
# CONFIG_BLK_DEV_GENERIC is not set
# CONFIG_BLK_DEV_OPTI621 is not set
# CONFIG_BLK_DEV_RZ1000 is not set
# CONFIG_BLK_DEV_AEC62XX is not set
# CONFIG_BLK_DEV_ALI15X3 is not set
# CONFIG_BLK_DEV_AMD74XX is not set
# CONFIG_BLK_DEV_ATIIXP is not set
# CONFIG_BLK_DEV_CMD64X is not set
# CONFIG_BLK_DEV_TRIFLEX is not set
# CONFIG_BLK_DEV_CS5520 is not set
# CONFIG_BLK_DEV_CS5530 is not set
# CONFIG_BLK_DEV_HPT366 is not set
# CONFIG_BLK_DEV_JMICRON is not set
# CONFIG_BLK_DEV_SC1200 is not set
# CONFIG_BLK_DEV_PIIX is not set
# CONFIG_BLK_DEV_IT8213 is not set
# CONFIG_BLK_DEV_IT821X is not set
# CONFIG_BLK_DEV_NS87415 is not set
# CONFIG_BLK_DEV_PDC202XX_OLD is not set
# CONFIG_BLK_DEV_PDC202XX_NEW is not set
# CONFIG_BLK_DEV_SVWKS is not set
# CONFIG_BLK_DEV_SIIMAGE is not set
# CONFIG_BLK_DEV_SIS5513 is not set
# CONFIG_BLK_DEV_SLC90E66 is not set
# CONFIG_BLK_DEV_TRM290 is not set
# CONFIG_BLK_DEV_VIA82CXXX is not set
# CONFIG_BLK_DEV_TC86C001 is not set
# CONFIG_BLK_DEV_IDEDMA is not set

#
# SCSI device support
#
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
CONFIG_SCSI_TGT=m
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_PROC_FS is not set

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=m
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=m
CONFIG_CHR_DEV_SCH=m
CONFIG_SCSI_ENCLOSURE=m

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
# CONFIG_SCSI_LOWLEVEL is not set
# CONFIG_SCSI_DH is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_ACPI=y
CONFIG_SATA_PMP=y
CONFIG_SATA_AHCI=y
CONFIG_SATA_SIL24=y
CONFIG_ATA_SFF=y
# CONFIG_SATA_SVW is not set
CONFIG_ATA_PIIX=m
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SX4 is not set
CONFIG_SATA_SIL=y
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_PATA_ACPI is not set
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
CONFIG_ATA_GENERIC=m
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RZ1000 is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set
CONFIG_PATA_SCH=m
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
# CONFIG_MD_RAID456 is not set
CONFIG_MD_MULTIPATH=y
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_FUSION is not set

#
# IEEE 1394 (FireWire) support
#

#
# Enable only one of the two stacks, unless you know what you are doing
#
# CONFIG_FIREWIRE is not set
# CONFIG_IEEE1394 is not set
# CONFIG_I2O is not set
# CONFIG_MACINTOSH_DRIVERS is not set
CONFIG_NETDEVICES=y
# CONFIG_DUMMY is not set
# CONFIG_BONDING is not set
# CONFIG_MACVLAN is not set
# CONFIG_EQUALIZER is not set
CONFIG_TUN=y
# CONFIG_VETH is not set
# CONFIG_NET_SB1000 is not set
# CONFIG_ARCNET is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_MARVELL_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_ICPLUS_PHY is not set
CONFIG_REALTEK_PHY=y
# CONFIG_FIXED_PHY is not set
# CONFIG_MDIO_BITBANG is not set
CONFIG_NET_ETHERNET=y
CONFIG_MII=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NET_VENDOR_3COM is not set
# CONFIG_NET_TULIP is not set
# CONFIG_HP100 is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
# CONFIG_IBM_NEW_EMAC_NO_FLOW_CTRL is not set
# CONFIG_IBM_NEW_EMAC_MAL_CLR_ICINTSTAT is not set
# CONFIG_IBM_NEW_EMAC_MAL_COMMON_ERR is not set
# CONFIG_NET_PCI is not set
# CONFIG_B44 is not set
# CONFIG_ATL2 is not set
CONFIG_NETDEV_1000=y
# CONFIG_ACENIC is not set
# CONFIG_DL2K is not set
# CONFIG_E1000 is not set
# CONFIG_E1000E is not set
# CONFIG_IP1000 is not set
# CONFIG_IGB is not set
# CONFIG_NS83820 is not set
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_R8169=m
# CONFIG_SIS190 is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
# CONFIG_VIA_VELOCITY is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2 is not set
# CONFIG_QLA3XXX is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_JME is not set
# CONFIG_NETDEV_10000 is not set
# CONFIG_TR is not set

#
# Wireless LAN
#
# CONFIG_WLAN_PRE80211 is not set
# CONFIG_WLAN_80211 is not set
# CONFIG_IWLWIFI_LEDS is not set

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_WAN is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NET_FC is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_ISDN is not set
# CONFIG_PHONE is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
CONFIG_INPUT_POLLDEV=m

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
CONFIG_INPUT_JOYDEV=m
CONFIG_INPUT_EVDEV=m
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
CONFIG_KEYBOARD_ATKBD=y
CONFIG_KEYBOARD_SUNKBD=m
CONFIG_KEYBOARD_LKKBD=m
CONFIG_KEYBOARD_XTKBD=m
CONFIG_KEYBOARD_NEWTON=m
CONFIG_KEYBOARD_STOWAWAY=m
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=m
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
# CONFIG_MOUSE_PS2_ELANTECH is not set
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_SERIAL=m
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_INPUT_JOYSTICK is not set
# CONFIG_INPUT_TABLET is not set
# CONFIG_INPUT_TOUCHSCREEN is not set
# CONFIG_INPUT_MISC is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=m
# CONFIG_SERIO_CT82C710 is not set
CONFIG_SERIO_PCIPS2=m
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_HW_CONSOLE=y
# CONFIG_VT_HW_CONSOLE_BINDING is not set
CONFIG_DEVKMEM=y
# CONFIG_SERIAL_NONSTANDARD is not set
# CONFIG_NOZOMI is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y

#
# Non-8250 serial port support
#
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
# CONFIG_SERIAL_JSM is not set
CONFIG_UNIX98_PTYS=y
# CONFIG_LEGACY_PTYS is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=m
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
CONFIG_NVRAM=m
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_PC8736x_GPIO is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=m
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_CHARDEV=m
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=m
CONFIG_I2C_ALGOPCA=m

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
CONFIG_I2C_ALI1535=m
CONFIG_I2C_ALI1563=m
CONFIG_I2C_ALI15X3=m
CONFIG_I2C_AMD756=m
CONFIG_I2C_AMD8111=m
CONFIG_I2C_I801=m
# CONFIG_I2C_ISCH is not set
CONFIG_I2C_PIIX4=m
CONFIG_I2C_NFORCE2=m
CONFIG_I2C_SIS5595=m
CONFIG_I2C_SIS630=m
CONFIG_I2C_SIS96X=m
CONFIG_I2C_VIA=m
CONFIG_I2C_VIAPRO=m

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
CONFIG_I2C_OCORES=m
CONFIG_I2C_SIMTEC=m

#
# External I2C/SMBus adapter drivers
#
CONFIG_I2C_PARPORT_LIGHT=m
CONFIG_I2C_TAOS_EVM=m
# CONFIG_I2C_TINY_USB is not set

#
# Graphics adapter I2C/DDC channel drivers
#
CONFIG_I2C_VOODOO3=m

#
# Other I2C/SMBus bus drivers
#
CONFIG_I2C_PCA_PLATFORM=m
CONFIG_I2C_STUB=m

#
# Miscellaneous I2C Chip support
#
CONFIG_DS1682=m
# CONFIG_AT24 is not set
CONFIG_SENSORS_EEPROM=m
CONFIG_SENSORS_PCF8574=m
CONFIG_PCF8575=m
# CONFIG_SENSORS_PCA9539 is not set
CONFIG_SENSORS_PCF8591=m
CONFIG_SENSORS_MAX6875=m
CONFIG_SENSORS_TSL2550=m
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_I2C_DEBUG_CHIP is not set
# CONFIG_SPI is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
# CONFIG_GPIOLIB is not set
CONFIG_W1=m

#
# 1-wire Bus Masters
#
CONFIG_W1_MASTER_MATROX=m
# CONFIG_W1_MASTER_DS2490 is not set
CONFIG_W1_MASTER_DS2482=m

#
# 1-wire Slaves
#
CONFIG_W1_SLAVE_THERM=m
CONFIG_W1_SLAVE_SMEM=m
CONFIG_W1_SLAVE_DS2433=m
# CONFIG_W1_SLAVE_DS2433_CRC is not set
CONFIG_W1_SLAVE_DS2760=m
# CONFIG_W1_SLAVE_BQ27000 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
CONFIG_PDA_POWER=m
CONFIG_BATTERY_DS2760=m
# CONFIG_BATTERY_BQ27x00 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7473 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHER is not set
# CONFIG_SENSORS_FSCPOS is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_HWMON_DEBUG_CHIP is not set
CONFIG_THERMAL=m
CONFIG_THERMAL_HWMON=y
# CONFIG_WATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
CONFIG_SSB=m
CONFIG_SSB_SPROM=y
CONFIG_SSB_PCIHOST_POSSIBLE=y
CONFIG_SSB_PCIHOST=y
# CONFIG_SSB_B43_PCI_BRIDGE is not set
# CONFIG_SSB_DEBUG is not set
CONFIG_SSB_DRIVER_PCICORE_POSSIBLE=y
CONFIG_SSB_DRIVER_PCICORE=y

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_REGULATOR is not set

#
# Multimedia devices
#

#
# Multimedia core support
#
CONFIG_VIDEO_DEV=m
CONFIG_VIDEO_V4L2_COMMON=m
CONFIG_VIDEO_ALLOW_V4L1=y
CONFIG_VIDEO_V4L1_COMPAT=y
# CONFIG_DVB_CORE is not set
CONFIG_VIDEO_MEDIA=m

#
# Multimedia drivers
#
# CONFIG_MEDIA_ATTACH is not set
CONFIG_MEDIA_TUNER=m
# CONFIG_MEDIA_TUNER_CUSTOMIZE is not set
CONFIG_MEDIA_TUNER_SIMPLE=m
CONFIG_MEDIA_TUNER_TDA8290=m
CONFIG_MEDIA_TUNER_TDA9887=m
CONFIG_MEDIA_TUNER_TEA5761=m
CONFIG_MEDIA_TUNER_TEA5767=m
CONFIG_MEDIA_TUNER_MT20XX=m
CONFIG_MEDIA_TUNER_XC2028=m
CONFIG_MEDIA_TUNER_XC5000=m
CONFIG_VIDEO_V4L2=m
CONFIG_VIDEO_V4L1=m
CONFIG_VIDEO_CAPTURE_DRIVERS=y
# CONFIG_VIDEO_ADV_DEBUG is not set
# CONFIG_VIDEO_FIXED_MINOR_RANGES is not set
CONFIG_VIDEO_HELPER_CHIPS_AUTO=y
# CONFIG_VIDEO_VIVI is not set
# CONFIG_VIDEO_BT848 is not set
# CONFIG_VIDEO_CPIA is not set
# CONFIG_VIDEO_CPIA2 is not set
# CONFIG_VIDEO_SAA5246A is not set
# CONFIG_VIDEO_SAA5249 is not set
# CONFIG_VIDEO_STRADIS is not set
# CONFIG_VIDEO_ZORAN is not set
# CONFIG_VIDEO_MEYE is not set
# CONFIG_VIDEO_SAA7134 is not set
# CONFIG_VIDEO_MXB is not set
# CONFIG_VIDEO_HEXIUM_ORION is not set
# CONFIG_VIDEO_HEXIUM_GEMINI is not set
# CONFIG_VIDEO_IVTV is not set
# CONFIG_VIDEO_CAFE_CCIC is not set
# CONFIG_SOC_CAMERA is not set
CONFIG_V4L_USB_DRIVERS=y
# CONFIG_USB_VIDEO_CLASS is not set
CONFIG_USB_GSPCA=m
# CONFIG_USB_M5602 is not set
# CONFIG_USB_GSPCA_CONEX is not set
# CONFIG_USB_GSPCA_ETOMS is not set
# CONFIG_USB_GSPCA_FINEPIX is not set
# CONFIG_USB_GSPCA_MARS is not set
# CONFIG_USB_GSPCA_OV519 is not set
# CONFIG_USB_GSPCA_PAC207 is not set
# CONFIG_USB_GSPCA_PAC7311 is not set
# CONFIG_USB_GSPCA_SONIXB is not set
# CONFIG_USB_GSPCA_SONIXJ is not set
# CONFIG_USB_GSPCA_SPCA500 is not set
# CONFIG_USB_GSPCA_SPCA501 is not set
# CONFIG_USB_GSPCA_SPCA505 is not set
# CONFIG_USB_GSPCA_SPCA506 is not set
# CONFIG_USB_GSPCA_SPCA508 is not set
# CONFIG_USB_GSPCA_SPCA561 is not set
# CONFIG_USB_GSPCA_STK014 is not set
# CONFIG_USB_GSPCA_SUNPLUS is not set
# CONFIG_USB_GSPCA_T613 is not set
# CONFIG_USB_GSPCA_TV8532 is not set
# CONFIG_USB_GSPCA_VC032X is not set
# CONFIG_USB_GSPCA_ZC3XX is not set
# CONFIG_VIDEO_PVRUSB2 is not set
# CONFIG_VIDEO_EM28XX is not set
# CONFIG_VIDEO_USBVISION is not set
# CONFIG_USB_VICAM is not set
# CONFIG_USB_IBMCAM is not set
# CONFIG_USB_KONICAWC is not set
# CONFIG_USB_QUICKCAM_MESSENGER is not set
# CONFIG_USB_ET61X251 is not set
# CONFIG_VIDEO_OVCAMCHIP is not set
# CONFIG_USB_OV511 is not set
# CONFIG_USB_SE401 is not set
# CONFIG_USB_SN9C102 is not set
# CONFIG_USB_STV680 is not set
# CONFIG_USB_ZC0301 is not set
# CONFIG_USB_PWC is not set
# CONFIG_USB_ZR364XX is not set
# CONFIG_USB_STKWEBCAM is not set
# CONFIG_USB_S2255 is not set
# CONFIG_RADIO_ADAPTERS is not set
# CONFIG_DAB is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=m
CONFIG_AGP_SIS=m
CONFIG_AGP_VIA=m
# CONFIG_DRM is not set
# CONFIG_VGASTATE is not set
# CONFIG_VIDEO_OUTPUT_CONTROL is not set
# CONFIG_FB is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_CORGI is not set
# CONFIG_BACKLIGHT_PROGEAR is not set
# CONFIG_BACKLIGHT_MBP_NVIDIA is not set
# CONFIG_BACKLIGHT_SAHARA is not set

#
# Display device support
#
# CONFIG_DISPLAY_SUPPORT is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_SOUND=m
CONFIG_SOUND_OSS_CORE=y
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_SEQUENCER=m
# CONFIG_SND_SEQ_DUMMY is not set
CONFIG_SND_OSSEMUL=y
CONFIG_SND_MIXER_OSS=m
CONFIG_SND_PCM_OSS=m
CONFIG_SND_PCM_OSS_PLUGINS=y
CONFIG_SND_SEQUENCER_OSS=y
# CONFIG_SND_DYNAMIC_MINORS is not set
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_VIRMIDI is not set
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
# CONFIG_SND_ALS300 is not set
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5530 is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=m
# CONFIG_SND_HDA_HWDEP is not set
# CONFIG_SND_HDA_INPUT_BEEP is not set
CONFIG_SND_HDA_CODEC_REALTEK=y
CONFIG_SND_HDA_CODEC_ANALOG=y
CONFIG_SND_HDA_CODEC_SIGMATEL=y
CONFIG_SND_HDA_CODEC_VIA=y
CONFIG_SND_HDA_CODEC_ATIHDMI=y
CONFIG_SND_HDA_CODEC_NVHDMI=y
CONFIG_SND_HDA_CODEC_CONEXANT=y
CONFIG_SND_HDA_CODEC_CMEDIA=y
CONFIG_SND_HDA_CODEC_SI3054=y
CONFIG_SND_HDA_GENERIC=y
# CONFIG_SND_HDA_POWER_SAVE is not set
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_HIFIER is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
# CONFIG_HID_SUPPORT is not set
CONFIG_USB_SUPPORT=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB_ARCH_HAS_OHCI=y
CONFIG_USB_ARCH_HAS_EHCI=y
CONFIG_USB=m
# CONFIG_USB_DEBUG is not set
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEVICEFS=y
CONFIG_USB_DEVICE_CLASS=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_SUSPEND is not set
# CONFIG_USB_OTG is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB is not set
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
# CONFIG_USB_EHCI_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
CONFIG_USB_OHCI_HCD=m
CONFIG_USB_OHCI_HCD_SSB=y
# CONFIG_USB_OHCI_BIG_ENDIAN_DESC is not set
# CONFIG_USB_OHCI_BIG_ENDIAN_MMIO is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_UHCI_HCD=m
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HWA_HCD is not set

#
# Enable Host or Gadget support to see Inventra options
#

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
CONFIG_USB_PRINTER=m
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may also be needed;
#

#
# see USB_STORAGE Help for more information
#
CONFIG_USB_STORAGE=m
# CONFIG_USB_STORAGE_DEBUG is not set
# CONFIG_USB_STORAGE_DATAFAB is not set
# CONFIG_USB_STORAGE_FREECOM is not set
# CONFIG_USB_STORAGE_ISD200 is not set
# CONFIG_USB_STORAGE_DPCM is not set
CONFIG_USB_STORAGE_USBAT=y
CONFIG_USB_STORAGE_SDDR09=y
CONFIG_USB_STORAGE_SDDR55=y
# CONFIG_USB_STORAGE_JUMPSHOT is not set
# CONFIG_USB_STORAGE_ALAUDA is not set
# CONFIG_USB_STORAGE_ONETOUCH is not set
# CONFIG_USB_STORAGE_KARMA is not set
# CONFIG_USB_STORAGE_CYPRESS_ATACB is not set
CONFIG_USB_LIBUSUAL=y

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set

#
# USB port drivers
#
# CONFIG_USB_SERIAL is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_BERRY_CHARGE is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_PHIDGET is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_VST is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=m

#
# LED drivers
#
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_HP_DISK is not set
CONFIG_LEDS_CLEVO_MAIL=m
# CONFIG_LEDS_PCA955X is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
CONFIG_LEDS_TRIGGER_TIMER=m
# CONFIG_LEDS_TRIGGER_IDE_DISK is not set
CONFIG_LEDS_TRIGGER_HEARTBEAT=m
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
CONFIG_LEDS_TRIGGER_DEFAULT_ON=m
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
# CONFIG_EDAC is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
CONFIG_RTC_DRV_DS1307=m
CONFIG_RTC_DRV_DS1374=m
CONFIG_RTC_DRV_DS1672=m
CONFIG_RTC_DRV_MAX6900=m
CONFIG_RTC_DRV_RS5C372=m
CONFIG_RTC_DRV_ISL1208=m
CONFIG_RTC_DRV_X1205=m
CONFIG_RTC_DRV_PCF8563=m
CONFIG_RTC_DRV_PCF8583=m
CONFIG_RTC_DRV_M41T80=m
# CONFIG_RTC_DRV_M41T80_WDT is not set
CONFIG_RTC_DRV_S35390A=m
CONFIG_RTC_DRV_FM3130=m
# CONFIG_RTC_DRV_RX8581 is not set

#
# SPI RTC drivers
#

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
CONFIG_RTC_DRV_DS1511=m
CONFIG_RTC_DRV_DS1553=m
CONFIG_RTC_DRV_DS1742=m
CONFIG_RTC_DRV_STK17TA8=m
CONFIG_RTC_DRV_M48T86=m
# CONFIG_RTC_DRV_M48T35 is not set
CONFIG_RTC_DRV_M48T59=m
# CONFIG_RTC_DRV_BQ4802 is not set
CONFIG_RTC_DRV_V3020=m

#
# on-CPU RTC drivers
#
# CONFIG_DMADEVICES is not set
CONFIG_UIO=m
# CONFIG_UIO_CIF is not set
# CONFIG_UIO_PDRV is not set
# CONFIG_UIO_PDRV_GENIRQ is not set
# CONFIG_UIO_SMX is not set
# CONFIG_UIO_SERCOS3 is not set
# CONFIG_STAGING is not set
CONFIG_STAGING_EXCLUDE_BUILD=y

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_ISCSI_IBFT_FIND is not set

#
# File systems
#
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
# CONFIG_EXT4_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
CONFIG_REISERFS_FS=y
# CONFIG_REISERFS_CHECK is not set
# CONFIG_REISERFS_PROC_INFO is not set
CONFIG_REISERFS_FS_XATTR=y
CONFIG_REISERFS_FS_POSIX_ACL=y
CONFIG_REISERFS_FS_SECURITY=y
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_FILE_LOCKING=y
CONFIG_XFS_FS=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
# CONFIG_XFS_DEBUG is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
CONFIG_PRINT_QUOTA_WARNING=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=m
CONFIG_QUOTACTL=y
CONFIG_AUTOFS_FS=m
CONFIG_AUTOFS4_FS=m
CONFIG_FUSE_FS=m
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_UDF_FS=m
CONFIG_UDF_NLS=y

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_CONFIGFS_FS=m

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
CONFIG_CRAMFS=m
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_NETWORK_FILESYSTEMS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
# CONFIG_ACORN_PARTITION_CUMANA is not set
# CONFIG_ACORN_PARTITION_EESOX is not set
CONFIG_ACORN_PARTITION_ICS=y
# CONFIG_ACORN_PARTITION_ADFS is not set
# CONFIG_ACORN_PARTITION_POWERTEC is not set
CONFIG_ACORN_PARTITION_RISCIX=y
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
CONFIG_ATARI_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
CONFIG_NLS=m
CONFIG_NLS_DEFAULT="utf8"
# CONFIG_NLS_CODEPAGE_437 is not set
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_PRINTK_TIME=y
CONFIG_ALLOW_WARNINGS=y
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_DEBUG_SHIRQ is not set
CONFIG_DETECT_SOFTLOCKUP=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
# CONFIG_PROVE_LOCKING is not set
CONFIG_LOCKDEP=y
CONFIG_LOCK_STAT=y
# CONFIG_DEBUG_LOCKDEP is not set
# CONFIG_DEBUG_SPINLOCK_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_VIRTUAL is not set
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
CONFIG_FRAME_POINTER=y
# CONFIG_BOOT_PRINTK_DELAY is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_RCU_CPU_STALL_DETECTOR is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
# CONFIG_SYSCTL_SYSCALL_CHECK is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_RING_BUFFER=y
CONFIG_TRACING=y

#
# Tracers
#
CONFIG_FUNCTION_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
# CONFIG_PREEMPT_TRACER is not set
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
# CONFIG_BOOT_TRACER is not set
# CONFIG_TRACE_BRANCH_PROFILING is not set
# CONFIG_POWER_TRACER is not set
# CONFIG_STACK_TRACER is not set
# CONFIG_BTS_TRACER is not set
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DYNAMIC_PRINTK_DEBUG is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
# CONFIG_KGDB is not set
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_DEBUG_PER_CPU_MAPS is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
# CONFIG_DEBUG_NX_TEST is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_MMIOTRACE is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
# CONFIG_KMEMCHECK is not set
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
# CONFIG_OPTIMIZE_INLINING is not set

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
CONFIG_SECURITY=y
# CONFIG_SECURITYFS is not set
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_NETWORK_XFRM=y
CONFIG_SECURITY_FILE_CAPABILITIES=y
CONFIG_SECURITY_DEFAULT_MMAP_MIN_ADDR=0
# CONFIG_SECURITY_SELINUX is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
# CONFIG_CRYPTO_FIPS is not set
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_GF128MUL=m
CONFIG_CRYPTO_NULL=m
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
CONFIG_CRYPTO_TEST=m

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=m
CONFIG_CRYPTO_GCM=m
CONFIG_CRYPTO_SEQIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=m
CONFIG_CRYPTO_CTR=m
CONFIG_CRYPTO_CTS=m
CONFIG_CRYPTO_ECB=m
CONFIG_CRYPTO_LRW=m
CONFIG_CRYPTO_PCBC=m
CONFIG_CRYPTO_XTS=m

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=y
CONFIG_CRYPTO_XCBC=m

#
# Digest
#
CONFIG_CRYPTO_CRC32C=m
# CONFIG_CRYPTO_CRC32C_INTEL is not set
CONFIG_CRYPTO_MD4=m
CONFIG_CRYPTO_MD5=y
CONFIG_CRYPTO_MICHAEL_MIC=m
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=m
CONFIG_CRYPTO_SHA256=m
CONFIG_CRYPTO_SHA512=m
CONFIG_CRYPTO_TGR192=m
CONFIG_CRYPTO_WP512=m

#
# Ciphers
#
CONFIG_CRYPTO_AES=m
CONFIG_CRYPTO_AES_X86_64=m
CONFIG_CRYPTO_ANUBIS=m
CONFIG_CRYPTO_ARC4=m
CONFIG_CRYPTO_BLOWFISH=m
CONFIG_CRYPTO_CAMELLIA=m
CONFIG_CRYPTO_CAST5=m
CONFIG_CRYPTO_CAST6=m
CONFIG_CRYPTO_DES=m
CONFIG_CRYPTO_FCRYPT=m
CONFIG_CRYPTO_KHAZAD=m
CONFIG_CRYPTO_SALSA20=m
CONFIG_CRYPTO_SALSA20_X86_64=m
CONFIG_CRYPTO_SEED=m
CONFIG_CRYPTO_SERPENT=m
CONFIG_CRYPTO_TEA=m
CONFIG_CRYPTO_TWOFISH=m
CONFIG_CRYPTO_TWOFISH_COMMON=m
CONFIG_CRYPTO_TWOFISH_X86_64=m

#
# Compression
#
CONFIG_CRYPTO_DEFLATE=m
CONFIG_CRYPTO_LZO=m

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_HIFN_795X=m
CONFIG_CRYPTO_DEV_HIFN_795X_RNG=y
CONFIG_HAVE_KVM=y
# CONFIG_VIRTUALIZATION is not set

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_FIND_NEXT_BIT=y
CONFIG_CRC_CCITT=m
CONFIG_CRC16=m
CONFIG_CRC_T10DIF=m
CONFIG_CRC_ITU_T=m
CONFIG_CRC32=y
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m
CONFIG_LZO_COMPRESS=m
CONFIG_LZO_DECOMPRESS=m
CONFIG_TEXTSEARCH=y
CONFIG_TEXTSEARCH_KMP=m
CONFIG_TEXTSEARCH_BM=m
CONFIG_TEXTSEARCH_FSM=m
CONFIG_PLIST=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y


Best regards,
--Edwin

2008-11-27 12:03:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 01:39:52PM +0200, T?r?k Edwin wrote:
> On 2008-11-27 11:28, Mike Waychison wrote:
> > Correct. I don't recall the numbers from the pathelogical cases we
> > were seeing, but iirc, it was on the order of 10s of seconds, likely
> > exascerbated by slower than usual disks. I've been digging through my
> > inbox to find numbers without much success -- we've been using a
> > variant of this patch since 2.6.11.
> >
> > T?r?k however identified mmap taking on the order of several
> > milliseconds due to this exact problem:
> >
> > http://lkml.org/lkml/2008/9/12/185
>
>
> Hi,
>
> Thanks for the patch. I just tested it on top of 2.6.28-rc6-tip, see
> /proc/lock_stat output at the end.
>
> Running my testcase shows no significant performance difference. What am
> I doing wrong?

Software may just be doing a lot of mmap/munmap activity. threads +
mmap is never going to be pretty because it is always going to involve
broadcasting tlb flushes to other cores... Software writers shouldn't
be scared of using processes (possibly with some shared memory).
Actually, a lot of things get faster (like malloc, or file descriptor
operations) because locks aren't needed.

Despite common perception, processes are actually much *faster* than
threads when doing common operations like these. They are slightly slower
sometimes with things like creation and exit, or context switching, but
if you're doing huge numbers of those operations, then it is unlikely
to be a performance critical app... :)

(end rant; sorry, that may not have been helpful to your immediate problem,
but we need to be realistic in what complexity we are ging to add where in
the kernel in order to speed things up. And we need to steer userspace
away from problems that are fundamentally hard and not going to get easier
with trends -- like virtual address activity with multiple threads)


> ...............................................................................................................................................................................................
>
> &sem->wait_lock: 122700
> 126641 0.42 77.94 125372.37
> 1779026 7368894 0.27 1099.42 3085559.16
> ---------------
> &sem->wait_lock 5943
> [<ffffffff8043a768>] __up_write+0x28/0x170
> &sem->wait_lock 8615
> [<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
> &sem->wait_lock 13568
> [<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
> &sem->wait_lock 49377
> [<ffffffff8043a600>] __down_read_trylock+0x20/0x60
> ---------------
> &sem->wait_lock 8097
> [<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
> &sem->wait_lock 31540
> [<ffffffff8043a768>] __up_write+0x28/0x170
> &sem->wait_lock 5501
> [<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
> &sem->wait_lock 33342
> [<ffffffff8043a600>] __down_read_trylock+0x20/0x60
>

Interesting. I have some (ancient) patches to make rwsems more scalable
under heavy load by reducing contention on this lock. They should really
have been merged... Not sure how much it would help, but if you're
interested in testing, I could dust them off.

2008-11-27 12:21:32

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-27 14:03, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 01:39:52PM +0200, T?r?k Edwin wrote:
>
>> On 2008-11-27 11:28, Mike Waychison wrote:
>>
>>> Correct. I don't recall the numbers from the pathelogical cases we
>>> were seeing, but iirc, it was on the order of 10s of seconds, likely
>>> exascerbated by slower than usual disks. I've been digging through my
>>> inbox to find numbers without much success -- we've been using a
>>> variant of this patch since 2.6.11.
>>>
>>> T?r?k however identified mmap taking on the order of several
>>> milliseconds due to this exact problem:
>>>
>>> http://lkml.org/lkml/2008/9/12/185
>>>
>> Hi,
>>
>> Thanks for the patch. I just tested it on top of 2.6.28-rc6-tip, see
>> /proc/lock_stat output at the end.
>>
>> Running my testcase shows no significant performance difference. What am
>> I doing wrong?
>>
>
> Software may just be doing a lot of mmap/munmap activity. threads +
> mmap is never going to be pretty because it is always going to involve
> broadcasting tlb flushes to other cores... Software writers shouldn't
> be scared of using processes (possibly with some shared memory).
>

It would be interesting to compare the performance of a threaded clamd,
and of a clamd that uses multiple processes.
Distributing tasks will be a bit more tricky, since it would need to use
IPC, instead of mutexes and condition variables.

> Actually, a lot of things get faster (like malloc, or file descriptor
> operations) because locks aren't needed.
>
> Despite common perception, processes are actually much *faster* than
> threads when doing common operations like these. They are slightly slower
> sometimes with things like creation and exit, or context switching, but
> if you're doing huge numbers of those operations, then it is unlikely
> to be a performance critical app... :)
>

How about distributing tasks to a set of worked threads, is the overhead
of using IPC instead of
mutexes/cond variables acceptable?

> (end rant; sorry, that may not have been helpful to your immediate problem,
> but we need to be realistic in what complexity we are ging to add where in
> the kernel in order to speed things up. And we need to steer userspace
> away from problems that are fundamentally hard and not going to get easier
> with trends -- like virtual address activity with multiple threads)
>

I understood that mmap() is not scalable, however look at
http://lkml.org/lkml/2008/9/12/185, even fopen/fdopen does
an (anonymous) mmap internally.
That does not affect performance that much, since the overhead of a
file-backed mmap + pagefaults is higher.
Rewriting libclamav to not use mmap() would take a significant amount of
time, however I will try to avoid using mmap()
in new code (and prefer pread/read).

Also clamd is a CPU bound application [given fast enough disks ;)] and
having to wait for mmap_sem prevents it from doing "real work".
Most of the time it reads files from /tmp, that should either be in the
page cache, or (in my case) they are always in RAM (I use tmpfs).

So mmaping, and reading from these files does not involve disk I/O, yet
threads working with /tmp files still need to wait
for disk I/O to complete because it has to wait on mmap_sem (held by
another thread).

>
>
>> ...............................................................................................................................................................................................
>>
>> &sem->wait_lock: 122700
>> 126641 0.42 77.94 125372.37
>> 1779026 7368894 0.27 1099.42 3085559.16
>> ---------------
>> &sem->wait_lock 5943
>> [<ffffffff8043a768>] __up_write+0x28/0x170
>> &sem->wait_lock 8615
>> [<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
>> &sem->wait_lock 13568
>> [<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
>> &sem->wait_lock 49377
>> [<ffffffff8043a600>] __down_read_trylock+0x20/0x60
>> ---------------
>> &sem->wait_lock 8097
>> [<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
>> &sem->wait_lock 31540
>> [<ffffffff8043a768>] __up_write+0x28/0x170
>> &sem->wait_lock 5501
>> [<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
>> &sem->wait_lock 33342
>> [<ffffffff8043a600>] __down_read_trylock+0x20/0x60
>>
>>
>
> Interesting. I have some (ancient) patches to make rwsems more scalable
> under heavy load by reducing contention on this lock. They should really
> have been merged... Not sure how much it would help, but if you're
> interested in testing, I could dust them off.

Sure, I can test patches (preferably against 2.6.28-rc6-tip ).

Best regards,
--Edwin

2008-11-27 12:33:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, 2008-11-27 at 14:21 +0200, Török Edwin wrote:

> How about distributing tasks to a set of worked threads, is the
> overhead of using IPC instead of mutexes/cond variables acceptable?

Inter process pthread mutexes should be very fast in the latest kernels
as they'll avoid the mmap_sem by use of get_user_pages_fast().

Not sure if pthread condition variables also work inter-process.

2008-11-27 12:39:40

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 02:21:16PM +0200, T?r?k Edwin wrote:
> On 2008-11-27 14:03, Nick Piggin wrote:
> >> Running my testcase shows no significant performance difference. What am
> >> I doing wrong?
> >>
> >
> > Software may just be doing a lot of mmap/munmap activity. threads +
> > mmap is never going to be pretty because it is always going to involve
> > broadcasting tlb flushes to other cores... Software writers shouldn't
> > be scared of using processes (possibly with some shared memory).
> >
>
> It would be interesting to compare the performance of a threaded clamd,
> and of a clamd that uses multiple processes.
> Distributing tasks will be a bit more tricky, since it would need to use
> IPC, instead of mutexes and condition variables.

Yes, although you could use PTHREAD_PROCESS_SHARED pthread mutexes on
the shared memory I believe (having never tried it myself).


> > Actually, a lot of things get faster (like malloc, or file descriptor
> > operations) because locks aren't needed.
> >
> > Despite common perception, processes are actually much *faster* than
> > threads when doing common operations like these. They are slightly slower
> > sometimes with things like creation and exit, or context switching, but
> > if you're doing huge numbers of those operations, then it is unlikely
> > to be a performance critical app... :)
> >
>
> How about distributing tasks to a set of worked threads, is the overhead
> of using IPC instead of
> mutexes/cond variables acceptable?

It is really going to depend on a lot of things. What is involved in
distributing tasks, how many cores and cache/TLB architecture of the
system running on, etc.

You want to distribute as much work as possible while touching as
little memory as possible, in general.

But if you're distributing threads over cores, and shared caches are
physically tagged (which I think all x86 CPUs are), then you should
be able to have multiple processes operate on shared memory just as
efficiently as multiple threads I think.

And then you also get the advantages of reduced contention on other
shared locks and resources.


> > (end rant; sorry, that may not have been helpful to your immediate problem,
> > but we need to be realistic in what complexity we are ging to add where in
> > the kernel in order to speed things up. And we need to steer userspace
> > away from problems that are fundamentally hard and not going to get easier
> > with trends -- like virtual address activity with multiple threads)
> >
>
> I understood that mmap() is not scalable, however look at
> http://lkml.org/lkml/2008/9/12/185, even fopen/fdopen does
> an (anonymous) mmap internally.

Well, I guess that would be all the more reason to avoid threads (and
things like fopen/fdopen fundamentally have to be synchronized between
threads regardless of whether they use mmap() or not, so you're going
to see a win on any OS avoiding threaded code that uses fopen/fdopen).


> That does not affect performance that much, since the overhead of a
> file-backed mmap + pagefaults is higher.
> Rewriting libclamav to not use mmap() would take a significant amount of
> time, however I will try to avoid using mmap()
> in new code (and prefer pread/read).
>
> Also clamd is a CPU bound application [given fast enough disks ;)] and
> having to wait for mmap_sem prevents it from doing "real work".
> Most of the time it reads files from /tmp, that should either be in the
> page cache, or (in my case) they are always in RAM (I use tmpfs).
>
> So mmaping, and reading from these files does not involve disk I/O, yet
> threads working with /tmp files still need to wait
> for disk I/O to complete because it has to wait on mmap_sem (held by
> another thread).

Yeah, it's costly. Even if it didn't take mmap_sem, then it still
needs to broadcast TLB invalidates over the machine, so it would
probably go even faster if it weren't threaded and/or didn't use
mmap/munmap so heavily.


> >> ...............................................................................................................................................................................................
> >>
> >> &sem->wait_lock: 122700
> >> 126641 0.42 77.94 125372.37
> >> 1779026 7368894 0.27 1099.42 3085559.16
> >> ---------------
> >> &sem->wait_lock 5943
> >> [<ffffffff8043a768>] __up_write+0x28/0x170
> >> &sem->wait_lock 8615
> >> [<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
> >> &sem->wait_lock 13568
> >> [<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
> >> &sem->wait_lock 49377
> >> [<ffffffff8043a600>] __down_read_trylock+0x20/0x60
> >> ---------------
> >> &sem->wait_lock 8097
> >> [<ffffffff8043a5a0>] __down_write_trylock+0x20/0x60
> >> &sem->wait_lock 31540
> >> [<ffffffff8043a768>] __up_write+0x28/0x170
> >> &sem->wait_lock 5501
> >> [<ffffffff805ce3ac>] __down_write_nested+0x1c/0xc0
> >> &sem->wait_lock 33342
> >> [<ffffffff8043a600>] __down_read_trylock+0x20/0x60
> >>
> >>
> >
> > Interesting. I have some (ancient) patches to make rwsems more scalable
> > under heavy load by reducing contention on this lock. They should really
> > have been merged... Not sure how much it would help, but if you're
> > interested in testing, I could dust them off.
>
> Sure, I can test patches (preferably against 2.6.28-rc6-tip ).

OK, I'll see if I can find them (am overseas at the moment, and I suspect
they are stranded on some stationary rust back home, but I might be able
to find them on the web).

2008-11-27 12:52:24

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-27 14:39, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 02:21:16PM +0200, T?r?k Edwin wrote:
>
>> On 2008-11-27 14:03, Nick Piggin wrote:
>>
>>>> Running my testcase shows no significant performance difference. What am
>>>> I doing wrong?
>>>>
>>>>
>>>
>>> Software may just be doing a lot of mmap/munmap activity. threads +
>>> mmap is never going to be pretty because it is always going to involve
>>> broadcasting tlb flushes to other cores... Software writers shouldn't
>>> be scared of using processes (possibly with some shared memory).
>>>
>>>
>> It would be interesting to compare the performance of a threaded clamd,
>> and of a clamd that uses multiple processes.
>> Distributing tasks will be a bit more tricky, since it would need to use
>> IPC, instead of mutexes and condition variables.
>>
>
> Yes, although you could use PTHREAD_PROCESS_SHARED pthread mutexes on
> the shared memory I believe (having never tried it myself).
>
>
>
>>> Actually, a lot of things get faster (like malloc, or file descriptor
>>> operations) because locks aren't needed.
>>>
>>> Despite common perception, processes are actually much *faster* than
>>> threads when doing common operations like these. They are slightly slower
>>> sometimes with things like creation and exit, or context switching, but
>>> if you're doing huge numbers of those operations, then it is unlikely
>>> to be a performance critical app... :)
>>>
>>>
>> How about distributing tasks to a set of worked threads, is the overhead
>> of using IPC instead of
>> mutexes/cond variables acceptable?
>>
>
> It is really going to depend on a lot of things. What is involved in
> distributing tasks, how many cores and cache/TLB architecture of the
> system running on, etc.
>
> You want to distribute as much work as possible while touching as
> little memory as possible, in general.
>
> But if you're distributing threads over cores, and shared caches are
> physically tagged (which I think all x86 CPUs are), then you should
> be able to have multiple processes operate on shared memory just as
> efficiently as multiple threads I think.
>
> And then you also get the advantages of reduced contention on other
> shared locks and resources.
>

Thanks for the tips, but lets get back to the original question:
why don't I see any performance improvement with the fault-retry patches?

My testcase only compares reads file with mmap, vs. reading files with
read, with different number of threads.
Leaving aside other reasons why mmap is slower, there should be some
speedup by running 4 threads vs 1 thread, but:

1 thread: read:27,18 28.76
1 thread: mmap: 25.45, 25.24
2 thread: read: 16.03, 15.66
2 thread: mmap: 22.20, 20.99
4 thread: read: 9.15, 9.12
4 thread: mmap: 20.38, 20.47

The speed of 4 threads is about the same as for 2 threads with mmap, yet
with read it scales nicely.
And the patch doesn't seem to improve scalability.
How can I find out if the patch works as expected? [i.e. verify that
faults are actually retried, and that they don't keep the semaphore locked]

> OK, I'll see if I can find them (am overseas at the moment, and I suspect
> they are stranded on some stationary rust back home, but I might be able
> to find them on the web).

Ok.

Best regards,
--Edwin

2008-11-27 13:05:37

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 02:52:10PM +0200, T?r?k Edwin wrote:
> On 2008-11-27 14:39, Nick Piggin wrote:
> > And then you also get the advantages of reduced contention on other
> > shared locks and resources.
> >
>
> Thanks for the tips, but lets get back to the original question:
> why don't I see any performance improvement with the fault-retry patches?

Because as you said, your app is CPU bound and page faults aren't needing
to sleep very much. There is too much contention on the write side, rather
than too much contention/hold time on the read side.


> My testcase only compares reads file with mmap, vs. reading files with
> read, with different number of threads.
> Leaving aside other reasons why mmap is slower, there should be some
> speedup by running 4 threads vs 1 thread, but:
>
> 1 thread: read:27,18 28.76
> 1 thread: mmap: 25.45, 25.24
> 2 thread: read: 16.03, 15.66
> 2 thread: mmap: 22.20, 20.99
> 4 thread: read: 9.15, 9.12
> 4 thread: mmap: 20.38, 20.47
>
> The speed of 4 threads is about the same as for 2 threads with mmap, yet
> with read it scales nicely.
> And the patch doesn't seem to improve scalability.
> How can I find out if the patch works as expected? [i.e. verify that
> faults are actually retried, and that they don't keep the semaphore locked]

Yeah, that workload will be completely contended on the mmap_sem write-side
if the files are in cache. The google patch won't help at all in that
case.

2008-11-27 13:08:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
> >Hmm. How quantifiable is the benefit? Does it actually matter that you
> >can read the proc file much faster? (this is for some automated workload
> >management daemon or something, right?)
>
> Correct. I don't recall the numbers from the pathelogical cases we were
> seeing, but iirc, it was on the order of 10s of seconds, likely
> exascerbated by slower than usual disks. I've been digging through my
> inbox to find numbers without much success -- we've been using a variant
> of this patch since 2.6.11.
>
> T?r?k however identified mmap taking on the order of several
> milliseconds due to this exact problem:
>
> http://lkml.org/lkml/2008/9/12/185

Turns out to be a different problem.


> >Would it be possible to reduce mmap()/munmap() activity? eg. if it is
> >due to a heap memory allocator, then perhaps do more batching or set
> >some hysteresis.
>
> I know our tcmalloc team had made great strides to reduce mmap_sem
> contention for the heap, but there are various other bits of the stack
> that really want to mmap files..
>
> We generally try to avoid such things, but sometimes it a) can't be
> easily avoided (third party libraries for instance) and b) when it hits
> us, it affects the overall health of the machine/cluster (the monitoring
> daemons get blocked, which isn't very healthy).

Are you doing appropriate posix_fadvise to prefetch in the files before
faulting, and madvise hints if appropriate?

2008-11-27 13:10:35

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-27 15:05, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 02:52:10PM +0200, T?r?k Edwin wrote:
>
>> On 2008-11-27 14:39, Nick Piggin wrote:
>>
>>> And then you also get the advantages of reduced contention on other
>>> shared locks and resources.
>>>
>>>
>> Thanks for the tips, but lets get back to the original question:
>> why don't I see any performance improvement with the fault-retry patches?
>>
>
> Because as you said, your app is CPU bound and page faults aren't needing
> to sleep very much. There is too much contention on the write side, rather
> than too much contention/hold time on the read side.
>
>
>
>> My testcase only compares reads file with mmap, vs. reading files with
>> read, with different number of threads.
>> Leaving aside other reasons why mmap is slower, there should be some
>> speedup by running 4 threads vs 1 thread, but:
>>
>> 1 thread: read:27,18 28.76
>> 1 thread: mmap: 25.45, 25.24
>> 2 thread: read: 16.03, 15.66
>> 2 thread: mmap: 22.20, 20.99
>> 4 thread: read: 9.15, 9.12
>> 4 thread: mmap: 20.38, 20.47
>>
>> The speed of 4 threads is about the same as for 2 threads with mmap, yet
>> with read it scales nicely.
>> And the patch doesn't seem to improve scalability.
>> How can I find out if the patch works as expected? [i.e. verify that
>> faults are actually retried, and that they don't keep the semaphore locked]
>>
>
> Yeah, that workload will be completely contended on the mmap_sem write-side
> if the files are in cache. The google patch won't help at all in that
> case.
>

Ok. Sorry for hijacking the thread, my testcase is not a good testcase
for what this patch tries to solve.

Best regards,
--Edwin

2008-11-27 13:12:26

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 03:10:20PM +0200, T?r?k Edwin wrote:
> On 2008-11-27 15:05, Nick Piggin wrote:
> > On Thu, Nov 27, 2008 at 02:52:10PM +0200, T?r?k Edwin wrote:
> >
> >> On 2008-11-27 14:39, Nick Piggin wrote:
> >>
> >>> And then you also get the advantages of reduced contention on other
> >>> shared locks and resources.
> >>>
> >>>
> >> Thanks for the tips, but lets get back to the original question:
> >> why don't I see any performance improvement with the fault-retry patches?
> >>
> >
> > Because as you said, your app is CPU bound and page faults aren't needing
> > to sleep very much. There is too much contention on the write side, rather
> > than too much contention/hold time on the read side.
> >
> >
> >
> >> My testcase only compares reads file with mmap, vs. reading files with
> >> read, with different number of threads.
> >> Leaving aside other reasons why mmap is slower, there should be some
> >> speedup by running 4 threads vs 1 thread, but:
> >>
> >> 1 thread: read:27,18 28.76
> >> 1 thread: mmap: 25.45, 25.24
> >> 2 thread: read: 16.03, 15.66
> >> 2 thread: mmap: 22.20, 20.99
> >> 4 thread: read: 9.15, 9.12
> >> 4 thread: mmap: 20.38, 20.47
> >>
> >> The speed of 4 threads is about the same as for 2 threads with mmap, yet
> >> with read it scales nicely.
> >> And the patch doesn't seem to improve scalability.
> >> How can I find out if the patch works as expected? [i.e. verify that
> >> faults are actually retried, and that they don't keep the semaphore locked]
> >>
> >
> > Yeah, that workload will be completely contended on the mmap_sem write-side
> > if the files are in cache. The google patch won't help at all in that
> > case.
> >
>
> Ok. Sorry for hijacking the thread, my testcase is not a good testcase
> for what this patch tries to solve.

No not at all. It's always really useful to hear any problems like this.
I'd like you to keep participating... for one thing I'd like you to test
my mmap_sem patch ;) (when I finish it)

2008-11-27 13:23:29

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-27 15:12, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 03:10:20PM +0200, T?r?k Edwin wrote:
>
>>
>> Ok. Sorry for hijacking the thread, my testcase is not a good testcase
>> for what this patch tries to solve.
>>
>
> No not at all. It's always really useful to hear any problems like this.
> I'd like you to keep participating... for one thing I'd like you to test
> my mmap_sem patch ;) (when I finish it)

Sure, just send me your patch when it is ready (together, or
before/after the rwsems patch).

Best regards,
--Edwin

2008-11-27 19:04:36

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
>>> Hmm. How quantifiable is the benefit? Does it actually matter that you
>>> can read the proc file much faster? (this is for some automated workload
>>> management daemon or something, right?)
>> Correct. I don't recall the numbers from the pathelogical cases we were
>> seeing, but iirc, it was on the order of 10s of seconds, likely
>> exascerbated by slower than usual disks. I've been digging through my
>> inbox to find numbers without much success -- we've been using a variant
>> of this patch since 2.6.11.
>>
>> T?r?k however identified mmap taking on the order of several
>> milliseconds due to this exact problem:
>>
>> http://lkml.org/lkml/2008/9/12/185
>
> Turns out to be a different problem.
>

What do you mean?

>
>>> Would it be possible to reduce mmap()/munmap() activity? eg. if it is
>>> due to a heap memory allocator, then perhaps do more batching or set
>>> some hysteresis.
>> I know our tcmalloc team had made great strides to reduce mmap_sem
>> contention for the heap, but there are various other bits of the stack
>> that really want to mmap files..
>>
>> We generally try to avoid such things, but sometimes it a) can't be
>> easily avoided (third party libraries for instance) and b) when it hits
>> us, it affects the overall health of the machine/cluster (the monitoring
>> daemons get blocked, which isn't very healthy).
>
> Are you doing appropriate posix_fadvise to prefetch in the files before
> faulting, and madvise hints if appropriate?
>

Yes, we've been slowly rolling out fadvise hints out, though not to
prefetch, and definitely not for faulting. I don't see how issuing a
prefetch right before we try to fault in a page is going to help
matters. The pages may appear in pagecache, but they won't be uptodate
by the time we look at them anyway, so we're back to square one.

The best use for fadvise we've found is FADV_DONTNEED as it kicks off
any IO for dirty pages asynchronously (except it misses metadata..).
That it drops clean pages is a nice side-benefit. With it, we don't
have to rely on the kernel's heuristics for writeout which lead to
imbalances and latency spikes.

2008-11-27 19:12:20

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Peter Zijlstra wrote:
> On Thu, 2008-11-27 at 01:28 -0800, Mike Waychison wrote:
>
>> Correct. I don't recall the numbers from the pathelogical cases we were
>> seeing, but iirc, it was on the order of 10s of seconds, likely
>> exascerbated by slower than usual disks. I've been digging through my
>> inbox to find numbers without much success -- we've been using a variant
>> of this patch since 2.6.11.
>
>> We generally try to avoid such things, but sometimes it a) can't be
>> easily avoided (third party libraries for instance) and b) when it hits
>> us, it affects the overall health of the machine/cluster (the monitoring
>> daemons get blocked, which isn't very healthy).
>
> If its only monitoring, there might be another solution. If you can keep
> the required data in a separate (approximate) copy so that you don't
> need mmap_sem at all to show them.
>
> If your mmap_sem is so contended your latencies are unacceptable, adding
> more users to it - even statistics gathering, just isn't going to cure
> the situation.
>
> Furthermore, /proc code usually isn't written with performance in mind,
> so its usually simple and robust code. Adding it to a 'hot'-path like
> you're doing doesn't seem advisable.
>
> Also, releasing and re-acquiring mmap_sem can significantly add to the
> cacheline bouncing that thing already has.
>

This is much less of a worry. We expect to be able to look at these
things on the order of 1HZ, so cacheline bouncing becomes negligible.

Latency to lock acquire however hurts and is silly considering it's just
another reader. Our monitoring software here is acting as a litmus test
and the real pain is felt by other threads in the same process who are
also blocked trying to acquire the read lock.

2008-11-27 19:23:45

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 11:00:07AM +0100, Peter Zijlstra wrote:
>> On Thu, 2008-11-27 at 01:28 -0800, Mike Waychison wrote:
>>
>>> Correct. I don't recall the numbers from the pathelogical cases we were
>>> seeing, but iirc, it was on the order of 10s of seconds, likely
>>> exascerbated by slower than usual disks. I've been digging through my
>>> inbox to find numbers without much success -- we've been using a variant
>>> of this patch since 2.6.11.
>>> We generally try to avoid such things, but sometimes it a) can't be
>>> easily avoided (third party libraries for instance) and b) when it hits
>>> us, it affects the overall health of the machine/cluster (the monitoring
>>> daemons get blocked, which isn't very healthy).
>> If its only monitoring, there might be another solution. If you can keep
>> the required data in a separate (approximate) copy so that you don't
>> need mmap_sem at all to show them.
>>
>> If your mmap_sem is so contended your latencies are unacceptable, adding
>> more users to it - even statistics gathering, just isn't going to cure
>> the situation.
>>
>> Furthermore, /proc code usually isn't written with performance in mind,
>> so its usually simple and robust code. Adding it to a 'hot'-path like
>> you're doing doesn't seem advisable.
>>
>> Also, releasing and re-acquiring mmap_sem can significantly add to the
>> cacheline bouncing that thing already has.
>
> Yes, it would be nice to reduce mmap_sem load regardless of any other
> fixes or problems. I guess they're not very worried about cacheline
> bouncing but more about hold time (how many sockets in these systems?
> 4 at most?)
>
> I guess it is the pagemap stuff that they use most heavily?
>

We aren't using pagemap yet. Reading /proc/pid/maps alone hurts.

> pagemap_read looks like it can use get_user_pages_fast. The smaps and
> clear_refs stuff might have been nicer if they could work on ranges
> like pagemap. Then they could avoid mmap_sem as well (although maps
> would need to be sampled and take mmap_sem I guess).
>
> One problem with dropping mmap_sem is that it hurts priority/fairness.
> And it opens a bit of a (maybe theoretical but not something to completely
> ignore) forward progress hole AFAIKS. If mmap_sem is very heavily
> contended, then the refault is going to take a while to get through,
> and then the page might get reclaimed etc).

Right, this can be an issue. The way around it should be to minimize
the length of time any single lock holder can sit on it. Compared to
what we have today with:

- sleep in major fault with read lock held,
- enqueue writer behind it,
- and make all other faults wait on the rwsem

The retry logic seems to be a lot better for forward progress.

2008-11-28 09:37:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 11:03:40AM -0800, Mike Waychison wrote:
> Nick Piggin wrote:
> >On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
> >>
> >>T?r?k however identified mmap taking on the order of several
> >>milliseconds due to this exact problem:
> >>
> >>http://lkml.org/lkml/2008/9/12/185
> >
> >Turns out to be a different problem.
> >
>
> What do you mean?

His is just contending on the write side. The retry patch doesn't help.


> >>We generally try to avoid such things, but sometimes it a) can't be
> >>easily avoided (third party libraries for instance) and b) when it hits
> >>us, it affects the overall health of the machine/cluster (the monitoring
> >>daemons get blocked, which isn't very healthy).
> >
> >Are you doing appropriate posix_fadvise to prefetch in the files before
> >faulting, and madvise hints if appropriate?
> >
>
> Yes, we've been slowly rolling out fadvise hints out, though not to
> prefetch, and definitely not for faulting. I don't see how issuing a
> prefetch right before we try to fault in a page is going to help
> matters. The pages may appear in pagecache, but they won't be uptodate
> by the time we look at them anyway, so we're back to square one.

The whole point of a prefetch is to issue it sufficiently early so
it makes a difference. Actually if you can tell quite well where the
major faults will be, but don't know it sufficiently in advance to
do very good prefetching, then perhaps we could add a new madvise hint
to synchronously bring the page in (dropping the mmap_sem over the IO).

2008-11-28 09:41:39

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 11:22:57AM -0800, Mike Waychison wrote:
> Nick Piggin wrote:
> >On Thu, Nov 27, 2008 at 11:00:07AM +0100, Peter Zijlstra wrote:
>
> >pagemap_read looks like it can use get_user_pages_fast. The smaps and
> >clear_refs stuff might have been nicer if they could work on ranges
> >like pagemap. Then they could avoid mmap_sem as well (although maps
> >would need to be sampled and take mmap_sem I guess).
> >
> >One problem with dropping mmap_sem is that it hurts priority/fairness.
> >And it opens a bit of a (maybe theoretical but not something to completely
> >ignore) forward progress hole AFAIKS. If mmap_sem is very heavily
> >contended, then the refault is going to take a while to get through,
> >and then the page might get reclaimed etc).
>
> Right, this can be an issue. The way around it should be to minimize
> the length of time any single lock holder can sit on it. Compared to
> what we have today with:
>
> - sleep in major fault with read lock held,
> - enqueue writer behind it,
> - and make all other faults wait on the rwsem
>
> The retry logic seems to be a lot better for forward progress.

The whole reason why you have the latency is because it is
guaranteeing forward progress for everyone. The retry logic
may work out better in that situation, but it does actually
open a starvation hole.

2008-11-28 12:10:32

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Thu, Nov 27, 2008 at 03:23:14PM +0200, T?r?k Edwin wrote:
> On 2008-11-27 15:12, Nick Piggin wrote:
> > On Thu, Nov 27, 2008 at 03:10:20PM +0200, T?r?k Edwin wrote:
> >
> >>
> >> Ok. Sorry for hijacking the thread, my testcase is not a good testcase
> >> for what this patch tries to solve.
> >>
> >
> > No not at all. It's always really useful to hear any problems like this.
> > I'd like you to keep participating... for one thing I'd like you to test
> > my mmap_sem patch ;) (when I finish it)
>
> Sure, just send me your patch when it is ready (together, or
> before/after the rwsems patch).

This is what I have.

It does two things. Firstly, it switches x86-64 over to use the xadd
algorithm rather than the spinlock algorithm. This is actually significant
in high contention situations, because the spinlock algorithm doesn't allow
concurrent operations on the lock while the queue of waiters is being
manipulated.

Secondly, it moves wakeups out from underneath the waiter queue lock. This
is more significant on bigger machines where wakeup latency is worse and/or
runqueue locks are very heavily contended.

Now both these changes are going to help *mainly* for the case when there are
a significant number of readers and writers, I think. So your write-heavy
workload may not win anything. I noticed some speedup a long time ago on
some weird java (volanomark) workload.

Thanks,
Nick
---

Index: linux-2.6/include/asm-generic/rwsem-xadd.h
===================================================================
--- /dev/null
+++ linux-2.6/include/asm-generic/rwsem-xadd.h
@@ -0,0 +1,182 @@
+#ifndef _ASM_GENERIC_RWSEM_XADD_H
+#define _ASM_GENERIC_RWSEM_XADD_H
+
+#ifndef _LINUX_RWSEM_H
+#error "Please don't include <asm/rwsem.h> directly, use <linux/rwsem.h> instead."
+#endif
+
+#ifdef __KERNEL__
+
+/*
+ * R/W semaphores for PPC using the stuff in lib/rwsem.c.
+ * Adapted largely from include/asm-i386/rwsem.h
+ * by Paul Mackerras <[email protected]>.
+ */
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/rwsem.h>
+#include <linux/spinlock.h>
+
+#include <asm/system.h>
+#include <asm/atomic.h>
+
+/*
+ * The MSW of the count is the negated number of active writers and waiting
+ * lockers, and the LSW is the total number of active locks
+ *
+ * The lock count is initialized to 0 (no active and no waiting lockers).
+ *
+ * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
+ * uncontended lock. This can be determined because atomic_long_add_return returns
+ * the old value. Readers increment by 1 and see a positive value when
+ * uncontended, negative if there are writers (and maybe) readers waiting (in
+ * which case it goes to sleep).
+ *
+ * On 32-bit architectures, the value of WAITING_BIAS supports up to 32766
+ * waiting processes. The value of ACTIVE_BIAS supports up to 65535 active
+ * processes. On 64-bit, these are much larger.
+ */
+#define RWSEM_UNLOCKED_VALUE 0x00000000L
+#define RWSEM_ACTIVE_BIAS 0x00000001L
+#if BITS_PER_LONG == 32
+#define RWSEM_ACTIVE_MASK 0x0000ffffL
+#define RWSEM_WAITING_BIAS 0xffff0000L
+#elif BITS_PER_LONG == 64
+#define RWSEM_ACTIVE_MASK 0x00000000ffffffffL
+#define RWSEM_WAITING_BIAS 0xffffffff00000000L
+#endif
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+
+/*
+ * the semaphore definition
+ */
+struct rw_semaphore {
+ atomic_long_t count;
+ spinlock_t wait_lock;
+ struct list_head wait_list;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+ struct lockdep_map dep_map;
+#endif
+};
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
+#else
+# define __RWSEM_DEP_MAP_INIT(lockname)
+#endif
+
+#define __RWSEM_INITIALIZER(name) \
+ { .count = ATOMIC_LONG_INIT(RWSEM_UNLOCKED_VALUE), \
+ .wait_lock = __SPIN_LOCK_UNLOCKED((name).wait_lock), \
+ .wait_list = LIST_HEAD_INIT((name).wait_list) \
+ __RWSEM_DEP_MAP_INIT(name) }
+
+#define DECLARE_RWSEM(name) \
+ struct rw_semaphore name = __RWSEM_INITIALIZER(name)
+
+extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
+extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
+
+extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
+ struct lock_class_key *key);
+
+#define init_rwsem(sem) \
+ do { \
+ static struct lock_class_key __key; \
+ \
+ __init_rwsem((sem), #sem, &__key); \
+ } while (0)
+
+/*
+ * lock for reading
+ */
+static inline void __down_read(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_inc_return(&sem->count)) <= 0)
+ rwsem_down_read_failed(sem);
+}
+
+static inline int __down_read_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ while ((tmp = atomic_long_read(&sem->count)) >= 0) {
+ if (tmp == atomic_long_cmpxchg(&sem->count, tmp,
+ tmp + RWSEM_ACTIVE_READ_BIAS)) {
+ return 1;
+ }
+ }
+ return 0;
+}
+
+/*
+ * lock for writing
+ */
+static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
+{
+ long tmp;
+
+ tmp = atomic_long_add_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
+ if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
+ rwsem_down_write_failed(sem);
+}
+
+static inline void __down_write(struct rw_semaphore *sem)
+{
+ __down_write_nested(sem, 0);
+}
+
+static inline int __down_write_trylock(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
+ RWSEM_ACTIVE_WRITE_BIAS);
+ return tmp == RWSEM_UNLOCKED_VALUE;
+}
+
+/*
+ * unlock after reading
+ */
+static inline void __up_read(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_dec_return(&sem->count);
+ if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
+ rwsem_wake(sem);
+}
+
+/*
+ * unlock after writing
+ */
+static inline void __up_write(struct rw_semaphore *sem)
+{
+ if (unlikely(atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS,
+ &sem->count) < 0))
+ rwsem_wake(sem);
+}
+
+/*
+ * downgrade write lock to read lock
+ */
+static inline void __downgrade_write(struct rw_semaphore *sem)
+{
+ long tmp;
+
+ tmp = atomic_long_add_return(-RWSEM_WAITING_BIAS, &sem->count);
+ if (tmp < 0)
+ rwsem_downgrade_wake(sem);
+}
+
+static inline int rwsem_is_locked(struct rw_semaphore *sem)
+{
+ return (atomic_long_read(&sem->count) != 0);
+}
+
+#endif /* __KERNEL__ */
+#endif /* _ASM_GENERIC_RWSEM_XADD_H */
Index: linux-2.6/lib/rwsem.c
===================================================================
--- linux-2.6.orig/lib/rwsem.c
+++ linux-2.6/lib/rwsem.c
@@ -3,10 +3,15 @@
* Written by David Howells ([email protected]).
* Derived from arch/i386/kernel/semaphore.c
*/
-#include <linux/rwsem.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
#include <linux/sched.h>
-#include <linux/init.h>
#include <linux/module.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+
+#include <asm/system.h>
+#include <asm/atomic.h>

/*
* Initialize an rwsem:
@@ -21,11 +26,10 @@ void __init_rwsem(struct rw_semaphore *s
debug_check_no_locks_freed((void *)sem, sizeof(*sem));
lockdep_init_map(&sem->dep_map, name, key, 0);
#endif
- sem->count = RWSEM_UNLOCKED_VALUE;
+ atomic_long_set(&sem->count, RWSEM_UNLOCKED_VALUE);
spin_lock_init(&sem->wait_lock);
INIT_LIST_HEAD(&sem->wait_list);
}
-
EXPORT_SYMBOL(__init_rwsem);

struct rwsem_waiter {
@@ -45,14 +49,15 @@ struct rwsem_waiter {
* - the spinlock must be held by the caller
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only woken if downgrading is false
+ *
+ * The spinlock will be dropped by this function.
*/
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int downgrading, unsigned long flags)
{
+ LIST_HEAD(wake_list);
struct rwsem_waiter *waiter;
- struct task_struct *tsk;
- struct list_head *next;
- signed long oldcount, woken, loop;
+ long oldcount, woken;

if (downgrading)
goto dont_wake_writers;
@@ -61,7 +66,7 @@ __rwsem_do_wake(struct rw_semaphore *sem
* if we can transition the active part of the count from 0 -> 1
*/
try_again:
- oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS, sem)
+ oldcount = atomic_long_add_return(RWSEM_ACTIVE_BIAS, &sem->count)
- RWSEM_ACTIVE_BIAS;
if (oldcount & RWSEM_ACTIVE_MASK)
goto undo;
@@ -79,12 +84,8 @@ __rwsem_do_wake(struct rw_semaphore *sem
* It is an allocated on the waiter's stack and may become invalid at
* any time after that point (due to a wakeup from another source).
*/
- list_del(&waiter->list);
- tsk = waiter->task;
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ list_move_tail(&waiter->list, &wake_list);
+ waiter->flags = 0;
goto out;

/* don't want to wake any writers */
@@ -101,44 +102,40 @@ __rwsem_do_wake(struct rw_semaphore *sem
readers_only:
woken = 0;
do {
- woken++;
-
- if (waiter->list.next == &sem->wait_list)
+ list_move_tail(&waiter->list, &wake_list);
+ waiter->flags = 0;
+ woken += RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+ if (list_empty(&sem->wait_list))
break;

- waiter = list_entry(waiter->list.next,
+ waiter = list_entry(sem->wait_list.next,
struct rwsem_waiter, list);
-
} while (waiter->flags & RWSEM_WAITING_FOR_READ);

- loop = woken;
- woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
if (!downgrading)
/* we'd already done one increment earlier */
woken -= RWSEM_ACTIVE_BIAS;

- rwsem_atomic_add(woken, sem);
+ atomic_long_add(woken, &sem->count);

- next = sem->wait_list.next;
- for (; loop > 0; loop--) {
- waiter = list_entry(next, struct rwsem_waiter, list);
- next = waiter->list.next;
+out:
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
+ while (!list_empty(&wake_list)) {
+ struct task_struct *tsk;
+ waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
+ list_del(&waiter->list);
tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
}

- sem->wait_list.next = next;
- next->prev = &sem->wait_list;
-
- out:
- return sem;
+ return;

/* undo the change to count, but check for a transition 1->0 */
undo:
- if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS, sem) != 0)
+ if (atomic_long_add_return(-RWSEM_ACTIVE_BIAS, &sem->count) != 0)
goto out;
goto try_again;
}
@@ -146,35 +143,33 @@ __rwsem_do_wake(struct rw_semaphore *sem
/*
* wait for a lock to be granted
*/
-static struct rw_semaphore __sched *
-rwsem_down_failed_common(struct rw_semaphore *sem,
- struct rwsem_waiter *waiter, signed long adjustment)
+static struct rw_semaphore * __sched rwsem_down_failed_common(struct rw_semaphore *sem,
+ struct rwsem_waiter *waiter, long adjustment)
{
struct task_struct *tsk = current;
- signed long count;
+ unsigned long flags;
+ long count;

set_task_state(tsk, TASK_UNINTERRUPTIBLE);

/* set up my own style of waitqueue */
- spin_lock_irq(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);
waiter->task = tsk;
get_task_struct(tsk);

list_add_tail(&waiter->list, &sem->wait_list);

/* we're now waiting on the lock, but no longer actively read-locking */
- count = rwsem_atomic_update(adjustment, sem);
+ count = atomic_long_add_return(adjustment, &sem->count);

/* if there are no active locks, wake the front queued process(es) up */
if (!(count & RWSEM_ACTIVE_MASK))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irq(&sem->wait_lock);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

/* wait to be given the lock */
- for (;;) {
- if (!waiter->task)
- break;
+ while (waiter->task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -187,8 +182,7 @@ rwsem_down_failed_common(struct rw_semap
/*
* wait for the read lock to be granted
*/
-asmregparm struct rw_semaphore __sched *
-rwsem_down_read_failed(struct rw_semaphore *sem)
+struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;

@@ -197,12 +191,12 @@ rwsem_down_read_failed(struct rw_semapho
RWSEM_WAITING_BIAS - RWSEM_ACTIVE_BIAS);
return sem;
}
+EXPORT_SYMBOL(rwsem_down_read_failed);

/*
* wait for the write lock to be granted
*/
-asmregparm struct rw_semaphore __sched *
-rwsem_down_write_failed(struct rw_semaphore *sem)
+struct rw_semaphore __sched *rwsem_down_write_failed(struct rw_semaphore *sem)
{
struct rwsem_waiter waiter;

@@ -211,12 +205,13 @@ rwsem_down_write_failed(struct rw_semaph

return sem;
}
+EXPORT_SYMBOL(rwsem_down_write_failed);

/*
* handle waking up a waiter on the semaphore
* - up_read/up_write has decremented the active part of count if we come here
*/
-asmregparm struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
{
unsigned long flags;

@@ -224,19 +219,20 @@ asmregparm struct rw_semaphore *rwsem_wa

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

return sem;
}
+EXPORT_SYMBOL(rwsem_wake);

/*
* downgrade a write lock into a read lock
* - caller incremented waiting part of count and discovered it still negative
* - just wake up any readers at the front of the queue
*/
-asmregparm struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
+struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem)
{
unsigned long flags;

@@ -244,14 +240,10 @@ asmregparm struct rw_semaphore *rwsem_do

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 1);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 1, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

return sem;
}
-
-EXPORT_SYMBOL(rwsem_down_read_failed);
-EXPORT_SYMBOL(rwsem_down_write_failed);
-EXPORT_SYMBOL(rwsem_wake);
EXPORT_SYMBOL(rwsem_downgrade_wake);
Index: linux-2.6/include/linux/rwsem.h
===================================================================
--- linux-2.6.orig/include/linux/rwsem.h
+++ linux-2.6/include/linux/rwsem.h
@@ -16,8 +16,8 @@

struct rw_semaphore;

-#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
-#include <linux/rwsem-spinlock.h> /* use a generic implementation */
+#ifdef CONFIG_GENERIC_RWSEM
+#include <asm-generic/rwsem-xadd.h>
#else
#include <asm/rwsem.h> /* use an arch-specific implementation */
#endif
Index: linux-2.6/arch/alpha/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/alpha/include/asm/rwsem.h
+++ /dev/null
@@ -1,259 +0,0 @@
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky <[email protected]>, 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/compiler.h>
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
- long count;
-#define RWSEM_UNLOCKED_VALUE 0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS 0x0000000000000001L
-#define RWSEM_ACTIVE_MASK 0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS (-0x0000000100000000L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
- spinlock_t wait_lock;
- struct list_head wait_list;
-};
-
-#define __RWSEM_INITIALIZER(name) \
- { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
- LIST_HEAD_INIT((name).wait_list) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- sem->count = RWSEM_UNLOCKED_VALUE;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-}
-
-static inline void __down_read(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count;
- sem->count += RWSEM_ACTIVE_READ_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " mb\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(oldcount < 0))
- rwsem_down_read_failed(sem);
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- long old, new, res;
-
- res = sem->count;
- do {
- new = res + RWSEM_ACTIVE_READ_BIAS;
- if (new <= 0)
- break;
- old = res;
- res = cmpxchg(&sem->count, old, new);
- } while (res != old);
- return res >= 0 ? 1 : 0;
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count;
- sem->count += RWSEM_ACTIVE_WRITE_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " mb\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(oldcount))
- rwsem_down_write_failed(sem);
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- long ret = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- if (ret == RWSEM_UNLOCKED_VALUE)
- return 1;
- return 0;
-}
-
-static inline void __up_read(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count;
- sem->count -= RWSEM_ACTIVE_READ_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- " mb\n"
- "1: ldq_l %0,%1\n"
- " subq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(oldcount < 0))
- if ((int)oldcount - RWSEM_ACTIVE_READ_BIAS == 0)
- rwsem_wake(sem);
-}
-
-static inline void __up_write(struct rw_semaphore *sem)
-{
- long count;
-#ifndef CONFIG_SMP
- sem->count -= RWSEM_ACTIVE_WRITE_BIAS;
- count = sem->count;
-#else
- long temp;
- __asm__ __volatile__(
- " mb\n"
- "1: ldq_l %0,%1\n"
- " subq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " subq %0,%3,%0\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (count), "=m" (sem->count), "=&r" (temp)
- :"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(count))
- if ((int)count == 0)
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- long oldcount;
-#ifndef CONFIG_SMP
- oldcount = sem->count;
- sem->count -= RWSEM_WAITING_BIAS;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- " mb\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (-RWSEM_WAITING_BIAS), "m" (sem->count) : "memory");
-#endif
- if (unlikely(oldcount < 0))
- rwsem_downgrade_wake(sem);
-}
-
-static inline void rwsem_atomic_add(long val, struct rw_semaphore *sem)
-{
-#ifndef CONFIG_SMP
- sem->count += val;
-#else
- long temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%2,%0\n"
- " stq_c %0,%1\n"
- " beq %0,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (temp), "=m" (sem->count)
- :"Ir" (val), "m" (sem->count));
-#endif
-}
-
-static inline long rwsem_atomic_update(long val, struct rw_semaphore *sem)
-{
-#ifndef CONFIG_SMP
- sem->count += val;
- return sem->count;
-#else
- long ret, temp;
- __asm__ __volatile__(
- "1: ldq_l %0,%1\n"
- " addq %0,%3,%2\n"
- " addq %0,%3,%0\n"
- " stq_c %2,%1\n"
- " beq %2,2f\n"
- ".subsection 2\n"
- "2: br 1b\n"
- ".previous"
- :"=&r" (ret), "=m" (sem->count), "=&r" (temp)
- :"Ir" (val), "m" (sem->count));
-
- return ret;
-#endif
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ALPHA_RWSEM_H */
Index: linux-2.6/arch/ia64/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/ia64/include/asm/rwsem.h
+++ /dev/null
@@ -1,182 +0,0 @@
-/*
- * R/W semaphores for ia64
- *
- * Copyright (C) 2003 Ken Chen <[email protected]>
- * Copyright (C) 2003 Asit Mallick <[email protected]>
- * Copyright (C) 2005 Christoph Lameter <[email protected]>
- *
- * Based on asm-i386/rwsem.h and other architecture implementation.
- *
- * The MSW of the count is the negated number of active writers and
- * waiting lockers, and the LSW is the total number of active locks.
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffffffff00000001 for
- * the case of an uncontended lock. Readers increment by 1 and see a positive
- * value when uncontended, negative if there are writers (and maybe) readers
- * waiting (in which case it goes to sleep).
- */
-
-#ifndef _ASM_IA64_RWSEM_H
-#define _ASM_IA64_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "Please don't include <asm/rwsem.h> directly, use <linux/rwsem.h> instead."
-#endif
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-#include <asm/intrinsics.h>
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
- signed long count;
- spinlock_t wait_lock;
- struct list_head wait_list;
-};
-
-#define RWSEM_UNLOCKED_VALUE __IA64_UL_CONST(0x0000000000000000)
-#define RWSEM_ACTIVE_BIAS __IA64_UL_CONST(0x0000000000000001)
-#define RWSEM_ACTIVE_MASK __IA64_UL_CONST(0x00000000ffffffff)
-#define RWSEM_WAITING_BIAS -__IA64_UL_CONST(0x0000000100000000)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-#define __RWSEM_INITIALIZER(name) \
- { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
- LIST_HEAD_INIT((name).wait_list) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-static inline void
-init_rwsem (struct rw_semaphore *sem)
-{
- sem->count = RWSEM_UNLOCKED_VALUE;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-}
-
-/*
- * lock for reading
- */
-static inline void
-__down_read (struct rw_semaphore *sem)
-{
- long result = ia64_fetchadd8_acq((unsigned long *)&sem->count, 1);
-
- if (result < 0)
- rwsem_down_read_failed(sem);
-}
-
-/*
- * lock for writing
- */
-static inline void
-__down_write (struct rw_semaphore *sem)
-{
- long old, new;
-
- do {
- old = sem->count;
- new = old + RWSEM_ACTIVE_WRITE_BIAS;
- } while (cmpxchg_acq(&sem->count, old, new) != old);
-
- if (old != 0)
- rwsem_down_write_failed(sem);
-}
-
-/*
- * unlock after reading
- */
-static inline void
-__up_read (struct rw_semaphore *sem)
-{
- long result = ia64_fetchadd8_rel((unsigned long *)&sem->count, -1);
-
- if (result < 0 && (--result & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void
-__up_write (struct rw_semaphore *sem)
-{
- long old, new;
-
- do {
- old = sem->count;
- new = old - RWSEM_ACTIVE_WRITE_BIAS;
- } while (cmpxchg_rel(&sem->count, old, new) != old);
-
- if (new < 0 && (new & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline int
-__down_read_trylock (struct rw_semaphore *sem)
-{
- long tmp;
- while ((tmp = sem->count) >= 0) {
- if (tmp == cmpxchg_acq(&sem->count, tmp, tmp+1)) {
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline int
-__down_write_trylock (struct rw_semaphore *sem)
-{
- long tmp = cmpxchg_acq(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- return tmp == RWSEM_UNLOCKED_VALUE;
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void
-__downgrade_write (struct rw_semaphore *sem)
-{
- long old, new;
-
- do {
- old = sem->count;
- new = old - RWSEM_WAITING_BIAS;
- } while (cmpxchg_rel(&sem->count, old, new) != old);
-
- if (old < 0)
- rwsem_downgrade_wake(sem);
-}
-
-/*
- * Implement atomic add functionality. These used to be "inline" functions, but GCC v3.1
- * doesn't quite optimize this stuff right and ends up with bad calls to fetchandadd.
- */
-#define rwsem_atomic_add(delta, sem) atomic64_add(delta, (atomic64_t *)(&(sem)->count))
-#define rwsem_atomic_update(delta, sem) atomic64_add_return(delta, (atomic64_t *)(&(sem)->count))
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* _ASM_IA64_RWSEM_H */
Index: linux-2.6/arch/powerpc/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/powerpc/include/asm/rwsem.h
+++ /dev/null
@@ -1,173 +0,0 @@
-#ifndef _ASM_POWERPC_RWSEM_H
-#define _ASM_POWERPC_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "Please don't include <asm/rwsem.h> directly, use <linux/rwsem.h> instead."
-#endif
-
-#ifdef __KERNEL__
-
-/*
- * R/W semaphores for PPC using the stuff in lib/rwsem.c.
- * Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <[email protected]>.
- */
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-#include <asm/atomic.h>
-#include <asm/system.h>
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
- /* XXX this should be able to be an atomic_t -- paulus */
- signed int count;
-#define RWSEM_UNLOCKED_VALUE 0x00000000
-#define RWSEM_ACTIVE_BIAS 0x00000001
-#define RWSEM_ACTIVE_MASK 0x0000ffff
-#define RWSEM_WAITING_BIAS (-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
- spinlock_t wait_lock;
- struct list_head wait_list;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map dep_map;
-#endif
-};
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
-#else
-# define __RWSEM_DEP_MAP_INIT(lockname)
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
- { RWSEM_UNLOCKED_VALUE, __SPIN_LOCK_UNLOCKED((name).wait_lock), \
- LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key);
-
-#define init_rwsem(sem) \
- do { \
- static struct lock_class_key __key; \
- \
- __init_rwsem((sem), #sem, &__key); \
- } while (0)
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_inc_return((atomic_t *)(&sem->count)) <= 0))
- rwsem_down_read_failed(sem);
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- int tmp;
-
- while ((tmp = sem->count) >= 0) {
- if (tmp == cmpxchg(&sem->count, tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
-{
- int tmp;
-
- tmp = atomic_add_return(RWSEM_ACTIVE_WRITE_BIAS,
- (atomic_t *)(&sem->count));
- if (unlikely(tmp != RWSEM_ACTIVE_WRITE_BIAS))
- rwsem_down_write_failed(sem);
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
- __down_write_nested(sem, 0);
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- return tmp == RWSEM_UNLOCKED_VALUE;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = atomic_dec_return((atomic_t *)(&sem->count));
- if (unlikely(tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0))
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- if (unlikely(atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS,
- (atomic_t *)(&sem->count)) < 0))
- rwsem_wake(sem);
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
- atomic_add(delta, (atomic_t *)(&sem->count));
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = atomic_add_return(-RWSEM_WAITING_BIAS, (atomic_t *)(&sem->count));
- if (tmp < 0)
- rwsem_downgrade_wake(sem);
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
- return atomic_add_return(delta, (atomic_t *)(&sem->count));
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ASM_POWERPC_RWSEM_H */
Index: linux-2.6/arch/s390/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/s390/include/asm/rwsem.h
+++ /dev/null
@@ -1,387 +0,0 @@
-#ifndef _S390_RWSEM_H
-#define _S390_RWSEM_H
-
-/*
- * include/asm-s390/rwsem.h
- *
- * S390 version
- * Copyright (C) 2002 IBM Deutschland Entwicklung GmbH, IBM Corporation
- * Author(s): Martin Schwidefsky ([email protected])
- *
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-/*
- *
- * The MSW of the count is the negated number of active writers and waiting
- * lockers, and the LSW is the total number of active locks
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
- * uncontended lock. This can be determined because XADD returns the old value.
- * Readers increment by 1 and see a positive value when uncontended, negative
- * if there are writers (and maybe) readers waiting (in which case it goes to
- * sleep).
- *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
- *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
- *
- * This should be totally fair - if anything is waiting, a process that wants a
- * lock will go to the back of the queue. When the currently active lock is
- * released, if there's a writer at the front of the queue, then that and only
- * that will be woken up; if there's a bunch of consequtive readers at the
- * front, then they'll all be woken up, but no other readers will be.
- */
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *);
-extern struct rw_semaphore *rwsem_downgrade_write(struct rw_semaphore *);
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
- signed long count;
- spinlock_t wait_lock;
- struct list_head wait_list;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map dep_map;
-#endif
-};
-
-#ifndef __s390x__
-#define RWSEM_UNLOCKED_VALUE 0x00000000
-#define RWSEM_ACTIVE_BIAS 0x00000001
-#define RWSEM_ACTIVE_MASK 0x0000ffff
-#define RWSEM_WAITING_BIAS (-0x00010000)
-#else /* __s390x__ */
-#define RWSEM_UNLOCKED_VALUE 0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS 0x0000000000000001L
-#define RWSEM_ACTIVE_MASK 0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS (-0x0000000100000000L)
-#endif /* __s390x__ */
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-/*
- * initialisation
- */
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
-#else
-# define __RWSEM_DEP_MAP_INIT(lockname)
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
- { RWSEM_UNLOCKED_VALUE, __SPIN_LOCK_UNLOCKED((name).wait.lock), \
- LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- sem->count = RWSEM_UNLOCKED_VALUE;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-}
-
-extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key);
-
-#define init_rwsem(sem) \
-do { \
- static struct lock_class_key __key; \
- \
- __init_rwsem((sem), #sem, &__key); \
-} while (0)
-
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- signed long old, new;
-
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " ahi %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " aghi %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count),
- "i" (RWSEM_ACTIVE_READ_BIAS) : "cc", "memory");
- if (old < 0)
- rwsem_down_read_failed(sem);
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- signed long old, new;
-
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: ltr %1,%0\n"
- " jm 1f\n"
- " ahi %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b\n"
- "1:"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: ltgr %1,%0\n"
- " jm 1f\n"
- " aghi %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b\n"
- "1:"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count),
- "i" (RWSEM_ACTIVE_READ_BIAS) : "cc", "memory");
- return old >= 0 ? 1 : 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
-{
- signed long old, new, tmp;
-
- tmp = RWSEM_ACTIVE_WRITE_BIAS;
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " a %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " ag %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count), "m" (tmp)
- : "cc", "memory");
- if (old != 0)
- rwsem_down_write_failed(sem);
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
- __down_write_nested(sem, 0);
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- signed long old;
-
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%2)\n"
- "0: ltr %0,%0\n"
- " jnz 1f\n"
- " cs %0,%4,0(%2)\n"
- " jl 0b\n"
-#else /* __s390x__ */
- " lg %0,0(%2)\n"
- "0: ltgr %0,%0\n"
- " jnz 1f\n"
- " csg %0,%4,0(%2)\n"
- " jl 0b\n"
-#endif /* __s390x__ */
- "1:"
- : "=&d" (old), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count),
- "d" (RWSEM_ACTIVE_WRITE_BIAS) : "cc", "memory");
- return (old == RWSEM_UNLOCKED_VALUE) ? 1 : 0;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- signed long old, new;
-
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " ahi %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " aghi %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count),
- "i" (-RWSEM_ACTIVE_READ_BIAS)
- : "cc", "memory");
- if (new < 0)
- if ((new & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- signed long old, new, tmp;
-
- tmp = -RWSEM_ACTIVE_WRITE_BIAS;
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " a %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " ag %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count), "m" (tmp)
- : "cc", "memory");
- if (new < 0)
- if ((new & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- signed long old, new, tmp;
-
- tmp = -RWSEM_WAITING_BIAS;
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " a %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " ag %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count), "m" (tmp)
- : "cc", "memory");
- if (new > 1)
- rwsem_downgrade_wake(sem);
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(long delta, struct rw_semaphore *sem)
-{
- signed long old, new;
-
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " ar %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " agr %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count), "d" (delta)
- : "cc", "memory");
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline long rwsem_atomic_update(long delta, struct rw_semaphore *sem)
-{
- signed long old, new;
-
- asm volatile(
-#ifndef __s390x__
- " l %0,0(%3)\n"
- "0: lr %1,%0\n"
- " ar %1,%5\n"
- " cs %0,%1,0(%3)\n"
- " jl 0b"
-#else /* __s390x__ */
- " lg %0,0(%3)\n"
- "0: lgr %1,%0\n"
- " agr %1,%5\n"
- " csg %0,%1,0(%3)\n"
- " jl 0b"
-#endif /* __s390x__ */
- : "=&d" (old), "=&d" (new), "=m" (sem->count)
- : "a" (&sem->count), "m" (sem->count), "d" (delta)
- : "cc", "memory");
- return new;
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _S390_RWSEM_H */
Index: linux-2.6/arch/sh/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/sh/include/asm/rwsem.h
+++ /dev/null
@@ -1,188 +0,0 @@
-/*
- * include/asm-sh/rwsem.h: R/W semaphores for SH using the stuff
- * in lib/rwsem.c.
- */
-
-#ifndef _ASM_SH_RWSEM_H
-#define _ASM_SH_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-#include <linux/list.h>
-#include <linux/spinlock.h>
-#include <asm/atomic.h>
-#include <asm/system.h>
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
- long count;
-#define RWSEM_UNLOCKED_VALUE 0x00000000
-#define RWSEM_ACTIVE_BIAS 0x00000001
-#define RWSEM_ACTIVE_MASK 0x0000ffff
-#define RWSEM_WAITING_BIAS (-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
- spinlock_t wait_lock;
- struct list_head wait_list;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map dep_map;
-#endif
-};
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
-#else
-# define __RWSEM_DEP_MAP_INIT(lockname)
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
- { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
- LIST_HEAD_INIT((name).wait_list) \
- __RWSEM_DEP_MAP_INIT(name) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key);
-
-#define init_rwsem(sem) \
-do { \
- static struct lock_class_key __key; \
- \
- __init_rwsem((sem), #sem, &__key); \
-} while (0)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- sem->count = RWSEM_UNLOCKED_VALUE;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-}
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (atomic_inc_return((atomic_t *)(&sem->count)) > 0)
- smp_wmb();
- else
- rwsem_down_read_failed(sem);
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- int tmp;
-
- while ((tmp = sem->count) >= 0) {
- if (tmp == cmpxchg(&sem->count, tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
- smp_wmb();
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = atomic_add_return(RWSEM_ACTIVE_WRITE_BIAS,
- (atomic_t *)(&sem->count));
- if (tmp == RWSEM_ACTIVE_WRITE_BIAS)
- smp_wmb();
- else
- rwsem_down_write_failed(sem);
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- smp_wmb();
- return tmp == RWSEM_UNLOCKED_VALUE;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- int tmp;
-
- smp_wmb();
- tmp = atomic_dec_return((atomic_t *)(&sem->count));
- if (tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- smp_wmb();
- if (atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS,
- (atomic_t *)(&sem->count)) < 0)
- rwsem_wake(sem);
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
- atomic_add(delta, (atomic_t *)(&sem->count));
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- int tmp;
-
- smp_wmb();
- tmp = atomic_add_return(-RWSEM_WAITING_BIAS, (atomic_t *)(&sem->count));
- if (tmp < 0)
- rwsem_downgrade_wake(sem);
-}
-
-static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
-{
- __down_write(sem);
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
- smp_mb();
- return atomic_add_return(delta, (atomic_t *)(&sem->count));
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ASM_SH_RWSEM_H */
Index: linux-2.6/arch/sparc/include/asm/rwsem-const.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/rwsem-const.h
+++ /dev/null
@@ -1,12 +0,0 @@
-/* rwsem-const.h: RW semaphore counter constants. */
-#ifndef _SPARC64_RWSEM_CONST_H
-#define _SPARC64_RWSEM_CONST_H
-
-#define RWSEM_UNLOCKED_VALUE 0x00000000
-#define RWSEM_ACTIVE_BIAS 0x00000001
-#define RWSEM_ACTIVE_MASK 0x0000ffff
-#define RWSEM_WAITING_BIAS 0xffff0000
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-#endif /* _SPARC64_RWSEM_CONST_H */
Index: linux-2.6/arch/sparc/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/sparc/include/asm/rwsem.h
+++ /dev/null
@@ -1,84 +0,0 @@
-/*
- * rwsem.h: R/W semaphores implemented using CAS
- *
- * Written by David S. Miller ([email protected]), 2001.
- * Derived from asm-i386/rwsem.h
- */
-#ifndef _SPARC64_RWSEM_H
-#define _SPARC64_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-#include <asm/rwsem-const.h>
-
-struct rwsem_waiter;
-
-struct rw_semaphore {
- signed int count;
- spinlock_t wait_lock;
- struct list_head wait_list;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map dep_map;
-#endif
-};
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
-#else
-# define __RWSEM_DEP_MAP_INIT(lockname)
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) \
- __RWSEM_DEP_MAP_INIT(name) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key);
-
-#define init_rwsem(sem) \
-do { \
- static struct lock_class_key __key; \
- \
- __init_rwsem((sem), #sem, &__key); \
-} while (0)
-
-extern void __down_read(struct rw_semaphore *sem);
-extern int __down_read_trylock(struct rw_semaphore *sem);
-extern void __down_write(struct rw_semaphore *sem);
-extern int __down_write_trylock(struct rw_semaphore *sem);
-extern void __up_read(struct rw_semaphore *sem);
-extern void __up_write(struct rw_semaphore *sem);
-extern void __downgrade_write(struct rw_semaphore *sem);
-
-static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
-{
- __down_write(sem);
-}
-
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
- return atomic_add_return(delta, (atomic_t *)(&sem->count));
-}
-
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
- atomic_add(delta, (atomic_t *)(&sem->count));
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* __KERNEL__ */
-
-#endif /* _SPARC64_RWSEM_H */
Index: linux-2.6/arch/sparc/lib/rwsem.S
===================================================================
--- linux-2.6.orig/arch/sparc/lib/rwsem.S
+++ /dev/null
@@ -1,204 +0,0 @@
-/*
- * Assembly part of rw semaphores.
- *
- * Copyright (C) 1999 Jakub Jelinek ([email protected])
- */
-
-#include <asm/ptrace.h>
-#include <asm/psr.h>
-
- .section .sched.text, "ax"
- .align 4
-
- .globl ___down_read
-___down_read:
- rd %psr, %g3
- nop
- nop
- nop
- or %g3, PSR_PIL, %g7
- wr %g7, 0, %psr
- nop
- nop
- nop
-#ifdef CONFIG_SMP
-1: ldstub [%g1 + 4], %g7
- tst %g7
- bne 1b
- ld [%g1], %g7
- sub %g7, 1, %g7
- st %g7, [%g1]
- stb %g0, [%g1 + 4]
-#else
- ld [%g1], %g7
- sub %g7, 1, %g7
- st %g7, [%g1]
-#endif
- wr %g3, 0, %psr
- add %g7, 1, %g7
- nop
- nop
- subcc %g7, 1, %g7
- bneg 3f
- nop
-2: jmpl %o7, %g0
- mov %g4, %o7
-3: save %sp, -64, %sp
- mov %g1, %l1
- mov %g4, %l4
- bcs 4f
- mov %g5, %l5
- call down_read_failed
- mov %l1, %o0
- mov %l1, %g1
- mov %l4, %g4
- ba ___down_read
- restore %l5, %g0, %g5
-4: call down_read_failed_biased
- mov %l1, %o0
- mov %l1, %g1
- mov %l4, %g4
- ba 2b
- restore %l5, %g0, %g5
-
- .globl ___down_write
-___down_write:
- rd %psr, %g3
- nop
- nop
- nop
- or %g3, PSR_PIL, %g7
- wr %g7, 0, %psr
- sethi %hi(0x01000000), %g2
- nop
- nop
-#ifdef CONFIG_SMP
-1: ldstub [%g1 + 4], %g7
- tst %g7
- bne 1b
- ld [%g1], %g7
- sub %g7, %g2, %g7
- st %g7, [%g1]
- stb %g0, [%g1 + 4]
-#else
- ld [%g1], %g7
- sub %g7, %g2, %g7
- st %g7, [%g1]
-#endif
- wr %g3, 0, %psr
- add %g7, %g2, %g7
- nop
- nop
- subcc %g7, %g2, %g7
- bne 3f
- nop
-2: jmpl %o7, %g0
- mov %g4, %o7
-3: save %sp, -64, %sp
- mov %g1, %l1
- mov %g4, %l4
- bcs 4f
- mov %g5, %l5
- call down_write_failed
- mov %l1, %o0
- mov %l1, %g1
- mov %l4, %g4
- ba ___down_write
- restore %l5, %g0, %g5
-4: call down_write_failed_biased
- mov %l1, %o0
- mov %l1, %g1
- mov %l4, %g4
- ba 2b
- restore %l5, %g0, %g5
-
- .text
- .globl ___up_read
-___up_read:
- rd %psr, %g3
- nop
- nop
- nop
- or %g3, PSR_PIL, %g7
- wr %g7, 0, %psr
- nop
- nop
- nop
-#ifdef CONFIG_SMP
-1: ldstub [%g1 + 4], %g7
- tst %g7
- bne 1b
- ld [%g1], %g7
- add %g7, 1, %g7
- st %g7, [%g1]
- stb %g0, [%g1 + 4]
-#else
- ld [%g1], %g7
- add %g7, 1, %g7
- st %g7, [%g1]
-#endif
- wr %g3, 0, %psr
- nop
- nop
- nop
- cmp %g7, 0
- be 3f
- nop
-2: jmpl %o7, %g0
- mov %g4, %o7
-3: save %sp, -64, %sp
- mov %g1, %l1
- mov %g4, %l4
- mov %g5, %l5
- clr %o1
- call __rwsem_wake
- mov %l1, %o0
- mov %l1, %g1
- mov %l4, %g4
- ba 2b
- restore %l5, %g0, %g5
-
- .globl ___up_write
-___up_write:
- rd %psr, %g3
- nop
- nop
- nop
- or %g3, PSR_PIL, %g7
- wr %g7, 0, %psr
- sethi %hi(0x01000000), %g2
- nop
- nop
-#ifdef CONFIG_SMP
-1: ldstub [%g1 + 4], %g7
- tst %g7
- bne 1b
- ld [%g1], %g7
- add %g7, %g2, %g7
- st %g7, [%g1]
- stb %g0, [%g1 + 4]
-#else
- ld [%g1], %g7
- add %g7, %g2, %g7
- st %g7, [%g1]
-#endif
- wr %g3, 0, %psr
- sub %g7, %g2, %g7
- nop
- nop
- addcc %g7, %g2, %g7
- bcs 3f
- nop
-2: jmpl %o7, %g0
- mov %g4, %o7
-3: save %sp, -64, %sp
- mov %g1, %l1
- mov %g4, %l4
- mov %g5, %l5
- mov %g7, %o1
- call __rwsem_wake
- mov %l1, %o0
- mov %l1, %g1
- mov %l4, %g4
- ba 2b
- restore %l5, %g0, %g5
Index: linux-2.6/arch/sparc64/lib/rwsem.S
===================================================================
--- linux-2.6.orig/arch/sparc64/lib/rwsem.S
+++ /dev/null
@@ -1,170 +0,0 @@
-/* rwsem.S: RW semaphore assembler.
- *
- * Written by David S. Miller ([email protected]), 2001.
- * Derived from asm-i386/rwsem.h
- */
-
-#include <asm/rwsem-const.h>
-
- .section .sched.text, "ax"
-
- .globl __down_read
-__down_read:
-1: lduw [%o0], %g1
- add %g1, 1, %g7
- cas [%o0], %g1, %g7
- cmp %g1, %g7
- bne,pn %icc, 1b
- add %g7, 1, %g7
- cmp %g7, 0
- membar #StoreLoad | #StoreStore
- bl,pn %icc, 3f
- nop
-2:
- retl
- nop
-3:
- save %sp, -192, %sp
- call rwsem_down_read_failed
- mov %i0, %o0
- ret
- restore
- .size __down_read, .-__down_read
-
- .globl __down_read_trylock
-__down_read_trylock:
-1: lduw [%o0], %g1
- add %g1, 1, %g7
- cmp %g7, 0
- bl,pn %icc, 2f
- mov 0, %o1
- cas [%o0], %g1, %g7
- cmp %g1, %g7
- bne,pn %icc, 1b
- mov 1, %o1
- membar #StoreLoad | #StoreStore
-2: retl
- mov %o1, %o0
- .size __down_read_trylock, .-__down_read_trylock
-
- .globl __down_write
-__down_write:
- sethi %hi(RWSEM_ACTIVE_WRITE_BIAS), %g1
- or %g1, %lo(RWSEM_ACTIVE_WRITE_BIAS), %g1
-1:
- lduw [%o0], %g3
- add %g3, %g1, %g7
- cas [%o0], %g3, %g7
- cmp %g3, %g7
- bne,pn %icc, 1b
- cmp %g7, 0
- membar #StoreLoad | #StoreStore
- bne,pn %icc, 3f
- nop
-2: retl
- nop
-3:
- save %sp, -192, %sp
- call rwsem_down_write_failed
- mov %i0, %o0
- ret
- restore
- .size __down_write, .-__down_write
-
- .globl __down_write_trylock
-__down_write_trylock:
- sethi %hi(RWSEM_ACTIVE_WRITE_BIAS), %g1
- or %g1, %lo(RWSEM_ACTIVE_WRITE_BIAS), %g1
-1:
- lduw [%o0], %g3
- cmp %g3, 0
- bne,pn %icc, 2f
- mov 0, %o1
- add %g3, %g1, %g7
- cas [%o0], %g3, %g7
- cmp %g3, %g7
- bne,pn %icc, 1b
- mov 1, %o1
- membar #StoreLoad | #StoreStore
-2: retl
- mov %o1, %o0
- .size __down_write_trylock, .-__down_write_trylock
-
- .globl __up_read
-__up_read:
-1:
- lduw [%o0], %g1
- sub %g1, 1, %g7
- cas [%o0], %g1, %g7
- cmp %g1, %g7
- bne,pn %icc, 1b
- cmp %g7, 0
- membar #StoreLoad | #StoreStore
- bl,pn %icc, 3f
- nop
-2: retl
- nop
-3: sethi %hi(RWSEM_ACTIVE_MASK), %g1
- sub %g7, 1, %g7
- or %g1, %lo(RWSEM_ACTIVE_MASK), %g1
- andcc %g7, %g1, %g0
- bne,pn %icc, 2b
- nop
- save %sp, -192, %sp
- call rwsem_wake
- mov %i0, %o0
- ret
- restore
- .size __up_read, .-__up_read
-
- .globl __up_write
-__up_write:
- sethi %hi(RWSEM_ACTIVE_WRITE_BIAS), %g1
- or %g1, %lo(RWSEM_ACTIVE_WRITE_BIAS), %g1
-1:
- lduw [%o0], %g3
- sub %g3, %g1, %g7
- cas [%o0], %g3, %g7
- cmp %g3, %g7
- bne,pn %icc, 1b
- sub %g7, %g1, %g7
- cmp %g7, 0
- membar #StoreLoad | #StoreStore
- bl,pn %icc, 3f
- nop
-2:
- retl
- nop
-3:
- save %sp, -192, %sp
- call rwsem_wake
- mov %i0, %o0
- ret
- restore
- .size __up_write, .-__up_write
-
- .globl __downgrade_write
-__downgrade_write:
- sethi %hi(RWSEM_WAITING_BIAS), %g1
- or %g1, %lo(RWSEM_WAITING_BIAS), %g1
-1:
- lduw [%o0], %g3
- sub %g3, %g1, %g7
- cas [%o0], %g3, %g7
- cmp %g3, %g7
- bne,pn %icc, 1b
- sub %g7, %g1, %g7
- cmp %g7, 0
- membar #StoreLoad | #StoreStore
- bl,pn %icc, 3f
- nop
-2:
- retl
- nop
-3:
- save %sp, -192, %sp
- call rwsem_downgrade_wake
- mov %i0, %o0
- ret
- restore
- .size __downgrade_write, .-__downgrade_write
Index: linux-2.6/arch/x86/include/asm/rwsem.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/rwsem.h
+++ /dev/null
@@ -1,265 +0,0 @@
-/* rwsem.h: R/W semaphores implemented using XADD/CMPXCHG for i486+
- *
- * Written by David Howells ([email protected]).
- *
- * Derived from asm-x86/semaphore.h
- *
- *
- * The MSW of the count is the negated number of active writers and waiting
- * lockers, and the LSW is the total number of active locks
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
- * uncontended lock. This can be determined because XADD returns the old value.
- * Readers increment by 1 and see a positive value when uncontended, negative
- * if there are writers (and maybe) readers waiting (in which case it goes to
- * sleep).
- *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
- *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
- *
- * This should be totally fair - if anything is waiting, a process that wants a
- * lock will go to the back of the queue. When the currently active lock is
- * released, if there's a writer at the front of the queue, then that and only
- * that will be woken up; if there's a bunch of consequtive readers at the
- * front, then they'll all be woken up, but no other readers will be.
- */
-
-#ifndef _ASM_X86_RWSEM_H
-#define _ASM_X86_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-#include <linux/lockdep.h>
-
-struct rwsem_waiter;
-
-extern asmregparm struct rw_semaphore *
- rwsem_down_read_failed(struct rw_semaphore *sem);
-extern asmregparm struct rw_semaphore *
- rwsem_down_write_failed(struct rw_semaphore *sem);
-extern asmregparm struct rw_semaphore *
- rwsem_wake(struct rw_semaphore *);
-extern asmregparm struct rw_semaphore *
- rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-/*
- * the semaphore definition
- */
-
-#define RWSEM_UNLOCKED_VALUE 0x00000000
-#define RWSEM_ACTIVE_BIAS 0x00000001
-#define RWSEM_ACTIVE_MASK 0x0000ffff
-#define RWSEM_WAITING_BIAS (-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-
-struct rw_semaphore {
- signed long count;
- spinlock_t wait_lock;
- struct list_head wait_list;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map dep_map;
-#endif
-};
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
-#else
-# define __RWSEM_DEP_MAP_INIT(lockname)
-#endif
-
-
-#define __RWSEM_INITIALIZER(name) \
-{ \
- RWSEM_UNLOCKED_VALUE, __SPIN_LOCK_UNLOCKED((name).wait_lock), \
- LIST_HEAD_INIT((name).wait_list) __RWSEM_DEP_MAP_INIT(name) \
-}
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key);
-
-#define init_rwsem(sem) \
-do { \
- static struct lock_class_key __key; \
- \
- __init_rwsem((sem), #sem, &__key); \
-} while (0)
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- asm volatile("# beginning down_read\n\t"
- LOCK_PREFIX " incl (%%eax)\n\t"
- /* adds 0x00000001, returns the old value */
- " jns 1f\n"
- " call call_rwsem_down_read_failed\n"
- "1:\n\t"
- "# ending down_read\n\t"
- : "+m" (sem->count)
- : "a" (sem)
- : "memory", "cc");
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- __s32 result, tmp;
- asm volatile("# beginning __down_read_trylock\n\t"
- " movl %0,%1\n\t"
- "1:\n\t"
- " movl %1,%2\n\t"
- " addl %3,%2\n\t"
- " jle 2f\n\t"
- LOCK_PREFIX " cmpxchgl %2,%0\n\t"
- " jnz 1b\n\t"
- "2:\n\t"
- "# ending __down_read_trylock\n\t"
- : "+m" (sem->count), "=&a" (result), "=&r" (tmp)
- : "i" (RWSEM_ACTIVE_READ_BIAS)
- : "memory", "cc");
- return result >= 0 ? 1 : 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write_nested(struct rw_semaphore *sem, int subclass)
-{
- int tmp;
-
- tmp = RWSEM_ACTIVE_WRITE_BIAS;
- asm volatile("# beginning down_write\n\t"
- LOCK_PREFIX " xadd %%edx,(%%eax)\n\t"
- /* subtract 0x0000ffff, returns the old value */
- " testl %%edx,%%edx\n\t"
- /* was the count 0 before? */
- " jz 1f\n"
- " call call_rwsem_down_write_failed\n"
- "1:\n"
- "# ending down_write"
- : "+m" (sem->count), "=d" (tmp)
- : "a" (sem), "1" (tmp)
- : "memory", "cc");
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
- __down_write_nested(sem, 0);
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- signed long ret = cmpxchg(&sem->count,
- RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- if (ret == RWSEM_UNLOCKED_VALUE)
- return 1;
- return 0;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- __s32 tmp = -RWSEM_ACTIVE_READ_BIAS;
- asm volatile("# beginning __up_read\n\t"
- LOCK_PREFIX " xadd %%edx,(%%eax)\n\t"
- /* subtracts 1, returns the old value */
- " jns 1f\n\t"
- " call call_rwsem_wake\n"
- "1:\n"
- "# ending __up_read\n"
- : "+m" (sem->count), "=d" (tmp)
- : "a" (sem), "1" (tmp)
- : "memory", "cc");
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- asm volatile("# beginning __up_write\n\t"
- " movl %2,%%edx\n\t"
- LOCK_PREFIX " xaddl %%edx,(%%eax)\n\t"
- /* tries to transition
- 0xffff0001 -> 0x00000000 */
- " jz 1f\n"
- " call call_rwsem_wake\n"
- "1:\n\t"
- "# ending __up_write\n"
- : "+m" (sem->count)
- : "a" (sem), "i" (-RWSEM_ACTIVE_WRITE_BIAS)
- : "memory", "cc", "edx");
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- asm volatile("# beginning __downgrade_write\n\t"
- LOCK_PREFIX " addl %2,(%%eax)\n\t"
- /* transitions 0xZZZZ0001 -> 0xYYYY0001 */
- " jns 1f\n\t"
- " call call_rwsem_downgrade_wake\n"
- "1:\n\t"
- "# ending __downgrade_write\n"
- : "+m" (sem->count)
- : "a" (sem), "i" (-RWSEM_WAITING_BIAS)
- : "memory", "cc");
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
- asm volatile(LOCK_PREFIX "addl %1,%0"
- : "+m" (sem->count)
- : "ir" (delta));
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
- int tmp = delta;
-
- asm volatile(LOCK_PREFIX "xadd %0,%1"
- : "+r" (tmp), "+m" (sem->count)
- : : "memory");
-
- return tmp + delta;
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ASM_X86_RWSEM_H */
Index: linux-2.6/include/linux/rwsem-spinlock.h
===================================================================
--- linux-2.6.orig/include/linux/rwsem-spinlock.h
+++ /dev/null
@@ -1,78 +0,0 @@
-/* rwsem-spinlock.h: fallback C implementation
- *
- * Copyright (c) 2001 David Howells ([email protected]).
- * - Derived partially from ideas by Andrea Arcangeli <[email protected]>
- * - Derived also from comments by Linus
- */
-
-#ifndef _LINUX_RWSEM_SPINLOCK_H
-#define _LINUX_RWSEM_SPINLOCK_H
-
-#ifndef _LINUX_RWSEM_H
-#error "please don't include linux/rwsem-spinlock.h directly, use linux/rwsem.h instead"
-#endif
-
-#include <linux/spinlock.h>
-#include <linux/list.h>
-
-#ifdef __KERNEL__
-
-#include <linux/types.h>
-
-struct rwsem_waiter;
-
-/*
- * the rw-semaphore definition
- * - if activity is 0 then there are no active readers or writers
- * - if activity is +ve then that is the number of active readers
- * - if activity is -1 then there is one active writer
- * - if wait_list is not empty, then there are processes waiting for the semaphore
- */
-struct rw_semaphore {
- __s32 activity;
- spinlock_t wait_lock;
- struct list_head wait_list;
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- struct lockdep_map dep_map;
-#endif
-};
-
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-# define __RWSEM_DEP_MAP_INIT(lockname) , .dep_map = { .name = #lockname }
-#else
-# define __RWSEM_DEP_MAP_INIT(lockname)
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ 0, __SPIN_LOCK_UNLOCKED(name.wait_lock), LIST_HEAD_INIT((name).wait_list) \
- __RWSEM_DEP_MAP_INIT(name) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key);
-
-#define init_rwsem(sem) \
-do { \
- static struct lock_class_key __key; \
- \
- __init_rwsem((sem), #sem, &__key); \
-} while (0)
-
-extern void __down_read(struct rw_semaphore *sem);
-extern int __down_read_trylock(struct rw_semaphore *sem);
-extern void __down_write(struct rw_semaphore *sem);
-extern void __down_write_nested(struct rw_semaphore *sem, int subclass);
-extern int __down_write_trylock(struct rw_semaphore *sem);
-extern void __up_read(struct rw_semaphore *sem);
-extern void __up_write(struct rw_semaphore *sem);
-extern void __downgrade_write(struct rw_semaphore *sem);
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->activity != 0);
-}
-
-#endif /* __KERNEL__ */
-#endif /* _LINUX_RWSEM_SPINLOCK_H */
Index: linux-2.6/lib/rwsem-spinlock.c
===================================================================
--- linux-2.6.orig/lib/rwsem-spinlock.c
+++ /dev/null
@@ -1,316 +0,0 @@
-/* rwsem-spinlock.c: R/W semaphores: contention handling functions for
- * generic spinlock implementation
- *
- * Copyright (c) 2001 David Howells ([email protected]).
- * - Derived partially from idea by Andrea Arcangeli <[email protected]>
- * - Derived also from comments by Linus
- */
-#include <linux/rwsem.h>
-#include <linux/sched.h>
-#include <linux/module.h>
-
-struct rwsem_waiter {
- struct list_head list;
- struct task_struct *task;
- unsigned int flags;
-#define RWSEM_WAITING_FOR_READ 0x00000001
-#define RWSEM_WAITING_FOR_WRITE 0x00000002
-};
-
-/*
- * initialise the semaphore
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
- /*
- * Make sure we are not reinitializing a held semaphore:
- */
- debug_check_no_locks_freed((void *)sem, sizeof(*sem));
- lockdep_init_map(&sem->dep_map, name, key, 0);
-#endif
- sem->activity = 0;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-}
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here, then:
- * - the 'active count' _reached_ zero
- * - the 'waiting count' is non-zero
- * - the spinlock must be held by the caller
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only woken if wakewrite is non-zero
- */
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
-{
- struct rwsem_waiter *waiter;
- struct task_struct *tsk;
- int woken;
-
- waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
-
- if (!wakewrite) {
- if (waiter->flags & RWSEM_WAITING_FOR_WRITE)
- goto out;
- goto dont_wake_writers;
- }
-
- /* if we are allowed to wake writers try to grant a single write lock
- * if there's a writer at the front of the queue
- * - we leave the 'waiting count' incremented to signify potential
- * contention
- */
- if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
- sem->activity = -1;
- list_del(&waiter->list);
- tsk = waiter->task;
- /* Don't touch waiter after ->task has been NULLed */
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
- goto out;
- }
-
- /* grant an infinite number of read locks to the front of the queue */
- dont_wake_writers:
- woken = 0;
- while (waiter->flags & RWSEM_WAITING_FOR_READ) {
- struct list_head *next = waiter->list.next;
-
- list_del(&waiter->list);
- tsk = waiter->task;
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
- woken++;
- if (list_empty(&sem->wait_list))
- break;
- waiter = list_entry(next, struct rwsem_waiter, list);
- }
-
- sem->activity += woken;
-
- out:
- return sem;
-}
-
-/*
- * wake a single writer
- */
-static inline struct rw_semaphore *
-__rwsem_wake_one_writer(struct rw_semaphore *sem)
-{
- struct rwsem_waiter *waiter;
- struct task_struct *tsk;
-
- sem->activity = -1;
-
- waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
- list_del(&waiter->list);
-
- tsk = waiter->task;
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
- return sem;
-}
-
-/*
- * get a read lock on the semaphore
- */
-void __sched __down_read(struct rw_semaphore *sem)
-{
- struct rwsem_waiter waiter;
- struct task_struct *tsk;
-
- spin_lock_irq(&sem->wait_lock);
-
- if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
- /* granted */
- sem->activity++;
- spin_unlock_irq(&sem->wait_lock);
- goto out;
- }
-
- tsk = current;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-
- /* set up my own style of waitqueue */
- waiter.task = tsk;
- waiter.flags = RWSEM_WAITING_FOR_READ;
- get_task_struct(tsk);
-
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we don't need to touch the semaphore struct anymore */
- spin_unlock_irq(&sem->wait_lock);
-
- /* wait to be given the lock */
- for (;;) {
- if (!waiter.task)
- break;
- schedule();
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- }
-
- tsk->state = TASK_RUNNING;
- out:
- ;
-}
-
-/*
- * trylock for reading -- returns 1 if successful, 0 if contention
- */
-int __down_read_trylock(struct rw_semaphore *sem)
-{
- unsigned long flags;
- int ret = 0;
-
-
- spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (sem->activity >= 0 && list_empty(&sem->wait_list)) {
- /* granted */
- sem->activity++;
- ret = 1;
- }
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
-
- return ret;
-}
-
-/*
- * get a write lock on the semaphore
- * - we increment the waiting count anyway to indicate an exclusive lock
- */
-void __sched __down_write_nested(struct rw_semaphore *sem, int subclass)
-{
- struct rwsem_waiter waiter;
- struct task_struct *tsk;
-
- spin_lock_irq(&sem->wait_lock);
-
- if (sem->activity == 0 && list_empty(&sem->wait_list)) {
- /* granted */
- sem->activity = -1;
- spin_unlock_irq(&sem->wait_lock);
- goto out;
- }
-
- tsk = current;
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-
- /* set up my own style of waitqueue */
- waiter.task = tsk;
- waiter.flags = RWSEM_WAITING_FOR_WRITE;
- get_task_struct(tsk);
-
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we don't need to touch the semaphore struct anymore */
- spin_unlock_irq(&sem->wait_lock);
-
- /* wait to be given the lock */
- for (;;) {
- if (!waiter.task)
- break;
- schedule();
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- }
-
- tsk->state = TASK_RUNNING;
- out:
- ;
-}
-
-void __sched __down_write(struct rw_semaphore *sem)
-{
- __down_write_nested(sem, 0);
-}
-
-/*
- * trylock for writing -- returns 1 if successful, 0 if contention
- */
-int __down_write_trylock(struct rw_semaphore *sem)
-{
- unsigned long flags;
- int ret = 0;
-
- spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (sem->activity == 0 && list_empty(&sem->wait_list)) {
- /* granted */
- sem->activity = -1;
- ret = 1;
- }
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
-
- return ret;
-}
-
-/*
- * release a read lock on the semaphore
- */
-void __up_read(struct rw_semaphore *sem)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&sem->wait_lock, flags);
-
- if (--sem->activity == 0 && !list_empty(&sem->wait_list))
- sem = __rwsem_wake_one_writer(sem);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
-}
-
-/*
- * release a write lock on the semaphore
- */
-void __up_write(struct rw_semaphore *sem)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&sem->wait_lock, flags);
-
- sem->activity = 0;
- if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 1);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
-}
-
-/*
- * downgrade a write lock into a read lock
- * - just wake up any readers at the front of the queue
- */
-void __downgrade_write(struct rw_semaphore *sem)
-{
- unsigned long flags;
-
- spin_lock_irqsave(&sem->wait_lock, flags);
-
- sem->activity = 1;
- if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-EXPORT_SYMBOL(__down_read);
-EXPORT_SYMBOL(__down_read_trylock);
-EXPORT_SYMBOL(__down_write_nested);
-EXPORT_SYMBOL(__down_write);
-EXPORT_SYMBOL(__down_write_trylock);
-EXPORT_SYMBOL(__up_read);
-EXPORT_SYMBOL(__up_write);
-EXPORT_SYMBOL(__downgrade_write);
Index: linux-2.6/include/asm-xtensa/rwsem.h
===================================================================
--- linux-2.6.orig/include/asm-xtensa/rwsem.h
+++ /dev/null
@@ -1,168 +0,0 @@
-/*
- * include/asm-xtensa/rwsem.h
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Largely copied from include/asm-ppc/rwsem.h
- *
- * Copyright (C) 2001 - 2005 Tensilica Inc.
- */
-
-#ifndef _XTENSA_RWSEM_H
-#define _XTENSA_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error "Please don't include <asm/rwsem.h> directly, use <linux/rwsem.h> instead."
-#endif
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-#include <asm/atomic.h>
-#include <asm/system.h>
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
- signed long count;
-#define RWSEM_UNLOCKED_VALUE 0x00000000
-#define RWSEM_ACTIVE_BIAS 0x00000001
-#define RWSEM_ACTIVE_MASK 0x0000ffff
-#define RWSEM_WAITING_BIAS (-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
- spinlock_t wait_lock;
- struct list_head wait_list;
-};
-
-#define __RWSEM_INITIALIZER(name) \
- { RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
- LIST_HEAD_INIT((name).wait_list) }
-
-#define DECLARE_RWSEM(name) \
- struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_downgrade_wake(struct rw_semaphore *sem);
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
- sem->count = RWSEM_UNLOCKED_VALUE;
- spin_lock_init(&sem->wait_lock);
- INIT_LIST_HEAD(&sem->wait_list);
-}
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
- if (atomic_add_return(1,(atomic_t *)(&sem->count)) > 0)
- smp_wmb();
- else
- rwsem_down_read_failed(sem);
-}
-
-static inline int __down_read_trylock(struct rw_semaphore *sem)
-{
- int tmp;
-
- while ((tmp = sem->count) >= 0) {
- if (tmp == cmpxchg(&sem->count, tmp,
- tmp + RWSEM_ACTIVE_READ_BIAS)) {
- smp_wmb();
- return 1;
- }
- }
- return 0;
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = atomic_add_return(RWSEM_ACTIVE_WRITE_BIAS,
- (atomic_t *)(&sem->count));
- if (tmp == RWSEM_ACTIVE_WRITE_BIAS)
- smp_wmb();
- else
- rwsem_down_write_failed(sem);
-}
-
-static inline int __down_write_trylock(struct rw_semaphore *sem)
-{
- int tmp;
-
- tmp = cmpxchg(&sem->count, RWSEM_UNLOCKED_VALUE,
- RWSEM_ACTIVE_WRITE_BIAS);
- smp_wmb();
- return tmp == RWSEM_UNLOCKED_VALUE;
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
- int tmp;
-
- smp_wmb();
- tmp = atomic_sub_return(1,(atomic_t *)(&sem->count));
- if (tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0)
- rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
- smp_wmb();
- if (atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS,
- (atomic_t *)(&sem->count)) < 0)
- rwsem_wake(sem);
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
- atomic_add(delta, (atomic_t *)(&sem->count));
-}
-
-/*
- * downgrade write lock to read lock
- */
-static inline void __downgrade_write(struct rw_semaphore *sem)
-{
- int tmp;
-
- smp_wmb();
- tmp = atomic_add_return(-RWSEM_WAITING_BIAS, (atomic_t *)(&sem->count));
- if (tmp < 0)
- rwsem_downgrade_wake(sem);
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
- smp_mb();
- return atomic_add_return(delta, (atomic_t *)(&sem->count));
-}
-
-static inline int rwsem_is_locked(struct rw_semaphore *sem)
-{
- return (sem->count != 0);
-}
-
-#endif /* _XTENSA_RWSEM_H */
Index: linux-2.6/arch/alpha/Kconfig
===================================================================
--- linux-2.6.orig/arch/alpha/Kconfig
+++ linux-2.6/arch/alpha/Kconfig
@@ -21,10 +21,7 @@ config MMU
bool
default y

-config RWSEM_GENERIC_SPINLOCK
- bool
-
-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
default y

Index: linux-2.6/arch/arm/Kconfig
===================================================================
--- linux-2.6.orig/arch/arm/Kconfig
+++ linux-2.6/arch/arm/Kconfig
@@ -117,13 +117,10 @@ config GENERIC_LOCKBREAK
default y
depends on SMP && PREEMPT

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
-
config ARCH_HAS_ILOG2_U32
bool
default n
Index: linux-2.6/arch/avr32/Kconfig
===================================================================
--- linux-2.6.orig/arch/avr32/Kconfig
+++ linux-2.6/arch/avr32/Kconfig
@@ -42,7 +42,7 @@ config HARDIRQS_SW_RESEND
config GENERIC_IRQ_PROBE
def_bool y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
def_bool y

config GENERIC_TIME
Index: linux-2.6/arch/blackfin/Kconfig
===================================================================
--- linux-2.6.orig/arch/blackfin/Kconfig
+++ linux-2.6/arch/blackfin/Kconfig
@@ -13,14 +13,10 @@ config FPU
bool
default n

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
- default n
-
config BLACKFIN
bool
default y
Index: linux-2.6/arch/cris/Kconfig
===================================================================
--- linux-2.6.orig/arch/cris/Kconfig
+++ linux-2.6/arch/cris/Kconfig
@@ -13,13 +13,10 @@ config ZONE_DMA
bool
default y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
-
config GENERIC_IOMAP
bool
default y
Index: linux-2.6/arch/frv/Kconfig
===================================================================
--- linux-2.6.orig/arch/frv/Kconfig
+++ linux-2.6/arch/frv/Kconfig
@@ -11,13 +11,10 @@ config ZONE_DMA
bool
default y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
-
config GENERIC_FIND_NEXT_BIT
bool
default y
Index: linux-2.6/arch/h8300/Kconfig
===================================================================
--- linux-2.6.orig/arch/h8300/Kconfig
+++ linux-2.6/arch/h8300/Kconfig
@@ -26,14 +26,10 @@ config FPU
bool
default n

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
- default n
-
config ARCH_HAS_ILOG2_U32
bool
default n
Index: linux-2.6/arch/ia64/Kconfig
===================================================================
--- linux-2.6.orig/arch/ia64/Kconfig
+++ linux-2.6/arch/ia64/Kconfig
@@ -59,7 +59,7 @@ config GENERIC_LOCKBREAK
default y
depends on SMP && PREEMPT

-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
default y

Index: linux-2.6/arch/m32r/Kconfig
===================================================================
--- linux-2.6.orig/arch/m32r/Kconfig
+++ linux-2.6/arch/m32r/Kconfig
@@ -244,15 +244,10 @@ config GENERIC_LOCKBREAK
default y
depends on SMP && PREEMPT

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
- depends on M32R
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
- default n
-
config ARCH_HAS_ILOG2_U32
bool
default n
Index: linux-2.6/arch/m68k/Kconfig
===================================================================
--- linux-2.6.orig/arch/m68k/Kconfig
+++ linux-2.6/arch/m68k/Kconfig
@@ -12,13 +12,10 @@ config MMU
bool
default y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
-
config ARCH_HAS_ILOG2_U32
bool
default n
Index: linux-2.6/arch/m68knommu/Kconfig
===================================================================
--- linux-2.6.orig/arch/m68knommu/Kconfig
+++ linux-2.6/arch/m68knommu/Kconfig
@@ -22,14 +22,10 @@ config ZONE_DMA
bool
default y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
- default n
-
config ARCH_HAS_ILOG2_U32
bool
default n
Index: linux-2.6/arch/mips/Kconfig
===================================================================
--- linux-2.6.orig/arch/mips/Kconfig
+++ linux-2.6/arch/mips/Kconfig
@@ -610,13 +610,10 @@ source "arch/mips/vr41xx/Kconfig"

endmenu

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
-
config ARCH_HAS_ILOG2_U32
bool
default n
Index: linux-2.6/arch/mn10300/Kconfig
===================================================================
--- linux-2.6.orig/arch/mn10300/Kconfig
+++ linux-2.6/arch/mn10300/Kconfig
@@ -23,11 +23,9 @@ config NUMA
config UID16
def_bool y

-config RWSEM_GENERIC_SPINLOCK
- def_bool y
-
-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
+ default y

config GENERIC_HARDIRQS_NO__DO_IRQ
def_bool y
Index: linux-2.6/arch/parisc/Kconfig
===================================================================
--- linux-2.6.orig/arch/parisc/Kconfig
+++ linux-2.6/arch/parisc/Kconfig
@@ -28,11 +28,9 @@ config GENERIC_LOCKBREAK
default y
depends on SMP && PREEMPT

-config RWSEM_GENERIC_SPINLOCK
- def_bool y
-
-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
+ default y

config ARCH_HAS_ILOG2_U32
bool
Index: linux-2.6/arch/powerpc/Kconfig
===================================================================
--- linux-2.6.orig/arch/powerpc/Kconfig
+++ linux-2.6/arch/powerpc/Kconfig
@@ -65,10 +65,7 @@ config LOCKDEP_SUPPORT
bool
default y

-config RWSEM_GENERIC_SPINLOCK
- bool
-
-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
default y

Index: linux-2.6/arch/s390/Kconfig
===================================================================
--- linux-2.6.orig/arch/s390/Kconfig
+++ linux-2.6/arch/s390/Kconfig
@@ -23,11 +23,9 @@ config STACKTRACE_SUPPORT
config HAVE_LATENCYTOP_SUPPORT
def_bool y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
-
-config RWSEM_XCHGADD_ALGORITHM
- def_bool y
+ default y

config ARCH_HAS_ILOG2_U32
bool
Index: linux-2.6/arch/sh/Kconfig
===================================================================
--- linux-2.6.orig/arch/sh/Kconfig
+++ linux-2.6/arch/sh/Kconfig
@@ -34,11 +34,9 @@ config ARCH_DEFCONFIG
default "arch/sh/configs/shx3_defconfig" if SUPERH32
default "arch/sh/configs/cayman_defconfig" if SUPERH64

-config RWSEM_GENERIC_SPINLOCK
- def_bool y
-
-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
+ default y

config GENERIC_BUG
def_bool y
Index: linux-2.6/arch/sparc/Kconfig
===================================================================
--- linux-2.6.orig/arch/sparc/Kconfig
+++ linux-2.6/arch/sparc/Kconfig
@@ -169,13 +169,10 @@ config SUN_IO
bool
default y

-config RWSEM_GENERIC_SPINLOCK
+config GENERIC_RWSEM
bool
default y

-config RWSEM_XCHGADD_ALGORITHM
- bool
-
config GENERIC_FIND_NEXT_BIT
bool
default y
Index: linux-2.6/arch/sparc64/Kconfig
===================================================================
--- linux-2.6.orig/arch/sparc64/Kconfig
+++ linux-2.6/arch/sparc64/Kconfig
@@ -214,10 +214,7 @@ config GENERIC_LOCKBREAK
default y
depends on SMP && PREEMPT

-config RWSEM_GENERIC_SPINLOCK
- bool
-
-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
default y

Index: linux-2.6/arch/um/Kconfig.x86
===================================================================
--- linux-2.6.orig/arch/um/Kconfig.x86
+++ linux-2.6/arch/um/Kconfig.x86
@@ -19,11 +19,9 @@ config X86_32
def_bool !64BIT
select HAVE_AOUT

-config RWSEM_XCHGADD_ALGORITHM
- def_bool X86_XADD
-
-config RWSEM_GENERIC_SPINLOCK
- def_bool !X86_XADD
+config GENERIC_RWSEM
+ bool
+ default y

config 3_LEVEL_PGTABLES
bool "Three-level pagetables (EXPERIMENTAL)" if !64BIT
Index: linux-2.6/arch/x86/Kconfig
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig
+++ linux-2.6/arch/x86/Kconfig
@@ -97,11 +97,9 @@ config GENERIC_GPIO
config ARCH_MAY_HAVE_PC_FDC
def_bool y

-config RWSEM_GENERIC_SPINLOCK
- def_bool !X86_XADD
-
-config RWSEM_XCHGADD_ALGORITHM
- def_bool X86_XADD
+config GENERIC_RWSEM
+ bool
+ default y

config ARCH_HAS_CPU_IDLE_WAIT
def_bool y
Index: linux-2.6/arch/xtensa/Kconfig
===================================================================
--- linux-2.6.orig/arch/xtensa/Kconfig
+++ linux-2.6/arch/xtensa/Kconfig
@@ -23,7 +23,7 @@ config XTENSA
with reasonable minimum requirements. The Xtensa Linux project has
a home page at <http://xtensa.sourceforge.net/>.

-config RWSEM_XCHGADD_ALGORITHM
+config GENERIC_RWSEM
bool
default y

Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -33,8 +33,7 @@ obj-$(CONFIG_HAS_IOMEM) += iomap_copy.o
obj-$(CONFIG_CHECK_SIGNATURE) += check_signature.o
obj-$(CONFIG_DEBUG_LOCKING_API_SELFTESTS) += locking-selftest.o
obj-$(CONFIG_DEBUG_SPINLOCK) += spinlock_debug.o
-lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
-lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
+lib-$(CONFIG_GENERIC_RWSEM) += rwsem.o
lib-$(CONFIG_GENERIC_FIND_FIRST_BIT) += find_next_bit.o
lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += find_next_bit.o
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o

2008-11-28 22:48:13

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 11:22:57AM -0800, Mike Waychison wrote:
>> Nick Piggin wrote:
>>> On Thu, Nov 27, 2008 at 11:00:07AM +0100, Peter Zijlstra wrote:
>>> pagemap_read looks like it can use get_user_pages_fast. The smaps and
>>> clear_refs stuff might have been nicer if they could work on ranges
>>> like pagemap. Then they could avoid mmap_sem as well (although maps
>>> would need to be sampled and take mmap_sem I guess).
>>>
>>> One problem with dropping mmap_sem is that it hurts priority/fairness.
>>> And it opens a bit of a (maybe theoretical but not something to completely
>>> ignore) forward progress hole AFAIKS. If mmap_sem is very heavily
>>> contended, then the refault is going to take a while to get through,
>>> and then the page might get reclaimed etc).
>> Right, this can be an issue. The way around it should be to minimize
>> the length of time any single lock holder can sit on it. Compared to
>> what we have today with:
>>
>> - sleep in major fault with read lock held,
>> - enqueue writer behind it,
>> - and make all other faults wait on the rwsem
>>
>> The retry logic seems to be a lot better for forward progress.
>
> The whole reason why you have the latency is because it is
> guaranteeing forward progress for everyone. The retry logic
> may work out better in that situation, but it does actually
> open a starvation hole.
>

Right. In practice though, we haven't seen this cause a problem (and is
why we'll only allow the path to retry once).

Do you have any suggestions on how we could plug this hole? Perhaps we
could pin a reference to the page in the vm_fault structure across the
retry?

2008-11-28 23:04:14

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 11:03:40AM -0800, Mike Waychison wrote:
>> Nick Piggin wrote:
>>> On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
>>>> T?r?k however identified mmap taking on the order of several
>>>> milliseconds due to this exact problem:
>>>>
>>>> http://lkml.org/lkml/2008/9/12/185
>>> Turns out to be a different problem.
>>>
>> What do you mean?
>
> His is just contending on the write side. The retry patch doesn't help.
>

I disagree. How do you get 'write contention' from the following paragraph:

"Just to confirm that the problem is with pagefaults and mmap, I dropped
the mmap_sem in filemap_fault, and then
I got same performance in my testprogram for mmap and read. Of course
this is totally unsafe, because the mapping could change at any time."

It reads to me that the writers were held off by the readers sleeping in IO.

>
>>>> We generally try to avoid such things, but sometimes it a) can't be
>>>> easily avoided (third party libraries for instance) and b) when it hits
>>>> us, it affects the overall health of the machine/cluster (the monitoring
>>>> daemons get blocked, which isn't very healthy).
>>> Are you doing appropriate posix_fadvise to prefetch in the files before
>>> faulting, and madvise hints if appropriate?
>>>
>> Yes, we've been slowly rolling out fadvise hints out, though not to
>> prefetch, and definitely not for faulting. I don't see how issuing a
>> prefetch right before we try to fault in a page is going to help
>> matters. The pages may appear in pagecache, but they won't be uptodate
>> by the time we look at them anyway, so we're back to square one.
>
> The whole point of a prefetch is to issue it sufficiently early so
> it makes a difference. Actually if you can tell quite well where the
> major faults will be, but don't know it sufficiently in advance to
> do very good prefetching, then perhaps we could add a new madvise hint
> to synchronously bring the page in (dropping the mmap_sem over the IO).
>

Or we could just fix the faulting code to drop the mmap_sem for us? I'm
not sure a new madvise flag could help with the 'starvation hole' issue
you brought up.

2008-11-30 19:38:34

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-28 14:10, Nick Piggin wrote:
> This is what I have.
>
> It does two things. Firstly, it switches x86-64 over to use the xadd
> algorithm rather than the spinlock algorithm. This is actually significant
> in high contention situations, because the spinlock algorithm doesn't allow
> concurrent operations on the lock while the queue of waiters is being
> manipulated.
>
> Secondly, it moves wakeups out from underneath the waiter queue lock. This
> is more significant on bigger machines where wakeup latency is worse and/or
> runqueue locks are very heavily contended.
>
> Now both these changes are going to help *mainly* for the case when there are
> a significant number of readers and writers, I think. So your write-heavy
> workload may not win anything. I noticed some speedup a long time ago on
> some weird java (volanomark) workload.

Hi,

I just tested your patch on top of tip/master, and my testprogram has
segfaulted :(
It is either something wrong in tip/master or the patch, or my program.
This is the first time this testprogram segfaults, and it doesn't have a
reason to segfault there.


[ 140.624155] scalability[4995]: segfault at 7f9ce137f000 ip
0000000000401a62 sp 00000000454950a0 error 4 in scalability[400000+3000]
[ 401.640738] scalability[5398]: segfault at 7fdbffba3000 ip
0000000000401a62 sp 00000000423d70a0 error 4 in scalability[400000+3000]

Here is the relevant portion, at 401a62 I read from the mapping:

static void mmap_worker_fn(int fd, off_t len)
{
char *data = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
401a4f: 48 89 c7 mov %rax,%rdi
if(data == MAP_FAILED) {
401a52: 74 36 je 401a8a <mmap_worker_fn+0x5a>
perror("mmap");
abort();
401a54: 31 d2 xor %edx,%edx
401a56: 31 c9 xor %ecx,%ecx
static pthread_mutex_t thrtime_mtx = PTHREAD_MUTEX_INITIALIZER;

static size_t execute(const char *data, size_t len)
{
size_t sum = 0, i;
for(i=0;i<len;++i)
401a58: 48 85 db test %rbx,%rbx
401a5b: 74 28 je 401a85 <mmap_worker_fn+0x55>
401a5d: 0f 1f 00 nopl (%rax)
if(data[i] == 'd')
++sum;
401a60: 31 c0 xor %eax,%eax
401a62: 80 3c 17 64 cmpb $0x64,(%rdi,%rdx,1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
This simply reads from the mapping

401a66: 0f 94 c0 sete %al
static pthread_mutex_t thrtime_mtx = PTHREAD_MUTEX_INITIALIZER;

Steps to reproduce:
# sync; echo 3 >/proc/sys/vm/drop_caches; sync
# echo 0 >/proc/lock_stat
$ sudo ./scalability 16 /usr/bin/
... prints out results for read, and while running mmap_worker ...
... a message about segmentation fault ....

The testprogram is available here:
http://edwintorok.googlepages.com/tst.tar.gz

My .config:
http://edwintorok.googlepages.com/config

Can you reproduce the crash on your box?
Can I help debugging the problem?

Best regards,
--Edwin

2008-11-30 19:55:19

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-11-29 01:02, Mike Waychison wrote:
> Nick Piggin wrote:
>> On Thu, Nov 27, 2008 at 11:03:40AM -0800, Mike Waychison wrote:
>>> Nick Piggin wrote:
>>>> On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
>>>>> T?r?k however identified mmap taking on the order of several
>>>>> milliseconds due to this exact problem:
>>>>>
>>>>> http://lkml.org/lkml/2008/9/12/185
>>>> Turns out to be a different problem.
>>>>
>>> What do you mean?
>>
>> His is just contending on the write side. The retry patch doesn't help.
>>
>
> I disagree. How do you get 'write contention' from the following
> paragraph:
>
> "Just to confirm that the problem is with pagefaults and mmap, I dropped
> the mmap_sem in filemap_fault, and then
> I got same performance in my testprogram for mmap and read. Of course
> this is totally unsafe, because the mapping could change at any time."
>
> It reads to me that the writers were held off by the readers sleeping
> in IO.

It is true that I have a write/write contention too, but do_page_fault
shows up too on lock_stat.

This is my guess at what happens:
* filemap_fault used to sleep with mmap_sem held while waiting for the
page lock.
* the google patch avoids that, which is fine: if page lock can't be
taken, it drops mmap_sem, waits, then retries the fault once
* however after we acquired the page lock, mapping->a_ops->readpage is
invoked, mmap_sem is NOT dropped here:

error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);

If my understanding is correct ->readpage does the actual disk I/O, and
it keeps the page locked, when the lock is released we know it has finished.
So wait_on_page_locked(page) holds mmap_sem locked for read during the
disk I/O, preventing sys_mmap/sys_munmap from making progress.

I don't know how to prove/disprove my guess above, suggestions welcome.

Could the patch be changed to also release the mmap_sem after readpage,
and before wait_on_page_locked?

Best regards,
--Edwin

2008-12-01 04:52:18

by Mike Waychison

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

T?r?k Edwin wrote:
> On 2008-11-29 01:02, Mike Waychison wrote:
>> Nick Piggin wrote:
>>> On Thu, Nov 27, 2008 at 11:03:40AM -0800, Mike Waychison wrote:
>>>> Nick Piggin wrote:
>>>>> On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
>>>>>> T?r?k however identified mmap taking on the order of several
>>>>>> milliseconds due to this exact problem:
>>>>>>
>>>>>> http://lkml.org/lkml/2008/9/12/185
>>>>> Turns out to be a different problem.
>>>>>
>>>> What do you mean?
>>> His is just contending on the write side. The retry patch doesn't help.
>>>
>> I disagree. How do you get 'write contention' from the following
>> paragraph:
>>
>> "Just to confirm that the problem is with pagefaults and mmap, I dropped
>> the mmap_sem in filemap_fault, and then
>> I got same performance in my testprogram for mmap and read. Of course
>> this is totally unsafe, because the mapping could change at any time."
>>
>> It reads to me that the writers were held off by the readers sleeping
>> in IO.
>
> It is true that I have a write/write contention too, but do_page_fault
> shows up too on lock_stat.
>
> This is my guess at what happens:
> * filemap_fault used to sleep with mmap_sem held while waiting for the
> page lock.
> * the google patch avoids that, which is fine: if page lock can't be
> taken, it drops mmap_sem, waits, then retries the fault once
> * however after we acquired the page lock, mapping->a_ops->readpage is
> invoked, mmap_sem is NOT dropped here:
>
> error = mapping->a_ops->readpage(file, page);
> if (!error) {
> wait_on_page_locked(page);
>
> If my understanding is correct ->readpage does the actual disk I/O, and
> it keeps the page locked, when the lock is released we know it has finished.
> So wait_on_page_locked(page) holds mmap_sem locked for read during the
> disk I/O, preventing sys_mmap/sys_munmap from making progress.
>
> I don't know how to prove/disprove my guess above, suggestions welcome.
>
> Could the patch be changed to also release the mmap_sem after readpage,
> and before wait_on_page_locked?

Ya, my suspicion is that there is still some other code path where we
are waiting on the locked page with mmap_sem still held. Ying and I
will take a closer look this week.

2008-12-01 08:52:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Sun, Nov 30, 2008 at 09:38:18PM +0200, T?r?k Edwin wrote:
> On 2008-11-28 14:10, Nick Piggin wrote:
> > This is what I have.
> >
> > It does two things. Firstly, it switches x86-64 over to use the xadd
> > algorithm rather than the spinlock algorithm. This is actually significant
> > in high contention situations, because the spinlock algorithm doesn't allow
> > concurrent operations on the lock while the queue of waiters is being
> > manipulated.
> >
> > Secondly, it moves wakeups out from underneath the waiter queue lock. This
> > is more significant on bigger machines where wakeup latency is worse and/or
> > runqueue locks are very heavily contended.
> >
> > Now both these changes are going to help *mainly* for the case when there are
> > a significant number of readers and writers, I think. So your write-heavy
> > workload may not win anything. I noticed some speedup a long time ago on
> > some weird java (volanomark) workload.
>
> Hi,
>
> I just tested your patch on top of tip/master, and my testprogram has
> segfaulted :(
> It is either something wrong in tip/master or the patch, or my program.
> This is the first time this testprogram segfaults, and it doesn't have a
> reason to segfault there.
>
>
> [ 140.624155] scalability[4995]: segfault at 7f9ce137f000 ip
> 0000000000401a62 sp 00000000454950a0 error 4 in scalability[400000+3000]
> [ 401.640738] scalability[5398]: segfault at 7fdbffba3000 ip
> 0000000000401a62 sp 00000000423d70a0 error 4 in scalability[400000+3000]
>
> Here is the relevant portion, at 401a62 I read from the mapping:
>
> static void mmap_worker_fn(int fd, off_t len)
> {
> char *data = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
> 401a4f: 48 89 c7 mov %rax,%rdi
> if(data == MAP_FAILED) {
> 401a52: 74 36 je 401a8a <mmap_worker_fn+0x5a>
> perror("mmap");
> abort();
> 401a54: 31 d2 xor %edx,%edx
> 401a56: 31 c9 xor %ecx,%ecx
> static pthread_mutex_t thrtime_mtx = PTHREAD_MUTEX_INITIALIZER;
>
> static size_t execute(const char *data, size_t len)
> {
> size_t sum = 0, i;
> for(i=0;i<len;++i)
> 401a58: 48 85 db test %rbx,%rbx
> 401a5b: 74 28 je 401a85 <mmap_worker_fn+0x55>
> 401a5d: 0f 1f 00 nopl (%rax)
> if(data[i] == 'd')
> ++sum;
> 401a60: 31 c0 xor %eax,%eax
> 401a62: 80 3c 17 64 cmpb $0x64,(%rdi,%rdx,1)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This simply reads from the mapping
>
> 401a66: 0f 94 c0 sete %al
> static pthread_mutex_t thrtime_mtx = PTHREAD_MUTEX_INITIALIZER;
>
> Steps to reproduce:
> # sync; echo 3 >/proc/sys/vm/drop_caches; sync
> # echo 0 >/proc/lock_stat
> $ sudo ./scalability 16 /usr/bin/
> ... prints out results for read, and while running mmap_worker ...
> ... a message about segmentation fault ....
>
> The testprogram is available here:
> http://edwintorok.googlepages.com/tst.tar.gz
>
> My .config:
> http://edwintorok.googlepages.com/config
>
> Can you reproduce the crash on your box?
> Can I help debugging the problem?

Hi Edwin,

Drat, sorry. I haven't been able to do very good testing because I'm
overseas away from my normal test systems :P

The bug is quite likely to be in my patch I sent you by the sound. I
will definitely try to have you a working patch by next week, if I'm
unable to reproduce the problem here.

Thanks,
Nick

2008-12-01 08:58:56

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Sun, Nov 30, 2008 at 09:54:56PM +0200, T?r?k Edwin wrote:
> On 2008-11-29 01:02, Mike Waychison wrote:
> > Nick Piggin wrote:
> >> On Thu, Nov 27, 2008 at 11:03:40AM -0800, Mike Waychison wrote:
> >>> Nick Piggin wrote:
> >>>> On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
> >>>>> T?r?k however identified mmap taking on the order of several
> >>>>> milliseconds due to this exact problem:
> >>>>>
> >>>>> http://lkml.org/lkml/2008/9/12/185
> >>>> Turns out to be a different problem.
> >>>>
> >>> What do you mean?
> >>
> >> His is just contending on the write side. The retry patch doesn't help.
> >>
> >
> > I disagree. How do you get 'write contention' from the following
> > paragraph:
> >
> > "Just to confirm that the problem is with pagefaults and mmap, I dropped
> > the mmap_sem in filemap_fault, and then
> > I got same performance in my testprogram for mmap and read. Of course
> > this is totally unsafe, because the mapping could change at any time."
> >
> > It reads to me that the writers were held off by the readers sleeping
> > in IO.
>
> It is true that I have a write/write contention too, but do_page_fault
> shows up too on lock_stat.
>
> This is my guess at what happens:
> * filemap_fault used to sleep with mmap_sem held while waiting for the
> page lock.
> * the google patch avoids that, which is fine: if page lock can't be
> taken, it drops mmap_sem, waits, then retries the fault once
> * however after we acquired the page lock, mapping->a_ops->readpage is
> invoked, mmap_sem is NOT dropped here:
>
> error = mapping->a_ops->readpage(file, page);
> if (!error) {
> wait_on_page_locked(page);
>
> If my understanding is correct ->readpage does the actual disk I/O, and
> it keeps the page locked, when the lock is released we know it has finished.
> So wait_on_page_locked(page) holds mmap_sem locked for read during the
> disk I/O, preventing sys_mmap/sys_munmap from making progress.

Yes that's exactly right. Ahh, the google patch doesn't solve this
case? Interesting...


> I don't know how to prove/disprove my guess above, suggestions welcome.
>
> Could the patch be changed to also release the mmap_sem after readpage,
> and before wait_on_page_locked?

It should be possible somehow, but it is difficult because after
dropping mmap_sem, then we have to basically retry the whole fault
because the vma might have gone away.

2008-12-01 11:13:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Sun, Nov 30, 2008 at 09:38:18PM +0200, T?r?k Edwin wrote:
> On 2008-11-28 14:10, Nick Piggin wrote:
> > This is what I have.
> >
> > It does two things. Firstly, it switches x86-64 over to use the xadd
> > algorithm rather than the spinlock algorithm. This is actually significant
> > in high contention situations, because the spinlock algorithm doesn't allow
> > concurrent operations on the lock while the queue of waiters is being
> > manipulated.
> >
> > Secondly, it moves wakeups out from underneath the waiter queue lock. This
> > is more significant on bigger machines where wakeup latency is worse and/or
> > runqueue locks are very heavily contended.
> >
> > Now both these changes are going to help *mainly* for the case when there are
> > a significant number of readers and writers, I think. So your write-heavy
> > workload may not win anything. I noticed some speedup a long time ago on
> > some weird java (volanomark) workload.
>
> Hi,
>
> I just tested your patch on top of tip/master, and my testprogram has
> segfaulted :(
> It is either something wrong in tip/master or the patch, or my program.
> This is the first time this testprogram segfaults, and it doesn't have a
> reason to segfault there.
>
>
> [ 140.624155] scalability[4995]: segfault at 7f9ce137f000 ip
> 0000000000401a62 sp 00000000454950a0 error 4 in scalability[400000+3000]
> [ 401.640738] scalability[5398]: segfault at 7fdbffba3000 ip
> 0000000000401a62 sp 00000000423d70a0 error 4 in scalability[400000+3000]
>
> Here is the relevant portion, at 401a62 I read from the mapping:
>
> static void mmap_worker_fn(int fd, off_t len)
> {
> char *data = mmap(NULL, len, PROT_READ, MAP_PRIVATE, fd, 0);
> 401a4f: 48 89 c7 mov %rax,%rdi
> if(data == MAP_FAILED) {
> 401a52: 74 36 je 401a8a <mmap_worker_fn+0x5a>
> perror("mmap");
> abort();
> 401a54: 31 d2 xor %edx,%edx
> 401a56: 31 c9 xor %ecx,%ecx
> static pthread_mutex_t thrtime_mtx = PTHREAD_MUTEX_INITIALIZER;
>
> static size_t execute(const char *data, size_t len)
> {
> size_t sum = 0, i;
> for(i=0;i<len;++i)
> 401a58: 48 85 db test %rbx,%rbx
> 401a5b: 74 28 je 401a85 <mmap_worker_fn+0x55>
> 401a5d: 0f 1f 00 nopl (%rax)
> if(data[i] == 'd')
> ++sum;
> 401a60: 31 c0 xor %eax,%eax
> 401a62: 80 3c 17 64 cmpb $0x64,(%rdi,%rdx,1)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> This simply reads from the mapping
>
> 401a66: 0f 94 c0 sete %al
> static pthread_mutex_t thrtime_mtx = PTHREAD_MUTEX_INITIALIZER;
>
> Steps to reproduce:
> # sync; echo 3 >/proc/sys/vm/drop_caches; sync
> # echo 0 >/proc/lock_stat
> $ sudo ./scalability 16 /usr/bin/
> ... prints out results for read, and while running mmap_worker ...
> ... a message about segmentation fault ....
>
> The testprogram is available here:
> http://edwintorok.googlepages.com/tst.tar.gz
>
> My .config:
> http://edwintorok.googlepages.com/config
>
> Can you reproduce the crash on your box?
> Can I help debugging the problem?

BTW. I think your source code (I see you updated it since last posting)
should be very easy to give good hints to the kernel about the IO. I
will try a few simple tricks and we can see if they help. (this pattern
of touching memory corresponds well to how your app works?)

Thanks,
Nick

2008-12-01 11:37:53

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-12-01 13:13, Nick Piggin wrote:
> BTW. I think your source code (I see you updated it since last posting)
> should be very easy to give good hints to the kernel about the IO. I
> will try a few simple tricks and we can see if they help. (this pattern
> of touching memory corresponds well to how your app works?)

It corresponds well to the latencies involved, but only part of the
behaviour:

- in some cases mmap is used to sequentially read a file (PROT_READ,
MAP_PRIVATE), and does operations like
memchr, memcpy on it, my testcase models this
- in some cases it is used to mmap archives, and containers, that have
the index at the end (like zip), so it jumps back and forth between the
end of the file, and the offset indicated there (using pread here may be
better, but using mmap simplified the code a lot)
- there are multiple threads, each processing a different file, the only
data shared between threads is the signature database, so once a thread
started working on a file,
no other thread touches it
- the goal is to process as many files as possible, which works on some
files very well (PE files mostly), but not on others (where I can't load
all cores to 400%)

In either case it pagefaults a lot, and calls mmap() often, which is
what my testcase attempted to model.

You can completely disable mmap usage in clamav, but last I tried that
slowed things down (it falls back to using fread, and reading the entire
file in memory in case of zip).
Perhaps I should try turning off mmap for just portions.

If you find something that improves my testcase, I can try on the real
application and let you know if it improved or not (and perhaps create a
new testcase).

If you want, you can test on the original application (its open source
after all!) too.
I found that scanning my local copy of my Gmail inbox is a good
testcase. I can walk you through how to configure/setup clamav to test.

Best regards,
--Edwin

2008-12-01 11:46:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On Fri, Nov 28, 2008 at 03:02:43PM -0800, Mike Waychison wrote:
> Nick Piggin wrote:
> >On Thu, Nov 27, 2008 at 11:03:40AM -0800, Mike Waychison wrote:
> >>Nick Piggin wrote:
> >>>On Thu, Nov 27, 2008 at 01:28:41AM -0800, Mike Waychison wrote:
> >>>>T?r?k however identified mmap taking on the order of several
> >>>>milliseconds due to this exact problem:
> >>>>
> >>>>http://lkml.org/lkml/2008/9/12/185
> >>>Turns out to be a different problem.
> >>>
> >>What do you mean?
> >
> >His is just contending on the write side. The retry patch doesn't help.
> >
>
> I disagree. How do you get 'write contention' from the following paragraph:
>
> "Just to confirm that the problem is with pagefaults and mmap, I dropped
> the mmap_sem in filemap_fault, and then
> I got same performance in my testprogram for mmap and read. Of course
> this is totally unsafe, because the mapping could change at any time."
>
> It reads to me that the writers were held off by the readers sleeping in IO.

Yeah, I didn't look closely at your patch. I assumed it was dropping the
mmap_sem for the duration of the IO, so I didn't think there could be
significant read hold time there. Of course reads would still show up
somewhat if there is a lot of write contention, but they would not
necessarily be the cause of the problem.

Anyway...


> >>>>We generally try to avoid such things, but sometimes it a) can't be
> >>>>easily avoided (third party libraries for instance) and b) when it hits
> >>>>us, it affects the overall health of the machine/cluster (the
> >>>>monitoring daemons get blocked, which isn't very healthy).
> >>>Are you doing appropriate posix_fadvise to prefetch in the files before
> >>>faulting, and madvise hints if appropriate?
> >>>
> >>Yes, we've been slowly rolling out fadvise hints out, though not to
> >>prefetch, and definitely not for faulting. I don't see how issuing a
> >>prefetch right before we try to fault in a page is going to help
> >>matters. The pages may appear in pagecache, but they won't be uptodate
> >>by the time we look at them anyway, so we're back to square one.
> >
> >The whole point of a prefetch is to issue it sufficiently early so
> >it makes a difference. Actually if you can tell quite well where the
> >major faults will be, but don't know it sufficiently in advance to
> >do very good prefetching, then perhaps we could add a new madvise hint
> >to synchronously bring the page in (dropping the mmap_sem over the IO).
> >
>
> Or we could just fix the faulting code to drop the mmap_sem for us? I'm

Yeah... I don't exactly call it a "fix"... It's tricky code... In other
cases you slow down.

Of course we should try to improve the kernel for user workloads, but we
also need to steer users toward workloads that are going to work well and
be nice for us to work with and optimise in future.

Threads are more often than not, not a good solution. There is lots of
other locks in the kernel (and userspace) that are going to slow things
down. I bet this workload would actually be much faster if the app was
designed with processes (and shared memory in the cases where it needs
to be shared).


> not sure a new madvise flag could help with the 'starvation hole' issue
> you brought up.

Because the madvise is just a hint, but you still go through with the
fault, so in rare cases yes perhaps the fault will need to re-read the
page if it has been reclaimed in the meantime, but those are the cases
where the fault retry code has a chance to have a starvation issue.

2008-12-04 22:28:18

by Ying Han

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

I am trying your test program(scalability) in house, but somehow i got
different result as you saw. i created 8 files each with 1G size on
separate drives( to avoid the latency disturbing of disk seek). I got
this number without applying the batch based on 2.6.26. May i ask how
to reproduce the mmap issue you mentioned?

8 CPU
read_worker
1 threads Real time: 101.058262 s (since task start)
2 threads Real time: 50.670456 s (since task start)
4 threads Real time: 25.904657 s (since task start)
8 threads Real time: 20.090677 s (since task start)
--------------------------------------------------------------------------------
mmap_worker
1 threads Real time: 101.340662 s (since task start)
2 threads Real time: 51.484646 s (since task start)
4 threads Real time: 28.414534 s (since task start)
8 threads Real time: 21.785818 s (since task start)

--Ying

On Thu, Nov 27, 2008 at 4:52 AM, T?r?k Edwin <[email protected]> wrote:
> On 2008-11-27 14:39, Nick Piggin wrote:
>> On Thu, Nov 27, 2008 at 02:21:16PM +0200, T?r?k Edwin wrote:
>>
>>> On 2008-11-27 14:03, Nick Piggin wrote:
>>>
>>>>> Running my testcase shows no significant performance difference. What am
>>>>> I doing wrong?
>>>>>
>>>>>
>>>>
>>>> Software may just be doing a lot of mmap/munmap activity. threads +
>>>> mmap is never going to be pretty because it is always going to involve
>>>> broadcasting tlb flushes to other cores... Software writers shouldn't
>>>> be scared of using processes (possibly with some shared memory).
>>>>
>>>>
>>> It would be interesting to compare the performance of a threaded clamd,
>>> and of a clamd that uses multiple processes.
>>> Distributing tasks will be a bit more tricky, since it would need to use
>>> IPC, instead of mutexes and condition variables.
>>>
>>
>> Yes, although you could use PTHREAD_PROCESS_SHARED pthread mutexes on
>> the shared memory I believe (having never tried it myself).
>>
>>
>>
>>>> Actually, a lot of things get faster (like malloc, or file descriptor
>>>> operations) because locks aren't needed.
>>>>
>>>> Despite common perception, processes are actually much *faster* than
>>>> threads when doing common operations like these. They are slightly slower
>>>> sometimes with things like creation and exit, or context switching, but
>>>> if you're doing huge numbers of those operations, then it is unlikely
>>>> to be a performance critical app... :)
>>>>
>>>>
>>> How about distributing tasks to a set of worked threads, is the overhead
>>> of using IPC instead of
>>> mutexes/cond variables acceptable?
>>>
>>
>> It is really going to depend on a lot of things. What is involved in
>> distributing tasks, how many cores and cache/TLB architecture of the
>> system running on, etc.
>>
>> You want to distribute as much work as possible while touching as
>> little memory as possible, in general.
>>
>> But if you're distributing threads over cores, and shared caches are
>> physically tagged (which I think all x86 CPUs are), then you should
>> be able to have multiple processes operate on shared memory just as
>> efficiently as multiple threads I think.
>>
>> And then you also get the advantages of reduced contention on other
>> shared locks and resources.
>>
>
> Thanks for the tips, but lets get back to the original question:
> why don't I see any performance improvement with the fault-retry patches?
>
> My testcase only compares reads file with mmap, vs. reading files with
> read, with different number of threads.
> Leaving aside other reasons why mmap is slower, there should be some
> speedup by running 4 threads vs 1 thread, but:
>
> 1 thread: read:27,18 28.76
> 1 thread: mmap: 25.45, 25.24
> 2 thread: read: 16.03, 15.66
> 2 thread: mmap: 22.20, 20.99
> 4 thread: read: 9.15, 9.12
> 4 thread: mmap: 20.38, 20.47
>
> The speed of 4 threads is about the same as for 2 threads with mmap, yet
> with read it scales nicely.
> And the patch doesn't seem to improve scalability.
> How can I find out if the patch works as expected? [i.e. verify that
> faults are actually retried, and that they don't keep the semaphore locked]
>
>> OK, I'll see if I can find them (am overseas at the moment, and I suspect
>> they are stranded on some stationary rust back home, but I might be able
>> to find them on the web).
>
> Ok.
>
> Best regards,
> --Edwin
>

2008-12-05 06:50:26

by Török Edwin

[permalink] [raw]
Subject: Re: [RFC v1][PATCH]page_fault retry with NOPAGE_RETRY

On 2008-12-05 00:27, Ying Han wrote:
> I am trying your test program(scalability) in house, but somehow i got
> different result as you saw. i created 8 files each with 1G size on
> separate drives( to avoid the latency disturbing of disk seek). I got
> this number without applying the batch based on 2.6.26. May i ask how
> to reproduce the mmap issue you mentioned?
>

Hi,

Try using more files, and of smaller size. I was using /usr/bin, which
has 3632 files, and 571M total.
I am using XFS filesystem: /dev/mapper/vg--all-lv--usr on /usr type xfs
(rw,noatime,logbsize=262144,logbufs=8,logdev=/dev/sdg6,inode64)


> 8 CPU
> read_worker
> 1 threads Real time: 101.058262 s (since task start)
> 2 threads Real time: 50.670456 s (since task start)
> 4 threads Real time: 25.904657 s (since task start)
> 8 threads Real time: 20.090677 s (since task start)
> --------------------------------------------------------------------------------
> mmap_worker
> 1 threads Real time: 101.340662 s (since task start)
> 2 threads Real time: 51.484646 s (since task start)
> 4 threads Real time: 28.414534 s (since task start)
> 8 threads Real time: 21.785818 s (since task start)
>


Try 16 threads, so that there is more contention on the read side as well.

Best regards,
--Edwin