2007-01-12 16:01:35

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: High lock spin time for zone->lru_lock under extreme conditions

Hi,
We noticed high interrupt hold off times while running some memory intensive
tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups,
lost ticks and even wall time drifting (which is probably a bug in the
x86_64 timer subsystem).

The test was simple, we have 16 processes, each allocating 3.5G of memory
and and touching each and every page and returning. Each of the process is
bound to a node (socket), with the local node being the preferred node for
allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each
socket has 4G of physical memory and there are two cores on each socket. On
start of the test, the machine becomes unresponsive after sometime and
prints out softlockup and OOM messages. We then found out the cause
for softlockups being the excessive spin times on zone_lru lock. The fact
that spin_lock_irq disables interrupts while spinning made matters very bad.
We instrumented the spin_lock_irq code and found that the spin time on the
lru locks was in the order of a few seconds (tens of seconds at times) and
the hold time was comparatively lesser.

We did not use any lock debugging options and used plain old rdtsc to
measure cycles. (We disable cpu freq scaling in the BIOS). All we did was
this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
local_irq_disable();
------------------------> rdtsc(t1);
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
------------------------> rdtsc(t2);
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}

On some runs, we found that the zone->lru_lock spun for 33 seconds or more
while the maximal CS time was 3 seconds or so.

While the softlockups and the like went away by enabling interrupts during
spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
Andi thought maybe this is exposing a problem with zone->lru_locks and
hence warrants a discussion on lkml, hence this post. Are there any
plans/patches/ideas to address the spin time under such extreme conditions?

I will be happy to provide any additional information (config/dmesg/test
case if needed.

Thanks,
Kiran


2007-01-12 17:05:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, 2007-01-12 at 08:01 -0800, Ravikiran G Thirumalai wrote:
> Hi,
> We noticed high interrupt hold off times while running some memory intensive
> tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups,
> lost ticks and even wall time drifting (which is probably a bug in the
> x86_64 timer subsystem).
>
> The test was simple, we have 16 processes, each allocating 3.5G of memory
> and and touching each and every page and returning. Each of the process is
> bound to a node (socket), with the local node being the preferred node for
> allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each
> socket has 4G of physical memory and there are two cores on each socket. On
> start of the test, the machine becomes unresponsive after sometime and
> prints out softlockup and OOM messages. We then found out the cause
> for softlockups being the excessive spin times on zone_lru lock. The fact
> that spin_lock_irq disables interrupts while spinning made matters very bad.
> We instrumented the spin_lock_irq code and found that the spin time on the
> lru locks was in the order of a few seconds (tens of seconds at times) and
> the hold time was comparatively lesser.
>
> We did not use any lock debugging options and used plain old rdtsc to
> measure cycles. (We disable cpu freq scaling in the BIOS). All we did was
> this:
>
> void __lockfunc _spin_lock_irq(spinlock_t *lock)
> {
> local_irq_disable();
> ------------------------> rdtsc(t1);
> preempt_disable();
> spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> _raw_spin_lock(lock);
> ------------------------> rdtsc(t2);
> if (lock->spin_time < (t2 - t1))
> lock->spin_time = t2 - t1;
> }
>
> On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> while the maximal CS time was 3 seconds or so.
>
> While the softlockups and the like went away by enabling interrupts during
> spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> Andi thought maybe this is exposing a problem with zone->lru_locks and
> hence warrants a discussion on lkml, hence this post. Are there any
> plans/patches/ideas to address the spin time under such extreme conditions?
>
> I will be happy to provide any additional information (config/dmesg/test
> case if needed.

I have been tinkering with this because -rt shows similar issues.
Find there the patch so far, it works on UP, but it still went *boom*
the last time I tried an actual SMP box.

So take this patch only as an indication of the direction I'm working
in.

One concern I have with the taken approach is cacheline bouncing.
Perhaps I should retain some form of per-cpu data structure.

---
Subject: mm: streamline zone->lock acquisition on lru_cache_add

By buffering the lru pages on a per cpu basis the flush of that buffer
is prone to bounce around zones. Furthermore release_pages can also acquire
the zone->lock.

Streeamline all this by replacing the per cpu buffer with a per zone
lockless buffer. Once the buffer is filled flush it and perform
all needed operation under one lock acquisition.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/mmzone.h | 12 +++
mm/internal.h | 2
mm/page_alloc.c | 21 ++++++
mm/swap.c | 169 +++++++++++++++++++++++++++++++++----------------
4 files changed, 149 insertions(+), 55 deletions(-)

Index: linux-2.6-rt/include/linux/mmzone.h
===================================================================
--- linux-2.6-rt.orig/include/linux/mmzone.h 2007-01-11 16:27:08.000000000 +0100
+++ linux-2.6-rt/include/linux/mmzone.h 2007-01-11 16:32:08.000000000 +0100
@@ -153,6 +153,17 @@ enum zone_type {
#define ZONES_SHIFT 2
#endif

+/*
+ * must be power of 2 to avoid wrap around artifacts
+ */
+#define PAGEBUF_SIZE 32
+
+struct pagebuf {
+ atomic_t head;
+ atomic_t tail;
+ struct page *pages[PAGEBUF_SIZE];
+};
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
@@ -188,6 +199,7 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];

+ struct pagebuf pagebuf;

ZONE_PADDING(_pad1_)

Index: linux-2.6-rt/mm/swap.c
===================================================================
--- linux-2.6-rt.orig/mm/swap.c 2007-01-11 16:27:08.000000000 +0100
+++ linux-2.6-rt/mm/swap.c 2007-01-11 16:36:34.000000000 +0100
@@ -31,6 +31,8 @@
#include <linux/notifier.h>
#include <linux/init.h>

+#include "internal.h"
+
/* How many pages do we try to swap or page in/out together? */
int page_cluster;

@@ -170,49 +172,131 @@ void fastcall mark_page_accessed(struct

EXPORT_SYMBOL(mark_page_accessed);

+static int __pagebuf_add(struct zone *zone, struct page *page)
+{
+ BUG_ON(page_zone(page) != zone);
+
+ switch (page_count(page)) {
+ case 0:
+ BUG();
+
+ case 1:
+ /*
+ * we're the sole owner, don't bother inserting
+ */
+ if (PageActive(page))
+ ClearPageActive(page);
+ __put_page(page);
+ return 0;
+
+ default:
+ VM_BUG_ON(PageLRU(page));
+ SetPageLRU(page);
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+
+ if (unlikely(put_page_testzero(page))) {
+ /*
+ * we are the last, remove again
+ */
+ VM_BUG_ON(!PageLRU(page));
+ ClearPageLRU(page);
+ del_page_from_lru(zone, page);
+ return 0;
+ }
+ }
+
+ return 1;
+}
+
+static int __pagebuf_size(struct pagebuf *pb)
+{
+ int tail = atomic_read(&pb->tail);
+ int head = atomic_read(&pb->head);
+ int size = (head + PAGEBUF_SIZE - tail) % PAGEBUF_SIZE;
+
+ if (!size && pb->pages[tail])
+ size = PAGEBUF_SIZE;
+
+ return size;
+}
+
+static void __pagebuf_flush(struct zone *zone, int nr_pages)
+{
+ struct pagebuf *pb = &zone->pagebuf;
+ int i;
+
+ for (i = 0; i < nr_pages && __pagebuf_size(pb); i++) {
+ int pos = atomic_inc_return(&pb->tail) % PAGEBUF_SIZE;
+ struct page *page = pb->pages[pos];
+ if (page && cmpxchg(&pb->pages[pos], page, NULL) == page) {
+ if (!__pagebuf_add(zone, page))
+ __free_cache_page(zone, page);
+ }
+ }
+}
+
+static void pagebuf_flush(struct zone *zone)
+{
+ spin_lock_irq(&zone->lock);
+ __pagebuf_flush(zone, PAGEBUF_SIZE);
+ spin_unlock_irq(&zone->lock);
+}
+
+static void pagebuf_add(struct zone *zone, struct page *page)
+{
+ struct pagebuf *pb = &zone->pagebuf;
+ int pos;
+
+ pos = atomic_inc_return(&pb->head) % PAGEBUF_SIZE;
+ if (unlikely(cmpxchg(&pb->pages[pos], NULL, page) != NULL)) {
+ /*
+ * This races, but should be harmless
+ */
+ atomic_dec(&pb->head);
+ spin_lock_irq(&zone->lock);
+
+ if (!__pagebuf_add(zone, page))
+ __free_cache_page(zone, page);
+
+flush:
+ __pagebuf_flush(zone, PAGEBUF_SIZE);
+ spin_unlock_irq(&zone->lock);
+ return;
+ }
+
+ if (__pagebuf_size(pb) > PAGEBUF_SIZE/2 &&
+ spin_trylock_irq(&zone->lock))
+ goto flush;
+}
+
/**
* lru_cache_add: add a page to the page lists
* @page: the page to add
*/
-static DEFINE_PER_CPU_LOCKED(struct pagevec, lru_add_pvecs) = { 0, };
-static DEFINE_PER_CPU_LOCKED(struct pagevec, lru_add_active_pvecs) = { 0, };
-
void fastcall lru_cache_add(struct page *page)
{
- int cpu;
- struct pagevec *pvec = &get_cpu_var_locked(lru_add_pvecs, &cpu);
-
page_cache_get(page);
- if (!pagevec_add(pvec, page))
- __pagevec_lru_add(pvec);
- put_cpu_var_locked(lru_add_pvecs, cpu);
+ pagebuf_add(page_zone(page), page);
}

void fastcall lru_cache_add_active(struct page *page)
{
- int cpu;
- struct pagevec *pvec = &get_cpu_var_locked(lru_add_active_pvecs, &cpu);
-
- page_cache_get(page);
- if (!pagevec_add(pvec, page))
- __pagevec_lru_add_active(pvec);
- put_cpu_var_locked(lru_add_active_pvecs, cpu);
+ VM_BUG_ON(PageActive(page));
+ SetPageActive(page);
+ lru_cache_add(page);
}

void lru_add_drain(void)
{
- struct pagevec *pvec;
- int cpu;
+ int i;
+ struct zone *zones;

- pvec = &get_cpu_var_locked(lru_add_pvecs, &cpu);
- if (pagevec_count(pvec))
- __pagevec_lru_add(pvec);
- put_cpu_var_locked(lru_add_pvecs, cpu);
-
- pvec = &get_cpu_var_locked(lru_add_active_pvecs, &cpu);
- if (pagevec_count(pvec))
- __pagevec_lru_add_active(pvec);
- put_cpu_var_locked(lru_add_active_pvecs, cpu);
+ zones = NODE_DATA(cpu_to_node(cpu))->node_zones;
+ for (i = 0; i < MAX_NR_ZONES; i++)
+ pagebuf_flush(&zones[i]);
}

#ifdef CONFIG_NUMA
@@ -351,25 +435,12 @@ void __pagevec_release_nonlru(struct pag
void __pagevec_lru_add(struct pagevec *pvec)
{
int i;
- struct zone *zone = NULL;

for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);

- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
- add_page_to_inactive_list(zone, page);
+ pagebuf_add(page_zone(page), page);
}
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- release_pages(pvec->pages, pvec->nr, pvec->cold);
pagevec_reinit(pvec);
}

@@ -378,27 +449,15 @@ EXPORT_SYMBOL(__pagevec_lru_add);
void __pagevec_lru_add_active(struct pagevec *pvec)
{
int i;
- struct zone *zone = NULL;

for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);

- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- VM_BUG_ON(PageLRU(page));
- SetPageLRU(page);
VM_BUG_ON(PageActive(page));
SetPageActive(page);
- add_page_to_active_list(zone, page);
+
+ pagebuf_add(page_zone(page), page);
}
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- release_pages(pvec->pages, pvec->nr, pvec->cold);
pagevec_reinit(pvec);
}

Index: linux-2.6-rt/mm/internal.h
===================================================================
--- linux-2.6-rt.orig/mm/internal.h 2007-01-11 16:27:08.000000000 +0100
+++ linux-2.6-rt/mm/internal.h 2007-01-11 16:27:15.000000000 +0100
@@ -37,4 +37,6 @@ static inline void __put_page(struct pag
extern void fastcall __init __free_pages_bootmem(struct page *page,
unsigned int order);

+extern void __free_cache_page(struct zone *zone, struct page *page);
+
#endif
Index: linux-2.6-rt/mm/page_alloc.c
===================================================================
--- linux-2.6-rt.orig/mm/page_alloc.c 2007-01-11 16:27:08.000000000 +0100
+++ linux-2.6-rt/mm/page_alloc.c 2007-01-11 16:27:15.000000000 +0100
@@ -555,6 +555,27 @@ static void __free_pages_ok(struct page
unlock_cpu_pcp(flags, this_cpu);
}

+void __free_cache_page(struct zone *zone, struct page *page)
+{
+ BUG_ON(PageCompound(page));
+
+ if (PageAnon(page))
+ page->mapping = NULL;
+ if (free_pages_check(page))
+ return;
+
+ if (!PageHighMem(page))
+ debug_check_no_locks_freed(page_address(page),PAGE_SIZE);
+
+ arch_free_page(page, 0);
+ kernel_map_pages(page, 1, 0);
+
+ __count_vm_events(PGFREE, 1);
+ zone->all_unreclaimable = 0;
+ zone->pages_scanned = 0;
+ __free_one_page(page, zone, 0);
+}
+
/*
* permit the bootmem allocator to evade page validation on high-order frees
*/


2007-01-12 19:46:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:

> The test was simple, we have 16 processes, each allocating 3.5G of memory
> and and touching each and every page and returning. Each of the process is
> bound to a node (socket), with the local node being the preferred node for
> allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each
> socket has 4G of physical memory and there are two cores on each socket. On
> start of the test, the machine becomes unresponsive after sometime and
> prints out softlockup and OOM messages. We then found out the cause
> for softlockups being the excessive spin times on zone_lru lock. The fact
> that spin_lock_irq disables interrupts while spinning made matters very bad.
> We instrumented the spin_lock_irq code and found that the spin time on the
> lru locks was in the order of a few seconds (tens of seconds at times) and
> the hold time was comparatively lesser.

So the issue is two processes contenting on the zone lock for one node?
You are overallocating the 4G node with two processes attempting to
allocate 7.5GB? So we go off node for 3.5G of the allocation?

Does the system scale the right way if you stay within the bounds of node
memory? I.e. allocate 1.5GB from each process?

Have you tried increasing the size of the per cpu caches in
/proc/sys/vm/percpu_pagelist_fraction?

> While the softlockups and the like went away by enabling interrupts during
> spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> Andi thought maybe this is exposing a problem with zone->lru_locks and
> hence warrants a discussion on lkml, hence this post. Are there any
> plans/patches/ideas to address the spin time under such extreme conditions?

Could this be a hardware problem? Some issue with atomic ops in the
Sun hardware?

2007-01-12 21:26:08

by Andrew Morton

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, 12 Jan 2007 11:46:22 -0800 (PST)
Christoph Lameter <[email protected]> wrote:

> > While the softlockups and the like went away by enabling interrupts during
> > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> > Andi thought maybe this is exposing a problem with zone->lru_locks and
> > hence warrants a discussion on lkml, hence this post. Are there any
> > plans/patches/ideas to address the spin time under such extreme conditions?
>
> Could this be a hardware problem? Some issue with atomic ops in the
> Sun hardware?

I'd assume so. We don't hold lru_lock for 33 seconds ;)

Probably similar symptoms are demonstrable using other locks, if a
suitable workload is chosen.

Increasing PAGEVEC_SIZE might help. But we do allocate those things
on the stack.

2007-01-12 21:40:43

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, Jan 12, 2007 at 11:46:22AM -0800, Christoph Lameter wrote:
> On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
>
> > The test was simple, we have 16 processes, each allocating 3.5G of memory
> > and and touching each and every page and returning. Each of the process is
> > bound to a node (socket), with the local node being the preferred node for
> > allocation (numactl --cpubind=$node ./numa-membomb --preferred=$node). Each
> > socket has 4G of physical memory and there are two cores on each socket. On
> > start of the test, the machine becomes unresponsive after sometime and
> > prints out softlockup and OOM messages. We then found out the cause
> > for softlockups being the excessive spin times on zone_lru lock. The fact
> > that spin_lock_irq disables interrupts while spinning made matters very bad.
> > We instrumented the spin_lock_irq code and found that the spin time on the
> > lru locks was in the order of a few seconds (tens of seconds at times) and
> > the hold time was comparatively lesser.
>
> So the issue is two processes contenting on the zone lock for one node?
> You are overallocating the 4G node with two processes attempting to
> allocate 7.5GB? So we go off node for 3.5G of the allocation?

Yes.

>
> Does the system scale the right way if you stay within the bounds of node
> memory? I.e. allocate 1.5GB from each process?

Yes. We see problems only when we oversubscribe memory.

>
> Have you tried increasing the size of the per cpu caches in
> /proc/sys/vm/percpu_pagelist_fraction?

No not yet. I can give it a try.

>
> > While the softlockups and the like went away by enabling interrupts during
> > spinning, as mentioned in http://lkml.org/lkml/2007/1/3/29 ,
> > Andi thought maybe this is exposing a problem with zone->lru_locks and
> > hence warrants a discussion on lkml, hence this post. Are there any
> > plans/patches/ideas to address the spin time under such extreme conditions?
>
> Could this be a hardware problem? Some issue with atomic ops in the
> Sun hardware?

I think that is unlikely -- because when we donot oversubscribe
memory, the tests complete quickly without softlockups ane the like. Peter
has also noticed this (presumeably on different hardware). I would think
this could also be locking unfairness (cpus of the same node getting the
lock and starving out other nodes) case under extreme contention.

2007-01-12 21:46:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:

> > Does the system scale the right way if you stay within the bounds of node
> > memory? I.e. allocate 1.5GB from each process?
>
> Yes. We see problems only when we oversubscribe memory.

Ok in that case we can have more than 2 processors trying to acquire the
same zone lock. If they have all exhausted their node local memory and are
all going off node then all processor may be hitting the last node that
has some memory left which will cause a very high degree of contention.

Moreover mostatomic operations are to remote memory which is also
increasing the problem by making the atomic ops take longer. Typically
mature NUMA system have implemented hardware provisions that can deal with
such high degrees of contention. If this is simply a SMP system that was
turned into a NUMA box then this is a new hardware scenario for the
engineers.

2007-01-13 01:00:55

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, Jan 12, 2007 at 01:45:43PM -0800, Christoph Lameter wrote:
> On Fri, 12 Jan 2007, Ravikiran G Thirumalai wrote:
>
> Moreover mostatomic operations are to remote memory which is also
> increasing the problem by making the atomic ops take longer. Typically
> mature NUMA system have implemented hardware provisions that can deal with
> such high degrees of contention. If this is simply a SMP system that was
> turned into a NUMA box then this is a new hardware scenario for the
> engineers.

This is using HT as all AMD systems do, but this is one of the 8
socket systems.

I ran the same test on a 2 node Tyan AMD box, and did not notice the
atrocious spin times. It would be interesting to see how a 4 socket HT box
would fare. Unfortunately, I do not have access to one. If someone has access
to such a box, I can provide the test case and instrumentation patches.

It could very well be the hardware limitation in this case, which means, all
the more reason to enable interrupts with spin locks while spinning. But is
lru_lock an issue is another question.

Thanks,
Kiran

2007-01-13 01:11:31

by Andrew Morton

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, 12 Jan 2007 17:00:39 -0800
Ravikiran G Thirumalai <[email protected]> wrote:

> But is
> lru_lock an issue is another question.

I doubt it, although there might be changes we can make in there to
work around it.

<mentions PAGEVEC_SIZE again>

2007-01-13 04:40:17

by Nick Piggin

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

Ravikiran G Thirumalai wrote:
> Hi,
> We noticed high interrupt hold off times while running some memory intensive
> tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups,

[...]

> We did not use any lock debugging options and used plain old rdtsc to
> measure cycles. (We disable cpu freq scaling in the BIOS). All we did was
> this:
>
> void __lockfunc _spin_lock_irq(spinlock_t *lock)
> {
> local_irq_disable();
> ------------------------> rdtsc(t1);
> preempt_disable();
> spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> _raw_spin_lock(lock);
> ------------------------> rdtsc(t2);
> if (lock->spin_time < (t2 - t1))
> lock->spin_time = t2 - t1;
> }
>
> On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> while the maximal CS time was 3 seconds or so.

What is the "CS time"?

It would be interesting to know how long the maximal lru_lock *hold* time is,
which could give us a better indication of whether it is a hardware problem.

For example, if the maximum hold time is 10ms, that it might indicate a
hardware fairness problem.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-13 07:37:00

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
> Ravikiran G Thirumalai wrote:
> >Hi,
> >We noticed high interrupt hold off times while running some memory
> >intensive
> >tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups,
>
> [...]
>
> >We did not use any lock debugging options and used plain old rdtsc to
> >measure cycles. (We disable cpu freq scaling in the BIOS). All we did was
> >this:
> >
> >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> >{
> > local_irq_disable();
> > ------------------------> rdtsc(t1);
> > preempt_disable();
> > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > _raw_spin_lock(lock);
> > ------------------------> rdtsc(t2);
> > if (lock->spin_time < (t2 - t1))
> > lock->spin_time = t2 - t1;
> >}
> >
> >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> >while the maximal CS time was 3 seconds or so.
>
> What is the "CS time"?

Critical Section :). This is the maximal time interval I measured from
t2 above to the time point we release the spin lock. This is the hold
time I guess.

>
> It would be interesting to know how long the maximal lru_lock *hold* time
> is,
> which could give us a better indication of whether it is a hardware problem.
>
> For example, if the maximum hold time is 10ms, that it might indicate a
> hardware fairness problem.

The maximal hold time was about 3s.

2007-01-13 07:43:04

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Fri, Jan 12, 2007 at 05:11:16PM -0800, Andrew Morton wrote:
> On Fri, 12 Jan 2007 17:00:39 -0800
> Ravikiran G Thirumalai <[email protected]> wrote:
>
> > But is
> > lru_lock an issue is another question.
>
> I doubt it, although there might be changes we can make in there to
> work around it.
>
> <mentions PAGEVEC_SIZE again>

I tested with PAGEVEC_SIZE define to 62 and 126 -- no difference. I still
notice the atrociously high spin times.

Thanks,
Kiran

2007-01-13 07:55:10

by Nick Piggin

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

Ravikiran G Thirumalai wrote:
> On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:

>>What is the "CS time"?
>
>
> Critical Section :). This is the maximal time interval I measured from
> t2 above to the time point we release the spin lock. This is the hold
> time I guess.
>
>
>>It would be interesting to know how long the maximal lru_lock *hold* time
>>is,
>>which could give us a better indication of whether it is a hardware problem.
>>
>>For example, if the maximum hold time is 10ms, that it might indicate a
>>hardware fairness problem.
>
>
> The maximal hold time was about 3s.

Well then it doesn't seem very surprising that this could cause a 30s wait
time for one CPU in a 16 core system, regardless of fairness.

I guess most of the contention, and the lock hold times are coming from
vmscan? Do you know exactly which critical sections are the culprits?

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-13 08:00:31

by Andrew Morton

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

> On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[email protected]> wrote:
> On Sat, Jan 13, 2007 at 03:39:45PM +1100, Nick Piggin wrote:
> > Ravikiran G Thirumalai wrote:
> > >Hi,
> > >We noticed high interrupt hold off times while running some memory
> > >intensive
> > >tests on a Sun x4600 8 socket 16 core x86_64 box. We noticed softlockups,
> >
> > [...]
> >
> > >We did not use any lock debugging options and used plain old rdtsc to
> > >measure cycles. (We disable cpu freq scaling in the BIOS). All we did was
> > >this:
> > >
> > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > >{
> > > local_irq_disable();
> > > ------------------------> rdtsc(t1);
> > > preempt_disable();
> > > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > > _raw_spin_lock(lock);
> > > ------------------------> rdtsc(t2);
> > > if (lock->spin_time < (t2 - t1))
> > > lock->spin_time = t2 - t1;
> > >}
> > >
> > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> > >while the maximal CS time was 3 seconds or so.
> >
> > What is the "CS time"?
>
> Critical Section :). This is the maximal time interval I measured from
> t2 above to the time point we release the spin lock. This is the hold
> time I guess.

By no means. The theory here is that CPUA is taking and releasing the
lock at high frequency, but CPUB never manages to get in and take it. In
which case the maximum-acquisition-time is much larger than the
maximum-hold-time.

I'd suggest that you use a similar trick to measure the maximum hold time:
start the timer after we got the lock, stop it just before we release the
lock (assuming that the additional rdtsc delay doesn't "fix" things, of
course...)


2007-01-13 19:53:58

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
> > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[email protected]> wrote:
> > > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > > >{
> > > > local_irq_disable();
> > > > ------------------------> rdtsc(t1);
> > > > preempt_disable();
> > > > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > > > _raw_spin_lock(lock);
> > > > ------------------------> rdtsc(t2);
> > > > if (lock->spin_time < (t2 - t1))
> > > > lock->spin_time = t2 - t1;
> > > >}
> > > >
> > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> > > >while the maximal CS time was 3 seconds or so.
> > >
> > > What is the "CS time"?
> >
> > Critical Section :). This is the maximal time interval I measured from
> > t2 above to the time point we release the spin lock. This is the hold
> > time I guess.
>
> By no means. The theory here is that CPUA is taking and releasing the
> lock at high frequency, but CPUB never manages to get in and take it. In
> which case the maximum-acquisition-time is much larger than the
> maximum-hold-time.
>
> I'd suggest that you use a similar trick to measure the maximum hold time:
> start the timer after we got the lock, stop it just before we release the
> lock (assuming that the additional rdtsc delay doesn't "fix" things, of
> course...)

Well, that is exactly what I described above as CS time. The
instrumentation goes like this:

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
unsigned long long t1,t2;
local_irq_disable();
t1 = get_cycles_sync();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
t2 = get_cycles_sync();
lock->raw_lock.htsc = t2;
if (lock->spin_time < (t2 - t1))
lock->spin_time = t2 - t1;
}
...

void __lockfunc _spin_unlock_irq(spinlock_t *lock)
{
unsigned long long t1 ;
spin_release(&lock->dep_map, 1, _RET_IP_);
t1 = get_cycles_sync();
if (lock->cs_time < (t1 - lock->raw_lock.htsc))
lock->cs_time = t1 - lock->raw_lock.htsc;
_raw_spin_unlock(lock);
local_irq_enable();
preempt_enable();
}

Am I missing something? Is this not what you just described? (The
synchronizing rdtsc might not be really required at all locations, but I
doubt if it would contribute a significant fraction to 33s or even
the 3s hold time on a 2.6 GHZ opteron).

2007-01-13 21:20:36

by Andrew Morton

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

> On Sat, 13 Jan 2007 11:53:34 -0800 Ravikiran G Thirumalai <[email protected]> wrote:
> On Sat, Jan 13, 2007 at 12:00:17AM -0800, Andrew Morton wrote:
> > > On Fri, 12 Jan 2007 23:36:43 -0800 Ravikiran G Thirumalai <[email protected]> wrote:
> > > > >void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > > > >{
> > > > > local_irq_disable();
> > > > > ------------------------> rdtsc(t1);
> > > > > preempt_disable();
> > > > > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > > > > _raw_spin_lock(lock);
> > > > > ------------------------> rdtsc(t2);
> > > > > if (lock->spin_time < (t2 - t1))
> > > > > lock->spin_time = t2 - t1;
> > > > >}
> > > > >
> > > > >On some runs, we found that the zone->lru_lock spun for 33 seconds or more
> > > > >while the maximal CS time was 3 seconds or so.
> > > >
> > > > What is the "CS time"?
> > >
> > > Critical Section :). This is the maximal time interval I measured from
> > > t2 above to the time point we release the spin lock. This is the hold
> > > time I guess.
> >
> > By no means. The theory here is that CPUA is taking and releasing the
> > lock at high frequency, but CPUB never manages to get in and take it. In
> > which case the maximum-acquisition-time is much larger than the
> > maximum-hold-time.
> >
> > I'd suggest that you use a similar trick to measure the maximum hold time:
> > start the timer after we got the lock, stop it just before we release the
> > lock (assuming that the additional rdtsc delay doesn't "fix" things, of
> > course...)
>
> Well, that is exactly what I described above as CS time.

Seeing the code helps.

> The
> instrumentation goes like this:
>
> void __lockfunc _spin_lock_irq(spinlock_t *lock)
> {
> unsigned long long t1,t2;
> local_irq_disable();
> t1 = get_cycles_sync();
> preempt_disable();
> spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> _raw_spin_lock(lock);
> t2 = get_cycles_sync();
> lock->raw_lock.htsc = t2;
> if (lock->spin_time < (t2 - t1))
> lock->spin_time = t2 - t1;
> }
> ...
>
> void __lockfunc _spin_unlock_irq(spinlock_t *lock)
> {
> unsigned long long t1 ;
> spin_release(&lock->dep_map, 1, _RET_IP_);
> t1 = get_cycles_sync();
> if (lock->cs_time < (t1 - lock->raw_lock.htsc))
> lock->cs_time = t1 - lock->raw_lock.htsc;
> _raw_spin_unlock(lock);
> local_irq_enable();
> preempt_enable();
> }
>
> Am I missing something? Is this not what you just described? (The
> synchronizing rdtsc might not be really required at all locations, but I
> doubt if it would contribute a significant fraction to 33s or even
> the 3s hold time on a 2.6 GHZ opteron).

OK, now we need to do a dump_stack() each time we discover a new max hold
time. That might a bit tricky: the printk code does spinlocking too so
things could go recursively deadlocky. Maybe make spin_unlock_irq() return
the hold time then do:

void lru_spin_unlock_irq(struct zone *zone)
{
long this_time;

this_time = spin_unlock_irq(&zone->lru_lock);
if (this_time > zone->max_time) {
zone->max_time = this_time;
dump_stack();
}
}

or similar.



2007-01-16 02:57:06

by Ravikiran G Thirumalai

[permalink] [raw]
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions

On Sat, Jan 13, 2007 at 01:20:23PM -0800, Andrew Morton wrote:
>
> Seeing the code helps.

But there was a subtle problem with hold time instrumentation here.
The code assumed the critical section exiting through
spin_unlock_irq entered critical section with spin_lock_irq, but that
might not be the case always, and the instrumentation for hold time goes bad
when that happens (as in shrink_inactive_list)

>
> > The
> > instrumentation goes like this:
> >
> > void __lockfunc _spin_lock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1,t2;
> > local_irq_disable();
> > t1 = get_cycles_sync();
> > preempt_disable();
> > spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
> > _raw_spin_lock(lock);
> > t2 = get_cycles_sync();
> > lock->raw_lock.htsc = t2;
> > if (lock->spin_time < (t2 - t1))
> > lock->spin_time = t2 - t1;
> > }
> > ...
> >
> > void __lockfunc _spin_unlock_irq(spinlock_t *lock)
> > {
> > unsigned long long t1 ;
> > spin_release(&lock->dep_map, 1, _RET_IP_);
> > t1 = get_cycles_sync();
> > if (lock->cs_time < (t1 - lock->raw_lock.htsc))
> > lock->cs_time = t1 - lock->raw_lock.htsc;
> > _raw_spin_unlock(lock);
> > local_irq_enable();
> > preempt_enable();
> > }
> >
...
>
> OK, now we need to do a dump_stack() each time we discover a new max hold
> time. That might a bit tricky: the printk code does spinlocking too so
> things could go recursively deadlocky. Maybe make spin_unlock_irq() return
> the hold time then do:

What I found now after fixing the above is that hold time is not bad --
249461 cycles on the 2.6 GHZ opteron with powernow disabled in the BIOS.
The spin time is still in orders of seconds.

Hence this looks like a hardware fairness issue.

Attaching the instrumentation patch with this email FR.


Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h
===================================================================
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock.h 2007-01-14 22:36:46.694248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock.h 2007-01-15 15:40:36.554248000 -0800
@@ -6,6 +6,18 @@
#include <asm/page.h>
#include <asm/processor.h>

+/* Like get_cycles, but make sure the CPU is synchronized. */
+static inline unsigned long long get_cycles_sync2(void)
+{
+ unsigned long long ret;
+ unsigned eax;
+ /* Don't do an additional sync on CPUs where we know
+ RDTSC is already synchronous. */
+ alternative_io("cpuid", ASM_NOP2, X86_FEATURE_SYNC_RDTSC,
+ "=a" (eax), "0" (1) : "ebx","ecx","edx","memory");
+ rdtscll(ret);
+ return ret;
+}
/*
* Your basic SMP spinlocks, allowing only a single CPU anywhere
*
@@ -34,6 +46,7 @@ static inline void __raw_spin_lock(raw_s
"jle 3b\n\t"
"jmp 1b\n"
"2:\t" : "=m" (lock->slock) : : "memory");
+ lock->htsc = get_cycles_sync2();
}

/*
@@ -62,6 +75,7 @@ static inline void __raw_spin_lock_flags
"jmp 4b\n"
"5:\n\t"
: "+m" (lock->slock) : "r" ((unsigned)flags) : "memory");
+ lock->htsc = get_cycles_sync2();
}
#endif

@@ -74,11 +88,16 @@ static inline int __raw_spin_trylock(raw
:"=q" (oldval), "=m" (lock->slock)
:"0" (0) : "memory");

+ if (oldval)
+ lock->htsc = get_cycles_sync2();
return oldval > 0;
}

static inline void __raw_spin_unlock(raw_spinlock_t *lock)
{
+ unsigned long long t = get_cycles_sync2();
+ if (lock->hold_time < t - lock->htsc)
+ lock->hold_time = t - lock->htsc;
asm volatile("movl $1,%0" :"=m" (lock->slock) :: "memory");
}

Index: linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h
===================================================================
--- linux-2.6.20-rc4.spin_instru.orig/include/asm-x86_64/spinlock_types.h 2007-01-14 22:36:46.714248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/asm-x86_64/spinlock_types.h 2007-01-15 14:23:37.204248000 -0800
@@ -7,9 +7,11 @@

typedef struct {
unsigned int slock;
+ unsigned long long hold_time;
+ unsigned long long htsc;
} raw_spinlock_t;

-#define __RAW_SPIN_LOCK_UNLOCKED { 1 }
+#define __RAW_SPIN_LOCK_UNLOCKED { 1,0,0 }

typedef struct {
unsigned int lock;
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h
===================================================================
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock.h 2007-01-14 22:36:48.464248000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock.h 2007-01-14 22:41:30.964248000 -0800
@@ -231,8 +231,8 @@ do { \
# define spin_unlock(lock) __raw_spin_unlock(&(lock)->raw_lock)
# define read_unlock(lock) __raw_read_unlock(&(lock)->raw_lock)
# define write_unlock(lock) __raw_write_unlock(&(lock)->raw_lock)
-# define spin_unlock_irq(lock) \
- do { __raw_spin_unlock(&(lock)->raw_lock); local_irq_enable(); } while (0)
+# define spin_unlock_irq(lock) _spin_unlock_irq(lock)
+/* do { __raw_spin_unlock(&(lock)->raw_lock); local_irq_enable(); } while (0)*/
# define read_unlock_irq(lock) \
do { __raw_read_unlock(&(lock)->raw_lock); local_irq_enable(); } while (0)
# define write_unlock_irq(lock) \
Index: linux-2.6.20-rc4.spin_instru/include/linux/spinlock_types.h
===================================================================
--- linux-2.6.20-rc4.spin_instru.orig/include/linux/spinlock_types.h 2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.20-rc4.spin_instru/include/linux/spinlock_types.h 2007-01-15 14:27:50.664248000 -0800
@@ -19,6 +19,7 @@

typedef struct {
raw_spinlock_t raw_lock;
+ unsigned long long spin_time;
#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)
unsigned int break_lock;
#endif
@@ -78,7 +79,8 @@ typedef struct {
RW_DEP_MAP_INIT(lockname) }
#else
# define __SPIN_LOCK_UNLOCKED(lockname) \
- (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED, \
+ (spinlock_t) { .raw_lock = __RAW_SPIN_LOCK_UNLOCKED,\
+ .spin_time = 0 \
SPIN_DEP_MAP_INIT(lockname) }
#define __RW_LOCK_UNLOCKED(lockname) \
(rwlock_t) { .raw_lock = __RAW_RW_LOCK_UNLOCKED, \
Index: linux-2.6.20-rc4.spin_instru/kernel/spinlock.c
===================================================================
--- linux-2.6.20-rc4.spin_instru.orig/kernel/spinlock.c 2006-11-29 13:57:37.000000000 -0800
+++ linux-2.6.20-rc4.spin_instru/kernel/spinlock.c 2007-01-15 14:29:47.374248000 -0800
@@ -99,10 +99,15 @@ EXPORT_SYMBOL(_spin_lock_irqsave);

void __lockfunc _spin_lock_irq(spinlock_t *lock)
{
+ unsigned long long t1,t2;
local_irq_disable();
+ t1 = get_cycles_sync();
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
_raw_spin_lock(lock);
+ t2 = get_cycles_sync();
+ if (lock->spin_time < (t2 - t1))
+ lock->spin_time = t2 - t1;
}
EXPORT_SYMBOL(_spin_lock_irq);