LinuxLists.cc - Re: 2.6.26-rc: nfsd hangs for a few sec

2008-06-22 02:10:12

Subject: Re: 2.6.26-rc: nfsd hangs for a few sec

2008/6/22 Mel Gorman <[email protected]>:
> 2. The circular lock itself was considered to be a false positive by David
> Chinner (http://lkml.org/lkml/2008/5/11/253). I've added David to the
> cc. I hate to ask the obvious, but is it possible that LOCKDEP checking
> was not turned on for the kernels before 2.6.26-rc1?
Yes, I bisected it with about the same config (as much as possible
with changing Kconfig),
LOCKDEP was enabled for all bisected configs.

> I spotted at least one problem in the patch in a change made to SLAB that
> needs to be fixed but it is not relevant to the problem at hand as I believe
> Alexandar is using SLUB instead of SLAB. That patch is at the end of the
> mail. Christoph, can you double check that patch please?
Yes, I use SLUB.

> I'm assuming that the few seconds are being spent in reclaim rather than
> working out lock dependency logic. Any chance there is profile information
> showing where all the time is being spent? Just in case, does the stall
> still occur with lockdep turned off?
Hm, I cannot forecast the time when I will have this message, I
gathered readprofile statistics
right after the message:
102 add_preempt_count 1.0303
102 net_rx_action 0.2991
104 __flush_tlb_all 2.3111
115 __tcp_push_pending_frames 0.0626
129 e1000_clean_rx_irq 0.1549
132 _read_unlock_irq 1.7838
136 __rcu_read_unlock 1.3878
137 native_read_tsc 7.2105
153 tcp_ack 0.0261
155 __rcu_advance_callbacks 0.8960
197 local_bh_enable 0.8277
205 e1000_clean 0.4092
206 free_hot_cold_page 0.5754
308 _write_unlock_irq 4.1622
352 get_page_from_freelist 0.3157
448 lock_acquired 0.8854
564 acpi_pm_read 28.2000
618 vprintk 0.6897
738 _spin_unlock_irq 9.9730
863 __do_softirq 5.2945
993 lock_release 2.4458
1166 netpoll_setup 1.5630
1633 _spin_unlock_irqrestore 17.0104
1712 lock_acquire 12.7761
1714 kfree 7.6178
1724 __kmalloc_track_caller 7.7309
2308 kmem_cache_free 12.3422
3189 kmem_cache_alloc 18.7588
18261 default_idle 214.8353
43758 total 0.0175

Could it be useful? I am afraid it is not.
I can try to gather it for lesser time around the event.

I did not mentioned a message from e1000 right after the lockdep message:
e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
Tx Queue <0>
TDH <8f>
TDT <8f>
next_to_use <8f>
next_to_clean <44>
buffer_info[next_to_clean]
time_stamp <601d>
next_to_watch <44>
jiffies <647c>
next_to_watch.status <1>

Perhaps, it is actually the reason for nfsd being stalled ?

I have tried the test without lockdep and it seems it does not stall.
But I am not pretty sure, it is harder to caught hanging without
lockdep message.

> At the moment, I'm a little stumped. I'm going to start looking at diffs
> between 2.6.25 and 2.6.26-rc5 and see what jumps out but alternative theories
> are welcome :/
I found it in 2.6.26-rc1, I think it does not need to search after rc1.