LinuxLists.cc - deadlock on vmap_area

2013-05-01 14:43:54

Subject: deadlock on vmap_area_lock

I've got two compute clusters with around 350 machines each which are
running kernels based off of 3.1.9 (Yes I realize this is ancient by
todays standards). All of the machines run a 'find' command once an
hour on one of the mounted XFS filesystems. Occasionally these find
commands get stuck requiring a reboot of the system. I took a peek
today and see this with perf:

72.22% find [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--98.84%-- vm_map_ram
| _xfs_buf_map_pages
| xfs_buf_get
| xfs_buf_read
| xfs_trans_read_buf
| xfs_da_do_buf
| xfs_da_read_buf
| xfs_dir2_block_getdents
| xfs_readdir
| xfs_file_readdir
| vfs_readdir
| sys_getdents
| system_call_fastpath
| __getdents64
|
|--1.12%-- _xfs_buf_map_pages
| xfs_buf_get
| xfs_buf_read
| xfs_trans_read_buf
| xfs_da_do_buf
| xfs_da_read_buf
| xfs_dir2_block_getdents
| xfs_readdir
| xfs_file_readdir
| vfs_readdir
| sys_getdents
| system_call_fastpath
| __getdents64
--0.04%-- [...]

Looking at the code my best guess is that we are spinning on
vmap_area_lock, but I could be wrong. This is the only process
spinning on the machine so I'm assuming either another process has
blocked while holding the lock, or perhaps this find process has tried
to acquire the vmap_area_lock twice?

I've skimmed through the change logs between 3.1 and 3.9 but nothing
stood out as fix for this bug. Does this ring a bell with anyone? If
I have a machine that is currently in one of these stuck states does
anyone have any tips to identifying the processes currently holding
the lock?

Additionally as I mentioned before I have two clusters of roughly
equal size though one cluster hits this issue more frequently. On
that cluster with approximately 350 machines we get about 10 stuck
machines a month. The other cluster has about 450 machines but we
only get about 1 or 2 stuck machines a month. Both clusters run the
same find command every hour, but the workloads on the machines are
different. The cluster that hits the issue more frequently tends to
run more memory intensive jobs.

I'm open to building some debug kernels to help track this down,
though I can't upgrade all of the machines in one shot so it may take
a while to reproduce. I'm happy to provide any other information if
people have questions.

Thanks,
Shawn

--

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2013-05-01 15:57:46

by David Rientjes

[permalink] [raw]

Subject: Re: deadlock on vmap_area_lock

On Wed, 1 May 2013, Shawn Bohrer wrote:

> I've got two compute clusters with around 350 machines each which are
> running kernels based off of 3.1.9 (Yes I realize this is ancient by
> todays standards). All of the machines run a 'find' command once an
> hour on one of the mounted XFS filesystems. Occasionally these find
> commands get stuck requiring a reboot of the system. I took a peek
> today and see this with perf:
>
> 72.22% find [kernel.kallsyms] [k] _raw_spin_lock
> |
> --- _raw_spin_lock
> |
> |--98.84%-- vm_map_ram
> | _xfs_buf_map_pages
> | xfs_buf_get
> | xfs_buf_read
> | xfs_trans_read_buf
> | xfs_da_do_buf
> | xfs_da_read_buf
> | xfs_dir2_block_getdents
> | xfs_readdir
> | xfs_file_readdir
> | vfs_readdir
> | sys_getdents
> | system_call_fastpath
> | __getdents64
> |
> |--1.12%-- _xfs_buf_map_pages
> | xfs_buf_get
> | xfs_buf_read
> | xfs_trans_read_buf
> | xfs_da_do_buf
> | xfs_da_read_buf
> | xfs_dir2_block_getdents
> | xfs_readdir
> | xfs_file_readdir
> | vfs_readdir
> | sys_getdents
> | system_call_fastpath
> | __getdents64
> --0.04%-- [...]
>
> Looking at the code my best guess is that we are spinning on
> vmap_area_lock, but I could be wrong. This is the only process
> spinning on the machine so I'm assuming either another process has
> blocked while holding the lock, or perhaps this find process has tried
> to acquire the vmap_area_lock twice?
>

Significant spinlock contention doesn't necessarily mean that there's a
deadlock, but it also doesn't mean the opposite. Depending on your
definition of "occassionally", would it be possible to run with
CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real
deadlock potential?

2013-05-01 16:44:30

by Shawn Bohrer

[permalink] [raw]

Subject: Re: deadlock on vmap_area_lock

On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote:
> On Wed, 1 May 2013, Shawn Bohrer wrote:
>
> > I've got two compute clusters with around 350 machines each which are
> > running kernels based off of 3.1.9 (Yes I realize this is ancient by
> > todays standards). All of the machines run a 'find' command once an
> > hour on one of the mounted XFS filesystems. Occasionally these find
> > commands get stuck requiring a reboot of the system. I took a peek
> > today and see this with perf:
> >
> > 72.22% find [kernel.kallsyms] [k] _raw_spin_lock
> > |
> > --- _raw_spin_lock
> > |
> > |--98.84%-- vm_map_ram
> > | _xfs_buf_map_pages
> > | xfs_buf_get
> > | xfs_buf_read
> > | xfs_trans_read_buf
> > | xfs_da_do_buf
> > | xfs_da_read_buf
> > | xfs_dir2_block_getdents
> > | xfs_readdir
> > | xfs_file_readdir
> > | vfs_readdir
> > | sys_getdents
> > | system_call_fastpath
> > | __getdents64
> > |
> > |--1.12%-- _xfs_buf_map_pages
> > | xfs_buf_get
> > | xfs_buf_read
> > | xfs_trans_read_buf
> > | xfs_da_do_buf
> > | xfs_da_read_buf
> > | xfs_dir2_block_getdents
> > | xfs_readdir
> > | xfs_file_readdir
> > | vfs_readdir
> > | sys_getdents
> > | system_call_fastpath
> > | __getdents64
> > --0.04%-- [...]
> >
> > Looking at the code my best guess is that we are spinning on
> > vmap_area_lock, but I could be wrong. This is the only process
> > spinning on the machine so I'm assuming either another process has
> > blocked while holding the lock, or perhaps this find process has tried
> > to acquire the vmap_area_lock twice?
> >
>
> Significant spinlock contention doesn't necessarily mean that there's a
> deadlock, but it also doesn't mean the opposite.

Correct it doesn't and I can't prove the find command is not making
progress, however these finds normally complete in under 15 min and
we've let the stuck ones run for days. Additionally if this was just
contention I'd expect to see multiple threads/CPUs contending and I
only have a single CPU pegged running find at 99%. I should clarify
that the perf snippet above was for the entire system. Profiling just
the find command shows:

82.56% find [kernel.kallsyms] [k] _raw_spin_lock
16.63% find [kernel.kallsyms] [k] vm_map_ram
0.13% find [kernel.kallsyms] [k] hrtimer_interrupt
0.04% find [kernel.kallsyms] [k] update_curr
0.03% find [igb] [k] igb_poll
0.03% find [kernel.kallsyms] [k] irqtime_account_process_tick
0.03% find [kernel.kallsyms] [k] account_system_vtime
0.03% find [kernel.kallsyms] [k] task_tick_fair
0.03% find [kernel.kallsyms] [k] perf_event_task_tick
0.03% find [kernel.kallsyms] [k] scheduler_tick
0.03% find [kernel.kallsyms] [k] rb_erase
0.02% find [kernel.kallsyms] [k] native_write_msr_safe
0.02% find [kernel.kallsyms] [k] native_sched_clock
0.02% find [kernel.kallsyms] [k] dma_issue_pending_all
0.02% find [kernel.kallsyms] [k] handle_irq_event_percpu
0.02% find [kernel.kallsyms] [k] timerqueue_del
0.02% find [kernel.kallsyms] [k] run_timer_softirq
0.02% find [kernel.kallsyms] [k] get_mm_counter
0.02% find [kernel.kallsyms] [k] __rcu_pending
0.02% find [kernel.kallsyms] [k] tick_program_event
0.01% find [kernel.kallsyms] [k] __netif_receive_skb
0.01% find [kernel.kallsyms] [k] ip_route_input_common
0.01% find [kernel.kallsyms] [k] __insert_vmap_area
0.01% find [igb] [k] igb_alloc_rx_buffers_adv
0.01% find [kernel.kallsyms] [k] irq_exit
0.01% find [kernel.kallsyms] [k] acct_update_integrals
0.01% find [kernel.kallsyms] [k] apic_timer_interrupt
0.01% find [kernel.kallsyms] [k] tick_sched_timer
0.01% find [kernel.kallsyms] [k] __remove_hrtimer
0.01% find [kernel.kallsyms] [k] do_IRQ
0.01% find [kernel.kallsyms] [k] dev_gro_receive
0.01% find [kernel.kallsyms] [k] net_rx_action
0.01% find [kernel.kallsyms] [k] classify
0.01% find [kernel.kallsyms] [k] __udp4_lib_mcast_deliver
0.01% find [kernel.kallsyms] [k] rb_next
0.01% find [kernel.kallsyms] [k] smp_apic_timer_interrupt
0.01% find [kernel.kallsyms] [k] intel_pmu_disable_all
0.01% find [ioatdma] [k] ioat2_issue_pending
0.01% find [kernel.kallsyms] [k] get_partial_node
0.01% find [kernel.kallsyms] [k] ktime_get
0.01% find [kernel.kallsyms] [k] radix_tree_lookup_element
0.01% find [kernel.kallsyms] [k] swiotlb_map_page
0.01% find [kernel.kallsyms] [k] __schedule
0.01% find [kernel.kallsyms] [k] _raw_spin_lock_irq
0.01% find [kernel.kallsyms] [k] hrtimer_forward
0.01% find [kernel.kallsyms] [k] sched_clock_tick
0.01% find [kernel.kallsyms] [k] clockevents_program_event
0.01% find [kernel.kallsyms] [k] raw_local_deliver
0.01% find [kernel.kallsyms] [k] exit_idle
0.01% find [kernel.kallsyms] [k] sched_clock_cpu
0.01% find [kernel.kallsyms] [k] sched_clock
0.01% find [kernel.kallsyms] [k] idle_cpu
0.01% find [kernel.kallsyms] [k] update_process_times
0.01% find [kernel.kallsyms] [k] tick_dev_program_event

> Depending on your
> definition of "occassionally", would it be possible to run with
> CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real
> deadlock potential?

Yeah, I can probably enable these on a few machines and hope I get
lucky. These machines are used for real work so I'll have to gauge
what how significant the performance impact is to determine how many
machines I can sacrifice to the cause.

Thanks,
Shawn

--

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.

2013-05-01 17:01:50

by David Rientjes

[permalink] [raw]

Subject: Re: deadlock on vmap_area_lock

On Wed, 1 May 2013, Shawn Bohrer wrote:

> Correct it doesn't and I can't prove the find command is not making
> progress, however these finds normally complete in under 15 min and
> we've let the stuck ones run for days. Additionally if this was just
> contention I'd expect to see multiple threads/CPUs contending and I
> only have a single CPU pegged running find at 99%. I should clarify
> that the perf snippet above was for the entire system. Profiling just
> the find command shows:
>
> 82.56% find [kernel.kallsyms] [k] _raw_spin_lock

Couple of options to figure out what spinlock this is: use lockstat (see
Documentation/lockstat.txt), which will also require a kernel rebuild,
some human intervention to collect the stats, and the accompanying
performance degradation, or you could try collecting
/proc/$(pidof find)/stack at regular intervals and figure out which
spinlock it is.

> > Depending on your
> > definition of "occassionally", would it be possible to run with
> > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real
> > deadlock potential?
>
> Yeah, I can probably enable these on a few machines and hope I get
> lucky. These machines are used for real work so I'll have to gauge
> what how significant the performance impact is to determine how many
> machines I can sacrifice to the cause.
>

You'll probably only need to enable it on one machine, if a deadlock
possibility exists here then lockdep will find it even without hitting it,
it simply has to exercise the path that leads to it. It does have a
performance degradation for that one machine, though.

2013-05-01 22:03:10

by Dave Chinner

[permalink] [raw]

Subject: Re: deadlock on vmap_area_lock

2013-05-02 16:21:20

by Shawn Bohrer

[permalink] [raw]

Subject: Re: deadlock on vmap_area_lock

On Thu, May 02, 2013 at 08:03:04AM +1000, Dave Chinner wrote:
> On Wed, May 01, 2013 at 08:57:38AM -0700, David Rientjes wrote:
> > On Wed, 1 May 2013, Shawn Bohrer wrote:
> >
> > > I've got two compute clusters with around 350 machines each which are
> > > running kernels based off of 3.1.9 (Yes I realize this is ancient by
> > > todays standards).
>
> xfs_info output of one of those filesystems? What platform are you
> running (32 or 64 bit)?

# cat /proc/mounts | grep data-cache
/dev/sdb1 /data-cache xfs rw,nodiratime,relatime,attr2,delaylog,noquota 0 0
# xfs_info /data-cache
meta-data=/dev/sdb1 isize=256 agcount=4, agsize=66705344 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=266821376, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal bsize=4096 blocks=130283, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

These are 64-bit systems. The ones that hit the issue more frequently
have 96 GB of RAM.

> > > All of the machines run a 'find' command once an
> > > hour on one of the mounted XFS filesystems. Occasionally these find
> > > commands get stuck requiring a reboot of the system. I took a peek
> > > today and see this with perf:
> > >
> > > 72.22% find [kernel.kallsyms] [k] _raw_spin_lock
> > > |
> > > --- _raw_spin_lock
> > > |
> > > |--98.84%-- vm_map_ram
> > > | _xfs_buf_map_pages
> > > | xfs_buf_get
> > > | xfs_buf_read
> > > | xfs_trans_read_buf
> > > | xfs_da_do_buf
> > > | xfs_da_read_buf
> > > | xfs_dir2_block_getdents
> > > | xfs_readdir
> > > | xfs_file_readdir
> > > | vfs_readdir
> > > | sys_getdents
> > > | system_call_fastpath
> > > | __getdents64
> > > |
> > > |--1.12%-- _xfs_buf_map_pages
> > > | xfs_buf_get
> > > | xfs_buf_read
> > > | xfs_trans_read_buf
> > > | xfs_da_do_buf
> > > | xfs_da_read_buf
> > > | xfs_dir2_block_getdents
> > > | xfs_readdir
> > > | xfs_file_readdir
> > > | vfs_readdir
> > > | sys_getdents
> > > | system_call_fastpath
> > > | __getdents64
> > > --0.04%-- [...]
> > >
> > > Looking at the code my best guess is that we are spinning on
> > > vmap_area_lock, but I could be wrong. This is the only process
> > > spinning on the machine so I'm assuming either another process has
> > > blocked while holding the lock, or perhaps this find process has tried
> > > to acquire the vmap_area_lock twice?
> > >
> >
> > Significant spinlock contention doesn't necessarily mean that there's a
> > deadlock, but it also doesn't mean the opposite. Depending on your
> > definition of "occassionally", would it be possible to run with
> > CONFIG_PROVE_LOCKING and CONFIG_LOCKDEP to see if it uncovers any real
> > deadlock potential?
>
> It sure will. We've been reporting that vm_map_ram is doing
> GFP_KERNEL allocations from GFP_NOFS context for years, and have
> reported plenty of lockdep dumps as a result of it.
>
> But that's not the problem that is occurring above - lockstat is
> probably a good thing to look at here to determine exactly what
> locks are being contended on....

I've built a kernel with lock_stat, CONFIG_PROVE_LOCKING,
CONFIG_LOCKDEP and have one machine running with that kernel. We'll
probably put machines on this debug kernel when we reboot them and
hopefully one will trigger the issue.

Thanks,
Shawn

--

---------------------------------------------------------------
This email, along with any attachments, is confidential. If you
believe you received this message in error, please contact the
sender immediately and delete all copies of the message.
Thank you.