Hi all,
I'm seeing these BUGs with 3.12.9-rt13.... finally caught the messages.
I was getting frozen machines with no traces left behind, this could
possibly be it (see below - I have to retest with rt15)
-- Fernando
[771508.546420] BUG: soft lockup - CPU#5 stuck for 23s! [SweepSinVsUsm:1421]
[771508.546431] Modules linked in: bnep bluetooth fuse tun act_police
cls_basic cls_flow cls_fw cls_u32 sch_fq_codel sch_tbf sch_prio sch_htb
sch_hfsc sch_ingress sch_sfq nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd sunrpc fscache bridge stp llc xt_CHECKSUM
ipt_rpfilter xt_statistic xt_CT xt_LOG xt_connlimit xt_realm xt_addrtype
xt_comment xt_recent xt_nat ipt_ULOG ipt_MASQUERADE ipt_ECN
ipt_CLUSTERIP ipt_ah xt_set ip_set nf_nat_tftp nf_nat_snmp_basic
nf_conntrack_snmp nf_nat_sip nf_nat_pptp nf_nat_proto_gre nf_nat_irc
nf_nat_h323 nf_nat_ftp nf_nat_amanda ts_kmp nf_conntrack_amanda
nf_conntrack_sane nf_conntrack_tftp nf_conntrack_sip
nf_conntrack_proto_udplite nf_conntrack_proto_sctp nf_conntrack_pptp
nf_conntrack_proto_gre nf_conntrack_netlink nf_conntrack_netbios_ns
nf_conntrack_broadcast
[771508.546441] nf_conntrack_irc nf_conntrack_h323 nf_conntrack_ftp
xt_TPROXY xt_time xt_TCPMSS xt_tcpmss xt_sctp xt_policy xt_pkttype
xt_physdev xt_owner xt_NFQUEUE xt_NFLOG nfnetlink_log xt_multiport
xt_mark xt_mac xt_limit xt_length xt_iprange xt_helper xt_hashlimit
xt_DSCP xt_dscp xt_dccp xt_connmark ebtable_nat xt_CLASSIFY ebtables
xt_AUDIT xt_state iptable_raw iptable_nat nf_nat_ipv4 nf_nat
nf_conntrack_ipv4 nf_defrag_ipv4 iptable_mangle nfnetlink ip6t_REJECT
nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack
ip6table_filter ip6_tables hwmon_vid iTCO_wdt iTCO_vendor_support
gpio_ich eeepc_wmi asus_wmi sparse_keymap rfkill snd_ice1712 snd_cs8427
x86_pkg_temp_thermal snd_i2c snd_ice17xx_ak4xxx snd_ak4xxx_adda coretemp
snd_mpu401_uart snd_rawmidi snd_ac97_codec kvm ac97_bus crct10dif_pclmul
[771508.546446] crc32_pclmul snd_seq crc32c_intel snd_seq_device
ghash_clmulni_intel snd_pcm microcode snd_page_alloc snd_timer snd
serio_raw soundcore r8169 i2c_i801 lpc_ich mfd_core mii mei_me mei
shpchp binfmt_misc usb_storage nouveau i2c_algo_bit drm_kms_helper ttm
drm i2c_core mxm_wmi video wmi
[771508.546447] CPU: 5 PID: 1421 Comm: SweepSinVsUsm Tainted: G W
3.12.9-301.rt13.1.fc20.ccrma.x86_64+rt #1
[771508.546447] Hardware name: System manufacturer System Product
Name/P8Z77-M, BIOS 1406 07/19/2012
[771508.546447] task: ffff8804af8b8000 ti: ffff8807ab410000 task.ti:
ffff8807ab410000
[771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
smp_call_function_many+0x2ca/0x330
[771508.546449] RSP: 0018:ffff8807ab411cf0 EFLAGS: 00000202
[771508.546450] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
ffff8807fe2e9918
[771508.546450] RDX: 0000000000000001 RSI: 0000000000000400 RDI:
0000000000000000
[771508.546450] RBP: ffff8807ab411d50 R08: ffff8807fe4e6108 R09:
0000000000000010
[771508.546451] R10: ffff8807fe4e6108 R11: 0000000000000246 R12:
0000000000000000
[771508.546451] R13: 0000000000000000 R14: 00000000000000df R15:
ffffffff810432e4
[771508.546452] FS: 00007faf34514700(0000) GS:ffff8807fe480000(0000)
knlGS:0000000000000000
[771508.546452] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[771508.546453] CR2: 00007faebb93d000 CR3: 00000006bafd8000 CR4:
00000000001407e0
[771508.546453] Stack:
[771508.546454] ffff8807fe4e6188 0000000100000000 00000000000660c0
ffff8807ab411d60
[771508.546455] ffffffff8105adb0 ffff8807fe5660c0 0000000000000202
ffff880077268480
[771508.546456] ffff880077268780 00007faebb93f000 00007faebafea000
ffff880077268480
[771508.546456] Call Trace:
[771508.546458] [<ffffffff8105adb0>] ? leave_mm+0x80/0x80
[771508.546459] [<ffffffff8105af07>] native_flush_tlb_others+0x37/0x40
[771508.546460] [<ffffffff8105b084>] flush_tlb_mm_range+0xb4/0x280
[771508.546461] [<ffffffff81177173>] tlb_flush_mmu.part.50+0x33/0x90
[771508.546462] [<ffffffff81177d15>] tlb_finish_mmu+0x55/0x60
[771508.546463] [<ffffffff8117a072>] zap_page_range+0x112/0x150
[771508.546465] [<ffffffff81176bc1>] SyS_madvise+0x381/0x7b0
[771508.546466] [<ffffffff81696169>] system_call_fastpath+0x16/0x1b
[771508.546475] Code: 4d ea 24 00 3b 05 9f e5 c2 00 89 c2 0f 8d c3 fd ff
ff 48 98 49 8b 4d 00 48 03 0c c5 e0 64 d0 81 f6 41 20 01 74 cb 0f 1f 00
f3 90 <f6> 41 20 01 75 f8 eb be 0f b6 4d ac 48 8b 55 b8 44 89 ef 48 8b
On Wed, 12 Feb 2014, Fernando Lopez-Lezcano wrote:
> [771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
> smp_call_function_many+0x2ca/0x330
Can you decode the exact location inside of smp_call_function_many via
addr2line please ?
Thanks,
tglx
On 02/13/2014 02:25 PM, Thomas Gleixner wrote:
> On Wed, 12 Feb 2014, Fernando Lopez-Lezcano wrote:
>> [771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
>> smp_call_function_many+0x2ca/0x330
>
> Can you decode the exact location inside of smp_call_function_many via
> addr2line please ?
Hope this is useful (adding 0x2ce/0x330 as offsets does not make any
difference, don't know if it should)...
# grep smp_call_function /var/log/messages|tail -1
Feb 12 14:18:21 cmn27 kernel: [771840.224419] RIP:
0010:[<ffffffff810dc60e>] [<ffffffff810dc60e>]
smp_call_function_many+0x2ce/0x330
# addr2line -e
/usr/lib/debug/lib/modules/3.12.10-300.rt15.1.fc20.ccrma.x86_64+rt/vmlinux
ffffffff810dc60e
/usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.10-300.rt15.1.fc20.ccrma.x86_64/kernel/rtmutex.c:1295
This is the only time I was able to catch some logs of the problem (if
it is the same). I had to revert to 3.10.27-rt25 for the time being and
that seems to be holding up well so far.
-- Fernando
On Thu, 13 Feb 2014, Fernando Lopez-Lezcano wrote:
> On 02/13/2014 02:25 PM, Thomas Gleixner wrote:
> > On Wed, 12 Feb 2014, Fernando Lopez-Lezcano wrote:
> > > [771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
> > > smp_call_function_many+0x2ca/0x330
> >
> > Can you decode the exact location inside of smp_call_function_many via
> > addr2line please ?
>
> Hope this is useful (adding 0x2ce/0x330 as offsets does not make any
> difference, don't know if it should)...
>
> # grep smp_call_function /var/log/messages|tail -1
> Feb 12 14:18:21 cmn27 kernel: [771840.224419] RIP: 0010:[<ffffffff810dc60e>]
> [<ffffffff810dc60e>] smp_call_function_many+0x2ce/0x330
> # addr2line -e
> /usr/lib/debug/lib/modules/3.12.10-300.rt15.1.fc20.ccrma.x86_64+rt/vmlinux
> ffffffff810dc60e
> /usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.10-300.rt15.1.fc20.ccrma.x86_64/kernel/rtmutex.c:1295
>
I can't see how the kernel decoder thinks it's smp_call_function_many
but addr2line looks at rtmutex.c
That doesn't make any sense at all. Version mismatch?
Thanks,
tglx
On 02/13/2014 03:55 PM, Thomas Gleixner wrote:
> On Thu, 13 Feb 2014, Fernando Lopez-Lezcano wrote:
>
>> On 02/13/2014 02:25 PM, Thomas Gleixner wrote:
>>> On Wed, 12 Feb 2014, Fernando Lopez-Lezcano wrote:
>>>> [771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
>>>> smp_call_function_many+0x2ca/0x330
>>>
>>> Can you decode the exact location inside of smp_call_function_many via
>>> addr2line please ?
>>
>> Hope this is useful (adding 0x2ce/0x330 as offsets does not make any
>> difference, don't know if it should)...
>>
>> # grep smp_call_function /var/log/messages|tail -1
>> Feb 12 14:18:21 cmn27 kernel: [771840.224419] RIP: 0010:[<ffffffff810dc60e>]
>> [<ffffffff810dc60e>] smp_call_function_many+0x2ce/0x330
>> # addr2line -e
>> /usr/lib/debug/lib/modules/3.12.10-300.rt15.1.fc20.ccrma.x86_64+rt/vmlinux
>> ffffffff810dc60e
>> /usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.10-300.rt15.1.fc20.ccrma.x86_64/kernel/rtmutex.c:1295
>
> I can't see how the kernel decoder thinks it's smp_call_function_many
> but addr2line looks at rtmutex.c
>
> That doesn't make any sense at all. Version mismatch?
Indeed, sorry for the mixup... here I go again, hopefully this one will
make sense:
# addr2line -e
/usr/lib/debug/lib/modules/3.12.9-301.rt13.1.fc20.ccrma.x86_64+rt/vmlinux ffffffff810dc60e
/usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.9-301.rt13.1.fc20.ccrma.x86_64/kernel/smp.c:108
-- Fernando
On Thu, 13 Feb 2014, Fernando Lopez-Lezcano wrote:
> On 02/13/2014 03:55 PM, Thomas Gleixner wrote:
> > On Thu, 13 Feb 2014, Fernando Lopez-Lezcano wrote:
> >
> > > On 02/13/2014 02:25 PM, Thomas Gleixner wrote:
> > > > On Wed, 12 Feb 2014, Fernando Lopez-Lezcano wrote:
> > > > > [771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
> > > > > smp_call_function_many+0x2ca/0x330
> > > >
> > > > Can you decode the exact location inside of smp_call_function_many via
> > > > addr2line please ?
> > >
> > > Hope this is useful (adding 0x2ce/0x330 as offsets does not make any
> > > difference, don't know if it should)...
> > >
> > > # grep smp_call_function /var/log/messages|tail -1
> > > Feb 12 14:18:21 cmn27 kernel: [771840.224419] RIP:
> > > 0010:[<ffffffff810dc60e>]
> > > [<ffffffff810dc60e>] smp_call_function_many+0x2ce/0x330
> > > # addr2line -e
> > > /usr/lib/debug/lib/modules/3.12.10-300.rt15.1.fc20.ccrma.x86_64+rt/vmlinux
> > > ffffffff810dc60e
> > > /usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.10-300.rt15.1.fc20.ccrma.x86_64/kernel/rtmutex.c:1295
> >
> > I can't see how the kernel decoder thinks it's smp_call_function_many
> > but addr2line looks at rtmutex.c
> >
> > That doesn't make any sense at all. Version mismatch?
>
> Indeed, sorry for the mixup... here I go again, hopefully this one will make
> sense:
>
> # addr2line -e
> /usr/lib/debug/lib/modules/3.12.9-301.rt13.1.fc20.ccrma.x86_64+rt/vmlinux
> ffffffff810dc60e
> /usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.9-301.rt13.1.fc20.ccrma.x86_64/kernel/smp.c:108
So it's stuck in csd_lock_wait(), which means that the csd of the
target cpu is not free.
Is the machine completely dead or can you still retrieve information
from it?
Thanks,
tglx
On 02/14/2014 02:43 AM, Thomas Gleixner wrote:
> On Thu, 13 Feb 2014, Fernando Lopez-Lezcano wrote:
>> On 02/13/2014 03:55 PM, Thomas Gleixner wrote:
>>> On Thu, 13 Feb 2014, Fernando Lopez-Lezcano wrote:
>>>
>>>> On 02/13/2014 02:25 PM, Thomas Gleixner wrote:
>>>>> On Wed, 12 Feb 2014, Fernando Lopez-Lezcano wrote:
>>>>>> [771508.546449] RIP: 0010:[<ffffffff810dc60a>] [<ffffffff810dc60a>]
>>>>>> smp_call_function_many+0x2ca/0x330
>>>>>
>>>>> Can you decode the exact location inside of smp_call_function_many via
>>>>> addr2line please ?
>>
>> # addr2line -e
>> /usr/lib/debug/lib/modules/3.12.9-301.rt13.1.fc20.ccrma.x86_64+rt/vmlinux
>> ffffffff810dc60e
>> /usr/src/debug/kernel-3.12.fc20.ccrma/linux-3.12.9-301.rt13.1.fc20.ccrma.x86_64/kernel/smp.c:108
>
> So it's stuck in csd_lock_wait(), which means that the csd of the
> target cpu is not free.
>
> Is the machine completely dead or can you still retrieve information
> from it?
After migrating to fc20/3.12.x-rtyy I started experiencing freezes in
some workstations. This coincided with one of our students running high
cpu load multi-core computations in them (he had been doing that before
under 3.10.x-rtyy with no problems). In the morning I would find
workstations unresponsive and catatonic. Probably his software was still
eating up cpu as the machines were warm (ie: still under load). No pings
back or keyboard/mouse/display response.
This was the only time I could get information from a machine while it
was in the process of freezing up - but this might have been a different
issue. I was ssh'd in and that terminal became unresponsive. I managed
to ssh in again and looked at the logs. The machine was not completely
frozen but it eventually became completely catatonic. For all I know
this might be different from the locked machines syndrome as it left
traces in the logs (I could forward you all the log entries if you want).
I could try to boot one of the machines into 3.12.xrtyy, replicate the
conditions and wait. What should I look for if I can catch this in the act?
-- Fernando
On Fri, 14 Feb 2014, Fernando Lopez-Lezcano wrote:
> I could try to boot one of the machines into 3.12.xrtyy, replicate the
> conditions and wait. What should I look for if I can catch this in the act?
dmesg and sysrq-t output would be a good start I guess.
Thanks,
tglx