Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751425AbbEZSFP (ORCPT ); Tue, 26 May 2015 14:05:15 -0400 Received: from mail-ie0-f174.google.com ([209.85.223.174]:34873 "EHLO mail-ie0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751326AbbEZSFM convert rfc822-to-8bit (ORCPT ); Tue, 26 May 2015 14:05:12 -0400 MIME-Version: 1.0 X-Originating-IP: [204.136.14.250] In-Reply-To: References: <1432576851-24831-1-git-send-email-mnaser@vexxhost.com> <1432586334.11346.2.camel@twins> <1255671432590695@web13o.yandex.ru> Date: Tue, 26 May 2015 14:05:12 -0400 Message-ID: Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues From: Mohammed Naser To: Benjamin Segall Cc: Kirill Tkhai , Peter Zijlstra , "linux-kernel@vger.kernel.org" , "mingo@redhat.com" , Konstantin Khlebnikov Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12449 Lines: 384 Also, regarding replication.. I'll try to make a setup that "mimics" the crash but unfortunately I don't have a full reproducible case (but when cgroup bandwidth limits are in place, server cannot last 24 hours up) Thanks! Mohammed On Tue, May 26, 2015 at 1:56 PM, Mohammed Naser wrote: > Hi Benjamin, > > Thank you for getting in touch. With as much as I can do (and I'm > documenting this just to be sure I didn't take the wrong path), here > is what I can provide: > > The crash occurred at: > > [] set_next_entity+0x11/0xb0 > set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) > > The value of RDI will be memory address of the first argument, RSI > will be the memory address of the second argument, taking them out > from the BUG message: > > RDI: ffff883ffedf3140 (cfs_rq) > RSI: 0000000000000000 (se, NULL in the case) > > Therefore, `cfs_rq` is at `ffff883ffedf3140`.. looking it up... > > ========================================================== > crash> cfs_rq ffff883ffedf3140 > struct cfs_rq { > load = { > weight = 0, > inv_weight = 0 > }, > nr_running = 0, > h_nr_running = 0, > exec_clock = 50727270151190, > min_vruntime = 1126033353437518, > tasks_timeline = { > rb_node = 0x0 > }, > rb_leftmost = 0x0, > curr = 0x0, > next = 0x0, > last = 0x0, > skip = 0x0, > nr_spread_over = 960, > runnable_load_avg = 0, > blocked_load_avg = 108, > decay_counter = { > counter = 139289244 > }, > last_decay = 139289243, > removed_load = { > counter = 0 > }, > tg_runnable_contrib = 329, > tg_load_contrib = 108, > h_load = 75, > last_h_load_update = 4331424052, > h_load_next = 0xffff883fcab5d600, > rq = 0xffff883ffedf30c0, > on_list = 1, > leaf_cfs_rq_list = { > next = 0xffff883eec13eec0, > prev = 0xffff883fcab5fec0 > }, > tg = 0xffffffff81ebdd80 , > runtime_enabled = 0, > runtime_expires = 0, > runtime_remaining = 0, > throttled_clock = 0, > throttled_clock_task = 0, > throttled_clock_task_time = 0, > throttled = 0, > throttle_count = 0, > throttled_list = { > next = 0xffff883ffedf3250, > prev = 0xffff883ffedf3250 > } > } > ========================================================== > > Following the memory address of tg at 0xffffffff81ebdd80 > > ========================================================== > crash> task_group ffffffff81ebdd80 > struct task_group { > css = { > cgroup = 0xffff883fd0858010, > ss = 0xffffffff81c4e380 , > refcnt = { > count = { > counter = 0 > }, > pcpu_count = 0x0, > release = 0x0, > confirm_kill = 0x0, > rcu = { > next = 0x0, > func = 0x0 > } > }, > parent = 0x0, > sibling = { > next = 0xffffffff81ebddc8 , > prev = 0xffffffff81ebddc8 > }, > children = { > next = 0xffff881fcfab0448, > prev = 0xffff883fd1ad7048 > }, > id = 1, > flags = 3, > serial_nr = 2, > callback_head = { > next = 0x0, > func = 0x0 > }, > destroy_work = { > data = { > counter = 0 > }, > entry = { > next = 0x0, > prev = 0x0 > }, > func = 0x0 > } > }, > se = 0xffff881fff010000, > cfs_rq = 0xffff881fff010480, > shares = 1024, > load_avg = { > counter = 1234 > }, > runnable_avg = { > counter = 20014 > }, > rcu = { > next = 0x0, > func = 0x0 > }, > list = { > next = 0xffffffff81c4e830 , > prev = 0xffff883fd1b710e0 > }, > parent = 0x0, > siblings = { > next = 0xffffffff81ebde78 , > prev = 0xffffffff81ebde78 > }, > children = { > next = 0xffff883a29cd58f8, > prev = 0xffff883fd1b710f8 > }, > autogroup = 0x0, > cfs_bandwidth = { > lock = { > raw_lock = { > { > head_tail = 0, > tickets = { > head = 0, > tail = 0 > } > } > } > }, > period = { > tv64 = 100000000 > }, > quota = 18446744073709551615, > runtime = 0, > hierarchal_quota = -1, > runtime_expires = 0, > idle = 0, > timer_active = 0, > period_timer = { > node = { > node = { > __rb_parent_color = 18446744071594303192, > rb_right = 0x0, > rb_left = 0x0 > }, > expires = { > tv64 = 0 > } > }, > _softexpires = { > tv64 = 0 > }, > function = 0xffffffff810add70 , > base = 0xffff881fff40d1a0, > state = 0, > start_pid = -1, > start_site = 0x0, > start_comm = > "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" > }, > slack_timer = { > node = { > node = { > __rb_parent_color = 18446744071594303288, > rb_right = 0x0, > rb_left = 0x0 > }, > expires = { > tv64 = 0 > } > }, > _softexpires = { > tv64 = 0 > }, > function = 0xffffffff810adcb0 , > base = 0xffff881fff40d1a0, > state = 0, > start_pid = -1, > start_site = 0x0, > start_comm = > "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000" > }, > throttled_cfs_rq = { > next = 0xffffffff81ebdf98 , > prev = 0xffffffff81ebdf98 > }, > nr_periods = 0, > nr_throttled = 0, > throttled_time = 0 > } > } > ========================================================== > > I have the following for the `rq` if it helps (when it crashed) > > ========================================================== > crash> rq.clock ffff883ffedf30c0 > clock = 146055357448781 > crash> rq.clock_task ffff883ffedf30c0 > clock_task = 14605535744878 > ========================================================== > > Sorry for the long email, wanted to provide as much as I can > > Thanks, > Mohammed > > On Tue, May 26, 2015 at 1:38 PM, wrote: >> Kirill Tkhai writes: >> >>> Hi, >>> >>> 25.05.2015, 23:53, "Mohammed Naser" : >>>> Hi Peter, >>>> >>>> (resending as plain text, my bad) >>>> >>>> Thank you for reply. >>>> >>>> Would you have any ideas on why this would have occurred or other >>>> steps to look at? It's my first time attempting to help fix a problem >>>> like this. >>>> >>>> I have a crashdump of the kernel since this issue repeated itself a >>>> few times on a loaded KVM host (it's 12GB however). I can also >>>> provide the values of cfs_rq before the kernel crash. >>>> >>>> =================================== >>>> [146055.357476] BUG: unable to handle kernel NULL pointer dereference >>>> at 0000000000000038 >>>> [146055.359620] IP: [] set_next_entity+0x11/0xb0 >>>> [146055.361890] PGD 0 >>>> [146055.364131] Oops: 0000 [#1] SMP >>>> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan >>>> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6 >>>> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip >>>> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl >>>> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat >>>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack >>>> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc >>>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter >>>> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm >>>> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp >>>> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl >>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm >>>> crct10dif_pclmul crc32_pclmul >>>> [146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw >>>> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me >>>> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp >>>> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core >>>> megaraid_sas libahci mdio [last unloaded: ipmi_si] >>>> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted >>>> 3.16.0-37-generic #51~14.04.1-Ubuntu >>>> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS >>>> 1.0.4 08/28/2014 >>>> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti: >>>> ffff883a1c168000 >>>> [146055.421817] RIP: 0010:[] [] >>>> set_next_entity+0x11/0xb0 >>>> [146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092 >>>> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >>>> 00000000044aa200 >>>> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >>>> ffff883ffedf3140 >>>> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09: >>>> 0000000000000001 >>>> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12: >>>> ffff883ffedf3140 >>>> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15: >>>> ffff883ffedf30c0 >>>> [146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000) >>>> knlGS:ffff880002380000 >>>> [146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4: >>>> 00000000001427e0 >>>> [146055.489134] Stack: >>>> [146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f >>>> ffff883a1c16bd68 >>>> [146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0 >>>> ffff883fcab69e90 >>>> [146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895 >>>> ffff883fcab6a3c8 >>>> [146055.519551] Call Trace: >>>> [146055.527330] [] pick_next_task_fair+0x78/0x880 >>>> [146055.535292] [] ? sched_clock+0x9/0x10 >>>> [146055.543379] [] ? sched_clock_cpu+0x85/0xc0 >>>> [146055.551519] [] __schedule+0x11b/0x7a0 >>>> [146055.559722] [] _cond_resched+0x29/0x40 >>>> [146055.568020] [] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm] >>>> [146055.576509] [] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm] >>>> [146055.585045] [] ? perf_event_context_sched_in+0xa2/0xc0 >>>> [146055.593771] [] do_vfs_ioctl+0x2e0/0x4c0 >>>> [146055.602531] [] ? finish_task_switch+0x108/0x180 >>>> [146055.611413] [] ? kvm_on_user_return+0x74/0x80 [kvm] >>>> [146055.620339] [] SyS_ioctl+0x81/0xa0 >>>> [146055.629396] [] system_call_fastpath+0x1a/0x1f >>>> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 >>>> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 >>>> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 >>>> 8b 80 >>>> [146055.657833] RIP [] set_next_entity+0x11/0xb0 >>>> [146055.667524] RSP >>>> [146055.677082] CR2: 0000000000000038 >>>> =================================== >>> >>> This looks like https://lkml.org/lkml/2015/4/3/231 >> >> >> If you're using cfs_bandwidth, it's possible (though I can't tell just >> from that). If so and you can reproduce this, could you test >> https://lkml.org/lkml/2015/4/6/396 (assuming it still applies at all >> anyway)? >> >> If you want to provide a bunch of debugging info, rq->clock, >> rq->clock_task, *cfs_rq and *se for all rqs and cgroups would be ideal, >> but getting all that is kinda annoying. The most basic info would be >> *cfs_rq of the crashing cpu and *cfs_rq->tg. > > > > -- > Mohammed Naser — vexxhost > ----------------------------------------------------- > D. 514-316-8872 > D. 800-910-1726 ext. 200 > E. mnaser@vexxhost.com > W. http://vexxhost.com -- Mohammed Naser — vexxhost ----------------------------------------------------- D. 514-316-8872 D. 800-910-1726 ext. 200 E. mnaser@vexxhost.com W. http://vexxhost.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/