Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751291AbbEYWBV (ORCPT ); Mon, 25 May 2015 18:01:21 -0400 Received: from forward2o.mail.yandex.net ([37.140.190.31]:49746 "EHLO forward2o.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750859AbbEYWBT (ORCPT ); Mon, 25 May 2015 18:01:19 -0400 X-Greylist: delayed 578 seconds by postgrey-1.27 at vger.kernel.org; Mon, 25 May 2015 18:01:18 EDT From: Kirill Tkhai To: Mohammed Naser , Peter Zijlstra Cc: "linux-kernel@vger.kernel.org" , "mingo@redhat.com" , Konstantin Khlebnikov In-Reply-To: References: <1432576851-24831-1-git-send-email-mnaser@vexxhost.com> <1432586334.11346.2.camel@twins> Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues MIME-Version: 1.0 Message-Id: <1255671432590695@web13o.yandex.ru> X-Mailer: Yamail [ http://yandex.ru ] 5.0 Date: Tue, 26 May 2015 00:51:35 +0300 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=koi8-r Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12465 Lines: 274 Hi, 25.05.2015, 23:53, "Mohammed Naser" : > Hi Peter, > > (resending as plain text, my bad) > > Thank you for reply. > > Would you have any ideas on why this would have occurred or other > steps to look at? ?It's my first time attempting to help fix a problem > like this. > > I have a crashdump of the kernel since this issue repeated itself a > few times on a loaded KVM host (it's 12GB however). ?I can also > provide the values of cfs_rq before the kernel crash. > > =================================== > [146055.357476] BUG: unable to handle kernel NULL pointer dereference > at 0000000000000038 > [146055.359620] IP: [] set_next_entity+0x11/0xb0 > [146055.361890] PGD 0 > [146055.364131] Oops: 0000 [#1] SMP > [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan > act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6 > nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip > ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl > mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat > nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack > nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc > ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter > ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm > iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp > libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm > crct10dif_pclmul crc32_pclmul > [146055.388889] ?ghash_clmulni_intel aesni_intel aes_x86_64 lrw > gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me > mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp > parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core > megaraid_sas libahci mdio [last unloaded: ipmi_si] > [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted > 3.16.0-37-generic #51~14.04.1-Ubuntu > [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS > 1.0.4 08/28/2014 > [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti: > ffff883a1c168000 > [146055.421817] RIP: 0010:[] ?[] > set_next_entity+0x11/0xb0 > [146055.428079] RSP: 0018:ffff883a1c16bce8 ?EFLAGS: 00010092 > [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > 00000000044aa200 > [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: > ffff883ffedf3140 > [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09: > 0000000000000001 > [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12: > ffff883ffedf3140 > [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15: > ffff883ffedf30c0 > [146055.467722] FS: ?00007f404919d700(0000) GS:ffff883ffede0000(0000) > knlGS:ffff880002380000 > [146055.474756] CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4: > 00000000001427e0 > [146055.489134] Stack: > [146055.496412] ?0000000000000000 ffff883ffedf3140 000000000000001f > ffff883a1c16bd68 > [146055.504053] ?ffffffff810af2f8 ffff883ffedf3140 00000000000130c0 > ffff883fcab69e90 > [146055.511786] ?ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895 > ffff883fcab6a3c8 > [146055.519551] Call Trace: > [146055.527330] ?[] pick_next_task_fair+0x78/0x880 > [146055.535292] ?[] ? sched_clock+0x9/0x10 > [146055.543379] ?[] ? sched_clock_cpu+0x85/0xc0 > [146055.551519] ?[] __schedule+0x11b/0x7a0 > [146055.559722] ?[] _cond_resched+0x29/0x40 > [146055.568020] ?[] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm] > [146055.576509] ?[] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm] > [146055.585045] ?[] ? perf_event_context_sched_in+0xa2/0xc0 > [146055.593771] ?[] do_vfs_ioctl+0x2e0/0x4c0 > [146055.602531] ?[] ? finish_task_switch+0x108/0x180 > [146055.611413] ?[] ? kvm_on_user_return+0x74/0x80 [kvm] > [146055.620339] ?[] SyS_ioctl+0x81/0xa0 > [146055.629396] ?[] system_call_fastpath+0x1a/0x1f > [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 > 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 > 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 > 8b 80 > [146055.657833] RIP ?[] set_next_entity+0x11/0xb0 > [146055.667524] ?RSP > [146055.677082] CR2: 0000000000000038 > =================================== This looks like https://lkml.org/lkml/2015/4/3/231 > Any pointers are appreciated and I'll do my best to do some more > troubleshooting, just trying to understand the codebase is a task on > it's own > > Thanks Peter, > Mohammed > > On Mon, May 25, 2015 at 4:49 PM, Mohammed Naser wrote: >> ?Hi Peter, >> >> ?Thank you for reply. >> >> ?Would you have any ideas on why this would have occurred or other steps to >> ?look at? ?It's my first time attempting to help fix a problem like this. >> >> ?I have a crashdump of the kernel since this issue repeated itself a few >> ?times on a loaded KVM host (it's 12GB however). ?I can also provide the >> ?values of cfs_rq before the kernel crash. >> >> ?=================================== >> ?[146055.357476] BUG: unable to handle kernel NULL pointer dereference at >> ?0000000000000038 >> ?[146055.359620] IP: [] set_next_entity+0x11/0xb0 >> ?[146055.361890] PGD 0 >> ?[146055.364131] Oops: 0000 [#1] SMP >> ?[146055.366475] Modules linked in: vhost_net vhost macvtap macvlan >> ?act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6 >> ?nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip ip_set >> ?nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase veth >> ?xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat >> ?nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT >> ?xt_tcpudp dell_rbu bridge stp llc ebtable_filter ebtables ip6table_filter >> ?ip6_tables iptable_filter ip_tables x_tables nbd openvswitch gre vxlan >> ?libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp >> ?libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl >> ?x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm >> ?crct10dif_pclmul crc32_pclmul >> ?[146055.388889] ?ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul >> ?dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me mei lpc_ich >> ?ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp parport nls_iso8859_1 >> ?igb ixgbe i2c_algo_bit dca ptp ahci pps_core megaraid_sas libahci mdio [last >> ?unloaded: ipmi_si] >> ?[146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted >> ?3.16.0-37-generic #51~14.04.1-Ubuntu >> ?[146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4 >> ?08/28/2014 >> ?[146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti: >> ?ffff883a1c168000 >> ?[146055.421817] RIP: 0010:[] ?[] >> ?set_next_entity+0x11/0xb0 >> ?[146055.428079] RSP: 0018:ffff883a1c16bce8 ?EFLAGS: 00010092 >> ?[146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX: >> ?00000000044aa200 >> ?[146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI: >> ?ffff883ffedf3140 >> ?[146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09: >> ?0000000000000001 >> ?[146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12: >> ?ffff883ffedf3140 >> ?[146055.460968] R13: 000000000000001f R14: 0000000000000001 R15: >> ?ffff883ffedf30c0 >> ?[146055.467722] FS: ?00007f404919d700(0000) GS:ffff883ffede0000(0000) >> ?knlGS:ffff880002380000 >> ?[146055.474756] CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> ?[146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4: >> ?00000000001427e0 >> ?[146055.489134] Stack: >> ?[146055.496412] ?0000000000000000 ffff883ffedf3140 000000000000001f >> ?ffff883a1c16bd68 >> ?[146055.504053] ?ffffffff810af2f8 ffff883ffedf3140 00000000000130c0 >> ?ffff883fcab69e90 >> ?[146055.511786] ?ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895 >> ?ffff883fcab6a3c8 >> ?[146055.519551] Call Trace: >> ?[146055.527330] ?[] pick_next_task_fair+0x78/0x880 >> ?[146055.535292] ?[] ? sched_clock+0x9/0x10 >> ?[146055.543379] ?[] ? sched_clock_cpu+0x85/0xc0 >> ?[146055.551519] ?[] __schedule+0x11b/0x7a0 >> ?[146055.559722] ?[] _cond_resched+0x29/0x40 >> ?[146055.568020] ?[] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 >> ?[kvm] >> ?[146055.576509] ?[] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm] >> ?[146055.585045] ?[] ? >> ?perf_event_context_sched_in+0xa2/0xc0 >> ?[146055.593771] ?[] do_vfs_ioctl+0x2e0/0x4c0 >> ?[146055.602531] ?[] ? finish_task_switch+0x108/0x180 >> ?[146055.611413] ?[] ? kvm_on_user_return+0x74/0x80 [kvm] >> ?[146055.620339] ?[] SyS_ioctl+0x81/0xa0 >> ?[146055.629396] ?[] system_call_fastpath+0x1a/0x1f >> ?[146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 85 c0 >> ?75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53 >> ?<8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 8b 80 >> ?[146055.657833] RIP ?[] set_next_entity+0x11/0xb0 >> ?[146055.667524] ?RSP >> ?[146055.677082] CR2: 0000000000000038 >> ?=================================== >> >> ?Any pointers are appreciated and I'll do my best to do some more >> ?troubleshooting, just trying to understand the codebase is a task on it's >> ?own >> >> ?Thanks Peter, >> ?Mohammed >> >> ?On Mon, May 25, 2015 at 4:38 PM Peter Zijlstra wrote: >>> ?On Mon, 2015-05-25 at 14:00 -0400, Mohammed Naser wrote: >>>> ?Calling put_prev_task() can result in nr_running being updated >>>> ?to zero, which would then crash the system when the kernel >>>> ?attempts to pick_next_entity() with an empty queue. >>> ?Getting to pick_next_entity() with an empty queue is a bug. Maybe, we should do global update for all classes. Something like in below patch (Warning: it's completelly untested). Though, one problem is update_curr() of fair class does not update the whole hierarhy. It's for task's entity only. kernel/sched/core.c | 1 + kernel/sched/deadline.c | 10 ---------- kernel/sched/rt.c | 7 ------- 3 files changed, 1 insertion(+), 17 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 613b61e..7717d0b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2821,6 +2821,7 @@ static void __sched __schedule(void) if (task_on_rq_queued(prev)) update_rq_clock(rq); + prev->sched_class->update_curr(rq); next = pick_next_task(rq, prev); clear_tsk_need_resched(prev); diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 7a08d59..7320293 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c @@ -1102,16 +1102,6 @@ struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev) return RETRY_TASK; } - /* - * When prev is DL, we may throttle it in put_prev_task(). - * So, we update time before we check for dl_nr_running. - */ - if (prev->sched_class == &dl_sched_class) - update_curr_dl(rq); - - if (unlikely(!dl_rq->dl_nr_running)) - return NULL; - put_prev_task(rq, prev); dl_se = pick_next_dl_entity(rq, dl_rq); diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 8781a38..31223c0 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -1474,13 +1474,6 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev) return RETRY_TASK; } - /* - * We may dequeue prev's rt_rq in put_prev_task(). - * So, we update time before rt_nr_running check. - */ - if (prev->sched_class == &rt_sched_class) - update_curr_rt(rq); - if (!rt_rq->rt_queued) return NULL; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/