2015-05-25 18:01:05

by Mohammed Naser

[permalink] [raw]
Subject: [PATCH] sched/fair: Fix null pointer dereference of empty queues

Calling put_prev_task() can result in nr_running being updated
to zero, which would then crash the system when the kernel
attempts to pick_next_entity() with an empty queue.

Signed-off-by: Mohammed Naser <[email protected]>
---
kernel/sched/fair.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d4632f..dd0a74a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5374,12 +5374,11 @@ again:
simple:
cfs_rq = &rq->cfs;
#endif
+ put_prev_task(rq, prev);

if (!cfs_rq->nr_running)
goto idle;

- put_prev_task(rq, prev);
-
do {
se = pick_next_entity(cfs_rq, NULL);
set_next_entity(cfs_rq, se);
@@ -5415,7 +5414,10 @@ idle:
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
struct sched_entity *se = &prev->se;
- struct cfs_rq *cfs_rq;
+ struct cfs_rq *cfs_rq = &rq->cfs;
+
+ if (!cfs_rq->nr_running)
+ return;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
--
2.3.2 (Apple Git-55)


2015-05-25 20:39:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

On Mon, 2015-05-25 at 14:00 -0400, Mohammed Naser wrote:
> Calling put_prev_task() can result in nr_running being updated
> to zero, which would then crash the system when the kernel
> attempts to pick_next_entity() with an empty queue.

Getting to pick_next_entity() with an empty queue is a bug.

2015-05-25 20:50:03

by Mohammed Naser

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

Hi Peter,

(resending as plain text, my bad)

Thank you for reply.

Would you have any ideas on why this would have occurred or other
steps to look at? It's my first time attempting to help fix a problem
like this.

I have a crashdump of the kernel since this issue repeated itself a
few times on a loaded KVM host (it's 12GB however). I can also
provide the values of cfs_rq before the kernel crash.

===================================
[146055.357476] BUG: unable to handle kernel NULL pointer dereference
at 0000000000000038
[146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.361890] PGD 0
[146055.364131] Oops: 0000 [#1] SMP
[146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
crct10dif_pclmul crc32_pclmul
[146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw
gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
megaraid_sas libahci mdio [last unloaded: ipmi_si]
[146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
3.16.0-37-generic #51~14.04.1-Ubuntu
[146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
1.0.4 08/28/2014
[146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
ffff883a1c168000
[146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>]
set_next_entity+0x11/0xb0
[146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
[146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
00000000044aa200
[146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff883ffedf3140
[146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
0000000000000001
[146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
ffff883ffedf3140
[146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
ffff883ffedf30c0
[146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000)
knlGS:ffff880002380000
[146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
00000000001427e0
[146055.489134] Stack:
[146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f
ffff883a1c16bd68
[146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
ffff883fcab69e90
[146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
ffff883fcab6a3c8
[146055.519551] Call Trace:
[146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
[146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
[146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
[146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
[146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
[146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
[146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
[146055.585045] [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
[146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
[146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
[146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
[146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
[146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
[146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
8b 80
[146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
[146055.667524] RSP <ffff883a1c16bce8>
[146055.677082] CR2: 0000000000000038
===================================

Any pointers are appreciated and I'll do my best to do some more
troubleshooting, just trying to understand the codebase is a task on
it's own

Thanks Peter,
Mohammed

On Mon, May 25, 2015 at 4:49 PM, Mohammed Naser <[email protected]> wrote:
> Hi Peter,
>
> Thank you for reply.
>
> Would you have any ideas on why this would have occurred or other steps to
> look at? It's my first time attempting to help fix a problem like this.
>
> I have a crashdump of the kernel since this issue repeated itself a few
> times on a loaded KVM host (it's 12GB however). I can also provide the
> values of cfs_rq before the kernel crash.
>
> ===================================
> [146055.357476] BUG: unable to handle kernel NULL pointer dereference at
> 0000000000000038
> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> [146055.361890] PGD 0
> [146055.364131] Oops: 0000 [#1] SMP
> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip ip_set
> nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase veth
> xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
> xt_tcpudp dell_rbu bridge stp llc ebtable_filter ebtables ip6table_filter
> ip6_tables iptable_filter ip_tables x_tables nbd openvswitch gre vxlan
> libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
> libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
> crct10dif_pclmul crc32_pclmul
> [146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
> dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me mei lpc_ich
> ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp parport nls_iso8859_1
> igb ixgbe i2c_algo_bit dca ptp ahci pps_core megaraid_sas libahci mdio [last
> unloaded: ipmi_si]
> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
> 3.16.0-37-generic #51~14.04.1-Ubuntu
> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4
> 08/28/2014
> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
> ffff883a1c168000
> [146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>]
> set_next_entity+0x11/0xb0
> [146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 00000000044aa200
> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff883ffedf3140
> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
> 0000000000000001
> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
> ffff883ffedf3140
> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
> ffff883ffedf30c0
> [146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000)
> knlGS:ffff880002380000
> [146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
> 00000000001427e0
> [146055.489134] Stack:
> [146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f
> ffff883a1c16bd68
> [146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
> ffff883fcab69e90
> [146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
> ffff883fcab6a3c8
> [146055.519551] Call Trace:
> [146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
> [146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
> [146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
> [146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
> [146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
> [146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460
> [kvm]
> [146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
> [146055.585045] [<ffffffff81156952>] ?
> perf_event_context_sched_in+0xa2/0xc0
> [146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
> [146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
> [146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
> [146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
> [146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 85 c0
> 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53
> <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 8b 80
> [146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> [146055.667524] RSP <ffff883a1c16bce8>
> [146055.677082] CR2: 0000000000000038
> ===================================
>
> Any pointers are appreciated and I'll do my best to do some more
> troubleshooting, just trying to understand the codebase is a task on it's
> own
>
> Thanks Peter,
> Mohammed
>
> On Mon, May 25, 2015 at 4:38 PM Peter Zijlstra <[email protected]> wrote:
>>
>> On Mon, 2015-05-25 at 14:00 -0400, Mohammed Naser wrote:
>> > Calling put_prev_task() can result in nr_running being updated
>> > to zero, which would then crash the system when the kernel
>> > attempts to pick_next_entity() with an empty queue.
>>
>> Getting to pick_next_entity() with an empty queue is a bug.
>>
>



--
Mohammed Naser — vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. [email protected]
W. http://vexxhost.com

2015-05-25 22:01:21

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

Hi,

25.05.2015, 23:53, "Mohammed Naser" <[email protected]>:
> Hi Peter,
>
> (resending as plain text, my bad)
>
> Thank you for reply.
>
> Would you have any ideas on why this would have occurred or other
> steps to look at? ?It's my first time attempting to help fix a problem
> like this.
>
> I have a crashdump of the kernel since this issue repeated itself a
> few times on a loaded KVM host (it's 12GB however). ?I can also
> provide the values of cfs_rq before the kernel crash.
>
> ===================================
> [146055.357476] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000038
> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> [146055.361890] PGD 0
> [146055.364131] Oops: 0000 [#1] SMP
> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
> crct10dif_pclmul crc32_pclmul
> [146055.388889] ?ghash_clmulni_intel aesni_intel aes_x86_64 lrw
> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
> megaraid_sas libahci mdio [last unloaded: ipmi_si]
> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
> 3.16.0-37-generic #51~14.04.1-Ubuntu
> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
> 1.0.4 08/28/2014
> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
> ffff883a1c168000
> [146055.421817] RIP: 0010:[<ffffffff810a7d31>] ?[<ffffffff810a7d31>]
> set_next_entity+0x11/0xb0
> [146055.428079] RSP: 0018:ffff883a1c16bce8 ?EFLAGS: 00010092
> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 00000000044aa200
> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff883ffedf3140
> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
> 0000000000000001
> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
> ffff883ffedf3140
> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
> ffff883ffedf30c0
> [146055.467722] FS: ?00007f404919d700(0000) GS:ffff883ffede0000(0000)
> knlGS:ffff880002380000
> [146055.474756] CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
> 00000000001427e0
> [146055.489134] Stack:
> [146055.496412] ?0000000000000000 ffff883ffedf3140 000000000000001f
> ffff883a1c16bd68
> [146055.504053] ?ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
> ffff883fcab69e90
> [146055.511786] ?ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
> ffff883fcab6a3c8
> [146055.519551] Call Trace:
> [146055.527330] ?[<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
> [146055.535292] ?[<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
> [146055.543379] ?[<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
> [146055.551519] ?[<ffffffff81768afb>] __schedule+0x11b/0x7a0
> [146055.559722] ?[<ffffffff81769579>] _cond_resched+0x29/0x40
> [146055.568020] ?[<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
> [146055.576509] ?[<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
> [146055.585045] ?[<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
> [146055.593771] ?[<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
> [146055.602531] ?[<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
> [146055.611413] ?[<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
> [146055.620339] ?[<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
> [146055.629396] ?[<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
> 8b 80
> [146055.657833] RIP ?[<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> [146055.667524] ?RSP <ffff883a1c16bce8>
> [146055.677082] CR2: 0000000000000038
> ===================================

This looks like https://lkml.org/lkml/2015/4/3/231

> Any pointers are appreciated and I'll do my best to do some more
> troubleshooting, just trying to understand the codebase is a task on
> it's own
>
> Thanks Peter,
> Mohammed
>
> On Mon, May 25, 2015 at 4:49 PM, Mohammed Naser <[email protected]> wrote:
>> ?Hi Peter,
>>
>> ?Thank you for reply.
>>
>> ?Would you have any ideas on why this would have occurred or other steps to
>> ?look at? ?It's my first time attempting to help fix a problem like this.
>>
>> ?I have a crashdump of the kernel since this issue repeated itself a few
>> ?times on a loaded KVM host (it's 12GB however). ?I can also provide the
>> ?values of cfs_rq before the kernel crash.
>>
>> ?===================================
>> ?[146055.357476] BUG: unable to handle kernel NULL pointer dereference at
>> ?0000000000000038
>> ?[146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>> ?[146055.361890] PGD 0
>> ?[146055.364131] Oops: 0000 [#1] SMP
>> ?[146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
>> ?act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
>> ?nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip ip_set
>> ?nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase veth
>> ?xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat
>> ?nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT
>> ?xt_tcpudp dell_rbu bridge stp llc ebtable_filter ebtables ip6table_filter
>> ?ip6_tables iptable_filter ip_tables x_tables nbd openvswitch gre vxlan
>> ?libcrc32c ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp
>> ?libiscsi_tcp libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
>> ?x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
>> ?crct10dif_pclmul crc32_pclmul
>> ?[146055.388889] ?ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul
>> ?dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me mei lpc_ich
>> ?ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp parport nls_iso8859_1
>> ?igb ixgbe i2c_algo_bit dca ptp ahci pps_core megaraid_sas libahci mdio [last
>> ?unloaded: ipmi_si]
>> ?[146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
>> ?3.16.0-37-generic #51~14.04.1-Ubuntu
>> ?[146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS 1.0.4
>> ?08/28/2014
>> ?[146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
>> ?ffff883a1c168000
>> ?[146055.421817] RIP: 0010:[<ffffffff810a7d31>] ?[<ffffffff810a7d31>]
>> ?set_next_entity+0x11/0xb0
>> ?[146055.428079] RSP: 0018:ffff883a1c16bce8 ?EFLAGS: 00010092
>> ?[146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>> ?00000000044aa200
>> ?[146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> ?ffff883ffedf3140
>> ?[146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
>> ?0000000000000001
>> ?[146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
>> ?ffff883ffedf3140
>> ?[146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
>> ?ffff883ffedf30c0
>> ?[146055.467722] FS: ?00007f404919d700(0000) GS:ffff883ffede0000(0000)
>> ?knlGS:ffff880002380000
>> ?[146055.474756] CS: ?0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> ?[146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
>> ?00000000001427e0
>> ?[146055.489134] Stack:
>> ?[146055.496412] ?0000000000000000 ffff883ffedf3140 000000000000001f
>> ?ffff883a1c16bd68
>> ?[146055.504053] ?ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
>> ?ffff883fcab69e90
>> ?[146055.511786] ?ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
>> ?ffff883fcab6a3c8
>> ?[146055.519551] Call Trace:
>> ?[146055.527330] ?[<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
>> ?[146055.535292] ?[<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
>> ?[146055.543379] ?[<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
>> ?[146055.551519] ?[<ffffffff81768afb>] __schedule+0x11b/0x7a0
>> ?[146055.559722] ?[<ffffffff81769579>] _cond_resched+0x29/0x40
>> ?[146055.568020] ?[<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460
>> ?[kvm]
>> ?[146055.576509] ?[<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
>> ?[146055.585045] ?[<ffffffff81156952>] ?
>> ?perf_event_context_sched_in+0xa2/0xc0
>> ?[146055.593771] ?[<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
>> ?[146055.602531] ?[<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
>> ?[146055.611413] ?[<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
>> ?[146055.620339] ?[<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
>> ?[146055.629396] ?[<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
>> ?[146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48 85 c0
>> ?75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 49 89 fc 53
>> ?<8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48 8b 80
>> ?[146055.657833] RIP ?[<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>> ?[146055.667524] ?RSP <ffff883a1c16bce8>
>> ?[146055.677082] CR2: 0000000000000038
>> ?===================================
>>
>> ?Any pointers are appreciated and I'll do my best to do some more
>> ?troubleshooting, just trying to understand the codebase is a task on it's
>> ?own
>>
>> ?Thanks Peter,
>> ?Mohammed
>>
>> ?On Mon, May 25, 2015 at 4:38 PM Peter Zijlstra <[email protected]> wrote:
>>> ?On Mon, 2015-05-25 at 14:00 -0400, Mohammed Naser wrote:
>>>> ?Calling put_prev_task() can result in nr_running being updated
>>>> ?to zero, which would then crash the system when the kernel
>>>> ?attempts to pick_next_entity() with an empty queue.
>>> ?Getting to pick_next_entity() with an empty queue is a bug.

Maybe, we should do global update for all classes. Something like in below patch
(Warning: it's completelly untested).

Though, one problem is update_curr() of fair class does not update the whole
hierarhy. It's for task's entity only.

kernel/sched/core.c | 1 +
kernel/sched/deadline.c | 10 ----------
kernel/sched/rt.c | 7 -------
3 files changed, 1 insertion(+), 17 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 613b61e..7717d0b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2821,6 +2821,7 @@ static void __sched __schedule(void)

if (task_on_rq_queued(prev))
update_rq_clock(rq);
+ prev->sched_class->update_curr(rq);

next = pick_next_task(rq, prev);
clear_tsk_need_resched(prev);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 7a08d59..7320293 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1102,16 +1102,6 @@ struct task_struct *pick_next_task_dl(struct rq *rq, struct task_struct *prev)
return RETRY_TASK;
}

- /*
- * When prev is DL, we may throttle it in put_prev_task().
- * So, we update time before we check for dl_nr_running.
- */
- if (prev->sched_class == &dl_sched_class)
- update_curr_dl(rq);
-
- if (unlikely(!dl_rq->dl_nr_running))
- return NULL;
-
put_prev_task(rq, prev);

dl_se = pick_next_dl_entity(rq, dl_rq);
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8781a38..31223c0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1474,13 +1474,6 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
return RETRY_TASK;
}

- /*
- * We may dequeue prev's rt_rq in put_prev_task().
- * So, we update time before rt_nr_running check.
- */
- if (prev->sched_class == &rt_sched_class)
- update_curr_rt(rq);
-
if (!rt_rq->rt_queued)
return NULL;

2015-05-26 17:39:13

by Benjamin Segall

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

Kirill Tkhai <[email protected]> writes:

> Hi,
>
> 25.05.2015, 23:53, "Mohammed Naser" <[email protected]>:
>> Hi Peter,
>>
>> (resending as plain text, my bad)
>>
>> Thank you for reply.
>>
>> Would you have any ideas on why this would have occurred or other
>> steps to look at?  It's my first time attempting to help fix a problem
>> like this.
>>
>> I have a crashdump of the kernel since this issue repeated itself a
>> few times on a loaded KVM host (it's 12GB however).  I can also
>> provide the values of cfs_rq before the kernel crash.
>>
>> ===================================
>> [146055.357476] BUG: unable to handle kernel NULL pointer dereference
>> at 0000000000000038
>> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>> [146055.361890] PGD 0
>> [146055.364131] Oops: 0000 [#1] SMP
>> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
>> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
>> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
>> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
>> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
>> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
>> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
>> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
>> crct10dif_pclmul crc32_pclmul
>> [146055.388889]  ghash_clmulni_intel aesni_intel aes_x86_64 lrw
>> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
>> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
>> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
>> megaraid_sas libahci mdio [last unloaded: ipmi_si]
>> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
>> 3.16.0-37-generic #51~14.04.1-Ubuntu
>> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
>> 1.0.4 08/28/2014
>> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
>> ffff883a1c168000
>> [146055.421817] RIP: 0010:[<ffffffff810a7d31>]  [<ffffffff810a7d31>]
>> set_next_entity+0x11/0xb0
>> [146055.428079] RSP: 0018:ffff883a1c16bce8  EFLAGS: 00010092
>> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>> 00000000044aa200
>> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>> ffff883ffedf3140
>> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
>> 0000000000000001
>> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
>> ffff883ffedf3140
>> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
>> ffff883ffedf30c0
>> [146055.467722] FS:  00007f404919d700(0000) GS:ffff883ffede0000(0000)
>> knlGS:ffff880002380000
>> [146055.474756] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
>> 00000000001427e0
>> [146055.489134] Stack:
>> [146055.496412]  0000000000000000 ffff883ffedf3140 000000000000001f
>> ffff883a1c16bd68
>> [146055.504053]  ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
>> ffff883fcab69e90
>> [146055.511786]  ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
>> ffff883fcab6a3c8
>> [146055.519551] Call Trace:
>> [146055.527330]  [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
>> [146055.535292]  [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
>> [146055.543379]  [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
>> [146055.551519]  [<ffffffff81768afb>] __schedule+0x11b/0x7a0
>> [146055.559722]  [<ffffffff81769579>] _cond_resched+0x29/0x40
>> [146055.568020]  [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
>> [146055.576509]  [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
>> [146055.585045]  [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
>> [146055.593771]  [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
>> [146055.602531]  [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
>> [146055.611413]  [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
>> [146055.620339]  [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
>> [146055.629396]  [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
>> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
>> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
>> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
>> 8b 80
>> [146055.657833] RIP  [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>> [146055.667524]  RSP <ffff883a1c16bce8>
>> [146055.677082] CR2: 0000000000000038
>> ===================================
>
> This looks like https://lkml.org/lkml/2015/4/3/231


If you're using cfs_bandwidth, it's possible (though I can't tell just
from that). If so and you can reproduce this, could you test
https://lkml.org/lkml/2015/4/6/396 (assuming it still applies at all
anyway)?

If you want to provide a bunch of debugging info, rq->clock,
rq->clock_task, *cfs_rq and *se for all rqs and cgroups would be ideal,
but getting all that is kinda annoying. The most basic info would be
*cfs_rq of the crashing cpu and *cfs_rq->tg.

2015-05-26 17:56:17

by Mohammed Naser

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

Hi Benjamin,

Thank you for getting in touch. With as much as I can do (and I'm
documenting this just to be sure I didn't take the wrong path), here
is what I can provide:

The crash occurred at:

[<ffffffff810a7d31>] set_next_entity+0x11/0xb0
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

The value of RDI will be memory address of the first argument, RSI
will be the memory address of the second argument, taking them out
from the BUG message:

RDI: ffff883ffedf3140 (cfs_rq)
RSI: 0000000000000000 (se, NULL in the case)

Therefore, `cfs_rq` is at `ffff883ffedf3140`.. looking it up...

==========================================================
crash> cfs_rq ffff883ffedf3140
struct cfs_rq {
load = {
weight = 0,
inv_weight = 0
},
nr_running = 0,
h_nr_running = 0,
exec_clock = 50727270151190,
min_vruntime = 1126033353437518,
tasks_timeline = {
rb_node = 0x0
},
rb_leftmost = 0x0,
curr = 0x0,
next = 0x0,
last = 0x0,
skip = 0x0,
nr_spread_over = 960,
runnable_load_avg = 0,
blocked_load_avg = 108,
decay_counter = {
counter = 139289244
},
last_decay = 139289243,
removed_load = {
counter = 0
},
tg_runnable_contrib = 329,
tg_load_contrib = 108,
h_load = 75,
last_h_load_update = 4331424052,
h_load_next = 0xffff883fcab5d600,
rq = 0xffff883ffedf30c0,
on_list = 1,
leaf_cfs_rq_list = {
next = 0xffff883eec13eec0,
prev = 0xffff883fcab5fec0
},
tg = 0xffffffff81ebdd80 <root_task_group>,
runtime_enabled = 0,
runtime_expires = 0,
runtime_remaining = 0,
throttled_clock = 0,
throttled_clock_task = 0,
throttled_clock_task_time = 0,
throttled = 0,
throttle_count = 0,
throttled_list = {
next = 0xffff883ffedf3250,
prev = 0xffff883ffedf3250
}
}
==========================================================

Following the memory address of tg at 0xffffffff81ebdd80

==========================================================
crash> task_group ffffffff81ebdd80
struct task_group {
css = {
cgroup = 0xffff883fd0858010,
ss = 0xffffffff81c4e380 <cpu_cgrp_subsys>,
refcnt = {
count = {
counter = 0
},
pcpu_count = 0x0,
release = 0x0,
confirm_kill = 0x0,
rcu = {
next = 0x0,
func = 0x0
}
},
parent = 0x0,
sibling = {
next = 0xffffffff81ebddc8 <root_task_group+72>,
prev = 0xffffffff81ebddc8 <root_task_group+72>
},
children = {
next = 0xffff881fcfab0448,
prev = 0xffff883fd1ad7048
},
id = 1,
flags = 3,
serial_nr = 2,
callback_head = {
next = 0x0,
func = 0x0
},
destroy_work = {
data = {
counter = 0
},
entry = {
next = 0x0,
prev = 0x0
},
func = 0x0
}
},
se = 0xffff881fff010000,
cfs_rq = 0xffff881fff010480,
shares = 1024,
load_avg = {
counter = 1234
},
runnable_avg = {
counter = 20014
},
rcu = {
next = 0x0,
func = 0x0
},
list = {
next = 0xffffffff81c4e830 <task_groups>,
prev = 0xffff883fd1b710e0
},
parent = 0x0,
siblings = {
next = 0xffffffff81ebde78 <root_task_group+248>,
prev = 0xffffffff81ebde78 <root_task_group+248>
},
children = {
next = 0xffff883a29cd58f8,
prev = 0xffff883fd1b710f8
},
autogroup = 0x0,
cfs_bandwidth = {
lock = {
raw_lock = {
{
head_tail = 0,
tickets = {
head = 0,
tail = 0
}
}
}
},
period = {
tv64 = 100000000
},
quota = 18446744073709551615,
runtime = 0,
hierarchal_quota = -1,
runtime_expires = 0,
idle = 0,
timer_active = 0,
period_timer = {
node = {
node = {
__rb_parent_color = 18446744071594303192,
rb_right = 0x0,
rb_left = 0x0
},
expires = {
tv64 = 0
}
},
_softexpires = {
tv64 = 0
},
function = 0xffffffff810add70 <sched_cfs_period_timer>,
base = 0xffff881fff40d1a0,
state = 0,
start_pid = -1,
start_site = 0x0,
start_comm =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
},
slack_timer = {
node = {
node = {
__rb_parent_color = 18446744071594303288,
rb_right = 0x0,
rb_left = 0x0
},
expires = {
tv64 = 0
}
},
_softexpires = {
tv64 = 0
},
function = 0xffffffff810adcb0 <sched_cfs_slack_timer>,
base = 0xffff881fff40d1a0,
state = 0,
start_pid = -1,
start_site = 0x0,
start_comm =
"\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
},
throttled_cfs_rq = {
next = 0xffffffff81ebdf98 <root_task_group+536>,
prev = 0xffffffff81ebdf98 <root_task_group+536>
},
nr_periods = 0,
nr_throttled = 0,
throttled_time = 0
}
}
==========================================================

I have the following for the `rq` if it helps (when it crashed)

==========================================================
crash> rq.clock ffff883ffedf30c0
clock = 146055357448781
crash> rq.clock_task ffff883ffedf30c0
clock_task = 14605535744878
==========================================================

Sorry for the long email, wanted to provide as much as I can

Thanks,
Mohammed

On Tue, May 26, 2015 at 1:38 PM, <[email protected]> wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> Hi,
>>
>> 25.05.2015, 23:53, "Mohammed Naser" <[email protected]>:
>>> Hi Peter,
>>>
>>> (resending as plain text, my bad)
>>>
>>> Thank you for reply.
>>>
>>> Would you have any ideas on why this would have occurred or other
>>> steps to look at? It's my first time attempting to help fix a problem
>>> like this.
>>>
>>> I have a crashdump of the kernel since this issue repeated itself a
>>> few times on a loaded KVM host (it's 12GB however). I can also
>>> provide the values of cfs_rq before the kernel crash.
>>>
>>> ===================================
>>> [146055.357476] BUG: unable to handle kernel NULL pointer dereference
>>> at 0000000000000038
>>> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>> [146055.361890] PGD 0
>>> [146055.364131] Oops: 0000 [#1] SMP
>>> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
>>> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
>>> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
>>> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
>>> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
>>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
>>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
>>> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
>>> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
>>> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
>>> crct10dif_pclmul crc32_pclmul
>>> [146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw
>>> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
>>> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
>>> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
>>> megaraid_sas libahci mdio [last unloaded: ipmi_si]
>>> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
>>> 3.16.0-37-generic #51~14.04.1-Ubuntu
>>> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
>>> 1.0.4 08/28/2014
>>> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
>>> ffff883a1c168000
>>> [146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>]
>>> set_next_entity+0x11/0xb0
>>> [146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
>>> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>> 00000000044aa200
>>> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>> ffff883ffedf3140
>>> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
>>> 0000000000000001
>>> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
>>> ffff883ffedf3140
>>> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
>>> ffff883ffedf30c0
>>> [146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000)
>>> knlGS:ffff880002380000
>>> [146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
>>> 00000000001427e0
>>> [146055.489134] Stack:
>>> [146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f
>>> ffff883a1c16bd68
>>> [146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
>>> ffff883fcab69e90
>>> [146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
>>> ffff883fcab6a3c8
>>> [146055.519551] Call Trace:
>>> [146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
>>> [146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
>>> [146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
>>> [146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
>>> [146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
>>> [146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
>>> [146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
>>> [146055.585045] [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
>>> [146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
>>> [146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
>>> [146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
>>> [146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
>>> [146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
>>> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
>>> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
>>> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
>>> 8b 80
>>> [146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>> [146055.667524] RSP <ffff883a1c16bce8>
>>> [146055.677082] CR2: 0000000000000038
>>> ===================================
>>
>> This looks like https://lkml.org/lkml/2015/4/3/231
>
>
> If you're using cfs_bandwidth, it's possible (though I can't tell just
> from that). If so and you can reproduce this, could you test
> https://lkml.org/lkml/2015/4/6/396 (assuming it still applies at all
> anyway)?
>
> If you want to provide a bunch of debugging info, rq->clock,
> rq->clock_task, *cfs_rq and *se for all rqs and cgroups would be ideal,
> but getting all that is kinda annoying. The most basic info would be
> *cfs_rq of the crashing cpu and *cfs_rq->tg.



--
Mohammed Naser — vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. [email protected]
W. http://vexxhost.com

2015-05-26 18:05:15

by Mohammed Naser

[permalink] [raw]
Subject: Re: [PATCH] sched/fair: Fix null pointer dereference of empty queues

Also, regarding replication.. I'll try to make a setup that "mimics"
the crash but unfortunately I don't have a full reproducible case (but
when cgroup bandwidth limits are in place, server cannot last 24 hours
up)

Thanks!
Mohammed

On Tue, May 26, 2015 at 1:56 PM, Mohammed Naser <[email protected]> wrote:
> Hi Benjamin,
>
> Thank you for getting in touch. With as much as I can do (and I'm
> documenting this just to be sure I didn't take the wrong path), here
> is what I can provide:
>
> The crash occurred at:
>
> [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
> set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> The value of RDI will be memory address of the first argument, RSI
> will be the memory address of the second argument, taking them out
> from the BUG message:
>
> RDI: ffff883ffedf3140 (cfs_rq)
> RSI: 0000000000000000 (se, NULL in the case)
>
> Therefore, `cfs_rq` is at `ffff883ffedf3140`.. looking it up...
>
> ==========================================================
> crash> cfs_rq ffff883ffedf3140
> struct cfs_rq {
> load = {
> weight = 0,
> inv_weight = 0
> },
> nr_running = 0,
> h_nr_running = 0,
> exec_clock = 50727270151190,
> min_vruntime = 1126033353437518,
> tasks_timeline = {
> rb_node = 0x0
> },
> rb_leftmost = 0x0,
> curr = 0x0,
> next = 0x0,
> last = 0x0,
> skip = 0x0,
> nr_spread_over = 960,
> runnable_load_avg = 0,
> blocked_load_avg = 108,
> decay_counter = {
> counter = 139289244
> },
> last_decay = 139289243,
> removed_load = {
> counter = 0
> },
> tg_runnable_contrib = 329,
> tg_load_contrib = 108,
> h_load = 75,
> last_h_load_update = 4331424052,
> h_load_next = 0xffff883fcab5d600,
> rq = 0xffff883ffedf30c0,
> on_list = 1,
> leaf_cfs_rq_list = {
> next = 0xffff883eec13eec0,
> prev = 0xffff883fcab5fec0
> },
> tg = 0xffffffff81ebdd80 <root_task_group>,
> runtime_enabled = 0,
> runtime_expires = 0,
> runtime_remaining = 0,
> throttled_clock = 0,
> throttled_clock_task = 0,
> throttled_clock_task_time = 0,
> throttled = 0,
> throttle_count = 0,
> throttled_list = {
> next = 0xffff883ffedf3250,
> prev = 0xffff883ffedf3250
> }
> }
> ==========================================================
>
> Following the memory address of tg at 0xffffffff81ebdd80
>
> ==========================================================
> crash> task_group ffffffff81ebdd80
> struct task_group {
> css = {
> cgroup = 0xffff883fd0858010,
> ss = 0xffffffff81c4e380 <cpu_cgrp_subsys>,
> refcnt = {
> count = {
> counter = 0
> },
> pcpu_count = 0x0,
> release = 0x0,
> confirm_kill = 0x0,
> rcu = {
> next = 0x0,
> func = 0x0
> }
> },
> parent = 0x0,
> sibling = {
> next = 0xffffffff81ebddc8 <root_task_group+72>,
> prev = 0xffffffff81ebddc8 <root_task_group+72>
> },
> children = {
> next = 0xffff881fcfab0448,
> prev = 0xffff883fd1ad7048
> },
> id = 1,
> flags = 3,
> serial_nr = 2,
> callback_head = {
> next = 0x0,
> func = 0x0
> },
> destroy_work = {
> data = {
> counter = 0
> },
> entry = {
> next = 0x0,
> prev = 0x0
> },
> func = 0x0
> }
> },
> se = 0xffff881fff010000,
> cfs_rq = 0xffff881fff010480,
> shares = 1024,
> load_avg = {
> counter = 1234
> },
> runnable_avg = {
> counter = 20014
> },
> rcu = {
> next = 0x0,
> func = 0x0
> },
> list = {
> next = 0xffffffff81c4e830 <task_groups>,
> prev = 0xffff883fd1b710e0
> },
> parent = 0x0,
> siblings = {
> next = 0xffffffff81ebde78 <root_task_group+248>,
> prev = 0xffffffff81ebde78 <root_task_group+248>
> },
> children = {
> next = 0xffff883a29cd58f8,
> prev = 0xffff883fd1b710f8
> },
> autogroup = 0x0,
> cfs_bandwidth = {
> lock = {
> raw_lock = {
> {
> head_tail = 0,
> tickets = {
> head = 0,
> tail = 0
> }
> }
> }
> },
> period = {
> tv64 = 100000000
> },
> quota = 18446744073709551615,
> runtime = 0,
> hierarchal_quota = -1,
> runtime_expires = 0,
> idle = 0,
> timer_active = 0,
> period_timer = {
> node = {
> node = {
> __rb_parent_color = 18446744071594303192,
> rb_right = 0x0,
> rb_left = 0x0
> },
> expires = {
> tv64 = 0
> }
> },
> _softexpires = {
> tv64 = 0
> },
> function = 0xffffffff810add70 <sched_cfs_period_timer>,
> base = 0xffff881fff40d1a0,
> state = 0,
> start_pid = -1,
> start_site = 0x0,
> start_comm =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> },
> slack_timer = {
> node = {
> node = {
> __rb_parent_color = 18446744071594303288,
> rb_right = 0x0,
> rb_left = 0x0
> },
> expires = {
> tv64 = 0
> }
> },
> _softexpires = {
> tv64 = 0
> },
> function = 0xffffffff810adcb0 <sched_cfs_slack_timer>,
> base = 0xffff881fff40d1a0,
> state = 0,
> start_pid = -1,
> start_site = 0x0,
> start_comm =
> "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000"
> },
> throttled_cfs_rq = {
> next = 0xffffffff81ebdf98 <root_task_group+536>,
> prev = 0xffffffff81ebdf98 <root_task_group+536>
> },
> nr_periods = 0,
> nr_throttled = 0,
> throttled_time = 0
> }
> }
> ==========================================================
>
> I have the following for the `rq` if it helps (when it crashed)
>
> ==========================================================
> crash> rq.clock ffff883ffedf30c0
> clock = 146055357448781
> crash> rq.clock_task ffff883ffedf30c0
> clock_task = 14605535744878
> ==========================================================
>
> Sorry for the long email, wanted to provide as much as I can
>
> Thanks,
> Mohammed
>
> On Tue, May 26, 2015 at 1:38 PM, <[email protected]> wrote:
>> Kirill Tkhai <[email protected]> writes:
>>
>>> Hi,
>>>
>>> 25.05.2015, 23:53, "Mohammed Naser" <[email protected]>:
>>>> Hi Peter,
>>>>
>>>> (resending as plain text, my bad)
>>>>
>>>> Thank you for reply.
>>>>
>>>> Would you have any ideas on why this would have occurred or other
>>>> steps to look at? It's my first time attempting to help fix a problem
>>>> like this.
>>>>
>>>> I have a crashdump of the kernel since this issue repeated itself a
>>>> few times on a loaded KVM host (it's 12GB however). I can also
>>>> provide the values of cfs_rq before the kernel crash.
>>>>
>>>> ===================================
>>>> [146055.357476] BUG: unable to handle kernel NULL pointer dereference
>>>> at 0000000000000038
>>>> [146055.359620] IP: [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>>> [146055.361890] PGD 0
>>>> [146055.364131] Oops: 0000 [#1] SMP
>>>> [146055.366475] Modules linked in: vhost_net vhost macvtap macvlan
>>>> act_police cls_u32 sch_ingress ipmi_si xt_multiport nf_conntrack_ipv6
>>>> nf_defrag_ipv6 xt_mac xt_physdev xt_set iptable_raw ip_set_hash_ip
>>>> ip_set nfnetlink mpt3sas mpt2sas raid_class scsi_transport_sas mptctl
>>>> mptbase veth xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat
>>>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>>> nf_conntrack ipt_REJECT xt_tcpudp dell_rbu bridge stp llc
>>>> ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter
>>>> ip_tables x_tables nbd openvswitch gre vxlan libcrc32c ib_iser rdma_cm
>>>> iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp
>>>> libiscsi scsi_transport_iscsi ipmi_devintf intel_rapl
>>>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel dcdbas kvm
>>>> crct10dif_pclmul crc32_pclmul
>>>> [146055.388889] ghash_clmulni_intel aesni_intel aes_x86_64 lrw
>>>> gf128mul dm_multipath glue_helper ablk_helper scsi_dh cryptd mei_me
>>>> mei lpc_ich ipmi_msghandler shpchp wmi acpi_power_meter mac_hid lp
>>>> parport nls_iso8859_1 igb ixgbe i2c_algo_bit dca ptp ahci pps_core
>>>> megaraid_sas libahci mdio [last unloaded: ipmi_si]
>>>> [146055.404208] CPU: 31 PID: 67922 Comm: qemu-system-x86 Not tainted
>>>> 3.16.0-37-generic #51~14.04.1-Ubuntu
>>>> [146055.409906] Hardware name: Dell Inc. PowerEdge R630/0CNCJW, BIOS
>>>> 1.0.4 08/28/2014
>>>> [146055.415754] task: ffff883fcab69e90 ti: ffff883a1c168000 task.ti:
>>>> ffff883a1c168000
>>>> [146055.421817] RIP: 0010:[<ffffffff810a7d31>] [<ffffffff810a7d31>]
>>>> set_next_entity+0x11/0xb0
>>>> [146055.428079] RSP: 0018:ffff883a1c16bce8 EFLAGS: 00010092
>>>> [146055.434377] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
>>>> 00000000044aa200
>>>> [146055.440913] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
>>>> ffff883ffedf3140
>>>> [146055.447474] RBP: ffff883a1c16bd00 R08: 0000000000000000 R09:
>>>> 0000000000000001
>>>> [146055.454181] R10: 0000000000000004 R11: 0000000000000206 R12:
>>>> ffff883ffedf3140
>>>> [146055.460968] R13: 000000000000001f R14: 0000000000000001 R15:
>>>> ffff883ffedf30c0
>>>> [146055.467722] FS: 00007f404919d700(0000) GS:ffff883ffede0000(0000)
>>>> knlGS:ffff880002380000
>>>> [146055.474756] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [146055.481830] CR2: 0000000000000038 CR3: 0000003a1c45b000 CR4:
>>>> 00000000001427e0
>>>> [146055.489134] Stack:
>>>> [146055.496412] 0000000000000000 ffff883ffedf3140 000000000000001f
>>>> ffff883a1c16bd68
>>>> [146055.504053] ffffffff810af2f8 ffff883ffedf3140 00000000000130c0
>>>> ffff883fcab69e90
>>>> [146055.511786] ffffffff8101c3b9 ffff883a1c16bd50 ffffffff810a4895
>>>> ffff883fcab6a3c8
>>>> [146055.519551] Call Trace:
>>>> [146055.527330] [<ffffffff810af2f8>] pick_next_task_fair+0x78/0x880
>>>> [146055.535292] [<ffffffff8101c3b9>] ? sched_clock+0x9/0x10
>>>> [146055.543379] [<ffffffff810a4895>] ? sched_clock_cpu+0x85/0xc0
>>>> [146055.551519] [<ffffffff81768afb>] __schedule+0x11b/0x7a0
>>>> [146055.559722] [<ffffffff81769579>] _cond_resched+0x29/0x40
>>>> [146055.568020] [<ffffffffc0338289>] kvm_arch_vcpu_ioctl_run+0x3e9/0x460 [kvm]
>>>> [146055.576509] [<ffffffffc0321ce2>] kvm_vcpu_ioctl+0x2a2/0x5e0 [kvm]
>>>> [146055.585045] [<ffffffff81156952>] ? perf_event_context_sched_in+0xa2/0xc0
>>>> [146055.593771] [<ffffffff811e7250>] do_vfs_ioctl+0x2e0/0x4c0
>>>> [146055.602531] [<ffffffff8109dec8>] ? finish_task_switch+0x108/0x180
>>>> [146055.611413] [<ffffffffc032bcd4>] ? kvm_on_user_return+0x74/0x80 [kvm]
>>>> [146055.620339] [<ffffffff811e74b1>] SyS_ioctl+0x81/0xa0
>>>> [146055.629396] [<ffffffff8176d20d>] system_call_fastpath+0x1a/0x1f
>>>> [146055.638500] Code: 83 c4 10 4c 89 f2 4c 89 ee ff d0 49 8b 04 24 48
>>>> 85 c0 75 e6 eb 99 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54
>>>> 49 89 fc 53 <8b> 46 38 48 89 f3 85 c0 75 5d 49 8b 84 24 b0 00 00 00 48
>>>> 8b 80
>>>> [146055.657833] RIP [<ffffffff810a7d31>] set_next_entity+0x11/0xb0
>>>> [146055.667524] RSP <ffff883a1c16bce8>
>>>> [146055.677082] CR2: 0000000000000038
>>>> ===================================
>>>
>>> This looks like https://lkml.org/lkml/2015/4/3/231
>>
>>
>> If you're using cfs_bandwidth, it's possible (though I can't tell just
>> from that). If so and you can reproduce this, could you test
>> https://lkml.org/lkml/2015/4/6/396 (assuming it still applies at all
>> anyway)?
>>
>> If you want to provide a bunch of debugging info, rq->clock,
>> rq->clock_task, *cfs_rq and *se for all rqs and cgroups would be ideal,
>> but getting all that is kinda annoying. The most basic info would be
>> *cfs_rq of the crashing cpu and *cfs_rq->tg.
>
>
>
> --
> Mohammed Naser — vexxhost
> -----------------------------------------------------
> D. 514-316-8872
> D. 800-910-1726 ext. 200
> E. [email protected]
> W. http://vexxhost.com



--
Mohammed Naser — vexxhost
-----------------------------------------------------
D. 514-316-8872
D. 800-910-1726 ext. 200
E. [email protected]
W. http://vexxhost.com