Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753538AbbETNQv (ORCPT ); Wed, 20 May 2015 09:16:51 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:33216 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752517AbbETNQs (ORCPT ); Wed, 20 May 2015 09:16:48 -0400 Message-ID: In-Reply-To: References: Date: Wed, 20 May 2015 13:16:46 -0000 Subject: Re: Crash in crc32_le during kmemleak_scan() From: vigneshr@codeaurora.org To: vigneshr@codeaurora.org Cc: linux-kernel@vger.kernel.org, catalin.marinas@arm.com, bernd.schubert@itwm.fraunhofer.de User-Agent: SquirrelMail/1.4.22-4.el6 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Priority: 3 (Normal) Importance: Normal Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8370 Lines: 220 Hi, Sharing an update. We got back the test results. As such we did not observe the bug, but we hit another side effect. After kmemleak was disabled due to low memory 96.825883: <6> kmemleak: Cannot allocate a kmemleak_object structure 96.825902: <6> kmemleak: Cannot allocate a kmemleak_object structure 96.825923: <6> kmemleak: Kernel memory leak detector disabled We saw the following warning appearing in the dmesg : 97.023202: <6> kmemleak: Trying to color unknown object at 0xffffffc01f0b2c80 as Grey 97.029735: <6> CPU: 0 PID: 49 Comm: kworker/u8:3 Tainted: G W O 3.10.49-gde07ba9-00027-g9ce640d #1 97.029751: <2> Workqueue: diag_wq diag_read_smd_work_fn 97.029758: <2> Call trace: 97.029769: <2> [] dump_backtrace+0x0/0x270 97.029777: <2> [] show_stack+0x10/0x1c 97.029786: <2> [] dump_stack+0x1c/0x28 97.029794: <2> [] paint_ptr+0x64/0xb8 97.029801: <2> [] kmemleak_not_leak+0x64/0x98 97.029808: <2> [] diagmem_alloc+0x148/0x174 97.029816: <2> [] diag_usb_write+0x80/0x180 97.029823: <2> [] diag_mux_write+0x38/0x4c 97.029830: <2> [] diag_process_smd_read_data+0x270/0x2e4 97.029837: <2> [] diag_smd_send_req+0x578/0x5b0 97.029844: <2> [] diag_read_smd_work_fn+0xc/0x18 97.029852: <2> [] process_one_work+0x258/0x3d0 97.029858: <2> [] worker_thread+0x1f0/0x340 97.029867: <2> [] kthread+0xac/0xb8 This should be fixed by the addition of the following to the patch that was provided earlier : --------------------- 8< -------------------------------- diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 5ec8b71..7d67c87 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr) { pr_debug("%s(0x%p)\n", __func__, ptr); - if (kmemleak_enabled && ptr && !IS_ERR(ptr)) + if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error) make_gray_object((unsigned long)ptr); else if (kmemleak_early_log) log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0); --------------------- 8< -------------------------------- However, the more concerning thing was that we crashed later at around 100 seconds. I have pasted the snippet of the crash log below : 100.946902: <6> Unable to handle kernel NULL pointer dereference at virtual address 00000000 100.953953: <6> pgd = ffffffc00007d000 100.957338: <2> [00000000] *pgd=000000008c307003, *pmd=000000008c308003, *pte=00e000000b000407 100.965587: <6> Internal error: Oops: 96000006 [#1] PREEMPT SMP 100.971139: <2> Modules linked in: wlan(O) 100.974876: <6> CPU: 0 PID: 246 Comm: kworker/0:2 Tainted: G W O 3.10.49-gde07ba9-00027-g9ce640d #1 100.984254: <2> Workqueue: events kmemleak_do_cleanup 100.988936: <6> task: ffffffc02e7d9580 ti: ffffffc02da0c000 task.ti: ffffffc02da0c000 100.996404: <2> PC is at rb_erase+0x224/0x344 101.000393: <2> LR is at __delete_object+0x3c/0xc8 101.004820: <2> pc : [] lr : [] pstate: 600001c5 101.012198: <2> sp : ffffffc02da0fcb0 101.015497: <2> x29: ffffffc02da0fcb0 x28: 0000000000000009 101.020792: <2> x27: ffffffc0015f9000 x26: 0000000000000000 101.026086: <2> x25: ffffffc034ad2a40 x24: ffffffc02da0c000 101.031380: <2> x23: ffffffc034ad7b00 x22: 0000000000000000 101.036676: <2> x21: 0000000000000140 x20: ffffffc0014e26b8 101.041972: <2> x19: ffffffc0236b03c0 x18: 0000000000000000 101.047266: <2> x17: 0000007f973e7f8c x16: ffffffc00030447c 101.052561: <2> x15: 0000000000000000 x14: 0000000066666667 101.057857: <2> x13: 000000000000000a x12: 0000007f7c95e9f0 101.063152: <2> x11: 731a976167c4c606 x10: 0000000000000013 101.068447: <2> x9 : 0000000000127500 x8 : 0000001780764fe3 101.073743: <2> x7 : ffffffc0236b1bc0 x6 : 0000000000000000 101.079038: <2> x5 : ffffffc02eeb6f41 x4 : 0000000000000000 101.084333: <2> x3 : 0000000000000000 x2 : ffffffc02eeb6f40 101.089628: <2> x1 : ffffffc0016e05b8 x0 : 0000000000000000 101.454125: <6> Process kworker/0:2 (pid: 246, stack limit = 0xffffffc02da0c058) 101.461155: <2> Call trace: 101.463590: <2> [] rb_erase+0x224/0x344 101.468624: <2> [] delete_object_full+0x1c/0x34 101.474353: <2> [] __kmemleak_do_cleanup+0x38/0x54 101.480342: <2> [] kmemleak_do_cleanup+0x38/0x6c 101.486160: <2> [] process_one_work+0x258/0x3d0 101.491887: <2> [] worker_thread+0x1f0/0x340 101.497359: <2> [] kthread+0xac/0xb8 101.502132: <6> Code: f9000046 14000040 f9400843 aa0303e0 (f9400064) Analysis so far : I haven't been able to figure out exactly how the dots are connected and the relation to the change provided earlier but the following is what i could put together : When i checked the post mortem analysis i see the following in the crashing stack : -018|rb_erase( | [locdesc] node = ?, | [X1] root = 0xFFFFFFC0016E05B8) | [X1] root = 0xFFFFFFC0016E05B8 | [locdesc] node = ? | [X2] parent = 0xFFFFFFC02EEB6F40 | [X4] node = 0x0 | [X3] sibling = 0x0 -019|__delete_object( | [X19] object = 0xFFFFFFC0236B03C0) | [X21] flags = 0x0140 -020|delete_object_full( | ?) | [X19] object = 0xFFFFFFC0236B03C0 -021|__kmemleak_do_cleanup() | [X19] object = 0xFFFFFFC0236B03C0 -> ( | [0xFFFFFFC0236B03C0] lock = ([0xFFFFFFC0236B03C0] rlock = ([0xFFFFFFC0236B03C0] raw_lock = ([0xFFFFFFC0236B03C0] owner | [0xFFFFFFC0236B03D8] flags = 0x0, | [0xFFFFFFC0236B03E0] object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200), | [0xFFFFFFC0236B03F0] gray_list = ([0xFFFFFFC0236B03F0] next = 0x00100100, [0xFFFFFFC0236B03F8] prev = 0x00200200), [0xFFFFFFC0236B0400] rb_node = ( [0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41, [0xFFFFFFC0236B0408] rb_right = 0x0, [0xFFFFFFC0236B0410] rb_left = 0x0 The node is becoming 0x0 here and is crashing in the following piece of code in rb_erase: 309 } else { 310 sibling = parent->rb_left; 311 if (rb_is_red(sibling)) { and rb_is_red(sibling) resolves into : 96 #define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color) 1. Here the parent ptr refers to Object pointer residing at 0xFFFFFFC0236B03C0 and we can see that rb_node->rb_left is becoming 0x0 ( this is from the values observed in the crashed stack) [0xFFFFFFC0236B0400] rb_node = ( [0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41, [0xFFFFFFC0236B0408] rb_right = 0x0, [0xFFFFFFC0236B0410] rb_left = 0x0 2. Therefore the sibling will take the value 0x0 3. And then when rb->__rb_parent_color operation takes place, we crash since its not able to resolve 0x0 (since rb=0x0 here). So the question is how this variable ended up being this way. 1. I checked the object_list via list walkthrough and it did not report any corruptions. 2. However, the object in question from our call stack, object->object_list pointers looks a bit amiss. object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200), But its not clear to me how it ended up being in this state. I will continue to debug what went wrong and share my analysis. Please let me know if anything looks obvious to you that i can try out. ==================== Thanks and regards, Vignesh Radhakrishnan QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/