Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753810AbbEUFpV (ORCPT ); Thu, 21 May 2015 01:45:21 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:46666 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751186AbbEUFpS (ORCPT ); Thu, 21 May 2015 01:45:18 -0400 Message-ID: In-Reply-To: References: Date: Thu, 21 May 2015 05:45:17 -0000 Subject: Re: Crash in crc32_le during kmemleak_scan() From: vigneshr@codeaurora.org To: vigneshr@codeaurora.org Cc: vigneshr@codeaurora.org, linux-kernel@vger.kernel.org, catalin.marinas@arm.com, bernd.schubert@itwm.fraunhofer.de User-Agent: SquirrelMail/1.4.22-4.el6 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Priority: 3 (Normal) Importance: Normal Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5850 Lines: 165 > Analysis so far : > > I haven't been able to figure out exactly how the dots are connected and > the relation to the change provided earlier but the following is what i > could put together : > > When i checked the post mortem analysis i see the following in the > crashing stack : > > -018|rb_erase( > | [locdesc] node = ?, > | [X1] root = 0xFFFFFFC0016E05B8) > | [X1] root = 0xFFFFFFC0016E05B8 > | [locdesc] node = ? > | [X2] parent = 0xFFFFFFC02EEB6F40 > | [X4] node = 0x0 > | [X3] sibling = 0x0 > -019|__delete_object( > | [X19] object = 0xFFFFFFC0236B03C0) > | [X21] flags = 0x0140 > -020|delete_object_full( > | ?) > | [X19] object = 0xFFFFFFC0236B03C0 > -021|__kmemleak_do_cleanup() > | [X19] object = 0xFFFFFFC0236B03C0 -> ( > | [0xFFFFFFC0236B03C0] lock = ([0xFFFFFFC0236B03C0] rlock = > ([0xFFFFFFC0236B03C0] raw_lock = ([0xFFFFFFC0236B03C0] owner > | [0xFFFFFFC0236B03D8] flags = 0x0, > | [0xFFFFFFC0236B03E0] object_list = ([NSD:0xFFFFFFC0236B03E0] next > = 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200), > | [0xFFFFFFC0236B03F0] gray_list = ([0xFFFFFFC0236B03F0] next = > 0x00100100, [0xFFFFFFC0236B03F8] prev = 0x00200200), > [0xFFFFFFC0236B0400] rb_node = ( > [0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41, > [0xFFFFFFC0236B0408] rb_right = 0x0, > [0xFFFFFFC0236B0410] rb_left = 0x0 > > > The node is becoming 0x0 here and is crashing in the following piece of > code in rb_erase: > > > 309 } else { > 310 sibling = parent->rb_left; > 311 if (rb_is_red(sibling)) { > > and rb_is_red(sibling) resolves into : > > 96 #define rb_is_red(rb) __rb_is_red((rb)->__rb_parent_color) > > > > 1. Here the parent ptr refers to Object pointer residing at > 0xFFFFFFC0236B03C0 and we can see that rb_node->rb_left is becoming 0x0 ( > this is from the values observed in the crashed stack) > > [0xFFFFFFC0236B0400] rb_node = ( > [0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41, > [0xFFFFFFC0236B0408] rb_right = 0x0, > [0xFFFFFFC0236B0410] rb_left = 0x0 > > 2. Therefore the sibling will take the value 0x0 > > 3. And then when rb->__rb_parent_color operation takes place, we crash > since its not able to resolve 0x0 (since rb=0x0 here). > > > So the question is how this variable ended up being this way. > > 1. I checked the object_list via list walkthrough and it did not report > any corruptions. > > 2. However, the object in question from our call stack, > object->object_list pointers looks a bit amiss. > > object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920, > [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200), > > But its not clear to me how it ended up being in this state. > > > I will continue to debug what went wrong and share my analysis. Please let > me know if anything looks obvious to you that i can try out. I went through the bug again and i think the following is happening : I just noticed that there could be possible race that's happening between 2 cores causing the aforementioned crash to occur : Core 0 ____________________________ Core 1 rb_erase() | __delete_object() | delete_object_full() | | ___________________________ | __delete_object() | ___________________________ | delete_object_full() | ___________________________ | kmemleak_free()-Call from some other ctxt. | | __kmemleak_do_cleanup() | kmemleak_do_cleanup() | process_one_work() | worker_thread() | kthread() | ret_from_fork() | Just after __delete_object() running on core 1 unlocks the writelock (kmemleak_lock), from core 0 we take the same object and try to do rb_erase again. This is causing it to crash in this fashion. Therefore, together with the warning that i quoted in the earlier mail and the above race, i think the following can be added to the patch provided earlier: ---------------------------------- 8< ------------------------------------- diff --git a/mm/kmemleak.c b/mm/kmemleak.c index 5ec8b71..4455bb8 100644 --- a/mm/kmemleak.c +++ b/mm/kmemleak.c @@ -959,7 +959,7 @@ void __ref kmemleak_free(const void *ptr) { pr_debug("%s(0x%p)\n", __func__, ptr); - if (kmemleak_enabled && ptr && !IS_ERR(ptr)) + if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error) delete_object_full((unsigned long)ptr); else if (kmemleak_early_log) log_early(KMEMLEAK_FREE, ptr, 0, 0); @@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr) { pr_debug("%s(0x%p)\n", __func__, ptr); - if (kmemleak_enabled && ptr && !IS_ERR(ptr)) + if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error) make_gray_object((unsigned long)ptr); else if (kmemleak_early_log) log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0); ---------------------------------- 8< ------------------------------------- Please let me know if the above analysis and the addition to the patch looks good to you or if you a different opinion. I can try out testing once more to see if it pops back if it looks good to you. ==================== Thanks and regards, Vignesh Radhakrishnan QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation. again. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/