Message-ID: <cade605e668e2bc5620fac2b7328121d.squirrel@www.codeaurora.org>
In-Reply-To: <d5b53c2263c3bbb3d2070afb24bb0c38.squirrel@www.codeaurora.org>
References: <eadef8ec803240cd54da3b39d01f546d.squirrel@www.codeaurora.org>
    <d5b53c2263c3bbb3d2070afb24bb0c38.squirrel@www.codeaurora.org>
Date: Thu, 21 May 2015 05:45:17 -0000
Subject: Re: Crash in crc32_le during kmemleak_scan()
From: vigneshr@codeaurora.org
To: vigneshr@codeaurora.org
Cc: vigneshr@codeaurora.org, linux-kernel@vger.kernel.org,
        catalin.marinas@arm.com, bernd.schubert@itwm.fraunhofer.de
User-Agent: SquirrelMail/1.4.22-4.el6
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7BIT
Importance: Normal
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5850
Lines: 165

> Analysis so far :
>
> I haven't been able to figure out exactly how the dots are connected and
> the relation to the change provided earlier but the following is what i
> could put together :
>
> When i checked the post mortem analysis i see the following in the
> crashing stack :
>
> -018|rb_erase(
>     |    [locdesc] node = ?,
>     |    [X1] root = 0xFFFFFFC0016E05B8)
>     |  [X1] root = 0xFFFFFFC0016E05B8
>     |  [locdesc] node = ?
>     |  [X2] parent = 0xFFFFFFC02EEB6F40
>     |  [X4] node = 0x0
>     |  [X3] sibling = 0x0
> -019|__delete_object(
>     |    [X19] object = 0xFFFFFFC0236B03C0)
>     |  [X21] flags = 0x0140
> -020|delete_object_full(
>     |  ?)
>     |  [X19] object = 0xFFFFFFC0236B03C0
> -021|__kmemleak_do_cleanup()
>     |  [X19] object = 0xFFFFFFC0236B03C0 -> (
>     |    [0xFFFFFFC0236B03C0] lock = ([0xFFFFFFC0236B03C0] rlock =
> ([0xFFFFFFC0236B03C0] raw_lock = ([0xFFFFFFC0236B03C0] owner
>     |    [0xFFFFFFC0236B03D8] flags = 0x0,
>     |    [0xFFFFFFC0236B03E0] object_list = ([NSD:0xFFFFFFC0236B03E0] next
> = 0xFFFFFFC0236B1920, [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),
>     |    [0xFFFFFFC0236B03F0] gray_list = ([0xFFFFFFC0236B03F0] next =
> 0x00100100, [0xFFFFFFC0236B03F8] prev = 0x00200200),
> 	 [0xFFFFFFC0236B0400] rb_node = (
>       			[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
> 			[0xFFFFFFC0236B0408] rb_right = 0x0,
> 			[0xFFFFFFC0236B0410] rb_left = 0x0
>
>
> The node is becoming 0x0 here and is crashing in the following piece of
> code in rb_erase:
>
>
>   309         } else {
>   310             sibling = parent->rb_left;
>   311             if (rb_is_red(sibling)) {
>
> and rb_is_red(sibling) resolves into :
>
> 96 #define rb_is_red(rb)      __rb_is_red((rb)->__rb_parent_color)
>
>
>
> 1. Here the parent ptr refers to Object pointer residing at
> 0xFFFFFFC0236B03C0 and we can see that rb_node->rb_left is becoming 0x0 (
> this is from the values observed in the crashed stack)
>
> 		 [0xFFFFFFC0236B0400] rb_node = (
>       			[0xFFFFFFC0236B0400] __rb_parent_color = 0xFFFFFFC02EEB6F41,
> 			[0xFFFFFFC0236B0408] rb_right = 0x0,
> 			[0xFFFFFFC0236B0410] rb_left = 0x0
>
> 2. Therefore the sibling will take the value 0x0
>
> 3. And then when rb->__rb_parent_color operation takes place, we crash
> since its not able to resolve 0x0 (since rb=0x0 here).
>
>
> So the question is how this variable ended up being this way.
>
> 1. I checked the object_list via list walkthrough  and it did not report
> any corruptions.
>
> 2. However, the object in question from our call stack,
> object->object_list pointers looks a bit amiss.
>
>   object_list = ([NSD:0xFFFFFFC0236B03E0] next = 0xFFFFFFC0236B1920,
> [NSD:0xFFFFFFC0236B03E8] prev = 0x00200200),
>
> But its not clear to me how it ended up being in this state.
>
>
> I will continue to debug what went wrong and share my analysis. Please let
> me know if anything looks obvious to you that i can try out.

I went through the bug again and i think the following is happening :

I just noticed that there could be possible race that's happening between 2
cores causing the aforementioned crash to occur :

Core 0 ____________________________ Core 1

rb_erase()                  |
__delete_object()           |
delete_object_full()        |
                            |
___________________________ | __delete_object()
                            |
___________________________ | delete_object_full()
                            |
___________________________ | kmemleak_free()-Call from some other ctxt.
                            |
                            |
__kmemleak_do_cleanup()     |
kmemleak_do_cleanup()       |
process_one_work()          |
worker_thread()             |
kthread()                   |
ret_from_fork()             |

Just after __delete_object() running on core 1 unlocks the writelock
(kmemleak_lock), from core 0 we take the same object and try to do rb_erase
again. This is causing it to crash in this fashion.

Therefore, together with the warning that i quoted in the earlier mail and
the above race, i think the following can be added to the patch provided
earlier:

---------------------------------- 8< -------------------------------------

diff --git a/mm/kmemleak.c b/mm/kmemleak.c
index 5ec8b71..4455bb8 100644
--- a/mm/kmemleak.c
+++ b/mm/kmemleak.c
@@ -959,7 +959,7 @@ void __ref kmemleak_free(const void *ptr)
 {
        pr_debug("%s(0x%p)\n", __func__, ptr);

-       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
                delete_object_full((unsigned long)ptr);
        else if (kmemleak_early_log)
                log_early(KMEMLEAK_FREE, ptr, 0, 0);
@@ -1019,7 +1019,7 @@ void __ref kmemleak_not_leak(const void *ptr)
 {
        pr_debug("%s(0x%p)\n", __func__, ptr);

-       if (kmemleak_enabled && ptr && !IS_ERR(ptr))
+       if (kmemleak_enabled && ptr && !IS_ERR(ptr) && !kmemleak_error)
                make_gray_object((unsigned long)ptr);
        else if (kmemleak_early_log)
                log_early(KMEMLEAK_NOT_LEAK, ptr, 0, 0);


---------------------------------- 8< -------------------------------------

Please let me know if the above analysis and the addition to the patch
looks good to you or if you a different opinion. I can try out testing
once more to see if it pops back if it looks good to you.

====================
Thanks and regards,
Vignesh Radhakrishnan


QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of the Code Aurora Forum, hosted by The Linux Foundation.
again.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/