Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S938802AbcJXJxr (ORCPT ); Mon, 24 Oct 2016 05:53:47 -0400 Received: from merlin.infradead.org ([205.233.59.134]:55332 "EHLO merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S936515AbcJXJxq (ORCPT ); Mon, 24 Oct 2016 05:53:46 -0400 Date: Mon, 24 Oct 2016 11:53:41 +0200 From: Peter Zijlstra To: "Ni, BaoleX" Cc: "mingo@redhat.com" , "acme@kernel.org" , "linux-kernel@vger.kernel.org" , "alexander.shishkin@linux.intel.com" , "Liu, Chuansheng" , Oleg Nesterov Subject: Re: hit a KASan bug related to Perf during stress test Message-ID: <20161024095341.GF3102@twins.programming.kicks-ass.net> References: <318B87A793BE164187D8851D6CE09D64371C8811@shsmsx102.ccr.corp.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <318B87A793BE164187D8851D6CE09D64371C8811@shsmsx102.ccr.corp.intel.com> User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4869 Lines: 85 On Mon, Oct 24, 2016 at 09:35:46AM +0000, Ni, BaoleX wrote: > > [32736.018823] BUG: KASan: use after free in task_tgid_nr_ns+0x35/0xb0 at addr ffff8800265568c0 > [32736.028309] Read of size 8 by task dumpsys/11268 > [32736.033511] ============================================================================= > [32736.042700] BUG task_struct (Tainted: G W O): kasan: bad access detected 'W' this wasn't the first WARN you got, this means this might be the result of prior borkage. Also, it says: "BUG task_struct", does that mean task_struct was the object accessed after free? > [32736.051002] ----------------------------------------------------------------------------- > [32736.051002] > [32736.061840] Disabling lock debugging due to kernel taint > [32736.067830] INFO: Slab 0xffffea0000995400 objects=5 used=3 fp=0xffff880026550000 flags=0x4000000000004080 > [32736.078572] INFO: Object 0xffff880026556440 @offset=25664 fp=0x (null) > ... > [32738.776936] CPU: 0 PID: 11268 Comm: dumpsys Tainted: G B W O 3.14.70-x86_64-02260-g162539f #1 > [32738.787092] Hardware name: Insyde CherryTrail/T3 MRD, BIOS CHTMRD.A6.002.016 09/20/2016 > [32738.796082] ffff880026550000 0000000000000086 0000000000000000 ffff880065e05a70 > [32738.796215] ffffffff81fc9427 ffff880065803b40 ffff880026556440 ffff880065e05aa0 > [32738.796345] ffffffff8123fe2d ffff880065803b40 ffffea0000995400 ffff880026556440 > [32738.796475] Call Trace: > [32738.796510] > [32738.796585] [] dump_stack+0x67/0x90 > [32738.802404] [] print_trailer+0xfd/0x170 > [32738.808603] [] object_err+0x36/0x40 > [32738.814417] [] kasan_report_error+0x1fd/0x3d0 > [32738.821193] [] ? __rcu_read_unlock+0x24/0x90 > [32738.827881] [] ? preempt_count_sub+0x18/0xf0 > [32738.834565] [] ? perf_output_put_handle+0x5c/0x170 > [32738.841833] [] kasan_report+0x40/0x50 > [32738.847838] [] ? task_tgid_nr_ns+0x35/0xb0 > [32738.854327] [] __asan_load8+0x69/0xa0 > [32738.860333] [] ? perf_output_copy+0x88/0x120 > [32738.867020] [] task_tgid_nr_ns+0x35/0xb0 So here we did: perf_event_[pt]id(event, current); How can _current_ not be valid anymore? > [32738.873319] [] __perf_event_header__init_id+0xb8/0x200 > [32738.880970] [] perf_prepare_sample+0xa9/0x4a0 > [32738.887754] [] __perf_event_overflow+0x3f0/0x460 > [32738.894835] [] ? x86_perf_event_set_period+0x128/0x210 > [32738.902496] [] perf_event_overflow+0x14/0x20 > [32738.909180] [] intel_pmu_handle_irq+0x25c/0x520 > [32738.916156] [] ? __asan_store8+0x15/0xa0 > [32738.922460] [] perf_event_nmi_handler+0x2b/0x50 > [32738.929437] [] nmi_handle+0x88/0x230 > [32738.935346] [] do_nmi+0x193/0x490 > [32738.940963] [] end_repeat_nmi+0x1a/0x1e > [32738.947163] [] ? __asan_load8+0x32/0xa0 > [32738.953358] [] ? __asan_load8+0x32/0xa0 > [32738.959554] [] ? __asan_load8+0x32/0xa0 > [32738.965718] <> > [32738.965787] [] ? check_preempt_wakeup+0x1a2/0x3a0 > [32738.972970] [] check_preempt_curr+0xf8/0x120 > [32738.979658] [] ttwu_do_wakeup+0x1d/0x1b0 > [32738.985953] [] ttwu_do_activate.constprop.105+0x89/0x90 > [32738.993710] [] try_to_wake_up+0x29e/0x4e0 > [32739.000100] [] default_wake_function+0x2f/0x40 > [32739.006979] [] autoremove_wake_function+0x18/0x50 > [32739.014149] [] ? preempt_count_sub+0x18/0xf0 > [32739.020836] [] __wake_up_common+0x79/0xb0 > [32739.027232] [] __wake_up+0x39/0x50 > [32739.032945] [] __call_rcu_nocb_enqueue+0x158/0x160 > [32739.040207] [] __call_rcu+0x12c/0x450 And while we just called release_task(), that call_rcu() should still be pending at this point, also I don't think that can be current until after do_task_dead() where we schedule away from the dead task and change current. > [32739.046207] [] call_rcu+0x1d/0x20 > [32739.051821] [] release_task+0x6aa/0x8d0 > [32739.058022] [] ? do_raw_write_unlock+0x6f/0xd0 > [32739.064900] [] do_exit+0xe52/0x1020 > [32739.070712] [] SyS_exit+0x22/0x30 > [32739.076328] [] sysenter_dispatch+0x7/0x1f > [32739.082725] [] ? trace_hardirqs_on_thunk+0x3a/0x3c Oleg, any idea?