Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754779AbcCAQB7 (ORCPT ); Tue, 1 Mar 2016 11:01:59 -0500 Received: from mail-am1on0084.outbound.protection.outlook.com ([157.56.112.84]:21952 "EHLO emea01-am1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1754543AbcCAQB6 (ORCPT ); Tue, 1 Mar 2016 11:01:58 -0500 Authentication-Results: linaro.org; dkim=none (message not signed) header.d=none;linaro.org; dmarc=none action=none header.from=mellanox.com; Subject: Re: [PATCH 2/4] nmi_backtrace: generate one-line reports for idle cpus To: Daniel Thompson References: <1456782024-7122-1-git-send-email-cmetcalf@ezchip.com> <1456782024-7122-3-git-send-email-cmetcalf@ezchip.com> <56D5A5E6.9050206@linaro.org> CC: Russell King , Thomas Gleixner , Ingo Molnar , Andrew Morton , , Aaron Tomlin , Peter Zijlstra , "Rafael J. Wysocki" , Daniel Lezcano From: Chris Metcalf Message-ID: <56D5BCE6.3010300@mellanox.com> Date: Tue, 1 Mar 2016 11:01:42 -0500 User-Agent: Mozilla/5.0 (X11; Linux i686 on x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <56D5A5E6.9050206@linaro.org> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [12.216.194.146] X-ClientProxiedBy: BLUPR16CA0047.namprd16.prod.outlook.com (2a01:111:e400:5160::15) To AM3PR05MB418.eurprd05.prod.outlook.com (2a01:111:e400:880d::14) X-Microsoft-Exchange-Diagnostics: 1;AM3PR05MB418;2:5McKgiLfUjbrgVCJ/GGFc7lDsFxZQ+a9lwubSPtFLI/6fg2hX6r50rU1RgNWNmHYoOgJu3FTuXJrAbi9f+TiAC81Ii7mutqneEaYzqRdIZTA+CkHyjI05VjRU+Yx5CYEvtl/AlaElKCR2Oau6MfJQg==;3:eIga2Bo+Hk9kEC1y0xsmT9zC9mS6cFc7oI2bgLz3miUSH9WFf/SCV/MgtySxn9j22brwa+qdPXHFBzd18jDM36H+w+r6uuhQm//5CgNpoonylMGAiVbaxtchq4ByZ1fV;25:UurA84GzjwfdQ5H1K1VQSd/3ur4DuUX/iH4zuF3LyjuEqGSU3Y57XaNEGW+lsO4jtbLtdv0zVZ4iRGlsx2djhYGz4Fsl3S9is0zROH5tn8c6pK43dQk5PDl47KYfaP9jzpUH/9aV280+0ZKc7l/qViR2I3pGj4NxrfhNjk90TepkcSQw1Ib7+1pneYOUOO0MGhW8Xc/pVor+M9o17fEBljs8gVTntDr4BeG0gxMUZjWa/1jzCTEle5IgO68d7IdFfmE5q4In7NyBf+8vbR8lUXp5QDg6TN01QorGjTyFBtCTYjAQKefpRDG2lh31mkBRk9Ekr/vKhaegfOd+sQhqoQ== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:AM3PR05MB418; X-MS-Office365-Filtering-Correlation-Id: fa310a53-43c3-4bc9-da51-08d341eacf95 X-MLNXRule-EZCH-Linux: Rule triggered X-Microsoft-Exchange-Diagnostics: 1;AM3PR05MB418;20:RPTeS9otRVj7+BhtOpe2S1SZMrClngzSoBRRHtG0aBpG6OXoPohtPVJWZYqbN2TyiK+xTQgXRtUk3rz1bS5FjkS3FJ4uV713I6gM5eotdYW0bceqsWWheyXPa6EHVWX/CmMayk/ToQqo4ucFPBc5ugiPQsTvDUQslEirjev/JTbbZ+YU/U4BYmjDhzvfbj3Vue/PKU2qYib4c7GC80xX772ZSjtEhsmL2P0NkI9i3rjToZ1RhIqWUc8iNNOYPb0zvtLIZqv3mCl1Cu3lSj/4MUf+cw8/ognue0+cNqBd/xv7Urq6hW+BKUTBvH8+NTg1LAYMzGyatiLnpkmRHX8Dw+3ePHdAd8cwTs9bWVevZyfwTvZFrZBp4i01aSMS+kjUxF5y9bkmzP8iijZ7ydHs6HkjU684YhL4DYBBF6FjFZohMH840yTHI3KX66QGdSwl4Ex85P92Ip6viHhJCIJupOIrbMTFCkkdNsrRouSozii75hEgFAhsR/WRZ14z7hrB;4:5PAMeEvHz0q11lE6QI88N3VgPV9ELWCZHaMtBDZLLnP5GUuyfjBh+2Y6LoWHcI2IVo2nv+4i+ci12U6n9x3WsR4FJ9rLdXg4Ej4EjVZWRgeivk22K2yOSzAcJCoSRNeuGsVbPVblzDs9kN3NuHFa9DZrOmCSoF0LjDW+6ndGGbLAtN2UU17l6E6bN5cbkClJdfp3HC4ArfCn1JO9to6z1Fat9OyNM657nET31JOsFv0vvJr/mNwdNrkDPKAaq5yGtDHlEU+GmaKFIieMIL19mO6nHrDA4Indi+Vwt7d4fmULVW9xruFf0TuGYDIxUtlk2hR+sgqsyhxs+q15qcqb8BwjzeI32hJCsNeR7kU5J9lDSGqaxvwYB8jnY69AMwNp X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(8121501046)(5005006)(10201501046)(3002001);SRVR:AM3PR05MB418;BCL:0;PCL:0;RULEID:;SRVR:AM3PR05MB418; X-Forefront-PRVS: 086831DFB4 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10009020)(4630300001)(6049001)(6009001)(24454002)(377454003)(479174004)(189998001)(47776003)(1096002)(110136002)(64126003)(586003)(230700001)(5001960100003)(15975445007)(3846002)(6116002)(5008740100001)(33656002)(36756003)(77096005)(19580405001)(19580395003)(66066001)(81156008)(4326007)(2906002)(42186005)(50466002)(65806001)(4001350100001)(2950100001)(65816999)(76176999)(50986999)(92566002)(40100003)(83506001)(87266999)(5004730100002)(23746002)(54356999)(87976001)(86362001)(122386002)(18886065003);DIR:OUT;SFP:1101;SCL:1;SRVR:AM3PR05MB418;H:[10.15.7.41];FPR:;SPF:None;MLV:sfv;LANG:en; X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1;AM3PR05MB418;23:0VEnQxjXwK45agfyg8L+X0S5JfuXZBHUBC6ic8?= =?Windows-1252?Q?mZC9h+LGDgK097ObURHZXXAOxPDbOErLZemiJcPo+Wsd0C0z2gCpdCiY?= =?Windows-1252?Q?+wfgOCaXkFXxPppVulXvmOKX2ADkJP2vu80Wpn1VYhNKinrTq/ljKHRc?= =?Windows-1252?Q?jvY4UmMivnxYbSKMYSvLmRPb1+lgOzoS7Xm895ydkN5SO/LR67b0Keb4?= =?Windows-1252?Q?oEB6yVae4G4f2dp9AkGqjY413LjamgJ4xIzYT27EhWgn7G1U53DN+EMj?= =?Windows-1252?Q?pAwSTu7dNqPmWtpSgcDHkIlYVyYUeFXKLZRwsUmIZjI3HRH7S8+cRZl4?= =?Windows-1252?Q?fZp7Mqz2iXkSmedmmXf6l/UWfACrvf03SewTFFlKouhle3v69w9HWVjZ?= =?Windows-1252?Q?WnxCj+/wt/uDE55G+uu8XI89TfRje8bKkzxcBYgq3Mh233KAazSTZn1m?= =?Windows-1252?Q?nJa/qb823Xh+nndy64E6cdS+vtFWQMp+Qca5QzaXCizpWGDaRx0xMmg+?= =?Windows-1252?Q?7FX0qEVlMzTHtYzKAyUZ8vhT5rSvj5Ct6KYXPMqiGBacX0vdjpEqdw+o?= =?Windows-1252?Q?9DVQc0baCCVWih5h4/37e/b+cA5Mf3i5m/m7XBxb2hoV0Yt+omCb5JEc?= =?Windows-1252?Q?QI3Dp4dL1Ca1auRKEwLTW6XUpXF4QCsKFyo9ktyahwOXRRLKdl9Ajjnv?= =?Windows-1252?Q?NyNJByM+KoODkGCS14JRiCQ96ewbnxGQsnm0YpKoGDSFsreO1mR6nqrh?= =?Windows-1252?Q?5rpJWZ2SA+c9i1yrh3tkpmMpu8BJ3+lXQBQ0TSyo4i4eiB8+FrX3OILz?= =?Windows-1252?Q?TzUEEaJx+azeEBio4fq3clb1TA86p+rQ6aA6JlUygHbbHagO1tmYs/P6?= =?Windows-1252?Q?z5m8a6bojJDFj2QpZUn2KzcOGZ3YTQNEGPINL7GYHEgOc1QqvVjO9HeX?= =?Windows-1252?Q?wgukzRkclEPJ3LJizBdf2ESH289/f2rISC/hmzOginYoV0V7YcJP9BNi?= =?Windows-1252?Q?+6bU7SjOiChtebZUimEkd6U8peL92BWpa757d4b3yOdZwYhY83wWZ6RH?= =?Windows-1252?Q?4xtBFZoJ1/35HXlOWabb0lfAnNSGFQgRruS/bYo+7hhSNVkZAxbKMxjq?= =?Windows-1252?Q?tLon/DjpzxFpO95NRNtuIdOSc/9FRhgs/fZSAoaYo+0uiDProo6Mqxtv?= =?Windows-1252?Q?VvfrIkMFPawVFyoaS1aHKh49uqNs8GUauAr7Icfue2YXhQzpSai7lwU9?= =?Windows-1252?Q?vmQEFbeFApB+Kk61t/S5B8dx6tQmjbT416LjriMDNJw9cUdqUh68vgD0?= =?Windows-1252?Q?Rj7X89gFCAIOU8fOm+9OHMZw=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;AM3PR05MB418;5:55CqVBGgKC/8bP8H3OFUZv79oK1EIGWH/L80Q5HZPundFR3UqzNuFKFF+E9bojW5gtzNXZnApCRYt3A5IbLAmnj0srcYyJjYrWpkdZF+zHHDCSOYDz68xSjsyjw+qqNA1bDmg2i4Rx9ynGF3mfLTUQ==;24:mc4R61ODyLZDAIWZNpW4+BM+FO0gsB9obuQDelfpymCNlyrP9E4zvy00z1N70ykQH5KPEDcZzZgdSInQrO33qpIDmGz3i6dm0jIjLR8hmf8= SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 01 Mar 2016 16:01:53.7416 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM3PR05MB418 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5272 Lines: 140 (+PeterZ, Rafael, and Daniel Lezcano for cpuidle and scheduler) On 03/01/2016 09:23 AM, Daniel Thompson wrote: > On 29/02/16 21:40, Chris Metcalf wrote: >> When doing an nmi backtrace of many cores, and most of them are idle, >> the output is a little overwhelming and very uninformative. Suppress >> messages for cpus that are idling when they are interrupted and >> just emit one line, "NMI backtrace for N skipped: idle". > > I can see this makes the logs more attractive but this is code for > emergency situations. > > The idle task is responsible for certain power management activities. > How can you be sure the system is wedged because of bugs in that code? It's a fair point, but as core count increases, you really run the risk of losing the valuable data in a sea of data that isn't. For example, for the architecture I maintain, we have the TILE-Gx72, which is a 72-core chip. If each core's register dump and backtrace is 40 lines, we're up to around 3,000 lines of console output. Filtering that down by a factor of 10x or more (if only a handful of cores happen to be active, which is not uncommon) is a substantial usability improvement. That said, it's true that the original solution I offered (examining just is_idle_task() plus interrupt nesting) is imprecise. It is relatively straightforward to add a bit of per-cpu state that is set at the same moment we currently do stop/start_critical_timings(), which would indicate much more specifically that the cpu was running the idling code itself, and not anything more complex. In that case if the flag was set, you would know you were either sitting on a processor-specific idle instruction in arch_cpu_idle(), or else polling one or two memory locations in a tight loop in cpu_idle_poll(), which presumably would offer sufficient precision to feel safe. Here's an alternative version of the patch which incorporates this idea. Do you think this is preferable? Thanks! commit 5b6dca9bad908ae66fa764025c4e6046a6cc0262 Author: Chris Metcalf Date: Mon Feb 29 11:56:32 2016 -0500 nmi_backtrace: generate one-line reports for idle cpus When doing an nmi backtrace of many cores, and most of them are idle, the output is a little overwhelming and very uninformative. Suppress messages for cpus that are idling when they are interrupted and just emit one line, "NMI backtrace for N skipped: idle". Signed-off-by: Chris Metcalf diff --git a/include/linux/cpuidle.h b/include/linux/cpuidle.h index 786ad32631a6..b8c3c4cf88ad 100644 --- a/include/linux/cpuidle.h +++ b/include/linux/cpuidle.h @@ -206,6 +206,7 @@ static inline int cpuidle_enter_freeze(struct cpuidle_driver *drv, /* kernel/sched/idle.c */ extern void sched_idle_set_state(struct cpuidle_state *idle_state); extern void default_idle_call(void); +extern bool in_cpu_idle(void); #ifdef CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED void cpuidle_coupled_parallel_barrier(struct cpuidle_device *dev, atomic_t *a); diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c index 544a7133cbd1..9aff315f278b 100644 --- a/kernel/sched/idle.c +++ b/kernel/sched/idle.c @@ -52,15 +52,25 @@ static int __init cpu_idle_nopoll_setup(char *__unused) __setup("hlt", cpu_idle_nopoll_setup); #endif +static DEFINE_PER_CPU(bool, cpu_idling); + +/* Was the cpu was in the low-level idle code when interrupted? */ +bool in_cpu_idle(void) +{ + return this_cpu_read(cpu_idling); +} + static inline int cpu_idle_poll(void) { rcu_idle_enter(); trace_cpu_idle_rcuidle(0, smp_processor_id()); local_irq_enable(); stop_critical_timings(); + this_cpu_write(cpu_idling, true); while (!tif_need_resched() && (cpu_idle_force_poll || tick_check_broadcast_expired())) cpu_relax(); + this_cpu_write(cpu_idling, false); start_critical_timings(); trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); rcu_idle_exit(); @@ -89,7 +99,9 @@ void default_idle_call(void) local_irq_enable(); } else { stop_critical_timings(); + this_cpu_write(cpu_idling, true); arch_cpu_idle(); + this_cpu_write(cpu_idling, false); start_critical_timings(); } } diff --git a/lib/nmi_backtrace.c b/lib/nmi_backtrace.c index db63ac75eba0..75b5eacaa5d3 100644 --- a/lib/nmi_backtrace.c +++ b/lib/nmi_backtrace.c @@ -17,6 +17,7 @@ #include #include #include +#include #ifdef arch_trigger_cpumask_backtrace /* For reliability, we're prepared to waste bits here. */ @@ -151,11 +152,16 @@ bool nmi_cpu_backtrace(struct pt_regs *regs) /* Replace printk to write into the NMI seq */ this_cpu_write(printk_func, nmi_vprintk); - pr_warn("NMI backtrace for cpu %d\n", cpu); - if (regs) - show_regs(regs); - else - dump_stack(); + if (in_cpu_idle()) { + pr_warn("NMI backtrace for cpu %d skipped: idle\n", + cpu); + } else { + pr_warn("NMI backtrace for cpu %d\n", cpu); + if (regs) + show_regs(regs); + else + dump_stack(); + } this_cpu_write(printk_func, printk_func_save); cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask)); -- Chris Metcalf, Mellanox Technologies http://www.mellanox.com