Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754801AbdDLQpX (ORCPT ); Wed, 12 Apr 2017 12:45:23 -0400 Received: from fllnx209.ext.ti.com ([198.47.19.16]:32816 "EHLO fllnx209.ext.ti.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752731AbdDLQpW (ORCPT ); Wed, 12 Apr 2017 12:45:22 -0400 Subject: Re: [PATCH] thermal: core: Add a back up thermal shutdown mechanism To: Grygorii Strashko , Eduardo Valentin , Zhang Rui References: <1490941820-13511-1-git-send-email-j-keerthy@ti.com> <20170411172918.GA5193@localhost.localdomain> <1491967248.2357.25.camel@intel.com> <492e72af-ff33-d193-071e-5bc00df9a8b0@ti.com> <20170412040542.GA11305@localhost.localdomain> <1491985580.2357.39.camel@intel.com> <1491986744.2357.42.camel@intel.com> <20170412154358.GA12881@localhost.localdomain> CC: , , , , From: Keerthy Message-ID: Date: Wed, 12 Apr 2017 22:14:36 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3473 Lines: 106 On Wednesday 12 April 2017 10:01 PM, Grygorii Strashko wrote: > > > On 04/12/2017 10:44 AM, Eduardo Valentin wrote: >> Hello, >> > ... > >> >> I agree. But there it nothing that says it is not reenterable. If you >> saw something in this line, can you please share? >> >>>>> will you generate a patch to do this? >>>> Sure. I will generate a patch to take care of 1) To make sure that >>>> orderly_poweroff is called only once right away. I have already >>>> tested. >>>> >>>> for 2) Cancel all the scheduled work queues to monitor the >>>> temperature. >>>> I will take some more time to make it and test. >>>> >>>> Is that okay? Or you want me to send both together? >>>> >>> I think you can send patch for step 1 first. >> >> I am happy to see that Keerthy found the problem with his setup and a >> possible solution. But I have a few concerns here. >> >> 1. If regular shutdown process takes 10seconds, that is a ballpark that >> thermal should never wait. orderly_poweroff() calls run_cmd() with wait >> flag set. That means, if regular userland shutdown takes 10s, we are >> waiting for it. Obviously this not acceptable. Specially if you setup >> critical trip to be 125C. Now, if you properly size the critical trip to >> fire before hotspot really reach 125C, for 10s (or the time it takes to >> shutdown), then fine. But based on what was described in this thread, >> his system is waiting 10s on regular shutdown, and his silicon is on >> out-of-spec temperature for 10s, which is wrong. >> >> 2. The above scenario is not acceptable in a long run, specially from a >> reliability perspective. If orderly_poweroff() has a possibility to >> simply never return (or take too long), I would say the thermal >> subsystem is using the wrong API. >> > > > Hh, I do not see that orderly_poweroff() will wait for anything now: > void orderly_poweroff(bool force) > { > if (force) /* do not override the pending "true" */ > poweroff_force = true; > schedule_work(&poweroff_work); > ^^^^^^^ async call. even here can be pretty big delay if system is under pressure > } > > > static int __orderly_poweroff(bool force) > { > int ret; > > ret = run_cmd(poweroff_cmd); When i tried with multiple orderly_poweroff calls ret was always 0. So every 250mS i see this ret = 0. > ^^^^ no wait for the process - only for exec. flags == UMH_WAIT_EXEC > > if (ret && force) { So it never entered this path. ret = 0 so if is not executed. > pr_warn("Failed to start orderly shutdown: forcing the issue\n"); > > /* > * I guess this should try to kick off some daemon to sync and > * poweroff asap. Or not even bother syncing if we're doing an > * emergency shutdown? > */ > emergency_sync(); > kernel_power_off(); > ^^^ force power off, but only if run_cmd() failed - for example /sbin/poweroff doesn't exist > } > > return ret; > } > > static bool poweroff_force; > > static void poweroff_work_func(struct work_struct *work) > { > __orderly_poweroff(poweroff_force); > } > > As result thermal has no control of power off any more after calling orderly_poweroff() and can get the result > of US poweroff binary execution. > >> >> If you are going to implement the above two patches, keep in mind: >> i. At least within the thermal subsystem, you need to take care of all >> zones that could trigger a shutdown. >> ii. serializing the calls to orderly_poweroff() seams to be more >> concerning than cancelling all monitoring. >> >> >