Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933241AbbFWQRe (ORCPT ); Tue, 23 Jun 2015 12:17:34 -0400 Received: from mail-ob0-f179.google.com ([209.85.214.179]:33513 "EHLO mail-ob0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755039AbbFWQRU convert rfc822-to-8bit (ORCPT ); Tue, 23 Jun 2015 12:17:20 -0400 MIME-Version: 1.0 In-Reply-To: <20150623152131.GA9990@roeck-us.net> References: <1433958452-23721-5-git-send-email-fu.wei@linaro.org> <20150611162810.GA22711@roeck-us.net> <20150623152131.GA9990@roeck-us.net> Date: Wed, 24 Jun 2015 00:17:19 +0800 Message-ID: Subject: Re: [non-pretimeout,4/7] Watchdog: introduce ARM SBSA watchdog driver From: Fu Wei To: Guenter Roeck Cc: Suravee Suthikulpanit , Linaro ACPI Mailman List , linux-watchdog@vger.kernel.org, devicetree@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, Wei Fu , G Gregory , Al Stone , Hanjun Guo , Timur Tabi , Ashwin Chaugule , Arnd Bergmann , Vipul Gandhi , Wim Van Sebroeck , Jon Masters , Leo Duran , Jon Corbet , Mark Rutland , Catalin Marinas , Will Deacon , rjw@rjwysocki.net Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4615 Lines: 118 Hi Guenter, you always can provide help very quickly, thank you very much :-) On 23 June 2015 at 23:21, Guenter Roeck wrote: > On Tue, Jun 23, 2015 at 09:26:35PM +0800, Fu Wei wrote: >> Hi Guenter, > [ ...] > >> > >> >> + * When the first timeout occurs, WS0(SPI or LPI) is triggered, >> >> + * the second timeout period(as long as the first timeout period) starts. >> > >> > no longer accurate if WOR is used for the second period. >> > >> >> + * In WS0 interrupt routine, panic() will be called for collecting >> >> + * crashdown info. >> >> + * If system can not recover from WS0 interrupt routine, then second >> >> + * timeout occurs, WS1(reset or higher level interrupt) is triggered. >> >> + * The two timeout period can be set by WOR(32bit). >> > >> > The second timeout period is determined by ... >> > >> >> + * WOR gives a maximum watch period of around 10s at the maximum >> >> + * system counter frequency. >> >> + * The System Counter shall run at maximum of 400MHz. >> > >> > "... at the maximum system counter frequency of 400 MHz.", and drop the >> > last sentence. >> >> For the second timeout period, I have discussed with a kdump developers, >> (1)10s maybe not good enough for all the case of panic + kdump, so >> maybe we still need to use WCV in the second timeout period >> (2)in the second timeout period, maybe we need to programme WCV for >> two reason: a, trigger WS1 to reboot system ASAP; b, feed the watchdog >> without cleanning WS0 flag. >> >> WHY we want to feed the watchdog (keepalive) without cleanning WS0 flag?? >> REASON: >> (1)if the system context is large, we may need to feed the dog until >> we get all the things backed up. >> (2)if system goes wrong, WS0 triggered, then panic--> kdump. if we >> feed the dog by WRR or programming WOR, WS0 flag will be cleaned. Once >> system goes wrong again, then panic again..... >> So this system will be in a panic--kdump--panic--kdump loop, have not >> chance to reset. >> >> So if we are in the second timeout period, we may need to always programme WCV. >> > The crashdump kernel is supposed to reload the watchdog driver, which will ping > the watchdog. If it isn't able to do that in 10 seconds, something is wrong. yes, 10s maybe not enough for all case. When I tested kdump on arm64, sometimes , it took 20s. So I am thinking : can we make the max value of pretimeout > 10s in this driver. > >> >> + >> >> + status = readl_relaxed(gwdt->control_base + SBSA_GWDT_WCS); >> >> + if (status & SBSA_GWDT_WCS_WS1) { >> >> + dev_warn(dev, "System reset by WDT(WCV: %llx)\n", >> >> + sbsa_gwdt_get_wcv(wdd)); >> > >> > WCV here only tells us how many clock cycles were executed since the >> > system started (or something like that). So I still don't understand >> > why it is valuable to print that number. >> >> this number provides the time of system reset, I thinks that may help >> admin to analyse the system failure. >> > It doesn't mean anything to anyone but you since it is not in a well defined > time scale. maybe I should convert it to second? I think the original value is better? > Also, I would be somewhat surprised if WCV would retain its value > on reset. Much more likely it is the time (in clock cycles) since reset. yes, It has been mentioned in SBSA: --------------------- If WS0 is asserted and a timeout refresh occurs then the following must occur:  If the system is compliant to SBSA level 0 or level 1 then it is IMPLEMENTATION DEFINED as to whether the compare value is loaded with the sum of the zero-extended watchdog offset register and the current generic timer system count value, or whether it retains its current value.  If the system is compliant to SBSA level 2 or higher the compare value must retain its current value. This means that the compare value records the time that WS1 is asserted. --------------------- Hope I understand it correctly. please let me know , if I misunderstand something, thanks > > Guenter -- Best regards, Fu Wei Software Engineer Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch Ph: +86 21 61221326(direct) Ph: +86 186 2020 4684 (mobile) Room 1512, Regus One Corporate Avenue,Level 15, One Corporate Avenue,222 Hubin Road,Huangpu District, Shanghai,China 200021 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/