Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751310AbbEZSDg (ORCPT ); Tue, 26 May 2015 14:03:36 -0400 Received: from mail-qg0-f49.google.com ([209.85.192.49]:34741 "EHLO mail-qg0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751043AbbEZSDd (ORCPT ); Tue, 26 May 2015 14:03:33 -0400 MIME-Version: 1.0 In-Reply-To: <1431964402-27457-1-git-send-email-dianders@chromium.org> References: <1431964402-27457-1-git-send-email-dianders@chromium.org> From: Alim Akhtar Date: Tue, 26 May 2015 23:32:52 +0530 Message-ID: Subject: Re: [PATCH] mmc: dw_mmc: Wait for data transfer after response errors To: Doug Anderson Cc: Jaehoon Chung , Seungwon Jeon , Ulf Hansson , Alim Akhtar , Sonny Rao , Andrew Bresticker , Heiko Stuebner , Addy Ke , Alexandru Stan , Javier Martinez Canillas , Chris Zhong , Caesar Wang , "linux-mmc@vger.kernel.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9626 Lines: 223 Hi Doug, On peach-pi, I got a hung task once in 4 cold boot as [1]. And every time got a hung task [2] on suspend/resume, triggered exactly from this change. I have a debug print at $SUBJECT change. [1]: ---------------- on boot: [ 240.197190] INFO: task kworker/u16:1:50 blocked for more than 120 seconds. [ 240.202602] Not tainted 4.1.0-rc5-00001-g862e58a-dirty #6 [ 240.208505] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.216307] kworker/u16:1 D c0505528 0 50 2 0x00000000 [ 240.222650] Workqueue: kmmcd mmc_rescan [ 240.226448] [] (__schedule) from [] (schedule+0x34/0x98) [ 240.233496] [] (schedule) from [] (schedule_timeout+0x110/0x164) [ 240.241207] [] (schedule_timeout) from [] (wait_for_common+0xb8/0x14c) [ 240.249454] [] (wait_for_common) from [] (mmc_wait_for_req+0xb0/0x13c) [ 240.257691] [] (mmc_wait_for_req) from [] (mmc_send_tuning+0x140/0x1b0) [ 240.266026] [] (mmc_send_tuning) from [] (dw_mci_exynos_execute_tuning+0x70/0x174) [ 240.275299] [] (dw_mci_exynos_execute_tuning) from [] (dw_mci_execute_tuning+0x2c/0x38) [ 240.285021] [] (dw_mci_execute_tuning) from [] (mmc_execute_tuning+0x30/0x60) [ 240.293864] [] (mmc_execute_tuning) from [] (mmc_init_card+0x8e0/0x1648) [ 240.302283] [] (mmc_init_card) from [] (mmc_attach_mmc+0x90/0x15c) [ 240.310172] [] (mmc_attach_mmc) from [] (mmc_rescan+0x29c/0x2e4) [ 240.317903] [] (mmc_rescan) from [] (process_one_work+0x120/0x324) [ 240.325788] [] (process_one_work) from [] (worker_thread+0x28/0x490) [ 240.333861] [] (worker_thread) from [] (kthread+0xd8/0xf4) [ 240.341058] [] (kthread) from [] (ret_from_fork+0x14/0x34) ----------- [2]: On s2r cycle: ======== [ 71.107181] mmc_host mmc0: Bus speed (slot 0) = 50000000Hz (slot req 400000Hz, actual 396825HZ div = 63) [ 71.147363] mmc_host mmc0: Bus speed (slot 0) = 200000000Hz (slot req 200000000Hz, actual 200000000HZ div = 0) [ 71.155946] dwmmc_exynos 12200000.mmc: [ALIM] : dw_mci_tasklet_func: 1593: cmd->data and err [ 240.197177] INFO: task sh:1645 blocked for more than 120 seconds. [ 240.201802] Not tainted 4.1.0-rc5-00001-g862e58a-dirty #6 [ 240.207713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.215518] sh D c0505528 0 1645 1 0x00000000 [ 240.221857] [] (__schedule) from [] (schedule+0x34/0x98) [ 240.228887] [] (schedule) from [] (schedule_timeout+0x110/0x164) [ 240.236606] [] (schedule_timeout) from [] (wait_for_common+0xb8/0x14c) [ 240.244851] [] (wait_for_common) from [] (mmc_wait_for_req+0xb0/0x13c) [ 240.253090] [] (mmc_wait_for_req) from [] (mmc_send_tuning+0x140/0x1b0) [ 240.261417] [] (mmc_send_tuning) from [] (dw_mci_exynos_execute_tuning+0x70/0x174) [ 240.270699] [] (dw_mci_exynos_execute_tuning) from [] (dw_mci_execute_tuning+0x2c/0x38) [ 240.280415] [] (dw_mci_execute_tuning) from [] (mmc_execute_tuning+0x30/0x60) [ 240.289263] [] (mmc_execute_tuning) from [] (mmc_init_card+0x8e0/0x1648) [ 240.297675] [] (mmc_init_card) from [] (_mmc_resume+0x4c/0x78) [ 240.305222] [] (_mmc_resume) from [] (mmc_resume+0x1c/0x58) [ 240.312508] [] (mmc_resume) from [] (mmc_bus_resume+0x1c/0x50) [ 240.320060] [] (mmc_bus_resume) from [] (dpm_run_callback.isra.8+0x1c/0x48) [ 240.328736] [] (dpm_run_callback.isra.8) from [] (device_resume+0xac/0x180) [ 240.337436] [] (device_resume) from [] (dpm_resume+0xe8/0x210) [ 240.344957] [] (dpm_resume) from [] (dpm_resume_end+0xc/0x18) [ 240.352419] [] (dpm_resume_end) from [] (suspend_devices_and_enter+0x10c/0x3e4) [ 240.361437] [] (suspend_devices_and_enter) from [] (pm_suspend+0x168/0x208) [ 240.370110] [] (pm_suspend) from [] (state_store+0x6c/0xbc) [ 240.377400] [] (state_store) from [] (kobj_attr_store+0x14/0x20) [ 240.385123] [] (kobj_attr_store) from [] (sysfs_kf_write+0x44/0x48) [ 240.393103] [] (sysfs_kf_write) from [] (kernfs_fop_write+0xb8/0x194) [ 240.401257] [] (kernfs_fop_write) from [] (__vfs_write+0x2c/0xd8) [ 240.409064] [] (__vfs_write) from [] (vfs_write+0x90/0x14c) [ 240.416350] [] (vfs_write) from [] (SyS_write+0x40/0x8c) [ 240.423378] [] (SyS_write) from [] (ret_fast_syscall+0x0/0x34) ==== I was checking on v4.1-rc5, git hash as below: 862e58a mmc: dw_mmc: Wait for data transfer after response errors ba155e2 Linux 4.1-rc5 5b13966 Not sure if I missed any dependent patch?? Have not checked the dw TRM for this change, will do that as soon as I get access to it. Regards, Alim On Mon, May 18, 2015 at 9:23 PM, Doug Anderson wrote: > According to the DesignWare state machine description, after we get a > "response error" or "response CRC error" we move into data transfer > mode. That means that we don't necessarily need to special case > trying to deal with the failure right away. We can wait until we are > notified that the data transfer is complete (with or without errors) > and then we can deal with the failure. > > It may sound strange to defer dealing with a command that we know will > fail anyway, but this appears to fix a bug. During tuning (CMD19) on > a specific card on an rk3288-based system, we found that we could get > a "response CRC error". Sending the stop command after the "response > CRC error" would then throw the system into a confused state causing > all future tuning phases to report failure. > > When in the confused state, the controller would show these (hex codes > are interrupt status register): > CMD ERR: 0x00000046 (cmd=19) > CMD ERR: 0x0000004e (cmd=12) > DATA ERR: 0x00000208 > DATA ERR: 0x0000020c > CMD ERR: 0x00000104 (cmd=19) > CMD ERR: 0x00000104 (cmd=12) > DATA ERR: 0x00000208 > DATA ERR: 0x0000020c > ... > ... > > It is inherently difficult to deal with the complexity of trying to > correctly send a stop command while a data transfer is taking place > since you need to deal with different corner cases caused by the fact > that the data transfer could complete (with errors or without errors) > during various places in sending the stop command (dw_mci_stop_dma, > send_stop_abort, etc) > > Instead of adding a bunch of extra complexity to deal with this, it > seems much simpler to just use the more straightforward (and less > error-prone) path of letting the data transfer finish. There > shouldn't be any huge benefit to sending the stop command slightly > earlier, anyway. > > Signed-off-by: Doug Anderson > --- > drivers/mmc/host/dw_mmc.c | 20 ++++++++++++++++++++ > 1 file changed, 20 insertions(+) > > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c > index 5f5adaf..c081ce2 100644 > --- a/drivers/mmc/host/dw_mmc.c > +++ b/drivers/mmc/host/dw_mmc.c > @@ -1574,6 +1574,26 @@ static void dw_mci_tasklet_func(unsigned long priv) > } > > if (cmd->data && err) { > + /* > + * Controller will move into a data transfer > + * state after a response error or response CRC > + * error. Let's let that finish before trying > + * to send a stop, so we'll go to > + * STATE_SENDING_DATA. > + * > + * Although letting the data transfer take place > + * will waste a bit of time (we already know > + * the command was bad), it can't cause any > + * errors since it's possible it would have > + * taken place anyway if this tasklet got > + * delayed. Allowing the transfer to take place > + * avoids races and keeps things simple. > + */ > + if (err != -ETIMEDOUT) { > + state = STATE_SENDING_DATA; > + continue; > + } > + > dw_mci_stop_dma(host); > send_stop_abort(host, data); > state = STATE_SENDING_STOP; > -- > 2.2.0.rc0.207.ga3a616c > > -- > To unsubscribe from this list: send the line "unsubscribe linux-mmc" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Regards, Alim -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/