Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp4644994pxa; Mon, 10 Aug 2020 14:28:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw6aMLOUHDinTSgEZDh4VURDeS3DepKkO9iW/9JcEqyMUhNXMFXRgPp/Eb+cMBKTGL+J+Vh X-Received: by 2002:a17:906:16c8:: with SMTP id t8mr22999640ejd.484.1597094892349; Mon, 10 Aug 2020 14:28:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1597094892; cv=none; d=google.com; s=arc-20160816; b=zSV08hTjYnSjZQ/jCt9QRcUYqDvkpH4iZeuCBjHiGWsv+bLLyGWDU8ZQch4Jq56Vbi nYdu1ZMOqtI1Z+TciC2cAnRr6hrzTOAFjIJAtK9jO3iCrDRsIcANV96TjTEhnghyDw6J A/vvt5s5GW9O2mioisYvPZH3CuBn6UhsNiEAAfjhdKVwm5GA1Pl+enVfixX6XnezHtl8 V5rAHslvy3ca+pEBsrcpTtYr87q3o9P1AyO14MGM03nxRN5zHdXhsKxq4pmqufF3mP4j IyjqaI7bMILwV/3TGUUTbILFvv1T58VbCQrJqVLiAAU4PmUsUlyB9Ey6k2EvQf032yI1 8Zvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=hXASfmqoUl3o8zN6MXHewh7vS0xl4tHelH9X4EiAPTg=; b=V/IfQ1aCmWlosYUCGE/l2LrtbLBSalZ2eRZRuy9mNadkcXGOSS80RXQSr05kaPGbio F+sA//zajNZ7oaJL+SEqD/XNIvcvImC3qzS94fY1+LEPvIaFeWVxvFSVPafbw56VdUc5 oULy6yak/A/R2cVrEdoCe606lB3OHsE3gJWq/kI2naHEn+2emUhi7DIYqWWmLUCMG1Xp Ms0vw8wemc7ZZv9G8yqv7pKyMMgNbYFWPOkrz4J7n7aFavB0avNRN8aAIQgcJfABmyJ/ 10r9FwQ2l5jDFO61xN39/IVoSKRuArPG70FzpdDZFlFstjL5O/yfY2ysMIlgSVPk8nRi bRuQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=iPtviNy9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id bu20si10860644edb.537.2020.08.10.14.27.49; Mon, 10 Aug 2020 14:28:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=iPtviNy9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726844AbgHJV01 (ORCPT + 99 others); Mon, 10 Aug 2020 17:26:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34778 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726825AbgHJV01 (ORCPT ); Mon, 10 Aug 2020 17:26:27 -0400 Received: from mail-vk1-xa44.google.com (mail-vk1-xa44.google.com [IPv6:2607:f8b0:4864:20::a44]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E17DAC061788 for ; Mon, 10 Aug 2020 14:26:26 -0700 (PDT) Received: by mail-vk1-xa44.google.com with SMTP id x142so2187903vke.0 for ; Mon, 10 Aug 2020 14:26:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=hXASfmqoUl3o8zN6MXHewh7vS0xl4tHelH9X4EiAPTg=; b=iPtviNy9RShsJFpYsrUv4+ItQVmKuSBZAs+NqagKy930sVZjRWUTkUxZacbU9KMJ9x CldemxPziILP/Xm4LSWsQAELyezydtv5sifKft2/LSVyc4/58JoPG6k5BIxVVvNHjwp4 uWlD8l3v+F7WMtJqUFLtCaENMe7W0y16GGvzg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=hXASfmqoUl3o8zN6MXHewh7vS0xl4tHelH9X4EiAPTg=; b=a36OFObiBh16XGljF9L48nVbSVoWXWhl58NL0Uchdo3hkuEWS0UuOXiL2o9rZTtB5A i8LJtGqG5F8/+2fjfmKRPgKnEZVnjGICDx3LyWjRpNwnTsSLUeEXZ3aokNL2FwMEQGZ5 IRMMDBQeQkoS1hI7MvTzNfJFMrW+EKnWiFVjvCoGeF+tqVLH9U3CIeLmjSVfw3KzNorL Imz2eaC8s/9YifJbhEp5WUEV2Apy8UVX+N5TIBs+/MPwIFVpd8ULck/yZnb5JQY+qTOF jztLPzLdHYDTsuHiIR9gq36tYGlZhAnBGK3F1+sdj9bItqjco8tIwT87Co+cZ1Z8b3hT nATw== X-Gm-Message-State: AOAM531PrPqkckAn1KiSdkuSMgLKk57+WAA71jZWDhNqEu6EHfE+x8js raVFMmDQ5ou9yc6e0fR+K/D10fqBikk= X-Received: by 2002:a1f:3d97:: with SMTP id k145mr22560204vka.8.1597094785604; Mon, 10 Aug 2020 14:26:25 -0700 (PDT) Received: from mail-vs1-f47.google.com (mail-vs1-f47.google.com. [209.85.217.47]) by smtp.gmail.com with ESMTPSA id m62sm5614445vsd.21.2020.08.10.14.26.23 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 10 Aug 2020 14:26:24 -0700 (PDT) Received: by mail-vs1-f47.google.com with SMTP id y8so4952739vsq.8 for ; Mon, 10 Aug 2020 14:26:23 -0700 (PDT) X-Received: by 2002:a67:2c4f:: with SMTP id s76mr20074899vss.213.1597094783404; Mon, 10 Aug 2020 14:26:23 -0700 (PDT) MIME-Version: 1.0 References: <20200806221904.1.I4455ff86f0ef5281c2a0cd0a4712db614548a5ca@changeid> In-Reply-To: From: Doug Anderson Date: Mon, 10 Aug 2020 14:26:11 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH] serial: qcom_geni_serial: Fix recent kdb hang To: Akash Asthana Cc: Greg Kroah-Hartman , kgdb-bugreport@lists.sourceforge.net, Mukesh Savaliya , Andy Gross , Bjorn Andersson , Evan Green , Jiri Slaby , linux-arm-msm , LKML , linux-serial@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Mon, Aug 10, 2020 at 5:32 AM Akash Asthana wrote: > > Hi Doug, > > On 8/7/2020 10:49 AM, Douglas Anderson wrote: > > The commit e42d6c3ec0c7 ("serial: qcom_geni_serial: Make kgdb work > > even if UART isn't console") worked pretty well and I've been doing a > > lot of debugging with it. However, recently I typed "dmesg" in kdb > > and then held the space key down to scroll through the pagination. My > > device hung. This was repeatable and I found that it was introduced > > with the aforementioned commit. > > > > It turns out that there are some strange boundary cases in geni where > > in some weird situations it will signal RX_LAST but then will put 0 in > > RX_LAST_BYTE. This means that the entire last FIFO entry is valid. > > IMO that means we received a word in RX_FIFO and it is the last word > hence RX_LAST bit is set. What you say would make logical sense, but it's not how I have observed geni to work. See below. > RX_LAST_BYTE is 0 means none of the bytes are valid in the last word. This would imply that qcom_geni_serial_handle_rx() is also broken though, wouldn't it? Specifically imagine that WORD_CNT is 1 and RX_LAST is set and RX_LAST_BYTE_VALID is true. Here's the logic from that function: total_bytes = BYTES_PER_FIFO_WORD * (word_cnt - 1); if (last_word_partial && last_word_byte_cnt) total_bytes += last_word_byte_cnt; else total_bytes += BYTES_PER_FIFO_WORD; port->handle_rx(uport, total_bytes, drop); As you can see that logic will set "total_bytes" to 4 in the case I'm talking about. > In such scenario we should just read RX_FIFO buffer (to empty it), > discard the word and return NO_POLL_CHAR. Something like below. > > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > else > private_data->poll_cached_bytes_cnt = 4; > > private_data->poll_cached_bytes = > readl(uport->membase + SE_GENI_RX_FIFOn); > } > > + if (!private_data->poll_cached_bytes_cnt) > + return NO_POLL_CHAR; > private_data->poll_cached_bytes_cnt--; > ret = private_data->poll_cached_bytes & 0xff; > ------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Please let me know whether above code helps. Your code will avoid the hang. Yes. ...but it will drop bytes. I devised a quick-n-dirty test. Here's a test of your code: https://crrev.com/c/2346886 ...and here's a test of my code: https://crrev.com/c/2346884 I had to keep a buffer around since it's hard to debug the serial driver. In both cases I put "DOUG" into the buffer when I detect this case. If my theory about how geni worked was wrong then we should expect to see some garbage in the buffer right after the DOUG, right? ...but my code gets the alphabet in nice sequence. Your code drops 4 bytes. NOTE: while poking around with the above two test patches I found it was pretty easy to get geni to drop bytes / hit overflow cases and also to insert bogus 0 bytes in the stream (I believe these are related). I was able to reproduce this: * With ${SUBJECT} patch in place. * With your proposed patch. * With the recent "geni" patches reverted (in other words back to 1 byte per FIFO entry). It's not terribly surprising that we're overflowing since I believe kgdb isn't too keen to read characters at the same time it's writing. That doesn't explain the weird 0-bytes that geni seemed to be inserting, but at least it would explain the overflows. However, even after I fixed this I _still_ was getting problems. Specifically geni seemed to be hiding bytes from me until it was too late. I put logging in and would see this: 1 word in FIFO - wxyz 1 word in FIFO (last set, last FIFO has 1 byte) - \n Check again, still 0 bytes in FIFO Suddenly 16 bytes are in FIFO and S_RX_FIFO_WR_ERR_EN is set. I spent a whole bunch of time poking at this and couldn't find any sort of workaround. Presumably geni is taking some time between me reading the last word out of the FIFO from the "previous" packet and then transitioning to the new packet. I found a lot of references to this process in the hardware register description (see GENI_CFG_REG69, for instance), but I couldn't manage to make the kick to happen any faster. Presumably this isn't a problem for things like Bluetooth since flow control saves them. ...and I guess this isn't a problem in practice because we usually _send_ a lot of data to the host for console/kgdb and it's only the host => DUT path that has problems. > I am not sure about what all scenario can leads to this behavior from > hardware, I will try to get an answer from hardware team. > > Any error bit was set for SE_GENI_S_IRQ_STATUS & SE_GENI_M_IRQ_STATUS > registers? As per above I can see overflows in my test case and geni seems to be behaving pretty badly. If you have ideas on how to fix this I'd love it. However, it still seems like my patch is right because (at least in the cases I tested) it avoids dropping bytes in some cases. It also matches how qcom_geni_serial_handle_rx() works and if that was broken we'd have noticed by now. > I guess the hang was seen because *poll_cached_bytes_cnt* is unsigned > int and it's value was 0, when it's decremented by 1 it's value become > '4294967295' (very large) and dummy RX (0x00) would happen that > > many times before reading any actual RX transfers/bytes. Right. That would be why it was hanging. -Doug