Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp19207694ybl; Fri, 3 Jan 2020 18:35:19 -0800 (PST) X-Google-Smtp-Source: APXvYqyBGPA58KVZe9nL7Lr/ezx5p6D2I8SWMSDZqZuknZNMSU2YeuqsOitR1kgwuNVxCMr5OIZb X-Received: by 2002:a9d:7c8a:: with SMTP id q10mr95563383otn.124.1578105319656; Fri, 03 Jan 2020 18:35:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1578105319; cv=none; d=google.com; s=arc-20160816; b=1J9wSjXRC6Ms6SyGU164B646pKVxkChj3qGfxyPTZcRSYcuQcjjkzpd91IdtCeWV3L aD/PBBipxe4LtFdYtMfFEUSf2P3GoOnWPvKN6TZWuw+HWh3jEiwYbV3FJlrZpv6MpTXq MpPpjCO+JJpsrwlaNB+hbF7npa04ma21+pwEgxcO7rTP1QX9hiDx75axGMnN+zdPRkCK buifo9sQUYQVyG4q4vD4ZO6jTQUn3gypGbN8qlnuuQ+tm0c9KRUsQjrckhP49TCcTCdY 8QMSsYXKsh3e93gKZCPEBC7WXAC9krYO+8J8plVISeeoCUVRhnpjIQWREr4I51ZgkHLS tqBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=BQvpmc6bTYtIoGI7mIebVuPQLrewKdFxgpvNfh1jXk8=; b=XB5fmqQSC2fWMmREs4emmduIE0pReSiWcRiEuH5o99Xt8pS7KNsdL/x7ROwvrqmCwQ MhWM51Lhdq3MMAZ3FYVKWOntzfUUmy+OdNdDdOlnd6Ge8xAEUD1rUIBTcscVKm1MsZ4N 1Mxe+kQlh2tMuCJV9Z50+gns+wFRZs7DpRU1JqoIXMN+ROiIZiY6sjeppHDY6bx1HR6y 9Sj4bge29M4P0R9DgpVEfTSbD71E8jFXSWezSMFapqvorKpT9iKNbP3Cfycnnd+M4gff dU4JPeVBpTtV5mj8rSyglFYe3Hw762yzF/uZqF3VnBvcmYtQBJ84KvTrfNO8Rn7yAa07 Vjvg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tAZK3OEK; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r4si10217185otq.188.2020.01.03.18.35.07; Fri, 03 Jan 2020 18:35:19 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tAZK3OEK; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727264AbgADCe3 (ORCPT + 99 others); Fri, 3 Jan 2020 21:34:29 -0500 Received: from mail-lj1-f194.google.com ([209.85.208.194]:34614 "EHLO mail-lj1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727237AbgADCe2 (ORCPT ); Fri, 3 Jan 2020 21:34:28 -0500 Received: by mail-lj1-f194.google.com with SMTP id z22so40757517ljg.1 for ; Fri, 03 Jan 2020 18:34:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BQvpmc6bTYtIoGI7mIebVuPQLrewKdFxgpvNfh1jXk8=; b=tAZK3OEKd7k1M9evv8ECR6IT4KuCG9ELRfiOy7DNKz4dsEPusPe/W4hprN3ZoOOUz3 +ODcHgUFQ4f7YwPVWXfHrzvzWU3ldmbVznRqqa1zjiUkPu4LpoCrsAknU2pSs7MXrXIu Y/xlv7+HrXX2kZSD/sFcN30bWts4SGSC4NXHj1vHv5ddjdk8O2FOHd/Rolgxf2cns+sm JXsm1ZhUMMiY+ixrFoIcOP0aW8DHcgEfeK6MQn7dP0f2AzG/APBX9Bucp+PPpWqmxSQz DHL2CocOXAEQcVBUweSFLywO+ba7qejLqqpJTYRMTjsrVcuRnFB62Lll2tPT7II4iznT iGbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BQvpmc6bTYtIoGI7mIebVuPQLrewKdFxgpvNfh1jXk8=; b=rcYlWBG44LKfWf05l1P7ZN8BGIix5eRoZ3wPn3qdL8leaPyg2XoI+RK8/Y2zJZpZ70 YS0hpFlvtMBaBfx5MyqFFvwVZmjWkFRV4tkSWTJyY8siaf7/lED9CLSjh+S6/y7xIlPw o5gwRPnc+YrYZWZdHFFWpUQzi31vl6ME/fQV3Gq71omFnfrKU3R1MpHqM1k0aMJ3hudF xXd9RxHu6aVLZFeCbf1b1U+d2CbhTMKcblR4OXuLDm5dBtleRoN3MFMGUxJnyie++6ac KMJRpDo2JuB6epYOxHbGNulZuT7zebTA1HXEcDzFnsnoT6hIM3E9Gsk0U2kiNni3zmPl rmKw== X-Gm-Message-State: APjAAAXO1yEFAKrDphgaKkQc7FNAdEGxah/3OSSVrC3suLXLH4NNpJX+ XeeCkgP2/hWvvujhqtftUHosikv0J0qDU1GSzlX2/w== X-Received: by 2002:a2e:9804:: with SMTP id a4mr27224065ljj.10.1578105265406; Fri, 03 Jan 2020 18:34:25 -0800 (PST) MIME-Version: 1.0 References: <1762437703fd150bb535ee488c78c830f107a531.camel@sipsolutions.net> In-Reply-To: From: Kan Yan Date: Fri, 3 Jan 2020 18:34:14 -0800 Message-ID: Subject: Re: PROBLEM: Wireless networking goes down on Acer C720P Chromebook (bisected) To: Justin Capella Cc: Stephen Oberholtzer , Johannes Berg , =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= , "David S. Miller" , linux-wireless , netdev@vger.kernel.org, LKML Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > This AQL stuff sounds pretty nifty, and I'd love to try my hand at > making it work for ath9k (also, since I put so much effort into an > automated build-and-test framework, it'd be a shame to just abandon > it.) However, the ath9k code is rather lacking for comments, so I > don't even know where I should start, except for (I suspect) a call to > `wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from > inside ath9k_set_hw_capab()? > In the meantime, I went back to e548f749b096 -- the commit prior to > the one making AQL support opt-in -- and cranked up the debugging. AQL is designed for wireless chipset that uses firmware/hardware offloading, to manage the firmware/hardware queue size. For ath9k, the TX queues are controlled by the host driver and chipsets that use ath9k have a much smaller hardware queue compared to ath10k, so AQL is probably not needed for ath9k. The airtime based TX scheduler alone should be sufficient. > > /sys/kernel/debug/ieee80211/phy0 > > > > airtime_flags = 7 > > > > stations//airtime = > > > > RX: 6583578 us > > TX: 32719 us > > Weight: 256 > > Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us > > Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us > > Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000 > > > > (I have no idea how to interpret this, but that '32719 us' seems odd, > > I thought the airtime usage was in 4us units?) > Me neither, off the top of my head, let's wait for Toke. "TX: 32719 us" is the airtime reported by firmware, which is not in 4us units. There are two airtime: the "consumed" airtime reported by firmware, which is used by the airtimed based TX scheduler to enforce fairness, and the "estimated" airtime used by AQL to control the queue length for frames pending in the firmware/hardware queue, which in 4us unit. > I ran a ping, and saw this: > > - pings coming back in <5ms > - re-enable AQL (echo 7 | tee airtime_flags) > - pings stop coming back immediately > - some seconds later, disable AQL again (echo 3 | tee airtime_flags) > - immediate *flood* of ping replies registered, with times 16000ms, > 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms > - According to the icmp_seq values, all 28 requests were replied to, > and their replies were delivered in-order > > This certainly looks like a missing TX queue restart to me? I don't think TX queue restart is "missing", the TX queue should get restarted when the pending frames is completed and returned to the host driver. However, It looks like there is some issue with the deficit refill logic in ath9k, and the TX queue got blocked due to the negative deficit. On Thu, Jan 2, 2020 at 11:22 PM Justin Capella wrote: > > The rather large negative deficit stands out to me. See this patch, > https://patchwork.kernel.org/patch/11246363/ specifically the comments > by Kan Yan > > On Thu, Jan 2, 2020, 3:14 PM Stephen Oberholtzer wrote: > > > > > > /sys/kernel/debug/ieee80211/phy0 > > > > airtime_flags = 7 > > > > stations//airtime = > > > > RX: 6583578 us > > TX: 32719 us > > Weight: 256 > > Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us > > Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us > > Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/1200 > > On Thu, Jan 2, 2020 at 3:14 PM Stephen Oberholtzer wrote: > > > > On Thu, Jan 2, 2020 at 8:28 AM Johannes Berg wrote: > > > > > > On Tue, 2019-12-31 at 19:49 -0500, Stephen Oberholtzer wrote: > > > > Wireless networking goes down on Acer C720P Chromebook (bisected) > > > > > > > > Culprit: 7a89233a ("mac80211: Use Airtime-based Queue Limits (AQL) on > > > > packet dequeue") > > > > > > > > > > > > > I think I found at least one hole in this, but IIRC (it was before my > > > vacation, sorry) it was pretty unlikely to actually happen. Perhaps > > > there are more though. > > > > > > https://lore.kernel.org/r/b14519e81b6d2335bd0cb7dcf074f0d1a4eec707.camel@sipsolutions.net > > > > > > > > > Do you get any output at all? Like a WARN_ON() for an underflow, or > > > something? > > > > > > johannes > > > > > > > Johannes, > > > > To answer your immediate question, no, I don't get any dmesg output at > > all. Nothing about underruns. > > However, while pursuing other avenues -- specifically, enabling > > mac80211 debugfs and log messages -- I realized that my 'master' was > > out-of-date from linux-stable and did a git pull. Imagine my surprise > > when the resulting kernel did not exhibit the problem! > > > > Apparently, I had been a bit too pessimistic; since the problem > > existed in 5.5-rc1 release, I'd assumed that the problem wouldn't get > > rectified before 5.5. > > > > However, I decided to bisect the fix, and ended up with: 911bde0f > > ("mac80211: Turn AQL into an NL80211_EXT_FEATURE"), which appears to > > have "solved" the problem by just disabling the feature (this is > > ath9k, by the way.) > > > > This AQL stuff sounds pretty nifty, and I'd love to try my hand at > > making it work for ath9k (also, since I put so much effort into an > > automated build-and-test framework, it'd be a shame to just abandon > > it.) However, the ath9k code is rather lacking for comments, so I > > don't even know where I should start, except for (I suspect) a call to > > `wiphy_ext_feature_set(whatever, NL80211_EXT_FEATURE_AQL);` from > > inside ath9k_set_hw_capab()? > > > > In the meantime, I went back to e548f749b096 -- the commit prior to > > the one making AQL support opt-in -- and cranked up the debugging. > > > > I'm not sure how to interpret any of this, but here's what I got: > > > > dmesg output: > > > > Last relevant mention is "moving STA to state 4" which > > happened during startup, before everything shut down. > > > > /sys/kernel/debug/ieee80211/phy0 > > > > airtime_flags = 7 > > > > stations//airtime = > > > > RX: 6583578 us > > TX: 32719 us > > Weight: 256 > > Deficit: VO: -1128 us VI: 11 us BE: -5098636 us BK: 256 us > > Q depth: VO: 3868 us VI: 3636 us BE: 12284 us BK: 0 us > > Q limit[low/high]: VO: 5000/12000 VI: 5000/12000 BE: 5000/12000 BK: 5000/12000 > > > > (I have no idea how to interpret this, but that '32719 us' seems odd, > > I thought the airtime usage was in 4us units?) > > > > > > Doing an 'echo 3 | tee airtime_flags' to clear the (old) AQL-enabled > > bit seemed to *immediately* restore network connectivity. > > > > I ran a ping, and saw this: > > > > - pings coming back in <5ms > > - re-enable AQL (echo 7 | tee airtime_flags) > > - pings stop coming back immediately > > - some seconds later, disable AQL again (echo 3 | tee airtime_flags) > > - immediate *flood* of ping replies registered, with times 16000ms, > > 15000ms, 14000ms, .. down to 1000ms, 15ms, then stabilizing sub-5ms > > - According to the icmp_seq values, all 28 requests were replied to, > > and their replies were delivered in-order > > > > This certainly looks like a missing TX queue restart to me? > > > > > > -- > > -- Stevie-O > > Real programmers use COPY CON PROGRAM.EXE