Received-SPF: pass (google.com: domain of linux-wireless-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Tue, 9 Feb 2021 14:55:31 -0500
From:   Steven Rostedt <rostedt@goodmis.org>
To:     Brian Norris <briannorris@chromium.org>
Cc:     Wen Gong <wgong@codeaurora.org>,
        ath10k <ath10k@lists.infradead.org>,
        linux-wireless <linux-wireless@vger.kernel.org>
Subject: Re: [PATCH] ath10k: change len of trace_ath10k_log_dbg_dump for
 large buffer size
Message-ID: <20210209145531.5977b16d@gandalf.local.home>
In-Reply-To: <CA+ASDXN1V2HYA7C6s-q5bQNSxE7L5GCJiqfiJ_67R_hpUn4b_g@mail.gmail.com>
References: <1612839593-2308-1-git-send-email-wgong@codeaurora.org>
        <CA+ASDXN1V2HYA7C6s-q5bQNSxE7L5GCJiqfiJ_67R_hpUn4b_g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Precedence: bulk

On Tue, 9 Feb 2021 11:35:07 -0800
Brian Norris <briannorris@chromium.org> wrote:

> + Steven Rostedt

Thanks.

> 
> Hi Wen,
> 
> (Trimming down the description a bit:)
> 
> On Mon, Feb 8, 2021 at 6:59 PM Wen Gong <wgong@codeaurora.org> wrote:
> >
> > Kernel panic every time in kernel when running below case:
> > steps:
> > 1. connect to an AP with good signal strength
> > 2. echo 0x7f > /sys/kernel/debug/ieee80211/phy0/ath10k/pktlog_filter
> > 3. echo 0xffff 0 > /sys/kernel/debug/ieee80211/phy0/ath10k/fw_dbglog
> > 4. echo 0 > /sys/module/ath10k_core/parameters/debug_mask
> > 5. sudo trace-cmd record  -e ath10k
> > 6. run "iperf -c 192.168.1.1 -u -b 100m -i 1 -t 30"
> > 7. kernel panic immeditely
> >
> > It always crash at trace_event_raw_event_ath10k_xxx, below is 2 sample:  
> ...
> > The value of prog in filter_match_preds of kernel/trace/trace_events_filter.c
> > is overwrite to the content of the UDP packets's content like this
> > 0x0039383736353433, it is a invalid address, so crash.  
> ...
> > ath10k_htc_send_bundle_skbs allocate skb with size 49792(1556*32), it
> > is bigger than PAGE_SIZE which is normally 4096, then ath10k_sdio_write
> > call ath10k_dbg_dump to trace the large size skb and corrupt the trace
> > data of tracing and lead crash. When disable the TX bundle of SDIO, it
> > does not crash, but TX bundle is for improve throughput, so it is enabled
> > by default. It is useless to call ath10k_dbg_dump to trace the large
> > bundled skb, so this patch trace the top part of large bundled skb.  
> ...
> > trace_event_raw_event_ath10k_log_dbg_dump is generated by compiler, it
> > call trace_event_buffer_reserve got get a struct pointer *entry, its
> > type is trace_event_raw_ath10k_log_dbg_dump which is also generated by
> > compiler, trace_event_buffer_reserve of kernel/trace/trace_events.c
> > call trace_event_buffer_lock_reserve to get ring_buffer_event.
> >
> > In function trace_event_buffer_lock_reserve of kernel/trace/trace.c,
> > the ring_buffer_time_stamp_abs is false and trace_file->flags is 0x40b
> > which is set bit of EVENT_FILE_FL_FILTERED by debugging, so it use the
> > temp buffer this_cpu_read(trace_buffered_event), and the buffer size
> > is 1 page size which allocatee in trace_buffered_event_enable by
> > alloc_pages_node, and then ath10k pass the buffer size > 1 page trigger
> > overflow and crash.
> >
> > Based on upper test, try and debugging, pass large buff size to function
> > trace_ath10k_log_dbg_dump cause crash, and it has ath10k_dbg in
> > ath10k_sdio_write to print the length of skb/buffer, it is not necessary
> > to trace all content of the large skb.  
> 
> Is this the same issue noted in this thread?
> 
> [for-next][PATCH 2/2] tracing: Use temp buffer when filtering events
> https://lore.kernel.org/lkml/f16b14066317f6a926b6636df6974966@codeaurora.org/

Note, that is only used when filtering happens, which doesn't appear to be
the case here.

> 
> It seems like we should still try to get that fixed somehow, even if
> the below change is fine on its own (it probably doesn't make sense to
> such a large amount of data via tracepoints). It would be unfortunate
> for next poor soul to hit the same issues, just because they wanted to
> dump a few KB.

Yeah, it was a design decision to cap the max size of events to just under
PAGE_SIZE. The ring buffer is broken up into pages (for zero copy
transfers to file systems and the network). Thus, no event is allowed to be
bigger than a page (and actually a bit smaller)

That said, it shouldn't have crashed, it should have just failed to record.

I'll test it out and see why it crashed.

-- Steve