Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S971302AbdDTQwB (ORCPT ); Thu, 20 Apr 2017 12:52:01 -0400 Received: from mail-io0-f196.google.com ([209.85.223.196]:35589 "EHLO mail-io0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S971285AbdDTQv6 (ORCPT ); Thu, 20 Apr 2017 12:51:58 -0400 Subject: Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle To: Frank Rowand , Michael Ellerman , Tyrel Datwyler , robh+dt@kernel.org References: <1492475525-10827-1-git-send-email-tyreld@linux.vnet.ibm.com> <58F6AA35.2040902@gmail.com> <87efwp6v4e.fsf@concordia.ellerman.id.au> <1d51a229-612b-bf09-93d5-6e43b476e2cf@gmail.com> <58F83C66.7030806@gmail.com> Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, devicetree@vger.kernel.org, nfont@linux.vnet.ibm.com, rostedt@goodmis.org, mingo@redhat.com From: Tyrel Datwyler Message-ID: <64505d40-926d-b302-4ff3-dff71aa5018a@gmail.com> Date: Thu, 20 Apr 2017 09:51:55 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <58F83C66.7030806@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 14485 Lines: 333 On 04/19/2017 09:43 PM, Frank Rowand wrote: >> Something else to keep in mind is that while pr_debugs could be used to >> provide feedback on the reference counts and of_reconfig events they >> don't in anyway tell us where they are happening in the kernel. The > > Yes, that is critical information. When there are refcount issues, the > root cause is at varying levels back in the call stack. > > >> trace infrastructure provides the ability to stack trace those events. >> The following example provides me a lot more information about who is >> doing what and where after I hot-add an ethernet adapter: >> >> # echo stacktrace > /sys/kernel/debug/tracing/trace_options >> # cat trace | grep -A6 "/pci@800000020000018" >> ... >> drmgr-7349 [006] d... 7138.821875: of_node_get: refcount=8, >> dn->full_name=/pci@800000020000018 >> drmgr-7349 [006] d... 7138.821876: >> => .msi_quota_for_device >> => .rtas_setup_msi_irqs >> => .arch_setup_msi_irqs >> => .__pci_enable_msix >> => .pci_enable_msix_range > > Nice! It is great to have function names in the call stack. Agreed. >> To get that same info as far as I know is to add a dump_stack() after >> each pr_debug. > > Here is a patch that I have used. It is not as user friendly in terms > of human readable stack traces (though a very small user space program > should be able to fix that). The patch is cut and pasted into this > email, so probably white space damaged. > > Instead of dumping the stack, each line in the "report" contains > the top six addresses in the call stack. If interesting, they > can be post-processed (as I will show in some examples below). Very cool. I wasn't familiar with the CALLER_ADDR* defines. Thanks for sharing. > > --- > drivers/of/dynamic.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > Index: b/drivers/of/dynamic.c > =================================================================== > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -13,6 +13,7 @@ > #include > #include > #include > +#include > > #include "of_private.h" > > @@ -27,6 +28,20 @@ struct device_node *of_node_get(struct d > { > if (node) > kobject_get(&node->kobj); > + > + if (node) { > + int k; > + int refcount = refcount_read(&node->kobj.kref.refcount); > + pr_err("XXX get 0x%p %3d [0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx] ", > + node, refcount, > + CALLER_ADDR5, CALLER_ADDR4, CALLER_ADDR3, > + CALLER_ADDR2, CALLER_ADDR1, CALLER_ADDR0); > + // refcount = (refcount > 20) ? 20 : refcount; /* clamp max */ > + for (k = 0; k < refcount; k++) > + pr_cont("+"); > + pr_cont(" %s\n", of_node_full_name(node)); > + } > + > return node; > } > EXPORT_SYMBOL(of_node_get); > @@ -38,8 +53,22 @@ EXPORT_SYMBOL(of_node_get); > */ > void of_node_put(struct device_node *node) > { > + if (node) { > + int k; > + int refcount = refcount_read(&node->kobj.kref.refcount); > + pr_err("XXX put 0x%p %3d [0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx 0x%08lx] ", > + node, refcount, > + CALLER_ADDR5, CALLER_ADDR4, CALLER_ADDR3, > + CALLER_ADDR2, CALLER_ADDR1, CALLER_ADDR0); > + // refcount = (refcount > 20) ? 20 : refcount; /* clamp max */ > + for (k = 0; k < refcount; k++) > + pr_cont("-"); > + pr_cont(" %s\n", of_node_full_name(node)); > + } > + > if (node) > kobject_put(&node->kobj); > + > } > EXPORT_SYMBOL(of_node_put); > > > I have used this mostly in the context of boot time, so there is a lot > of output. My notes on configuration needed for my ARM boards are: > > FIXME: Currently using pr_err() so I don't need to set loglevel on boot. > > So obviously not a user friendly tool!!! > The process is: > - apply patch > - configure, build, boot kernel > - analyze data > - remove patch > > LOG_BUF_SHIFT (was 17) > General Setup -> > [select 21] Kernel log buffer size (16 => 64KB, 17 => 128KB) > > > Device Drivers -> > Device Tree and Open Firmware support -> > Device Tree overlays > > > Want CONFIG_FRAME_POINTER so that CALLER_ADDR* will work. > To be able to enable CONFIG_FRAME_POINTER, need to disable CONFIG_ARM_UNWIND. > > Kernel hacking -> > [unselect] Enable stack unwinding support (EXPERIMENTAL) > CONFIG_FRAME_POINTER will then be selected automatically > > > The output looks like: > > [ 0.231430] OF: XXX get 0xeefeb5dc 2 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e678 0xc080de7c] ++ /smp2p-adsp/slave-kernel > [ 0.231457] OF: XXX get 0xeefeb5dc 3 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814188] +++ /smp2p-adsp/slave-kernel > [ 0.231495] OF: XXX get 0xeefeb5dc 4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814258] ++++ /smp2p-adsp/slave-kernel > [ 0.231537] OF: XXX get 0xeefeb244 4 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] ++++ /smp2p-adsp > [ 0.231568] OF: XXX put 0xeefeb5dc 4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ---- /smp2p-adsp/slave-kernel > [ 0.231610] OF: XXX get 0xeefe759c 23 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] +++++++++++++++++++++++ / > [ 0.231702] OF: XXX put 0xeefeb244 4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ---- /smp2p-adsp > [ 0.231744] OF: XXX put 0xeefe759c 23 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ----------------------- / > [ 0.231881] OF: XXX get 0xeefecb2c 22 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e86c 0xc080e678 0xc080de7c] ++++++++++++++++++++++ /soc/interrupt-controller@f9000000 > [ 0.231972] OF: XXX put 0xeefecb2c 22 [0xc080fd94 0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e61c] ---------------------- /soc/interrupt-controller@f9000000 > [ 0.232101] OF: XXX get 0xeefeb5dc 4 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e678 0xc080de7c] ++++ /smp2p-adsp/slave-kernel > [ 0.232134] OF: XXX put 0xeefeb5dc 4 [0xc080fd94 0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e61c] ---- /smp2p-adsp/slave-kernel > [ 0.232178] OF: XXX get 0xeefeb5dc 4 [0xc0814c58 0xc08148b0 0xc080e970 0xc080e894 0xc080e678 0xc080de7c] ++++ /smp2p-adsp/slave-kernel > [ 0.232211] OF: XXX get 0xeefeb5dc 5 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814188] +++++ /smp2p-adsp/slave-kernel > [ 0.232257] OF: XXX get 0xeefeb5dc 6 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814258] ++++++ /smp2p-adsp/slave-kernel > [ 0.232308] OF: XXX get 0xeefeb244 4 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] ++++ /smp2p-adsp > [ 0.232339] OF: XXX put 0xeefeb5dc 6 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ------ /smp2p-adsp/slave-kernel > [ 0.232390] OF: XXX get 0xeefe759c 23 [0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814278 0xc080ccc8] +++++++++++++++++++++++ / > [ 0.232482] OF: XXX put 0xeefeb244 4 [0xc08103c4 0xc0810024 0xc080fd94 0xc0814c58 0xc08149a8 0xc0814284] ---- /smp2p-adsp > > > > But I normally strip off the timestamp, and grep for the "OF: XXX ", > which gets me only the get and put info. It is also easy to grep > for a single node of interest. > > The data fields are: > get or put > the struct device_node address > refcount > a 6 caller deep call stack > for get, one '+' per refcount or for put, one '-' per refcount > the full node name > > The refcount for get is the post get value, for put is the pre put value, > so they are easy to match up for human scanning. The length of the "++++" > and "----" patterns on the end are also intended for easy human scanning. > > Here are two actual refcount issues for the root node on my 4.11-rc1: > > OF: XXX get 0xeefe759c 2 [0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813070 0xc080ccc8] ++ / > OF: XXX put 0xeefe759c 2 [0xc031aa28 0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813014] -- / > OF: XXX get 0xeefe759c 2 [0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813070 0xc080ccc8] ++ / > OF: XXX put 0xeefe759c 2 [0xc031aa40 0xc08138bc 0xc08137fc 0xc08136e4 0xc08136b4 0xc0813014] -- / > OF: XXX get 0xeefe759c 22 [0xc0308518 0xc09330e0 0xc0d00e3c 0xc03017d0 0xc0d3a1fc 0xc080d948] ++++++++++++++++++++++ / > OF: XXX put 0xeefe759c 22 [0xc0308518 0xc09330e0 0xc0d00e3c 0xc03017d0 0xc0d3a1fc 0xc080d8c8] ---------------------- / > OF: XXX get 0xeefe759c 22 [0xc0d00e3c 0xc03017d0 0xc0d3a234 0xc0810684 0xc081061c 0xc080d928] ++++++++++++++++++++++ / > OF: XXX get 0xeefe759c 23 [0xc08103c4 0xc0810024 0xc080fd84 0xc08137b4 0xc0812c88 0xc080ccc8] +++++++++++++++++++++++ / > OF: XXX put 0xeefe759c 23 [0xc08105d8 0xc08103c4 0xc0810024 0xc080fd84 0xc08137b4 0xc0812cb4] ----------------------- / > > The call stack could easily be post-processed, for example using addr2line. > Here is the call stack for when the refcount incremented to 23 from 22 (or > more accurately, to 22 from 21): > > 0xc0d00e3c Line 857 of "init/main.c" > 0xc03017d0 Line 792 of "init/main.c" > 0xc0d3a234 Line 528 of "drivers/of/platform.c" > 0xc0810684 Line 503 of "drivers/of/platform.c" > 0xc081061c Line 267 of "include/linux/of.h" > 0xc080d928 Line 815 of "drivers/of/base.c" > > Which ends up being this code: > > of_platform_default_populate_init() > of_platform_default_populate() > of_platform_populate() > [[ of_find_node_by_path("/") ]] > [[ of_find_node_opts_by_path(path, NULL) ]] > of_node_get(of_root) > > Note that some functions can be left out of the ARM call stack, with > a return going back more than one level. The functions in the call > list above that are enclosed in '[[' and ']]' were found by source > inspection in those cases. The same thing is encountered in ppc64 stack traces. I assume it is generally inlining of small functions, but I've never actually verified that theory. Probably should take the time to investigate, or just ask someone. > > It looks like a put is missing, but about 250 get/put pairs later, > of_platform_populate() does the required put on node "/". > > Then quite a bit later, after lots of balanced gets and puts, there is an > initcall that does a get on the root without a corresponding put. > > > The jump from refcount 2 to refcount 22 is an interesting case, insofar as it > is not the result of of_node_get(). It is instead inside a series of calls to > kobject_add(): Took me a hot minute to track this one down when I first encountered the extreme reference count jump on the root node with no associated of_node_gets. Luckily, the other nodes with children sent me looking in the right direction. > > kernel_init() > kernel_init_freeable() > do_basic_setup() > driver_init() > of_core_init() > for_each_of_allnodes(np) > __of_attach_node_sysfs(np) > kobject_add() > > > Filtering and reformatting is "easily done" with grep and other > normal unix tools. > > For example, a simple stream of command line tools can show a > streamlined report of the refcounts of a single node (in this > case the root node), which can easily be scanned for interesting > events: > > [ 0.199569] 2 ++ / > [ 0.199629] 2 -- / > [ 0.199826] 2 ++ / > [ 0.199886] 2 -- / > [ 0.212549] 22 ++++++++++++++++++++++ / > [ 0.212855] 22 ---------------------- / > [ 0.213087] 22 ++++++++++++++++++++++ / > [ 0.213700] 23 +++++++++++++++++++++++ / > [ 0.213797] 23 ----------------------- / > [ 0.213973] 23 +++++++++++++++++++++++ / > > ... hundreds of boring put/get pairs > > [ 0.458737] 23 ----------------------- / > [ 0.458909] 23 +++++++++++++++++++++++ / > [ 0.459035] 23 ----------------------- / > [ 0.459305] 22 ---------------------- / > [ 0.470255] 22 ++++++++++++++++++++++ / > > ... hundreds of boring put/get pairs > > [ 93.110548] 22 ++++++++++++++++++++++ / > [ 93.140046] 22 ---------------------- / > [ 93.264639] 22 ++++++++++++++++++++++ / > [ 93.389530] 23 +++++++++++++++++++++++ / > [ 93.414269] 23 ----------------------- / > > > You might have noticed that the call trace is not interesting for > most of the gets and puts. There are over 350 get/put pairs for > the root node in the boot that I collected the above examples on, > but only a few instances where the trace matters. Thus leaving > the call stack in a compact format until needed is a feature. > > I will be the first to admit that the tool is not polished and not > easy to use, though it is easily extended with post-processing. > > I wrote the patch as a proof of concept a while ago and have not > fleshed it out. In fact, calling it a "tool" is overstating what > it is. > > >> Further, filters can be set on the tracepoint event fields such that >> trace data could be restricted to a particular device_node or refcount >> threshold. For example: >> >> # cd /sys/kernel/debug/tracing# cat events/of/of_node_get/format >> # echo "dn_name == /pci@800000020000018" > events/of/filter >> >> # cat trace >> drmgr-10542 [003] .... 9630.677001: of_node_put: refcount=5, >> dn->full_name=/pci@800000020000018 >> drmgr-10542 [003] d... 9631.677368: of_node_get: refcount=6, >> dn->full_name=/pci@800000020000018 >> drmgr-10542 [003] .... 9631.677389: of_node_put: refcount=5, >> dn->full_name=/pci@800000020000018 >> drmgr-10542 [003] .... 9631.677390: of_reconfig_notify: >> action=DETACH_NODE, dn->full_name=/pci@800000020000018, prop->name=null, >> old_prop->name=null >> drmgr-10542 [003] .n.. 9632.025656: of_node_put: refcount=4, >> dn->full_name=/pci@800000020000018 >> drmgr-10542 [003] .n.. 9632.025657: of_node_put: refcount=3, >> dn->full_name=/pci@800000020000018 >> >> After setting the filter and doing a hot-remove of the pci device in >> question the trace quickly tells me 3 references are being leaked. In >> combination with the stacktrace option I can quickly correlate call >> sites that take references without releasing them. > > Thanks for sharing that. It is nice seeing your results. Ditto. -Tyrel