Received: by 2002:a05:7412:31a9:b0:e2:908c:2ebd with SMTP id et41csp5562005rdb; Sun, 17 Sep 2023 02:12:02 -0700 (PDT) X-Google-Smtp-Source: AGHT+IHEptqy+Y3gzAaPmmhwJl5cWprkkZ8DKaOD2z3iWm84Mlfw2RP8Po83ZCba+BmBS5RKpMeo X-Received: by 2002:a17:90a:2a0f:b0:268:15f5:9191 with SMTP id i15-20020a17090a2a0f00b0026815f59191mr5820621pjd.36.1694941921678; Sun, 17 Sep 2023 02:12:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1694941921; cv=none; d=google.com; s=arc-20160816; b=TrbR51s2L8/WQuuRfATirTFcTEhm+3oBUhJHA50JJoTVqoT5pIwtHf8K6kQKMSLN+R YonwTWYEmikbLzeeYuQ5uchJmi30Kk/52xEjn3O96bNbxUpZ1N2b+LQMXX3wgd5ihH/Z RK+crjGuiJdi7lKShI8dBHNkH3yB1A0m6T0Ei1PqEL0jqhYL3a+/P2Y4AwTVrMXJYLk5 Q4U1udlRWgoKp6P+4oXZRzql5544G3PB/54BpzYt6NYuWyQhvjJZj+0xpSOyY0pRWseV J+GBYD4ugsCRCZCZ9kjyVqDnhnWGRYJPgRQbXKWRP3GMfxSUzFmwVxobxAp+iUNQqKqa 1Txw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:subject:user-agent:mime-version:date:message-id; bh=dL4VgP3q36LCxIx2GYwRjSPS4xS2qgRrcXfJW+MIIDM=; fh=T6ZUjcwh4emx7TX+IeMAf8+NBanUq4trbsU20hhro5k=; b=Xg2UloGIGG3EJUuH5OUgGH3vH4tP3ZwTtMeCyE3UVZ4ht/9bvnpv0bNq+IiJxVxUuE 5HAGQRImj+g+L6DG7eyrZ7RpImn0LucsZr8AgmztyE+lj5HkWUu2DURjskoHMF9ajIVP NtmcJ7OE5uDS3B0yUGxOoW5+Mpa4D9ObuP233o5ywn5jXCgBCDOVA9eojd2D7c6McGoU uey/i5UlrSG4kcyAbgX3CVHvxhxPMY9PCqRgOQtl2xhUdgxPOR+OmsDiYoUHmZ+tbii4 8mnJu0QPblF5e/pkQF3dunOAx7vfxWJAsHAImdTDsTIjTq9wlbLUOkNpWnU4b4UeC8Oq biHw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=sangfor.com.cn Return-Path: Received: from pete.vger.email (pete.vger.email. [2620:137:e000::3:6]) by mx.google.com with ESMTPS id gi23-20020a17090b111700b0026818f6a0c0si6179570pjb.86.2023.09.17.02.12.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 17 Sep 2023 02:12:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) client-ip=2620:137:e000::3:6; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:6 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=sangfor.com.cn Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id E600B8097A7E; Sun, 17 Sep 2023 02:10:33 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234919AbjIQJIp (ORCPT + 99 others); Sun, 17 Sep 2023 05:08:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235129AbjIQJIj (ORCPT ); Sun, 17 Sep 2023 05:08:39 -0400 Received: from mail-m127157.xmail.ntesmail.com (mail-m127157.xmail.ntesmail.com [115.236.127.157]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 064C3F9; Sun, 17 Sep 2023 02:08:30 -0700 (PDT) Received: from [172.23.196.159] (unknown [121.32.254.146]) by mail-m11877.qiye.163.com (Hmail) with ESMTPA id C024C4001B2; Sun, 17 Sep 2023 17:07:51 +0800 (CST) Message-ID: <6784b0d1-0da2-48dd-ad99-bb8b86978519@sangfor.com.cn> Date: Sun, 17 Sep 2023 17:07:47 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2] bpf: Using binary search to improve the performance of btf_find_by_name_kind To: Alan Maguire , Alexei Starovoitov , Eduard Zingerman Cc: Martin KaFai Lau , Alexei Starovoitov , Song Liu , Yonghong Song , Steven Rostedt , Masami Hiramatsu , dinghui@sangfor.com.cn, huangcun@sangfor.com.cn, bpf , LKML References: <20230909091646.420163-1-pengdonglin@sangfor.com.cn> <20ef8441084c9d5fd54f84987afa77eed7fe148e.camel@gmail.com> <035ab912d7d6bd11c54c038464795da01dbed2de.camel@gmail.com> <5f8d82c3-838e-4d75-bb25-7d98a6d0a37c@sangfor.com.cn> <6b77425c-7f09-ae6d-c981-7cb2b3b826bd@oracle.com> <774732d7-1603-466e-8df2-3b21314913e5@sangfor.com.cn> <2d520a7b-9166-16c4-5385-5bb90437d45c@oracle.com> From: pengdonglin In-Reply-To: <2d520a7b-9166-16c4-5385-5bb90437d45c@oracle.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-HM-Spam-Status: e1kfGhgUHx5ZQUpXWQgPGg8OCBgUHx5ZQUlOS1dZFg8aDwILHllBWSg2Ly tZV1koWUFITzdXWS1ZQUlXWQ8JGhUIEh9ZQVlDQh0dVhkfSh1OGhlLGEhCHVUTARMWGhIXJBQOD1 lXWRgSC1lBWUpJSlVISVVJTk9VSk9NWVdZFhoPEhUdFFlBWU9LSFVKSktISkNVSktLVUtZBg++ X-HM-Tid: 0a8aa26573622eb3kusnc024c4001b2 X-HM-MType: 1 X-HM-Sender-Digest: e1kMHhlZQR0aFwgeV1kSHx4VD1lBWUc6KyI6LAw*OD1ICVY6IlZOIggX Di4aCgpVSlVKTUJPQk9KTUxIS0pLVTMWGhIXVQseFRwfFBUcFxIVOwgaFRwdFAlVGBQWVRgVRVlX WRILWUFZSklKVUhJVUlOT1VKT01ZV1kIAVlBSkxJT0w3Bg++ X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Sun, 17 Sep 2023 02:10:34 -0700 (PDT) On 2023/9/15 1:14, Alan Maguire wrote: > On 14/09/2023 14:05, pengdonglin wrote: >> On 2023/9/14 20:46, Alan Maguire wrote: >>> On 14/09/2023 11:13, pengdonglin wrote: >>>> On 2023/9/13 21:34, Alan Maguire wrote: >>>>> On 13/09/2023 11:32, pengdonglin wrote: >>>>>> On 2023/9/13 2:46, Alexei Starovoitov wrote: >>>>>>> On Tue, Sep 12, 2023 at 10:03 AM Eduard Zingerman >>>>>>> wrote: >>>>>>>> >>>>>>>> On Tue, 2023-09-12 at 09:40 -0700, Alexei Starovoitov wrote: >>>>>>>>> On Tue, Sep 12, 2023 at 7:19 AM Eduard Zingerman >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> On Tue, 2023-09-12 at 16:51 +0300, Eduard Zingerman wrote: >>>>>>>>>>> On Sat, 2023-09-09 at 02:16 -0700, Donglin Peng wrote: >>>>>>>>>>>> Currently, we are only using the linear search method to find >>>>>>>>>>>> the >>>>>>>>>>>> type id >>>>>>>>>>>> by the name, which has a time complexity of O(n). This change >>>>>>>>>>>> involves >>>>>>>>>>>> sorting the names of btf types in ascending order and using >>>>>>>>>>>> binary search, >>>>>>>>>>>> which has a time complexity of O(log(n)). This idea was inspired >>>>>>>>>>>> by the >>>>>>>>>>>> following patch: >>>>>>>>>>>> >>>>>>>>>>>> 60443c88f3a8 ("kallsyms: Improve the performance of >>>>>>>>>>>> kallsyms_lookup_name()"). >>>>>>>>>>>> >>>>>>>>>>>> At present, this improvement is only for searching in vmlinux's >>>>>>>>>>>> and >>>>>>>>>>>> module's BTFs, and the kind should only be BTF_KIND_FUNC or >>>>>>>>>>>> BTF_KIND_STRUCT. >>>>>>>>>>>> >>>>>>>>>>>> Another change is the search direction, where we search the BTF >>>>>>>>>>>> first and >>>>>>>>>>>> then its base, the type id of the first matched btf_type will be >>>>>>>>>>>> returned. >>>>>>>>>>>> >>>>>>>>>>>> Here is a time-consuming result that finding all the type ids of >>>>>>>>>>>> 67,819 kernel >>>>>>>>>>>> functions in vmlinux's BTF by their names: >>>>>>>>>>>> >>>>>>>>>>>> Before: 17000 ms >>>>>>>>>>>> After:     10 ms >>>>>>>>>>>> >>>>>>>>>>>> The average lookup performance has improved about 1700x at the >>>>>>>>>>>> above scenario. >>>>>>>>>>>> >>>>>>>>>>>> However, this change will consume more memory, for example, >>>>>>>>>>>> 67,819 kernel >>>>>>>>>>>> functions will allocate about 530KB memory. >>>>>>>>>>> >>>>>>>>>>> Hi Donglin, >>>>>>>>>>> >>>>>>>>>>> I think this is a good improvement. However, I wonder, why did >>>>>>>>>>> you >>>>>>>>>>> choose to have a separate name map for each BTF kind? >>>>>>>>>>> >>>>>>>>>>> I did some analysis for my local testing kernel config and got >>>>>>>>>>> such numbers: >>>>>>>>>>> - total number of BTF objects: 97350 >>>>>>>>>>> - number of FUNC and STRUCT objects: 51597 >>>>>>>>>>> - number of FUNC, STRUCT, UNION, ENUM, ENUM64, TYPEDEF, DATASEC >>>>>>>>>>> objects: 56817 >>>>>>>>>>>      (these are all kinds for which lookup by name might make >>>>>>>>>>> sense) >>>>>>>>>>> - number of named objects: 54246 >>>>>>>>>>> - number of name collisions: >>>>>>>>>>>      - unique names: 53985 counts >>>>>>>>>>>      - 2 objects with the same name: 129 counts >>>>>>>>>>>      - 3 objects with the same name: 3 counts >>>>>>>>>>> >>>>>>>>>>> So, it appears that having a single map for all named objects >>>>>>>>>>> makes >>>>>>>>>>> sense and would also simplify the implementation, what do you >>>>>>>>>>> think? >>>>>>>>>> >>>>>>>>>> Some more numbers for my config: >>>>>>>>>> - 13241 types (struct, union, typedef, enum), log2 13241 = 13.7 >>>>>>>>>> - 43575 funcs, log2 43575 = 15.4 >>>>>>>>>> Thus, having separate map for types vs functions might save ~1.7 >>>>>>>>>> search iterations. Is this a significant slowdown in practice? >>>>>>>>> >>>>>>>>> What do you propose to do in case of duplicates ? >>>>>>>>> func and struct can have the same name, but they will have two >>>>>>>>> different >>>>>>>>> btf_ids. How do we store them ? >>>>>>>>> Also we might add global vars to BTF. Such request came up several >>>>>>>>> times. >>>>>>>>> So we need to make sure our search approach scales to >>>>>>>>> func, struct, vars. I don't recall whether we search any other >>>>>>>>> kinds. >>>>>>>>> Separate arrays for different kinds seems ok. >>>>>>>>> It's a bit of code complexity, but it's not an increase in memory. >>>>>>>> >>>>>>>> Binary search gives, say, lowest index of a thing with name A, then >>>>>>>> increment index while name remains A looking for correct kind. >>>>>>>> Given the name conflicts info from above, 99% of times there >>>>>>>> would be >>>>>>>> no need to iterate and in very few cases there would a couple of >>>>>>>> iterations. >>>>>>>> >>>>>>>> Same logic would be necessary with current approach if different BTF >>>>>>>> kinds would be allowed in BTF_ID_NAME_* cohorts. I figured that >>>>>>>> these >>>>>>>> cohorts are mainly a way to split the tree for faster lookups, but >>>>>>>> maybe that is not the main intent. >>>>>>>> >>>>>>>>> With 13k structs and 43k funcs it's 56k * (4 + 4) that's 0.5 Mbyte >>>>>>>>> extra memory. That's quite a bit. Anything we can do to compress >>>>>>>>> it? >>>>>>>> >>>>>>>> That's an interesting question, from the top of my head: >>>>>>>> pre-sort in pahole (re-assign IDs so that increasing ID also would >>>>>>>> mean "increasing" name), shouldn't be that difficult. >>>>>>> >>>>>>> That sounds great. kallsyms are pre-sorted at build time. >>>>>>> We should do the same with BTF. >>>>>>> I think GCC can emit BTF directly now and LLVM emits it for bpf progs >>>>>>> too, >>>>>>> but since vmlinux and kernel module BTFs will keep being processed >>>>>>> through pahole we don't have to make gcc/llvm sort things right away. >>>>>>> pahole will be enough. The kernel might do 'is it sorted' check >>>>>>> during BTF validation and then use binary search or fall back to >>>>>>> linear >>>>>>> when not-sorted == old pahole. >>>>>>> >>>>>> >>>>>> Yeah, I agree and will attempt to modify the pahole and perform a >>>>>> test. >>>>>> Do we need >>>>>> to introduce a new macro to control the behavior when the BTF is not >>>>>> sorted? If >>>>>> it is not sorted, we can use the method mentioned in this patch or use >>>>>> linear >>>>>> search. >>>>>> >>>>>> >>>>> >>>>> One challenge with pahole is that it often runs in parallel mode, so I >>>>> suspect any sorting would have to be done after merging across threads. >>>>> Perhaps BTF deduplication time might be a useful time to re-sort by >>>>> name? BTF dedup happens after BTF has been merged, and a new "sorted" >>>>> btf_dedup_opts option could be added and controlled by a pahole >>>>> option. However dedup is pretty complicated already.. >>>>> >>>>> One thing we should weigh up though is if there are benefits to the >>>>> way BTF is currently laid out. It tends to start with base types, >>>>> and often-encountered types end up being located towards the start >>>>> of the BTF data. For example >>>>> >>>>> >>>>> [1] INT 'long unsigned int' size=8 bits_offset=0 nr_bits=64 >>>>> encoding=(none) >>>>> [2] CONST '(anon)' type_id=1 >>>>> [3] VOLATILE '(anon)' type_id=1 >>>>> [4] ARRAY '(anon)' type_id=1 index_type_id=21 nr_elems=2 >>>>> [5] PTR '(anon)' type_id=8 >>>>> [6] CONST '(anon)' type_id=5 >>>>> [7] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED >>>>> [8] CONST '(anon)' type_id=7 >>>>> [9] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none) >>>>> [10] CONST '(anon)' type_id=9 >>>>> [11] TYPEDEF '__s8' type_id=12 >>>>> [12] INT 'signed char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED >>>>> [13] TYPEDEF '__u8' type_id=14 >>>>> >>>>> So often-used types will be found quickly, even under linear search >>>>> conditions. >>>> >>>> I found that there seems to be no code in the kernel that get the ID >>>> of the >>>> basic data type by calling btf_find_by_name_kind directly. The general >>>> usage >>>> of this function is to obtain the ID of a structure or function. After >>>> we got >>>> the ID of a structure or function, it is O(1) to get the IDs of its >>>> members >>>> or parameters. >>>> >>>> ./kernel/trace/trace_probe.c:383:       id = btf_find_by_name_kind(btf, >>>> funcname, BTF_KIND_FUNC); >>>> ./kernel/bpf/btf.c:3523:        id = btf_find_by_name_kind(btf, >>>> value_type, BTF_KIND_STRUCT); >>>> ./kernel/bpf/btf.c:5504:                id = btf_find_by_name_kind(btf, >>>> alloc_obj_fields[i], BTF_KIND_STRUCT); >>>> ./kernel/bpf/bpf_struct_ops.c:128:      module_id = >>>> btf_find_by_name_kind(btf, "module", BTF_KIND_STRUCT); >>>> ./net/ipv4/bpf_tcp_ca.c:28:     type_id = btf_find_by_name_kind(btf, >>>> "sock", BTF_KIND_STRUCT); >>>> ./net/ipv4/bpf_tcp_ca.c:33:     type_id = btf_find_by_name_kind(btf, >>>> "tcp_sock", BTF_KIND_STRUCT); >>>> ./net/netfilter/nf_bpf_link.c:181:      type_id = >>>> btf_find_by_name_kind(btf, name, BTF_KIND_STRUCT); >>>> >>>>> >>>>> When we look at how many lookups by id (which are O(1), since they are >>>>> done via the btf->types[] array) versus by name, we see: >>>>> >>>>> $ grep btf_type_by_id kernel/bpf/*.c|wc -l >>>>> 120 >>>>> $ grep btf_find_by_nam kernel/bpf/*.c|wc -l >>>>> 15 >>>>> >>>>> I don't see a huge number of name-based lookups, and I think most are >>>>> outside of the hotter codepaths, unless I'm missing some. All of which >>>>> is to say it would be a good idea to have a clear sense of what will >>>>> get >>>>> faster with sorted-by-name BTF. Thanks! >>>> >>>> The story goes like this. >>>> >>>> I have added a new feature to the function graph called >>>> "funcgraph_retval", >>>> here is the link: >>>> >>>> https://lore.kernel.org/all/1fc502712c981e0e6742185ba242992170ac9da8.1680954589.git.pengdonglin@sangfor.com.cn/ >>>> >>>> We can obtain the return values of almost every function during the >>>> execution >>>> of kernel through this feature, it can help us analyze problems. >>>> >>> >>> It's a great feature! >> >> Thanks. >> >>> >>>> However, this feature has two main drawbacks. >>>> >>>> 1. Even if a function's return type is void,  a return value will still >>>> be printed. >>>> >>>> 2. The return value printed may be incorrect when the width of the >>>> return type is >>>> smaller than the generic register. >>>> >>>> I think if we can get this return type of the function, then the >>>> drawbacks mentioned >>>> above can be eliminated. The function btf_find_by_name_kind can be used >>>> to get the ID of >>>> the kernel function, then we can get its return type easily. If the >>>> return type is >>>> void, the return value recorded will not be printed. If the width of the >>>> return type >>>> is smaller than the generic register, then the value stored in the upper >>>> bits will be >>>> trimmed. I have written a demo and these drawbacks were resolved. >>>> >>>> However, during my test, I found that it took too much time when read >>>> the trace log >>>> with this feature enabled, because the trace log consists of 200,000 >>>> lines. The >>>> majority of the time was consumed by the btf_find_by_name_kind, which is >>>> called >>>> 200,000 times. >>>> >>>> So I think the performance of btf_find_by_name_kind  may need to be >>>> improved. >>>> >>> >>> If I recall, Masami's work uses BTF ids, but can cache them since the >>> user explicitly asks for specific fields in the trace output. I'm >>> presuming that's not an option for you due to the fact funcgraph tracing >>> enables everything (or at least everything under a filter predicate) and >>> you have limited context to work with, is that right? >> >> Yes, right. >> >>> >>> Looking at print_graph_entry_leaf() which I _think_ is where you'd need >>> to print the retval from, you have access to the function address via >>> call->func, and I presume you get the name from snprinting the symbol to >>> a string or similar. So you're stuck in a context where you have the >>> function address, and from that you can derive the function name. Is >>> that correct? Thanks! >> >> Yes, both print_graph_return and print_graph_entry_leaf will call >> print_graph_retval >> to print the return value. Then call sprint_symbol_no_offset with >> call->func to get >> the function name, then call btf_find_by_name_kind to get the return type. >> > > So what you ultimately need is a way to map from that information > available to be able to figure out the size of the return value > associated with a function. > > On the BPF side we've been thinking a bit about the relationship between > kernel function addresses and their BTF representations; sometimes > knowing BTF->address mapping is needed for cases where the same function > name has multiple inconsistent function signatures in BTF. We don't > represent function addresses yet in BTF but may end up having to. > The reason I mention this is in an ideal world, it would benefit to > populate kallsyms entries with their associated BTF ids; then a > function would need to be looked up once in kallsyms; that lookup would > benefit from recent speedups, and if it contained the associated BTF id > we'd have an O(1) lookup from the BTF id -> function. Not sure if that > would be tenable from the kallsyms side, but I just wanted to mention > it, as from the above it seems an address-based lookup is a possibility > to solve the return value type lookup problem that you're facing. > > Cc'ed Jiri who had to wrestle with kallsyms for the kprobe multi stuff. > Would the above help do you think? Thank you, but I tend to agree with Alexei. Using kallsyms will consume extra memory too, but it can only be used for functions. I have done a test using kallsyms_lookup + optimized btf_find_by_name_kind, and the time consumed was not excessive, it took about 38ms to find the IDs of 67823 kernel functions. If using a new map for IDs and function addresses, it took about 5ms. > > Thanks! > > Alan >