Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp3976451pxj; Tue, 15 Jun 2021 12:39:05 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzMsBpoJQ7BSGlLH95Q8wnGADOkoOYICntp+i+RPS9PtW631r9LLdZmmGyeO9RuYIj9fXgJ X-Received: by 2002:a02:b897:: with SMTP id p23mr724073jam.71.1623785945787; Tue, 15 Jun 2021 12:39:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1623785945; cv=none; d=google.com; s=arc-20160816; b=ysby6HvKlSMHqdtLU4U5xY6lkU/+MqAj97nZHS3I15f16LMEJgwhb1v/HofjMwYOiC CCvTnO0PmXGxSWQINBvhx9myeVE2brVkinXm9q+Lp0W5s925LooPXcVGOfpemSZurdQU S5C4trhSXwh7X9SUphl/X0Vu7Hs6YFTpHm5k8AqWF1Yy8Dzx9OR0gB7aBk/yFHLYwNRE 7/hkzSKLxxuH4ltJ+moqiTQcAkHJ3GkIUxUP2fmwWhS9hhbpnec1xaK+mw7rkaRzpdgY gdSFpZgLF7tsPCO1VBtvwlm7c/mNxZ0pAsZpjW9K3lYYFCJvODPkk2jn+yYyA3KdWUkL sOfA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=u3P2ZSvTGLeDNu9GFSryL/chgNTjzv5Hu0UlIZc5d/c=; b=A5H6KRkv8PlPuTJQoaJHThVwwCjFQYL8xKZ8zyu+301H6PbweXkCHsdZSCMfmFp1bF VfCfZZwEGNpuvDr/J1F0+9EBtIs+KsAzvMwscxkxnua0Ajme3UyagCECICuc+6msGt29 EKTlDpXVpY3gnw9jNeabrFssAjD7pQdVk83NtB6bpIVllxTQAnacD5ZqmHRuOQpSN1BW SQDWiLPOWVafe05kAbdH7khIggUVNpgVEI7QHIHRyiwQ7pJsCCdtOx7vMAjKLN/Pg1ba JqPrwpaLZ43/+/Paoq59mpSDSg/6lK9rbvChPM4rk2mdMApeg6IfkD8UKZhi2T294RU+ U18A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=RyHch98A; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s6si19940146iln.106.2021.06.15.12.38.53; Tue, 15 Jun 2021 12:39:05 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=RyHch98A; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230052AbhFOTkW (ORCPT + 99 others); Tue, 15 Jun 2021 15:40:22 -0400 Received: from mail.kernel.org ([198.145.29.99]:59536 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229749AbhFOTkV (ORCPT ); Tue, 15 Jun 2021 15:40:21 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id D82EC6128B; Tue, 15 Jun 2021 19:38:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1623785896; bh=5l1ZFT8n68dLKsHaNbbcXWI0cE3aK5mhMAWBOGIyy8U=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=RyHch98AyjWmixU727i2K4FRDR0ecbNN4d58jY+sqZY4a4Lca14z0NHl3Kttyyvu6 wVEcpH/wGXvtFUzzedfxBduKn9TWFhY9QdyiiN+Nxolsztgua7JQTSrZuq4bOoMSJ3 Els1szhTI6l66HnAN0dH4+ub4/LagYHKOJfHKiBKnIczFTdLlAOaAK0mFS8P5C5au6 vaqvM2ec9R/WjMMWob96ALnlBc6NkcdbXZDBbG8DtUkRHU2GHDwsA4Di9J6coDCxSj bMS1J/f4VGws90YqAXEOy+02zG+JqLsXRsuLv4vA9x8j35YsFpQpIxFTwkEZDk4EYD UHq5ZKi88saxw== Received: by quaco.ghostprotocols.net (Postfix, from userid 1000) id 86D5F40B1A; Tue, 15 Jun 2021 16:38:12 -0300 (-03) Date: Tue, 15 Jun 2021 16:38:12 -0300 From: Arnaldo Carvalho de Melo To: Andrii Nakryiko Cc: Yonghong Song , Andrii Nakryiko , Jiri Olsa , dwarves@vger.kernel.org, bpf , Kernel Team , Linux Kernel Mailing List Subject: Re: Parallelizing vmlinux BTF encoding. was Re: [RFT] Testing 1.22 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Url: http://acmel.wordpress.com Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Em Tue, Jun 15, 2021 at 12:13:55PM -0700, Andrii Nakryiko escreveu: > On Tue, Jun 15, 2021 at 12:01 PM Arnaldo Carvalho de Melo wrote: > > Em Tue, Jun 08, 2021 at 09:59:48AM -0300, Arnaldo Carvalho de Melo escreveu: > > > Em Mon, Jun 07, 2021 at 05:53:59PM -0700, Andrii Nakryiko escreveu: > > > > I think it's very fragile and it will be easy to get > > > > broken/invalid/incomplete BTF. Yonghong already brought up the case > > > I thought about that as it would be almost like the compiler generating > > > BTF, but you are right, the vmlinux prep process is a complex beast and > > > probably it is best to go with the second approach I outlined and you > > > agreed to be less fragile, so I'll go with that, thanks for your > > > comments. > > So, just to write some notes here from what I saw so far: > > 1. In the LTO cases there are inter-CU references, so the current code > > combines all CUs into one and we end up not being able to parallelize > > much. LTO is expensive, so... I'll leave it for later, but yeah, I don't > > think the current algorithm is ideal, can be improved. > Yeah, let's worry about LTO later. > > 2. The case where there's no inter CU refs, which so far is the most > > common, seems easier, we create N threads, all sharing the dwarf_loader > > state and the btf_encoder, as-is now. we can process one CU per thread, > > and as soon as we finish it, just grab a lock and call > > btf_encoder__encode_cu() with the just produced CU data structures > > (tags, types, functions, variables, etc), consume them and delete the > > CU. > > > > So each thread will consume one CU, push it to the 'struct btf' class > > as-is now and then ask for the next CU, using the dwarf_loader state, > > still under that lock, then go back to processing dwarf tags, then > > lock, btf add types, rinse, repeat. > > Hmm... wouldn't keeping a "local" per-thread struct btf and just keep > appending to it for each processed CU until we run out of CUs be > simpler? I thought about this as a logical next step, I would love to have a 'btf__merge_argv(struct btf *btf[]), is there one? But from what I've read after this first paragraph of yours, lemme try to rephrase: 1. pahole calls btf_encoder__new(...) Creates a single struct btf. 2. dwarf_loader will create N threads, each will call a dwarf_get_next_cu() that is locked and will return a CU to process, when it finishes this CU, calls btf_encoder__encode_cu() under an all-threads lock. Rinse repeat. Until all the threads have consumed all CUs. then btf_encoder__encode(), which should be probably renamed to btf_econder__finish() will call btf__dedup(encoder->btf) and write ELF or raw file. My first reaction to your first paragraph was: Yeah, we can have multiple 'struct btf' instances, one per thread, that will each contain a subset of DWARF CU's encoded as BTF, and then I have to merge the per-thread BTF and then dedup. O think my rephrase above is better, no? > So each thread does as much as possible locally without any > locks. And only at the very end we merge everything together and then > dedup. Or we can even dedup inside each worker before merging final > btf, that probably would give quite a lot of speed up and some memory > saving. Would be interesting to experiment with that. > > So I like the idea of a fixed pool of threads (can be customized, and > I'd default to num_workers == num_cpus), but I think we can and should > keep them independent for as long as possible. Sure, this should map the whatever the user passes to -j in the kernel make command line, if nothing is passed as an argument, then default to getconf(_NPROCESSORS_ONLN). There is a nice coincidence here where we probably don't care about -J anymore and want to deal only with -j (detached btf) that is the same as what 'make' expects to state how many "jobs" (thread pool size) the user wants 8-) > Another disadvantage of generating small struct btf and then lock + > merge is that we don't get as efficient string re-use, we'll churn > more on string memory allocation. Keeping bigger local struct btfs > allow for more efficient memory re-use (and probably a tiny bit of CPU > savings). I think we're in the same page, the contention for adding the CU to a single 'struct btf' (amongst all DWARF loading threads) after we just produced it should be minimal, so we grab all the advantages: locality of reference, minimal contention as DWARF reading/creating the pahole internal, neutral, data structures should be higher than adding types/functions/variables via the libbpf BTF API. I.e. we can leave paralellizing the BTF _encoding_ for later, what we're trying to do now is to paralellize the DWARF _loading_, right? > So please consider that, it also seems simpler overall. > > The ordering will be different than what we have now, as some smaller > > CUs (object files with debug) will be processed faster so will get its > > btf encoding slot faster, but that, at btf__dedup() time shouldn't make > > a difference, right? > Right, order doesn't matter. > > I think I'm done with refactoring the btf_encoder code, which should be > > by now a thin layer on top of the excellent libbpf BTF API, just getting > > what the previous loader (DWARF) produced and feeding libbpf. > Great. > > I thought about fancy thread pools, etc, researching some pre-existing > > thing or doing some kthread + workqueue lifting from the kernel but will > > instead start with the most spartan code, we can improve later. > Agree, simple is good. Really curious how much faster we can get. I > think anything fancy will give a relatively small improvement. The > biggest one will come from any parallelization. And I think that is possible, modulo elfutils libraries saying no, I hope that will not be the case. - Arnaldo