Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp672523rwr; Thu, 4 May 2023 08:11:17 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6JfIhyWX20otdqY82MXtAbr6nTnyyBy9/J6PDLWv4OdQrNZ1fbCWEEfPv6nS55eyCoVBiD X-Received: by 2002:a17:90a:f196:b0:240:f8a6:55c7 with SMTP id bv22-20020a17090af19600b00240f8a655c7mr2394462pjb.20.1683213077613; Thu, 04 May 2023 08:11:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1683213077; cv=none; d=google.com; s=arc-20160816; b=liVdFugIM9EUCQ519YwE5aPzf24E7stRRkP0DbPVHLrZ732r2a6xxPuZLGgjDI11c0 D1JbuXbuZBoYCTQj2px2ParXRXNNPBQwhoCJme8pNLjCS17ygKZtyY274cMiV0QBGoGn woTBaYJaNIICGoS1Vodpl6hOw7+Ynok58C7uBs9mg5buSCDEY/fIfktV2R+LP+QtqRgm /svRAWELo0ubJ6y5vlVR33ofcjOlAzmsVz70ARQaVP5+6hTQrks+0W5bS9Y6whdItwqu feFAjEyMDE72x9q1wASdMG9vhZUvjAxdXVgntHl/5IP2LeOKCbp7P4pHlAL75fXEIDW2 v6Jw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=vpI3TgDyhkti03RqKX4aJ+LI47o/coVvQYc1czzI8zs=; b=ZcX79tn85A6bDofS9ImjXSsSCgdhDWbd8gcTlOd6ddk+cOxX2QdYIs4Yg216bvEZk+ AlHhYH4eSx0zDeWVZgYjU8v9+PacfrPLrRwCWIuXljKquNq3go37kBBlIZfYuBN9xLjv rMrt6VOCQBOne2OPLOsjEjXCLCm+33zF5+k28KvMIaCZ6sqTvnJF/9A9BuRpUBzAxwkU sDzFNown7Z8ug6+T4i4k28WwPfUMJ5fjKVlinZqohnWZqxtZ/a5CG0BQ+ii6eUrnSqTp OXuC1IVcngc94c12zKgP0FIsR57kLjKnUx8m2S+h/C/QAqiSuNXz0O1VPryG06dhHMPo IB5Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=FNofsLdv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id nv1-20020a17090b1b4100b0024e3527e8f1si6138853pjb.60.2023.05.04.08.11.03; Thu, 04 May 2023 08:11:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=FNofsLdv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231246AbjEDPIf (ORCPT + 99 others); Thu, 4 May 2023 11:08:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231219AbjEDPI2 (ORCPT ); Thu, 4 May 2023 11:08:28 -0400 Received: from mail-yb1-xb2b.google.com (mail-yb1-xb2b.google.com [IPv6:2607:f8b0:4864:20::b2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2C9BC422B for ; Thu, 4 May 2023 08:08:26 -0700 (PDT) Received: by mail-yb1-xb2b.google.com with SMTP id 3f1490d57ef6-b99f0a0052fso873460276.3 for ; Thu, 04 May 2023 08:08:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1683212905; x=1685804905; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vpI3TgDyhkti03RqKX4aJ+LI47o/coVvQYc1czzI8zs=; b=FNofsLdv/+7cq18LESbrv72DxTOtLR7I9kjhPi1qP4Q3LOJM6z2Qwy4LxmeVOSRV1Z xK+g4AMIJhpry6DKZYI++Pk/k+tIoKY5MDlKbzPcxFqtpsTF9KyhL/CnMcUvTRQFU1XW s77y30fDpMQ0IAed3GmRXiG0b/uvK6WqPwJdfl0zE1KIt9+luKSvRC/QVW5PpatI86J1 8Lleqs+mqleJV9W3YvOGElVKE+XmHEyvUzWjfiQlhf68dkmkvTC9QVzwaCMHP6HyYov0 fyMfje9piFOaym+JTibnTPFGqcqfJtpyIPdeKRtc5PVjhwHQRxzmBvCsLDjcVSYkRaxf RH/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683212905; x=1685804905; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vpI3TgDyhkti03RqKX4aJ+LI47o/coVvQYc1czzI8zs=; b=OF0jTEAOiuc/Tfo4MFeiOWviVu/+Z/WvTDv1XlYj4DtaCal1zmQx/EGgQYdgutjarT rvi7o4BFG19kPThZ6qVOtp4WUUN/PxhGW6ahBoRMl4VRH9ZPCrAAf5ZrJlBgRwdwlWFW NWZuZE/fbE0i3jApk/T4LCGsBLPzX+6DEk30nqWqkwtCitbhZdumbAu2E1H/16FA3Blf vYPop73VY5zqspsMzht9ys1DRBtBdVpHS7zsiDxd3mMOvKwYwdVOfnEzPHmsux6pcsEe 6RewZBCCxRv3+i4HX97nDlWmO0vCVpKbwH5dv7WIHVKZen/zo7Ly1BdbL+Bb4TgVqDOA t/Lw== X-Gm-Message-State: AC+VfDz/OuRRJxnVxmmZhxCY0ccAsC+8FZHSB6jpEpTgdNbLizdmnR5n gbquhf2AWQS69FpqNk1tjrDxTku9bnRDbJuBef1yMQ== X-Received: by 2002:a25:b21f:0:b0:ba1:78df:20fc with SMTP id i31-20020a25b21f000000b00ba178df20fcmr253521ybj.21.1683212905109; Thu, 04 May 2023 08:08:25 -0700 (PDT) MIME-Version: 1.0 References: <20230501165450.15352-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Thu, 4 May 2023 08:08:13 -0700 Message-ID: Subject: Re: [PATCH 00/40] Memory allocation profiling To: Michal Hocko Cc: akpm@linux-foundation.org, kent.overstreet@linux.dev, vbabka@suse.cz, hannes@cmpxchg.org, roman.gushchin@linux.dev, mgorman@suse.de, dave@stgolabs.net, willy@infradead.org, liam.howlett@oracle.com, corbet@lwn.net, void@manifault.com, peterz@infradead.org, juri.lelli@redhat.com, ldufour@linux.ibm.com, catalin.marinas@arm.com, will@kernel.org, arnd@arndb.de, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, peterx@redhat.com, david@redhat.com, axboe@kernel.dk, mcgrof@kernel.org, masahiroy@kernel.org, nathan@kernel.org, dennis@kernel.org, tj@kernel.org, muchun.song@linux.dev, rppt@kernel.org, paulmck@kernel.org, pasha.tatashin@soleen.com, yosryahmed@google.com, yuzhao@google.com, dhowells@redhat.com, hughd@google.com, andreyknvl@gmail.com, keescook@chromium.org, ndesaulniers@google.com, gregkh@linuxfoundation.org, ebiggers@google.com, ytcoode@gmail.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, bristot@redhat.com, vschneid@redhat.com, cl@linux.com, penberg@kernel.org, iamjoonsoo.kim@lge.com, 42.hyeyoo@gmail.com, glider@google.com, elver@google.com, dvyukov@google.com, shakeelb@google.com, songmuchun@bytedance.com, jbaron@akamai.com, rientjes@google.com, minchan@google.com, kaleshsingh@google.com, kernel-team@android.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, iommu@lists.linux.dev, linux-arch@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-modules@vger.kernel.org, kasan-dev@googlegroups.com, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 4, 2023 at 2:07=E2=80=AFAM Michal Hocko wrote= : > > On Wed 03-05-23 08:09:28, Suren Baghdasaryan wrote: > > On Wed, May 3, 2023 at 12:25=E2=80=AFAM Michal Hocko = wrote: > [...] > > Thanks for summarizing! > > > > > At least those I find the most important: > > > - This is a big change and it adds a significant maintenance burden > > > because each allocation entry point needs to be handled specificall= y. > > > The cost will grow with the intended coverage especially there when > > > allocation is hidden in a library code. > > > > Do you mean with more allocations in the codebase more codetags will > > be generated? Is that the concern? > > No. I am mostly concerned about the _maintenance_ overhead. For the > bare tracking (without profiling and thus stack traces) only those > allocations that are directly inlined into the consumer are really > of any use. That increases the code impact of the tracing because any > relevant allocation location has to go through the micro surgery. > > e.g. is it really interesting to know that there is a likely memory > leak in seq_file proper doing and allocation? No as it is the specific > implementation using seq_file that is leaking most likely. There are > other examples like that See? Yes, I see that. One level tracking does not provide all the information needed to track such issues. Something more informative would cost more. That's why our proposal is to have a light-weight mechanism to get a high level picture and then be able to zoom into a specific area using context capture. If you have ideas to improve this, I'm open to suggestions. > > > Or maybe as you commented in > > another patch that context capturing feature does not limit how many > > stacks will be captured? > > That is a memory overhead which can be really huge and it would be nice > to be more explicit about that in the cover letter. It is a downside for > sure but not something that has a code maintenance impact and it is an > opt-in so it can be enabled only when necessary. You are right, I'll add that into the cover letter. > > Quite honestly, though, the more I look into context capturing part it > seems to me that there is much more to be reconsidered there and if you > really want to move forward with the code tagging part then you should > drop that for now. It would make the whole series smaller and easier to > digest. Sure, I don't see an issue with removing that for now and refining the mechanism before posting again. > > > > - It has been brought up that this is duplicating functionality alrea= dy > > > available via existing tracing infrastructure. You should make it v= ery > > > clear why that is not suitable for the job > > > > I experimented with using tracing with _RET_IP_ to implement this > > accounting. The major issue is the _RET_IP_ to codetag lookup runtime > > overhead which is orders of magnitude higher than proposed code > > tagging approach. With code tagging proposal, that link is resolved at > > compile time. Since we want this mechanism deployed in production, we > > want to keep the overhead to the absolute minimum. > > You asked me before how much overhead would be tolerable and the > > answer will always be "as small as possible". This is especially true > > for slab allocators which are ridiculously fast and regressing them > > would be very noticable (due to the frequent use). > > It would have been more convincing if you had some numbers at hands. > E.g. this is a typical workload we are dealing with. With the compile > time tags we are able to learn this with that much of cost. With a dynami= c > tracing we are able to learn this much with that cost. See? As small as > possible is a rather vague term that different people will have a very > different idea about. I'm rerunning my tests with the latest kernel to collect the comparison data. I profiled these solutions before but the kernel changed since then, so I need to update them. > > > There is another issue, which I think can be solved in a smart way but > > will either affect performance or would require more memory. With the > > tracing approach we don't know beforehand how many individual > > allocation sites exist, so we have to allocate code tags (or similar > > structures for counting) at runtime vs compile time. We can be smart > > about it and allocate in batches or even preallocate more than we need > > beforehand but, as I said, it will require some kind of compromise. > > I have tried our usual distribution config (only vmlinux without modules > so the real impact will be larger as we build a lot of stuff into > modules) just to get an idea: > text data bss dec hex filename > 28755345 17040322 19845124 65640791 3e99957 v= mlinux.before > 28867168 17571838 19386372 65825378 3ec6a62 v= mlinux.after > > Less than 1% for text 3% for data. This is not all that terrible > for an initial submission and a more dynamic approach could be added > later. E.g. with a smaller pre-allocated hash table that could be > expanded lazily. Anyway not something I would be losing sleep over. This > can always be improved later on. Ah, right. I should have mentioned this overhead too. Thanks for keeping me honest. > > I understand that code tagging creates additional maintenance burdens > > but I hope it also produces enough benefits that people will want > > this. The cost is also hopefully amortized when additional > > applications like the ones we presented in RFC [1] are built using the > > same framework. > > TBH I am much more concerned about the maintenance burden on the MM side > than the actual code tagging itslef which is much more self contained. I > haven't seen other potential applications of the same infrastructure and > maybe the code impact would be much smaller than in the MM proper. Our > allocator API is really hairy and convoluted. Yes, other applications are much smaller and cleaner. MM allocation code is quite complex indeed. > > > > - We already have page_owner infrastructure that provides allocation > > > tracking data. Why it cannot be used/extended? > > > > 1. The overhead. > > Do you have any numbers? Will post once my tests are completed. > > > 2. Covers only page allocators. > > Yes this sucks. > > > > I didn't think about extending the page_owner approach to slab > > allocators but I suspect it would not be trivial. I don't see > > attaching an owner to every slab object to be a scalable solution. The > > overhead would again be of concern here. > > This would have been a nice argument to mention in the changelog so that > we know that you have considered that option at least. Why should I (as > a reviewer) wild guess that? Sorry, It's hard to remember all the decisions, discussions and conclusions when working on a feature over a long time period. I'll include more information about that. > > > I should point out that there was one important technical concern > > about lack of a kill switch for this feature, which was an issue for > > distributions that can't disable the CONFIG flag. In this series we > > addressed that concern. > > Thanks, that is certainly appreciated. I haven't looked deeper into that > part but from the cover letter I have understood that CONFIG_MEM_ALLOC_PR= OFILING > implies unconditional page_ext and therefore the memory overhead > assosiated with that. There seems to be a killswitch nomem_profiling but > from a quick look it doesn't seem to disable page_ext allocations. I > might be missing something there of course. Having a highlevel > describtion for that would be really nice as well. Right, will add a description of that as well. We eliminate the runtime overhead but not the memory one. However I believe it's also doable using page_ext_operations.need callback. Will look into it. Thanks, Suren. > > > [1] https://lore.kernel.org/all/20220830214919.53220-1-surenb@google.co= m/ > > -- > Michal Hocko > SUSE Labs