Received: by 2002:ac0:e350:0:0:0:0:0 with SMTP id g16csp2577642imn; Tue, 2 Aug 2022 08:12:04 -0700 (PDT) X-Google-Smtp-Source: AGRyM1se4JhbmICApMGse4Z2ZMAqkDm2efQAvw0VNgnyrkeBCmOVJcQfSVAbrQHPhqU76HxcVhus X-Received: by 2002:aa7:8286:0:b0:52b:5ccd:f6bb with SMTP id s6-20020aa78286000000b0052b5ccdf6bbmr21401196pfm.34.1659453124545; Tue, 02 Aug 2022 08:12:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1659453124; cv=none; d=google.com; s=arc-20160816; b=m76nLKRPLW4FxBg3CCYapqvoiI0/ZRdSsUCXv3AJbKqROwOUMbeLxpyKShtvB+4qw/ Zr0gIdGfl9awbRgLlLDqPvdtoPXbvEY2Pb/Pv4x7MCuEnq+Ou34tJoPEUisJUW9Etuuw jOWQq1gBXNx3dnS5C8WO7nan380bqe6SOPPeN4gpsjXCjvXHoNLyt4863YrlarFtu+OM 6hGjCMNaHpFBYfWxMEU5or+sG6Pe7mkVnisSLJGFHHEKu3YaGfmOy94Kfe97KhPvLUyO MJEBNczTcVz51q+ekR0CKccUJsKkir6+yqqGuHSS1Vk3AIE4+5b9KNTp/UadQZV/gGP5 GUMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:dkim-signature:dkim-filter; bh=bVG2zyAAyFZSYcpouAX+4anDt1x5+l+apwXOG9Uq0l8=; b=fSPkJh3aYklAp/NBV7+iuj2kzAiCVyk8l3FVjav5J3JTEzowlkDHMjsMU2oiJvZsi1 g6OaW2sEmwv960/7s/Wmu/xr1TJFA08yqK0cXps+8JIwyY/dNkqhEsyR6Afw1kPWJX1s gmvSySi7qtHupItaHd5dvL86FsSFiAVZycQx+bwMCr/U9vXPrt29Ht93m6YtM1OOEWgJ DMyyA7ZXZQtRwiNnopZI9/vBaH7M8uR4LrD5OxrMBgZSd8/7NLtvimDQ38B3p0eL4Apv 1wfJb0nDJ95kY+OMuTpd2l76g9wqfbnBt42alcXtkuy6ade9dWKAIIVg/1gE601ZMrRB Byig== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@efficios.com header.s=default header.b=VGcoTt3+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id j10-20020a056a00234a00b00525425b20f0si441581pfj.257.2022.08.02.08.11.50; Tue, 02 Aug 2022 08:12:04 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@efficios.com header.s=default header.b=VGcoTt3+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=efficios.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237130AbiHBPB2 (ORCPT + 99 others); Tue, 2 Aug 2022 11:01:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45204 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233060AbiHBPBZ (ORCPT ); Tue, 2 Aug 2022 11:01:25 -0400 Received: from mail.efficios.com (mail.efficios.com [167.114.26.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 722DDB1E3; Tue, 2 Aug 2022 08:01:22 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id BF7DE3B95B7; Tue, 2 Aug 2022 11:01:20 -0400 (EDT) Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id OsEnzjYzB6w4; Tue, 2 Aug 2022 11:01:20 -0400 (EDT) Received: from localhost (localhost [127.0.0.1]) by mail.efficios.com (Postfix) with ESMTP id F3EAB3B94D0; Tue, 2 Aug 2022 11:01:19 -0400 (EDT) DKIM-Filter: OpenDKIM Filter v2.10.3 mail.efficios.com F3EAB3B94D0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=efficios.com; s=default; t=1659452480; bh=bVG2zyAAyFZSYcpouAX+4anDt1x5+l+apwXOG9Uq0l8=; h=Date:From:To:Message-ID:MIME-Version; b=VGcoTt3+r5S+Xff6eEGdSbW5xVsSjtZupC7Yc1jt9zHv50xJLJCskRc3mTER23RGI SFMMBZBSecknVn/HbWgRX5H0fta4CcLp1B7QeEM4FflhAqjklZpcuhy8lZWOZXZ6mK RBnkPnY/RhJ4sfmMAd4Z/Br4w9gBt2soYXeXnpr12jeu4htyhG10GfXuF5cSSbh24v zaBCR4irk780nF7KcgQM8L+iEK9ZKCkm0c6EPxfKHY66DaE+ehtNHYKqfNMMPXSKJ3 IebBXBi7+5nVRdykbWmwiqBN+WOJNdFutEtykSYSswxAed22xfx9txkoRwoqfFmP0s OnC7VWVwloGhw== X-Virus-Scanned: amavisd-new at efficios.com Received: from mail.efficios.com ([127.0.0.1]) by localhost (mail03.efficios.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id a_9Zkqfwy8ch; Tue, 2 Aug 2022 11:01:19 -0400 (EDT) Received: from mail03.efficios.com (mail03.efficios.com [167.114.26.124]) by mail.efficios.com (Postfix) with ESMTP id E12DA3B9371; Tue, 2 Aug 2022 11:01:19 -0400 (EDT) Date: Tue, 2 Aug 2022 11:01:19 -0400 (EDT) From: Mathieu Desnoyers To: Peter Oskolkov Cc: Peter Zijlstra , Linux Kernel Mailing List , Thomas Gleixner , "Paul E . McKenney" , Boqun Feng , "H. Peter Anvin" , Paul Turner , linux-api , Christian Brauner , Florian Weimer , David Laight , carlos , Chris Kennelly , Peter Oskolkov Message-ID: <500891137.95782.1659452479846.JavaMail.zimbra@efficios.com> In-Reply-To: References: <20220729190225.12726-1-mathieu.desnoyers@efficios.com> Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [167.114.26.124] X-Mailer: Zimbra 8.8.15_GA_4304 (ZimbraWebClient - FF100 (Linux)/8.8.15_GA_4304) Thread-Topic: RSEQ node id and virtual cpu id extensions Thread-Index: 56GbPqtcNuVVI1uvQlqn/jwXHtFfHg== X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ----- On Aug 1, 2022, at 1:07 PM, Peter Oskolkov posk@posk.io wrote: > On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers > wrote: >> >> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field. > > Thanks a lot, Mathieu - it is really exciting to see this happening! > > I'll share our experiences here, with the hope that it may be useful. > I've also cc-ed > Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide > more context/details if I miss or misrepresent something. Thanks for sharing your experiences at Google. This helps put things in perspective. > > The problem: > > tcmalloc maintains per-cpu freelists in the userspace to make userspace > memory allocations fast and efficient; it relies on rseq to do so, as > any manipulation > of the freelists has to be protected vs thread migrations. > > However, as a typical userspace process at a Google datacenter is confined to > a relatively small number of CPUs (8-16) via cgroups, while the > servers typically > have a much larger number of physical CPUs, the per-cpu freelist model > is somewhat > wasteful: if a process has only at most 10 threads running, for > example, but these threads > can "wander" across 100 CPUs over the lifetime of the process, keeping 100 > freelists instead of 10 noticeably wastes memory. > > Note that although a typical process at Google has a limited CPU > quota, thus using > only a small number of CPUs at any given time, the process may often have many > hundreds or thousands of threads, so per-thread freelists are not a viable > solution to the problem just described. > > Our current solution: > > As you outlined in patch 9, tracking the number of currently running threads per > address space and exposing this information via a vcpu_id abstraction helps > tcmalloc to noticeably reduce its freelist overhead in the "narrow > process running > on a wide server" situation, which is typical at Google. > > We have experimented with several approaches here. The one that we are > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes. > > We did try per-numa-node vcpus, but it did not show any material improvement > over the "flat" model, perhaps because on our most "wide" servers the CPU > topology is multi-level. Chris Kennelly may provide more details here. I would really like to know more about Google's per-numa-node vcpus implementation. I suspect you guys may have taken a different turn somewhere in the design which led to these results. But having not seen that implementation, I can only guess. I notice the following Google-specific prototype extension in tcmalloc: // This is a prototype extension to the rseq() syscall. Since a process may // run on only a few cores at a time, we can use a dense set of "v(irtual) // cpus." This can reduce cache requirements, as we only need N caches for // the cores we actually run on simultaneously, rather than a cache for every // physical core. union { struct { short numa_node_id; short vcpu_id; }; int vcpu_flat; }; Can you tell me more about the way the numa_node_id and vcpu_id are allocated internally, and how they are expected to be used by userspace ? > > On a more technical note, we do use atomic operations extensively in > the kernel to make sure > vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently > running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps, > no going to N and above; this does consume some extra CPU cycles, but the > RAM savings we gain far outweigh the extra CPU cost; it will be interesting to > see what you can do with the optimizations you propose in this patchset. The optimizations I propose keep those "tightly packed" characteristics, but skip the atomic operations in common scenarios. I'll welcome benchmarks of the added overhead in representative workloads. > Again, thanks a lot for this effort! Thanks for your input. It really helps steering the effort in the right direction. Mathieu > > Peter > > [...] -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com