Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp1387662pxb; Wed, 2 Feb 2022 03:59:22 -0800 (PST) X-Google-Smtp-Source: ABdhPJxd8N8btZ/sQXIAVB2nbcnCOZPVHxO4M3ENExGD0S9/xnddv5i65EsE4eS5aFlbnMzak62f X-Received: by 2002:a05:6402:5107:: with SMTP id m7mr29838295edd.408.1643803162460; Wed, 02 Feb 2022 03:59:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643803162; cv=none; d=google.com; s=arc-20160816; b=xxa/fGrYpuKNPGlVpRZjLDYtqqZSr8WhsSh8jUjC/v5/a+HB9T01UnLOzf1q/RHpGb 1hIjHsrvMlLPKvRa/b2AJCP/4E8t8Qw9c4iy+1WJ5tN3TU5jOwQp+zfa2ZfTk8iyTGBq TgnIrginHYcOPVHBVQq8EOq3zdRMXR+yd4hvQodYA8Aqh/KGNOfgRcevfcbL3yI6N16O XsMhPzqT61qF6M3/whuywG2tvUyt7lJkQiEHuiv5ab79BHYqR5VU2UL3hYJfldPgC2nI 2Rn/Lt+tec7YcNkqcxRyHyz19HOQjnaJzF6Ivw5HbPU9F5d2+vQW7xKkzldBFRQPMa7T eDLQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:in-reply-to:date:references:subject:cc:to:from; bh=tVbUpfjgosqyXMW6ay+Tj2qLlhg6cThVKMXVbjod7n0=; b=0sAOAEanA+J105uTMQnJW8hZticp92pGJTnqUCj6pajeiHFXMK7erlGCNhj/ib9ClL CXSJBtSy44SSZoYi99E7RsfTM8+Dcn6i93GARSfjAob31iJ9lWuRMNexxdme+oitHyZ4 nYWhHhKKuXxvvCbZpx/Gy4X8JcDl1Uash0/uP2NEzLlsHLEZBbiv0wGNqREwFsKjK4Am bTmMn0rVboDwc1KKZ6RGWLYhw2Un75KD8BhfzqZiLg+q9ltKZCTKmlogNS0jWH0PqzX6 lW+QI94KoQ2AoeX9CLTg26GFARDxjQEe9aaRtUj2cQMxlbPOkR2PmRd6ouI1+ug6pbFx Z+Rg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id he41si11200274ejc.333.2022.02.02.03.58.57; Wed, 02 Feb 2022 03:59:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238747AbiBAUc6 convert rfc822-to-8bit (ORCPT + 99 others); Tue, 1 Feb 2022 15:32:58 -0500 Received: from albireo.enyo.de ([37.24.231.21]:54248 "EHLO albireo.enyo.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234508AbiBAUc5 (ORCPT ); Tue, 1 Feb 2022 15:32:57 -0500 Received: from [172.17.203.2] (port=56927 helo=deneb.enyo.de) by albireo.enyo.de ([172.17.140.2]) with esmtps (TLS1.3:ECDHE_SECP256R1__RSA_PSS_RSAE_SHA256__AES_256_GCM:256) id 1nEzpc-00HFx1-5m; Tue, 01 Feb 2022 20:32:52 +0000 Received: from fw by deneb.enyo.de with local (Exim 4.94.2) (envelope-from ) id 1nEzpb-000PPY-Ti; Tue, 01 Feb 2022 21:32:51 +0100 From: Florian Weimer To: Mathieu Desnoyers Cc: Peter Zijlstra , linux-kernel , Thomas Gleixner , paulmck , Boqun Feng , "H. Peter Anvin" , Paul Turner , linux-api , Christian Brauner , David Laight , carlos , Peter Oskolkov Subject: Re: [RFC PATCH 2/3] rseq: extend struct rseq with per thread group vcpu id References: <20220201192540.10439-1-mathieu.desnoyers@efficios.com> <20220201192540.10439-2-mathieu.desnoyers@efficios.com> <87bkzqz75q.fsf@mid.deneb.enyo.de> <1075473571.25688.1643746930751.JavaMail.zimbra@efficios.com> Date: Tue, 01 Feb 2022 21:32:51 +0100 In-Reply-To: <1075473571.25688.1643746930751.JavaMail.zimbra@efficios.com> (Mathieu Desnoyers's message of "Tue, 1 Feb 2022 15:22:10 -0500 (EST)") Message-ID: <87sft2xr7w.fsf@mid.deneb.enyo.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Mathieu Desnoyers: > ----- On Feb 1, 2022, at 3:03 PM, Florian Weimer fw@deneb.enyo.de wrote: > >> * Mathieu Desnoyers: >> >>> If a thread group has fewer threads than cores, or is limited to run on >>> few cores concurrently through sched affinity or cgroup cpusets, the >>> virtual cpu ids will be values close to 0, thus allowing efficient use >>> of user-space memory for per-cpu data structures. >> >> From a userspace programmer perspective, what's a good way to obtain a >> reasonable upper bound for the possible tg_vcpu_id values? > > Some effective upper bounds: > > - sysconf(3) _SC_NPROCESSORS_CONF, > - the number of threads which exist concurrently in the process, > - the number of cpus in the cpu affinity mask applied by sched_setaffinity, > except in corner-case situations such as cpu hotplug removing all cpus from > the affinity set, > - cgroup cpuset "partition" limits, Affinity masks and _SC_NPROCESSORS_CONF can be off by more than an order of magnitude compared to the cgroup cpuset, I think, and those aren't even that atypical configurations. The number of concurrent threads sounds more tractable, but I'm worried about things creating threads behind libc's back (perhaps io_uring?). So it couldn't be a hard upper bound. I'm worried about querying anything cgroup-related because these APIs have a reputation for being slow, convoluted, and unstable (effectively not subject to the “don't break userspace” rule). Hopefully I'm wrong about that. >> I believe not all users of cgroup cpusets change the affinity mask. > > AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset. > Those are two mechanisms both affecting the scheduler task placement. There are container hosts out there that synthesize an affinity mask that matches the CPU allocation, assuming that anyone who calls sched_getaffinity only does so for counting the number of set bits. > I would expect the user-space code to use some sensible upper bound as a > hint about how many per-vcpu data structure elements to expect (and how many > to pre-allocate), but have a "lazy initialization" fall-back in case the > vcpu id goes up to the number of configured processors - 1. And I suspect > that even the number of configured processors may change with CRIU. Sounds reasonable. >> Is the switch really useful? I suspect it's faster to just write as >> much as possible all the time. The switch should be well-predictable >> if running uniform userspace, but still … > > The switch ensures the kernel don't try to write to a memory area beyond > the rseq size which has been registered by user-space. So it seems to be > useful to ensure we don't corrupt user-space memory. Or am I missing your > point ? Due to the alignment, I think you'd only ever see 32 and 64 bytes for now? I'd appreciate if you could put the maximm supported size and possibly the alignment in the auxiliary vector, so that we don't have to rseq system calls in a loop on process startup.