Received: by 2002:a05:6a10:1287:0:0:0:0 with SMTP id d7csp96657pxv; Wed, 21 Jul 2021 16:46:40 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzWP5X0/F2uT8ailGVs/y4JuyIECtwDoblG7gTwawJ0z5fwrqeVs9Bx1MBG+/4zfWpsv2S7 X-Received: by 2002:a05:6402:487:: with SMTP id k7mr50263374edv.315.1626911200195; Wed, 21 Jul 2021 16:46:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626911200; cv=none; d=google.com; s=arc-20160816; b=WzQwg9RPnYbcX1H6F5HT4EEacVRSQcmebRmoa1xemPHmqOkI/5ru7FeaXX5ENiRikt MnxcTqm8s7WaDBbY0QyNB8xQ+iA7KKi3Y7fCSqgKWS5TayxvhhCuTZlQYdbAs/49KFwe 1a7UaZwnAvaf1h/y9PIJHVg+B90y/L5j5K8KVXMZBMfgUJh5QWc7wVXXoZ5IJE6QhrsT /YaW3ZD3phPktrSZYm3jxmkCoJq8bW5WsCeugP63Mt1PallXcu4ACQq0zXHtGL+J0KHI BOwIAN2ms934XC7RjRE2o6CRsoyEToG3m+j+uxd1eFeA8shF6/IrXTOpryz/TTsaDuH/ ZpPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=oNOH2/EWCWyCcPN5nRIw6bxj97ptFU8T1+ifaWUKe/8=; b=eyZZ1iNivfDiESXpynzLqrZic0d4Z3SnouTVgbQagtytAyJ/MI9Ra/lg1WT8wyDYyZ R+ih2W/yQH5ibMrIZk3mCJi5I9D77ELf7zpbjoQA/TyXmlxseg3oJpgYL+L8sMYRBhfX noE+JobUWUK5aKtnvmb3hMeEzdb2AlBM4kjKole2wieJXIVZ5BQwBlTXU6qwIj1mvfUS /6C9XCCk7VVVU6GXebukR4wvmBGhsPdEE8IpfP8Q1zC4mHZYXkhmfO3ja/6IV2r6QuXv /E/PG8WoofKL8IdNu+ki9t6HqH8CWu8eJwtNK7icpm+2w/0LlSMDa/9kzCLeEiH2vDgU pYnQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=kOhQvf4I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id z15si402774ejr.728.2021.07.21.16.46.16; Wed, 21 Jul 2021 16:46:40 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=kOhQvf4I; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231269AbhGUXET (ORCPT + 99 others); Wed, 21 Jul 2021 19:04:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43152 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230289AbhGUXET (ORCPT ); Wed, 21 Jul 2021 19:04:19 -0400 Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CFE66C061757 for ; Wed, 21 Jul 2021 16:44:54 -0700 (PDT) Received: by mail-lf1-x133.google.com with SMTP id b26so5759522lfo.4 for ; Wed, 21 Jul 2021 16:44:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=oNOH2/EWCWyCcPN5nRIw6bxj97ptFU8T1+ifaWUKe/8=; b=kOhQvf4IYIR3eicIInu7WQAw4TeGseyHNyfhxD8/0BCDQy3CsTHZ4Dl23AGMsLz04W y1ML+adGVROT4tVrPgBccrE9+bkaPwqDvpBOAMKUKi/sIOEUhmy79Iws2adnaizbtE33 GEYRCeg2RM2GYsUXMySCTe4qfZ0Cb1zdjmNaI5OsSd1EA0RG06d9D0BgZDMSb1sUcyWM SI/7aUwB0Yfy1m61g1RGlA22fyohWzm5Mrvb+K8U4qVkRksZWY4SPl+bznvraWv2T6Aj 2teoNzB+HhGYfHYpaPpE7t0UFe/kEF/8JYuSynrBOJC2j9cDOofuN3RKT28yBQmxDARh UUYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=oNOH2/EWCWyCcPN5nRIw6bxj97ptFU8T1+ifaWUKe/8=; b=EyYn41FJrrmvQdGl9170c5y2J3fBoZOU1aXg3kNBuh0MjIh0CjDIm7NcYSkq+vVo+1 Z/I2JUo6niz7s5uedCCBmHXzoLmPHhdZLG3uh5ReXH+3BMV7suYFAJyLMvkF59VF7r0z fZH3FQ5nZ/0iAO3b/LtuQUuajB6d/31i5Tv4I/GJsi4yZV8orsa/d/A18f/uDkVb6IcZ FzDMxd4MOXRFR/3aagmmyts9QCUhFHrkcN9xblWz7etGViYufhpgze33HFOm2J99AeWR 3KSbZx/Slpdm5GTK9qYJXXo/+MZnnVP62na8FoDPU+y1rTJ/2XQEWbw18Uk4LI3Pw7kS 3qqQ== X-Gm-Message-State: AOAM533cz73//QY1kxqJD7eW8LFLwVP/R7EtTggQD9NnkXwZuBHeaX4F xLRsPmyv0gczntgQwqDBH27AVnj8+WVQQjiOt4mIFA== X-Received: by 2002:a19:7110:: with SMTP id m16mr27279438lfc.125.1626911065544; Wed, 21 Jul 2021 16:44:25 -0700 (PDT) MIME-Version: 1.0 References: <20210716184719.269033-5-posk@google.com> <2c971806-b8f6-50b9-491f-e1ede4a33579@uwaterloo.ca> <5790661b-869c-68bd-86fa-62f580e84be1@uwaterloo.ca> In-Reply-To: <5790661b-869c-68bd-86fa-62f580e84be1@uwaterloo.ca> From: Peter Oskolkov Date: Wed, 21 Jul 2021 16:44:14 -0700 Message-ID: Subject: Re: [RFC PATCH 4/4 v0.3] sched/umcg: RFC: implement UMCG syscalls To: Thierry Delisle Cc: Peter Oskolkov , Andrei Vagin , Ben Segall , Jann Horn , Jim Newsome , Joel Fernandes , linux-api@vger.kernel.org, Linux Kernel Mailing List , Ingo Molnar , Peter Zijlstra , Paul Turner , Thomas Gleixner , Peter Buhr Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 21, 2021 at 12:56 PM Thierry Delisle wrote: > > > Yes, this is naturally supported in the current patchset on the kernel > > side, and is supported in libumcg (to be posted, later when the kernel > > side is settled); internally at Google, some applications use > > different "groups" of workers/servers per NUMA node. > > Good to know. Cforall has the same feature, where we refer to these groups > as "clusters". https://doi.org/10.1002/spe.2925 (Section 7) > > > Please see the attached atomic_stack.h file - I use it in my tests, > > things seem to be working. Specifically, atomic_stack_gc does the > > cleanup. For the kernel side of things, see the third patch in this > > patchset. > > I don't believe the atomic_stack_gc function is robust enough to be > offering > any guarantee. I believe that once a node is unlinked, its next pointer > should > be reset immediately, e.g., by writing 0xDEADDEADDEADDEAD. Do your tests > work > if the next pointer is reset immediately on reclaimed nodes? In my tests reclaimed nodes have their next pointers immediately set to point to the list head. If the kernel gets a node with its @next pointing to something else, then yes, things break down (the kernel kills the process); this has happened occasionally when I had a bug in the userspace code. > > As far as I can tell, the reclaimed nodes in atomic_stack_gc still contain > valid next fields. I believe there is a race which can lead to the kernel > reading reclaimed nodes. If atomic_stack_gc does not reset the fields, > this bug > could be hidden in the testing. Could you, please, provide a bit more details on when/how the race can happen? Servers add themselves to the list, so there can be no races there (servers going idle: add-to-the-list; wait; gc (under a lock); restore @next; do stuff). Workers are trickier, as they can be woken by signals and then block again, but stray signals are so bad here that I'm thinking of actually not letting sleeping workers wake on signals. Other than signals waking queued/unqueued idle workers, are there any other potential races here? > > An more aggressive test is to put each node in a different page and > remove read > permissions when the node is reclaimed. I'm not sure this applies when the > kernel is the one reading. > > > > To keep the kernel side light and simple. To also protect the kernel > > from spinning if userspace misbehaves. Basically, the overall approach > > is to delegate most of the work to the userspace, and keep the bare > > minimum in the kernel. > > I'll try to keep this in mind then. > > > After some thought, I'll suggest a scheme to significantly reduce > complexity. > As I understand, the idle_workers_ptr are linked to form one or more > Multi-Producer Single-Consumer queues. If each head is augmented with a > single > volatile tid-sized word, servers that want to go idle can simply write > their id > in the word. When the kernel adds something to the idle_workers_ptr > list, it > simply does an XCHG with 0 or any INVALID_TID. This scheme only supports > one > server blocking per idle_workers_ptr list. To keep the "kernel side > light and > simple", you can simply ask that any extra servers must synchronize > among each > other to pick which server is responsible for wait on behalf of everyone. > I'm not yet convinced that the single-linked-list approach is infeasible. And if it is, a simple fix would be to have two pointers per list in struct umcg_task: head and next. Thanks, Peter