Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp621525pxf; Wed, 17 Mar 2021 11:45:47 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzQiK8BSCTRYg2MSHTqDDU+RpqVu8Y66MFZyk7DPlbiBYtDQYGuU8BVPoqmOrC08Xp9JbaE X-Received: by 2002:a17:907:3e9e:: with SMTP id hs30mr37553034ejc.66.1616006747184; Wed, 17 Mar 2021 11:45:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616006747; cv=none; d=google.com; s=arc-20160816; b=astpLG7zgcdDk7Yhf77d/a1QlIAn6MwScAMPS1JwQOGy/WenrZXmDrGfhuiWUdgfj4 rzZPW0KkRkzG/JM95EGbvXctAC2WSQRJurT1UZa4Z8uAcJ08BF2el3GWkdCXaP2UYLdA DQc6bQPMRzqCuQ6n5sfvxepNmpmUXEU6BwCUzEIOx/7g0ymviakatkB+dA4TR/OmFQnP 7TvIs7Z00GQSKKC8YCyBPTRRD6aB35P4z5SagAo65utlKqzcat8ixV3safYtgJUpCxq4 ogabC/5OnsLm/9GdZuLMkNXBkG01WgQWACWo1F813PKLwE7gk6VgnaTz8TEMXv5Exskn 2pmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=BY6Vn1nuAM3ElW71CCjlIKhlSq5fPXddIXN0CzmwDNc=; b=UMDNNTbkTjXpNjeMZtQWru0WPVFUWk3SP5GzcLjvK979DRL1VWcAQS7NpJmw4/crlJ E8uBzzT3P656tXIcLd26awPBlpr49dhOfQ8GXbcFG9qH+fmfvqfq2IaV3qhLyaxDKC6d EkZ/aC+5EYYJnbKIg+fFzcdz9a2LgX4Ah3r8mL1en7pXC8742TmA+TfXkj9DnlC/btBh U2BGd5gIGq5AymJFp0WC4iFu6bLGfCnl1it3CV40Dl1v8eN81MUM+Fp9yIVv7FruOj5D rqfWpqJljTGXVX2btvVTIt6hJrULy2u6iLiUTpWTCdkc8Y8fWlng2/iFt2xLf6mUo9Q5 yL5g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@posk.io header.s=google header.b=IcjKrfcJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f20si17301915edr.318.2021.03.17.11.45.23; Wed, 17 Mar 2021 11:45:47 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@posk.io header.s=google header.b=IcjKrfcJ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232756AbhCQSnp (ORCPT + 99 others); Wed, 17 Mar 2021 14:43:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231152AbhCQSno (ORCPT ); Wed, 17 Mar 2021 14:43:44 -0400 Received: from mail-vs1-xe30.google.com (mail-vs1-xe30.google.com [IPv6:2607:f8b0:4864:20::e30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 06B5DC06174A for ; Wed, 17 Mar 2021 11:43:44 -0700 (PDT) Received: by mail-vs1-xe30.google.com with SMTP id e5so165124vse.4 for ; Wed, 17 Mar 2021 11:43:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=posk.io; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=BY6Vn1nuAM3ElW71CCjlIKhlSq5fPXddIXN0CzmwDNc=; b=IcjKrfcJSqrirymxoj0le3JUWlU0wyWnSvuOoEyHh5mJ0xUM2ot+BYhfDD8HqX5VH9 tWUwRLXEa6unLMGuR9f674oqYFLxPp4hHpnPQCdqkKu9ibxjiNbxyfDbGzet6dC+rq20 Y1mVtt621nVfku3r/H7kaylNviAFfPMSONSOzEFvUifz2A5Av7juIxrH+LJul2t9Zrho htoxoWT/E9eeAZeYVkQkq7C899uffYtqzYGb68FG/5b9JPUreFXCnkfvRIs+45Cd3pWz zBj6ah0OudETpR7KHBeIauydlnqm56uybxwSxjHXbHxoGYnxDKfsvPHGZswhzqe8V9pn OIEw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=BY6Vn1nuAM3ElW71CCjlIKhlSq5fPXddIXN0CzmwDNc=; b=nF1bn/Q6mx8c23UUX/j2BRO10VyH/JwMUEY6nPI9yjp3pZfYpds5aVyaj4naojQcBk Wl6ZBHsialX3aZUygeurK4Qs3zD4yISmIb2eQ1R7k0+/mCDlJBDqyho7SHVzUlj6FjrI A61DLG4J4l28P3ZuJHytZsIXa+HUhaa+Ziem8fvPBhbGApSE2IKa7anSHJdPs3nV/KMw SKXQ7U3e08cZLSuSuzQWfwqbyNTYDlcPixRrDM2mESxHSae19KIUBj+Af74gRn6Z3pHY lme4Rt0kJVJQXoVtLKyJPbqy+l9FDdMnFqJtYOh3Sj1gIulCuEIvo46ys8RTWXvQc/Ox LWXg== X-Gm-Message-State: AOAM53216jbkdZmVEkciSBSnoD5q49IajGOom5sgrP9qkd/xvID/DmnT 6QiVNSRPLAjxZR2h37W+dxlLXlBXjOj1Mxw6C4VBdQ== X-Received: by 2002:a67:3015:: with SMTP id w21mr4742485vsw.23.1616006623246; Wed, 17 Mar 2021 11:43:43 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Peter Oskolkov Date: Wed, 17 Mar 2021 11:43:32 -0700 Message-ID: Subject: Re: [RFC PATCH 0/3 v3] futex/sched: introduce FUTEX_SWAP operation To: Jim Newsome Cc: Peter Oskolkov , Linux Kernel Mailing List , Rob Jansen , Ryan Wails , Paul Turner , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Ben Segall Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jim, thank you for your interest! While FUTEX_SWAP seems to be a nonstarter, there is a discussion off-list on how to approach the larger problem of userspace scheduling. A full userspace scheduling patchset is likely to take some time to shape out, but the "core" patches of wait/wake/swap are more or less ready, so I'll probably post an early RFC version here in the next week or two. CC-ing the maintainers. Thanks, Peter On Wed, Mar 17, 2021 at 10:59 AM Jim Newsome wrote: > > I'm not well versed in this part of the kernel (ok, any part, really), > but I wanted to chime in from a user perspective that I'm very > interested in this functionality. > > We (Rob + Ryan + I, cc'd) are currently developing the second generation > of the Shadow simulator , which is used by > various researchers and the Tor Project. In this new architecture, > simulated network-application processes (such as tor, browsers, and web > servers) are each run as a native OS process, started by forking and > exec'ing its unmodified binary. We are interested in supporting large > simulations (e.g. 50k+ processes), and expect them to take on the order > of hours or even days to execute, so scalability and performance matters. > > We've prototyped two mechanisms for controlling these simulated > processes, and a third hybrid mechanism that combines the two. I've > mentioned one of these (ptrace) in another thread ("do_wait: make > PIDTYPE_PID case O(1) instead of O(n)"). The other mechanism is to use > an LD_PRELOAD'd shim that implements the libc interface, and > communicates with Shadow via a syscall-like API over IPC. > > So far the most performant version we've tried of this IPC is with a bit > of shared memory and a pair of semaphores. It looks much like the > example in Peter's proposal: > > > a. T1: futex-wake T2, futex-wait > > b. T2: wakes, does what it has been woken to do > > c. T2: futex-wake T1, futex-wait > > We've been able to get the switching costs down using CPU pinning and > SCHED_FIFO. Each physical CPU spends most of its time swapping back and > forth between a Shadow worker thread and an emulated process. Even so, > the new architecture is so far slower than the first generation of > Shadow, which multiplexes the simulated processes into its own handful > of OS processes (but is complex and fragile). > > > With FUTEX_SWAP, steps a and c above can be reduced to one futex > > operation that runs 5-10 times faster. > > IIUC the proposed primitives could let us further improve performance, > and perhaps drop some of the complexity of attempting to control the > scheduler via pinning and SCHED_FIFO.