Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp621578ybe; Fri, 6 Sep 2019 04:56:29 -0700 (PDT) X-Google-Smtp-Source: APXvYqwfE3SsEN+BXMDUeXDhUkB3rOv8uKpoRtAFA+TPk6ktjO5x2FL2SbqVmP5K8xkMPTY8RLA8 X-Received: by 2002:a65:6097:: with SMTP id t23mr7544025pgu.357.1567770988887; Fri, 06 Sep 2019 04:56:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567770988; cv=none; d=google.com; s=arc-20160816; b=pk766atpcpXk1slvebMOtwMNP8redNIHNL+fknGTOqqgbx7eMKoS86fe5jUQ5BB2NN gIOfh4ja3g65m3mCMbKw5yUMeYpIYebB8cArr/BVP162NOPV2d1dHL+6fdroHjrdydIj LWGpWZFAsYijtXyEVnIRN3OZHtb7R4I1y6V1WGEXMeX3RyoClIo34TnNc70c1qm+3PTW IiADnw6SWGUUM960n+OebFLy8tWE48vzVCM5IUaoVZZdKdC+SRrsrCotDVDK9vOcaWzG 5vITLimgOai+ATuUcLS4YSXhzBNHci93+hsCb6ZXptNYh8I9NqLkrwMUi1hIe7uSbb8K ud6g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=0NMHQZwfVd4ulDW3P9IBDkOOXGi1t10wFsvSD1QyNDE=; b=YXL7Rf+zvqYK6Hd2JJloT0LAx3GzrI22PEt8MG2GAshVnT7Lldxe4IL732n5RcsAh5 hlYlzeCEmhjq9Y59hmPRwIfs2mdgznxT7sfw40KRcO5bpDzJUKbiC+rKjgZcWDSaGLFq cKFYNO8N32mlhaR9whhsYRadApP/urBUtxNDRSisMnPX2lbDPv+ljL5kBCijnZFe+uws GaZzsi88KhI3PqBU+bPoSbTnrmSGemo+k+QQaSj6fPQvSO7/YWrS+jnGuAKn+SvpstuK Uw/tgluwqHjxxRmUEVlWZ1thrwGgCD92odj72IIkGdsBdlaNBk2XVcSLNCMCFnpVP3n1 ottw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b="PrQ/gz4Y"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n23si4618912plp.273.2019.09.06.04.56.12; Fri, 06 Sep 2019 04:56:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linux-foundation.org header.s=google header.b="PrQ/gz4Y"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2390907AbfIFANA (ORCPT + 99 others); Thu, 5 Sep 2019 20:13:00 -0400 Received: from mail-lf1-f65.google.com ([209.85.167.65]:38566 "EHLO mail-lf1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389391AbfIFANA (ORCPT ); Thu, 5 Sep 2019 20:13:00 -0400 Received: by mail-lf1-f65.google.com with SMTP id c12so3506842lfh.5 for ; Thu, 05 Sep 2019 17:12:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=0NMHQZwfVd4ulDW3P9IBDkOOXGi1t10wFsvSD1QyNDE=; b=PrQ/gz4Yvh3kDyFvFLcH3TLv78hTzoNA/ToV1HPN5PDzcNoSoVUwUoP72QpHgogaxX CLkzLD1zXZqq6zgzwke2sO6s4DA395H5eQOgvJCtQoWHNcdLmv5+5Qv3OTtes1ZdYxgr dP7NEV3rjX1ke9HkH3D1w0hGa1cOD5ShRa+2M= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=0NMHQZwfVd4ulDW3P9IBDkOOXGi1t10wFsvSD1QyNDE=; b=Ou5WbOXl9EHWFgM6leLPLqx+BdfTauP07twK5LFNPTqEKAclqsTO7TpHrKY+o4qVVz Wcx1eKsvIwY5Nk5ShsGGkCK6nc75ObiRTInqhWu+9Mw0z051d4Fq0PVc6yAPqG05GLPc xYYkh4pktEBxKqRXfEiZgBJiq5hEUSH7NcPxTJVmOKG2Hg6krMiErNEreqodYLSH1SCD IHJMJ7GJ24x8BHAOtrkTpp3Jdk8G9eChnNdlieRn+o8ErubVqj7Yge8uc2wHRIM7cnPZ bzkCPVpvisK9ZKPqKN+Ci0pR6HxGXJvrTYIdJW1pSEyWMYbXY7XagM19v9uZJldtdpR6 yKzQ== X-Gm-Message-State: APjAAAVeE1DGEIfGbAN/ar+6Jjp02KqGz4xv6tYqL0sl3dMfoJinrTWr +UZao2RbagVJyDyMT39ZnC1+LXE1l4o= X-Received: by 2002:a19:c6d5:: with SMTP id w204mr4085761lff.53.1567728777029; Thu, 05 Sep 2019 17:12:57 -0700 (PDT) Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com. [209.85.208.181]) by smtp.gmail.com with ESMTPSA id u9sm683938lja.27.2019.09.05.17.12.56 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 05 Sep 2019 17:12:56 -0700 (PDT) Received: by mail-lj1-f181.google.com with SMTP id a4so4293837ljk.8 for ; Thu, 05 Sep 2019 17:12:56 -0700 (PDT) X-Received: by 2002:a2e:3c14:: with SMTP id j20mr3770452lja.84.1567728461450; Thu, 05 Sep 2019 17:07:41 -0700 (PDT) MIME-Version: 1.0 References: <156763534546.18676.3530557439501101639.stgit@warthog.procyon.org.uk> <17703.1567702907@warthog.procyon.org.uk> <5396.1567719164@warthog.procyon.org.uk> <14883.1567725508@warthog.procyon.org.uk> In-Reply-To: <14883.1567725508@warthog.procyon.org.uk> From: Linus Torvalds Date: Thu, 5 Sep 2019 17:07:25 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Why add the general notification queue and its sources To: David Howells Cc: Ray Strode , Greg Kroah-Hartman , Steven Whitehouse , Nicolas Dichtel , raven@themaw.net, keyrings@vger.kernel.org, linux-usb@vger.kernel.org, linux-block , Christian Brauner , LSM List , linux-fsdevel , Linux API , Linux List Kernel Mailing , Al Viro , "Ray, Debarshi" , Robbie Harwood Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Sep 5, 2019 at 4:18 PM David Howells wrote: > > Can you write into a pipe from softirq context and/or with spinlocks held > and/or with the RCU read lock held? That is a requirement. Another is that > messages get inserted whole or not at all (or if they are truncated, the size > field gets updated). Right now we use a mutex for the buffer locking, so no, pipe buffers are not irq-safe or atomic. That's due to the whole "we may block on data from user space" when doing a write. HOWEVER. Pipes actually have buffers on two different levels: there's the actual data buffers themselves (each described by a "struct pipe_buffer"), and there's the circular queue of them (the "pipe->buf[]" array, with pipe->curbuf/nrbufs) that points to individual data buffers. And we could easily separate out that data buffer management. Right now it's not really all that separated: people just do things like int newbuf = (pipe->curbuf + bufs) & (pipe->buffers-1); struct pipe_buffer *buf = pipe->bufs + newbuf; ... pipe->nrbufs++; to add a buffer into that circular array of buffers, but _that_ part could be made separate. It's just all protected by the pipe mutex right now, so it has never been an issue. And yes, atomicity of writes has actually been an integral part of pipes since forever. It's actually the only unambiguous atomicity that POSIX guarantees. It only holds for writes to pipes() of less than PIPE_BUF blocks, but that's 4096 on Linux. > Since one end would certainly be attached to an fd, it looks on the face of it > that writing into the pipe would require taking pipe->mutex. That's how the normal synchronization is done, yes. And changing that in general would be pretty painful. For example, two concurrent user-space writers might take page faults and just generally be painful, and the pipe locking needs to serialize that. So the mutex couldn't go away from pipes in general - it would remain for read/write/splice mutual exclusion (and it's not just the data it protects, it's the reader/writer logic for EPIPE etc). But the low-level pipe->bufs[] handling is another issue entirely. Even when a user space writer copies things from user space, it does so into a pre-allocated buffer that is then attached to the list of buffers somewhat separately (there's a magical special case where you can re-use a buffer that is marked as "I can be reused" and append into an already allocated buffer). And adding new buffers *could* be done with it's own separate locking. If you have a blocking writer (ie a user space data source), that would still take the pipe mutex, and it would delay the user space readers (because the readers also need the mutex), but it should not be all that hard to just make the whole "curbuf/nrbufs" handling use its own locking (maybe even some lockless atomics and cmpxchg). So a kernel writer could "insert" a "struct pipe_buffer" atomically, and wake up the reader atomically. No need for the other complexity that is protected by the mutex. The buggest problem is perhaps that the number of pipe buffers per pipe is fairly limited by default. PIPE_DEF_BUFFERS is 16, and if we'd insert using the ->bufs[] array, that would be the limit of "number of messages". But each message could be any size (we've historically limited pipe buffers to one page each, but that limit isn't all that hard. You could put more data in there). The number of pipe buffers _is_ dynamic, so the above PIPE_DEF_BUFFERS isn't a hard limit, but it would be the default. Would it be entirely trivial to do all the above? No. But it's *literally* just finding the places that work with pipe->curbuf/nrbufs and making them use atomic updates. You'd find all the places by just renaming them (and making them atomic or whatever) and the compiler will tell you "this area needs fixing". We've actually used pipes for messages before: autofs uses a magic packetized pipe buffer thing. It didn't need any extra atomicity, though, so it stil all worked with the regular pipe->mutex thing. And there is a big advantage from using pipes. They really would work with almost anything. You could even mix-and-match "data generated by kernel" and "data done by 'write()' or 'splice()' by a user process". NOTE! I'm not at all saying that pipes are perfect. You'll find people who swear by sockets instead. They have their own advantages (and disadvantages). Most people who do packet-based stuff tend to prefer sockets, because those have standard packet-based models (Linux pipes have that packet mode too, but it's certainly not standard, and I'm not even sure we ever exposed it to user space - it could be that it's only used by the autofs daemon). I have a soft spot for pipes, just because I think they are simpler than sockets. But that soft spot might be misplaced. Linus