Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp5614067ybi; Tue, 28 May 2019 16:25:10 -0700 (PDT) X-Google-Smtp-Source: APXvYqwOK+Te1zmImdLHGz9AWIVmWVsUv13d1AOsWvkfHUIbuGRsusa0LhQz+X3PFogudaNO+VRB X-Received: by 2002:a62:36c1:: with SMTP id d184mr126589592pfa.49.1559085910280; Tue, 28 May 2019 16:25:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1559085910; cv=none; d=google.com; s=arc-20160816; b=Fs/QwitunKoKM4AbmcjH2HVbn71499/uflsxcpbXCY3/zhggFGAdxvgLb9qexge6+g KurU0sODZT9FtEAaoDVOe7XtzJleBtc/UgIzhRFXhY11cbBUj8XeeUT3cKrF1FgbIVkz pg4KfR1tHE8Yv9Bkcn5c47T3mSg8odzeCQ8UMjnfKQyQiywofCLloReitOByhLFTF/82 +FDxrF8VbVccJVVlSvbFOgsB2qDMjEDEQNTd5r7ENOvChbbXj6Bbt5zVxFGwH5g6a4RF rrr9QFEy7fxcMwpe+54GOdPRmV0WkpH8L7nxsUdGFVlBUGcvkr3JjqADDRD8r41BpHu3 It3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=bXl2E+8Ofltdc9I18xpTcw4iLvgqqVUYnnHUQU/GAq8=; b=YUJ/UZeEnf2soI5/1JEBdDTqmOPzF+ywm8QkozE6Z3KghQTaCxLDmeUNdPoDDpmg6p qy9QDhJVJdVWg93tAmCSWrCMNF1K6eALuXDn8hq7S68GDcKRnyeZ1hcHiFgQZnAca3Jw y9JYpkbQ9YEmjAlS5OOuoZsFYByQEGsAR8I/TBpIqb8kg6Xprv6IyaBbpHXi8ifb21ix nrwYld6lXi7p+Fh6xkJ5nXUtOlNjRVl3dpfrlwH43HuotHHxG9Pf79nJnNvCfIlgXakd zdis45poUBAP19AAF7zRYKnQB/LeOp6h9L+uZltOO9tHYmJTWCU3crSfYlnfKh3ZBCbI rBZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=XRdrVhMl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j66si23367616pgc.224.2019.05.28.16.24.54; Tue, 28 May 2019 16:25:10 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=XRdrVhMl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727144AbfE1XQy (ORCPT + 99 others); Tue, 28 May 2019 19:16:54 -0400 Received: from mail-ot1-f65.google.com ([209.85.210.65]:42086 "EHLO mail-ot1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726610AbfE1XQx (ORCPT ); Tue, 28 May 2019 19:16:53 -0400 Received: by mail-ot1-f65.google.com with SMTP id i2so135732otr.9 for ; Tue, 28 May 2019 16:16:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=bXl2E+8Ofltdc9I18xpTcw4iLvgqqVUYnnHUQU/GAq8=; b=XRdrVhMlsJz9dk8XMaJIhK71SNkpeN9/QvL1IFjjOGQL/+2erDzADKEgV2Ewf7N9BF cBCn7ujn+0ZSADdUtnYM6bYVxPorxwZmenlFR6nyf9F/usl1rHTU5jsCrmubxLZEOeF7 uRjdac8tx++6S7JxCvvB/zFu9U6mVc3a/709+2v6vE/NBRNH2hn7NoBGlSOHT+Q/oVOM k5Cx0qcxzwFTZr1V8SKiSavDEM/adYJcvqzB44X2L5HuR5ZNGPugOczgP0sw2Cfrx284 dR12tmV/bA+xqSC+omGlF6HiNYF/bLaYcY4gsz2/oJYnVJIp1XXgxJZqzBh0SHcI77fo 4PsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=bXl2E+8Ofltdc9I18xpTcw4iLvgqqVUYnnHUQU/GAq8=; b=FKabbDJx+TN+TDx0DgYBUf59Cit3DS2xy0BO0XxL24hUbHPATXTMm+pPFgc4dZqLvv n3+EIju8QBahwprV5kcwxwVJ85+bCeKTcJeiumqslIcq1L8mT2Kk89bAhH4q9SvEJdxk hjWmv93f+T+hLISlGGrUdDomCl+EVgK/lltXelmeDdiv+8QT7hcrH4GUFIoMNyO+bIK7 PTDuSqt61W5SP41q8Xqi7C1Tu/kOttyC1eiXmzmAD8Zh17+8K1MrZZm+yivHFUeXXbum uObYQVNlRCpYlX8WhfzKWLE/Fk6vBSl6sIBMClqRPX41Q3iVdjzXVIVSgxrRE9TqqFP+ z0Yw== X-Gm-Message-State: APjAAAWVA179rO3r01kSneDEs9X1Hmywnn0zujtSuXi0EtMqCFQDKmYQ DzaJU7hNbxheh2vUyFsnHtqZRdbcWiu2qa3pt4d25A== X-Received: by 2002:a9d:6254:: with SMTP id i20mr27881970otk.180.1559085412912; Tue, 28 May 2019 16:16:52 -0700 (PDT) MIME-Version: 1.0 References: <155905930702.7587.7100265859075976147.stgit@warthog.procyon.org.uk> <155905931502.7587.11705449537368497489.stgit@warthog.procyon.org.uk> <11466.1559082515@warthog.procyon.org.uk> In-Reply-To: <11466.1559082515@warthog.procyon.org.uk> From: Jann Horn Date: Wed, 29 May 2019 01:16:26 +0200 Message-ID: Subject: Re: [PATCH 1/7] General notification queue with user mmap()'able ring buffer To: David Howells Cc: Al Viro , raven@themaw.net, linux-fsdevel , Linux API , linux-block@vger.kernel.org, keyrings@vger.kernel.org, linux-security-module , kernel list Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 29, 2019 at 12:28 AM David Howells wrote: > Jann Horn wrote: > > I don't see you setting any special properties on the VMA that would > > prevent userspace from extending its size via mremap() - no > > VM_DONTEXPAND or VM_PFNMAP. So I think you might get an out-of-bounds > > access here? > > Should I just set VM_DONTEXPAND in watch_queue_mmap()? Like so: > > vma->vm_flags |= VM_DONTEXPAND; Yeah, that should work. > > > +static void watch_queue_map_pages(struct vm_fault *vmf, > > > + pgoff_t start_pgoff, pgoff_t end_pgoff) > > ... > > > > Same as above. > > Same solution as above? Or do I need ot check start/end_pgoff too? Same solution. > > > +static int watch_queue_mmap(struct file *file, struct vm_area_struct *vma) > > > +{ > > > + struct watch_queue *wqueue = file->private_data; > > > + > > > + if (vma->vm_pgoff != 0 || > > > + vma->vm_end - vma->vm_start > wqueue->nr_pages * PAGE_SIZE || > > > + !(pgprot_val(vma->vm_page_prot) & pgprot_val(PAGE_SHARED))) > > > + return -EINVAL; > > > > This thing should probably have locking against concurrent > > watch_queue_set_size()? > > Yeah. > > static int watch_queue_mmap(struct file *file, > struct vm_area_struct *vma) > { > struct watch_queue *wqueue = file->private_data; > struct inode *inode = file_inode(file); > u8 nr_pages; > > inode_lock(inode); > nr_pages = wqueue->nr_pages; > inode_unlock(inode); > > if (nr_pages == 0 || > ... > return -EINVAL; Looks reasonable. > > > + smp_store_release(&buf->meta.head, len); > > > > Why is this an smp_store_release()? The entire buffer should not be visible to > > userspace before this setup is complete, right? > > Yes - if I put the above locking in the mmap handler. OTOH, it's a one-off > barrier... > > > > + if (wqueue->buffer) > > > + return -EBUSY; > > > > The preceding check occurs without any locks held and therefore has no > > reliable effect. It should probably be moved below the > > inode_lock(...). > > Yeah, it can race. I'll move it into watch_queue_set_size(). Sounds good. > > > +static void free_watch(struct rcu_head *rcu) > > > +{ > > > + struct watch *watch = container_of(rcu, struct watch, rcu); > > > + > > > + put_watch_queue(rcu_access_pointer(watch->queue)); > > > > This should be rcu_dereference_protected(..., 1). > > That shouldn't be necessary. rcu_access_pointer()'s description says: > > * It is also permissible to use rcu_access_pointer() when read-side > * access to the pointer was removed at least one grace period ago, as > * is the case in the context of the RCU callback that is freeing up > * the data, ... > > It's in an rcu callback function, so accessing the __rcu pointers in the RCU'd > struct should be fine with rcu_access_pointer(). Aah, whoops, you're right, I missed that paragraph in the documentation of rcu_access_pointer(). > > > + /* We don't need the watch list lock for the next bit as RCU is > > > + * protecting everything from being deallocated. > > > > Does "everything" mean "the wqueue" or more than that? > > Actually, just 'wqueue' and its buffer. 'watch' is held by us once we've > dequeued it as we now own the ref 'wlist' had on it. 'wlist' and 'wq' must be > pinned by the caller. > > > > + if (release) { > > > + if (wlist->release_watch) { > > > + rcu_read_unlock(); > > > + /* This might need to call dput(), so > > > + * we have to drop all the locks. > > > + */ > > > + wlist->release_watch(wlist, watch); > > > > How are you holding a reference to `wlist` here? You got the reference through > > rcu_dereference(), you've dropped the RCU read lock, and I don't see anything > > that stabilizes the reference. > > The watch record must hold a ref on the watched object if the watch_list has a > ->release_watch() method. In the code snippet above, the watch record now > belongs to us because we unlinked it under the wlist->lock some lines prior. Ah, of course. > However, you raise a good point, and I think the thing to do is to cache > ->release_watch from it and not pass wlist into (*release_watch)(). We don't > need to concern ourselves with cleaning up *wlist as it will be cleaned up > when the target object is removed. > > Keyrings don't have a ->release_watch method and neither does the block-layer > notification stuff. > > > > + if (wqueue->pages && wqueue->pages[0]) > > > + WARN_ON(page_ref_count(wqueue->pages[0]) != 1); > > > > Is there a reason why there couldn't still be references to the pages > > from get_user_pages()/get_user_pages_fast()? > > I'm not sure. I'm not sure what to do if there are. What do you suggest? I would use put_page() instead of manually freeing it; I think that should be enough? I'm not entirely sure though. > > > + n->info &= (WATCH_INFO_LENGTH | WATCH_INFO_TYPE_FLAGS | WATCH_INFO_ID); > > > > Should the non-atomic modification of n->info > > n's an unpublished copy of some userspace data that's local to the function > instance. There shouldn't be any way to race with it at this point. Ah, derp. Yes.