Received: by 10.223.185.116 with SMTP id b49csp966523wrg; Fri, 23 Feb 2018 09:33:12 -0800 (PST) X-Google-Smtp-Source: AH8x225JgKhRoxAdLTR/TYEpeHjXj16lnN41xuzEpWvcNjtUNTg5Op2WJofOMnQpxHRqsTFotlRw X-Received: by 10.99.126.14 with SMTP id z14mr2046664pgc.429.1519407192211; Fri, 23 Feb 2018 09:33:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519407192; cv=none; d=google.com; s=arc-20160816; b=kScyK3wWzH3qVt+FrdwrHDZFJNHMx5RhQxd8xeQjeEhLXyHKvvCZHRJFuAKbID1TJf ViWIh1BMsVq66Dw9wbnUPARBss2s3uiYrWKTlbM/Obghs7j7tJuRNJalidNs59xBOm2y hYzm7eX1vYg2g1f4sEfAglcOfDuapDaBJlwnyAlrtNiyRhEOyLisD4JpgBJHr3Xu3hJu Iw+l4bMvsdArD6Dw2ai85YlGnXQ+8wC/ViJlnaaqDM35nByynvYylxGcmOOGE7Iym/C9 IvNBjtwBtjFknrOL8WnkSyYoEOi6pRdAArZLAP62anHI4gK+KHnhgwh3VZM3WWUydziC RUAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=ppERli8rPgAHibnhBzSoefK0Ryr9qgdzayhyorJcARk=; b=Tj7V0S2Cv3rKagJjKSJmNNvhhk4kkcasxmaGzNOBustFitUWJoIcmstOK6dY/gc0UD UWzEqdhHx4Qxim+IjgSiVGXw9pOaO6PjNszmAac0yPSMDO3nUGfXCYOCYlje2wCLcNFX JJ86c4ogTlkQgb6P2mK9l2YHxj0W8ScgHox+0bI3ZM/AMlHUtbpSLcvlKhVzuf4VPz9O 1fFEe+8tlKEGPndRx0ZvvYlsY6pq7A954/TVGWmhwsFjRb+4dCURgwz050XAOt7FJVdq WTnlDJ5Tc/vYsExc4vXtk4yZdDOCgBRahqK6zwQB/6rlqmj5cqNLXMOtLSqBBH6EZ3RF fAVA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pJVScNMT; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t24-v6si2039399plo.340.2018.02.23.09.32.55; Fri, 23 Feb 2018 09:33:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=pJVScNMT; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751873AbeBWRcJ (ORCPT + 99 others); Fri, 23 Feb 2018 12:32:09 -0500 Received: from mail-wm0-f49.google.com ([74.125.82.49]:55024 "EHLO mail-wm0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751805AbeBWRcG (ORCPT ); Fri, 23 Feb 2018 12:32:06 -0500 Received: by mail-wm0-f49.google.com with SMTP id z81so6043039wmb.4 for ; Fri, 23 Feb 2018 09:32:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=ppERli8rPgAHibnhBzSoefK0Ryr9qgdzayhyorJcARk=; b=pJVScNMTuG5JyT+jWuzpxMUZSzojw8njG3m1lnt4rJbwmmD8BihYeq09XqOr6Cs2FM 9vEytAMTZe7uJyJL969jwL4h/ovrGwAWPoi1EqcnskQo4nLgGRv9YOcUulpV4OhK88Vk HhF6INH+T0xukOCs8CgdXaUsg9jKqirK6kYga42n9LSL0xDYXc7a40txDVNN7zPqJ64v poBPd9H5BL+QnXKni6Rkns4KrJiDOnBC4c0fnVoO39xD4TwebyeIKPP5ahH53NfWmyym HcZIbTpOc1/1OWsY5iSa4Nl7k2eT4oeBSFxkGrNyhudQa9twRIhChLbrEKbYiQXAaO5e 57Kw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=ppERli8rPgAHibnhBzSoefK0Ryr9qgdzayhyorJcARk=; b=EGjYr58nDkhpu3qRXfG+bDPVtqpXwX8Oi/IicHUrJ46fhSfboPobgS1lX30+bsVGXo juht9U+YuhRTG5VTBAm+itfN4JnVp3NVq1hcc8Di6qxitFRTLku7VzAGPlh62w1z0ekv UYPlrqRanDfsNxCPjg/nN1r6H0Rlc/GnXYuRKKxVcYk4iSi61tbsEUk+fG2ZB2n67KHZ dkKMUX+9UiDLy/Ekv5T2dhOj2lweBTY5BsZTJJvYMtLJhC/60S6cBgEpEJdxQN751LZ9 0kDDKZ0fWUdj0WXvbbSaW4t6JDv5rEwVGJg+B6f5RAwa4mtZC51L8pU278E0Lx/myZmv +afw== X-Gm-Message-State: APf1xPDqa/d6+T3ct7q1QPPVC5Zl5vzmZrhgJ8me1ET/91gP6M1n3TmP COiN+itM4jPSbuKISTQnY2kW1Msg X-Received: by 10.28.52.4 with SMTP id b4mr2602977wma.90.1519407125115; Fri, 23 Feb 2018 09:32:05 -0800 (PST) Received: from andrea (86.100.broadband17.iol.cz. [109.80.100.86]) by smtp.gmail.com with ESMTPSA id q12sm3034228wrg.37.2018.02.23.09.32.04 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 23 Feb 2018 09:32:04 -0800 (PST) Date: Fri, 23 Feb 2018 18:31:58 +0100 From: Andrea Parri To: Nikolay Borisov Cc: LKML , "Paul E. McKenney" , mathieu.desnoyers@efficios.com, Peter Zijlstra Subject: Re: Reasoning about memory ordering Message-ID: <20180223173158.GA3723@andrea> References: <0db16ef6-c805-b1f6-527f-8fec149e3df5@suse.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0db16ef6-c805-b1f6-527f-8fec149e3df5@suse.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 23, 2018 at 02:30:22PM +0200, Nikolay Borisov wrote: > Hello, > > I'm cc'ing a bunch of people I know are well-versed in > the black arts of memory ordering! > > Currently in btrfs we have roughly the following sequence: > > T1: T2: > i_size_write(inode, newsize); > set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); atomic_inc(&inode->i_dio_count); > smp_mb(); if (iov_iter_rw(iter) == READ) { > if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) { > if (atomic_read(&inode->i_dio_count)) { if (atomic_dec_and_test(&inode->i_dio_count)) > wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); > DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); } > if (offset >= i_size_read(inode)) > do { return; > prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); } > if (atomic_read(&inode->i_dio_count)) > schedule(); > } while (atomic_read(&inode->i_dio_count)); > finish_wait(wq, &q.wq_entry); > } > > smp_mb__before_atomic(); > clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); > > The semantics I'm after are: > > 1. If T1 goes to sleep, then T2 would see the > BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the > atomic_dec_and_test and possibly wake up T1. This flag serves as a way > to indicate to possibly multiple T2 (dio readers) that T1 is blocked > and they should unblock it and resort to acquiring some locks (this is not > visible in this excerpt of code for brevity). It's sort of a back-off > mechanism. I don't see how this could be guaranteed, even in a sequentially consistent world (disclaimer: I'm certainly not familiar with btrfs): what is wrong in T1 T2 atomic_inc(i_dio_count) test_bit(NEED_LOCK, flags) // unset set_bit(NEED_LOCK, flags) atomic_read(i_dio_count) // >1 --> go to sleep Thanks, Andrea > > 2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep > > 3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has > been woken up. > > 4. After T1 is woken up, it's possible that a new T2 comes and doesn't see > the BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check > for i_size should cause T2 to just return (it will also execute atomic_dec_and_test) > > Given this is the current state of the code (it's part of btrfs) I believe > the following could/should be done: > > 1. The smp_mb after the set_bit in T1 could be removed, since there is > already an implied full mm in prepare_to_wait. That is if we go to sleep, > then T2 is guaranteed to see the flag/i_size_write happening by merit of > the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't > go to sleep? I still would like the i_size_write to be visible to T2 > > 2. The bit clearing code in T1 should be possible to be replaced by > clear_bit_unlock (this was suggested by PeterZ on IRC). > > 3. I suspect there is a memory barrier in T2 that is missing. Perhaps > there should be an smp_mb__before_atomic right before the test_bit so that > it's ordered with the implied smp_mb in T1's prepare_to_wait.