Received: by 10.223.185.116 with SMTP id b49csp981439wrg; Fri, 23 Feb 2018 09:48:54 -0800 (PST) X-Google-Smtp-Source: AH8x224yJbu68ackCaTPW5fhMEIKAngisCqmiWp9atb2mppzkIQ9wJoqbyCFpgsAeqc+9wavZSJl X-Received: by 10.99.100.67 with SMTP id y64mr2075474pgb.145.1519408134690; Fri, 23 Feb 2018 09:48:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519408134; cv=none; d=google.com; s=arc-20160816; b=Y92hD2jhFyH10F2rRzSIgGNYYEuzosjciT0QD+Aoysv560S1Wgb5vNd0qTt8Hic8IW PxUABr3Z7FnmcezzTb2vqVnPna3KHsRgGCSTHX0VAoVvpr0yidra6bvnu3nTT3z1ll0P TdnW7GPCROtHGU2fUaUoDOlFI9GCmzhgCqoi4TKx1ESwaD+KoLx++2jlsChc7C0BbPNm ATvUCZEaTlwf5chO9tdcsUNGrdWByCf/FeZ2xqArHWRBfHrejQucXXws/t0Lbsk+t4BA NNDWCH+IrvjDiHkNWZVnyedem0/DflET0HMCBjWxFw8MJeNvUHqr5ArfJDc812zebQVr Qu+A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=DLZl8HLNUg43MjAq3pyCPWVDAAXOlBjXRJKrrJCAEOs=; b=vMWKWMwjvQREdL5KtvGFAbd0qMwc7flSBPM0T5tuEfd5iMy33UM0+AXOjtiQKz26mI VfpdkNafujOOKA6AqGDiRf/lditCRCRwr+kU++pMBdhOrK/rRgPXafsapT1nFmohUbIZ Ta3lKj52HKwCxPK1kMJSpmAV6HhgU8tEEcaGGj6uw+zwSx00xMuCSOmywTuZv1uajO24 lXBzewc3EhnG66d2UuqaE4Ve7jebWvY/hY8ZkKZk8wabmwc+I8K5wXApv8XPMDDQvDe/ OHWCgbFL2tWHQV3uRGeAl2i9H17EhUqwsB0f3sQVaiMT0OWvHq8Zs3ubsJerUDqSSYvu gZSw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a14-v6si2068446plt.693.2018.02.23.09.48.39; Fri, 23 Feb 2018 09:48:54 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752119AbeBWRqG (ORCPT + 99 others); Fri, 23 Feb 2018 12:46:06 -0500 Received: from mx2.suse.de ([195.135.220.15]:59471 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751763AbeBWRp7 (ORCPT ); Fri, 23 Feb 2018 12:45:59 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id E6588ADB7; Fri, 23 Feb 2018 17:45:57 +0000 (UTC) Subject: Re: Reasoning about memory ordering To: Andrea Parri Cc: LKML , "Paul E. McKenney" , mathieu.desnoyers@efficios.com, Peter Zijlstra References: <0db16ef6-c805-b1f6-527f-8fec149e3df5@suse.com> <20180223173158.GA3723@andrea> From: Nikolay Borisov Message-ID: Date: Fri, 23 Feb 2018 19:45:55 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.0 MIME-Version: 1.0 In-Reply-To: <20180223173158.GA3723@andrea> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 23.02.2018 19:31, Andrea Parri wrote: > On Fri, Feb 23, 2018 at 02:30:22PM +0200, Nikolay Borisov wrote: >> Hello, >> >> I'm cc'ing a bunch of people I know are well-versed in >> the black arts of memory ordering! >> >> Currently in btrfs we have roughly the following sequence: >> >> T1: T2: >> i_size_write(inode, newsize); >> set_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); atomic_inc(&inode->i_dio_count); >> smp_mb(); if (iov_iter_rw(iter) == READ) { >> if (test_bit(BTRFS_INODE_READDIO_NEED_LOCK, &BTRFS_I(inode)->runtime_flags)) { >> if (atomic_read(&inode->i_dio_count)) { if (atomic_dec_and_test(&inode->i_dio_count)) >> wait_queue_head_t *wq = bit_waitqueue(&inode->i_state, __I_DIO_WAKEUP); wake_up_bit(&inode->i_state, __I_DIO_WAKEUP); >> DEFINE_WAIT_BIT(q, &inode->i_state, __I_DIO_WAKEUP); } >> if (offset >= i_size_read(inode)) >> do { return; >> prepare_to_wait(wq, &q.wq_entry, TASK_UNINTERRUPTIBLE); } >> if (atomic_read(&inode->i_dio_count)) >> schedule(); >> } while (atomic_read(&inode->i_dio_count)); >> finish_wait(wq, &q.wq_entry); >> } >> >> smp_mb__before_atomic(); >> clear_bit(BTRFS_INODE_READDIO_NEED_LOCK, &inode->runtime_flags); >> >> The semantics I'm after are: >> >> 1. If T1 goes to sleep, then T2 would see the >> BTRFS_INODE_READDIO_NEED_LOCK and hence will execute the >> atomic_dec_and_test and possibly wake up T1. This flag serves as a way >> to indicate to possibly multiple T2 (dio readers) that T1 is blocked >> and they should unblock it and resort to acquiring some locks (this is not >> visible in this excerpt of code for brevity). It's sort of a back-off >> mechanism. > > I don't see how this could be guaranteed, even in a sequentially consistent > world (disclaimer: I'm certainly not familiar with btrfs): what is wrong in > > T1 T2 > > atomic_inc(i_dio_count) > test_bit(NEED_LOCK, flags) // unset > set_bit(NEED_LOCK, flags) > atomic_read(i_dio_count) // >1 > --> go to sleep You are correct, so looking at btrfs_direct_IO again it seems this kind of execution is fine since we also do inode_dio_end (i.e. the atomic_dec_andtest/wake_up) sequence even outside of the test_bit() conditional. So I guess the first requirement is really unsatisfiable/not required. > > Thanks, > Andrea > > >> >> 2. BTRFS_INODE_READDIO_NEED_LOCK bit must be set _before_ going to sleep >> >> 3. BTRFS_INODE_READDIO_NEED_LOCK must be cleared _after_ the thread has >> been woken up. >> >> 4. After T1 is woken up, it's possible that a new T2 comes and doesn't see >> the BTRFS_INODE_READDIO_NEED_LOCK flag set but this is fine, since the check >> for i_size should cause T2 to just return (it will also execute atomic_dec_and_test) >> >> Given this is the current state of the code (it's part of btrfs) I believe >> the following could/should be done: >> >> 1. The smp_mb after the set_bit in T1 could be removed, since there is >> already an implied full mm in prepare_to_wait. That is if we go to sleep, >> then T2 is guaranteed to see the flag/i_size_write happening by merit of >> the implied memory barrier in prepare_to_wait/schedule. But what if it doesn't >> go to sleep? I still would like the i_size_write to be visible to T2 >> >> 2. The bit clearing code in T1 should be possible to be replaced by >> clear_bit_unlock (this was suggested by PeterZ on IRC). >> >> 3. I suspect there is a memory barrier in T2 that is missing. Perhaps >> there should be an smp_mb__before_atomic right before the test_bit so that >> it's ordered with the implied smp_mb in T1's prepare_to_wait. >