Received: by 10.223.164.221 with SMTP id h29csp1980752wrb; Thu, 2 Nov 2017 04:07:21 -0700 (PDT) X-Google-Smtp-Source: ABhQp+SZcz0F3h+ix+tFNIoa0punISu8nVQA8m/f83krR2bo+yNtiR9N2h+OUUukX1n3b5MQiILd X-Received: by 10.159.244.14 with SMTP id x14mr2762768plr.75.1509620841175; Thu, 02 Nov 2017 04:07:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1509620841; cv=none; d=google.com; s=arc-20160816; b=rN6/IKH7VAjHeB3Bwo11xziv3BPELBFR/2QuFfe+AlxYOSGSxz0NYrgrRIUsu8oEhi 24ijyQov/qhTy/UjJwhJe+UPEkE1LpkPlu1KUE0KDKyxWmOoB9XFNtzZdDMfnRvODgrw GbhbZWWdJ0fiI86E5wq/nBDkTGnQB4ZEADDi4k6fEkMdBQXx3aaKFv92QbJbYTRGWxCU 9jdUJnCXD+f+cMj9LVOsaixu5kqGR+2LmPRbtpYCzBqoRFsoNUQ8RJxRBMaa+aruPRhC yV6ky9V+yrJmxXyqK2+wxrN9/HUcq7nBNyrJmVagqEwZ9JSArdJQgjxAzrR4vzaOZyZ3 xHxg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:content-transfer-encoding :dlp-reaction:dlp-version:dlp-product:content-language :accept-language:in-reply-to:references:message-id:date:thread-index :thread-topic:subject:cc:to:from:arc-authentication-results; bh=EWFKD8I8U5efY4w6jcWhEI653jF7Aco1Pnc5bnBtqtw=; b=WbH1who/v9KzgiR7HO+t+I0D1QglIpQZxjreMGIFm4bvEebLSJ0oaTPIpQFYJznVLu d4vQo3QXxb1VxI05TBMUug9ECZ9TvisLEEoUvfsa56W3ywMVlrDp640+4Ph/bTp9F58Y 6wmZuBT9S1sbOqL0RnWa8crfjLkujPWJ3r47SBI/4WgdiVA+hV9juHDLjsjjibhtNk1n POWRFEgYSWyoutyZPCtkIk65ef2jrQs01Jj4uEe226168XXgVHXoyShrcLtqNSQwxQpB sKggiaIwVs0X87Zte8DDXnI5xd0e3TpR2qoMVBUMfzPuq2vFyleCJkfQ1CZr6vgKlPwb 375g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v68si3634415pfj.359.2017.11.02.04.07.06; Thu, 02 Nov 2017 04:07:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755538AbdKBLFA convert rfc822-to-8bit (ORCPT + 99 others); Thu, 2 Nov 2017 07:05:00 -0400 Received: from mga09.intel.com ([134.134.136.24]:44064 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755371AbdKBLE6 (ORCPT ); Thu, 2 Nov 2017 07:04:58 -0400 Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 02 Nov 2017 04:04:57 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.44,333,1505804400"; d="scan'208";a="331027707" Received: from irsmsx154.ger.corp.intel.com ([163.33.192.96]) by fmsmga004.fm.intel.com with ESMTP; 02 Nov 2017 04:04:53 -0700 Received: from irsmsx102.ger.corp.intel.com ([169.254.2.180]) by IRSMSX154.ger.corp.intel.com ([169.254.12.83]) with mapi id 14.03.0319.002; Thu, 2 Nov 2017 11:04:53 +0000 From: "Reshetova, Elena" To: Peter Zijlstra CC: "linux-kernel@vger.kernel.org" , "gregkh@linuxfoundation.org" , "keescook@chromium.org" , "tglx@linutronix.de" , "mingo@redhat.com" , "ishkamiel@gmail.com" , Will Deacon Subject: RE: [PATCH] refcount: provide same memory ordering guarantees as in atomic_t Thread-Topic: [PATCH] refcount: provide same memory ordering guarantees as in atomic_t Thread-Index: AQHTS+92qOdaqfxtzUSi29Fjx7+W86LxWH8AgAXrypCAAGnUAIAGNEwQ Date: Thu, 2 Nov 2017 11:04:53 +0000 Message-ID: <2236FBA76BA1254E88B949DDB74E612B802B89B8@IRSMSX102.ger.corp.intel.com> References: <1508756984-18980-1-git-send-email-elena.reshetova@intel.com> <20171023131224.GC3165@worktop.lehotels.local> <2236FBA76BA1254E88B949DDB74E612B802B6A2E@IRSMSX102.ger.corp.intel.com> <20171027135624.GY3165@worktop.lehotels.local> In-Reply-To: <20171027135624.GY3165@worktop.lehotels.local> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.0.0.116 dlp-reaction: no-action x-ctpclassification: CTP_IC x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiMzYyMTQ2MzktYzE0Ni00MDQxLThhZGItZGUwNDA3ZGVlNjk5IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjIuNS4xOCIsIlRydXN0ZWRMYWJlbEhhc2giOiJSdWxZUWxXY0JVeEFkaFJcL2NuTnN3aVpqUDRVdWo1ZzNQSHNMNzFXbTVBUXdTa2RZVzFBcktsT0VGMU5SR2d4UiJ9 x-originating-ip: [163.33.239.181] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT MIME-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Sorry for delayed reply, but I was actually reading and trying to understand all the involved notions, so it took a while... > On Fri, Oct 27, 2017 at 06:49:55AM +0000, Reshetova, Elena wrote: > > Could we possibly have a bit more elaborate discussion on this? > > > > Or alternatively, what then should be the correct way for a certain > > variable (that behaves like a standard refcounter) to check if the > > relaxed memory ordering is ok? > > The refcount_t includes sufficient ordering for normal reference > counting. It provides a happens-before relation wrt. dropping the > reference, such that all object accesses happen before we drop the > reference; because at that point the object can go away and any further > access would be UAF. > > So if the thing being replaced is purely a reference count, all should > be well. Yes, I understand this, but people are asking a totally different level of detail and inside analysis and given that there are changes, they have a point. So, that's why I am trying to figure the "whole story" below. > > > This is what Thomas was asking me to do for core kernel conversions > > and this is what I don't know how to do correctly. > > Thomas asked you to verify that nothing relies on the stronger ordering > provided by atomic_dec_and_test(). This means you have to audit the code > with an eye towards memory ordering. > > Now the futex one is actually OK. Thomas just wants you to mention that > refcount_dec_and_test is weaker and explain that in this case that is > fine since pi_state::refcount is in fact a pure refcount and > put_pi_state() does not in fact rely on further ordering. > > Just mentioning the above gives the maintainer: > > 1) the information he needs to perform review, mentioning memory > ordering changes means he needs to look at it. > > 2) you identify the place where it changes (put_pi_state) which again > aids him in review by limiting the amount of code he needs to double > check. > > 3) that you actually looked at the problem; giving more trust you > actually know wth you're doing. Well refcount_dec_and_test() is not the only function that has different memory ordering specifics. So, the full answer then for any arbitrary case according to your points above would be smth like: for each substituted function (atomic_* --> refcount_*) that actually has changes in memory ordering *** perform the following: - mention the difference - mention the actual code place where the change would affect ( various put and etc. functions) - potentially try to understand if it would make a difference for this code (again here is where I am not sure how to do it properly) *** the actual list of these functions to me looks like: 1) atomic_inc_not_zero -> refcount_inc_not_zero. Change is from underlying atomic_cmpxchg() to atomic_try_cmpxchg_relaxed () First one implies SMP-conditional general memory barrier (smp_mb()) on each side of the actual operation (at least according to documentation, In reality it goes through so many changes, ifdefs and conditionals that one gets lost Easily in the process). Second one according to the comment implies no memory barriers whatsoever, BUT "provides a control dependency which will order future stores against the inc" So, every store (not load) operation (I guess we are talking here only about store operations that touch the same object, but I wonder how it is defined in terms of memory location? (overlapping?) that comes inside "if refcount_inc_not_zero(){}" cause would only be executed if functions returns true. So, practically what we might "loose" here is any updates on the object protected by this refcounter done by another CPU. But for this purpose we expect the developer to take some other lock/memory barrier into use, right? We only care of incrementing the refcount atomically and make sure we don't do anything with object unless it is ready for us to be used. If yes, then I guess it might be a big change for the code that previously relied on atomic-provided smp_mb() barriers and now instead needs to take an explicit locks/barriers by itself. 2) atomic_** -> refcount_add_not_zero. Fortunately these are super rare and need to see per each case dependent on actual atomic function substituted. 3) atomic_add() --> refcount_add(). This should not make any change since both do not provide memory ordering at all, but there is a comment in the refcount.c that says that refcount_add " provide a control dependency and thereby orders future stores". How is it done given that refcount_add is void returning function?? I am fully confused with this one. 4) atomic_dec () --> refcount_dec (). This one we have discussed extensively before. Again first one implies SMP-conditional general memory barrier (smp_mb()) on each side of the actual operation. Second one only provides "release" ordering meaning that prior both loads and stores must be completed before the operation. So, is it correct to express the difference in this way: atomic_dec (): refcount_dec (): smp_mb(); smp_mb(); do_atomic_dec_operation; do_atomic_dec_operation; smp_mb(); /*note no any barrier here! */ / * we assume that we hold the lock * or anything like this on object * anyway, otherwise code is broken * to start with */ 5) atomic_dec_and_test () --> refcount_dec_and_test (). Same as 4) but in addition we have a control flow barrier. So, is it correct to express the difference in this way: atomic_dec_and_test (): refcount_dec_and_test (): smp_mb(); smp_mb(); if (do_atomic_dec_and_test;) { if (do_atomic_dec_and_test;){ /* 1 --> 0 transition happened */ /* 1 --> 0 transition happened */ smp_mb(); /* in refcounter case we assume do_anything including free(); that we are the last user of object so that no concurrent access can happen*/ /* control_flow_barrier here */ <------ which one btw???? /* only need to guarantee that we free after reaching zero so we are good here */ } } 6) atomic_sub_and_test () --> refcount_sub_and_test (). Identical to 5) 7) atomic_add_unless(&var, -1, 1) --> refcount_dec_not_one(). Should be identical to 4) Lock functions such as refcount_dec_and_lock() & refcount_dec_and_mutex_lock() Provide exactly the same guarantees as they atomic counterparts. Is the above a correct view on this? I think that instead of putting this whole info in the individual commits, how about making some documentation subsection (in memory-barriers or atomics or on its own, don't know what is better) with examples like above? Then in each individual commit I can point to the respective line in the doc and say that here is a particular memory guarantee change and in the case of this particular code it would apply to your put_pi_state() or whatever else function. Otherwise keeping this info only in commits do not benefit any future users of refcount. > > > Also, I got exactly the same question from xfs maintainer, > > so if we provide an API that we hope will be used correctly in the > > future, we should have a way for people to verify it. > > I actually replied to Dave (but have not checked if there's further > email on the thread). I object to the claim that the code is correct if > he cannot explain the memory ordering. > > Aside from that; if/when he explain the ordering requirements of those > sites and those do indeed require more than refcount_dec_and_test() > provides we should add a comment exactly explaining the ordering along > with additional barriers, such as smp_mb__after_atomic(). > > > Maybe it is just me, but I would think that having a way to verify > > that your code is ok with this refcounter-specific relaxed memory > > ordering applies to any memory ordering requirements, refcounters are > > just a subset with certain properties, so is then the full answer is > > to figure out how to verify any memory ordering requirement of your > > code? > > Yes; and this can be quite tricky, but undocumented ordering > requirements are a bug that need to be fixed. Mostly the code is > 'simple' and refcount really does the right and sufficient thing. > > If the maintainer knows or suspects more ordering is required he can > help out; but for that to happen you have to, at the very least, do 1&2 > above. As I said, I want to do more if we are going this way. That's why I am trying to create some kind of type of easy to read doc that would explain people the differences and won't necessary imply reading the whole memory-barrier.txt doc (more on my reading of it below). > > > We can also document this logic in more docs or even maybe try to > > create a better documentation for current memory ordering bits since > > it is not the most easiest read at the moment. Otherwise this might be > > just bad enough reason for many people to avoid refcount_t type if it > > is just easier to tell "let's take just old atomic, we knew it somehow > > worked before".... > > That's just voodoo programming; and while that is rife I have absolutely > no sympathy for it. Memory ordering is hard, but that doesn't mean we > should just blindly sprinkle memory barriers around until it 'works'. > > So while memory-barriers.txt is a longish document, it is readable with > a bit of time and effort. There are also tools being developed that can > help people validate their code. Could you please point to the tools? Might help my understanding also. So, I actually went and tried to __slowly__ read the memory-barriers.txt which indeed makes a lot of sense when you read it in full (not partially like it did it before). But the issue with this doc is not so much the length but not always consequent way of giving the information to unprepared reader. I have many examples that I have wondered for a while what is meant with certain term or notion somewhere at the beginning of the doc only to discover a fully specified answer somewhere after line 1800. I do have many suggestions that might help to improve the readability of the document but it would take me at least a day to write them down (and it is going to be a long read for anyone), so before I go do it, I want to know if anyone is interested to hear :) Anyway this would be a separate exercise to do if people want. Best Regards, Elena. > > There are enough people around that actually get this stuff (it really > is not _that_ hard, but it does require a functional brain) so if you > can describe the problem with sufficient detail we can help out getting > the ordering right (or state its broken :-). From 1582419429900775538@xxx Fri Oct 27 13:59:28 +0000 2017 X-GM-THRID: 1582046402032606032 X-Gmail-Labels: Inbox,Category Forums