Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp6101791imm; Sat, 19 May 2018 17:42:40 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrGr/KWJq1nk6JMahh6AsbN3T1/AlxBu32mEtGpgraNi/kxZaUs6SnE0WXLVM+qw5n/T8pQ X-Received: by 2002:a17:902:7c18:: with SMTP id x24-v6mr15446865pll.173.1526776960353; Sat, 19 May 2018 17:42:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526776960; cv=none; d=google.com; s=arc-20160816; b=W6vQ0ZtO9PccQXeBpdS7xo9uiS5t5Lmm7VrxSdFTLOfk4uCo3wJjvpLnjogJe+E1fA 8mnsPDdg9+RxKz2bo20cBKQqkTELCstX/pZHTkbmdtZBXGgxohtYNujhCyECuRG+u6ZQ w5FKPt/Duf0hS8l1ymLzPUZggHfH5bwpeRxSPypSjsZn3jbsuM3RwN71gdy39ne6p7P6 /l5uXkcGuL4LteleY4p0rhfLb/Cwh77J2chPYt9CyxWD4a9tir43YqYcX9pgPECJug6c nDuOv2tSXIde81te4yC13YiIOckEDAwoEAMJ0XO4P8fJ6IbVbeUOcDWHHGEvTItc/xUY NfEw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:subject:cc:to :from:date:arc-authentication-results; bh=+U/dSh7xG+iVTC6LYlmxjvYogyKnCKMjkEBocqNGDWU=; b=DrHBplim4f+t4NlvDFdc8Trl2frRxNxgwPij6JhBFYpE8vso3dUVGX/UkSJYJGWDF3 n/LFMIza8/Cu8obN4gtRo3wjwwwyCG2lIqwt2aUSyzCU4xRZkKm6Y4rPBSIje4WXhUrQ IKADyfdfgKh6CFTribhr286+uQ+z5vxhYJ/Hbxoh6fTuxonaUvxrGfAU/kfGSHQ1b68V hk6Ys840IaBWCRrTAVdmEXLAi3LXKdWNtLENfU4rLm6jkkDI0jAtYWPIA03q5lmU732M m72ClYdg4JEIxj49mV5xzXUWNAggBDnY6n+LqP0v6T/T9v7Jj+n7VofSCBqfHbvkRRH6 j/AQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e9-v6si11038641plk.61.2018.05.19.17.42.01; Sat, 19 May 2018 17:42:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752518AbeETAlv (ORCPT + 99 others); Sat, 19 May 2018 20:41:51 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:58252 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752199AbeETAlu (ORCPT ); Sat, 19 May 2018 20:41:50 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w4K0duDG115504 for ; Sat, 19 May 2018 20:41:49 -0400 Received: from e14.ny.us.ibm.com (e14.ny.us.ibm.com [129.33.205.204]) by mx0a-001b2d01.pphosted.com with ESMTP id 2j2fx2jq26-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Sat, 19 May 2018 20:41:49 -0400 Received: from localhost by e14.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sat, 19 May 2018 20:41:48 -0400 Received: from b01cxnp23032.gho.pok.ibm.com (9.57.198.27) by e14.ny.us.ibm.com (146.89.104.201) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Sat, 19 May 2018 20:41:44 -0400 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b01cxnp23032.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w4K0fh4M35913976; Sun, 20 May 2018 00:41:43 GMT Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F3917B2064; Sat, 19 May 2018 21:43:36 -0400 (EDT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id BEACBB205F; Sat, 19 May 2018 21:43:36 -0400 (EDT) Received: from paulmck-ThinkPad-W541 (unknown [9.85.187.109]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Sat, 19 May 2018 21:43:36 -0400 (EDT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id 8CCF616C17BA; Sat, 19 May 2018 17:43:18 -0700 (PDT) Date: Sat, 19 May 2018 17:43:18 -0700 From: "Paul E. McKenney" To: Roman Penyaev Cc: linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, Jens Axboe , Christoph Hellwig , Sagi Grimberg , Bart Van Assche , Or Gerlitz , Doug Ledford , Swapnil Ingle , Danil Kipnis , Jack Wang , linux-kernel@vger.kernel.org Subject: Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Reply-To: paulmck@linux.vnet.ibm.com References: <20180518130413.16997-1-roman.penyaev@profitbricks.com> <20180518130413.16997-2-roman.penyaev@profitbricks.com> <20180519163735.GX3803@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18052000-0052-0000-0000-000002F18EBC X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009053; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000261; SDB=6.01034806; UDB=6.00529232; IPR=6.00813935; MB=3.00021205; MTD=3.00000008; XFM=3.00000015; UTC=2018-05-20 00:41:47 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18052000-0053-0000-0000-00005CBEF3ED Message-Id: <20180520004318.GY3803@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-05-19_15:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1805200005 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: > On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney > wrote: > > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > >> Function is going to be used in transport over RDMA module > >> in subsequent patches. > >> > >> Function returns next element in round-robin fashion, > >> i.e. head will be skipped. NULL will be returned if list > >> is observed as empty. > >> > >> Signed-off-by: Roman Pen > >> Cc: Paul E. McKenney > >> Cc: linux-kernel@vger.kernel.org > >> --- > >> include/linux/rculist.h | 19 +++++++++++++++++++ > >> 1 file changed, 19 insertions(+) > >> > >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h > >> index 127f534fec94..b0840d5ab25a 100644 > >> --- a/include/linux/rculist.h > >> +++ b/include/linux/rculist.h > >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, > >> }) > >> > >> /** > >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. > >> + * @head: the head for the list. > >> + * @ptr: the list head to take the next element from. > >> + * @type: the type of the struct this is embedded in. > >> + * @memb: the name of the list_head within the struct. > >> + * > >> + * Next element returned in round-robin fashion, i.e. head will be skipped, > >> + * but if list is observed as empty, NULL will be returned. > >> + * > >> + * This primitive may safely run concurrently with the _rcu list-mutation > >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). > > > > Of course, all the set of list_next_or_null_rr_rcu() invocations that > > are round-robining a given list must all be under the same RCU read-side > > critical section. For example, the following will break badly: > > > > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > > { > > struct foo *ret; > > > > rcu_read_lock(); > > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); > > rcu_read_unlock(); /* BUG */ > > return ret; > > } > > > > You need a big fat comment stating this, at the very least. The resulting > > bug can be very hard to trigger and even harder to debug. > > > > And yes, I know that the same restriction applies to list_next_rcu() > > and friends. The difference is that if you try to invoke those in an > > infinite loop, you will be rapped on the knuckles as soon as you hit > > the list header. Without that knuckle-rapping, RCU CPU stall warnings > > might tempt people to do something broken like take_rr_step() above. > > Hi Paul, > > I need -rr behaviour for doing IO load-balancing when I choose next RDMA > connection from the list in order to send a request, i.e. my code is > something like the following: > > static struct conn *get_and_set_next_conn(void) > { > struct conn *conn; > > conn = rcu_dereferece(rcu_conn); > if (unlikely(!conn)) > return conn; Wait. Don't you need to restart from the beginning of the list in this case? Or does the list never have anything added to it and is rcu_conn initially the first element in the list? > conn = list_next_or_null_rr_rcu(&conn_list, > &conn->entry, > typeof(*conn), > entry); > rcu_assign_pointer(rcu_conn, conn); Linus is correct to doubt this code. You assign a pointer to the current element to rcu_conn, which is presumably a per-CPU or global variable. So far, so good ... > return conn; > } > > rcu_read_lock(); > conn = get_and_set_next_conn(); > if (unlikely(!conn)) { > /* ... */ > } > err = rdma_io(conn, request); > rcu_read_unlock(); ... except that some other CPU might well remove the entry referenced by rcu_conn at this point. It would have to wait for a grace period (e.g., synchronize_rcu()), but the current CPU has exited its RCU read-side critical section, and therefore is not blocking the grace period. Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it might well be referencing the freelist, or, even worse, some other type of structure. What is your code doing to prevent this from happening? (There are ways, but I want to know what you were doing in this case.) > i.e. usage of the @next pointer is under an RCU critical section. > > > Is it possible to instead do some sort of list_for_each_entry_rcu()-like > > macro that makes it more obvious that the whole thing need to be under > > a single RCU read-side critical section? Such a macro would of course be > > an infinite loop if the list never went empty, so presumably there would > > be a break or return statement in there somewhere. > > The difference is that I do not need a loop, I take the @next conn pointer, > save it for the following IO request and do IO for current IO request. > > It seems list_for_each_entry_rcu()-like with immediate "break" in the body > of the loop does not look nice, I personally do not like it, i.e.: > > > static struct conn *get_and_set_next_conn(void) > { > struct conn *conn; > > conn = rcu_dereferece(rcu_conn); > if (unlikely(!conn)) > return conn; > list_for_each_entry_rr_rcu(conn, &conn_list, > entry) { > break; > } > rcu_assign_pointer(rcu_conn, conn); > return conn; > } > > > or maybe I did not fully get your idea? That would not help at all because you are still leaking the pointer out of the RCU read-side critical section. That is completely and utterly broken unless you are somehow cleaning up rcu_conn when you remove the element. And getting that cleanup right is -extremely- tricky. Unless you have some sort of proof of correctness, you will get a NACK from me. More like this: list_for_each_entry_rr_rcu(conn, &conn_list, entry) { do_something_with(conn); if (done_for_now()) break; } > >> + */ > >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ > >> +({ \ > >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ > >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ > > > > Are there any uses for this outside of RDMA? If not, I am with Linus. > > Define this within RDMA, where a smaller number of people can more > > easily be kept aware of the restrictions on use. If it turns out to be > > more generally useful, we can take a look at exactly what makes sense > > more globally. > > The only one list_for_each_entry_rcu()-like macro I am aware of is used in > block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): > > https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > > Does it make sense to implement generic list_next_or_null_rr_rcu() reusing > my list_next_or_null_rr_rcu() variant? Let's start with the basics: It absolutely does not make sense to leak pointers across rcu_read_unlock() unless you have arranged something else to protect the pointed-to data in the meantime. There are a number of ways of implementing this protection. Again, what protection are you using? Your code at the above URL looks plausible to me at first glance: You do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then rcu_read_unlock(). But at second glance, it looks like htcx->queue might have the same vulnerability as rcu_conn in your earlier code. Thanx, Paul > > Even within RDMA, I strongly recommend the big fat comment called out above. > > And the list_for_each_entry_rcu()-like formulation, if that can be made to > > work within RDMA's code structure. > > > > Seem reasonable, or am I missing something here? > > Thanks for clear explanation. > > -- > Roman >