Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp335599imm; Mon, 21 May 2018 06:52:40 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqrizboc5FU5Ikb2XIScYylpZczBt1GIPh4XtSNQgbu6odWnymYkTaMJt5JMVhjvjxtkmdR X-Received: by 2002:a62:6c87:: with SMTP id h129-v6mr19952205pfc.179.1526910760669; Mon, 21 May 2018 06:52:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526910760; cv=none; d=google.com; s=arc-20160816; b=O38BRxDCxTX54MicsxP2J2xDChQm8Lfc6rtH10EioTxw0FQ7qQxSjbuha+qXVaWw9r EEFzvKyDC/zY/hSysI8tueFgHrV+r99MVvNSEMD+6VXNKYsQxouwBAoLDW5+WYGyyt+v jmsuV9J3qJ/FXiikF2c1LFQjEegNiR4hS3hbOoHsyxYJloumGUlsrnzNGnfp7V8ot7UH AXuysfIkobZz6g4FTa5OyB8jPYjukZIhzcAKcRQnx9sCzljAA1DA5ybWNci6dIacyl2s YrV6/gh8sJO0Lf6iHutvfCFB8Vqg1a+hBVJTQw7/Twc84Wrr27o/+QgQYyNfr5KQtNEp RfEg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=1WYFr0rvvHV8eve54qvvtVjjVlXFMyq/rO4/KlKDOk4=; b=LptmOsVFtUbjCLfcT0iFUjkDFvTgQ9b0vX0kDA59+yMQutnLBFrAWnR+VCTLRaozjY XImzVmzCVk3PRpT9gZG993wHdCHoznHE4/pkfxJHKr/NLOmAU88MTFVoOlbGV4wgER1N XN6F2mq7RdKb3YGDuVOZpR0dtPFHGxWnIhQbfDzl/YgY3sYLSVY4J/F/VNunovpIVX4s hiQfREEPcxNsnOnHYSdgC5AfdipzsTB6VBwOb1KZWpPLsSJQpWLkDgEZftjZqIfZZqgt 6NC4V6+PWtFc4vb1Trnw62/308W17pLSfMvVdrjHcZBQYeMxokAbTDA/NrmUVyf2rExx NGTA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@profitbricks-com.20150623.gappssmtp.com header.s=20150623 header.b=nUd0w/a0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a24-v6si11021350pgw.101.2018.05.21.06.52.24; Mon, 21 May 2018 06:52:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@profitbricks-com.20150623.gappssmtp.com header.s=20150623 header.b=nUd0w/a0; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752951AbeEUNu4 (ORCPT + 99 others); Mon, 21 May 2018 09:50:56 -0400 Received: from mail-it0-f65.google.com ([209.85.214.65]:40778 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752874AbeEUNuc (ORCPT ); Mon, 21 May 2018 09:50:32 -0400 Received: by mail-it0-f65.google.com with SMTP id j186-v6so21125613ita.5 for ; Mon, 21 May 2018 06:50:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=profitbricks-com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=1WYFr0rvvHV8eve54qvvtVjjVlXFMyq/rO4/KlKDOk4=; b=nUd0w/a07Pq64Kri1810Av/oCDCvzGtZCjSbsqZ6P/frckkD45ADzIAmA5DztKkxA9 x9cjRtmA5cvQ7WAOcZpaPvOLBm0+dXcVg9kuCYu0UpJaUiAY2De6kuo3fDQgzj/UqauW vROku/FHG1yS7YNfLHC3wYm6825Yb8CSxIdB3ZpNi7PkUtMRMCJzKHVhakFNT6qsxB1L q0RIR29gVOVxPPKbU57eOaoNHe78r+SEJOnAZVLTW6Kz9WQ9RrU4jYs4IdJ18dWnbTz9 iJUauPVz2jDEWRA4VDSBSKW/FdxNZm2XGHBjEp36wdZkMzJoG2QYlrXEz0N8dGo+7Uf4 c3fQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=1WYFr0rvvHV8eve54qvvtVjjVlXFMyq/rO4/KlKDOk4=; b=Kv3mzu+xwzdtA01bfFFtyJks66WppwbdiKHuCDdvYayvyFE+0fI7IMeAamEFo9x4QP lgIZEXOk3KlBoB6yKMUeMSobikMPvC5bv6R84NM3PKfw/Co8J3xWpuaK8dNplAAu8R5P vc4ue/2776VleJ/8wqXWYPDjbhx5CMBER1kwYUhM+ozJUHBRnRrBU0D3jnCz8fv4LcR1 1UEva6Xbjk0ChqnIfDU5xGjHgFl73OIa1+ETXIKhI8FLJx4w6LFNRKufMnnj/i+3IUSP HOlPHEDPKkXxUAXLH8EIBD9JN9u4Sd5E9+BBSSHniM8JsOJquzs1HjCrqkAH1m+ntTPx B/ow== X-Gm-Message-State: ALKqPwdwlrGcZ+GOv/8e+/GQX2RcJplJna95pCVsnJH2onEBQTYJbko/ rmoE0O/lVdj2AFkfHlsoSi3tQu2Tzg6jkWNJ3eohwA== X-Received: by 2002:a24:5e4d:: with SMTP id h74-v6mr16838443itb.91.1526910630996; Mon, 21 May 2018 06:50:30 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.140.210 with HTTP; Mon, 21 May 2018 06:50:10 -0700 (PDT) In-Reply-To: <20180520004318.GY3803@linux.vnet.ibm.com> References: <20180518130413.16997-1-roman.penyaev@profitbricks.com> <20180518130413.16997-2-roman.penyaev@profitbricks.com> <20180519163735.GX3803@linux.vnet.ibm.com> <20180520004318.GY3803@linux.vnet.ibm.com> From: Roman Penyaev Date: Mon, 21 May 2018 15:50:10 +0200 Message-ID: Subject: Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() To: "Paul E . McKenney" Cc: linux-block , linux-rdma , Jens Axboe , Christoph Hellwig , Sagi Grimberg , Bart Van Assche , Or Gerlitz , Doug Ledford , Swapnil Ingle , Danil Kipnis , Jack Wang , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney wrote: > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney >> wrote: >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: >> >> Function is going to be used in transport over RDMA module >> >> in subsequent patches. >> >> >> >> Function returns next element in round-robin fashion, >> >> i.e. head will be skipped. NULL will be returned if list >> >> is observed as empty. >> >> >> >> Signed-off-by: Roman Pen >> >> Cc: Paul E. McKenney >> >> Cc: linux-kernel@vger.kernel.org >> >> --- >> >> include/linux/rculist.h | 19 +++++++++++++++++++ >> >> 1 file changed, 19 insertions(+) >> >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h >> >> index 127f534fec94..b0840d5ab25a 100644 >> >> --- a/include/linux/rculist.h >> >> +++ b/include/linux/rculist.h >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, >> >> }) >> >> >> >> /** >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. >> >> + * @head: the head for the list. >> >> + * @ptr: the list head to take the next element from. >> >> + * @type: the type of the struct this is embedded in. >> >> + * @memb: the name of the list_head within the struct. >> >> + * >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, >> >> + * but if list is observed as empty, NULL will be returned. >> >> + * >> >> + * This primitive may safely run concurrently with the _rcu list-mutation >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). >> > >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that >> > are round-robining a given list must all be under the same RCU read-side >> > critical section. For example, the following will break badly: >> > >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) >> > { >> > struct foo *ret; >> > >> > rcu_read_lock(); >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); >> > rcu_read_unlock(); /* BUG */ >> > return ret; >> > } >> > >> > You need a big fat comment stating this, at the very least. The resulting >> > bug can be very hard to trigger and even harder to debug. >> > >> > And yes, I know that the same restriction applies to list_next_rcu() >> > and friends. The difference is that if you try to invoke those in an >> > infinite loop, you will be rapped on the knuckles as soon as you hit >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings >> > might tempt people to do something broken like take_rr_step() above. >> >> Hi Paul, >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA >> connection from the list in order to send a request, i.e. my code is >> something like the following: >> >> static struct conn *get_and_set_next_conn(void) >> { >> struct conn *conn; >> >> conn = rcu_dereferece(rcu_conn); >> if (unlikely(!conn)) >> return conn; > > Wait. Don't you need to restart from the beginning of the list in > this case? Or does the list never have anything added to it and is > rcu_conn initially the first element in the list? Hi Paul, No, I continue from the pointer, which I assigned on the previous IO in order to send IO fairly and keep load balanced. Initially @rcu_conn points to the first element, but elements can be deleted from the list and list can become empty. The deletion code is below. > >> conn = list_next_or_null_rr_rcu(&conn_list, >> &conn->entry, >> typeof(*conn), >> entry); >> rcu_assign_pointer(rcu_conn, conn); > > Linus is correct to doubt this code. You assign a pointer to the current > element to rcu_conn, which is presumably a per-CPU or global variable. > So far, so good ... I use per-CPU, in the first example I did not show that not to overcomplicate the code. > >> return conn; >> } >> >> rcu_read_lock(); >> conn = get_and_set_next_conn(); >> if (unlikely(!conn)) { >> /* ... */ >> } >> err = rdma_io(conn, request); >> rcu_read_unlock(); > > ... except that some other CPU might well remove the entry referenced by > rcu_conn at this point. It would have to wait for a grace period (e.g., > synchronize_rcu()), but the current CPU has exited its RCU read-side > critical section, and therefore is not blocking the grace period. > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it > might well be referencing the freelist, or, even worse, some other type > of structure. > > What is your code doing to prevent this from happening? (There are ways, > but I want to know what you were doing in this case.) Probably I should have shown the way of removal at the very beginning, my fault. So deletion looks as the following (a bit changed and simplified for the sake of clearness): static void remove_connection(conn) { bool need_to_wait = false; int cpu; /* Do not let RCU list add/delete happen in parallel */ mutex_lock(&conn_lock); list_del_rcu(&conn->entry); /* Make sure everybody observes element removal */ synchronize_rcu(); /* * At this point nobody sees @conn in the list, but still we have * dangling pointer @rcu_conn which _can_ point to @conn. Since * nobody can observe @conn in the list, we guarantee that IO path * will not assign @conn to @rcu_conn, i.e. @rcu_conn can be equal * to @conn, but can never again become @conn. */ /* * Get @next connection from current @conn which is going to be * removed. */ next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, typeof(*next), entry); /* * Here @rcu_conn can be changed by reader side, so use @cmpxchg * in order to keep fairness in load-balancing and do not touch * the pointer which can be already changed by the IO path. * * Current path can be faster than IO path and the following race * exists: * * CPU0 CPU1 * ---- ---- * conn = rcu_dereferece(rcu_conn); * next = list_next_or_null_rr_rcu(conn) * * conn == cmpxchg(rcu_conn, conn, next); * synchronize_rcu(); * * rcu_assign_pointer(rcu_conn, next); * ^^^^^^^^^^^^^^^^^^ * * Here @rcu_conn is already equal to @next (done by @cmpxchg), * so assignment to the same pointer is harmless. * */ for_each_possible_cpu(cpu) { struct conn **rcu_conn; rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); if (*rcu_conn != conn) /* * This @cpu will never again pick up @conn, * so it is safe just to choose next CPU. */ continue; if (conn == cmpxchg(rcu_conn, conn, next)) /* * @rcu_conn was successfully replaced with @next, * that means that someone can also hold a @conn * and dereferencing it, so wait for a grace period * is required. */ need_to_wait = true; } if (need_to_wait) synchronize_rcu(); mutex_unlock(&conn_lock); kfree(conn); } > >> i.e. usage of the @next pointer is under an RCU critical section. >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like >> > macro that makes it more obvious that the whole thing need to be under >> > a single RCU read-side critical section? Such a macro would of course be >> > an infinite loop if the list never went empty, so presumably there would >> > be a break or return statement in there somewhere. >> >> The difference is that I do not need a loop, I take the @next conn pointer, >> save it for the following IO request and do IO for current IO request. >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body >> of the loop does not look nice, I personally do not like it, i.e.: >> >> >> static struct conn *get_and_set_next_conn(void) >> { >> struct conn *conn; >> >> conn = rcu_dereferece(rcu_conn); >> if (unlikely(!conn)) >> return conn; >> list_for_each_entry_rr_rcu(conn, &conn_list, >> entry) { >> break; >> } >> rcu_assign_pointer(rcu_conn, conn); >> return conn; >> } >> >> >> or maybe I did not fully get your idea? > > That would not help at all because you are still leaking the pointer out > of the RCU read-side critical section. That is completely and utterly > broken unless you are somehow cleaning up rcu_conn when you remove > the element. And getting that cleanup right is -extremely- tricky. > Unless you have some sort of proof of correctness, you will get a NACK > from me. I understand all the consequences of the leaking pointer, and of course wrapped loop with RCU lock/unlock is simpler, but in order to keep load-balancing and IO fairness avoiding any locks on IO path I've come up with these RCU tricks and list_next_or_null_rr_rcu() macro. > More like this: > > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { > do_something_with(conn); > if (done_for_now()) > break; > } > >> >> + */ >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ >> >> +({ \ >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ >> > >> > Are there any uses for this outside of RDMA? If not, I am with Linus. >> > Define this within RDMA, where a smaller number of people can more >> > easily be kept aware of the restrictions on use. If it turns out to be >> > more generally useful, we can take a look at exactly what makes sense >> > more globally. >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing >> my list_next_or_null_rr_rcu() variant? > > Let's start with the basics: It absolutely does not make sense to leak > pointers across rcu_read_unlock() unless you have arranged something else > to protect the pointed-to data in the meantime. There are a number of ways > of implementing this protection. Again, what protection are you using? > > Your code at the above URL looks plausible to me at first glance: You > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then > rcu_read_unlock(). But at second glance, it looks like htcx->queue > might have the same vulnerability as rcu_conn in your earlier code. I am not the author of the code at the URL I specified. I provided the link answering the question to show other possible users of round-robin semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' case I can't use a loop, I leak the pointer and indeed have to be very careful. But perhaps we can come up with some generic solution to cover both cases: -rr loop and -rr next. -- Roman