Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1834401imm; Tue, 22 May 2018 10:03:38 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrzTDvtZuW3BnCwTYZ1VrCepxgQ2lwunL9Z4WEx248mx9gq9b02hTj24DisekiDC4+Wcu98 X-Received: by 2002:a17:902:5327:: with SMTP id b36-v6mr26161238pli.316.1527008618188; Tue, 22 May 2018 10:03:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527008618; cv=none; d=google.com; s=arc-20160816; b=skKRchKIi50ONjMhOLwWlGr1L8cULdCt4uaWs4Pr0AryjLSfIHU1toO4AyyhHYoEfZ 8rLMlKRubb52PkfDxybVocvOlsU22rh5Bf2sA+YFakfp2qKhWRxc8oWjrqL/KFv/0q1A oV6UIiGldTYwL1u7r3oWD3Uq8JLpEwPG/7+aSbgyfpmKsw29npd9J2m/T4RiRap7qASM ojPD7sNZM7ZNFMs2ZToMwyTo+3xUpfYEvIDO0KTd48idmvqozm4MZhoDD1jHCb2QAHcA KPM9BmB7XEaqeyr8E8nPWhzVuCfE+igxk0DDzld2AtceEsUOm5fQ6q6iDl8CvQwuMSxR oQOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:reply-to:subject:cc:to :from:date:arc-authentication-results; bh=Lzv650XshZJbnO+irVATn3kNl1bIxjXxdGbZLQ+BPQo=; b=dk2jQG4W5IeqCFIHuhUePp8RkH5uWSEn3cKKXhbomNzWup4yzH9XOokC25sX37n6Ev s2U2/Z6YyzQLzeTmeTL8yLVzJiv3ygnvdMn1ofiEjxqA/vjNunmv2SHwA63GrSHkRdKH l6rwXL02//9JMOCsdviuuXL41urMPUgJFw1L62/wP66qL80YhVa6F0bnTMPpdtMmHVRR C9V17YNfi9QersJ+vXoyBSbZkqZzUkYzfWoL9jEXUFlbbZaDLki8bmzAv9HJqc1yno87 HZuVNlNLgan4VS3YNDC9KgpBqq1dC7uALzYYRBdxSs6N1b6QLaxOK421mAENUI3nl6y0 mBSA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s20-v6si10442010pgo.462.2018.05.22.10.03.19; Tue, 22 May 2018 10:03:38 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751461AbeEVRC6 (ORCPT + 99 others); Tue, 22 May 2018 13:02:58 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:43838 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751229AbeEVRCz (ORCPT ); Tue, 22 May 2018 13:02:55 -0400 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w4MGvVba073685 for ; Tue, 22 May 2018 13:02:55 -0400 Received: from e36.co.us.ibm.com (e36.co.us.ibm.com [32.97.110.154]) by mx0a-001b2d01.pphosted.com with ESMTP id 2j4q6e09fr-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 22 May 2018 13:02:54 -0400 Received: from localhost by e36.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 22 May 2018 11:02:53 -0600 Received: from b03cxnp07028.gho.boulder.ibm.com (9.17.130.15) by e36.co.us.ibm.com (192.168.1.136) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 22 May 2018 11:02:49 -0600 Received: from b01ledav003.gho.pok.ibm.com (b01ledav003.gho.pok.ibm.com [9.57.199.108]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w4MH2nK661210644; Tue, 22 May 2018 10:02:49 -0700 Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EEAB3B2257; Tue, 22 May 2018 14:04:38 -0400 (EDT) Received: from b01ledav003.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 97271B2258; Tue, 22 May 2018 14:03:26 -0400 (EDT) Received: from paulmck-ThinkPad-W541 (unknown [9.70.82.108]) by b01ledav003.gho.pok.ibm.com (Postfix) with ESMTP; Tue, 22 May 2018 14:03:26 -0400 (EDT) Received: by paulmck-ThinkPad-W541 (Postfix, from userid 1000) id BEB0D16C0E7A; Tue, 22 May 2018 10:03:13 -0700 (PDT) Date: Tue, 22 May 2018 10:03:13 -0700 From: "Paul E. McKenney" To: Roman Penyaev Cc: linux-block , linux-rdma , Jens Axboe , Christoph Hellwig , Sagi Grimberg , Bart Van Assche , Or Gerlitz , Doug Ledford , Swapnil Ingle , Danil Kipnis , Jack Wang , Linux Kernel Mailing List Subject: Re: [PATCH v2 01/26] rculist: introduce list_next_or_null_rr_rcu() Reply-To: paulmck@linux.vnet.ibm.com References: <20180518130413.16997-1-roman.penyaev@profitbricks.com> <20180518130413.16997-2-roman.penyaev@profitbricks.com> <20180519163735.GX3803@linux.vnet.ibm.com> <20180520004318.GY3803@linux.vnet.ibm.com> <20180521153151.GE3803@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) X-TM-AS-GCONF: 00 x-cbid: 18052217-0020-0000-0000-00000DFC6A24 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009066; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000261; SDB=6.01036093; UDB=6.00530004; IPR=6.00815222; MB=3.00021242; MTD=3.00000008; XFM=3.00000015; UTC=2018-05-22 17:02:53 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18052217-0021-0000-0000-00006180C337 Message-Id: <20180522170313.GW3803@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-05-22_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1805220185 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 22, 2018 at 11:09:16AM +0200, Roman Penyaev wrote: > On Mon, May 21, 2018 at 5:31 PM, Paul E. McKenney > wrote: > > On Mon, May 21, 2018 at 03:50:10PM +0200, Roman Penyaev wrote: > >> On Sun, May 20, 2018 at 2:43 AM, Paul E. McKenney > >> wrote: > >> > On Sat, May 19, 2018 at 10:20:48PM +0200, Roman Penyaev wrote: > >> >> On Sat, May 19, 2018 at 6:37 PM, Paul E. McKenney > >> >> wrote: > >> >> > On Fri, May 18, 2018 at 03:03:48PM +0200, Roman Pen wrote: > >> >> >> Function is going to be used in transport over RDMA module > >> >> >> in subsequent patches. > >> >> >> > >> >> >> Function returns next element in round-robin fashion, > >> >> >> i.e. head will be skipped. NULL will be returned if list > >> >> >> is observed as empty. > >> >> >> > >> >> >> Signed-off-by: Roman Pen > >> >> >> Cc: Paul E. McKenney > >> >> >> Cc: linux-kernel@vger.kernel.org > >> >> >> --- > >> >> >> include/linux/rculist.h | 19 +++++++++++++++++++ > >> >> >> 1 file changed, 19 insertions(+) > >> >> >> > >> >> >> diff --git a/include/linux/rculist.h b/include/linux/rculist.h > >> >> >> index 127f534fec94..b0840d5ab25a 100644 > >> >> >> --- a/include/linux/rculist.h > >> >> >> +++ b/include/linux/rculist.h > >> >> >> @@ -339,6 +339,25 @@ static inline void list_splice_tail_init_rcu(struct list_head *list, > >> >> >> }) > >> >> >> > >> >> >> /** > >> >> >> + * list_next_or_null_rr_rcu - get next list element in round-robin fashion. > >> >> >> + * @head: the head for the list. > >> >> >> + * @ptr: the list head to take the next element from. > >> >> >> + * @type: the type of the struct this is embedded in. > >> >> >> + * @memb: the name of the list_head within the struct. > >> >> >> + * > >> >> >> + * Next element returned in round-robin fashion, i.e. head will be skipped, > >> >> >> + * but if list is observed as empty, NULL will be returned. > >> >> >> + * > >> >> >> + * This primitive may safely run concurrently with the _rcu list-mutation > >> >> >> + * primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock(). > >> >> > > >> >> > Of course, all the set of list_next_or_null_rr_rcu() invocations that > >> >> > are round-robining a given list must all be under the same RCU read-side > >> >> > critical section. For example, the following will break badly: > >> >> > > >> >> > struct foo *take_rr_step(struct list_head *head, struct foo *ptr) > >> >> > { > >> >> > struct foo *ret; > >> >> > > >> >> > rcu_read_lock(); > >> >> > ret = list_next_or_null_rr_rcu(head, ptr, struct foo, foolist); > >> >> > rcu_read_unlock(); /* BUG */ > >> >> > return ret; > >> >> > } > >> >> > > >> >> > You need a big fat comment stating this, at the very least. The resulting > >> >> > bug can be very hard to trigger and even harder to debug. > >> >> > > >> >> > And yes, I know that the same restriction applies to list_next_rcu() > >> >> > and friends. The difference is that if you try to invoke those in an > >> >> > infinite loop, you will be rapped on the knuckles as soon as you hit > >> >> > the list header. Without that knuckle-rapping, RCU CPU stall warnings > >> >> > might tempt people to do something broken like take_rr_step() above. > >> >> > >> >> Hi Paul, > >> >> > >> >> I need -rr behaviour for doing IO load-balancing when I choose next RDMA > >> >> connection from the list in order to send a request, i.e. my code is > >> >> something like the following: > >> >> > >> >> static struct conn *get_and_set_next_conn(void) > >> >> { > >> >> struct conn *conn; > >> >> > >> >> conn = rcu_dereferece(rcu_conn); > >> >> if (unlikely(!conn)) > >> >> return conn; > >> > > >> > Wait. Don't you need to restart from the beginning of the list in > >> > this case? Or does the list never have anything added to it and is > >> > rcu_conn initially the first element in the list? > >> > >> Hi Paul, > >> > >> No, I continue from the pointer, which I assigned on the previous IO > >> in order to send IO fairly and keep load balanced. > >> > >> Initially @rcu_conn points to the first element, but elements can > >> be deleted from the list and list can become empty. > >> > >> The deletion code is below. > >> > >> > > >> >> conn = list_next_or_null_rr_rcu(&conn_list, > >> >> &conn->entry, > >> >> typeof(*conn), > >> >> entry); > >> >> rcu_assign_pointer(rcu_conn, conn); > >> > > >> > Linus is correct to doubt this code. You assign a pointer to the current > >> > element to rcu_conn, which is presumably a per-CPU or global variable. > >> > So far, so good ... > >> > >> I use per-CPU, in the first example I did not show that not to overcomplicate > >> the code. > >> > >> > > >> >> return conn; > >> >> } > >> >> > >> >> rcu_read_lock(); > >> >> conn = get_and_set_next_conn(); > >> >> if (unlikely(!conn)) { > >> >> /* ... */ > >> >> } > >> >> err = rdma_io(conn, request); > >> >> rcu_read_unlock(); > >> > > >> > ... except that some other CPU might well remove the entry referenced by > >> > rcu_conn at this point. It would have to wait for a grace period (e.g., > >> > synchronize_rcu()), but the current CPU has exited its RCU read-side > >> > critical section, and therefore is not blocking the grace period. > >> > Therefore, by the time get_and_set_next_conn() picks up rcu_conn, it > >> > might well be referencing the freelist, or, even worse, some other type > >> > of structure. > >> > > >> > What is your code doing to prevent this from happening? (There are ways, > >> > but I want to know what you were doing in this case.) > >> > >> Probably I should have shown the way of removal at the very beginning, > >> my fault. So deletion looks as the following (a bit changed and > >> simplified for the sake of clearness): > > > > Thank you! Let's see... > > > >> static void remove_connection(conn) > >> { > >> bool need_to_wait = false; > >> int cpu; > >> > >> /* Do not let RCU list add/delete happen in parallel */ > >> mutex_lock(&conn_lock); > >> > >> list_del_rcu(&conn->entry); > >> > >> /* Make sure everybody observes element removal */ > >> synchronize_rcu(); > > > > At this point, any reader who saw the element in the list is done, as you > > comment in fact says. But there might be a pointer to that element in the > > per-CPU variables, however, from this point forward it cannot be the case > > that one of the per-CPU variables gets set to the newly deleted element. > > Which is your next block of code... > > > >> /* > >> * At this point nobody sees @conn in the list, but > >> still we have > >> * dangling pointer @rcu_conn which _can_ point to @conn. Since > >> * nobody can observe @conn in the list, we guarantee > >> that IO path > >> * will not assign @conn to @rcu_conn, i.e. @rcu_conn > >> can be equal > >> * to @conn, but can never again become @conn. > >> */ > >> > >> /* > >> * Get @next connection from current @conn which is going to be > >> * removed. > >> */ > >> next = list_next_or_null_rr_rcu(&conn_list, &conn->entry, > >> typeof(*next), entry); > >> > >> /* > >> * Here @rcu_conn can be changed by reader side, so use @cmpxchg > >> * in order to keep fairness in load-balancing and do not touch > >> * the pointer which can be already changed by the IO path. > >> * > >> * Current path can be faster than IO path and the > >> following race > >> * exists: > >> * > >> * CPU0 CPU1 > >> * ---- ---- > >> * conn = rcu_dereferece(rcu_conn); > >> * next = list_next_or_null_rr_rcu(conn) > >> * > >> * conn == > >> cmpxchg(rcu_conn, conn, next); > >> * synchronize_rcu(); > >> * > >> * rcu_assign_pointer(rcu_conn, next); > >> * ^^^^^^^^^^^^^^^^^^ > >> * > >> * Here @rcu_conn is already equal to @next (done by > >> @cmpxchg), > >> * so assignment to the same pointer is harmless. > >> * > >> */ > >> for_each_possible_cpu(cpu) { > >> struct conn **rcu_conn; > >> > >> rcu_conn = per_cpu_ptr(pcpu_rcu_conn, cpu); > >> if (*rcu_conn != conn) > >> /* > >> * This @cpu will never again pick up @conn, > >> * so it is safe just to choose next CPU. > >> */ > >> continue; > > > > ... Someone else might have picked up rcu_conn at this point... > > > >> if (conn == cmpxchg(rcu_conn, conn, next)) > >> /* > >> * @rcu_conn was successfully replaced > >> with @next, > >> * that means that someone can also hold a @conn > >> * and dereferencing it, so wait for a > >> grace period > >> * is required. > >> */ > >> need_to_wait = true; > > > > ... But if there was any possibility of that, need_to_wait is true, and it > > still cannot be the case that a reader finds the newly deleted element > > in the list, so they cannot find that element, so the pcpu_rcu_conn > > variables cannot be set to it. > > > >> } > >> if (need_to_wait) > >> synchronize_rcu(); > > > > And at this point, the reader that might have picked up rcu_conn > > just before the cmpxchg must have completed. (Good show, by the way! > > Many people miss the fact that they need this second synchronize_rcu().) > > > > Hmmm... What happens if this was the last element in the list, and > > the relevant pcpu_rcu_conn variable references that newly removed > > element? Taking a look at list_next_or_null_rcu() and thus at > > list_next_or_null_rcu(), and it does appear that you get NULL in that > > case, as is right and good. > > Thanks for explicit comments. What I always lack is a good description. > Indeed it is worth to mention that @next can become NULL if that was the > last element, will add comments then. Very good. And for the record, I agree with Linus that this needs to be private. I am very glad that you got it right (or appear to have), and there might come a time when it needs to be generally available, but that time is certainly not now and might well never come. > >> mutex_unlock(&conn_lock); > >> > >> kfree(conn); > >> } > >> > >> > >> > > >> >> i.e. usage of the @next pointer is under an RCU critical section. > >> >> > >> >> > Is it possible to instead do some sort of list_for_each_entry_rcu()-like > >> >> > macro that makes it more obvious that the whole thing need to be under > >> >> > a single RCU read-side critical section? Such a macro would of course be > >> >> > an infinite loop if the list never went empty, so presumably there would > >> >> > be a break or return statement in there somewhere. > >> >> > >> >> The difference is that I do not need a loop, I take the @next conn pointer, > >> >> save it for the following IO request and do IO for current IO request. > >> >> > >> >> It seems list_for_each_entry_rcu()-like with immediate "break" in the body > >> >> of the loop does not look nice, I personally do not like it, i.e.: > >> >> > >> >> > >> >> static struct conn *get_and_set_next_conn(void) > >> >> { > >> >> struct conn *conn; > >> >> > >> >> conn = rcu_dereferece(rcu_conn); > >> >> if (unlikely(!conn)) > >> >> return conn; > >> >> list_for_each_entry_rr_rcu(conn, &conn_list, > >> >> entry) { > >> >> break; > >> >> } > >> >> rcu_assign_pointer(rcu_conn, conn); > >> >> return conn; > >> >> } > >> >> > >> >> > >> >> or maybe I did not fully get your idea? > >> > > >> > That would not help at all because you are still leaking the pointer out > >> > of the RCU read-side critical section. That is completely and utterly > >> > broken unless you are somehow cleaning up rcu_conn when you remove > >> > the element. And getting that cleanup right is -extremely- tricky. > >> > Unless you have some sort of proof of correctness, you will get a NACK > >> > from me. > >> > >> I understand all the consequences of the leaking pointer, and of course > >> wrapped loop with RCU lock/unlock is simpler, but in order to keep > >> load-balancing and IO fairness avoiding any locks on IO path I've come > >> up with these RCU tricks and list_next_or_null_rr_rcu() macro. > > > > At first glance, it appears that you have handled this correctly. But I > > can make mistakes just as easily as the next guy, so what have you done > > to validate your algorithm? > > What we only have is a set of unit-tests which run every night. For this > particular case there is a special stress test which adds/removes RDMA > connections in a loop while IO is performing. Unfortunately I did not > write any synthetic test-case just for testing this isolated algorithm. > (e.g. module with only these RCU functions can be created and list > modification can be easily simulated. Should not be very much difficult) > Do you think it is worth to do? Unfortunately it also does not prove > correctness, like Linus said it is fragile as hell, but for sure I can > burn CPUs testing it for couple of days. Yes, this definitely deserves some -serious- focused stress testing. And formal verification, if that can be arranged. The Linux-kernel memory model does handle RCU, and also minimal linked lists, so might be worth a try. Unfortunately, I don't know of any other tooling that handles RCU. :-( > >> > More like this: > >> > > >> > list_for_each_entry_rr_rcu(conn, &conn_list, entry) { > >> > do_something_with(conn); > >> > if (done_for_now()) > >> > break; > >> > } > >> > > >> >> >> + */ > >> >> >> +#define list_next_or_null_rr_rcu(head, ptr, type, memb) \ > >> >> >> +({ \ > >> >> >> + list_next_or_null_rcu(head, ptr, type, memb) ?: \ > >> >> >> + list_next_or_null_rcu(head, READ_ONCE((ptr)->next), type, memb); \ > >> >> > > >> >> > Are there any uses for this outside of RDMA? If not, I am with Linus. > >> >> > Define this within RDMA, where a smaller number of people can more > >> >> > easily be kept aware of the restrictions on use. If it turns out to be > >> >> > more generally useful, we can take a look at exactly what makes sense > >> >> > more globally. > >> >> > >> >> The only one list_for_each_entry_rcu()-like macro I am aware of is used in > >> >> block/blk-mq-sched.c, is called list_for_each_entry_rcu_rr(): > >> >> > >> >> https://elixir.bootlin.com/linux/v4.17-rc5/source/block/blk-mq-sched.c#L370 > >> >> > >> >> Does it make sense to implement generic list_next_or_null_rr_rcu() reusing > >> >> my list_next_or_null_rr_rcu() variant? > >> > > >> > Let's start with the basics: It absolutely does not make sense to leak > >> > pointers across rcu_read_unlock() unless you have arranged something else > >> > to protect the pointed-to data in the meantime. There are a number of ways > >> > of implementing this protection. Again, what protection are you using? > >> > > >> > Your code at the above URL looks plausible to me at first glance: You > >> > do rcu_read_lock(), a loop with list_for_each_entry_rcu_rr(), then > >> > rcu_read_unlock(). But at second glance, it looks like htcx->queue > >> > might have the same vulnerability as rcu_conn in your earlier code. > >> > >> I am not the author of the code at the URL I specified. I provided the > >> link answering the question to show other possible users of round-robin > >> semantics for RCU list traversal. In my 'list_next_or_null_rr_rcu()' > >> case I can't use a loop, I leak the pointer and indeed have to be very > >> careful. But perhaps we can come up with some generic solution to cover > >> both cases: -rr loop and -rr next. > > > > Ah. Could you please check their update-side code to make sure that it > > looks correct to you? > > BTW authors of this particular -rr loop are in CC, but they keep silence > so far :) According to my shallow understanding @queue can't disappear > from the list on this calling path, i.e. existence of a @queue should be > guaranteed by the fact that the queue has no IO in-flights and only then > it can be removed. But something has to enforce that, right? For example, how does the code know that there are no longer any I/Os in flight, and how does it know that more I/Os won't be issued in the near future, possibly based on old state? But yes, the authors are free to respond. Thanx, Paul