Received: by 2002:a25:f815:0:0:0:0:0 with SMTP id u21csp3099520ybd; Fri, 28 Jun 2019 02:52:55 -0700 (PDT) X-Google-Smtp-Source: APXvYqyEwcvQwz6lyKGh71aJoVnoLKiw5D+6VTZKhd37TQNeDGW/ZVOGfz8pz6gbUuRg/xfcKB0R X-Received: by 2002:a17:902:6506:: with SMTP id b6mr9815175plk.253.1561715575817; Fri, 28 Jun 2019 02:52:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561715575; cv=none; d=google.com; s=arc-20160816; b=GkIVBKGjvZXz8y4QN3YVBrZT7Ff0tXcLbNOc5YNkLNmt6ZDgN5Dkv1FuoJkOvzEJs8 AEs5bODQLs7yCprjg00/1tx3gXL1oyIMENJc/fRzPuVRX2oDnHRGpriyE90dLNsMmwkr yAVt7Y17n2f2+Z56gYScGaHK+HoTBsUpl9CU9TN9ViJs08n3wkCjcZCcNLTsJJ+KF3Gi v8lyK8neNYsPtTEeHwmrECHhf50X+/DStbcQKvDLhLH4L+fF/Sn7a66iToiOuG4Is0iy smHD5w6gjnswQawKcqkABn7Lu8jVejmdxDFUch7Qu9FUYbduhwOdwkQkf8AegDzcA5q4 1Y3g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id:date :references:subject:cc:to:from; bh=KywuG7tuXsuwSsE9QlTP/8Y6TnZW5zFxvuUfcwFYq8o=; b=KbdfoHBqsCJpMDpY4XeK1tch4tjCvU+J+L0m8BXUpxBcD8TuBSM6x7qoGXuLvILVET qcBEge7GI1vVFFN3C8zEr+3ARTn7VIL6w1i+kS9cVw76OEDsEfuJR1fn1dnhZakk1mVo bNm7wOkQSL9qKC/d4AeVYrlCkRRwPI6qkntB5W9V6t2FTeBOMZ+WsigfIYAVNwWO4cXW EMSKDHpQFPAJ4Q1facRYQ3/eXGDhMuavXUhg4cbzsYApByJIpgmt/rn9dTqbzRDEv4ss AgU2hoObBNzcOlTevLGY458Fos+2Ets1QzWVzKhlpZGziTJSNuTmLZdxjf7nO2NPaALS enXQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h1si1675649pgs.290.2019.06.28.02.52.39; Fri, 28 Jun 2019 02:52:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726561AbfF1JvO (ORCPT + 99 others); Fri, 28 Jun 2019 05:51:14 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:34811 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726431AbfF1JvN (ORCPT ); Fri, 28 Jun 2019 05:51:13 -0400 Received: from localhost ([127.0.0.1] helo=vostro.local) by Galois.linutronix.de with esmtp (Exim 4.80) (envelope-from ) id 1hgnWd-0008O2-IS; Fri, 28 Jun 2019 11:50:35 +0200 From: John Ogness To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Petr Mladek , Sergey Senozhatsky , Steven Rostedt , Linus Torvalds , Greg Kroah-Hartman , Andrea Parri , Thomas Gleixner , Sergey Senozhatsky Subject: Re: [RFC PATCH v2 1/2] printk-rb: add a new printk ringbuffer implementation References: <20190607162349.18199-1-john.ogness@linutronix.de> <20190607162349.18199-2-john.ogness@linutronix.de> <20190618114747.GQ3436@hirez.programming.kicks-ass.net> <87k1df28x4.fsf@linutronix.de> <20190626224034.GK2490@worktop.programming.kicks-ass.net> Date: Fri, 28 Jun 2019 11:50:33 +0200 Message-ID: <87mui2ujh2.fsf@linutronix.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-06-27, Peter Zijlstra wrote: >> This is a long response, but we are getting into some fine details >> about the memory barriers (as well as battling my communication skill >> level). > > So I'm going to reply piecewise to this... so not such long emails, > but more of them. I am not sure if I should wait to respond until you've finished going through the full patch. I will respond to this email and if you would like me to wait on further responses, just let me know. Thanks. >>>> +static void add_descr_list(struct prb_reserved_entry *e) >>>> +{ >>>> + struct printk_ringbuffer *rb = e->rb; >>>> + struct prb_list *l = &rb->descr_list; >>>> + struct prb_descr *d = e->descr; >>>> + struct prb_descr *newest_d; >>>> + unsigned long newest_id; >>>> + >>>> + /* set as newest */ >>>> + do { >>>> + /* MB5: synchronize add descr */ >>>> + newest_id = smp_load_acquire(&l->newest); >>>> + newest_d = TO_DESCR(rb, newest_id); >>>> + >>>> + if (newest_id == EOL) >>>> + WRITE_ONCE(d->seq, 1); >>>> + else >>>> + WRITE_ONCE(d->seq, READ_ONCE(newest_d->seq) + 1); >>>> + /* >>>> + * MB5: synchronize add descr >>>> + * >>>> + * In particular: next written before cmpxchg >>>> + */ >>>> + } while (cmpxchg_release(&l->newest, newest_id, e->id) != newest_id); >>> >>> What does this pair with? I find ->newest usage in: >> >> It is pairing with the smp_load_acquire() at the beginning of this >> loop (also labeled MB5) that is running simultaneously on another >> CPU. I am avoiding a possible situation that a new descriptor is >> added but the store of "next" from the previous descriptor is not yet >> visible and thus the cmpxchg following will fail, which is not >> allowed. (Note that "next" is set to EOL shortly before this function >> is called.) >> >> The litmus test for this is: >> >> P0(int *newest, int *d_next) >> { >> // set descr->next to EOL (terminates list) >> WRITE_ONCE(*d_next, 1); >> >> // set descr as newest >> smp_store_release(*newest, 1); >> } >> >> P1(int *newest, int *d_next) >> { >> int local_newest; >> int local_next; >> >> // get newest descriptor >> local_newest = smp_load_acquire(*newest); >> >> // a new descriptor is set as the newest >> // (not relevant here) >> >> // read descr->next of previous newest >> // (must be EOL!) >> local_next = READ_ONCE(*d_next); >> } >> >> exists (1:local_newest=1 /\ 1:local_next=0) > > I'm having trouble connecting your P1's READ_ONCE() to the actual > code. You say that is in the same function, but I cannot find a LOAD > there that would care about the ACQUIRE. P1's READ_ONCE() is the READ part of the cmpxchg a few lines below: WARN_ON_ONCE(cmpxchg_release(&newest_d->next, EOL, e->id) != EOL); Note that the cmpxchg is a _release because of MB6 (a different memory barrier pair). But only the READ part of that cmpxchg synchronizes with MB5. Also note that cmpxchg is used only because of bug checking. If instead it becomes a blind store (such as you suggest below), then it changes to smp_store_release(). While investigating this (and the lack of a LOAD), I realized that the smp_load_acquire() is not needed because @seq is dependent on the load of @newest. I have implemented and tested these changes. I also added setting the list terminator to this function, since all callers would have to do it anyway. Also, I spent a lot of time trying to put in comments that I think are _understandable_ and _acceptable_. @Peter: I expect they are way too long for you. @Andrea: Is this starting to become something that you would like to see? /** * add_descr_list() - Add a descriptor to the descriptor list. * * @e: An entry that has already reserved data. * * The provided entry contains a pointer to a descriptor that has already * been reserved for this entry. However, the reserved descriptor is not * yet on the list. Add this descriptor as the newest item. * * A descriptor is added in two steps. The first step is to make this * descriptor the newest. The second step is to update @next of the former * newest descriptor to point to this one (or set @oldest to this one if * this will be the first descriptor on the list). */ static void add_descr_list(struct prb_reserved_entry *e) { struct printk_ringbuffer *rb = e->rb; struct prb_list *l = &rb->descr_list; struct prb_descr *d = e->descr; struct prb_descr *newest_d; unsigned long newest_id; WRITE_ONCE(d->next, EOL); do { newest_id = READ_ONCE(l->newest); newest_d = TO_DESC(rb, newest_id); if (newest_id == EOL) { WRITE_ONCE(d->seq, 1); } else { /* * MB5-read: synchronize setting newest descr * * context-pair: 2 writers adding a descriptor via * add_descr_list(). * * @newest will load before @seq due to a data * dependency, therefore, the stores of @seq * and @next from the pairing MB5-write context * will be visible. * * Although @next is not loaded by this context, * this context must overwrite the stored @next * value of the pairing MB5-write context. */ WRITE_ONCE(d->seq, READ_ONCE(newest_d->seq) + 1); } /* * MB5-write: synchronize setting newest descr * * context-pair: 2 writers adding a descriptor via * add_descr_list(). * * Ensure that @next and @seq are stored before @d is * visible via @newest. The pairing MB5-read context * must load this @seq value and must overwrite this * @next value. */ } while (cmpxchg_release(&l->newest, newest_id, e->id) != newest_id); if (unlikely(newest_id == EOL)) { /* * MB0-write: synchronize adding first descr * * context-pair: 1 writer adding the first descriptor via * add_descr_list(), 1 reader getting the beginning of * the list via iter_peek_next_id(). * * This context recently assigned new values for @id, * @next, @seq. Ensure these are stored before the first * store to @oldest so that the new values are visible * to the reader in the pairing MB0-read context. * * Note: Before this store, the value of @oldest is EOL. */ smp_store_release(&l->oldest, e->id); } else { /* * MB6-write: synchronize linking new descr * * context-pair-1: 1 writer adding a descriptor via * add_descr_list(), 1 writer removing a descriptor via * remove_oldest_descr(). * * If this is a recycled descriptor, this context * recently stored a new @oldest value. Ensure that * @oldest is stored before storing @next so that * if the pairing MB6-read context sees a non-EOL * @next value, it is ensured that it will also see * an updated @oldest value. * * context-pair-2: 1 writer adding a descriptor via * add_descr_list(), 1 reader iterating the list via * prb_iter_next_valid_entry(). * * This context recently assigned new values for @id, * @next, @seq, @data, @data_next. Ensure these are * stored before storing @next of the previously * newest descriptor so that the new values are * visible to the iterating reader in the pairing * MB6-read context. * * Note: Before this store, the value of @next of the * previously newest descriptor is EOL. */ smp_store_release(&newest_d->next, e->id); } } The smp_rmb() calls in the reader functions are then commented and marked with the appropriate MB0-read and MB6-read labels. > Afaict prb_list is a list head not a list node (calling it just _list > is confusing at best). OK. > You have a single linked list going from the tail to the head, while > adding to the head and removing from the tail. And that sounds like a > FIFO queue: Yes, but with one important feature: the nodes in the FIFO queue are labeled with ordered sequence numbers. This is important for printk. I talk more about this below. > struct lqueue_head { > struct lqueue_node *head, *tail; > }; > > struct lqueue_node { > struct lqueue_node *next; > }; > > void lqueue_push(struct lqueue_head *h, struct lqueue_node *n) > { > struct lqueue_node *prev; > > n->next = NULL; Is this safe? Do all compilers understand that @next must be stored before the xchg() of @head? I would have chosen WRITE_ONCE(). > /* > * xchg() implies RELEASE; and thereby ensures @n is > * complete before getting published. > */ > prev = xchg(&h->head, n); Unfortunately it is not that simple because of sequence numbers. A node must be assigned a sequence number that is +1 of the previous node. This must be done before exchanging the head because immediately after the xchg() on the head, another CPU could then add on to us and expects our sequence number to already be set. This is why I need cmpxchg() here. > /* > * xchg() implies ACQUIRE; and thereby ensures @tail is > * written after @head, see lqueue_pop()'s smp_rmb(). > */ > if (prev) > WRITE_ONCE(prev->next, n); This needs to be a store_release() so that a reader cannot read @n but the store to @next is not yet visible. The memory barriers of the above xchg() do not apply here because readers never read @head. > else > WRITE_ONCE(h->tail, n); Ditto, but for the tail node in particular. > } > > struct lqueue_node *lqueue_pop(struct lqueue_head *h) > { > struct lqueue_node *head, *tail, *next; > > do { > tail = READ_ONCE(h->tail); > /* If the list is empty, nothing to remove. */ > if (!tail) > return NULL; > > /* > * If we see @tail, we must then also see @head. > * Pairs with the xchg() in lqueue_push(), > * ensure no false positive on the singleton > * test below. > */ > smp_rmb(); > head = READ_ONCE(h->head); > > /* If there is but one item; fail to remove. */ > if (head == tail) > return NULL; > > next = smp_cond_load_relaxed(&tail->next, VAL); What if a writer is adding a 2nd node to the queue and is interrupted by an NMI directly after the xchg() in lqueue_push()? Then we have: * head != tail * tail->next == NULL If that interrupting NMI calls lqueue_pop(), the NMI will spin forever. The following cmpxchg() is not allowed to happen as long as tail->next is NULL. This is why I synchronize on @next instead, using (tail && !tail->next) for the singleton test. > } while (cmpxchg(h->tail, tail, next) != tail); > > return tail; > } > > Now, you appear to be using desc_ids instead of pointers, but since > you're not using the actual wrap value; I don't see the benefit of > using those IDs over straight pointers. The documentation mentions that descriptor ids are used to identify pointers to invalid descriptors. This is used by the readers, see iter_peek_next_id() and prb_iter_next_valid_entry(). IDs are used for: - @next of descriptors on the list - @id, @id_next in the reader iterator - @id in the data blocks If changed to pointers, iterators would need to additionally store @seq values to be able to identifiy if the entry they are pointing to is the entry they expect. The only advantage I see with pointers is that the ringbuffer could be more useful generally, independent of whether the data is separate or within the nodes or if the nodes are statically or dynamically allocated. That is something worth having, even if it is not printk related. Are you implicitly requesting me to split the prb_ringbuffer and instead base it on a new "lockless multi-writer multi-reader sequenced FIFO queue" data structure? > That is, unless I've overlooked some subtle ABA issue, but then, your > code doesn't seem to mention that, and I think we're good because if > we re-use an entry, it can never get back in the same location, since > we never allow an empty list I do not understand what you mean here. If a reader has a pointer to an entry, the entry behind that pointer can certainly change. But that isn't a problem. The reader will recognize that. > (might also be fixable, haven't tought too hard on this). :-) > That said, the above has cmpxchg() vs WRITE_ONCE() and is therefore > not safe on a number of our architectures. We can either not care > about performance and use xchg() for the ->tail store, or use > atomic_long_t and suffer ugly casting. cmpxchg_release() vs WRITE_ONCE() is not safe?! Can you point me to documentation about this? > But the above is, IMO, a more useful and readable abstraction. Let me > continue in another email (probably tomorrow). Thank you for taking the time for this. John Ogness