Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp2021138ybi; Mon, 1 Jul 2019 04:49:37 -0700 (PDT) X-Google-Smtp-Source: APXvYqxNg+3Aooq7VvyVNbQ2aDZpwFVXRh0hOTi2Xu+kgkfpRSIJlwdhUMXOPBpzQr4tvbyiTjSX X-Received: by 2002:a17:902:29a7:: with SMTP id h36mr29324694plb.158.1561981777562; Mon, 01 Jul 2019 04:49:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561981777; cv=none; d=google.com; s=arc-20160816; b=tFR9FKRq72cFv4ciy0QJLPTuaJVu9MrJBsPCa/Krn3HA43QeRmJAvz8D4ufCFTNj78 KAP+JEn4Fh6Kloh4J69yntDqbz46If/Enz3iXYl++WNI9jAAN4BnkN+uDxdlh/s324Da z4QZ7qxpCScQLAOFTs+mVRD+0lxUo9RNQaNHuh0o7skbjotvS6/yHbNEt7ui9jEk37J2 ERTHncfikga8wcF3IvdmRfbJWYnKiIklJmdXM/Gi3naO8reC2ADY88Pm8TkkVMs/ETUv cu3gUy26wh9OVZNzkltPF9xsFOz+PP6qoAou2mJZKEzK5vSOSPTcYRYMQzsnWd4Pn+Fj aWlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from; bh=1tkSIBV/cc78WXqPXi1YOVBlBMGCTgacMQni/E3Y43w=; b=LFDwoRJe79iyifmBg+FZQ7xvtuL7Nx2IFxMWg73vlF+XcDbd2dErCGoUGi0IRxJ/Kq RjeaH4hX/ywFkOj+kVxx4imMlXTWmVmS0QcxJqMo1TobwcFv5Y/96lsD21WRUr+CU++a Ei4wZgG+yHB1kdGhvZaLbG0fi/Un03jt2Sehpb+ZBcPkbWXJXqbKAqL03GJAvYqq5FrT dhCZspZFRZNzF8quOqyHBA6f5B7xhpIUxtV28sNz12CLTq7tCVyAME3R7G8qrNT/odm2 YRQer0LpbBEWYzBjA6wugInjLvnI8VgvORVwCNrDH0XdfA73gZ5r+169qBY2m7HxX3Wp CgLg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r11si10140122pjq.108.2019.07.01.04.49.21; Mon, 01 Jul 2019 04:49:37 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728999AbfGAKjr (ORCPT + 99 others); Mon, 1 Jul 2019 06:39:47 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:40131 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727701AbfGAKjq (ORCPT ); Mon, 1 Jul 2019 06:39:46 -0400 Received: from localhost ([127.0.0.1] helo=vostro.local) by Galois.linutronix.de with esmtp (Exim 4.80) (envelope-from ) id 1hhtij-0002uW-7P; Mon, 01 Jul 2019 12:39:37 +0200 From: John Ogness To: Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Petr Mladek , Sergey Senozhatsky , Steven Rostedt , Linus Torvalds , Greg Kroah-Hartman , Andrea Parri , Thomas Gleixner , Sergey Senozhatsky Subject: Re: [RFC PATCH v2 1/2] printk-rb: add a new printk ringbuffer implementation References: <20190607162349.18199-1-john.ogness@linutronix.de> <20190607162349.18199-2-john.ogness@linutronix.de> <20190618114747.GQ3436@hirez.programming.kicks-ass.net> <87k1df28x4.fsf@linutronix.de> <20190626224034.GK2490@worktop.programming.kicks-ass.net> <87mui2ujh2.fsf@linutronix.de> <20190628154435.GZ7905@worktop.programming.kicks-ass.net> Date: Mon, 01 Jul 2019 12:39:35 +0200 In-Reply-To: <20190628154435.GZ7905@worktop.programming.kicks-ass.net> (Peter Zijlstra's message of "Fri, 28 Jun 2019 17:44:35 +0200") Message-ID: <877e92f388.fsf@linutronix.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-06-28, Peter Zijlstra wrote: >> I have implemented and tested these changes. I also added setting the >> list terminator to this function, since all callers would have to do >> it anyway. Also, I spent a lot of time trying to put in comments that >> I think are _understandable_ and _acceptable_. >> >> @Peter: I expect they are way too long for you. >> >> @Andrea: Is this starting to become something that you would like to >> see? >> >> /** >> * add_descr_list() - Add a descriptor to the descriptor list. >> * >> * @e: An entry that has already reserved data. >> * >> * The provided entry contains a pointer to a descriptor that has already >> * been reserved for this entry. However, the reserved descriptor is not >> * yet on the list. Add this descriptor as the newest item. >> * >> * A descriptor is added in two steps. The first step is to make this >> * descriptor the newest. The second step is to update @next of the former >> * newest descriptor to point to this one (or set @oldest to this one if >> * this will be the first descriptor on the list). >> */ > > I still think it might be useful to explicitly call out the data > structure more. Even if you cannot use a fully abtracted queue. Agreed. It needs to be clear that the queue management is separate from the data management. > Also, newest/oldest just looks weird to me; I'm expecting head/tail. I will rename it to head/tail. >>> You have a single linked list going from the tail to the head, while >>> adding to the head and removing from the tail. And that sounds like >>> a FIFO queue: >> >> Yes, but with one important feature: the nodes in the FIFO queue are >> labeled with ordered sequence numbers. This is important for >> printk. I talk more about this below. > > But nowhere did/do you say what the actual data structure is, with what > modification for which reason. When you dive into the reader code you will see that the sequence numbers are necessary for readers to recognize that they have missed records. This means that the sequence numbers must be ordered, and AFAICT that is only possible if the queue management code is assigning it. >>> /* >>> * xchg() implies ACQUIRE; and thereby ensures @tail is >>> * written after @head, see lqueue_pop()'s smp_rmb(). >>> */ >>> if (prev) >>> WRITE_ONCE(prev->next, n); >> >> This needs to be a store_release() so that a reader cannot read @n >> but the store to @next is not yet visible. The memory barriers of the >> above xchg() do not apply here because readers never read @head. > > Confused, all stores to @n are before the xchg() so the barrier from > xchg() also order those stores and this store. Sorry. Yes, you are correct. I was confusing your suggested implementation with mine. I'm starting to overthink things and confuse myself about memory barriers. I need to be more careful. > (Note that the qspinlock has a queue not unlike this, but that again > doesn't have to bother with NMIs) Thank you for pointing this out! I will look to qspinlock for some naming guidelines. >>> Now, you appear to be using desc_ids instead of pointers, but since >>> you're not using the actual wrap value; I don't see the benefit of >>> using those IDs over straight pointers. That is, unless I've >>> overlooked some subtle ABA issue, but then, your code doesn't seem >>> to mention that, and I think we're good because if we re-use an >>> entry, it can never get back in the same location, since we never >>> allow an empty list >> >> I do not understand what you mean here. If a reader has a pointer to >> an entry, the entry behind that pointer can certainly change. But >> that isn't a problem. The reader will recognize that. > > ABA is where a cmpxchg has a false positive due to values matching but > not the structure. Thanks for the clarification. And it reminds me why I chose to use desc_ids. If we are using pointers instead of desc_ids, assuming we have a queue with currently only 1 node, the following should cause the ABA problem. CPU0 is perfoming an lqueue_push() to add a 2nd node to the queue. CPU0 CPU1 ---- ---- head = READ_ONCE(h->head); seq = READ_ONCE(head->seq); WRITE_ONCE(n->seq, seq + 1); WRITE_ONCE(n->next, NULL); lqueue_push(); lqueue_pop(); lqueue_push(); cmpxchg_release(&h->head, head, n); WRITE_ONCE(head->next, n); The queue itself will still be intact, but the sequence numbers are now wrong. For this to happen using desc_ids, the above set of calls from CPU1 would need to occur "LONG_MAX/desc_max_count" times in that window. Basically I am using tagged state references with probably >40 bits for the tag (on 64-bit systems). The effect for readers is that they will see a sequence number that is less or equal to the previous seen sequence number. Worthy of a warn message. My code needs to mention all this. >>> That said, the above has cmpxchg() vs WRITE_ONCE() and is therefore >>> not safe on a number of our architectures. We can either not care >>> about performance and use xchg() for the ->tail store, or use >>> atomic_long_t and suffer ugly casting. >> >> cmpxchg_release() vs WRITE_ONCE() is not safe?! Can you point me to >> documentation about this? > > Documentation/atomic_t.txt has this, see the SEMANTICS section on > atomic-set. Thanks. I overlooked that subtle detail. Can I assume NMIs do not exist on architectures that need to implement locking for cmpxchg()? Or did I just hit a major obstacle? I would prefer to replace the affected WRITE_ONCE() with xchg_relaxed() (like qspinlock is doing). Or are there some other subtle advantages of atomic_long_t? John Ogness