Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp7282589ybi; Mon, 8 Jul 2019 18:35:28 -0700 (PDT) X-Google-Smtp-Source: APXvYqxKIqyyfJEjpvUqlYDhuAmeyFNDqkgH6oTdxIwiRJCKwte6HFFIw4ROR2rzOOxdHIewGJ7j X-Received: by 2002:a17:902:2b8a:: with SMTP id l10mr28547753plb.283.1562636128854; Mon, 08 Jul 2019 18:35:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562636128; cv=none; d=google.com; s=arc-20160816; b=KXeLO1Z7HQmPNWfhOpaUvBexmuAZMjPe5EStC8nu5a4kVZmuHKyqQ8kpZ1y71LKa7r X5fAssIoJBYPqLVh6FHp5WVpuQAeoqI9vHZKeBuKVLOTmoye7LzNQk/dh15Wcjk0R1cL oY3MBU1gS114+9sgsTyhS4nxS4YswY1QCsjJ6oELDzOKIWWIm0b0xAXeYu/UCPxR9lsc NKWU4OacLOb5/jZLToc74vWcKETVWYD9NsTj5gqHLi6MZ5qX7uC4XatSc8aUATSdHaW0 rI5RgXYdqVRKomf3lENZUJMvQqGu/JQZD4uy7eWYxieqk9qIAfkW+5zPRxu0jy5O87Ax kWDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :in-reply-to:date:references:subject:cc:to:from; bh=wWi5P6OfQshutexoT3PXYOQXfd4xWJgVNkpe3Dek5J0=; b=Qs5BLBJPCpLBecbXg02ZLROe47x4VGtlGNvTbQvs6igiGzGI2tr3fiySjpYsI6Eaqx 6vYHU5TJ3V1ldSamgtUD+C0mZLpBOw58Rn4qoJnOGeRqv27WYgT3jrqtepFsZ1eqindy H0qHxI23Wgswl1Uh5aHLEgClP7qpySAcnySFzoQmIZV0auLxYzQLw0zlNdNBgrjKkj75 W/m/cSZZfzIwAbE4547107hS9/24qycd/EU2YUpsK/XdCXGj9c7m0BM5cXGi15hjs/UU Be+hr29zelNT29f2Y28cocLx6DQ3PjIhYVRollQVh2ZVgw5mzkPxo3uZw0RAD7HlcVBF aOhA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m35si1163305pje.84.2019.07.08.18.35.13; Mon, 08 Jul 2019 18:35:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726371AbfGIBex (ORCPT + 99 others); Mon, 8 Jul 2019 21:34:53 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:43572 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725886AbfGIBew (ORCPT ); Mon, 8 Jul 2019 21:34:52 -0400 Received: from localhost ([127.0.0.1] helo=vostro.local) by Galois.linutronix.de with esmtp (Exim 4.80) (envelope-from ) id 1hkf1p-0003xm-RV; Tue, 09 Jul 2019 03:34:45 +0200 From: John Ogness To: Petr Mladek Cc: Andrea Parri , Sergey Senozhatsky , Sergey Senozhatsky , Steven Rostedt , Peter Zijlstra , Thomas Gleixner , Linus Torvalds , Greg Kroah-Hartman , linux-kernel@vger.kernel.org Subject: Re: [PATCH POC] printk_ringbuffer: Alternative implementation of lockless printk ringbuffer References: <20190621140516.h36g4in26pe3rmly@pathway.suse.cz> <20190704103321.10022-1-pmladek@suse.com> <20190704103321.10022-1-pmladek@suse.com> <87r275j15h.fsf@linutronix.de> <20190708152311.7u6hs453phhjif3q@pathway.suse.cz> Date: Tue, 09 Jul 2019 03:34:43 +0200 In-Reply-To: <20190708152311.7u6hs453phhjif3q@pathway.suse.cz> (Petr Mladek's message of "Mon, 8 Jul 2019 17:23:11 +0200") Message-ID: <874l3wng7g.fsf@linutronix.de> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-07-08, Petr Mladek wrote: >>> This is POC that implements the lockless printk ringbuffer slightly >>> different way. I believe that it is worth considering because it looks >>> much easier to deal with. The reasons are: >>> >>> + The state of each entry is always clear. >>> >>> + The write access rights and validity of the data >>> are clear from the state of the entry. >>> >>> + It seems that three barriers are enough to synchronize >>> readers vs. writers. The rest is done implicitely >>> using atomic operations. >>> >>> Of course, I might have missed some critical race that can't get >>> solved by the new design easily. But I do not see it at the moment. >> >> Two things jump out at me when looking at the implementation: >> >> 1. The code claims that the cmpxchg(seq_newest) in prb_reserve_desc() >> guarantees that "The descriptor is ours until the COMMITTED bit is >> set." This is not true if in that wind seq_newest wraps, allowing >> another writer to gain ownership of the same descriptor. For small >> descriptor arrays (such as in my test module), this situation is >> quite easy to reproduce. > > I am not sure that I fully understand the problem. seq_newest > wraps at 2^30 (32-bit) and at 2^62 (64-bit). It takes a while > to reuse an existing one. And it does not depend on the size > of the array. I am not referring to unsigned long overflowing. I am referring to array index wrapping. This _does_ depend on the size of the array. > In addition, new sequence number can get assigned only when > the descriptor with the conflicting (sharing the same struct > prb_desc) sequence number is in reusable state. It means > that it has to be committed before. Correct. But taking it _out_ of the reusable state is not atomic, which opens the window I am referring to. >> This was one of the reasons I chose to use a linked list. When the >> descriptor is atomically removed from the linked list, it can _never_ be >> used (or even seen) by any other writer until the owning writer is done >> with it. >> >> I'm not yet sure how that could be fixed with this implementation. The >> state information is in a separate variable than the head pointer for >> the descriptor array (seq_newest). This means you cannot atomically >> reserve descriptors. > > In my implementation, the sequence numbers are atomically reserved > in prb_reserve_desc() by > > } while (cmpxchg(&rb->seq_newest, seq_newest, seq) != newest_seqs); > > where seq is always seq_newest + 1. We are here only when the > conflicting seq from the previous wrap is in reusable state > and the related datablock is moved outside valid lpos range. > This is ensured by prb_remove_desc_oldest(rb, seq_prev_wrap). > > Now, the CPU that succeeded with cmpxchg() becomes > the exclusive owner of the respective descriptor. The sequence > number is written into this descriptor _after_ cmpxchg() succeeded. > > It is safe because: > > + previous values are not longer used (descriptor has been marked > as reusable, lpos from the related datablock were moved > outside valid range (lpos_oldest, lpos_newest). > > + new values are ignored by readers and other writers until > the right sequence number and the committed flag is set > in the descriptor. Let me inline the function are talking about and add commentary to illustrate what I am saying: static int prb_reserve_desc(struct prb_reserved_entry *entry) { unsigned long seq, seq_newest, seq_prev_wrap; struct printk_ringbuffer *rb = entry->rb; struct prb_desc *desc; int err; /* Get descriptor for the next sequence number. */ do { seq_newest = READ_ONCE(rb->seq_newest); seq = (seq_newest + 1) & PRB_SEQ_MASK; seq_prev_wrap = (seq - PRB_DESC_SIZE(rb)) & PRB_SEQ_MASK; /* * Remove conflicting descriptor from the previous wrap * if ever used. It might fail when the related data * have not been committed yet. */ if (seq_prev_wrap == READ_ONCE(rb->seq_oldest)) { err = prb_remove_desc_oldest(rb, seq_prev_wrap); if (err) return err; } } while (cmpxchg(&rb->seq_newest, seq_newest, seq) != seq_newest); I am referring to this point in the code, after the cmpxchg(). seq_newest has been incremented but the descriptor is still in the unused state and seq is still 1 wrap behind. If an NMI occurs here and the NMI (or some other CPU) inserts enough entries to wrap the descriptor array, this descriptor will be reserved again, even though it has already been reserved. /* * The descriptor is ours until the COMMITTED bit is set. * Set its sequence number with all flags cleared. */ desc = SEQ_TO_DESC(rb, seq); WRITE_ONCE(desc->dst, seq); /* * Make sure that anyone sees the new dst/seq before * lpos values and data are manipulated. It is related * to the read berrier in prb_read_desc(). */ smp_wmb(); *Now* the descriptor is ours. Not before. And it is only exclusively ours if the above mentioned situation doesn't occur. This window doesn't exist with the list approach because reserving a descriptor simply involves removing the tail (oldest) of the committed list, which is an atomic operation. entry->seq = seq; return 0; } >> 2. Another issue is when prb_reserve() fails and sets the descriptor >> as unused. As it is now, no reader can read beyond that descriptor >> until it is recycled. Readers need to know that the descriptor is bad >> and can be skipped over. It might be better to handle this the way I >> did: go ahead and set the state to committed, but have invalid >> lpos/lpos_next values (for example: lpos_next=lpos) so the reader >> knows it can skip over the descriptor. > > This is exactly what the code does, see prb_make_desc_unused(). > It marks the descriptor as committed so that it can get reused. > And it sets lpos and lpos_next to the same value so that > the situation can get eventually detected by readers. Indeed. Sorry for the noise. > BTW: There is one potential problem with my alternative approach. > > The descriptors and the related data blocks might get reserved > in different order. Now, the descriptor might get reused only > when the related datablock is moved outside the valid range. > This operation might move also other data blocks outside > the range and invalidate descriptors that were reserved later. > As a result we might need to invalidate more messages in > the log buffer then would be really necessary. > > If I get it properly, this problem does not exist with the > implementation using links. It is because the descriptors are > linked in the same order as the reserved data blocks. Descriptors in the committed list are ordered in commit order (not the reserve order). However, if there are not enough descriptors (i.e. avgdatabits is higher than the true average) this problem exists with the list approach as well. > I am not sure how big the problem, with more invalidated messages, > would be in reality. I am not sure if it would be worth > a more complicated implementation. I am also not sure how big the problem is in a practical sense. However to help avoid this issue, I will increase the descbits:avgdatabits ratio for v3. > Anyway, I still need to fully understand the linked approach > first. You may want to wait for v3. I've now split the ringbuffer into multiple generic data structures (as unintentionally suggested[0] by PeterZ), which helps to clarify the role of each data structure and also isolates the memory barriers so that it is clear which data structure requires which memory barriers. John Ogness [0] https://lkml.kernel.org/r/20190628154435.GZ7905@worktop.programming.kicks-ass.net