Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp6813013ybi; Mon, 8 Jul 2019 09:06:00 -0700 (PDT) X-Google-Smtp-Source: APXvYqxj8rHd5Jdn3exrUoZkBWUWvpUd4Y3OthJQ5K1GyvFEDh9DqBe832SvesPqyxzQFWpUkGIi X-Received: by 2002:a17:902:9346:: with SMTP id g6mr25956593plp.61.1562601960702; Mon, 08 Jul 2019 09:06:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562601960; cv=none; d=google.com; s=arc-20160816; b=n8kAPADuqJRbCa9m/RYkcV1sss/V8GhKsZZdsXW1Xmy9Ia9EQxGbm7H/lUKaq0xEvG miK4Ue6oqnDgp3tm7UCuBogjU1aapRVYOzoQKNE5eujtRWzP2WYtAjAJp5chdI2yhVqD /xGf+AzonvWDEXGcfoWZOSEQUc3keupTkzZlAh2/aC8UiqOSH1GH/SCjVCQIUidnQUzN wIfyjvnZZiagfD+OnFEs3jaIdzcOX9MVIQe3cB/sjQwq5LJV8peJOlxQoLlm56bvuP4l Bq1KIPG2gKRz9x6qdPXZpXgxOlkMoxTHd5r3lEeSxwPvOfuIhwFO/2jWmNF9ahq0NFhB ITgQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=b4H7uU4WbNHz7sq7Xk5bht+AOSlk+6osv8h+s6baE3Q=; b=RD6sVsmTM9ospLnLm9TY1/P8yJYbFREl9kzfRiYSlLGd7jaVnCmhQ0ydvped4zvHN9 W0JkFehDPpj3xK6NMZI7YauBycIyYvXE87tkxdVUaNDav+sTfTVhOqrS6/7V3/9X23Pq qKLvaM1OCJ3n7V0XAgXGCmgvK/DwFsIXj0Ah/bYxNjWYJxA7vojSS+vV4jkmu78PkY7T RYo/23cWve9S8nmrLen5FxGOVgaTYrIEGuAwv78vJEt7rHzRrTFfJBKpz+umdY1dufRv NI5MiW9zMWNU6z31FigvOBeBHMGlB8VufGXWpSVJ31WEbac3ScuhsaKWTKs4HwwqFHdq yt2A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d16si18700229pfr.229.2019.07.08.09.05.45; Mon, 08 Jul 2019 09:06:00 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388144AbfGHPXS (ORCPT + 99 others); Mon, 8 Jul 2019 11:23:18 -0400 Received: from mx2.suse.de ([195.135.220.15]:34854 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S2388157AbfGHPXO (ORCPT ); Mon, 8 Jul 2019 11:23:14 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id BC5D9AF9C; Mon, 8 Jul 2019 15:23:12 +0000 (UTC) Date: Mon, 8 Jul 2019 17:23:11 +0200 From: Petr Mladek To: John Ogness Cc: Andrea Parri , Sergey Senozhatsky , Sergey Senozhatsky , Steven Rostedt , Peter Zijlstra , Thomas Gleixner , Linus Torvalds , Greg Kroah-Hartman , linux-kernel@vger.kernel.org Subject: Re: [PATCH POC] printk_ringbuffer: Alternative implementation of lockless printk ringbuffer Message-ID: <20190708152311.7u6hs453phhjif3q@pathway.suse.cz> References: <20190621140516.h36g4in26pe3rmly@pathway.suse.cz> <20190704103321.10022-1-pmladek@suse.com> <20190704103321.10022-1-pmladek@suse.com> <87r275j15h.fsf@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87r275j15h.fsf@linutronix.de> User-Agent: NeoMutt/20170912 (1.9.0) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 2019-07-04 16:59:54, John Ogness wrote: > Hi Petr, > > On 2019-07-04, Petr Mladek wrote: > > This is POC that implements the lockless printk ringbuffer slightly > > different way. I believe that it is worth considering because it looks > > much easier to deal with. The reasons are: > > > > + The state of each entry is always clear. > > > > + The write access rights and validity of the data > > are clear from the state of the entry. > > > > + It seems that three barriers are enough to synchronize > > readers vs. writers. The rest is done implicitely > > using atomic operations. > > > > Of course, I might have missed some critical race that can't get > > solved by the new design easily. But I do not see it at the moment. > > Two things jump out at me when looking at the implementation: > > 1. The code claims that the cmpxchg(seq_newest) in prb_reserve_desc() > guarantees that "The descriptor is ours until the COMMITTED bit is set." > This is not true if in that wind seq_newest wraps, allowing another > writer to gain ownership of the same descriptor. For small descriptor > arrays (such as in my test module), this situation is quite easy to > reproduce. I am not sure that I fully understand the problem. seq_newest wraps at 2^30 (32-bit) and at 2^62 (64-bit). It takes a while to reuse an existing one. And it does not depend on the size of the array. In addition, new sequence number can get assigned only when the descriptor with the conflicting (sharing the same struct prb_desc) sequence number is in reusable state. It means that it has to be committed before. > This was one of the reasons I chose to use a linked list. When the > descriptor is atomically removed from the linked list, it can _never_ be > used (or even seen) by any other writer until the owning writer is done > with it. > > I'm not yet sure how that could be fixed with this implementation. The > state information is in a separate variable than the head pointer for > the descriptor array (seq_newest). This means you cannot atomically > reserve descriptors. In my implementation, the sequence numbers are atomically reserved in prb_reserve_desc() by } while (cmpxchg(&rb->seq_newest, seq_newest, seq) != newest_seqs); where seq is always seq_newest + 1. We are here only when the conflicting seq from the previous wrap is in reusable state and the related datablock is moved outside valid lpos range. This is ensured by prb_remove_desc_oldest(rb, seq_prev_wrap). Now, the CPU that succeeded with cmpxchg() becomes the exclusive owner of the respective descriptor. The sequence number is written into this descriptor _after_ cmpxchg() succeeded. It is safe because: + previous values are not longer used (descriptor has been marked as reusable, lpos from the related datablock were moved outside valid range (lpos_oldest, lpos_newest). + new values are ignored by readers and other writers until the right sequence number and the committed flag is set in the descriptor. > 2. Another issue is when prb_reserve() fails and sets the descriptor as > unused. As it is now, no reader can read beyond that descriptor until it > is recycled. Readers need to know that the descriptor is bad and can be > skipped over. It might be better to handle this the way I did: go ahead > and set the state to committed, but have invalid lpos/lpos_next values > (for example: lpos_next=lpos) so the reader knows it can skip over the > descriptor. This is exactly what the code does, see prb_make_desc_unused(). It marks the descriptor as committed so that it can get reused. And it sets lpos and lpos_next to the same value so that the situation can get eventually detected by readers. > > The code compiles but it is not really tested. I wanted to send it > > ASAP in a good enough state before we spend more time on cleaning > > up either of the proposals. > > I am glad to see you put together your implementation. If anything, it > shows you understand the design! If after seeing my next version (v3) > you are still convinced that using a linked list for the descriptors is > too complex, then I can help support your idea to move to an array. I am definitely interested into seeing v3 of your approach. I believe that it will be much easier to understand. Then it will be easier to compare. BTW: There is one potential problem with my alternative approach. The descriptors and the related data blocks might get reserved in different order. Now, the descriptor might get reused only when the related datablock is moved outside the valid range. This operation might move also other data blocks outside the range and invalidate descriptors that were reserved later. As a result we might need to invalidate more messages in the log buffer then would be really necessary. If I get it properly, this problem does not exist with the implementation using links. It is because the descriptors are linked in the same order as the reserved data blocks. I am not sure how big the problem, with more invalidated messages, would be in reality. I am not sure if it would be worth a more complicated implementation. Alternative solution would be to somehow mix tricks from both approaches and get something that is easier to deal with and has less drawbacks. I am not sure it is possible. Anyway, I still need to fully understand the linked approach first. Best Regards, Petr