Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752777AbbH2VIT (ORCPT ); Sat, 29 Aug 2015 17:08:19 -0400 Received: from ns.horizon.com ([71.41.210.147]:59005 "HELO ns.horizon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1752266AbbH2VIS (ORCPT ); Sat, 29 Aug 2015 17:08:18 -0400 Date: 29 Aug 2015 17:08:16 -0400 Message-ID: <20150829210816.16551.qmail@ns.horizon.com> From: "George Spelvin" To: oleg@redhat.com Subject: Re: [PATCH] task_work: remove fifo ordering guarantee Cc: eric.dumazet@gmail.com, linux@horizon.com, linux-kernel@vger.kernel.org, mingo@kernel.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3212 Lines: 71 Oleg Nesterov wrote: > On 08/29, Ingo Molnar wrote: >> So I'm wondering, is there any strong reason why we couldn't use a double >> linked list and still do FIFO and remove that silly linear list walking hack? > > This will obviously enlarge callback_head, and it is often embedded. > But this is minor. > > If we use a double linked list we can't do task_work_add() lockless. > So we will need another spinlock_t in task_struct. We can't use pi_lock. You only need a singly linked list for FIFO, but indeed lockless access is a pain. For a LIFO stack, you just do a single compare-and-swap on the head. Once an entry is in the list, it's immutable. For FIFO, you only need one pointer in the nodes, but two in the list head: a head pointer and a tail pointer. The problem for lockless access is that you have to update both the next pointer and the tail pointer, and without very specialized instructions like 680x0's CAS2, there's no way to do them both atomically. So the procedure to append (write) to the list is: - Set your newly added node's next pointer to NULL. - CAS the tail pointer. This establishes your place in line, but does not make it visible to the reader. - At this point, the next pointer is not yours any more and may be written by someone else, but the rest of the node may still be finished, if you like. - But also, there's a sort of priority inversion problem. If a writer stalls here, no following writer is visible to the reader. - Then you update the previous tail (obtained from the CAS) to link via its next pointer to your newly added node. This makes it visible to the reader. - The reader, meanwhile, totally ignores the tail pointer, and only uses the head and next pointers. Now, one trick with the above is that the tail pointer must *only* be accessed by writers. The single-threaded technique of representing an empty queue by using a "struct **" tail pointer and having tail = &head to represent an empty list is not okay. Rather, the reader must never deallocate a node with a NULL next pointer. If the nodes are large, you can minimize the overhead of this by having a dedicated sentinel/dummy node which lives in the queue and gets moved to the tail whenever it reaches the head *and* has a non-NULL next pointer. But the reader may not depend on this node always being visible! The reader might find the sentinel pointing to node A, and then move it to the tail by doing a CAS on the tail pointer, but the tail pointer by then points to B. It's possible that A's next pointer is still NULL and the writer adding B has not gotten to the second step yet. Anyway, it can be done, but honestly the current "empty the stack in batches and then reverse it" is simpler. The cond_resched is ugly, but the overhead is minimal, and the FIFO guarantee is convenient. By the way, it's quite possible (LIFO or FIFO) for the writer to batch up queue adds and only do one CAS to add an arbitrary number. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/