Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp5640030ybe; Tue, 17 Sep 2019 11:07:27 -0700 (PDT) X-Google-Smtp-Source: APXvYqxOmeYXW4HcedV0cQQ7Q/ABwPxTStLba8Lhy9DXYMEQ//ZZ2DUdcM8Asya4idDq31266F4G X-Received: by 2002:a50:baa5:: with SMTP id x34mr6327053ede.148.1568743647567; Tue, 17 Sep 2019 11:07:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1568743647; cv=none; d=google.com; s=arc-20160816; b=L8tlaj9xmVMIJwZ2gcLtAsQfK7co1ltjaGQkrIZAyZR7AXd3JHKaqvDPhHlW16JQ2S sOtBkeXrtbnZYexCAvofO1yXb50dJNL5XG+IMEqDN6XGAH16nYvdQrupagS1J0kQbV+M rb91H2BcfMpMKbzIB2kDWjXuOJt7OyDFfMk36BEkVC1qTgTW+LmF3P849arfreQ0HU9P PxJHgGyX6rpL/bhwclficBI7nWhM9vHbJRdH6SPmUXKIAiXXhwF/qAppPpjQaz5lyoXv od0FV9bYhJiggUrFIGfpY3Vho5bOqK6Qjq8pltVecdleqGkTPS2kJLj1zY+AC37viwi4 uRNA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :dkim-signature; bh=WJ6lflTdZgdwhLLHMAQLrUDOqDJ8VE2A6nu6iIqBmhQ=; b=e5wFxwWx3gAzpF2lRv00u67KAwEvlZKYG0R297RECjKAx+M1tZZ+B5ScZ1Hz0SSFgq AYhI9NJwXfruhwjgjk7KKyzj/2PJTvYZUzY6CkfOhWv1K27pC2JoJfjXjF3UAt8dJkRU CZis7gUKdykh5DtfREzQB1BGBoLtW5CQg5Rtl9kG5SHSPiMkIf9uebd+slkvC62LS+9s /U/bG2vQLtlmj9ITGF+nxx33mFhwjyUskruelMfCtSHW6ykzI/x7vWS6RoeSPdxreSYQ 8RS1PfDH2Fync3xz+Dse5kseYSkWALAE6D9HKqW1PxqJhUVuRdDwnCrpqq/Pog0+NiF7 CDMA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@scylladb-com.20150623.gappssmtp.com header.s=20150623 header.b=yh3RV8VC; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b42si1808406eda.367.2019.09.17.11.07.03; Tue, 17 Sep 2019 11:07:27 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@scylladb-com.20150623.gappssmtp.com header.s=20150623 header.b=yh3RV8VC; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729214AbfIQPMz (ORCPT + 99 others); Tue, 17 Sep 2019 11:12:55 -0400 Received: from mail-wm1-f65.google.com ([209.85.128.65]:52418 "EHLO mail-wm1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727454AbfIQPMz (ORCPT ); Tue, 17 Sep 2019 11:12:55 -0400 Received: by mail-wm1-f65.google.com with SMTP id x2so3929424wmj.2 for ; Tue, 17 Sep 2019 08:12:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=scylladb-com.20150623.gappssmtp.com; s=20150623; h=subject:to:cc:references:from:organization:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding :content-language; bh=WJ6lflTdZgdwhLLHMAQLrUDOqDJ8VE2A6nu6iIqBmhQ=; b=yh3RV8VC6TRXbPtRn+q7xsvgewnzq2SA7oJBBh229t3azLHIkZGCw8OUkUAzsEu9HY lWvChiwb9wOEvxvAMDKXZSOhV7QRPe0iE9Xw3jXmrnpiyHkcdY9ITNMDeP0BKAfxJOnj vdZleIH8MOQgBV0xQHm3fvRkZ0AHVpa75UxmvMDIwY5BNmkdB4xdL1gIMlbiddcoN+lY n04YS8wfHGfZ3XYF7QeeL1Y8kZxTrJ0fNp4HEpAdgCxzHTQm2HFobplSfIFWXAMhUx9O uL68sN42WTs4XZW0jj63D5fZBOx0tSP6FBoYwfIIHWumgOrOC8BS1JeiJ0/TIPSUVnDN 6QNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-transfer-encoding:content-language; bh=WJ6lflTdZgdwhLLHMAQLrUDOqDJ8VE2A6nu6iIqBmhQ=; b=bkZvq/OLuba7IKppr7ipyjIKntpSVgwR9asIRp9lrNi+cknfag6k89NCRfhF0/LXqw /Bd0gYXKFqTYrgPeKEHqsMQ3S/3DZQhKsnjjTfq++uVrVY6dvdq+mhy0aURqXBbM1AGy VTjbEGtq83CT9EGNvGeINn8/VmPfeaLsZ23qmF63tRdWAR/buuvHNMyzxa8YfGl/cPqz soGJWHCGl4wVUwMOXjWK6XP4RS2QGUszNu79YuYvJ4zVSKRZG8XBWittqCrIfdKyHsLR HHAJb9ycsdVBMlLZNZ/R4T9B4C+/FOpMy30u5KSvyUv431nzl5m5w8UUj6uwt9AjD8ni hE8Q== X-Gm-Message-State: APjAAAVn4rO51yhUhTXp6VeVJwr7sSy1932R9wij5I4sQXxyUZJt17g2 LJMKhefiC7/alhVx4oQclC+GxQo7z1A= X-Received: by 2002:a1c:3b06:: with SMTP id i6mr4093891wma.6.1568733172559; Tue, 17 Sep 2019 08:12:52 -0700 (PDT) Received: from [10.0.0.1] (system.cloudius-systems.com. [199.203.229.89]) by smtp.gmail.com with ESMTPSA id t14sm2011029wrs.6.2019.09.17.08.12.51 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 17 Sep 2019 08:12:51 -0700 (PDT) Subject: Re: [PATCH v1] io_uring: reserve word at cqring tail+4 for the user To: Jens Axboe Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org References: <20190917091358.3652-1-avi@scylladb.com> <3e886e3a-458d-0fe4-68ff-5925835efb5e@kernel.dk> From: Avi Kivity Organization: ScyllaDB Message-ID: <9b4a780e-72af-125e-d104-4179a859d581@scylladb.com> Date: Tue, 17 Sep 2019 18:12:50 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.1.0 MIME-Version: 1.0 In-Reply-To: <3e886e3a-458d-0fe4-68ff-5925835efb5e@kernel.dk> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 17/09/2019 17.54, Jens Axboe wrote: > On 9/17/19 3:13 AM, Avi Kivity wrote: >> In some applications, a thread waits for I/O events generated by >> the kernel, and also events generated by other threads in the same >> application. Typically events from other threads are passed using >> in-memory queues that are not known to the kernel. As long as the >> threads is active, it polls for both kernel completions and >> inter-thread completions; when it is idle, it tells the other threads >> to use an I/O event to wait it up (e.g. an eventfd or a pipe) and >> then enters the kernel, waiting for such an event or an ordinary >> I/O completion. >> >> When such a thread goes idle, it typically spins for a while to >> avoid the kernel entry/exit cost in case an event is forthcoming >> shortly. While it spins it polls both I/O completions and >> inter-thread queues. >> >> The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache >> line to be written to. This can be used with io_uring to wait for a >> wakeup without spinning (and wasting power and slowing down the other >> hyperthread). Other threads can also wake up the waiter by doing a >> safe write to the tail word (which triggers the wakeup), but safe >> writes are slow as they require an atomic instruction. To speed up >> those wakeups, reserve a word after the tail for user writes. >> >> A thread consuming an io_uring completion queue can then use the >> following sequences: >> >> - while busy: >> - pick up work from the completion queue and from other threads, >> and process it >> >> - while idle: >> - use UMONITOR/UMWAIT to wait on completions and notifications >> from other threads for a short period >> - if no work is picked up, let other threads know you will need >> a kernel wakeup, and use io_uring_enter to wait indefinitely > This is cool, I like it. A few comments: > >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index cfb48bd088e1..4bd7905cee1d 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -77,12 +77,13 @@ >> >> #define IORING_MAX_ENTRIES 4096 >> #define IORING_MAX_FIXED_FILES 1024 >> >> struct io_uring { >> - u32 head ____cacheline_aligned_in_smp; >> - u32 tail ____cacheline_aligned_in_smp; >> + u32 head ____cacheline_aligned; >> + u32 tail ____cacheline_aligned; >> + u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups >> }; > Since we have that full cacheline, maybe name this one a bit more > appropriately as we can add others if we need it. Not a big deal. You mean, name it for its intended purpose of serving as a write target for umonitor/umwait wakes? Note that the user won't see the name, and that it's only accurate for an io_uring that's used for completions. > But definitely use /* */ style comments :-) Sorry, in C++-land for a while. You're lucky I didn't turn the whole thing into a virtual template something. > >> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h >> index 1e1652f25cc1..1a6a826a66f3 100644 >> --- a/include/uapi/linux/io_uring.h >> +++ b/include/uapi/linux/io_uring.h >> @@ -103,10 +103,14 @@ struct io_sqring_offsets { >> */ >> #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */ >> >> struct io_cqring_offsets { >> __u32 head; >> + // tail is guaranteed to be aligned on a cache line, and to have the >> + // following __u32 free for user use. This allows using e.g. >> + // UMONITOR/UMWAIT to wait on both writes to head and writes from >> + // other threads to the following word. >> __u32 tail; >> __u32 ring_mask; >> __u32 ring_entries; >> __u32 overflow; >> __u32 cqes; > Ditto on the comments here. Sure. > Would be ideal if we could pair this with an example for liburing, a basic > test case would be fine. Something that shows how to use it, and verifies > that it works. I'll have to look for a machine with waitpkg for that. > Also, this patch is against master, it should be against for-5.4/io_iuring as > it won't apply there right now. Sure, will rebase.