Received: by 2002:a25:7ec1:0:0:0:0:0 with SMTP id z184csp5725711ybc; Wed, 27 Nov 2019 08:29:58 -0800 (PST) X-Google-Smtp-Source: APXvYqwi6FRUI2czdhjbKCZr79l3jKvPgSQXjUwMlP7Kb2XiujeWo0gLYfMDo4lgdUKYMfrWIn4n X-Received: by 2002:a17:906:2552:: with SMTP id j18mr51002266ejb.244.1574872198874; Wed, 27 Nov 2019 08:29:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1574872198; cv=none; d=google.com; s=arc-20160816; b=GpgsdDCV7SGfu1ulyQpgxJcQDZQU+LHEUbRThDZ7XwJlMftKDW0Ge4Vqo0/BKo6n0F epUJ6nvH0p2hTRcfKN/rfhA5FA5FDpY0FL/jU6aKgyUNpzKeU4dI4RtGEcY6Pj24mXw3 RIw49M5uhfQRIjm0cC+jz9sbPFTKbH5x62eQv5aHvkMeGyuDDRmw/g3CpT2q80Nwh6TI TC+6xIaftkpAWLs9+osvIKCTLopTG173i1zoX9sFgk7BHW1DfvPECsl7agdP1cdxYbTU PUeLZKTaRF1N92AVIGPQ1aJfbtAZ4nR7fdQCMYZsGtpvmR9qffAW/14vkXY2Jvw7rc3J iigg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=K5j6toD0BFeVtaIARVqGdTgt6vile1DqjAIyA0acZ78=; b=rWjcZG/41cjU50zPV4hk/QvvKmaPfz7q9juTo8vroK+YxU7VF7gjd00UWJtcyNc8rR E1hFHW3nK1oXkkrS+zY04d5Xdsakd7AvT+2/QQl3fg3uae++4UaNDZsl3ZFmXtwiY/Ay lDbd2WV9ChZujqWjanm+m86pZ3NAXwYQJQ9RwQWZV02q1EFCkaJb+lwN2RRbaxI2tcdj uOYKlBpYlocPXej64m57/TouZFOg/RKVWDSf7SOILeLJIT3NVaN5e3BrRrDZZpyjRojT q15yAIVjM+UPacSSrramMNs7eD6DLfGK+LK+rB4+7wosN6RV0FUULaI0Dpq1hnne1RZ5 nPYA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZymfxeSv; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id cw16si11522312edb.37.2019.11.27.08.29.34; Wed, 27 Nov 2019 08:29:58 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZymfxeSv; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727125AbfK0Q1B (ORCPT + 99 others); Wed, 27 Nov 2019 11:27:01 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:42961 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726514AbfK0Q1A (ORCPT ); Wed, 27 Nov 2019 11:27:00 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574872019; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=K5j6toD0BFeVtaIARVqGdTgt6vile1DqjAIyA0acZ78=; b=ZymfxeSvFWlzhaduOpHmbumIq2tUY7+Vv2sSAK313Azy71orDSk938FVt8LxHTwo8rRbgP 3UulzmKJ13YGr0WC/wHIlXXRI9NTxb+phyZH6igiZ1c5oHuwZ0MXTUv/4Y5CNZzKP84tf6 xTHxTQ235Yp4B8OmNbGbGZ10P9utSCc= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-170-1MrmtW4TONip9Xmkn6yp9g-1; Wed, 27 Nov 2019 11:26:55 -0500 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id F3F2010C58DA; Wed, 27 Nov 2019 16:26:53 +0000 (UTC) Received: from ovpn-118-152.ams2.redhat.com (unknown [10.36.118.152]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8414E5D9D6; Wed, 27 Nov 2019 16:26:49 +0000 (UTC) Message-ID: <0b8d7447e129539aec559fa797c07047f5a6a1b2.camel@redhat.com> Subject: Re: epoll_wait() performance From: Paolo Abeni To: Jesper Dangaard Brouer , David Laight Cc: 'Marek Majkowski' , linux-kernel , network dev , kernel-team Date: Wed, 27 Nov 2019 17:26:48 +0100 In-Reply-To: <20191127164821.1c41deff@carbon> References: <5f4028c48a1a4673bd3b38728e8ade07@AcuMS.aculab.com> <20191127164821.1c41deff@carbon> User-Agent: Evolution 3.32.4 (3.32.4-1.fc30) MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-MC-Unique: 1MrmtW4TONip9Xmkn6yp9g-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2019-11-27 at 16:48 +0100, Jesper Dangaard Brouer wrote: > On Wed, 27 Nov 2019 10:39:44 +0000 David Laight wrote: > > > ... > > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much > > > > slower than recv() when there is only one message (even recvmsg() is a lot slower). > > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user() > > > > and faffing with the user iov[].) > > > > > > > > So using poll() we repoll the fd after calling recv() to find is there is a second message. > > > > However the second poll has a significant performance cost (but less than using recvmmsg()). > > > > > > That sounds wrong. Single recvmmsg(), even when receiving only a > > > single message, should be faster than two syscalls - recv() and > > > poll(). > > > > My suspicion is the extra two copy_from_user() needed for each recvmsg are a > > significant overhead, most likely due to the crappy code that tries to stop > > the kernel buffer being overrun. > > > > I need to run the tests on a system with a 'home built' kernel to see how much > > difference this make (by seeing how much slower duplicating the copy makes it). > > > > The system call cost of poll() gets factored over a reasonable number of sockets. > > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg > > even allowing for looking up the fd. > > > > This could be fixed by an extra flag to recvmmsg() to indicate that you only really > > expect one message and to call the poll() function before each subsequent receive. > > > > There is also the 'reschedule' that Eric added to the loop in recvmmsg. > > I don't know how much that actually costs. > > In this case the process is likely to be running at a RT priority and pinned to a cpu. > > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it. > > > > We really do want to receive all these UDP packets in a timely manner. > > Although very low latency isn't itself an issue. > > The data is telephony audio with (typically) one packet every 20ms. > > The code only looks for packets every 10ms - that helps no end since, in principle, > > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms. > > I have a simple udp_sink tool[1] that cycle through the different > receive socket system calls. I gave it a quick spin on a F31 kernel > 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised > to see a significant regression/slowdown for recvMmsg. > > $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7)) > run count ns/pkt pps cycles payload > recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1 > recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1 > read run: 0 10000000 974.81 1025841.68 3509 18 demux:1 > recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1 > > Normal recvmsg almost have double performance that recvmmsg. For stream tests, the above is true, if the BH is able to push the packets to the socket fast enough. Otherwise the recvmmsg() will make the user space even faster, the BH will find the user space process sleeping more often and the BH will have to spend more time waking-up the process. If a single receive queue is in use this condition is not easy to meet. Before spectre/meltdown and others mitigations using connected sockets and removing ct/nf was usually sufficient - at least in my scenarios - to make BH fast enough. But it's no more the case, and I have to use 2 or more different receive queues. @David: If I read your message correctly, the pkt rate you are dealing with is quite low... are we talking about tput or latency? I guess latency could be measurably higher with recvmmsg() in respect to other syscall. How do you measure the releative performances of recvmmsg() and recv() ? with micro-benchmark/rdtsc()? Am I right that you are usually getting a single packet per recvmmsg() call? Thanks, PAolo