Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp3998347pxb; Mon, 1 Feb 2021 09:45:56 -0800 (PST) X-Google-Smtp-Source: ABdhPJwrRbDFfqrol0HV08x2wDqMPqoi6YcP6V73YuCZJEGloBPTt6XEFhySwl+YrpKjgT5mTU4H X-Received: by 2002:a17:906:2747:: with SMTP id a7mr19492705ejd.250.1612201556726; Mon, 01 Feb 2021 09:45:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612201556; cv=none; d=google.com; s=arc-20160816; b=PUcH5UnqacdUjryoV296Y5DAWVZIMQ0mWRg1xGGQbM/plWQskEwGa6IDT91+Hj/eEp 1Ldx0HrSEFThDqa0dQKly2rjscAOEQonpcTJxOR8WvkbPeen2Psmti6W6sZthZgSSPeA ArZTkgGOf0vUa9EJauBz4TMSYUr6nHmDNI+fTDNLeFuzZWFh5z/uHvdLxFex0Hdb/oLF Ckw2e1C4TP62dfDRazGEixl0ftni642uGAQ0aA8mKIjctaEVqSkA9RaU+1gTnMoUJ4io EjRAkRmWcWqXWAPcD5wXxXADdC1VoY2Fym4GAgHDMDQP+TDAdIxge75vZiWnx1b6VW7y lVeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:mime-version:message-id:date :sender:dkim-signature; bh=FkWHZoDKLZwI+xuVqRwHYTXFYTqVT0XtYwSO3SpZo80=; b=FptlTcIoavs5aED5QI1CJt1+SrYaogA+xgCjAGe1EEcDYSCuwEYp2N18Gercl6SGOl qiLBIuC8bLD41UTG/KoMB47QdK0P1sNTlcHtOC6GHB85IcnE0DQKcIJjUtFIcM+tDPw7 2s48292fqBQEkGPB4tXVxjzwIBpfeHXPkJfhRnEhSecr5tklSMpwCUQhvHAVhAG+zO7b GvBYGwnuN+tt10h6qFiqI4DHA/vMQWF+Cz8cWG3VFOhSW6Moglf4FW9ON95KczgPArMw 27CXbTGxsTMrjVkyhxKUzDgse9t4GV1dHtc3XHxbZOTyovateDSm4duD0PaWo+qk7HKF cmyQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=wLBgxMPG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id um7si10489405ejb.425.2021.02.01.09.45.30; Mon, 01 Feb 2021 09:45:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=wLBgxMPG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232301AbhBARmc (ORCPT + 99 others); Mon, 1 Feb 2021 12:42:32 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42052 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232053AbhBARmT (ORCPT ); Mon, 1 Feb 2021 12:42:19 -0500 Received: from mail-pf1-x449.google.com (mail-pf1-x449.google.com [IPv6:2607:f8b0:4864:20::449]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C400FC061573 for ; Mon, 1 Feb 2021 09:41:39 -0800 (PST) Received: by mail-pf1-x449.google.com with SMTP id d2so2639150pfa.17 for ; Mon, 01 Feb 2021 09:41:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=sender:date:message-id:mime-version:subject:from:to:cc; bh=FkWHZoDKLZwI+xuVqRwHYTXFYTqVT0XtYwSO3SpZo80=; b=wLBgxMPGLTsJw9XbpqsPQNPCF4M0XtCNj1TGJnEiOw3+lKoOVzDfi9fbNFrRZS876o SrDB/CRuRsEXRgFog+0O4NeRpm+DvA+UrwRMdHc3FFCnnD1EKpW3E4/LEqsnbyiquy6i i2ohuMKROopdbrWi6DzJrdAbwPImOkGPdDzTBeoUyJZfSn5wlWCFnL074hS1uCZOF+eA 8VikM/mh/tsWDlMIUuQ/AVuIGgdAULsIov7snfmzDJL/hRq5SwPuPt9kWnqt8x4ct4cs XEEn1hT8T2kL/hxQRCQHdwyVp+8CIDiWpSIjpgd5xr0Mph5As2i5JIntT6NRBV0SV3IO ZZEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:message-id:mime-version:subject:from :to:cc; bh=FkWHZoDKLZwI+xuVqRwHYTXFYTqVT0XtYwSO3SpZo80=; b=pnHUGPKvStsSBVDIX2tgQ51M2nbHWVqfi0lOr81se8Ffd9COzLXu6Bfqu/92ZHj4mS GObBiJBwJwVafxTdfjbCeRj0MfChVgdbyTzkChQgIxZBCNBu0G6cduNgqVZmUQu9VOvL yv5rjh5GAvCOs+5UqQz6ouIAsMLS266FqM63yHW//214aSuUZQ1tOkZp1yyP9XoLl4OM 2mCeeK7WDiz69HV9bp4LSVmz/8rGpOF6vi5vZ69APsY9KrTgUyaEIvX5S0OI4mAOaO8e yv9+EL/ZXgmakMxBTIFsbvhlnLnm94QxuxRrKDvc/r5wP8U8IBZzFlNXvPWVpEWcvNMH 9yDg== X-Gm-Message-State: AOAM530YUkTnFYpv/LRFI1KwcKqktGFz6bNwWCFKX9ydID5Is6YLzab8 yIYuyuKyF3+KUSic8Kn22IT0UzHbD48U Sender: "brianvv via sendgmr" X-Received: from brianvv.c.googlers.com ([fda3:e722:ac3:10:7f:e700:c0a8:348]) (user=brianvv job=sendgmr) by 2002:a17:90a:4548:: with SMTP id r8mr44745pjm.16.1612201299266; Mon, 01 Feb 2021 09:41:39 -0800 (PST) Date: Mon, 1 Feb 2021 17:41:28 +0000 Message-Id: <20210201174132.3534118-1-brianvv@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.30.0.365.g02bc693789-goog Subject: [PATCH net-next v3 0/4] net: use INDIRECT_CALL in some dst_ops From: Brian Vazquez To: Brian Vazquez , Brian Vazquez , Eric Dumazet , Luigi Rizzo , "David S . Miller" , Jakub Kicinski Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch series uses the INDIRECT_CALL wrappers in some dst_ops functions to mitigate retpoline costs. Benefits depend on the platform as described below. Background: The kernel rewrites the retpoline code at __x86_indirect_thunk_r11 depending on the CPU's requirements. The INDIRECT_CALL wrappers provide hints on possible targets and save the retpoline overhead using a direct call in case the target matches one of the hints. The retpoline overhead for the following three cases has been measured by Luigi Rizzo in microbenchmarks, using CPU performance counters, and cover reasonably well the range of possible retpoline overheads compared to a plain indirect call (in equal conditions, specifically with predicted branch, hot cache): - just "jmp *(%r11)" on modern platforms like Intel Cascadelake. In this case the overhead is just 2 clock cycles: - "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs. In this case the lfence is blocked until pending reads complete, so the actual overhead depends on previous instructions. The best case we have measured 15 clock cycles of overhead. - worst case, e.g. skylake, the full retpoline is used __x86_indirect_thunk_r11: call set_u_target capture_speculation: pause lfence jmp capture_speculation .align 16 set_up_target: mov %r11, (%rsp) ret In this case the overhead has been measured in 35-40 clock cycles. The actual time saved hence depends on the platform and current clock speed (which varies heavily, especially when C-states are active). Also note that actual benefit might be lower than expected if the longer retpoline overlaps with some pending memory read. MEASUREMENTS: The INDIRECT_CALL wrappers in this patchset involve the processing of incoming SYN and generation of syncookies. Hence, the test has been run by configuring a receiving host with a single NIC rx queue, disabling RPS and RFS so that all processing occurs on the same core. An external source generates SYN fast enough to saturate the receiving CPU. We ran two sets of experiments, with and without the dst_output patch, comparing the number of syncookies generated over a 20s period in multiple runs. Assuming the CPU is saturated, the time per packet is t = number_of_packets/total_time and if the two datasets have statistically meaningful difference, the difference in times between the two cases gives an estimate of the benefits from one INDIRECT_CALL. Here are the experimental results: Skylake Syncookies over 20s (5 tests) --------------------------------------------------- indirect 9166325 9182023 9170093 9134014 9171082 retpoline 9099308 9126350 9154841 9056377 9122376 Computing the stats on the ns_pkt = 20e6/total_packets gives the following: $ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp x /tmp/sk-indirect + /tmp/sk-retp +----------------------------------------------------------------------+ |x xx x + x + + + +| ||______M__A_______|_|____________M_____A___________________| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 2.17817e-06 2.18962e-06 2.181e-06 2.182292e-06 4.3252133e-09 + 5 2.18464e-06 2.20839e-06 2.19241e-06 2.194974e-06 8.8695958e-09 Difference at 95.0% confidence 1.2682e-08 +/- 1.01766e-08 0.581132% +/- 0.466326% (Student's t, pooled s = 6.97772e-09) This suggests a difference of 13ns +/- 10ns Our expectation from microbenchmarks was 35-40 cycles per call, but part of the gains may be eaten by stalls from pending memory reads. For Cascadelake: Cascadelake Syncookies over 20s (5 tests) --------------------------------------------------------- indirect 10339797 10297547 10366826 10378891 10384854 retpoline 10332674 10366805 10320374 10334272 10374087 Computing the stats on the ns_pkt = 20e6/total_packets gives no meaningful difference even at just 80% (this was expected): $ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp x /tmp/cl-indirect + /tmp/cl-retp +----------------------------------------------------------------------+ | x x + * x + + + x| ||______________|_M_________A_____A_______M________|___| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 1.92588e-06 1.94221e-06 1.92923e-06 1.931716e-06 6.6936746e-09 + 5 1.92788e-06 1.93791e-06 1.93531e-06 1.933188e-06 4.3734106e-09 No difference proven at 80.0% confidence Changed in v3: - fix From: tag - provide measurements Changed in v2: -fix build issues reported by kernel test robot Brian Vazquez (4): net: use indirect call helpers for dst_input net: use indirect call helpers for dst_output net: use indirect call helpers for dst_mtu net: indirect call helpers for ipv4/ipv6 dst_check functions include/net/dst.h | 25 +++++++++++++++++++++---- net/core/sock.c | 12 ++++++++++-- net/ipv4/ip_input.c | 1 + net/ipv4/ip_output.c | 1 + net/ipv4/route.c | 13 +++++++++---- net/ipv4/tcp_ipv4.c | 5 ++++- net/ipv6/ip6_output.c | 1 + net/ipv6/route.c | 13 +++++++++---- net/ipv6/tcp_ipv6.c | 5 ++++- 9 files changed, 60 insertions(+), 16 deletions(-) -- 2.30.0.365.g02bc693789-goog