Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp3544106imm; Mon, 8 Oct 2018 05:54:01 -0700 (PDT) X-Google-Smtp-Source: ACcGV62hBb1bEiRnTNaSYm0vbc7+xpHrP1oMJBbVZL3G1ZhXRI6F8wGOXQ4cQwMCxbP3ulfBdVfz X-Received: by 2002:a17:902:76cb:: with SMTP id j11-v6mr23690729plt.258.1539003241390; Mon, 08 Oct 2018 05:54:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539003241; cv=none; d=google.com; s=arc-20160816; b=B09M+VeiD1Pjwpr2Ie7SPHdHhB5egfSl9GHiz4PGzwMbASC3GKaQsZvaVaQSzqkZsV vkjmIlfKrSYoL0evouZnHMBYV0B2HPb5hfKdMhX605R1ViG0gUORGFEm7jCPyF4FJjd3 fRtWhXef6biTYb/yQQ/A3v6YOGId3fzrrw2ch5oHVNmoO5XAthI/IpHFu+izC7OsP51f 1JimqP563Id3Pt7IWfH/6QeVbS5jvdcvKjzDX9MTZEGf982hJ9Uk/naypIKevufSt98J 7942JvjEFbbsDyEEQepo1ItSGcZV0krXEW96FDICc9BUJx0bAek7UIVIK5h56bDsGmAg 4zVQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from; bh=DZc1fDCqTSdQJ/XrvPun36i2YbD1YTR6EEMAQaMnWTE=; b=vSVVd05uCDUanQE+/QZwq5WjDZtne0pnrYaxhGX1AQ9ielwr882uha7c2av5nsDX15 ouwQ4Ba29Tfcj5DQDkhNtAtDu9WGcn7+LfGDt8C/D1pCFMJFiWIS5dAicUfcuqnzyyWk HAk5j22F7iufJY8+7xeVl8QKhQPI5j+HdnXXuCuDBOZrYed8LM4Hy3p+1CncN5kXI7L6 Tk+MTfNQ/ll/4cQ6SYmUV6KUS+i6pCgPWqo59wyEspH/ACsoD28En2MBTaDjODysaHkq fdm9aLw4UHYBSC7VWEK+JZzYvGXAnoTidZLD0L87/t5q0lXcFfjNp28tXvEsIVFrid/j JCOw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 187-v6si20410950pfe.182.2018.10.08.05.53.46; Mon, 08 Oct 2018 05:54:01 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726510AbeJHUE5 (ORCPT + 99 others); Mon, 8 Oct 2018 16:04:57 -0400 Received: from mx1.redhat.com ([209.132.183.28]:44048 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725893AbeJHUE5 (ORCPT ); Mon, 8 Oct 2018 16:04:57 -0400 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 01397307D84C; Mon, 8 Oct 2018 12:53:23 +0000 (UTC) Received: from f28server.default (ovpn-116-135.phx2.redhat.com [10.3.116.135]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3A6796530E; Mon, 8 Oct 2018 12:53:19 +0000 (UTC) From: Daniel Bristot de Oliveira To: linux-kernel@vger.kernel.org, x86@kernel.org Cc: Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Greg Kroah-Hartman , Pavel Tatashin , Masami Hiramatsu , "Steven Rostedt (VMware)" , Zhou Chengming , Jiri Kosina , Josh Poimboeuf , "Peter Zijlstra (Intel)" , Chris von Recklinghausen , Jason Baron , Scott Wood , Marcelo Tosatti , Clark Williams Subject: [RFC PATCH 0/6] x86/jump_label: Bound IPIs sent when updating a static key Date: Mon, 8 Oct 2018 14:52:59 +0200 Message-Id: X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.48]); Mon, 08 Oct 2018 12:53:23 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org While tuning a system with CPUs isolated as much as possible, we've noticed that isolated CPUs were receiving bursts of 12 IPIs, periodically. Tracing the functions that emit IPIs, we saw chronyd - an unprivileged process - generating the IPIs when changing a static key, enabling network timestaping on sockets. For instance, the trace pointed: # trace-cmd record --func-stack -p function -l smp_call_function_single -e irq_vectors -f 'vector == 251'... # trace-cmde report... [...] chronyd-858 [000] 433.286406: function: smp_call_function_single chronyd-858 [000] 433.286408: kernel_stack: => smp_call_function_many (ffffffffbc10ab22) => smp_call_function (ffffffffbc10abaa) => on_each_cpu (ffffffffbc10ac0b) => text_poke_bp (ffffffffbc02436a) => arch_jump_label_transform (ffffffffbc020ec8) => __jump_label_update (ffffffffbc1b300f) => jump_label_update (ffffffffbc1b30ed) => static_key_slow_inc (ffffffffbc1b334d) => net_enable_timestamp (ffffffffbc625c44) => sock_enable_timestamp (ffffffffbc6127d5) => sock_setsockopt (ffffffffbc612d2f) => SyS_setsockopt (ffffffffbc60d5e6) => tracesys (ffffffffbc7681c5) -0 [001] 433.286416: call_function_single_entry: vector=251 -0 [001] 433.286419: call_function_single_exit: vector=251 [... The IPI takes place 12 times] The static key in case was the netstamp_needed_key. A static key change from enabled->disabled/disabled->enabled causes the code to be changed, and this is done in three steps: -- Pseudo-code #1 - Current implementation --- For each key to be updated: 1) add an int3 trap to the address that will be patched sync cores (send IPI to all other CPUs) 2) update all but the first byte of the patched range sync cores (send IPI to all other CPUs) 3) replace the first byte (int3) by the first byte of replacing opcode sync cores (send IPI to all other CPUs) -- Pseudo-code #1 --- As the static key netstamp_needed_key has four entries (used in for places in the code) in our kernel, 3 IPIs were generated for each entry, resulting in 12 IPIs. The number of IPIs then is linear with regard to the number 'n' of entries of a key: O(n*3), which is O(n). This algorithm works fine for the update of a single key. But we think it is possible to optimize the case in which a static key has more than one entry. For instance, the sched_schedstats jump label has 56 entries in my (updated) fedora kernel, resulting in 168 IPIs for each CPU in which the thread that is enabling is _not_ running. In this patch, rather than doing each updated at once, it is possible to queue all updates first, and the, apply all updates at once, rewriting the pseudo-code #1 in this way: -- Pseudo-code #2 - This patch --- 1) for each key in the queue: add an int3 trap to the address that will be patched sync cores (send IPI to all other CPUs) 2) for each key in the queue: update all but the first byte of the patched range sync cores (send IPI to all other CPUs) 3) for each key in the queue: replace the first byte (int3) by the first byte of replacing opcode sync cores (send IPI to all other CPUs) -- Pseudo-code #2 - This patch --- Doing the update in this way, the number of IPI becomes O(3) with regard to the number of keys, which is O(1). Currently, the jump label of a static key is transformed via the arch specific function: void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type) The new approach (batch mode) uses two arch functions, the first has the same arguments of the arch_jump_label_transform(), and is the function: void arch_jump_label_transform_queue(struct jump_entry *entry, enum jump_label_type type) Rather than transforming the code, it adds the jump_entry in a queue of entries to be updated. After queuing all jump_entries, the function: void arch_jump_label_transform_apply(void) Applies the changes in the queue. One easy way to see the benefits of this patch is switching the schedstats on and off. For instance: -------------------------- %< ---------------------------- #!/bin/bash while [ true ]; do sysctl -w kernel.sched_schedstats=1 sleep 2 sysctl -w kernel.sched_schedstats=0 sleep 2 done -------------------------- >% ---------------------------- while watching the IPI count: -------------------------- %< ---------------------------- # watch -n1 "cat /proc/interrupts | grep Function" -------------------------- >% ---------------------------- With the current mode, it is possible to see +- 168 IPIs each 2 seconds, while with this patch the number of IPIs goes to 3 each 2 seconds. Although the motivation of this patch is to reduce the noise on threads that are *not* causing the enabling/disabling of the static key, counter-intuitively, this patch also improves the performance of the enabling/disabling (slow) path of the thread that is actually doing the change. The reason being is that the costs of allocating memory/ list manipulation/freeing memory are smaller than sending IPIs. For example, in a system with 24 CPUs, the current cost of enabling the schedstats key is 170000-ish us, while with this patch, it decreases to 2200 -ish us. This is an RFC, so comments and critics about things I am missing are more than welcome. The batch of operations was suggested by Scott Wood . Daniel Bristot de Oliveira (6): jump_label: Add for_each_label_entry helper jump_label: Add the jump_label_can_update_check() helper x86/jump_label: Move checking code away from __jump_label_transform() x86/jump_label: Add __jump_label_set_jump_code() helper x86/alternative: Split text_poke_bp() into tree steps x86/jump_label,x86/alternatives: Batch jump label transformations arch/x86/include/asm/jump_label.h | 2 + arch/x86/include/asm/text-patching.h | 9 ++ arch/x86/kernel/alternative.c | 115 ++++++++++++++++--- arch/x86/kernel/jump_label.c | 161 ++++++++++++++++++++------- include/linux/jump_label.h | 8 ++ kernel/jump_label.c | 46 ++++++-- 6 files changed, 273 insertions(+), 68 deletions(-) -- 2.17.1