Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753657AbcLPRUF (ORCPT ); Fri, 16 Dec 2016 12:20:05 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:45270 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751014AbcLPRT4 (ORCPT ); Fri, 16 Dec 2016 12:19:56 -0500 Date: Fri, 16 Dec 2016 22:49:16 +0530 From: "Naveen N. Rao" To: Balbir Singh Cc: Anju T Sudhakar , linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, srikar@linux.vnet.ibm.com, mahesh@linux.vnet.ibm.com, paulus@samba.org, mhiramat@kernel.org, ananth@in.ibm.com Subject: Re: [PATCH V2 0/4] OPTPROBES for powerpc References: <1481732310-7779-1-git-send-email-anju@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.6.2 (2016-07-01) X-TM-AS-MML: disable X-Content-Scanned: Fidelis XPS MAILER x-cbid: 16121617-0008-0000-0000-000004F45E57 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 16121617-0009-0000-0000-000012864C84 Message-Id: <20161216171916.GH4109@naverao1-tp.localdomain> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2016-12-16_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 suspectscore=0 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1612050000 definitions=main-1612160265 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4870 Lines: 128 On 2016/12/17 01:46AM, Balbir Singh wrote: > > > On 15/12/16 03:18, Anju T Sudhakar wrote: > > This is the V2 patchset of the kprobes jump optimization > > (a.k.a OPTPROBES)for powerpc. Kprobe being an inevitable tool > > for kernel developers, enhancing the performance of kprobe has > > got much importance. > > > > Currently kprobes inserts a trap instruction to probe a running kernel. > > Jump optimization allows kprobes to replace the trap with a branch, > > reducing the probe overhead drastically. > > > > In this series, conditional branch instructions are not considered for > > optimization as they have to be assessed carefully in SMP systems. > > > > The kprobe placed on the kretprobe_trampoline during boot time, is also > > optimized in this series. Patch 4/4 furnishes this. > > > > The first two patches can go independently of the series. The helper > > functions in these patches are invoked in patch 3/4. > > > > Performance: > > ============ > > An optimized kprobe in powerpc is 1.05 to 4.7 times faster than a kprobe. > > > > Example: > > > > Placed a probe at an offset 0x50 in _do_fork(). > > *Time Diff here is, difference in time before hitting the probe and > > after the probed instruction. mftb() is employed in kernel/fork.c for > > this purpose. > > > > # echo 0 > /proc/sys/debug/kprobes-optimization > > Kprobes globally unoptimized > > [ 233.607120] Time Diff = 0x1f0 > > [ 233.608273] Time Diff = 0x1ee > > [ 233.609228] Time Diff = 0x203 > > [ 233.610400] Time Diff = 0x1ec > > [ 233.611335] Time Diff = 0x200 > > [ 233.612552] Time Diff = 0x1f0 > > [ 233.613386] Time Diff = 0x1ee > > [ 233.614547] Time Diff = 0x212 > > [ 233.615570] Time Diff = 0x206 > > [ 233.616819] Time Diff = 0x1f3 > > [ 233.617773] Time Diff = 0x1ec > > [ 233.618944] Time Diff = 0x1fb > > [ 233.619879] Time Diff = 0x1f0 > > [ 233.621066] Time Diff = 0x1f9 > > [ 233.621999] Time Diff = 0x283 > > [ 233.623281] Time Diff = 0x24d > > [ 233.624172] Time Diff = 0x1ea > > [ 233.625381] Time Diff = 0x1f0 > > [ 233.626358] Time Diff = 0x200 > > [ 233.627572] Time Diff = 0x1ed > > > > # echo 1 > /proc/sys/debug/kprobes-optimization > > Kprobes globally optimized > > [ 70.797075] Time Diff = 0x103 > > [ 70.799102] Time Diff = 0x181 > > [ 70.801861] Time Diff = 0x15e > > [ 70.803466] Time Diff = 0xf0 > > [ 70.804348] Time Diff = 0xd0 > > [ 70.805653] Time Diff = 0xad > > [ 70.806477] Time Diff = 0xe0 > > [ 70.807725] Time Diff = 0xbe > > [ 70.808541] Time Diff = 0xc3 > > [ 70.810191] Time Diff = 0xc7 > > [ 70.811007] Time Diff = 0xc0 > > [ 70.812629] Time Diff = 0xc0 > > [ 70.813640] Time Diff = 0xda > > [ 70.814915] Time Diff = 0xbb > > [ 70.815726] Time Diff = 0xc4 > > [ 70.816955] Time Diff = 0xc0 > > [ 70.817778] Time Diff = 0xcd > > [ 70.818999] Time Diff = 0xcd > > [ 70.820099] Time Diff = 0xcb > > [ 70.821333] Time Diff = 0xf0 > > > > Implementation: > > =================== > > > > The trap instruction is replaced by a branch to a detour buffer. To address > > the limitation of branch instruction in power architecture, detour buffer > > slot is allocated from a reserved area . This will ensure that the branch > > is within ? 32 MB range. The current kprobes insn caches allocate memory > > area for insn slots with module_alloc(). This will always be beyond > > ? 32MB range. > > > > The paragraph is a little confusing. We need the detour buffer to be within > +-32 MB, but then you say we always get memory from module_alloc() beyond > 32MB. Yes, I think it can be described better. What Anju is mentioning is that the existing generic approach for kprobes insn cache uses module_alloc() which is not suitable for us due to the 32MB range limit with relative branches on powerpc. Instead, we reserve a 64k block within .text and allocate the detour buffer from that area. This puts the detour buffer in range for most of the symbols and should be a good start. > > > The detour buffer contains a call to optimized_callback() which in turn > > call the pre_handler(). Once the pre-handler is run, the original > > instruction is emulated from the detour buffer itself. Also the detour > > buffer is equipped with a branch back to the normal work flow after the > > probed instruction is emulated. > > Does the branch itself use registers that need to be saved? I presume No, we use immediate values to encode the relative address. > we are going to rely on the +-32MB, what are the guarantees of success > of such a mechanism? We explicitly ensure that the return branch is within range as well during registration. In fact, this is one of the reasons why we can't optimize conditional branches - we can't know in advance where we need to jump back. > > Balbir Singh. > Thanks, - Naveen