Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp5166305ybi; Tue, 11 Jun 2019 20:48:41 -0700 (PDT) X-Google-Smtp-Source: APXvYqyGjoEpZHgc09AgLEqGifkFE//ereuV+S/Zt5SmFeGslN4scFPAEGxWlato5bYCQyMG+aFn X-Received: by 2002:a17:90a:658a:: with SMTP id k10mr7818891pjj.15.1560311321751; Tue, 11 Jun 2019 20:48:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560311321; cv=none; d=google.com; s=arc-20160816; b=eIIagCNGjpEAupDZbdybAcov2wa6RGdAI3V22PGSINa2CY6WfdKIKPA37hvCCF/Fn+ C2GE7JX1LTB5cZQB3hPRrQrFVlFUshvj9L2eHr9ugLyuTwXQfMM3C5LSxZyR2BJ2/U0M fWjQhT8Nr9QJ87hHzqEeQaCDSFXJZmqRsdToNcvgfmwpxikeLHKY9eRmW4riGpbqzg6G GKOHVzIahnbevDsDhZenheWvLPJ5lrmJhSVgqXv/G3XQ/QCFNbx1E0Q7OEjzNlPH+kek 0scOgOElpysPRve4NZPbOEuK9gS96ejP6FmWg4TKfa2BjQljEaw5Mz/TZGb8xsy2T1+v zf7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:subject:cc :to:from:dkim-signature; bh=lvUJUfl7vkOJvmMu8y+5Qbz3L/l78ZP+Gmj+lYaz4xQ=; b=IOl12n586TVdtCR6VipwFVEe5hc/xpsNAI3WZrn4DxUmBiqGunfWr5HfV82DN0fj8f JSP6Wo7UDhYmY4UEQHQqs9PU4yLpg06geaIjPopsJFHttk1STVs8ZzgI4fRjikOyV4c0 NE1ETTH8bSNBH9bDrEPa1zDGdfh8VG2HPecNYoH4qgZSsdE7MrlVpyO7zCGv8xLH5IFu FzIyihdxt15OgUXjmUOibiK9V16hbE2PWipmyAMfCjnwlbL+yaD3hc2U1rsY72ZBrT14 Kfwl9Ahq3002zOGZI7duGQbftz/Velur+3rlP6Nshrz3R16h2xH+aAHXQm5piRqefvz8 yUYw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@fb.com header.s=facebook header.b=UjrDsP+s; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=fb.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x186si13925213pfx.167.2019.06.11.20.48.26; Tue, 11 Jun 2019 20:48:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@fb.com header.s=facebook header.b=UjrDsP+s; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=fb.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2436644AbfFKVyg (ORCPT + 99 others); Tue, 11 Jun 2019 17:54:36 -0400 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:43126 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391563AbfFKVyf (ORCPT ); Tue, 11 Jun 2019 17:54:35 -0400 Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x5BLmhPY014766; Tue, 11 Jun 2019 14:53:15 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=from : to : cc : subject : date : message-id : mime-version : content-type; s=facebook; bh=lvUJUfl7vkOJvmMu8y+5Qbz3L/l78ZP+Gmj+lYaz4xQ=; b=UjrDsP+sBJ4+B5ISLnb1cGCO48vJCfLVSaxgvU20vYb8ULzbeRYMvi2wKsGPtb1q3iLa Oeg1qLVugXtvf9H3/fzRlDVuP/4VPjQR5UGZIWUd042ALf7ZNkcT+pGd0Ue4SIuOq+Bi Bs2zQzVA94mYa6OOVaEQ9ZNn4zrs5/dvRU4= Received: from mail.thefacebook.com (mailout.thefacebook.com [199.201.64.23]) by mx0a-00082601.pphosted.com with ESMTP id 2t2keag8ag-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Tue, 11 Jun 2019 14:53:15 -0700 Received: from mmullins-1.thefacebook.com (2620:10d:c081:10::13) by mail.thefacebook.com (2620:10d:c081:35::125) with Microsoft SMTP Server id 15.1.1713.5; Tue, 11 Jun 2019 14:53:14 -0700 From: Matt Mullins To: , , , , , CC: , Steven Rostedt , "Ingo Molnar" , Martin KaFai Lau , Song Liu , Yonghong Song Subject: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data Date: Tue, 11 Jun 2019 14:53:04 -0700 Message-ID: <20190611215304.28831-1-mmullins@fb.com> X-Mailer: git-send-email 2.17.1 MIME-Version: 1.0 Content-Type: text/plain X-Originating-IP: [2620:10d:c081:10::13] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-06-11_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1906110141 X-FB-Internal: deliver Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as they do not increment bpf_prog_active while executing. This enables three levels of nesting, to support - a kprobe or raw tp or perf event, - another one of the above that irq context happens to call, and - another one in nmi context (at most one of which may be a kprobe or perf event). Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: Matt Mullins --- v1->v2: * reverse-Christmas-tree-ize the declarations in bpf_perf_event_output * instantiate err more readably I've done additional testing with the original workload that hit the irq+raw-tp reentrancy problem, and as far as I can tell, it's still solved with this solution (as opposed to my earlier per-map-element version). kernel/trace/bpf_trace.c | 100 ++++++++++++++++++++++++++++++++------- 1 file changed, 84 insertions(+), 16 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..1c9a4745e596 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -410,8 +410,6 @@ static const struct bpf_func_proto bpf_perf_event_read_value_proto = { .arg4_type = ARG_CONST_SIZE, }; -static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); - static __always_inline u64 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, u64 flags, struct perf_sample_data *sd) @@ -442,24 +440,50 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, return perf_event_output(event, sd, regs); } +/* + * Support executing tracepoints in normal, irq, and nmi context that each call + * bpf_perf_event_output + */ +struct bpf_trace_sample_data { + struct perf_sample_data sds[3]; +}; + +static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds); +static DEFINE_PER_CPU(int, bpf_trace_nest_level); BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd); + struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds); + int nest_level = this_cpu_inc_return(bpf_trace_nest_level); struct perf_raw_record raw = { .frag = { .size = size, .data = data, }, }; + struct perf_sample_data *sd; + int err; - if (unlikely(flags & ~(BPF_F_INDEX_MASK))) - return -EINVAL; + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) { + err = -EBUSY; + goto out; + } + + sd = &sds->sds[nest_level - 1]; + + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) { + err = -EINVAL; + goto out; + } perf_sample_data_init(sd, 0, 0); sd->raw = &raw; - return __bpf_perf_event_output(regs, map, flags, sd); + err = __bpf_perf_event_output(regs, map, flags, sd); + +out: + this_cpu_dec(bpf_trace_nest_level); + return err; } static const struct bpf_func_proto bpf_perf_event_output_proto = { @@ -822,16 +846,48 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) /* * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp * to avoid potential recursive reuse issue when/if tracepoints are added - * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack + * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack. + * + * Since raw tracepoints run despite bpf_prog_active, support concurrent usage + * in normal, irq, and nmi context. */ -static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs); +struct bpf_raw_tp_regs { + struct pt_regs regs[3]; +}; +static DEFINE_PER_CPU(struct bpf_raw_tp_regs, bpf_raw_tp_regs); +static DEFINE_PER_CPU(int, bpf_raw_tp_nest_level); +static struct pt_regs *get_bpf_raw_tp_regs(void) +{ + struct bpf_raw_tp_regs *tp_regs = this_cpu_ptr(&bpf_raw_tp_regs); + int nest_level = this_cpu_inc_return(bpf_raw_tp_nest_level); + + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(tp_regs->regs))) { + this_cpu_dec(bpf_raw_tp_nest_level); + return ERR_PTR(-EBUSY); + } + + return &tp_regs->regs[nest_level - 1]; +} + +static void put_bpf_raw_tp_regs(void) +{ + this_cpu_dec(bpf_raw_tp_nest_level); +} + BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); - return ____bpf_perf_event_output(regs, map, flags, data, size); + ret = ____bpf_perf_event_output(regs, map, flags, data, size); + + put_bpf_raw_tp_regs(); + return ret; } static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = { @@ -848,12 +904,18 @@ static const struct bpf_func_proto bpf_perf_event_output_proto_raw_tp = { BPF_CALL_3(bpf_get_stackid_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); /* similar to bpf_perf_event_output_tp, but pt_regs fetched differently */ - return bpf_get_stackid((unsigned long) regs, (unsigned long) map, - flags, 0, 0); + ret = bpf_get_stackid((unsigned long) regs, (unsigned long) map, + flags, 0, 0); + put_bpf_raw_tp_regs(); + return ret; } static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = { @@ -868,11 +930,17 @@ static const struct bpf_func_proto bpf_get_stackid_proto_raw_tp = { BPF_CALL_4(bpf_get_stack_raw_tp, struct bpf_raw_tracepoint_args *, args, void *, buf, u32, size, u64, flags) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); - return bpf_get_stack((unsigned long) regs, (unsigned long) buf, - (unsigned long) size, flags, 0); + ret = bpf_get_stack((unsigned long) regs, (unsigned long) buf, + (unsigned long) size, flags, 0); + put_bpf_raw_tp_regs(); + return ret; } static const struct bpf_func_proto bpf_get_stack_proto_raw_tp = { -- 2.17.1