Received: by 2002:a25:ad19:0:0:0:0:0 with SMTP id y25csp10415879ybi; Wed, 24 Jul 2019 22:49:40 -0700 (PDT) X-Google-Smtp-Source: APXvYqy8aLhiHwkhTMIXxqR7om684m4sHa5yyra05GDGtGuuU9z0QNSK5VnrZaHEydbgvCI8gaS+ X-Received: by 2002:a63:66c5:: with SMTP id a188mr84402176pgc.127.1564033780561; Wed, 24 Jul 2019 22:49:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564033780; cv=none; d=google.com; s=arc-20160816; b=OD77jCAlkLppXxMn4X5PxVuUjkReZntg1Pa7gZJykRG5L4DMGRHZi6B/hvR2EOMHJi 5oEp0sgKRY/YJ03LRmDskcmtxVIWLXjxtTH3NrNqB2b9NoyCeCNmgk7tw60RbgPylcHU aUynqiI6DBHslZeW6t7ykEaFyiKGK/3rvISbFx3mQ0KF+lgS6tFY2+HGqqM5gakAD9qX 7r0mVEtSmUYgTb1jWgfF+zssqjOQTlm4Bze+cXxwlRkSOlrEVOmWWnjOOSxEd6EOZhEB t+8W2ML5YWSE0UCLscvK1cmskZDQlxS1bS8srtrg4YMxJISAIhiyQnFLPr0j5dKxw9Gp 4xlA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=R7rWft6F3bA/tG51c+ZfWgjzu6JzJyAuVH+CKO1ZKEs=; b=ZtqXa+KAt++f4cPYwbz0hI3IillU/o1TCT2cnzLTzqgTN+sOZKKJyce3FgWw4oYcUr Jag8iffchd+c47XyMJyPQB9CghbPMyNysS8c1r1TsFuSq0TL17/kw1aULNcwbSjqw1dW ZMGMVwU2U5hBmNkbYhiiGOR8al4bFJtQdCEUAnjYED1SinX6AxukqaPoRawVSTPLENmn 4LvnmKN8WA+QaKt1M49FdLRJzIROxD9yY2k4IKEOdvx/CuyAibdLa0FENjbf18W4ol4L iRsaHa6DK8htATT0dSrxebj3C/wHM9S2wObaX9VceVH2IcwFV39BaA7L0Nb969JQLPu5 ipdQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=gmAgAkG9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h189si15797239pgc.236.2019.07.24.22.49.26; Wed, 24 Jul 2019 22:49:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=gmAgAkG9; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729663AbfGXXFC (ORCPT + 99 others); Wed, 24 Jul 2019 19:05:02 -0400 Received: from mail-qt1-f194.google.com ([209.85.160.194]:37729 "EHLO mail-qt1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726558AbfGXXFB (ORCPT ); Wed, 24 Jul 2019 19:05:01 -0400 Received: by mail-qt1-f194.google.com with SMTP id y26so47239131qto.4; Wed, 24 Jul 2019 16:05:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=R7rWft6F3bA/tG51c+ZfWgjzu6JzJyAuVH+CKO1ZKEs=; b=gmAgAkG9blbkmEcLcxNuRMj67eOiDugp5u0rYEyqC581XKIXAAcCgYGONE2MK1Peol 9JZwpj2FKEvzrCdU+8CAHRtu34WvauxS2zTZu1uUtHoNlBYeAWQ7U//qd+Sx3uZfdk70 a1L3waOw/IfgwgJLjPkUrcEM1gYfjNxImetwrs+i9djjZZm2NQJgFLtMk1HJf2M5HDwN U7VtzLa2+OkuNP/tzgOpxO5R9gWo7ZVbTACkF13I5hJOBtk7LHEHU5kJozlaAqKAA9Ti O0F2PrKmKCm3T+jzVEpOvNdE0gpDKD0jBpsEYAyJgCxryj1wiHCK70yT3YxFAbbZJu6W zOVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=R7rWft6F3bA/tG51c+ZfWgjzu6JzJyAuVH+CKO1ZKEs=; b=ZDYaxJoZvANrUKtin71Ty+60y+52e2gQfXjs6gh3iCzPS6tKvY2hqi7fCebh7bDpDD enGboK+Y9cuZgISLO0ZqtV68bkYQ/66mffsl+7zurad45q8TACkYVnblDcEY2Xgn7pi6 tc5dFLBQJ0eJK5kLL90SJYHx5HYiFuhPP3LkoUTah3j6c/1L4wHMXQBV+RxuanTg+Doz KJVS06/FJRHA+fGyfarl/K4+jF0q1AjLTV+xkG4v6/OIPB/n+fLeMHrgEty70LobV7XU mAujEn/Ff0nNBEIfU250aRqmNTd6V3Bzpf7NKIXxXTBbNzT3IZoi39AXmln0kyWAArkw 63Ig== X-Gm-Message-State: APjAAAWQMRvUJPWc62IsCK74y3S9jNJMrVknacE6qX8cwBE40MlvS/dw 9EouPImPSQziwWDk8THzPQ5brqx1IpjpmjTj328= X-Received: by 2002:ac8:6a17:: with SMTP id t23mr57874962qtr.183.1564009500202; Wed, 24 Jul 2019 16:05:00 -0700 (PDT) MIME-Version: 1.0 References: <20190724165803.87470-1-brianvv@google.com> <20190724165803.87470-3-brianvv@google.com> In-Reply-To: From: Song Liu Date: Wed, 24 Jul 2019 16:04:48 -0700 Message-ID: Subject: Re: [PATCH bpf-next 2/6] bpf: add BPF_MAP_DUMP command to dump more than one entry per call To: Brian Vazquez Cc: Brian Vazquez , Alexei Starovoitov , Daniel Borkmann , "David S . Miller" , Stanislav Fomichev , Willem de Bruijn , Petar Penkov , open list , Networking , bpf Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jul 24, 2019 at 3:44 PM Brian Vazquez wrote: > > On Wed, Jul 24, 2019 at 2:40 PM Song Liu wrote: > > > > On Wed, Jul 24, 2019 at 10:10 AM Brian Vazquez wrote: > > > > > > This introduces a new command to retrieve multiple number of entries > > > from a bpf map, wrapping the existing bpf methods: > > > map_get_next_key and map_lookup_elem > > > > > > To start dumping the map from the beginning you must specify NULL as > > > the prev_key. > > > > > > The new API returns 0 when it successfully copied all the elements > > > requested or it copied less because there weren't more elements to > > > retrieved (i.e err == -ENOENT). In last scenario err will be masked to 0. > > > > > > On a successful call buf and buf_len will contain correct data and in > > > case prev_key was provided (not for the first walk, since prev_key is > > > NULL) it will contain the last_key copied into the prev_key which will > > > simplify next call. > > > > > > Only when it can't find a single element it will return -ENOENT meaning > > > that the map has been entirely walked. When an error is return buf, > > > buf_len and prev_key shouldn't be read nor used. > > > > > > Because maps can be called from userspace and kernel code, this function > > > can have a scenario where the next_key was found but by the time we > > > try to retrieve the value the element is not there, in this case the > > > function continues and tries to get a new next_key value, skipping the > > > deleted key. If at some point the function find itself trap in a loop, > > > it will return -EINTR. > > > > > > The function will try to fit as much as possible in the buf provided and > > > will return -EINVAL if buf_len is smaller than elem_size. > > > > > > QUEUE and STACK maps are not supported. > > > > > > Note that map_dump doesn't guarantee that reading the entire table is > > > consistent since this function is always racing with kernel and user code > > > but the same behaviour is found when the entire table is walked using > > > the current interfaces: map_get_next_key + map_lookup_elem. > > > It is also important to note that with a locked map, the lock is grabbed > > > for 1 entry at the time, meaning that the returned buf might or might not > > > be consistent. > > > > > > Suggested-by: Stanislav Fomichev > > > Signed-off-by: Brian Vazquez > > > --- > > > include/uapi/linux/bpf.h | 9 +++ > > > kernel/bpf/syscall.c | 117 +++++++++++++++++++++++++++++++++++++++ > > > 2 files changed, 126 insertions(+) > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > index fa1c753dcdbc7..66dab5385170d 100644 > > > --- a/include/uapi/linux/bpf.h > > > +++ b/include/uapi/linux/bpf.h > > > @@ -106,6 +106,7 @@ enum bpf_cmd { > > > BPF_TASK_FD_QUERY, > > > BPF_MAP_LOOKUP_AND_DELETE_ELEM, > > > BPF_MAP_FREEZE, > > > + BPF_MAP_DUMP, > > > }; > > > > > > enum bpf_map_type { > > > @@ -388,6 +389,14 @@ union bpf_attr { > > > __u64 flags; > > > }; > > > > > > + struct { /* struct used by BPF_MAP_DUMP command */ > > > + __aligned_u64 prev_key; > > > + __aligned_u64 buf; > > > + __aligned_u64 buf_len; /* input/output: len of buf */ > > > + __u64 flags; > > > > Please add explanation of flags here. > > got it! > > > Also, we need to update the > > comments of BPF_F_LOCK for BPF_MAP_DUMP. > > What do you mean? I didn't get this part. I meant, current comment says BPF_F_LOCK is for BPF_MAP_UPDATE_ELEM. But it is also used by BPF_MAP_LOOKUP_ELEM and BPF_MAP_DUMP. So current comment is not accurate either. Maybe fix it while you are on it? > > > > > > + __u32 map_fd; > > > + } dump; > > > + > > > struct { /* anonymous struct used by BPF_PROG_LOAD command */ > > > __u32 prog_type; /* one of enum bpf_prog_type */ > > > __u32 insn_cnt; > > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > > index 86cdc2f7bb56e..0c35505aa219f 100644 > > > --- a/kernel/bpf/syscall.c > > > +++ b/kernel/bpf/syscall.c > > > @@ -1097,6 +1097,120 @@ static int map_get_next_key(union bpf_attr *attr) > > > return err; > > > } > > > > > > +/* last field in 'union bpf_attr' used by this command */ > > > +#define BPF_MAP_DUMP_LAST_FIELD dump.map_fd > > > + > > > +static int map_dump(union bpf_attr *attr) > > > +{ > > > + void __user *ukey = u64_to_user_ptr(attr->dump.prev_key); > > > + void __user *ubuf = u64_to_user_ptr(attr->dump.buf); > > > + u32 __user *ubuf_len = u64_to_user_ptr(attr->dump.buf_len); > > > + int ufd = attr->dump.map_fd; > > > + struct bpf_map *map; > > > + void *buf, *prev_key, *key, *value; > > > + u32 value_size, elem_size, buf_len, cp_len; > > > + struct fd f; > > > + int err; > > > + bool first_key = false; > > > + > > > + if (CHECK_ATTR(BPF_MAP_DUMP)) > > > + return -EINVAL; > > > + > > > + if (attr->dump.flags & ~BPF_F_LOCK) > > > + return -EINVAL; > > > + > > > + f = fdget(ufd); > > > + map = __bpf_map_get(f); > > > + if (IS_ERR(map)) > > > + return PTR_ERR(map); > > > + if (!(map_get_sys_perms(map, f) & FMODE_CAN_READ)) { > > > + err = -EPERM; > > > + goto err_put; > > > + } > > > + > > > + if ((attr->dump.flags & BPF_F_LOCK) && > > > + !map_value_has_spin_lock(map)) { > > > + err = -EINVAL; > > > + goto err_put; > > > + } > > > > We can share these lines with map_lookup_elem(). Maybe > > add another helper function? > > Which are the lines you are referring to? the dump.flags? It makes > sense so that way when a new flag is added you only need to modify > them in one spot. I think I misread it. attr->dump.flags is not same as attr->flags. So never mind. > > > > + > > > + if (map->map_type == BPF_MAP_TYPE_QUEUE || > > > + map->map_type == BPF_MAP_TYPE_STACK) { > > > + err = -ENOTSUPP; > > > + goto err_put; > > > + } > > > + > > > + value_size = bpf_map_value_size(map); > > > + > > > + err = get_user(buf_len, ubuf_len); > > > + if (err) > > > + goto err_put; > > > + > > > + elem_size = map->key_size + value_size; > > > + if (buf_len < elem_size) { > > > + err = -EINVAL; > > > + goto err_put; > > > + } > > > + > > > + if (ukey) { > > > + prev_key = __bpf_copy_key(ukey, map->key_size); > > > + if (IS_ERR(prev_key)) { > > > + err = PTR_ERR(prev_key); > > > + goto err_put; > > > + } > > > + } else { > > > + prev_key = NULL; > > > + first_key = true; > > > + } > > > + > > > + err = -ENOMEM; > > > + buf = kmalloc(elem_size, GFP_USER | __GFP_NOWARN); > > > + if (!buf) > > > + goto err_put; > > > + > > > + key = buf; > > > + value = key + map->key_size; > > > + for (cp_len = 0; cp_len + elem_size <= buf_len;) { > > > + if (signal_pending(current)) { > > > + err = -EINTR; > > > + break; > > > + } > > > + > > > + rcu_read_lock(); > > > + err = map->ops->map_get_next_key(map, prev_key, key); > > > > If prev_key is deleted before map_get_next_key(), we get the first key > > again. This is pretty weird. > > Yes, I know. But note that the current scenario happens even for the > old interface (imagine you are walking a map from userspace and you > tried get_next_key the prev_key was removed, you will start again from > the beginning without noticing it). > I tried to sent a patch in the past but I was missing some context: > before NULL was used to get the very first_key the interface relied in > a random (non existent) key to retrieve the first_key in the map, and > I was told what we still have to support that scenario. BPF_MAP_DUMP is slightly different, as you may return the first key multiple times in the same call. Also, BPF_MAP_DUMP is new, so we don't have to support legacy scenarios. Since BPF_MAP_DUMP keeps a list of elements. It is possible to try to look up previous keys. Would something down this direction work? Thanks, Song