Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20220818221212.464487-1-void@manifault.com> <20220818221212.464487-3-void@manifault.com>
 <CAEf4BzY6oaCpHmh7x92mhqAVdPNDUe6GLndXHbqHx4i9QzjOsw@mail.gmail.com> <Yw4QjyD9tEB2xNK6@maniforge.dhcp.thefacebook.com>
In-Reply-To: <Yw4QjyD9tEB2xNK6@maniforge.dhcp.thefacebook.com>
From:   Andrii Nakryiko <andrii.nakryiko@gmail.com>
Date:   Fri, 9 Sep 2022 15:45:13 -0700
Message-ID: <CAEf4Bzb=MuDYLy_VxWHtoT4SS-3D2F6MLA6TQ4z00h0Zj86WdA@mail.gmail.com>
Subject: Re: [PATCH v3 2/4] bpf: Add bpf_user_ringbuf_drain() helper
To:     David Vernet <void@manifault.com>
Cc:     bpf@vger.kernel.org, ast@kernel.org, andrii@kernel.org,
        daniel@iogearbox.net, kernel-team@fb.com, martin.lau@linux.dev,
        song@kernel.org, yhs@fb.com, john.fastabend@gmail.com,
        kpsingh@kernel.org, sdf@google.com, haoluo@google.com,
        jolsa@kernel.org, joannelkoong@gmail.com, tj@kernel.org,
        linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Tue, Aug 30, 2022 at 6:28 AM David Vernet <void@manifault.com> wrote:
>
> On Wed, Aug 24, 2022 at 02:22:44PM -0700, Andrii Nakryiko wrote:
> > > +/* Maximum number of user-producer ringbuffer samples that can be drained in
> > > + * a call to bpf_user_ringbuf_drain().
> > > + */
> > > +#define BPF_MAX_USER_RINGBUF_SAMPLES BIT(17)
> >
> > nit: I don't think using BIT() is appropriate here. 128 * 1024 would
> > be better, IMO. This is not inherently required to be a single bit
> > constant.
>
> No problem, updated.
>
> > > +
> > >  static inline u32 bpf_map_flags_to_cap(struct bpf_map *map)
> > >  {
> > >         u32 access_flags = map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG);
> > > @@ -2411,6 +2417,7 @@ extern const struct bpf_func_proto bpf_loop_proto;
> > >  extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
> > >  extern const struct bpf_func_proto bpf_set_retval_proto;
> > >  extern const struct bpf_func_proto bpf_get_retval_proto;
> > > +extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > >

[...]

> > > +
> > > +static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t size, u64 flags)
> > > +{
> > > +       u64 producer_pos, consumer_pos;
> > > +
> > > +       /* Synchronizes with smp_store_release() in user-space producer. */
> > > +       producer_pos = smp_load_acquire(&rb->producer_pos);
> > > +
> > > +       /* Using smp_load_acquire() is unnecessary here, as the busy-bit
> > > +        * prevents another task from writing to consumer_pos after it was read
> > > +        * by this task with smp_load_acquire() in __bpf_user_ringbuf_peek().
> > > +        */
> > > +       consumer_pos = rb->consumer_pos;
> > > +        /* Synchronizes with smp_load_acquire() in user-space producer. */
> > > +       smp_store_release(&rb->consumer_pos, consumer_pos + size + BPF_RINGBUF_HDR_SZ);
> > > +
> > > +       /* Prevent the clearing of the busy-bit from being reordered before the
> > > +        * storing of the updated rb->consumer_pos value.
> > > +        */
> > > +       smp_mb__before_atomic();
> > > +       atomic_set(&rb->busy, 0);
> > > +
> > > +       if (!(flags & BPF_RB_NO_WAKEUP)) {
> > > +               /* As a heuristic, if the previously consumed sample caused the
> > > +                * ringbuffer to no longer be full, send an event notification
> > > +                * to any user-space producer that is epoll-waiting.
> > > +                */
> > > +               if (producer_pos - consumer_pos == ringbuf_total_data_sz(rb))
> >
> > I'm a bit confused here. This will be true only if user-space producer
> > filled out entire ringbuf data *exactly* to the last byte with a
> > single record. Or am I misunderstanding this?
>
> I think you're misunderstanding. This will indeed only be true if the ring
> buffer was full (to the last byte as you said) before the last sample was
> consumed, but it doesn't have to have been filled with a single record.
> We're just checking that producer_pos - consumer_pos is the total size of
> the ring buffer, but there can be many samples between consumer_pos and
> producer_pos for that to be the case.

you are right, never mind about single sample part, but I don't think
that's the important part (just something that surprised me making
everything even less realistic)

>
> > If my understanding is correct, how is this a realistic use case and
> > how does this heuristic help at all?
>
> Though I think you may have misunderstood the heuristic, some more
> explanation is probably warranted nonetheless. This heuristic being useful
> relies on two assumptions:
>
> 1. It will be common for user-space to publish statically sized samples.
>
> I think this one is pretty unambiguously true, especially considering that
> BPF_MAP_TYPE_RINGBUF was put to great use with statically sized samples for
> quite some time. I'm open to hearing why that might not be the case.

True, majority of use cases for BPF ringubf were fixed-sized, thanks
to convenience of reserve/commit API. But data structure itself allows
variable-sized and there are use cases doing this, plus with dynptr
now it's easier to do variable-sized efficiently. So special-casing
for fixed-sized sample a bit off, especially considering #2

>
> 2. The size of the ring buffer is a multiple of the size of a sample.
>
> This one I think is a bit less clear. Users can always size the ring buffer
> to make sure this will be the case, but whether or not that will be
> commonly done is another story.

so I'm almost certain this won't be the case. I don't think anyone is
going to be tracking exact size of sample's struct (and it will most
probably change with time) and then sizing ringbuf to be both
power-of-2 of page_size *and* multiple of sizeof(struct
my_ringbuf_sample) is something I don't see anyone doing.

>
> I'm fine with removing this heuristic for now if it's unclear that it's
> serving a common use-case. We can always add it back in later if we want
> to.

Yes, this looks quite out of place with a bunch of optimistic but
unrealistic assumptions. Doing one notification after drain will be
fine for now, IMO.

>
> > > +                       irq_work_queue(&rb->work);
> > > +
> > > +       }
> > > +}
> > > +
> > > +BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
> > > +          void *, callback_fn, void *, callback_ctx, u64, flags)
> > > +{
> > > +       struct bpf_ringbuf *rb;
> > > +       long num_samples = 0, ret = 0;
> > > +       bpf_callback_t callback = (bpf_callback_t)callback_fn;
> > > +       u64 wakeup_flags = BPF_RB_NO_WAKEUP;
> > > +
> > > +       if (unlikely(flags & ~wakeup_flags))
> >

[...]