Message-ID: <55AFB2E0.5060307@iogearbox.net>
Date: Wed, 22 Jul 2015 17:12:32 +0200
From: Daniel Borkmann <daniel@iogearbox.net>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>, Silvan Jegen <s.jegen@gmail.com>,
        linux-man@vger.kernel.org, linux-kernel@vger.kernel.org,
        Walter Harms <wharms@bfs.de>
Subject: Re: Edited draft of bpf(2) man page for review/enhancement
References: <556583B4.4000607@gmail.com> <55658DB4.6000106@iogearbox.net> <55AFAD7A.6090009@gmail.com>
In-Reply-To: <55AFAD7A.6090009@gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7075
Lines: 180

On 07/22/2015 04:49 PM, Michael Kerrisk (man-pages) wrote:
> Hi Daniel,
>
> Sorry for the long delay in following up....

No worries, eBPF is quite some material. ;)

> On 05/27/2015 11:26 AM, Daniel Borkmann wrote:
>> On 05/27/2015 10:43 AM, Michael Kerrisk (man-pages) wrote:
>>> Hello Alexei,
>>>
>>> I took the draft 3 of the bpf(2) man page that you sent back in March
>>> and did some substantial editing to clarify the language and add a
>>> few technical details. Could you please check the revised  version
>>> below, to ensure I did not inject any errors.
>>>
>>> I also added a number of FIXMEs for pieces of the page that need
>>> further work. Could you take a look at these and let me know your
>>> thoughts, please.
>>
>> That's great, thanks! Minor comments:
>>
>> ...
>>> .TH BPF 2 2015-03-10 "Linux" "Linux Programmer's Manual"
>>> .SH NAME
>>> bpf - perform a command on an extended BPF map or program
>>> .SH SYNOPSIS
>>> .nf
>>> .B #include <linux/bpf.h>
>>> .sp
>>> .BI "int bpf(int cmd, union bpf_attr *attr, unsigned int size);
>>>
>>> .SH DESCRIPTION
>>> The
>>> .BR bpf ()
>>> system call performs a range of operations related to extended
>>> Berkeley Packet Filters.
>>> Extended BPF (or eBPF) is similar to
>>> the original BPF (or classic BPF) used to filter network packets.
>>> For both BPF and eBPF programs,
>>> the kernel statically analyzes the programs before loading them,
>>> in order to ensure that they cannot harm the running system.
>>> .P
>>> eBPF extends classic BPF in multiple ways including the ability to call
>>> in-kernel helper functions (via the
>>> .B BPF_CALL
>>> opcode extension provided by eBPF)
>>> and access shared data structures such as BPF maps.
>>
>> I would perhaps emphasize that maps can be shared among in-kernel
>> eBPF programs, but also between kernel and user space.
>
> This is covered later in the page, under the "BPF maps" subheading.
> Maybe you missed that? (Or did you think it doesn't suffice?)

Okay, I presume you mean:

   Maps are a generic data structure for storage of different types
   and sharing data between the kernel and user-space programs.

Maybe, to emphasize both options a bit (not sure if it's better in
my words, though):

   Maps are a generic data structure for storage of different types
   and allow for sharing data among eBPF kernel programs, but also
   between kernel and user-space applications.

>>> The programs can be written in a restricted C that is compiled into
>>> .\" FIXME In the next line, what is "a restricted C"? Where does
>>> .\"       one get further information about it?
>>
>> So far only from the kernel samples directory and for tc classifier
>> and action, from the tc man page and/or examples/bpf/ in the tc git
>> tree.
>
> So, given that we are several weeks down the track, and things may have
> changed, I'll re-ask the questions ;-) :
>
> * Is this restricted C documented anywhere?

Not (yet) that I'm aware of. We were thinking that short-mid term
to polish the stuff that resides in the kernel documentation, that
is, Documentation/networking/filter.txt, to get it in a better
shape, which I presume, would also include a documentation on the
restricted C. So far, examples are provided in the tc-bpf man page
(see link below).

The set of available helper functions callable from eBPF resides
under (enum bpf_func_id):

   https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/bpf.h

> * Is the procedure for compiling this restricted C documented anywhere?
>    (Yes, it's LLVM, but are the suitable pipelines/options documented
>    somewhere?)
>
>>> eBPF bytecode and executed on the in-kernel virtual machine or
>>> just-in-time compiled into native code.
>>> .SS Extended BPF Design/Architecture
>>> .P
>>> .\" FIXME In the following line, what does "different data types" mean?
>>> .\"       Are the values in a map not just blobs?
>>
>> Sort of, currently, these blobs can have different sizes of keys
>> and values (you can even have structs as keys). For the map itself
>> they are treated as blob internally. However, recently, bpf tail call
>> got added where you can lookup another program from an array map and
>> call into it. Here, that particular type of map can only have entries
>> of type of eBPF program fd. I think, if needed, adding a paragraph to
>> the tail call could be done as follow-up after we have an initial man
>> page in the tree included.
>
> Okay -- I've added a FIXME placeholder for this, so we can revisit.

Okay.

>>> BPF maps are a generic data structure for storage of different data types.
>>> A user process can create multiple maps (with key/value-pairs being
>>> opaque bytes of data) and access them via file descriptors.
>>> BPF programs can access maps from inside the kernel in parallel.
>>> It's up to the user process and BPF program to decide what they store
>>> inside maps.
>>> .P
>>> BPF programs are similar to kernel modules.
>>> They are loaded by the user
>>> process and automatically unloaded when the process exits.
>>
>> Generally that's true. Btw, in 4.1 kernel, tc(8) also got support for
>> eBPF classifier and actions, and here it's slightly different: in tc,
>> we load the programs, maps etc, and push down the eBPF program fd in
>> order to let the kernel hold reference on the program itself.
>>
>> Thus, there, the program fd that the application owns is gone when the
>> application terminates, but the eBPF program itself still lives on
>> inside the kernel. But perhaps it's already too much detail to mention
>> here ...
>
> Well, it should be documented somewhere....

Yep, fwiw some time ago I've hacked together a man page for tc:

https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=cbdd1e6921d21815e35d2a96526cfbad5ac98e09

>>> Each BPF program is a set of instructions that is safe to run until
>>> its completion.
>>> The in-kernel BPF verifier statically determines that the program
>>> terminates and is safe to execute.
>>> .\" FIXME In the following sentence, what does "takes hold" mean?
>>
>> Takes a reference. Meaning, that maps cannot disappear under us while
>> the eBPF program that is using them in the kernel is still alive.
>
> So, I changed this to:
>
> [[
> During verification, the kernel increments reference counts for each of
> the maps that the eBPF program uses,
> so that the selected maps cannot be removed until the program is unloaded.
> ]]
>
> Okay?

Okay.

[...]
> I'll send out a new draft soon, but in the meantime hopefully you
> or Alexei might have a chance to answer some open questions (see my
> other mail to Alexei, which will be sent soon), so I can further edit
> the page before sending it out.

Later on, we should also add a paragraph on eBPF tail calls, but one
step at a time.

Thanks again,
Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/