extending cc-list, since I think this thread is related to bpf split thread.
On Mon, Jun 2, 2014 at 5:36 AM, Daniel Borkmann <[email protected]> wrote:
> On 05/30/2014 07:12 PM, Chema Gonzalez wrote:
>>
>> On Thu, May 29, 2014 at 4:54 PM, Daniel Borkmann <[email protected]>
>> wrote:
>>>
>>> I actually liked [1] most ... ;)
>
> ...
>
>>> one extension e.g. SKF_AD_DISSECT where you call the external
>>> skb_flow_dissect()
>>> helper or similar on?
>>>
>>> I could imagine that either these offsets are squashed into the return or
>>> stored if you really need them from the struct flow_keys into M[] itself.
>>> So
>>> you would end up with one e.g. 'ld #keys' call that e.g. fills
>>> out/overwrites
>>> your M[] with the dissected metadata. Then, you'd only have a reason to
>>> call
>>> that once and do further processing based on these information. Whether
>>> you
>>> also need poff could e.g. be determined if the user does e.g. 'ldx #1' or
>>> so
>>> to indicate to the function it eventually calls, that it would further do
>>> the dissect and also give you poff into fixed defined M[] slots back.
>>
>>
>> IIUC, you're suggesting to have the classic BPF user writes "ld #keys"
>> to load flow_keys in the stack and then *explicitly* adds "ld
>> mem[<offset>]" to access the field she's interested on. Note that if
>> the "ld mem[]" is implicit, then the user must use "ld #thoff" to let
>> the compiler know she wants "ld #keys" and then "ld
>> mem[<thoff_offset>]" (and not "ld mem[<other_offset>]"), which is
>> equivalent to what we're doing right now*.
>
>
> I think there are many ways to design that, but for something possibly
> generic, what I mean is something like this:
>
> We would a-priori know per API documentation what slots in M[] are
> filled with which information, just for example:
>
> M[0] <-- nhoff [network header off]
> M[1] <-- nproto [network header proto]
> M[2] <-- thoff [transport header off]
> M[3] <-- tproto [transport header proto]
> M[4] <-- poff [payload off]
>
> Now a user could load the following pseudo bpf_asm style program:
>
> ld #5 <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
> ld #keys <-- triggers the extension to fill the M[] slots
> ld M[0] <-- loads nhoff from M[0] into accu
> <do sth with it>
> ld M[3] <-- loads tproto into accu, etc
> …
imo there are pros and cons in Daniel's and Chema's proposals
for classic BPF extensions.
I like Chema's a bit more, since his proposal doesn't require to
change classic BPF verifier and that's always a delicate area
(seccomp also allows to use M[] slots).
Both still look like a stop gap until next classic extension
appears on horizon.
I think explicit eBPF program solves this particular task cleaner
and it doesn't require changing eBPF.
> A program like:
>
> ld #2
> ld #keys
> ...
>
> Would then on the other hand only fill the first two slots of M[] as
> the user does not need all information from the kernel's flow dissector
> and thus also does not fully need to dissect the skb.
>
> This also permits in future to add new fields by enabling them with
> ld #6 etc, for example.
>
> I think this would fit much more into the design of BPF (although I
> don't strictly like referring to the kernel's flow dissector and thus
> bypassing BPF's actual purpose; but I understand that it can get quite
> complicated with tunnels, etc).
>
> But this idea would allow you to only add 1 new extension and to have
> it return dynamically a part or all information you would need in your
> program, and thus solves the issue of calling the skb flow dissector
> multiple times just because we have ld #thoff, ld #nhoff, etc, we can
> avoid making such a stack_layout + filter_prologue hack and thus design
> it more cleanly into the BPF architecture.
>
>
>> The only advantage I can see is that we're only adding one new
>> ancillary load, which is more elegant. I can see some issues with this
>> approach:
>>
>> - usability. The user has to actually know what's the right offset of
>> the thoff, which is cumbersome and may change with kernels. We'd be
>> effectively exporting the flow_keys struct layout to the users.
>
>
> See above.
>
>
>> - have to take care that the classic BPF filter does not muck with
>> mem[] between the "ld #keys" and all of the the "ld* mem[]" that
>> access to it. Note that we're currently storing the flow_keys struct
>> in the stack outside of the mem[] words, so this is not a problem
>> here. (It is only accessible using ebpf insns.)
>
>
> Since it is part of the API/ABI, and a new instruction, it is then
> known that ld #keys would fill particular slots of M[] to the user.
> That however, must be carefully designed, so that in future one
> doesn't need to add ld #keys2. The part of not mucking with M[] fields
> is then part of the user's task actually, just the same way as a
> user shouldn't store something to A resp. X while an ld resp. ldx
> knowingly would overwrite that value stored previously. I don't
> think it would be any different than that.
>
>
>> *Now, if your comment is ebpf-only, that could be ok. That would mean:
>>
>> - a. the user writes "ld #thoff"
>> - b. the BPF->eBPF compiler generates one common BPF_CALL
>> (__skb_get_flow_dissect) plus a "ebpf_ldx stack[right_offset]". This
>> would not include payload_offset (which needs to run its own function
>> to get the poff from flow_keys).
>
>
> Since we're not using eBPF in user space right now, the comment is not
> about eBPF. I think if you would use eBPF from user space, you don't
> need to add such extensions anymore, but could just write yourself a
> minimal flow dissector in 'restricted' C to solve that problem.
Exactly.
I think exposing eBPF to user space is not a matter of 'if' but 'when'.
I'm not sure how pressing it is now to add another extension to classic,
when the goal of this patch set fits into eBPF model just fine.
yes, eBPF verifier is not in tree yet and it will obviously take longer to
review than Chema's or Daniel's set.
When eBPF is exposed to user space the inner header access can
be done in two ways without changing eBPF instruction set or eBPF
verifier.
eBPF approach #1:
-- code packet parser in restricted C
Pros:
skb_flow_dissect() stays hidden in kernel.
No additional uapi headache which exist if we start reusing
in-kernel skb_flow_dissect() either via classic or eBPF.
Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
(I provided performance numbers before)
Cons:
in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to userspace.
eBPF approach #2:
-- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
Pros:
eBPF program becomes much shorter and can be done in asm like:
bpf_mov r2, fp
bpf_sub r2, 64
bpf_call bpf_skb_flow_dissect // call in-kernel helper function from
eBPF program
bpf_ldx r1, [fp - 64] // access flow_keys data
bpf_ldx r2, [fp - 60]
Cons:
bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
uapi visible helper function
imo both eBPF approaches are cleaner than extending classic.
On 06/02/2014 06:48 PM, Alexei Starovoitov wrote:
> extending cc-list, since I think this thread is related to bpf split thread.
>
> On Mon, Jun 2, 2014 at 5:36 AM, Daniel Borkmann <[email protected]> wrote:
>> On 05/30/2014 07:12 PM, Chema Gonzalez wrote:
>>>
>>> On Thu, May 29, 2014 at 4:54 PM, Daniel Borkmann <[email protected]>
>>> wrote:
>>>>
>>>> I actually liked [1] most ... ;)
>>
>> ...
>>
>>>> one extension e.g. SKF_AD_DISSECT where you call the external
>>>> skb_flow_dissect()
>>>> helper or similar on?
>>>>
>>>> I could imagine that either these offsets are squashed into the return or
>>>> stored if you really need them from the struct flow_keys into M[] itself.
>>>> So
>>>> you would end up with one e.g. 'ld #keys' call that e.g. fills
>>>> out/overwrites
>>>> your M[] with the dissected metadata. Then, you'd only have a reason to
>>>> call
>>>> that once and do further processing based on these information. Whether
>>>> you
>>>> also need poff could e.g. be determined if the user does e.g. 'ldx #1' or
>>>> so
>>>> to indicate to the function it eventually calls, that it would further do
>>>> the dissect and also give you poff into fixed defined M[] slots back.
>>>
>>>
>>> IIUC, you're suggesting to have the classic BPF user writes "ld #keys"
>>> to load flow_keys in the stack and then *explicitly* adds "ld
>>> mem[<offset>]" to access the field she's interested on. Note that if
>>> the "ld mem[]" is implicit, then the user must use "ld #thoff" to let
>>> the compiler know she wants "ld #keys" and then "ld
>>> mem[<thoff_offset>]" (and not "ld mem[<other_offset>]"), which is
>>> equivalent to what we're doing right now*.
>>
>>
>> I think there are many ways to design that, but for something possibly
>> generic, what I mean is something like this:
>>
>> We would a-priori know per API documentation what slots in M[] are
>> filled with which information, just for example:
>>
>> M[0] <-- nhoff [network header off]
>> M[1] <-- nproto [network header proto]
>> M[2] <-- thoff [transport header off]
>> M[3] <-- tproto [transport header proto]
>> M[4] <-- poff [payload off]
>>
>> Now a user could load the following pseudo bpf_asm style program:
>>
>> ld #5 <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
>> ld #keys <-- triggers the extension to fill the M[] slots
>> ld M[0] <-- loads nhoff from M[0] into accu
>> <do sth with it>
>> ld M[3] <-- loads tproto into accu, etc
>> …
>
> imo there are pros and cons in Daniel's and Chema's proposals
> for classic BPF extensions.
> I like Chema's a bit more, since his proposal doesn't require to
> change classic BPF verifier and that's always a delicate area
> (seccomp also allows to use M[] slots).
What bothers me is that you have to *special case* this extension
all over the BPF converter and stack, even you need to introduce a
prologue just to walk the whole filter program and to check if the
extension is present; next to that instead of one extension you
need to add a couple of them to uapi. I understand Chema's proposal
or his need to easily access these data but that's why I proposed
the above if he wants to go for that. Btw, seccomp doesn't allow
for loading BPF extensions, you said so yourself Alexei.
> Both still look like a stop gap until next classic extension
> appears on horizon.
>
> I think explicit eBPF program solves this particular task cleaner
> and it doesn't require changing eBPF.
>
>> A program like:
>>
>> ld #2
>> ld #keys
>> ...
>>
>> Would then on the other hand only fill the first two slots of M[] as
>> the user does not need all information from the kernel's flow dissector
>> and thus also does not fully need to dissect the skb.
>>
>> This also permits in future to add new fields by enabling them with
>> ld #6 etc, for example.
>>
>> I think this would fit much more into the design of BPF (although I
>> don't strictly like referring to the kernel's flow dissector and thus
>> bypassing BPF's actual purpose; but I understand that it can get quite
>> complicated with tunnels, etc).
>>
>> But this idea would allow you to only add 1 new extension and to have
>> it return dynamically a part or all information you would need in your
>> program, and thus solves the issue of calling the skb flow dissector
>> multiple times just because we have ld #thoff, ld #nhoff, etc, we can
>> avoid making such a stack_layout + filter_prologue hack and thus design
>> it more cleanly into the BPF architecture.
>>
>>
>>> The only advantage I can see is that we're only adding one new
>>> ancillary load, which is more elegant. I can see some issues with this
>>> approach:
>>>
>>> - usability. The user has to actually know what's the right offset of
>>> the thoff, which is cumbersome and may change with kernels. We'd be
>>> effectively exporting the flow_keys struct layout to the users.
>>
>>
>> See above.
>>
>>
>>> - have to take care that the classic BPF filter does not muck with
>>> mem[] between the "ld #keys" and all of the the "ld* mem[]" that
>>> access to it. Note that we're currently storing the flow_keys struct
>>> in the stack outside of the mem[] words, so this is not a problem
>>> here. (It is only accessible using ebpf insns.)
>>
>>
>> Since it is part of the API/ABI, and a new instruction, it is then
>> known that ld #keys would fill particular slots of M[] to the user.
>> That however, must be carefully designed, so that in future one
>> doesn't need to add ld #keys2. The part of not mucking with M[] fields
>> is then part of the user's task actually, just the same way as a
>> user shouldn't store something to A resp. X while an ld resp. ldx
>> knowingly would overwrite that value stored previously. I don't
>> think it would be any different than that.
>>
>>
>>> *Now, if your comment is ebpf-only, that could be ok. That would mean:
>>>
>>> - a. the user writes "ld #thoff"
>>> - b. the BPF->eBPF compiler generates one common BPF_CALL
>>> (__skb_get_flow_dissect) plus a "ebpf_ldx stack[right_offset]". This
>>> would not include payload_offset (which needs to run its own function
>>> to get the poff from flow_keys).
>>
>>
>> Since we're not using eBPF in user space right now, the comment is not
>> about eBPF. I think if you would use eBPF from user space, you don't
>> need to add such extensions anymore, but could just write yourself a
>> minimal flow dissector in 'restricted' C to solve that problem.
>
> Exactly.
> I think exposing eBPF to user space is not a matter of 'if' but 'when'.
>
> I'm not sure how pressing it is now to add another extension to classic,
> when the goal of this patch set fits into eBPF model just fine.
> yes, eBPF verifier is not in tree yet and it will obviously take longer to
> review than Chema's or Daniel's set.
>
> When eBPF is exposed to user space the inner header access can
> be done in two ways without changing eBPF instruction set or eBPF
> verifier.
>
> eBPF approach #1:
> -- code packet parser in restricted C
> Pros:
> skb_flow_dissect() stays hidden in kernel.
> No additional uapi headache which exist if we start reusing
> in-kernel skb_flow_dissect() either via classic or eBPF.
> Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
> (I provided performance numbers before)
> Cons:
> in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to userspace.
>
> eBPF approach #2:
> -- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
> Pros:
> eBPF program becomes much shorter and can be done in asm like:
> bpf_mov r2, fp
> bpf_sub r2, 64
> bpf_call bpf_skb_flow_dissect // call in-kernel helper function from
> eBPF program
> bpf_ldx r1, [fp - 64] // access flow_keys data
> bpf_ldx r2, [fp - 60]
>
> Cons:
> bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
> uapi visible helper function
>
> imo both eBPF approaches are cleaner than extending classic.
>
On Tue, Jun 3, 2014 at 1:33 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/02/2014 06:48 PM, Alexei Starovoitov wrote:
>>
>> imo there are pros and cons in Daniel's and Chema's proposals
>> for classic BPF extensions.
>> I like Chema's a bit more, since his proposal doesn't require to
>> change classic BPF verifier and that's always a delicate area
>> (seccomp also allows to use M[] slots).
>
>
> What bothers me is that you have to *special case* this extension
> all over the BPF converter and stack, even you need to introduce a
> prologue just to walk the whole filter program and to check if the
> extension is present; next to that instead of one extension you
> need to add a couple of them to uapi. I understand Chema's proposal
Agree. The walking part and multiple anc loads are not clean, but in your
approach you'd need to hack sk_chk_filter() to recognize very specific
anc load and do even fancier things in check_load_and_stores()
to make sure that 'ld #5, ld #keys' is safe from M[] usage point of view.
> or his need to easily access these data but that's why I proposed
> the above if he wants to go for that. Btw, seccomp doesn't allow
> for loading BPF extensions, you said so yourself Alexei.
of course, but seccomp does use ld/st M[] which will overlap
with your socket extension and since check_load_and_stores()
will be hacked, seccomp verification needs to be considered as well.
>From user api point of view, your approach is cleaner than Chema's,
but from implementation Chema's is much safer and smaller.
Anyway as I said before I'm not excited about either.
I don't think we should be adding classic BPF extensions any more.
The long term headache of supporting classic BPF extensions
outweighs the short term benefits.
Not having to constantly tweak kernel for such cases was the key
goal of eBPF. My two eBPF approaches to solve Chema's need
are both cleaner, since nothing special is being done in eBPF
core to support this new need. Both instruction set and verifier
stay generic and cover tracing and this new socket use at once.
I do realize that I'm talking here about future eBPF verifier that
is not in tree yet and eBPF is not exposed to user space, but
I think we should consider longer term perspective.
If I'm forced to pick between yours or Chema's classic extensions,
I would pick Chema's because it's lesser evil :)
>> I think exposing eBPF to user space is not a matter of 'if' but 'when'.
>>
>> I'm not sure how pressing it is now to add another extension to classic,
>> when the goal of this patch set fits into eBPF model just fine.
>> yes, eBPF verifier is not in tree yet and it will obviously take longer to
>> review than Chema's or Daniel's set.
>>
>> When eBPF is exposed to user space the inner header access can
>> be done in two ways without changing eBPF instruction set or eBPF
>> verifier.
>>
>> eBPF approach #1:
>> -- code packet parser in restricted C
>> Pros:
>> skb_flow_dissect() stays hidden in kernel.
>> No additional uapi headache which exist if we start reusing
>> in-kernel skb_flow_dissect() either via classic or eBPF.
>> Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
>> (I provided performance numbers before)
>> Cons:
>> in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to
>> userspace.
>>
>> eBPF approach #2:
>> -- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
>> Pros:
>> eBPF program becomes much shorter and can be done in asm like:
>> bpf_mov r2, fp
>> bpf_sub r2, 64
>> bpf_call bpf_skb_flow_dissect // call in-kernel helper function from
>> eBPF program
>> bpf_ldx r1, [fp - 64] // access flow_keys data
>> bpf_ldx r2, [fp - 60]
>>
>> Cons:
>> bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
>> uapi visible helper function
>>
>> imo both eBPF approaches are cleaner than extending classic.
>>
>
On Tue, Jun 3, 2014 at 1:15 PM, Alexei Starovoitov <[email protected]> wrote:
> On Tue, Jun 3, 2014 at 1:33 AM, Daniel Borkmann <[email protected]> wrote:
>> On 06/02/2014 06:48 PM, Alexei Starovoitov wrote:
>>>
>>> imo there are pros and cons in Daniel's and Chema's proposals
>>> for classic BPF extensions.
>>> I like Chema's a bit more, since his proposal doesn't require to
>>> change classic BPF verifier and that's always a delicate area
>>> (seccomp also allows to use M[] slots).
>>
>>
>> What bothers me is that you have to *special case* this extension
>> all over the BPF converter and stack, even you need to introduce a
>> prologue just to walk the whole filter program and to check if the
>> extension is present; next to that instead of one extension you
>> need to add a couple of them to uapi. I understand Chema's proposal
>
> Agree. The walking part and multiple anc loads are not clean, but in your
> approach you'd need to hack sk_chk_filter() to recognize very specific
> anc load and do even fancier things in check_load_and_stores()
> to make sure that 'ld #5, ld #keys' is safe from M[] usage point of view.
I think you may be confusing the two parts of the patch, namely
whether we implement the code as 1 anc load with N parameters or as N
anc loads, and the existence of a prologue.
- Adding just 1 anc load ("ld #keys" in your pseudo-code) requires
less UAPI changes, but effectively exports the flow_keys struct to the
user (which means we cannot change it in the future). The fact that
you're proposing your own version of the flow_keys ( {nhoff, nproto,
thoff, tproto, poff} ) does not make it less of a problem: You're just
exporting an alternative (albeit IMO way better one, as all the
dissected info can be obtained from your 5-tuple) flow_keys. From a
usability PoV, the user may have to either do "ld #0; ld #keys"
(knowing that "#0" refers to "nhoff"), or "ld #nhoff". I definitely
prefer the second, but I can easily rewrite the patch to use the
first.
- Now, the existence of a prologue is a must if you want to ensure the
filter only calls the actual flow dissector once (this is the main
goal of the patch, actually). Having 2 insns separated by N insns that
access data obtained from the same flow dissector call means you have
to both (a) ensure the first caller -- whether it's the first or the
second insn -- does perform the BPF_CALL, and (b) ensure that none of
the N insns in the middle mucks with the result of the first call (my
solution deals with (b) by using the stack outside of M[], while yours
requires verifying the code. I find mine easier).
In order to achieve (a), you need the prologue code: Because the code
can have jumps, you can never know whether the "ld #keys" was actually
called or not. What my solution's prologue does is to write a 0 on a
flow_inited u32, which states that the flow dissector hasn't been
called. Now, every call for any of the sub-fields checks this
flow_inited field, and runs the flow dissector iff it's zero (hasn't
been called yet), in which case it also sets flow_inited to 1.
Your approach needs it too. Citing from your pseudo-code:
> ld #5 <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
> ld #keys <-- triggers the extension to fill the M[] slots
> ld M[0] <-- loads nhoff from M[0] into accu
How does the "ld M[0]" know that the actual flow dissector has already
been called? What if the insn just before the "ld #5" was "jmp +2" ?
In that case, the "ld #keys" would have never been called.
> ld M[0] <-- loads nhoff from M[0] into accu
> <do sth with it>
> ld M[3] <-- loads tproto into accu, etc
>> or his need to easily access these data but that's why I proposed
>> the above if he wants to go for that. Btw, seccomp doesn't allow
>> for loading BPF extensions, you said so yourself Alexei.
>
> of course, but seccomp does use ld/st M[] which will overlap
> with your socket extension and since check_load_and_stores()
> will be hacked, seccomp verification needs to be considered as well.
> From user api point of view, your approach is cleaner than Chema's,
> but from implementation Chema's is much safer and smaller.
>
> Anyway as I said before I'm not excited about either.
> I don't think we should be adding classic BPF extensions any more.
> The long term headache of supporting classic BPF extensions
> outweighs the short term benefits.
I see a couple of issues with (effectively) freezing classic BPF
development while waiting for direct eBPF access to happen. The first
one is that the kernel has to accept it. I can see many questions
about this, especially security and usability (I'll send an email
about the "split BPF out of core later"). Now, the main issue is
whether/when the tools will support it. IMO, this is useful iff I can
quickly write/reuse filters and run tcpdump filters based on them. I'm
trying to get upstream libpcap to accept support for raw (classic) BPF
filters, and it's taking a long time. I can imagine how they may be
less receptive about supporting a Linux-only eBPF mechanism. Tools do
matter.
Even if eBPF happens, it's not that the extensions are so hard to port
to eBPF: It's one BPF_CALL per extension. And they have a
straightforward porting path to the C-like code.
-Chema
> Not having to constantly tweak kernel for such cases was the key
> goal of eBPF. My two eBPF approaches to solve Chema's need
> are both cleaner, since nothing special is being done in eBPF
> core to support this new need. Both instruction set and verifier
> stay generic and cover tracing and this new socket use at once.
> I do realize that I'm talking here about future eBPF verifier that
> is not in tree yet and eBPF is not exposed to user space, but
> I think we should consider longer term perspective.
> If I'm forced to pick between yours or Chema's classic extensions,
> I would pick Chema's because it's lesser evil :)
>
>>> I think exposing eBPF to user space is not a matter of 'if' but 'when'.
>>>
>>> I'm not sure how pressing it is now to add another extension to classic,
>>> when the goal of this patch set fits into eBPF model just fine.
>>> yes, eBPF verifier is not in tree yet and it will obviously take longer to
>>> review than Chema's or Daniel's set.
>>>
>>> When eBPF is exposed to user space the inner header access can
>>> be done in two ways without changing eBPF instruction set or eBPF
>>> verifier.
>>>
>>> eBPF approach #1:
>>> -- code packet parser in restricted C
>>> Pros:
>>> skb_flow_dissect() stays hidden in kernel.
>>> No additional uapi headache which exist if we start reusing
>>> in-kernel skb_flow_dissect() either via classic or eBPF.
>>> Such eBPF program will be just as fast as in-kernel skb_flow_dissect()
>>> (I provided performance numbers before)
>>> Cons:
>>> in-kernel skb_flow_dissect() is not reused, but mostly copy-pasted to
>>> userspace.
>>>
>>> eBPF approach #2:
>>> -- reuse in-kernel skb_flow_dissect() via bpf_call insn from eBPF program
>>> Pros:
>>> eBPF program becomes much shorter and can be done in asm like:
>>> bpf_mov r2, fp
>>> bpf_sub r2, 64
>>> bpf_call bpf_skb_flow_dissect // call in-kernel helper function from
>>> eBPF program
>>> bpf_ldx r1, [fp - 64] // access flow_keys data
>>> bpf_ldx r2, [fp - 60]
>>>
>>> Cons:
>>> bpf_skb_flow_dissect() (wrapper on top of skb_flow_dissect()) becomes
>>> uapi visible helper function
>>>
>>> imo both eBPF approaches are cleaner than extending classic.
>>>
>>
On 06/03/2014 11:12 PM, Chema Gonzalez wrote:
...
> Your approach needs it too. Citing from your pseudo-code:
>
>> ld #5 <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
>> ld #keys <-- triggers the extension to fill the M[] slots
>> ld M[0] <-- loads nhoff from M[0] into accu
>
> How does the "ld M[0]" know that the actual flow dissector has already
> been called? What if the insn just before the "ld #5" was "jmp +2" ?
> In that case, the "ld #keys" would have never been called.
But that case would be no different from doing something like ...
[...]
jmp foo
ldi #42
st M[0]
foo:
ld M[0]
[...]
... and would then not pass the checker in check_load_and_stores(),
which, as others have already stated, would need to be extended,
of course. It's one possible approach.
>> Anyway as I said before I'm not excited about either.
>> I don't think we should be adding classic BPF extensions any more.
>> The long term headache of supporting classic BPF extensions
>> outweighs the short term benefits.
...
> I see a couple of issues with (effectively) freezing classic BPF
> development while waiting for direct eBPF access to happen. The first
> one is that the kernel has to accept it. I can see many questions
> about this, especially security and usability (I'll send an email
> about the "split BPF out of core later"). Now, the main issue is
> whether/when the tools will support it. IMO, this is useful iff I can
> quickly write/reuse filters and run tcpdump filters based on them. I'm
> trying to get upstream libpcap to accept support for raw (classic) BPF
> filters, and it's taking a long time. I can imagine how they may be
> less receptive about supporting a Linux-only eBPF mechanism. Tools do
> matter.
Grepping through libpcap code, which tries to be platform independent,
it seems after all the years, the only thing where you can see support
for in their code is SKF_AD_PKTTYPE and SKF_AD_PROTOCOL. Perhaps they
just don't care, perhaps they do, who knows, but it looks to me a bit
that they are reluctant to these improvements, maybe for one reason
that other OSes don't support it. That was also one of the reasons that
led me to start writing bpf_asm (net/tools/) for having a small DSL
for more easily trying out BPF code while having _full_ control over it.
Maybe someone should start a binary-compatible Linux-only version of
libpcap, where tcpdump will transparently make use of these low level
improvements eventually. </rant> ;)
Thanks,
Daniel
FWIW, I'm not applying this series.
There is no real consensus and a lot more discussion and time is
needed before anything happens here.
I'll try to revive the discussion for this patch, in case I can
convince you about its implementation. I rebased it to the latest
HEAD, and I'm ready to re-submit.
On Wed, Jun 4, 2014 at 1:51 AM, Daniel Borkmann <[email protected]> wrote:
> On 06/03/2014 11:12 PM, Chema Gonzalez wrote:
> ...
>
>> Your approach needs it too. Citing from your pseudo-code:
>>
>>> ld #5 <-- indicates to fill the first 5 slots of M[], so M[0] to M[4]
>>> ld #keys <-- triggers the extension to fill the M[] slots
>>> ld M[0] <-- loads nhoff from M[0] into accu
>>
>>
>> How does the "ld M[0]" know that the actual flow dissector has already
>> been called? What if the insn just before the "ld #5" was "jmp +2" ?
>> In that case, the "ld #keys" would have never been called.
>
>
> But that case would be no different from doing something like ...
>
> [...]
> jmp foo
> ldi #42
> st M[0]
> foo:
> ld M[0]
> [...]
>
> ... and would then not pass the checker in check_load_and_stores(),
> which, as others have already stated, would need to be extended,
> of course. It's one possible approach.
As Alexei/you have noted, this is moving the complexity to
check_load_and_stores(). It's more complicated than the current
ensure-there's-a-store-preceding-each-load-for-each-M[]-position
check. In particular, your approach requires 2 insns before a memory
position can be considered filled, namely the "ld #5" and the "ld
#keys". That means adding intra-instruction state to the FSM in
check_load_and_stores().
A second issue is that you need to ensure that M[0:5] does not get
polluted between the moment you call "ld #5; ld #keys" and the moment
the code uses the flow-dissector values. This is more complex than
just ensuring the code doesn't access uninit'ed M[].
There's also some concerns about effectively adding a prologue. We
already have that:
$ cat net/core/filter.c
...
int sk_convert_filter(struct sock_filter *prog, int len,
struct sock_filter_int *new_prog, int *new_len)
{
...
if (new_insn)
*new_insn = BPF_MOV64_REG(BPF_REG_CTX, BPF_REG_ARG1);
new_insn++;
for (i = 0; i < len; fp++, i++) {
...
The current prologue is embedded in sk_convert_filter(). The patch
just moves it to its own function, and (potentially) adds another
instruction (which zeroes flow_inited iff there's any call to the flow
dissector).
>>> Anyway as I said before I'm not excited about either.
>>> I don't think we should be adding classic BPF extensions any more.
>>> The long term headache of supporting classic BPF extensions
>>> outweighs the short term benefits.
>> I see a couple of issues with (effectively) freezing classic BPF
>> development while waiting for direct eBPF access to happen. The first
>> one is that the kernel has to accept it. I can see many questions
>> about this, especially security and usability (I'll send an email
>> about the "split BPF out of core later"). Now, the main issue is
>> whether/when the tools will support it. IMO, this is useful iff I can
>> quickly write/reuse filters and run tcpdump filters based on them. I'm
>> trying to get upstream libpcap to accept support for raw (classic) BPF
>> filters, and it's taking a long time. I can imagine how they may be
>> less receptive about supporting a Linux-only eBPF mechanism. Tools do
>> matter.
This is a high-level decision, more than a technical one. Do we want
to freeze classic BPF development in linux, even before we have a
complete eBPF replacement, and zero eBPF tool (libpcap) support?
> Grepping through libpcap code, which tries to be platform independent,
> it seems after all the years, the only thing where you can see support
> for in their code is SKF_AD_PKTTYPE and SKF_AD_PROTOCOL. Perhaps they
Actually they recently added MOD/XOR support. Woo-hoo!
> just don't care, perhaps they do, who knows, but it looks to me a bit
> that they are reluctant to these improvements, maybe for one reason
> that other OSes don't support it.
>From the comments in the MOD/XOR patch, the latter seem to be the issue.
> That was also one of the reasons that
> led me to start writing bpf_asm (net/tools/) for having a small DSL
> for more easily trying out BPF code while having _full_ control over it.
>
> Maybe someone should start a binary-compatible Linux-only version of
> libpcap, where tcpdump will transparently make use of these low level
> improvements eventually. </rant> ;)
There's too much code dependent on libpcap to make a replacement possible.
Thanks,
-Chema
On 06/20/2014 11:56 PM, Chema Gonzalez wrote:
...
>>>> Anyway as I said before I'm not excited about either.
>>>> I don't think we should be adding classic BPF extensions any more.
>>>> The long term headache of supporting classic BPF extensions
>>>> outweighs the short term benefits.
>>>
>>> I see a couple of issues with (effectively) freezing classic BPF
>>> development while waiting for direct eBPF access to happen. The first
>>> one is that the kernel has to accept it. I can see many questions
>>> about this, especially security and usability (I'll send an email
>>> about the "split BPF out of core later"). Now, the main issue is
>>> whether/when the tools will support it. IMO, this is useful iff I can
>>> quickly write/reuse filters and run tcpdump filters based on them. I'm
>>> trying to get upstream libpcap to accept support for raw (classic) BPF
>>> filters, and it's taking a long time. I can imagine how they may be
>>> less receptive about supporting a Linux-only eBPF mechanism. Tools do
>>> matter.
>
> This is a high-level decision, more than a technical one. Do we want
> to freeze classic BPF development in linux, even before we have a
> complete eBPF replacement, and zero eBPF tool (libpcap) support?
In my opinion, I don't think we strictly have to hard-freeze it. The
only concern I see is that conceptually hooking into the flow_dissector
to read out all keys for further processing on top of them 1) sort
of breaks/bypasses the concept of BPF (as it's actually the task of
BPF itself for doing this), 2) effectively freezes any changes to the
flow_dissector as BPF applications making use of it now depend on the
provided offsets for doing further processing on top of them, 3) it
can already be resolved by (re-)writing the kernel's flow dissector
in C-like syntax in user space iff eBPF can be loaded from there with
similar performance. So shouldn't we rather work towards that as a
more generic approach/goal in the mid term and w/o having to maintain
a very short term intermediate solution that we need to special case
along the code and have to carry around forever ...
>> Grepping through libpcap code, which tries to be platform independent,
>> it seems after all the years, the only thing where you can see support
>> for in their code is SKF_AD_PKTTYPE and SKF_AD_PROTOCOL. Perhaps they
>
> Actually they recently added MOD/XOR support. Woo-hoo!
Great to hear, still quite some things missing, unfortunately. :/
>> just don't care, perhaps they do, who knows, but it looks to me a bit
>> that they are reluctant to these improvements, maybe for one reason
>> that other OSes don't support it.
>
> From the comments in the MOD/XOR patch, the latter seem to be the issue.
Yep, that's the pain you need to live with when trying to be multi
OS capable. I assume in its very origin, the [libpcap] compiler was
probably not designed for handling such differences in various
operating systems (likely even ran in user space from libpcap directly).
>> That was also one of the reasons that
>> led me to start writing bpf_asm (net/tools/) for having a small DSL
>> for more easily trying out BPF code while having _full_ control over it.
>>
>> Maybe someone should start a binary-compatible Linux-only version of
>> libpcap, where tcpdump will transparently make use of these low level
>> improvements eventually. </rant> ;)
>
> There's too much code dependent on libpcap to make a replacement possible.
Well, I wrote binary-compatible, so applications on top of it won't
care much if it could be used as drop-in replacement. That would perhaps
also allow for fanout and other features to be used ...
On Tue, Jun 24, 2014 at 1:14 AM, Daniel Borkmann <[email protected]> wrote:
>> This is a high-level decision, more than a technical one. Do we want
>> to freeze classic BPF development in linux, even before we have a
>> complete eBPF replacement, and zero eBPF tool (libpcap) support?
>
>
> In my opinion, I don't think we strictly have to hard-freeze it. The
> only concern I see is that conceptually hooking into the flow_dissector
> to read out all keys for further processing on top of them 1) sort
> of breaks/bypasses the concept of BPF (as it's actually the task of
> BPF itself for doing this),
I don't think we want to do flow dissection using BPF insns. It's not
easy to write BPF insns, and we already have kernel code that does
that. IMO that's what eBPF calls/BPF ancillary loads are for (e.g.
vlan access).
> 2) effectively freezes any changes to the
> flow_dissector as BPF applications making use of it now depend on the
> provided offsets for doing further processing on top of them, 3) it
Remember that my approach does not have (user-visible) offsets. It
uses the eBPF stack to dump the output (struct flow_keys) of the flow
dissector (skb_flow_dissect()). The only dependencies we're adding is
that, once we provide a BPF ancillary load to access e.g. thoff, we
have to keep providing it.
> can already be resolved by (re-)writing the kernel's flow dissector
> in C-like syntax in user space iff eBPF can be loaded from there with
> similar performance. So shouldn't we rather work towards that as a
> more generic approach/goal in the mid term and w/o having to maintain
> a very short term intermediate solution that we need to special case
> along the code and have to carry around forever ...
Once (if) we reach the point where we can do eBPF filters in "C-like
syntax," I'd agree with you that it would be nice to be able to reuse
the same function inside the kernel and as an eBPF library. The same
probably applies to other network functions. Now, I'm not sure what's
the model to reuse: Are we planning to re-write (maybe "re-write" is
too strong, as we will probably only need some minor changes) some of
the kernel functions into this "C--" language so that eBPF can use
them? Do other people agree with this vision?
There's still the problem of whether we want to obsolete classic BPF
in the kernel before the tools (libpcap mainly) accept eBPF. This can
take a lot.
Finally, what's the user's CLI interface you have in mind? Right now,
tcpdump expressions are very handy: I know I can pass "ip[2:2] ==
1500" or "(tcp[13] & 0x03)" to any libpcap-based application. This is
very handy to log into a machine, and quickly run tcpdump to get the
packets I'm interested on. What would be the model for using C-- eBPF
filters in the same manner?
Thanks again,
-Chema
On 06/26/2014 12:00 AM, Chema Gonzalez wrote:
...
> There's still the problem of whether we want to obsolete classic BPF
> in the kernel before the tools (libpcap mainly) accept eBPF. This can
> take a lot.
>
> Finally, what's the user's CLI interface you have in mind? Right now,
> tcpdump expressions are very handy: I know I can pass "ip[2:2] ==
> 1500" or "(tcp[13] & 0x03)" to any libpcap-based application. This is
> very handy to log into a machine, and quickly run tcpdump to get the
> packets I'm interested on. What would be the model for using C-- eBPF
> filters in the same manner?
Yes, imho, it's a valid question to ask. I think there are a couple
of possibilities for libpcap/tcpdump from a user point of view (note,
I don't strictly think it's the _only_ main user though): 1) iff a
llvm and/or gcc backend gets merged from the compiler side, one could
add a cli interface to run the generated opcodes from a file for
advanced filters while perhaps classic BPF continues to be supported
via its high-level filter expressions; 2) there could be a Linux-only
compiler in libpcap that translates and makes use of full eBPF (though
significantly more effort to implement); 3) libpcap continues to use
classic BPF as it's currently doing.