On Tue, Jun 30, 2020 at 2:17 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Jun 30, 2020, at 4:56 PM, Eric Dumazet [email protected] wrote:
>
> > On Tue, Jun 30, 2020 at 1:44 PM David Miller <[email protected]> wrote:
> >>
> >> From: Eric Dumazet <[email protected]>
> >> Date: Tue, 30 Jun 2020 13:39:27 -0700
> >>
> >> > The (C) & (B) case are certainly doable.
> >> >
> >> > A) case is more complex, I have no idea of breakages of various TCP
> >> > stacks if a flow got SACK
> >> > at some point (in 3WHS) but suddenly becomes Reno.
> >>
> >> I agree that C and B are the easiest to implement without having to
> >> add complicated code to handle various negotiated TCP option
> >> scenerios.
> >>
> >> It does seem to be that some entities do A, or did I misread your
> >> behavioral analysis of various implementations Mathieu?
> >>
> >> Thanks.
> >
> > Yes, another question about Mathieu cases is do determine the behavior
> > of all these stacks vs :
> > SACK option
> > TCP TS option.
>
> I will ask my customer's networking team to investigate these behaviors,
> which will allow me to prepare a thorough reply to the questions raised
> by Eric and David. I expect to have an answer within 2-3 weeks at most.
>
> Thank you!
Great, I am working on adding back support for (B) & (C) by the end of
this week.
On Tue, Jun 30, 2020 at 2:23 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 30, 2020 at 2:17 PM Mathieu Desnoyers
> <[email protected]> wrote:
> >
> > ----- On Jun 30, 2020, at 4:56 PM, Eric Dumazet [email protected] wrote:
> >
> > > On Tue, Jun 30, 2020 at 1:44 PM David Miller <[email protected]> wrote:
> > >>
> > >> From: Eric Dumazet <[email protected]>
> > >> Date: Tue, 30 Jun 2020 13:39:27 -0700
> > >>
> > >> > The (C) & (B) case are certainly doable.
> > >> >
> > >> > A) case is more complex, I have no idea of breakages of various TCP
> > >> > stacks if a flow got SACK
> > >> > at some point (in 3WHS) but suddenly becomes Reno.
> > >>
> > >> I agree that C and B are the easiest to implement without having to
> > >> add complicated code to handle various negotiated TCP option
> > >> scenerios.
> > >>
> > >> It does seem to be that some entities do A, or did I misread your
> > >> behavioral analysis of various implementations Mathieu?
> > >>
> > >> Thanks.
> > >
> > > Yes, another question about Mathieu cases is do determine the behavior
> > > of all these stacks vs :
> > > SACK option
> > > TCP TS option.
> >
> > I will ask my customer's networking team to investigate these behaviors,
> > which will allow me to prepare a thorough reply to the questions raised
> > by Eric and David. I expect to have an answer within 2-3 weeks at most.
> >
> > Thank you!
>
>
> Great, I am working on adding back support for (B) & (C) by the end of
> this week.
Note that the security issue (of sending uninit bytes to the wire) has
been independently fixed with [1]
This means syzbot was able to have MD5+TS+SACK ~6 months ago.
It seems we (linux) do not enable this combination for PASSIVE flows,
(according to tcp_synack_options()),
but for ACTIVE flows we do nothing special.
So maybe code in tcp_synack_options() should be mirrored to
tcp_syn_options() for consistency.
(disabling TS if both MD5 and SACK are enabled)
[1]
commit 9424e2e7ad93ffffa88f882c9bc5023570904b55
Author: Eric Dumazet <[email protected]>
Date: Thu Dec 5 10:10:15 2019 -0800
tcp: md5: fix potential overestimation of TCP option space
Back in 2008, Adam Langley fixed the corner case of packets for flows
having all of the following options : MD5 TS SACK
Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
can be cooked from the remaining 8 bytes.
tcp_established_options() correctly sets opts->num_sack_blocks
to zero, but returns 36 instead of 32.
This means TCP cooks packets with 4 extra bytes at the end
of options, containing unitialized bytes.
Fixes: 33ad798c924b ("tcp: options clean up")
Signed-off-by: Eric Dumazet <[email protected]>
Reported-by: syzbot <[email protected]>
Acked-by: Neal Cardwell <[email protected]>
Acked-by: Soheil Hassas Yeganeh <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index be6d22b8190fa375074062032105879270af4be5..b184f03d743715ef4b2d166ceae651529be77953
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -755,8 +755,9 @@ static unsigned int tcp_established_options(struct
sock *sk, struct sk_buff *skb
min_t(unsigned int, eff_sacks,
(remaining - TCPOLEN_SACK_BASE_ALIGNED) /
TCPOLEN_SACK_PERBLOCK);
- size += TCPOLEN_SACK_BASE_ALIGNED +
- opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
+ if (likely(opts->num_sack_blocks))
+ size += TCPOLEN_SACK_BASE_ALIGNED +
+ opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
}
return size;
On Tue, Jun 30, 2020 at 2:54 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 30, 2020 at 2:23 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 30, 2020 at 2:17 PM Mathieu Desnoyers
> > <[email protected]> wrote:
> > >
> > > ----- On Jun 30, 2020, at 4:56 PM, Eric Dumazet [email protected] wrote:
> > >
> > > > On Tue, Jun 30, 2020 at 1:44 PM David Miller <[email protected]> wrote:
> > > >>
> > > >> From: Eric Dumazet <[email protected]>
> > > >> Date: Tue, 30 Jun 2020 13:39:27 -0700
> > > >>
> > > >> > The (C) & (B) case are certainly doable.
> > > >> >
> > > >> > A) case is more complex, I have no idea of breakages of various TCP
> > > >> > stacks if a flow got SACK
> > > >> > at some point (in 3WHS) but suddenly becomes Reno.
> > > >>
> > > >> I agree that C and B are the easiest to implement without having to
> > > >> add complicated code to handle various negotiated TCP option
> > > >> scenerios.
> > > >>
> > > >> It does seem to be that some entities do A, or did I misread your
> > > >> behavioral analysis of various implementations Mathieu?
> > > >>
> > > >> Thanks.
> > > >
> > > > Yes, another question about Mathieu cases is do determine the behavior
> > > > of all these stacks vs :
> > > > SACK option
> > > > TCP TS option.
> > >
> > > I will ask my customer's networking team to investigate these behaviors,
> > > which will allow me to prepare a thorough reply to the questions raised
> > > by Eric and David. I expect to have an answer within 2-3 weeks at most.
> > >
> > > Thank you!
> >
> >
> > Great, I am working on adding back support for (B) & (C) by the end of
> > this week.
>
> Note that the security issue (of sending uninit bytes to the wire) has
> been independently fixed with [1]
>
> This means syzbot was able to have MD5+TS+SACK ~6 months ago.
>
> It seems we (linux) do not enable this combination for PASSIVE flows,
> (according to tcp_synack_options()),
> but for ACTIVE flows we do nothing special.
>
> So maybe code in tcp_synack_options() should be mirrored to
> tcp_syn_options() for consistency.
> (disabling TS if both MD5 and SACK are enabled)
Oh well, tcp_syn_options() is supposed to have the same logic.
Maybe we have an issue with SYNCOOKIES (with MD5 + TS + SACK)
Nice can of worms.
>
> [1]
>
> commit 9424e2e7ad93ffffa88f882c9bc5023570904b55
> Author: Eric Dumazet <[email protected]>
> Date: Thu Dec 5 10:10:15 2019 -0800
>
> tcp: md5: fix potential overestimation of TCP option space
>
> Back in 2008, Adam Langley fixed the corner case of packets for flows
> having all of the following options : MD5 TS SACK
>
> Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block
> can be cooked from the remaining 8 bytes.
>
> tcp_established_options() correctly sets opts->num_sack_blocks
> to zero, but returns 36 instead of 32.
>
> This means TCP cooks packets with 4 extra bytes at the end
> of options, containing unitialized bytes.
>
> Fixes: 33ad798c924b ("tcp: options clean up")
> Signed-off-by: Eric Dumazet <[email protected]>
> Reported-by: syzbot <[email protected]>
> Acked-by: Neal Cardwell <[email protected]>
> Acked-by: Soheil Hassas Yeganeh <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index be6d22b8190fa375074062032105879270af4be5..b184f03d743715ef4b2d166ceae651529be77953
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -755,8 +755,9 @@ static unsigned int tcp_established_options(struct
> sock *sk, struct sk_buff *skb
> min_t(unsigned int, eff_sacks,
> (remaining - TCPOLEN_SACK_BASE_ALIGNED) /
> TCPOLEN_SACK_PERBLOCK);
> - size += TCPOLEN_SACK_BASE_ALIGNED +
> - opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
> + if (likely(opts->num_sack_blocks))
> + size += TCPOLEN_SACK_BASE_ALIGNED +
> + opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK;
> }
>
> return size;
On Tue, Jun 30, 2020 at 3:07 PM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Jun 30, 2020 at 2:54 PM Eric Dumazet <[email protected]> wrote:
> >
> > On Tue, Jun 30, 2020 at 2:23 PM Eric Dumazet <[email protected]> wrote:
> > >
> > > On Tue, Jun 30, 2020 at 2:17 PM Mathieu Desnoyers
> > > <[email protected]> wrote:
> > > >
> > > > ----- On Jun 30, 2020, at 4:56 PM, Eric Dumazet [email protected] wrote:
> > > >
> > > > > On Tue, Jun 30, 2020 at 1:44 PM David Miller <[email protected]> wrote:
> > > > >>
> > > > >> From: Eric Dumazet <[email protected]>
> > > > >> Date: Tue, 30 Jun 2020 13:39:27 -0700
> > > > >>
> > > > >> > The (C) & (B) case are certainly doable.
> > > > >> >
> > > > >> > A) case is more complex, I have no idea of breakages of various TCP
> > > > >> > stacks if a flow got SACK
> > > > >> > at some point (in 3WHS) but suddenly becomes Reno.
> > > > >>
> > > > >> I agree that C and B are the easiest to implement without having to
> > > > >> add complicated code to handle various negotiated TCP option
> > > > >> scenerios.
> > > > >>
> > > > >> It does seem to be that some entities do A, or did I misread your
> > > > >> behavioral analysis of various implementations Mathieu?
> > > > >>
> > > > >> Thanks.
> > > > >
> > > > > Yes, another question about Mathieu cases is do determine the behavior
> > > > > of all these stacks vs :
> > > > > SACK option
> > > > > TCP TS option.
> > > >
> > > > I will ask my customer's networking team to investigate these behaviors,
> > > > which will allow me to prepare a thorough reply to the questions raised
> > > > by Eric and David. I expect to have an answer within 2-3 weeks at most.
> > > >
> > > > Thank you!
> > >
> > >
> > > Great, I am working on adding back support for (B) & (C) by the end of
> > > this week.
> >
> > Note that the security issue (of sending uninit bytes to the wire) has
> > been independently fixed with [1]
> >
> > This means syzbot was able to have MD5+TS+SACK ~6 months ago.
> >
> > It seems we (linux) do not enable this combination for PASSIVE flows,
> > (according to tcp_synack_options()),
> > but for ACTIVE flows we do nothing special.
> >
> > So maybe code in tcp_synack_options() should be mirrored to
> > tcp_syn_options() for consistency.
> > (disabling TS if both MD5 and SACK are enabled)
>
> Oh well, tcp_syn_options() is supposed to have the same logic.
>
> Maybe we have an issue with SYNCOOKIES (with MD5 + TS + SACK)
>
> Nice can of worms.
>
For updates of keys, it seems existing code lacks some RCU care.
MD5 keys use RCU for lookups/hashes, but the replacement of a key does
not allocate a new piece of memory.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 810cc164f795f8e1e8ca747ed5df51bb20fec8a2..ecc0e3fabce8b03bef823cbfc5c1b0a9e24df124
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4034,9 +4034,12 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
tcp_md5sig_key *key)
{
struct scatterlist sg;
+ u8 keylen = key->keylen;
- sg_init_one(&sg, key->key, key->keylen);
- ahash_request_set_crypt(hp->md5_req, &sg, NULL, key->keylen);
+ smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
+
+ sg_init_one(&sg, key->key, keylen);
+ ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
return crypto_ahash_update(hp->md5_req);
}
EXPORT_SYMBOL(tcp_md5_hash_key);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad6435ba6d72ffd8caf783bb25cad7ec151d6909..99916fcc15ca0be12c2c133ff40516f79e6fdf7f
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1113,6 +1113,9 @@ int tcp_md5_do_add(struct sock *sk, const union
tcp_md5_addr *addr,
if (key) {
/* Pre-existing entry - just update that one. */
memcpy(key->key, newkey, newkeylen);
+
+ smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
+
key->keylen = newkeylen;
return 0;
}
----- On Jun 30, 2020, at 6:38 PM, Eric Dumazet [email protected] wrote:
[...]
>
> For updates of keys, it seems existing code lacks some RCU care.
>
> MD5 keys use RCU for lookups/hashes, but the replacement of a key does
> not allocate a new piece of memory.
How is that RCU-safe ?
Based on what I see here:
tcp_md5_do_add() has a comment stating:
"/* This can be called on a newly created socket, from other files */"
which appears to be untrue if this can indeed be called on a live socket.
The path for pre-existing keys does:
key = tcp_md5_do_lookup_exact(sk, addr, family, prefixlen, l3index);
if (key) {
/* Pre-existing entry - just update that one. */
memcpy(key->key, newkey, newkeylen);
key->keylen = newkeylen;
return 0;
}
AFAIU, this works only if you assume there are no concurrent readers
accessing key->key, else they can see a corrupted key.
The change you are proposing adds smp_wmb/smp_rmb to pair stores
to key before key_len with loads of key_len before key. I'm not sure
what this is trying to achieve, and how it prevents the readers from
observing a corrupted state if the key is updated on a live socket ?
Based on my understanding, this path which deals with pre-existing keys
in-place should only ever be used when there are no concurrent readers,
else a new memory allocation would be needed to guarantee that readers
always observe a valid copy.
Thanks,
Mathieu
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index
> 810cc164f795f8e1e8ca747ed5df51bb20fec8a2..ecc0e3fabce8b03bef823cbfc5c1b0a9e24df124
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4034,9 +4034,12 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
> int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
> tcp_md5sig_key *key)
> {
> struct scatterlist sg;
> + u8 keylen = key->keylen;
>
> - sg_init_one(&sg, key->key, key->keylen);
> - ahash_request_set_crypt(hp->md5_req, &sg, NULL, key->keylen);
> + smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
> +
> + sg_init_one(&sg, key->key, keylen);
> + ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
> return crypto_ahash_update(hp->md5_req);
> }
> EXPORT_SYMBOL(tcp_md5_hash_key);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index
> ad6435ba6d72ffd8caf783bb25cad7ec151d6909..99916fcc15ca0be12c2c133ff40516f79e6fdf7f
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1113,6 +1113,9 @@ int tcp_md5_do_add(struct sock *sk, const union
> tcp_md5_addr *addr,
> if (key) {
> /* Pre-existing entry - just update that one. */
> memcpy(key->key, newkey, newkeylen);
> +
> + smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
> +
> key->keylen = newkeylen;
> return 0;
> }
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Tue, Jun 30, 2020 at 4:44 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Jun 30, 2020, at 6:38 PM, Eric Dumazet [email protected] wrote:
> [...]
> >
> > For updates of keys, it seems existing code lacks some RCU care.
> >
> > MD5 keys use RCU for lookups/hashes, but the replacement of a key does
> > not allocate a new piece of memory.
>
> How is that RCU-safe ?
>
> Based on what I see here:
>
> tcp_md5_do_add() has a comment stating:
>
> "/* This can be called on a newly created socket, from other files */"
>
> which appears to be untrue if this can indeed be called on a live socket.
"This can be called" is not the same than "this is always called for
newly created socket"
>
> The path for pre-existing keys does:
>
> key = tcp_md5_do_lookup_exact(sk, addr, family, prefixlen, l3index);
> if (key) {
> /* Pre-existing entry - just update that one. */
> memcpy(key->key, newkey, newkeylen);
> key->keylen = newkeylen;
> return 0;
> }
>
> AFAIU, this works only if you assume there are no concurrent readers
> accessing key->key, else they can see a corrupted key.
This is fine.
>
> The change you are proposing adds smp_wmb/smp_rmb to pair stores
> to key before key_len with loads of key_len before key. I'm not sure
> what this is trying to achieve, and how it prevents the readers from
> observing a corrupted state if the key is updated on a live socket ?
By definition if you change the MD5 key on a socket while packets are
flying, the incoming packet could either
1) See old key (packet is dropped)
2) See new key.
So any other decision (catching intermediate state) is really not an
issue, since you already accepted the fact that a packet could be
dropped,
and TCP will retransmit.
TCP MD5 implementation does not support multiple keys for one flow,
you can not have both old and new keys being checked.
>
> Based on my understanding, this path which deals with pre-existing keys
> in-place should only ever be used when there are no concurrent readers,
> else a new memory allocation would be needed to guarantee that readers
> always observe a valid copy.
This is not _needed,_ and since memory allocations can fail, we would
potentially break applications
assuming that changing MD5 key would never fail.
Patch has been sent for review on netdev@ (
https://patchwork.ozlabs.org/project/netdev/patch/[email protected]/
)
Eric Dumazet <[email protected]> wrote:
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 810cc164f795f8e1e8ca747ed5df51bb20fec8a2..ecc0e3fabce8b03bef823cbfc5c1b0a9e24df124
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4034,9 +4034,12 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
> int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
> tcp_md5sig_key *key)
> {
> struct scatterlist sg;
> + u8 keylen = key->keylen;
>
> - sg_init_one(&sg, key->key, key->keylen);
> - ahash_request_set_crypt(hp->md5_req, &sg, NULL, key->keylen);
> + smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
> +
> + sg_init_one(&sg, key->key, keylen);
> + ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
> return crypto_ahash_update(hp->md5_req);
> }
> EXPORT_SYMBOL(tcp_md5_hash_key);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index ad6435ba6d72ffd8caf783bb25cad7ec151d6909..99916fcc15ca0be12c2c133ff40516f79e6fdf7f
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1113,6 +1113,9 @@ int tcp_md5_do_add(struct sock *sk, const union
> tcp_md5_addr *addr,
> if (key) {
> /* Pre-existing entry - just update that one. */
> memcpy(key->key, newkey, newkeylen);
> +
> + smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
> +
> key->keylen = newkeylen;
> return 0;
> }
This doesn't make sense. Your smp_rmb only guarantees that you
see a version of key->key that's newer than keylen. What if the
key got changed twice? You could still read more than what's in
the key (if that's what you're trying to protect against).
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Jun 30, 2020 at 7:02 PM Herbert Xu <[email protected]> wrote:
>
> Eric Dumazet <[email protected]> wrote:
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index 810cc164f795f8e1e8ca747ed5df51bb20fec8a2..ecc0e3fabce8b03bef823cbfc5c1b0a9e24df124
> > 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4034,9 +4034,12 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
> > int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
> > tcp_md5sig_key *key)
> > {
> > struct scatterlist sg;
> > + u8 keylen = key->keylen;
> >
> > - sg_init_one(&sg, key->key, key->keylen);
> > - ahash_request_set_crypt(hp->md5_req, &sg, NULL, key->keylen);
> > + smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
> > +
> > + sg_init_one(&sg, key->key, keylen);
> > + ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
> > return crypto_ahash_update(hp->md5_req);
> > }
> > EXPORT_SYMBOL(tcp_md5_hash_key);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index ad6435ba6d72ffd8caf783bb25cad7ec151d6909..99916fcc15ca0be12c2c133ff40516f79e6fdf7f
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1113,6 +1113,9 @@ int tcp_md5_do_add(struct sock *sk, const union
> > tcp_md5_addr *addr,
> > if (key) {
> > /* Pre-existing entry - just update that one. */
> > memcpy(key->key, newkey, newkeylen);
> > +
> > + smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
> > +
> > key->keylen = newkeylen;
> > return 0;
> > }
>
> This doesn't make sense. Your smp_rmb only guarantees that you
> see a version of key->key that's newer than keylen. What if the
> key got changed twice? You could still read more than what's in
> the key (if that's what you're trying to protect against).
The worst that can happen is a dropped packet.
Which is anyway going to happen anyway eventually if an application
changes a TCP MD5 key on a live flow.
The main issue of the prior code was the double read of key->keylen in
tcp_md5_hash_key(), not that few bytes could change under us.
I used smp_rmb() to ease backports, since old kernels had no
READ_ONCE()/WRITE_ONCE(), but ACCESS_ONCE() instead.
On Tue, Jun 30, 2020 at 07:17:46PM -0700, Eric Dumazet wrote:
>
> The main issue of the prior code was the double read of key->keylen in
> tcp_md5_hash_key(), not that few bytes could change under us.
>
> I used smp_rmb() to ease backports, since old kernels had no
> READ_ONCE()/WRITE_ONCE(), but ACCESS_ONCE() instead.
If it's the double-read that you're protecting against, you should
just use barrier() and the comment should say so too.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Jun 30, 2020 at 7:23 PM Herbert Xu <[email protected]> wrote:
>
> On Tue, Jun 30, 2020 at 07:17:46PM -0700, Eric Dumazet wrote:
> >
> > The main issue of the prior code was the double read of key->keylen in
> > tcp_md5_hash_key(), not that few bytes could change under us.
> >
> > I used smp_rmb() to ease backports, since old kernels had no
> > READ_ONCE()/WRITE_ONCE(), but ACCESS_ONCE() instead.
>
> If it's the double-read that you're protecting against, you should
> just use barrier() and the comment should say so too.
I made this clear in the changelog, do we want comments all over the places ?
Do not get me wrong, we had this bug for years and suddenly this is a
big deal...
MD5 keys are read with RCU protection, and tcp_md5_do_add()
might update in-place a prior key.
Normally, typical RCU updates would allocate a new piece
of memory. In this case only key->key and key->keylen might
be updated, and we do not care if an incoming packet could
see the old key, the new one, or some intermediate value,
since changing the key on a live flow is known to be problematic
anyway.
We only want to make sure that in the case key->keylen
is changed, cpus in tcp_md5_hash_key() wont try to use
uninitialized data, or crash because key->keylen was
read twice to feed sg_init_one() and ahash_request_set_crypt()
On Tue, 2020-06-30 at 19:30 -0700, Eric Dumazet wrote:
> On Tue, Jun 30, 2020 at 7:23 PM Herbert Xu <[email protected]> wrote:
> > On Tue, Jun 30, 2020 at 07:17:46PM -0700, Eric Dumazet wrote:
> > > The main issue of the prior code was the double read of key->keylen in
> > > tcp_md5_hash_key(), not that few bytes could change under us.
> > >
> > > I used smp_rmb() to ease backports, since old kernels had no
> > > READ_ONCE()/WRITE_ONCE(), but ACCESS_ONCE() instead.
> >
> > If it's the double-read that you're protecting against, you should
> > just use barrier() and the comment should say so too.
>
> I made this clear in the changelog, do we want comments all over the places ?
Having to run git for every line of code isn't great.
Comments in code is better than comments in changelogs.
On Tue, Jun 30, 2020 at 07:30:43PM -0700, Eric Dumazet wrote:
>
> I made this clear in the changelog, do we want comments all over the places ?
> Do not get me wrong, we had this bug for years and suddenly this is a
> big deal...
I thought you were adding a new pair of smp_rmb/smp_wmb. If they
already exist in the code then I agree it's not a big deal. But
adding a new pair of bogus smp_Xmb's is bad for maintenance.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Jun 30, 2020 at 7:59 PM Herbert Xu <[email protected]> wrote:
>
> On Tue, Jun 30, 2020 at 07:30:43PM -0700, Eric Dumazet wrote:
> >
> > I made this clear in the changelog, do we want comments all over the places ?
> > Do not get me wrong, we had this bug for years and suddenly this is a
> > big deal...
>
> I thought you were adding a new pair of smp_rmb/smp_wmb. If they
> already exist in the code then I agree it's not a big deal. But
> adding a new pair of bogus smp_Xmb's is bad for maintenance.
>
If I knew so many people were excited about TCP / MD5, I would have
posted all my patches on lkml ;)
Without the smp_wmb() we would still need something to prevent KMSAN
from detecting that we read uninitialized bytes,
if key->keylen is increased. (initial content of key->key[] is garbage)
Something like this :
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f111660453241692a17c881dd6dc2910a1236263..c3af8180c7049d5c4987bf5c67e4aff2ed6967c9
100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -4033,11 +4033,9 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
tcp_md5sig_key *key)
{
- u8 keylen = key->keylen;
+ u8 keylen = READ_ONCE(key->keylen); /* paired with
WRITE_ONCE() in tcp_md5_do_add */
struct scatterlist sg;
- smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
-
sg_init_one(&sg, key->key, keylen);
ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
return crypto_ahash_update(hp->md5_req);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 99916fcc15ca0be12c2c133ff40516f79e6fdf7f..0d08e0134335a21d23702e6a5c24a0f2b3c61c6f
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1114,9 +1114,13 @@ int tcp_md5_do_add(struct sock *sk, const union
tcp_md5_addr *addr,
/* Pre-existing entry - just update that one. */
memcpy(key->key, newkey, newkeylen);
- smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
+ /* Pairs with READ_ONCE() in tcp_md5_hash_key().
+ * Also note that a reader could catch new key->keylen value
+ * but old key->key[], this is the reason we use __GFP_ZERO
+ * at sock_kmalloc() time below these lines.
+ */
+ WRITE_ONCE(key->keylen, newkeylen);
- key->keylen = newkeylen;
return 0;
}
@@ -1132,7 +1136,7 @@ int tcp_md5_do_add(struct sock *sk, const union
tcp_md5_addr *addr,
rcu_assign_pointer(tp->md5sig_info, md5sig);
}
- key = sock_kmalloc(sk, sizeof(*key), gfp);
+ key = sock_kmalloc(sk, sizeof(*key), gfp | __GFP_ZERO);
if (!key)
return -ENOMEM;
if (!tcp_alloc_md5sig_pool()) {
On Tue, Jun 30, 2020 at 08:36:51PM -0700, Eric Dumazet wrote:
>
> If I knew so many people were excited about TCP / MD5, I would have
> posted all my patches on lkml ;)
>
> Without the smp_wmb() we would still need something to prevent KMSAN
> from detecting that we read uninitialized bytes,
> if key->keylen is increased. (initial content of key->key[] is garbage)
>
> Something like this :
LGTM. Thanks,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
----- On Jun 30, 2020, at 11:36 PM, Eric Dumazet [email protected] wrote:
> On Tue, Jun 30, 2020 at 7:59 PM Herbert Xu <[email protected]> wrote:
>>
>> On Tue, Jun 30, 2020 at 07:30:43PM -0700, Eric Dumazet wrote:
>> >
>> > I made this clear in the changelog, do we want comments all over the places ?
>> > Do not get me wrong, we had this bug for years and suddenly this is a
>> > big deal...
>>
>> I thought you were adding a new pair of smp_rmb/smp_wmb. If they
>> already exist in the code then I agree it's not a big deal. But
>> adding a new pair of bogus smp_Xmb's is bad for maintenance.
>>
>
> If I knew so many people were excited about TCP / MD5, I would have
> posted all my patches on lkml ;)
>
> Without the smp_wmb() we would still need something to prevent KMSAN
> from detecting that we read uninitialized bytes,
> if key->keylen is increased. (initial content of key->key[] is garbage)
>
> Something like this :
The approach below looks good to me, but you'll also need to annotate
both tcp_md5_hash_key and tcp_md5_do_add with __no_kcsan or use
data_race(expr) to let the concurrency sanitizer know that there is
a known data race which is there on purpose (triggered by memcpy in tcp_md5_do_add
and somewhere within crypto_ahash_update). See Documentation/dev-tools/kcsan.rst
for details.
Thanks,
Mathieu
>
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index
> f111660453241692a17c881dd6dc2910a1236263..c3af8180c7049d5c4987bf5c67e4aff2ed6967c9
> 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -4033,11 +4033,9 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
>
> int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
> tcp_md5sig_key *key)
> {
> - u8 keylen = key->keylen;
> + u8 keylen = READ_ONCE(key->keylen); /* paired with
> WRITE_ONCE() in tcp_md5_do_add */
> struct scatterlist sg;
>
> - smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
> -
> sg_init_one(&sg, key->key, keylen);
> ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
> return crypto_ahash_update(hp->md5_req);
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index
> 99916fcc15ca0be12c2c133ff40516f79e6fdf7f..0d08e0134335a21d23702e6a5c24a0f2b3c61c6f
> 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1114,9 +1114,13 @@ int tcp_md5_do_add(struct sock *sk, const union
> tcp_md5_addr *addr,
> /* Pre-existing entry - just update that one. */
> memcpy(key->key, newkey, newkeylen);
>
> - smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
> + /* Pairs with READ_ONCE() in tcp_md5_hash_key().
> + * Also note that a reader could catch new key->keylen value
> + * but old key->key[], this is the reason we use __GFP_ZERO
> + * at sock_kmalloc() time below these lines.
> + */
> + WRITE_ONCE(key->keylen, newkeylen);
>
> - key->keylen = newkeylen;
> return 0;
> }
>
> @@ -1132,7 +1136,7 @@ int tcp_md5_do_add(struct sock *sk, const union
> tcp_md5_addr *addr,
> rcu_assign_pointer(tp->md5sig_info, md5sig);
> }
>
> - key = sock_kmalloc(sk, sizeof(*key), gfp);
> + key = sock_kmalloc(sk, sizeof(*key), gfp | __GFP_ZERO);
> if (!key)
> return -ENOMEM;
> if (!tcp_alloc_md5sig_pool()) {
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Jul 1, 2020 at 5:19 AM Mathieu Desnoyers
<[email protected]> wrote:
> The approach below looks good to me, but you'll also need to annotate
> both tcp_md5_hash_key and tcp_md5_do_add with __no_kcsan or use
> data_race(expr) to let the concurrency sanitizer know that there is
> a known data race which is there on purpose (triggered by memcpy in tcp_md5_do_add
> and somewhere within crypto_ahash_update). See Documentation/dev-tools/kcsan.rst
> for details.
Sure, I can add a data_race() and let stable teams handle the
backports without it ;)
>
> Thanks,
>
> Mathieu
>
> >
> > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> > index
> > f111660453241692a17c881dd6dc2910a1236263..c3af8180c7049d5c4987bf5c67e4aff2ed6967c9
> > 100644
> > --- a/net/ipv4/tcp.c
> > +++ b/net/ipv4/tcp.c
> > @@ -4033,11 +4033,9 @@ EXPORT_SYMBOL(tcp_md5_hash_skb_data);
> >
> > int tcp_md5_hash_key(struct tcp_md5sig_pool *hp, const struct
> > tcp_md5sig_key *key)
> > {
> > - u8 keylen = key->keylen;
> > + u8 keylen = READ_ONCE(key->keylen); /* paired with
> > WRITE_ONCE() in tcp_md5_do_add */
> > struct scatterlist sg;
> >
> > - smp_rmb(); /* paired with smp_wmb() in tcp_md5_do_add() */
> > -
> > sg_init_one(&sg, key->key, keylen);
> > ahash_request_set_crypt(hp->md5_req, &sg, NULL, keylen);
> > return crypto_ahash_update(hp->md5_req);
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index
> > 99916fcc15ca0be12c2c133ff40516f79e6fdf7f..0d08e0134335a21d23702e6a5c24a0f2b3c61c6f
> > 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1114,9 +1114,13 @@ int tcp_md5_do_add(struct sock *sk, const union
> > tcp_md5_addr *addr,
> > /* Pre-existing entry - just update that one. */
> > memcpy(key->key, newkey, newkeylen);
> >
> > - smp_wmb(); /* pairs with smp_rmb() in tcp_md5_hash_key() */
> > + /* Pairs with READ_ONCE() in tcp_md5_hash_key().
> > + * Also note that a reader could catch new key->keylen value
> > + * but old key->key[], this is the reason we use __GFP_ZERO
> > + * at sock_kmalloc() time below these lines.
> > + */
> > + WRITE_ONCE(key->keylen, newkeylen);
> >
> > - key->keylen = newkeylen;
> > return 0;
> > }
> >
> > @@ -1132,7 +1136,7 @@ int tcp_md5_do_add(struct sock *sk, const union
> > tcp_md5_addr *addr,
> > rcu_assign_pointer(tp->md5sig_info, md5sig);
> > }
> >
> > - key = sock_kmalloc(sk, sizeof(*key), gfp);
> > + key = sock_kmalloc(sk, sizeof(*key), gfp | __GFP_ZERO);
> > if (!key)
> > return -ENOMEM;
> > if (!tcp_alloc_md5sig_pool()) {
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
On Tue, Jun 30, 2020 at 3:07 PM Eric Dumazet <[email protected]> wrote:
> Oh well, tcp_syn_options() is supposed to have the same logic.
>
> Maybe we have an issue with SYNCOOKIES (with MD5 + TS + SACK)
>
> Nice can of worms.
Yes, MD5 does not like SYNCOOKIES in some cases.
In this trace, S is a linux host, the SYNACK is a syncookie.
C host is attempting a SYN with MD5+TS+SACK, which a standard linux
host would not attempt.
But we could expect another stack to attempt this combination.
TCP stack believes it can encode selected TCP options (in the TSVAL value),
but since MD5 option is set, TS option is not sent in the SYNACK.
However we still send other options that MUST not be sent if TS option
could not fit the TCP option space.
10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535,
options [nop,nop,md5 valid,mss 65495,sackOK,TS val 456965269 ecr
0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602,
win 65535, options [nop,nop,md5 valid,mss
65495,nop,nop,sackOK,nop,wscale 8], length 0
<When ACK packets comes back from client, the server has no state and
no TS ecr value, so must assume no option was negotiated>
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options
[nop,nop,md5 valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256,
options [nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options
[nop,nop,md5 valid], length 0
We can see for instance the windows are wrong by a 256 factor (wscale 8)