2020-01-15 17:09:34

by Jakub Witowski

[permalink] [raw]
Subject: [PATCH 0/1] Sequence number out of range fix

During sending messages with high speed, the 'cached' value may have exceeded the 24-bit range
and this value will save into storage.

The current implementation protects us against exceeding max sequence number (0xFFFFFF).

Jakub Witowski (1):
mesh: sequence number out of range fix

mesh/mesh-config-json.c | 4 ++++
mesh/net.c | 3 +++
2 files changed, 7 insertions(+)

--
2.20.1


2020-01-15 18:36:48

by Gix, Brian

[permalink] [raw]
Subject: Re: [PATCH 0/1] Sequence number out of range fix

Hi Jakub,

On Wed, 2020-01-15 at 18:08 +0100, Jakub Witowski wrote:
> During sending messages with high speed, the 'cached' value may have exceeded the 24-bit range
> and this value will save into storage.
>
> The current implementation protects us against exceeding max sequence number (0xFFFFFF).
>
> Jakub Witowski (1):
> mesh: sequence number out of range fix
>
> mesh/mesh-config-json.c | 4 ++++
> mesh/net.c | 3 +++
> 2 files changed, 7 insertions(+)
>

This is a very dangerous course of action. The suggested patch might potentially cause a node to re-use a
sequence number more than once, which would cause a "dirty nonce" condition, and allow unauthorized entities to
derive the encrypted data without the keys.

The Mesh working group put a lot of work into determining what should be range of Sequence Numbers and IV
Indexes, and came up with the 24 bits of sequence numbers, which can be reset to 0 with an IV Index update
every 192 hours. This will allow a node to send 24 packets (every 42ms) a second constantly without running
out of sequence numbers. While this is technically possible, especially with a daemon that might be looping
back messages internally without ever using the OTA interface, it is not really possible in an actual BT driven
real life system.


So here is the solution I would suggest:

Beacuse we store sequence numbers internally with a u32 sized data type, we should *let* the value go over the
max legal sequence nunmbver of 0xFFFFFF (perhaps capping it at 0x01000000 to prevent "super overflows"). Then
we *reject* all send requests with a sequence number > 0xFFFFFF.

Assuming the IV Index update logic is working correctly (triggered internally when the Sequence number goes
over 0x00800000... half the max) the node will work just fine as long as it never tries to send more than one
packet every 42ms for 192 hours without rest... Just as the specification intended. If it does, then the
specification and code will disallow it.

BR,
Brian

2020-01-15 19:09:37

by Michał Lowas-Rzechonek

[permalink] [raw]
Subject: Re: [PATCH 0/1] Sequence number out of range fix

Brian, Jakub,

On 01/15, Gix, Brian wrote:
> This is a very dangerous course of action. The suggested patch might
> potentially cause a node to re-use a sequence number more than once,
> which would cause a "dirty nonce" condition, and allow unauthorized
> entities to derive the encrypted data without the keys.

Good point. Note that his is also possible in the current
implementation: since seq_num is applied to nonce with a 24bit mask,
it's going to wrap around.

> While this is technically possible, especially with a daemon that
> might be looping back messages internally without ever using the OTA
> interface, it is not really possible in an actual BT driven real life
> system.

There is another option to exhaust the 24 bit range: we have the
overcommit mechanism in the storage. Let's say the daemon starts, sends
a few messages (and overcommits the sequence by a certain value), then
either the daemon, or the system crashes.

Do that a few times, and you end up with storage containing sequence
number exceeding 24 bits.

> Beacuse we store sequence numbers internally with a u32 sized data
> type, we should *let* the value go over the max legal sequence
> nunmbver of 0xFFFFFF (perhaps capping it at 0x01000000 to prevent
> "super overflows"). Then we *reject* all send requests with a
> sequence number > 0xFFFFFF.

Sounds good.

--
Michał Lowas-Rzechonek <[email protected]>
Silvair http://silvair.com
Jasnogórska 44, 31-358 Krakow, POLAND

2020-01-15 19:56:59

by Gix, Brian

[permalink] [raw]
Subject: Re: [PATCH 0/1] Sequence number out of range fix

Hi Michał

> On Jan 15, 2020, at 11:09 AM, "[email protected]" <[email protected]> wrote:
>
> Brian, Jakub,
>
>> On 01/15, Gix, Brian wrote:
>> This is a very dangerous course of action. The suggested patch might
>> potentially cause a node to re-use a sequence number more than once,
>> which would cause a "dirty nonce" condition, and allow unauthorized
>> entities to derive the encrypted data without the keys.
>
> Good point. Note that his is also possible in the current
> implementation: since seq_num is applied to nonce with a 24bit mask,
> it's going to wrap around.

The full IV Index is in the nonce, and at 192 hours per IV index, will be unique for something like 1.4 million years.

>
>> While this is technically possible, especially with a daemon that
>> might be looping back messages internally without ever using the OTA
>> interface, it is not really possible in an actual BT driven real life
>> system.
>
> There is another option to exhaust the 24 bit range: we have the
> overcommit mechanism in the storage. Let's say the daemon starts, sends
> a few messages (and overcommits the sequence by a certain value), then
> either the daemon, or the system crashes.
>
> Do that a few times, and you end up with storage containing sequence
> number exceeding 24 bits.


The over commit is calculated based on the usage rate, and the daemon would need to unexpectedly abort (not just ctrl-c or exit) for us to use the over-commit value, as on deliberate exit, the actual sequence used is saved. If an unexpected abort occurs, I think the default daemon restart is 5 seconds?
>
>> Beacuse we store sequence numbers internally with a u32 sized data
>> type, we should *let* the value go over the max legal sequence
>> nunmbver of 0xFFFFFF (perhaps capping it at 0x01000000 to prevent
>> "super overflows"). Then we *reject* all send requests with a
>> sequence number > 0xFFFFFF.
>
> Sounds good.
>
> --
> Michał Lowas-Rzechonek <[email protected]>
> Silvair http://silvair.com
> Jasnogórska 44, 31-358 Krakow, POLAND

2020-01-15 21:05:47

by Michał Lowas-Rzechonek

[permalink] [raw]
Subject: Re: [PATCH 0/1] Sequence number out of range fix

Hi Brian,

On 01/15, Gix, Brian wrote:
> > Good point. Note that his is also possible in the current
> > implementation: since seq_num is applied to nonce with a 24bit mask,
> > it's going to wrap around.
>
> The full IV Index is in the nonce, and at 192 hours per IV index, will
> be unique for something like 1.4 million years.

Yes it is, but that's not the point.

At the moment, net->seq_num is a 32 bit value that *can* exceed 24bit
range, because mesh_net_next_seq_num() doesn't check ranges. So the
raw value can reach 0x1000000 and above.

Now, this raw value is used in send_seg, passed to
mesh_crypto_packet_build, which effectively applies a 24bit mask in line
640:

l_put_be32(seq, packet + 1);
packet[1] = (ctl ? CTL : 0) | (ttl & TTL_MASK);

So this means that when:

- the network is already in iv update (so that you can't increase the
iv_index, maybe you even started the procedure yourself because your
seq_num is above the threshold)

- your sequence number is sufficiently large (because of the "repeated
crash" scenario described below)

Then the actual value used in the nonce will be net->seq_num & 0xffffff,
which is something you have *already* used before. All of that happens
with the same IV index.

> The over commit is calculated based on the usage rate, and the daemon
> would need to unexpectedly abort (not just ctrl-c or exit) for us to
> use the over-commit value

Indeed, that's precisely what I'm talking about - repeated, unhandled
process terminations. We're trying to come up with the patch simply
because this situation has *already happened* on one of our instances.

--
Michał Lowas-Rzechonek <[email protected]>
Silvair http://silvair.com
Jasnogórska 44, 31-358 Krakow, POLAND

2020-01-15 21:33:00

by Gix, Brian

[permalink] [raw]
Subject: Re: [PATCH 0/1] Sequence number out of range fix

On Wed, 2020-01-15 at 22:04 +0100, [email protected] wrote:
> Hi Brian,
>
> On 01/15, Gix, Brian wrote:
> > > Good point. Note that his is also possible in the current
> > > implementation: since seq_num is applied to nonce with a 24bit mask,
> > > it's going to wrap around.
> >
> > The full IV Index is in the nonce, and at 192 hours per IV index, will
> > be unique for something like 1.4 million years.
>
> Yes it is, but that's not the point.
>
> At the moment, net->seq_num is a 32 bit value that *can* exceed 24bit
> range, because mesh_net_next_seq_num() doesn't check ranges. So the
> raw value can reach 0x1000000 and above.
>
> Now, this raw value is used in send_seg, passed to
> mesh_crypto_packet_build, which effectively applies a 24bit mask in line
> 640:

Yes, we should definitely be sanity checking this, and not sending if SeqNum out of range.

>
> l_put_be32(seq, packet + 1);
> packet[1] = (ctl ? CTL : 0) | (ttl & TTL_MASK);
>
> So this means that when:
>
> - the network is already in iv update (so that you can't increase the
> iv_index, maybe you even started the procedure yourself because your
> seq_num is above the threshold)
>
> - your sequence number is sufficiently large (because of the "repeated
> crash" scenario described below)

I think if we are repeatedly crashing, and it is causing a runaway sequence number increase, that being
forbidden to send more packets is a natural consequence, and people should fix the code that was causing the
crash in the first place.

>
> Then the actual value used in the nonce will be net->seq_num & 0xffffff,
> which is something you have *already* used before. All of that happens
> with the same IV index.
>
> > The over commit is calculated based on the usage rate, and the daemon
> > would need to unexpectedly abort (not just ctrl-c or exit) for us to
> > use the over-commit value
>
> Indeed, that's precisely what I'm talking about - repeated, unhandled
> process terminations. We're trying to come up with the patch simply
> because this situation has *already happened* on one of our instances.
>

2020-01-15 22:11:05

by Michał Lowas-Rzechonek

[permalink] [raw]
Subject: Re: [PATCH 0/1] Sequence number out of range fix

On 01/15, Gix, Brian wrote:
> > Now, this raw value is used in send_seg, passed to
> > mesh_crypto_packet_build, which effectively applies a 24bit mask in line
> > 640:
>
> Yes, we should definitely be sanity checking this, and not sending if
> SeqNum out of range.

Ok, we'll do that then.

> I think if we are repeatedly crashing, and it is causing a runaway
> sequence number increase, that being forbidden to send more packets is
> a natural consequence, and people should fix the code that was causing
> the crash in the first place.

You don't say. But it's not a given that the code is in the daemon
itself. Two words: OOM killer.

--
Michał Lowas-Rzechonek <[email protected]>
Silvair http://silvair.com
Jasnogórska 44, 31-358 Krakow, POLAND