LinuxLists.cc - Re: [PATCH] ntp: remove accidental integer wrap-around

2024-05-14 10:59:30

Subject: Re: [PATCH] ntp: remove accidental integer wrap-around

On Tue, May 07 2024 at 04:34, Justin Stitt wrote:
> Using syzkaller alongside the newly reintroduced signed integer overflow
> sanitizer spits out this report:
>
> [ 138.454979] ------------[ cut here ]------------
> [ 138.458089] UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
> [ 138.462134] 9223372036854775807 + 500 cannot be represented in type 'long'
> [ 138.466234] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-00038-gc0a509640e93-dirty #10
> [ 138.471498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> [ 138.477110] Call Trace:
> [ 138.478657] <IRQ>
> [ 138.479964] dump_stack_lvl+0x93/0xd0
> [ 138.482276] handle_overflow+0x171/0x1b0
> [ 138.484699] second_overflow+0x2d6/0x500
> [ 138.487133] accumulate_nsecs_to_secs+0x60/0x160
> [ 138.489931] timekeeping_advance+0x1fe/0x890
> [ 138.492535] update_wall_time+0x10/0x30

Same comment vs. trimming.

> Historically, the signed integer overflow sanitizer did not work in the
> kernel due to its interaction with `-fwrapv` but this has since been
> changed [1] in the newest version of Clang. It was re-enabled in the
> kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow
> sanitizer").

Again. Irrelevant to the problem.

> Let's introduce a new macro and use that against NTP_PHASE_LIMIT to
> properly limit the max size of time_maxerror without overflowing during
> the check itself.

This fails to tell what is causing the issue and just talks about what
the patch is doing. The latter can be seen from the patch itself, no?

Something like this:

On second overflow time_maxerror is unconditionally incremented and
the result is checked against NTP_PHASE_LIMIT, but the increment can
overflow into negative space.

Prevent this by checking the overflow condition before incrementing.

Hmm?

But that obviously begs the question why this can happen at all.

#define MAXPHASE 500000000L
#define NTP_PHASE_LIMIT ((MAXPHASE / NSEC_PER_USEC) << 5)

==> NTP_PHASE_LIMIT = 1.6e+07 = 0xf42400

#define MAXFREQ 500000

So how can 0xf42400 + 500000/1000 overflow in the first place?

It can't unless time_maxerror is somehow initialized to a bogus
value and indeed it is:

process_adjtimex_modes()
....
if (txc->modes & ADJ_MAXERROR)
time_maxerror = txc->maxerror;

So that wants to be fixed and not the symptom.

Thanks,

tglx

2024-05-16 23:40:26

by Justin Stitt

[permalink] [raw]

Subject: Re: [PATCH] ntp: remove accidental integer wrap-around

Hi,

On Tue, May 14, 2024 at 3:38 AM Thomas Gleixner <[email protected]> wrote:
>
> On Tue, May 07 2024 at 04:34, Justin Stitt wrote:
> > Using syzkaller alongside the newly reintroduced signed integer overflow
> > sanitizer spits out this report:
> >
> > [ 138.454979] ------------[ cut here ]------------
> > [ 138.458089] UBSAN: signed-integer-overflow in ../kernel/time/ntp.c:461:16
> > [ 138.462134] 9223372036854775807 + 500 cannot be represented in type 'long'
> > [ 138.466234] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 6.8.0-rc2-00038-gc0a509640e93-dirty #10
> > [ 138.471498] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> > [ 138.477110] Call Trace:
> > [ 138.478657] <IRQ>
> > [ 138.479964] dump_stack_lvl+0x93/0xd0
> > [ 138.482276] handle_overflow+0x171/0x1b0
> > [ 138.484699] second_overflow+0x2d6/0x500
> > [ 138.487133] accumulate_nsecs_to_secs+0x60/0x160
> > [ 138.489931] timekeeping_advance+0x1fe/0x890
> > [ 138.492535] update_wall_time+0x10/0x30
>
> Same comment vs. trimming.

Gotcha, in the next version this will be trimmed.

>
> > Historically, the signed integer overflow sanitizer did not work in the
> > kernel due to its interaction with `-fwrapv` but this has since been
> > changed [1] in the newest version of Clang. It was re-enabled in the
> > kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow
> > sanitizer").
>
> Again. Irrelevant to the problem.

Right, I'll move it below the fold.

>
> > Let's introduce a new macro and use that against NTP_PHASE_LIMIT to
> > properly limit the max size of time_maxerror without overflowing during
> > the check itself.
>
> This fails to tell what is causing the issue and just talks about what
> the patch is doing. The latter can be seen from the patch itself, no?
>
> Something like this:
>
> On second overflow time_maxerror is unconditionally incremented and
> the result is checked against NTP_PHASE_LIMIT, but the increment can
> overflow into negative space.
>
> Prevent this by checking the overflow condition before incrementing.
>
> Hmm?

Sounds better :thumbs_up: I'll use this!

>
> But that obviously begs the question why this can happen at all.
>
> #define MAXPHASE 500000000L
> #define NTP_PHASE_LIMIT ((MAXPHASE / NSEC_PER_USEC) << 5)
>
> ==> NTP_PHASE_LIMIT = 1.6e+07 = 0xf42400
>
> #define MAXFREQ 500000
>
> So how can 0xf42400 + 500000/1000 overflow in the first place?
>
> It can't unless time_maxerror is somehow initialized to a bogus
> value and indeed it is:
>
> process_adjtimex_modes()
> ....
> if (txc->modes & ADJ_MAXERROR)
> time_maxerror = txc->maxerror;
>
> So that wants to be fixed and not the symptom.

Isn't this usually supplied from the user and can be some pretty
random stuff? Are you suggesting we update
timekeeping_validate_timex() to include a check to limit the maxerror
field to (NTP_PHASE_LIMIT-(MAXFREQ / NSEC_PER_USEC))? It seems like we
should handle the overflow case where it happens: in
second_overflow().

The clear intent of the existing code was to saturate at
NTP_PHASE_LIMIT, they just did it in a way where the check itself
triggers overflow sanitizers.

>
> Thanks,
>
> tglx

Thanks
Justin

2024-05-16 23:55:41

by Justin Stitt

[permalink] [raw]

Subject: Re: [PATCH] ntp: remove accidental integer wrap-around

On Thu, May 16, 2024 at 4:40 PM Justin Stitt <[email protected]> wrote:
> Isn't this usually supplied from the user and can be some pretty
> random stuff? Are you suggesting we update
> timekeeping_validate_timex() to include a check to limit the maxerror
> field to (NTP_PHASE_LIMIT-(MAXFREQ / NSEC_PER_USEC))? It seems like we
> should handle the overflow case where it happens: in
> second_overflow().

Or, I suppose we could add a check to timekeeping_validate_timex() like:

if (txc->modes & ADJ_MAXERROR) {
if (txc->maxerror < 0 || txc->maxerror > NTP_PHASE_LIMIT)
return -EINVAL;
}

> Thanks
> Justin

2024-05-17 08:49:49

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] ntp: remove accidental integer wrap-around

On Thu, May 16 2024 at 16:40, Justin Stitt wrote:
> On Tue, May 14, 2024 at 3:38 AM Thomas Gleixner <[email protected]> wrote:
>> So how can 0xf42400 + 500000/1000 overflow in the first place?
>>
>> It can't unless time_maxerror is somehow initialized to a bogus
>> value and indeed it is:
>>
>> process_adjtimex_modes()
>> ....
>> if (txc->modes & ADJ_MAXERROR)
>> time_maxerror = txc->maxerror;
>>
>> So that wants to be fixed and not the symptom.
>
> Isn't this usually supplied from the user and can be some pretty
> random stuff?

Sure it comes from user space and can contain random nonsense as
syzkaller demonstrated.

> Are you suggesting we update timekeeping_validate_timex() to include a
> check to limit the maxerror field to (NTP_PHASE_LIMIT-(MAXFREQ /
> NSEC_PER_USEC))? It seems like we should handle the overflow case
> where it happens: in second_overflow().
>
> The clear intent of the existing code was to saturate at
> NTP_PHASE_LIMIT, they just did it in a way where the check itself
> triggers overflow sanitizers.

The clear intent of the code is to do saturation of a bound value.

Clearly the user space interface fails to validate the input to be in a
sane range and that makes you go and prevent the resulting overflow at
the usage site. Seriously?

Obviously the sanitizer detects the stupid in second_overflow(), but
that does not mean that the proper solution is to add overflow
protection to that code.

Tools are good to pin-point symptoms, but they are by definition
patently bad in root cause analysis. Otherwise we could just let the
tool write the "fix".

The obvious first question in such a case is to ask _WHY_ does
time_maxerror have a bogus value, which clearly cannot be achieved from
regular operation.

Once you figured out that the only way to set time_maxerror to a bogus
value is the user space interface the obvious follow up question is
whether such a value has to be considered as valid or not.

As it is obviously invalid the logical consequence is to add a sanity
check and reject that nonsense at that boundary, no?

Thanks,

tglx