Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
month: https://github.com/richfelker/musl-cross-make/issues/96 and
https://github.com/raspberrypi/linux/issues/3579
As Will Deacon pointed out, this was never reported on the mailing list,
so I'll try to summarize what we know, so this can hopefully be resolved soon.
- This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
clock_gettime64(CLOCK_REALTIME)
- The kernel tree is at https://github.com/raspberrypi/linux/, but I could
see no relevant changes compared to a mainline kernel.
- From the report, I see that the returned time value is larger than the
expected time, by 3.4 to 14.5 million seconds in four samples, my
guess is that a random number gets added in at some point.
- From other sources, I found that the Raspberry Pi clocksource runs
at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
I would expect that reading a completely random hardware register
value would result in an offset up to 1.33 billion seconds, which is
around factor 100 more than the error we see, though similar.
- The test case calls the musl clock_gettime() function, which falls back to
the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
not show the bug.
- The behavior was not reproduced on the same user space in qemu,
though I cannot tell whether the exact same kernel binary was used.
- glibc-2.31 calls the same clock_gettime64() vdso function on arm to
implement clock_gettime(), but earlier versions did not. I have not
seen any reports of this bug, which could be explained by users
generally being on older versions.
- As far as I can tell, there are no reports of this bug from other users,
and so far nobody could reproduce it.
- The current musl git tree has been patched to not call clock_gettime64
on ARM because of this problem, so it cannot be used for reproducing it.
If anyone has other information that may help figure out what is going
on, please share.
Arnd
On 19/05/2020 16:54, Arnd Bergmann wrote:
> Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> month: https://github.com/richfelker/musl-cross-make/issues/96 and
> https://github.com/raspberrypi/linux/issues/3579
>
> As Will Deacon pointed out, this was never reported on the mailing list,
> so I'll try to summarize what we know, so this can hopefully be resolved soon.
>
> - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> clock_gettime64(CLOCK_REALTIME)
Does it happen with other clocks as well?
>
> - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> see no relevant changes compared to a mainline kernel.
Is this bug reproducible with mainline kernel or mainline kernel can't be
booted on bcm2711?
>
> - From the report, I see that the returned time value is larger than the
> expected time, by 3.4 to 14.5 million seconds in four samples, my
> guess is that a random number gets added in at some point.
What kind code are you using to reproduce it? It is threaded or issue
clock_gettime from signal handlers?
>
> - From other sources, I found that the Raspberry Pi clocksource runs
> at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> I would expect that reading a completely random hardware register
> value would result in an offset up to 1.33 billion seconds, which is
> around factor 100 more than the error we see, though similar.
>
> - The test case calls the musl clock_gettime() function, which falls back to
> the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> not show the bug.
>
> - The behavior was not reproduced on the same user space in qemu,
> though I cannot tell whether the exact same kernel binary was used.
>
> - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> implement clock_gettime(), but earlier versions did not. I have not
> seen any reports of this bug, which could be explained by users
> generally being on older versions.
>
> - As far as I can tell, there are no reports of this bug from other users,
> and so far nobody could reproduce it.
>
> - The current musl git tree has been patched to not call clock_gettime64
> on ARM because of this problem, so it cannot be used for reproducing it.
So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
or this bug is localized to an specific kernel version running on an
specific hardware?
>
> If anyone has other information that may help figure out what is going
> on, please share.
>
> Arnd
>
On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
<[email protected]> wrote:
> On 19/05/2020 16:54, Arnd Bergmann wrote:
> > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > https://github.com/raspberrypi/linux/issues/3579
> >
> > As Will Deacon pointed out, this was never reported on the mailing list,
> > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> >
> > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > clock_gettime64(CLOCK_REALTIME)
>
> Does it happen with other clocks as well?
Unclear.
> > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > see no relevant changes compared to a mainline kernel.
>
> Is this bug reproducible with mainline kernel or mainline kernel can't be
> booted on bcm2711?
Mainline linux-5.6 should boot on that machine but might not have
all the other features, so I think users tend to use the raspberry pi
kernel sources for now.
> > - From the report, I see that the returned time value is larger than the
> > expected time, by 3.4 to 14.5 million seconds in four samples, my
> > guess is that a random number gets added in at some point.
>
> What kind code are you using to reproduce it? It is threaded or issue
> clock_gettime from signal handlers?
The reproducer is very simple without threads or signals,
see the start of https://github.com/richfelker/musl-cross-make/issues/96
It does rely on calling into the musl wrapper, not the direct vdso
call.
> > - From other sources, I found that the Raspberry Pi clocksource runs
> > at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > I would expect that reading a completely random hardware register
> > value would result in an offset up to 1.33 billion seconds, which is
> > around factor 100 more than the error we see, though similar.
> >
> > - The test case calls the musl clock_gettime() function, which falls back to
> > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > not show the bug.
> >
> > - The behavior was not reproduced on the same user space in qemu,
> > though I cannot tell whether the exact same kernel binary was used.
> >
> > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > implement clock_gettime(), but earlier versions did not. I have not
> > seen any reports of this bug, which could be explained by users
> > generally being on older versions.
> >
> > - As far as I can tell, there are no reports of this bug from other users,
> > and so far nobody could reproduce it.
> >
> > - The current musl git tree has been patched to not call clock_gettime64
> > on ARM because of this problem, so it cannot be used for reproducing it.
>
> So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> or this bug is localized to an specific kernel version running on an
> specific hardware?
I hope we can figure out what is actually going on soon, there is probably
no need to change glibc before we have.
Arnd
On Tue, May 19, 2020 at 05:24:18PM -0300, Adhemerval Zanella wrote:
>
>
> On 19/05/2020 16:54, Arnd Bergmann wrote:
> > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > https://github.com/raspberrypi/linux/issues/3579
> >
> > As Will Deacon pointed out, this was never reported on the mailing list,
> > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> >
> > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > clock_gettime64(CLOCK_REALTIME)
>
> Does it happen with other clocks as well?
>
> >
> > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > see no relevant changes compared to a mainline kernel.
>
> Is this bug reproducible with mainline kernel or mainline kernel can't be
> booted on bcm2711?
>
> >
> > - From the report, I see that the returned time value is larger than the
> > expected time, by 3.4 to 14.5 million seconds in four samples, my
> > guess is that a random number gets added in at some point.
>
> What kind code are you using to reproduce it? It is threaded or issue
> clock_gettime from signal handlers?
Original report thread is here:
https://github.com/richfelker/musl-cross-make/issues/96
The reporter originally misunderstood the issue and wrongly attributed
it to difference between gettimeofday and clock_gettime but it was
just big jumps between successive vdso clock_gettime64 calls.
No transformation was being done on the output of the vdso function;
as long as it succeeds musl just returns directly with the value it
stored in the timespec. No threads or anything fancy were involved.
Current musl will no longer call it but you should be able to
dlopen("linux-gate.so.1", RTLD_NOW|RTLD_LOCAL) then use dlsym to get
its address and call it (not tested; I've never used it this way).
> > - The current musl git tree has been patched to not call clock_gettime64
> > on ARM because of this problem, so it cannot be used for reproducing it.
>
> So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> or this bug is localized to an specific kernel version running on an
> specific hardware?
For musl it was important to disable it asap pending a fix, because
users are expected to generate static binaries, and these could make
it into the wild without anyone realizing they're broken until much
later when run on an affected kernel (especially since pre-5.6 kernels
would hide the issue entirely due to lacking vdso). Ideally a fix will
be something we can detect (e.g. new symbol version) so as not to risk
calling the broken one, but whether that's necessary may depend on
what's affected.
I'm not sure if glibc should do the same; it's not often used in
static linking, and replacing libc (shared lib, or re-static-linking
which LGPL requires you to facilitate to distribute static binaries)
could solve the issue on affected systems.
Rich
The 05/19/2020 22:31, Arnd Bergmann wrote:
> On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> <[email protected]> wrote:
> > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > https://github.com/raspberrypi/linux/issues/3579
> > >
> > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > >
> > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > > kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > > clock_gettime64(CLOCK_REALTIME)
> >
> > Does it happen with other clocks as well?
>
> Unclear.
>
> > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > > see no relevant changes compared to a mainline kernel.
> >
> > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > booted on bcm2711?
>
> Mainline linux-5.6 should boot on that machine but might not have
> all the other features, so I think users tend to use the raspberry pi
> kernel sources for now.
>
> > > - From the report, I see that the returned time value is larger than the
> > > expected time, by 3.4 to 14.5 million seconds in four samples, my
> > > guess is that a random number gets added in at some point.
> >
> > What kind code are you using to reproduce it? It is threaded or issue
> > clock_gettime from signal handlers?
>
> The reproducer is very simple without threads or signals,
> see the start of https://github.com/richfelker/musl-cross-make/issues/96
>
> It does rely on calling into the musl wrapper, not the direct vdso
> call.
>
> > > - From other sources, I found that the Raspberry Pi clocksource runs
> > > at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > > I would expect that reading a completely random hardware register
> > > value would result in an offset up to 1.33 billion seconds, which is
> > > around factor 100 more than the error we see, though similar.
> > >
> > > - The test case calls the musl clock_gettime() function, which falls back to
> > > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > > not show the bug.
> > >
> > > - The behavior was not reproduced on the same user space in qemu,
> > > though I cannot tell whether the exact same kernel binary was used.
> > >
> > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > > implement clock_gettime(), but earlier versions did not. I have not
> > > seen any reports of this bug, which could be explained by users
> > > generally being on older versions.
> > >
> > > - As far as I can tell, there are no reports of this bug from other users,
> > > and so far nobody could reproduce it.
note: i could not reproduce it in qemu-system with these configs:
qemu-system-aarch64 + arm64 kernel + compat vdso
qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
qemu-system-arm + cpu max + 32bit arm kernel
so i think it's something specific to that user's setup
(maybe rpi hw bug or gcc miscompiled the vdso or something
with that particular linux, i built my own linux 5.6 because
i did not know the exact kernel version where the bug was seen)
i don't have access to rpi (or other cortex-a53 where i
can install my own kernel) so this is as far as i got.
> > >
> > > - The current musl git tree has been patched to not call clock_gettime64
> > > on ARM because of this problem, so it cannot be used for reproducing it.
> >
> > So should glibc follow musl and remove arm clock_gettime6y4 vDSO support
> > or this bug is localized to an specific kernel version running on an
> > specific hardware?
>
> I hope we can figure out what is actually going on soon, there is probably
> no need to change glibc before we have.
>
> Arnd
--
On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> The 05/19/2020 22:31, Arnd Bergmann wrote:
> > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > <[email protected]> wrote:
> > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > > https://github.com/raspberrypi/linux/issues/3579
> > > >
> > > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > > >
> > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > > > kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > > > clock_gettime64(CLOCK_REALTIME)
> > >
> > > Does it happen with other clocks as well?
> >
> > Unclear.
> >
> > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > > > see no relevant changes compared to a mainline kernel.
> > >
> > > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > > booted on bcm2711?
> >
> > Mainline linux-5.6 should boot on that machine but might not have
> > all the other features, so I think users tend to use the raspberry pi
> > kernel sources for now.
> >
> > > > - From the report, I see that the returned time value is larger than the
> > > > expected time, by 3.4 to 14.5 million seconds in four samples, my
> > > > guess is that a random number gets added in at some point.
> > >
> > > What kind code are you using to reproduce it? It is threaded or issue
> > > clock_gettime from signal handlers?
> >
> > The reproducer is very simple without threads or signals,
> > see the start of https://github.com/richfelker/musl-cross-make/issues/96
> >
> > It does rely on calling into the musl wrapper, not the direct vdso
> > call.
> >
> > > > - From other sources, I found that the Raspberry Pi clocksource runs
> > > > at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > > > I would expect that reading a completely random hardware register
> > > > value would result in an offset up to 1.33 billion seconds, which is
> > > > around factor 100 more than the error we see, though similar.
> > > >
> > > > - The test case calls the musl clock_gettime() function, which falls back to
> > > > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > > > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > > > not show the bug.
> > > >
> > > > - The behavior was not reproduced on the same user space in qemu,
> > > > though I cannot tell whether the exact same kernel binary was used.
> > > >
> > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > > > implement clock_gettime(), but earlier versions did not. I have not
> > > > seen any reports of this bug, which could be explained by users
> > > > generally being on older versions.
> > > >
> > > > - As far as I can tell, there are no reports of this bug from other users,
> > > > and so far nobody could reproduce it.
>
> note: i could not reproduce it in qemu-system with these configs:
>
> qemu-system-aarch64 + arm64 kernel + compat vdso
> qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> qemu-system-arm + cpu max + 32bit arm kernel
>
> so i think it's something specific to that user's setup
> (maybe rpi hw bug or gcc miscompiled the vdso or something
> with that particular linux, i built my own linux 5.6 because
> i did not know the exact kernel version where the bug was seen)
>
> i don't have access to rpi (or other cortex-a53 where i
> can install my own kernel) so this is as far as i got.
If we have a binary of the kernel that's known to be failing on the
hardware, it would be useful to dump its vdso and examine the
disassembly to see if it was miscompiled.
Rich
On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote:
> On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> > The 05/19/2020 22:31, Arnd Bergmann wrote:
> > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > > <[email protected]> wrote:
> > > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > > > Jack Schmidt reported a bug for the arm32 clock_gettimeofday64 vdso call last
> > > > > month: https://github.com/richfelker/musl-cross-make/issues/96 and
> > > > > https://github.com/raspberrypi/linux/issues/3579
> > > > >
> > > > > As Will Deacon pointed out, this was never reported on the mailing list,
> > > > > so I'll try to summarize what we know, so this can hopefully be resolved soon.
> > > > >
> > > > > - This happened reproducibly on Linux-5.6 on a 32-bit Raspberry Pi patched
> > > > > kernel running on a 64-bit Raspberry Pi 4b (bcm2711) when calling
> > > > > clock_gettime64(CLOCK_REALTIME)
> > > >
> > > > Does it happen with other clocks as well?
> > >
> > > Unclear.
> > >
> > > > > - The kernel tree is at https://github.com/raspberrypi/linux/, but I could
> > > > > see no relevant changes compared to a mainline kernel.
> > > >
> > > > Is this bug reproducible with mainline kernel or mainline kernel can't be
> > > > booted on bcm2711?
> > >
> > > Mainline linux-5.6 should boot on that machine but might not have
> > > all the other features, so I think users tend to use the raspberry pi
> > > kernel sources for now.
> > >
> > > > > - From the report, I see that the returned time value is larger than the
> > > > > expected time, by 3.4 to 14.5 million seconds in four samples, my
> > > > > guess is that a random number gets added in at some point.
> > > >
> > > > What kind code are you using to reproduce it? It is threaded or issue
> > > > clock_gettime from signal handlers?
> > >
> > > The reproducer is very simple without threads or signals,
> > > see the start of https://github.com/richfelker/musl-cross-make/issues/96
> > >
> > > It does rely on calling into the musl wrapper, not the direct vdso
> > > call.
> > >
> > > > > - From other sources, I found that the Raspberry Pi clocksource runs
> > > > > at 54 MHz, with a mask value of 0xffffffffffffff. From these numbers
> > > > > I would expect that reading a completely random hardware register
> > > > > value would result in an offset up to 1.33 billion seconds, which is
> > > > > around factor 100 more than the error we see, though similar.
> > > > >
> > > > > - The test case calls the musl clock_gettime() function, which falls back to
> > > > > the clock_gettime64() syscall on kernels prior to 5.5, or to the 32-bit
> > > > > clock_gettime() prior to Linux-5.1. As reported in the bug, Linux-4.19 does
> > > > > not show the bug.
> > > > >
> > > > > - The behavior was not reproduced on the same user space in qemu,
> > > > > though I cannot tell whether the exact same kernel binary was used.
> > > > >
> > > > > - glibc-2.31 calls the same clock_gettime64() vdso function on arm to
> > > > > implement clock_gettime(), but earlier versions did not. I have not
> > > > > seen any reports of this bug, which could be explained by users
> > > > > generally being on older versions.
> > > > >
> > > > > - As far as I can tell, there are no reports of this bug from other users,
> > > > > and so far nobody could reproduce it.
> >
> > note: i could not reproduce it in qemu-system with these configs:
> >
> > qemu-system-aarch64 + arm64 kernel + compat vdso
> > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> > qemu-system-arm + cpu max + 32bit arm kernel
> >
> > so i think it's something specific to that user's setup
> > (maybe rpi hw bug or gcc miscompiled the vdso or something
> > with that particular linux, i built my own linux 5.6 because
> > i did not know the exact kernel version where the bug was seen)
> >
> > i don't have access to rpi (or other cortex-a53 where i
> > can install my own kernel) so this is as far as i got.
>
> If we have a binary of the kernel that's known to be failing on the
> hardware, it would be useful to dump its vdso and examine the
> disassembly to see if it was miscompiled.
OK, OP posted it and I think we've solved this. See
https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410
And my analysis:
<@dalias> see what i just found on the tracker
<@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out the time32 functions in this case
<@dalias> but not the time64 one
<@dalias> this looks like a real kernel bug that's not hw-specific except breaking on all hardware where the patching-out is needed
<@dalias> we could possibly work around it by refusing to use the time64 vdso unless the time32 one is also present
<@dalias> yep
<@dalias> so i think we've solved this. the kernel thought it wasnt using vdso anymore because it patched it out
<@dalias> but it forgot to patch out the time64 one
<@dalias> so it stopped updating the data needed for vdso to work
On Wed, May 20, 2020 at 7:09 PM Rich Felker <[email protected]> wrote:
>
> On Wed, May 20, 2020 at 12:08:10PM -0400, Rich Felker wrote:
> > On Wed, May 20, 2020 at 04:41:29PM +0100, Szabolcs Nagy wrote:
> > > The 05/19/2020 22:31, Arnd Bergmann wrote:
> > > > On Tue, May 19, 2020 at 10:24 PM Adhemerval Zanella
> > > > <[email protected]> wrote:
> > > > > On 19/05/2020 16:54, Arnd Bergmann wrote:
> > > note: i could not reproduce it in qemu-system with these configs:
> > >
> > > qemu-system-aarch64 + arm64 kernel + compat vdso
> > > qemu-system-aarch64 + kvm accel (on cortex-a72) + 32bit arm kernel
> > > qemu-system-arm + cpu max + 32bit arm kernel
> > >
> > > so i think it's something specific to that user's setup
> > > (maybe rpi hw bug or gcc miscompiled the vdso or something
> > > with that particular linux, i built my own linux 5.6 because
> > > i did not know the exact kernel version where the bug was seen)
> > >
> > > i don't have access to rpi (or other cortex-a53 where i
> > > can install my own kernel) so this is as far as i got.
> >
> > If we have a binary of the kernel that's known to be failing on the
> > hardware, it would be useful to dump its vdso and examine the
> > disassembly to see if it was miscompiled.
>
> OK, OP posted it and I think we've solved this. See
> https://github.com/richfelker/musl-cross-make/issues/96#issuecomment-631604410
Thanks a lot everyone for figuring this out.
> And my analysis:
>
> <@dalias> see what i just found on the tracker
> <@dalias> patch_vdso/vdso_nullpatch_one in arch/arm/kernel/vdso.c patches out the time32 functions in this case
> <@dalias> but not the time64 one
> <@dalias> this looks like a real kernel bug that's not hw-specific except breaking on all hardware where the patching-out is needed
> <@dalias> we could possibly work around it by refusing to use the time64 vdso unless the time32 one is also present
> <@dalias> yep
> <@dalias> so i think we've solved this. the kernel thought it wasnt using vdso anymore because it patched it out
> <@dalias> but it forgot to patch out the time64 one
> <@dalias> so it stopped updating the data needed for vdso to work
As you mentioned in the issue tracker, the patching was meant as
an optimization and missing it for clock_gettime64 was a mistake but
should by itself not have caused incorrect data to be returned.
I would assume that there is another bug that leads to clock_gettime64
not entering the syscall fallback path as it should but instead returning
bogus data.
Here are some more things I found:
- From reading the linux-5.6 code that was tested, I see that a condition
that leads to patching out the clock_gettime() vdso should also lead to
clock_gettime64() falling back to the the syscall after
__arch_get_hw_counter() returns an error, but for some reason that
does not happen. Presumably the presence of the patching meant that
this code path was never much exercised.
A missing 45939ce292b4 ("ARM: 8957/1: VDSO: Match ARMv8 timer in
cntvct_functional()") would explain the problem, if it happened on
linux-5.6-rc7 or earlier. The fix was merged in the final v5.6 though.
- The patching may actually be counterproductive because it means that
clock_gettime(CLOCK_*COARSE, ...) has to go through the system call
when it could just return the time of the last timer tick regardless of the
clocksource.
- We may get bitten by errata handling on 32-bit kernels running on 64-bit
hardware that has errata workaround in arch/arm64 for compat mode
but not in native arm kernels. ARM64_ERRATUM_1418040,
ARM64_ERRATUM_858921 or SUN50I_ERRATUM_UNKNOWN1
are examples of workaround that are not used on 32-bit kernels running
on 64-bit hardware.
Arnd