2020-10-30 10:05:17

by Lukasz Majewski

[permalink] [raw]
Subject: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Hi Andrei, Dmitry,

I do have a question regarding the Linux time namespaces in respect of
adding support for virtualizing the CLOCK_REALTIME.

According to patch description [1] and time_namespaces documentation
[2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
and overhead in the kernel.

Is there any plan to add support for it in a near future?

Why I'm asking?

It looks like this kernel feature (with CLOCK_REALTIME support
available) would be very helpful for testing Y2038 compliance for e.g.
glibc 32 bit ports.

To be more specific - it would be possible to modify time after time_t
32 bit overflow (i.e. Y2038 bug) on the process running Y2038
regression tests on the host system (64 bit one). By using Linux time
namespaces the system time will not be affected in any way.

Thanks in advance for your help.

Links:

[1] - https://lkml.org/lkml/2019/10/10/1329
[2] - https://www.man7.org/linux/man-pages/man7/time_namespaces.7.html


Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: [email protected]


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2020-10-30 13:11:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Lukasz,

On Fri, Oct 30 2020 at 11:02, Lukasz Majewski wrote:
> I do have a question regarding the Linux time namespaces in respect of
> adding support for virtualizing the CLOCK_REALTIME.
>
> According to patch description [1] and time_namespaces documentation
> [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
> and overhead in the kernel.
>
> Is there any plan to add support for it in a near future?

Not really. Just having an offset on clock realtime would be incorrect
in a number of ways. Doing it correct is a massive trainwreck.

For a debug aid, which is what you are looking for, the correctness
would not really matter, but providing that is a really slippery
slope.

If at all we could hide it under a debug option which depends on
CONFIG_BROKEN and emitting a big fat warning in dmesg with a clear
statement that it _is_ broken, stays so forever and any attempt to "fix"
it results in a permanent ban from all kernel lists.

Preferrably we don't go there.

Thanks,

tglx


2020-10-30 13:59:51

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Hi!
> I do have a question regarding the Linux time namespaces in respect of
> adding support for virtualizing the CLOCK_REALTIME.
>
> According to patch description [1] and time_namespaces documentation
> [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
> and overhead in the kernel.
>
> Is there any plan to add support for it in a near future?
>
> Why I'm asking?
>
> It looks like this kernel feature (with CLOCK_REALTIME support
> available) would be very helpful for testing Y2038 compliance for e.g.
> glibc 32 bit ports.
>
> To be more specific - it would be possible to modify time after time_t
> 32 bit overflow (i.e. Y2038 bug) on the process running Y2038
> regression tests on the host system (64 bit one). By using Linux time
> namespaces the system time will not be affected in any way.

And what's exactly wrong with moving the system time forward for a
duration of the test?

--
Cyril Hrubis
[email protected]

2020-10-30 14:21:55

by Zack Weinberg

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On Fri, Oct 30, 2020 at 9:57 AM Cyril Hrubis <[email protected]> wrote:
> > According to patch description [1] and time_namespaces documentation
> > [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
> > and overhead in the kernel.
...
> > To be more specific - [if this were supported] it would be possible to modify time after time_t
> > 32 bit overflow (i.e. Y2038 bug) on the process running Y2038
> > regression tests on the host system (64 bit one). By using Linux time
> > namespaces the system time will not be affected in any way.
>
> And what's exactly wrong with moving the system time forward for a
> duration of the test?

Interference with other processes on the same computer? Some of us
*do* like to run the glibc test suite on computers not entirely
devoted to glibc CI.

zw

2020-10-30 15:14:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On Fri, Oct 30 2020 at 10:02, Zack Weinberg wrote:
> On Fri, Oct 30, 2020 at 9:57 AM Cyril Hrubis <[email protected]> wrote:
>> > According to patch description [1] and time_namespaces documentation
>> > [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
>> > and overhead in the kernel.
> ...
>> > To be more specific - [if this were supported] it would be possible to modify time after time_t
>> > 32 bit overflow (i.e. Y2038 bug) on the process running Y2038
>> > regression tests on the host system (64 bit one). By using Linux time
>> > namespaces the system time will not be affected in any way.
>>
>> And what's exactly wrong with moving the system time forward for a
>> duration of the test?
>
> Interference with other processes on the same computer? Some of us
> *do* like to run the glibc test suite on computers not entirely
> devoted to glibc CI.

That's what virtual machines are for.

2020-10-30 15:46:41

by Lukasz Majewski

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Hi Thomas,

> Lukasz,
>
> On Fri, Oct 30 2020 at 11:02, Lukasz Majewski wrote:
> > I do have a question regarding the Linux time namespaces in respect
> > of adding support for virtualizing the CLOCK_REALTIME.
> >
> > According to patch description [1] and time_namespaces documentation
> > [2] the CLOCK_REALTIME is not supported (for now?) to avoid
> > complexity and overhead in the kernel.
> >
> > Is there any plan to add support for it in a near future?
>
> Not really. Just having an offset on clock realtime would be incorrect
> in a number of ways. Doing it correct is a massive trainwreck.
>
> For a debug aid, which is what you are looking for, the correctness
> would not really matter, but providing that is a really slippery
> slope.
>
> If at all we could hide it under a debug option which depends on
> CONFIG_BROKEN and emitting a big fat warning in dmesg with a clear
> statement that it _is_ broken, stays so forever and any attempt to
> "fix" it results in a permanent ban from all kernel lists.
>
> Preferrably we don't go there.

I see. Thanks for the explanation.

Now, I do use QEMU to emulate ARM 32 bit system with recent kernel
(5.1+). It works.

Another option would be to give a shoot to QEMU with the "user mode" to
run cross-compiled tests (with using a cross-compiled glibc in earlier
CI stage). The problem with above is the reliance on QEMU emulation of
ARM syscalls (and if 64 bit time supporting syscalls - i.e.
clock_settime64 - are available).

>
> Thanks,
>
> tglx
>
>




Best regards,

Lukasz Majewski

--

DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
Phone: (+49)-8142-66989-59 Fax: (+49)-8142-66989-80 Email: [email protected]


Attachments:
(No filename) (499.00 B)
OpenPGP digital signature

2020-10-30 17:02:54

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 10/30/20 11:10 AM, Thomas Gleixner via Libc-alpha wrote:
> On Fri, Oct 30 2020 at 10:02, Zack Weinberg wrote:
>> On Fri, Oct 30, 2020 at 9:57 AM Cyril Hrubis <[email protected]> wrote:
>>>> According to patch description [1] and time_namespaces documentation
>>>> [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
>>>> and overhead in the kernel.
>> ...
>>>> To be more specific - [if this were supported] it would be possible to modify time after time_t
>>>> 32 bit overflow (i.e. Y2038 bug) on the process running Y2038
>>>> regression tests on the host system (64 bit one). By using Linux time
>>>> namespaces the system time will not be affected in any way.
>>>
>>> And what's exactly wrong with moving the system time forward for a
>>> duration of the test?
>>
>> Interference with other processes on the same computer? Some of us
>> *do* like to run the glibc test suite on computers not entirely
>> devoted to glibc CI.
>
> That's what virtual machines are for.

Certainly, that is always an option, just like real hardware.

However, every requirement we add to testing reduces the number of
times that developer will run the test on their system and potentially
catch a problem during development. Yes, CI helps, but "make check"
gives more coverage. More kernel variants tested in all downstream rpm
%check builds or developer systems. Just like kernel self tests help
today.

glibc uses namespaces in "make check" to increase the number of userspace
and kernel features we can test immediately and easily on developer
*or* distribution build systems.

So the natural extension is to further isolate the testing namespace
using the time namespace to test and verify y2038. If we can't use
namespaces then we'll have to move the tests out to the less
frequently run scripts we use for cross-target toolchain testing,
and so we'll see a 100x drop in coverage.

I expect that more requests for further time isolation will happen
given the utility of this in containers.

If we have to use qemu today then that's where we're at, but again
I expect our use case is representative of more than just glibc.

Does checkpointing work robustly when userspace APIS use
CLOCK_REALTIME (directly or indirectly) in the container?

--
Cheers,
Carlos.

2020-10-30 20:08:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On Fri, Oct 30 2020 at 12:58, Carlos O'Donell wrote:
> On 10/30/20 11:10 AM, Thomas Gleixner via Libc-alpha wrote:
>> That's what virtual machines are for.
>
> Certainly, that is always an option, just like real hardware.
>
> However, every requirement we add to testing reduces the number of
> times that developer will run the test on their system and potentially
> catch a problem during development. Yes, CI helps, but "make check"
> gives more coverage. More kernel variants tested in all downstream rpm
> %check builds or developer systems. Just like kernel self tests help
> today.
>
> glibc uses namespaces in "make check" to increase the number of userspace
> and kernel features we can test immediately and easily on developer
> *or* distribution build systems.
>
> So the natural extension is to further isolate the testing namespace
> using the time namespace to test and verify y2038. If we can't use
> namespaces then we'll have to move the tests out to the less
> frequently run scripts we use for cross-target toolchain testing,
> and so we'll see a 100x drop in coverage.

I understand that.

> I expect that more requests for further time isolation will happen
> given the utility of this in containers.

There was a lengthy discussion about this and the only "usecase" which
was brought up was having different NTP servers in name spaces, i.e. the
leap second ones and the smearing ones.

Now imagine 1000 containers each running their own NTP. Guess what the
host does in each timer interrupt? Chasing 1000 containers and update
their notion of CLOCK_REALTIME. In the remaining 5% CPU time the 1000
containers can do their computations.

But even if you restrict it to a trivial offset without NTP
capabilities, what's the semantics of that offset when the host time is
set?

- Does the offset just stay the same and container time just jumps
around with the host time?

- Has it to change so that the containers notion of realtime is not
affected? Which is pretty much equivalent to the NTP case of chasing
a gazillion of containers, just it might give the containers a bit
more than 5% remaining CPU time.

- Can the offset of the container be changed at runtime,
i.e. is clock_settime() possible from withing the container?

There are some other bits related to that as well, but the above is
already mindboggling.

> If we have to use qemu today then that's where we're at, but again
> I expect our use case is representative of more than just glibc.

For testing purposes it might be. For real world use cases not so
much. People tend to rely on the coordinated nature of CLOCK_TAI and
CLOCK_REALTIME.

> Does checkpointing work robustly when userspace APIS use
> CLOCK_REALTIME (directly or indirectly) in the container?

AFAICT, yes. That was the conclusion over the lenghty discussion about
time name spaces and their requirements.

Here is the Linux plumber session related to that:

https://www.youtube.com/watch?v=sjRUiqJVzOA

Thanks,

tglx

2020-10-30 22:23:32

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 10/30/20 4:06 PM, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 12:58, Carlos O'Donell wrote:
>> I expect that more requests for further time isolation will happen
>> given the utility of this in containers.
>
> There was a lengthy discussion about this and the only "usecase" which
> was brought up was having different NTP servers in name spaces, i.e. the
> leap second ones and the smearing ones.

In the non-"request for ponies" category:

* Running legacy 32-bit applications in containers with CLOCK_REALTIME set
to some value below y2038.

* Testing kernel and userspace clock handling code without needing to
run on bare-metal, VM, or other.

> Now imagine 1000 containers each running their own NTP. Guess what the
> host does in each timer interrupt? Chasing 1000 containers and update
> their notion of CLOCK_REALTIME. In the remaining 5% CPU time the 1000
> containers can do their computations.

How is this different than balancing any other resource that you give
to a container/vm on a host?

Can you enable 1000 containers running smbd/nmbd and expect to get
great IO performance?

> But even if you restrict it to a trivial offset without NTP
> capabilities, what's the semantics of that offset when the host time is
> set?

Now you're talking about an implementation. This thread is simply
"Would we implement CLOCK_REALTIME?" Is the answer "Maybe, if we solve
all these other problems?"

>> If we have to use qemu today then that's where we're at, but again
>> I expect our use case is representative of more than just glibc.
>
> For testing purposes it might be. For real world use cases not so
> much. People tend to rely on the coordinated nature of CLOCK_TAI and
> CLOCK_REALTIME.

Except we have two real world use cases, at the top of this email,
that could extend to a lot of software. We know legacy 32-bit
applications exist that will break with CLOCK_REALTIME past
y2038. Software exists that manipulates time and needs testing
with specific time values e.g. month crossings, day crossings,
leap year crossings, etc.

>> Does checkpointing work robustly when userspace APIS use
>> CLOCK_REALTIME (directly or indirectly) in the container?
>
> AFAICT, yes. That was the conclusion over the lenghty discussion about
> time name spaces and their requirements.

If this is the case then have we established behaviours that
happen when such processes are migrated to other systems with
different CLOCK_REALTIME clocks? Would these behaviours serve
as the basis of how CLOCK_REALTIME in a namespace would behave?

That is to say that migrating a container to a system with a
different CLOCK_REALTIME should behave similarly to what happens
when CLOCK_REALTIME is changed locally and you have a container
with a unique CLOCK_REALTIME?

> Here is the Linux plumber session related to that:
> https://www.youtube.com/watch?v=sjRUiqJVzOA

Thanks. I watched the session. Informative :-)

--
Cheers,
Carlos.

2020-10-31 01:40:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Carlos,

On Fri, Oct 30 2020 at 18:19, Carlos O'Donell wrote:
> On 10/30/20 4:06 PM, Thomas Gleixner wrote:
>> On Fri, Oct 30 2020 at 12:58, Carlos O'Donell wrote:
>>> I expect that more requests for further time isolation will happen
>>> given the utility of this in containers.
>>
>> There was a lengthy discussion about this and the only "usecase" which
>> was brought up was having different NTP servers in name spaces, i.e. the
>> leap second ones and the smearing ones.
>
> In the non-"request for ponies" category:
>
> * Running legacy 32-bit applications in containers with CLOCK_REALTIME set
> to some value below y2038.

That's broken to begin with. That has been tried with Y2K and failed
miserably.

Any real application which needs access to CLOCK_REALTIME requires to
have access to something which is at least close to the real time.

> * Testing kernel and userspace clock handling code without needing to
> run on bare-metal, VM, or other.

I grant you that, but it comes with a large can of worms as it opens the
door for 'request for ponies' all over the place.

>> Now imagine 1000 containers each running their own NTP. Guess what the
>> host does in each timer interrupt? Chasing 1000 containers and update
>> their notion of CLOCK_REALTIME. In the remaining 5% CPU time the 1000
>> containers can do their computations.
>
> How is this different than balancing any other resource that you give
> to a container/vm on a host?
>
> Can you enable 1000 containers running smbd/nmbd and expect to get
> great IO performance?

That's bogus. The kernel can control whether these daemons run or not
and how much CPU time they get, as it can control whether any container
application runs or not.

But when it comes down to time correctness that's a different story. At
the moment it allows to have a gazillion of notions of CLOCK_REALTIME
then it has to guarantee the correctness for all of them no matter what.

>> But even if you restrict it to a trivial offset without NTP
>> capabilities, what's the semantics of that offset when the host time is
>> set?
>
> Now you're talking about an implementation. This thread is simply
> "Would we implement CLOCK_REALTIME?" Is the answer "Maybe, if we solve
> all these other problems?"

Maybe, if you solved all these problems which is going to be finished at
the theoretical level in about 20 years from now. As I'm planning to be
retired and Y2038 has passed by then, feel free to pursue that route.

>>> If we have to use qemu today then that's where we're at, but again
>>> I expect our use case is representative of more than just glibc.
>>
>> For testing purposes it might be. For real world use cases not so
>> much. People tend to rely on the coordinated nature of CLOCK_TAI and
>> CLOCK_REALTIME.
>
> Except we have two real world use cases, at the top of this email,
> that could extend to a lot of software. We know legacy 32-bit
> applications exist that will break with CLOCK_REALTIME past
> y2038. Software exists that manipulates time and needs testing
> with specific time values e.g. month crossings, day crossings,
> leap year crossings, etc.

Again. I agree with the testing part, but the legacy application part is
wishful thinking at least. IMO it's utter nonsense.

Coming back to your test coverage argument. I really don't see a problem
with the requirement of having qemu installed in order to run 'make
check'.

If you can't ask that from your contributors, then asking me to provide
you a namespace magic is just hillarious. The contributor who refuses to
install qemu will also insist to run on some last century kernel which
does not even know about name spaces at all.

Instead of asking for ponies your time might be better spent with
providing tools which just make it easy to run 'make check' with all
bells and whistels.

Virtualization is the right answer to the testing problem and if people
really insist on running their broken legacy apps past 2038, then stick
them into a VM and charge boatloads of money for that service.

>>> Does checkpointing work robustly when userspace APIS use
>>> CLOCK_REALTIME (directly or indirectly) in the container?
>>
>> AFAICT, yes. That was the conclusion over the lenghty discussion about
>> time name spaces and their requirements.
>
> If this is the case then have we established behaviours that
> happen when such processes are migrated to other systems with
> different CLOCK_REALTIME clocks? Would these behaviours serve
> as the basis of how CLOCK_REALTIME in a namespace would behave?
>
> That is to say that migrating a container to a system with a
> different CLOCK_REALTIME should behave similarly to what happens
> when CLOCK_REALTIME is changed locally and you have a container
> with a unique CLOCK_REALTIME?

Any application has to be able to deal with CLOCK_REALTIME changing
under their feet no matter what. So why would migrating a container from
host A to host B which have a different notion of CLOCK_REALTIME make
any difference?

Please stop to abuse container migration which works perfectly fine with
the real problems vs. timekeeping solved (CLOCK_MONOTONIC and
CLOCK_BOOTTIME going backwards) as an argument for something which can
and should be solved entirely in user space.

1) Testing

Virtualization solves that problem. Creating tools to handle that
conveniantly for your users/contributors is not rocket science.

2) Legacy applications

It does matter at all if you stick the application into a container
which tells the kernel that it runs in some different time universe
or if you start the very same application with a libc variant which
uses the Y2038 aware interfaces of the kernel and pretends to be in
the pre Y2038 time universe when handing time down to the
application.

If you have a bunch of applications which all suffer from the same
problem and are completely disconnected from the real world notion
of CLOCK_REALTIME then stick them into a VM and be done with it.

Just because something could be solved at the kernel level does not mean
that it is the right thing to do.

Thanks,

tglx

2020-11-03 12:44:54

by Cyril Hrubis

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Hi!
> Virtualization is the right answer to the testing problem and if people
> really insist on running their broken legacy apps past 2038, then stick
> them into a VM and charge boatloads of money for that service.

Let me just emphasise this with a short story. Before I release LTP I do
a lot of pre-release testruns to make sure that all tests works well on
a different distributions and kernel versions.

Before I wrote a script that automated this[1] i.e. runs all the tests in
qemu and filters out the interesting results it took me a few days of
manual labor to finish the task. Now I just schedulle the jobs and after
a day or two I get the results. Even if the tested kernel crashes, which
happens a lot, the machine is just restarted automatically and the
testrun carries on with a next test. All in all the work that has been
put into the solution wasn't that big to begin with it took me a week to
write a first prototype from a scratch.

[1] https://github.com/metan-ucw/runltp-ng

--
Cyril Hrubis
[email protected]

2020-11-05 17:27:52

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 10/30/20 9:38 PM, Thomas Gleixner wrote:
> Coming back to your test coverage argument. I really don't see a problem
> with the requirement of having qemu installed in order to run 'make
> check'.

Cost. It's is cheaper and easier to maintain and deploy containers.

A full VM requires maintaining and updating images, and kernel builds
independent of what the developer is using for their development system.
This goes out of date quickly and needs a lot of resources to maintain.

When you get away from a VM you can then engage the entire developer
community to just run your userspace testing on *their* hardware on *their*
kernels. So I can go to Arm, Intel, AMD, IBM, SUSE, Red Hat, etc. and say:
"All you need to do is run 'make check' and the tests will verify your
hardware and kernel are working correctly." Those developers don't want their
system clocks adjusted during testing, and they are as busy as you and I are.

Container registries and tooling are much lighter weight and support layering
changes on top of base images in ways which allow different testing scenarios.
I don't have any desire to build a similar ecosystem for VM images or wait for
VM+container (kata) tooling to grow up.

If kata grows up quickly perhaps this entire problem becomes solved, but until
then I continue to have a testing need for a distinct CLOCK_REALTIME in a
time namespace (and it need not be unconditional, if I have to engage magic
then I'm happy to do that).

> If you can't ask that from your contributors, then asking me to provide
> you a namespace magic is just hillarious. The contributor who refuses to
> install qemu will also insist to run on some last century kernel which
> does not even know about name spaces at all.

I don't disagree with you, it is *absolutely* a VM tooling issue, and that
containers are easier to maintain, and deploy. With namespaces I can build
glibc, a sysroot, and run it in isolation very quickly.

Just so I understand, let me reiterate your position:

* Adding CLOCK_REALTIME to the kernel is a lot of work given the expected
guarantees for a local system.

* CLOCK_REALTIME is an expensive resource to maintain, even more expensive
than other resources where the kernel can balance their usage.

* On balance it would be better to use vm or vm+containers e.g. kata as a
solution to having CLOCK_REALTIME distinct in the container.
Thanks for your feedback.

--
Cheers,
Carlos.

2020-11-07 00:50:04

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On Thu, Nov 05 2020 at 12:25, Carlos O'Donell wrote:
> On 10/30/20 9:38 PM, Thomas Gleixner wrote:
> If kata grows up quickly perhaps this entire problem becomes solved, but until
> then I continue to have a testing need for a distinct CLOCK_REALTIME in a
> time namespace (and it need not be unconditional, if I have to engage magic
> then I'm happy to do that).

Conditional, that might be a way to go.

Would CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME be a way to go? IOW,
something which is clearly in the debug section of the kernel which wont
get turned on by distros (*cough*) and comes with a description that any
bug reports against it vs. time correctness are going to be ignored.

> * Adding CLOCK_REALTIME to the kernel is a lot of work given the expected
> guarantees for a local system.

Correct.

> * CLOCK_REALTIME is an expensive resource to maintain, even more expensive
> than other resources where the kernel can balance their usage.

Correct.

> * On balance it would be better to use vm or vm+containers e.g. kata as a
> solution to having CLOCK_REALTIME distinct in the container.

That'd be the optimal solution, but the above might be a middle ground.

Thanks,

tglx

2020-11-14 10:27:53

by Pavel Machek

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Hi!

> I do have a question regarding the Linux time namespaces in respect of
> adding support for virtualizing the CLOCK_REALTIME.
>
> According to patch description [1] and time_namespaces documentation
> [2] the CLOCK_REALTIME is not supported (for now?) to avoid complexity
> and overhead in the kernel.
>
> Is there any plan to add support for it in a near future?
>
> Why I'm asking?
>
> It looks like this kernel feature (with CLOCK_REALTIME support
> available) would be very helpful for testing Y2038 compliance for e.g.
> glibc 32 bit ports.
>
> To be more specific - it would be possible to modify time after time_t
> 32 bit overflow (i.e. Y2038 bug) on the process running Y2038
> regression tests on the host system (64 bit one). By using Linux time
> namespaces the system time will not be affected in any way.

If big slowdown is acceptable... you can play games with ptrace. Project called "subterfugue"
should have examples how to do that.

Best regards,
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2020-11-19 18:40:23

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 11/6/20 7:47 PM, Thomas Gleixner wrote:
> On Thu, Nov 05 2020 at 12:25, Carlos O'Donell wrote:
>> On 10/30/20 9:38 PM, Thomas Gleixner wrote:
>> If kata grows up quickly perhaps this entire problem becomes solved, but until
>> then I continue to have a testing need for a distinct CLOCK_REALTIME in a
>> time namespace (and it need not be unconditional, if I have to engage magic
>> then I'm happy to do that).
>
> Conditional, that might be a way to go.
>
> Would CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME be a way to go? IOW,
> something which is clearly in the debug section of the kernel which wont
> get turned on by distros (*cough*) and comes with a description that any
> bug reports against it vs. time correctness are going to be ignored.

Yes. I would be requiring CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME.

Let me be clear though, the distros have *+debug kernels for which this
CONFIG_DEBUG_* could get turned on? In Fedora *+debug kernels we enable all
sorts of things like CONFIG_DEBUG_OBJECTS_* and CONFIG_DEBUG_SPINLOCK etc.
etc. etc.

I would push Fedora/RHEL to ship this in the *+debug kernels. That way I can have
this on for local test/build cycle. Would you be OK with that?

We could have it disabled by default but enabled via proc like
unprivileged_userns_clone was at one point? I want to avoid accidental use in
Fedora *+debug kernels unless the developer is actively going to run tests that
require time manipulation e.g. thousands of DNSSEC tests with timeouts [1].

I also need a way to determine the feature is enabled or disabled so I can XFAIL
the tests and tell the developer they need to turn on the feature in the host
kernel (and not to complain when CLOCK_REALTIME is wrong). A proc interface solves
this in a straight forward way.

I could then also tell my hardware partners to turn it on during certain test/build
cycles. It violates "ship what you test" but increases test coverage and can be
run as a distinct test cycle. I could also have our internal builders turn this
feature on so we can run rpm %check phases with this feature enabled (operations
might refuse, but in that case my day-to-day developer testing still helps by
orders of magnitude).

Notes:
[1] Petr Špaček commented on DNSSEC and expiration testing as another real-world testing
scenario: https://sourceware.org/pipermail/libc-alpha/2020-November/119785.html
Still a testing scenario, but an example outside of glibc for networking, where they
have a need to execute thousands of tests with accelerated timeout. If vm+containers
catches up, and I think they will, we'll have a solution in a few years.

>> * Adding CLOCK_REALTIME to the kernel is a lot of work given the expected
>> guarantees for a local system.
>
> Correct.
>
>> * CLOCK_REALTIME is an expensive resource to maintain, even more expensive
>> than other resources where the kernel can balance their usage.
>
> Correct.
>
>> * On balance it would be better to use vm or vm+containers e.g. kata as a
>> solution to having CLOCK_REALTIME distinct in the container.
>
> That'd be the optimal solution, but the above might be a middle ground.

Agreed.

--
Cheers,
Carlos.

2020-11-20 00:19:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On Thu, Nov 19 2020 at 13:37, Carlos O'Donell wrote:
> On 11/6/20 7:47 PM, Thomas Gleixner wrote:
>> Would CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME be a way to go? IOW,
>> something which is clearly in the debug section of the kernel which wont
>> get turned on by distros (*cough*) and comes with a description that any
>> bug reports against it vs. time correctness are going to be ignored.
>
> Yes. I would be requiring CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME.
>
> Let me be clear though, the distros have *+debug kernels for which this
> CONFIG_DEBUG_* could get turned on? In Fedora *+debug kernels we enable all
> sorts of things like CONFIG_DEBUG_OBJECTS_* and CONFIG_DEBUG_SPINLOCK etc.
> etc. etc.

That's why I wrote '(*cough*)'. It's entirely clear to me that this
would be enabled for whatever raisins.

> I would push Fedora/RHEL to ship this in the *+debug kernels. That way I can have
> this on for local test/build cycle. Would you be OK with that?

Distros ship a lot of weird things. Though that config would be probably
saner than some of the horrors shipped in enterprise production kernels.

> We could have it disabled by default but enabled via proc like
> unprivileged_userns_clone was at one point?

Yes, that'd be mandatory. But see below.

> I want to avoid accidental use in Fedora *+debug kernels unless the
> developer is actively going to run tests that require time
> manipulation e.g. thousands of DNSSEC tests with timeouts [1].

...

> In case of DNSSEC protocol conversations have real time values in them
> which cause "expiration", thus packet captures are useful only if real
> time clock reflects values during the original conversation. In our case
> packet captures come from real Internet, i.e. we do not have private
> keys used to sign the packets, so we cannot change time values.
>
> This use-case also implies support for settime(): During the course of a
> test we shorten time windows where "nothing happens" and server and
> client are waiting for an event, e.g. for cache expiration on
> client. This window can be hours long so it really _does_ make a
> difference. Oh yes, and for these time jumps we need to move monotonic
> time as well.

I hope you are aware that the time namespace offsets have to be set
_before_ the process starts and can't be changed afterwards,
i.e. settime() is not an option.

That might limit the usability for your use case and this can't be
changed at all because there might be armed timers and other time
related things which would start to go into full confusion mode.

The supported use case is container life migration and that _is_ very
careful about restoring time and armed timers and if their user space
tools screw it up then they can keep the bits and pieces.

So in order to utilize that you'd have to checkpoint the container,
manipulate the offsets and restore it.

The point is that on changing the time offset after the fact the kernel
would have to chase _all_ armed timers which belong to that namespace
and are related to the affected clock and readjust them to the new
distortion of namespace time. Otherwise they might expire way too late
(which is kinda ok from a correctness POV, but not what you expect) or
too early, which is clearly a NONO. Finding them is not trivial because
some of them are part of a syscall and on stack.

What's worse is that if the host's CLOCK_REALTIME is set, then it'd have
to go through _all_ time namespaces, adjust the offsets, find all timers
of all tasks in each namespace.

Contrary to that the real clock_settime(CLOCK_REALTIME) is not a big
problem, simply because all it takes is to change the time and then kick
all CPUs to reevaluate their first expiring timer. If the clock jumped
backward then they rearm their hardware and are done, if it jumped
forward they expire the ones which are affected and all is good.

The original posix timer implementation did not have seperate time bases
and on clock_settime() _all_ armed CLOCK_REALTIME timers in the system
had to be chased down, reevaluated and readjusted. Guess how well that
worked and what kind of limitation that implied.

Aside of this, there are other things, e.g. file times, packet
timestamps etc. which are based on CLOCK_REALTIME. What to do about
them? Translate these to/from name space time or not? There is a long
list of other horrors which are related to that.

So _you_ might say, that you don't care about file times, RTC, timers
expiring at the wrong time, packet timestamps and whatever.

But then the next test dude comes around and want's to test exactly
these interfaces and we have to slap the time namespace conversions for
REALTIME and TAI all over the place because we already support the
minimal thing.

Can you see why this is a slippery slope and why I'm extremly reluctant
to even provide the minimal 'distort realtime when the namespace starts'
support?

> Hopefully this ilustrates that real time name space is not "request for
> ponny" :-)

I can understand your pain and why you want to distort time, but please
understand that timekeeping is complex. The primary focus must be
correctness, scalability and maintainability which is already hard
enough to achieve. Just for the perspective: It took us only 8 years to
get the kernel halfways 2038 ready (filesystems still outstanding).

So from my point of view asking for distorted time still _is_ a request
for ponies.

The fixed offsets for clock MONOTONIC/BOOTTIME are straight forward,
absolutely make sense and they have a limited scope of exposure. clock
REALTIME/TAI are very different beasts which entail a slew of horrors.
Adding settime() to the mix makes it exponentially harder.

Thanks,

tglx

2020-11-25 17:12:08

by Petr Špaček

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 20. 11. 20 1:14, Thomas Gleixner wrote:
> On Thu, Nov 19 2020 at 13:37, Carlos O'Donell wrote:
>> On 11/6/20 7:47 PM, Thomas Gleixner wrote:
>>> Would CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME be a way to go? IOW,
>>> something which is clearly in the debug section of the kernel which wont
>>> get turned on by distros (*cough*) and comes with a description that any
>>> bug reports against it vs. time correctness are going to be ignored.
>>
>> Yes. I would be requiring CONFIG_DEBUG_DISTORTED_CLOCK_REALTIME.
>>
>> Let me be clear though, the distros have *+debug kernels for which this
>> CONFIG_DEBUG_* could get turned on? In Fedora *+debug kernels we enable all
>> sorts of things like CONFIG_DEBUG_OBJECTS_* and CONFIG_DEBUG_SPINLOCK etc.
>> etc. etc.
>
> That's why I wrote '(*cough*)'. It's entirely clear to me that this
> would be enabled for whatever raisins.
>
>> I would push Fedora/RHEL to ship this in the *+debug kernels. That way I can have
>> this on for local test/build cycle. Would you be OK with that?
>
> Distros ship a lot of weird things. Though that config would be probably
> saner than some of the horrors shipped in enterprise production kernels.
>
>> We could have it disabled by default but enabled via proc like
>> unprivileged_userns_clone was at one point?
>
> Yes, that'd be mandatory. But see below.
>
>> I want to avoid accidental use in Fedora *+debug kernels unless the
>> developer is actively going to run tests that require time
>> manipulation e.g. thousands of DNSSEC tests with timeouts [1].
>
> ...
>
>> In case of DNSSEC protocol conversations have real time values in them
>> which cause "expiration", thus packet captures are useful only if real
>> time clock reflects values during the original conversation. In our case
>> packet captures come from real Internet, i.e. we do not have private
>> keys used to sign the packets, so we cannot change time values.
>>
>> This use-case also implies support for settime(): During the course of a
>> test we shorten time windows where "nothing happens" and server and
>> client are waiting for an event, e.g. for cache expiration on
>> client. This window can be hours long so it really _does_ make a
>> difference. Oh yes, and for these time jumps we need to move monotonic
>> time as well.
>
> I hope you are aware that the time namespace offsets have to be set
> _before_ the process starts and can't be changed afterwards,
> i.e. settime() is not an option.
>
> That might limit the usability for your use case and this can't be
> changed at all because there might be armed timers and other time
> related things which would start to go into full confusion mode.
>
> The supported use case is container life migration and that _is_ very
> careful about restoring time and armed timers and if their user space
> tools screw it up then they can keep the bits and pieces.
>
> So in order to utilize that you'd have to checkpoint the container,
> manipulate the offsets and restore it.
>
> The point is that on changing the time offset after the fact the kernel
> would have to chase _all_ armed timers which belong to that namespace
> and are related to the affected clock and readjust them to the new
> distortion of namespace time. Otherwise they might expire way too late
> (which is kinda ok from a correctness POV, but not what you expect) or
> too early, which is clearly a NONO. Finding them is not trivial because
> some of them are part of a syscall and on stack.
>
> What's worse is that if the host's CLOCK_REALTIME is set, then it'd have
> to go through _all_ time namespaces, adjust the offsets, find all timers
> of all tasks in each namespace.
>
> Contrary to that the real clock_settime(CLOCK_REALTIME) is not a big
> problem, simply because all it takes is to change the time and then kick
> all CPUs to reevaluate their first expiring timer. If the clock jumped
> backward then they rearm their hardware and are done, if it jumped
> forward they expire the ones which are affected and all is good.
>
> The original posix timer implementation did not have seperate time bases
> and on clock_settime() _all_ armed CLOCK_REALTIME timers in the system
> had to be chased down, reevaluated and readjusted. Guess how well that
> worked and what kind of limitation that implied.
>
> Aside of this, there are other things, e.g. file times, packet
> timestamps etc. which are based on CLOCK_REALTIME. What to do about
> them? Translate these to/from name space time or not? There is a long
> list of other horrors which are related to that.
>
> So _you_ might say, that you don't care about file times, RTC, timers
> expiring at the wrong time, packet timestamps and whatever.
>
> But then the next test dude comes around and want's to test exactly
> these interfaces and we have to slap the time namespace conversions for
> REALTIME and TAI all over the place because we already support the
> minimal thing.
>
> Can you see why this is a slippery slope and why I'm extremly reluctant
> to even provide the minimal 'distort realtime when the namespace starts'
> support?
>
>> Hopefully this ilustrates that real time name space is not "request for
>> ponny" :-)
>
> I can understand your pain and why you want to distort time, but please
> understand that timekeeping is complex. The primary focus must be
> correctness, scalability and maintainability which is already hard
> enough to achieve. Just for the perspective: It took us only 8 years to
> get the kernel halfways 2038 ready (filesystems still outstanding).
>
> So from my point of view asking for distorted time still _is_ a request
> for ponies.
>
> The fixed offsets for clock MONOTONIC/BOOTTIME are straight forward,
> absolutely make sense and they have a limited scope of exposure. clock
> REALTIME/TAI are very different beasts which entail a slew of horrors.
> Adding settime() to the mix makes it exponentially harder.

Point taken, I can see it is complex as hell. Maybe settime() would not be necessary if checkpoint+restore operation is cheap enough, assuming time jumps can be achieved by manipulating images. I will eventually explore criu.org to find out.

Thank you for your time!

--
Petr Špaček @ CZ.NIC

2020-11-26 07:31:31

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 11/19/20 7:14 PM, Thomas Gleixner wrote:
> I hope you are aware that the time namespace offsets have to be set
> _before_ the process starts and can't be changed afterwards,
> i.e. settime() is not an option.

I not interested in settime(). I saw Petr's request and forwarded it on
here to further the educational conversation about CLOCK_REALTIME and
cement a consensus around this issue. I'm happy to evangelize that we
won't support settime() for the specific reasons you call out and that
way I can give architectural guidance to setup systems in a particular
way to use CRIU or VM+container if needed.

> That might limit the usability for your use case and this can't be
> changed at all because there might be armed timers and other time
> related things which would start to go into full confusion mode.

The use of time in these cases, from first principles, seems to
degenerate into:

* Verify a time-dependent action is correct.

In glibc's case:

* Verify various APIs after y2038.

In Petr's case he could split the test into two such tests
but he would have to reproduce the system state for the second
test and I expect he wants to avoid that.

In that case I think Petr has to use CRIU to start the container,
stop it, advance time, and restart. That should work perfectly
in that use case and solve all the problems by relying on the work
done by CRIU.

> The supported use case is container life migration and that _is_ very
> careful about restoring time and armed timers and if their user space
> tools screw it up then they can keep the bits and pieces.

Agreed.

> So in order to utilize that you'd have to checkpoint the container,
> manipulate the offsets and restore it.

Or use the same mechanisms CRIU uses.

I can't rely on CRIU because I have to bootstrap a toolchain and userspace.
We should be able to write a thin veneer into our own testing wrapper and
emulate whatever CRIU does. We know apriori that our test framework starts
the test without anything having been executed yet. So we have that benefit.
Currently we unshare() for NEWUSER/NEWPID/NEWNS, but I expect that will
get a little more advanced. Already having these namespaces helps immensely
when adding fs-related tests.

> Aside of this, there are other things, e.g. file times, packet
> timestamps etc. which are based on CLOCK_REALTIME. What to do about
> them? Translate these to/from name space time or not? There is a long
> list of other horrors which are related to that.

We haven't even started testing for the upcoming negative leap second ;-)

> So _you_ might say, that you don't care about file times, RTC, timers
> expiring at the wrong time, packet timestamps and whatever.

I do care about them, but only given certain contexts.

> But then the next test dude comes around and want's to test exactly
> these interfaces and we have to slap the time namespace conversions for
> REALTIME and TAI all over the place because we already support the
> minimal thing.

That's a decision you need to make when asked those questions.

> Can you see why this is a slippery slope and why I'm extremly reluctant
> to even provide the minimal 'distort realtime when the namespace starts'
> support?

I would argue this is a slippery slope fallacy. If and when we get better
vm+container support we just tear all this code out and tell people to
start using those frameworks. The vm+container frameworks have independent
reasons to exist and so will continue to improve for security isolation
purposes and end up solving time testing issues by allowing us complete
control over the VMs time.

>> Hopefully this ilustrates that real time name space is not "request for
>> ponny" :-)
>
> I can understand your pain and why you want to distort time, but please
> understand that timekeeping is complex. The primary focus must be
> correctness, scalability and maintainability which is already hard
> enough to achieve. Just for the perspective: It took us only 8 years to
> get the kernel halfways 2038 ready (filesystems still outstanding).

I agree. The upstream glibc community has been working on y2038 since 2018;
not as long as the kernel.

> So from my point of view asking for distorted time still _is_ a request
> for ponies.

I'm happy if you say it's more work than the value it provides.

> The fixed offsets for clock MONOTONIC/BOOTTIME are straight forward,
> absolutely make sense and they have a limited scope of exposure. clock
> REALTIME/TAI are very different beasts which entail a slew of horrors.
> Adding settime() to the mix makes it exponentially harder.

Right.

--
Cheers,
Carlos.

2020-11-26 08:31:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

Carlos, Petr,

On Wed, Nov 25 2020 at 15:37, Carlos O'Donell wrote:
> On 11/19/20 7:14 PM, Thomas Gleixner wrote:
>> So from my point of view asking for distorted time still _is_ a request
>> for ponies.
>
> I'm happy if you say it's more work than the value it provides.

Thinking more about it. Would a facility which provides:

CLOCK_FAKE_MONOTONIC|BOOTTIME|REALTIME

where you can go wild on setting time to whatever you want solve
your problem?

Thanks,

tglx

2020-11-26 09:38:42

by Carlos O'Donell

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On 11/25/20 7:17 PM, Thomas Gleixner wrote:
> Carlos, Petr,
>
> On Wed, Nov 25 2020 at 15:37, Carlos O'Donell wrote:
>> On 11/19/20 7:14 PM, Thomas Gleixner wrote:
>>> So from my point of view asking for distorted time still _is_ a request
>>> for ponies.
>>
>> I'm happy if you say it's more work than the value it provides.
>
> Thinking more about it. Would a facility which provides:
>
> CLOCK_FAKE_MONOTONIC|BOOTTIME|REALTIME
>
> where you can go wild on setting time to whatever you want solve
> your problem?

We would need a way to inject CLOCK_FAKE_* in lieu of the real
constants.

There are only two straight forward ways I know how to do that
and they aren't very useful e.g. alternative build, syscall hot-path
debug code to alter the constant.

We might write a syscall interception framework using seccomp
and SECCOMP_RET_TRACE, but that involves ptrace'ing the process
under test, and is equivalent to a micro-sandbox. I'm not against
that idea for testing; we would test what we ship.

I don't think eBPF can modify the incoming arguments.

I need to go check if systemtap can modify incoming arguments;
I've never done that in any script.

In what other ways can we inject the new constants?

--
Cheers,
Carlos.

2020-11-26 10:53:41

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Y2038][time namespaces] Question regarding CLOCK_REALTIME support plans in Linux time namespaces

On Nov 25 2020, Carlos O'Donell via Libc-alpha wrote:

> We might write a syscall interception framework using seccomp
> and SECCOMP_RET_TRACE, but that involves ptrace'ing the process
> under test, and is equivalent to a micro-sandbox. I'm not against
> that idea for testing; we would test what we ship.

seccomp and ptrace does not work with qemu linux-user.

Andreas.

--
Andreas Schwab, [email protected]
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510 2552 DF73 E780 A9DA AEC1
"And now for something completely different."