2020-07-07 11:46:38

by Qian Cai

[permalink] [raw]
Subject: Re: [mm] 4e2c82a409: ltp.overcommit_memory01.fail



> On Jul 7, 2020, at 6:28 AM, Michal Hocko <[email protected]> wrote:
>
> Would you have any examples? Because I find this highly unlikely.
> OVERCOMMIT_NEVER only works when virtual memory is not largerly
> overcommited wrt to real memory demand. And that tends to be more of
> an exception rather than a rule. "Modern" userspace (whatever that
> means) tends to be really hungry with virtual memory which is only used
> very sparsely.
>
> I would argue that either somebody is running an "OVERCOMMIT_NEVER"
> friendly SW and this is a permanent setting or this is not used at all.
> At least this is my experience.
>
> So I strongly suspect that LTP test failure is not something we should
> really lose sleep over. It would be nice to find a way to flush existing
> batches but I would rather see a real workload that would suffer from
> this imprecision.

I hear you many times that you really don’t care about those use cases unless you hear exactly people are using in your world.

For example, when you said LTP oom tests are totally artificial last time and how less you care about if they are failing, and I could only enjoy their efficiencies to find many issues like race conditions and bad error accumulation handling etc that your “real world use cases” are going to take ages or no way to flag them.

There are just too many valid use cases in this wild world. The difference is that I admit that I don’t know or even aware all the use cases, and I don’t believe you do as well.

If a patchset broke the existing behaviors that written exactly in the spec, it is then someone has to prove its innocent. For example, if nobody is going to rely on something like this now and future, and then fix the spec and explain exactly nobody should be rely upon.


2020-07-07 12:07:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [mm] 4e2c82a409: ltp.overcommit_memory01.fail

On Tue 07-07-20 07:43:48, Qian Cai wrote:
>
>
> > On Jul 7, 2020, at 6:28 AM, Michal Hocko <[email protected]> wrote:
> >
> > Would you have any examples? Because I find this highly unlikely.
> > OVERCOMMIT_NEVER only works when virtual memory is not largerly
> > overcommited wrt to real memory demand. And that tends to be more of
> > an exception rather than a rule. "Modern" userspace (whatever that
> > means) tends to be really hungry with virtual memory which is only used
> > very sparsely.
> >
> > I would argue that either somebody is running an "OVERCOMMIT_NEVER"
> > friendly SW and this is a permanent setting or this is not used at all.
> > At least this is my experience.
> >
> > So I strongly suspect that LTP test failure is not something we should
> > really lose sleep over. It would be nice to find a way to flush existing
> > batches but I would rather see a real workload that would suffer from
> > this imprecision.
>
> I hear you many times that you really don’t care about those use
> cases unless you hear exactly people are using in your world.
>
> For example, when you said LTP oom tests are totally artificial last
> time and how less you care about if they are failing, and I could only
> enjoy their efficiencies to find many issues like race conditions
> and bad error accumulation handling etc that your “real world use
> cases” are going to take ages or no way to flag them.

Yes, they are effective at hitting corner cases and that is fine. I
am not dismissing their usefulness. I have tried to explain that many
times but let me try again. Seeing a corner case and think about a
potential fix is one thing. On the other hand it is not really ideal to
treat such a failure a hard regression and consider otherwise useful
functionality/improvement to be reverted without a proper cost benefit
analysis. Sure having corner cases is not really nice but really, look
at this example again. Overcommit setting is a global thing, it is hard
to change it during runtime nilly willy. Because that might have really
detrimental side effects on all workloads running. So it is quite
reasonable to expect that this is either early after the boot or when
the system is in quiescent state when almost nothing but very core
services are running and likelihood that the mode of operation changes.

> There are just too many valid use cases in this wild world. The
> difference is that I admit that I don’t know or even aware all the
> use cases, and I don’t believe you do as well.

Me neither and I am not claiming that. All I am saying is that a real
risk of a regression is reasonably low that I wouldn't lose sleep over
that. It is perfectly fine to address this pro-actively if the fix is
reasonably maintainable. I was mostly reacting to your pushing for a
revert solely based on LTP results.

LTP is a very useful tool to raise awareness of potential problems but
you shouldn't really follow those results just blindly.

> If a patchset broke the existing behaviors that written exactly in
> the spec, it is then someone has to prove its innocent. For example,
> if nobody is going to rely on something like this now and future, and
> then fix the spec and explain exactly nobody should be rely upon.

I am all for clarifications in the documentation.

--
Michal Hocko
SUSE Labs

2020-07-07 13:08:46

by Qian Cai

[permalink] [raw]
Subject: Re: [mm] 4e2c82a409: ltp.overcommit_memory01.fail

On Tue, Jul 07, 2020 at 02:06:19PM +0200, Michal Hocko wrote:
> On Tue 07-07-20 07:43:48, Qian Cai wrote:
> >
> >
> > > On Jul 7, 2020, at 6:28 AM, Michal Hocko <[email protected]> wrote:
> > >
> > > Would you have any examples? Because I find this highly unlikely.
> > > OVERCOMMIT_NEVER only works when virtual memory is not largerly
> > > overcommited wrt to real memory demand. And that tends to be more of
> > > an exception rather than a rule. "Modern" userspace (whatever that
> > > means) tends to be really hungry with virtual memory which is only used
> > > very sparsely.
> > >
> > > I would argue that either somebody is running an "OVERCOMMIT_NEVER"
> > > friendly SW and this is a permanent setting or this is not used at all.
> > > At least this is my experience.
> > >
> > > So I strongly suspect that LTP test failure is not something we should
> > > really lose sleep over. It would be nice to find a way to flush existing
> > > batches but I would rather see a real workload that would suffer from
> > > this imprecision.
> >
> > I hear you many times that you really don’t care about those use
> > cases unless you hear exactly people are using in your world.
> >
> > For example, when you said LTP oom tests are totally artificial last
> > time and how less you care about if they are failing, and I could only
> > enjoy their efficiencies to find many issues like race conditions
> > and bad error accumulation handling etc that your “real world use
> > cases” are going to take ages or no way to flag them.
>
> Yes, they are effective at hitting corner cases and that is fine. I
> am not dismissing their usefulness. I have tried to explain that many
> times but let me try again. Seeing a corner case and think about a
> potential fix is one thing. On the other hand it is not really ideal to
> treat such a failure a hard regression and consider otherwise useful

Well, terms like "corner cases" and "hard regression" are rather
subjective.

> functionality/improvement to be reverted without a proper cost benefit
> analysis. Sure having corner cases is not really nice but really, look
> at this example again. Overcommit setting is a global thing, it is hard
> to change it during runtime nilly willy. Because that might have really
> detrimental side effects on all workloads running. So it is quite
> reasonable to expect that this is either early after the boot or when
> the system is in quiescent state when almost nothing but very core
> services are running and likelihood that the mode of operation changes.

Not really convinced that is only way people will use those tunables.

>
> > There are just too many valid use cases in this wild world. The
> > difference is that I admit that I don’t know or even aware all the
> > use cases, and I don’t believe you do as well.
>
> Me neither and I am not claiming that. All I am saying is that a real
> risk of a regression is reasonably low that I wouldn't lose sleep over
> that. It is perfectly fine to address this pro-actively if the fix is
> reasonably maintainable. I was mostly reacting to your pushing for a
> revert solely based on LTP results.
>
> LTP is a very useful tool to raise awareness of potential problems but
> you shouldn't really follow those results just blindly.

You must think I am a newbie tester to give me this piece of advice
then.

2020-07-07 13:57:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [mm] 4e2c82a409: ltp.overcommit_memory01.fail

On Tue 07-07-20 09:04:36, Qian Cai wrote:
> On Tue, Jul 07, 2020 at 02:06:19PM +0200, Michal Hocko wrote:
> > On Tue 07-07-20 07:43:48, Qian Cai wrote:
> > >
> > >
> > > > On Jul 7, 2020, at 6:28 AM, Michal Hocko <[email protected]> wrote:
> > > >
> > > > Would you have any examples? Because I find this highly unlikely.
> > > > OVERCOMMIT_NEVER only works when virtual memory is not largerly
> > > > overcommited wrt to real memory demand. And that tends to be more of
> > > > an exception rather than a rule. "Modern" userspace (whatever that
> > > > means) tends to be really hungry with virtual memory which is only used
> > > > very sparsely.
> > > >
> > > > I would argue that either somebody is running an "OVERCOMMIT_NEVER"
> > > > friendly SW and this is a permanent setting or this is not used at all.
> > > > At least this is my experience.
> > > >
> > > > So I strongly suspect that LTP test failure is not something we should
> > > > really lose sleep over. It would be nice to find a way to flush existing
> > > > batches but I would rather see a real workload that would suffer from
> > > > this imprecision.
> > >
> > > I hear you many times that you really don’t care about those use
> > > cases unless you hear exactly people are using in your world.
> > >
> > > For example, when you said LTP oom tests are totally artificial last
> > > time and how less you care about if they are failing, and I could only
> > > enjoy their efficiencies to find many issues like race conditions
> > > and bad error accumulation handling etc that your “real world use
> > > cases” are going to take ages or no way to flag them.
> >
> > Yes, they are effective at hitting corner cases and that is fine. I
> > am not dismissing their usefulness. I have tried to explain that many
> > times but let me try again. Seeing a corner case and think about a
> > potential fix is one thing. On the other hand it is not really ideal to
> > treat such a failure a hard regression and consider otherwise useful
>
> Well, terms like "corner cases" and "hard regression" are rather
> subjective.

Existing real life examples really makes them less subjective though.

[...]

> > LTP is a very useful tool to raise awareness of potential problems but
> > you shouldn't really follow those results just blindly.
>
> You must think I am a newbie tester to give me this piece of advice
> then.

Not by even close. I can clearly see your involvement in testing and how
many good bug reports that results in.
--
Michal Hocko
SUSE Labs