by Byungchul Park

[permalink] [raw]

Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Fri, Mar 04, 2022 at 10:40:35PM -0500, Theodore Ts'o wrote:
> On Fri, Mar 04, 2022 at 12:20:02PM +0900, Byungchul Park wrote:
> >
> > I found a point that the two wait channels don't lead a deadlock in
> > some cases thanks to Jan Kara. I will fix it so that Dept won't
> > complain it.
>
> I sent my last (admittedly cranky) message before you sent this. I'm
> glad you finally understood Jan's explanation. I was trying to tell

Not finally. I've understood him whenever he tried to tell me something.

> you the same thing, but apparently I failed to communicate in a

I don't think so. Your point and Jan's point are different. All he has
said make sense. But yours does not.

> sufficiently clear manner. In any case, what Jan described is a
> fundamental part of how wait queues work, and I'm kind of amazed that
> you were able to implement DEPT without understanding it. (But maybe

Of course, it was possible because all that Dept has to know for basic
work is wait and event. The subtle things like what Jan told me help
Dept be better.

> that is why some of the DEPT reports were completely incomprehensible

It's because you are blinded to blame at it without understanding how
Dept works at all. I will fix those that must be fixed. Don't worry.

> to me; I couldn't interpret why in the world DEPT was saying there was
> a problem.)

I can tell you if you really want to understand why. But I can't if you
are like this.

> In any case, the thing I would ask is a little humility. We regularly
> use lockdep, and we run a huge number of stress tests, throughout each
> development cycle.

Sure.

> So if DEPT is issuing lots of reports about apparently circular
> dependencies, please try to be open to the thought that the fault is

No one was convinced that Dept doesn't have a fault. I think your
worries are too much.

> in DEPT, and don't try to argue with maintainers that their code MUST
> be buggy --- but since you don't understand our code, and DEPT must be

No one argued that their code must be buggy, either. So I don't think
you have to worry about what's never happened.

> theoretically perfect, that it is up to the Maintainers to prove to
> you that their code is correct.
>
> I am going to gently suggest that it is at least as likely, if not
> more likely, that the failure is in DEPT or your understanding of what

No doubt. I already think so. But it doesn't mean that I have to keep
quiet without discussing to imporve Dept. I will keep improving Dept in
a reasonable way.

> how kernel wait channels and locking works. After all, why would it
> be that we haven't found these problems via our other QA practices?

Let's talk more once you understand how Dept works at least 10%. Or I
think we cannot talk in a productive way.

2022-03-10 01:51:41

by Byungchul Park

[permalink] [raw]

Subject: Re: Report 2 in ext4 and journal based on v5.17-rc1

On Sun, Mar 06, 2022 at 09:19:10AM -0500, Theodore Ts'o wrote:
> On Sun, Mar 06, 2022 at 07:51:42PM +0900, Byungchul Park wrote:
> > >
> > > Users of DEPT must not have to understand how DEPT works in order to
> >
> > Users must not have to understand how Dept works for sure, and haters
> > must not blame things based on what they guess wrong.
>
> For the record, I don't hate DEPT. I *fear* that DEPT will result in
> my getting spammed with a huge number of false posiives once automated
> testing systems like Syzkaller, zero-day test robot, etcs., get a hold
> of it once it gets merged and start generating hundreds of automated
> reports.

Agree. Dept should not be a part of *automated testing system* until it
finally works as much as Lockdep in terms of false positives. However,
it's impossible to achieve it by doing it out of the tree.

Besides automated testing system, Dept works great in the middle of
developing something that is so complicated in terms of synchronization.
They don't have to worry about real reports anymore, that should be
reported, from getting prevented by a false positve.

I will explicitely describe EXPERIMENTAL and "Dept might false-alarm" in
Kconfig until it's considered a few-false-alarming tool.

> > Sure, it should be done manually. I should do it on my own when that
> > kind of issue arises.
>
> The question here is how often will it need to be done, and how easy

I guess it's gonna rarely happens. I want to see too.

> will it be to "do it manually"? Suppose we mark all of the DEPT false

Very easy. Equal to or easier than the way we do for lockdep. But the
interest would be wait/event objects rather than locks.

> positives before it gets merged? How easy will it be able to suppress
> future false positives in the future, as the kernel evolves?

Same as - or even better than - what we do for Lockdep.

And we'd better consider those activies as a code-documentation. Not
only making things just work but organizing code and documenting
in code, are also very meaningful.

> Perhaps one method is to haved a way to take a particular wait queue,
> or call to schedule(), or at the level of an entire kernel source
> file, and opt it out from DEPT analysis? That way, if DEPT gets
> merged, and a maintainer starts getting spammed by bogus (or

Once Dept gets stable - hoplefully now that Dept is working very
conservatively, there might not be as many false positives as you're
concerning. The situation is in control.

> That way we don't necessarily need to have a debate over how close to
> zero percent false positives is necessary before DEPT can get merged.

Non-sense. I would agree with you if it was so when Lockdep was merged.
But I'll try to achieve almost zero false positives, again, it's
impossible to do it out of tree.

> And we avoid needing to force maintainers to prove that a DEPT report

So... It'd be okay if Dept goes not as a part of automated testing
system. Right?

> is a false positive, which is from my experience hard to do, since
> they get accused of being DEPT haters and not understanding DEPT.

Honestly, it's not a problem of that they don't understand other
domians than what they are familiar with, but another issue. I won't
mention it.

And it sounds like you'd do nothing unless it turns out to be
problematic 100%. It's definitely the *easiest* way to maintain
something because it's the same as not checking its correctness at all.

Even if it's so hard to do, checking if the code is safe for real
repeatedly, is what it surely should be done. Again, I understand it
would be freaking hard. But it doesn't mean we should avoid it.

Here, there seems to be two points you'd like to say:

1. Fundamental question. Does Dept track wait and event correctly?
2. Even if so, can Dept consider all the subtle things in the kernel?

For 1, I'm willing to response to whatever it is. And not only me but we
can make it perfectly work if the concept and direction is *right*.
For 2, I need to ask things and try my best to fix those if it exists.

Again. Dept won't be a part of *automated testing system* until it
finally works as much as Lockdep in terms of false positives. Hopefully
you are okay with it.

---
Byungchul