LinuxLists.cc - Re: [PATCH v13 3/3] binder: add transaction latency tracer

2020-11-11 15:16:19

Subject: Re: [PATCH v13 3/3] binder: add transaction latency tracer

On Wed, Nov 11, 2020 at 11:03:06PM +0800, Frankie Chang wrote:
> On Wed, 2020-11-11 at 08:34 +0100, Greg Kroah-Hartman wrote:
> > > - The reason why printing the related information to
> > > kernel information log but not trace buffer is that
> > > some abnormal transactions may be pending for a long
> > > time ago, they could not be recorded due to buffer
> > > limited.
> >
> > Don't abuse the kernel information log for stuff that is just normal
> > operations. What is wrong with using the trace buffers here? That's
> > what they are designed for from what I can tell.
> >
> As mentioned before, time limitation of recording is the reason why we
> don't just use trace here.

What limitation?

> In some long time stability test, such as MTBF,

What is "MTBF"?

> the exception is caused by a series of transactions interaction.
> Some abnormal transactions may be pending for a long time ago, they
> could not be recorded due to buffer limited.

How long of a time is this? If they are pending, only when the timeout
happens is the trace logged, right?

Again, please do not abuse the kernel log for this, that is not what it
is for.

> > > +config BINDER_TRANSACTION_LATENCY_TRACKING
> > > + tristate "Android Binder transaction tracking"
> > > + help
> > > + Used for track abnormal binder transaction which is over threshold,
> > > + when the transaction is done or be free, this transaction would be
> > > + checked whether it executed overtime.
> > > + If yes, printing out the detailed info.
> >
> > Why is this a separate module? Who will ever want this split out?
> >
> The reason we split out a separate module is that we adopted the
> previously discussed recommendations in PATCH v1.
>
> This way all of this tracing code is in-kernel but outside of binder.c.

Putting it in a single file is fine, but what does this benifit doing it
in a separate file? Doesn't it waste more codespace this way?

> > > +/*
> > > + * The reason setting the binder_txn_latency_threshold to 2 sec
> > > + * is that most of timeout abort is greater or equal to 2 sec.
> > > + * Making it configurable to let all users determine which
> > > + * threshold is more suitable.
> > > + */
> > > +static uint32_t binder_txn_latency_threshold = 2;
> > > +module_param_named(threshold, binder_txn_latency_threshold,
> > > + uint, 0644);
> >
> > Again, this isn't the 1990's, please do not add module parameters if at
> > all possible.
> >
>
> Is any recommended method here?
> Because we refer to the method in binder.c, we don't know if this method
> is not suitable.

Look at the individual binder instances. That is what trace should be
on/off for, not for all binder instances in the system at the same time.

thanks,

greg k-h

2020-11-11 16:02:03

by Frankie Chang

[permalink] [raw]

Subject: Re: [PATCH v13 3/3] binder: add transaction latency tracer

On Wed, 2020-11-11 at 16:12 +0100, Greg Kroah-Hartman wrote:
> On Wed, Nov 11, 2020 at 11:03:06PM +0800, Frankie Chang wrote:
> > On Wed, 2020-11-11 at 08:34 +0100, Greg Kroah-Hartman wrote:
> > > > - The reason why printing the related information to
> > > > kernel information log but not trace buffer is that
> > > > some abnormal transactions may be pending for a long
> > > > time ago, they could not be recorded due to buffer
> > > > limited.
> > >
> > > Don't abuse the kernel information log for stuff that is just normal
> > > operations. What is wrong with using the trace buffers here? That's
> > > what they are designed for from what I can tell.
> > >
> > As mentioned before, time limitation of recording is the reason why we
> > don't just use trace here.
>
> What limitation?
>
> > In some long time stability test, such as MTBF,
>
> What is "MTBF"?
>
Mean time between failures (MTBF) is the predicted elapsed time between
inherent failures of a mechanical or electronic system, during normal
system operation.

And we use MTBF script to run long time stress test to
make sure our product stability is no problem.

> > the exception is caused by a series of transactions interaction.
> > Some abnormal transactions may be pending for a long time ago, they
> > could not be recorded due to buffer limited.
>
> How long of a time is this? If they are pending, only when the timeout
> happens is the trace logged, right?
>
> Again, please do not abuse the kernel log for this, that is not what it
> is for.
>
Hmm..Do you mean that make timeout log print to trace buffer but not
kernel log?

The reason We don't do this is that we need to enable these trace events
and enable trace everytimes before testing. But our testing script could
lead to device reboot, and then it continue testing after reboot. The
reboot would make these trace events disable, and we cannot get the
timeout log which happen after reboot.

> > > > +config BINDER_TRANSACTION_LATENCY_TRACKING
> > > > + tristate "Android Binder transaction tracking"
> > > > + help
> > > > + Used for track abnormal binder transaction which is over threshold,
> > > > + when the transaction is done or be free, this transaction would be
> > > > + checked whether it executed overtime.
> > > > + If yes, printing out the detailed info.
> > >
> > > Why is this a separate module? Who will ever want this split out?
> > >
> > The reason we split out a separate module is that we adopted the
> > previously discussed recommendations in PATCH v1.
> >
> > This way all of this tracing code is in-kernel but outside of binder.c.
>
> Putting it in a single file is fine, but what does this benifit doing it
> in a separate file? Doesn't it waste more codespace this way?
>
Yeah, but I think separate file may be more manageable.

> > > > +/*
> > > > + * The reason setting the binder_txn_latency_threshold to 2 sec
> > > > + * is that most of timeout abort is greater or equal to 2 sec.
> > > > + * Making it configurable to let all users determine which
> > > > + * threshold is more suitable.
> > > > + */
> > > > +static uint32_t binder_txn_latency_threshold = 2;
> > > > +module_param_named(threshold, binder_txn_latency_threshold,
> > > > + uint, 0644);
> > >
> > > Again, this isn't the 1990's, please do not add module parameters if at
> > > all possible.
> > >
> >
> > Is any recommended method here?
> > Because we refer to the method in binder.c, we don't know if this method
> > is not suitable.
>
> Look at the individual binder instances. That is what trace should be
> on/off for, not for all binder instances in the system at the same time.
>
But our testing script is not for specific binder instances, it includes
several testing and tests for several processes.

So the reason we trace for all binder instances is that we cannot
predict which process would happen timeout transaction.

Many thanks
Frankie Chang

2020-11-13 15:48:30

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH v13 3/3] binder: add transaction latency tracer

On Wed, Nov 11, 2020 at 11:59:05PM +0800, Frankie Chang wrote:
> On Wed, 2020-11-11 at 16:12 +0100, Greg Kroah-Hartman wrote:
> > On Wed, Nov 11, 2020 at 11:03:06PM +0800, Frankie Chang wrote:
> > > On Wed, 2020-11-11 at 08:34 +0100, Greg Kroah-Hartman wrote:
> > > > > - The reason why printing the related information to
> > > > > kernel information log but not trace buffer is that
> > > > > some abnormal transactions may be pending for a long
> > > > > time ago, they could not be recorded due to buffer
> > > > > limited.
> > > >
> > > > Don't abuse the kernel information log for stuff that is just normal
> > > > operations. What is wrong with using the trace buffers here? That's
> > > > what they are designed for from what I can tell.
> > > >
> > > As mentioned before, time limitation of recording is the reason why we
> > > don't just use trace here.
> >
> > What limitation?
> >
> > > In some long time stability test, such as MTBF,
> >
> > What is "MTBF"?
> >
> Mean time between failures (MTBF) is the predicted elapsed time between
> inherent failures of a mechanical or electronic system, during normal
> system operation.
>
> And we use MTBF script to run long time stress test to
> make sure our product stability is no problem.

Ok, great.

> > > the exception is caused by a series of transactions interaction.
> > > Some abnormal transactions may be pending for a long time ago, they
> > > could not be recorded due to buffer limited.
> >
> > How long of a time is this? If they are pending, only when the timeout
> > happens is the trace logged, right?
> >
> > Again, please do not abuse the kernel log for this, that is not what it
> > is for.
> >
> Hmm..Do you mean that make timeout log print to trace buffer but not
> kernel log?

Yes.

> The reason We don't do this is that we need to enable these trace events
> and enable trace everytimes before testing. But our testing script could
> lead to device reboot, and then it continue testing after reboot. The
> reboot would make these trace events disable, and we cannot get the
> timeout log which happen after reboot.

When you reboot you enable tracing, that's not an issue, and then you
will see the messages. Nothing is going to work across reboots.

> > > > > +config BINDER_TRANSACTION_LATENCY_TRACKING
> > > > > + tristate "Android Binder transaction tracking"
> > > > > + help
> > > > > + Used for track abnormal binder transaction which is over threshold,
> > > > > + when the transaction is done or be free, this transaction would be
> > > > > + checked whether it executed overtime.
> > > > > + If yes, printing out the detailed info.
> > > >
> > > > Why is this a separate module? Who will ever want this split out?
> > > >
> > > The reason we split out a separate module is that we adopted the
> > > previously discussed recommendations in PATCH v1.
> > >
> > > This way all of this tracing code is in-kernel but outside of binder.c.
> >
> > Putting it in a single file is fine, but what does this benifit doing it
> > in a separate file? Doesn't it waste more codespace this way?
> >
> Yeah, but I think separate file may be more manageable.

Sorry, I mean, "why put this in a separate kernel module", not file.
File is fine.

> > > > > +/*
> > > > > + * The reason setting the binder_txn_latency_threshold to 2 sec
> > > > > + * is that most of timeout abort is greater or equal to 2 sec.
> > > > > + * Making it configurable to let all users determine which
> > > > > + * threshold is more suitable.
> > > > > + */
> > > > > +static uint32_t binder_txn_latency_threshold = 2;
> > > > > +module_param_named(threshold, binder_txn_latency_threshold,
> > > > > + uint, 0644);
> > > >
> > > > Again, this isn't the 1990's, please do not add module parameters if at
> > > > all possible.
> > > >
> > >
> > > Is any recommended method here?
> > > Because we refer to the method in binder.c, we don't know if this method
> > > is not suitable.
> >
> > Look at the individual binder instances. That is what trace should be
> > on/off for, not for all binder instances in the system at the same time.
> >
> But our testing script is not for specific binder instances, it includes
> several testing and tests for several processes.

Then turn it on for all binder instances.

> So the reason we trace for all binder instances is that we cannot
> predict which process would happen timeout transaction.

That's fine, but provide the ability to do this on a per-instance, as
that is what binder now supports. To make it "global" is a big
regression from the way it was changed recently to support different
instances.

thanks,

greg k-h