I get this same oops occassionally - the machine freezes and there doesn't
seem to be any record of the oops on disk.
I captured it on camera -
https://lh3.googleusercontent.com/-K0lNbJrZBMQ/UVOU1vv1vvI/AAAAAAAANqI/pY92mWm3caE/s800/20130327_205245.jpg
If I am reading this right, it dies on this instruction -
0xffffffff8145792d <+349>: divq 0x18(%rcx)
>From the lst file that *seems* to be this inline function -
static inline void intel_pstate_calc_busy(struct cpudata *cpu,
struct sample *sample)
{
u64 core_pct;
sample->pstate_pct_busy = 100 - div64_u64(
ffffffff8145791d: 48 8b 41 20 mov 0x20(%rcx),%rax
ffffffff81457921: 48 8d 04 80 lea (%rax,%rax,4),%rax
ffffffff81457925: 48 8d 04 80 lea (%rax,%rax,4),%rax
ffffffff81457929: 48 c1 e0 02 shl $0x2,%rax
ffffffff8145792d: 48 f7 71 18 divq 0x18(%rcx)
That is -
sample->pstate_pct_busy = 100 - div64_u64(
sample->idletime_us * 100,
sample->duration_us);
So looks like sample->duration_us is 0? If so, that implies that
ktime_us_delta(now, cpu->prev_sample) is zero. I am not entirely sure how
to handle this case - return if sampling too early, or if there is some
other bug making the delta calculation go poof.
Thanks,
Parag
Is there any way to capture the beginning of this trace?
pid_param_set() is on the stack which means that something is changing
the debugfs parameters or the stack is FUBAR.
On 03/27/2013 06:49 PM, Parag Warudkar wrote:
> I get this same oops occassionally - the machine freezes and there doesn't
> seem to be any record of the oops on disk.
>
>
> That is -
> sample->pstate_pct_busy = 100 - div64_u64(
> sample->idletime_us * 100,
> sample->duration_us);
>
I don't see how duration_us can be zero unless somehow I am getting back-to-back
timer callbacks which seems unlikely since the timer is not re-armed until
the timer function is about to return and the driver has done all its work
for the sample period
--Dirk
> So looks like sample->duration_us is 0? If so, that implies that
> ktime_us_delta(now, cpu->prev_sample) is zero. I am not entirely sure how
> to handle this case - return if sampling too early, or if there is some
> other bug making the delta calculation go poof.
>
>
> Thanks,
>
> Parag
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On Wed, Mar 27, 2013 at 10:51 PM, Dirk Brandewie
<[email protected]> wrote:
>
> Is there any way to capture the beginning of this trace?
I tried but since the oops scrolls fast followed by a hard freeze, I
wasn't able to capture it completely.
May be I can try netconsole and see if that helps.
>
> pid_param_set() is on the stack which means that something is changing
> the debugfs parameters or the stack is FUBAR.
>
I somehow doubt the stack is messed up as the call traces are always identical.
(pid_param_set() seems to be in first trace as well.)
>
> I don't see how duration_us can be zero unless somehow I am getting
> back-to-back
> timer callbacks which seems unlikely since the timer is not re-armed until
> the timer function is about to return and the driver has done all its work
> for the sample period
Do the two oops with common call stack suggest back to back callbacks?
I will add some debugging checks tomorrow to see what is going on. But
sounds like a minimal fix would be to guard against callbacks in quick
succession?
i.e. return from sample if ktime_us_delta(now, cpu->prev_sample) is zero?
Thanks,
Parag
On 03/27/2013 08:13 PM, Parag Warudkar wrote:
> On Wed, Mar 27, 2013 at 10:51 PM, Dirk Brandewie
> <[email protected]> wrote:
>>
>> Is there any way to capture the beginning of this trace?
>
> I tried but since the oops scrolls fast followed by a hard freeze, I
> wasn't able to capture it completely.
> May be I can try netconsole and see if that helps.
>
>>
>> pid_param_set() is on the stack which means that something is changing
>> the debugfs parameters or the stack is FUBAR.
>>
> I somehow doubt the stack is messed up as the call traces are always identical.
> (pid_param_set() seems to be in first trace as well.)
>
I agree that the two oops are likely the same but unless something is crawling
through debugfs writing random values to the files there pid_param_set()
should not be on any stack anywhere.
There was a similar bug reported by fedora:
https://bugzilla.redhat.com/show_bug.cgi?id=920289
This bug has not showed up again since rc3 can you try the current rc to see if
you still see the problem?
>>
>> I don't see how duration_us can be zero unless somehow I am getting
>> back-to-back
>> timer callbacks which seems unlikely since the timer is not re-armed until
>> the timer function is about to return and the driver has done all its work
>> for the sample period
>
> Do the two oops with common call stack suggest back to back callbacks?
>
> I will add some debugging checks tomorrow to see what is going on. But
> sounds like a minimal fix would be to guard against callbacks in quick
> succession?
> i.e. return from sample if ktime_us_delta(now, cpu->prev_sample) is zero?
>
> Thanks,
> Parag
>
On Thu, Mar 28, 2013 at 11:35 AM, Dirk Brandewie
<[email protected]> wrote:
>>> pid_param_set() is on the stack which means that something is changing
>>> the debugfs parameters or the stack is FUBAR.
>>>
>> I somehow doubt the stack is messed up as the call traces are always
>> identical.
>> (pid_param_set() seems to be in first trace as well.)
>>
>
> I agree that the two oops are likely the same but unless something is
> crawling
> through debugfs writing random values to the files there pid_param_set()
> should not be on any stack anywhere.
Ok.
>
> There was a similar bug reported by fedora:
> https://bugzilla.redhat.com/show_bug.cgi?id=920289
>
> This bug has not showed up again since rc3 can you try the current rc to see
> if
> you still see the problem?
>
I updated the Fedora BZ - the oops I got is from latest git as of
yesterday. So I think the bug is still there post -rc3.
Parag