2012-05-09 18:31:54

by Stephen Warren

[permalink] [raw]
Subject: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

For me, next-20120508 prints nothing when booted, and I think also
hangs. To solve this, I reverted:

7ff9554 printk: convert byte-buffer to variable-length record buffer

In order to build, I also had to revert:

c4e00da driver-core: extend dev_printk() to pass structured data

Note: I'm running on an ARM system using a serial console, with
earlyprintk enabled.


2012-05-10 19:54:55

by Stephen Warren

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On 05/09/2012 12:31 PM, Stephen Warren wrote:
> For me, next-20120508 prints nothing when booted, and I think also
> hangs. To solve this, I reverted:
>
> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>
> In order to build, I also had to revert:
>
> c4e00da driver-core: extend dev_printk() to pass structured data
>
> Note: I'm running on an ARM system using a serial console, with
> earlyprintk enabled.

This issue still occurs in next-20120510.

I've tracked it down to the assignment of msg->ts_nsec near the end of
log_store(). If I comment this out, everything works. The problem is the
assignment, not the call to local_clock():

fails:
msg->ts_nsec = local_clock();
fails:
msg->ts_nsec = 0;//local_clock();
works:
//msg->ts_nsec = local_clock();

I guess that msg isn't pointing at what it's supposed to. I'll keep
trying to track this down, but if you could think about why this might
happen (or even revert the changes until this is isolated), it'd be great.

2012-05-10 20:07:06

by Kay Sievers

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>> For me, next-20120508 prints nothing when booted, and I think also
>> hangs. To solve this, I reverted:
>>
>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>
>> In order to build, I also had to revert:
>>
>> c4e00da driver-core: extend dev_printk() to pass structured data
>>
>> Note: I'm running on an ARM system using a serial console, with
>> earlyprintk enabled.
>
> This issue still occurs in next-20120510.
>
> I've tracked it down to the assignment of msg->ts_nsec near the end of
> log_store(). If I comment this out, everything works. The problem is the
> assignment, not the call to local_clock():
>
> fails:
>        msg->ts_nsec = local_clock();
> fails:
>        msg->ts_nsec = 0;//local_clock();
> works:
>        //msg->ts_nsec = local_clock();

Weird.

What happens if you change it to:
cpu_clock(logbuf_cpu);
?

If it works, the timestamps look ok?

Kay

2012-05-10 20:09:46

by Stephen Warren

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On 05/10/2012 02:06 PM, Kay Sievers wrote:
> On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
>> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>>> For me, next-20120508 prints nothing when booted, and I think also
>>> hangs. To solve this, I reverted:
>>>
>>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>>
>>> In order to build, I also had to revert:
>>>
>>> c4e00da driver-core: extend dev_printk() to pass structured data
>>>
>>> Note: I'm running on an ARM system using a serial console, with
>>> earlyprintk enabled.
>>
>> This issue still occurs in next-20120510.
>>
>> I've tracked it down to the assignment of msg->ts_nsec near the end of
>> log_store(). If I comment this out, everything works. The problem is the
>> assignment, not the call to local_clock():
>>
>> fails:
>> msg->ts_nsec = local_clock();
>> fails:
>> msg->ts_nsec = 0;//local_clock();
>> works:
>> //msg->ts_nsec = local_clock();
>
> Weird.
>
> What happens if you change it to:
> cpu_clock(logbuf_cpu);
> ?
>
> If it works, the timestamps look ok?

I doubt that would work - after all, assigning 0 fails, but not
performing the assignment at all works. But, I'll go try it...

2012-05-10 20:13:09

by Stephen Warren

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On 05/10/2012 02:09 PM, Stephen Warren wrote:
> On 05/10/2012 02:06 PM, Kay Sievers wrote:
>> On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
>>> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>>>> For me, next-20120508 prints nothing when booted, and I think also
>>>> hangs. To solve this, I reverted:
>>>>
>>>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>>>
>>>> In order to build, I also had to revert:
>>>>
>>>> c4e00da driver-core: extend dev_printk() to pass structured data
>>>>
>>>> Note: I'm running on an ARM system using a serial console, with
>>>> earlyprintk enabled.
>>>
>>> This issue still occurs in next-20120510.
>>>
>>> I've tracked it down to the assignment of msg->ts_nsec near the end of
>>> log_store(). If I comment this out, everything works. The problem is the
>>> assignment, not the call to local_clock():
>>>
>>> fails:
>>> msg->ts_nsec = local_clock();
>>> fails:
>>> msg->ts_nsec = 0;//local_clock();
>>> works:
>>> //msg->ts_nsec = local_clock();
>>
>> Weird.
>>
>> What happens if you change it to:
>> cpu_clock(logbuf_cpu);
>> ?
>>
>> If it works, the timestamps look ok?
>
> I doubt that would work - after all, assigning 0 fails, but not
> performing the assignment at all works. But, I'll go try it...

Calling cpu_clock() instead of local_clock() fails in the same way.

2012-05-10 20:15:40

by Kay Sievers

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On Thu, May 10, 2012 at 10:13 PM, Stephen Warren <[email protected]> wrote:
> On 05/10/2012 02:09 PM, Stephen Warren wrote:
>> On 05/10/2012 02:06 PM, Kay Sievers wrote:
>>> On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
>>>> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>>>>> For me, next-20120508 prints nothing when booted, and I think also
>>>>> hangs. To solve this, I reverted:
>>>>>
>>>>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>>>>
>>>>> In order to build, I also had to revert:
>>>>>
>>>>> c4e00da driver-core: extend dev_printk() to pass structured data
>>>>>
>>>>> Note: I'm running on an ARM system using a serial console, with
>>>>> earlyprintk enabled.
>>>>
>>>> This issue still occurs in next-20120510.
>>>>
>>>> I've tracked it down to the assignment of msg->ts_nsec near the end of
>>>> log_store(). If I comment this out, everything works. The problem is the
>>>> assignment, not the call to local_clock():
>>>>
>>>> fails:
>>>>        msg->ts_nsec = local_clock();
>>>> fails:
>>>>        msg->ts_nsec = 0;//local_clock();
>>>> works:
>>>>        //msg->ts_nsec = local_clock();
>>>
>>> Weird.
>>>
>>> What happens if you change it to:
>>>   cpu_clock(logbuf_cpu);
>>> ?
>>>
>>> If it works, the timestamps look ok?
>>
>> I doubt that would work - after all, assigning 0 fails, but not
>> performing the assignment at all works. But, I'll go try it...
>
> Calling cpu_clock() instead of local_clock() fails in the same way.

Ok, didn't really see the assign to 0 you tried, sorry. :)

And 'dmesg' works when you run the box with the line commented out?

And 'cat /dev/kmsg'?

Kay

2012-05-10 20:20:25

by Stephen Warren

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On 05/10/2012 02:15 PM, Kay Sievers wrote:
> On Thu, May 10, 2012 at 10:13 PM, Stephen Warren <[email protected]> wrote:
>> On 05/10/2012 02:09 PM, Stephen Warren wrote:
>>> On 05/10/2012 02:06 PM, Kay Sievers wrote:
>>>> On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
>>>>> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>>>>>> For me, next-20120508 prints nothing when booted, and I think also
>>>>>> hangs. To solve this, I reverted:
>>>>>>
>>>>>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>>>>>
>>>>>> In order to build, I also had to revert:
>>>>>>
>>>>>> c4e00da driver-core: extend dev_printk() to pass structured data
>>>>>>
>>>>>> Note: I'm running on an ARM system using a serial console, with
>>>>>> earlyprintk enabled.
>>>>>
>>>>> This issue still occurs in next-20120510.
>>>>>
>>>>> I've tracked it down to the assignment of msg->ts_nsec near the end of
>>>>> log_store(). If I comment this out, everything works. The problem is the
>>>>> assignment, not the call to local_clock():
>>>>>
>>>>> fails:
>>>>> msg->ts_nsec = local_clock();
>>>>> fails:
>>>>> msg->ts_nsec = 0;//local_clock();
>>>>> works:
>>>>> //msg->ts_nsec = local_clock();
>>>>
>>>> Weird.
>>>>
>>>> What happens if you change it to:
>>>> cpu_clock(logbuf_cpu);
>>>> ?
>>>>
>>>> If it works, the timestamps look ok?
>>>
>>> I doubt that would work - after all, assigning 0 fails, but not
>>> performing the assignment at all works. But, I'll go try it...
>>
>> Calling cpu_clock() instead of local_clock() fails in the same way.
>
> Ok, didn't really see the assign to 0 you tried, sorry. :)
>
> And 'dmesg' works when you run the box with the line commented out?
>
> And 'cat /dev/kmsg'?

Yes, both work and produce reasonable output. "cat /dev/kmsg" does hang
at the end of the log until I CTRL-C it - is that expected?

2012-05-10 20:26:44

by Kay Sievers

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On Thu, May 10, 2012 at 10:20 PM, Stephen Warren <[email protected]> wrote:
> On 05/10/2012 02:15 PM, Kay Sievers wrote:
>> On Thu, May 10, 2012 at 10:13 PM, Stephen Warren <[email protected]> wrote:
>>> On 05/10/2012 02:09 PM, Stephen Warren wrote:
>>>> On 05/10/2012 02:06 PM, Kay Sievers wrote:
>>>>> On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
>>>>>> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>>>>>>> For me, next-20120508 prints nothing when booted, and I think also
>>>>>>> hangs. To solve this, I reverted:
>>>>>>>
>>>>>>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>>>>>>
>>>>>>> In order to build, I also had to revert:
>>>>>>>
>>>>>>> c4e00da driver-core: extend dev_printk() to pass structured data
>>>>>>>
>>>>>>> Note: I'm running on an ARM system using a serial console, with
>>>>>>> earlyprintk enabled.
>>>>>>
>>>>>> This issue still occurs in next-20120510.
>>>>>>
>>>>>> I've tracked it down to the assignment of msg->ts_nsec near the end of
>>>>>> log_store(). If I comment this out, everything works. The problem is the
>>>>>> assignment, not the call to local_clock():
>>>>>>
>>>>>> fails:
>>>>>>        msg->ts_nsec = local_clock();
>>>>>> fails:
>>>>>>        msg->ts_nsec = 0;//local_clock();
>>>>>> works:
>>>>>>        //msg->ts_nsec = local_clock();
>>>>>
>>>>> Weird.
>>>>>
>>>>> What happens if you change it to:
>>>>>   cpu_clock(logbuf_cpu);
>>>>> ?
>>>>>
>>>>> If it works, the timestamps look ok?
>>>>
>>>> I doubt that would work - after all, assigning 0 fails, but not
>>>> performing the assignment at all works. But, I'll go try it...
>>>
>>> Calling cpu_clock() instead of local_clock() fails in the same way.
>>
>> Ok, didn't really see the assign to 0 you tried, sorry. :)
>>
>> And 'dmesg' works when you run the box with the line commented out?
>>
>> And 'cat /dev/kmsg'?
>
> Yes, both work and produce reasonable output.

Really weird. I have zero ideas at what this could be. The messages in
/dev/kmsg have seqnums, you see the 0 as the first?

5,0,0;Linux version 3.4.0-rc6+ ...
6,1,0;Command line: root=/dev/sda1 ...

> "cat /dev/kmsg" does hang
> at the end of the log until I CTRL-C it - is that expected?

Yeah, 'cat' is a blocking read(), tools would use poll().

Kay

2012-05-10 21:58:59

by Stephen Warren

[permalink] [raw]
Subject: Re: Regression due to 7ff9554 "printk: convert byte-buffer to variable-length record buffer"

On 05/10/2012 02:26 PM, Kay Sievers wrote:
> On Thu, May 10, 2012 at 10:20 PM, Stephen Warren <[email protected]> wrote:
>> On 05/10/2012 02:15 PM, Kay Sievers wrote:
>>> On Thu, May 10, 2012 at 10:13 PM, Stephen Warren <[email protected]> wrote:
>>>> On 05/10/2012 02:09 PM, Stephen Warren wrote:
>>>>> On 05/10/2012 02:06 PM, Kay Sievers wrote:
>>>>>> On Thu, May 10, 2012 at 9:54 PM, Stephen Warren <[email protected]> wrote:
>>>>>>> On 05/09/2012 12:31 PM, Stephen Warren wrote:
>>>>>>>> For me, next-20120508 prints nothing when booted, and I think also
>>>>>>>> hangs. To solve this, I reverted:
>>>>>>>>
>>>>>>>> 7ff9554 printk: convert byte-buffer to variable-length record buffer
>>>>>>>>
>>>>>>>> In order to build, I also had to revert:
>>>>>>>>
>>>>>>>> c4e00da driver-core: extend dev_printk() to pass structured data
>>>>>>>>
>>>>>>>> Note: I'm running on an ARM system using a serial console, with
>>>>>>>> earlyprintk enabled.
>>>>>>>
>>>>>>> This issue still occurs in next-20120510.
>>>>>>>
>>>>>>> I've tracked it down to the assignment of msg->ts_nsec near the end of
>>>>>>> log_store(). If I comment this out, everything works. The problem is the
>>>>>>> assignment, not the call to local_clock():
>>>>>>>
>>>>>>> fails:
>>>>>>> msg->ts_nsec = local_clock();
>>>>>>> fails:
>>>>>>> msg->ts_nsec = 0;//local_clock();
>>>>>>> works:
>>>>>>> //msg->ts_nsec = local_clock();
>>>>>>
>>>>>> Weird.
>>>>>>
>>>>>> What happens if you change it to:
>>>>>> cpu_clock(logbuf_cpu);
>>>>>> ?
>>>>>>
>>>>>> If it works, the timestamps look ok?
>>>>>
>>>>> I doubt that would work - after all, assigning 0 fails, but not
>>>>> performing the assignment at all works. But, I'll go try it...
>>>>
>>>> Calling cpu_clock() instead of local_clock() fails in the same way.
>>>
>>> Ok, didn't really see the assign to 0 you tried, sorry. :)
>>>
>>> And 'dmesg' works when you run the box with the line commented out?
>>>
>>> And 'cat /dev/kmsg'?
>>
>> Yes, both work and produce reasonable output.
>
> Really weird. I have zero ideas at what this could be. The messages in
> /dev/kmsg have seqnums, you see the 0 as the first?
>
> 5,0,0;Linux version 3.4.0-rc6+ ...
> 6,1,0;Command line: root=/dev/sda1 ...
>
>> "cat /dev/kmsg" does hang
>> at the end of the log until I CTRL-C it - is that expected?
>
> Yeah, 'cat' is a blocking read(), tools would use poll().

Found the bug. __log_buf isn't correctly aligned, so the very first
message gets the struct log written to an unaligned address. Subsequent
entries would be correctly aligned in log_store(), except the kernel has
already crashed by then.

The ts_nsec field is a 64-bit write, which triggers the alignment
requirement. I'm not quite sure why the previous 16-bit writes for the
various len fields didn't experience an issue, since __log_buf ended up
at an odd address for me. Perhaps ARM doesn't have an alignment
requirement for 16-bit writes, but does for 32-bit and greater.

I'll post a patch in a minute.