2018-11-09 21:37:13

by Qian Cai

[permalink] [raw]
Subject: ODEBUG: Out of memory. ODEBUG disabled

It is a bit annoying on this aarch64 server with 64 CPUs that is
booting the latest mainline (3541833fd1f2) causes object debugging
always running out of memory.

I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
to make it work. Is it expected that object debugging is not going
to work with large machines?


2018-11-09 21:43:36

by Yang Shi

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled



On 11/9/18 1:36 PM, Qian Cai wrote:
> It is a bit annoying on this aarch64 server with 64 CPUs that is
> booting the latest mainline (3541833fd1f2) causes object debugging
> always running out of memory.

May you please paste the detail failure log?

>
> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
> to make it work. Is it expected that object debugging is not going
> to work with large machines?

I don't think so. I'm supposed it works well with large CPU number on x86.

Thanks,
Yang



2018-11-09 21:52:54

by Qian Cai

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled



> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>
>
>
> On 11/9/18 1:36 PM, Qian Cai wrote:
>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>> booting the latest mainline (3541833fd1f2) causes object debugging
>> always running out of memory.
>
> May you please paste the detail failure log?
I assume you mean dmesg.

Here is the dmesg for 64 CPUs,
https://paste.ubuntu.com/p/BnhvXXhn7k/
>>
>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>> to make it work. Is it expected that object debugging is not going
>> to work with large machines?
>
> I don't think so. I'm supposed it works well with large CPU number on x86.
Here is the one with nr_cpus workaround,
https://paste.ubuntu.com/p/qMpd2CCPSV/

2018-11-09 22:04:38

by Yang Shi

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled



On 11/9/18 1:51 PM, Qian Cai wrote:
>
>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>>
>>
>>
>> On 11/9/18 1:36 PM, Qian Cai wrote:
>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>>> booting the latest mainline (3541833fd1f2) causes object debugging
>>> always running out of memory.
>> May you please paste the detail failure log?
> I assume you mean dmesg.
>
> Here is the dmesg for 64 CPUs,
> https://paste.ubuntu.com/p/BnhvXXhn7k/

I don't see the oom message, it looks it boots up successfully with 64 CPUs.

>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>>> to make it work. Is it expected that object debugging is not going
>>> to work with large machines?
>> I don't think so. I'm supposed it works well with large CPU number on x86.
> Here is the one with nr_cpus workaround,
> https://paste.ubuntu.com/p/qMpd2CCPSV/


2018-11-09 22:09:23

by Waiman Long

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled

On 11/09/2018 04:51 PM, Qian Cai wrote:
>
>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>>
>>
>>
>> On 11/9/18 1:36 PM, Qian Cai wrote:
>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>>> booting the latest mainline (3541833fd1f2) causes object debugging
>>> always running out of memory.
>> May you please paste the detail failure log?
> I assume you mean dmesg.
>
> Here is the dmesg for 64 CPUs,
> https://paste.ubuntu.com/p/BnhvXXhn7k/
>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>>> to make it work. Is it expected that object debugging is not going
>>> to work with large machines?
>> I don't think so. I'm supposed it works well with large CPU number on x86.
> Here is the one with nr_cpus workaround,
> https://paste.ubuntu.com/p/qMpd2CCPSV/

The debugobjects code have a set of 1024 statically allocated debug
objects that can be used in early boot before the slab memory allocator
is initialized. Apparently, the system may have used up all the
statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
helps.

There are also quite a number of warnings in your console log. So there
is certainly something wrong with your kernel or config options.

Cheers,
Longman



2018-11-10 01:47:53

by Qian Cai

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled

> Sent: Friday, November 09, 2018 at 5:08 PM
> From: "Waiman Long" <[email protected]>
> To: "Qian Cai" <[email protected]>, "Yang Shi" <[email protected]>
> Cc: "open list" <[email protected]>, "Thomas Gleixner" <[email protected]>, "Arnd Bergmann" <[email protected]>, "Joel Fernandes (Google)" <[email protected]>, "Zhong Jiang" <[email protected]>
> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
>
> On 11/09/2018 04:51 PM, Qian Cai wrote:
>>
>>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>>>
>>>
>>>
>>> On 11/9/18 1:36 PM, Qian Cai wrote:
>>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>>>> booting the latest mainline (3541833fd1f2) causes object debugging
>>>> always running out of memory.
>>> May you please paste the detail failure log?
>> I assume you mean dmesg.
>>
>> Here is the dmesg for 64 CPUs,
>> https://paste.ubuntu.com/p/BnhvXXhn7k/
>>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>>>> to make it work. Is it expected that object debugging is not going
>>>> to work with large machines?
>>> I don't think so. I'm supposed it works well with large CPU number on x86.
>> Here is the one with nr_cpus workaround,
>> https://paste.ubuntu.com/p/qMpd2CCPSV/
>
> The debugobjects code have a set of 1024 statically allocated debug
> objects that can be used in early boot before the slab memory allocator
> is initialized. Apparently, the system may have used up all the
> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
> helps.
Great, you are right. Doubling the size makes it work. Does it make sense
to have a kconfig option instead?
>
> There are also quite a number of warnings in your console log. So there
> is certainly something wrong with your kernel or config options.
Yes, I am working on all those warnings. This one is found by ODEBUG,
https://lkml.org/lkml/2018/11/10/136

2018-11-10 14:00:08

by Waiman Long

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled

On 11/09/2018 08:45 PM, Qian Cai wrote:
>> Sent: Friday, November 09, 2018 at 5:08 PM
>> From: "Waiman Long" <[email protected]>
>> To: "Qian Cai" <[email protected]>, "Yang Shi" <[email protected]>
>> Cc: "open list" <[email protected]>, "Thomas Gleixner" <[email protected]>, "Arnd Bergmann" <[email protected]>, "Joel Fernandes (Google)" <[email protected]>, "Zhong Jiang" <[email protected]>
>> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
>>
>> On 11/09/2018 04:51 PM, Qian Cai wrote:
>>>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 11/9/18 1:36 PM, Qian Cai wrote:
>>>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>>>>> booting the latest mainline (3541833fd1f2) causes object debugging
>>>>> always running out of memory.
>>>> May you please paste the detail failure log?
>>> I assume you mean dmesg.
>>>
>>> Here is the dmesg for 64 CPUs,
>>> https://paste.ubuntu.com/p/BnhvXXhn7k/
>>>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>>>>> to make it work. Is it expected that object debugging is not going
>>>>> to work with large machines?
>>>> I don't think so. I'm supposed it works well with large CPU number on x86.
>>> Here is the one with nr_cpus workaround,
>>> https://paste.ubuntu.com/p/qMpd2CCPSV/
>> The debugobjects code have a set of 1024 statically allocated debug
>> objects that can be used in early boot before the slab memory allocator
>> is initialized. Apparently, the system may have used up all the
>> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
>> helps.
> Great, you are right. Doubling the size makes it work. Does it make sense
> to have a kconfig option instead?

First, I think you need to figure out what your system needed to use up
so many debug objects in early boot. If there is a legitimate reason for
this behavior, we can talk about having a kconfig option to increase that.

>> There are also quite a number of warnings in your console log. So there
>> is certainly something wrong with your kernel or config options.
> Yes, I am working on all those warnings. This one is found by ODEBUG,
> https://lkml.org/lkml/2018/11/10/136

Cheers,
Longman


2018-11-10 14:17:23

by Qian Cai

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled

On 11/10/18 at 8:59 AM, Waiman Long wrote:

> On 11/09/2018 08:45 PM, Qian Cai wrote:
> >> Sent: Friday, November 09, 2018 at 5:08 PM
> >> From: "Waiman Long" <[email protected]>
> >> To: "Qian Cai" <[email protected]>, "Yang Shi" <[email protected]>
> >> Cc: "open list" <[email protected]>, "Thomas Gleixner" <[email protected]>, "Arnd Bergmann" <[email protected]>, "Joel Fernandes (Google)" <[email protected]>, "Zhong Jiang" <[email protected]>
> >> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
> >>
> >> On 11/09/2018 04:51 PM, Qian Cai wrote:
> >>>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 11/9/18 1:36 PM, Qian Cai wrote:
> >>>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
> >>>>> booting the latest mainline (3541833fd1f2) causes object debugging
> >>>>> always running out of memory.
> >>>> May you please paste the detail failure log?
> >>> I assume you mean dmesg.
> >>>
> >>> Here is the dmesg for 64 CPUs,
> >>> https://paste.ubuntu.com/p/BnhvXXhn7k/
> >>>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
> >>>>> to make it work. Is it expected that object debugging is not going
> >>>>> to work with large machines?
> >>>> I don't think so. I'm supposed it works well with large CPU number on x86.
> >>> Here is the one with nr_cpus workaround,
> >>> https://paste.ubuntu.com/p/qMpd2CCPSV/
> >> The debugobjects code have a set of 1024 statically allocated debug
> >> objects that can be used in early boot before the slab memory allocator
> >> is initialized. Apparently, the system may have used up all the
> >> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
> >> helps.
> > Great, you are right. Doubling the size makes it work. Does it make sense
> > to have a kconfig option instead?
>
> First, I think you need to figure out what your system needed to use up
> so many debug objects in early boot. If there is a legitimate reason for
> this behavior, we can talk about having a kconfig option to increase that.
Anybody else not getting ODEBUG OOM with more than 64-CPU? As mentioned, restricting to 16-CPU works fine. How can I figure out why the system uses so much debug objects?

2018-11-13 04:34:40

by Qian Cai

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled



> On Nov 10, 2018, at 9:11 AM, Qian Cai <[email protected]> wrote:
>
> On 11/10/18 at 8:59 AM, Waiman Long wrote:
>
>> On 11/09/2018 08:45 PM, Qian Cai wrote:
>>>> Sent: Friday, November 09, 2018 at 5:08 PM
>>>> From: "Waiman Long" <[email protected]>
>>>> To: "Qian Cai" <[email protected]>, "Yang Shi" <[email protected]>
>>>> Cc: "open list" <[email protected]>, "Thomas Gleixner" <[email protected]>, "Arnd Bergmann" <[email protected]>, "Joel Fernandes (Google)" <[email protected]>, "Zhong Jiang" <[email protected]>
>>>> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
>>>>
>>>> On 11/09/2018 04:51 PM, Qian Cai wrote:
>>>>>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 11/9/18 1:36 PM, Qian Cai wrote:
>>>>>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>>>>>>> booting the latest mainline (3541833fd1f2) causes object debugging
>>>>>>> always running out of memory.
>>>>>> May you please paste the detail failure log?
>>>>> I assume you mean dmesg.
>>>>>
>>>>> Here is the dmesg for 64 CPUs,
>>>>> https://paste.ubuntu.com/p/BnhvXXhn7k/
>>>>>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>>>>>>> to make it work. Is it expected that object debugging is not going
>>>>>>> to work with large machines?
>>>>>> I don't think so. I'm supposed it works well with large CPU number on x86.
>>>>> Here is the one with nr_cpus workaround,
>>>>> https://paste.ubuntu.com/p/qMpd2CCPSV/
>>>> The debugobjects code have a set of 1024 statically allocated debug
>>>> objects that can be used in early boot before the slab memory allocator
>>>> is initialized. Apparently, the system may have used up all the
>>>> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
>>>> helps.
>>> Great, you are right. Doubling the size makes it work. Does it make sense
>>> to have a kconfig option instead?
>>
>> First, I think you need to figure out what your system needed to use up
>> so many debug objects in early boot. If there is a legitimate reason for
>> this behavior, we can talk about having a kconfig option to increase that.
> Anybody else not getting ODEBUG OOM with more than 64-CPU? As
> mentioned, restricting to 16-CPU works fine. How can I figure out why the
> system uses so much debug objects?
On another aarch64 server with 256-CPU, even double the size of
ODEBUG_POOL_SIZE, i.e., 2048 will get "ODEBUG: Out of memory. ODEBUG
disabled”.

2018-11-14 00:58:58

by Qian Cai

[permalink] [raw]
Subject: Re: ODEBUG: Out of memory. ODEBUG disabled



> On Nov 12, 2018, at 11:33 PM, Qian Cai <[email protected]> wrote:
>
>
>
>> On Nov 10, 2018, at 9:11 AM, Qian Cai <[email protected]> wrote:
>>
>> On 11/10/18 at 8:59 AM, Waiman Long wrote:
>>
>>> On 11/09/2018 08:45 PM, Qian Cai wrote:
>>>>> Sent: Friday, November 09, 2018 at 5:08 PM
>>>>> From: "Waiman Long" <[email protected]>
>>>>> To: "Qian Cai" <[email protected]>, "Yang Shi" <[email protected]>
>>>>> Cc: "open list" <[email protected]>, "Thomas Gleixner" <[email protected]>, "Arnd Bergmann" <[email protected]>, "Joel Fernandes (Google)" <[email protected]>, "Zhong Jiang" <[email protected]>
>>>>> Subject: Re: ODEBUG: Out of memory. ODEBUG disabled
>>>>>
>>>>> On 11/09/2018 04:51 PM, Qian Cai wrote:
>>>>>>> On Nov 9, 2018, at 4:42 PM, Yang Shi <[email protected]> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 11/9/18 1:36 PM, Qian Cai wrote:
>>>>>>>> It is a bit annoying on this aarch64 server with 64 CPUs that is
>>>>>>>> booting the latest mainline (3541833fd1f2) causes object debugging
>>>>>>>> always running out of memory.
>>>>>>> May you please paste the detail failure log?
>>>>>> I assume you mean dmesg.
>>>>>>
>>>>>> Here is the dmesg for 64 CPUs,
>>>>>> https://paste.ubuntu.com/p/BnhvXXhn7k/
>>>>>>>> I have to boot the kernel with only 16 CPUs instead (nr_cpus=16)
>>>>>>>> to make it work. Is it expected that object debugging is not going
>>>>>>>> to work with large machines?
>>>>>>> I don't think so. I'm supposed it works well with large CPU number on x86.
>>>>>> Here is the one with nr_cpus workaround,
>>>>>> https://paste.ubuntu.com/p/qMpd2CCPSV/
>>>>> The debugobjects code have a set of 1024 statically allocated debug
>>>>> objects that can be used in early boot before the slab memory allocator
>>>>> is initialized. Apparently, the system may have used up all the
>>>>> statically allocated objects. Try double ODEBUG_POOL_SIZE to see if it
>>>>> helps.
>>>> Great, you are right. Doubling the size makes it work. Does it make sense
>>>> to have a kconfig option instead?
>>>
>>> First, I think you need to figure out what your system needed to use up
>>> so many debug objects in early boot. If there is a legitimate reason for
>>> this behavior, we can talk about having a kconfig option to increase that.
>> Anybody else not getting ODEBUG OOM with more than 64-CPU? As
>> mentioned, restricting to 16-CPU works fine. How can I figure out why the
>> system uses so much debug objects?
> On another aarch64 server with 256-CPU, even double the size of
> ODEBUG_POOL_SIZE, i.e., 2048 will get "ODEBUG: Out of memory. ODEBUG
> disabled”.

OK, here is the problem.

In order to get aarch64 work, the initial ODEBUG_POOL_SIZE on

64-CPU: need 2048
256-CPU: need 8192 (4096 too small)

This commit 97dd552eb23c

+ * Increase the thresholds for allocating and freeing objects
+ * according to the number of possible CPUs available in the system.
+ */
+ debug_objects_pool_size += num_possible_cpus() * 32;

Why magic number 32?

It needs to be bigger than that for aarch64.

(2048 + 64 x 32 - 1024) / 64 = 48 (work on 64-cpu)
(4096 + 256 x 32 - 1024) / 256 = 48 (not work on 256-cpu)
(8196 + 256 x 32 - 1024) / 256 = 60 (work on 256-cpu)