LinuxLists.cc - [BUG] Read-Only THP causes stalls (commit 10359213d)

2015-05-24 19:33:42

Subject: [BUG] Read-Only THP causes stalls (commit 10359213d)

Hi all,

I noticed a regression on my arm64 APM X-Gene system a couple
of weeks back. I would occassionally see the system lock up and see RCU
stalls during the caching phase of kernbench. I then wrote a small
script that does nothing but cache the files
(http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
iterations of the loop. I have since tried to run a bisect from v3.19 to
v4.0 using 100 iterations as my criteria for a good commit.

This resulted in the following first bad commit:

10359213d05acf804558bda7cc9b8422a828d1cd
(mm: incorporate read-only pages into transparent huge pages, 2015-02-11)

Indeed, running the workload on v4.1-rc4 still produced the behavior,
but reverting the above commit gets me through 100 iterations of the
loop.

I have not tried to reproduce on an x86 system. Turning on a bunch
of kernel debugging features *seems* to hide the problem. My config for
the XGene system is defconfig + CONFIG_BRIDGE and
CONFIG_POWER_RESET_XGENE.

Please let me know if I can help test patches or other things I can
do to help. I'm afraid that by simply reading the patch I didn't see
anything obviously wrong with it which would cause this behavior.

Thanks,
-Christoffer

2015-05-25 10:08:57

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
> Hi all,
>
> I noticed a regression on my arm64 APM X-Gene system a couple
> of weeks back. I would occassionally see the system lock up and see RCU
> stalls during the caching phase of kernbench. I then wrote a small
> script that does nothing but cache the files
> (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
> bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
> iterations of the loop. I have since tried to run a bisect from v3.19 to
> v4.0 using 100 iterations as my criteria for a good commit.
>
> This resulted in the following first bad commit:
>
> 10359213d05acf804558bda7cc9b8422a828d1cd
> (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
>
> Indeed, running the workload on v4.1-rc4 still produced the behavior,
> but reverting the above commit gets me through 100 iterations of the
> loop.
>
> I have not tried to reproduce on an x86 system. Turning on a bunch
> of kernel debugging features *seems* to hide the problem. My config for
> the XGene system is defconfig + CONFIG_BRIDGE and
> CONFIG_POWER_RESET_XGENE.
>
> Please let me know if I can help test patches or other things I can
> do to help. I'm afraid that by simply reading the patch I didn't see
> anything obviously wrong with it which would cause this behavior.

I don't see the problem on x86.

Some backtraces could help to track it down.

--
Kirill A. Shutemov

2015-05-25 10:19:20

by Christoffer Dall

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On Mon, May 25, 2015 at 01:05:15PM +0300, Kirill A. Shutemov wrote:
> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
> > Hi all,
> >
> > I noticed a regression on my arm64 APM X-Gene system a couple
> > of weeks back. I would occassionally see the system lock up and see RCU
> > stalls during the caching phase of kernbench. I then wrote a small
> > script that does nothing but cache the files
> > (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
> > iterations of the loop. I have since tried to run a bisect from v3.19 to
> > v4.0 using 100 iterations as my criteria for a good commit.
> >
> > This resulted in the following first bad commit:
> >
> > 10359213d05acf804558bda7cc9b8422a828d1cd
> > (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
> >
> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
> > but reverting the above commit gets me through 100 iterations of the
> > loop.
> >
> > I have not tried to reproduce on an x86 system. Turning on a bunch
> > of kernel debugging features *seems* to hide the problem. My config for
> > the XGene system is defconfig + CONFIG_BRIDGE and
> > CONFIG_POWER_RESET_XGENE.
> >
> > Please let me know if I can help test patches or other things I can
> > do to help. I'm afraid that by simply reading the patch I didn't see
> > anything obviously wrong with it which would cause this behavior.
>
> I don't see the problem on x86.

I'm wondering if we could have some weird combination of how the
specific architecture works along with these patches...

>
> Some backtraces could help to track it down.
>
I don't really get backtraces as the sytem just locks up. But here are
some of the RCU stalls as I've observed them on the console:

http://paste.ubuntu.com/11014701/
http://paste.ubuntu.com/11023143/
http://paste.ubuntu.com/11023261/

Occasionally, I also get this error from the SATA system at the moment
when I power off the device, but not sure if it is related:

ata1: exception Emask 0x10 SAct 0x0 SErr 0x180000 action 0xe frozen
ata1: irq_stat 0x00400000, PHY RDY changed
ata1: SError: { 10B8B Dispar }

Thanks,
-Christoffer

2015-05-25 14:16:33

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

Hello Christoffer,

On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
> Hi all,
>
> I noticed a regression on my arm64 APM X-Gene system a couple
> of weeks back. I would occassionally see the system lock up and see RCU
> stalls during the caching phase of kernbench. I then wrote a small
> script that does nothing but cache the files
> (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
> bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
> iterations of the loop. I have since tried to run a bisect from v3.19 to
> v4.0 using 100 iterations as my criteria for a good commit.
>
> This resulted in the following first bad commit:
>
> 10359213d05acf804558bda7cc9b8422a828d1cd
> (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
>
> Indeed, running the workload on v4.1-rc4 still produced the behavior,
> but reverting the above commit gets me through 100 iterations of the
> loop.
>
> I have not tried to reproduce on an x86 system. Turning on a bunch
> of kernel debugging features *seems* to hide the problem. My config for
> the XGene system is defconfig + CONFIG_BRIDGE and
> CONFIG_POWER_RESET_XGENE.
>
> Please let me know if I can help test patches or other things I can
> do to help. I'm afraid that by simply reading the patch I didn't see
> anything obviously wrong with it which would cause this behavior.

As further confirmation, could you try:

echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

and verify the problem goes away without having to revert the patch?

Accordingly you should reproduce much eaiser this way (setting
$largevalue to 8192 or something, it doesn't matter).

echo $largevalue > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

Then push the system into swap with some memhog -r1000 xG.

The patch just allows readonly anon pages to be collapsed along with
read-write ones, the vma permissions allows it, so they have to be
swapcache pages, this is why swap shall be required.

Perhaps there's some arch detail that needs fixing but it'll be easier
to track it down once you have a way to reproduce fast.

Thanks!
Andrea

2015-05-26 08:08:27

by Christoffer Dall

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

Hi Andrea,

On Mon, May 25, 2015 at 04:15:25PM +0200, Andrea Arcangeli wrote:
> Hello Christoffer,
>
> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
> > Hi all,
> >
> > I noticed a regression on my arm64 APM X-Gene system a couple
> > of weeks back. I would occassionally see the system lock up and see RCU
> > stalls during the caching phase of kernbench. I then wrote a small
> > script that does nothing but cache the files
> > (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
> > iterations of the loop. I have since tried to run a bisect from v3.19 to
> > v4.0 using 100 iterations as my criteria for a good commit.
> >
> > This resulted in the following first bad commit:
> >
> > 10359213d05acf804558bda7cc9b8422a828d1cd
> > (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
> >
> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
> > but reverting the above commit gets me through 100 iterations of the
> > loop.
> >
> > I have not tried to reproduce on an x86 system. Turning on a bunch
> > of kernel debugging features *seems* to hide the problem. My config for
> > the XGene system is defconfig + CONFIG_BRIDGE and
> > CONFIG_POWER_RESET_XGENE.
> >
> > Please let me know if I can help test patches or other things I can
> > do to help. I'm afraid that by simply reading the patch I didn't see
> > anything obviously wrong with it which would cause this behavior.
>
> As further confirmation, could you try:
>
> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan

this returns -EINVAL.

But I'm trying now with:

echo never > /sys/kernel/mm/transparent_hugepage/enabled

>
> and verify the problem goes away without having to revert the patch?

will let you know, so far so good...

>
> Accordingly you should reproduce much eaiser this way (setting
> $largevalue to 8192 or something, it doesn't matter).
>
> echo $largevalue > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>
> Then push the system into swap with some memhog -r1000 xG.

what is memhog? I couldn't find the utility in Google...

I did try with the above settings and just push a bunch of data into
ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).

>
> The patch just allows readonly anon pages to be collapsed along with
> read-write ones, the vma permissions allows it, so they have to be
> swapcache pages, this is why swap shall be required.
>
> Perhaps there's some arch detail that needs fixing but it'll be easier
> to track it down once you have a way to reproduce fast.
>
Yes, would be great to be able to reproduce quickly.

Thanks,
-Christoffer

2015-05-26 13:24:16

by Marc Zyngier

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On 26/05/15 09:08, Christoffer Dall wrote:

[...]

>> Then push the system into swap with some memhog -r1000 xG.
>
> what is memhog? I couldn't find the utility in Google...

This looks to be part of the numactl suite, though Debian doesn't seem
to include it in its numactl package...

Thanks,

M.
--
Jazz is not dead. It just smells funny...

2015-05-26 14:24:24

by Steve Capper

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On 26 May 2015 at 09:08, Christoffer Dall <[email protected]> wrote:
> Hi Andrea,
>
> On Mon, May 25, 2015 at 04:15:25PM +0200, Andrea Arcangeli wrote:
>> Hello Christoffer,
>>
>> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
>> > Hi all,
>> >
>> > I noticed a regression on my arm64 APM X-Gene system a couple
>> > of weeks back. I would occassionally see the system lock up and see RCU
>> > stalls during the caching phase of kernbench. I then wrote a small
>> > script that does nothing but cache the files
>> > (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
>> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
>> > iterations of the loop. I have since tried to run a bisect from v3.19 to
>> > v4.0 using 100 iterations as my criteria for a good commit.
>> >
>> > This resulted in the following first bad commit:
>> >
>> > 10359213d05acf804558bda7cc9b8422a828d1cd
>> > (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
>> >
>> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
>> > but reverting the above commit gets me through 100 iterations of the
>> > loop.
>> >
>> > I have not tried to reproduce on an x86 system. Turning on a bunch
>> > of kernel debugging features *seems* to hide the problem. My config for
>> > the XGene system is defconfig + CONFIG_BRIDGE and
>> > CONFIG_POWER_RESET_XGENE.
>> >
>> > Please let me know if I can help test patches or other things I can
>> > do to help. I'm afraid that by simply reading the patch I didn't see
>> > anything obviously wrong with it which would cause this behavior.
>>
>> As further confirmation, could you try:
>>
>> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>
> this returns -EINVAL.
>
> But I'm trying now with:
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>
>>
>> and verify the problem goes away without having to revert the patch?
>
> will let you know, so far so good...
>
>>
>> Accordingly you should reproduce much eaiser this way (setting
>> $largevalue to 8192 or something, it doesn't matter).
>>
>> echo $largevalue > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>>
>> Then push the system into swap with some memhog -r1000 xG.
>
> what is memhog? I couldn't find the utility in Google...
>
> I did try with the above settings and just push a bunch of data into
> ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).
>
>>
>> The patch just allows readonly anon pages to be collapsed along with
>> read-write ones, the vma permissions allows it, so they have to be
>> swapcache pages, this is why swap shall be required.
>>
>> Perhaps there's some arch detail that needs fixing but it'll be easier
>> to track it down once you have a way to reproduce fast.
>>
> Yes, would be great to be able to reproduce quickly.
>
> Thanks,
> -Christoffer
>

Hi Christoffer,
I'm trying to reproduce this on hardware here; but have been unable to
thus far with 4.1-rc2 on a Xgene and Seattle systems.
Also, I tried the memhog + pages_to_scan suggestion from Andrea.

Maybe a silly question, where is your root filesystem located? Is
there anything network mounted?

Cheers,
--
Steve

2015-05-26 14:35:26

by Christoffer Dall

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

Hi Steve,

On Tue, May 26, 2015 at 03:24:20PM +0100, Steve Capper wrote:
> >> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
> >> > Hi all,
> >> >
> >> > I noticed a regression on my arm64 APM X-Gene system a couple
> >> > of weeks back. I would occassionally see the system lock up and see RCU
> >> > stalls during the caching phase of kernbench. I then wrote a small
> >> > script that does nothing but cache the files
> >> > (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
> >> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
> >> > iterations of the loop. I have since tried to run a bisect from v3.19 to
> >> > v4.0 using 100 iterations as my criteria for a good commit.
> >> >
> >> > This resulted in the following first bad commit:
> >> >
> >> > 10359213d05acf804558bda7cc9b8422a828d1cd
> >> > (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
> >> >
> >> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
> >> > but reverting the above commit gets me through 100 iterations of the
> >> > loop.
> >> >
> >> > I have not tried to reproduce on an x86 system. Turning on a bunch
> >> > of kernel debugging features *seems* to hide the problem. My config for
> >> > the XGene system is defconfig + CONFIG_BRIDGE and
> >> > CONFIG_POWER_RESET_XGENE.
> >> >
> >> > Please let me know if I can help test patches or other things I can
> >> > do to help. I'm afraid that by simply reading the patch I didn't see
> >> > anything obviously wrong with it which would cause this behavior.
> >>
> >> As further confirmation, could you try:
> >>
> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> >
> > this returns -EINVAL.
> >
> > But I'm trying now with:
> >
> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
> >
> >>
> >> and verify the problem goes away without having to revert the patch?
> >
> > will let you know, so far so good...
> >
> >>
> >> Accordingly you should reproduce much eaiser this way (setting
> >> $largevalue to 8192 or something, it doesn't matter).
> >>
> >> echo $largevalue > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
> >>
> >> Then push the system into swap with some memhog -r1000 xG.
> >
> > what is memhog? I couldn't find the utility in Google...
> >
> > I did try with the above settings and just push a bunch of data into
> > ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).
> >
> >>
> >> The patch just allows readonly anon pages to be collapsed along with
> >> read-write ones, the vma permissions allows it, so they have to be
> >> swapcache pages, this is why swap shall be required.
> >>
> >> Perhaps there's some arch detail that needs fixing but it'll be easier
> >> to track it down once you have a way to reproduce fast.
> >>
> > Yes, would be great to be able to reproduce quickly.
> >

> I'm trying to reproduce this on hardware here; but have been unable to
> thus far with 4.1-rc2 on a Xgene and Seattle systems.

Really? That's concerning. I think Andre mentioned he could
reproduce...

How many iterations have you run the caching loop for?

Are you using defconfig? I noticed that turning on debugging features
was hiding the problem.

> Also, I tried the memhog + pages_to_scan suggestion from Andrea.

Any chance you could send me the memhog tool?

>
> Maybe a silly question, where is your root filesystem located? Is
> there anything network mounted?
>
It's a regular ext4 on the local SATA disk. Ubuntu Trusty.

Thanks,
-Christoffer

2015-05-26 14:42:53

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On Tue, May 26, 2015 at 10:08:48AM +0200, Christoffer Dall wrote:
> > echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>
> this returns -EINVAL.
>

Oops sorry, I haven't re-read the code, pages_to_scan 0 does not make
sense, it would only be useful for debugging purposes because it
doesn't shut off khugepaged entirely, so it is ok that it returns
-EINVAL, just it won't allow this debug tweak...

> But I'm trying now with:
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
>
> >
> > and verify the problem goes away without having to revert the patch?
>
> will let you know, so far so good...

I only intended to disable khugepaged, to validate the theory it was
that patch that made the difference.

Increasing the sleep time is equivalent to set pages_to_scan to 0, so
you can use this instead:

echo 3600000 >/sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
echo 3600000 >/sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs

In addition to knowing if it still happens with THP disables, it's
interesting to know also if it happens with THP enabled but khugepaged
disabled.

> what is memhog? I couldn't find the utility in Google...

Somebody answered, yes it's from numactl.

> I did try with the above settings and just push a bunch of data into
> ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).

That's fine, memhog just was a way to hit swap. tmpfs pages aren't
candidate for khugepaged THP collapsing, so it'd be perhaps quicker to
reproduce with something like memhog that uses anonymous memory but it
still happens, as long as you hit swap it's ok.

If other arm don't exhibit this problem, perhaps it has to do with
some difference in THP, I recall there were two models for arm.

Thanks,
Andrea

2015-05-26 14:48:49

by Steve Capper

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On 26 May 2015 at 15:35, Christoffer Dall <[email protected]> wrote:
> Hi Steve,
>
> On Tue, May 26, 2015 at 03:24:20PM +0100, Steve Capper wrote:
>> >> On Sun, May 24, 2015 at 09:34:04PM +0200, Christoffer Dall wrote:
>> >> > Hi all,
>> >> >
>> >> > I noticed a regression on my arm64 APM X-Gene system a couple
>> >> > of weeks back. I would occassionally see the system lock up and see RCU
>> >> > stalls during the caching phase of kernbench. I then wrote a small
>> >> > script that does nothing but cache the files
>> >> > (http://paste.ubuntu.com/11324767/) and ran that in a loop. On a known
>> >> > bad commit (v4.1-rc2), out of 25 boots, I never saw it get past 21
>> >> > iterations of the loop. I have since tried to run a bisect from v3.19 to
>> >> > v4.0 using 100 iterations as my criteria for a good commit.
>> >> >
>> >> > This resulted in the following first bad commit:
>> >> >
>> >> > 10359213d05acf804558bda7cc9b8422a828d1cd
>> >> > (mm: incorporate read-only pages into transparent huge pages, 2015-02-11)
>> >> >
>> >> > Indeed, running the workload on v4.1-rc4 still produced the behavior,
>> >> > but reverting the above commit gets me through 100 iterations of the
>> >> > loop.
>> >> >
>> >> > I have not tried to reproduce on an x86 system. Turning on a bunch
>> >> > of kernel debugging features *seems* to hide the problem. My config for
>> >> > the XGene system is defconfig + CONFIG_BRIDGE and
>> >> > CONFIG_POWER_RESET_XGENE.
>> >> >
>> >> > Please let me know if I can help test patches or other things I can
>> >> > do to help. I'm afraid that by simply reading the patch I didn't see
>> >> > anything obviously wrong with it which would cause this behavior.
>> >>
>> >> As further confirmation, could you try:
>> >>
>> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>> >
>> > this returns -EINVAL.
>> >
>> > But I'm trying now with:
>> >
>> > echo never > /sys/kernel/mm/transparent_hugepage/enabled
>> >
>> >>
>> >> and verify the problem goes away without having to revert the patch?
>> >
>> > will let you know, so far so good...
>> >
>> >>
>> >> Accordingly you should reproduce much eaiser this way (setting
>> >> $largevalue to 8192 or something, it doesn't matter).
>> >>
>> >> echo $largevalue > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
>> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
>> >> echo 0 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs
>> >>
>> >> Then push the system into swap with some memhog -r1000 xG.
>> >
>> > what is memhog? I couldn't find the utility in Google...
>> >
>> > I did try with the above settings and just push a bunch of data into
>> > ramfs and tmpfs and indeed the sytem died very quickly (on v4.0-rc4).
>> >
>> >>
>> >> The patch just allows readonly anon pages to be collapsed along with
>> >> read-write ones, the vma permissions allows it, so they have to be
>> >> swapcache pages, this is why swap shall be required.
>> >>
>> >> Perhaps there's some arch detail that needs fixing but it'll be easier
>> >> to track it down once you have a way to reproduce fast.
>> >>
>> > Yes, would be great to be able to reproduce quickly.
>> >
>
>> I'm trying to reproduce this on hardware here; but have been unable to
>> thus far with 4.1-rc2 on a Xgene and Seattle systems.
>
> Really? That's concerning. I think Andre mentioned he could
> reproduce...
>
> How many iterations have you run the caching loop for?
>
> Are you using defconfig? I noticed that turning on debugging features
> was hiding the problem.
>
>> Also, I tried the memhog + pages_to_scan suggestion from Andrea.
>
> Any chance you could send me the memhog tool?
>
>>
>> Maybe a silly question, where is your root filesystem located? Is
>> there anything network mounted?
>>
> It's a regular ext4 on the local SATA disk. Ubuntu Trusty.
>
> Thanks,
> -Christoffer

Sending an email to lakml appears to have been enough to make it hang
on the Xgene :-).
The system is completely frozen, not even the serial port works.

On Seattle, I've hit 100 iterations multiple times without any problems.

Investigating...

Cheers,
--
Steve

2015-05-26 14:59:33

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [BUG] Read-Only THP causes stalls (commit 10359213d)

On Tue, May 26, 2015 at 04:35:47PM +0200, Christoffer Dall wrote:
> Any chance you could send me the memhog tool?

memhog is just the first that come to mind because I got it
preinstalled everywhere (I only miss it on cyanogenmod as there's no
numactl there... yet).

Anything else would do as well, as long as you allocate lots of
anonymous memory (malloc(); bzero() or just write 1 byte every
4k). The tmpfs trick was fine as well as you'd end up swapping the
anonymous memory allocated by the running apps.

This would be the python version which I actually used sometime if I
couldn't find something preinstalled and I didn't want to install
packages.

echo 1 >/proc/sys/vm/overcommit_memory
python
a = "a"
while True:
a += a

This is the more polished way, I just happen to have it installed
everywhere (except the cellphone) so I tend to use it, I think it's
simpler to install the numactl package.

https://github.com/numactl/numactl/blob/master/memhog.c