Hi,
I tracked a huge performance regression for a while and got it bisected
down to commit "e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4 -
page-allocator: preserve PFN ordering when __GFP_COLD is set".
The scenario I'm running is a low memory system (256M total), that does
sequential I/O with parallel iozone processes.
One process per disk, each process reading a 2Gb file. The disks I use
are fcp scsi disks attached to a s390 host. File system is ext2.
The regression appears as up to 76% loss in throughput at 16 processes
(processes are scaled from 1 to 64, performance is bad everywhere - 16
is just the peak - avg loss is about 40% throughput).
I already know that giving the system just a bit (~64m+) more memory
solves the issue almost completely, probably because there is almost no
"memory pressure" left in that cases.
I also know that using direct-I/O instead of I/O through page cache
doesn't have the problem at all.
Comparing sysstat data taken while running with the kernels just with &
without the bisected patch shows nothing obvious except that I/O seems
to take much longer (lower interrupt ratio etc).
The patch alone looks very reasonable, so I'd prefer understanding and
fixing the real issue instead of getting it eventually reverted due to
this regression being larger than the one it was intended to fix.
In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk
should behave like before. So maybe the question is "are our pages cold
while they shouldn't be"?
Well I don't really have a clue yet to explain how patch e084b exactly
causes that big regression, ideas welcome :-)
Kind regards,
Christian
P.S: The original patch is concise enough to attach it here to act as
reference without bloating the mail too much:
commit e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4
Author: Mel Gorman <[email protected]>
Date: Wed Jul 29 15:02:04 2009 -0700
page-allocator: preserve PFN ordering when __GFP_COLD is set
Fix a post-2.6.24 performace regression caused by
3dfa5721f12c3d5a441448086bee156887daa961 ("page-allocator: preserve PFN
ordering when __GFP_COLD is set").
Narayanan reports "The regression is around 15%. There is no disk
controller
as our setup is based on Samsung OneNAND used as a memory mapped
device on a
OMAP2430 based board."
The page allocator tries to preserve contiguous PFN ordering when
returning
pages such that repeated callers to the allocator have a strong
chance of
getting physically contiguous pages, particularly when external
fragmentation
is low. However, of the bulk of the allocations have __GFP_COLD set
as they
are due to aio_read() for example, then the PFNs are in reverse PFN
order.
This can cause performance degration when used with IO controllers
that could
have merged the requests.
This patch attempts to preserve the contiguous ordering of PFNs for
users of
__GFP_COLD.
Signed-off-by: Mel Gorman <[email protected]>
Reported-by: Narayananu Gopalakrishnan <[email protected]>
Tested-by: Narayanan Gopalakrishnan <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index caa9268..ae28c22 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -882,7 +882,7 @@ retry_reserve:
*/
static int rmqueue_bulk(struct zone *zone, unsigned int order,
unsigned long count, struct list_head *list,
- int migratetype)
+ int migratetype, int cold)
{
int i;
@@ -901,7 +901,10 @@ static int rmqueue_bulk(struct zone *zone, unsigned
int order,
* merge IO requests if the physical pages are ordered
* properly.
*/
- list_add(&page->lru, list);
+ if (likely(cold == 0))
+ list_add(&page->lru, list);
+ else
+ list_add_tail(&page->lru, list);
set_page_private(page, migratetype);
list = &page->lru;
}
@@ -1119,7 +1122,8 @@ again:
local_irq_save(flags);
if (!pcp->count) {
pcp->count = rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list, migratetype);
+ pcp->batch, &pcp->list,
+ migratetype, cold);
if (unlikely(!pcp->count))
goto failed;
}
@@ -1138,7 +1142,8 @@ again:
/* Allocate more to the pcp list if necessary */
if (unlikely(&page->lru == &pcp->list)) {
pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, &pcp->list, migratetype);
+ pcp->batch, &pcp->list,
+ migratetype, cold);
page = list_entry(pcp->list.next, struct page, lru);
}
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
> Hi,
> I tracked a huge performance regression for a while and got it bisected
> down to commit "e084b2d95e48b31aa45f9c49ffc6cdae8bdb21d4 -
> page-allocator: preserve PFN ordering when __GFP_COLD is set".
>
Darn. That is related to IO controllers being able to automatically merge
requests. The problem it was fixing was that pages were arriving in reverse
PFN order, the controller was unable to merge and performance was impaired. Any
controller that can merge should be faster as a result of the patch.
> The scenario I'm running is a low memory system (256M total), that does
> sequential I/O with parallel iozone processes.
> One process per disk, each process reading a 2Gb file. The disks I use
> are fcp scsi disks attached to a s390 host. File system is ext2.
>
I don't know what controller is in use there but does it
opportunistically merge requests if they are on physically contiguous
pages? If so, can it be disabled?
> The regression appears as up to 76% loss in throughput at 16 processes
> (processes are scaled from 1 to 64, performance is bad everywhere - 16
> is just the peak - avg loss is about 40% throughput).
> I already know that giving the system just a bit (~64m+) more memory
> solves the issue almost completely, probably because there is almost no
> "memory pressure" left in that cases.
> I also know that using direct-I/O instead of I/O through page cache
> doesn't have the problem at all.
This makes sense because it's a sequentual read load, so readahead is a
factor and that is why __GFP_COLD is used - the data is not for
immediate use so doesn't need to be cache hot.
> Comparing sysstat data taken while running with the kernels just with &
> without the bisected patch shows nothing obvious except that I/O seems
> to take much longer (lower interrupt ratio etc).
>
Maybe the controller is spending an age trying to merge requests because
it should be able to but takes a long time figuring it out?
> The patch alone looks very reasonable, so I'd prefer understanding and
> fixing the real issue instead of getting it eventually reverted due to
> this regression being larger than the one it was intended to fix.
> In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk
> should behave like before. So maybe the question is "are our pages cold
> while they shouldn't be"?
> Well I don't really have a clue yet to explain how patch e084b exactly
> causes that big regression, ideas welcome :-)
>
Only theory I have at the moment is that the controller notices it can
merge requests and either spends a long time figuring out how to do the
merging or performs worse with merged IO requests.
If the problem is in the driver, oprofile might show where the problem lies.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
>
> [...]
>
> I don't know what controller is in use there but does it
> opportunistically merge requests if they are on physically contiguous
> pages? If so, can it be disabled?
>
As far as i could clarify it our controllers don't support such a
opportunistic merging.
>> The regression appears as up to 76% loss in throughput at 16 processes
>> (processes are scaled from 1 to 64, performance is bad everywhere - 16
>> is just the peak - avg loss is about 40% throughput).
>> I already know that giving the system just a bit (~64m+) more memory
>> solves the issue almost completely, probably because there is almost no
>> "memory pressure" left in that cases.
>> I also know that using direct-I/O instead of I/O through page cache
>> doesn't have the problem at all.
>>
>
> This makes sense because it's a sequentual read load, so readahead is a
> factor and that is why __GFP_COLD is used - the data is not for
> immediate use so doesn't need to be cache hot.
>
In the meanwhile I was able to verify that this also applies to random
reads which are still reads but have zero read ahead requests.
I attached more regression data in the post scriptum at the end of the mail.
>
>> Comparing sysstat data taken while running with the kernels just with &
>> without the bisected patch shows nothing obvious except that I/O seems
>> to take much longer (lower interrupt ratio etc).
>>
>>
>
> Maybe the controller is spending an age trying to merge requests because
> it should be able to but takes a long time figuring it out?
>
I thought that too, but now comes the funny part.
I gathered HW statistics from our I/O controllers and latency statistics
clearly state that your patch is working as intended - the latency from
entering the controller until the interrupt to linux device driver is
~30% lower!.
Remember as I stated above that they don't support that opportunistic
merging so I will have some fun finding out why it is faster in HW now :-)
>> The patch alone looks very reasonable, so I'd prefer understanding and
>> fixing the real issue instead of getting it eventually reverted due to
>> this regression being larger than the one it was intended to fix.
>> In the patch it is clear that hot pages (cold==0) freed in rmqueue_bulk
>> should behave like before. So maybe the question is "are our pages cold
>> while they shouldn't be"?
>> Well I don't really have a clue yet to explain how patch e084b exactly
>> causes that big regression, ideas welcome :-)
>>
>>
>
> Only theory I have at the moment is that the controller notices it can
> merge requests and either spends a long time figuring out how to do the
> merging or performs worse with merged IO requests.
>
> If the problem is in the driver, oprofile might show where the problem lies
With the effective throughput dropping by such a large amount while
hardware latency improves by 30% I agree and suggest the issue is in the
driver.
I'll do some research in breaking down where in our drivers time is lost
and reply here for advises and comments in regard to what general memory
management could/should/might do.
Kind regards,
Christian
p.s.
FYI a bit more regression data, now I had the patch identified I made a
full regression test scaling from 1 to 64 processes.
Comparing just without / with the commit e084b
I wondered, but obviously random read is also suffering from that patch.
Sequential Read
Procs Deviation in %
1 -4.9
2 5.2
4 -82.6
8 -65.6
16 -44.2
32 -30.0
64 -37.7
Random Read
Procs Deviation in %
1 -48.3
2 -45.7
4 -50.5
8 -59.0
16 -61.8
32 -48.3
64 -21.0
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
Keeping the old discussion in the mail tail, adding the new information
up here were everyone finds them :-)
Things I was able to confirm so far summarized:
- The controller doesn't care about pfn ordering in any way (proved by
HW statistics)
- regression appears in sequential AND random workloads -> also without
readahead
- oprofile & co are no option atm.
The effective consumed cpu cycles per transferred kb are almost the
same so I would not expect sampling would give us huge insights.
Therefore I expect that it is more a matter of lost time (latency)
than more expensive tasks (cpu consumption)
I don't want to preclude it completely, but sampling has to wait as
long as we have better tracks to follow
So the question is where time is lost in Linux. I used blktrace to
create latency summaries.
I only list the random case for discussion as the effects are more clear
int hat data.
Abbreviations are (like the blkparse man page explains) - sorted in
order it would appear per request:
A -- remap For stacked devices, incoming i/o is remapped to
device below it in the i/o stack. The remap action details what exactly
is being remapped to what.
G -- get request To send any type of request to a block device, a
struct request container must be allocated first.
I -- inserted A request is being sent to the i/o scheduler for
addition to the internal queue and later service by the driver. The
request is fully formed at this time.
D -- issued A request that previously resided on the block layer
queue or in the i/o scheduler has been sent to the driver.
C -- complete A previously issued request has been completed.
The output will detail the sector and size of that request, as well as
the success or failure of it.
The following table shows the average latencies from A to G, G to I and
so on.
C2A is special and tries to summarize how long it takes after completing
an I/O until the next one arrives in the block device layer.
avg-A2G avg-G2I avg-I2D avg-D2C
avg-C2A-in-avg+-stddev %C2A-in-avg+-stddev
deviation good->bad -3.48% -0.56% -1.57%
-1.31% 128.69% 97.26%
It clearly shows that all latencies once block device layer and device
driver are involved are almost equal. Remember that the throughput of
good vs. bad case is more than x3.
But we can also see that the value of C2A increases by a huge amount.
That huge C2A increase let me assume that the time is actually lost
"above" the block device layer.
I don't expect the execution speed of iozone as user process itself is
affected by commit e084b, so the question is where the time is lost
between the "read" issued by iozone and entering the block device layer.
Actually I expect it somewhere in the area of getting a page cache page
for the I/O. On one hand page handling is what commit e084b changes and
on the other hand pages are under pressure (systat vm effectiveness
~100%, >40% scanned directly in both cases).
I'll continue hunting down the lost time - maybe with ftrace if it is
not concealing the effect by its invasiveness -, any further
ideas/comments welcome.
kind regards,
Christian
Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Mon, Dec 07, 2009 at 03:39:49PM +0100, Christian Ehrhardt wrote:
>> [...]
>>
>> I don't know what controller is in use there but does it
>> opportunistically merge requests if they are on physically contiguous
>> pages? If so, can it be disabled?
>>
> As far as i could clarify it our controllers don't support such a
> opportunistic merging.
>>> The regression appears as up to 76% loss in throughput at 16
>>> processes (processes are scaled from 1 to 64, performance is bad
>>> everywhere - 16 is just the peak - avg loss is about 40% throughput).
>>> I already know that giving the system just a bit (~64m+) more
>>> memory solves the issue almost completely, probably because there
>>> is almost no "memory pressure" left in that cases.
>>> I also know that using direct-I/O instead of I/O through page cache
>>> doesn't have the problem at all.
>>>
>>
>> This makes sense because it's a sequentual read load, so readahead is a
>> factor and that is why __GFP_COLD is used - the data is not for
>> immediate use so doesn't need to be cache hot.
>>
> In the meanwhile I was able to verify that this also applies to random
> reads which are still reads but have zero read ahead requests.
> I attached more regression data in the post scriptum at the end of the
> mail.
>>
>>> Comparing sysstat data taken while running with the kernels just
>>> with & without the bisected patch shows nothing obvious except that
>>> I/O seems to take much longer (lower interrupt ratio etc).
>>>
>>>
>>
>> Maybe the controller is spending an age trying to merge requests because
>> it should be able to but takes a long time figuring it out?
>>
> I thought that too, but now comes the funny part.
> I gathered HW statistics from our I/O controllers and latency
> statistics clearly state that your patch is working as intended - the
> latency from entering the controller until the interrupt to linux
> device driver is ~30% lower!.
> Remember as I stated above that they don't support that opportunistic
> merging so I will have some fun finding out why it is faster in HW now
> :-)
>
>>> The patch alone looks very reasonable, so I'd prefer understanding
>>> and fixing the real issue instead of getting it eventually reverted
>>> due to this regression being larger than the one it was intended to
>>> fix.
>>> In the patch it is clear that hot pages (cold==0) freed in
>>> rmqueue_bulk should behave like before. So maybe the question is
>>> "are our pages cold while they shouldn't be"?
>>> Well I don't really have a clue yet to explain how patch e084b
>>> exactly causes that big regression, ideas welcome :-)
>>>
>>>
>>
>> Only theory I have at the moment is that the controller notices it can
>> merge requests and either spends a long time figuring out how to do the
>> merging or performs worse with merged IO requests.
>>
>> If the problem is in the driver, oprofile might show where the
>> problem lies
> With the effective throughput dropping by such a large amount while
> hardware latency improves by 30% I agree and suggest the issue is in
> the driver.
> I'll do some research in breaking down where in our drivers time is
> lost and reply here for advises and comments in regard to what general
> memory management could/should/might do.
>
> Kind regards,
> Christian
>
> p.s.
> FYI a bit more regression data, now I had the patch identified I made
> a full regression test scaling from 1 to 64 processes.
> Comparing just without / with the commit e084b
> I wondered, but obviously random read is also suffering from that patch.
>
> Sequential Read
> Procs Deviation in %
> 1 -4.9
> 2 5.2
> 4 -82.6
> 8 -65.6
> 16 -44.2
> 32 -30.0
> 64 -37.7
>
> Random Read
> Procs Deviation in %
> 1 -48.3
> 2 -45.7
> 4 -50.5
> 8 -59.0
> 16 -61.8
> 32 -48.3
> 64 -21.0
>
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Thu, Dec 10, 2009 at 03:36:04PM +0100, Christian Ehrhardt wrote:
> Keeping the old discussion in the mail tail, adding the new information
> up here were everyone finds them :-)
>
> Things I was able to confirm so far summarized:
> - The controller doesn't care about pfn ordering in any way (proved by
> HW statistics)
> - regression appears in sequential AND random workloads -> also without
> readahead
> - oprofile & co are no option atm.
> The effective consumed cpu cycles per transferred kb are almost the
> same so I would not expect sampling would give us huge insights.
> Therefore I expect that it is more a matter of lost time (latency) than
> more expensive tasks (cpu consumption)
But earlier, you said that latency was lower - "latency statistics clearly
state that your patch is working as intended - the latency from entering
the controller until the interrupt to linux device driver is ~30% lower!."
Also, if the controller is doing no merging of IO requests, why is the
interrupt rate lower?
> I don't want to preclude it completely, but sampling has to wait as
> long as we have better tracks to follow
>
> So the question is where time is lost in Linux. I used blktrace to
> create latency summaries.
> I only list the random case for discussion as the effects are more clear
> int hat data.
> Abbreviations are (like the blkparse man page explains) - sorted in
> order it would appear per request:
> A -- remap For stacked devices, incoming i/o is remapped to device
> below it in the i/o stack. The remap action details what exactly is being
> remapped to what.
> G -- get request To send any type of request to a block device, a
> struct request container must be allocated first.
> I -- inserted A request is being sent to the i/o scheduler for
> addition to the internal queue and later service by the driver. The
> request is fully formed at this time.
> D -- issued A request that previously resided on the block layer
> queue or in the i/o scheduler has been sent to the driver.
> C -- complete A previously issued request has been completed. The
> output will detail the sector and size of that request, as well as the
> success or failure of it.
>
> The following table shows the average latencies from A to G, G to I and
> so on.
> C2A is special and tries to summarize how long it takes after completing
> an I/O until the next one arrives in the block device layer.
>
> avg-A2G avg-G2I avg-I2D avg-D2C avg-C2A-in-avg+-stddev %C2A-in-avg+-stddev
> deviation good->bad -3.48% -0.56% -1.57% -1.31% 128.69% 97.26%
>
> It clearly shows that all latencies once block device layer and device
> driver are involved are almost equal. Remember that the throughput of
> good vs. bad case is more than x3.
> But we can also see that the value of C2A increases by a huge amount.
> That huge C2A increase let me assume that the time is actually lost
> "above" the block device layer.
>
To be clear. As C is "completion" and "A" is remapping new IO, it
implies that time is being lost between when one IO completes and
another starts, right?
> I don't expect the execution speed of iozone as user process itself is
> affected by commit e084b,
Not by this much anyway. Lets say cache hotness is a problem, I would
expect some performance loss but not this much.
> so the question is where the time is lost
> between the "read" issued by iozone and entering the block device layer.
> Actually I expect it somewhere in the area of getting a page cache page
> for the I/O. On one hand page handling is what commit e084b changes and
> on the other hand pages are under pressure (systat vm effectiveness
> ~100%, >40% scanned directly in both cases).
>
> I'll continue hunting down the lost time - maybe with ftrace if it is
> not concealing the effect by its invasiveness -, any further
> ideas/comments welcome.
>
One way of finding out if cache hottness was the problem would be to profile
for cache misses and see if there are massive differences with and without
the patch. Is that an option?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Thu, Dec 10, 2009 at 03:36:04PM +0100, Christian Ehrhardt wrote:
>
>> Keeping the old discussion in the mail tail, adding the new information
>> up here were everyone finds them :-)
>>
>> Things I was able to confirm so far summarized:
>> - The controller doesn't care about pfn ordering in any way (proved by
>> HW statistics)
>> - regression appears in sequential AND random workloads -> also without
>> readahead
>> - oprofile & co are no option atm.
>> The effective consumed cpu cycles per transferred kb are almost the
>> same so I would not expect sampling would give us huge insights.
>> Therefore I expect that it is more a matter of lost time (latency) than
>> more expensive tasks (cpu consumption)
>>
>
> But earlier, you said that latency was lower - "latency statistics clearly
> state that your patch is working as intended - the latency from entering
> the controller until the interrupt to linux device driver is ~30% lower!."
>
Thats right, but the pure Hardware time is only lower due to the case of
less I/O in flight.
As it has less concurrency and contention that way a single I/O in HW is
faster.
But in HW it's so fast (both cases) that as verified in the linux layers
it doesn't matter.
Both cases take more or less the same time from an I/O entering the
block device layer until completion.
> Also, if the controller is doing no merging of IO requests, why is the
> interrupt rate lower?
>
I was wondering about that when I started to work on this too, but the
answer is simply that there are less requests coming in per second - and
that implies a lower interrupt rate too.
>> I don't want to preclude it completely, but sampling has to wait as
>> long as we have better tracks to follow
>>
>> So the question is where time is lost in Linux. I used blktrace to
>> create latency summaries.
>> I only list the random case for discussion as the effects are more clear
>> int hat data.
>> Abbreviations are (like the blkparse man page explains) - sorted in
>> order it would appear per request:
>> A -- remap For stacked devices, incoming i/o is remapped to device
>> below it in the i/o stack. The remap action details what exactly is being
>> remapped to what.
>> G -- get request To send any type of request to a block device, a
>> struct request container must be allocated first.
>> I -- inserted A request is being sent to the i/o scheduler for
>> addition to the internal queue and later service by the driver. The
>> request is fully formed at this time.
>> D -- issued A request that previously resided on the block layer
>> queue or in the i/o scheduler has been sent to the driver.
>> C -- complete A previously issued request has been completed. The
>> output will detail the sector and size of that request, as well as the
>> success or failure of it.
>>
>> The following table shows the average latencies from A to G, G to I and
>> so on.
>> C2A is special and tries to summarize how long it takes after completing
>> an I/O until the next one arrives in the block device layer.
>>
>> avg-A2G avg-G2I avg-I2D avg-D2C avg-C2A-in-avg+-stddev %C2A-in-avg+-stddev
>> deviation good->bad -3.48% -0.56% -1.57% -1.31% 128.69% 97.26%
>>
>> It clearly shows that all latencies once block device layer and device
>> driver are involved are almost equal. Remember that the throughput of
>> good vs. bad case is more than x3.
>> But we can also see that the value of C2A increases by a huge amount.
>> That huge C2A increase let me assume that the time is actually lost
>> "above" the block device layer.
>>
>>
>
> To be clear. As C is "completion" and "A" is remapping new IO, it
> implies that time is being lost between when one IO completes and
> another starts, right?
>
>
Absolutely correct
>> I don't expect the execution speed of iozone as user process itself is
>> affected by commit e084b,
>>
> Not by this much anyway. Lets say cache hotness is a problem, I would
> expect some performance loss but not this much.
>
I agree, even if it is cache hotness it wouldn't be that much.
And cold caches would appear as "longer" instructions because they would
need more cycles due to e.g. dcache misses.
But as mentioned before the cycles per transferred amount of data are
the same so I don't expect it is due to cache hot/cold.
>> so the question is where the time is lost
>> between the "read" issued by iozone and entering the block device layer.
>> Actually I expect it somewhere in the area of getting a page cache page
>> for the I/O. On one hand page handling is what commit e084b changes and
>> on the other hand pages are under pressure (systat vm effectiveness
>> ~100%, >40% scanned directly in both cases).
>>
>> I'll continue hunting down the lost time - maybe with ftrace if it is
>> not concealing the effect by its invasiveness -, any further
>> ideas/comments welcome.
>>
>>
>
> One way of finding out if cache hottness was the problem would be to profile
> for cache misses and see if there are massive differences with and without
> the patch. Is that an option?
>
It is an option to verify things, but as mentioned above I would expect
an increase amounts of consumed cycles per kb which I don't see.
I'll track caches anyway to be sure.
My personal current assumption is that either there is some time lost
from the read syscall until the A event blocktrace tracks or I'm wrong
with my assumption about user processes and iozone runs "slower".
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
Christian Ehrhardt wrote:
> Mel Gorman wrote:
> [...]
>>>
>>
>> One way of finding out if cache hottness was the problem would be to
>> profile
>> for cache misses and see if there are massive differences with and
>> without
>> the patch. Is that an option?
>>
> It is an option to verify things, but as mentioned above I would
> expect an increase amounts of consumed cycles per kb which I don't see.
> I'll track caches anyway to be sure.
>
> My personal current assumption is that either there is some time lost
> from the read syscall until the A event blocktrace tracks or I'm wrong
> with my assumption about user processes and iozone runs "slower".
>
After I was able to verify that the time is lost above the block device
layer I ran further tests to check out where the additional latency due
to commit e084b occurs.
With "strace -frT -e trace=open,close,read,write" I could isolate the
continuous stream of read calls done by the iozone child process.
In those data two things were revealed:
a) The time is lost between read and Block device enter (not e.g. due to
slow userspace execution or scheduling effects)
Time spent in userspace until the next read call is almost the
same, but the average time spent "inside" the read call increases from
788?s to 3278?s
b) The time is lost in a few big peaks
The majority of read calls are ~500-750?s in both cases, but there
is a slight shift of approximately 8% that occur now as durations
between 15000 and 60000?s
As mentioned before there is no significant increase of spent cpu cycles
- therefore I watched out how the read call could end up in sleeps
(latency but not used cycles).
I found that there is a chance to run into "congestion_wait" in case
__alloc_pages_direct_reclaim() was able to free some pages with
try_to_free_pages() but doesn't get a page in the subsequent
get_page_from_freelist().
To get details about that I patched in some simple counters and
ktime_get based timings. The results are impressive and point exactly
towards the congestion_wait theory
The following data is again taken with iozone 1 thread random read in a
very low memory environment, the counters were zeroed before starting iozone
Bad = the kernel just with commit e084b, good = kernel one commit before
e084b. The throughput in good case while taking this data was
approximately ~168% of that of the bad case.
GOOD BAD
perf_count_congestion_wait 245 1388
perf_count_pages_direct_reclaim 10993 5365
perf_count_failed_pages_direct_reclaim 0 1090
Note - the data also shows that the average time spent in the
f_op->aio_read call issued by do_sync_read increases by 68.5% which is
directly inversely proportional to the average loss in throughput.
This lets me conclude that patch e084b causes the system to run into
congestion_wait after the semi-failed call to __alloc_pages_direct_reclaim.
Unfortunately I don't understand memory management enough to find the
relation between e084b actually changing the order and position of freed
pages in the LRU list to __alloc_pages_direct_reclaim not getting pages
as effectively as before.
Ideas and comments welcome!
Kind regards,
Christian
P.S: attached are the three patches I used to get those new numbers
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Linux Performance on System z
On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>> [...]
>>>>
>>>
>>> One way of finding out if cache hottness was the problem would be to
>>> profile
>>> for cache misses and see if there are massive differences with and
>>> without
>>> the patch. Is that an option?
>>>
>> It is an option to verify things, but as mentioned above I would
>> expect an increase amounts of consumed cycles per kb which I don't see.
>> I'll track caches anyway to be sure.
>>
>> My personal current assumption is that either there is some time lost
>> from the read syscall until the A event blocktrace tracks or I'm wrong
>> with my assumption about user processes and iozone runs "slower".
>>
>
> After I was able to verify that the time is lost above the block device
> layer I ran further tests to check out where the additional latency due
> to commit e084b occurs.
> With "strace -frT -e trace=open,close,read,write" I could isolate the
> continuous stream of read calls done by the iozone child process.
> In those data two things were revealed:
> a) The time is lost between read and Block device enter (not e.g. due to
> slow userspace execution or scheduling effects)
> Time spent in userspace until the next read call is almost the same,
> but the average time spent "inside" the read call increases from 788?s
> to 3278?s
> b) The time is lost in a few big peaks
> The majority of read calls are ~500-750?s in both cases, but there is
> a slight shift of approximately 8% that occur now as durations between
> 15000 and 60000?s
>
> As mentioned before there is no significant increase of spent cpu cycles
> - therefore I watched out how the read call could end up in sleeps
> (latency but not used cycles).
> I found that there is a chance to run into "congestion_wait" in case
> __alloc_pages_direct_reclaim() was able to free some pages with
> try_to_free_pages() but doesn't get a page in the subsequent
> get_page_from_freelist().
This is true although it was also the case before the patch.
> To get details about that I patched in some simple counters and
> ktime_get based timings. The results are impressive and point exactly
> towards the congestion_wait theory
>
> The following data is again taken with iozone 1 thread random read in a
> very low memory environment, the counters were zeroed before starting
> iozone
> Bad = the kernel just with commit e084b, good = kernel one commit before
> e084b. The throughput in good case while taking this data was
> approximately ~168% of that of the bad case.
>
> GOOD BAD
> perf_count_congestion_wait 245 1388
> perf_count_pages_direct_reclaim 10993 5365
> perf_count_failed_pages_direct_reclaim 0 1090
> Note - the data also shows that the average time spent in the
> f_op->aio_read call issued by do_sync_read increases by 68.5% which is
> directly inversely proportional to the average loss in throughput.
>
> This lets me conclude that patch e084b causes the system to run into
> congestion_wait after the semi-failed call to
> __alloc_pages_direct_reclaim.
You mention that the "GOOD" kernel is one commit before e084b but can
you test with just that patch reverted please? As the patch is only
about list manipulation, I'm failing to see how it can affect
congestion_wait().
> Unfortunately I don't understand memory management enough to find the
> relation between e084b actually changing the order and position of freed
> pages in the LRU list to __alloc_pages_direct_reclaim not getting pages
> as effectively as before.
Off-hand, neither can I. I regret that I'm going off-line for the
Christmas very soon and I'll be in the middle of nowhere with no
internet connection or test machines in the interim. It's going to be
the new year before I get to look at this properly.
Even if this patch is not the real problem, your instrumentation patches
will help pin down the real problem. Thanks and sorry I'll be delayed in
getting back to you properly.
> Ideas and comments welcome!
>
> Kind regards,
> Christian
>
> P.S: attached are the three patches I used to get those new numbers
>
>
> --
>
> Gr?sse / regards, Christian Ehrhardt
> IBM Linux Technology Center, Linux Performance on System z
>
> Index: linux-2.6-git-2.6.31-bisect2/fs/read_write.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/fs/read_write.c
> +++ linux-2.6-git-2.6.31-bisect2/fs/read_write.c
> @@ -249,25 +249,37 @@ static void wait_on_retry_sync_kiocb(str
> __set_current_state(TASK_RUNNING);
> }
>
> +unsigned long perf_count_do_sync_read_calls_aio_read = 0;
> +unsigned long perf_tsum_do_sync_read_calls_aio_read = 0;
> +unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb = 0;
> +unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb = 0;
> ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos)
> {
> struct iovec iov = { .iov_base = buf, .iov_len = len };
> struct kiocb kiocb;
> ssize_t ret;
> + ktime_t tv64_pre;
>
> init_sync_kiocb(&kiocb, filp);
> kiocb.ki_pos = *ppos;
> kiocb.ki_left = len;
>
> for (;;) {
> + perf_count_do_sync_read_calls_aio_read++;
> + tv64_pre = ktime_get();
> ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos);
> + perf_tsum_do_sync_read_calls_aio_read += ktime_sub(ktime_get(),tv64_pre).tv64;
> if (ret != -EIOCBRETRY)
> break;
> wait_on_retry_sync_kiocb(&kiocb);
> }
>
> - if (-EIOCBQUEUED == ret)
> + if (-EIOCBQUEUED == ret) {
> + perf_count_do_sync_read_calls_wait_on_sync_kiocb++;
> + tv64_pre = ktime_get();
> ret = wait_on_sync_kiocb(&kiocb);
> + perf_tsum_do_sync_read_calls_wait_on_sync_kiocb += ktime_sub(ktime_get(),tv64_pre).tv64;
> + }
> *ppos = kiocb.ki_pos;
> return ret;
> }
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -257,6 +257,10 @@ extern unsigned long perf_count_pages_di
> extern unsigned long perf_count_failed_pages_direct_reclaim;
> extern unsigned long perf_count_do_generic_file_read_calls_page_cache_alloc_cold;
> extern unsigned long perf_tsum_do_generic_file_read_calls_page_cache_alloc_cold;
> +extern unsigned long perf_count_do_sync_read_calls_aio_read;
> +extern unsigned long perf_tsum_do_sync_read_calls_aio_read;
> +extern unsigned long perf_count_do_sync_read_calls_wait_on_sync_kiocb;
> +extern unsigned long perf_tsum_do_sync_read_calls_wait_on_sync_kiocb;
> static struct ctl_table perf_table[] = {
> {
> .ctl_name = CTL_UNNUMBERED,
> @@ -297,6 +301,38 @@ static struct ctl_table perf_table[] = {
> .maxlen = sizeof(unsigned long),
> .mode = 0666,
> .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_do_sync_read_calls_aio_read",
> + .data = &perf_count_do_sync_read_calls_aio_read,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_tsum_do_sync_read_calls_aio_read",
> + .data = &perf_tsum_do_sync_read_calls_aio_read,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_do_sync_read_calls_wait_on_sync_kiocb",
> + .data = &perf_count_do_sync_read_calls_wait_on_sync_kiocb,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_tsum_do_sync_read_calls_wait_on_sync_kiocb",
> + .data = &perf_tsum_do_sync_read_calls_wait_on_sync_kiocb,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> },
> { .ctl_name = 0 }
> };
> Index: linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/include/linux/sysctl.h
> +++ linux-2.6-git-2.6.31-bisect2/include/linux/sysctl.h
> @@ -69,6 +69,7 @@ enum
> CTL_BUS=8, /* Busses */
> CTL_ABI=9, /* Binary emulation */
> CTL_CPU=10, /* CPU stuff (speed scaling, etc) */
> + CTL_PERF=11, /* Performance counters and timer sums for debugging */
> CTL_ARLAN=254, /* arlan wireless driver */
> CTL_S390DBF=5677, /* s390 debug */
> CTL_SUNRPC=7249, /* sunrpc debug */
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -177,6 +177,7 @@ static struct ctl_table_root sysctl_tabl
> .default_set.list = LIST_HEAD_INIT(root_table_header.ctl_entry),
> };
>
> +static struct ctl_table perf_table[];
> static struct ctl_table kern_table[];
> static struct ctl_table vm_table[];
> static struct ctl_table fs_table[];
> @@ -230,6 +231,13 @@ static struct ctl_table root_table[] = {
> .mode = 0555,
> .child = dev_table,
> },
> + {
> + .ctl_name = CTL_PERF,
> + .procname = "perf",
> + .mode = 0555,
> + .child = perf_table,
> + },
> +
> /*
> * NOTE: do not add new entries to this table unless you have read
> * Documentation/sysctl/ctl_unnumbered.txt
> @@ -244,6 +252,19 @@ static int min_wakeup_granularity_ns;
> static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
> #endif
>
> +extern unsigned long perf_count_congestion_wait;
> +static struct ctl_table perf_table[] = {
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_congestion_wait",
> + .data = &perf_count_congestion_wait,
> + .mode = 0666,
> + .maxlen = sizeof(unsigned long),
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + { .ctl_name = 0 }
> +};
> +
> static struct ctl_table kern_table[] = {
> #ifdef CONFIG_SCHED_DEBUG
> {
> Index: linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/mm/backing-dev.c
> +++ linux-2.6-git-2.6.31-bisect2/mm/backing-dev.c
> @@ -305,6 +305,7 @@ void set_bdi_congested(struct backing_de
> }
> EXPORT_SYMBOL(set_bdi_congested);
>
> +unsigned long perf_count_congestion_wait = 0;
> /**
> * congestion_wait - wait for a backing_dev to become uncongested
> * @sync: SYNC or ASYNC IO
> @@ -320,6 +321,7 @@ long congestion_wait(int sync, long time
> DEFINE_WAIT(wait);
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> + perf_count_congestion_wait++;
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> finish_wait(wqh, &wait);
> Index: linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/kernel/sysctl.c
> +++ linux-2.6-git-2.6.31-bisect2/kernel/sysctl.c
> @@ -253,6 +253,8 @@ static int max_wakeup_granularity_ns = N
> #endif
>
> extern unsigned long perf_count_congestion_wait;
> +extern unsigned long perf_count_pages_direct_reclaim;
> +extern unsigned long perf_count_failed_pages_direct_reclaim;
> static struct ctl_table perf_table[] = {
> {
> .ctl_name = CTL_UNNUMBERED,
> @@ -262,6 +264,22 @@ static struct ctl_table perf_table[] = {
> .maxlen = sizeof(unsigned long),
> .proc_handler = &proc_doulongvec_minmax,
> },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_pages_direct_reclaim",
> + .data = &perf_count_pages_direct_reclaim,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> + {
> + .ctl_name = CTL_UNNUMBERED,
> + .procname = "perf_count_failed_pages_direct_reclaim",
> + .data = &perf_count_failed_pages_direct_reclaim,
> + .maxlen = sizeof(unsigned long),
> + .mode = 0666,
> + .proc_handler = &proc_doulongvec_minmax,
> + },
> { .ctl_name = 0 }
> };
>
> Index: linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git-2.6.31-bisect2.orig/mm/page_alloc.c
> +++ linux-2.6-git-2.6.31-bisect2/mm/page_alloc.c
> @@ -1605,6 +1605,9 @@ out:
> return page;
> }
>
> +unsigned long perf_count_pages_direct_reclaim = 0;
> +unsigned long perf_count_failed_pages_direct_reclaim = 0;
> +
> /* The really slow allocator path where we enter direct reclaim */
> static inline struct page *
> __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> @@ -1645,6 +1648,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_m
> zonelist, high_zoneidx,
> alloc_flags, preferred_zone,
> migratetype);
> +
> + perf_count_pages_direct_reclaim++;
> + if (!page)
> + perf_count_failed_pages_direct_reclaim++;
> +
> return page;
> }
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
>
>> Christian Ehrhardt wrote:
>>
>> [...]
> You mention that the "GOOD" kernel is one commit before e084b but can
> you test with just that patch reverted please? As the patch is only
> about list manipulation, I'm failing to see how it can affect
> congestion_wait().
>
>
>> Unfortunately I don't understand memory management enough to find the
>> relation between e084b actually changing the order and position of freed
>> pages in the LRU list to __alloc_pages_direct_reclaim not getting pages
>> as effectively as before.
>>
>
> Off-hand, neither can I. I regret that I'm going off-line for the
> Christmas very soon and I'll be in the middle of nowhere with no
> internet connection or test machines in the interim. It's going to be
> the new year before I get to look at this properly.
>
> Even if this patch is not the real problem, your instrumentation patches
> will help pin down the real problem. Thanks and sorry I'll be delayed in
> getting back to you properly.
>
Back to business :-)
First - yes it is reproducible in 2.6.32 final and fixable by unapplying
e084b.
But I don't think un-applying it is really a fix as no one really
understands how e084b cause this regression (it might just work around
whatever goes on in the background and someone still hits the hidden issue).
Lets better look what we know summarized:
- the patch e084a causes this regression
- it causes it only in very rare if not theoretically high memory
constraints
- time is lost between read system call and the block device layer enter
- via debugging and discussion we found that time lost is spent in
congestion_wait
I did not yet check if it is possible, but we might have a spot tho
might fix the issue.
Congestion_wait states it would "Waits for up to @timeout jiffies for a
backing_dev (any backing_dev) to exit write congestion. If no
backing_devs are congested then just wait for the next write to be
completed.".
Now in some case (and also my test) there is absolutely no write to do
or in flight. So maybe it would be possible to either detect this before
calling congestion_wait from page_alloc.c or let congestion_wait return
an error code or something similar.
I mean as far as I can see the kernel currently does a loop like that
(pseudo code and simplified):
1. get a page <- fail
2. try_to_free page <- returns >=1 pages freed
3. get_page_from_freelist <- fails
4. should_alloc_retry <- true
5. congestion_wait <- FUTILE - just loosing time, nothing will be freed
by writes
6. goto 1
-> So maybe we should detect the futility of that congestion_wait call
and not do it at all repeating instantly, go to oom, .., whatever (thats
out for discussion).
If we can't fix it that way I would propose we keep the code as is,
hoping that it is a theoretical case that never hits a customer system.
But in that case we should definitely add a userspace readable counter
to congestion_wait similar to my debug patches. This would allow
everyone that ever assumes an issue might be related to this to verify
it by checking the congestion_wait counter.
This should not be too hard in a technical way, but also not performance
critical as congestion_wait is going to wait anyway. On the other hand
everyone hates introducing new kernel interfaces that need to be kept
compatible until nirvana - especially for bug tracking its not the best
idea :-) So it should be our very last resort.
Comments/Opinions to those ideas ?
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Thu, Jan 14, 2010 at 01:30:24PM +0100, Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
>>
>>> Christian Ehrhardt wrote:
>>> [...]
>> You mention that the "GOOD" kernel is one commit before e084b but can
>> you test with just that patch reverted please? As the patch is only
>> about list manipulation, I'm failing to see how it can affect
>> congestion_wait().
>>
>>
>>> Unfortunately I don't understand memory management enough to find the
>>> relation between e084b actually changing the order and position of
>>> freed pages in the LRU list to __alloc_pages_direct_reclaim not
>>> getting pages as effectively as before.
>>>
>>
>> Off-hand, neither can I. I regret that I'm going off-line for the
>> Christmas very soon and I'll be in the middle of nowhere with no
>> internet connection or test machines in the interim. It's going to be
>> the new year before I get to look at this properly.
>>
>> Even if this patch is not the real problem, your instrumentation patches
>> will help pin down the real problem. Thanks and sorry I'll be delayed in
>> getting back to you properly.
>>
>
> Back to business :-)
> First - yes it is reproducible in 2.6.32 final and fixable by unapplying
> e084b.
> But I don't think un-applying it is really a fix as no one really
> understands how e084b cause this regression (it might just work around
> whatever goes on in the background and someone still hits the hidden
> issue).
>
Agreed on reverting is not an option. While it is not understood why it
causes your regression, it's known to fix a regression for hardware that
does do merging.
> Lets better look what we know summarized:
> - the patch e084a causes this regression
> - it causes it only in very rare if not theoretically high memory
> constraints
> - time is lost between read system call and the block device layer enter
> - via debugging and discussion we found that time lost is spent in
> congestion_wait
>
Thanks for the summary.
There is an outside chance it's due to a difference in free page availablity
and the associated counters. Particularly if some subsystem is optimistally
trying high-order allocations depending on availability. If there are many
allocation failures, the counters gets skewed due to a bug, watermarks are
not met and direct reclaim is used more. A patch is on its way upstream to
fix this is http://www.spinics.net/lists/mm-commits/msg75671.html . Can you
try it please?
There is a report on swap-based workloads being impacted on 2.6.32 and a fix
exists but it wouldn't have affected 2.6.31. If you are testing on 2.6.32
though, this patch should be applied http://lkml.org/lkml/2009/12/21/245
> I did not yet check if it is possible, but we might have a spot tho
> might fix the issue.
> Congestion_wait states it would "Waits for up to @timeout jiffies for a
> backing_dev (any backing_dev) to exit write congestion. If no
> backing_devs are congested then just wait for the next write to be
> completed.".
> Now in some case (and also my test) there is absolutely no write to do
> or in flight. So maybe it would be possible to either detect this before
> calling congestion_wait from page_alloc.c or let congestion_wait return
> an error code or something similar.
>
While I think that idea has merit, it's fixing the symptons and not the
underlying problem. There is a growing opinion that congestion_wait() being
used in the VM at all is a mistake. If the VM really needs to wait on pages
being written out, it should have been on a waitqueue for a number of pages
rather than a time-based method.
In other words, I'm more interesting in figuring out *why* we enter
congestion_wait() at all with the patch and it's avoided without rather than
what congestion_wait() does when it gets called.
Also, are you sure about thw "waiting for the next write to complete"? In
2.6.31 at least, it's waiting on the async write queue. If it's a case that
it waits the full timeout if there is no async traffic then that's obviously
bad as well but still doesn't explain why the problem patch makes a difference.
> I mean as far as I can see the kernel currently does a loop like that
> (pseudo code and simplified):
> 1. get a page <- fail
> 2. try_to_free page <- returns >=1 pages freed
> 3. get_page_from_freelist <- fails
> 4. should_alloc_retry <- true
> 5. congestion_wait <- FUTILE - just loosing time, nothing will be freed
> by writes
> 6. goto 1
> -> So maybe we should detect the futility of that congestion_wait call
> and not do it at all repeating instantly, go to oom, .., whatever (thats
> out for discussion).
>
> If we can't fix it that way I would propose we keep the code as is,
It's not my preferred way to fix this although I'm willing to explore it.
I'd much prefer an explanation as to why your testcase is sensitive to the
PFN ordering of the pages it gets back.
One possibility is that your testcase requires a number of
high-order allocations and the patch is preventing them being
merged. If that was the case, the following patch would also help
http://www.spinics.net/lists/mm-commits/msg75672.html but it's less than ideal.
> hoping that it is a theoretical case that never hits a customer system.
> But in that case we should definitely add a userspace readable counter
> to congestion_wait similar to my debug patches. This would allow
> everyone that ever assumes an issue might be related to this to verify
> it by checking the congestion_wait counter.
Would the accounting features available with
CONFIG_TASK_DELAY_ACCT be sufficient? There is a description in
Documentation/accounting/delay-accounting.txt on how to use it. An abnormal
amount of time spent on IO might explain the problem.
> This should not be too hard in a technical way, but also not performance
> critical as congestion_wait is going to wait anyway. On the other hand
> everyone hates introducing new kernel interfaces that need to be kept
> compatible until nirvana - especially for bug tracking its not the best
> idea :-) So it should be our very last resort.
>
Agreed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
I'll keep the old thread below as reference.
After taking a round of ensuring reproducibility and a pile of new
measurements I now can come back with several new insights.
FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
threads) with sequential read load and all that 4 times to find
potential noise. But since I changed to that load instead of random read
wit hone thread and ensuring the most memory is cleared (sync + echo 3 >
/proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.
For detailed questions about the setup feel free to ask me directly as I
won't flood this thread too much with such details.
So in the past I identified git id e084b for bringing a huge
degradation, that is still true, but I need to revert my old statement
that unapplying e084b on v2.6.32 helps - it simply doesn't.
Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which
came in later. Both patches unapplied individually don't improve
anything. But both patches reverted at the same time on git v2.6.32
bring us back to our old values (remember that is more than 2Gb/s
improvement in throughput)!
Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
can cause so much trouble.
In the past I identified congestion_wait as the point where the "time is
lost" when comparing good and bad case. Fortunately at least this is
still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
So I upgraded my old counter patches a bit and can now report the following:
BAD CASE
4 THREAD
READ 8 THREAD READ 16 THREAD READ
16THR % portions
perf_count_congestion_wait
305 1970 8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority
0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath
305 1970
8979 100.00%
perf_count_pages_direct_reclaim
1153 6818 3221
perf_count_failed_pages_direct_reclaim
305 1556 8979
perf_count_failed_pages_direct_reclaim_but_progress
305 1478
8979 27.87%
GOOD CASE WITH REVERTS 4 THREAD
READ 8 THREAD READ 16 THREAD READ
16THR % portions
perf_count_congestion_wait
25 76 1114
perf_count_call_congestion_wait_from_alloc_pages_high_priority
0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath
25 76
1114 99.98%
perf_count_pages_direct_reclaim
1054 9150 62297
perf_count_failed_pages_direct_reclaim
25 64 1114
perf_count_failed_pages_direct_reclaim_but_progress
25 57
1114 1.79%
I hope the format is kept, it should be good with every monospace viewer.
You can all clearly see that the patches somehow affect the ratio at
which __alloc_pages_direct_reclaim runs into a race between try_to_free
pages that could actually free something, but a few lines below can't
get a page from the free list.
Outside in the function alloc_pages_slowpath this leads to a call to
congestion_wait which is absolutely futile as there are absolutely no
writes in flight to wait for.
Now this kills effectively up to 80% of our throughput - Any idea of
better understanding the link between the patches and the effect is
welcome and might lead to a solution.
FYI - I tried all patches you suggested - none of them affects this.
Updated perf counter patches are attached.
As for some questions in the old thread:
>> Back to business :-)
>> First - yes it is reproducible in 2.6.32 final and fixable by unapplying
>> e084b.
>> But I don't think un-applying it is really a fix as no one really
>> understands how e084b cause this regression (it might just work around
>> whatever goes on in the background and someone still hits the hidden
>> issue).
>>
>>
>
> Agreed on reverting is not an option. While it is not understood why it
> causes your regression, it's known to fix a regression for hardware that
> does do merging.
>
I'm not taking this point but fairly compared it is +15% vs -80% - so it
is not "only" good :-)
>> Lets better look what we know summarized:
>> - the patch e084a causes this regression
>> - it causes it only in very rare if not theoretically high memory
>> constraints
>> - time is lost between read system call and the block device layer enter
>> - via debugging and discussion we found that time lost is spent in
>> congestion_wait
>>
>>
>
> Thanks for the summary.
>
> There is an outside chance it's due to a difference in free page availablity
> and the associated counters. Particularly if some subsystem is optimistally
> trying high-order allocations depending on availability. If there are many
> allocation failures, the counters gets skewed due to a bug, watermarks are
> not met and direct reclaim is used more. A patch is on its way upstream to
> fix this is http://www.spinics.net/lists/mm-commits/msg75671.html . Can you
> try it please?
>
> There is a report on swap-based workloads being impacted on 2.6.32 and a fix
> exists but it wouldn't have affected 2.6.31. If you are testing on 2.6.32
> though, this patch should be applied http://lkml.org/lkml/2009/12/21/245
>
both are not affecting my scenario at all
>
>> I did not yet check if it is possible, but we might have a spot tho
>> might fix the issue.
>> Congestion_wait states it would "Waits for up to @timeout jiffies for a
>> backing_dev (any backing_dev) to exit write congestion. If no
>> backing_devs are congested then just wait for the next write to be
>> completed.".
>> Now in some case (and also my test) there is absolutely no write to do
>> or in flight. So maybe it would be possible to either detect this before
>> calling congestion_wait from page_alloc.c or let congestion_wait return
>> an error code or something similar.
>>
>>
>
> While I think that idea has merit, it's fixing the symptons and not the
> underlying problem. There is a growing opinion that congestion_wait() being
> used in the VM at all is a mistake. If the VM really needs to wait on pages
> being written out, it should have been on a waitqueue for a number of pages
> rather than a time-based method.
>
> In other words, I'm more interesting in figuring out *why* we enter
> congestion_wait() at all with the patch and it's avoided without rather than
> what congestion_wait() does when it gets called.
>
> Also, are you sure about thw "waiting for the next write to complete"? In
> 2.6.31 at least, it's waiting on the async write queue. If it's a case that
> it waits the full timeout if there is no async traffic then that's obviously
> bad as well but still doesn't explain why the problem patch makes a difference.
>
I agree that this part of the discussion is about symptoms and for now
we should try to focus on the cause that avtually brings up the race in
__alloc_pages_direct_reclaim.
>> I mean as far as I can see the kernel currently does a loop like that
>> (pseudo code and simplified):
>> 1. get a page <- fail
>> 2. try_to_free page <- returns >=1 pages freed
>> 3. get_page_from_freelist <- fails
>> 4. should_alloc_retry <- true
>> 5. congestion_wait <- FUTILE - just loosing time, nothing will be freed
>> by writes
>> 6. goto 1
>> -> So maybe we should detect the futility of that congestion_wait call
>> and not do it at all repeating instantly, go to oom, .., whatever (thats
>> out for discussion).
>>
>> If we can't fix it that way I would propose we keep the code as is,
>>
>
> It's not my preferred way to fix this although I'm willing to explore it.
> I'd much prefer an explanation as to why your testcase is sensitive to the
> PFN ordering of the pages it gets back.
>
> One possibility is that your testcase requires a number of
> high-order allocations and the patch is preventing them being
> merged. If that was the case, the following patch would also help
> http://www.spinics.net/lists/mm-commits/msg75672.html but it's less than ideal.
>
>
again not affecting my workload at all
>> hoping that it is a theoretical case that never hits a customer system.
>> But in that case we should definitely add a userspace readable counter
>> to congestion_wait similar to my debug patches. This would allow
>> everyone that ever assumes an issue might be related to this to verify
>> it by checking the congestion_wait counter.
>>
>
> Would the accounting features available with
> CONFIG_TASK_DELAY_ACCT be sufficient? There is a description in
> Documentation/accounting/delay-accounting.txt on how to use it. An abnormal
> amount of time spent on IO might explain the problem.
>
>
Eventually it will appear as a huge increase in I/O-wait for no obvious
reason - DELAY_ACCT is fine, actually even plain iostat is be enough.
The congestion_wait counter would just help to differ this issue, from
others also increasing I/O-wait - but that is not too important.
I think we can all live without a counter and better fix the cause :-)
>> This should not be too hard in a technical way, but also not performance
>> critical as congestion_wait is going to wait anyway. On the other hand
>> everyone hates introducing new kernel interfaces that need to be kept
>> compatible until nirvana - especially for bug tracking its not the best
>> idea :-) So it should be our very last resort.
>>
>>
>
> Agreed.
>
>
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
> I'll keep the old thread below as reference.
>
> After taking a round of ensuring reproducibility and a pile of new
> measurements I now can come back with several new insights.
>
> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
> threads) with sequential read load and all that 4 times to find
> potential noise. But since I changed to that load instead of random read
> wit hone thread and ensuring the most memory is cleared (sync + echo 3 >
> /proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.
> For detailed questions about the setup feel free to ask me directly as I
> won't flood this thread too much with such details.
>
Is there any chance you have a driver script for the test that you could send
me? I'll then try reproducing based on that script and see what happens. I'm
not optimistic I'll be able to reproduce the problem because I think
it's specific to your setup but you never know.
> So in the past I identified git id e084b for bringing a huge
> degradation, that is still true, but I need to revert my old statement
> that unapplying e084b on v2.6.32 helps - it simply doesn't.
>
Hmm, ok. Odd that it used to work but now doesn't.
How reproducible are these results with patch e084b reverted? i.e. I know
it does not work now, but did reverting on the old kernel always fix it
or were there occasions where the figures would be still bad?
> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which
> came in later. Both patches unapplied individually don't improve
> anything. But both patches reverted at the same time on git v2.6.32
> bring us back to our old values (remember that is more than 2Gb/s
> improvement in throughput)!
>
> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
> can cause so much trouble.
>
There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
workloads can suffer. A second is that users of high-order allocations can
enter direct reclaim a lot more than previously. This was fixed in commit
a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
your case.
The only other interesting thing in commit 5f8dcc21 is that it increases
the memory overhead of a per-cpu structure. If your memory usage is really
borderline, it might have been just enough to push it over the edge.
> In the past I identified congestion_wait as the point where the "time is
> lost" when comparing good and bad case. Fortunately at least this is
> still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
> So I upgraded my old counter patches a bit and can now report the following:
>
> BAD CASE
> 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
> perf_count_congestion_wait 305 1970 8980
> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
> perf_count_pages_direct_reclaim 1153 6818 3221
> perf_count_failed_pages_direct_reclaim 305 1556 8979
Something is wrong with the counters there. For 16 threads, it says direct
reclaim was entered 3221 but it failed 8979 times. How does that work? The
patch you supplied below looks like a patch on top of another debugging patch.
Can you send the full debugging patch please?
> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>
> GOOD CASE WITH REVERTS 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
> perf_count_congestion_wait 25 76 1114
> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath 25 76 1114 99.98%
> perf_count_pages_direct_reclaim 1054 9150 62297
> perf_count_failed_pages_direct_reclaim 25 64 1114
> perf_count_failed_pages_direct_reclaim_but_progress 25 57 1114 1.79%
>
>
> I hope the format is kept, it should be good with every monospace viewer.
It got manged but I think I've it fixed above. The biggest thing I can see
is that direct reclaim is a lot more successful with the patches reverted but
that in itself doesn't make sense. Neither patch affects how many pages should
be free or reclaimable - just what order they are allocated and freed in.
With botgh patches reverted, is the performance 100% reliable or does it
sometimes vary?
If reverting 5f8dcc21 is required, I'm leaning towards believing that the
additional memory overhead with that patch is enough to push this workload
over the edge where entering direct reclaim is necessary a lot more to keep
the zones over the min watermark. You mentioned early on that adding 64MB
to the machine makes the problem go away. Do you know what the cut-off point
is? For example, is adding 4MB enough?
If a small amount of memory does help, I'd also test with min_free_kbytes
at a lower value? If reducing it restores the performance, it again implies
that memory usage is the real problem.
Thing is, even if adding small amounts of memory or reducing
min_free_kbytes helps, it does not explain why e084b ever made a
difference because all that patch does is alter the ordering of pages on
the free list. That might have some cache consequences but it would not
impact direct reclaim like this.
> You can all clearly see that the patches somehow affect the ratio at
> which __alloc_pages_direct_reclaim runs into a race between try_to_free
> pages that could actually free something, but a few lines below can't
> get a page from the free list.
There actually is potentially a significant delay from when direct
reclaim returns and a new allocation attempt happens. However, it's been
like this for a long time so I don't think it's the issue.
I have a prototype patch that removes usage of congestion_wait() altogether
and puts processes to sleep on a waitqueue waiting for watermarks to restore
or a timeout. However, I think it would be just treating the symptoms here
rather than the real issue.
> Outside in the function alloc_pages_slowpath this leads to a call to
> congestion_wait which is absolutely futile as there are absolutely no
> writes in flight to wait for.
>
> Now this kills effectively up to 80% of our throughput - Any idea of
> better understanding the link between the patches and the effect is
> welcome and might lead to a solution.
>
> FYI - I tried all patches you suggested - none of them affects this.
>
I'm still at a total loss to explain this. Memory overhead of the second
patch is a vague possibility and worth checking out by slightly increasing
available memory on the partition or reducing min_free_kbytes. It does not
explain why the first patch makes a difference.
> <SNIP>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>
>> I'll keep the old thread below as reference.
>>
>> After taking a round of ensuring reproducibility and a pile of new
>> measurements I now can come back with several new insights.
>>
>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
>> threads) with sequential read load and all that 4 times to find
>> potential noise. But since I changed to that load instead of random read
>> wit hone thread and ensuring the most memory is cleared (sync + echo 3 >
>> /proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.
>> For detailed questions about the setup feel free to ask me directly as I
>> won't flood this thread too much with such details.
>>
>>
>
> Is there any chance you have a driver script for the test that you could send
> me? I'll then try reproducing based on that script and see what happens. I'm
> not optimistic I'll be able to reproduce the problem because I think
> it's specific to your setup but you never know.
>
I don't have one as it runs in a bigger automated test environment, but
it is easy enough to write down something comparable.
>> So in the past I identified git id e084b for bringing a huge
>> degradation, that is still true, but I need to revert my old statement
>> that unapplying e084b on v2.6.32 helps - it simply doesn't.
>>
>>
>
> Hmm, ok. Odd that it used to work but now doesn't.
>
> How reproducible are these results with patch e084b reverted? i.e. I know
> it does not work now, but did reverting on the old kernel always fix it
> or were there occasions where the figures would be still bad?
>
Reverting e084b in the past showed something like +40%, so I thought it
helps.
Afterwards I found out it wasn't a good testcase for reproducibility and
when looking at a series it had +5%,-7%,+20%,...
So by far too noisy to be useful for bisect, discussion or fix testing.
Thats why I made my big round reinventing the testcases in a more
reproducible way like described above.
In those the original issue is still very visible - which means 4 runs
comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>
>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which
>> came in later. Both patches unapplied individually don't improve
>> anything. But both patches reverted at the same time on git v2.6.32
>> bring us back to our old values (remember that is more than 2Gb/s
>> improvement in throughput)!
>>
>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
>> can cause so much trouble.
>>
>
> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
> workloads can suffer. A second is that users of high-order allocations can
> enter direct reclaim a lot more than previously. This was fixed in commit
> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
> your case.
>
> The only other interesting thing in commit 5f8dcc21 is that it increases
> the memory overhead of a per-cpu structure. If your memory usage is really
> borderline, it might have been just enough to push it over the edge.
>
I had this thoughts too, but as it only shows an effect with 5f8dcc21
and e084b reverted at the same time I'm wondering if it can be that.
But I agree that this should be considered too until a verification can
be done.
>
>> In the past I identified congestion_wait as the point where the "time is
>> lost" when comparing good and bad case. Fortunately at least this is
>> still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
>> So I upgraded my old counter patches a bit and can now report the following:
>>
>> BAD CASE
>> 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>> perf_count_congestion_wait 305 1970 8980
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
>> perf_count_pages_direct_reclaim 1153 6818 3221
>> perf_count_failed_pages_direct_reclaim 305 1556 8979
>>
>
> Something is wrong with the counters there. For 16 threads, it says direct
> reclaim was entered 3221 but it failed 8979 times. How does that work? The
> patch you supplied below looks like a patch on top of another debugging patch.
> Can you send the full debugging patch please?
>
This is just a single 7 deleted accidentally when bringing it to a plain
text format for this mail. The percentage is fine and properly it would
look like:
4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
perf_count_congestion_wait
305 1970 8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority
0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath
305 1970
8979 100.00%
perf_count_pages_direct_reclaim
1153 6818 32217
perf_count_failed_pages_direct_reclaim
305 1556 8979
perf_count_failed_pages_direct_reclaim_but_progress
305 1478
8979 27.87%
There were three patches attached in the last mail which should work in
this order:
~/kernels/linux-2.6$ git checkout -f v2.6.32
~/kernels/linux-2.6$ patch -p1 <
~/Desktop/patches-git/perf-count-congestion-wait.diff
patching file include/linux/sysctl.h
patching file kernel/sysctl.c
patching file mm/backing-dev.c
~/kernels/linux-2.6$ patch -p1 <
~/Desktop/patches-git/perf-count-failed-direct-recalims.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1670 (offset 9 lines).
Hunk #2 succeeded at 1709 (offset 9 lines).
~/kernels/linux-2.6$ patch -p1 <
~/Desktop/patches-git/perf-count-congestion-wait-calls.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1672 (offset 9 lines).
Hunk #2 succeeded at 1714 (offset 9 lines).
Hunk #3 succeeded at 1738 (offset 9 lines).
Hunk #4 succeeded at 1796 (offset 9 lines).
Hunk #5 succeeded at 1913 (offset 9 lines).
Offsets are due to e084b/5f8dcc21 being reverted or not, it works in
both cases.
Let me know if they work out for you that way.
>
>> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>>
>> GOOD CASE WITH REVERTS 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>> perf_count_congestion_wait 25 76 1114
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 25 76 1114 99.98%
>> perf_count_pages_direct_reclaim 1054 9150 62297
>> perf_count_failed_pages_direct_reclaim 25 64 1114
>> perf_count_failed_pages_direct_reclaim_but_progress 25 57 1114 1.79%
>>
>>
>> I hope the format is kept, it should be good with every monospace viewer.
>>
>
> It got manged but I think I've it fixed above. The biggest thing I can see
> is that direct reclaim is a lot more successful with the patches reverted but
> that in itself doesn't make sense. Neither patch affects how many pages should
> be free or reclaimable - just what order they are allocated and freed in.
>
> With botgh patches reverted, is the performance 100% reliable or does it
> sometimes vary?
>
It is 100% percent reliable now - reliably bad with plain 2.6.32 as well
as reliably much better (e.g. +40% @ 16threads) with both reverted.
> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
> additional memory overhead with that patch is enough to push this workload
> over the edge where entering direct reclaim is necessary a lot more to keep
> the zones over the min watermark. You mentioned early on that adding 64MB
> to the machine makes the problem go away. Do you know what the cut-off point
> is? For example, is adding 4MB enough?
>
That was another one probably only seen due to the lack of good
reproducibility in my old setup.
I made bigger memory scaling tests with the new setup. Therefore I ran
the workload with 160 to 356 megabytes in 32mb steps (256 is the default
in all other runs).
The result is that more than 256m memory only brings a slight benefit
(which is ok as the load doesn't reread the data read into page cache
and it just doesn't need much more).
Here some data about scaling memory normalized to the 256m memory values:
- deviation to 256m case - 160m 192m 224m
256m 288m 320m 356m
plain 2.6.32 +9.12% -55.11% -17.15%
=100% +1.46% +1.46% +1.46%
2.6.32 - 5f8dcc21 and e084b reverted +6.95% -6.95% -0.80%
=100% +2.14% +2.67% +2.67%
------------------------------------------------------------------------------------------------
deviation between each other (all +) 60.64% 182.93% 63.44%
36.50% 37.41% 38.13% 38.13%
What can be seen:
a) more memory brings up to +2.67%/+1.46%, but not more when further
increasing memory (reasonable as we don't reread cached files)
b) decreasing memory drops performance by up to -6.95% @ -64mb with both
reverted, but down to -55% @ -64mb (compared to the already much lower
256m throughput)
-> plain 2.6.32 is much more sensible to lower available memory AND
always a level below
c) there is a weird but very reproducible improvement with even lower
memory - very probably another issue or better another solution and not
related here - but who knows :-)
-> still both reverted is always much better than plain 2.6.32
> If a small amount of memory does help, I'd also test with min_free_kbytes
> at a lower value? If reducing it restores the performance, it again implies
> that memory usage is the real problem.
>
I'll run a few tests with bigger/smaller numbers of min_free_kbytes just
to be sure.
> Thing is, even if adding small amounts of memory or reducing
> min_free_kbytes helps, it does not explain why e084b ever made a
> difference because all that patch does is alter the ordering of pages on
> the free list. That might have some cache consequences but it would not
> impact direct reclaim like this.
>
>
>> You can all clearly see that the patches somehow affect the ratio at
>> which __alloc_pages_direct_reclaim runs into a race between try_to_free
>> pages that could actually free something, but a few lines below can't
>> get a page from the free list.
>>
>
> There actually is potentially a significant delay from when direct
> reclaim returns and a new allocation attempt happens. However, it's been
> like this for a long time so I don't think it's the issue.
>
Due to the increased percentage of progress, but !page in direct_reclaim
I still think it is related.
I agree that "having" that potential race between try_to_free and
get_page is not the issue, but I assume it is very likely that the
patches change the timing there so that it happens much more often.
> I have a prototype patch that removes usage of congestion_wait() altogether
> and puts processes to sleep on a waitqueue waiting for watermarks to restore
> or a timeout. However, I think it would be just treating the symptoms here
> rather than the real issue.
>
Anyway I'd be happy to test this patch if you could sent it to me,
because as long as we don't have a "real" solution I like such
symptom-fixes as a start :-)
At least it would fix going into congestion_wait even without dirty
pages or I/O's in flight - waiting for watermarks sounds much better to
me independent to the current issue.
>> Outside in the function alloc_pages_slowpath this leads to a call to
>> congestion_wait which is absolutely futile as there are absolutely no
>> writes in flight to wait for.
>>
>> Now this kills effectively up to 80% of our throughput - Any idea of
>> better understanding the link between the patches and the effect is
>> welcome and might lead to a solution.
>>
>> FYI - I tried all patches you suggested - none of them affects this.
>>
>>
>
> I'm still at a total loss to explain this. Memory overhead of the second
> patch is a vague possibility and worth checking out by slightly increasing
> available memory on the partition or reducing min_free_kbytes. It does not
> explain why the first patch makes a difference.
>
In a discussion with Hannes Reinecke ([email protected]) he brought up that
in a low memory scenario the ordering of the pages might twist.
Today we have the single linst of pages - add hot ones to the head and
cold to the tail. Both patches affect that list management - e084b
changes the order of some, and 5f8dcc is splitting it per migrate type.
What now could happen is that you free pages and get a list like:
H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
In low mem scenarios it could now happen that a allocation for hot pages
falls back to cold pages (as usual fallback), but couldnt it be that it
then also gets the cold pages in reverse order again ?
Without e084b it would be like:
H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with
the issue that cold allocations from the right are reverse -> e084b)
5f8dcc is now lowering the size of those lists by splitting it into
different migration types, so maybe both together are increasing the
chance to get such an issue.
I hope I explained that idea right, Hannes please feel free to correct
and extend if needed.
I also added Nick on his request.
>> <SNIP>
>>
>
>
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
*resend with fixed line wrap in the tables*
Mel Gorman wrote:
> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>
>> I'll keep the old thread below as reference.
>>
>> After taking a round of ensuring reproducibility and a pile of new
>> measurements I now can come back with several new insights.
>>
>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
>> threads) with sequential read load and all that 4 times to find
>> potential noise. But since I changed to that load instead of random read
>> wit hone thread and ensuring the most memory is cleared (sync + echo 3 >
>> /proc/sys/vm/drop_caches + a few sleeps) . The noise is now down <2%.
>> For detailed questions about the setup feel free to ask me directly as I
>> won't flood this thread too much with such details.
>>
>>
>
> Is there any chance you have a driver script for the test that you could send
> me? I'll then try reproducing based on that script and see what happens. I'm
> not optimistic I'll be able to reproduce the problem because I think
> it's specific to your setup but you never know.
>
I don't have one as it runs in a bigger automated test environment, but
it is easy enough to write down something comparable.
>> So in the past I identified git id e084b for bringing a huge
>> degradation, that is still true, but I need to revert my old statement
>> that unapplying e084b on v2.6.32 helps - it simply doesn't.
>>
>>
>
> Hmm, ok. Odd that it used to work but now doesn't.
>
> How reproducible are these results with patch e084b reverted? i.e. I know
> it does not work now, but did reverting on the old kernel always fix it
> or were there occasions where the figures would be still bad?
>
Reverting e084b in the past showed something like +40%, so I thought it
helps.
Afterwards I found out it wasn't a good testcase for reproducibility and
when looking at a series it had +5%,-7%,+20%,...
So by far too noisy to be useful for bisect, discussion or fix testing.
Thats why I made my big round reinventing the testcases in a more
reproducible way like described above.
In those the original issue is still very visible - which means 4 runs
comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>
>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21 which
>> came in later. Both patches unapplied individually don't improve
>> anything. But both patches reverted at the same time on git v2.6.32
>> bring us back to our old values (remember that is more than 2Gb/s
>> improvement in throughput)!
>>
>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
>> can cause so much trouble.
>>
>
> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
> workloads can suffer. A second is that users of high-order allocations can
> enter direct reclaim a lot more than previously. This was fixed in commit
> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
> your case.
>
> The only other interesting thing in commit 5f8dcc21 is that it increases
> the memory overhead of a per-cpu structure. If your memory usage is really
> borderline, it might have been just enough to push it over the edge.
>
I had this thoughts too, but as it only shows an effect with 5f8dcc21
and e084b reverted at the same time I'm wondering if it can be that.
But I agree that this should be considered too until a verification can
be done.
>
>> In the past I identified congestion_wait as the point where the "time is
>> lost" when comparing good and bad case. Fortunately at least this is
>> still true when comparing 2.6.32 vs 2.6.32 with both patches reverted.
>> So I upgraded my old counter patches a bit and can now report the following:
>>
>> BAD CASE
>> 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>> perf_count_congestion_wait 305 1970 8980
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
>> perf_count_pages_direct_reclaim 1153 6818 3221
>> perf_count_failed_pages_direct_reclaim 305 1556 8979
>>
>
> Something is wrong with the counters there. For 16 threads, it says direct
> reclaim was entered 3221 but it failed 8979 times. How does that work? The
> patch you supplied below looks like a patch on top of another debugging patch.
> Can you send the full debugging patch please?
>
This is just a single 7 deleted accidentally when bringing it to a plain
text format for this mail. The percentage is fine and properly it would
look like:
4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
perf_count_congestion_wait 305 1970 8980
perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
perf_count_pages_direct_reclaim 1153 6818 32217
perf_count_failed_pages_direct_reclaim 305 1556 8979
perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
There were three patches attached in the last mail which should work in
this order:
~/kernels/linux-2.6$ git checkout -f v2.6.32
~/kernels/linux-2.6$ patch -p1 <
~/Desktop/patches-git/perf-count-congestion-wait.diff
patching file include/linux/sysctl.h
patching file kernel/sysctl.c
patching file mm/backing-dev.c
~/kernels/linux-2.6$ patch -p1 <
~/Desktop/patches-git/perf-count-failed-direct-recalims.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1670 (offset 9 lines).
Hunk #2 succeeded at 1709 (offset 9 lines).
~/kernels/linux-2.6$ patch -p1 <
~/Desktop/patches-git/perf-count-congestion-wait-calls.diff
patching file kernel/sysctl.c
patching file mm/page_alloc.c
Hunk #1 succeeded at 1672 (offset 9 lines).
Hunk #2 succeeded at 1714 (offset 9 lines).
Hunk #3 succeeded at 1738 (offset 9 lines).
Hunk #4 succeeded at 1796 (offset 9 lines).
Hunk #5 succeeded at 1913 (offset 9 lines).
Offsets are due to e084b/5f8dcc21 being reverted or not, it works in
both cases.
Let me know if they work out for you that way.
>
>> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>>
>> GOOD CASE WITH REVERTS 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>> perf_count_congestion_wait 25 76 1114
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 25 76 1114 99.98%
>> perf_count_pages_direct_reclaim 1054 9150 62297
>> perf_count_failed_pages_direct_reclaim 25 64 1114
>> perf_count_failed_pages_direct_reclaim_but_progress 25 57 1114 1.79%
>>
>>
>> I hope the format is kept, it should be good with every monospace viewer.
>>
>
> It got manged but I think I've it fixed above. The biggest thing I can see
> is that direct reclaim is a lot more successful with the patches reverted but
> that in itself doesn't make sense. Neither patch affects how many pages should
> be free or reclaimable - just what order they are allocated and freed in.
>
> With botgh patches reverted, is the performance 100% reliable or does it
> sometimes vary?
>
It is 100% percent reliable now - reliably bad with plain 2.6.32 as well
as reliably much better (e.g. +40% @ 16threads) with both reverted.
> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
> additional memory overhead with that patch is enough to push this workload
> over the edge where entering direct reclaim is necessary a lot more to keep
> the zones over the min watermark. You mentioned early on that adding 64MB
> to the machine makes the problem go away. Do you know what the cut-off point
> is? For example, is adding 4MB enough?
>
That was another one probably only seen due to the lack of good
reproducibility in my old setup.
I made bigger memory scaling tests with the new setup. Therefore I ran
the workload with 160 to 356 megabytes in 32mb steps (256 is the default
in all other runs).
The result is that more than 256m memory only brings a slight benefit
(which is ok as the load doesn't reread the data read into page cache
and it just doesn't need much more).
Here some data about scaling memory normalized to the 256m memory values:
- deviation to 256m case - 160m 192m 224m 256m 288m 320m 356m
plain 2.6.32 +9.12% -55.11% -17.15% =100% +1.46% +1.46% +1.46%
2.6.32 - 5f8dcc21 and e084b reverted +6.95% -6.95% -0.80% =100% +2.14% +2.67% +2.67%
------------------------------------------------------------------------------------------------
deviation between each other (all +) 60.64% 182.93% 63.44% 36.50% 37.41% 38.13% 38.13%
What can be seen:
a) more memory brings up to +2.67%/+1.46%, but not more when further
increasing memory (reasonable as we don't reread cached files)
b) decreasing memory drops performance by up to -6.95% @ -64mb with both
reverted, but down to -55% @ -64mb (compared to the already much lower
256m throughput)
-> plain 2.6.32 is much more sensible to lower available memory AND
always a level below
c) there is a weird but very reproducible improvement with even lower
memory - very probably another issue or better another solution and not
related here - but who knows :-)
-> still both reverted is always much better than plain 2.6.32
> If a small amount of memory does help, I'd also test with min_free_kbytes
> at a lower value? If reducing it restores the performance, it again implies
> that memory usage is the real problem.
>
I'll run a few tests with bigger/smaller numbers of min_free_kbytes just
to be sure.
> Thing is, even if adding small amounts of memory or reducing
> min_free_kbytes helps, it does not explain why e084b ever made a
> difference because all that patch does is alter the ordering of pages on
> the free list. That might have some cache consequences but it would not
> impact direct reclaim like this.
>
>
>> You can all clearly see that the patches somehow affect the ratio at
>> which __alloc_pages_direct_reclaim runs into a race between try_to_free
>> pages that could actually free something, but a few lines below can't
>> get a page from the free list.
>>
>
> There actually is potentially a significant delay from when direct
> reclaim returns and a new allocation attempt happens. However, it's been
> like this for a long time so I don't think it's the issue.
>
Due to the increased percentage of progress, but !page in direct_reclaim
I still think it is related.
I agree that "having" that potential race between try_to_free and
get_page is not the issue, but I assume it is very likely that the
patches change the timing there so that it happens much more often.
> I have a prototype patch that removes usage of congestion_wait() altogether
> and puts processes to sleep on a waitqueue waiting for watermarks to restore
> or a timeout. However, I think it would be just treating the symptoms here
> rather than the real issue.
>
Anyway I'd be happy to test this patch if you could sent it to me,
because as long as we don't have a "real" solution I like such
symptom-fixes as a start :-)
At least it would fix going into congestion_wait even without dirty
pages or I/O's in flight - waiting for watermarks sounds much better to
me independent to the current issue.
>> Outside in the function alloc_pages_slowpath this leads to a call to
>> congestion_wait which is absolutely futile as there are absolutely no
>> writes in flight to wait for.
>>
>> Now this kills effectively up to 80% of our throughput - Any idea of
>> better understanding the link between the patches and the effect is
>> welcome and might lead to a solution.
>>
>> FYI - I tried all patches you suggested - none of them affects this.
>>
>>
>
> I'm still at a total loss to explain this. Memory overhead of the second
> patch is a vague possibility and worth checking out by slightly increasing
> available memory on the partition or reducing min_free_kbytes. It does not
> explain why the first patch makes a difference.
>
In a discussion with Hannes Reinecke ([email protected]) he brought up that
in a low memory scenario the ordering of the pages might twist.
Today we have the single linst of pages - add hot ones to the head and
cold to the tail. Both patches affect that list management - e084b
changes the order of some, and 5f8dcc is splitting it per migrate type.
What now could happen is that you free pages and get a list like:
H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
In low mem scenarios it could now happen that a allocation for hot pages
falls back to cold pages (as usual fallback), but couldnt it be that it
then also gets the cold pages in reverse order again ?
Without e084b it would be like:
H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with
the issue that cold allocations from the right are reverse -> e084b)
5f8dcc is now lowering the size of those lists by splitting it into
different migration types, so maybe both together are increasing the
chance to get such an issue.
I hope I explained that idea right, Hannes please feel free to correct
and extend if needed.
I also added Nick on his request.
>> <SNIP>
>>
>
>
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Mon, Feb 08, 2010 at 03:01:16PM +0100, Christian Ehrhardt wrote:
>
>
> Mel Gorman wrote:
>> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>>
>>> I'll keep the old thread below as reference.
>>>
>>> After taking a round of ensuring reproducibility and a pile of new
>>> measurements I now can come back with several new insights.
>>>
>>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
>>> threads) with sequential read load and all that 4 times to find
>>> potential noise. But since I changed to that load instead of random
>>> read wit hone thread and ensuring the most memory is cleared (sync +
>>> echo 3 > /proc/sys/vm/drop_caches + a few sleeps) . The noise is now
>>> down <2%. For detailed questions about the setup feel free to ask me
>>> directly as I won't flood this thread too much with such details.
>>>
>>>
>>
>> Is there any chance you have a driver script for the test that you could send
>> me? I'll then try reproducing based on that script and see what happens. I'm
>> not optimistic I'll be able to reproduce the problem because I think
>> it's specific to your setup but you never know.
>>
>
> I don't have one as it runs in a bigger automated test environment, but
> it is easy enough to write down something comparable.
I'd appreciate it, thanks.
>>> So in the past I identified git id e084b for bringing a huge
>>> degradation, that is still true, but I need to revert my old
>>> statement that unapplying e084b on v2.6.32 helps - it simply
>>> doesn't.
>>>
>>>
>>
>> Hmm, ok. Odd that it used to work but now doesn't.
>>
>> How reproducible are these results with patch e084b reverted? i.e. I know
>> it does not work now, but did reverting on the old kernel always fix it
>> or were there occasions where the figures would be still bad?
>>
>
> Reverting e084b in the past showed something like +40%, so I thought it
> helps.
> Afterwards I found out it wasn't a good testcase for reproducibility and
> when looking at a series it had +5%,-7%,+20%,...
> So by far too noisy to be useful for bisect, discussion or fix testing.
> Thats why I made my big round reinventing the testcases in a more
> reproducible way like described above.
> In those the original issue is still very visible - which means 4 runs
> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
About all e084b can be affecting is cache hotness when fallback is occuring
but 60% regression still seems way too high for just that.
>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21
>>> which came in later. Both patches unapplied individually don't
>>> improve anything. But both patches reverted at the same time on git
>>> v2.6.32 bring us back to our old values (remember that is more than
>>> 2Gb/s improvement in throughput)!
>>>
>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
>>> can cause so much trouble.
>>>
>>
>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>> workloads can suffer. A second is that users of high-order allocations can
>> enter direct reclaim a lot more than previously. This was fixed in commit
>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>> your case.
>>
>> The only other interesting thing in commit 5f8dcc21 is that it increases
>> the memory overhead of a per-cpu structure. If your memory usage is really
>> borderline, it might have been just enough to push it over the edge.
>>
>
> I had this thoughts too, but as it only shows an effect with 5f8dcc21
> and e084b reverted at the same time I'm wondering if it can be that.
> But I agree that this should be considered too until a verification can
> be done.
>>
Another consequence of 5f8dcc is that pages are on the per-cpu lists that
it is possible for pages to be on the per-cpu lists when the page allocator
is called. There is potential that page reclaim is being entered as a
result even though pages were free on the per-cpu lists.
It would not explain why reverting e084b makes such a difference but I
can prototype a patch that falls back to other per-cpu lists before
calling the page allocator as it used to do but without linear searching
the per-cpu lists.
>>> In the past I identified congestion_wait as the point where the "time
>>> is lost" when comparing good and bad case. Fortunately at least this
>>> is still true when comparing 2.6.32 vs 2.6.32 with both patches
>>> reverted.
>>> So I upgraded my old counter patches a bit and can now report the following:
>>>
>>> BAD CASE
>>> 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>>> perf_count_congestion_wait 305 1970 8980
>>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
>>> perf_count_pages_direct_reclaim 1153 6818 3221
>>> perf_count_failed_pages_direct_reclaim 305 1556 8979
>>>
>>
>> Something is wrong with the counters there. For 16 threads, it says direct
>> reclaim was entered 3221 but it failed 8979 times. How does that work? The
>> patch you supplied below looks like a patch on top of another debugging patch.
>> Can you send the full debugging patch please?
>>
>
> This is just a single 7 deleted accidentally when bringing it to a plain
> text format for this mail. The percentage is fine and properly it would
> look like:
>
> 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>
> perf_count_congestion_wait 305 1970 8980
> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
> perf_count_pages_direct_reclaim 1153 6818 32217
> perf_count_failed_pages_direct_reclaim 305 1556 8979
> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>
>
> There were three patches attached in the last mail which should work in
> this order:
>
Bah. I was reading just an automatically-inlined version and didn't notice
the other two attachments. Sorry.
> ~/kernels/linux-2.6$ git checkout -f v2.6.32
> ~/kernels/linux-2.6$ patch -p1 <
> ~/Desktop/patches-git/perf-count-congestion-wait.diff
> patching file include/linux/sysctl.h
> patching file kernel/sysctl.c
> patching file mm/backing-dev.c
> ~/kernels/linux-2.6$ patch -p1 <
> ~/Desktop/patches-git/perf-count-failed-direct-recalims.diff
> patching file kernel/sysctl.c
> patching file mm/page_alloc.c
> Hunk #1 succeeded at 1670 (offset 9 lines).
> Hunk #2 succeeded at 1709 (offset 9 lines).
> ~/kernels/linux-2.6$ patch -p1 <
> ~/Desktop/patches-git/perf-count-congestion-wait-calls.diff
> patching file kernel/sysctl.c
> patching file mm/page_alloc.c
> Hunk #1 succeeded at 1672 (offset 9 lines).
> Hunk #2 succeeded at 1714 (offset 9 lines).
> Hunk #3 succeeded at 1738 (offset 9 lines).
> Hunk #4 succeeded at 1796 (offset 9 lines).
> Hunk #5 succeeded at 1913 (offset 9 lines).
>
> Offsets are due to e084b/5f8dcc21 being reverted or not, it works in
> both cases.
> Let me know if they work out for you that way.
>>
>>> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>>>
>>> GOOD CASE WITH REVERTS 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>>> perf_count_congestion_wait 25 76 1114
>>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 25 76 1114 99.98%
>>> perf_count_pages_direct_reclaim 1054 9150 62297
>>> perf_count_failed_pages_direct_reclaim 25 64 1114
>>> perf_count_failed_pages_direct_reclaim_but_progress 25 57 1114 1.79%
>>>
>>>
>>> I hope the format is kept, it should be good with every monospace viewer.
>>>
>>
>> It got manged but I think I've it fixed above. The biggest thing I can see
>> is that direct reclaim is a lot more successful with the patches reverted but
>> that in itself doesn't make sense. Neither patch affects how many pages should
>> be free or reclaimable - just what order they are allocated and freed in.
>>
>> With botgh patches reverted, is the performance 100% reliable or does it
>> sometimes vary?
>>
>
> It is 100% percent reliable now - reliably bad with plain 2.6.32 as well
> as reliably much better (e.g. +40% @ 16threads) with both reverted.
>
Ok.
>> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
>> additional memory overhead with that patch is enough to push this workload
>> over the edge where entering direct reclaim is necessary a lot more to keep
>> the zones over the min watermark. You mentioned early on that adding 64MB
>> to the machine makes the problem go away. Do you know what the cut-off point
>> is? For example, is adding 4MB enough?
>>
> That was another one probably only seen due to the lack of good
> reproducibility in my old setup.
Ok.
> I made bigger memory scaling tests with the new setup. Therefore I ran
> the workload with 160 to 356 megabytes in 32mb steps (256 is the default
> in all other runs).
> The result is that more than 256m memory only brings a slight benefit
> (which is ok as the load doesn't reread the data read into page cache
> and it just doesn't need much more).
>
> Here some data about scaling memory normalized to the 256m memory values:
> - deviation to 256m case - 160m 192m 224m 256m 288m 320m 356m
> plain 2.6.32 +9.12% -55.11% -17.15% =100% +1.46% +1.46% +1.46%
> 2.6.32 - 5f8dcc21 and e084b reverted +6.95% -6.95% -0.80% =100% +2.14% +2.67% +2.67%
> ------------------------------------------------------------------------------------------------
> deviation between each other (all +) 60.64% 182.93% 63.44% 36.50% 37.41% 38.13% 38.13%
> What can be seen:
> a) more memory brings up to +2.67%/+1.46%, but not more when further
> increasing memory (reasonable as we don't reread cached files)
> b) decreasing memory drops performance by up to -6.95% @ -64mb with both
> reverted, but down to -55% @ -64mb (compared to the already much lower
> 256m throughput)
> -> plain 2.6.32 is much more sensible to lower available memory AND
> always a level below
> c) there is a weird but very reproducible improvement with even lower
> memory - very probably another issue or better another solution and not
> related here - but who knows :-)
> -> still both reverted is always much better than plain 2.6.32
>
Ok. Two avenues of attack then although both are focused on 5f8dcc21.
The first is the prototype below that should avoid congestion_wait. The
second is to reintroduce a fallback to other per-cpu lists and see if
that was a big factor.
>> If a small amount of memory does help, I'd also test with min_free_kbytes
>> at a lower value? If reducing it restores the performance, it again implies
>> that memory usage is the real problem.
>>
>
> I'll run a few tests with bigger/smaller numbers of min_free_kbytes just
> to be sure.
>> Thing is, even if adding small amounts of memory or reducing
>> min_free_kbytes helps, it does not explain why e084b ever made a
>> difference because all that patch does is alter the ordering of pages on
>> the free list. That might have some cache consequences but it would not
>> impact direct reclaim like this.
>>
>>
>>> You can all clearly see that the patches somehow affect the ratio at
>>> which __alloc_pages_direct_reclaim runs into a race between
>>> try_to_free pages that could actually free something, but a few
>>> lines below can't get a page from the free list.
>>>
>>
>> There actually is potentially a significant delay from when direct
>> reclaim returns and a new allocation attempt happens. However, it's been
>> like this for a long time so I don't think it's the issue.
>>
<
> Due to the increased percentage of progress, but !page in direct_reclaim
> I still think it is related.
It probably yes, but unfortunately the e084b has very little to do with
that logic although 5f8dcc21 might indirectly be affecting it.
> I agree that "having" that potential race between try_to_free and
> get_page is not the issue, but I assume it is very likely that the
> patches change the timing there so that it happens much more often.
>
Patch e084b would make very little difference to the timing of events
though except from a cache perspective but the slowdown is too severe
for that.
>> I have a prototype patch that removes usage of congestion_wait() altogether
>> and puts processes to sleep on a waitqueue waiting for watermarks to restore
>> or a timeout. However, I think it would be just treating the symptoms here
>> rather than the real issue.
>>
>
> Anyway I'd be happy to test this patch if you could sent it to me,
> because as long as we don't have a "real" solution I like such
> symptom-fixes as a start :-)
> At least it would fix going into congestion_wait even without dirty
> pages or I/O's in flight - waiting for watermarks sounds much better to
> me independent to the current issue.
Hmm, the patch as it currently stands is below. However, I warn you that
it has only been boot-tested on qemu with no proper testing doing, either
functional or performance testing. It may just go mad, drink your beer and
chew on your dog.
>>> Outside in the function alloc_pages_slowpath this leads to a call to
>>> congestion_wait which is absolutely futile as there are absolutely no
>>> writes in flight to wait for.
>>>
>>> Now this kills effectively up to 80% of our throughput - Any idea of
>>> better understanding the link between the patches and the effect is
>>> welcome and might lead to a solution.
>>>
>>> FYI - I tried all patches you suggested - none of them affects this.
>>>
>>>
>>
>> I'm still at a total loss to explain this. Memory overhead of the second
>> patch is a vague possibility and worth checking out by slightly increasing
>> available memory on the partition or reducing min_free_kbytes. It does not
>> explain why the first patch makes a difference.
>>
> In a discussion with Hannes Reinecke ([email protected]) he brought up that
> in a low memory scenario the ordering of the pages might twist.
> Today we have the single linst of pages - add hot ones to the head and
> cold to the tail. Both patches affect that list management - e084b
> changes the order of some, and 5f8dcc is splitting it per migrate type.
Correct on both counts.
> What now could happen is that you free pages and get a list like:
> H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
> In low mem scenarios it could now happen that a allocation for hot pages
> falls back to cold pages (as usual fallback), but couldnt it be that it
> then also gets the cold pages in reverse order again ?
Yes, this is true but that would only matter from a cache perspective
as you say your hardware and drivers are doing no automatic merging of
requests.
> Without e084b it would be like:
> H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with
> the issue that cold allocations from the right are reverse -> e084b)
> 5f8dcc is now lowering the size of those lists by splitting it into
> different migration types, so maybe both together are increasing the
> chance to get such an issue.
>
There is a good chance that 5f8dcc is causing direct reclaim to be entered
when it could have been avoided as pages were on the per-cpu lists but
I don't think the difference in cache hotness is enough to explain a 60%
loss on your machine.
> I hope I explained that idea right, Hannes please feel free to correct
> and extend if needed.
> I also added Nick on his request.
The prototype patch for avoiding congestion_wait is below. I'll start
work on a fallback-to-other-percpu-lists patch.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 30fe668..72465c1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -398,6 +398,9 @@ struct zone {
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
+ /* queue for processes waiting for pressure to relieve */
+ wait_queue_head_t *pressure_wq;
+
/*
* Discontig memory support fields.
*/
diff --git a/mm/internal.h b/mm/internal.h
index 6a697bb..d447d57 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -251,6 +251,9 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
#define ZONE_RECLAIM_SUCCESS 1
#endif
+extern void check_zone_pressure(struct zone *zone);
+extern long zonepressure_wait(struct zone *zone, long timeout);
+
extern int hwpoison_filter(struct page *p);
extern u32 hwpoison_filter_dev_major;
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..c43d068 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -9,6 +9,7 @@
#include <linux/mm.h>
#include <linux/mmzone.h>
#include <linux/module.h>
+#include <linux/sched.h>
struct pglist_data *first_online_pgdat(void)
{
@@ -87,3 +88,42 @@ int memmap_valid_within(unsigned long pfn,
return 1;
}
#endif /* CONFIG_ARCH_HAS_HOLES_MEMORYMODEL */
+
+void check_zone_pressure(struct zone *zone)
+{
+ /* If no process is waiting, nothing to do */
+ if (!waitqueue_active(zone->pressure_wq))
+ return;
+
+ /* Check if the high watermark is ok for order 0 */
+ if (zone_watermark_ok(zone, 0, low_wmark_pages(zone), 0, 0))
+ wake_up_interruptible(zone->pressure_wq);
+}
+
+/**
+ * zonepressure_wait - Wait for pressure on a zone to ease off
+ * @zone: The zone that is expected to be under pressure
+ * @timeout: Wait until pressure is relieved or this timeout is reached
+ *
+ * Waits for up to @timeout jiffies for pressure on a zone to be relieved.
+ * It's considered to be relieved if any direct reclaimer or kswapd brings
+ * the zone above the high watermark
+ */
+long zonepressure_wait(struct zone *zone, long timeout)
+{
+ long ret;
+ DEFINE_WAIT(wait);
+
+ prepare_to_wait(zone->pressure_wq, &wait, TASK_INTERRUPTIBLE);
+
+ /*
+ * XXX: Is io_schedule_timeout() really appropriate here? Maybe not
+ * because it'll get accounted for as IO waiting where that is not
+ * really happening any more. Revisit what sort of wait this should
+ * be
+ */
+ ret = io_schedule_timeout(timeout);
+ finish_wait(zone->pressure_wq, &wait);
+
+ return ret;
+}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..e80de7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1734,8 +1734,10 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
zonelist, high_zoneidx, ALLOC_NO_WATERMARKS,
preferred_zone, migratetype);
- if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ if (!page && gfp_mask & __GFP_NOFAIL) {
+ /* If still failing, wait for pressure on zone to relieve */
+ zonepressure_wait(preferred_zone, HZ/50);
+ }
} while (!page && (gfp_mask & __GFP_NOFAIL));
return page;
@@ -1905,8 +1907,8 @@ rebalance:
/* Check if we should retry the allocation */
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
- /* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ /* Too much pressure, back off a bit at let reclaimers do work */
+ zonepressure_wait(preferred_zone, HZ/50);
goto rebalance;
}
@@ -3254,6 +3256,38 @@ int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
return 0;
}
+static noinline __init_refok
+void zone_pressure_wq_cleanup(struct zone *zone)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+ size_t free_size = sizeof(wait_queue_head_t);
+
+ if (!slab_is_available())
+ free_bootmem_node(pgdat, __pa(zone->pressure_wq), free_size);
+ else
+ kfree(zone->pressure_wq);
+}
+
+static noinline __init_refok
+int zone_pressure_wq_init(struct zone *zone)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+ size_t alloc_size = sizeof(wait_queue_head_t);
+
+ if (!slab_is_available())
+ zone->pressure_wq = (wait_queue_head_t *)
+ alloc_bootmem_node(pgdat, alloc_size);
+ else
+ zone->pressure_wq = kmalloc(alloc_size, GFP_KERNEL);
+
+ if (!zone->pressure_wq)
+ return -ENOMEM;
+
+ init_waitqueue_head(zone->pressure_wq);
+
+ return 0;
+}
+
static int __zone_pcp_update(void *data)
{
struct zone *zone = data;
@@ -3306,9 +3340,15 @@ __meminit int init_currently_empty_zone(struct zone *zone,
{
struct pglist_data *pgdat = zone->zone_pgdat;
int ret;
- ret = zone_wait_table_init(zone, size);
+
+ ret = zone_pressure_wq_init(zone);
if (ret)
return ret;
+ ret = zone_wait_table_init(zone, size);
+ if (ret) {
+ zone_pressure_wq_cleanup(zone);
+ return ret;
+ }
pgdat->nr_zones = zone_idx(zone) + 1;
zone->zone_start_pfn = zone_start_pfn;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..7a9aaae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1709,6 +1709,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
}
shrink_zone(priority, zone, sc);
+ check_zone_pressure(zone);
}
}
@@ -1954,7 +1955,7 @@ static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
* interoperates with the page allocator fallback scheme to ensure that aging
* of pages is balanced across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat, wait_queue_t *wait, int order)
{
int all_zones_ok;
int priority;
@@ -2080,8 +2081,10 @@ loop_again:
* zone has way too many pages free already.
*/
if (!zone_watermark_ok(zone, order,
- 8*high_wmark_pages(zone), end_zone, 0))
+ 8*high_wmark_pages(zone), end_zone, 0)) {
shrink_zone(priority, zone, &sc);
+ }
+ check_zone_pressure(zone);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
@@ -2120,8 +2123,11 @@ loop_again:
if (total_scanned && (priority < DEF_PRIORITY - 2)) {
if (has_under_min_watermark_zone)
count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
- else
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ else {
+ prepare_to_wait(&pgdat->kswapd_wait, wait, TASK_INTERRUPTIBLE);
+ schedule_timeout(HZ/10);
+ finish_wait(&pgdat->kswapd_wait, wait);
+ }
}
/*
@@ -2270,7 +2276,7 @@ static int kswapd(void *p)
* after returning from the refrigerator
*/
if (!ret)
- balance_pgdat(pgdat, order);
+ balance_pgdat(pgdat, &wait, order);
}
return 0;
}
> <SNIP>
> The prototype patch for avoiding congestion_wait is below. I'll start
> work on a fallback-to-other-percpu-lists patch.
>
And here is the prototype of the fallback-to-other-percpu-lists patch.
I'm afraid I've only managed to test it on qemu. My three test machines are
still occupied :(
==== CUT HERE ====
page allocator: Fallback to other per-cpu lists when the target list is empty and memory is low
When a per-cpu list of pages for a given migratetype is empty, the page
allocator is called to refill the PCP list. It's possible when memory
is low that this results in the process entering direct reclaim even
if it wasn't strictly necessary because there were pages free for other
migratetypes. Unconditionally falling back to other PCP lists hurts the
fragmentation-avoidance strategy which is also undesirable.
When the desired PCP list is empty, this patch checks how many free pages
there are on the PCP lists and if refilling the list could result in direct
reclaim. If direct reclaim is unlikely, the PCP list is refilled to maintain
fragmentation-avoidance. Otherwise, a page from an alternative PCP list is
chosen to maintain performance and avoid direct reclaim.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++++++---
1 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8deb9d0..009d683 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1168,6 +1168,39 @@ void split_page(struct page *page, unsigned int order)
set_page_refcounted(page + i);
}
+/* Decide whether to find an alternative PCP list or refill */
+static struct list_head *pcp_fallback(struct zone *zone,
+ struct per_cpu_pages *pcp,
+ int start_migratetype, int cold)
+{
+ int i;
+ int migratetype;
+ struct list_head *list;
+ long free_pages = zone_page_state(zone, NR_FREE_PAGES) - pcp->batch;
+
+ /*
+ * Find a PCPU list with free pages in the same order as
+ * fragmentation-avoidance fallback in the event that refilling
+ * the PCP list may result in direct reclaim
+ */
+ if (pcp->count && free_pages <= low_wmark_pages(zone)) {
+ for (i = 0; i < MIGRATE_PCPTYPES - 1; i++) {
+ migratetype = fallbacks[start_migratetype][i];
+ list = &pcp->lists[migratetype];
+
+ if (!list_empty(list))
+ return list;
+ }
+ }
+
+ /* Alternatively, we need to allocate more memory to the PCP lists */
+ list = &pcp->lists[start_migratetype];
+ pcp->count += rmqueue_bulk(zone, 0, pcp->batch, list,
+ migratetype, cold);
+
+ return list;
+}
+
/*
* Really, prep_compound_page() should be called from __rmqueue_bulk(). But
* we cheat by calling it from here, in the order > 0 path. Saves a branch
@@ -1193,9 +1226,7 @@ again:
list = &pcp->lists[migratetype];
local_irq_save(flags);
if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
- pcp->batch, list,
- migratetype, cold);
+ list = pcp_fallback(zone, pcp, migratetype, cold);
if (unlikely(list_empty(list)))
goto failed;
}
Mel Gorman wrote:
> On Mon, Feb 08, 2010 at 03:01:16PM +0100, Christian Ehrhardt wrote:
>>
>> Mel Gorman wrote:
>>> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>>>
>>>> I'll keep the old thread below as reference.
>>>>
>>>> After taking a round of ensuring reproducibility and a pile of new
>>>> measurements I now can come back with several new insights.
>>>>
>>>> FYI - I'm now running iozone triplets (4, then 8, then 16 parallel
>>>> threads) with sequential read load and all that 4 times to find
>>>> potential noise. But since I changed to that load instead of random
>>>> read wit hone thread and ensuring the most memory is cleared (sync +
>>>> echo 3 > /proc/sys/vm/drop_caches + a few sleeps) . The noise is now
>>>> down <2%. For detailed questions about the setup feel free to ask me
>>>> directly as I won't flood this thread too much with such details.
>>>>
>>>>
>>> Is there any chance you have a driver script for the test that you could send
>>> me? I'll then try reproducing based on that script and see what happens. I'm
>>> not optimistic I'll be able to reproduce the problem because I think
>>> it's specific to your setup but you never know.
>>>
>> I don't have one as it runs in a bigger automated test environment, but
>> it is easy enough to write down something comparable.
>
> I'd appreciate it, thanks.
>
Testing of your two patches starts in a few minutes, thanks in advance.
Here the info how to execute the core of the test - I cross fingers that anyone else can reproduce it that way :-)
I use it in a huge automation framework which takes care of setting up the system, disks, gathering statistics and so on, but it essentially comes down to something simple like that:
#!/bin/bash
# reboot your system with 256m
# attach 16 disks (usually up to 64, but 16 should be enough to show the issue)
# mount your disks at /mnt/subw0, /mnt/subw1, ...
for i in 4 8 16 4 8 16 4 8 16 4 8 16
do
sync; sleep 10s; echo 3 > /proc/sys/vm/drop_caches; sleep 2s;
iozone -s 2000m -r 64k -t $i -e -w -R -C -i 0 -F /mnt/subw0 /mnt/subw1 /mnt/subw2 /mnt/subw3 /mnt/subw4 /mnt/subw5 /mnt/subw6 /mnt/subw7 /mnt/subw8 /mnt/subw9 /mnt/subw10 /mnt/subw11 /mnt/subw12 /mnt/subw13 /mnt/subw14 /mnt/subw15
sync; sleep 10s; echo 3 > /proc/sys/vm/drop_caches; sleep 2s;
iozone -s 2000m -r 64k -t $i -e -w -R -C -i 1 -F /mnt/subw0 /mnt/subw1 /mnt/subw2 /mnt/subw3 /mnt/subw4 /mnt/subw5 /mnt/subw6 /mnt/subw7 /mnt/subw8 /mnt/subw9 /mnt/subw10 /mnt/subw11 /mnt/subw12 /mnt/subw13 /mnt/subw14 /mnt/subw15
done
# while we could reduce the number of writes to one 16 thread write I use it that way as it is more similar to our original load (makes no difference anyway)
[...]
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
Mel Gorman wrote:
> On Mon, Feb 08, 2010 at 03:01:16PM +0100, Christian Ehrhardt wrote:
>>
>> Mel Gorman wrote:
>>> On Fri, Feb 05, 2010 at 04:51:10PM +0100, Christian Ehrhardt wrote:
>>>
[...]
>>> How reproducible are these results with patch e084b reverted? i.e. I know
>>> it does not work now, but did reverting on the old kernel always fix it
>>> or were there occasions where the figures would be still bad?
>>>
>> Reverting e084b in the past showed something like +40%, so I thought it
>> helps.
>> Afterwards I found out it wasn't a good testcase for reproducibility and
>> when looking at a series it had +5%,-7%,+20%,...
>> So by far too noisy to be useful for bisect, discussion or fix testing.
>> Thats why I made my big round reinventing the testcases in a more
>> reproducible way like described above.
>> In those the original issue is still very visible - which means 4 runs
>> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
>> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>
> About all e084b can be affecting is cache hotness when fallback is occuring
> but 60% regression still seems way too high for just that.
Yes, most the ideas you, I an a few others so far had went for cache hotness - but always the summary was "couldn't be that much".
On the other hand we have to consider that it could be a small timing due to cache or whatever else which leads into congestion_wait and then it waits HZ/50. That way a 1?s cause can lead to 20001?s lost which then cause 60%.
Another important thing is that "cold page" afaik means the page content, and not the management struct etc. As the time lost is in management only (=> before accessing the actual 4k page content) it should not be important if a page is cold/hot for the mm code itself.
That much for direct_reclaim that either runs into progress, but !page (BAD case) or not (GOOD case) - there only cache effects of struct page, pcp lists etc should have an effect.
>>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21
>>>> which came in later. Both patches unapplied individually don't
>>>> improve anything. But both patches reverted at the same time on git
>>>> v2.6.32 bring us back to our old values (remember that is more than
>>>> 2Gb/s improvement in throughput)!
>>>>
>>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
>>>> can cause so much trouble.
>>>>
>>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>>> workloads can suffer. A second is that users of high-order allocations can
>>> enter direct reclaim a lot more than previously. This was fixed in commit
>>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>>> your case.
>>>
>>> The only other interesting thing in commit 5f8dcc21 is that it increases
>>> the memory overhead of a per-cpu structure. If your memory usage is really
>>> borderline, it might have been just enough to push it over the edge.
>>>
>> I had this thoughts too, but as it only shows an effect with 5f8dcc21
>> and e084b reverted at the same time I'm wondering if it can be that.
>> But I agree that this should be considered too until a verification can
>> be done.
>>>
>
> Another consequence of 5f8dcc is that pages are on the per-cpu lists that
> it is possible for pages to be on the per-cpu lists when the page allocator
> is called. There is potential that page reclaim is being entered as a
> result even though pages were free on the per-cpu lists.
>
> It would not explain why reverting e084b makes such a difference but I
> can prototype a patch that falls back to other per-cpu lists before
> calling the page allocator as it used to do but without linear searching
> the per-cpu lists.
I tested these patch you submitted in another reply.
Short - it only gives it ~+4% by probably speeding up things a bit but not touching the issue in any way.
Some more detail below at the other test result.
[...]
>>
>> 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>>
>> perf_count_congestion_wait 305 1970 8980
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 305 1970 8979 100.00%
>> perf_count_pages_direct_reclaim 1153 6818 32217
>> perf_count_failed_pages_direct_reclaim 305 1556 8979
>> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>>
[...]
>>>
>>>> perf_count_failed_pages_direct_reclaim_but_progress 305 1478 8979 27.87%
>>>>
>>>> GOOD CASE WITH REVERTS 4 THREAD READ 8 THREAD READ 16 THREAD READ 16THR % portions
>>>> perf_count_congestion_wait 25 76 1114
>>>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>>>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 25 76 1114 99.98%
>>>> perf_count_pages_direct_reclaim 1054 9150 62297
>>>> perf_count_failed_pages_direct_reclaim 25 64 1114
>>>> perf_count_failed_pages_direct_reclaim_but_progress 25 57 1114 1.79%
>>>>
>>>>
>>>> I hope the format is kept, it should be good with every monospace viewer.
>>>>
>>> It got manged but I think I've it fixed above. The biggest thing I can see
>>> is that direct reclaim is a lot more successful with the patches reverted but
>>> that in itself doesn't make sense. Neither patch affects how many pages should
>>> be free or reclaimable - just what order they are allocated and freed in.
>>>
>>> With botgh patches reverted, is the performance 100% reliable or does it
>>> sometimes vary?
>>>
>> It is 100% percent reliable now - reliably bad with plain 2.6.32 as well
>> as reliably much better (e.g. +40% @ 16threads) with both reverted.
>>
>
> Ok.
>
>>> If reverting 5f8dcc21 is required, I'm leaning towards believing that the
>>> additional memory overhead with that patch is enough to push this workload
>>> over the edge where entering direct reclaim is necessary a lot more to keep
>>> the zones over the min watermark. You mentioned early on that adding 64MB
>>> to the machine makes the problem go away. Do you know what the cut-off point
>>> is? For example, is adding 4MB enough?
>>>
>> That was another one probably only seen due to the lack of good
>> reproducibility in my old setup.
>
> Ok.
>
>> I made bigger memory scaling tests with the new setup. Therefore I ran
>> the workload with 160 to 356 megabytes in 32mb steps (256 is the default
>> in all other runs).
>> The result is that more than 256m memory only brings a slight benefit
>> (which is ok as the load doesn't reread the data read into page cache
>> and it just doesn't need much more).
>>
>> Here some data about scaling memory normalized to the 256m memory values:
>> - deviation to 256m case - 160m 192m 224m 256m 288m 320m 356m
>> plain 2.6.32 +9.12% -55.11% -17.15% =100% +1.46% +1.46% +1.46%
>> 2.6.32 - 5f8dcc21 and e084b reverted +6.95% -6.95% -0.80% =100% +2.14% +2.67% +2.67%
>> ------------------------------------------------------------------------------------------------
>> deviation between each other (all +) 60.64% 182.93% 63.44% 36.50% 37.41% 38.13% 38.13%
>> What can be seen:
>> a) more memory brings up to +2.67%/+1.46%, but not more when further
>> increasing memory (reasonable as we don't reread cached files)
>> b) decreasing memory drops performance by up to -6.95% @ -64mb with both
>> reverted, but down to -55% @ -64mb (compared to the already much lower
>> 256m throughput)
>> -> plain 2.6.32 is much more sensible to lower available memory AND
>> always a level below
>> c) there is a weird but very reproducible improvement with even lower
>> memory - very probably another issue or better another solution and not
>> related here - but who knows :-)
>> -> still both reverted is always much better than plain 2.6.32
>>
>
> Ok. Two avenues of attack then although both are focused on 5f8dcc21.
> The first is the prototype below that should avoid congestion_wait. The
> second is to reintroduce a fallback to other per-cpu lists and see if
> that was a big factor.
I tested both of your patches, btw thanks for the quick work!
[...]
>
> Hmm, the patch as it currently stands is below. However, I warn you that
> it has only been boot-tested on qemu with no proper testing doing, either
> functional or performance testing. It may just go mad, drink your beer and
> chew on your dog.
In opposition to all your warnings it worked pretty well.
The patch replacing congestion_wait calls - which we know re futile in my scenario
anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.
>From a throughput perspective alone it looks almost fixed:
4 thread 8 thread 16 thread
plain 2.6.32 100% 100% 100%
2.6.32 avoid direct reclaim 103.85% 103.60% 104.04%
2.6.32 zone wait 125.46% 140.87% 131.89%
2.6.32 zone wait + 5f8dcc21 and e084b reverted 124.46% 143.35% 144.21%
Looking at the counters this theory (only fixing symptoms) seems to be confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the good cases and ~30% in bad cases. As congestion_wait in my case almost always waits the full HZ/50 this is a lot.
With the zone wait patch I see a huge throughput win, most probably by all the (still ~30%!) waits that don't have to wait for the full timeout.
But still reverting 5f8dcc21 and e084b fix the issue of so much direct reclaims running in into that condition and that way giving throughput another last push.
% ran into try_to_free did progress but get_page got !page
plain 2.6.32 30.16%
2.6.32 zone wait 29.05%
2.6.32 zone wait + 5f8dcc21 and e084b reverted 2.93%
Note - with the zone wait patch there are also almost twice as much direct_reclaim calls. Probably by several waiters coming of the list when the watermark is restored and allocating too much.
Another good side note - 5f8dcc21 and e084b reverted plus your zone wait patch is better than anything I have seen so far :-)
Maybe its worth to measure both of your patches plus 5f8dcc21 and e084b reverted with 64 threads (the effect grows with # threads) just to rejoice in huge throughput values for a short time.
>>>> Outside in the function alloc_pages_slowpath this leads to a call to
>>>> congestion_wait which is absolutely futile as there are absolutely no
>>>> writes in flight to wait for.
>>>>
>>>> Now this kills effectively up to 80% of our throughput - Any idea of
>>>> better understanding the link between the patches and the effect is
>>>> welcome and might lead to a solution.
>>>>
>>>> FYI - I tried all patches you suggested - none of them affects this.
>>>>
>>>>
>>> I'm still at a total loss to explain this. Memory overhead of the second
>>> patch is a vague possibility and worth checking out by slightly increasing
>>> available memory on the partition or reducing min_free_kbytes. It does not
>>> explain why the first patch makes a difference.
>>>
>> In a discussion with Hannes Reinecke ([email protected]) he brought up that
>> in a low memory scenario the ordering of the pages might twist.
>> Today we have the single linst of pages - add hot ones to the head and
>> cold to the tail. Both patches affect that list management - e084b
>> changes the order of some, and 5f8dcc is splitting it per migrate type.
>
> Correct on both counts.
>
>> What now could happen is that you free pages and get a list like:
>> H1 - H2 - H3 - C3 - C2 - C1 (H is hot and C is cold)
>> In low mem scenarios it could now happen that a allocation for hot pages
>> falls back to cold pages (as usual fallback), but couldnt it be that it
>> then also gets the cold pages in reverse order again ?
>
> Yes, this is true but that would only matter from a cache perspective
> as you say your hardware and drivers are doing no automatic merging of
> requests.
>
>> Without e084b it would be like:
>> H1 - H2 - H3 - C1 - C2 - C3 and therefore all at least in order (with
>> the issue that cold allocations from the right are reverse -> e084b)
>> 5f8dcc is now lowering the size of those lists by splitting it into
>> different migration types, so maybe both together are increasing the
>> chance to get such an issue.
>>
>
> There is a good chance that 5f8dcc is causing direct reclaim to be entered
> when it could have been avoided as pages were on the per-cpu lists but
> I don't think the difference in cache hotness is enough to explain a 60%
> loss on your machine.
>
As far as I get it thats what your patch #2 fixed (page allocator: Fallback to other per-cpu lists when the target list is empty and memory is low), but as mentioned below that gave me +4%.
Which is actually a fine win, but not for the issue currently in discussion.
As bonus I can give you now a little tested-by for both patches :-)
[...]
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
> >>> How reproducible are these results with patch e084b reverted? i.e. I know
> >>> it does not work now, but did reverting on the old kernel always fix it
> >>> or were there occasions where the figures would be still bad?
> >>>
> >> Reverting e084b in the past showed something like +40%, so I thought it
> >> helps.
> >> Afterwards I found out it wasn't a good testcase for reproducibility and
> >> when looking at a series it had +5%,-7%,+20%,...
> >> So by far too noisy to be useful for bisect, discussion or fix testing.
> >> Thats why I made my big round reinventing the testcases in a more
> >> reproducible way like described above.
> >> In those the original issue is still very visible - which means 4 runs
> >> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
> >> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
> >
> > About all e084b can be affecting is cache hotness when fallback is occuring
> > but 60% regression still seems way too high for just that.
>
> Yes, most the ideas you, I an a few others so far had went for cache
> hotness - but always the summary was "couldn't be that much".
Well, it's cache hotness or the driver or hardware doing automatic merging
of requests (which you say is not happening). There are two other
possibilties but their effect should be very low as well.
> On the other hand we have to consider that it could be a small timing due
> to cache or whatever else which leads into congestion_wait and then it waits
> HZ/50. That way a 1?s cause can lead to 20001?s lost which then cause 60%.
Maybe although it would imply the window where it can go wrong is
very small. It seems unlikely that it would be hit every time.
> Another important thing is that "cold page" afaik means the page content,
> and not the management struct etc. As the time lost is in management only
> (=> before accessing the actual 4k page content) it should not be important
> if a page is cold/hot for the mm code itself.
True. Even the zeroing should be a non-temporal store so the timing
should not be affected.
> That much for direct_reclaim that either runs into progress, but !page
> (BAD case) or not (GOOD case) - there only cache effects of struct page,
> pcp lists etc should have an effect.
There are only two other possibilities I can think of.
1. The extra parameter to rmqueue_bulk() means that your platform
generates particularly bad code to pass a 6th parameter.
2. The branch inside the for loop is causing serious problems somehow
Causing a 60% slowdown from either of these seems unlikely but there isn't
much choice than to eliminate them as possibilities. The attached patch is
for "option 1" above. I'll think more on how to best handle "option 2"
> >>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21
> >>>> which came in later. Both patches unapplied individually don't
> >>>> improve anything. But both patches reverted at the same time on git
> >>>> v2.6.32 bring us back to our old values (remember that is more than
> >>>> 2Gb/s improvement in throughput)!
> >>>>
> >>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
> >>>> can cause so much trouble.
> >>>>
> >>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
> >>> workloads can suffer. A second is that users of high-order allocations can
> >>> enter direct reclaim a lot more than previously. This was fixed in commit
> >>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
> >>> your case.
> >>>
> >>> The only other interesting thing in commit 5f8dcc21 is that it increases
> >>> the memory overhead of a per-cpu structure. If your memory usage is really
> >>> borderline, it might have been just enough to push it over the edge.
> >>>
> >> I had this thoughts too, but as it only shows an effect with 5f8dcc21
> >> and e084b reverted at the same time I'm wondering if it can be that.
> >> But I agree that this should be considered too until a verification can
> >> be done.
> >
> > Another consequence of 5f8dcc is that pages are on the per-cpu lists that
> > it is possible for pages to be on the per-cpu lists when the page allocator
> > is called. There is potential that page reclaim is being entered as a
> > result even though pages were free on the per-cpu lists.
> >
> > It would not explain why reverting e084b makes such a difference but I
> > can prototype a patch that falls back to other per-cpu lists before
> > calling the page allocator as it used to do but without linear searching
> > the per-cpu lists.
>
> I tested these patch you submitted in another reply.
> Short - it only gives it ~+4% by probably speeding up things a bit but
> not touching the issue in any way.
> Some more detail below at the other test result.
>
You mention it gains 4% which is not great, but what impact, if any, had
it on the number of calls to direct reclaim?
> <SNIP>
> > Ok. Two avenues of attack then although both are focused on 5f8dcc21.
> > The first is the prototype below that should avoid congestion_wait. The
> > second is to reintroduce a fallback to other per-cpu lists and see if
> > that was a big factor.
>
> I tested both of your patches, btw thanks for the quick work!
>
Thanks for persisting with this! I know it's dragging on a lot longer
than anyone would like.
> > Hmm, the patch as it currently stands is below. However, I warn you that
> > it has only been boot-tested on qemu with no proper testing doing, either
> > functional or performance testing. It may just go mad, drink your beer and
> > chew on your dog.
>
> In opposition to all your warnings it worked pretty well.
Nice one.
> The patch replacing congestion_wait calls - which we know re futile in my scenario
> anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.
>
> From a throughput perspective alone it looks almost fixed:
> 4 thread 8 thread 16 thread
> plain 2.6.32 100% 100% 100%
> 2.6.32 avoid direct reclaim 103.85% 103.60% 104.04%
> 2.6.32 zone wait 125.46% 140.87% 131.89%
> 2.6.32 zone wait + 5f8dcc21 and e084b reverted 124.46% 143.35% 144.21%
>
Ok, the fact that the congestion_wait-avoidance patch makes such a large
difference is very interesting although if the direct reclaim calls were
lower, it would be less pronounced. I will be having those patches on the
back-burner for another while unless we cannot resolve this problem a better
way but it's nice to know they may have some potential.
> Looking at the counters this theory (only fixing symptoms) seems to be
> confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that
> ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the
> good cases and ~30% in bad cases. As congestion_wait in my case almost always
> waits the full HZ/50 this is a lot.
> With the zone wait patch I see a huge throughput win, most probably by
> all the (still ~30%!) waits that don't have to wait for the full timeout.
That would be my expectation. The underlying problem of
too-many-direct-reclaims is not solved.
> But still reverting 5f8dcc21 and e084b fix the issue of so much direct
> reclaims running in into that condition and that way giving throughput
> another last push.
> % ran into try_to_free did progress but get_page got !page
> plain 2.6.32 30.16%
> 2.6.32 zone wait 29.05%
> 2.6.32 zone wait + 5f8dcc21 and e084b reverted 2.93%
>
And what impact did "Fallback to other per-cpu lists when the target list
is empty and memory is low" have on the number of calls to direct reclaim?
> Note - with the zone wait patch there are also almost twice as much
> direct_reclaim calls. Probably by several waiters coming of the list when
> the watermark is restored and allocating too much.
It could be either a thundering herd problem or just the fact that the
processes are not asleep.
> Another good side note - 5f8dcc21 and e084b reverted plus your zone wait
> patch is better than anything I have seen so far :-)
Well, that's something positive out of all this. The congestion_wait() patch
was developed in response to this thread. Even though I knew it was treating
symptons and not problems, the congestion_wait() was identified as a bad
idea some time ago because the lack of pages may have nothing to do with IO.
> Maybe its worth to measure both of your patches plus 5f8dcc21 and e084b
> reverted with 64 threads (the effect grows with # threads) just to rejoice
> in huge throughput values for a short time.
:)
> > <SNIP>
> >
> > There is a good chance that 5f8dcc is causing direct reclaim to be entered
> > when it could have been avoided as pages were on the per-cpu lists but
> > I don't think the difference in cache hotness is enough to explain a 60%
> > loss on your machine.
> >
>
> As far as I get it thats what your patch #2 fixed (page allocator: Fallback
> to other per-cpu lists when the target list is empty and memory is low),
> but as mentioned below that gave me +4%.
I need to know what effect, if any, it had on the direct reclaim
figures. If it does drop them, but not by enough, then fallback can be
allowed to occur in more cirsumstances - for example, by changing
+ if (pcp->count && free_pages <= low_wmark_pages(zone)) {
with
+ if (pcp->count && free_pages <= high_wmark_pages(zone)) {
> Which is actually a fine win, but not for the issue currently in discussion.
> As bonus I can give you now a little tested-by for both patches :-)
>
Thanks :)
Still, the grind continues to try resolve this :(
1. If the "Fallback to other per-cpu lists" does reduce the number of
calls to direct reclaim, but not by enough, please replace the
comparison of low_wmark_pages with high_wmark_pages and let me know a)
how much it affects throughput and b) how much it affects the number
of calls to direct reclaim
This is addressing potential problems in 5f8dcc
2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
number of parameters being passed is making some unexpected large
difference
This is trying to figure out more what the problem with e084b is
==== CUT HERE ====
rmqueue_bulk-fewer-parameters
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7021c68..180bab2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -936,7 +936,7 @@ retry_reserve:
* a single hold of the lock, for efficiency. Add them to the supplied list.
* Returns the number of new pages which were placed at *list.
*/
-static int rmqueue_bulk(struct zone *zone, unsigned int order,
+static int rmqueue_bulk(struct zone *zone,
unsigned long count, struct list_head *list,
int migratetype, int cold)
{
@@ -944,7 +944,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype);
+ struct page *page = __rmqueue(zone, 0, migratetype);
if (unlikely(page == NULL))
break;
@@ -964,7 +964,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
set_page_private(page, migratetype);
list = &page->lru;
}
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
spin_unlock(&zone->lock);
return i;
}
@@ -1231,7 +1231,7 @@ again:
list = &pcp->lists[migratetype];
local_irq_save(flags);
if (list_empty(list)) {
- pcp->count += rmqueue_bulk(zone, 0,
+ pcp->count += rmqueue_bulk(zone,
pcp->batch, list,
migratetype, cold);
if (unlikely(list_empty(list)))
Mel Gorman wrote:
>>>>> How reproducible are these results with patch e084b reverted? i.e. I know
>>>>> it does not work now, but did reverting on the old kernel always fix it
>>>>> or were there occasions where the figures would be still bad?
>>>>>
>>>> Reverting e084b in the past showed something like +40%, so I thought it
>>>> helps.
>>>> Afterwards I found out it wasn't a good testcase for reproducibility and
>>>> when looking at a series it had +5%,-7%,+20%,...
>>>> So by far too noisy to be useful for bisect, discussion or fix testing.
>>>> Thats why I made my big round reinventing the testcases in a more
>>>> reproducible way like described above.
>>>> In those the original issue is still very visible - which means 4 runs
>>>> comparing 2.6.32 vs 2.6.32-Reve084b being each max +/-1%.
>>>> While gitid-e084b vs. the one just before (51fbb) gives ~-60% all the time.
>>> About all e084b can be affecting is cache hotness when fallback is occuring
>>> but 60% regression still seems way too high for just that.
>> Yes, most the ideas you, I an a few others so far had went for cache
>> hotness - but always the summary was "couldn't be that much".
>
> Well, it's cache hotness or the driver or hardware doing automatic merging
> of requests (which you say is not happening). There are two other
> possibilties but their effect should be very low as well.
>
>> On the other hand we have to consider that it could be a small timing due
>> to cache or whatever else which leads into congestion_wait and then it waits
>> HZ/50. That way a 1?s cause can lead to 20001?s lost which then cause 60%.
>
> Maybe although it would imply the window where it can go wrong is
> very small. It seems unlikely that it would be hit every time.
>
>> Another important thing is that "cold page" afaik means the page content,
>> and not the management struct etc. As the time lost is in management only
>> (=> before accessing the actual 4k page content) it should not be important
>> if a page is cold/hot for the mm code itself.
>
> True. Even the zeroing should be a non-temporal store so the timing
> should not be affected.
>
>> That much for direct_reclaim that either runs into progress, but !page
>> (BAD case) or not (GOOD case) - there only cache effects of struct page,
>> pcp lists etc should have an effect.
>
> There are only two other possibilities I can think of.
>
> 1. The extra parameter to rmqueue_bulk() means that your platform
> generates particularly bad code to pass a 6th parameter.
I never checked that number, but out of the box I can tell you that 5 parameters can be passed per register and starting with the 6th they need to go via stack.
So much about hope, some numbers below to keep results together.
> 2. The branch inside the for loop is causing serious problems somehow
I also some branches can be tricky on s390, so I look forward to see your idea how we could verify that by not changing too much behavior.
>
> Causing a 60% slowdown from either of these seems unlikely but there isn't
> much choice than to eliminate them as possibilities. The attached patch is
> for "option 1" above. I'll think more on how to best handle "option 2"
>>>>>> Another bisect (keepign e084b reverted) brought up git id 5f8dcc21
>>>>>> which came in later. Both patches unapplied individually don't
>>>>>> improve anything. But both patches reverted at the same time on git
>>>>>> v2.6.32 bring us back to our old values (remember that is more than
>>>>>> 2Gb/s improvement in throughput)!
>>>>>>
>>>>>> Unfortunately 5f8dcc21 is as unobvious as e84b in explaining how this
>>>>>> can cause so much trouble.
>>>>>>
>>>>> There is a bug in commit 5f8dcc21. One consequence of it is that swap-based
>>>>> workloads can suffer. A second is that users of high-order allocations can
>>>>> enter direct reclaim a lot more than previously. This was fixed in commit
>>>>> a7016235a61d520e6806f38129001d935c4b6661 but you say that's not the fix in
>>>>> your case.
>>>>>
>>>>> The only other interesting thing in commit 5f8dcc21 is that it increases
>>>>> the memory overhead of a per-cpu structure. If your memory usage is really
>>>>> borderline, it might have been just enough to push it over the edge.
>>>>>
>>>> I had this thoughts too, but as it only shows an effect with 5f8dcc21
>>>> and e084b reverted at the same time I'm wondering if it can be that.
>>>> But I agree that this should be considered too until a verification can
>>>> be done.
>>> Another consequence of 5f8dcc is that pages are on the per-cpu lists that
>>> it is possible for pages to be on the per-cpu lists when the page allocator
>>> is called. There is potential that page reclaim is being entered as a
>>> result even though pages were free on the per-cpu lists.
>>>
>>> It would not explain why reverting e084b makes such a difference but I
>>> can prototype a patch that falls back to other per-cpu lists before
>>> calling the page allocator as it used to do but without linear searching
>>> the per-cpu lists.
>> I tested these patch you submitted in another reply.
>> Short - it only gives it ~+4% by probably speeding up things a bit but
>> not touching the issue in any way.
>> Some more detail below at the other test result.
>>
>
> You mention it gains 4% which is not great, but what impact, if any, had
> it on the number of calls to direct reclaim?
sorry I missed to add that numbers yesterday.
Unfortunately it had no effect. vomparing 2.6.32 to 2.6.32 with avoid direct reclaim patch it has these for e.g. 16 threads:
with avoid direct reclaim (pcp list fallback)
perf_count_pages_direct_reclaim 32909
perf_count_failed_pages_direct_reclaim_but_progress 9802 =29.79%
remmember original 2.6.32 had:
perf_count_pages_direct_reclaim 33662
perf_count_failed_pages_direct_reclaim_but_progress 10152 =30.16%
So no big progress due to that (~4% TP as mentioned before).
>> <SNIP>
>>> Ok. Two avenues of attack then although both are focused on 5f8dcc21.
>>> The first is the prototype below that should avoid congestion_wait. The
>>> second is to reintroduce a fallback to other per-cpu lists and see if
>>> that was a big factor.
>> I tested both of your patches, btw thanks for the quick work!
>>
>
> Thanks for persisting with this! I know it's dragging on a lot longer
> than anyone would like.
>
>>> Hmm, the patch as it currently stands is below. However, I warn you that
>>> it has only been boot-tested on qemu with no proper testing doing, either
>>> functional or performance testing. It may just go mad, drink your beer and
>>> chew on your dog.
>> In opposition to all your warnings it worked pretty well.
>
> Nice one.
>
>> The patch replacing congestion_wait calls - which we know re futile in my scenario
>> anyway - by zone_wait calls helps a lot. But so far as you suggested only by alleviating the symptoms.
>>
>> From a throughput perspective alone it looks almost fixed:
>> 4 thread 8 thread 16 thread
>> plain 2.6.32 100% 100% 100%
>> 2.6.32 avoid direct reclaim 103.85% 103.60% 104.04%
>> 2.6.32 zone wait 125.46% 140.87% 131.89%
>> 2.6.32 zone wait + 5f8dcc21 and e084b reverted 124.46% 143.35% 144.21%
>>
>
> Ok, the fact that the congestion_wait-avoidance patch makes such a large
> difference is very interesting although if the direct reclaim calls were
> lower, it would be less pronounced. I will be having those patches on the
> back-burner for another while unless we cannot resolve this problem a better
> way but it's nice to know they may have some potential.
>
>> Looking at the counters this theory (only fixing symptoms) seems to be
>> confirmed. Before the ratio of perf_count_pages_direct_reclaim calls that
>> ended in perf_count_failed_pages_direct_reclaim_but_progress was ~2% in the
>> good cases and ~30% in bad cases. As congestion_wait in my case almost always
>> waits the full HZ/50 this is a lot.
>
>> With the zone wait patch I see a huge throughput win, most probably by
>> all the (still ~30%!) waits that don't have to wait for the full timeout.
>
> That would be my expectation. The underlying problem of
> too-many-direct-reclaims is not solved.
>
>> But still reverting 5f8dcc21 and e084b fix the issue of so much direct
>> reclaims running in into that condition and that way giving throughput
>> another last push.
>> % ran into try_to_free did progress but get_page got !page
>> plain 2.6.32 30.16%
>> 2.6.32 zone wait 29.05%
>> 2.6.32 zone wait + 5f8dcc21 and e084b reverted 2.93%
>>
>
> And what impact did "Fallback to other per-cpu lists when the target list
> is empty and memory is low" have on the number of calls to direct reclaim?
see above
>
>> Note - with the zone wait patch there are also almost twice as much
>> direct_reclaim calls. Probably by several waiters coming of the list when
>> the watermark is restored and allocating too much.
>
> It could be either a thundering herd problem or just the fact that the
> processes are not asleep.
Thats sounds right, but as other things we had that is just related to the symptoms so I guess we can neglect it for now.
>> Another good side note - 5f8dcc21 and e084b reverted plus your zone wait
>> patch is better than anything I have seen so far :-)
>
> Well, that's something positive out of all this. The congestion_wait() patch
> was developed in response to this thread. Even though I knew it was treating
> symptons and not problems, the congestion_wait() was identified as a bad
> idea some time ago because the lack of pages may have nothing to do with IO.
>
>> Maybe its worth to measure both of your patches plus 5f8dcc21 and e084b
>> reverted with 64 threads (the effect grows with # threads) just to rejoice
>> in huge throughput values for a short time.
>
> :)
>
>>> <SNIP>
>>>
>>> There is a good chance that 5f8dcc is causing direct reclaim to be entered
>>> when it could have been avoided as pages were on the per-cpu lists but
>>> I don't think the difference in cache hotness is enough to explain a 60%
>>> loss on your machine.
>>>
>> As far as I get it thats what your patch #2 fixed (page allocator: Fallback
>> to other per-cpu lists when the target list is empty and memory is low),
>> but as mentioned below that gave me +4%.
>
> I need to know what effect, if any, it had on the direct reclaim
> figures. If it does drop them, but not by enough, then fallback can be
> allowed to occur in more cirsumstances - for example, by changing
>
> + if (pcp->count && free_pages <= low_wmark_pages(zone)) {
>
> with
>
> + if (pcp->count && free_pages <= high_wmark_pages(zone)) {
>
>> Which is actually a fine win, but not for the issue currently in discussion.
>> As bonus I can give you now a little tested-by for both patches :-)
>>
>
> Thanks :)
>
> Still, the grind continues to try resolve this :(
>
> 1. If the "Fallback to other per-cpu lists" does reduce the number of
> calls to direct reclaim, but not by enough, please replace the
> comparison of low_wmark_pages with high_wmark_pages and let me know a)
> how much it affects throughput and b) how much it affects the number
> of calls to direct reclaim
The difference between low or high watermark is minimal.
here with avoid direct reclaims with high watermark
perf_count_pages_direct_reclaim 32839
perf_count_failed_pages_direct_reclaim_but_progress 488 =30.16%
as reference from above the one with low watermark
perf_count_pages_direct_reclaim 32909
perf_count_failed_pages_direct_reclaim_but_progress 9802 =29.79%
=> this is inside the small amount of noise that is left.
>
> This is addressing potential problems in 5f8dcc
>
> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> number of parameters being passed is making some unexpected large
> difference
BINGO - this definitely hit something.
This experimental patch does two things - on one hand it closes the race we had:
4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
perf_count_congestion_wait 13 27 52
perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
perf_count_pages_direct_reclaim 30867 56811 131552
perf_count_failed_pages_direct_reclaim 14 24 52
perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!
On the other hand we see that the number of direct_reclaim calls increased by ~x4.
I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
This leaves me with two ideas what the real issue could be:
1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
2. maybe just the fact that we now only do "order 0" operations changes the behaviour
I think I try to cover that tomorrow by keeping the old call with 6 parms but rewriting it internally to 0 again (and hope I don't need to much tricks to stop gcc from optimizing it out).
Maybe I could also extend statistics to track order in the failing cases vs. all cases or something like this.
I don't have full access to the machines tomorrow, so it might get monday for a proper response.
Thanks a lot Mel, this brought us much closer to a final fix.
> This is trying to figure out more what the problem with e084b is
>
> ==== CUT HERE ====
> rmqueue_bulk-fewer-parameters
[...]
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
> > 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> > number of parameters being passed is making some unexpected large
> > difference
>
> BINGO - this definitely hit something.
> This experimental patch does two things - on one hand it closes the race we had:
>
> 4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
> perf_count_congestion_wait 13 27 52
> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
> perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
> perf_count_pages_direct_reclaim 30867 56811 131552
> perf_count_failed_pages_direct_reclaim 14 24 52
> perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!
>
> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
>
> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
But the order parameter was always passed as constant 0 by the caller?
> This leaves me with two ideas what the real issue could be:
> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
It must be something to do wih code generation AFAIKS. I'm surprised
the function isn't inlined, giving exactly the same result regardless
of the patch.
Unlikely to be a correctness issue with code generation, but I'm
really surprised that a small difference in performance could have
such a big (and apparently repeatable) effect on behaviour like this.
What's the assembly look like?
On Fri, Feb 12, 2010 at 09:05:19PM +1100, Nick Piggin wrote:
> > This leaves me with two ideas what the real issue could be:
> > 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
>
> It must be something to do wih code generation AFAIKS. I'm surprised
> the function isn't inlined, giving exactly the same result regardless
> of the patch.
>
> Unlikely to be a correctness issue with code generation, but I'm
> really surprised that a small difference in performance could have
> such a big (and apparently repeatable) effect on behaviour like this.
>
> What's the assembly look like?
Oh, and what happens if you always_inline it? I would have thought
gcc should be able to generate the same code in both cases -- unless
I'm really missing something and there is a functional change in
there?
Nick Piggin wrote:
> On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
>>> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
>>> number of parameters being passed is making some unexpected large
>>> difference
>> BINGO - this definitely hit something.
>> This experimental patch does two things - on one hand it closes the race we had:
>>
>> 4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
>> perf_count_congestion_wait 13 27 52
>> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
>> perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
>> perf_count_pages_direct_reclaim 30867 56811 131552
>> perf_count_failed_pages_direct_reclaim 14 24 52
>> perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!
>>
>> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
>>
>> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
>
> But the order parameter was always passed as constant 0 by the caller?
Uh - I didn't see that in the first look - you're right.
So the x4 in direct_reclaim calls are not caused by e.g. missing order information. Thanks for spotting this.
>
>> This leaves me with two ideas what the real issue could be:
>> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
>
> It must be something to do with code generation AFAIKS. I'm surprised
> the function isn't inlined, giving exactly the same result regardless
> of the patch.
after checking asm I can tell that rmqueue_bulk is inlined.
Therefore the test to inline it explicitly as requested in another mail can be considered negligible.
It actually boils down to something different in Mels patch - see below.
> Unlikely to be a correctness issue with code generation, but I'm
> really surprised that a small difference in performance could have
> such a big (and apparently repeatable) effect on behaviour like this.
>
> What's the assembly look like?
>
Line 214763
This is the preparation of the __mod_zone_page_state call from rmqueue_bulk which is inlined into
get_page_from_freelist.
!ORDER CHANGED FOR READABILITY!
FAST SLOW
lgr %r2,%r11 #parm 1 from r11 lgr %r2, %r11 #parm 1 from r11 - same
lghi %r3,0 #parm 2 = 0 const lghi %r3,0 #parm 2 = 0 - same
lghi %r4,-1 #parm 3 = -1 const lcr %r4,%r12 #parm 3 as complement of r12?
lgfr %r4,%r4 #parm 3 - clear upper 32 bit?
brasl #call brasl #call
=> This is most probably this part of Mel's patch:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
29 spin_unlock(&zone->lock);
30 return i;
31 }
-> doesn't look too important, but to be sure we can build an executable with
just this change, but still passing order to rmqueue_bulk - see below.
Line 214800
Differend end of the function. In Slow path there is a check to %r4 and
dependent two different jump targets inside get_page_from_freelist, while in
the fast version there is only an unconditional jump (both after a
_raw_spin_lock_wait).
FAST SLOW
brasl # call _raw_spin_lock_wait brasl # _raw_spin_lock_wait
j 1e6294 get_page_from_freelist lg %r4,288(%r15) # from stack
ltgr %r4,%r4 # compare
jne 1e62a2 get_page_from_freelist
lhi %r12,0 # new call gets r12 @ 0 (GOT?)
j 1e6340 get_page_from_freelist
The rest is the same. So what is left is that the slow variant has a new branch
back in to get_page_from_freelist with r12 set to zero. This jump wents directly
after the also new lcr call seen in the first difference and might therfore be
related to that change.
Now I first thought it might not be rmqueue_bulk that misses optimization, but
__mod_zone_page_state - but that one looks the same in both cases.
Therefore I took a shot for 2.6.32 with just that snippet above applied (__mod_zone_page_state call without order which should be fine as we know it is 0 in all cases).
And it is interesting to see that this snippet alone is enough to fix throughput and the direct_reclaim -> progress&!page ratio.
The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
29 spin_unlock(&zone->lock);
30 return i;
31 }
Now that one has still the issue of the very huge throughput loss and the bad ratio of driect_reclaims leading into congestion_wait.
Now as far as I can see that - oh so important - "-i" or "-1" ends up in zone_page_state_add as variable x:
static inline void zone_page_state_add(long x, struct zone *zone,
enum zone_stat_item item)
{
atomic_long_add(x, &zone->vm_stat[item]);
atomic_long_add(x, &vm_stat[item]);
}
I guess the intention there is to update the zone with the number of pages freed - and I also guess that -1 as constant might be wrong there.
That would also explain some weird output I saw like this after boot:
h37lp01:~ # cat /proc/meminfo
MemTotal: 251216 kB
MemFree: 483460 kB bigger than total
BUT, and that is now again open for discussion - "__mod_zone_page_state(zone, NR_FREE_PAGES, -1)" even being wrong - fixed the issue in terms of throughput and the congestion_waits as good as reverting e084b+5f8dcc21!
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>
>
> Nick Piggin wrote:
> > On Thu, Feb 11, 2010 at 05:11:24PM +0100, Christian Ehrhardt wrote:
> >>> 2. Test with the patch below rmqueue_bulk-fewer-parameters to see if the
> >>> number of parameters being passed is making some unexpected large
> >>> difference
> >> BINGO - this definitely hit something.
> >> This experimental patch does two things - on one hand it closes the race we had:
> >>
> >> 4 THREAD READ 8 THREAD READ 16 THREAD READ %ofcalls
> >> perf_count_congestion_wait 13 27 52
> >> perf_count_call_congestion_wait_from_alloc_pages_high_priority 0 0 0
> >> perf_count_call_congestion_wait_from_alloc_pages_slowpath 13 27 52 99.52%
> >> perf_count_pages_direct_reclaim 30867 56811 131552
> >> perf_count_failed_pages_direct_reclaim 14 24 52
> >> perf_count_failed_pages_direct_reclaim_but_progress 14 21 52 0.04% !!!
> >>
> >> On the other hand we see that the number of direct_reclaim calls increased by ~x4.
> >>
> >> I assume (might be totally wrong) that the x4 increase of direct_reclaim calls could be caused by the fact that before we used higher orders which worked on x4 number of pages at once.
> >
> > But the order parameter was always passed as constant 0 by the caller?
>
> Uh - I didn't see that in the first look - you're right.
> So the x4 in direct_reclaim calls are not caused by e.g. missing order information. Thanks for spotting this.
>
> >
> >> This leaves me with two ideas what the real issue could be:
> >> 1. either really the 6th parameter as this is the first one that needs to go on stack and that way might open a race and rob gcc a big pile of optimization chances
> >
> > It must be something to do with code generation AFAIKS. I'm surprised
> > the function isn't inlined, giving exactly the same result regardless
> > of the patch.
>
> after checking asm I can tell that rmqueue_bulk is inlined.
> Therefore the test to inline it explicitly as requested in another mail can be considered negligible.
Well, this is good at least. The expectation was that it was getting
automatically inlined but I'd stopped expecting anything as this problem
is weird.
> It actually boils down to something different in Mels patch - see below.
>
> > Unlikely to be a correctness issue with code generation, but I'm
> > really surprised that a small difference in performance could have
> > such a big (and apparently repeatable) effect on behaviour like this.
> >
> > What's the assembly look like?
> >
>
> Line 214763
> This is the preparation of the __mod_zone_page_state call from rmqueue_bulk which is inlined into
> get_page_from_freelist.
>
> !ORDER CHANGED FOR READABILITY!
> FAST SLOW
> lgr %r2,%r11 #parm 1 from r11 lgr %r2, %r11 #parm 1 from r11 - same
> lghi %r3,0 #parm 2 = 0 const lghi %r3,0 #parm 2 = 0 - same
> lghi %r4,-1 #parm 3 = -1 const lcr %r4,%r12 #parm 3 as complement of r12?
> lgfr %r4,%r4 #parm 3 - clear upper 32 bit?
I'm not an s390 assembler expert but from the (only) reference I found.
LGHI - loads a half word (16 bits) to a 64 bit
to -1 into register 4
LCR - load the comlpement of count (i.e. -count)
LGFR - load a 64 register with a 32 value, no idea if this clears the
upper bits or not.
So, they are doing very different things here.... err, crap, because I
broke it. The patch to eliminate the parameter is wrong
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
should have been
- __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+ __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
> brasl #call brasl #call
>
> => This is most probably this part of Mel's patch:
> 23 @@ -965,7 +965,7 @@
> 24 set_page_private(page, migratetype);
> 25 list = &page->lru;
> 26 }
> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -1);
> 29 spin_unlock(&zone->lock);
> 30 return i;
> 31 }
>
> -> doesn't look too important, but to be sure we can build an executable with
> just this change, but still passing order to rmqueue_bulk - see below.
>
It's important, it's broken. That -1 should be -i. Can you test with
that fixed? I'm very sorry.
> Line 214800
> Differend end of the function. In Slow path there is a check to %r4 and
> dependent two different jump targets inside get_page_from_freelist, while in
> the fast version there is only an unconditional jump (both after a
> _raw_spin_lock_wait).
>
> FAST SLOW
> brasl # call _raw_spin_lock_wait brasl # _raw_spin_lock_wait
> j 1e6294 get_page_from_freelist lg %r4,288(%r15) # from stack
> ltgr %r4,%r4 # compare
> jne 1e62a2 get_page_from_freelist
> lhi %r12,0 # new call gets r12 @ 0 (GOT?)
> j 1e6340 get_page_from_freelist
>
> The rest is the same. So what is left is that the slow variant has a new branch
> back in to get_page_from_freelist with r12 set to zero. This jump wents directly
> after the also new lcr call seen in the first difference and might therfore be
> related to that change.
>
> Now I first thought it might not be rmqueue_bulk that misses optimization, but
> __mod_zone_page_state - but that one looks the same in both cases.
>
> Therefore I took a shot for 2.6.32 with just that snippet above applied
> (__mod_zone_page_state call without order which should be fine as we know it
> is 0 in all cases). > And it is interesting to see that this snippet alone
> is enough to fix throughput and the direct_reclaim -> progress&!page ratio.
Unfortunately, it's because I broke the free counters in that patch so
we are probably breaking watermarks. I imagine the counters in
/proc/vmstat are very funny looking.
>
> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>
> 23 @@ -965,7 +965,7 @@
> 24 set_page_private(page, migratetype);
> 25 list = &page->lru;
> 26 }
> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
> 29 spin_unlock(&zone->lock);
> 30 return i;
> 31 }
>
> Now that one has still the issue of the very huge throughput loss and the bad ratio of driect_reclaims leading into congestion_wait.
> Now as far as I can see that - oh so important - "-i" or "-1" ends up in zone_page_state_add as variable x:
> static inline void zone_page_state_add(long x, struct zone *zone,
> enum zone_stat_item item)
> {
> atomic_long_add(x, &zone->vm_stat[item]);
> atomic_long_add(x, &vm_stat[item]);
> }
>
> I guess the intention there is to update the zone with the number of pages freed - and I also guess that -1 as constant might be wrong there.
It is.
> That would also explain some weird output I saw like this after boot:
> h37lp01:~ # cat /proc/meminfo
> MemTotal: 251216 kB
> MemFree: 483460 kB bigger than total
>
> BUT, and that is now again open for discussion - "__mod_zone_page_state(zone, NR_FREE_PAGES, -1)" even being wrong - fixed the issue in terms of throughput and the congestion_waits as good as reverting e084b+5f8dcc21!
It "fixes" it only by not calling direct reclaim when it should :(
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
[...]
>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>
>> 23 @@ -965,7 +965,7 @@
>> 24 set_page_private(page, migratetype);
>> 25 list = &page->lru;
>> 26 }
>> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>> 29 spin_unlock(&zone->lock);
>> 30 return i;
>> 31 }
>>
[...]
> It "fixes" it only by not calling direct reclaim when it should :(
yeah as we both realized -1 was not right so it was more a crazy workaround :-)
Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
Two patches are attached and apply after the other three already available in that thread.
The intention is
a) to track the time
a1) spent in try_to_free_pages
a2) consumed after try_to_free_pages until get_page_from_freelist
a3) spent in get_pages_from_freelist
b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
Naming convention (otherwise it might get confusing here)
Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
Bad case - the scenario e.g. on a clean 2.6.32
Fast path - direct reclaim calls that did not run into progress&!page
Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
Mini summary of what we had before in huge tables:
fast path slow path
GOOD CASE ~98% ~1-3%
BAD CASE ~70% ~30%
-> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
Out of the numbers I got the following things might help us to create a new approach to a solution.
The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
GOOD CASE duration
a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
BAD CASE duration deviation to good case in %
a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
Everything else are a few assignments so it can't be much else.
But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
PAGES-FREED fast path slow path
GOOD CASE ~62 ~1.46
BAD CASE ~62 ~37
Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.
As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.
Comments and ideas welcome!
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
Christian Ehrhardt wrote:
> Mel Gorman wrote:
>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
> [...]
>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>
>>> 23 @@ -965,7 +965,7 @@
>>> 24 set_page_private(page, migratetype);
>>> 25 list = &page->lru;
>>> 26 }
>>> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>> 29 spin_unlock(&zone->lock);
>>> 30 return i;
>>> 31 }
>>>
> [...]
>> It "fixes" it only by not calling direct reclaim when it should :(
>
> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
>
> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
>
> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
> Two patches are attached and apply after the other three already available in that thread.
> The intention is
> a) to track the time
> a1) spent in try_to_free_pages
> a2) consumed after try_to_free_pages until get_page_from_freelist
> a3) spent in get_pages_from_freelist
> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
>
> Naming convention (otherwise it might get confusing here)
> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
> Bad case - the scenario e.g. on a clean 2.6.32
> Fast path - direct reclaim calls that did not run into progress&!page
> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
>
> Mini summary of what we had before in huge tables:
> fast path slow path
> GOOD CASE ~98% ~1-3%
> BAD CASE ~70% ~30%
> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
>
> Out of the numbers I got the following things might help us to create a new approach to a solution.
> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
>
> GOOD CASE duration
> a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
> a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
> a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
> a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
> a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
> a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
> BAD CASE duration deviation to good case in %
> a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
> a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
> a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
> a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
> a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
> a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
>
> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
>
> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
> Everything else are a few assignments so it can't be much else.
> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
THe reason here is probably the the fact that in the bad case a lot of
processes are waiting on congestion_wait and are therefore not runnnable
and that way not scheduled via cond_resched.
I'll test this theory today or tomorrow with cond_resched in
direct_reclaim commented out and expect almost no difference.
> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
>
> PAGES-FREED fast path slow path
> GOOD CASE ~62 ~1.46
> BAD CASE ~62 ~37
5f8dcc21 introduced per migrate type pcp lists, is it possible that we
run in a scenario where try_to_free frees a lot of pages via, but of the
wrong migrate type?
And afterwards get_page
At least try_to_free_page and it's called shrink_ functions is not
migrate type aware while get_page and its subsequent buffered_rmqueue
and rmqueue_bulk are - btw here comes patch e084b.
I only see buffered_rmqueue chose a specific pcp list based on migrate
type, and a fallback to migrate_reserve - is that enough fallback, what
if the reserve is empty too but a few other types would not and those
other types are the ones filled by try_to_free?
I'll try to get a per migrate type #pages statistic after direct_reclaim
reaches !page - maybe that can confirm some parts of my theory.
> Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
> The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.
>
> As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.
>
> Comments and ideas welcome!
>
>
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>> [...]
>>>> The differences in asm are pretty much the same, as before
>>>> rmqueue_bulk was already inlined the actually intended change to its
>>>> parameters was negligible.
>>>> I wondered if it would be important if that is a constant value (-1)
>>>> or if the reason was caused by that shift. So I tried:
>>>>
>>>> 23 @@ -965,7 +965,7 @@
>>>> 24 set_page_private(page, migratetype);
>>>> 25 list = &page->lru;
>>>> 26 }
>>>> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>> 29 spin_unlock(&zone->lock);
>>>> 30 return i;
>>>> 31 }
>>>>
>> [...]
>>> It "fixes" it only by not calling direct reclaim when it should :(
>>
>> yeah as we both realized -1 was not right so it was more a crazy
>> workaround :-)
>>
>> Anyway after that being a dead end again I dug even deeper into the
>> details of direct_reclaim - I think we can agree that out of the
>> counters we already know the race between try_to_free making progress
>> and get_page not getting a page causing the congestion_wait is source
>> of the issue.
>>
>> So what I tried to bring some more light into all that was extending
>> my perf counters to track a few more details in direct_reclaim.
>> Two patches are attached and apply after the other three already
>> available in that thread.
>> The intention is
>> a) to track the time
>> a1) spent in try_to_free_pages
>> a2) consumed after try_to_free_pages until get_page_from_freelist
>> a3) spent in get_pages_from_freelist
>> b1) after seeing that order!=0 -> drain_all_pages I wondered if that
>> might differ even all calls look like they have zero
>> b2) tracking the average amount of pages freed by try_to_free_pages
>> for fast path and slow path (progres&!page)
>>
>> Naming convention (otherwise it might get confusing here)
>> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted
>> resulting in high throughput and a low ratio of direct_reclaim running
>> into progress&!page
>> Bad case - the scenario e.g. on a clean 2.6.32
>> Fast path - direct reclaim calls that did not run into progress&!page
>> Slow path - direct reclaim calls that ran into progress&!page ending
>> up in a long congestion_wait and therefore called "slow" path
>>
>> Mini summary of what we had before in huge tables:
>> fast path slow path
>> GOOD CASE ~98% ~1-3%
>> BAD CASE ~70% ~30%
>> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone
>> threads (worse with even more threads)
>>
>> Out of the numbers I got the following things might help us to create
>> a new approach to a solution.
>> The timings showed that that the so called slow case is actually much
>> faster passing though direct_reclaim in bad case.
>>
>> GOOD CASE duration
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
>> BAD CASE duration deviation
>> to good case in %
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
>>
>> That means that in the bad case the execution is much faster.
>> Especially in the case that eventually runs into the slow path
>> try_to_free is 14% faster, more important the time between try_to_free
>> and get_pages is 87%! faster => less than a fifth and finally get_page
>> is 54% faster, but that is probably just failing in an almost empty
>> list which is fast^^.
>>
>> As I checked order which always was zero the time is not spent in
>> drain_all_pages and the only other thing left might be cond_resched ?!
>> Everything else are a few assignments so it can't be much else.
>> But why would e.g. not running into schedule in cond_resched cause
>> get_pages to not find anything - I don't know and I would expect it
>> should be the other way around - the faster you get from free to get
>> the more pages should be left.
>
> THe reason here is probably the the fact that in the bad case a lot of
> processes are waiting on congestion_wait and are therefore not runnnable
> and that way not scheduled via cond_resched.
>
> I'll test this theory today or tomorrow with cond_resched in
> direct_reclaim commented out and expect almost no difference.
>
>> I thought the progress try_to_free_pages is doing might be important
>> as well so I took numbers for that too.
>> From those I see that the good case as well as the bad case has an
>> average of 62 pages freed in fast path.
>> But in slow path they start to differ - while the good case that is
>> running only seldom in that path anyway frees an average of 1.46 pages
>> (that might be the thing causing it not getting a page eventually) in
>> the bad case it makes a progress of avg 37 pages even in slow path.
>>
>> PAGES-FREED fast path slow path
>> GOOD CASE ~62 ~1.46
>> BAD CASE ~62 ~37
>
> 5f8dcc21 introduced per migrate type pcp lists, is it possible that we
> run in a scenario where try_to_free frees a lot of pages via, but of the
> wrong migrate type?
> And afterwards get_page
> At least try_to_free_page and it's called shrink_ functions is not
> migrate type aware while get_page and its subsequent buffered_rmqueue
> and rmqueue_bulk are - btw here comes patch e084b.
>
> I only see buffered_rmqueue chose a specific pcp list based on migrate
> type, and a fallback to migrate_reserve - is that enough fallback, what
> if the reserve is empty too but a few other types would not and those
> other types are the ones filled by try_to_free?
I just saw the full type iteration in __rmqueue_fallback, but still
somewhere the average of 37 freed pages need to go so that get_page
doesn't get one.
> I'll try to get a per migrate type #pages statistic after direct_reclaim
> reaches !page - maybe that can confirm some parts of my theory.
This might be still worth a try.
>> Thinking of it as asking "how few pages do I have to free until I fall
>> from fast to slow path" the kernels behave different it looks wrong
>> but interesting.
>> The good case only drops to slow path (!page) in case of ~1.46 pages
>> freed while the bad case seems to enter that much earlier with even 37
>> pages freed.
>>
>> As order is always 0 and get_page afaik about getting just "one" page
>> I wonder where these 37 pages disappeared especially as in bad case it
>> is much faster getting to get_pages after freeing those ~37 pages.
>>
>> Comments and ideas welcome!
>>
>>
>
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
>> Mel Gorman wrote:
>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>> [...]
>>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>>
>>>> 23 @@ -965,7 +965,7 @@
>>>> 24 set_page_private(page, migratetype);
>>>> 25 list = &page->lru;
>>>> 26 }
>>>> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>> 29 spin_unlock(&zone->lock);
>>>> 30 return i;
>>>> 31 }
>>>>
>> [...]
>>> It "fixes" it only by not calling direct reclaim when it should :(
>>
>> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
>>
>> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
>>
>> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
>> Two patches are attached and apply after the other three already available in that thread.
>> The intention is
>> a) to track the time
>> a1) spent in try_to_free_pages
>> a2) consumed after try_to_free_pages until get_page_from_freelist
>> a3) spent in get_pages_from_freelist
>> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
>> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
>>
>> Naming convention (otherwise it might get confusing here)
>> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
>> Bad case - the scenario e.g. on a clean 2.6.32
>> Fast path - direct reclaim calls that did not run into progress&!page
>> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
>>
>> Mini summary of what we had before in huge tables:
>> fast path slow path
>> GOOD CASE ~98% ~1-3%
>> BAD CASE ~70% ~30%
>> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
>>
>> Out of the numbers I got the following things might help us to create a new approach to a solution.
>> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
>>
>> GOOD CASE duration
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
>> BAD CASE duration deviation to good case in %
>> a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
>> a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
>> a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
>> a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
>> a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
>> a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
>>
>> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
>>
>> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
>> Everything else are a few assignments so it can't be much else.
>> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
>
> THe reason here is probably the the fact that in the bad case a lot of
> processes are waiting on congestion_wait and are therefore not runnnable
> and that way not scheduled via cond_resched.
>
> I'll test this theory today or tomorrow with cond_resched in
> direct_reclaim commented out and expect almost no difference.
>
>> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
>> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
>>
>> PAGES-FREED fast path slow path
>> GOOD CASE ~62 ~1.46
>> BAD CASE ~62 ~37
>
> 5f8dcc21 introduced per migrate type pcp lists, is it possible that we
> run in a scenario where try_to_free frees a lot of pages via, but of the
> wrong migrate type?
It's possible but the window is small. When a threshold is reached on the
PCP lists, they get drained to the buddy lists and later picked up again
by __rmqueue_fallback(). I had considered the possibility of pages of the
wrong type being on the PCP lists which led to the patch "page-allocator:
Fallback to other per-cpu lists when the target list is empty and memory is
low" but you reported it made no difference even when fallback was allowed
with high watermarks.
> And afterwards get_page
> At least try_to_free_page and it's called shrink_ functions is not
> migrate type aware while get_page and its subsequent buffered_rmqueue
> and rmqueue_bulk are -
Unless a large number of pages are pinned on those PCP lists (which
doesn't appear to be the case or the fallback patch would have helped),
the pages of other migratetypes are found by __rmqueue_fallback().
> btw here comes patch e084b.
>
Which still has zero explanation.
> I only see buffered_rmqueue chose a specific pcp list based on migrate
> type, and a fallback to migrate_reserve - is that enough fallback, what
> if the reserve is empty too but a few other types would not and those
> other types are the ones filled by try_to_free?
>
They get found by __rmqueue_fallback().
> I'll try to get a per migrate type #pages statistic after direct_reclaim
> reaches !page - maybe that can confirm some parts of my theory.
>
>> Thinking of it as asking "how few pages do I have to free until I fall from fast to slow path" the kernels behave different it looks wrong but interesting.
>> The good case only drops to slow path (!page) in case of ~1.46 pages freed while the bad case seems to enter that much earlier with even 37 pages freed.
>>
>> As order is always 0 and get_page afaik about getting just "one" page I wonder where these 37 pages disappeared especially as in bad case it is much faster getting to get_pages after freeing those ~37 pages.
>>
>> Comments and ideas welcome!
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:
>> Christian Ehrhardt wrote:
>>> Mel Gorman wrote:
>>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
>>> [...]
>>>>> The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
>>>>> I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
>>>>>
>>>>> 23 @@ -965,7 +965,7 @@
>>>>> 24 set_page_private(page, migratetype);
>>>>> 25 list = &page->lru;
>>>>> 26 }
>>>>> 27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
>>>>> 28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
>>>>> 29 spin_unlock(&zone->lock);
>>>>> 30 return i;
>>>>> 31 }
>>>>>
>>> [...]
>>>> It "fixes" it only by not calling direct reclaim when it should :(
>>> yeah as we both realized -1 was not right so it was more a crazy workaround :-)
>>>
>>> Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
>>>
>>> So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
>>> Two patches are attached and apply after the other three already available in that thread.
>>> The intention is
>>> a) to track the time
>>> a1) spent in try_to_free_pages
>>> a2) consumed after try_to_free_pages until get_page_from_freelist
>>> a3) spent in get_pages_from_freelist
>>> b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
>>> b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
>>>
>>> Naming convention (otherwise it might get confusing here)
>>> Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
>>> Bad case - the scenario e.g. on a clean 2.6.32
>>> Fast path - direct reclaim calls that did not run into progress&!page
>>> Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
>>>
>>> Mini summary of what we had before in huge tables:
>>> fast path slow path
>>> GOOD CASE ~98% ~1-3%
>>> BAD CASE ~70% ~30%
>>> -> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
>>>
>>> Out of the numbers I got the following things might help us to create a new approach to a solution.
>>> The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
>>>
>>> GOOD CASE duration
>>> a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
>>> a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
>>> a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
>>> a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
>>> a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
>>> a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
>>> BAD CASE duration deviation to good case in %
>>> a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
>>> a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
>>> a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
>>> a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
>>> a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
>>> a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
>>>
>>> That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
>>>
>>> As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
>>> Everything else are a few assignments so it can't be much else.
>>> But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
>> THe reason here is probably the the fact that in the bad case a lot of
>> processes are waiting on congestion_wait and are therefore not runnnable
>> and that way not scheduled via cond_resched.
>>
>> I'll test this theory today or tomorrow with cond_resched in
>> direct_reclaim commented out and expect almost no difference.
>>
>>> I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.
>>> From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
>>> But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
>>>
>>> PAGES-FREED fast path slow path
>>> GOOD CASE ~62 ~1.46
>>> BAD CASE ~62 ~37
>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that we
>> run in a scenario where try_to_free frees a lot of pages via, but of the
>> wrong migrate type?
>
> It's possible but the window is small. When a threshold is reached on the
> PCP lists, they get drained to the buddy lists and later picked up again
> by __rmqueue_fallback(). I had considered the possibility of pages of the
> wrong type being on the PCP lists which led to the patch "page-allocator:
> Fallback to other per-cpu lists when the target list is empty and memory is
> low" but you reported it made no difference even when fallback was allowed
> with high watermarks.
> [...]
Today I created rather huge debug logs - I'll spare everyone with too
much detail.
Eventually it comes down to some kind of cat /proc/zoneinfo like output
extended to list all things per migrate type too.
From that I still think there should be plenty of pages to get the
allocation through, as it was suggested by the high amount of pages
freed by try_to_free.
Due to that I tried calling a second get_page_from_freelist after the
first failing one and after the statistic output which showed me there
wer enough pages available.
Even that second one failed - but this is good, that allows me to run
that second call with a kind of "debugthis" parameter which I'll run
tomorrow.
If anyone wants the full 6.2M of that data let me know, but I think the
step by step debug of the second try will give us much smaller and
better insights to discuss about.
More about that tomorrow,
Christian
P.S. Just one output of it here from a 16 thread read case:
### extended zoneinfo after failed direct reclaim ###
Progress just made in try_to_free 32
Current allocation order 0
Current allocation gfp_flags 201da GFP_COLD '1'
Current allocation migrate type '2'
Current cpu id 0
zone DMA
zni: pages_free 503
min 508
low 635
high 762
scanned 96
spanned 65536
present 64640
nr_free_pages 503
nr_inactive_anon 1536
nr_active_anon 1480
nr_inactive_file 33197
nr_active_file 9492
nr_unevictable 3016
nr_mlock 3016
nr_anon_pages 5188
nr_mapped 2295
nr_file_pages 43714
nr_dirty 206
nr_writeback 0
nr_slab_reclaimable 2762
nr_slab_unreclaimable 5368
nr_page_table_pages 151
nr_kernel_stack 215
nr_unstable 0
nr_bounce 0
nr_vmscan_write 476
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 32
nr_shmem 62
protection: (0,0,0)
Free areas:
Order 0 - nr_free 198 (matching 198 order 0 pages)
MIGRATE_UNMOVABLE 198
MIGRATE_RECLAIMABLE 0
d get_page_from_freelist fail
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 1 - nr_free 2 (matching 4 order 0 pages)
MIGRATE_UNMOVABLE 1
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 1*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 2 - nr_free 0 (matching 0 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 3 - nr_free 0 (matching 0 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 4 - nr_free 0 (matching 0 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 5 - nr_free 0 (matching 0 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 6 - nr_free 0 (matching 0 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 7 - nr_free 0 (matching 0 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
MIGRATE_RESERVE 0
MIGRATE_ISOLATE 0
Order 8 - nr_free 1 (matching 256 order 0 pages)
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
d get_page_from_freelist fail
MIGRATE_RESERVE 1
MIGRATE_ISOLATE 0
pagesets
cpu: 0 (current cpu)
count: 46
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 46*
vm_stats_threshold: 16
cpu: 1
count: 77
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 77*
vm_stats_threshold: 16
cpu: 2
count: 10
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 10*
vm_stats_threshold: 16
cpu: 3
count: 8
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 8*
vm_stats_threshold: 16
cpu: 4
count: 11
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 11*
vm_stats_threshold: 16
cpu: 5
count: 74
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 74*
vm_stats_threshold: 16
cpu: 6
count: 65
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 65*
vm_stats_threshold: 16
cpu: 7
count: 0
high: 90
batch: 15
MIGRATE_UNMOVABLE 0
MIGRATE_RECLAIMABLE 0
MIGRATE_MOVABLE 0*
vm_stats_threshold: 16
all_unreclaimable: 0
prev_priority: 10
start_pfn: 0
inactive_ratio: 1
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
Christian Ehrhardt wrote:
>
>
> Mel Gorman wrote:
>> On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:
>>> Christian Ehrhardt wrote:
>>>> Mel Gorman wrote:
>>>>> On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:
[...]
>>>>
>>>> PAGES-FREED fast path slow path
>>>> GOOD CASE ~62 ~1.46
>>>> BAD CASE ~62 ~37
>>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that
>>> we run in a scenario where try_to_free frees a lot of pages via, but
>>> of the wrong migrate type?
>>
>> It's possible but the window is small. When a threshold is reached on the
>> PCP lists, they get drained to the buddy lists and later picked up again
>> by __rmqueue_fallback(). I had considered the possibility of pages of the
>> wrong type being on the PCP lists which led to the patch "page-allocator:
>> Fallback to other per-cpu lists when the target list is empty and
>> memory is
>> low" but you reported it made no difference even when fallback was
>> allowed
>> with high watermarks.
>> [...]
>
> Today I created rather huge debug logs - I'll spare everyone with too
> much detail.
> Eventually it comes down to some kind of cat /proc/zoneinfo like output
> extended to list all things per migrate type too.
>
> From that I still think there should be plenty of pages to get the
> allocation through, as it was suggested by the high amount of pages
> freed by try_to_free.
> >
[...]
> More about that tomorrow,
Well tomorrow is now, and I think I got some important new insights.
As mentioned before I realized that a second call still fails most of the time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist and buffered_rmqueue to see where the allocations exactly fails (patch attached).
Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
get_page_from_freelist - zone loop - current zone 'DMA'
get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
get_page_from_freelist - return page '(null)'
Ok - now we at least exactly know why it gets no page.
Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
I didn't expect that, but actually watermarks are what stops the allocations here.
The zone_watermark_ok check reports that there are not enough pages for the current watermark and
finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
Ok remember my scenario - several parallel iozone sequential read processes - theres not much allocation going on except for the page cache read ahead related to that read workload.
The allocations for page cache seem to have no special watermarks selected via their GFP flags and therefore get stalled by congestion_wait - which in consequence of no available writes in flight consumes its full timeout.
As I see significant impacts to the iozone throughput and not only e.g. bad counters in direct_reclaim the congestion_wait stalling seems to be so often/long to stall the application I/O itself.
That means from the time VFS starting a read ahead it seems to stall that page allocation long enough that the data is not ready when the application tries to read it, while it would be if the allocation would be fast enough.
A simple test for this theory was to allow those failing allocations a second chance without watermark restrictions before putting them to sleep for such a long time.
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c 2010-02-19 09:53:14.000000000 +0100
+++ linux-2.6-git/mm/page_alloc.c 2010-02-19 09:56:26.000000000 +0100
@@ -1954,7 +1954,22 @@
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ /*
+ * If it gets here try it without watermarks before going
+ * to sleep.
+ *
+ * This will end up in alloc_high_priority and if that fails
+ * once more direct_reclaim but this time without watermark
+ * checks.
+ *
+ * FIXME: that is just for verification - a real fix needs to
+ * ensure e.g. page cache allocations don't drain all pages
+ * under watermark
+ */
+ if (!(alloc_flags & ALLOC_NO_WATERMARKS))
+ alloc_flags &= ALLOC_NO_WATERMARKS;
+ else
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto rebalance;
}
This fixes all issues I have, but as stated in the FIXME it is unfortunately no fix for the real world.
Unfortunately even now knowing the place of the issue so well I don't see the connection to the commits e084b+5f8dcc21 - I couldn't find something but did they change the accounting somewhere or e.g. changed the timing/order of watermark updates and allocations?
Eventually it might come down to a discussion of allocation priorities and we might even keep them as is and accept this issue - I still would prefer a good second chance implementation, other page cache allocation flags or something else that explicitly solves this issue.
Mel's patch that replaces congestion_wait with a wait for the zone watermarks becoming available again is definitely a step in the right direction and should go into upstream and the long term support branches.
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization
On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED fast path slow path
> >>>> GOOD CASE ~62 ~1.46
> >>>> BAD CASE ~62 ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that
> >>> we run in a scenario where try_to_free frees a lot of pages via, but
> >>> of the wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and
> >> memory is
> >> low" but you reported it made no difference even when fallback was
> >> allowed
> >> with high watermarks.
> >> [...]
> >
> > Today I created rather huge debug logs - I'll spare everyone with too
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output
> > extended to list all things per migrate type too.
> >
> > From that I still think there should be plenty of pages to get the
> > allocation through, as it was suggested by the high amount of pages
> > freed by try_to_free.
> > >
> [...]
>
> > More about that tomorrow,
>
> Well tomorrow is now, and I think I got some important new insights.
>
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
>
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
> get_page_from_freelist - zone loop - current zone 'DMA'
> get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
> get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
> get_page_from_freelist - return page '(null)'
>
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.
That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.
> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
>
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>
Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.
> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.
>
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c 2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c 2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
>
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> + /*
> + * If it gets here try it without watermarks before going
> + * to sleep.
> + *
> + * This will end up in alloc_high_priority and if that fails
> + * once more direct_reclaim but this time without watermark
> + * checks.
> + *
> + * FIXME: that is just for verification - a real fix needs to
> + * ensure e.g. page cache allocations don't drain all pages
> + * under watermark
> + */
> + if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> + alloc_flags &= ALLOC_NO_WATERMARKS;
> + else
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
> }
>
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.
It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.
> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21
Still a mystery.
> - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>
Not that I can think of.
> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>
In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.
> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.
I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.
When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
[...]
>
>> Unfortunately even now knowing the place of the issue so well I don't see
>> the connection to the commits e084b+5f8dcc21
>
> Still a mystery.
>
>> - I couldn't find something but
>> did they change the accounting somewhere or e.g. changed the timing/order
>> of watermark updates and allocations?
>>
>
> Not that I can think of.
>
>> Eventually it might come down to a discussion of allocation priorities and
>> we might even keep them as is and accept this issue - I still would prefer
>> a good second chance implementation, other page cache allocation flags or
>> something else that explicitly solves this issue.
>>
>
> In that line, the patch that replaced congestion_wait() with a waitqueue
> makes some sense.
[...]
> I'll need to do a number of tests before I can move that upstream but I
> don't think it's a merge candidate. Unfortunately, I'll be offline for a
> week starting tomorrow so I won't be able to do the testing.
>
> When I get back, I'll revisit those patches with the view to pushing
> them upstream. I hate to treat symptoms here without knowing the
> underlying problem but this has been spinning in circles for ages with
> little forward progress :(
I'll continue with some debugging in search for the real reasons, but if
I can't find a new way to look at it I think we have to drop it for now.
While I hate fixing symptoms too, I completely agree that it is time to
bring this fix upstream and in fact into the stable kernel too, to have
at least a ~98% workaround out there.
I'm looking forward for your revised patch after you are back and I'm
eager to test this one again.
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
Christian Ehrhardt wrote:
> Mel Gorman wrote:
> [...]
>
>> I'll need to do a number of tests before I can move that upstream but I
>> don't think it's a merge candidate. Unfortunately, I'll be offline for a
>> week starting tomorrow so I won't be able to do the testing.
>>
>> When I get back, I'll revisit those patches with the view to pushing
>> them upstream. I hate to treat symptoms here without knowing the
>> underlying problem but this has been spinning in circles for ages with
>> little forward progress :(
>
> I'll continue with some debugging in search for the real reasons, but if
> I can't find a new way to look at it I think we have to drop it for now.
>
[...]
>
As a last try I partially rewrote my debug patches which now report what
I call "extended zone info" (like proc/zoneinfo plus free area and per
migrate type counters) once every second at a random direct_reclaim call.
Naming: As before I call the plain 2.6.32 orig or "bad case" and 2.6.32
with e084b and 5f8dcc21 reverted Rev or "good case".
Depending on the fact if a allocation failed or not before the
statistics were reported I call them "failed" or "worked"
Therefore I splitted the resulting data into the four cases orig-failed,
orig-worked, Rev-failed, Rev-worked.
This could again end up in a report that most people expected (like the
stopped by watermark last time), but still I think it is worth to report
and have everyone take a look at it.
PRE)
First and probably most important to keep it in mind later on - the good
case seems to have more pages free, living usually above the watermark
and is therefore not running into that failed direct_reclaim allocations.
The question for all facts I describe below is "can this affect the
number of free pages either directly or be indirectly a pointer what
else is going on affecting the # free pages"
As notice for all data below the page cache allocations that occur when
running the read workload are GFCP_COLD=1 and preferred migration type
MOVABLE.
1) Free page distribution per order in free areas lists
These numbers cover migrate type distribution across free areas per order,
similar to what /proc/pagetypeinfo reports.
There is a major difference between the plain 2.6.32 and the one with
e084b and 5f8dcc21 reverted. While the good case shows at least some
distribution having a few elements in order 2-7 the bad case looks quite different.
Bad case has a huge peak in order 0, is about even on order 1 and then much fewer in order 2-7.
Eventually both cases have one order 8 page as reserve at all times.
Pages per Order 0 1 2 3 4 5 6 7 8
Bad Case 272.85 22.10 2.43 0.51 0.14 0.01 0.01 0.06 1
Good Case 97.55 15.29 3.08 1.39 0.77 0.11 0.29 0.90 1
This might not look much but factorized down to 4k order 0 pages this numbers would look like:
4kPages per Order 0 1 2 3 4 5 6 7 8
Bad Case 272.85 44.21 9.73 4.08 2.23 0.23 0.45 7.62 256
Good Case 97.55 30.58 12.30 11.12 12.24 3.64 18.67 114.91 256
So something seems to allow grouping into higher orders much better in the good case.
I wonder if there might be code doing things like this somewhere:
if (couldIcollapsethisintohigherorder)
free
else
donothing
-> leaving less free and less higher order pages in the bad case.
Remember my introduction - what we eventually search is why the bad case
has fewer free pages.
3) Migrate types on free areas
Looking at the numbers above more in detail, meaning "splitted into the
different migrate types" shows another difference.
The bad case has most of the pages as unmovable, interestingly almost exactly that
amount of pages that are shifted from higher orders to order 0 when comparing
good/bad case.
So this might be related to that different order distribution we see above.
(BTW - on s390 all memory 0-2Gb is one zone, as we have 256m in these tests
all is in one zone)
BAD CASE
Free pgs per migrate type @ order 0 1 2 3 4 5 6 7 8
MIGRATE_UNMOVABLE 178.17 0.38 0.00 0.00 0.00 0.00 0.00 0.00 0
MIGRATE_RECLAIMABLE 12.95 0.58 0.01 0.00 0.00 0.00 0.00 0.00 0
MIGRATE_MOVABLE 81.74 21.14 2.29 0.50 0.13 0.00 0.00 0.00 0
MIGRATE_RESERVE 0.00 0.00 0.13 0.01 0.01 0.01 0.01 0.06 1
GOOD CASE
Free pgs per migrate type @ order 0 1 2 3 4 5 6 7 8
Normal MIGRATE_UNMOVABLE 21.70 0.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Normal MIGRATE_RECLAIMABLE 4.15 0.22 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Normal MIGRATE_MOVABLE 68.71 12.38 0.88 0.63 0.06 0.00 0.00 0.00 0.00
Normal MIGRATE_RESERVE 2.99 2.56 2.19 0.77 0.71 0.11 0.29 0.90 1.00
Normal MIGRATE_ISOLATE 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Maybe this gives someone a good hint why we see that different grouping
or even why we have less free pages in bad case.
4) PCP list fill ratio
Finally the last major difference I see is the fill ratio of the pcp lists.
The good case has an average of ~TODO pages on the pcp lists while the bad
case has only ~TODO pages.
AVG count on pcp lists
bad case 35.33
good case 62.46
When looking at the migrate types at the pcp lists (which is only possible
without 5f8dcc21 reverted) it looks like "there the movable ones have gone",
which they might have not before migrate differentiation type on pcp list
support.
AVG count per migrate type in bad case
MIGRATE_UNMOVABLE 12.57
MIGRATE_RECLAIMABLE 2.03
MIGRATE_MOVABLE 31.89
Is it possible that with 5f8dcc21 the MIGRATE_MOVABLE pages are drained
from free areas to the pcp list more agressive and leave MIGRATE_UNMOVABLE
which then might e.g. not groupable or something like that - and eventually
that way somehow leave fewer free pages left ?
FIN)
So, thats it from my side.
I look forward to the finalized congestion_wait->zone wait patch however
this turns out (zone wait is resonable if fixing this symptom or not in
my opinion).
But still I have a small amount of hope left that all the data I found here
might give someone the kick to see whats going on in mm's backstage due to
that patches.
--
Gr?sse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
On Thu, Feb 25, 2010 at 04:13:52PM +0100, Christian Ehrhardt wrote:
> Christian Ehrhardt wrote:
> FIN)
> So, thats it from my side.
> I look forward to the finalized congestion_wait->zone wait patch however
> this turns out (zone wait is resonable if fixing this symptom or not in
> my opinion).
> But still I have a small amount of hope left that all the data I found here
> might give someone the kick to see whats going on in mm's backstage due to
> that patches.
Hi Christian,
Thanks for doing all this work. I've been looking at a couple of other
regressions while it seemed like you guys were making progress. I can't
seem to reproduce it on my x86-64 system (was anybody able to reproduce
it on x86?)
Anyway I have asked our mainframe guy whether we can set up an
environment.
I agree it is a real problem that really needs to be properly explained.
Thanks,
Nick