When I tested compaction in low memory condition, I found that
my benchmark is stuck in congestion_wait() at shrink_inactive_list().
This stuck last for 1 sec and after then it can escape. More investigation
shows that it is due to stale vmstat value. vmstat is updated every 1 sec
so it is stuck for 1 sec.
I guess that it is caused by updating NR_ISOLATED_XXX. In direct
reclaim/compaction, it would isolate some pages. After some processing,
they are returned to lru or freed and NR_ISOLATED_XXX is adjusted so
it should be recover to zero. But, it would be possible that some
updatings are appiled to global but some are applied only to per cpu
variable. In this case, zone_page_state() would return stale value so
it can be stuck.
This problem can be solved by adjusting zone_page_state() with this
cpu's vmstat value. It's sub-optimal because the other task in other cpu
can be stuck due to stale vmstat value but, at least, it can solve
some usecases without adding much overhead so I think that it is worth
to doing it. With this change, I can't find any stuck in my test.
Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/vmstat.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 62af0f8..7c84896 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -133,6 +133,9 @@ static inline unsigned long zone_page_state(struct zone *zone,
{
long x = atomic_long_read(&zone->vm_stat[item]);
#ifdef CONFIG_SMP
+ long diff = this_cpu_read(zone->pageset->vm_stat_diff[item]);
+
+ x += diff;
if (x < 0)
x = 0;
#endif
--
1.9.1
On Tue, 24 Nov 2015, Joonsoo Kim wrote:
> When I tested compaction in low memory condition, I found that
> my benchmark is stuck in congestion_wait() at shrink_inactive_list().
> This stuck last for 1 sec and after then it can escape. More investigation
> shows that it is due to stale vmstat value. vmstat is updated every 1 sec
> so it is stuck for 1 sec.
vmstat values are not designed to be accurate and are not guaranteed to be
accurate. Comparing to specific values should not be done. If you need an
accurate counter then please use another method of accounting like an
atomic.
On Tue, Nov 24, 2015 at 09:36:09AM -0600, Christoph Lameter wrote:
> On Tue, 24 Nov 2015, Joonsoo Kim wrote:
>
> > When I tested compaction in low memory condition, I found that
> > my benchmark is stuck in congestion_wait() at shrink_inactive_list().
> > This stuck last for 1 sec and after then it can escape. More investigation
> > shows that it is due to stale vmstat value. vmstat is updated every 1 sec
> > so it is stuck for 1 sec.
>
> vmstat values are not designed to be accurate and are not guaranteed to be
> accurate. Comparing to specific values should not be done. If you need an
> accurate counter then please use another method of accounting like an
> atomic.
I think that maintaining duplicate counter to guarantee accuracy isn't
reasonable solution. It would cause more overhead to the system.
Although vmstat values aren't designed for accuracy, these are already
used by some sensitive places so it is better to be more accurate.
What this patch does is just adding current cpu's diff to global value
when retrieving in order to get more accurate value and this would not be
expensive. I think that it doesn't break any design principle of vmstat.
Thanks.
On Tue 24-11-15 15:22:03, Joonsoo Kim wrote:
> When I tested compaction in low memory condition, I found that
> my benchmark is stuck in congestion_wait() at shrink_inactive_list().
> This stuck last for 1 sec and after then it can escape. More investigation
> shows that it is due to stale vmstat value. vmstat is updated every 1 sec
> so it is stuck for 1 sec.
Wouldn't it be sufficient to use zone_page_state_snapshot in
too_many_isolated?
--
Michal Hocko
SUSE Labs
On 11/25/2015 01:00 PM, Michal Hocko wrote:
> On Tue 24-11-15 15:22:03, Joonsoo Kim wrote:
>> When I tested compaction in low memory condition, I found that
>> my benchmark is stuck in congestion_wait() at shrink_inactive_list().
>> This stuck last for 1 sec and after then it can escape. More investigation
>> shows that it is due to stale vmstat value. vmstat is updated every 1 sec
>> so it is stuck for 1 sec.
>
> Wouldn't it be sufficient to use zone_page_state_snapshot in
> too_many_isolated?
That sounds better than the ad-hoc half-solution, yeah.
I don't know how performance sensitive the callers are, but maybe it could do a
non-snapshot check first, and only repeat with _snapshot when it's about to wait
(the result is true), just to make sure?
OTOH, how big issue is this? I suspect the system has been genuinely
too_many_isolated(), or very close, in order to hit the condition in the first
place, and the inaccuracy just delays the recovery a bit?
On Wed 25-11-15 14:43:38, Vlastimil Babka wrote:
> On 11/25/2015 01:00 PM, Michal Hocko wrote:
> > On Tue 24-11-15 15:22:03, Joonsoo Kim wrote:
> >> When I tested compaction in low memory condition, I found that
> >> my benchmark is stuck in congestion_wait() at shrink_inactive_list().
> >> This stuck last for 1 sec and after then it can escape. More investigation
> >> shows that it is due to stale vmstat value. vmstat is updated every 1 sec
> >> so it is stuck for 1 sec.
> >
> > Wouldn't it be sufficient to use zone_page_state_snapshot in
> > too_many_isolated?
>
> That sounds better than the ad-hoc half-solution, yeah.
> I don't know how performance sensitive the callers are, but maybe it could do a
> non-snapshot check first, and only repeat with _snapshot when it's about to wait
> (the result is true), just to make sure?
I am not sure this is worth bothering. We are in the reclaim which is
not a hot path.
--
Michal Hocko
SUSE Labs
On Wed, 25 Nov 2015, Joonsoo Kim wrote:
> I think that maintaining duplicate counter to guarantee accuracy isn't
> reasonable solution. It would cause more overhead to the system.
Simply remove the counter from the vmstat handling and do it differently
then.
> Although vmstat values aren't designed for accuracy, these are already
> used by some sensitive places so it is better to be more accurate.
The design is to sacrifice accuracy and the time the updates occur for
performance reasons. This is not the purpose the counters were designed
for. If you put these demands on the vmstat then you will get complex
convoluted code and compromise performance.
> What this patch does is just adding current cpu's diff to global value
> when retrieving in order to get more accurate value and this would not be
> expensive. I think that it doesn't break any design principle of vmstat.
There have been a number of expectations recently regarding the accuracy
of vmstat. We are on the wrong track here.
On Wed 25-11-15 10:04:44, Christoph Lameter wrote:
> On Wed, 25 Nov 2015, Joonsoo Kim wrote:
>
> > I think that maintaining duplicate counter to guarantee accuracy isn't
> > reasonable solution. It would cause more overhead to the system.
>
> Simply remove the counter from the vmstat handling and do it differently
> then.
We definitely do not want yet another set of counters. vmstat counters
are not only to be exported into the userspace. We have in kernel users
as well. I do agree that there are users who can cope with some level of
imprecision though and those which depend on the accuracy can use
zone_page_state_snapshot which doesn't impose any overhead on others.
[...]
--
Michal Hocko
SUSE Labs
On Wed, 25 Nov 2015, Michal Hocko wrote:
> > Simply remove the counter from the vmstat handling and do it differently
> > then.
>
> We definitely do not want yet another set of counters. vmstat counters
> are not only to be exported into the userspace. We have in kernel users
> as well. I do agree that there are users who can cope with some level of
> imprecision though and those which depend on the accuracy can use
> zone_page_state_snapshot which doesn't impose any overhead on others.
> [...]
Ok then the proper patch would be to use zone_page_state() instead of
zone_page_state() here instead of modifying zone_page_state().
On Wed, Nov 25, 2015 at 10:04:44AM -0600, Christoph Lameter wrote:
> > Although vmstat values aren't designed for accuracy, these are already
> > used by some sensitive places so it is better to be more accurate.
>
> The design is to sacrifice accuracy and the time the updates occur for
> performance reasons. This is not the purpose the counters were designed
> for. If you put these demands on the vmstat then you will get complex
> convoluted code and compromise performance.
I understand design decision, but, it is better to get value as much
as accurate if there is no performance problem. My patch would not
cause much performance degradation because it is just adding one
this_cpu_read().
Consider about following example. Current implementation returns
interesting output if someone do following things.
v1 = zone_page_state(XXX);
mod_zone_page_state(XXX, 1);
v2 = zone_page_state(XXX);
v2 would be same with v1 in most of cases even if we already update
it.
This situation could occurs in page allocation path and others. If
some task try to allocate many pages, then watermark check returns
same values until updating vmstat even if some freepage are allocated.
There are some adjustments for this imprecision but why not do it become
accurate? I think that this change is reasonable trade-off.
Thanks.
On Wed, Nov 25, 2015 at 01:00:22PM +0100, Michal Hocko wrote:
> On Tue 24-11-15 15:22:03, Joonsoo Kim wrote:
> > When I tested compaction in low memory condition, I found that
> > my benchmark is stuck in congestion_wait() at shrink_inactive_list().
> > This stuck last for 1 sec and after then it can escape. More investigation
> > shows that it is due to stale vmstat value. vmstat is updated every 1 sec
> > so it is stuck for 1 sec.
>
> Wouldn't it be sufficient to use zone_page_state_snapshot in
> too_many_isolated?
Yes, it would work in this case. But, I prefer this patch because
all zone_page_state() users get this benefit.
Thanks.
On Thu, Nov 26, 2015 at 7:26 AM, Joonsoo Kim <[email protected]> wrote:
> On Wed, Nov 25, 2015 at 01:00:22PM +0100, Michal Hocko wrote:
>> On Tue 24-11-15 15:22:03, Joonsoo Kim wrote:
>> > When I tested compaction in low memory condition, I found that
>> > my benchmark is stuck in congestion_wait() at shrink_inactive_list().
>> > This stuck last for 1 sec and after then it can escape. More investigation
>> > shows that it is due to stale vmstat value. vmstat is updated every 1 sec
>> > so it is stuck for 1 sec.
>>
>> Wouldn't it be sufficient to use zone_page_state_snapshot in
>> too_many_isolated?
>
This was done by this patch I believe,
http://lkml.iu.edu/hypermail/linux/kernel/1501.2/00001.html, though
the original issue (wait of more than 1 sec) was fixed by the vmstat
changes.
On Thu 26-11-15 10:56:12, Joonsoo Kim wrote:
> On Wed, Nov 25, 2015 at 01:00:22PM +0100, Michal Hocko wrote:
> > On Tue 24-11-15 15:22:03, Joonsoo Kim wrote:
> > > When I tested compaction in low memory condition, I found that
> > > my benchmark is stuck in congestion_wait() at shrink_inactive_list().
> > > This stuck last for 1 sec and after then it can escape. More investigation
> > > shows that it is due to stale vmstat value. vmstat is updated every 1 sec
> > > so it is stuck for 1 sec.
> >
> > Wouldn't it be sufficient to use zone_page_state_snapshot in
> > too_many_isolated?
>
> Yes, it would work in this case. But, I prefer this patch because
> all zone_page_state() users get this benefit.
Just to make it clear, I am not against your patch in general. I am just
not sure it would help for too_many_isolated case where a significant
drift might occur on remote cpus as well so I am not really sure that is
appropriate for the issue you are seeing.
--
Michal Hocko
SUSE Labs
On Thu, Nov 26, 2015 at 10:52:52AM +0900, Joonsoo Kim wrote:
> On Wed, Nov 25, 2015 at 10:04:44AM -0600, Christoph Lameter wrote:
> > > Although vmstat values aren't designed for accuracy, these are already
> > > used by some sensitive places so it is better to be more accurate.
> >
> > The design is to sacrifice accuracy and the time the updates occur for
> > performance reasons. This is not the purpose the counters were designed
> > for. If you put these demands on the vmstat then you will get complex
> > convoluted code and compromise performance.
>
> I understand design decision, but, it is better to get value as much
> as accurate if there is no performance problem. My patch would not
> cause much performance degradation because it is just adding one
> this_cpu_read().
>
> Consider about following example. Current implementation returns
> interesting output if someone do following things.
>
> v1 = zone_page_state(XXX);
> mod_zone_page_state(XXX, 1);
> v2 = zone_page_state(XXX);
>
> v2 would be same with v1 in most of cases even if we already update
> it.
>
> This situation could occurs in page allocation path and others. If
> some task try to allocate many pages, then watermark check returns
> same values until updating vmstat even if some freepage are allocated.
> There are some adjustments for this imprecision but why not do it become
> accurate? I think that this change is reasonable trade-off.
>
Christoph, any comment?
Thanks.