Richard Yao reported a month ago that his system have a trouble
with vmap_area_lock contention during performance analysis
by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo
stressfully, but he didn't answer it.
https://lkml.org/lkml/2014/4/10/416
Although I'm not sure that this is right usage or not, there is a solution
reducing vmap_area_lock contention with no side-effect. That is just
to use rcu list iterator in get_vmalloc_info().
rcu can be used in this function because all RCU protocol is already
respected by writers, since Nick Piggin commit db64fe02258f1507e13fe5
("mm: rewrite vmap layer") back in linux-2.6.28
Specifically :
insertions use list_add_rcu(),
deletions use list_del_rcu() and kfree_rcu().
Note the rb tree is not used from rcu reader (it would not be safe),
only the vmap_area_list has full RCU protection.
Note that __purge_vmap_area_lazy() already uses this rcu protection.
rcu_read_lock();
list_for_each_entry_rcu(va, &vmap_area_list, list) {
if (va->flags & VM_LAZY_FREE) {
if (va->va_start < *start)
*start = va->va_start;
if (va->va_end > *end)
*end = va->va_end;
nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
list_add_tail(&va->purge_list, &valist);
va->flags |= VM_LAZY_FREEING;
va->flags &= ~VM_LAZY_FREE;
}
}
rcu_read_unlock();
v2: add more commit description from Eric
[[email protected]: add more commit description]
Reported-by: Richard Yao <[email protected]>
Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: Joonsoo Kim <[email protected]>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f64632b..fdbb116 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2690,14 +2690,14 @@ void get_vmalloc_info(struct vmalloc_info *vmi)
prev_end = VMALLOC_START;
- spin_lock(&vmap_area_lock);
+ rcu_read_lock();
if (list_empty(&vmap_area_list)) {
vmi->largest_chunk = VMALLOC_TOTAL;
goto out;
}
- list_for_each_entry(va, &vmap_area_list, list) {
+ list_for_each_entry_rcu(va, &vmap_area_list, list) {
unsigned long addr = va->va_start;
/*
@@ -2724,7 +2724,7 @@ void get_vmalloc_info(struct vmalloc_info *vmi)
vmi->largest_chunk = VMALLOC_END - prev_end;
out:
- spin_unlock(&vmap_area_lock);
+ rcu_read_unlock();
}
#endif
--
1.7.9.5
On 06/10/2014 10:19 PM, Joonsoo Kim wrote:
> Richard Yao reported a month ago that his system have a trouble
> with vmap_area_lock contention during performance analysis
> by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo
> stressfully, but he didn't answer it.
>
> https://lkml.org/lkml/2014/4/10/416
>
> Although I'm not sure that this is right usage or not, there is a solution
> reducing vmap_area_lock contention with no side-effect. That is just
> to use rcu list iterator in get_vmalloc_info().
>
> rcu can be used in this function because all RCU protocol is already
> respected by writers, since Nick Piggin commit db64fe02258f1507e13fe5
> ("mm: rewrite vmap layer") back in linux-2.6.28
While rcu list traversal over the vmap_area_list is safe, this may
arrive at different results than the spinlocked version. The rcu list
traversal version will not be a 'snapshot' of a single, valid instant
of the entire vmap_area_list, but rather a potential amalgam of
different list states.
This is because the vmap_area_list can continue to change during
list traversal.
Regards,
Peter Hurley
> Specifically :
> insertions use list_add_rcu(),
> deletions use list_del_rcu() and kfree_rcu().
>
> Note the rb tree is not used from rcu reader (it would not be safe),
> only the vmap_area_list has full RCU protection.
>
> Note that __purge_vmap_area_lazy() already uses this rcu protection.
>
> rcu_read_lock();
> list_for_each_entry_rcu(va, &vmap_area_list, list) {
> if (va->flags & VM_LAZY_FREE) {
> if (va->va_start < *start)
> *start = va->va_start;
> if (va->va_end > *end)
> *end = va->va_end;
> nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
> list_add_tail(&va->purge_list, &valist);
> va->flags |= VM_LAZY_FREEING;
> va->flags &= ~VM_LAZY_FREE;
> }
> }
> rcu_read_unlock();
>
> v2: add more commit description from Eric
>
> [[email protected]: add more commit description]
> Reported-by: Richard Yao <[email protected]>
> Acked-by: Eric Dumazet <[email protected]>
> Signed-off-by: Joonsoo Kim <[email protected]>
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index f64632b..fdbb116 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2690,14 +2690,14 @@ void get_vmalloc_info(struct vmalloc_info *vmi)
>
> prev_end = VMALLOC_START;
>
> - spin_lock(&vmap_area_lock);
> + rcu_read_lock();
>
> if (list_empty(&vmap_area_list)) {
> vmi->largest_chunk = VMALLOC_TOTAL;
> goto out;
> }
>
> - list_for_each_entry(va, &vmap_area_list, list) {
> + list_for_each_entry_rcu(va, &vmap_area_list, list) {
> unsigned long addr = va->va_start;
>
> /*
> @@ -2724,7 +2724,7 @@ void get_vmalloc_info(struct vmalloc_info *vmi)
> vmi->largest_chunk = VMALLOC_END - prev_end;
>
> out:
> - spin_unlock(&vmap_area_lock);
> + rcu_read_unlock();
> }
> #endif
>
>
On Tue, Jun 10, 2014 at 11:32:19PM -0400, Peter Hurley wrote:
> PF: none (google.com: [email protected] does not designate permitted sender hosts) client-ip=216.70.64.70;
> Received: from h96-61-95-138.cntcnh.dsl.dynamic.tds.net ([96.61.95.138]:55986 helo=[192.168.1.139])
> by n23.mail01.mtsvc.net with esmtpsa (TLSv1:AES128-SHA:128)
> (Exim 4.72)
> (envelope-from <[email protected]>)
> id 1WuZGw-00064f-2L; Tue, 10 Jun 2014 23:32:22 -0400
> Message-ID: <[email protected]>
> Date: Tue, 10 Jun 2014 23:32:19 -0400
> From: Peter Hurley <[email protected]>
> User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.5.0
> MIME-Version: 1.0
> To: Joonsoo Kim <[email protected]>, Andrew Morton
> <[email protected]>
> CC: Zhang Yanfei <[email protected]>, Johannes Weiner
> <[email protected]>,
> Andi Kleen <[email protected]>, [email protected],
> [email protected], Richard Yao <[email protected]>, Eric
> Dumazet <[email protected]>
> Subject: Re: [PATCH v2] vmalloc: use rcu list iterator to reduce vmap_area_lock
> contention
> References: <[email protected]>
> In-Reply-To: <[email protected]>
> Content-Type: text/plain; charset=UTF-8; format=flowed
> Content-Transfer-Encoding: 7bit
> X-Authenticated-User: 990527 [email protected]
> X-MT-ID: 8FA290C2A27252AACF65DBC4A42F3CE3735FB2A4
> X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
> Sender: [email protected]
> Precedence: bulk
> X-Loop: [email protected]
> List-ID: <linux-mm.kvack.org>
> Status: O
> Content-Length: 3338
> Lines: 96
>
> On 06/10/2014 10:19 PM, Joonsoo Kim wrote:
> >Richard Yao reported a month ago that his system have a trouble
> >with vmap_area_lock contention during performance analysis
> >by /proc/meminfo. Andrew asked why his analysis checks /proc/meminfo
> >stressfully, but he didn't answer it.
> >
> >https://lkml.org/lkml/2014/4/10/416
> >
> >Although I'm not sure that this is right usage or not, there is a solution
> >reducing vmap_area_lock contention with no side-effect. That is just
> >to use rcu list iterator in get_vmalloc_info().
> >
> >rcu can be used in this function because all RCU protocol is already
> >respected by writers, since Nick Piggin commit db64fe02258f1507e13fe5
> >("mm: rewrite vmap layer") back in linux-2.6.28
>
> While rcu list traversal over the vmap_area_list is safe, this may
> arrive at different results than the spinlocked version. The rcu list
> traversal version will not be a 'snapshot' of a single, valid instant
> of the entire vmap_area_list, but rather a potential amalgam of
> different list states.
Hello,
Yes, you are right, but I don't think that we should be strict here.
Meminfo is already not a 'snapshot' at specific time. While we try to
get certain stats, the other stats can change.
And, although we may arrive at different results than the spinlocked
version, the difference would not be large and would not make serious
side-effect.
Thanks.
On Tue, 2014-06-10 at 23:32 -0400, Peter Hurley wrote:
> While rcu list traversal over the vmap_area_list is safe, this may
> arrive at different results than the spinlocked version. The rcu list
> traversal version will not be a 'snapshot' of a single, valid instant
> of the entire vmap_area_list, but rather a potential amalgam of
> different list states.
>
> This is because the vmap_area_list can continue to change during
> list traversal.
As soon as we exit from get_vmalloc_info(), information can be obsolete
anyway, especially if we held a spinlock for the whole list traversal.
So using the spinlock is certainly not protecting anything in this
regard.
On Wed, 11 Jun 2014 13:34:04 +0900 Joonsoo Kim <[email protected]> wrote:
> > While rcu list traversal over the vmap_area_list is safe, this may
> > arrive at different results than the spinlocked version. The rcu list
> > traversal version will not be a 'snapshot' of a single, valid instant
> > of the entire vmap_area_list, but rather a potential amalgam of
> > different list states.
>
> Hello,
>
> Yes, you are right, but I don't think that we should be strict here.
> Meminfo is already not a 'snapshot' at specific time. While we try to
> get certain stats, the other stats can change.
> And, although we may arrive at different results than the spinlocked
> version, the difference would not be large and would not make serious
> side-effect.
mm, well... The spinlocked version will at least report a number which
*used* to be true. The new improved racy version could for example see
a bunch of new allocations but fail to see the bunch of frees which
preceded those new allocations. Net result: it reports allocation
totals which exceed anything which this kernel has ever sustained.
But hey, it's only /proc/meminfo:VmallocFoo. I'll eat my hat if anyone
cares about it.