Hi,
we are successfully running several very busy web servers on 2.6.32.* and
few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup).
I installed 2.6.37.2 on one of the servers and very strange things started to
happen with Apache web server.
We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing
lots of 'fork' and lots of 'setuid'. I have also noticed that problem is
happening only on very busy servers.
Everything is ok when Apache is started but as time is passing by, its 'root'
processes (Apache processes running under root) are consuming more and more CPU.
Finally, the whole server becames very unstable and Apache must be restarted.
This is repeating until the load on web sites is much lower (usually on 22:00).
Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again,
depends on load on web sites). Here is the graph of CPU utilization showing the
problem (red color), Apache was REstarted at 8:11 and 9:35:
http://watchdog.sk/lkml/cpu-problem.png
Here is how it looks on htop:
http://watchdog.sk/lkml/htop.jpg
And finally here is how it looks with older kernels (yes, when i install older
kernel, problem is gone), notice also that I/O wait is much lower and nicer
(blue color):
http://watchdog.sk/lkml/cpu-ok.png
I was also strace-ing Apache processes which were doing problems, here it is:
http://watchdog.sk/lkml/strace.txt
I'm not 100% sure but I think that CPU was consumed on 'futex' lines.
I tried several kernel versions and find out that everything BEFORE 2.6.36 is
NOT affected and everything AFTER 2.6.36 (included) is affected.
Versions which I tried and were NOT affected by this problem:
2.6.32.*
2.6.35.11
Versions which I tried and were affected by this problem:
2.6.36
2.6.36.4
2.6.37.2
2.6.37.3
2.6.38-rc8 (final version was not released yet)
All tests were made on vanilla kernels on Debian Lenny with this config:
http://watchdog.sk/lkml/config
Do you need any other information from me ? I'm able to try other versions or
patches but, please, take into account that I have to do this on _production_
server (I failed to reproduce it in testing environment). Also, I'm able to try
only one kernel per day.
Thank you !
azurit
On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote:
>
> Hi,
>
> we are successfully running several very busy web servers on 2.6.32.* and
> few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup).
> I installed 2.6.37.2 on one of the servers and very strange things started to
> happen with Apache web server.
>
> We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing
> lots of 'fork' and lots of 'setuid'. I have also noticed that problem is
> happening only on very busy servers.
>
> Everything is ok when Apache is started but as time is passing by, its 'root'
> processes (Apache processes running under root) are consuming more and more CPU.
> Finally, the whole server becames very unstable and Apache must be restarted.
> This is repeating until the load on web sites is much lower (usually on 22:00).
> Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again,
> depends on load on web sites). Here is the graph of CPU utilization showing the
> problem (red color), Apache was REstarted at 8:11 and 9:35:
> http://watchdog.sk/lkml/cpu-problem.png
>
> Here is how it looks on htop:
> http://watchdog.sk/lkml/htop.jpg
>
> And finally here is how it looks with older kernels (yes, when i install older
> kernel, problem is gone), notice also that I/O wait is much lower and nicer
> (blue color):
> http://watchdog.sk/lkml/cpu-ok.png
>
> I was also strace-ing Apache processes which were doing problems, here it is:
> http://watchdog.sk/lkml/strace.txt
>
> I'm not 100% sure but I think that CPU was consumed on 'futex' lines.
>
> I tried several kernel versions and find out that everything BEFORE 2.6.36 is
> NOT affected and everything AFTER 2.6.36 (included) is affected.
>
> Versions which I tried and were NOT affected by this problem:
> 2.6.32.*
> 2.6.35.11
>
> Versions which I tried and were affected by this problem:
> 2.6.36
> 2.6.36.4
> 2.6.37.2
> 2.6.37.3
> 2.6.38-rc8 (final version was not released yet)
>
> All tests were made on vanilla kernels on Debian Lenny with this config:
> http://watchdog.sk/lkml/config
>
> Do you need any other information from me ? I'm able to try other versions or
> patches but, please, take into account that I have to do this on _production_
> server (I failed to reproduce it in testing environment). Also, I'm able to try
> only one kernel per day.
Ick, one kernel per day might make this a bit difficult, but if there
was any way you could use 'git bisect' to try to narrow this down to the
patch that caused this problem, it would be great.
You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
there and try to offer you different chances to find the problem.
thanks,
greg k-h
On Wed, Mar 16, 2011 at 05:15:19PM -0700, Greg Kroah-Hartman wrote:
> > Do you need any other information from me ? I'm able to try other versions or
> > patches but, please, take into account that I have to do this on _production_
> > server (I failed to reproduce it in testing environment). Also, I'm able to try
> > only one kernel per day.
>
> Ick, one kernel per day might make this a bit difficult, but if there
> was any way you could use 'git bisect' to try to narrow this down to the
> patch that caused this problem, it would be great.
>
> You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
> there and try to offer you different chances to find the problem.
Comparing the output of a perf profile between the good/bad kernels might
narrow it down faster than a bisect if something obvious sticks out.
Dave
Bisecting: 5103 revisions left to test after this (roughly 12 steps)
If i'm right, it will takes 12 reboots. I'm really able to reboot only once per day and NOT during weekend so this will take 2,5 weeks.
What about that 'perf' tool ? Can anyone, please, tell me how exactly should i run it to gather usefull data ?
Thank you.
______________________________________________________________
> Od: "Dave Jones"
> Komu: Greg KH
> Dátum: 17.03.2011 01:53
> Predmet: Re: Regression from 2.6.36
>
> CC: [email protected] On Wed, Mar 16, 2011 at 05:15:19PM -0700, Greg Kroah-Hartman wrote:
> > Do you need any other information from me ? I'm able to try other versions or
> > patches but, please, take into account that I have to do this on _production_
> > server (I failed to reproduce it in testing environment). Also, I'm able to try
> > only one kernel per day.
>
> Ick, one kernel per day might make this a bit difficult, but if there
> was any way you could use 'git bisect' to try to narrow this down to the
> patch that caused this problem, it would be great.
>
> You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
> there and try to offer you different chances to find the problem.
Comparing the output of a perf profile between the good/bad kernels might
narrow it down faster than a bisect if something obvious sticks out.
Dave
I have finally completed bisection, here are the results:
a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 is first bad commit
commit a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6
Author: Changli Gao <[email protected]>
Date: Tue Aug 10 18:01:35 2010 -0700
vfs: use kmalloc() to allocate fdmem if possible
Use kmalloc() to allocate fdmem if possible.
vmalloc() is used as a fallback solution for fdmem allocation. A new
helper function __free_fdtable() is introduced to reduce the lines of
code.
A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
[[email protected]: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
Signed-off-by: Changli Gao <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
:040000 040000 a7b3997bc754f573b4a309cda1a0774ea95c235e 4241a4f2115c60e5c1dc1879c85c9911fa077807 M fs
______________________________________________________________
> Od: "Greg KH" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 17.03.2011 01:15
> Predmet: Re: Regression from 2.6.36
>
> CC: [email protected] On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote:
>
> Hi,
>
> we are successfully running several very busy web servers on 2.6.32.* and
> few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup).
> I installed 2.6.37.2 on one of the servers and very strange things started to
> happen with Apache web server.
>
> We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing
> lots of 'fork' and lots of 'setuid'. I have also noticed that problem is
> happening only on very busy servers.
>
> Everything is ok when Apache is started but as time is passing by, its 'root'
> processes (Apache processes running under root) are consuming more and more CPU.
> Finally, the whole server becames very unstable and Apache must be restarted.
> This is repeating until the load on web sites is much lower (usually on 22:00).
> Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again,
> depends on load on web sites). Here is the graph of CPU utilization showing the
> problem (red color), Apache was REstarted at 8:11 and 9:35:
> http://watchdog.sk/lkml/cpu-problem.png
>
> Here is how it looks on htop:
> http://watchdog.sk/lkml/htop.jpg
>
> And finally here is how it looks with older kernels (yes, when i install older
> kernel, problem is gone), notice also that I/O wait is much lower and nicer
> (blue color):
> http://watchdog.sk/lkml/cpu-ok.png
>
> I was also strace-ing Apache processes which were doing problems, here it is:
> http://watchdog.sk/lkml/strace.txt
>
> I'm not 100% sure but I think that CPU was consumed on 'futex' lines.
>
> I tried several kernel versions and find out that everything BEFORE 2.6.36 is
> NOT affected and everything AFTER 2.6.36 (included) is affected.
>
> Versions which I tried and were NOT affected by this problem:
> 2.6.32.*
> 2.6.35.11
>
> Versions which I tried and were affected by this problem:
> 2.6.36
> 2.6.36.4
> 2.6.37.2
> 2.6.37.3
> 2.6.38-rc8 (final version was not released yet)
>
> All tests were made on vanilla kernels on Debian Lenny with this config:
> http://watchdog.sk/lkml/config
>
> Do you need any other information from me ? I'm able to try other versions or
> patches but, please, take into account that I have to do this on _production_
> server (I failed to reproduce it in testing environment). Also, I'm able to try
> only one kernel per day.
Ick, one kernel per day might make this a bit difficult, but if there
was any way you could use 'git bisect' to try to narrow this down to the
patch that caused this problem, it would be great.
You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
there and try to offer you different chances to find the problem.
thanks,
greg k-h
Cced few people.
Also the series which introduced this were discussed at:
http://lkml.org/lkml/2010/5/3/53
On 04/07/2011 12:01 PM, azurIt wrote:
>
> I have finally completed bisection, here are the results:
>
>
>
> a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6 is first bad commit
> commit a892e2d7dcdfa6c76e60c50a8c7385c65587a2a6
> Author: Changli Gao <[email protected]>
> Date: Tue Aug 10 18:01:35 2010 -0700
>
> vfs: use kmalloc() to allocate fdmem if possible
>
> Use kmalloc() to allocate fdmem if possible.
>
> vmalloc() is used as a fallback solution for fdmem allocation. A new
> helper function __free_fdtable() is introduced to reduce the lines of
> code.
>
> A potential bug, vfree() a memory allocated by kmalloc(), is fixed.
>
> [[email protected]: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
> Signed-off-by: Changli Gao <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Jiri Slaby <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: Alexey Dobriyan <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Avi Kivity <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>
>
> :040000 040000 a7b3997bc754f573b4a309cda1a0774ea95c235e 4241a4f2115c60e5c1dc1879c85c9911fa077807 M fs
>
>
>
>
>
>
> ______________________________________________________________
> > Od: "Greg KH" <[email protected]>
> > Komu: azurIt <[email protected]>
> > Dátum: 17.03.2011 01:15
> > Predmet: Re: Regression from 2.6.36
> >
> > CC: [email protected] On Tue, Mar 15, 2011 at 02:25:27PM +0100, azurIt wrote:
> >
> > Hi,
> >
> > we are successfully running several very busy web servers on 2.6.32.* and
> > few days ago I decided to upgrade to 2.6.37 (mainly because of blkio cgroup).
> > I installed 2.6.37.2 on one of the servers and very strange things started to
> > happen with Apache web server.
> >
> > We are using Apache with MPM-ITK ( http://mpm-itk.sesse.net/ ) so it is doing
> > lots of 'fork' and lots of 'setuid'. I have also noticed that problem is
> > happening only on very busy servers.
> >
> > Everything is ok when Apache is started but as time is passing by, its 'root'
> > processes (Apache processes running under root) are consuming more and more CPU.
> > Finally, the whole server becames very unstable and Apache must be restarted.
> > This is repeating until the load on web sites is much lower (usually on 22:00).
> > Sometimes it takes 3 hours when restart is needed, sometimes only 1 hour (again,
> > depends on load on web sites). Here is the graph of CPU utilization showing the
> > problem (red color), Apache was REstarted at 8:11 and 9:35:
> > http://watchdog.sk/lkml/cpu-problem.png
> >
> > Here is how it looks on htop:
> > http://watchdog.sk/lkml/htop.jpg
> >
> > And finally here is how it looks with older kernels (yes, when i install older
> > kernel, problem is gone), notice also that I/O wait is much lower and nicer
> > (blue color):
> > http://watchdog.sk/lkml/cpu-ok.png
> >
> > I was also strace-ing Apache processes which were doing problems, here it is:
> > http://watchdog.sk/lkml/strace.txt
> >
> > I'm not 100% sure but I think that CPU was consumed on 'futex' lines.
> >
> > I tried several kernel versions and find out that everything BEFORE 2.6.36 is
> > NOT affected and everything AFTER 2.6.36 (included) is affected.
> >
> > Versions which I tried and were NOT affected by this problem:
> > 2.6.32.*
> > 2.6.35.11
> >
> > Versions which I tried and were affected by this problem:
> > 2.6.36
> > 2.6.36.4
> > 2.6.37.2
> > 2.6.37.3
> > 2.6.38-rc8 (final version was not released yet)
> >
> > All tests were made on vanilla kernels on Debian Lenny with this config:
> > http://watchdog.sk/lkml/config
> >
> > Do you need any other information from me ? I'm able to try other versions or
> > patches but, please, take into account that I have to do this on _production_
> > server (I failed to reproduce it in testing environment). Also, I'm able to try
> > only one kernel per day.
>
> Ick, one kernel per day might make this a bit difficult, but if there
> was any way you could use 'git bisect' to try to narrow this down to the
> patch that caused this problem, it would be great.
>
> You can mark 2.6.35 as working and 2.6.36 as bad and git will go from
> there and try to offer you different chances to find the problem.
>
> thanks,
>
> greg k-h
thanks,
--
js
suse labs
On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <[email protected]> wrote:
> Cced few people.
>
> Also the series which introduced this were discussed at:
> http://lkml.org/lkml/2010/5/3/53
>
I guess this is due to that lots of fdt are allocated by kmalloc(),
not vmalloc(), and we kfree() them in rcu callback.
How about deferring all of the removal to workqueue? This may
hurt performance I think.
Anyway, like the patch below... makes sense?
Not-yet-signed-off-by: WANG Cong <[email protected]>
---
diff --git a/fs/file.c b/fs/file.c
index 0be3447..34dc355 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
container_of(fdt, struct files_struct, fdtab));
return;
}
- if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
- kfree(fdt->fd);
- kfree(fdt->open_fds);
- kfree(fdt);
- } else {
- fddef = &get_cpu_var(fdtable_defer_list);
- spin_lock(&fddef->lock);
- fdt->next = fddef->next;
- fddef->next = fdt;
- /* vmallocs are handled from the workqueue context */
- schedule_work(&fddef->wq);
- spin_unlock(&fddef->lock);
- put_cpu_var(fdtable_defer_list);
- }
+
+ fddef = &get_cpu_var(fdtable_defer_list);
+ spin_lock(&fddef->lock);
+ fdt->next = fddef->next;
+ fddef->next = fdt;
+ schedule_work(&fddef->wq);
+ spin_unlock(&fddef->lock);
+ put_cpu_var(fdtable_defer_list);
}
Le jeudi 07 avril 2011 à 19:21 +0800, Américo Wang a écrit :
> On Thu, Apr 7, 2011 at 6:19 PM, Jiri Slaby <[email protected]> wrote:
> > Cced few people.
> >
> > Also the series which introduced this were discussed at:
> > http://lkml.org/lkml/2010/5/3/53
> >
>
> I guess this is due to that lots of fdt are allocated by kmalloc(),
> not vmalloc(), and we kfree() them in rcu callback.
>
> How about deferring all of the removal to workqueue? This may
> hurt performance I think.
>
> Anyway, like the patch below... makes sense?
>
> Not-yet-signed-off-by: WANG Cong <[email protected]>
>
> ---
> diff --git a/fs/file.c b/fs/file.c
> index 0be3447..34dc355 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -96,20 +96,14 @@ void free_fdtable_rcu(struct rcu_head *rcu)
> container_of(fdt, struct files_struct, fdtab));
> return;
> }
> - if (!is_vmalloc_addr(fdt->fd) && !is_vmalloc_addr(fdt->open_fds)) {
> - kfree(fdt->fd);
> - kfree(fdt->open_fds);
> - kfree(fdt);
> - } else {
> - fddef = &get_cpu_var(fdtable_defer_list);
> - spin_lock(&fddef->lock);
> - fdt->next = fddef->next;
> - fddef->next = fdt;
> - /* vmallocs are handled from the workqueue context */
> - schedule_work(&fddef->wq);
> - spin_unlock(&fddef->lock);
> - put_cpu_var(fdtable_defer_list);
> - }
> +
> + fddef = &get_cpu_var(fdtable_defer_list);
> + spin_lock(&fddef->lock);
> + fdt->next = fddef->next;
> + fddef->next = fdt;
> + schedule_work(&fddef->wq);
> + spin_unlock(&fddef->lock);
> + put_cpu_var(fdtable_defer_list);
> }
Nope, this makes no sense at all.
Its probably the other way. We want to free those blocks ASAP
A fix would be to make alloc_fdmem() use vmalloc() if size is more than
4 pages, or whatever limit is reached.
We had a similar memory problem in fib_trie in the past : We force a
synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
too much ram waiting to be freed in rcu queues.
Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :
> We had a similar memory problem in fib_trie in the past : We force a
> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
> too much ram waiting to be freed in rcu queues.
This was done in commit c3059477fce2d956
(ipv4: Use synchronize_rcu() during trie_rebalance())
It was possible in fib_trie because we hold RTNL lock, so managing
a counter was free.
In fs case, we might use a percpu_counter if we really want to limit the
amount of space.
Now, I am not even sure we should care that much and could just forget
about this high order pages use.
diff --git a/fs/file.c b/fs/file.c
index 0be3447..7ba26fe 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -41,12 +41,6 @@ static DEFINE_PER_CPU(struct fdtable_defer,
fdtable_defer_list);
static inline void *alloc_fdmem(unsigned int size)
{
- void *data;
-
- data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
- if (data != NULL)
- return data;
-
return vmalloc(size);
}
On Thu, Apr 7, 2011 at 8:13 PM, Eric Dumazet <[email protected]> wrote:
> Le jeudi 07 avril 2011 ? 13:57 +0200, Eric Dumazet a ?crit :
>
>> We had a similar memory problem in fib_trie in the past ?: We force a
>> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
>> too much ram waiting to be freed in rcu queues.
I don't think there is too much memory allocated by vmalloc to free.
My patch should reduce the size of the memory allocated by vmalloc().
I think the real problem is kfree always returns the memory, whose
size is aligned to 2^n pages, and more memory are used than before.
>
> This was done in commit c3059477fce2d956
> (ipv4: Use synchronize_rcu() during trie_rebalance())
>
> It was possible in fib_trie because we hold RTNL lock, so managing
> a counter was free.
>
> In fs case, we might use a percpu_counter if we really want to limit the
> amount of space.
>
> Now, I am not even sure we should care that much and could just forget
> about this high order pages use.
In normal cases, only a few fds are used, the ftable isn't larger than
one page, so we should use kmalloc to reduce the memory cost. Maybe we
should set a upper limit for kmalloc() here. One page?
azurlt, would you please test the patch attached? Thanks.
--
Regards,
Changli Gao([email protected])
Le jeudi 07 avril 2011 à 23:27 +0800, Changli Gao a écrit :
> azurlt, would you please test the patch attached? Thanks.
>
Yes of course, I meant to reverse the patch
(use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)
Dont fallback to vmalloc if kmalloc() fails.
if (size <= PAGE_SIZE)
return kmalloc(size, GFP_KERNEL);
else
return vmalloc(size);
>azurlt, would you please test the patch attached? Thanks.
This patch fixed the problem, i used 2.6.36.4 for testing. Do you need from me to test also other kernel versions or patches ?
Thank you very much!
______________________________________________________________
> Od: "Changli Gao" <[email protected]>
> Komu: Eric Dumazet <[email protected]>
> Dátum: 07.04.2011 17:27
> Predmet: Re: Regression from 2.6.36
>
> CC: "Américo Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], "Andrew Morton" <[email protected]>, [email protected], [email protected], "Jiri Slaby" <[email protected]>
>On Thu, Apr 7, 2011 at 8:13 PM, Eric Dumazet <[email protected]> wrote:
>> Le jeudi 07 avril 2011 à 13:57 +0200, Eric Dumazet a écrit :
>>
>>> We had a similar memory problem in fib_trie in the past : We force a
>>> synchronize_rcu() every XXX Mbytes allocated to make sure we dont have
>>> too much ram waiting to be freed in rcu queues.
>
>I don't think there is too much memory allocated by vmalloc to free.
>My patch should reduce the size of the memory allocated by vmalloc().
>I think the real problem is kfree always returns the memory, whose
>size is aligned to 2^n pages, and more memory are used than before.
>
>>
>> This was done in commit c3059477fce2d956
>> (ipv4: Use synchronize_rcu() during trie_rebalance())
>>
>> It was possible in fib_trie because we hold RTNL lock, so managing
>> a counter was free.
>>
>> In fs case, we might use a percpu_counter if we really want to limit the
>> amount of space.
>>
>> Now, I am not even sure we should care that much and could just forget
>> about this high order pages use.
>
>In normal cases, only a few fds are used, the ftable isn't larger than
>one page, so we should use kmalloc to reduce the memory cost. Maybe we
>should set a upper limit for kmalloc() here. One page?
>
>
>--
>Regards,
>Changli Gao([email protected])
>
>
On Thu, 07 Apr 2011 17:36:26 +0200
Eric Dumazet <[email protected]> wrote:
> Le jeudi 07 avril 2011 __ 23:27 +0800, Changli Gao a __crit :
>
> > azurlt, would you please test the patch attached? Thanks.
> >
>
> Yes of course, I meant to reverse the patch
>
> (use kmalloc() under PAGE_SIZE, vmalloc() for 'big' allocs)
>
>
> Dont fallback to vmalloc if kmalloc() fails.
>
>
> if (size <= PAGE_SIZE)
> return kmalloc(size, GFP_KERNEL);
> else
> return vmalloc(size);
>
It's somewhat unclear (to me) what caused this regression.
Is it because the kernel is now doing large kmalloc()s for the fdtable,
and this makes the page allocator go nuts trying to satisfy high-order
page allocation requests?
Is it because the kernel now will usually free the fdtable
synchronously within the rcu callback, rather than deferring this to a
workqueue?
The latter seems unlikely, so I'm thinking this was a case of
high-order-allocations-considered-harmful?
On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
<[email protected]> wrote:
>
> It's somewhat unclear (to me) what caused this regression.
>
> Is it because the kernel is now doing large kmalloc()s for the fdtable,
> and this makes the page allocator go nuts trying to satisfy high-order
> page allocation requests?
>
> Is it because the kernel now will usually free the fdtable
> synchronously within the rcu callback, rather than deferring this to a
> workqueue?
>
> The latter seems unlikely, so I'm thinking this was a case of
> high-order-allocations-considered-harmful?
>
Maybe, but I am not sure. Maybe my patch causes too many inner
fragments. For example, when asking for 5 pages, get 8 pages, and 3
pages are wasted, then memory thrash happens finally.
--
Regards,
Changli Gao([email protected])
On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <[email protected]> wrote:
> On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> <[email protected]> wrote:
> >
> > It's somewhat unclear (to me) what caused this regression.
> >
> > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > and this makes the page allocator go nuts trying to satisfy high-order
> > page allocation requests?
> >
> > Is it because the kernel now will usually free the fdtable
> > synchronously within the rcu callback, rather than deferring this to a
> > workqueue?
> >
> > The latter seems unlikely, so I'm thinking this was a case of
> > high-order-allocations-considered-harmful?
> >
>
> Maybe, but I am not sure. Maybe my patch causes too many inner
> fragments. For example, when asking for 5 pages, get 8 pages, and 3
> pages are wasted, then memory thrash happens finally.
That theory sounds less likely, but could be tested by using
alloc_pages_exact().
Le mardi 12 avril 2011 à 18:31 -0700, Andrew Morton a écrit :
> On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <[email protected]> wrote:
>
> > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > <[email protected]> wrote:
> > >
> > > It's somewhat unclear (to me) what caused this regression.
> > >
> > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > and this makes the page allocator go nuts trying to satisfy high-order
> > > page allocation requests?
> > >
> > > Is it because the kernel now will usually free the fdtable
> > > synchronously within the rcu callback, rather than deferring this to a
> > > workqueue?
> > >
> > > The latter seems unlikely, so I'm thinking this was a case of
> > > high-order-allocations-considered-harmful?
> > >
> >
> > Maybe, but I am not sure. Maybe my patch causes too many inner
> > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > pages are wasted, then memory thrash happens finally.
>
> That theory sounds less likely, but could be tested by using
> alloc_pages_exact().
>
Very unlikely, since fdtable sizes are powers of two, unless you hit
sysctl_nr_open and it was changed (default value being 2^20)
Dear All,
I am trying to understand how memory fragmentation occurs in linux using many malloc calls.
I am trying to reproduce the page fragmentation problem in linux 2.6.29.x on a linux mobile(without Swap) using a small malloc(in loop) test program of BLOCK_SIZE (64*(4*K)).
And then monitoring the page changes in /proc/buddyinfo after each operation.
>From the output I can see that the page values under buddyinfo keeps changing. But I am not able to relate these changes with my malloc BLOCK_SIZE.
I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6 PAGES) the 2^6 th block under /proc/buddyinfo should change. But this is not the actual behaviour.
Whatever is the blocksize, the buddyinfo changes only for 2^0 or 2^1 or 2^2 or 2^3.
I am trying to measure the level of fragmentation after each page allocation.
Can somebody explain me in detail, how actually /proc/buddyinfo changes after each allocation and deallocation.
Thanks,
Pintu
On Wed, Apr 13, 2011 at 2:54 PM, Pintu Agarwal <[email protected]> wrote:
> Dear All,
>
> I am trying to understand how memory fragmentation occurs in linux using many malloc calls.
> I am trying to reproduce the page fragmentation problem in linux 2.6.29.x on a linux mobile(without Swap) using a small malloc(in loop) test program of BLOCK_SIZE (64*(4*K)).
> And then monitoring the page changes in /proc/buddyinfo after each operation.
> From the output I can see that the page values under buddyinfo keeps changing. But I am not able to relate these changes with my malloc BLOCK_SIZE.
> I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6 PAGES) the 2^6 th block under /proc/buddyinfo should change. But this is not the actual behaviour.
> Whatever is the blocksize, the buddyinfo changes only for 2^0 or 2^1 or 2^2 or 2^3.
>
> I am trying to measure the level of fragmentation after each page allocation.
> Can somebody explain me in detail, how actually /proc/buddyinfo changes after each allocation and deallocation.
>
What malloc() sees is virtual memory of the process, while buddyinfo
shows physical memory pages.
When you malloc() 64K memory, the kernel may not allocate a 64K
physical memory at one time
for you.
Thanks.
Hi,
My requirement is, I wanted to measure memory fragmentation level in linux kernel2.6.29 (ARM cortex A8 without swap).
How can I measure fragmentation level(percentage) from /proc/buddyinfo ?
Example : After each page allocation operation, I need to measure fragmentation level. If the level is above 80%, I will trigger a OOM or something to the user.
How can I reproduce this memory fragmentation scenario using a sample program?
Here is my sample program: (to check page allocation using malloc)
----------------------------------------------
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<errno.h>
#include<unistd.h>
#define PAGE_SIZE (4*1024)
#define MEM_BLOCK (64*PAGE_SIZE)
#define MAX_LIMIT (16)
int main()
{
char *ptr[MAX_LIMIT+1] = {NULL,};
int i = 0;
printf("Requesting <%d> blocks of memory of block size <%d>........\n",MAX_LIMIT,MEM_BLOCK);
system("cat /proc/buddyinfo");
system("cat /proc/zoneinfo | grep free_pages");
printf("*****************************************\n\n\n");
for(i=0; i<MAX_LIMIT; i++)
{
ptr[i] = (char *)malloc(sizeof(char)*MEM_BLOCK);
if(ptr[i] == NULL)
{
printf("ERROR : malloc failed(counter %d) <%s>\n",i,strerror(errno));
system("cat /proc/buddyinfo");
system("cat /proc/zoneinfo | grep free_pages");
printf("press any key to terminate......");
getchar();
exit(0);
}
memset(ptr[i],1,MEM_BLOCK);
sleep(1);
//system("cat /proc/buddyinfo");
//system("cat /proc/zoneinfo | grep free_pages");
//printf("-----------------------------------------\n");
}
sleep(1);
system("cat /proc/buddyinfo");
system("cat /proc/zoneinfo | grep free_pages");
printf("-----------------------------------------\n");
printf("press any key to end......");
getchar();
for(i=0; i<MAX_LIMIT; i++)
{
if(ptr[i] != NULL)
{
free(ptr[i]);
}
}
printf("DONE !!!\n");
return 0;
}
EACH BLOCK SIZE = 64 Pages ==> (64 * 4 * 1024)
TOTAL BLOCKS = 16
----------------------------------------------
In my linux2.6.29 ARM machine, the initial /proc/buddyinfo shows the following:
Node 0, zone DMA 17 22 1 1 0 1 1 0 0 0 0 0
Node 1, zone DMA 15 320 423 225 97 26 1 0 0 0 0 0
After running my sample program (with 16 iterations) the buddyinfo output is as follows:
Requesting <16> blocks of memory of block size <262144>........
Node 0, zone DMA 17 22 1 1 0 1 1 0 0 0 0 0
Node 1, zone DMA 15 301 419 224 96 27 1 0 0 0 0 0
nr_free_pages 169
nr_free_pages 6545
*****************************************
Node 0, zone DMA 17 22 1 1 0 1 1 0 0 0 0 0
Node 1, zone DMA 18 2 305 226 96 27 1 0 0 0 0 0
nr_free_pages 169
nr_free_pages 5514
-----------------------------------------
The requested block size is 64 pages (2^6) for each block.
But if we see the output after 16 iterations the buddyinfo allocates pages only from Node 1 , (2^0, 2^1, 2^2, 2^3).
But the actual allocation should happen from (2^6) block in buddyinfo.
Questions:
1) How to analyse buddyinfo based on each page block size?
2) How and in what scenario the buddyinfo changes?
3) Can we rely completely on buddyinfo information for measuring the level of fragmentation?
Can somebody through some more lights on this???
Thanks,
Pintu
--- On Wed, 4/13/11, Am?rico Wang <[email protected]> wrote:
> From: Am?rico Wang <[email protected]>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Pintu Agarwal" <[email protected]>
> Cc: "Andrew Morton" <[email protected]>, "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Jiri Slaby" <[email protected]>, "azurIt" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
> Date: Wednesday, April 13, 2011, 6:44 AM
> On Wed, Apr 13, 2011 at 2:54 PM,
> Pintu Agarwal <[email protected]>
> wrote:
> > Dear All,
> >
> > I am trying to understand how memory fragmentation
> occurs in linux using many malloc calls.
> > I am trying to reproduce the page fragmentation
> problem in linux 2.6.29.x on a linux mobile(without Swap)
> using a small malloc(in loop) test program of BLOCK_SIZE
> (64*(4*K)).
> > And then monitoring the page changes in
> /proc/buddyinfo after each operation.
> > From the output I can see that the page values under
> buddyinfo keeps changing. But I am not able to relate these
> changes with my malloc BLOCK_SIZE.
> > I mean with my BLOCK_SIZE of (2^6 x 4K ==> 2^6
> PAGES) the 2^6 th block under /proc/buddyinfo should change.
> But this is not the actual behaviour.
> > Whatever is the blocksize, the buddyinfo changes only
> for 2^0 or 2^1 or 2^2 or 2^3.
> >
> > I am trying to measure the level of fragmentation
> after each page allocation.
> > Can somebody explain me in detail, how actually
> /proc/buddyinfo changes after each allocation and
> deallocation.
> >
>
> What malloc() sees is virtual memory of the process, while
> buddyinfo
> shows physical memory pages.
>
> When you malloc() 64K memory, the kernel may not allocate a
> 64K
> physical memory at one time
> for you.
>
> Thanks.
>
On Wed, 13 Apr 2011 15:56:00 +0200, Pintu Agarwal
<[email protected]> wrote:
> My requirement is, I wanted to measure memory fragmentation level in
> linux kernel2.6.29 (ARM cortex A8 without swap).
> How can I measure fragmentation level(percentage) from /proc/buddyinfo ?
[...]
> In my linux2.6.29 ARM machine, the initial /proc/buddyinfo shows the
> following:
> Node 0, zone DMA 17 22 1 1 0 1
> 1 0 0 0 0 0
> Node 1, zone DMA 15 320 423 225 97 26
> 1 0 0 0 0 0
>
> After running my sample program (with 16 iterations) the buddyinfo
> output is as follows:
> Requesting <16> blocks of memory of block size <262144>........
> Node 0, zone DMA 17 22 1 1 0 1
> 1 0 0 0 0 0
> Node 1, zone DMA 15 301 419 224 96 27
> 1 0 0 0 0 0
> nr_free_pages 169
> nr_free_pages 6545
> *****************************************
>
>
> Node 0, zone DMA 17 22 1 1 0 1
> 1 0 0 0 0 0
> Node 1, zone DMA 18 2 305 226 96 27
> 1 0 0 0 0 0
> nr_free_pages 169
> nr_free_pages 5514
> -----------------------------------------
>
> The requested block size is 64 pages (2^6) for each block.
> But if we see the output after 16 iterations the buddyinfo allocates
> pages only from Node 1 , (2^0, 2^1, 2^2, 2^3).
> But the actual allocation should happen from (2^6) block in buddyinfo.
No. When you call malloc() only virtual address space is allocated. The
actual allocation of physical space occurs when user space accesses the
memory (either reads or writes) and it happens page at a time.
As a matter of fact, if you have limited number of 0-order pages and
allocates in user space block of 64 pages later accessing the memory,
what really happens is that kernel allocates the 0-order pages and when
it runs out of those, splits a 1-order page into two 0-order pages and
takes one of those.
Because of MMU, fragmentation of physical memory is not an issue for
normal user space programs.
It becomes an issue once you deal with hardware that does not have MMU
nor support for scatter-getter DMA or with some big kernel structures.
/proc/buddyinfo tells you how many free pages of given order there are
in the system. You may interpret it in such a way that the bigger number
of the low order pages the bigger fragmentation of physical memory. If
there was no fragmentation (for some definition of the term) you'd get only
the highest order pages and at most one page for each lower order.
Again though, this fragmentation is not an issue for user space programs.
--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--
On Wed, 13 Apr 2011 04:37:36 +0200
Eric Dumazet <[email protected]> wrote:
> Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
> > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <[email protected]> wrote:
> >
> > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > > <[email protected]> wrote:
> > > >
> > > > It's somewhat unclear (to me) what caused this regression.
> > > >
> > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > > and this makes the page allocator go nuts trying to satisfy high-order
> > > > page allocation requests?
> > > >
> > > > Is it because the kernel now will usually free the fdtable
> > > > synchronously within the rcu callback, rather than deferring this to a
> > > > workqueue?
> > > >
> > > > The latter seems unlikely, so I'm thinking this was a case of
> > > > high-order-allocations-considered-harmful?
> > > >
> > >
> > > Maybe, but I am not sure. Maybe my patch causes too many inner
> > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > > pages are wasted, then memory thrash happens finally.
> >
> > That theory sounds less likely, but could be tested by using
> > alloc_pages_exact().
> >
>
> Very unlikely, since fdtable sizes are powers of two, unless you hit
> sysctl_nr_open and it was changed (default value being 2^20)
>
So am I correct in believing that this regression is due to the
high-order allocations putting excess stress onto page reclaim?
If so, then how large _are_ these allocations? This perhaps can be
determined from /proc/slabinfo. They must be pretty huge, because slub
likes to do excessively-large allocations and the system handles that
reasonably well.
I suppose that a suitable fix would be
From: Andrew Morton <[email protected]>
Azurit reports large increases in system time after 2.6.36 when running
Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
to allocate fdmem if possible").
That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.
Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.
Reported-by: azurIt <[email protected]>
Cc: Changli Gao <[email protected]>
Cc: Americo Wang <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/file.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff -puN fs/file.c~a fs/file.c
--- a/fs/file.c~a
+++ a/fs/file.c
@@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
*/
static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
{
- void *data;
-
- data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
- if (data != NULL)
- return data;
-
+ /*
+ * Very large allocations can stress page reclaim, so fall back to
+ * vmalloc() if the allocation size will be considered "large" by the VM.
+ */
+ if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
+ void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ if (data != NULL)
+ return data;
+ }
return vmalloc(size);
}
_
On Wed, 13 Apr 2011 14:16:00 -0700
Andrew Morton <[email protected]> wrote:
> fs/file.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
bah, stupid compiler.
--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
+++ a/fs/file.c
@@ -9,6 +9,7 @@
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/mmzone.h>
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/slab.h>
@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
*/
static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
-static inline void *alloc_fdmem(unsigned int size)
+static void *alloc_fdmem(unsigned int size)
{
- void *data;
-
- data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
- if (data != NULL)
- return data;
-
+ /*
+ * Very large allocations can stress page reclaim, so fall back to
+ * vmalloc() if the allocation size will be considered "large" by the VM.
+ */
+ if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
+ void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ if (data != NULL)
+ return data;
+ }
return vmalloc(size);
}
_
On Wed, 13 Apr 2011, Andrew Morton wrote:
> Azurit reports large increases in system time after 2.6.36 when running
> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
>
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
>
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
>
> Reported-by: azurIt <[email protected]>
> Cc: Changli Gao <[email protected]>
> Cc: Americo Wang <[email protected]>
> Cc: Jiri Slaby <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> fs/file.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> */
> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
> {
> - void *data;
> -
> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> - if (data != NULL)
> - return data;
> -
> + /*
> + * Very large allocations can stress page reclaim, so fall back to
> + * vmalloc() if the allocation size will be considered "large" by the VM.
> + */
> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> + if (data != NULL)
> + return data;
> + }
> return vmalloc(size);
> }
>
It's a shame that we can't at least try kmalloc() with sufficiently large
sizes by doing something like
gfp_t flags = GFP_NOWAIT | __GFP_NOWARN;
if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
flags |= GFP_KERNEL;
data = kmalloc(size, flags);
if (data)
return data;
return vmalloc(size);
which would at least attempt to use the slab allocator.
On Wed, 13 Apr 2011 14:44:16 -0700 (PDT)
David Rientjes <[email protected]> wrote:
> > -static inline void *alloc_fdmem(unsigned int size)
> > +static void *alloc_fdmem(unsigned int size)
> > {
> > - void *data;
> > -
> > - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > - if (data != NULL)
> > - return data;
> > -
> > + /*
> > + * Very large allocations can stress page reclaim, so fall back to
> > + * vmalloc() if the allocation size will be considered "large" by the VM.
> > + */
> > + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > + if (data != NULL)
> > + return data;
> > + }
> > return vmalloc(size);
> > }
> >
>
> It's a shame that we can't at least try kmalloc() with sufficiently large
> sizes by doing something like
>
> gfp_t flags = GFP_NOWAIT | __GFP_NOWARN;
>
> if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> flags |= GFP_KERNEL;
> data = kmalloc(size, flags);
> if (data)
> return data;
> return vmalloc(size);
>
> which would at least attempt to use the slab allocator.
Maybe. If the fdtable is that huge then the fork() is probably going
to be pretty slow anyway. And the large allocation might cause
depletion of high-order free pages and might cause fragmentation of
even-higher-order pages by splitting them up. </handwaving>
Le mercredi 13 avril 2011 à 14:16 -0700, Andrew Morton a écrit :
> So am I correct in believing that this regression is due to the
> high-order allocations putting excess stress onto page reclaim?
>
I suppose so.
> If so, then how large _are_ these allocations? This perhaps can be
> determined from /proc/slabinfo. They must be pretty huge, because slub
> likes to do excessively-large allocations and the system handles that
> reasonably well.
>
> I suppose that a suitable fix would be
>
>
> From: Andrew Morton <[email protected]>
>
> Azurit reports large increases in system time after 2.6.36 when running
> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
>
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
>
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
>
> Reported-by: azurIt <[email protected]>
> Cc: Changli Gao <[email protected]>
> Cc: Americo Wang <[email protected]>
> Cc: Jiri Slaby <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> fs/file.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> */
> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
> {
> - void *data;
> -
> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> - if (data != NULL)
> - return data;
> -
> + /*
> + * Very large allocations can stress page reclaim, so fall back to
> + * vmalloc() if the allocation size will be considered "large" by the VM.
> + */
> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> + if (data != NULL)
> + return data;
> + }
> return vmalloc(size);
> }
>
> _
>
Acked-by: Eric Dumazet <[email protected]>
#define PAGE_ALLOC_COSTLY_ORDER 3
On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
Thanks !
On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <[email protected]> wrote:
> > --- a/fs/file.c~a
> > +++ a/fs/file.c
> > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> > */
> > static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> >
> > -static inline void *alloc_fdmem(unsigned int size)
> > +static void *alloc_fdmem(unsigned int size)
> > {
> > - void *data;
> > -
> > - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > - if (data != NULL)
> > - return data;
> > -
> > + /*
> > + * Very large allocations can stress page reclaim, so fall back to
> > + * vmalloc() if the allocation size will be considered "large" by the VM.
> > + */
> > + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > + if (data != NULL)
> > + return data;
> > + }
> > return vmalloc(size);
> > }
> >
> > _
> >
>
> Acked-by: Eric Dumazet <[email protected]>
>
> #define PAGE_ALLOC_COSTLY_ORDER 3
>
> On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
Thanks. I added the cc:stable to the changelog.
It'd be nice to get this tested if poss, to confrm that it actually
fixes things.
Also, Melpoke.
Le mercredi 13 avril 2011 à 22:28 -0700, Andrew Morton a écrit :
> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <[email protected]> wrote:
>
> > > --- a/fs/file.c~a
> > > +++ a/fs/file.c
> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> > > */
> > > static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
> > >
> > > -static inline void *alloc_fdmem(unsigned int size)
> > > +static void *alloc_fdmem(unsigned int size)
> > > {
> > > - void *data;
> > > -
> > > - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > - if (data != NULL)
> > > - return data;
> > > -
> > > + /*
> > > + * Very large allocations can stress page reclaim, so fall back to
> > > + * vmalloc() if the allocation size will be considered "large" by the VM.
> > > + */
> > > + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
> > > + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> > > + if (data != NULL)
> > > + return data;
> > > + }
> > > return vmalloc(size);
> > > }
> > >
> > > _
> > >
> >
> > Acked-by: Eric Dumazet <[email protected]>
> >
> > #define PAGE_ALLOC_COSTLY_ORDER 3
> >
> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
>
> Thanks. I added the cc:stable to the changelog.
>
> It'd be nice to get this tested if poss, to confrm that it actually
> fixes things.
>
> Also, Melpoke.
Azurit, could you check how many fds are opened by your apache servers ?
(must be related to number of virtual hosts / acces_log / error_log
files)
Pick one pid from ps list
ps aux | grep apache
ls /proc/{pid_of_one_apache}/fd | wc -l
or
lsof -p { pid_of_one_apache} | tail -n 2
apache2 8501 httpadm 13w REG 104,7 2350407 3866638 /data/logs/httpd/rewrites.log
apache2 8501 httpadm 14r 0000 0,10 0 263148343 eventpoll
Here it's "14"
Thanks
Thanks Mr. Michal for all your comments :)
As I can understand from your comments that, malloc from user space will not have much impact on memory fragmentation.
Will the memory fragmentation be visible if I do kmalloc from the kernel module????
> No. When you call malloc() only virtual address space
> is allocated. The
> actual allocation of physical space occurs when user space
> accesses the
> memory (either reads or writes) and it happens page at a
> time.
Here, if I do memset then I am accessing the memory...right? That I am doing already in my sample program.
> what really happens is that kernel allocates the 0-order
> pages and when
> it runs out of those, splits a 1-order page into two
> 0-order pages and
> takes one of those.
Actually, if I understand buddy allocator, it allocates pages from top to bottom. That means it checks for the highest order block that can be allocated, if possible it allocates pages from that block otherwise split the next highest block into two.
What will happen if the next higher blocks are all empty???
Is the memory fragmentation is always a cause of the kernel space program and not user space at all??
Can you provide me with some references for migitating memory fragmentation in linux?
Thanks,
Pintu
--- On Wed, 4/13/11, Michal Nazarewicz <[email protected]> wrote:
> From: Michal Nazarewicz <[email protected]>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Am?rico Wang" <[email protected]>, "Pintu Agarwal" <[email protected]>
> Cc: "Andrew Morton" <[email protected]>, "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Jiri Slaby" <[email protected]>, "azurIt" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
> Date: Wednesday, April 13, 2011, 10:25 AM
> On Wed, 13 Apr 2011 15:56:00 +0200,
> Pintu Agarwal <[email protected]>
> wrote:
> > My requirement is, I wanted to measure memory
> fragmentation level in linux kernel2.6.29 (ARM cortex A8
> without swap).
> > How can I measure fragmentation level(percentage) from
> /proc/buddyinfo ?
>
> [...]
>
> > In my linux2.6.29 ARM machine, the initial
> /proc/buddyinfo shows the following:
> > Node 0, zone? ? ? DMA?
> ???17? ???22? ?
> ? 1? ? ? 1? ? ? 0?
> ? ? 1? ? ? 1? ? ?
> 0? ? ? 0? ? ? 0? ?
> ? 0? ? ? 0
> > Node 1, zone? ? ? DMA?
> ???15? ? 320? ? 423?
> ? 225? ???97?
> ???26? ? ? 1? ?
> ? 0? ? ? 0? ? ? 0?
> ? ? 0? ? ? 0
> >
> > After running my sample program (with 16 iterations)
> the buddyinfo output is as follows:
> > Requesting <16> blocks of memory of block size
> <262144>........
> > Node 0, zone? ? ? DMA?
> ???17? ???22? ?
> ? 1? ? ? 1? ? ? 0?
> ? ? 1? ? ? 1? ? ?
> 0? ? ? 0? ? ? 0? ?
> ? 0? ? ? 0
> > Node 1, zone? ? ? DMA?
> ???15? ? 301? ? 419?
> ? 224? ???96?
> ???27? ? ? 1? ?
> ? 0? ? ? 0? ? ? 0?
> ? ? 0? ? ? 0
> >? ???nr_free_pages 169
> >? ???nr_free_pages 6545
> > *****************************************
> >
> >
> > Node 0, zone? ? ? DMA?
> ???17? ???22? ?
> ? 1? ? ? 1? ? ? 0?
> ? ? 1? ? ? 1? ? ?
> 0? ? ? 0? ? ? 0? ?
> ? 0? ? ? 0
> > Node 1, zone? ? ? DMA?
> ???18? ? ? 2? ?
> 305? ? 226? ???96?
> ???27? ? ? 1? ?
> ? 0? ? ? 0? ? ? 0?
> ? ? 0? ? ? 0
> >? ???nr_free_pages 169
> >? ???nr_free_pages 5514
> > -----------------------------------------
> >
> > The requested block size is 64 pages (2^6) for each
> block.
> > But if we see the output after 16 iterations the
> buddyinfo allocates pages only from Node 1 , (2^0, 2^1, 2^2,
> 2^3).
> > But the actual allocation should happen from (2^6)
> block in buddyinfo.
>
> No.? When you call malloc() only virtual address space
> is allocated.? The
> actual allocation of physical space occurs when user space
> accesses the
> memory (either reads or writes) and it happens page at a
> time.
>
> As a matter of fact, if you have limited number of 0-order
> pages and
> allocates in user space block of 64 pages later accessing
> the memory,
> what really happens is that kernel allocates the 0-order
> pages and when
> it runs out of those, splits a 1-order page into two
> 0-order pages and
> takes one of those.
>
> Because of MMU, fragmentation of physical memory is not an
> issue for
> normal user space programs.
>
> It becomes an issue once you deal with hardware that does
> not have MMU
> nor support for scatter-getter DMA or with some big kernel
> structures.
>
> /proc/buddyinfo tells you how many free pages of given
> order there are
> in the system.? You may interpret it in such a way
> that the bigger number
> of the low order pages the bigger fragmentation of physical
> memory.? If
> there was no fragmentation (for some definition of the
> term) you'd get only
> the highest order pages and at most one page for each lower
> order.
>
> Again though, this fragmentation is not an issue for user
> space programs.
>
> --Best regards,? ? ? ? ? ?
> ? ? ? ? ? ? ? ?
> ? ? ? ? ? ???_?
> ???_
> .o. | Liege of Serenely Enlightened Majesty of? ?
> ? o' \,=./ `o
> ..o | Computer Science,? Michal "mina86"
> Nazarewicz? ? (o o)
> ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--
>
Here it is:
# ls /proc/31416/fd | wc -l
5926
azur
______________________________________________________________
> Od: "Eric Dumazet" <[email protected]>
> Komu: Andrew Morton <[email protected]>
> Dátum: 14.04.2011 08:32
> Predmet: Re: Regression from 2.6.36
>
> CC: "Changli Gao" <[email protected]>, "Américo Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>, "Mel Gorman" <[email protected]>
>Le mercredi 13 avril 2011 à 22:28 -0700, Andrew Morton a écrit :
>> On Thu, 14 Apr 2011 04:10:58 +0200 Eric Dumazet <[email protected]> wrote:
>>
>> > > --- a/fs/file.c~a
>> > > +++ a/fs/file.c
>> > > @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>> > > */
>> > > static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>> > >
>> > > -static inline void *alloc_fdmem(unsigned int size)
>> > > +static void *alloc_fdmem(unsigned int size)
>> > > {
>> > > - void *data;
>> > > -
>> > > - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> > > - if (data != NULL)
>> > > - return data;
>> > > -
>> > > + /*
>> > > + * Very large allocations can stress page reclaim, so fall back to
>> > > + * vmalloc() if the allocation size will be considered "large" by the VM.
>> > > + */
>> > > + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>> > > + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> > > + if (data != NULL)
>> > > + return data;
>> > > + }
>> > > return vmalloc(size);
>> > > }
>> > >
>> > > _
>> > >
>> >
>> > Acked-by: Eric Dumazet <[email protected]>
>> >
>> > #define PAGE_ALLOC_COSTLY_ORDER 3
>> >
>> > On x86_64, this means we try kmalloc() up to 4096 files in fdtable.
>>
>> Thanks. I added the cc:stable to the changelog.
>>
>> It'd be nice to get this tested if poss, to confrm that it actually
>> fixes things.
>>
>> Also, Melpoke.
>
>Azurit, could you check how many fds are opened by your apache servers ?
>(must be related to number of virtual hosts / acces_log / error_log
>files)
>
>Pick one pid from ps list
>ps aux | grep apache
>
>ls /proc/{pid_of_one_apache}/fd | wc -l
>
>or
>
>lsof -p { pid_of_one_apache} | tail -n 2
>apache2 8501 httpadm 13w REG 104,7 2350407 3866638 /data/logs/httpd/rewrites.log
>apache2 8501 httpadm 14r 0000 0,10 0 263148343 eventpoll
>
>Here it's "14"
>
>Thanks
>
>
>
On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
> On Wed, 13 Apr 2011 04:37:36 +0200
> Eric Dumazet <[email protected]> wrote:
>
> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <[email protected]> wrote:
> > >
> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
> > > > <[email protected]> wrote:
> > > > >
> > > > > It's somewhat unclear (to me) what caused this regression.
> > > > >
> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
> > > > > and this makes the page allocator go nuts trying to satisfy high-order
> > > > > page allocation requests?
> > > > >
> > > > > Is it because the kernel now will usually free the fdtable
> > > > > synchronously within the rcu callback, rather than deferring this to a
> > > > > workqueue?
> > > > >
> > > > > The latter seems unlikely, so I'm thinking this was a case of
> > > > > high-order-allocations-considered-harmful?
> > > > >
> > > >
> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
> > > > pages are wasted, then memory thrash happens finally.
> > >
> > > That theory sounds less likely, but could be tested by using
> > > alloc_pages_exact().
> > >
> >
> > Very unlikely, since fdtable sizes are powers of two, unless you hit
> > sysctl_nr_open and it was changed (default value being 2^20)
> >
>
> So am I correct in believing that this regression is due to the
> high-order allocations putting excess stress onto page reclaim?
>
This is very plausible but it would be nice to get confirmation on
what the size of the fdtable was to be sure. If it's big enough for
high-order allocations and it's a fork-heavy workload with memory
mostly in use, the fork() latencies could be getting very high. In
addition, each fork is potentially kicking kswapd awake (to rebalance
the zone for higher orders). I do not see CONFIG_COMPACTION enabled
meaning that if I'm right in that kswapd is awake and fork() is
entering direct reclaim, then we are lumpy reclaiming as well which
can stall pretty severely.
> If so, then how large _are_ these allocations? This perhaps can be
> determined from /proc/slabinfo. They must be pretty huge, because slub
> likes to do excessively-large allocations and the system handles that
> reasonably well.
>
I'd be interested in finding out the value of /proc/sys/fs/file-max and
the output of ulimit -n (max open files) for the main server is. This
should help us determine what the size of the fdtable is.
> I suppose that a suitable fix would be
>
>
> From: Andrew Morton <[email protected]>
>
> Azurit reports large increases in system time after 2.6.36 when running
> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
> to allocate fdmem if possible").
>
> That patch caused the vfs to use kmalloc() for very large allocations and
> this is causing excessive work (and presumably excessive reclaim) within
> the page allocator.
>
> Fix it by falling back to vmalloc() earlier - when the allocation attempt
> would have been considered "costly" by reclaim.
>
> Reported-by: azurIt <[email protected]>
> Cc: Changli Gao <[email protected]>
> Cc: Americo Wang <[email protected]>
> Cc: Jiri Slaby <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> fs/file.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff -puN fs/file.c~a fs/file.c
> --- a/fs/file.c~a
> +++ a/fs/file.c
> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> */
> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>
> -static inline void *alloc_fdmem(unsigned int size)
> +static void *alloc_fdmem(unsigned int size)
> {
> - void *data;
> -
> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> - if (data != NULL)
> - return data;
> -
> + /*
> + * Very large allocations can stress page reclaim, so fall back to
> + * vmalloc() if the allocation size will be considered "large" by the VM.
> + */
> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
The reporter will need to retest this is really ok. The patch that was
reported to help avoided high-order allocations entirely. If fork-heavy
workloads are really entering direct reclaim and increasing fork latency
enough to ruin performance, then this patch will also suffer. How much
it helps depends on how big fdtable.
> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
> + if (data != NULL)
> + return data;
> + }
> return vmalloc(size);
> }
>
I'm attaching a primitive perl script that reports high-order allocation
latencies. I'd be interesting to see what the output of it looks like,
particularly when the server is in trouble if the bug reporter as the
time.
--
Mel Gorman
SUSE Labs
Le jeudi 14 avril 2011 à 11:08 +0200, azurIt a écrit :
> Here it is:
>
>
> # ls /proc/31416/fd | wc -l
> 5926
Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
problem...
[On 32bit kernel, we still use kmalloc() up to 8192 files ]
It's completely 64bit system.
______________________________________________________________
> Od: "Eric Dumazet" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 14.04.2011 12:28
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <[email protected]>, "Changli Gao" <[email protected]>, "Américo Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>, "Mel Gorman" <[email protected]>
>Le jeudi 14 avril 2011 à 11:08 +0200, azurIt a écrit :
>> Here it is:
>>
>>
>> # ls /proc/31416/fd | wc -l
>> 5926
>
>Hmm, if its a 32bit kernel, I am afraid Andrew patch wont solve the
>problem...
>
>[On 32bit kernel, we still use kmalloc() up to 8192 files ]
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
On Thu, 14 Apr 2011 08:44:50 +0200, Pintu Agarwal
<[email protected]> wrote:
> As I can understand from your comments that, malloc from user space will
> not have much impact on memory fragmentation.
It has an impact, just like any kind of allocation, it just don't care
about
fragmentation of physical memory. You can have only 0-order pages and
successfully allocate megabytes of memory with malloc().
> Will the memory fragmentation be visible if I do kmalloc from
> the kernel module????
It will be more visible in the sense that if you allocate 8 KiB, kernel
will
have to find 8 KiB contiguous physical memory (ie. 1-order page).
>> No. When you call malloc() only virtual address space is allocated.
>> The actual allocation of physical space occurs when user space accesses
>> the memory (either reads or writes) and it happens page at a time.
>
> Here, if I do memset then I am accessing the memory...right? That I am
> doing already in my sample program.
Yes. But note that even though it's a single memset() call, you are
accessing page at a time and kernel is allocating page at a time.
On some architectures (not ARM) you could access two pages with a single
instructions but I think that would result in two page faults anyway. I
might be wrong though, the details are not important though.
>> what really happens is that kernel allocates the 0-order
>> pages and when
>> it runs out of those, splits a 1-order page into two
>> 0-order pages and
>> takes one of those.
>
> Actually, if I understand buddy allocator, it allocates pages from top
> to bottom.
No. If you want to allocate a single 0-order page, buddy looks for a
a free 0-order page. If one is not found, it will look for 1-order page
and split it. This goes up till buddy reaches (MAX_ORDER-1)-page.
> Is the memory fragmentation is always a cause of the kernel space
> program and not user space at all?
Well, no. If you allocate memory in user space, kernel will have to
allocate physical memory and *every* allocation may contribute to
fragmentation. The point is, that all allocations from user-space are
single-page allocations even if you malloc() MiBs of memory.
> Can you provide me with some references for migitating memory
> fragmentation in linux?
I'm not sure what you mean by that.
--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--
Hello Mr. Michal,
Thanks for your comments.
Sorry. There was a small typo in my last sentence (mitigating not *migitating* memory fragmentation)
That means how can I measure the memory fragmentation either from user space or from kernel space.
Is there a way to measure the amount of memory fragmentation in linux?
Can you provide me some references for that?
Thanks,
Pintu
--- On Thu, 4/14/11, Michal Nazarewicz <[email protected]> wrote:
> From: Michal Nazarewicz <[email protected]>
> Subject: Re: Regarding memory fragmentation using malloc....
> To: "Am?rico Wang" <[email protected]>, "Pintu Agarwal" <[email protected]>
> Cc: "Andrew Morton" <[email protected]>, "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Jiri Slaby" <[email protected]>, "azurIt" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
> Date: Thursday, April 14, 2011, 5:47 AM
> On Thu, 14 Apr 2011 08:44:50 +0200,
> Pintu Agarwal <[email protected]>
> wrote:
> > As I can understand from your comments that, malloc
> from user space will not have much impact on memory
> fragmentation.
>
> It has an impact, just like any kind of allocation, it just
> don't care about
> fragmentation of physical memory.? You can have only
> 0-order pages and
> successfully allocate megabytes of memory with malloc().
>
> > Will the memory fragmentation be visible if I do
> kmalloc from
> > the kernel module????
>
> It will be more visible in the sense that if you allocate 8
> KiB, kernel will
> have to find 8 KiB contiguous physical memory (ie. 1-order
> page).
>
> >> No.? When you call malloc() only virtual
> address space is allocated.
> >> The actual allocation of physical space occurs
> when user space accesses
> >> the memory (either reads or writes) and it happens
> page at a time.
> >
> > Here, if I do memset then I am accessing the
> memory...right? That I am doing already in my sample
> program.
>
> Yes.? But note that even though it's a single memset()
> call, you are
> accessing page at a time and kernel is allocating page at a
> time.
>
> On some architectures (not ARM) you could access two pages
> with a single
> instructions but I think that would result in two page
> faults anyway.? I
> might be wrong though, the details are not important
> though.
>
> >> what really happens is that kernel allocates the
> 0-order
> >> pages and when
> >> it runs out of those, splits a 1-order page into
> two
> >> 0-order pages and
> >> takes one of those.
> >
> > Actually, if I understand buddy allocator, it
> allocates pages from top to bottom.
>
> No.? If you want to allocate a single 0-order page,
> buddy looks for a
> a free 0-order page.? If one is not found, it will
> look for 1-order page
> and split it.? This goes up till buddy reaches
> (MAX_ORDER-1)-page.
>
> > Is the memory fragmentation is always a cause of the
> kernel space program and not user space at all?
>
> Well, no.? If you allocate memory in user space,
> kernel will have to
> allocate physical memory and *every* allocation may
> contribute to
> fragmentation.? The point is, that all allocations
> from user-space are
> single-page allocations even if you malloc() MiBs of
> memory.
>
> > Can you provide me with some references for migitating
> memory fragmentation in linux?
>
> I'm not sure what you mean by that.
>
> --Best regards,? ? ? ? ? ?
> ? ? ? ? ? ? ? ?
> ? ? ? ? ? ???_?
> ???_
> .o. | Liege of Serenely Enlightened Majesty of? ?
> ? o' \,=./ `o
> ..o | Computer Science,? Michal "mina86"
> Nazarewicz? ? (o o)
> ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm'
> in
> the body to [email protected].?
> For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]">
> [email protected]
> </a>
>
On Thu, 14 Apr 2011 14:24:56 +0200, Pintu Agarwal
<[email protected]> wrote:
> Sorry. There was a small typo in my last sentence (mitigating not
> *migitating* memory fragmentation)
> That means how can I measure the memory fragmentation either from user
> space or from kernel space.
> Is there a way to measure the amount of memory fragmentation in linux?
I'm still not entirely sure what you need. You may try to measure
fragmentation by the number of low order pages -- the more low order
pages compared to high order pages the bigger the fragmentation.
As of how to mitigate... There's memory compaction. There's some
optimisations in buddy system. I'm probably not the best person to
ask anyway.
--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, Michal "mina86" Nazarewicz (o o)
ooo +-----<email/xmpp: [email protected]>-----ooO--(_)--Ooo--
Also this new patch is working fine and fixing the problem.
Mel, I cannot run your script:
# perl watch-highorder-latency.pl
Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
# ls -ld /sys/kernel/debug/
ls: cannot access /sys/kernel/debug/: No such file or directory
azur
______________________________________________________________
> Od: "Mel Gorman" <[email protected]>
> Komu: Andrew Morton <[email protected]>
> Dátum: 14.04.2011 12:25
> Predmet: Re: Regression from 2.6.36
>
> CC: "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Am?rico Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
>On Wed, Apr 13, 2011 at 02:16:00PM -0700, Andrew Morton wrote:
>> On Wed, 13 Apr 2011 04:37:36 +0200
>> Eric Dumazet <[email protected]> wrote:
>>
>> > Le mardi 12 avril 2011 __ 18:31 -0700, Andrew Morton a __crit :
>> > > On Wed, 13 Apr 2011 09:23:11 +0800 Changli Gao <[email protected]> wrote:
>> > >
>> > > > On Wed, Apr 13, 2011 at 6:49 AM, Andrew Morton
>> > > > <[email protected]> wrote:
>> > > > >
>> > > > > It's somewhat unclear (to me) what caused this regression.
>> > > > >
>> > > > > Is it because the kernel is now doing large kmalloc()s for the fdtable,
>> > > > > and this makes the page allocator go nuts trying to satisfy high-order
>> > > > > page allocation requests?
>> > > > >
>> > > > > Is it because the kernel now will usually free the fdtable
>> > > > > synchronously within the rcu callback, rather than deferring this to a
>> > > > > workqueue?
>> > > > >
>> > > > > The latter seems unlikely, so I'm thinking this was a case of
>> > > > > high-order-allocations-considered-harmful?
>> > > > >
>> > > >
>> > > > Maybe, but I am not sure. Maybe my patch causes too many inner
>> > > > fragments. For example, when asking for 5 pages, get 8 pages, and 3
>> > > > pages are wasted, then memory thrash happens finally.
>> > >
>> > > That theory sounds less likely, but could be tested by using
>> > > alloc_pages_exact().
>> > >
>> >
>> > Very unlikely, since fdtable sizes are powers of two, unless you hit
>> > sysctl_nr_open and it was changed (default value being 2^20)
>> >
>>
>> So am I correct in believing that this regression is due to the
>> high-order allocations putting excess stress onto page reclaim?
>>
>
>This is very plausible but it would be nice to get confirmation on
>what the size of the fdtable was to be sure. If it's big enough for
>high-order allocations and it's a fork-heavy workload with memory
>mostly in use, the fork() latencies could be getting very high. In
>addition, each fork is potentially kicking kswapd awake (to rebalance
>the zone for higher orders). I do not see CONFIG_COMPACTION enabled
>meaning that if I'm right in that kswapd is awake and fork() is
>entering direct reclaim, then we are lumpy reclaiming as well which
>can stall pretty severely.
>
>> If so, then how large _are_ these allocations? This perhaps can be
>> determined from /proc/slabinfo. They must be pretty huge, because slub
>> likes to do excessively-large allocations and the system handles that
>> reasonably well.
>>
>
>I'd be interested in finding out the value of /proc/sys/fs/file-max and
>the output of ulimit -n (max open files) for the main server is. This
>should help us determine what the size of the fdtable is.
>
>> I suppose that a suitable fix would be
>>
>>
>> From: Andrew Morton <[email protected]>
>>
>> Azurit reports large increases in system time after 2.6.36 when running
>> Apache. It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
>> to allocate fdmem if possible").
>>
>> That patch caused the vfs to use kmalloc() for very large allocations and
>> this is causing excessive work (and presumably excessive reclaim) within
>> the page allocator.
>>
>> Fix it by falling back to vmalloc() earlier - when the allocation attempt
>> would have been considered "costly" by reclaim.
>>
>> Reported-by: azurIt <[email protected]>
>> Cc: Changli Gao <[email protected]>
>> Cc: Americo Wang <[email protected]>
>> Cc: Jiri Slaby <[email protected]>
>> Cc: Eric Dumazet <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Signed-off-by: Andrew Morton <[email protected]>
>> ---
>>
>> fs/file.c | 17 ++++++++++-------
>> 1 file changed, 10 insertions(+), 7 deletions(-)
>>
>> diff -puN fs/file.c~a fs/file.c
>> --- a/fs/file.c~a
>> +++ a/fs/file.c
>> @@ -39,14 +39,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
>> */
>> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>>
>> -static inline void *alloc_fdmem(unsigned int size)
>> +static void *alloc_fdmem(unsigned int size)
>> {
>> - void *data;
>> -
>> - data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> - if (data != NULL)
>> - return data;
>> -
>> + /*
>> + * Very large allocations can stress page reclaim, so fall back to
>> + * vmalloc() if the allocation size will be considered "large" by the VM.
>> + */
>> + if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER) {
>
>The reporter will need to retest this is really ok. The patch that was
>reported to help avoided high-order allocations entirely. If fork-heavy
>workloads are really entering direct reclaim and increasing fork latency
>enough to ruin performance, then this patch will also suffer. How much
>it helps depends on how big fdtable.
>
>> + void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>> + if (data != NULL)
>> + return data;
>> + }
>> return vmalloc(size);
>> }
>>
>
>I'm attaching a primitive perl script that reports high-order allocation
>latencies. I'd be interesting to see what the output of it looks like,
>particularly when the server is in trouble if the bug reporter as the
>time.
>
>--
>Mel Gorman
>SUSE Labs
>
>
On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
>
> Also this new patch is working fine and fixing the problem.
>
> Mel, I cannot run your script:
> # perl watch-highorder-latency.pl
> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
>
> # ls -ld /sys/kernel/debug/
> ls: cannot access /sys/kernel/debug/: No such file or directory
>
mount -t debugfs none /sys/kernel/debug
If it still doesn't work, sysfs or the necessary FTRACE options are
not enabled on your .config. I'll give you a list if that is the case.
Thanks.
--
Mel Gorman
SUSE Labs
# mount -t debugfs none /sys/kernel/debug
mount: mount point /sys/kernel/debug does not exist
# mkdir /sys/kernel/debug
mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
config file used for testing is here:
http://watchdog.sk/lkml/config
azur
______________________________________________________________
> Od: "Mel Gorman" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 15.04.2011 12:47
> Predmet: Re: Regression from 2.6.36
>
> CC: "Mel Gorman" <[email protected]>, "Andrew Morton" <[email protected]>, "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Am?rico Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
>On Fri, Apr 15, 2011 at 11:59:03AM +0200, azurIt wrote:
>>
>> Also this new patch is working fine and fixing the problem.
>>
>> Mel, I cannot run your script:
>> # perl watch-highorder-latency.pl
>> Failed to open /sys/kernel/debug/tracing/set_ftrace_filter for writing at watch-highorder-latency.pl line 17.
>>
>> # ls -ld /sys/kernel/debug/
>> ls: cannot access /sys/kernel/debug/: No such file or directory
>>
>
>mount -t debugfs none /sys/kernel/debug
>
>If it still doesn't work, sysfs or the necessary FTRACE options are
>not enabled on your .config. I'll give you a list if that is the case.
>
>Thanks.
>
>--
>Mel Gorman
>SUSE Labs
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
> # mount -t debugfs none /sys/kernel/debug
> mount: mount point /sys/kernel/debug does not exist
>
> # mkdir /sys/kernel/debug
> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
>
Mount sysfs first
mount -t sysfs none /sys
>
> config file used for testing is here:
> http://watchdog.sk/lkml/config
>
Try setting the following
CONFIG_TRACEPOINTS=y
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_FTRACE_NMI_ENTER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_GENERIC_TRACER=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_MMIOTRACE=y
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
--
Mel Gorman
SUSE Labs
sysfs was already mounted:
# mount
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
______________________________________________________________
> Od: "Mel Gorman" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 15.04.2011 13:17
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <[email protected]>, "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Am?rico Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
>On Fri, 2011-04-15 at 12:56 +0200, azurIt wrote:
>> # mount -t debugfs none /sys/kernel/debug
>> mount: mount point /sys/kernel/debug does not exist
>>
>> # mkdir /sys/kernel/debug
>> mkdir: cannot create directory `/sys/kernel/debug': No such file or directory
>>
>
>Mount sysfs first
>
>mount -t sysfs none /sys
>
>>
>> config file used for testing is here:
>> http://watchdog.sk/lkml/config
>>
>
>Try setting the following
>
>CONFIG_TRACEPOINTS=y
>CONFIG_STACKTRACE=y
>CONFIG_USER_STACKTRACE_SUPPORT=y
>CONFIG_NOP_TRACER=y
>CONFIG_TRACER_MAX_TRACE=y
>CONFIG_FTRACE_NMI_ENTER=y
>CONFIG_CONTEXT_SWITCH_TRACER=y
>CONFIG_GENERIC_TRACER=y
>CONFIG_FTRACE=y
>CONFIG_FUNCTION_TRACER=y
>CONFIG_FUNCTION_GRAPH_TRACER=y
>CONFIG_IRQSOFF_TRACER=y
>CONFIG_SCHED_TRACER=y
>CONFIG_FTRACE_SYSCALLS=y
>CONFIG_STACK_TRACER=y
>CONFIG_BLK_DEV_IO_TRACE=y
>CONFIG_DYNAMIC_FTRACE=y
>CONFIG_FTRACE_MCOUNT_RECORD=y
>CONFIG_FTRACE_SELFTEST=y
>CONFIG_FTRACE_STARTUP_TEST=y
>CONFIG_MMIOTRACE=y
>CONFIG_HAVE_MMIOTRACE_SUPPORT=y
>
>--
>Mel Gorman
>SUSE Labs
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
> sysfs was already mounted:
>
> # mount
> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>
>
> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
>
It's only for curiousity's sake. As you report the patch fixes the
problem, it matches the theory that it's allocator latency. The script
would confirm it for sure, but it's not a high priority.
--
Mel Gorman
SUSE Labs
So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.
Which kernel versions will include this fix ?
Thank you very much!
azur
______________________________________________________________
> Od: "Mel Gorman" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 15.04.2011 15:01
> Predmet: Re: Regression from 2.6.36
>
> CC: "Andrew Morton" <[email protected]>, "Eric Dumazet" <[email protected]>, "Changli Gao" <[email protected]>, "Am?rico Wang" <[email protected]>, "Jiri Slaby" <[email protected]>, [email protected], [email protected], [email protected], "Jiri Slaby" <[email protected]>
>On Fri, 2011-04-15 at 13:36 +0200, azurIt wrote:
>> sysfs was already mounted:
>>
>> # mount
>> sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
>>
>>
>> I have enabled all of the options you suggested and also CONFIG_DEBUG_FS ;) I will boot new kernel this night. Hope it won't degraded performance much..
>>
>
>It's only for curiousity's sake. As you report the patch fixes the
>problem, it matches the theory that it's allocator latency. The script
>would confirm it for sure, but it's not a high priority.
>
>--
>Mel Gorman
>SUSE Labs
>
>
On Fri, 2011-04-15 at 15:21 +0200, azurIt wrote:
> So it's really not necessary ? It would be better for us if you can go without it cos it means to run buggy kernel for one more day.
>
I can live without it.
> Which kernel versions will include this fix ?
>
As it's a performance fix, I would guess 2.6.39 only. I don't think
-stable pick up performance fixes but I could be wrong.
--
Mel Gorman
SUSE Labs
Andrew,
which kernel versions will include this patch ? Thank you.
azur
______________________________________________________________
> Od: "Andrew Morton" <[email protected]>
> Komu: Eric Dumazet <[email protected]>,Changli Gao <[email protected]>,Américo Wang <[email protected]>,Jiri Slaby <[email protected]>, azurIt <[email protected]>,[email protected], [email protected],[email protected], Jiri Slaby <[email protected]>,Mel Gorman <[email protected]>
> Dátum: 13.04.2011 23:26
> Predmet: Re: Regression from 2.6.36
>
>On Wed, 13 Apr 2011 14:16:00 -0700
>Andrew Morton <[email protected]> wrote:
>
>> fs/file.c | 17 ++++++++++-------
>> 1 file changed, 10 insertions(+), 7 deletions(-)
>
>bah, stupid compiler.
>
>
>--- a/fs/file.c~vfs-avoid-large-kmallocs-for-the-fdtable
>+++ a/fs/file.c
>@@ -9,6 +9,7 @@
> #include <linux/module.h>
> #include <linux/fs.h>
> #include <linux/mm.h>
>+#include <linux/mmzone.h>
> #include <linux/time.h>
> #include <linux/sched.h>
> #include <linux/slab.h>
>@@ -39,14 +40,17 @@ int sysctl_nr_open_max = 1024 * 1024; /*
> */
> static DEFINE_PER_CPU(struct fdtable_defer, fdtable_defer_list);
>
>-static inline void *alloc_fdmem(unsigned int size)
>+static void *alloc_fdmem(unsigned int size)
> {
>- void *data;
>-
>- data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>- if (data != NULL)
>- return data;
>-
>+ /*
>+ * Very large allocations can stress page reclaim, so fall back to
>+ * vmalloc() if the allocation size will be considered "large" by the VM.
>+ */
>+ if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
>+ void *data = kmalloc(size, GFP_KERNEL|__GFP_NOWARN);
>+ if (data != NULL)
>+ return data;
>+ }
> return vmalloc(size);
> }
>
>_
>
>--
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
On Tue, 19 Apr 2011 21:29:20 +0200
"azurIt" <[email protected]> wrote:
> which kernel versions will include this patch ? Thank you.
Probably 2.6.39. If so, some later 2.6.38.x too.