LinuxLists.cc - Re: Write throughput impaired by touching dirty

2015-06-24 08:27:47

Subject: Re: Write throughput impaired by touching dirty_ratio

[add some CC's]

On 06/19/2015 05:16 PM, Mark Hills wrote:
> I noticed that any change to vm.dirty_ratio causes write throuput to
> plummet -- to around 5Mbyte/sec.
>
> <system bootup, kernel 4.0.5>
>
> # dd if=/dev/zero of=/path/to/file bs=1M
>
> # sysctl vm.dirty_ratio
> vm.dirty_ratio = 20
> <all ok; writes at ~150Mbyte/sec>
>
> # sysctl vm.dirty_ratio=20
> <all continues to be ok>
>
> # sysctl vm.dirty_ratio=21
> <writes drop to ~5Mbyte/sec>
>
> # sysctl vm.dirty_ratio=20
> <writes continue to be slow at ~5Mbyte/sec>
>
> The test shows that return to the previous value does not restore the old
> behaviour. I return the system to usable state with a reboot.
>
> Reads continue to be fast and are not affected.
>
> A quick look at the code suggests differing behaviour from
> writeback_set_ratelimit on startup. And that some of the calculations (eg.
> global_dirty_limit) is badly behaved once the system has booted.

Hmm, so the only thing that dirty_ratio_handler() changes except the
vm_dirty_ratio itself, is ratelimit_pages through writeback_set_ratelimit(). So
I assume the problem is with ratelimit_pages. There's num_online_cpus() used in
the calculation, which I think would differ between the initial system state
(where we are called by page_writeback_init()) and later when all CPU's are
onlined. But I don't see CPU onlining code updating the limit (unlike memory
hotplug which does that), so that's suspicious.

Another suspicious thing is that global_dirty_limits() looks at current
process's flag. It seems odd to me that the process calling the sysctl would
determine a value global to the system.

If you are brave enough (and have kernel configured properly and with
debuginfo), you can verify how value of ratelimit_pages variable changes on the
live system, using the crash tool. Just start it, and if everything works, you
can inspect the live system. It's a bit complicated since there are two static
variables called "ratelimit_pages" in the kernel so we can't print them easily
(or I don't know how). First we have to get the variable address:

crash> sym ratelimit_pages
ffffffff81e67200 (d) ratelimit_pages
ffffffff81ef4638 (d) ratelimit_pages

One will be absurdly high (probably less on your 32bit) so it's not the one we want:

crash> rd -d ffffffff81ef4638 1
ffffffff81ef4638: 4294967328768

The second will have a smaller value:
(my system after boot with dirty ratio = 20)
crash> rd -d ffffffff81e67200 1
ffffffff81e67200: 1577

(after changing to 21)
crash> rd -d ffffffff81e67200 1
ffffffff81e67200: 1570

(after changing back to 20)
crash> rd -d ffffffff81e67200 1
ffffffff81e67200: 1496

So yes, it does differ but not drastically. A difference between 1 and 8 online
CPU's would look differently I think. So my theory above is questionable. But
you might try what it looks like on your system...

>
> The system is an HP xw6600, running i686 kernel. This happens whether
> internal SATA HDD, SSD or external USB drive is used. I first saw this on
> kernel 4.0.4, and 4.0.5 is also affected.

So what was the last version where you did change the dirty ratio and it worked
fine?

>
> It would suprise me if I'm the only person who was setting dirty_ratio.
>
> Have others seen this behaviour? Thanks
>

2015-06-24 09:16:59

by Michal Hocko

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On Wed 24-06-15 10:27:36, Vlastimil Babka wrote:
> [add some CC's]
>
> On 06/19/2015 05:16 PM, Mark Hills wrote:
[...]
> > The system is an HP xw6600, running i686 kernel. This happens whether

How many CPUs does the machine have?

> > internal SATA HDD, SSD or external USB drive is used. I first saw this on
> > kernel 4.0.4, and 4.0.5 is also affected.

OK so this is 32b kernel which might be the most important part. What is
the value of /proc/sys/vm/highmem_is_dirtyable? Also how does your low
mem vs higmem look when you are setting the ratio (cat /proc/zoneinfo)?

It seems Vlastimil is right and a bogus ratelimit_pages is calculated
and your writers are throttled every few pages.
--
Michal Hocko
SUSE Labs

2015-06-24 22:52:46

by Mark Hills

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On Wed, 24 Jun 2015, Vlastimil Babka wrote:

> [add some CC's]
>
> On 06/19/2015 05:16 PM, Mark Hills wrote:
> > I noticed that any change to vm.dirty_ratio causes write throuput to
> > plummet -- to around 5Mbyte/sec.
> >
> > <system bootup, kernel 4.0.5>
> >
> > # dd if=/dev/zero of=/path/to/file bs=1M
> >
> > # sysctl vm.dirty_ratio
> > vm.dirty_ratio = 20
> > <all ok; writes at ~150Mbyte/sec>
> >
> > # sysctl vm.dirty_ratio=20
> > <all continues to be ok>
> >
> > # sysctl vm.dirty_ratio=21
> > <writes drop to ~5Mbyte/sec>
> >
> > # sysctl vm.dirty_ratio=20
> > <writes continue to be slow at ~5Mbyte/sec>
> >
> > The test shows that return to the previous value does not restore the old
> > behaviour. I return the system to usable state with a reboot.
> >
> > Reads continue to be fast and are not affected.
> >
> > A quick look at the code suggests differing behaviour from
> > writeback_set_ratelimit on startup. And that some of the calculations (eg.
> > global_dirty_limit) is badly behaved once the system has booted.
>
> Hmm, so the only thing that dirty_ratio_handler() changes except the
> vm_dirty_ratio itself, is ratelimit_pages through writeback_set_ratelimit(). So
> I assume the problem is with ratelimit_pages. There's num_online_cpus() used in
> the calculation, which I think would differ between the initial system state
> (where we are called by page_writeback_init()) and later when all CPU's are
> onlined. But I don't see CPU onlining code updating the limit (unlike memory
> hotplug which does that), so that's suspicious.
>
> Another suspicious thing is that global_dirty_limits() looks at current
> process's flag. It seems odd to me that the process calling the sysctl would
> determine a value global to the system.

Yes, I also spotted this. The fragment of code is:

tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
background += background / 4;
dirty += dirty / 4;
}

It seems to imply the code was not always used from the /proc interface.
It's relevant in a moment...

> If you are brave enough (and have kernel configured properly and with
> debuginfo),

I'm brave... :) I hadn't seen this tool before, thanks for introducing me
to it, I will use it more now, I'm sure.

> you can verify how value of ratelimit_pages variable changes on the live
> system, using the crash tool. Just start it, and if everything works,
> you can inspect the live system. It's a bit complicated since there are
> two static variables called "ratelimit_pages" in the kernel so we can't
> print them easily (or I don't know how). First we have to get the
> variable address:
>
> crash> sym ratelimit_pages
> ffffffff81e67200 (d) ratelimit_pages
> ffffffff81ef4638 (d) ratelimit_pages
>
> One will be absurdly high (probably less on your 32bit) so it's not the one we want:
>
> crash> rd -d ffffffff81ef4638 1
> ffffffff81ef4638: 4294967328768
>
> The second will have a smaller value:
> (my system after boot with dirty ratio = 20)
> crash> rd -d ffffffff81e67200 1
> ffffffff81e67200: 1577
>
> (after changing to 21)
> crash> rd -d ffffffff81e67200 1
> ffffffff81e67200: 1570
>
> (after changing back to 20)
> crash> rd -d ffffffff81e67200 1
> ffffffff81e67200: 1496

In my case there's only one such symbol (perhaps because this kernel
config is quite slimmed down?)

crash> sym ratelimit_pages
c148b618 (d) ratelimit_pages

(bootup with dirty_ratio 20)
crash> rd -d ratelimit_pages
c148b618: 78

(after changing to 21)
crash> rd -d ratelimit_pages
c148b618: 16

(after changing back to 20)
crash> rd -d ratelimit_pages
c148b618: 16

Compared to your system, even the bootup value seems pretty low.

So I am new to this code, but I took a look. Seems like we're basically
hitting the lower bound of 16.

void writeback_set_ratelimit(void)
{
unsigned long background_thresh;
unsigned long dirty_thresh;
global_dirty_limits(&background_thresh, &dirty_thresh);
global_dirty_limit = dirty_thresh;
ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
if (ratelimit_pages < 16)
ratelimit_pages = 16;
}

>From this code, we don't have dirty_thresh preserved, but we do have
global_dirty_limit:

crash> rd -d global_dirty_limit
c1545080: 0

And if that is zero then:

ratelimit_pages = 0 / (num_online_cpus() * 32)
= 0

So it seems like this is the path to follow.

The function global_dirty_limits() produces the value for dirty_thresh
and, aside from a potential multiply by 0.25 (the 'task dependent'
mentioned before) the value is derived as:

if (vm_dirty_bytes)
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
else
dirty = (vm_dirty_ratio * available_memory) / 100;

I checked the vm_dirty_bytes codepath and that works:

(vm.dirty_bytes = 1048576000, 1000Mb)
crash> rd -d ratelimit_pages
c148b618: 1000

Therefore it's the 'else' case, and this points to available_memory is
zero, or near it (in my case < 5). This value is the direct result of
global_dirtyable_memory(), which I've annotated with some values:

static unsigned long global_dirtyable_memory(void)
{
unsigned long x;

x = global_page_state(NR_FREE_PAGES); // 2648091
x -= min(x, dirty_balance_reserve); // - 175522

x += global_page_state(NR_INACTIVE_FILE); // + 156369
x += global_page_state(NR_ACTIVE_FILE); // + 3475 = 2632413

if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);

return x + 1; /* Ensure that we never return 0 */
}

If I'm correct here, global includes the highmem stuff, and it implies
that highmem_dirtyable_memory() is returning a value only slightly less
than or equal to the sum of the others.

To test, I flipped the vm_highmem_is_dirtyable (which had no effect until
I forced it to re-evaluate ratelimit_pages):

$ echo 1 > /proc/sys/vm/highmem_is_dirtyable
$ echo 21 > /proc/sys/vm/dirty_ratio
$ echo 20 > /proc/sys/vm/dirty_ratio

crash> rd -d ratelimit_pages
c148b618: 2186

The value is now healthy, more so than even the value we started
with on bootup.

My questions and observations are:

* What does highmem_is_dirtyable actually mean, and should it really
default to 1?

Is it actually a misnomer? Since it's only used in
global_dirtyable_memory(), it doesn't actually prevent dirtying of
highmem, it just attempts to place a limit that corresponds to the
amount of non-highmem.I have limited understanding at the moment, but
that would be something different.

* That the codepaths around setting highmem_is_dirtyable from /proc
is broken; it also needs to make a call to writeback_set_ratelimit()

* Even with highmem_is_dirtyable=1, there's still a sizeable difference
between the value on bootup (78) and the evaluation once booted (2186).
This goes the wrong direction and is far too big a difference to be
solely nr_cpus_online() switching from 1 to 8.

The machine is 32-bit with 12GiB of RAM.

For info, I posted a typical zoneinfo, below.

> So yes, it does differ but not drastically. A difference between 1 and 8
> online CPU's would look differently I think. So my theory above is
> questionable. But you might try what it looks like on your system...
>
> >
> > The system is an HP xw6600, running i686 kernel. This happens whether
> > internal SATA HDD, SSD or external USB drive is used. I first saw this on
> > kernel 4.0.4, and 4.0.5 is also affected.
>
> So what was the last version where you did change the dirty ratio and it worked
> fine?

Sorry, I don't know when it broke. I don't immediately have access to an
old kernel to test, but I could do that if necessary.

> > It would suprise me if I'm the only person who was setting dirty_ratio.
> >
> > Have others seen this behaviour? Thanks
> >
>

Thanks, I hope you find this useful.

--
Mark

Node 0, zone DMA
pages free 1566
min 196
low 245
high 294
scanned 0
spanned 4095
present 3989
managed 3970
nr_free_pages 1566
nr_alloc_batch 49
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 163
nr_active_file 1129
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 1292
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 842
nr_slab_unreclaimable 162
nr_page_table_pages 17
nr_kernel_stack 4
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 661
nr_written 661
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 377, 12165, 12165)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 8
all_unreclaimable: 0
start_pfn: 1
inactive_ratio: 1
Node 0, zone Normal
pages free 37336
min 4789
low 5986
high 7183
scanned 0
spanned 123902
present 123902
managed 96773
nr_free_pages 37336
nr_alloc_batch 331
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 4016
nr_active_file 26672
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 1
nr_file_pages 30684
nr_dirty 4
nr_writeback 0
nr_slab_reclaimable 19865
nr_slab_unreclaimable 4673
nr_page_table_pages 1027
nr_kernel_stack 281
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 14354
nr_written 21672
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 0, 94302, 94302)
pagesets
cpu: 0
count: 78
high: 186
batch: 31
vm stats threshold: 24
cpu: 1
count: 140
high: 186
batch: 31
vm stats threshold: 24
cpu: 2
count: 116
high: 186
batch: 31
vm stats threshold: 24
cpu: 3
count: 100
high: 186
batch: 31
vm stats threshold: 24
cpu: 4
count: 70
high: 186
batch: 31
vm stats threshold: 24
cpu: 5
count: 82
high: 186
batch: 31
vm stats threshold: 24
cpu: 6
count: 144
high: 186
batch: 31
vm stats threshold: 24
cpu: 7
count: 59
high: 186
batch: 31
vm stats threshold: 24
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 1
Node 0, zone HighMem
pages free 2536526
min 128
low 37501
high 74874
scanned 0
spanned 3214338
present 3017668
managed 3017668
nr_free_pages 2536526
nr_alloc_batch 10793
nr_inactive_anon 2118
nr_active_anon 118021
nr_inactive_file 80138
nr_active_file 273523
nr_unevictable 3475
nr_mlock 3475
nr_anon_pages 119672
nr_mapped 48158
nr_file_pages 357567
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 2766
nr_dirtied 1882996
nr_written 1695681
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_transparent_hugepages 151
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 171
high: 186
batch: 31
vm stats threshold: 64
cpu: 1
count: 80
high: 186
batch: 31
vm stats threshold: 64
cpu: 2
count: 91
high: 186
batch: 31
vm stats threshold: 64
cpu: 3
count: 173
high: 186
batch: 31
vm stats threshold: 64
cpu: 4
count: 114
high: 186
batch: 31
vm stats threshold: 64
cpu: 5
count: 159
high: 186
batch: 31
vm stats threshold: 64
cpu: 6
count: 130
high: 186
batch: 31
vm stats threshold: 64
cpu: 7
count: 62
high: 186
batch: 31
vm stats threshold: 64
all_unreclaimable: 0
start_pfn: 127998
inactive_ratio: 10

2015-06-25 09:21:07

by Michal Hocko

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On Wed 24-06-15 23:26:49, Mark Hills wrote:
> On Wed, 24 Jun 2015, Vlastimil Babka wrote:
[...]
> > Another suspicious thing is that global_dirty_limits() looks at current
> > process's flag. It seems odd to me that the process calling the sysctl would
> > determine a value global to the system.
>
> Yes, I also spotted this. The fragment of code is:
>
> tsk = current;
> if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> background += background / 4;
> dirty += dirty / 4;
> }

Yes this might be confusing for the proc path but it shouldn't be hit
there because PF_LESS_THROTTLE is currently only used from the nfs code
(to tell the throttling code to not throttle it because it is freeing
memory) and you usually do not set proc values from th RT context. So
this shouldn't matter.

[...]
> crash> sym ratelimit_pages
> c148b618 (d) ratelimit_pages
>
> (bootup with dirty_ratio 20)
> crash> rd -d ratelimit_pages
> c148b618: 78
>
> (after changing to 21)
> crash> rd -d ratelimit_pages
> c148b618: 16
>
> (after changing back to 20)
> crash> rd -d ratelimit_pages
> c148b618: 16
>
> Compared to your system, even the bootup value seems pretty low.
>
> So I am new to this code, but I took a look. Seems like we're basically
> hitting the lower bound of 16.

Yes this is really low and as suspected your writers are throttled every
few pages.

>
> void writeback_set_ratelimit(void)
> {
> unsigned long background_thresh;
> unsigned long dirty_thresh;
> global_dirty_limits(&background_thresh, &dirty_thresh);
> global_dirty_limit = dirty_thresh;
> ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
> if (ratelimit_pages < 16)
> ratelimit_pages = 16;
> }
>
> From this code, we don't have dirty_thresh preserved, but we do have
> global_dirty_limit:
>
> crash> rd -d global_dirty_limit
> c1545080: 0

This is really bad.

> And if that is zero then:
>
> ratelimit_pages = 0 / (num_online_cpus() * 32)
> = 0
>
> So it seems like this is the path to follow.
>
> The function global_dirty_limits() produces the value for dirty_thresh
> and, aside from a potential multiply by 0.25 (the 'task dependent'
> mentioned before) the value is derived as:
>
> if (vm_dirty_bytes)
> dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
> else
> dirty = (vm_dirty_ratio * available_memory) / 100;
>
> I checked the vm_dirty_bytes codepath and that works:
>
> (vm.dirty_bytes = 1048576000, 1000Mb)
> crash> rd -d ratelimit_pages
> c148b618: 1000
>
> Therefore it's the 'else' case, and this points to available_memory is
> zero, or near it (in my case < 5).

OK so it looks like you do not basically have any dirtyable memory.
Which smells like a highmem issue.

> This value is the direct result of
> global_dirtyable_memory(), which I've annotated with some values:
>
> static unsigned long global_dirtyable_memory(void)
> {
> unsigned long x;
>
> x = global_page_state(NR_FREE_PAGES); // 2648091
> x -= min(x, dirty_balance_reserve); // - 175522
>
> x += global_page_state(NR_INACTIVE_FILE); // + 156369
> x += global_page_state(NR_ACTIVE_FILE); // + 3475 = 2632413
>
> if (!vm_highmem_is_dirtyable)
> x -= highmem_dirtyable_memory(x);
>
> return x + 1; /* Ensure that we never return 0 */
> }
>
> If I'm correct here, global includes the highmem stuff, and it implies
> that highmem_dirtyable_memory() is returning a value only slightly less
> than or equal to the sum of the others.

Exactly!

> To test, I flipped the vm_highmem_is_dirtyable (which had no effect until
> I forced it to re-evaluate ratelimit_pages):
>
> $ echo 1 > /proc/sys/vm/highmem_is_dirtyable
> $ echo 21 > /proc/sys/vm/dirty_ratio
> $ echo 20 > /proc/sys/vm/dirty_ratio
>
> crash> rd -d ratelimit_pages
> c148b618: 2186
>
> The value is now healthy, more so than even the value we started
> with on bootup.

>From your /proc/zoneinfo:
> Node 0, zone HighMem
> pages free 2536526
> min 128
> low 37501
> high 74874
> scanned 0
> spanned 3214338
> present 3017668
> managed 3017668

You have 11G of highmem. Which is a lot wrt. the the lowmem

> Node 0, zone Normal
> pages free 37336
> min 4789
> low 5986
> high 7183
> scanned 0
> spanned 123902
> present 123902
> managed 96773

which is only 378M! So something had to eat portion of the lowmem.
I think it is a bad idea to use 32b kernel with that amount of memory in
general. The lowmem pressure is even worse by the fact that something is
eating already precious amount of lowmem. What is the reason to stick
with 32b kernel anyway?

> My questions and observations are:
>
> * What does highmem_is_dirtyable actually mean, and should it really
> default to 1?

It says whether highmem should be considered dirtyable. It is not by
default. See more for motivation in 195cf453d2c3 ("mm/page-writeback:
highmem_is_dirtyable option").

> Is it actually a misnomer? Since it's only used in
> global_dirtyable_memory(), it doesn't actually prevent dirtying of
> highmem, it just attempts to place a limit that corresponds to the
> amount of non-highmem.I have limited understanding at the moment, but
> that would be something different.
>
> * That the codepaths around setting highmem_is_dirtyable from /proc
> is broken; it also needs to make a call to writeback_set_ratelimit()

That should be probably fixed.

> * Even with highmem_is_dirtyable=1, there's still a sizeable difference
> between the value on bootup (78) and the evaluation once booted (2186).
> This goes the wrong direction and is far too big a difference to be
> solely nr_cpus_online() switching from 1 to 8.

I am not sure where the 78 came from because the default value is 32 and
it is not set anywhere else but writeback_set_ratelimit. At least it
looks like that from the quick code inspection. I am not an expert in
that area.

> The machine is 32-bit with 12GiB of RAM.

I think you should really consider 64b kernel for such a machine. You
would suffer from the low mem pressure otherwise and I do not see a good
reason for that. If you depend on 32b userspace then it should run just
fine on top of 64b kernel.

[...]
--
Michal Hocko
SUSE Labs

2015-06-25 09:30:38

by Vlastimil Babka

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On 06/25/2015 12:26 AM, Mark Hills wrote:
> On Wed, 24 Jun 2015, Vlastimil Babka wrote:
>
>> [add some CC's]
>>
>> On 06/19/2015 05:16 PM, Mark Hills wrote:
>>
>> Hmm, so the only thing that dirty_ratio_handler() changes except the
>> vm_dirty_ratio itself, is ratelimit_pages through writeback_set_ratelimit(). So
>> I assume the problem is with ratelimit_pages. There's num_online_cpus() used in
>> the calculation, which I think would differ between the initial system state
>> (where we are called by page_writeback_init()) and later when all CPU's are
>> onlined. But I don't see CPU onlining code updating the limit (unlike memory
>> hotplug which does that), so that's suspicious.
>>
>> Another suspicious thing is that global_dirty_limits() looks at current
>> process's flag. It seems odd to me that the process calling the sysctl would
>> determine a value global to the system.
>
> Yes, I also spotted this. The fragment of code is:
>
> tsk = current;
> if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
> background += background / 4;
> dirty += dirty / 4;
> }
>
> It seems to imply the code was not always used from the /proc interface.
> It's relevant in a moment...
>
>> If you are brave enough (and have kernel configured properly and with
>> debuginfo),
>
> I'm brave... :) I hadn't seen this tool before, thanks for introducing me
> to it, I will use it more now, I'm sure.

Ok I admit I didn't expect so much outcome from my suggestion. Good job :)

>> you can verify how value of ratelimit_pages variable changes on the live
>> system, using the crash tool. Just start it, and if everything works,
>> you can inspect the live system. It's a bit complicated since there are
>> two static variables called "ratelimit_pages" in the kernel so we can't
>> print them easily (or I don't know how). First we have to get the
>> variable address:
>>
>> crash> sym ratelimit_pages
>> ffffffff81e67200 (d) ratelimit_pages
>> ffffffff81ef4638 (d) ratelimit_pages
>>
>> One will be absurdly high (probably less on your 32bit) so it's not the one we want:
>>
>> crash> rd -d ffffffff81ef4638 1
>> ffffffff81ef4638: 4294967328768
>>
>> The second will have a smaller value:
>> (my system after boot with dirty ratio = 20)
>> crash> rd -d ffffffff81e67200 1
>> ffffffff81e67200: 1577
>>
>> (after changing to 21)
>> crash> rd -d ffffffff81e67200 1
>> ffffffff81e67200: 1570
>>
>> (after changing back to 20)
>> crash> rd -d ffffffff81e67200 1
>> ffffffff81e67200: 1496
>
> In my case there's only one such symbol (perhaps because this kernel
> config is quite slimmed down?)
>
> crash> sym ratelimit_pages
> c148b618 (d) ratelimit_pages
>
> (bootup with dirty_ratio 20)
> crash> rd -d ratelimit_pages
> c148b618: 78

With just one symbol you can use
crash> p ratelimit_pages

This will take the type properly into account, while rd will print full
32bit/64bit depending on your kernel, which might be larger than the
actual variable. But if there are more symbols of same name, "p" will
somehow randomly pick one of them and don't even warn about it.

[snip]

>>>
>>
>
> Thanks, I hope you find this useful.

Yes, thanks, nice analysis. Since Michal already replied and has more
experience with the reclaim code and dirty throttling, I won't try
adding more.

2015-06-25 12:56:14

by Michal Hocko

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On Thu 25-06-15 11:20:56, Michal Hocko wrote:
[...]
> From your /proc/zoneinfo:
> > Node 0, zone HighMem
> > pages free 2536526
> > min 128
> > low 37501
> > high 74874
> > scanned 0
> > spanned 3214338
> > present 3017668
> > managed 3017668
>
> You have 11G of highmem. Which is a lot wrt. the the lowmem
>
> > Node 0, zone Normal
> > pages free 37336
> > min 4789
> > low 5986
> > high 7183
> > scanned 0
> > spanned 123902
> > present 123902
> > managed 96773
>
> which is only 378M! So something had to eat portion of the lowmem.

And just to clarify. Your lowmem has only 123902 pages (+DMA zone which
has 16M so it doesn't add much) which is ~480M. The lowmem can sit only
in the low 1G (actually less because part of that is used by kernel for
special mappings). You only have half of that because, presumably some
HW has reserved portion of that address range. So your lowmem zone is
really tiny. Now part of that range is used for kernel stuff like struct
pages which have to describe the full memory and this is eating quite a
lot for 3 million pages. So you ended up with only 378M really usable
for all the kernel allocations which cannot live in the highmem (and there
are many of those). This makes a large memory pressure on that zone even
though you might have huge amount of highmem free. This is the primary
reason why PAE kernels are not really usable for large memory setups
in general. A very specific usecases might work but even then I would
have to a very strong reason to stick with 32b kernel (e.g. a stupid out
of tree driver which is 32b specific or something similar).
--
Michal Hocko
SUSE Labs

2015-06-25 21:46:11

by Mark Hills

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On Thu, 25 Jun 2015, Michal Hocko wrote:

> On Wed 24-06-15 23:26:49, Mark Hills wrote:
> [...]
> > To test, I flipped the vm_highmem_is_dirtyable (which had no effect until
> > I forced it to re-evaluate ratelimit_pages):
> >
> > $ echo 1 > /proc/sys/vm/highmem_is_dirtyable
> > $ echo 21 > /proc/sys/vm/dirty_ratio
> > $ echo 20 > /proc/sys/vm/dirty_ratio
> >
> > crash> rd -d ratelimit_pages
> > c148b618: 2186
> >
> > The value is now healthy, more so than even the value we started
> > with on bootup.
>
> From your /proc/zoneinfo:
> > Node 0, zone HighMem
> > pages free 2536526
> > min 128
> > low 37501
> > high 74874
> > scanned 0
> > spanned 3214338
> > present 3017668
> > managed 3017668
>
> You have 11G of highmem. Which is a lot wrt. the the lowmem
>
> > Node 0, zone Normal
> > pages free 37336
> > min 4789
> > low 5986
> > high 7183
> > scanned 0
> > spanned 123902
> > present 123902
> > managed 96773
>
> which is only 378M! So something had to eat portion of the lowmem.
> I think it is a bad idea to use 32b kernel with that amount of memory in
> general. The lowmem pressure is even worse by the fact that something is
> eating already precious amount of lowmem.

Yup, that's the ""vmalloc=512M" kernel parameter.

That was a requirement for my NVidia GPU to work, but now I have an AMD
card so I have been able to remove that. It now gives me ~730M, and
provided some relieve to ratelimit_pages; now at 63 (when dirty_ratio is
set to 20 after boot)

> What is the reason to stick with 32b kernel anyway?

Because it's ideal for finding edge cases and bugs in kernels :-)

The real reason is more practical. I never had a problem with the 32-bit
one, and as my OS is quite home-grown and evolved over 10+ years, I
haven't wanted to start again or reinstall.

This is the first time I've been aware of any problem or notable
performance impact -- the PAE kernel has worked very well for me.

The only reason I have so much RAM is that RAM is cheap, and it's a great
disk cache. I'd be more likely to remove some of the RAM than reinstall!

Perhaps someone could kindly explain why don't I have the same problem if
I have, say 1.5G of RAM? Is it because the page table for 12G is large and
sits in the lowmem?

> > My questions and observations are:
> >
> > * What does highmem_is_dirtyable actually mean, and should it really
> > default to 1?
>
> It says whether highmem should be considered dirtyable. It is not by
> default. See more for motivation in 195cf453d2c3 ("mm/page-writeback:
> highmem_is_dirtyable option").

Thank you, this explanation is useful.

I know very little about the constraints on highmem and lowmem, though I
can make an educated guess (and reading http://linux-mm.org/HighMemory)

I do have some questions though, perhaps if someone would be happy to
explain.

What is the "excessive scanning" mentioned in that patch, and why it is
any more than I would expect a 64-bit kernel to be doing? ie. what is the
practical downside of me doing:

$ echo 1073741824 > /proc/sys/vm/dirty_bytes

Also, is VMSPLIT_2G likely to be appropriate here if the kernel is
managing larger amounts of total RAM? I enabled it and it increases the
lowmem. Is this a simple tradeoff I am making now between user and kernel
space?

I'm not trying to sit in the dark ages, but the bad I/O throttling is the
only real problem I have suffered by staying 32-bit, and a small tweak has
restored sanity. So it's reasonable to question the logic that is in use.

For example, if we're saying that ratelimit_pages is dependent truly on
free lowmem, then surely it needs to be periodically re-evaluated as the
system is put to use? Setting 'dirty_ratio' implies that it's a ratio of a
fixed, unchanging value.

Many thanks

--
Mark

2015-07-01 15:41:05

by Michal Hocko

[permalink] [raw]

Subject: Re: Write throughput impaired by touching dirty_ratio

On Thu 25-06-15 22:45:57, Mark Hills wrote:
> On Thu, 25 Jun 2015, Michal Hocko wrote:
>
> > On Wed 24-06-15 23:26:49, Mark Hills wrote:
> > [...]
> > > To test, I flipped the vm_highmem_is_dirtyable (which had no effect until
> > > I forced it to re-evaluate ratelimit_pages):
> > >
> > > $ echo 1 > /proc/sys/vm/highmem_is_dirtyable
> > > $ echo 21 > /proc/sys/vm/dirty_ratio
> > > $ echo 20 > /proc/sys/vm/dirty_ratio
> > >
> > > crash> rd -d ratelimit_pages
> > > c148b618: 2186
> > >
> > > The value is now healthy, more so than even the value we started
> > > with on bootup.
> >
> > From your /proc/zoneinfo:
> > > Node 0, zone HighMem
> > > pages free 2536526
> > > min 128
> > > low 37501
> > > high 74874
> > > scanned 0
> > > spanned 3214338
> > > present 3017668
> > > managed 3017668
> >
> > You have 11G of highmem. Which is a lot wrt. the the lowmem
> >
> > > Node 0, zone Normal
> > > pages free 37336
> > > min 4789
> > > low 5986
> > > high 7183
> > > scanned 0
> > > spanned 123902
> > > present 123902
> > > managed 96773
> >
> > which is only 378M! So something had to eat portion of the lowmem.
> > I think it is a bad idea to use 32b kernel with that amount of memory in
> > general. The lowmem pressure is even worse by the fact that something is
> > eating already precious amount of lowmem.
>
> Yup, that's the ""vmalloc=512M" kernel parameter.

I see.

> That was a requirement for my NVidia GPU to work, but now I have an AMD
> card so I have been able to remove that. It now gives me ~730M, and
> provided some relieve to ratelimit_pages; now at 63 (when dirty_ratio is
> set to 20 after boot)
>
> > What is the reason to stick with 32b kernel anyway?
>
> Because it's ideal for finding edge cases and bugs in kernels :-)

OK, then good luck ;)

> The real reason is more practical. I never had a problem with the 32-bit
> one, and as my OS is quite home-grown and evolved over 10+ years, I
> haven't wanted to start again or reinstall.

I can understand that. I was using PAE kernel for ages as well even
though I was aware of all the problems. It wasn't such a big deal
because I didn't have much more than 4G on my machines. But it simply
stopped being practical and I have moved on.

> This is the first time I've been aware of any problem or notable
> performance impact -- the PAE kernel has worked very well for me.
>
> The only reason I have so much RAM is that RAM is cheap, and it's a great
> disk cache. I'd be more likely to remove some of the RAM than reinstall!

Well, you do not have to reinstall the whole system. You should be able
to install 64b kernel only.

> Perhaps someone could kindly explain why don't I have the same problem if
> I have, say 1.5G of RAM? Is it because the page table for 12G is large and
> sits in the lowmem?

I've tried to explain some of the issues in the other email. Some of the
problems (e.g. performance where each highmem page has to be mapped when
the kernel want's to access it) do not depend on the amount of memory
but some of them do (e.g. struct pages which scale with the amount of
memory).

> > > My questions and observations are:
> > >
> > > * What does highmem_is_dirtyable actually mean, and should it really
> > > default to 1?
> >
> > It says whether highmem should be considered dirtyable. It is not by
> > default. See more for motivation in 195cf453d2c3 ("mm/page-writeback:
> > highmem_is_dirtyable option").
>
> Thank you, this explanation is useful.
>
> I know very little about the constraints on highmem and lowmem, though I
> can make an educated guess (and reading http://linux-mm.org/HighMemory)
>
> I do have some questions though, perhaps if someone would be happy to
> explain.
>
> What is the "excessive scanning" mentioned in that patch, and why it is
> any more than I would expect a 64-bit kernel to be doing?

This is a good question! It wasn't obvious to me as well so I took my
pickaxe and a showel and dig into the history.
The highmem has been removed from the dirty throttling code back in
2005 by Andrea and Rik (https://lkml.org/lkml/2004/12/20/111) because
some mappings couldn't use highmem (e.g. dd of=block_device) and
so they didn't get throttled properly made a huge memory pressure
on lowmem and could even cause an OOM killer. The code still
considered highmem dirtyable for highmem capable mappings but that
has been later removed by Linus because it has caused other problems
(http://marc.info/?l=git-commits-head&m=117013324728709).

> ie. what is the practical downside of me doing:
>
> $ echo 1073741824 > /proc/sys/vm/dirty_bytes

You could end up having the full lowmem dirty for lowmem only mappings.

> Also, is VMSPLIT_2G likely to be appropriate here if the kernel is
> managing larger amounts of total RAM? I enabled it and it increases the
> lowmem. Is this a simple tradeoff I am making now between user and kernel
> space?

Your userspace will get only 2G of address space. If this is sufficient
for you then it will help to your lowmem pressure.
--
Michal Hocko
SUSE Labs