commit 6d2be915 (mm/readahead.c: fix readahead failure for memoryless NUMA nodes
and limit readahead pages) imposed 2MB limits to readahed that yielded good
performance since it avoided unnecessay page caching.
However it broke sys_readahead semantics: 'readahead() blocks until the specified
data has been read'
This patch still retains the fix for memoryless nodes which used to return zero
and limits its readahead to 2MB to avoid unnecessary page cache thrashing but
reverts to old sanitized readahead for cpu with memory nodes.
link: https://bugzilla.kernel.org/show_bug.cgi?id=79111
Signed-off-by: Raghavendra K T <[email protected]>
---
mm/readahead.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/mm/readahead.c b/mm/readahead.c
index 0ca36a7..4514cf6 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -239,6 +239,24 @@ int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
*/
unsigned long max_sane_readahead(unsigned long nr)
{
+ unsigned long local_free_page;
+ int nid;
+
+ nid = numa_node_id();
+ if (node_present_pages(nid)) {
+ /*
+ * We sanitize readahead size depending on free memory in
+ * the local node.
+ */
+ local_free_page = node_page_state(nid, NR_INACTIVE_FILE)
+ + node_page_state(nid, NR_FREE_PAGES);
+ return min(nr, local_free_page / 2);
+ }
+ /*
+ * Readahead onto remote memory is better than no readahead when local
+ * numa node does not have memory. We limit the readahead to 2MB to
+ * avoid trashing page cache.
+ */
return min(nr, MAX_READAHEAD);
}
--
1.7.11.7
On Thu, Jul 3, 2014 at 6:02 AM, Raghavendra K T
<[email protected]> wrote:
>
> However it broke sys_readahead semantics: 'readahead() blocks until the specified
> data has been read'
What? Where did you find that insane sentence? And where did you find
an application that depends on that totally insane semantics that sure
as hell was never intentional.
If this comes from some man-page, then the man-page is just full of
sh*t, and is being crazy. The whole and *only* point of readahead() is
that it does *not* block, and you can do it across multiple files.
So NAK NAK NAK. This is insane and completely wrong. And the bugzilla
is crazy too. Why would anybody think that readahead() is the same as
read()?
Linus
On 07/03/2014 09:11 PM, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 6:02 AM, Raghavendra K T
> <[email protected]> wrote:
>>
>> However it broke sys_readahead semantics: 'readahead() blocks until the specified
>> data has been read'
>
> What? Where did you find that insane sentence? And where did you find
> an application that depends on that totally insane semantics that sure
> as hell was never intentional.
>
> If this comes from some man-page,
Yes it is.
then the man-page is just full of
> sh*t, and is being crazy. The whole and *only* point of readahead() is
> that it does *not* block, and you can do it across multiple files.
Entirely agree. Infact I also had the strong opinion that we should
rather change man page instead of making Linux performing badly by doing
large unnecessary readahead when there is no actual read, and
performance numbers have proved that.
On Thu, Jul 3, 2014 at 11:11 AM, Raghavendra K T
<[email protected]> wrote:
>>
>> What? Where did you find that insane sentence? And where did you find
>> an application that depends on that totally insane semantics that sure
>> as hell was never intentional.
>>
>> If this comes from some man-page,
>
> Yes it is.
Ok, googling actually finds a fairly recent patch to fix it
http://www.spinics.net/lists/linux-mm/msg70517.html
and several much older "that's not true" comments.
I wonder how it ever happened, because it has never actually been true
that readahead() has been synchronous. It *has* been true that large
read-aheads have started so much IO that just the act of starting more
would wait for request allocations etc to free up, so it's not like it
has ever been entirely asynchonous either, but it definitely has
*never* been synchronous afaik.
The new behavior just means that you can't trigger the "request queues
are all so full that we end up blocking waiting for new request
allocations" quite as easily.
That said, the bugzilla entry you mentioned does mention "can't boot
3.14 now". I'm not sure what the meaning of that sentence is, though.
Does it mean "can't boot 3.14 to test it because the machine is busy",
or is it a typo and really meant 3.15, and that some bootup script
*depended* on readahead()? I don't know. It seems strange. It also
seems like it would be very hard to even show this semantically (aside
from timing, and looking at how much of the cache is used like the
test-program does).
So the bugzilla entry worries me a bit - we definitely do not want to
regress in case somebody really relied on timing - but without more
specific information I still think the real bug is just in the
man-page.
Linus
On Thu, Jul 3, 2014 at 11:22 AM, Linus Torvalds
<[email protected]> wrote:
>
> So the bugzilla entry worries me a bit - we definitely do not want to
> regress in case somebody really relied on timing - but without more
> specific information I still think the real bug is just in the
> man-page.
Side note: the 2MB limit may be too small. 2M is peanuts on modern
machines, even for fairly slow IO, and there are lots of files (like
glibc etc) that people might want to read-ahead during boot. We
already do bigger read-ahead if people just do "read()" system calls.
So I could certainly imagine that we should increase it.
I do *not* think we should bow down to insane man-pages that have
always been wrong, though, and I don't think we should increase it to
"let's just read-ahead a whole ISO image" kind of sizes..
Linus
On 07/03/2014 11:52 PM, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 11:11 AM, Raghavendra K T
> <[email protected]> wrote:
>>>
>>> If this comes from some man-page,
>>
>> Yes it is.
>
> Ok, googling actually finds a fairly recent patch to fix it
>
> http://www.spinics.net/lists/linux-mm/msg70517.html
>
> and several much older "that's not true" comments.
Thanks. I had missed that.
>
> That said, the bugzilla entry you mentioned does mention "can't boot
> 3.14 now". I'm not sure what the meaning of that sentence is, though.
> Does it mean "can't boot 3.14 to test it because the machine is busy",
> or is it a typo and really meant 3.15, and that some bootup script
> *depended* on readahead()? I don't know. It seems strange.
I think your guess is right, it meant to say "I can't boot it anymore
since I already upgraded to 3.15", because eventually bootup script (if
it is) should have to read IIUC.
On 07/03/2014 11:59 PM, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 11:22 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> So the bugzilla entry worries me a bit - we definitely do not want to
>> regress in case somebody really relied on timing - but without more
>> specific information I still think the real bug is just in the
>> man-page.
>
> Side note: the 2MB limit may be too small. 2M is peanuts on modern
> machines, even for fairly slow IO, and there are lots of files (like
> glibc etc) that people might want to read-ahead during boot. We
> already do bigger read-ahead if people just do "read()" system calls.
> So I could certainly imagine that we should increase it.
>
> I do *not* think we should bow down to insane man-pages that have
> always been wrong, though, and I don't think we should increase it to
> "let's just read-ahead a whole ISO image" kind of sizes..
Okay, how about something like 256MB? I would be happy to send a patch
for that change.
On Thu, Jul 3, 2014 at 11:38 AM, Raghavendra K T
<[email protected]> wrote:
>
> Okay, how about something like 256MB? I would be happy to send a patch
> for that change.
I'd like to see some performance numbers. I know at least Fedora uses
"readahead()" in the startup scripts, do we have any performance
numbers for that?
Also, I think 256MB is actually excessive. People still do have really
slow devices out there. USB-2 is still common, and drives that read at
15MB/s are not unusual. Do we really want to do readahead() that can
take tens of seconds (and *will* take tens of seconds sycnhronously,
because the IO requests fill up).
So I wouldn't go from 2 to 256. That seems like an excessive jump. I
was more thinking in the 4-8MB range. But even then, I think we should
always have technical reasons (ie preferably numbers) for the change,
not just randomly change it.
Linus
On 07/04/2014 12:08 AM, Raghavendra K T wrote:
> On 07/03/2014 11:59 PM, Linus Torvalds wrote:
>> On Thu, Jul 3, 2014 at 11:22 AM, Linus Torvalds
>> <[email protected]> wrote:
[...]
t.
>>
>> I do *not* think we should bow down to insane man-pages that have
>> always been wrong, though, and I don't think we should increase it to
>> "let's just read-ahead a whole ISO image" kind of sizes..
>
> Okay, how about something like 256MB? I would be happy to send a patch
> for that change.
Sorry I was too fast. I think some thing like 16MB? I 'll send patch
with that (unless you different size in mind).
On 07/04/2014 12:23 AM, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 11:38 AM, Raghavendra K T
> <[email protected]> wrote:
>>
>> Okay, how about something like 256MB? I would be happy to send a patch
>> for that change.
>
> I'd like to see some performance numbers. I know at least Fedora uses
> "readahead()" in the startup scripts, do we have any performance
> numbers for that?
>
> Also, I think 256MB is actually excessive. People still do have really
> slow devices out there. USB-2 is still common, and drives that read at
> 15MB/s are not unusual. Do we really want to do readahead() that can
> take tens of seconds (and *will* take tens of seconds sycnhronously,
> because the IO requests fill up).
>
> So I wouldn't go from 2 to 256. That seems like an excessive jump. I
> was more thinking in the 4-8MB range. But even then, I think we should
> always have technical reasons (ie preferably numbers) for the change,
> not just randomly change it.
Okay. I 'll take some time to do the analysis. I think we also should
keep in mind of possible remote readahead that would cause unnecessary
penalty.
On Thu, Jul 03, 2014 at 11:53:57AM -0700, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 11:38 AM, Raghavendra K T
> <[email protected]> wrote:
> >
> > Okay, how about something like 256MB? I would be happy to send a patch
> > for that change.
>
> I'd like to see some performance numbers. I know at least Fedora uses
> "readahead()" in the startup scripts, do we have any performance
> numbers for that?
this got rolled up into systemd a while ago, so it's not just Fedora.
re: numbers, systemd-analyze and systemd-bootchart look like the way
to figure that out..
https://wiki.archlinux.org/index.php/Improve_boot_performance
Dave
>>>>> "Linus" == Linus Torvalds <[email protected]> writes:
Linus> On Thu, Jul 3, 2014 at 11:22 AM, Linus Torvalds
Linus> <[email protected]> wrote:
>>
>> So the bugzilla entry worries me a bit - we definitely do not want to
>> regress in case somebody really relied on timing - but without more
>> specific information I still think the real bug is just in the
>> man-page.
Linus> Side note: the 2MB limit may be too small. 2M is peanuts on modern
Linus> machines, even for fairly slow IO, and there are lots of files (like
Linus> glibc etc) that people might want to read-ahead during boot. We
Linus> already do bigger read-ahead if people just do "read()" system calls.
Linus> So I could certainly imagine that we should increase it.
Linus> I do *not* think we should bow down to insane man-pages that have
Linus> always been wrong, though, and I don't think we should increase it to
Linus> "let's just read-ahead a whole ISO image" kind of sizes..
This is one of those perenial questions of how to tune this. I agree
we should increase the number, but shouldn't it be based on both the
amount of memory in the machine, number of devices (or is it all just
one big pool?) and the speed of the actual device doing readahead?
Doesn't make sense to do 32mb of readahead on a USB 1.1 thumb drive or
even a CDROM. But maybe it does for USB3 thumb drives?
John
On Thu, Jul 3, 2014 at 12:43 PM, John Stoffel <[email protected]> wrote:
>
> This is one of those perenial questions of how to tune this. I agree
> we should increase the number, but shouldn't it be based on both the
> amount of memory in the machine, number of devices (or is it all just
> one big pool?) and the speed of the actual device doing readahead?
Sure. But I don't trust the throughput data for the backing device at
all, especially early at boot. We're supposed to work it out for
writeback over time (based on device contention etc), but I never saw
that working, and for reading I don't think we have even any code to
do so.
And trying to be clever and basing the read-ahead size on the node
memory size was what caused problems to begin with (with memory-less
nodes) that then made us just hardcode the maximum.
So there are certainly better options - in theory. In practice, I
think we don't really care enough, and the better options are
questionably implementable.
I _suspect_ the right number is in that 2-8MB range, and I would
prefer to keep it at the low end at least until somebody really has
numbers (and preferably from different real-life situations).
I also suspect that read-ahead is less of an issue with non-rotational
storage in general, since the only real reason for it tends to be
latency reduction (particularly the "readahead()" kind of big-hammer
thing that is really just useful for priming caches). So there's some
argument to say that it's getting less and less important.
Linus
On Thu, Jul 03, 2014 at 02:58:54PM -0700, Linus Torvalds wrote:
> On Thu, Jul 3, 2014 at 12:43 PM, John Stoffel <[email protected]> wrote:
> >
> > This is one of those perenial questions of how to tune this. I agree
> > we should increase the number, but shouldn't it be based on both the
> > amount of memory in the machine, number of devices (or is it all just
> > one big pool?) and the speed of the actual device doing readahead?
>
> Sure. But I don't trust the throughput data for the backing device at
> all, especially early at boot. We're supposed to work it out for
> writeback over time (based on device contention etc), but I never saw
> that working, and for reading I don't think we have even any code to
> do so.
>
> And trying to be clever and basing the read-ahead size on the node
> memory size was what caused problems to begin with (with memory-less
> nodes) that then made us just hardcode the maximum.
>
> So there are certainly better options - in theory. In practice, I
> think we don't really care enough, and the better options are
> questionably implementable.
>
> I _suspect_ the right number is in that 2-8MB range, and I would
> prefer to keep it at the low end at least until somebody really has
> numbers (and preferably from different real-life situations).
>
> I also suspect that read-ahead is less of an issue with non-rotational
> storage in general, since the only real reason for it tends to be
> latency reduction (particularly the "readahead()" kind of big-hammer
> thing that is really just useful for priming caches). So there's some
> argument to say that it's getting less and less important.
>
I believe you're right, but yet we sort of broke the expectation for
deliberately issued readaheads, and that is what I believe that fellow
complained (poorly worded) at the bugzilla ticket. We recently got the
following report: https://bugzilla.redhat.com/show_bug.cgi?id=1103240
which pretty much is the same thing reported at kernel's BZ. I did some
empirical tests with iozone (forcing madv_willneed behaviour) as well as
I double-checked numbers our performance team got while running their
regression tests and I, honestly, couldn't see any change for better or
worse that could be directly related to the change in question.
Other than setting a hard ceiling of to 2MB for any issued readahead,
which might be seen as trouble for certain users, there seems to be no
other measurable loss here. OTOH, the tangible gain after the change is
having the readahead working for NUMA layouts where some CPUs are within
a memoryless node.
I believe we could take David's (Rientjes) early suggestion and, instead
of fixing a hard limit on max_sane_readahead(), change it to replace
numa_node_id() by numa_mem_id() calls and follow up the
CONFIG_HAVE_MEMORYLESS_NODES requirements on PPC to have it working
properly (which seems to be the reason that approach was left aside).
Best regards,
-- Rafael