Hi,
CCing lkml on this,
* Yannick Brosseau ([email protected]) wrote:
> Hi all,
>
> We discovered a performance regression in recent kernels with LTTng
> related to the use of fadvise DONTNEED.
> A call to this syscall is present in the LTTng consumer.
>
> The following kernel commit cause the call to fadvise to be sometime
> really slower.
>
> Kernel commit info:
> mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard
> all pages
> main tree: (since 3.9-rc1)
> commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> stable tree: (since 3.8.1)
> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
>
> On the workload test, we observe that the call to fadvise takes about
> 4-5 us before this patch is applied. After applying the patch, The
> syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The
> effect on lttng is that the consumer is frozen for this long period
> which leads to dropped event in the trace.
We use POSIX_FADV_DONTNEED in LTTng so the kernel know it's not useful
to keep the trace data around after it is flushed to disk. From what I
gather from the commit changelog, it seems that the POSIX_FADV_DONTNEED
operation now touches kernel data structures shared amongst processors
that have much higher contention/overhead than previously.
How does your page cache memory usage behave prior/after this kernel
commit ?
Also, can you try instrumenting the "count", "start_index" and
"end_index" values within fadvise64_64 with commit
67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 applied and log this though
LTTng ? This will tell us whether the lru_add_drain_all() hit is taken
for a good reason, or due to an unforeseen off-by-one type of issue in
the new test:
if (count < (end_index - start_index + 1)) {
Thanks,
Mathieu
>
> If we remove the call to fadvise in src/common/consumer.c, we don't
> have any dropped event and we don't observe any bad side effect.
> (The added latency seem to come from the new call to
> lru_add_drain_all(). We removed this line and the performance went back
> to normal.)
>
> It's obviously a problem in the kernel, but since it impacts LTTng, we
> wanted to report it here first and ask advice on what should be the
> next step to solve this problem.
>
> If you want to see for youself, you can find the trace with the long
> call to fadvise here:
> http://www.dorsal.polymtl.ca/~rbeamonte/3.8.0~autocreated-4469887.tar.gz
>
> Yannick and Raphael
>
> _______________________________________________
> lttng-dev mailing list
> [email protected]
> http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <[email protected]> wrote:
> Hi,
>
> CCing lkml on this,
>
> * Yannick Brosseau ([email protected]) wrote:
> > Hi all,
> >
> > We discovered a performance regression in recent kernels with LTTng
> > related to the use of fadvise DONTNEED.
> > A call to this syscall is present in the LTTng consumer.
> >
> > The following kernel commit cause the call to fadvise to be sometime
> > really slower.
> >
> > Kernel commit info:
> > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard
> > all pages
> > main tree: (since 3.9-rc1)
> > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > stable tree: (since 3.8.1)
> > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> >
> > On the workload test, we observe that the call to fadvise takes about
> > 4-5 us before this patch is applied. After applying the patch, The
> > syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The
> > effect on lttng is that the consumer is frozen for this long period
> > which leads to dropped event in the trace.
That change wasn't terribly efficient - if there are any unpopulated
pages in the range (which is quite likely), fadvise() will now always
call invalidate_mapping_pages() a second time.
Perhaps this is fixable. Say, make lru_add_drain_all() return a
success code, or even teach lru_add_drain_all() to return a code
indicating that one of the spilled pages was (or might have been) on a
particular mapping.
But I don't see why that would cause fadvise(POSIX_FADV_DONTNEED) to
sometimes take four milliseconds(!). Is it possible that a context
switch is now occurring, so the fadvise()-calling task sometimes spends
a few milliseconds asleep?
> We use POSIX_FADV_DONTNEED in LTTng so the kernel know it's not useful
> to keep the trace data around after it is flushed to disk. From what I
> gather from the commit changelog, it seems that the POSIX_FADV_DONTNEED
> operation now touches kernel data structures shared amongst processors
> that have much higher contention/overhead than previously.
>
> How does your page cache memory usage behave prior/after this kernel
> commit ?
>
> Also, can you try instrumenting the "count", "start_index" and
> "end_index" values within fadvise64_64 with commit
> 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732 applied and log this though
> LTTng ? This will tell us whether the lru_add_drain_all() hit is taken
> for a good reason, or due to an unforeseen off-by-one type of issue in
> the new test:
>
> if (count < (end_index - start_index + 1)) {
>
On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:
> On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <[email protected]> wrote:
>
> > Hi,
> >
> > CCing lkml on this,
> >
> > * Yannick Brosseau ([email protected]) wrote:
> > > Hi all,
> > >
> > > We discovered a performance regression in recent kernels with LTTng
> > > related to the use of fadvise DONTNEED.
> > > A call to this syscall is present in the LTTng consumer.
> > >
> > > The following kernel commit cause the call to fadvise to be sometime
> > > really slower.
> > >
> > > Kernel commit info:
> > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard
> > > all pages
> > > main tree: (since 3.9-rc1)
> > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > > stable tree: (since 3.8.1)
> > > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> > >
> > > On the workload test, we observe that the call to fadvise takes about
> > > 4-5 us before this patch is applied. After applying the patch, The
> > > syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The
> > > effect on lttng is that the consumer is frozen for this long period
> > > which leads to dropped event in the trace.
>
> That change wasn't terribly efficient - if there are any unpopulated
> pages in the range (which is quite likely), fadvise() will now always
> call invalidate_mapping_pages() a second time.
>
I did not view the path as being performance critical and did not anticipate
a delay that severe. The original test case as well was based on
sequential IO as part of a backup so I was also generally expecting the
range to be populated. I looked at the other users of lru_add_drain_all()
but there are fairly few. compaction uses them but only when used via sysfs
or proc. ksm uses it but it's not likely to be that noticable. mlock uses
it but it's unlikely it is being called frequently so I'm not going to
worry performance of lru_add_drain_all() in general. I'll look closer at
properly detecting when it's necessarily to call in the fadvise case.
--
Mel Gorman
SUSE Labs
On Tue, Jun 18, 2013 at 10:29:26AM +0100, Mel Gorman wrote:
> On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:
> > On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > CCing lkml on this,
> > >
> > > * Yannick Brosseau ([email protected]) wrote:
> > > > Hi all,
> > > >
> > > > We discovered a performance regression in recent kernels with LTTng
> > > > related to the use of fadvise DONTNEED.
> > > > A call to this syscall is present in the LTTng consumer.
> > > >
> > > > The following kernel commit cause the call to fadvise to be sometime
> > > > really slower.
> > > >
> > > > Kernel commit info:
> > > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard
> > > > all pages
> > > > main tree: (since 3.9-rc1)
> > > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > > > stable tree: (since 3.8.1)
> > > > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> > > >
> > > > On the workload test, we observe that the call to fadvise takes about
> > > > 4-5 us before this patch is applied. After applying the patch, The
> > > > syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The
> > > > effect on lttng is that the consumer is frozen for this long period
> > > > which leads to dropped event in the trace.
> >
> > That change wasn't terribly efficient - if there are any unpopulated
> > pages in the range (which is quite likely), fadvise() will now always
> > call invalidate_mapping_pages() a second time.
> >
>
> I did not view the path as being performance critical and did not anticipate
> a delay that severe.
Which I should have, schedule_on_each_cpu is documented to be slow.
> The original test case as well was based on
> sequential IO as part of a backup so I was also generally expecting the
> range to be populated. I looked at the other users of lru_add_drain_all()
> but there are fairly few. compaction uses them but only when used via sysfs
> or proc. ksm uses it but it's not likely to be that noticable. mlock uses
> it but it's unlikely it is being called frequently so I'm not going to
> worry performance of lru_add_drain_all() in general. I'll look closer at
> properly detecting when it's necessarily to call in the fadvise case.
>
This compile-only tested prototype should detect remaining pages in the rage
that were not invalidated. This will at least detect unpopulated pages but
whether it has any impact depends on what lttng is invalidating. If it's
invalidating a range of per-cpu traces then I doubt this will work because
there will always be remaining pages. Similarly I wonder if passing down
the mapping will really help if a large number of cpus are tracing as we
end up scheduling work on every CPU regardless.
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2c28271..e2bb47e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2176,6 +2176,8 @@ extern int invalidate_partition(struct gendisk *, int);
#endif
unsigned long invalidate_mapping_pages(struct address_space *mapping,
pgoff_t start, pgoff_t end);
+unsigned long invalidate_mapping_pages_check(struct address_space *mapping,
+ pgoff_t start, pgoff_t end);
static inline void invalidate_remote_inode(struct inode *inode)
{
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 7e09268..0579e60 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -122,7 +122,9 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
end_index = (endbyte >> PAGE_CACHE_SHIFT);
if (end_index >= start_index) {
- unsigned long count = invalidate_mapping_pages(mapping,
+ unsigned long nr_remaining;
+
+ nr_remaining = invalidate_mapping_pages_check(mapping,
start_index, end_index);
/*
@@ -131,7 +133,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
* a per-cpu pagevec for a remote CPU. Drain all
* pagevecs and try again.
*/
- if (count < (end_index - start_index + 1)) {
+ if (nr_remaining) {
lru_add_drain_all();
invalidate_mapping_pages(mapping, start_index,
end_index);
diff --git a/mm/truncate.c b/mm/truncate.c
index c75b736..86cfc2e 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -312,26 +312,16 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
}
EXPORT_SYMBOL(truncate_inode_pages);
-/**
- * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
- * @mapping: the address_space which holds the pages to invalidate
- * @start: the offset 'from' which to invalidate
- * @end: the offset 'to' which to invalidate (inclusive)
- *
- * This function only removes the unlocked pages, if you want to
- * remove all the pages of one inode, you must call truncate_inode_pages.
- *
- * invalidate_mapping_pages() will not block on IO activity. It will not
- * invalidate pages which are dirty, locked, under writeback or mapped into
- * pagetables.
- */
-unsigned long invalidate_mapping_pages(struct address_space *mapping,
- pgoff_t start, pgoff_t end)
+static void __invalidate_mapping_pages(struct address_space *mapping,
+ pgoff_t start, pgoff_t end,
+ unsigned long *ret_nr_invalidated,
+ unsigned long *ret_nr_remaining)
{
struct pagevec pvec;
pgoff_t index = start;
unsigned long ret;
- unsigned long count = 0;
+ unsigned long nr_invalidated = 0;
+ unsigned long nr_remaining = 0;
int i;
/*
@@ -354,8 +344,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
if (index > end)
break;
- if (!trylock_page(page))
+ if (!trylock_page(page)) {
+ nr_remaining++;
continue;
+ }
WARN_ON(page->index != index);
ret = invalidate_inode_page(page);
unlock_page(page);
@@ -365,17 +357,73 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
*/
if (!ret)
deactivate_page(page);
- count += ret;
+ else
+ nr_remaining++;
+ nr_invalidated += ret;
}
pagevec_release(&pvec);
mem_cgroup_uncharge_end();
cond_resched();
index++;
}
- return count;
+
+ *ret_nr_invalidated = nr_invalidated;
+ *ret_nr_remaining = nr_remaining;
}
EXPORT_SYMBOL(invalidate_mapping_pages);
+/**
+ * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
+ * @mapping: the address_space which holds the pages to invalidate
+ * @start: the offset 'from' which to invalidate
+ * @end: the offset 'to' which to invalidate (inclusive)
+ *
+ * This function only removes the unlocked pages, if you want to
+ * remove all the pages of one inode, you must call truncate_inode_pages.
+ *
+ * invalidate_mapping_pages() will not block on IO activity. It will not
+ * invalidate pages which are dirty, locked, under writeback or mapped into
+ * pagetables.
+ *
+ * Returns the number of pages invalidated
+ */
+unsigned long invalidate_mapping_pages(struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ unsigned long nr_invalidated, nr_remaining;
+
+ __invalidate_mapping_pages(mapping, start, end,
+ &nr_invalidated, &nr_remaining);
+
+ return nr_invalidated;
+}
+
+/**
+ * invalidate_mapping_pages_check - Invalidate all the unlocked pages of one inode and check for remaining pages.
+ * @mapping: the address_space which holds the pages to invalidate
+ * @start: the offset 'from' which to invalidate
+ * @end: the offset 'to' which to invalidate (inclusive)
+ *
+ * This function only removes the unlocked pages, if you want to
+ * remove all the pages of one inode, you must call truncate_inode_pages.
+ *
+ * invalidate_mapping_pages() will not block on IO activity. It will not
+ * invalidate pages which are dirty, locked, under writeback or mapped into
+ * pagetables.
+ *
+ * Returns the number of pages remaining in the invalidated range
+ */
+unsigned long invalidate_mapping_pages_check(struct address_space *mapping,
+ pgoff_t start, pgoff_t end)
+{
+ unsigned long nr_invalidated, nr_remaining;
+
+ __invalidate_mapping_pages(mapping, start, end,
+ &nr_invalidated, &nr_remaining);
+
+ return nr_remaining;
+}
+
/*
* This is like invalidate_complete_page(), except it ignores the page's
* refcount. We do this because invalidate_inode_pages2() needs stronger
* Mel Gorman ([email protected]) wrote:
> On Tue, Jun 18, 2013 at 10:29:26AM +0100, Mel Gorman wrote:
> > On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:
> > > On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > CCing lkml on this,
> > > >
> > > > * Yannick Brosseau ([email protected]) wrote:
> > > > > Hi all,
> > > > >
> > > > > We discovered a performance regression in recent kernels with LTTng
> > > > > related to the use of fadvise DONTNEED.
> > > > > A call to this syscall is present in the LTTng consumer.
> > > > >
> > > > > The following kernel commit cause the call to fadvise to be sometime
> > > > > really slower.
> > > > >
> > > > > Kernel commit info:
> > > > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to discard
> > > > > all pages
> > > > > main tree: (since 3.9-rc1)
> > > > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > > > > stable tree: (since 3.8.1)
> > > > > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> > > > >
> > > > > On the workload test, we observe that the call to fadvise takes about
> > > > > 4-5 us before this patch is applied. After applying the patch, The
> > > > > syscall now takes values from 5 us up to 4 ms (4000 us) sometime. The
> > > > > effect on lttng is that the consumer is frozen for this long period
> > > > > which leads to dropped event in the trace.
> > >
> > > That change wasn't terribly efficient - if there are any unpopulated
> > > pages in the range (which is quite likely), fadvise() will now always
> > > call invalidate_mapping_pages() a second time.
> > >
> >
> > I did not view the path as being performance critical and did not anticipate
> > a delay that severe.
>
> Which I should have, schedule_on_each_cpu is documented to be slow.
>
> > The original test case as well was based on
> > sequential IO as part of a backup so I was also generally expecting the
> > range to be populated. I looked at the other users of lru_add_drain_all()
> > but there are fairly few. compaction uses them but only when used via sysfs
> > or proc. ksm uses it but it's not likely to be that noticable. mlock uses
> > it but it's unlikely it is being called frequently so I'm not going to
> > worry performance of lru_add_drain_all() in general. I'll look closer at
> > properly detecting when it's necessarily to call in the fadvise case.
> >
>
> This compile-only tested prototype should detect remaining pages in the rage
> that were not invalidated. This will at least detect unpopulated pages but
> whether it has any impact depends on what lttng is invalidating. If it's
> invalidating a range of per-cpu traces then I doubt this will work because
> there will always be remaining pages. Similarly I wonder if passing down
> the mapping will really help if a large number of cpus are tracing as we
> end up scheduling work on every CPU regardless.
Here is how LTTng consumerd is using POSIX_FADV_DONTNEED. Please let me
know if we are doing something stupid. ;-)
The LTTng consumerd deals with trace packets, which consists of a set of
pages. I'll just discuss how the pages are moved around, leaving out
discussion about synchronization between producer/consumer.
Whenever the Nth trace packet is ready to be written to disk:
1) splice the entire set of pages of trace packet N to disk through a
pipe,
2) sync_file_range on trace packet N-1 with the following flags:
SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER
so we wait (blocking) on the _previous_ trace packet to be completely
written to disk.
3) posix_fadvise POSIX_FADV_DONTNEED on trace packet N-1.
There are a couple of comments in my consumerd code explaining why we do
this sequence of steps:
/*
* Give hints to the kernel about how we access the file:
* POSIX_FADV_DONTNEED : we won't re-access data in a near future after
* we write it.
*
* We need to call fadvise again after the file grows because the
* kernel does not seem to apply fadvise to non-existing parts of the
* file.
*
* Call fadvise _after_ having waited for the page writeback to
* complete because the dirty page writeback semantic is not well
* defined. So it can be expected to lead to lower throughput in
* streaming.
*/
Currently, the lttng-consumerd is single-threaded, but we plan to
re-introduce multi-threading, and per-cpu affinity, in a near future.
Yannick will try your patch tomorrow and report whether it improves
performance or not,
Thanks!
Mathieu
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 2c28271..e2bb47e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2176,6 +2176,8 @@ extern int invalidate_partition(struct gendisk *, int);
> #endif
> unsigned long invalidate_mapping_pages(struct address_space *mapping,
> pgoff_t start, pgoff_t end);
> +unsigned long invalidate_mapping_pages_check(struct address_space *mapping,
> + pgoff_t start, pgoff_t end);
>
> static inline void invalidate_remote_inode(struct inode *inode)
> {
> diff --git a/mm/fadvise.c b/mm/fadvise.c
> index 7e09268..0579e60 100644
> --- a/mm/fadvise.c
> +++ b/mm/fadvise.c
> @@ -122,7 +122,9 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> end_index = (endbyte >> PAGE_CACHE_SHIFT);
>
> if (end_index >= start_index) {
> - unsigned long count = invalidate_mapping_pages(mapping,
> + unsigned long nr_remaining;
> +
> + nr_remaining = invalidate_mapping_pages_check(mapping,
> start_index, end_index);
>
> /*
> @@ -131,7 +133,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
> * a per-cpu pagevec for a remote CPU. Drain all
> * pagevecs and try again.
> */
> - if (count < (end_index - start_index + 1)) {
> + if (nr_remaining) {
> lru_add_drain_all();
> invalidate_mapping_pages(mapping, start_index,
> end_index);
> diff --git a/mm/truncate.c b/mm/truncate.c
> index c75b736..86cfc2e 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -312,26 +312,16 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
> }
> EXPORT_SYMBOL(truncate_inode_pages);
>
> -/**
> - * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
> - * @mapping: the address_space which holds the pages to invalidate
> - * @start: the offset 'from' which to invalidate
> - * @end: the offset 'to' which to invalidate (inclusive)
> - *
> - * This function only removes the unlocked pages, if you want to
> - * remove all the pages of one inode, you must call truncate_inode_pages.
> - *
> - * invalidate_mapping_pages() will not block on IO activity. It will not
> - * invalidate pages which are dirty, locked, under writeback or mapped into
> - * pagetables.
> - */
> -unsigned long invalidate_mapping_pages(struct address_space *mapping,
> - pgoff_t start, pgoff_t end)
> +static void __invalidate_mapping_pages(struct address_space *mapping,
> + pgoff_t start, pgoff_t end,
> + unsigned long *ret_nr_invalidated,
> + unsigned long *ret_nr_remaining)
> {
> struct pagevec pvec;
> pgoff_t index = start;
> unsigned long ret;
> - unsigned long count = 0;
> + unsigned long nr_invalidated = 0;
> + unsigned long nr_remaining = 0;
> int i;
>
> /*
> @@ -354,8 +344,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> if (index > end)
> break;
>
> - if (!trylock_page(page))
> + if (!trylock_page(page)) {
> + nr_remaining++;
> continue;
> + }
> WARN_ON(page->index != index);
> ret = invalidate_inode_page(page);
> unlock_page(page);
> @@ -365,17 +357,73 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
> */
> if (!ret)
> deactivate_page(page);
> - count += ret;
> + else
> + nr_remaining++;
> + nr_invalidated += ret;
> }
> pagevec_release(&pvec);
> mem_cgroup_uncharge_end();
> cond_resched();
> index++;
> }
> - return count;
> +
> + *ret_nr_invalidated = nr_invalidated;
> + *ret_nr_remaining = nr_remaining;
> }
> EXPORT_SYMBOL(invalidate_mapping_pages);
>
> +/**
> + * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
> + * @mapping: the address_space which holds the pages to invalidate
> + * @start: the offset 'from' which to invalidate
> + * @end: the offset 'to' which to invalidate (inclusive)
> + *
> + * This function only removes the unlocked pages, if you want to
> + * remove all the pages of one inode, you must call truncate_inode_pages.
> + *
> + * invalidate_mapping_pages() will not block on IO activity. It will not
> + * invalidate pages which are dirty, locked, under writeback or mapped into
> + * pagetables.
> + *
> + * Returns the number of pages invalidated
> + */
> +unsigned long invalidate_mapping_pages(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + unsigned long nr_invalidated, nr_remaining;
> +
> + __invalidate_mapping_pages(mapping, start, end,
> + &nr_invalidated, &nr_remaining);
> +
> + return nr_invalidated;
> +}
> +
> +/**
> + * invalidate_mapping_pages_check - Invalidate all the unlocked pages of one inode and check for remaining pages.
> + * @mapping: the address_space which holds the pages to invalidate
> + * @start: the offset 'from' which to invalidate
> + * @end: the offset 'to' which to invalidate (inclusive)
> + *
> + * This function only removes the unlocked pages, if you want to
> + * remove all the pages of one inode, you must call truncate_inode_pages.
> + *
> + * invalidate_mapping_pages() will not block on IO activity. It will not
> + * invalidate pages which are dirty, locked, under writeback or mapped into
> + * pagetables.
> + *
> + * Returns the number of pages remaining in the invalidated range
> + */
> +unsigned long invalidate_mapping_pages_check(struct address_space *mapping,
> + pgoff_t start, pgoff_t end)
> +{
> + unsigned long nr_invalidated, nr_remaining;
> +
> + __invalidate_mapping_pages(mapping, start, end,
> + &nr_invalidated, &nr_remaining);
> +
> + return nr_remaining;
> +}
> +
> /*
> * This is like invalidate_complete_page(), except it ignores the page's
> * refcount. We do this because invalidate_inode_pages2() needs stronger
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
* Rob van der Heij ([email protected]) wrote:
> Wouldn't you batch the calls to drop the pages from cache rather than drop
> one packet at a time?
By default for kernel tracing, lttng's trace packets are 1MB, so I
consider the call to fadvise to be already batched by applying it to 1MB
packets rather than indivitual pages. Even there, it seems that the
extra overhead added by the lru drain on each CPU is noticeable.
Another reason for not batching this in larger chunks is to limit the
impact of the tracer on the kernel page cache. LTTng limits itself to
its own set of buffers, and use the page cache for what is absolutely
needed to perform I/O, but no more.
> Your effort to help Linux mm seems a bit overkill,
Without performing this, I have a situation similar as yours, where
LTTng fills up the page cache very quickly, until it gets to a point
where memory pressure level increase enough that the consumerd is
blocked until some pages are reclaimed. I really don't care about making
the consumerd "as fast as possible for a while" if it means its
throughput will drop when the page cache is filled. I prefer a constant
slower pace to a short burst followed by slower throughput.
> and you don't want every application to do it like that himself.
Indeed, tracing has always been slightly odd in the sense that it's not
the workload the system is meant to run, but rather a tool that should
have the smallest impact on the usual system's run when it is used.
> The
> fadvise will not even work when the page is still to be flushed out.
> Without the patch that started the thread, it would 'at random' not work
> due to SMP race condition (not multi-threading).
This is why the lttng consumerd calls:
sync_file_range with flags:
SYNC_FILE_RANGE_WAIT_BEFORE
SYNC_FILE_RANGE_WRITE
SYNC_FILE_RANGE_WAIT_AFTER
on the page range. The purpose of this call is to flush the pages to
disk before calling fadvise(POSIX_FADV_DONTNEED) on the page range.
Hopefully we can manage to make the SMP race scenario work with the
semantic you want without requiring cases where the pages are on the
local CPU, and already flushed to disk, to schedule work (and wait for
it) on each CPU. We can expect this to perform very badly on 64+ core
systems.
As far as LTTng is concerned, if every call to
fadvise(POSIX_FADV_DONTNEED) involves synchronization across all CPUs, I
will reconsider using this hint, because the performance hit would start
to be worse than the benefit of the hint given to the kernel.
Your idea of batching was interesting though. Perharps it would be good
to let the kernel batch messaging to remote CPUs within fadvise(), and
perform those asynchronously. Therefore, we could return to the
application quickly without awaiting for page invalidation to be
completed. After all, this is just a hint, and AFAIU there is no need to
guarantee that all those pages are invalidated before fadvise() returns
to user-space.
Thoughts ?
Thanks,
Mathieu
>
> I believe the right way would be for Linux to implement the fadvise flags
> properly to guide cache replacement.
>
> My situation was slightly different. I run in a virtualized environment
> where it would be a global improvement for the Linux guest not to cache
> data, even though from Linux mm perspective there is plenty of space and no
> good reason not to keep the data. My scenario intercepted the close() call
> to drop each input file during a tar or file system scan. This avoids
> building a huge page cache with stuff that will not be referenced until
> next backup and thus gets paged out by the hypervisor. There's some more
> here:
> http://zvmperf.wordpress.com/2013/01/27/taming-linux-page-cache/
>
> Rob
>
>
> On 19 June 2013 21:25, Mathieu Desnoyers <[email protected]>wrote:
>
> > * Mel Gorman ([email protected]) wrote:
> > > On Tue, Jun 18, 2013 at 10:29:26AM +0100, Mel Gorman wrote:
> > > > On Mon, Jun 17, 2013 at 02:24:59PM -0700, Andrew Morton wrote:
> > > > > On Mon, 17 Jun 2013 10:13:57 -0400 Mathieu Desnoyers <
> > [email protected]> wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > CCing lkml on this,
> > > > > >
> > > > > > * Yannick Brosseau ([email protected]) wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We discovered a performance regression in recent kernels with
> > LTTng
> > > > > > > related to the use of fadvise DONTNEED.
> > > > > > > A call to this syscall is present in the LTTng consumer.
> > > > > > >
> > > > > > > The following kernel commit cause the call to fadvise to be
> > sometime
> > > > > > > really slower.
> > > > > > >
> > > > > > > Kernel commit info:
> > > > > > > mm/fadvise.c: drain all pagevecs if POSIX_FADV_DONTNEED fails to
> > discard
> > > > > > > all pages
> > > > > > > main tree: (since 3.9-rc1)
> > > > > > > commit 67d46b296a1ba1477c0df8ff3bc5e0167a0b0732
> > > > > > > stable tree: (since 3.8.1)
> > > > > > >
> > https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit?id=bb01afe62feca1e7cdca60696f8b074416b0910d
> > > > > > >
> > > > > > > On the workload test, we observe that the call to fadvise takes
> > about
> > > > > > > 4-5 us before this patch is applied. After applying the patch,
> > The
> > > > > > > syscall now takes values from 5 us up to 4 ms (4000 us)
> > sometime. The
> > > > > > > effect on lttng is that the consumer is frozen for this long
> > period
> > > > > > > which leads to dropped event in the trace.
> > > > >
> > > > > That change wasn't terribly efficient - if there are any unpopulated
> > > > > pages in the range (which is quite likely), fadvise() will now always
> > > > > call invalidate_mapping_pages() a second time.
> > > > >
> > > >
> > > > I did not view the path as being performance critical and did not
> > anticipate
> > > > a delay that severe.
> > >
> > > Which I should have, schedule_on_each_cpu is documented to be slow.
> > >
> > > > The original test case as well was based on
> > > > sequential IO as part of a backup so I was also generally expecting the
> > > > range to be populated. I looked at the other users of
> > lru_add_drain_all()
> > > > but there are fairly few. compaction uses them but only when used via
> > sysfs
> > > > or proc. ksm uses it but it's not likely to be that noticable. mlock
> > uses
> > > > it but it's unlikely it is being called frequently so I'm not going to
> > > > worry performance of lru_add_drain_all() in general. I'll look closer
> > at
> > > > properly detecting when it's necessarily to call in the fadvise case.
> > > >
> > >
> > > This compile-only tested prototype should detect remaining pages in the
> > rage
> > > that were not invalidated. This will at least detect unpopulated pages
> > but
> > > whether it has any impact depends on what lttng is invalidating. If it's
> > > invalidating a range of per-cpu traces then I doubt this will work
> > because
> > > there will always be remaining pages. Similarly I wonder if passing down
> > > the mapping will really help if a large number of cpus are tracing as we
> > > end up scheduling work on every CPU regardless.
> >
> > Here is how LTTng consumerd is using POSIX_FADV_DONTNEED. Please let me
> > know if we are doing something stupid. ;-)
> >
> > The LTTng consumerd deals with trace packets, which consists of a set of
> > pages. I'll just discuss how the pages are moved around, leaving out
> > discussion about synchronization between producer/consumer.
> >
> > Whenever the Nth trace packet is ready to be written to disk:
> >
> > 1) splice the entire set of pages of trace packet N to disk through a
> > pipe,
> > 2) sync_file_range on trace packet N-1 with the following flags:
> > SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE |
> > SYNC_FILE_RANGE_WAIT_AFTER
> > so we wait (blocking) on the _previous_ trace packet to be completely
> > written to disk.
> > 3) posix_fadvise POSIX_FADV_DONTNEED on trace packet N-1.
> >
> > There are a couple of comments in my consumerd code explaining why we do
> > this sequence of steps:
> >
> > /*
> > * Give hints to the kernel about how we access the file:
> > * POSIX_FADV_DONTNEED : we won't re-access data in a near future
> > after
> > * we write it.
> > *
> > * We need to call fadvise again after the file grows because the
> > * kernel does not seem to apply fadvise to non-existing parts of
> > the
> > * file.
> > *
> > * Call fadvise _after_ having waited for the page writeback to
> > * complete because the dirty page writeback semantic is not well
> > * defined. So it can be expected to lead to lower throughput in
> > * streaming.
> > */
> >
> > Currently, the lttng-consumerd is single-threaded, but we plan to
> > re-introduce multi-threading, and per-cpu affinity, in a near future.
> >
> > Yannick will try your patch tomorrow and report whether it improves
> > performance or not,
> >
> > Thanks!
> >
> > Mathieu
> >
> > >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 2c28271..e2bb47e 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -2176,6 +2176,8 @@ extern int invalidate_partition(struct gendisk *,
> > int);
> > > #endif
> > > unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > pgoff_t start, pgoff_t end);
> > > +unsigned long invalidate_mapping_pages_check(struct address_space
> > *mapping,
> > > + pgoff_t start, pgoff_t end);
> > >
> > > static inline void invalidate_remote_inode(struct inode *inode)
> > > {
> > > diff --git a/mm/fadvise.c b/mm/fadvise.c
> > > index 7e09268..0579e60 100644
> > > --- a/mm/fadvise.c
> > > +++ b/mm/fadvise.c
> > > @@ -122,7 +122,9 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
> > loff_t len, int advice)
> > > end_index = (endbyte >> PAGE_CACHE_SHIFT);
> > >
> > > if (end_index >= start_index) {
> > > - unsigned long count =
> > invalidate_mapping_pages(mapping,
> > > + unsigned long nr_remaining;
> > > +
> > > + nr_remaining =
> > invalidate_mapping_pages_check(mapping,
> > > start_index, end_index);
> > >
> > > /*
> > > @@ -131,7 +133,7 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset,
> > loff_t len, int advice)
> > > * a per-cpu pagevec for a remote CPU. Drain all
> > > * pagevecs and try again.
> > > */
> > > - if (count < (end_index - start_index + 1)) {
> > > + if (nr_remaining) {
> > > lru_add_drain_all();
> > > invalidate_mapping_pages(mapping,
> > start_index,
> > > end_index);
> > > diff --git a/mm/truncate.c b/mm/truncate.c
> > > index c75b736..86cfc2e 100644
> > > --- a/mm/truncate.c
> > > +++ b/mm/truncate.c
> > > @@ -312,26 +312,16 @@ void truncate_inode_pages(struct address_space
> > *mapping, loff_t lstart)
> > > }
> > > EXPORT_SYMBOL(truncate_inode_pages);
> > >
> > > -/**
> > > - * invalidate_mapping_pages - Invalidate all the unlocked pages of one
> > inode
> > > - * @mapping: the address_space which holds the pages to invalidate
> > > - * @start: the offset 'from' which to invalidate
> > > - * @end: the offset 'to' which to invalidate (inclusive)
> > > - *
> > > - * This function only removes the unlocked pages, if you want to
> > > - * remove all the pages of one inode, you must call
> > truncate_inode_pages.
> > > - *
> > > - * invalidate_mapping_pages() will not block on IO activity. It will not
> > > - * invalidate pages which are dirty, locked, under writeback or mapped
> > into
> > > - * pagetables.
> > > - */
> > > -unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > - pgoff_t start, pgoff_t end)
> > > +static void __invalidate_mapping_pages(struct address_space *mapping,
> > > + pgoff_t start, pgoff_t end,
> > > + unsigned long *ret_nr_invalidated,
> > > + unsigned long *ret_nr_remaining)
> > > {
> > > struct pagevec pvec;
> > > pgoff_t index = start;
> > > unsigned long ret;
> > > - unsigned long count = 0;
> > > + unsigned long nr_invalidated = 0;
> > > + unsigned long nr_remaining = 0;
> > > int i;
> > >
> > > /*
> > > @@ -354,8 +344,10 @@ unsigned long invalidate_mapping_pages(struct
> > address_space *mapping,
> > > if (index > end)
> > > break;
> > >
> > > - if (!trylock_page(page))
> > > + if (!trylock_page(page)) {
> > > + nr_remaining++;
> > > continue;
> > > + }
> > > WARN_ON(page->index != index);
> > > ret = invalidate_inode_page(page);
> > > unlock_page(page);
> > > @@ -365,17 +357,73 @@ unsigned long invalidate_mapping_pages(struct
> > address_space *mapping,
> > > */
> > > if (!ret)
> > > deactivate_page(page);
> > > - count += ret;
> > > + else
> > > + nr_remaining++;
> > > + nr_invalidated += ret;
> > > }
> > > pagevec_release(&pvec);
> > > mem_cgroup_uncharge_end();
> > > cond_resched();
> > > index++;
> > > }
> > > - return count;
> > > +
> > > + *ret_nr_invalidated = nr_invalidated;
> > > + *ret_nr_remaining = nr_remaining;
> > > }
> > > EXPORT_SYMBOL(invalidate_mapping_pages);
> > >
> > > +/**
> > > + * invalidate_mapping_pages - Invalidate all the unlocked pages of one
> > inode
> > > + * @mapping: the address_space which holds the pages to invalidate
> > > + * @start: the offset 'from' which to invalidate
> > > + * @end: the offset 'to' which to invalidate (inclusive)
> > > + *
> > > + * This function only removes the unlocked pages, if you want to
> > > + * remove all the pages of one inode, you must call
> > truncate_inode_pages.
> > > + *
> > > + * invalidate_mapping_pages() will not block on IO activity. It will not
> > > + * invalidate pages which are dirty, locked, under writeback or mapped
> > into
> > > + * pagetables.
> > > + *
> > > + * Returns the number of pages invalidated
> > > + */
> > > +unsigned long invalidate_mapping_pages(struct address_space *mapping,
> > > + pgoff_t start, pgoff_t end)
> > > +{
> > > + unsigned long nr_invalidated, nr_remaining;
> > > +
> > > + __invalidate_mapping_pages(mapping, start, end,
> > > + &nr_invalidated, &nr_remaining);
> > > +
> > > + return nr_invalidated;
> > > +}
> > > +
> > > +/**
> > > + * invalidate_mapping_pages_check - Invalidate all the unlocked pages
> > of one inode and check for remaining pages.
> > > + * @mapping: the address_space which holds the pages to invalidate
> > > + * @start: the offset 'from' which to invalidate
> > > + * @end: the offset 'to' which to invalidate (inclusive)
> > > + *
> > > + * This function only removes the unlocked pages, if you want to
> > > + * remove all the pages of one inode, you must call
> > truncate_inode_pages.
> > > + *
> > > + * invalidate_mapping_pages() will not block on IO activity. It will not
> > > + * invalidate pages which are dirty, locked, under writeback or mapped
> > into
> > > + * pagetables.
> > > + *
> > > + * Returns the number of pages remaining in the invalidated range
> > > + */
> > > +unsigned long invalidate_mapping_pages_check(struct address_space
> > *mapping,
> > > + pgoff_t start, pgoff_t end)
> > > +{
> > > + unsigned long nr_invalidated, nr_remaining;
> > > +
> > > + __invalidate_mapping_pages(mapping, start, end,
> > > + &nr_invalidated, &nr_remaining);
> > > +
> > > + return nr_remaining;
> > > +}
> > > +
> > > /*
> > > * This is like invalidate_complete_page(), except it ignores the page's
> > > * refcount. We do this because invalidate_inode_pages2() needs
> > stronger
> >
> > --
> > Mathieu Desnoyers
> > EfficiOS Inc.
> > http://www.efficios.com
> >
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Thu, Jun 20, 2013 at 08:20:16AM -0400, Mathieu Desnoyers wrote:
> * Rob van der Heij ([email protected]) wrote:
> > Wouldn't you batch the calls to drop the pages from cache rather than drop
> > one packet at a time?
>
> By default for kernel tracing, lttng's trace packets are 1MB, so I
> consider the call to fadvise to be already batched by applying it to 1MB
> packets rather than indivitual pages. Even there, it seems that the
> extra overhead added by the lru drain on each CPU is noticeable.
>
> Another reason for not batching this in larger chunks is to limit the
> impact of the tracer on the kernel page cache. LTTng limits itself to
> its own set of buffers, and use the page cache for what is absolutely
> needed to perform I/O, but no more.
I think you are doing it wrong. This is a poster child case for
using Direct IO and completely avoiding the page cache altogether....
> > Your effort to help Linux mm seems a bit overkill,
>
> Without performing this, I have a situation similar as yours, where
> LTTng fills up the page cache very quickly, until it gets to a point
> where memory pressure level increase enough that the consumerd is
> blocked until some pages are reclaimed. I really don't care about making
> the consumerd "as fast as possible for a while" if it means its
> throughput will drop when the page cache is filled. I prefer a constant
> slower pace to a short burst followed by slower throughput.
>
> > and you don't want every application to do it like that himself.
>
> Indeed, tracing has always been slightly odd in the sense that it's not
> the workload the system is meant to run, but rather a tool that should
> have the smallest impact on the usual system's run when it is used.
>
> > The
> > fadvise will not even work when the page is still to be flushed out.
> > Without the patch that started the thread, it would 'at random' not work
> > due to SMP race condition (not multi-threading).
>
> This is why the lttng consumerd calls:
>
> sync_file_range with flags:
> SYNC_FILE_RANGE_WAIT_BEFORE
> SYNC_FILE_RANGE_WRITE
> SYNC_FILE_RANGE_WAIT_AFTER
>
> on the page range. The purpose of this call is to flush the pages to
> disk before calling fadvise(POSIX_FADV_DONTNEED) on the page range.
Yup, you're emulating direct IO semantics with buffered IO.
This seems to be an emerging trend I'm seeing a lot of over the past
few months - I'm hearing about it because of all the wierd corner
case behaviours it causes because sync_file_range() doesn't provide
data integrity guarantees and fadvise(DONTNEED) can randomly issue
lots of IO, block for long periods of time, silently do nothing,
remove pages from the page cache and/or some or all of the above.
Direct IO is a model of sanity compared to that mess....
Cheers,
Dave.
--
Dave Chinner
[email protected]
* Dave Chinner ([email protected]) wrote:
> On Thu, Jun 20, 2013 at 08:20:16AM -0400, Mathieu Desnoyers wrote:
> > * Rob van der Heij ([email protected]) wrote:
> > > Wouldn't you batch the calls to drop the pages from cache rather than drop
> > > one packet at a time?
> >
> > By default for kernel tracing, lttng's trace packets are 1MB, so I
> > consider the call to fadvise to be already batched by applying it to 1MB
> > packets rather than indivitual pages. Even there, it seems that the
> > extra overhead added by the lru drain on each CPU is noticeable.
> >
> > Another reason for not batching this in larger chunks is to limit the
> > impact of the tracer on the kernel page cache. LTTng limits itself to
> > its own set of buffers, and use the page cache for what is absolutely
> > needed to perform I/O, but no more.
>
> I think you are doing it wrong. This is a poster child case for
> using Direct IO and completely avoiding the page cache altogether....
I just tried replacing my sync_file_range()+fadvise() calls and instead
pass the O_DIRECT flag to open(). Unfortunately, I must be doing
something very wrong, because I get only 1/3rd of the throughput, and
the page cache fills up. Any idea why ?
Here are my results:
heavy-syscall.c: 30M sigaction() syscall with bad parameters (returns
immediately). Used as high-throughput stress-test for the tracer.
Tracing to disk with LTTng, all kernel tracepoints activated, including
system calls.
Tracer configuration: per-core buffers split into 4 sub-buffers of
262kB. splice() is used to transfer data from buffers to disk. Runs on a
8-core Intel machine.
Writing to a software raid-1 ext3 partition.
ext3 mount options: rw,errors=remount-ro
* sync_file_range+fadvise 3.9.8
- with lru drain on fadvise
Kernel cache usage:
Before tracing: 56272k cached
After tracing: 56388k cached
939M /root/lttng-traces/auto-20130702-090430
time ./heavy-syscall
real 0m21.910s
throughput: 42MB/s
* sync_file_range+fadvise 3.9.8
- without lru drain on fadvise: manually reverted
Kernel cache usage:
Before tracing: 67968k cached
After tracing: 67984k cached
945M /root/lttng-traces/auto-20130702-092505
time ./heavy-syscall
real 0m21.872s
throughput: 43MB/s
* O_DIRECT 3.9.8
- O_DIRECT flag on open(), removed fadvise and sync_file_range calls
Kernel cache usage:
Before tracing: 99480k cached
After tracing: 360132k cached
258M /root/lttng-traces/auto-20130702-090603
time ./heavy-syscall
real 0m19.627s
throughput: 13MB/s
* No cache hints 3.9.8
- only removed fadvise and sync_file_range calls
Kernel cache usage:
Before tracing: 103556k cached
After tracing: 363712k cached
945M /root/lttng-traces/auto-20130702-092505
time ./heavy-syscall
real 0m19.672s
throughput: 48MB/s
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
* Mathieu Desnoyers ([email protected]) wrote:
> * Dave Chinner ([email protected]) wrote:
> > On Thu, Jun 20, 2013 at 08:20:16AM -0400, Mathieu Desnoyers wrote:
> > > * Rob van der Heij ([email protected]) wrote:
> > > > Wouldn't you batch the calls to drop the pages from cache rather than drop
> > > > one packet at a time?
> > >
> > > By default for kernel tracing, lttng's trace packets are 1MB, so I
> > > consider the call to fadvise to be already batched by applying it to 1MB
> > > packets rather than indivitual pages. Even there, it seems that the
> > > extra overhead added by the lru drain on each CPU is noticeable.
> > >
> > > Another reason for not batching this in larger chunks is to limit the
> > > impact of the tracer on the kernel page cache. LTTng limits itself to
> > > its own set of buffers, and use the page cache for what is absolutely
> > > needed to perform I/O, but no more.
> >
> > I think you are doing it wrong. This is a poster child case for
> > using Direct IO and completely avoiding the page cache altogether....
>
> I just tried replacing my sync_file_range()+fadvise() calls and instead
> pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> something very wrong, because I get only 1/3rd of the throughput, and
> the page cache fills up. Any idea why ?
Since O_DIRECT does not seem to provide acceptable throughput, it may be
interesting to investigate other ways to lessen the latency impact of
the fadvise DONTNEED hint.
Given it is just a hint, we should be allowed to perform page
deactivation lazily. Is there any fundamental reason to wait for worker
threads on each CPU to complete their lru drain before returning from
fadvise() to user-space ?
Thanks,
Mathieu
>
> Here are my results:
>
> heavy-syscall.c: 30M sigaction() syscall with bad parameters (returns
> immediately). Used as high-throughput stress-test for the tracer.
> Tracing to disk with LTTng, all kernel tracepoints activated, including
> system calls.
>
> Tracer configuration: per-core buffers split into 4 sub-buffers of
> 262kB. splice() is used to transfer data from buffers to disk. Runs on a
> 8-core Intel machine.
>
> Writing to a software raid-1 ext3 partition.
> ext3 mount options: rw,errors=remount-ro
>
> * sync_file_range+fadvise 3.9.8
> - with lru drain on fadvise
>
> Kernel cache usage:
> Before tracing: 56272k cached
> After tracing: 56388k cached
>
> 939M /root/lttng-traces/auto-20130702-090430
> time ./heavy-syscall
> real 0m21.910s
> throughput: 42MB/s
>
>
> * sync_file_range+fadvise 3.9.8
> - without lru drain on fadvise: manually reverted
>
> Kernel cache usage:
> Before tracing: 67968k cached
> After tracing: 67984k cached
>
> 945M /root/lttng-traces/auto-20130702-092505
> time ./heavy-syscall
> real 0m21.872s
> throughput: 43MB/s
>
>
> * O_DIRECT 3.9.8
> - O_DIRECT flag on open(), removed fadvise and sync_file_range calls
>
> Kernel cache usage:
> Before tracing: 99480k cached
> After tracing: 360132k cached
>
> 258M /root/lttng-traces/auto-20130702-090603
> time ./heavy-syscall
> real 0m19.627s
> throughput: 13MB/s
>
>
> * No cache hints 3.9.8
> - only removed fadvise and sync_file_range calls
>
> Kernel cache usage:
> Before tracing: 103556k cached
> After tracing: 363712k cached
>
> 945M /root/lttng-traces/auto-20130702-092505
> time ./heavy-syscall
> real 0m19.672s
> throughput: 48MB/s
>
> Thoughts ?
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Tue, Jul 02, 2013 at 08:55:14PM -0400, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers ([email protected]) wrote:
> > * Dave Chinner ([email protected]) wrote:
> > > On Thu, Jun 20, 2013 at 08:20:16AM -0400, Mathieu Desnoyers wrote:
> > > > * Rob van der Heij ([email protected]) wrote:
> > > > > Wouldn't you batch the calls to drop the pages from cache rather than drop
> > > > > one packet at a time?
> > > >
> > > > By default for kernel tracing, lttng's trace packets are 1MB, so I
> > > > consider the call to fadvise to be already batched by applying it to 1MB
> > > > packets rather than indivitual pages. Even there, it seems that the
> > > > extra overhead added by the lru drain on each CPU is noticeable.
> > > >
> > > > Another reason for not batching this in larger chunks is to limit the
> > > > impact of the tracer on the kernel page cache. LTTng limits itself to
> > > > its own set of buffers, and use the page cache for what is absolutely
> > > > needed to perform I/O, but no more.
> > >
> > > I think you are doing it wrong. This is a poster child case for
> > > using Direct IO and completely avoiding the page cache altogether....
> >
> > I just tried replacing my sync_file_range()+fadvise() calls and instead
> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> > something very wrong, because I get only 1/3rd of the throughput, and
> > the page cache fills up. Any idea why ?
>
> Since O_DIRECT does not seem to provide acceptable throughput, it may be
> interesting to investigate other ways to lessen the latency impact of
> the fadvise DONTNEED hint.
>
There are cases where O_DIRECT falls back to buffered IO which is why you
might have found that page cache was still filling up. There are a few
reasons why this can happen but I would guess the common cause is that
the range of pages being written was in the page cache already and could
not be invalidated for some reason. I'm guessing this is the common case
for page cache filling even with O_DIRECT but would not bet money on it
as it's not a problem I investigated before.
> Given it is just a hint, we should be allowed to perform page
> deactivation lazily. Is there any fundamental reason to wait for worker
> threads on each CPU to complete their lru drain before returning from
> fadvise() to user-space ?
>
Only to make sure they pages are actually dropped as requested. The reason
the wait was introduced in the first place was that page cache was filling
up even with the fadvise calls and causing disruption. In 3.11 disruption
due to this sort of parallel IO should be reduced but making fadvise work
properly is reasonable in itself. Was that patch I posted ever tested or
did I manage to miss it?
--
Mel Gorman
SUSE Labs
Mel Gorman <[email protected]> writes:
>> > I just tried replacing my sync_file_range()+fadvise() calls and instead
>> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing
>> > something very wrong, because I get only 1/3rd of the throughput, and
>> > the page cache fills up. Any idea why ?
>>
>> Since O_DIRECT does not seem to provide acceptable throughput, it may be
>> interesting to investigate other ways to lessen the latency impact of
>> the fadvise DONTNEED hint.
>>
>
> There are cases where O_DIRECT falls back to buffered IO which is why you
> might have found that page cache was still filling up. There are a few
> reasons why this can happen but I would guess the common cause is that
> the range of pages being written was in the page cache already and could
> not be invalidated for some reason. I'm guessing this is the common case
> for page cache filling even with O_DIRECT but would not bet money on it
> as it's not a problem I investigated before.
Even when O_DIRECT falls back to buffered I/O for writes, it will
invalidate the page cache range described by the buffered I/O once it
completes. For reads, the range is written out synchronously before the
direct I/O is issued. Either way, you shouldn't see the page cache
filling up.
Switching to O_DIRECT often incurs a performance hit, especially if the
application does not submit more than one I/O at a time. Remember,
you're not getting readahead, and you're not getting the benefit of the
writeback code submitting batches of I/O.
HTH,
Jeff
On 2013-07-03 04:47, Mel Gorman wrote:
>> Given it is just a hint, we should be allowed to perform page
>> > deactivation lazily. Is there any fundamental reason to wait for worker
>> > threads on each CPU to complete their lru drain before returning from
>> > fadvise() to user-space ?
>> >
> Only to make sure they pages are actually dropped as requested. The reason
> the wait was introduced in the first place was that page cache was filling
> up even with the fadvise calls and causing disruption. In 3.11 disruption
> due to this sort of parallel IO should be reduced but making fadvise work
> properly is reasonable in itself. Was that patch I posted ever tested or
> did I manage to miss it?
I did test it. On our test case, we get a worst result with it.
Yannick
On Wed, Jul 03, 2013 at 10:53:08AM -0400, Jeff Moyer wrote:
> Mel Gorman <[email protected]> writes:
>
> >> > I just tried replacing my sync_file_range()+fadvise() calls and instead
> >> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> >> > something very wrong, because I get only 1/3rd of the throughput, and
> >> > the page cache fills up. Any idea why ?
> >>
> >> Since O_DIRECT does not seem to provide acceptable throughput, it may be
> >> interesting to investigate other ways to lessen the latency impact of
> >> the fadvise DONTNEED hint.
> >>
> >
> > There are cases where O_DIRECT falls back to buffered IO which is why you
> > might have found that page cache was still filling up. There are a few
> > reasons why this can happen but I would guess the common cause is that
> > the range of pages being written was in the page cache already and could
> > not be invalidated for some reason. I'm guessing this is the common case
> > for page cache filling even with O_DIRECT but would not bet money on it
> > as it's not a problem I investigated before.
>
> Even when O_DIRECT falls back to buffered I/O for writes, it will
> invalidate the page cache range described by the buffered I/O once it
> completes. For reads, the range is written out synchronously before the
> direct I/O is issued. Either way, you shouldn't see the page cache
> filling up.
<sigh>
I keep forgetting that filesystems other than XFS have sub-optimal
direct IO implementations. I wish that "silent fallback to buffered
IO" idea had never seen the light of day, and that filesystems
implemented direct IO properly.
> Switching to O_DIRECT often incurs a performance hit, especially if the
> application does not submit more than one I/O at a time. Remember,
> you're not getting readahead, and you're not getting the benefit of the
> writeback code submitting batches of I/O.
With the way IO is being done, there won't be any readahead (write
only workload) and they are directly controlling writeback one chunk
at a time, so there's not writeback caching to do batching, either.
There's no obvious reason that direct IO should be any slower
assuming that the application is actually doing 1MB sized and
aligned IOs like was mentioned, because both methods are directly
dispatching and then waiting for IO completion.
What filesystem is in use here?
Cheers,
Dave.
--
Dave Chinner
[email protected]
* Dave Chinner ([email protected]) wrote:
> On Wed, Jul 03, 2013 at 10:53:08AM -0400, Jeff Moyer wrote:
> > Mel Gorman <[email protected]> writes:
> >
> > >> > I just tried replacing my sync_file_range()+fadvise() calls and instead
> > >> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> > >> > something very wrong, because I get only 1/3rd of the throughput, and
> > >> > the page cache fills up. Any idea why ?
> > >>
> > >> Since O_DIRECT does not seem to provide acceptable throughput, it may be
> > >> interesting to investigate other ways to lessen the latency impact of
> > >> the fadvise DONTNEED hint.
> > >>
> > >
> > > There are cases where O_DIRECT falls back to buffered IO which is why you
> > > might have found that page cache was still filling up. There are a few
> > > reasons why this can happen but I would guess the common cause is that
> > > the range of pages being written was in the page cache already and could
> > > not be invalidated for some reason. I'm guessing this is the common case
> > > for page cache filling even with O_DIRECT but would not bet money on it
> > > as it's not a problem I investigated before.
> >
> > Even when O_DIRECT falls back to buffered I/O for writes, it will
> > invalidate the page cache range described by the buffered I/O once it
> > completes. For reads, the range is written out synchronously before the
> > direct I/O is issued. Either way, you shouldn't see the page cache
> > filling up.
>
> <sigh>
>
> I keep forgetting that filesystems other than XFS have sub-optimal
> direct IO implementations. I wish that "silent fallback to buffered
> IO" idea had never seen the light of day, and that filesystems
> implemented direct IO properly.
>
> > Switching to O_DIRECT often incurs a performance hit, especially if the
> > application does not submit more than one I/O at a time. Remember,
> > you're not getting readahead, and you're not getting the benefit of the
> > writeback code submitting batches of I/O.
>
> With the way IO is being done, there won't be any readahead (write
> only workload) and they are directly controlling writeback one chunk
> at a time, so there's not writeback caching to do batching, either.
> There's no obvious reason that direct IO should be any slower
> assuming that the application is actually doing 1MB sized and
> aligned IOs like was mentioned, because both methods are directly
> dispatching and then waiting for IO completion.
As a clarification, I use 256kB "chunks" (sub-buffers) in my tests, not
1MB. Also, please note that since I'm using splice(), each individual
splice call is internally limited to 16 pages worth of data transfer
(64kB).
> What filesystem is in use here?
My test was performed on ext3 filesystem, that was itself sitting on
raid-1 software raid.
Thanks,
Mathieu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Jul 03, 2013 at 08:31:03PM -0400, Mathieu Desnoyers wrote:
> * Dave Chinner ([email protected]) wrote:
> > On Wed, Jul 03, 2013 at 10:53:08AM -0400, Jeff Moyer wrote:
> > > Mel Gorman <[email protected]> writes:
> > >
> > > >> > I just tried replacing my sync_file_range()+fadvise() calls and instead
> > > >> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> > > >> > something very wrong, because I get only 1/3rd of the throughput, and
> > > >> > the page cache fills up. Any idea why ?
> > > >>
> > > >> Since O_DIRECT does not seem to provide acceptable throughput, it may be
> > > >> interesting to investigate other ways to lessen the latency impact of
> > > >> the fadvise DONTNEED hint.
> > > >>
> > > >
> > > > There are cases where O_DIRECT falls back to buffered IO which is why you
> > > > might have found that page cache was still filling up. There are a few
> > > > reasons why this can happen but I would guess the common cause is that
> > > > the range of pages being written was in the page cache already and could
> > > > not be invalidated for some reason. I'm guessing this is the common case
> > > > for page cache filling even with O_DIRECT but would not bet money on it
> > > > as it's not a problem I investigated before.
> > >
> > > Even when O_DIRECT falls back to buffered I/O for writes, it will
> > > invalidate the page cache range described by the buffered I/O once it
> > > completes. For reads, the range is written out synchronously before the
> > > direct I/O is issued. Either way, you shouldn't see the page cache
> > > filling up.
> >
> > <sigh>
> >
> > I keep forgetting that filesystems other than XFS have sub-optimal
> > direct IO implementations. I wish that "silent fallback to buffered
> > IO" idea had never seen the light of day, and that filesystems
> > implemented direct IO properly.
> >
> > > Switching to O_DIRECT often incurs a performance hit, especially if the
> > > application does not submit more than one I/O at a time. Remember,
> > > you're not getting readahead, and you're not getting the benefit of the
> > > writeback code submitting batches of I/O.
> >
> > With the way IO is being done, there won't be any readahead (write
> > only workload) and they are directly controlling writeback one chunk
> > at a time, so there's not writeback caching to do batching, either.
> > There's no obvious reason that direct IO should be any slower
> > assuming that the application is actually doing 1MB sized and
> > aligned IOs like was mentioned, because both methods are directly
> > dispatching and then waiting for IO completion.
>
> As a clarification, I use 256kB "chunks" (sub-buffers) in my tests, not
> 1MB. Also, please note that since I'm using splice(), each individual
> splice call is internally limited to 16 pages worth of data transfer
> (64kB).
So you are doing 4 write() calls to a single sync_file_range(), not
1:1?
> > What filesystem is in use here?
>
> My test was performed on ext3 filesystem, that was itself sitting on
> raid-1 software raid.
Yup, ext3 has limitations on direct IO that requires block
allocation and will often fall back to buffered IO. ext4 and XFS
should not have the same problems...
Cheers,
Dave.
--
Dave Chinner
[email protected]
* Dave Chinner ([email protected]) wrote:
> On Wed, Jul 03, 2013 at 08:31:03PM -0400, Mathieu Desnoyers wrote:
> > * Dave Chinner ([email protected]) wrote:
> > > On Wed, Jul 03, 2013 at 10:53:08AM -0400, Jeff Moyer wrote:
> > > > Mel Gorman <[email protected]> writes:
> > > >
> > > > >> > I just tried replacing my sync_file_range()+fadvise() calls and instead
> > > > >> > pass the O_DIRECT flag to open(). Unfortunately, I must be doing
> > > > >> > something very wrong, because I get only 1/3rd of the throughput, and
> > > > >> > the page cache fills up. Any idea why ?
> > > > >>
> > > > >> Since O_DIRECT does not seem to provide acceptable throughput, it may be
> > > > >> interesting to investigate other ways to lessen the latency impact of
> > > > >> the fadvise DONTNEED hint.
> > > > >>
> > > > >
> > > > > There are cases where O_DIRECT falls back to buffered IO which is why you
> > > > > might have found that page cache was still filling up. There are a few
> > > > > reasons why this can happen but I would guess the common cause is that
> > > > > the range of pages being written was in the page cache already and could
> > > > > not be invalidated for some reason. I'm guessing this is the common case
> > > > > for page cache filling even with O_DIRECT but would not bet money on it
> > > > > as it's not a problem I investigated before.
> > > >
> > > > Even when O_DIRECT falls back to buffered I/O for writes, it will
> > > > invalidate the page cache range described by the buffered I/O once it
> > > > completes. For reads, the range is written out synchronously before the
> > > > direct I/O is issued. Either way, you shouldn't see the page cache
> > > > filling up.
> > >
> > > <sigh>
> > >
> > > I keep forgetting that filesystems other than XFS have sub-optimal
> > > direct IO implementations. I wish that "silent fallback to buffered
> > > IO" idea had never seen the light of day, and that filesystems
> > > implemented direct IO properly.
> > >
> > > > Switching to O_DIRECT often incurs a performance hit, especially if the
> > > > application does not submit more than one I/O at a time. Remember,
> > > > you're not getting readahead, and you're not getting the benefit of the
> > > > writeback code submitting batches of I/O.
> > >
> > > With the way IO is being done, there won't be any readahead (write
> > > only workload) and they are directly controlling writeback one chunk
> > > at a time, so there's not writeback caching to do batching, either.
> > > There's no obvious reason that direct IO should be any slower
> > > assuming that the application is actually doing 1MB sized and
> > > aligned IOs like was mentioned, because both methods are directly
> > > dispatching and then waiting for IO completion.
> >
> > As a clarification, I use 256kB "chunks" (sub-buffers) in my tests, not
> > 1MB. Also, please note that since I'm using splice(), each individual
> > splice call is internally limited to 16 pages worth of data transfer
> > (64kB).
>
> So you are doing 4 write() calls to a single sync_file_range(), not
> 1:1?
Yes, my test case did 4 splice() calls (each writing 64kB) for each
256kB sync_file_range()+fadvise().
>
> > > What filesystem is in use here?
> >
> > My test was performed on ext3 filesystem, that was itself sitting on
> > raid-1 software raid.
>
> Yup, ext3 has limitations on direct IO that requires block
> allocation and will often fall back to buffered IO. ext4 and XFS
> should not have the same problems...
I'm not really keen on making good tracer performance so dependent on
which file system happens to be used on the system, unless there would
be a nice ABI (e.g. fcntl(), IOW for some arguable values of "nice" ;-) )
to ask the kernel if the file system being written to supports efficient
O_DIRECT or not.
If we had this kind of ABI, then I guess we could indeed use O_DIRECT as
an optimisation in some cases, but it unfortunately does not seem to
solve the latency regression we are seeing when using fadvise
POSIX_FADV_DONTNEED in the general case where we need to use buffered
I/O, but want to play nicely with the page cache. :-/
Thanks,
Mathieu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
On Wed, Jul 03, 2013 at 02:47:58PM -0400, Yannick Brosseau wrote:
> On 2013-07-03 04:47, Mel Gorman wrote:
> >> Given it is just a hint, we should be allowed to perform page
> >> > deactivation lazily. Is there any fundamental reason to wait for worker
> >> > threads on each CPU to complete their lru drain before returning from
> >> > fadvise() to user-space ?
> >> >
> > Only to make sure they pages are actually dropped as requested. The reason
> > the wait was introduced in the first place was that page cache was filling
> > up even with the fadvise calls and causing disruption. In 3.11 disruption
> > due to this sort of parallel IO should be reduced but making fadvise work
> > properly is reasonable in itself. Was that patch I posted ever tested or
> > did I manage to miss it?
>
> I did test it. On our test case, we get a worst result with it.
>
That's useful to know. Please test the following replacement patch
instead.
---8<---
mm: Only drain CPUs whose pagevecs contain pages backed by a given mapping
cha-cha-cha-changelog
Not-signed-off-by David Bowie
---
include/linux/swap.h | 1 +
include/linux/workqueue.h | 1 +
kernel/workqueue.c | 29 +++++++++++++++++++++++------
mm/fadvise.c | 2 +-
mm/swap.c | 43 +++++++++++++++++++++++++++++++++++++++++++
5 files changed, 69 insertions(+), 7 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1701ce4..7625d2a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -242,6 +242,7 @@ extern void mark_page_accessed(struct page *);
extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern int lru_add_drain_all(void);
+extern int lru_add_drain_mapping(struct address_space *mapping);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_page(struct page *page);
extern void swap_setup(void);
diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 623488f..89c97e8 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -435,6 +435,7 @@ extern void drain_workqueue(struct workqueue_struct *wq);
extern void flush_scheduled_work(void);
extern int schedule_on_each_cpu(work_func_t func);
+extern int schedule_on_cpumask(work_func_t func, const cpumask_t *mask);
int execute_in_process_context(work_func_t fn, struct execute_work *);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ee8e29a..c359e30 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2949,17 +2949,18 @@ bool cancel_delayed_work_sync(struct delayed_work *dwork)
EXPORT_SYMBOL(cancel_delayed_work_sync);
/**
- * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * schedule_on_cpumask - execute a function synchronously on each online CPU
* @func: the function to call
+ * @mask: the cpumask of CPUs to execute the function on
*
- * schedule_on_each_cpu() executes @func on each online CPU using the
+ * schedule_on_each_cpu() executes @func on each CPU specified in the mask the
* system workqueue and blocks until all CPUs have completed.
- * schedule_on_each_cpu() is very slow.
+ * schedule_on_cpumask() is potentially very slow.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
-int schedule_on_each_cpu(work_func_t func)
+int schedule_on_cpumask(work_func_t func, const cpumask_t *mask)
{
int cpu;
struct work_struct __percpu *works;
@@ -2970,14 +2971,14 @@ int schedule_on_each_cpu(work_func_t func)
get_online_cpus();
- for_each_online_cpu(cpu) {
+ for_each_cpu_mask(cpu, *mask) {
struct work_struct *work = per_cpu_ptr(works, cpu);
INIT_WORK(work, func);
schedule_work_on(cpu, work);
}
- for_each_online_cpu(cpu)
+ for_each_cpu_mask(cpu, *mask)
flush_work(per_cpu_ptr(works, cpu));
put_online_cpus();
@@ -2986,6 +2987,22 @@ int schedule_on_each_cpu(work_func_t func)
}
/**
+ * schedule_on_each_cpu - execute a function synchronously on each online CPU
+ * @func: the function to call
+ *
+ * schedule_on_each_cpu() executes @func on each online CPU using the
+ * system workqueue and blocks until all CPUs have completed.
+ * schedule_on_each_cpu() is very slow.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int schedule_on_each_cpu(work_func_t func)
+{
+ return schedule_on_cpumask(func, cpu_online_mask);
+}
+
+/**
* flush_scheduled_work - ensure that any scheduled work has run to completion.
*
* Forces execution of the kernel-global workqueue and blocks until its
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81..70c4a3d 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -132,7 +132,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
* pagevecs and try again.
*/
if (count < (end_index - start_index + 1)) {
- lru_add_drain_all();
+ lru_add_drain_mapping(mapping);
invalidate_mapping_pages(mapping, start_index,
end_index);
}
diff --git a/mm/swap.c b/mm/swap.c
index dfd7d71..97de395 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -656,6 +656,49 @@ int lru_add_drain_all(void)
}
/*
+ * Drains LRU of some CPUs but only if the associated CPUs pagevec contain
+ * pages managed by the given mapping. This is racy and there is no guarantee
+ * that when it returns that there will still not be pages belonging to the
+ * mapping on a pagevec.
+ */
+int lru_add_drain_mapping(struct address_space *mapping)
+{
+ static cpumask_t pagevec_cpus;
+ int cpu;
+
+ cpumask_clear(&pagevec_cpus);
+
+ /*
+ * get_online_cpus unnecessary as pagevecs persist after hotplug.
+ * Count may change underneath us but at worst some pages are
+ * missed or there is the occasional false positive.
+ */
+ for_each_online_cpu(cpu) {
+ struct pagevec *pvecs = per_cpu(lru_add_pvecs, cpu);
+ struct pagevec *pvec;
+ int lru;
+
+ for_each_lru(lru) {
+ int i;
+ pvec = &pvecs[lru - LRU_BASE];
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+
+ if (page->mapping == mapping) {
+ cpumask_set_cpu(cpu, &pagevec_cpus);
+ goto next_cpu;
+ }
+ }
+ }
+next_cpu:
+ ;
+ }
+
+ return schedule_on_cpumask(lru_add_drain_per_cpu, &pagevec_cpus);
+}
+
+/*
* Batched page_cache_release(). Decrement the reference count on all the
* passed pages. If it fell to zero then remove the page from the LRU and
* free it.